Files
stonks-oracle/.kiro/specs/remote-vllm-support/design.md
T
Celes Renata 117b693b19 feat: add remote vLLM support with provider abstraction layer
- LLMClient Protocol for provider-agnostic inference
- VLLMClient for OpenAI-compatible /v1/chat/completions API
- LLM client factory with provider routing (ollama/vllm)
- VLLMConfig with VLLM_* environment variable loading
- Updated extractor worker with health check and provider switching
- Updated event classifier to use LLMClient protocol
- Helm values for vLLM configuration
- 18 unit tests + 6 property-based tests
- Full backward compatibility preserved
2026-04-23 08:17:23 +00:00

14 KiB

Design Document: Remote vLLM Support

Overview

This design introduces an LLM provider abstraction layer into Stonks Oracle so that both the existing Ollama backend and a new remote vLLM backend can be used interchangeably for document extraction and event classification. The vLLM server at http://192.168.42.254:8000 runs RedHatAI/Qwen3.6-35B-A3B-NVFP4 on an NVIDIA RTX 5090 with tensor parallelism and exposes an OpenAI-compatible /v1/chat/completions API.

The design preserves full backward compatibility — existing Ollama deployments work without any configuration changes. Provider selection is driven by the existing model_provider column in the ai_agents and agent_variants database tables, requiring no new migrations.

Architecture

graph TD
    subgraph "Extractor Worker"
        MAIN[main.py]
        FACTORY[LLMClientFactory]
        EXTRACT[Extraction Pipeline]
        CLASSIFY[Event Classification Pipeline]
    end

    subgraph "Provider Abstraction"
        PROTO[LLMClient Protocol]
        OLLAMA_IMPL[OllamaClient]
        VLLM_IMPL[VLLMClient]
    end

    subgraph "Configuration"
        RESOLVER[AgentConfigResolver]
        OLLAMA_CFG[OllamaConfig]
        VLLM_CFG[VLLMConfig]
        APP_CFG[AppConfig]
    end

    subgraph "External Services"
        OLLAMA_SRV[Ollama Server<br/>:11434/api/chat]
        VLLM_SRV[vLLM Server<br/>:8000/v1/chat/completions]
    end

    MAIN --> FACTORY
    FACTORY --> PROTO
    PROTO --> OLLAMA_IMPL
    PROTO --> VLLM_IMPL
    EXTRACT --> PROTO
    CLASSIFY --> PROTO

    RESOLVER --> FACTORY
    OLLAMA_CFG --> FACTORY
    VLLM_CFG --> FACTORY
    APP_CFG --> OLLAMA_CFG
    APP_CFG --> VLLM_CFG

    OLLAMA_IMPL --> OLLAMA_SRV
    VLLM_IMPL --> VLLM_SRV

The key architectural decision is to use a Python Protocol (structural typing) rather than an ABC for the LLM client interface. This allows the existing OllamaClient to satisfy the protocol without inheritance changes, maintaining backward compatibility. The VLLMClient is a new class that also satisfies the protocol.

A factory function in services/extractor/llm_factory.py takes a ResolvedAgentConfig and the base configs, returning the appropriate client. The extractor worker (main.py) uses this factory instead of directly constructing OllamaClient.

Components and Interfaces

1. LLM Client Protocol (services/shared/llm_protocol.py)

A typing.Protocol defining the contract both clients must satisfy:

from typing import Protocol, runtime_checkable

@runtime_checkable
class LLMClient(Protocol):
    async def call_llm(
        self,
        prompts: dict[str, str],
        json_schema: dict[str, object],
        document_text: str = "",
    ) -> "ExtractionAttempt": ...

    async def close(self) -> None: ...

The call_llm method signature matches the existing OllamaClient._call_ollama() parameters and return type. The OllamaClient gains a public call_llm method that delegates to _call_ollama(), preserving the private method for internal backward compatibility.

2. VLLMClient (services/extractor/vllm_client.py)

New client implementing the LLMClient protocol for the OpenAI-compatible API:

@dataclass
class VLLMClient:
    _config: VLLMConfig
    _http: httpx.AsyncClient
    _owns_client: bool

    async def call_llm(
        self,
        prompts: dict[str, str],
        json_schema: dict[str, object],
        document_text: str = "",
    ) -> ExtractionAttempt: ...

    async def close(self) -> None: ...

Request format (OpenAI-compatible):

{
    "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4",
    "messages": [
        {"role": "system", "content": "..."},
        {"role": "user", "content": "..."}
    ],
    "max_tokens": 4096,
    "temperature": 0.7,
    "response_format": {"type": "json_object"}
}

Response parsing: Extracts choices[0].message.content, then applies the same _strip_markdown_fences() and _repair_json() pipeline as OllamaClient.

Error handling: Maps HTTP errors to the same string format as OllamaClient (timeout, http_{code}, connection_error: {details}, empty_model_response), so the existing _is_retryable() function works without modification.

Key differences from OllamaClient:

  • Endpoint: /v1/chat/completions instead of /api/chat
  • No think: false, stream: false, or options block
  • Uses max_tokens instead of options.num_predict
  • Uses response_format: {"type": "json_object"} for structured output
  • Supports temperature parameter (Ollama uses model defaults)
  • Response in choices[0].message.content instead of message.content

3. VLLMConfig (services/shared/config.py)

New dataclass alongside OllamaConfig:

@dataclass
class VLLMConfig:
    base_url: str = "http://192.168.42.254:8000"
    model: str = "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
    timeout: int = 120
    max_retries: int = 2
    retry_base_delay: float = 1.0
    retry_max_delay: float = 10.0
    retry_backoff_multiplier: float = 2.0
    max_tokens: int = 32768
    temperature: float = 0.7
    api_key: str = ""  # Optional, for authenticated vLLM deployments

Loaded from VLLM_* environment variables in load_config(). Added to AppConfig as vllm: VLLMConfig.

4. LLM Client Factory (services/extractor/llm_factory.py)

Factory function that replaces the hardcoded OllamaClient construction:

def build_llm_client(
    resolved: ResolvedAgentConfig | None,
    ollama_config: OllamaConfig,
    vllm_config: VLLMConfig,
    http_client: httpx.AsyncClient | None = None,
) -> LLMClient:
    """Return the appropriate LLM client based on resolved provider."""
    ...

def build_config_from_resolved(
    resolved: ResolvedAgentConfig,
    base_ollama: OllamaConfig,
    base_vllm: VLLMConfig,
) -> OllamaConfig | VLLMConfig:
    """Build provider-specific config from resolved agent config."""
    ...

Provider routing logic:

  1. If resolved is None or resolved.model_provider is "ollama" or empty → OllamaClient
  2. If resolved.model_provider is "vllm"VLLMClient
  3. Unknown provider → log warning, fall back to OllamaClient

5. Updated Extractor Worker (services/extractor/main.py)

Changes to main():

  • Replace _build_ollama_config_from_resolved() with build_llm_client() from the factory
  • Store clients as LLMClient type instead of OllamaClient
  • On config refresh (every 100 jobs), detect provider changes and swap clients
  • Log provider switches at INFO level

Changes to _process_macro_classification():

  • Accept LLMClient instead of OllamaClient for the classifier parameter

6. Updated OllamaClient (services/extractor/client.py)

Minimal changes to satisfy the protocol:

  • Add public call_llm() method that delegates to _call_ollama()
  • Keep _call_ollama() as-is for backward compatibility
  • The extract() method continues to call _call_ollama() internally

7. Updated Event Classifier (services/extractor/event_classifier.py)

Changes to classify_global_event():

  • Accept LLMClient instead of Any for the ollama_client parameter
  • Call client.call_llm() instead of ollama_client._call_ollama()
  • Set ModelMetadata.provider based on the actual client type (inspect _config or pass provider string)

8. Helm Values (infra/helm/stonks-oracle/values.yaml)

New config entries:

config:
  VLLM_BASE_URL: "http://192.168.42.254:8000"
  VLLM_MODEL: "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
  VLLM_TIMEOUT: "120"
  VLLM_MAX_RETRIES: "2"
  VLLM_TEMPERATURE: "0.7"
  VLLM_API_KEY: ""

9. Health Check (services/extractor/vllm_client.py)

Startup validation function:

async def check_vllm_health(base_url: str, timeout: float = 10.0) -> bool:
    """GET {base_url}/v1/models to verify vLLM is reachable."""
    ...

Called from main() when the resolved or default config specifies vLLM. On failure, logs WARNING and falls back to Ollama. On success, logs INFO with server URL and model list.

Data Models

VLLMConfig Dataclass

Field Type Default Env Var
base_url str http://192.168.42.254:8000 VLLM_BASE_URL
model str RedHatAI/Qwen3.6-35B-A3B-NVFP4 VLLM_MODEL
timeout int 120 VLLM_TIMEOUT
max_retries int 2 VLLM_MAX_RETRIES
retry_base_delay float 1.0 VLLM_RETRY_BASE_DELAY
retry_max_delay float 10.0 VLLM_RETRY_MAX_DELAY
retry_backoff_multiplier float 2.0 VLLM_RETRY_BACKOFF_MULTIPLIER
max_tokens int 32768 VLLM_MAX_TOKENS
temperature float 0.7 VLLM_TEMPERATURE
api_key str "" VLLM_API_KEY

ExtractionAttempt (unchanged)

The existing ExtractionAttempt dataclass is reused as-is for both providers. No changes needed.

ModelMetadata (unchanged structure, new values)

The provider field now accepts "vllm" in addition to "ollama". No schema change needed.

Error Handling

Error String Format Parity

Both clients produce identical error string formats so _is_retryable() works unchanged:

Condition Error String Retryable
HTTP timeout timeout Yes
HTTP 400/401/403/404/422 http_{code} No
HTTP 500/502/503/429 http_{code} Yes
Connection refused/reset connection_error: {details} Yes
Empty response body empty_model_response Yes
Invalid JSON in response invalid_response_json Yes

Health Check Failure

If the vLLM health check fails at startup:

  1. Log WARNING with the error details
  2. Fall back to OllamaClient using OllamaConfig
  3. Continue operation — the system degrades gracefully rather than crashing

Provider Switch During Refresh

When the config refresh (every 100 jobs) detects a provider change:

  1. Close the old client (await old_client.close())
  2. Construct the new client via the factory
  3. Log the switch at INFO level
  4. If new client construction fails, keep the old client and log ERROR

Testing Strategy

Property-Based Tests (tests/test_pbt_llm_provider.py)

Property-based tests using Hypothesis to verify the provider abstraction:

P1: Provider factory routing property (Req 3.4, 3.5, 9.5) For all model_provider values in {"ollama", "vllm", "", None}, the factory returns the correct client type. For "ollama", empty, or None, returns OllamaClient. For "vllm", returns VLLMClient.

P2: Error string format consistency property (Req 5.6) For all HTTP status codes (100-599), both OllamaClient and VLLMClient produce error strings in the same format (http_{code}), and _is_retryable() returns the same result for both.

P3: VLLMClient request payload structure property (Req 2.1, 8.1) For all generated prompt dicts (system + user messages of arbitrary text), the VLLMClient produces a request payload that: contains model, messages, max_tokens, temperature; does NOT contain think, stream, options, num_ctx, num_predict.

P4: JSON repair idempotence property (Req 2.4) For all valid JSON strings, _repair_json(json_str) returns a string that json.loads() can parse, and _repair_json(_repair_json(json_str)) == _repair_json(json_str) (idempotence).

P5: Markdown fence stripping round-trip property (Req 2.3) For all strings s, _strip_markdown_fences(f"```json\n{s}\n```") returns s (stripped), and _strip_markdown_fences(s) returns s when no fences are present (identity).

P6: VLLMConfig default construction property (Req 3.1) For all VLLMConfig instances constructed with default values, base_url is non-empty, timeout > 0, max_retries >= 0, temperature is between 0.0 and 2.0, and max_tokens > 0.

Unit Tests (tests/test_vllm_client.py)

Example-based tests for specific behaviors:

  • VLLMClient sends correct payload to /v1/chat/completions (mock httpx)
  • VLLMClient extracts content from choices[0].message.content
  • VLLMClient handles empty choices array → empty_model_response
  • VLLMClient handles timeout → timeout error
  • VLLMClient handles HTTP 500 → http_500 error, retryable
  • VLLMClient handles HTTP 400 → http_400 error, non-retryable
  • VLLMClient handles connection refused → connection_error: ...
  • VLLMClient applies markdown fence stripping
  • VLLMClient applies JSON repair
  • VLLMClient includes temperature in payload
  • VLLMClient includes response_format in payload
  • Health check success logs INFO
  • Health check failure logs WARNING and returns False
  • Factory returns OllamaClient for provider="ollama"
  • Factory returns VLLMClient for provider="vllm"
  • Factory returns OllamaClient for provider="" (default)
  • Factory returns OllamaClient for unknown provider with warning
  • VLLMConfig loads from environment variables
  • AppConfig includes vllm field with defaults
  • OllamaClient.call_llm() delegates to _call_ollama()

Existing Tests (unchanged)

  • tests/test_ollama_client.py — continues to pass without modification
  • All other existing test files — unaffected

File Changes Summary

File Change Type Description
services/shared/llm_protocol.py New LLMClient Protocol definition
services/extractor/vllm_client.py New VLLMClient implementation + health check
services/extractor/llm_factory.py New Factory function for provider routing
services/shared/config.py Modified Add VLLMConfig, update AppConfig, update load_config()
services/extractor/client.py Modified Add call_llm() public method to OllamaClient
services/extractor/event_classifier.py Modified Use call_llm() instead of _call_ollama(), accept LLMClient type
services/extractor/main.py Modified Use factory, support provider switching, health check
infra/helm/stonks-oracle/values.yaml Modified Add VLLM_* config entries
tests/test_pbt_llm_provider.py New Property-based tests for provider abstraction
tests/test_vllm_client.py New Unit tests for VLLMClient and factory