# Design Document: Remote vLLM Support ## Overview This design introduces an LLM provider abstraction layer into Stonks Oracle so that both the existing Ollama backend and a new remote vLLM backend can be used interchangeably for document extraction and event classification. The vLLM server at `http://192.168.42.254:8000` runs `RedHatAI/Qwen3.6-35B-A3B-NVFP4` on an NVIDIA RTX 5090 with tensor parallelism and exposes an OpenAI-compatible `/v1/chat/completions` API. The design preserves full backward compatibility — existing Ollama deployments work without any configuration changes. Provider selection is driven by the existing `model_provider` column in the `ai_agents` and `agent_variants` database tables, requiring no new migrations. ## Architecture ```mermaid graph TD subgraph "Extractor Worker" MAIN[main.py] FACTORY[LLMClientFactory] EXTRACT[Extraction Pipeline] CLASSIFY[Event Classification Pipeline] end subgraph "Provider Abstraction" PROTO[LLMClient Protocol] OLLAMA_IMPL[OllamaClient] VLLM_IMPL[VLLMClient] end subgraph "Configuration" RESOLVER[AgentConfigResolver] OLLAMA_CFG[OllamaConfig] VLLM_CFG[VLLMConfig] APP_CFG[AppConfig] end subgraph "External Services" OLLAMA_SRV[Ollama Server
:11434/api/chat] VLLM_SRV[vLLM Server
:8000/v1/chat/completions] end MAIN --> FACTORY FACTORY --> PROTO PROTO --> OLLAMA_IMPL PROTO --> VLLM_IMPL EXTRACT --> PROTO CLASSIFY --> PROTO RESOLVER --> FACTORY OLLAMA_CFG --> FACTORY VLLM_CFG --> FACTORY APP_CFG --> OLLAMA_CFG APP_CFG --> VLLM_CFG OLLAMA_IMPL --> OLLAMA_SRV VLLM_IMPL --> VLLM_SRV ``` The key architectural decision is to use a Python `Protocol` (structural typing) rather than an ABC for the LLM client interface. This allows the existing `OllamaClient` to satisfy the protocol without inheritance changes, maintaining backward compatibility. The `VLLMClient` is a new class that also satisfies the protocol. A factory function in `services/extractor/llm_factory.py` takes a `ResolvedAgentConfig` and the base configs, returning the appropriate client. The extractor worker (`main.py`) uses this factory instead of directly constructing `OllamaClient`. ## Components and Interfaces ### 1. LLM Client Protocol (`services/shared/llm_protocol.py`) A `typing.Protocol` defining the contract both clients must satisfy: ```python from typing import Protocol, runtime_checkable @runtime_checkable class LLMClient(Protocol): async def call_llm( self, prompts: dict[str, str], json_schema: dict[str, object], document_text: str = "", ) -> "ExtractionAttempt": ... async def close(self) -> None: ... ``` The `call_llm` method signature matches the existing `OllamaClient._call_ollama()` parameters and return type. The `OllamaClient` gains a public `call_llm` method that delegates to `_call_ollama()`, preserving the private method for internal backward compatibility. ### 2. VLLMClient (`services/extractor/vllm_client.py`) New client implementing the `LLMClient` protocol for the OpenAI-compatible API: ```python @dataclass class VLLMClient: _config: VLLMConfig _http: httpx.AsyncClient _owns_client: bool async def call_llm( self, prompts: dict[str, str], json_schema: dict[str, object], document_text: str = "", ) -> ExtractionAttempt: ... async def close(self) -> None: ... ``` **Request format** (OpenAI-compatible): ```json { "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4", "messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "..."} ], "max_tokens": 4096, "temperature": 0.7, "response_format": {"type": "json_object"} } ``` **Response parsing**: Extracts `choices[0].message.content`, then applies the same `_strip_markdown_fences()` and `_repair_json()` pipeline as `OllamaClient`. **Error handling**: Maps HTTP errors to the same string format as `OllamaClient` (`timeout`, `http_{code}`, `connection_error: {details}`, `empty_model_response`), so the existing `_is_retryable()` function works without modification. **Key differences from OllamaClient**: - Endpoint: `/v1/chat/completions` instead of `/api/chat` - No `think: false`, `stream: false`, or `options` block - Uses `max_tokens` instead of `options.num_predict` - Uses `response_format: {"type": "json_object"}` for structured output - Supports `temperature` parameter (Ollama uses model defaults) - Response in `choices[0].message.content` instead of `message.content` ### 3. VLLMConfig (`services/shared/config.py`) New dataclass alongside `OllamaConfig`: ```python @dataclass class VLLMConfig: base_url: str = "http://192.168.42.254:8000" model: str = "RedHatAI/Qwen3.6-35B-A3B-NVFP4" timeout: int = 120 max_retries: int = 2 retry_base_delay: float = 1.0 retry_max_delay: float = 10.0 retry_backoff_multiplier: float = 2.0 max_tokens: int = 32768 temperature: float = 0.7 api_key: str = "" # Optional, for authenticated vLLM deployments ``` Loaded from `VLLM_*` environment variables in `load_config()`. Added to `AppConfig` as `vllm: VLLMConfig`. ### 4. LLM Client Factory (`services/extractor/llm_factory.py`) Factory function that replaces the hardcoded `OllamaClient` construction: ```python def build_llm_client( resolved: ResolvedAgentConfig | None, ollama_config: OllamaConfig, vllm_config: VLLMConfig, http_client: httpx.AsyncClient | None = None, ) -> LLMClient: """Return the appropriate LLM client based on resolved provider.""" ... def build_config_from_resolved( resolved: ResolvedAgentConfig, base_ollama: OllamaConfig, base_vllm: VLLMConfig, ) -> OllamaConfig | VLLMConfig: """Build provider-specific config from resolved agent config.""" ... ``` Provider routing logic: 1. If `resolved` is `None` or `resolved.model_provider` is `"ollama"` or empty → `OllamaClient` 2. If `resolved.model_provider` is `"vllm"` → `VLLMClient` 3. Unknown provider → log warning, fall back to `OllamaClient` ### 5. Updated Extractor Worker (`services/extractor/main.py`) Changes to `main()`: - Replace `_build_ollama_config_from_resolved()` with `build_llm_client()` from the factory - Store clients as `LLMClient` type instead of `OllamaClient` - On config refresh (every 100 jobs), detect provider changes and swap clients - Log provider switches at INFO level Changes to `_process_macro_classification()`: - Accept `LLMClient` instead of `OllamaClient` for the classifier parameter ### 6. Updated OllamaClient (`services/extractor/client.py`) Minimal changes to satisfy the protocol: - Add public `call_llm()` method that delegates to `_call_ollama()` - Keep `_call_ollama()` as-is for backward compatibility - The `extract()` method continues to call `_call_ollama()` internally ### 7. Updated Event Classifier (`services/extractor/event_classifier.py`) Changes to `classify_global_event()`: - Accept `LLMClient` instead of `Any` for the `ollama_client` parameter - Call `client.call_llm()` instead of `ollama_client._call_ollama()` - Set `ModelMetadata.provider` based on the actual client type (inspect `_config` or pass provider string) ### 8. Helm Values (`infra/helm/stonks-oracle/values.yaml`) New config entries: ```yaml config: VLLM_BASE_URL: "http://192.168.42.254:8000" VLLM_MODEL: "RedHatAI/Qwen3.6-35B-A3B-NVFP4" VLLM_TIMEOUT: "120" VLLM_MAX_RETRIES: "2" VLLM_TEMPERATURE: "0.7" VLLM_API_KEY: "" ``` ### 9. Health Check (`services/extractor/vllm_client.py`) Startup validation function: ```python async def check_vllm_health(base_url: str, timeout: float = 10.0) -> bool: """GET {base_url}/v1/models to verify vLLM is reachable.""" ... ``` Called from `main()` when the resolved or default config specifies vLLM. On failure, logs WARNING and falls back to Ollama. On success, logs INFO with server URL and model list. ## Data Models ### VLLMConfig Dataclass | Field | Type | Default | Env Var | |-------|------|---------|---------| | `base_url` | `str` | `http://192.168.42.254:8000` | `VLLM_BASE_URL` | | `model` | `str` | `RedHatAI/Qwen3.6-35B-A3B-NVFP4` | `VLLM_MODEL` | | `timeout` | `int` | `120` | `VLLM_TIMEOUT` | | `max_retries` | `int` | `2` | `VLLM_MAX_RETRIES` | | `retry_base_delay` | `float` | `1.0` | `VLLM_RETRY_BASE_DELAY` | | `retry_max_delay` | `float` | `10.0` | `VLLM_RETRY_MAX_DELAY` | | `retry_backoff_multiplier` | `float` | `2.0` | `VLLM_RETRY_BACKOFF_MULTIPLIER` | | `max_tokens` | `int` | `32768` | `VLLM_MAX_TOKENS` | | `temperature` | `float` | `0.7` | `VLLM_TEMPERATURE` | | `api_key` | `str` | `""` | `VLLM_API_KEY` | ### ExtractionAttempt (unchanged) The existing `ExtractionAttempt` dataclass is reused as-is for both providers. No changes needed. ### ModelMetadata (unchanged structure, new values) The `provider` field now accepts `"vllm"` in addition to `"ollama"`. No schema change needed. ## Error Handling ### Error String Format Parity Both clients produce identical error string formats so `_is_retryable()` works unchanged: | Condition | Error String | Retryable | |-----------|-------------|-----------| | HTTP timeout | `timeout` | Yes | | HTTP 400/401/403/404/422 | `http_{code}` | No | | HTTP 500/502/503/429 | `http_{code}` | Yes | | Connection refused/reset | `connection_error: {details}` | Yes | | Empty response body | `empty_model_response` | Yes | | Invalid JSON in response | `invalid_response_json` | Yes | ### Health Check Failure If the vLLM health check fails at startup: 1. Log WARNING with the error details 2. Fall back to `OllamaClient` using `OllamaConfig` 3. Continue operation — the system degrades gracefully rather than crashing ### Provider Switch During Refresh When the config refresh (every 100 jobs) detects a provider change: 1. Close the old client (`await old_client.close()`) 2. Construct the new client via the factory 3. Log the switch at INFO level 4. If new client construction fails, keep the old client and log ERROR ## Testing Strategy ### Property-Based Tests (`tests/test_pbt_llm_provider.py`) Property-based tests using Hypothesis to verify the provider abstraction: **P1: Provider factory routing property** (Req 3.4, 3.5, 9.5) For all `model_provider` values in `{"ollama", "vllm", "", None}`, the factory returns the correct client type. For `"ollama"`, empty, or `None`, returns `OllamaClient`. For `"vllm"`, returns `VLLMClient`. **P2: Error string format consistency property** (Req 5.6) For all HTTP status codes (100-599), both `OllamaClient` and `VLLMClient` produce error strings in the same format (`http_{code}`), and `_is_retryable()` returns the same result for both. **P3: VLLMClient request payload structure property** (Req 2.1, 8.1) For all generated prompt dicts (system + user messages of arbitrary text), the VLLMClient produces a request payload that: contains `model`, `messages`, `max_tokens`, `temperature`; does NOT contain `think`, `stream`, `options`, `num_ctx`, `num_predict`. **P4: JSON repair idempotence property** (Req 2.4) For all valid JSON strings, `_repair_json(json_str)` returns a string that `json.loads()` can parse, and `_repair_json(_repair_json(json_str)) == _repair_json(json_str)` (idempotence). **P5: Markdown fence stripping round-trip property** (Req 2.3) For all strings `s`, `_strip_markdown_fences(f"```json\n{s}\n```")` returns `s` (stripped), and `_strip_markdown_fences(s)` returns `s` when no fences are present (identity). **P6: VLLMConfig default construction property** (Req 3.1) For all VLLMConfig instances constructed with default values, `base_url` is non-empty, `timeout > 0`, `max_retries >= 0`, `temperature` is between 0.0 and 2.0, and `max_tokens > 0`. ### Unit Tests (`tests/test_vllm_client.py`) Example-based tests for specific behaviors: - VLLMClient sends correct payload to `/v1/chat/completions` (mock httpx) - VLLMClient extracts content from `choices[0].message.content` - VLLMClient handles empty choices array → `empty_model_response` - VLLMClient handles timeout → `timeout` error - VLLMClient handles HTTP 500 → `http_500` error, retryable - VLLMClient handles HTTP 400 → `http_400` error, non-retryable - VLLMClient handles connection refused → `connection_error: ...` - VLLMClient applies markdown fence stripping - VLLMClient applies JSON repair - VLLMClient includes temperature in payload - VLLMClient includes `response_format` in payload - Health check success logs INFO - Health check failure logs WARNING and returns False - Factory returns OllamaClient for provider="ollama" - Factory returns VLLMClient for provider="vllm" - Factory returns OllamaClient for provider="" (default) - Factory returns OllamaClient for unknown provider with warning - VLLMConfig loads from environment variables - AppConfig includes vllm field with defaults - OllamaClient.call_llm() delegates to _call_ollama() ### Existing Tests (unchanged) - `tests/test_ollama_client.py` — continues to pass without modification - All other existing test files — unaffected ## File Changes Summary | File | Change Type | Description | |------|-------------|-------------| | `services/shared/llm_protocol.py` | **New** | `LLMClient` Protocol definition | | `services/extractor/vllm_client.py` | **New** | `VLLMClient` implementation + health check | | `services/extractor/llm_factory.py` | **New** | Factory function for provider routing | | `services/shared/config.py` | **Modified** | Add `VLLMConfig`, update `AppConfig`, update `load_config()` | | `services/extractor/client.py` | **Modified** | Add `call_llm()` public method to `OllamaClient` | | `services/extractor/event_classifier.py` | **Modified** | Use `call_llm()` instead of `_call_ollama()`, accept `LLMClient` type | | `services/extractor/main.py` | **Modified** | Use factory, support provider switching, health check | | `infra/helm/stonks-oracle/values.yaml` | **Modified** | Add `VLLM_*` config entries | | `tests/test_pbt_llm_provider.py` | **New** | Property-based tests for provider abstraction | | `tests/test_vllm_client.py` | **New** | Unit tests for VLLMClient and factory |