- LLMClient Protocol for provider-agnostic inference - VLLMClient for OpenAI-compatible /v1/chat/completions API - LLM client factory with provider routing (ollama/vllm) - VLLMConfig with VLLM_* environment variable loading - Updated extractor worker with health check and provider switching - Updated event classifier to use LLMClient protocol - Helm values for vLLM configuration - 18 unit tests + 6 property-based tests - Full backward compatibility preserved
14 KiB
Design Document: Remote vLLM Support
Overview
This design introduces an LLM provider abstraction layer into Stonks Oracle so that both the existing Ollama backend and a new remote vLLM backend can be used interchangeably for document extraction and event classification. The vLLM server at http://192.168.42.254:8000 runs RedHatAI/Qwen3.6-35B-A3B-NVFP4 on an NVIDIA RTX 5090 with tensor parallelism and exposes an OpenAI-compatible /v1/chat/completions API.
The design preserves full backward compatibility — existing Ollama deployments work without any configuration changes. Provider selection is driven by the existing model_provider column in the ai_agents and agent_variants database tables, requiring no new migrations.
Architecture
graph TD
subgraph "Extractor Worker"
MAIN[main.py]
FACTORY[LLMClientFactory]
EXTRACT[Extraction Pipeline]
CLASSIFY[Event Classification Pipeline]
end
subgraph "Provider Abstraction"
PROTO[LLMClient Protocol]
OLLAMA_IMPL[OllamaClient]
VLLM_IMPL[VLLMClient]
end
subgraph "Configuration"
RESOLVER[AgentConfigResolver]
OLLAMA_CFG[OllamaConfig]
VLLM_CFG[VLLMConfig]
APP_CFG[AppConfig]
end
subgraph "External Services"
OLLAMA_SRV[Ollama Server<br/>:11434/api/chat]
VLLM_SRV[vLLM Server<br/>:8000/v1/chat/completions]
end
MAIN --> FACTORY
FACTORY --> PROTO
PROTO --> OLLAMA_IMPL
PROTO --> VLLM_IMPL
EXTRACT --> PROTO
CLASSIFY --> PROTO
RESOLVER --> FACTORY
OLLAMA_CFG --> FACTORY
VLLM_CFG --> FACTORY
APP_CFG --> OLLAMA_CFG
APP_CFG --> VLLM_CFG
OLLAMA_IMPL --> OLLAMA_SRV
VLLM_IMPL --> VLLM_SRV
The key architectural decision is to use a Python Protocol (structural typing) rather than an ABC for the LLM client interface. This allows the existing OllamaClient to satisfy the protocol without inheritance changes, maintaining backward compatibility. The VLLMClient is a new class that also satisfies the protocol.
A factory function in services/extractor/llm_factory.py takes a ResolvedAgentConfig and the base configs, returning the appropriate client. The extractor worker (main.py) uses this factory instead of directly constructing OllamaClient.
Components and Interfaces
1. LLM Client Protocol (services/shared/llm_protocol.py)
A typing.Protocol defining the contract both clients must satisfy:
from typing import Protocol, runtime_checkable
@runtime_checkable
class LLMClient(Protocol):
async def call_llm(
self,
prompts: dict[str, str],
json_schema: dict[str, object],
document_text: str = "",
) -> "ExtractionAttempt": ...
async def close(self) -> None: ...
The call_llm method signature matches the existing OllamaClient._call_ollama() parameters and return type. The OllamaClient gains a public call_llm method that delegates to _call_ollama(), preserving the private method for internal backward compatibility.
2. VLLMClient (services/extractor/vllm_client.py)
New client implementing the LLMClient protocol for the OpenAI-compatible API:
@dataclass
class VLLMClient:
_config: VLLMConfig
_http: httpx.AsyncClient
_owns_client: bool
async def call_llm(
self,
prompts: dict[str, str],
json_schema: dict[str, object],
document_text: str = "",
) -> ExtractionAttempt: ...
async def close(self) -> None: ...
Request format (OpenAI-compatible):
{
"model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4",
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."}
],
"max_tokens": 4096,
"temperature": 0.7,
"response_format": {"type": "json_object"}
}
Response parsing: Extracts choices[0].message.content, then applies the same _strip_markdown_fences() and _repair_json() pipeline as OllamaClient.
Error handling: Maps HTTP errors to the same string format as OllamaClient (timeout, http_{code}, connection_error: {details}, empty_model_response), so the existing _is_retryable() function works without modification.
Key differences from OllamaClient:
- Endpoint:
/v1/chat/completionsinstead of/api/chat - No
think: false,stream: false, oroptionsblock - Uses
max_tokensinstead ofoptions.num_predict - Uses
response_format: {"type": "json_object"}for structured output - Supports
temperatureparameter (Ollama uses model defaults) - Response in
choices[0].message.contentinstead ofmessage.content
3. VLLMConfig (services/shared/config.py)
New dataclass alongside OllamaConfig:
@dataclass
class VLLMConfig:
base_url: str = "http://192.168.42.254:8000"
model: str = "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
timeout: int = 120
max_retries: int = 2
retry_base_delay: float = 1.0
retry_max_delay: float = 10.0
retry_backoff_multiplier: float = 2.0
max_tokens: int = 32768
temperature: float = 0.7
api_key: str = "" # Optional, for authenticated vLLM deployments
Loaded from VLLM_* environment variables in load_config(). Added to AppConfig as vllm: VLLMConfig.
4. LLM Client Factory (services/extractor/llm_factory.py)
Factory function that replaces the hardcoded OllamaClient construction:
def build_llm_client(
resolved: ResolvedAgentConfig | None,
ollama_config: OllamaConfig,
vllm_config: VLLMConfig,
http_client: httpx.AsyncClient | None = None,
) -> LLMClient:
"""Return the appropriate LLM client based on resolved provider."""
...
def build_config_from_resolved(
resolved: ResolvedAgentConfig,
base_ollama: OllamaConfig,
base_vllm: VLLMConfig,
) -> OllamaConfig | VLLMConfig:
"""Build provider-specific config from resolved agent config."""
...
Provider routing logic:
- If
resolvedisNoneorresolved.model_provideris"ollama"or empty →OllamaClient - If
resolved.model_provideris"vllm"→VLLMClient - Unknown provider → log warning, fall back to
OllamaClient
5. Updated Extractor Worker (services/extractor/main.py)
Changes to main():
- Replace
_build_ollama_config_from_resolved()withbuild_llm_client()from the factory - Store clients as
LLMClienttype instead ofOllamaClient - On config refresh (every 100 jobs), detect provider changes and swap clients
- Log provider switches at INFO level
Changes to _process_macro_classification():
- Accept
LLMClientinstead ofOllamaClientfor the classifier parameter
6. Updated OllamaClient (services/extractor/client.py)
Minimal changes to satisfy the protocol:
- Add public
call_llm()method that delegates to_call_ollama() - Keep
_call_ollama()as-is for backward compatibility - The
extract()method continues to call_call_ollama()internally
7. Updated Event Classifier (services/extractor/event_classifier.py)
Changes to classify_global_event():
- Accept
LLMClientinstead ofAnyfor theollama_clientparameter - Call
client.call_llm()instead ofollama_client._call_ollama() - Set
ModelMetadata.providerbased on the actual client type (inspect_configor pass provider string)
8. Helm Values (infra/helm/stonks-oracle/values.yaml)
New config entries:
config:
VLLM_BASE_URL: "http://192.168.42.254:8000"
VLLM_MODEL: "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
VLLM_TIMEOUT: "120"
VLLM_MAX_RETRIES: "2"
VLLM_TEMPERATURE: "0.7"
VLLM_API_KEY: ""
9. Health Check (services/extractor/vllm_client.py)
Startup validation function:
async def check_vllm_health(base_url: str, timeout: float = 10.0) -> bool:
"""GET {base_url}/v1/models to verify vLLM is reachable."""
...
Called from main() when the resolved or default config specifies vLLM. On failure, logs WARNING and falls back to Ollama. On success, logs INFO with server URL and model list.
Data Models
VLLMConfig Dataclass
| Field | Type | Default | Env Var |
|---|---|---|---|
base_url |
str |
http://192.168.42.254:8000 |
VLLM_BASE_URL |
model |
str |
RedHatAI/Qwen3.6-35B-A3B-NVFP4 |
VLLM_MODEL |
timeout |
int |
120 |
VLLM_TIMEOUT |
max_retries |
int |
2 |
VLLM_MAX_RETRIES |
retry_base_delay |
float |
1.0 |
VLLM_RETRY_BASE_DELAY |
retry_max_delay |
float |
10.0 |
VLLM_RETRY_MAX_DELAY |
retry_backoff_multiplier |
float |
2.0 |
VLLM_RETRY_BACKOFF_MULTIPLIER |
max_tokens |
int |
32768 |
VLLM_MAX_TOKENS |
temperature |
float |
0.7 |
VLLM_TEMPERATURE |
api_key |
str |
"" |
VLLM_API_KEY |
ExtractionAttempt (unchanged)
The existing ExtractionAttempt dataclass is reused as-is for both providers. No changes needed.
ModelMetadata (unchanged structure, new values)
The provider field now accepts "vllm" in addition to "ollama". No schema change needed.
Error Handling
Error String Format Parity
Both clients produce identical error string formats so _is_retryable() works unchanged:
| Condition | Error String | Retryable |
|---|---|---|
| HTTP timeout | timeout |
Yes |
| HTTP 400/401/403/404/422 | http_{code} |
No |
| HTTP 500/502/503/429 | http_{code} |
Yes |
| Connection refused/reset | connection_error: {details} |
Yes |
| Empty response body | empty_model_response |
Yes |
| Invalid JSON in response | invalid_response_json |
Yes |
Health Check Failure
If the vLLM health check fails at startup:
- Log WARNING with the error details
- Fall back to
OllamaClientusingOllamaConfig - Continue operation — the system degrades gracefully rather than crashing
Provider Switch During Refresh
When the config refresh (every 100 jobs) detects a provider change:
- Close the old client (
await old_client.close()) - Construct the new client via the factory
- Log the switch at INFO level
- If new client construction fails, keep the old client and log ERROR
Testing Strategy
Property-Based Tests (tests/test_pbt_llm_provider.py)
Property-based tests using Hypothesis to verify the provider abstraction:
P1: Provider factory routing property (Req 3.4, 3.5, 9.5)
For all model_provider values in {"ollama", "vllm", "", None}, the factory returns the correct client type. For "ollama", empty, or None, returns OllamaClient. For "vllm", returns VLLMClient.
P2: Error string format consistency property (Req 5.6)
For all HTTP status codes (100-599), both OllamaClient and VLLMClient produce error strings in the same format (http_{code}), and _is_retryable() returns the same result for both.
P3: VLLMClient request payload structure property (Req 2.1, 8.1)
For all generated prompt dicts (system + user messages of arbitrary text), the VLLMClient produces a request payload that: contains model, messages, max_tokens, temperature; does NOT contain think, stream, options, num_ctx, num_predict.
P4: JSON repair idempotence property (Req 2.4)
For all valid JSON strings, _repair_json(json_str) returns a string that json.loads() can parse, and _repair_json(_repair_json(json_str)) == _repair_json(json_str) (idempotence).
P5: Markdown fence stripping round-trip property (Req 2.3)
For all strings s, _strip_markdown_fences(f"```json\n{s}\n```") returns s (stripped), and _strip_markdown_fences(s) returns s when no fences are present (identity).
P6: VLLMConfig default construction property (Req 3.1)
For all VLLMConfig instances constructed with default values, base_url is non-empty, timeout > 0, max_retries >= 0, temperature is between 0.0 and 2.0, and max_tokens > 0.
Unit Tests (tests/test_vllm_client.py)
Example-based tests for specific behaviors:
- VLLMClient sends correct payload to
/v1/chat/completions(mock httpx) - VLLMClient extracts content from
choices[0].message.content - VLLMClient handles empty choices array →
empty_model_response - VLLMClient handles timeout →
timeouterror - VLLMClient handles HTTP 500 →
http_500error, retryable - VLLMClient handles HTTP 400 →
http_400error, non-retryable - VLLMClient handles connection refused →
connection_error: ... - VLLMClient applies markdown fence stripping
- VLLMClient applies JSON repair
- VLLMClient includes temperature in payload
- VLLMClient includes
response_formatin payload - Health check success logs INFO
- Health check failure logs WARNING and returns False
- Factory returns OllamaClient for provider="ollama"
- Factory returns VLLMClient for provider="vllm"
- Factory returns OllamaClient for provider="" (default)
- Factory returns OllamaClient for unknown provider with warning
- VLLMConfig loads from environment variables
- AppConfig includes vllm field with defaults
- OllamaClient.call_llm() delegates to _call_ollama()
Existing Tests (unchanged)
tests/test_ollama_client.py— continues to pass without modification- All other existing test files — unaffected
File Changes Summary
| File | Change Type | Description |
|---|---|---|
services/shared/llm_protocol.py |
New | LLMClient Protocol definition |
services/extractor/vllm_client.py |
New | VLLMClient implementation + health check |
services/extractor/llm_factory.py |
New | Factory function for provider routing |
services/shared/config.py |
Modified | Add VLLMConfig, update AppConfig, update load_config() |
services/extractor/client.py |
Modified | Add call_llm() public method to OllamaClient |
services/extractor/event_classifier.py |
Modified | Use call_llm() instead of _call_ollama(), accept LLMClient type |
services/extractor/main.py |
Modified | Use factory, support provider switching, health check |
infra/helm/stonks-oracle/values.yaml |
Modified | Add VLLM_* config entries |
tests/test_pbt_llm_provider.py |
New | Property-based tests for provider abstraction |
tests/test_vllm_client.py |
New | Unit tests for VLLMClient and factory |