feat: add remote vLLM support with provider abstraction layer
- LLMClient Protocol for provider-agnostic inference - VLLMClient for OpenAI-compatible /v1/chat/completions API - LLM client factory with provider routing (ollama/vllm) - VLLMConfig with VLLM_* environment variable loading - Updated extractor worker with health check and provider switching - Updated event classifier to use LLMClient protocol - Helm values for vLLM configuration - 18 unit tests + 6 property-based tests - Full backward compatibility preserved
This commit is contained in:
@@ -0,0 +1,350 @@
|
||||
# Design Document: Remote vLLM Support
|
||||
|
||||
## Overview
|
||||
|
||||
This design introduces an LLM provider abstraction layer into Stonks Oracle so that both the existing Ollama backend and a new remote vLLM backend can be used interchangeably for document extraction and event classification. The vLLM server at `http://192.168.42.254:8000` runs `RedHatAI/Qwen3.6-35B-A3B-NVFP4` on an NVIDIA RTX 5090 with tensor parallelism and exposes an OpenAI-compatible `/v1/chat/completions` API.
|
||||
|
||||
The design preserves full backward compatibility — existing Ollama deployments work without any configuration changes. Provider selection is driven by the existing `model_provider` column in the `ai_agents` and `agent_variants` database tables, requiring no new migrations.
|
||||
|
||||
## Architecture
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Extractor Worker"
|
||||
MAIN[main.py]
|
||||
FACTORY[LLMClientFactory]
|
||||
EXTRACT[Extraction Pipeline]
|
||||
CLASSIFY[Event Classification Pipeline]
|
||||
end
|
||||
|
||||
subgraph "Provider Abstraction"
|
||||
PROTO[LLMClient Protocol]
|
||||
OLLAMA_IMPL[OllamaClient]
|
||||
VLLM_IMPL[VLLMClient]
|
||||
end
|
||||
|
||||
subgraph "Configuration"
|
||||
RESOLVER[AgentConfigResolver]
|
||||
OLLAMA_CFG[OllamaConfig]
|
||||
VLLM_CFG[VLLMConfig]
|
||||
APP_CFG[AppConfig]
|
||||
end
|
||||
|
||||
subgraph "External Services"
|
||||
OLLAMA_SRV[Ollama Server<br/>:11434/api/chat]
|
||||
VLLM_SRV[vLLM Server<br/>:8000/v1/chat/completions]
|
||||
end
|
||||
|
||||
MAIN --> FACTORY
|
||||
FACTORY --> PROTO
|
||||
PROTO --> OLLAMA_IMPL
|
||||
PROTO --> VLLM_IMPL
|
||||
EXTRACT --> PROTO
|
||||
CLASSIFY --> PROTO
|
||||
|
||||
RESOLVER --> FACTORY
|
||||
OLLAMA_CFG --> FACTORY
|
||||
VLLM_CFG --> FACTORY
|
||||
APP_CFG --> OLLAMA_CFG
|
||||
APP_CFG --> VLLM_CFG
|
||||
|
||||
OLLAMA_IMPL --> OLLAMA_SRV
|
||||
VLLM_IMPL --> VLLM_SRV
|
||||
```
|
||||
|
||||
The key architectural decision is to use a Python `Protocol` (structural typing) rather than an ABC for the LLM client interface. This allows the existing `OllamaClient` to satisfy the protocol without inheritance changes, maintaining backward compatibility. The `VLLMClient` is a new class that also satisfies the protocol.
|
||||
|
||||
A factory function in `services/extractor/llm_factory.py` takes a `ResolvedAgentConfig` and the base configs, returning the appropriate client. The extractor worker (`main.py`) uses this factory instead of directly constructing `OllamaClient`.
|
||||
|
||||
## Components and Interfaces
|
||||
|
||||
### 1. LLM Client Protocol (`services/shared/llm_protocol.py`)
|
||||
|
||||
A `typing.Protocol` defining the contract both clients must satisfy:
|
||||
|
||||
```python
|
||||
from typing import Protocol, runtime_checkable
|
||||
|
||||
@runtime_checkable
|
||||
class LLMClient(Protocol):
|
||||
async def call_llm(
|
||||
self,
|
||||
prompts: dict[str, str],
|
||||
json_schema: dict[str, object],
|
||||
document_text: str = "",
|
||||
) -> "ExtractionAttempt": ...
|
||||
|
||||
async def close(self) -> None: ...
|
||||
```
|
||||
|
||||
The `call_llm` method signature matches the existing `OllamaClient._call_ollama()` parameters and return type. The `OllamaClient` gains a public `call_llm` method that delegates to `_call_ollama()`, preserving the private method for internal backward compatibility.
|
||||
|
||||
### 2. VLLMClient (`services/extractor/vllm_client.py`)
|
||||
|
||||
New client implementing the `LLMClient` protocol for the OpenAI-compatible API:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class VLLMClient:
|
||||
_config: VLLMConfig
|
||||
_http: httpx.AsyncClient
|
||||
_owns_client: bool
|
||||
|
||||
async def call_llm(
|
||||
self,
|
||||
prompts: dict[str, str],
|
||||
json_schema: dict[str, object],
|
||||
document_text: str = "",
|
||||
) -> ExtractionAttempt: ...
|
||||
|
||||
async def close(self) -> None: ...
|
||||
```
|
||||
|
||||
**Request format** (OpenAI-compatible):
|
||||
```json
|
||||
{
|
||||
"model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4",
|
||||
"messages": [
|
||||
{"role": "system", "content": "..."},
|
||||
{"role": "user", "content": "..."}
|
||||
],
|
||||
"max_tokens": 4096,
|
||||
"temperature": 0.7,
|
||||
"response_format": {"type": "json_object"}
|
||||
}
|
||||
```
|
||||
|
||||
**Response parsing**: Extracts `choices[0].message.content`, then applies the same `_strip_markdown_fences()` and `_repair_json()` pipeline as `OllamaClient`.
|
||||
|
||||
**Error handling**: Maps HTTP errors to the same string format as `OllamaClient` (`timeout`, `http_{code}`, `connection_error: {details}`, `empty_model_response`), so the existing `_is_retryable()` function works without modification.
|
||||
|
||||
**Key differences from OllamaClient**:
|
||||
- Endpoint: `/v1/chat/completions` instead of `/api/chat`
|
||||
- No `think: false`, `stream: false`, or `options` block
|
||||
- Uses `max_tokens` instead of `options.num_predict`
|
||||
- Uses `response_format: {"type": "json_object"}` for structured output
|
||||
- Supports `temperature` parameter (Ollama uses model defaults)
|
||||
- Response in `choices[0].message.content` instead of `message.content`
|
||||
|
||||
### 3. VLLMConfig (`services/shared/config.py`)
|
||||
|
||||
New dataclass alongside `OllamaConfig`:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class VLLMConfig:
|
||||
base_url: str = "http://192.168.42.254:8000"
|
||||
model: str = "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
|
||||
timeout: int = 120
|
||||
max_retries: int = 2
|
||||
retry_base_delay: float = 1.0
|
||||
retry_max_delay: float = 10.0
|
||||
retry_backoff_multiplier: float = 2.0
|
||||
max_tokens: int = 32768
|
||||
temperature: float = 0.7
|
||||
api_key: str = "" # Optional, for authenticated vLLM deployments
|
||||
```
|
||||
|
||||
Loaded from `VLLM_*` environment variables in `load_config()`. Added to `AppConfig` as `vllm: VLLMConfig`.
|
||||
|
||||
### 4. LLM Client Factory (`services/extractor/llm_factory.py`)
|
||||
|
||||
Factory function that replaces the hardcoded `OllamaClient` construction:
|
||||
|
||||
```python
|
||||
def build_llm_client(
|
||||
resolved: ResolvedAgentConfig | None,
|
||||
ollama_config: OllamaConfig,
|
||||
vllm_config: VLLMConfig,
|
||||
http_client: httpx.AsyncClient | None = None,
|
||||
) -> LLMClient:
|
||||
"""Return the appropriate LLM client based on resolved provider."""
|
||||
...
|
||||
|
||||
def build_config_from_resolved(
|
||||
resolved: ResolvedAgentConfig,
|
||||
base_ollama: OllamaConfig,
|
||||
base_vllm: VLLMConfig,
|
||||
) -> OllamaConfig | VLLMConfig:
|
||||
"""Build provider-specific config from resolved agent config."""
|
||||
...
|
||||
```
|
||||
|
||||
Provider routing logic:
|
||||
1. If `resolved` is `None` or `resolved.model_provider` is `"ollama"` or empty → `OllamaClient`
|
||||
2. If `resolved.model_provider` is `"vllm"` → `VLLMClient`
|
||||
3. Unknown provider → log warning, fall back to `OllamaClient`
|
||||
|
||||
### 5. Updated Extractor Worker (`services/extractor/main.py`)
|
||||
|
||||
Changes to `main()`:
|
||||
- Replace `_build_ollama_config_from_resolved()` with `build_llm_client()` from the factory
|
||||
- Store clients as `LLMClient` type instead of `OllamaClient`
|
||||
- On config refresh (every 100 jobs), detect provider changes and swap clients
|
||||
- Log provider switches at INFO level
|
||||
|
||||
Changes to `_process_macro_classification()`:
|
||||
- Accept `LLMClient` instead of `OllamaClient` for the classifier parameter
|
||||
|
||||
### 6. Updated OllamaClient (`services/extractor/client.py`)
|
||||
|
||||
Minimal changes to satisfy the protocol:
|
||||
- Add public `call_llm()` method that delegates to `_call_ollama()`
|
||||
- Keep `_call_ollama()` as-is for backward compatibility
|
||||
- The `extract()` method continues to call `_call_ollama()` internally
|
||||
|
||||
### 7. Updated Event Classifier (`services/extractor/event_classifier.py`)
|
||||
|
||||
Changes to `classify_global_event()`:
|
||||
- Accept `LLMClient` instead of `Any` for the `ollama_client` parameter
|
||||
- Call `client.call_llm()` instead of `ollama_client._call_ollama()`
|
||||
- Set `ModelMetadata.provider` based on the actual client type (inspect `_config` or pass provider string)
|
||||
|
||||
### 8. Helm Values (`infra/helm/stonks-oracle/values.yaml`)
|
||||
|
||||
New config entries:
|
||||
```yaml
|
||||
config:
|
||||
VLLM_BASE_URL: "http://192.168.42.254:8000"
|
||||
VLLM_MODEL: "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
|
||||
VLLM_TIMEOUT: "120"
|
||||
VLLM_MAX_RETRIES: "2"
|
||||
VLLM_TEMPERATURE: "0.7"
|
||||
VLLM_API_KEY: ""
|
||||
```
|
||||
|
||||
### 9. Health Check (`services/extractor/vllm_client.py`)
|
||||
|
||||
Startup validation function:
|
||||
|
||||
```python
|
||||
async def check_vllm_health(base_url: str, timeout: float = 10.0) -> bool:
|
||||
"""GET {base_url}/v1/models to verify vLLM is reachable."""
|
||||
...
|
||||
```
|
||||
|
||||
Called from `main()` when the resolved or default config specifies vLLM. On failure, logs WARNING and falls back to Ollama. On success, logs INFO with server URL and model list.
|
||||
|
||||
## Data Models
|
||||
|
||||
### VLLMConfig Dataclass
|
||||
|
||||
| Field | Type | Default | Env Var |
|
||||
|-------|------|---------|---------|
|
||||
| `base_url` | `str` | `http://192.168.42.254:8000` | `VLLM_BASE_URL` |
|
||||
| `model` | `str` | `RedHatAI/Qwen3.6-35B-A3B-NVFP4` | `VLLM_MODEL` |
|
||||
| `timeout` | `int` | `120` | `VLLM_TIMEOUT` |
|
||||
| `max_retries` | `int` | `2` | `VLLM_MAX_RETRIES` |
|
||||
| `retry_base_delay` | `float` | `1.0` | `VLLM_RETRY_BASE_DELAY` |
|
||||
| `retry_max_delay` | `float` | `10.0` | `VLLM_RETRY_MAX_DELAY` |
|
||||
| `retry_backoff_multiplier` | `float` | `2.0` | `VLLM_RETRY_BACKOFF_MULTIPLIER` |
|
||||
| `max_tokens` | `int` | `32768` | `VLLM_MAX_TOKENS` |
|
||||
| `temperature` | `float` | `0.7` | `VLLM_TEMPERATURE` |
|
||||
| `api_key` | `str` | `""` | `VLLM_API_KEY` |
|
||||
|
||||
### ExtractionAttempt (unchanged)
|
||||
|
||||
The existing `ExtractionAttempt` dataclass is reused as-is for both providers. No changes needed.
|
||||
|
||||
### ModelMetadata (unchanged structure, new values)
|
||||
|
||||
The `provider` field now accepts `"vllm"` in addition to `"ollama"`. No schema change needed.
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Error String Format Parity
|
||||
|
||||
Both clients produce identical error string formats so `_is_retryable()` works unchanged:
|
||||
|
||||
| Condition | Error String | Retryable |
|
||||
|-----------|-------------|-----------|
|
||||
| HTTP timeout | `timeout` | Yes |
|
||||
| HTTP 400/401/403/404/422 | `http_{code}` | No |
|
||||
| HTTP 500/502/503/429 | `http_{code}` | Yes |
|
||||
| Connection refused/reset | `connection_error: {details}` | Yes |
|
||||
| Empty response body | `empty_model_response` | Yes |
|
||||
| Invalid JSON in response | `invalid_response_json` | Yes |
|
||||
|
||||
### Health Check Failure
|
||||
|
||||
If the vLLM health check fails at startup:
|
||||
1. Log WARNING with the error details
|
||||
2. Fall back to `OllamaClient` using `OllamaConfig`
|
||||
3. Continue operation — the system degrades gracefully rather than crashing
|
||||
|
||||
### Provider Switch During Refresh
|
||||
|
||||
When the config refresh (every 100 jobs) detects a provider change:
|
||||
1. Close the old client (`await old_client.close()`)
|
||||
2. Construct the new client via the factory
|
||||
3. Log the switch at INFO level
|
||||
4. If new client construction fails, keep the old client and log ERROR
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Property-Based Tests (`tests/test_pbt_llm_provider.py`)
|
||||
|
||||
Property-based tests using Hypothesis to verify the provider abstraction:
|
||||
|
||||
**P1: Provider factory routing property** (Req 3.4, 3.5, 9.5)
|
||||
For all `model_provider` values in `{"ollama", "vllm", "", None}`, the factory returns the correct client type. For `"ollama"`, empty, or `None`, returns `OllamaClient`. For `"vllm"`, returns `VLLMClient`.
|
||||
|
||||
**P2: Error string format consistency property** (Req 5.6)
|
||||
For all HTTP status codes (100-599), both `OllamaClient` and `VLLMClient` produce error strings in the same format (`http_{code}`), and `_is_retryable()` returns the same result for both.
|
||||
|
||||
**P3: VLLMClient request payload structure property** (Req 2.1, 8.1)
|
||||
For all generated prompt dicts (system + user messages of arbitrary text), the VLLMClient produces a request payload that: contains `model`, `messages`, `max_tokens`, `temperature`; does NOT contain `think`, `stream`, `options`, `num_ctx`, `num_predict`.
|
||||
|
||||
**P4: JSON repair idempotence property** (Req 2.4)
|
||||
For all valid JSON strings, `_repair_json(json_str)` returns a string that `json.loads()` can parse, and `_repair_json(_repair_json(json_str)) == _repair_json(json_str)` (idempotence).
|
||||
|
||||
**P5: Markdown fence stripping round-trip property** (Req 2.3)
|
||||
For all strings `s`, `_strip_markdown_fences(f"```json\n{s}\n```")` returns `s` (stripped), and `_strip_markdown_fences(s)` returns `s` when no fences are present (identity).
|
||||
|
||||
**P6: VLLMConfig default construction property** (Req 3.1)
|
||||
For all VLLMConfig instances constructed with default values, `base_url` is non-empty, `timeout > 0`, `max_retries >= 0`, `temperature` is between 0.0 and 2.0, and `max_tokens > 0`.
|
||||
|
||||
### Unit Tests (`tests/test_vllm_client.py`)
|
||||
|
||||
Example-based tests for specific behaviors:
|
||||
|
||||
- VLLMClient sends correct payload to `/v1/chat/completions` (mock httpx)
|
||||
- VLLMClient extracts content from `choices[0].message.content`
|
||||
- VLLMClient handles empty choices array → `empty_model_response`
|
||||
- VLLMClient handles timeout → `timeout` error
|
||||
- VLLMClient handles HTTP 500 → `http_500` error, retryable
|
||||
- VLLMClient handles HTTP 400 → `http_400` error, non-retryable
|
||||
- VLLMClient handles connection refused → `connection_error: ...`
|
||||
- VLLMClient applies markdown fence stripping
|
||||
- VLLMClient applies JSON repair
|
||||
- VLLMClient includes temperature in payload
|
||||
- VLLMClient includes `response_format` in payload
|
||||
- Health check success logs INFO
|
||||
- Health check failure logs WARNING and returns False
|
||||
- Factory returns OllamaClient for provider="ollama"
|
||||
- Factory returns VLLMClient for provider="vllm"
|
||||
- Factory returns OllamaClient for provider="" (default)
|
||||
- Factory returns OllamaClient for unknown provider with warning
|
||||
- VLLMConfig loads from environment variables
|
||||
- AppConfig includes vllm field with defaults
|
||||
- OllamaClient.call_llm() delegates to _call_ollama()
|
||||
|
||||
### Existing Tests (unchanged)
|
||||
|
||||
- `tests/test_ollama_client.py` — continues to pass without modification
|
||||
- All other existing test files — unaffected
|
||||
|
||||
## File Changes Summary
|
||||
|
||||
| File | Change Type | Description |
|
||||
|------|-------------|-------------|
|
||||
| `services/shared/llm_protocol.py` | **New** | `LLMClient` Protocol definition |
|
||||
| `services/extractor/vllm_client.py` | **New** | `VLLMClient` implementation + health check |
|
||||
| `services/extractor/llm_factory.py` | **New** | Factory function for provider routing |
|
||||
| `services/shared/config.py` | **Modified** | Add `VLLMConfig`, update `AppConfig`, update `load_config()` |
|
||||
| `services/extractor/client.py` | **Modified** | Add `call_llm()` public method to `OllamaClient` |
|
||||
| `services/extractor/event_classifier.py` | **Modified** | Use `call_llm()` instead of `_call_ollama()`, accept `LLMClient` type |
|
||||
| `services/extractor/main.py` | **Modified** | Use factory, support provider switching, health check |
|
||||
| `infra/helm/stonks-oracle/values.yaml` | **Modified** | Add `VLLM_*` config entries |
|
||||
| `tests/test_pbt_llm_provider.py` | **New** | Property-based tests for provider abstraction |
|
||||
| `tests/test_vllm_client.py` | **New** | Unit tests for VLLMClient and factory |
|
||||
Reference in New Issue
Block a user