feat: add remote vLLM support with provider abstraction layer
- LLMClient Protocol for provider-agnostic inference - VLLMClient for OpenAI-compatible /v1/chat/completions API - LLM client factory with provider routing (ollama/vllm) - VLLMConfig with VLLM_* environment variable loading - Updated extractor worker with health check and provider switching - Updated event classifier to use LLMClient protocol - Helm values for vLLM configuration - 18 unit tests + 6 property-based tests - Full backward compatibility preserved
This commit is contained in:
@@ -0,0 +1 @@
|
||||
{"specId": "a7e3f1b2-9c4d-4e8a-b5f6-d2a1c3e7f9b0", "workflowType": "requirements-first", "specType": "feature"}
|
||||
@@ -0,0 +1,350 @@
|
||||
# Design Document: Remote vLLM Support
|
||||
|
||||
## Overview
|
||||
|
||||
This design introduces an LLM provider abstraction layer into Stonks Oracle so that both the existing Ollama backend and a new remote vLLM backend can be used interchangeably for document extraction and event classification. The vLLM server at `http://192.168.42.254:8000` runs `RedHatAI/Qwen3.6-35B-A3B-NVFP4` on an NVIDIA RTX 5090 with tensor parallelism and exposes an OpenAI-compatible `/v1/chat/completions` API.
|
||||
|
||||
The design preserves full backward compatibility — existing Ollama deployments work without any configuration changes. Provider selection is driven by the existing `model_provider` column in the `ai_agents` and `agent_variants` database tables, requiring no new migrations.
|
||||
|
||||
## Architecture
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Extractor Worker"
|
||||
MAIN[main.py]
|
||||
FACTORY[LLMClientFactory]
|
||||
EXTRACT[Extraction Pipeline]
|
||||
CLASSIFY[Event Classification Pipeline]
|
||||
end
|
||||
|
||||
subgraph "Provider Abstraction"
|
||||
PROTO[LLMClient Protocol]
|
||||
OLLAMA_IMPL[OllamaClient]
|
||||
VLLM_IMPL[VLLMClient]
|
||||
end
|
||||
|
||||
subgraph "Configuration"
|
||||
RESOLVER[AgentConfigResolver]
|
||||
OLLAMA_CFG[OllamaConfig]
|
||||
VLLM_CFG[VLLMConfig]
|
||||
APP_CFG[AppConfig]
|
||||
end
|
||||
|
||||
subgraph "External Services"
|
||||
OLLAMA_SRV[Ollama Server<br/>:11434/api/chat]
|
||||
VLLM_SRV[vLLM Server<br/>:8000/v1/chat/completions]
|
||||
end
|
||||
|
||||
MAIN --> FACTORY
|
||||
FACTORY --> PROTO
|
||||
PROTO --> OLLAMA_IMPL
|
||||
PROTO --> VLLM_IMPL
|
||||
EXTRACT --> PROTO
|
||||
CLASSIFY --> PROTO
|
||||
|
||||
RESOLVER --> FACTORY
|
||||
OLLAMA_CFG --> FACTORY
|
||||
VLLM_CFG --> FACTORY
|
||||
APP_CFG --> OLLAMA_CFG
|
||||
APP_CFG --> VLLM_CFG
|
||||
|
||||
OLLAMA_IMPL --> OLLAMA_SRV
|
||||
VLLM_IMPL --> VLLM_SRV
|
||||
```
|
||||
|
||||
The key architectural decision is to use a Python `Protocol` (structural typing) rather than an ABC for the LLM client interface. This allows the existing `OllamaClient` to satisfy the protocol without inheritance changes, maintaining backward compatibility. The `VLLMClient` is a new class that also satisfies the protocol.
|
||||
|
||||
A factory function in `services/extractor/llm_factory.py` takes a `ResolvedAgentConfig` and the base configs, returning the appropriate client. The extractor worker (`main.py`) uses this factory instead of directly constructing `OllamaClient`.
|
||||
|
||||
## Components and Interfaces
|
||||
|
||||
### 1. LLM Client Protocol (`services/shared/llm_protocol.py`)
|
||||
|
||||
A `typing.Protocol` defining the contract both clients must satisfy:
|
||||
|
||||
```python
|
||||
from typing import Protocol, runtime_checkable
|
||||
|
||||
@runtime_checkable
|
||||
class LLMClient(Protocol):
|
||||
async def call_llm(
|
||||
self,
|
||||
prompts: dict[str, str],
|
||||
json_schema: dict[str, object],
|
||||
document_text: str = "",
|
||||
) -> "ExtractionAttempt": ...
|
||||
|
||||
async def close(self) -> None: ...
|
||||
```
|
||||
|
||||
The `call_llm` method signature matches the existing `OllamaClient._call_ollama()` parameters and return type. The `OllamaClient` gains a public `call_llm` method that delegates to `_call_ollama()`, preserving the private method for internal backward compatibility.
|
||||
|
||||
### 2. VLLMClient (`services/extractor/vllm_client.py`)
|
||||
|
||||
New client implementing the `LLMClient` protocol for the OpenAI-compatible API:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class VLLMClient:
|
||||
_config: VLLMConfig
|
||||
_http: httpx.AsyncClient
|
||||
_owns_client: bool
|
||||
|
||||
async def call_llm(
|
||||
self,
|
||||
prompts: dict[str, str],
|
||||
json_schema: dict[str, object],
|
||||
document_text: str = "",
|
||||
) -> ExtractionAttempt: ...
|
||||
|
||||
async def close(self) -> None: ...
|
||||
```
|
||||
|
||||
**Request format** (OpenAI-compatible):
|
||||
```json
|
||||
{
|
||||
"model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4",
|
||||
"messages": [
|
||||
{"role": "system", "content": "..."},
|
||||
{"role": "user", "content": "..."}
|
||||
],
|
||||
"max_tokens": 4096,
|
||||
"temperature": 0.7,
|
||||
"response_format": {"type": "json_object"}
|
||||
}
|
||||
```
|
||||
|
||||
**Response parsing**: Extracts `choices[0].message.content`, then applies the same `_strip_markdown_fences()` and `_repair_json()` pipeline as `OllamaClient`.
|
||||
|
||||
**Error handling**: Maps HTTP errors to the same string format as `OllamaClient` (`timeout`, `http_{code}`, `connection_error: {details}`, `empty_model_response`), so the existing `_is_retryable()` function works without modification.
|
||||
|
||||
**Key differences from OllamaClient**:
|
||||
- Endpoint: `/v1/chat/completions` instead of `/api/chat`
|
||||
- No `think: false`, `stream: false`, or `options` block
|
||||
- Uses `max_tokens` instead of `options.num_predict`
|
||||
- Uses `response_format: {"type": "json_object"}` for structured output
|
||||
- Supports `temperature` parameter (Ollama uses model defaults)
|
||||
- Response in `choices[0].message.content` instead of `message.content`
|
||||
|
||||
### 3. VLLMConfig (`services/shared/config.py`)
|
||||
|
||||
New dataclass alongside `OllamaConfig`:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class VLLMConfig:
|
||||
base_url: str = "http://192.168.42.254:8000"
|
||||
model: str = "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
|
||||
timeout: int = 120
|
||||
max_retries: int = 2
|
||||
retry_base_delay: float = 1.0
|
||||
retry_max_delay: float = 10.0
|
||||
retry_backoff_multiplier: float = 2.0
|
||||
max_tokens: int = 32768
|
||||
temperature: float = 0.7
|
||||
api_key: str = "" # Optional, for authenticated vLLM deployments
|
||||
```
|
||||
|
||||
Loaded from `VLLM_*` environment variables in `load_config()`. Added to `AppConfig` as `vllm: VLLMConfig`.
|
||||
|
||||
### 4. LLM Client Factory (`services/extractor/llm_factory.py`)
|
||||
|
||||
Factory function that replaces the hardcoded `OllamaClient` construction:
|
||||
|
||||
```python
|
||||
def build_llm_client(
|
||||
resolved: ResolvedAgentConfig | None,
|
||||
ollama_config: OllamaConfig,
|
||||
vllm_config: VLLMConfig,
|
||||
http_client: httpx.AsyncClient | None = None,
|
||||
) -> LLMClient:
|
||||
"""Return the appropriate LLM client based on resolved provider."""
|
||||
...
|
||||
|
||||
def build_config_from_resolved(
|
||||
resolved: ResolvedAgentConfig,
|
||||
base_ollama: OllamaConfig,
|
||||
base_vllm: VLLMConfig,
|
||||
) -> OllamaConfig | VLLMConfig:
|
||||
"""Build provider-specific config from resolved agent config."""
|
||||
...
|
||||
```
|
||||
|
||||
Provider routing logic:
|
||||
1. If `resolved` is `None` or `resolved.model_provider` is `"ollama"` or empty → `OllamaClient`
|
||||
2. If `resolved.model_provider` is `"vllm"` → `VLLMClient`
|
||||
3. Unknown provider → log warning, fall back to `OllamaClient`
|
||||
|
||||
### 5. Updated Extractor Worker (`services/extractor/main.py`)
|
||||
|
||||
Changes to `main()`:
|
||||
- Replace `_build_ollama_config_from_resolved()` with `build_llm_client()` from the factory
|
||||
- Store clients as `LLMClient` type instead of `OllamaClient`
|
||||
- On config refresh (every 100 jobs), detect provider changes and swap clients
|
||||
- Log provider switches at INFO level
|
||||
|
||||
Changes to `_process_macro_classification()`:
|
||||
- Accept `LLMClient` instead of `OllamaClient` for the classifier parameter
|
||||
|
||||
### 6. Updated OllamaClient (`services/extractor/client.py`)
|
||||
|
||||
Minimal changes to satisfy the protocol:
|
||||
- Add public `call_llm()` method that delegates to `_call_ollama()`
|
||||
- Keep `_call_ollama()` as-is for backward compatibility
|
||||
- The `extract()` method continues to call `_call_ollama()` internally
|
||||
|
||||
### 7. Updated Event Classifier (`services/extractor/event_classifier.py`)
|
||||
|
||||
Changes to `classify_global_event()`:
|
||||
- Accept `LLMClient` instead of `Any` for the `ollama_client` parameter
|
||||
- Call `client.call_llm()` instead of `ollama_client._call_ollama()`
|
||||
- Set `ModelMetadata.provider` based on the actual client type (inspect `_config` or pass provider string)
|
||||
|
||||
### 8. Helm Values (`infra/helm/stonks-oracle/values.yaml`)
|
||||
|
||||
New config entries:
|
||||
```yaml
|
||||
config:
|
||||
VLLM_BASE_URL: "http://192.168.42.254:8000"
|
||||
VLLM_MODEL: "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
|
||||
VLLM_TIMEOUT: "120"
|
||||
VLLM_MAX_RETRIES: "2"
|
||||
VLLM_TEMPERATURE: "0.7"
|
||||
VLLM_API_KEY: ""
|
||||
```
|
||||
|
||||
### 9. Health Check (`services/extractor/vllm_client.py`)
|
||||
|
||||
Startup validation function:
|
||||
|
||||
```python
|
||||
async def check_vllm_health(base_url: str, timeout: float = 10.0) -> bool:
|
||||
"""GET {base_url}/v1/models to verify vLLM is reachable."""
|
||||
...
|
||||
```
|
||||
|
||||
Called from `main()` when the resolved or default config specifies vLLM. On failure, logs WARNING and falls back to Ollama. On success, logs INFO with server URL and model list.
|
||||
|
||||
## Data Models
|
||||
|
||||
### VLLMConfig Dataclass
|
||||
|
||||
| Field | Type | Default | Env Var |
|
||||
|-------|------|---------|---------|
|
||||
| `base_url` | `str` | `http://192.168.42.254:8000` | `VLLM_BASE_URL` |
|
||||
| `model` | `str` | `RedHatAI/Qwen3.6-35B-A3B-NVFP4` | `VLLM_MODEL` |
|
||||
| `timeout` | `int` | `120` | `VLLM_TIMEOUT` |
|
||||
| `max_retries` | `int` | `2` | `VLLM_MAX_RETRIES` |
|
||||
| `retry_base_delay` | `float` | `1.0` | `VLLM_RETRY_BASE_DELAY` |
|
||||
| `retry_max_delay` | `float` | `10.0` | `VLLM_RETRY_MAX_DELAY` |
|
||||
| `retry_backoff_multiplier` | `float` | `2.0` | `VLLM_RETRY_BACKOFF_MULTIPLIER` |
|
||||
| `max_tokens` | `int` | `32768` | `VLLM_MAX_TOKENS` |
|
||||
| `temperature` | `float` | `0.7` | `VLLM_TEMPERATURE` |
|
||||
| `api_key` | `str` | `""` | `VLLM_API_KEY` |
|
||||
|
||||
### ExtractionAttempt (unchanged)
|
||||
|
||||
The existing `ExtractionAttempt` dataclass is reused as-is for both providers. No changes needed.
|
||||
|
||||
### ModelMetadata (unchanged structure, new values)
|
||||
|
||||
The `provider` field now accepts `"vllm"` in addition to `"ollama"`. No schema change needed.
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Error String Format Parity
|
||||
|
||||
Both clients produce identical error string formats so `_is_retryable()` works unchanged:
|
||||
|
||||
| Condition | Error String | Retryable |
|
||||
|-----------|-------------|-----------|
|
||||
| HTTP timeout | `timeout` | Yes |
|
||||
| HTTP 400/401/403/404/422 | `http_{code}` | No |
|
||||
| HTTP 500/502/503/429 | `http_{code}` | Yes |
|
||||
| Connection refused/reset | `connection_error: {details}` | Yes |
|
||||
| Empty response body | `empty_model_response` | Yes |
|
||||
| Invalid JSON in response | `invalid_response_json` | Yes |
|
||||
|
||||
### Health Check Failure
|
||||
|
||||
If the vLLM health check fails at startup:
|
||||
1. Log WARNING with the error details
|
||||
2. Fall back to `OllamaClient` using `OllamaConfig`
|
||||
3. Continue operation — the system degrades gracefully rather than crashing
|
||||
|
||||
### Provider Switch During Refresh
|
||||
|
||||
When the config refresh (every 100 jobs) detects a provider change:
|
||||
1. Close the old client (`await old_client.close()`)
|
||||
2. Construct the new client via the factory
|
||||
3. Log the switch at INFO level
|
||||
4. If new client construction fails, keep the old client and log ERROR
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Property-Based Tests (`tests/test_pbt_llm_provider.py`)
|
||||
|
||||
Property-based tests using Hypothesis to verify the provider abstraction:
|
||||
|
||||
**P1: Provider factory routing property** (Req 3.4, 3.5, 9.5)
|
||||
For all `model_provider` values in `{"ollama", "vllm", "", None}`, the factory returns the correct client type. For `"ollama"`, empty, or `None`, returns `OllamaClient`. For `"vllm"`, returns `VLLMClient`.
|
||||
|
||||
**P2: Error string format consistency property** (Req 5.6)
|
||||
For all HTTP status codes (100-599), both `OllamaClient` and `VLLMClient` produce error strings in the same format (`http_{code}`), and `_is_retryable()` returns the same result for both.
|
||||
|
||||
**P3: VLLMClient request payload structure property** (Req 2.1, 8.1)
|
||||
For all generated prompt dicts (system + user messages of arbitrary text), the VLLMClient produces a request payload that: contains `model`, `messages`, `max_tokens`, `temperature`; does NOT contain `think`, `stream`, `options`, `num_ctx`, `num_predict`.
|
||||
|
||||
**P4: JSON repair idempotence property** (Req 2.4)
|
||||
For all valid JSON strings, `_repair_json(json_str)` returns a string that `json.loads()` can parse, and `_repair_json(_repair_json(json_str)) == _repair_json(json_str)` (idempotence).
|
||||
|
||||
**P5: Markdown fence stripping round-trip property** (Req 2.3)
|
||||
For all strings `s`, `_strip_markdown_fences(f"```json\n{s}\n```")` returns `s` (stripped), and `_strip_markdown_fences(s)` returns `s` when no fences are present (identity).
|
||||
|
||||
**P6: VLLMConfig default construction property** (Req 3.1)
|
||||
For all VLLMConfig instances constructed with default values, `base_url` is non-empty, `timeout > 0`, `max_retries >= 0`, `temperature` is between 0.0 and 2.0, and `max_tokens > 0`.
|
||||
|
||||
### Unit Tests (`tests/test_vllm_client.py`)
|
||||
|
||||
Example-based tests for specific behaviors:
|
||||
|
||||
- VLLMClient sends correct payload to `/v1/chat/completions` (mock httpx)
|
||||
- VLLMClient extracts content from `choices[0].message.content`
|
||||
- VLLMClient handles empty choices array → `empty_model_response`
|
||||
- VLLMClient handles timeout → `timeout` error
|
||||
- VLLMClient handles HTTP 500 → `http_500` error, retryable
|
||||
- VLLMClient handles HTTP 400 → `http_400` error, non-retryable
|
||||
- VLLMClient handles connection refused → `connection_error: ...`
|
||||
- VLLMClient applies markdown fence stripping
|
||||
- VLLMClient applies JSON repair
|
||||
- VLLMClient includes temperature in payload
|
||||
- VLLMClient includes `response_format` in payload
|
||||
- Health check success logs INFO
|
||||
- Health check failure logs WARNING and returns False
|
||||
- Factory returns OllamaClient for provider="ollama"
|
||||
- Factory returns VLLMClient for provider="vllm"
|
||||
- Factory returns OllamaClient for provider="" (default)
|
||||
- Factory returns OllamaClient for unknown provider with warning
|
||||
- VLLMConfig loads from environment variables
|
||||
- AppConfig includes vllm field with defaults
|
||||
- OllamaClient.call_llm() delegates to _call_ollama()
|
||||
|
||||
### Existing Tests (unchanged)
|
||||
|
||||
- `tests/test_ollama_client.py` — continues to pass without modification
|
||||
- All other existing test files — unaffected
|
||||
|
||||
## File Changes Summary
|
||||
|
||||
| File | Change Type | Description |
|
||||
|------|-------------|-------------|
|
||||
| `services/shared/llm_protocol.py` | **New** | `LLMClient` Protocol definition |
|
||||
| `services/extractor/vllm_client.py` | **New** | `VLLMClient` implementation + health check |
|
||||
| `services/extractor/llm_factory.py` | **New** | Factory function for provider routing |
|
||||
| `services/shared/config.py` | **Modified** | Add `VLLMConfig`, update `AppConfig`, update `load_config()` |
|
||||
| `services/extractor/client.py` | **Modified** | Add `call_llm()` public method to `OllamaClient` |
|
||||
| `services/extractor/event_classifier.py` | **Modified** | Use `call_llm()` instead of `_call_ollama()`, accept `LLMClient` type |
|
||||
| `services/extractor/main.py` | **Modified** | Use factory, support provider switching, health check |
|
||||
| `infra/helm/stonks-oracle/values.yaml` | **Modified** | Add `VLLM_*` config entries |
|
||||
| `tests/test_pbt_llm_provider.py` | **New** | Property-based tests for provider abstraction |
|
||||
| `tests/test_vllm_client.py` | **New** | Unit tests for VLLMClient and factory |
|
||||
@@ -0,0 +1,136 @@
|
||||
# Requirements Document
|
||||
|
||||
## Introduction
|
||||
|
||||
Add remote vLLM support to the Stonks Oracle platform. The system currently uses Ollama exclusively for LLM inference via the `/api/chat` endpoint. A remote vLLM server running `RedHatAI/Qwen3.6-35B-A3B-NVFP4` on a 5090 GPU with tensor parallelism is available at `http://192.168.42.254:8000` and exposes an OpenAI-compatible `/v1/chat/completions` API. This feature introduces a provider abstraction layer so that both Ollama and vLLM backends can be used interchangeably, selected per-agent via the existing `model_provider` database column and environment variable configuration. The abstraction preserves all existing behavior (retry logic, JSON repair, audit trail, backoff, context window override) while adapting to the differences between the two API protocols.
|
||||
|
||||
## Glossary
|
||||
|
||||
- **LLM_Client**: An abstract interface defining the contract for sending chat completion requests to any LLM backend. Concrete implementations exist for Ollama and vLLM.
|
||||
- **Ollama_Backend**: The existing Ollama inference server at `ollama.ollama-service.svc.cluster.local:11434` (cluster) or `http://10.1.1.12:2701` (external), using the `/api/chat` endpoint with Ollama-specific payload fields (`think`, `options.num_ctx`, `options.num_predict`).
|
||||
- **VLLM_Backend**: A remote vLLM inference server at `http://192.168.42.254:8000` exposing the OpenAI-compatible `/v1/chat/completions` endpoint. Runs `RedHatAI/Qwen3.6-35B-A3B-NVFP4` on a 5090 GPU with tensor parallelism.
|
||||
- **Provider**: A string identifier (`ollama` or `vllm`) that determines which LLM_Client implementation is used for a given agent. Stored in the `model_provider` column of `ai_agents` and `agent_variants` tables.
|
||||
- **LLM_Config**: A provider-agnostic configuration dataclass containing connection and inference parameters (base_url, model, timeout, retries, max_tokens, context_window) used to construct an LLM_Client.
|
||||
- **Extraction_Pipeline**: The document intelligence extraction workflow in `services/extractor/client.py` that sends documents to an LLM and parses structured JSON responses.
|
||||
- **Event_Classification_Pipeline**: The macro event classification workflow in `services/extractor/event_classifier.py` that classifies global news articles via an LLM.
|
||||
- **Agent_Config_Resolver**: The `AgentConfigResolver` in `services/shared/agent_config.py` that resolves runtime configuration from the `ai_agents` and `agent_variants` database tables, including the `model_provider` field.
|
||||
- **OpenAI_Chat_Format**: The request/response format used by `/v1/chat/completions` — messages array with role/content, `max_tokens`, `temperature`, and response in `choices[0].message.content`.
|
||||
- **JSON_Repair**: The existing `json-repair` library usage that fixes malformed JSON from model output, applied regardless of provider.
|
||||
- **Model_Metadata**: The `ModelMetadata` Pydantic model in `services/shared/schemas.py` that tracks `provider`, `model_name`, `prompt_version`, and `schema_version` for audit.
|
||||
|
||||
## Requirements
|
||||
|
||||
### Requirement 1: Provider Abstraction Layer
|
||||
|
||||
**User Story:** As a developer, I want a provider abstraction layer that decouples LLM inference from any specific backend, so that the extraction and classification pipelines can use either Ollama or vLLM without code changes in the calling services.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. THE LLM_Client interface SHALL define an async method that accepts a messages list (system and user prompts), a JSON schema hint, and optional document text, and returns an attempt result containing raw output, validation report, error string, duration, and model name.
|
||||
2. THE LLM_Client interface SHALL define an async `close` method for releasing underlying HTTP resources.
|
||||
3. WHEN the Extraction_Pipeline calls the LLM, THE Extraction_Pipeline SHALL use the LLM_Client interface instead of calling Ollama-specific endpoints directly.
|
||||
4. WHEN the Event_Classification_Pipeline calls the LLM, THE Event_Classification_Pipeline SHALL use the LLM_Client interface instead of calling `_call_ollama()` directly.
|
||||
5. THE Ollama_Backend implementation of LLM_Client SHALL preserve the existing `/api/chat` payload structure including `think: false`, `stream: false`, `options.num_predict`, and `options.num_ctx`.
|
||||
6. THE VLLM_Backend implementation of LLM_Client SHALL send requests to `/v1/chat/completions` using the OpenAI_Chat_Format with `model`, `messages`, `max_tokens`, and `temperature` fields.
|
||||
7. FOR ALL valid prompt inputs, sending a prompt through the Ollama_Backend and parsing the response SHALL produce the same ExtractionAttempt structure as the current `_call_ollama()` method (round-trip equivalence with existing behavior).
|
||||
|
||||
### Requirement 2: vLLM Client Implementation
|
||||
|
||||
**User Story:** As a developer, I want a vLLM client that communicates with the remote vLLM server using the OpenAI-compatible API, so that the platform can leverage the 5090 GPU for inference.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. THE VLLM_Backend SHALL send POST requests to `{base_url}/v1/chat/completions` with a JSON payload containing `model`, `messages` (array of role/content objects), `max_tokens`, and `temperature`.
|
||||
2. THE VLLM_Backend SHALL extract the response content from `choices[0].message.content` in the OpenAI-compatible response format.
|
||||
3. THE VLLM_Backend SHALL apply the same markdown fence stripping logic as the Ollama_Backend to handle model output wrapped in ```json ... ``` blocks.
|
||||
4. THE VLLM_Backend SHALL apply the same JSON_Repair logic as the Ollama_Backend to fix malformed JSON in model output.
|
||||
5. WHEN the vLLM server returns an HTTP timeout, THE VLLM_Backend SHALL report the error as `timeout` in the attempt result, consistent with the Ollama_Backend error format.
|
||||
6. WHEN the vLLM server returns an HTTP error status, THE VLLM_Backend SHALL report the error as `http_{status_code}` in the attempt result, consistent with the Ollama_Backend error format.
|
||||
7. WHEN the vLLM server returns an empty `choices` array or missing `content`, THE VLLM_Backend SHALL report the error as `empty_model_response`.
|
||||
8. IF the vLLM server is unreachable, THEN THE VLLM_Backend SHALL report the error as `connection_error: {details}`, consistent with the Ollama_Backend error format.
|
||||
9. THE VLLM_Backend SHALL use the same `httpx.AsyncClient` timeout configuration as the Ollama_Backend, derived from the LLM_Config timeout value.
|
||||
10. THE VLLM_Backend SHALL support an optional `temperature` parameter from the resolved agent config, defaulting to 0.7 when not specified.
|
||||
|
||||
### Requirement 3: Provider-Aware Configuration
|
||||
|
||||
**User Story:** As an operator, I want to configure the vLLM backend via environment variables and database agent config, so that I can switch providers without code changes.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. THE Configuration SHALL include a `VLLMConfig` dataclass with fields: `base_url` (default `http://192.168.42.254:8000`), `model` (default `RedHatAI/Qwen3.6-35B-A3B-NVFP4`), `timeout` (default 120), `max_retries` (default 2), `retry_base_delay`, `retry_max_delay`, `retry_backoff_multiplier`, `max_tokens` (default 32768), and `temperature` (default 0.7).
|
||||
2. THE Configuration SHALL load VLLMConfig values from environment variables prefixed with `VLLM_` (e.g., `VLLM_BASE_URL`, `VLLM_MODEL`, `VLLM_TIMEOUT`), following the same pattern as OllamaConfig.
|
||||
3. THE AppConfig dataclass SHALL include a `vllm` field of type VLLMConfig alongside the existing `ollama` field.
|
||||
4. WHEN the Agent_Config_Resolver resolves a `model_provider` value of `vllm`, THE service SHALL use the VLLMConfig base_url and construct a VLLM_Backend client instead of an Ollama_Backend client.
|
||||
5. WHEN the Agent_Config_Resolver resolves a `model_provider` value of `ollama` or when no `model_provider` is specified, THE service SHALL continue to use the OllamaConfig and Ollama_Backend client as the default.
|
||||
6. THE `_build_ollama_config_from_resolved` function in `services/extractor/main.py` SHALL be generalized to a provider-aware factory that returns the appropriate config and client type based on the resolved `model_provider`.
|
||||
|
||||
### Requirement 4: Provider Selection in Extractor Worker
|
||||
|
||||
**User Story:** As a developer, I want the extractor worker to select the correct LLM client based on the resolved agent config provider, so that each agent can independently use Ollama or vLLM.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN the extractor worker starts, THE worker SHALL construct the default LLM_Client based on the environment variable configuration (defaulting to Ollama_Backend).
|
||||
2. WHEN the Agent_Config_Resolver returns a resolved config with `model_provider = "vllm"` for the `document-extractor` slug, THE worker SHALL construct a VLLM_Backend client using the VLLMConfig base_url and the resolved model_name.
|
||||
3. WHEN the Agent_Config_Resolver returns a resolved config with `model_provider = "vllm"` for the `event-classifier` slug, THE worker SHALL construct a VLLM_Backend client for the event classification pipeline.
|
||||
4. WHEN the resolved config changes provider during a config refresh cycle (every 100 jobs), THE worker SHALL close the old LLM_Client and construct a new one matching the updated provider.
|
||||
5. WHEN the resolved config changes from `ollama` to `vllm` or vice versa, THE worker SHALL log the provider switch at INFO level including the old and new provider, model name, and variant ID.
|
||||
|
||||
### Requirement 5: Retry and Error Handling Parity
|
||||
|
||||
**User Story:** As a developer, I want the vLLM client to use the same retry logic, backoff strategy, and error classification as the Ollama client, so that reliability behavior is consistent across providers.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. THE VLLM_Backend SHALL use the same exponential backoff computation as the Ollama_Backend, using `retry_base_delay`, `retry_max_delay`, and `retry_backoff_multiplier` from the LLM_Config.
|
||||
2. THE VLLM_Backend SHALL classify HTTP 400, 401, 403, 404, and 422 errors as non-retryable, consistent with the Ollama_Backend.
|
||||
3. THE VLLM_Backend SHALL classify HTTP 500, 502, 503, 429, timeout, and connection errors as retryable, consistent with the Ollama_Backend.
|
||||
4. WHEN the VLLM_Backend encounters a retryable error, THE Extraction_Pipeline SHALL retry up to `max_retries` times with exponential backoff, preserving each attempt in the audit trail.
|
||||
5. WHEN the VLLM_Backend encounters a non-retryable error, THE Extraction_Pipeline SHALL stop retries immediately and record the attempt as non-retryable.
|
||||
6. FOR ALL error types, the VLLM_Backend error string format SHALL match the Ollama_Backend error string format so that `_is_retryable()` works without modification.
|
||||
|
||||
### Requirement 6: Audit Trail and Model Metadata
|
||||
|
||||
**User Story:** As a developer, I want the audit trail and model metadata to correctly reflect which provider and model were used for each extraction, so that I can trace results back to the specific backend.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN the VLLM_Backend completes an extraction attempt, THE attempt record SHALL include the vLLM model name in the `model` field.
|
||||
2. WHEN an extraction or classification succeeds via the VLLM_Backend, THE Model_Metadata in the result SHALL have `provider` set to `"vllm"` and `model_name` set to the vLLM model identifier.
|
||||
3. WHEN the `agent_performance_log` records an invocation that used the VLLM_Backend, THE log entry SHALL be attributed to the correct agent_id and variant_id, consistent with Ollama_Backend logging.
|
||||
4. THE MinIO prompt and result artifacts persisted by the Event_Classification_Pipeline SHALL include the provider name and model name in the stored JSON, regardless of which backend was used.
|
||||
|
||||
### Requirement 7: Health Check and Connectivity Validation
|
||||
|
||||
**User Story:** As an operator, I want the system to validate connectivity to the vLLM server at startup, so that misconfiguration is detected early rather than failing silently on the first inference request.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN the extractor worker starts and the resolved or default config specifies `model_provider = "vllm"`, THE worker SHALL send a GET request to `{vllm_base_url}/v1/models` to verify the vLLM server is reachable.
|
||||
2. IF the vLLM health check fails at startup, THEN THE worker SHALL log a WARNING and fall back to the Ollama_Backend, continuing operation with degraded capability.
|
||||
3. IF the vLLM health check succeeds, THEN THE worker SHALL log an INFO message confirming the vLLM connection including the server URL and available model name.
|
||||
4. THE health check SHALL use a timeout of 10 seconds to avoid blocking worker startup on an unresponsive server.
|
||||
|
||||
### Requirement 8: Context Window and Token Handling for vLLM
|
||||
|
||||
**User Story:** As a developer, I want the vLLM client to handle context window and token limits appropriately for the vLLM API, so that large documents are processed correctly on the remote GPU.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN the resolved agent config specifies a non-zero `context_window`, THE VLLM_Backend SHALL omit the `num_ctx` Ollama-specific option and instead rely on the vLLM server's model configuration for context window sizing.
|
||||
2. THE VLLM_Backend SHALL pass `max_tokens` in the OpenAI-compatible request payload to control the maximum number of output tokens generated.
|
||||
3. WHEN the resolved agent config specifies a non-zero `input_token_limit`, THE Extraction_Pipeline SHALL truncate the input text before sending it to the VLLM_Backend, using the same truncation logic as for the Ollama_Backend.
|
||||
4. WHEN the resolved agent config specifies a non-zero `token_budget`, THE worker SHALL enforce the same hourly token budget check for vLLM invocations as for Ollama invocations.
|
||||
|
||||
### Requirement 9: Backward Compatibility
|
||||
|
||||
**User Story:** As a developer, I want the vLLM integration to be fully backward compatible, so that existing Ollama-based deployments continue to work without any configuration changes.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN no `VLLM_BASE_URL` environment variable is set and no agent config specifies `model_provider = "vllm"`, THE system SHALL behave identically to the current Ollama-only implementation.
|
||||
2. THE existing `OllamaConfig` dataclass and its environment variable loading SHALL remain unchanged.
|
||||
3. THE existing `OllamaClient` class SHALL continue to function for Ollama-specific usage, with the LLM_Client interface added as a compatible layer on top.
|
||||
4. THE existing test suite in `tests/test_ollama_client.py` SHALL continue to pass without modification.
|
||||
5. WHEN the `model_provider` column in `ai_agents` or `agent_variants` contains `"ollama"` or NULL, THE system SHALL use the Ollama_Backend, preserving current behavior.
|
||||
6. THE database migration for this feature SHALL NOT alter existing table structures; it SHALL only add new columns or tables if needed.
|
||||
@@ -0,0 +1,82 @@
|
||||
# Tasks
|
||||
|
||||
## Task 1: LLM Client Protocol and VLLMConfig
|
||||
|
||||
- [x] 1.1 Create `services/shared/llm_protocol.py` with `LLMClient` Protocol defining `call_llm(prompts, json_schema, document_text) -> ExtractionAttempt` and `close()` methods
|
||||
- [x] 1.2 Add `VLLMConfig` dataclass to `services/shared/config.py` with fields: `base_url`, `model`, `timeout`, `max_retries`, `retry_base_delay`, `retry_max_delay`, `retry_backoff_multiplier`, `max_tokens`, `temperature`, `api_key`
|
||||
- [x] 1.3 Add `vllm: VLLMConfig` field to `AppConfig` dataclass
|
||||
- [x] 1.4 Add `VLLM_*` environment variable loading to `load_config()` function
|
||||
- [x] 1.5 Add public `call_llm()` method to `OllamaClient` in `services/extractor/client.py` that delegates to `_call_ollama()`
|
||||
|
||||
## Task 2: VLLMClient Implementation
|
||||
|
||||
- [x] 2.1 Create `services/extractor/vllm_client.py` with `VLLMClient` class that satisfies the `LLMClient` protocol
|
||||
- [x] 2.2 Implement `call_llm()` method that sends POST to `/v1/chat/completions` with OpenAI-compatible payload (`model`, `messages`, `max_tokens`, `temperature`, `response_format`)
|
||||
- [x] 2.3 Implement response parsing: extract content from `choices[0].message.content`, apply `_strip_markdown_fences()` and `_repair_json()`
|
||||
- [x] 2.4 Implement error handling: map timeout → `timeout`, HTTP errors → `http_{code}`, connection errors → `connection_error: {details}`, empty response → `empty_model_response`
|
||||
- [x] 2.5 Implement `close()` method to release the underlying `httpx.AsyncClient`
|
||||
- [x] 2.6 Implement `check_vllm_health(base_url, timeout=10.0)` async function that GETs `/v1/models` and returns bool
|
||||
|
||||
## Task 3: LLM Client Factory
|
||||
|
||||
- [x] 3.1 Create `services/extractor/llm_factory.py` with `build_llm_client()` function that returns `OllamaClient` or `VLLMClient` based on resolved `model_provider`
|
||||
- [x] 3.2 Implement `build_config_from_resolved()` function that creates provider-specific config from `ResolvedAgentConfig` and base configs
|
||||
- [x] 3.3 Handle unknown provider values: log warning and fall back to `OllamaClient`
|
||||
|
||||
## Task 4: Update Extractor Worker for Provider Abstraction
|
||||
|
||||
- [x] 4.1 Update `services/extractor/main.py` to import and use `build_llm_client()` from the factory instead of directly constructing `OllamaClient`
|
||||
- [x] 4.2 Replace `_build_ollama_config_from_resolved()` usage with the factory's `build_config_from_resolved()` for both extractor and classifier clients
|
||||
- [x] 4.3 Add vLLM health check call at startup when resolved config specifies `model_provider = "vllm"`, with fallback to Ollama on failure
|
||||
- [x] 4.4 Update config refresh logic (every 100 jobs) to detect provider changes, close old client, and construct new client via factory
|
||||
- [x] 4.5 Add INFO-level logging for provider switches including old/new provider, model name, and variant ID
|
||||
|
||||
## Task 5: Update Event Classifier for Provider Abstraction
|
||||
|
||||
- [x] 5.1 Update `classify_global_event()` in `services/extractor/event_classifier.py` to accept `LLMClient` protocol type instead of `Any` for the client parameter
|
||||
- [x] 5.2 Replace `ollama_client._call_ollama()` calls with `client.call_llm()` calls
|
||||
- [x] 5.3 Update `ModelMetadata.provider` assignment to use the actual provider string from the client (detect from config type or pass explicitly)
|
||||
- [x] 5.4 Update retry logic to use client config attributes instead of accessing `ollama_client._base_delay` and `ollama_client._backoff_multiplier` directly
|
||||
|
||||
## Task 6: Helm Configuration
|
||||
|
||||
- [x] 6.1 Add `VLLM_BASE_URL`, `VLLM_MODEL`, `VLLM_TIMEOUT`, `VLLM_MAX_RETRIES`, `VLLM_TEMPERATURE`, and `VLLM_API_KEY` entries to the `config:` section in `infra/helm/stonks-oracle/values.yaml`
|
||||
|
||||
## Task 7: Unit Tests for VLLMClient
|
||||
|
||||
- [x] 7.1 Create `tests/test_vllm_client.py` with test for VLLMClient sending correct payload to `/v1/chat/completions` using mock httpx transport
|
||||
- [x] 7.2 Add test for VLLMClient extracting content from `choices[0].message.content`
|
||||
- [x] 7.3 Add test for VLLMClient handling empty choices array returning `empty_model_response` error
|
||||
- [x] 7.4 Add test for VLLMClient handling HTTP timeout returning `timeout` error
|
||||
- [x] 7.5 Add test for VLLMClient handling HTTP 500 returning `http_500` retryable error
|
||||
- [x] 7.6 Add test for VLLMClient handling HTTP 400 returning `http_400` non-retryable error
|
||||
- [x] 7.7 Add test for VLLMClient handling connection error returning `connection_error: ...`
|
||||
- [x] 7.8 Add test for VLLMClient applying markdown fence stripping and JSON repair to response
|
||||
- [x] 7.9 Add test for VLLMClient including temperature and response_format in payload
|
||||
- [x] 7.10 Add test for health check success returning True and logging INFO
|
||||
- [x] 7.11 Add test for health check failure returning False and logging WARNING
|
||||
- [x] 7.12 Add test for OllamaClient.call_llm() delegating to _call_ollama()
|
||||
- [x] 7.13 Add test for VLLMConfig loading from environment variables
|
||||
- [x] 7.14 Add test for AppConfig including vllm field with correct defaults
|
||||
|
||||
## Task 8: Unit Tests for LLM Factory
|
||||
|
||||
- [x] 8.1 Add tests to `tests/test_vllm_client.py` for factory returning OllamaClient when provider is "ollama"
|
||||
- [x] 8.2 Add test for factory returning VLLMClient when provider is "vllm"
|
||||
- [x] 8.3 Add test for factory returning OllamaClient when provider is empty string (default)
|
||||
- [x] 8.4 Add test for factory returning OllamaClient with warning when provider is unknown value
|
||||
|
||||
## Task 9: Property-Based Tests
|
||||
|
||||
- [x] 9.1 Create `tests/test_pbt_llm_provider.py` with property test for factory routing: for all model_provider in {"ollama", "vllm", "", None}, factory returns correct client type [PBT]
|
||||
- [x] 9.2 Add property test for error string format consistency: for all HTTP status codes (100-599), `_is_retryable()` classifies them consistently [PBT]
|
||||
- [x] 9.3 Add property test for VLLMClient request payload structure: for all generated prompt dicts, payload contains required OpenAI fields and excludes Ollama-specific fields [PBT]
|
||||
- [x] 9.4 Add property test for JSON repair idempotence: for all valid JSON strings, `_repair_json()` is idempotent [PBT]
|
||||
- [x] 9.5 Add property test for markdown fence stripping: for all strings, wrapping in fences then stripping recovers the original [PBT]
|
||||
- [x] 9.6 Add property test for VLLMConfig defaults: for all default-constructed instances, invariants hold (timeout > 0, max_retries >= 0, 0 <= temperature <= 2, max_tokens > 0) [PBT]
|
||||
|
||||
## Task 10: Verification and Backward Compatibility
|
||||
|
||||
- [x] 10.1 Run existing `tests/test_ollama_client.py` to verify no regressions
|
||||
- [x] 10.2 Run `ruff check services/` to verify no lint errors in modified files
|
||||
- [x] 10.3 Run full test suite `python -m pytest tests/ -x --tb=short -q` to verify all tests pass
|
||||
Reference in New Issue
Block a user