Files

T

Celes Renata 117b693b19 feat: add remote vLLM support with provider abstraction layer

- LLMClient Protocol for provider-agnostic inference
- VLLMClient for OpenAI-compatible /v1/chat/completions API
- LLM client factory with provider routing (ollama/vllm)
- VLLMConfig with VLLM_* environment variable loading
- Updated extractor worker with health check and provider switching
- Updated event classifier to use LLMClient protocol
- Helm values for vLLM configuration
- 18 unit tests + 6 property-based tests
- Full backward compatibility preserved

2026-04-23 08:17:23 +00:00

14 KiB

Raw Blame History

Design Document: Remote vLLM Support

Overview

This design introduces an LLM provider abstraction layer into Stonks Oracle so that both the existing Ollama backend and a new remote vLLM backend can be used interchangeably for document extraction and event classification. The vLLM server at http://192.168.42.254:8000 runs RedHatAI/Qwen3.6-35B-A3B-NVFP4 on an NVIDIA RTX 5090 with tensor parallelism and exposes an OpenAI-compatible /v1/chat/completions API.

The design preserves full backward compatibility — existing Ollama deployments work without any configuration changes. Provider selection is driven by the existing model_provider column in the ai_agents and agent_variants database tables, requiring no new migrations.

Architecture

graph TD
    subgraph "Extractor Worker"
        MAIN[main.py]
        FACTORY[LLMClientFactory]
        EXTRACT[Extraction Pipeline]
        CLASSIFY[Event Classification Pipeline]
    end

    subgraph "Provider Abstraction"
        PROTO[LLMClient Protocol]
        OLLAMA_IMPL[OllamaClient]
        VLLM_IMPL[VLLMClient]
    end

    subgraph "Configuration"
        RESOLVER[AgentConfigResolver]
        OLLAMA_CFG[OllamaConfig]
        VLLM_CFG[VLLMConfig]
        APP_CFG[AppConfig]
    end

    subgraph "External Services"
        OLLAMA_SRV[Ollama Server<br/>:11434/api/chat]
        VLLM_SRV[vLLM Server<br/>:8000/v1/chat/completions]
    end

    MAIN --> FACTORY
    FACTORY --> PROTO
    PROTO --> OLLAMA_IMPL
    PROTO --> VLLM_IMPL
    EXTRACT --> PROTO
    CLASSIFY --> PROTO

    RESOLVER --> FACTORY
    OLLAMA_CFG --> FACTORY
    VLLM_CFG --> FACTORY
    APP_CFG --> OLLAMA_CFG
    APP_CFG --> VLLM_CFG

    OLLAMA_IMPL --> OLLAMA_SRV
    VLLM_IMPL --> VLLM_SRV

The key architectural decision is to use a Python Protocol (structural typing) rather than an ABC for the LLM client interface. This allows the existing OllamaClient to satisfy the protocol without inheritance changes, maintaining backward compatibility. The VLLMClient is a new class that also satisfies the protocol.

A factory function in services/extractor/llm_factory.py takes a ResolvedAgentConfig and the base configs, returning the appropriate client. The extractor worker (main.py) uses this factory instead of directly constructing OllamaClient.

Components and Interfaces

1. LLM Client Protocol (`services/shared/llm_protocol.py`)

A typing.Protocol defining the contract both clients must satisfy:

from typing import Protocol, runtime_checkable

@runtime_checkable
class LLMClient(Protocol):
    async def call_llm(
        self,
        prompts: dict[str, str],
        json_schema: dict[str, object],
        document_text: str = "",
    ) -> "ExtractionAttempt": ...

    async def close(self) -> None: ...

The call_llm method signature matches the existing OllamaClient._call_ollama() parameters and return type. The OllamaClient gains a public call_llm method that delegates to _call_ollama(), preserving the private method for internal backward compatibility.

2. VLLMClient (`services/extractor/vllm_client.py`)

New client implementing the LLMClient protocol for the OpenAI-compatible API:

@dataclass
class VLLMClient:
    _config: VLLMConfig
    _http: httpx.AsyncClient
    _owns_client: bool

    async def call_llm(
        self,
        prompts: dict[str, str],
        json_schema: dict[str, object],
        document_text: str = "",
    ) -> ExtractionAttempt: ...

    async def close(self) -> None: ...

Request format (OpenAI-compatible):

{
    "model": "RedHatAI/Qwen3.6-35B-A3B-NVFP4",
    "messages": [
        {"role": "system", "content": "..."},
        {"role": "user", "content": "..."}
    ],
    "max_tokens": 4096,
    "temperature": 0.7,
    "response_format": {"type": "json_object"}
}

Response parsing: Extracts choices[0].message.content, then applies the same _strip_markdown_fences() and _repair_json() pipeline as OllamaClient.

Error handling: Maps HTTP errors to the same string format as OllamaClient (timeout, http_{code}, connection_error: {details}, empty_model_response), so the existing _is_retryable() function works without modification.

Key differences from OllamaClient:

Endpoint: /v1/chat/completions instead of /api/chat
No think: false, stream: false, or options block
Uses max_tokens instead of options.num_predict
Uses response_format: {"type": "json_object"} for structured output
Supports temperature parameter (Ollama uses model defaults)
Response in choices[0].message.content instead of message.content

3. VLLMConfig (`services/shared/config.py`)

New dataclass alongside OllamaConfig:

@dataclass
class VLLMConfig:
    base_url: str = "http://192.168.42.254:8000"
    model: str = "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
    timeout: int = 120
    max_retries: int = 2
    retry_base_delay: float = 1.0
    retry_max_delay: float = 10.0
    retry_backoff_multiplier: float = 2.0
    max_tokens: int = 32768
    temperature: float = 0.7
    api_key: str = ""  # Optional, for authenticated vLLM deployments

Loaded from VLLM_* environment variables in load_config(). Added to AppConfig as vllm: VLLMConfig.

4. LLM Client Factory (`services/extractor/llm_factory.py`)

Factory function that replaces the hardcoded OllamaClient construction:

def build_llm_client(
    resolved: ResolvedAgentConfig | None,
    ollama_config: OllamaConfig,
    vllm_config: VLLMConfig,
    http_client: httpx.AsyncClient | None = None,
) -> LLMClient:
    """Return the appropriate LLM client based on resolved provider."""
    ...

def build_config_from_resolved(
    resolved: ResolvedAgentConfig,
    base_ollama: OllamaConfig,
    base_vllm: VLLMConfig,
) -> OllamaConfig | VLLMConfig:
    """Build provider-specific config from resolved agent config."""
    ...

Provider routing logic:

If resolved is None or resolved.model_provider is "ollama" or empty → OllamaClient
If resolved.model_provider is "vllm" → VLLMClient
Unknown provider → log warning, fall back to OllamaClient

5. Updated Extractor Worker (`services/extractor/main.py`)

Changes to main():

Replace _build_ollama_config_from_resolved() with build_llm_client() from the factory
Store clients as LLMClient type instead of OllamaClient
On config refresh (every 100 jobs), detect provider changes and swap clients
Log provider switches at INFO level

Changes to _process_macro_classification():

Accept LLMClient instead of OllamaClient for the classifier parameter

6. Updated OllamaClient (`services/extractor/client.py`)

Minimal changes to satisfy the protocol:

Add public call_llm() method that delegates to _call_ollama()
Keep _call_ollama() as-is for backward compatibility
The extract() method continues to call _call_ollama() internally

7. Updated Event Classifier (`services/extractor/event_classifier.py`)

Changes to classify_global_event():

Accept LLMClient instead of Any for the ollama_client parameter
Call client.call_llm() instead of ollama_client._call_ollama()
Set ModelMetadata.provider based on the actual client type (inspect _config or pass provider string)

8. Helm Values (`infra/helm/stonks-oracle/values.yaml`)

New config entries:

config:
  VLLM_BASE_URL: "http://192.168.42.254:8000"
  VLLM_MODEL: "RedHatAI/Qwen3.6-35B-A3B-NVFP4"
  VLLM_TIMEOUT: "120"
  VLLM_MAX_RETRIES: "2"
  VLLM_TEMPERATURE: "0.7"
  VLLM_API_KEY: ""

9. Health Check (`services/extractor/vllm_client.py`)

Startup validation function:

async def check_vllm_health(base_url: str, timeout: float = 10.0) -> bool:
    """GET {base_url}/v1/models to verify vLLM is reachable."""
    ...

Called from main() when the resolved or default config specifies vLLM. On failure, logs WARNING and falls back to Ollama. On success, logs INFO with server URL and model list.

Data Models

VLLMConfig Dataclass

Field	Type	Default	Env Var
`base_url`	`str`	`http://192.168.42.254:8000`	`VLLM_BASE_URL`
`model`	`str`	`RedHatAI/Qwen3.6-35B-A3B-NVFP4`	`VLLM_MODEL`
`timeout`	`int`	`120`	`VLLM_TIMEOUT`
`max_retries`	`int`	`2`	`VLLM_MAX_RETRIES`
`retry_base_delay`	`float`	`1.0`	`VLLM_RETRY_BASE_DELAY`
`retry_max_delay`	`float`	`10.0`	`VLLM_RETRY_MAX_DELAY`
`retry_backoff_multiplier`	`float`	`2.0`	`VLLM_RETRY_BACKOFF_MULTIPLIER`
`max_tokens`	`int`	`32768`	`VLLM_MAX_TOKENS`
`temperature`	`float`	`0.7`	`VLLM_TEMPERATURE`
`api_key`	`str`	`""`	`VLLM_API_KEY`

ExtractionAttempt (unchanged)

The existing ExtractionAttempt dataclass is reused as-is for both providers. No changes needed.

ModelMetadata (unchanged structure, new values)

The provider field now accepts "vllm" in addition to "ollama". No schema change needed.

Error Handling

Error String Format Parity

Both clients produce identical error string formats so _is_retryable() works unchanged:

Condition	Error String	Retryable
HTTP timeout	`timeout`	Yes
HTTP 400/401/403/404/422	`http_{code}`	No
HTTP 500/502/503/429	`http_{code}`	Yes
Connection refused/reset	`connection_error: {details}`	Yes
Empty response body	`empty_model_response`	Yes
Invalid JSON in response	`invalid_response_json`	Yes

Health Check Failure

If the vLLM health check fails at startup:

Log WARNING with the error details
Fall back to OllamaClient using OllamaConfig
Continue operation — the system degrades gracefully rather than crashing

Provider Switch During Refresh

When the config refresh (every 100 jobs) detects a provider change:

Close the old client (await old_client.close())
Construct the new client via the factory
Log the switch at INFO level
If new client construction fails, keep the old client and log ERROR

Testing Strategy

Property-Based Tests (`tests/test_pbt_llm_provider.py`)

Property-based tests using Hypothesis to verify the provider abstraction:

P1: Provider factory routing property (Req 3.4, 3.5, 9.5) For all model_provider values in {"ollama", "vllm", "", None}, the factory returns the correct client type. For "ollama", empty, or None, returns OllamaClient. For "vllm", returns VLLMClient.

P2: Error string format consistency property (Req 5.6) For all HTTP status codes (100-599), both OllamaClient and VLLMClient produce error strings in the same format (http_{code}), and _is_retryable() returns the same result for both.

P3: VLLMClient request payload structure property (Req 2.1, 8.1) For all generated prompt dicts (system + user messages of arbitrary text), the VLLMClient produces a request payload that: contains model, messages, max_tokens, temperature; does NOT contain think, stream, options, num_ctx, num_predict.

P4: JSON repair idempotence property (Req 2.4) For all valid JSON strings, _repair_json(json_str) returns a string that json.loads() can parse, and _repair_json(_repair_json(json_str)) == _repair_json(json_str) (idempotence).

P5: Markdown fence stripping round-trip property (Req 2.3) For all strings s, _strip_markdown_fences(f"```json\n{s}\n```") returns s (stripped), and _strip_markdown_fences(s) returns s when no fences are present (identity).

P6: VLLMConfig default construction property (Req 3.1) For all VLLMConfig instances constructed with default values, base_url is non-empty, timeout > 0, max_retries >= 0, temperature is between 0.0 and 2.0, and max_tokens > 0.

Unit Tests (`tests/test_vllm_client.py`)

Example-based tests for specific behaviors:

VLLMClient sends correct payload to /v1/chat/completions (mock httpx)
VLLMClient extracts content from choices[0].message.content
VLLMClient handles empty choices array → empty_model_response
VLLMClient handles timeout → timeout error
VLLMClient handles HTTP 500 → http_500 error, retryable
VLLMClient handles HTTP 400 → http_400 error, non-retryable
VLLMClient handles connection refused → connection_error: ...
VLLMClient applies markdown fence stripping
VLLMClient applies JSON repair
VLLMClient includes temperature in payload
VLLMClient includes response_format in payload
Health check success logs INFO
Health check failure logs WARNING and returns False
Factory returns OllamaClient for provider="ollama"
Factory returns VLLMClient for provider="vllm"
Factory returns OllamaClient for provider="" (default)
Factory returns OllamaClient for unknown provider with warning
VLLMConfig loads from environment variables
AppConfig includes vllm field with defaults
OllamaClient.call_llm() delegates to _call_ollama()

Existing Tests (unchanged)

tests/test_ollama_client.py — continues to pass without modification
All other existing test files — unaffected

File Changes Summary

File	Change Type	Description
`services/shared/llm_protocol.py`	New	`LLMClient` Protocol definition
`services/extractor/vllm_client.py`	New	`VLLMClient` implementation + health check
`services/extractor/llm_factory.py`	New	Factory function for provider routing
`services/shared/config.py`	Modified	Add `VLLMConfig`, update `AppConfig`, update `load_config()`
`services/extractor/client.py`	Modified	Add `call_llm()` public method to `OllamaClient`
`services/extractor/event_classifier.py`	Modified	Use `call_llm()` instead of `_call_ollama()`, accept `LLMClient` type
`services/extractor/main.py`	Modified	Use factory, support provider switching, health check
`infra/helm/stonks-oracle/values.yaml`	Modified	Add `VLLM_*` config entries
`tests/test_pbt_llm_provider.py`	New	Property-based tests for provider abstraction
`tests/test_vllm_client.py`	New	Unit tests for VLLMClient and factory

14 KiB Raw Blame History