Files
stonks-oracle/.kiro/specs/remote-vllm-support/requirements.md
T
Celes Renata 117b693b19 feat: add remote vLLM support with provider abstraction layer
- LLMClient Protocol for provider-agnostic inference
- VLLMClient for OpenAI-compatible /v1/chat/completions API
- LLM client factory with provider routing (ollama/vllm)
- VLLMConfig with VLLM_* environment variable loading
- Updated extractor worker with health check and provider switching
- Updated event classifier to use LLMClient protocol
- Helm values for vLLM configuration
- 18 unit tests + 6 property-based tests
- Full backward compatibility preserved
2026-04-23 08:17:23 +00:00

14 KiB

Requirements Document

Introduction

Add remote vLLM support to the Stonks Oracle platform. The system currently uses Ollama exclusively for LLM inference via the /api/chat endpoint. A remote vLLM server running RedHatAI/Qwen3.6-35B-A3B-NVFP4 on a 5090 GPU with tensor parallelism is available at http://192.168.42.254:8000 and exposes an OpenAI-compatible /v1/chat/completions API. This feature introduces a provider abstraction layer so that both Ollama and vLLM backends can be used interchangeably, selected per-agent via the existing model_provider database column and environment variable configuration. The abstraction preserves all existing behavior (retry logic, JSON repair, audit trail, backoff, context window override) while adapting to the differences between the two API protocols.

Glossary

  • LLM_Client: An abstract interface defining the contract for sending chat completion requests to any LLM backend. Concrete implementations exist for Ollama and vLLM.
  • Ollama_Backend: The existing Ollama inference server at ollama.ollama-service.svc.cluster.local:11434 (cluster) or http://10.1.1.12:2701 (external), using the /api/chat endpoint with Ollama-specific payload fields (think, options.num_ctx, options.num_predict).
  • VLLM_Backend: A remote vLLM inference server at http://192.168.42.254:8000 exposing the OpenAI-compatible /v1/chat/completions endpoint. Runs RedHatAI/Qwen3.6-35B-A3B-NVFP4 on a 5090 GPU with tensor parallelism.
  • Provider: A string identifier (ollama or vllm) that determines which LLM_Client implementation is used for a given agent. Stored in the model_provider column of ai_agents and agent_variants tables.
  • LLM_Config: A provider-agnostic configuration dataclass containing connection and inference parameters (base_url, model, timeout, retries, max_tokens, context_window) used to construct an LLM_Client.
  • Extraction_Pipeline: The document intelligence extraction workflow in services/extractor/client.py that sends documents to an LLM and parses structured JSON responses.
  • Event_Classification_Pipeline: The macro event classification workflow in services/extractor/event_classifier.py that classifies global news articles via an LLM.
  • Agent_Config_Resolver: The AgentConfigResolver in services/shared/agent_config.py that resolves runtime configuration from the ai_agents and agent_variants database tables, including the model_provider field.
  • OpenAI_Chat_Format: The request/response format used by /v1/chat/completions — messages array with role/content, max_tokens, temperature, and response in choices[0].message.content.
  • JSON_Repair: The existing json-repair library usage that fixes malformed JSON from model output, applied regardless of provider.
  • Model_Metadata: The ModelMetadata Pydantic model in services/shared/schemas.py that tracks provider, model_name, prompt_version, and schema_version for audit.

Requirements

Requirement 1: Provider Abstraction Layer

User Story: As a developer, I want a provider abstraction layer that decouples LLM inference from any specific backend, so that the extraction and classification pipelines can use either Ollama or vLLM without code changes in the calling services.

Acceptance Criteria

  1. THE LLM_Client interface SHALL define an async method that accepts a messages list (system and user prompts), a JSON schema hint, and optional document text, and returns an attempt result containing raw output, validation report, error string, duration, and model name.
  2. THE LLM_Client interface SHALL define an async close method for releasing underlying HTTP resources.
  3. WHEN the Extraction_Pipeline calls the LLM, THE Extraction_Pipeline SHALL use the LLM_Client interface instead of calling Ollama-specific endpoints directly.
  4. WHEN the Event_Classification_Pipeline calls the LLM, THE Event_Classification_Pipeline SHALL use the LLM_Client interface instead of calling _call_ollama() directly.
  5. THE Ollama_Backend implementation of LLM_Client SHALL preserve the existing /api/chat payload structure including think: false, stream: false, options.num_predict, and options.num_ctx.
  6. THE VLLM_Backend implementation of LLM_Client SHALL send requests to /v1/chat/completions using the OpenAI_Chat_Format with model, messages, max_tokens, and temperature fields.
  7. FOR ALL valid prompt inputs, sending a prompt through the Ollama_Backend and parsing the response SHALL produce the same ExtractionAttempt structure as the current _call_ollama() method (round-trip equivalence with existing behavior).

Requirement 2: vLLM Client Implementation

User Story: As a developer, I want a vLLM client that communicates with the remote vLLM server using the OpenAI-compatible API, so that the platform can leverage the 5090 GPU for inference.

Acceptance Criteria

  1. THE VLLM_Backend SHALL send POST requests to {base_url}/v1/chat/completions with a JSON payload containing model, messages (array of role/content objects), max_tokens, and temperature.
  2. THE VLLM_Backend SHALL extract the response content from choices[0].message.content in the OpenAI-compatible response format.
  3. THE VLLM_Backend SHALL apply the same markdown fence stripping logic as the Ollama_Backend to handle model output wrapped in json ... blocks.
  4. THE VLLM_Backend SHALL apply the same JSON_Repair logic as the Ollama_Backend to fix malformed JSON in model output.
  5. WHEN the vLLM server returns an HTTP timeout, THE VLLM_Backend SHALL report the error as timeout in the attempt result, consistent with the Ollama_Backend error format.
  6. WHEN the vLLM server returns an HTTP error status, THE VLLM_Backend SHALL report the error as http_{status_code} in the attempt result, consistent with the Ollama_Backend error format.
  7. WHEN the vLLM server returns an empty choices array or missing content, THE VLLM_Backend SHALL report the error as empty_model_response.
  8. IF the vLLM server is unreachable, THEN THE VLLM_Backend SHALL report the error as connection_error: {details}, consistent with the Ollama_Backend error format.
  9. THE VLLM_Backend SHALL use the same httpx.AsyncClient timeout configuration as the Ollama_Backend, derived from the LLM_Config timeout value.
  10. THE VLLM_Backend SHALL support an optional temperature parameter from the resolved agent config, defaulting to 0.7 when not specified.

Requirement 3: Provider-Aware Configuration

User Story: As an operator, I want to configure the vLLM backend via environment variables and database agent config, so that I can switch providers without code changes.

Acceptance Criteria

  1. THE Configuration SHALL include a VLLMConfig dataclass with fields: base_url (default http://192.168.42.254:8000), model (default RedHatAI/Qwen3.6-35B-A3B-NVFP4), timeout (default 120), max_retries (default 2), retry_base_delay, retry_max_delay, retry_backoff_multiplier, max_tokens (default 32768), and temperature (default 0.7).
  2. THE Configuration SHALL load VLLMConfig values from environment variables prefixed with VLLM_ (e.g., VLLM_BASE_URL, VLLM_MODEL, VLLM_TIMEOUT), following the same pattern as OllamaConfig.
  3. THE AppConfig dataclass SHALL include a vllm field of type VLLMConfig alongside the existing ollama field.
  4. WHEN the Agent_Config_Resolver resolves a model_provider value of vllm, THE service SHALL use the VLLMConfig base_url and construct a VLLM_Backend client instead of an Ollama_Backend client.
  5. WHEN the Agent_Config_Resolver resolves a model_provider value of ollama or when no model_provider is specified, THE service SHALL continue to use the OllamaConfig and Ollama_Backend client as the default.
  6. THE _build_ollama_config_from_resolved function in services/extractor/main.py SHALL be generalized to a provider-aware factory that returns the appropriate config and client type based on the resolved model_provider.

Requirement 4: Provider Selection in Extractor Worker

User Story: As a developer, I want the extractor worker to select the correct LLM client based on the resolved agent config provider, so that each agent can independently use Ollama or vLLM.

Acceptance Criteria

  1. WHEN the extractor worker starts, THE worker SHALL construct the default LLM_Client based on the environment variable configuration (defaulting to Ollama_Backend).
  2. WHEN the Agent_Config_Resolver returns a resolved config with model_provider = "vllm" for the document-extractor slug, THE worker SHALL construct a VLLM_Backend client using the VLLMConfig base_url and the resolved model_name.
  3. WHEN the Agent_Config_Resolver returns a resolved config with model_provider = "vllm" for the event-classifier slug, THE worker SHALL construct a VLLM_Backend client for the event classification pipeline.
  4. WHEN the resolved config changes provider during a config refresh cycle (every 100 jobs), THE worker SHALL close the old LLM_Client and construct a new one matching the updated provider.
  5. WHEN the resolved config changes from ollama to vllm or vice versa, THE worker SHALL log the provider switch at INFO level including the old and new provider, model name, and variant ID.

Requirement 5: Retry and Error Handling Parity

User Story: As a developer, I want the vLLM client to use the same retry logic, backoff strategy, and error classification as the Ollama client, so that reliability behavior is consistent across providers.

Acceptance Criteria

  1. THE VLLM_Backend SHALL use the same exponential backoff computation as the Ollama_Backend, using retry_base_delay, retry_max_delay, and retry_backoff_multiplier from the LLM_Config.
  2. THE VLLM_Backend SHALL classify HTTP 400, 401, 403, 404, and 422 errors as non-retryable, consistent with the Ollama_Backend.
  3. THE VLLM_Backend SHALL classify HTTP 500, 502, 503, 429, timeout, and connection errors as retryable, consistent with the Ollama_Backend.
  4. WHEN the VLLM_Backend encounters a retryable error, THE Extraction_Pipeline SHALL retry up to max_retries times with exponential backoff, preserving each attempt in the audit trail.
  5. WHEN the VLLM_Backend encounters a non-retryable error, THE Extraction_Pipeline SHALL stop retries immediately and record the attempt as non-retryable.
  6. FOR ALL error types, the VLLM_Backend error string format SHALL match the Ollama_Backend error string format so that _is_retryable() works without modification.

Requirement 6: Audit Trail and Model Metadata

User Story: As a developer, I want the audit trail and model metadata to correctly reflect which provider and model were used for each extraction, so that I can trace results back to the specific backend.

Acceptance Criteria

  1. WHEN the VLLM_Backend completes an extraction attempt, THE attempt record SHALL include the vLLM model name in the model field.
  2. WHEN an extraction or classification succeeds via the VLLM_Backend, THE Model_Metadata in the result SHALL have provider set to "vllm" and model_name set to the vLLM model identifier.
  3. WHEN the agent_performance_log records an invocation that used the VLLM_Backend, THE log entry SHALL be attributed to the correct agent_id and variant_id, consistent with Ollama_Backend logging.
  4. THE MinIO prompt and result artifacts persisted by the Event_Classification_Pipeline SHALL include the provider name and model name in the stored JSON, regardless of which backend was used.

Requirement 7: Health Check and Connectivity Validation

User Story: As an operator, I want the system to validate connectivity to the vLLM server at startup, so that misconfiguration is detected early rather than failing silently on the first inference request.

Acceptance Criteria

  1. WHEN the extractor worker starts and the resolved or default config specifies model_provider = "vllm", THE worker SHALL send a GET request to {vllm_base_url}/v1/models to verify the vLLM server is reachable.
  2. IF the vLLM health check fails at startup, THEN THE worker SHALL log a WARNING and fall back to the Ollama_Backend, continuing operation with degraded capability.
  3. IF the vLLM health check succeeds, THEN THE worker SHALL log an INFO message confirming the vLLM connection including the server URL and available model name.
  4. THE health check SHALL use a timeout of 10 seconds to avoid blocking worker startup on an unresponsive server.

Requirement 8: Context Window and Token Handling for vLLM

User Story: As a developer, I want the vLLM client to handle context window and token limits appropriately for the vLLM API, so that large documents are processed correctly on the remote GPU.

Acceptance Criteria

  1. WHEN the resolved agent config specifies a non-zero context_window, THE VLLM_Backend SHALL omit the num_ctx Ollama-specific option and instead rely on the vLLM server's model configuration for context window sizing.
  2. THE VLLM_Backend SHALL pass max_tokens in the OpenAI-compatible request payload to control the maximum number of output tokens generated.
  3. WHEN the resolved agent config specifies a non-zero input_token_limit, THE Extraction_Pipeline SHALL truncate the input text before sending it to the VLLM_Backend, using the same truncation logic as for the Ollama_Backend.
  4. WHEN the resolved agent config specifies a non-zero token_budget, THE worker SHALL enforce the same hourly token budget check for vLLM invocations as for Ollama invocations.

Requirement 9: Backward Compatibility

User Story: As a developer, I want the vLLM integration to be fully backward compatible, so that existing Ollama-based deployments continue to work without any configuration changes.

Acceptance Criteria

  1. WHEN no VLLM_BASE_URL environment variable is set and no agent config specifies model_provider = "vllm", THE system SHALL behave identically to the current Ollama-only implementation.
  2. THE existing OllamaConfig dataclass and its environment variable loading SHALL remain unchanged.
  3. THE existing OllamaClient class SHALL continue to function for Ollama-specific usage, with the LLM_Client interface added as a compatible layer on top.
  4. THE existing test suite in tests/test_ollama_client.py SHALL continue to pass without modification.
  5. WHEN the model_provider column in ai_agents or agent_variants contains "ollama" or NULL, THE system SHALL use the Ollama_Backend, preserving current behavior.
  6. THE database migration for this feature SHALL NOT alter existing table structures; it SHALL only add new columns or tables if needed.