Files
stonks-oracle/.kiro/specs/remote-vllm-support/tasks.md
T
Celes Renata 117b693b19 feat: add remote vLLM support with provider abstraction layer
- LLMClient Protocol for provider-agnostic inference
- VLLMClient for OpenAI-compatible /v1/chat/completions API
- LLM client factory with provider routing (ollama/vllm)
- VLLMConfig with VLLM_* environment variable loading
- Updated extractor worker with health check and provider switching
- Updated event classifier to use LLMClient protocol
- Helm values for vLLM configuration
- 18 unit tests + 6 property-based tests
- Full backward compatibility preserved
2026-04-23 08:17:23 +00:00

6.5 KiB

Tasks

Task 1: LLM Client Protocol and VLLMConfig

  • 1.1 Create services/shared/llm_protocol.py with LLMClient Protocol defining call_llm(prompts, json_schema, document_text) -> ExtractionAttempt and close() methods
  • 1.2 Add VLLMConfig dataclass to services/shared/config.py with fields: base_url, model, timeout, max_retries, retry_base_delay, retry_max_delay, retry_backoff_multiplier, max_tokens, temperature, api_key
  • 1.3 Add vllm: VLLMConfig field to AppConfig dataclass
  • 1.4 Add VLLM_* environment variable loading to load_config() function
  • 1.5 Add public call_llm() method to OllamaClient in services/extractor/client.py that delegates to _call_ollama()

Task 2: VLLMClient Implementation

  • 2.1 Create services/extractor/vllm_client.py with VLLMClient class that satisfies the LLMClient protocol
  • 2.2 Implement call_llm() method that sends POST to /v1/chat/completions with OpenAI-compatible payload (model, messages, max_tokens, temperature, response_format)
  • 2.3 Implement response parsing: extract content from choices[0].message.content, apply _strip_markdown_fences() and _repair_json()
  • 2.4 Implement error handling: map timeout → timeout, HTTP errors → http_{code}, connection errors → connection_error: {details}, empty response → empty_model_response
  • 2.5 Implement close() method to release the underlying httpx.AsyncClient
  • 2.6 Implement check_vllm_health(base_url, timeout=10.0) async function that GETs /v1/models and returns bool

Task 3: LLM Client Factory

  • 3.1 Create services/extractor/llm_factory.py with build_llm_client() function that returns OllamaClient or VLLMClient based on resolved model_provider
  • 3.2 Implement build_config_from_resolved() function that creates provider-specific config from ResolvedAgentConfig and base configs
  • 3.3 Handle unknown provider values: log warning and fall back to OllamaClient

Task 4: Update Extractor Worker for Provider Abstraction

  • 4.1 Update services/extractor/main.py to import and use build_llm_client() from the factory instead of directly constructing OllamaClient
  • 4.2 Replace _build_ollama_config_from_resolved() usage with the factory's build_config_from_resolved() for both extractor and classifier clients
  • 4.3 Add vLLM health check call at startup when resolved config specifies model_provider = "vllm", with fallback to Ollama on failure
  • 4.4 Update config refresh logic (every 100 jobs) to detect provider changes, close old client, and construct new client via factory
  • 4.5 Add INFO-level logging for provider switches including old/new provider, model name, and variant ID

Task 5: Update Event Classifier for Provider Abstraction

  • 5.1 Update classify_global_event() in services/extractor/event_classifier.py to accept LLMClient protocol type instead of Any for the client parameter
  • 5.2 Replace ollama_client._call_ollama() calls with client.call_llm() calls
  • 5.3 Update ModelMetadata.provider assignment to use the actual provider string from the client (detect from config type or pass explicitly)
  • 5.4 Update retry logic to use client config attributes instead of accessing ollama_client._base_delay and ollama_client._backoff_multiplier directly

Task 6: Helm Configuration

  • 6.1 Add VLLM_BASE_URL, VLLM_MODEL, VLLM_TIMEOUT, VLLM_MAX_RETRIES, VLLM_TEMPERATURE, and VLLM_API_KEY entries to the config: section in infra/helm/stonks-oracle/values.yaml

Task 7: Unit Tests for VLLMClient

  • 7.1 Create tests/test_vllm_client.py with test for VLLMClient sending correct payload to /v1/chat/completions using mock httpx transport
  • 7.2 Add test for VLLMClient extracting content from choices[0].message.content
  • 7.3 Add test for VLLMClient handling empty choices array returning empty_model_response error
  • 7.4 Add test for VLLMClient handling HTTP timeout returning timeout error
  • 7.5 Add test for VLLMClient handling HTTP 500 returning http_500 retryable error
  • 7.6 Add test for VLLMClient handling HTTP 400 returning http_400 non-retryable error
  • 7.7 Add test for VLLMClient handling connection error returning connection_error: ...
  • 7.8 Add test for VLLMClient applying markdown fence stripping and JSON repair to response
  • 7.9 Add test for VLLMClient including temperature and response_format in payload
  • 7.10 Add test for health check success returning True and logging INFO
  • 7.11 Add test for health check failure returning False and logging WARNING
  • 7.12 Add test for OllamaClient.call_llm() delegating to _call_ollama()
  • 7.13 Add test for VLLMConfig loading from environment variables
  • 7.14 Add test for AppConfig including vllm field with correct defaults

Task 8: Unit Tests for LLM Factory

  • 8.1 Add tests to tests/test_vllm_client.py for factory returning OllamaClient when provider is "ollama"
  • 8.2 Add test for factory returning VLLMClient when provider is "vllm"
  • 8.3 Add test for factory returning OllamaClient when provider is empty string (default)
  • 8.4 Add test for factory returning OllamaClient with warning when provider is unknown value

Task 9: Property-Based Tests

  • 9.1 Create tests/test_pbt_llm_provider.py with property test for factory routing: for all model_provider in {"ollama", "vllm", "", None}, factory returns correct client type [PBT]
  • 9.2 Add property test for error string format consistency: for all HTTP status codes (100-599), _is_retryable() classifies them consistently [PBT]
  • 9.3 Add property test for VLLMClient request payload structure: for all generated prompt dicts, payload contains required OpenAI fields and excludes Ollama-specific fields [PBT]
  • 9.4 Add property test for JSON repair idempotence: for all valid JSON strings, _repair_json() is idempotent [PBT]
  • 9.5 Add property test for markdown fence stripping: for all strings, wrapping in fences then stripping recovers the original [PBT]
  • 9.6 Add property test for VLLMConfig defaults: for all default-constructed instances, invariants hold (timeout > 0, max_retries >= 0, 0 <= temperature <= 2, max_tokens > 0) [PBT]

Task 10: Verification and Backward Compatibility

  • 10.1 Run existing tests/test_ollama_client.py to verify no regressions
  • 10.2 Run ruff check services/ to verify no lint errors in modified files
  • 10.3 Run full test suite python -m pytest tests/ -x --tb=short -q to verify all tests pass