Files
stonks-oracle/.kiro/specs/remote-vllm-support/tasks.md
T
Celes Renata 117b693b19 feat: add remote vLLM support with provider abstraction layer
- LLMClient Protocol for provider-agnostic inference
- VLLMClient for OpenAI-compatible /v1/chat/completions API
- LLM client factory with provider routing (ollama/vllm)
- VLLMConfig with VLLM_* environment variable loading
- Updated extractor worker with health check and provider switching
- Updated event classifier to use LLMClient protocol
- Helm values for vLLM configuration
- 18 unit tests + 6 property-based tests
- Full backward compatibility preserved
2026-04-23 08:17:23 +00:00

83 lines
6.5 KiB
Markdown

# Tasks
## Task 1: LLM Client Protocol and VLLMConfig
- [x] 1.1 Create `services/shared/llm_protocol.py` with `LLMClient` Protocol defining `call_llm(prompts, json_schema, document_text) -> ExtractionAttempt` and `close()` methods
- [x] 1.2 Add `VLLMConfig` dataclass to `services/shared/config.py` with fields: `base_url`, `model`, `timeout`, `max_retries`, `retry_base_delay`, `retry_max_delay`, `retry_backoff_multiplier`, `max_tokens`, `temperature`, `api_key`
- [x] 1.3 Add `vllm: VLLMConfig` field to `AppConfig` dataclass
- [x] 1.4 Add `VLLM_*` environment variable loading to `load_config()` function
- [x] 1.5 Add public `call_llm()` method to `OllamaClient` in `services/extractor/client.py` that delegates to `_call_ollama()`
## Task 2: VLLMClient Implementation
- [x] 2.1 Create `services/extractor/vllm_client.py` with `VLLMClient` class that satisfies the `LLMClient` protocol
- [x] 2.2 Implement `call_llm()` method that sends POST to `/v1/chat/completions` with OpenAI-compatible payload (`model`, `messages`, `max_tokens`, `temperature`, `response_format`)
- [x] 2.3 Implement response parsing: extract content from `choices[0].message.content`, apply `_strip_markdown_fences()` and `_repair_json()`
- [x] 2.4 Implement error handling: map timeout → `timeout`, HTTP errors → `http_{code}`, connection errors → `connection_error: {details}`, empty response → `empty_model_response`
- [x] 2.5 Implement `close()` method to release the underlying `httpx.AsyncClient`
- [x] 2.6 Implement `check_vllm_health(base_url, timeout=10.0)` async function that GETs `/v1/models` and returns bool
## Task 3: LLM Client Factory
- [x] 3.1 Create `services/extractor/llm_factory.py` with `build_llm_client()` function that returns `OllamaClient` or `VLLMClient` based on resolved `model_provider`
- [x] 3.2 Implement `build_config_from_resolved()` function that creates provider-specific config from `ResolvedAgentConfig` and base configs
- [x] 3.3 Handle unknown provider values: log warning and fall back to `OllamaClient`
## Task 4: Update Extractor Worker for Provider Abstraction
- [x] 4.1 Update `services/extractor/main.py` to import and use `build_llm_client()` from the factory instead of directly constructing `OllamaClient`
- [x] 4.2 Replace `_build_ollama_config_from_resolved()` usage with the factory's `build_config_from_resolved()` for both extractor and classifier clients
- [x] 4.3 Add vLLM health check call at startup when resolved config specifies `model_provider = "vllm"`, with fallback to Ollama on failure
- [x] 4.4 Update config refresh logic (every 100 jobs) to detect provider changes, close old client, and construct new client via factory
- [x] 4.5 Add INFO-level logging for provider switches including old/new provider, model name, and variant ID
## Task 5: Update Event Classifier for Provider Abstraction
- [x] 5.1 Update `classify_global_event()` in `services/extractor/event_classifier.py` to accept `LLMClient` protocol type instead of `Any` for the client parameter
- [x] 5.2 Replace `ollama_client._call_ollama()` calls with `client.call_llm()` calls
- [x] 5.3 Update `ModelMetadata.provider` assignment to use the actual provider string from the client (detect from config type or pass explicitly)
- [x] 5.4 Update retry logic to use client config attributes instead of accessing `ollama_client._base_delay` and `ollama_client._backoff_multiplier` directly
## Task 6: Helm Configuration
- [x] 6.1 Add `VLLM_BASE_URL`, `VLLM_MODEL`, `VLLM_TIMEOUT`, `VLLM_MAX_RETRIES`, `VLLM_TEMPERATURE`, and `VLLM_API_KEY` entries to the `config:` section in `infra/helm/stonks-oracle/values.yaml`
## Task 7: Unit Tests for VLLMClient
- [x] 7.1 Create `tests/test_vllm_client.py` with test for VLLMClient sending correct payload to `/v1/chat/completions` using mock httpx transport
- [x] 7.2 Add test for VLLMClient extracting content from `choices[0].message.content`
- [x] 7.3 Add test for VLLMClient handling empty choices array returning `empty_model_response` error
- [x] 7.4 Add test for VLLMClient handling HTTP timeout returning `timeout` error
- [x] 7.5 Add test for VLLMClient handling HTTP 500 returning `http_500` retryable error
- [x] 7.6 Add test for VLLMClient handling HTTP 400 returning `http_400` non-retryable error
- [x] 7.7 Add test for VLLMClient handling connection error returning `connection_error: ...`
- [x] 7.8 Add test for VLLMClient applying markdown fence stripping and JSON repair to response
- [x] 7.9 Add test for VLLMClient including temperature and response_format in payload
- [x] 7.10 Add test for health check success returning True and logging INFO
- [x] 7.11 Add test for health check failure returning False and logging WARNING
- [x] 7.12 Add test for OllamaClient.call_llm() delegating to _call_ollama()
- [x] 7.13 Add test for VLLMConfig loading from environment variables
- [x] 7.14 Add test for AppConfig including vllm field with correct defaults
## Task 8: Unit Tests for LLM Factory
- [x] 8.1 Add tests to `tests/test_vllm_client.py` for factory returning OllamaClient when provider is "ollama"
- [x] 8.2 Add test for factory returning VLLMClient when provider is "vllm"
- [x] 8.3 Add test for factory returning OllamaClient when provider is empty string (default)
- [x] 8.4 Add test for factory returning OllamaClient with warning when provider is unknown value
## Task 9: Property-Based Tests
- [x] 9.1 Create `tests/test_pbt_llm_provider.py` with property test for factory routing: for all model_provider in {"ollama", "vllm", "", None}, factory returns correct client type [PBT]
- [x] 9.2 Add property test for error string format consistency: for all HTTP status codes (100-599), `_is_retryable()` classifies them consistently [PBT]
- [x] 9.3 Add property test for VLLMClient request payload structure: for all generated prompt dicts, payload contains required OpenAI fields and excludes Ollama-specific fields [PBT]
- [x] 9.4 Add property test for JSON repair idempotence: for all valid JSON strings, `_repair_json()` is idempotent [PBT]
- [x] 9.5 Add property test for markdown fence stripping: for all strings, wrapping in fences then stripping recovers the original [PBT]
- [x] 9.6 Add property test for VLLMConfig defaults: for all default-constructed instances, invariants hold (timeout > 0, max_retries >= 0, 0 <= temperature <= 2, max_tokens > 0) [PBT]
## Task 10: Verification and Backward Compatibility
- [x] 10.1 Run existing `tests/test_ollama_client.py` to verify no regressions
- [x] 10.2 Run `ruff check services/` to verify no lint errors in modified files
- [x] 10.3 Run full test suite `python -m pytest tests/ -x --tb=short -q` to verify all tests pass