feat: add remote vLLM support with provider abstraction layer

- LLMClient Protocol for provider-agnostic inference - VLLMClient for OpenAI-compatible /v1/chat/completions API - LLM client factory with provider routing (ollama/vllm) - VLLMConfig with VLLM_* environment variable loading - Updated extractor worker with health check and provider switching - Updated event classifier to use LLMClient protocol - Helm values for vLLM configuration - 18 unit tests + 6 property-based tests - Full backward compatibility preserved
2026-04-23 08:17:23 +00:00
parent 63e4fb96ea
commit 117b693b19
15 changed files with 1876 additions and 77 deletions
@@ -0,0 +1,82 @@
+# Tasks
+
+## Task 1: LLM Client Protocol and VLLMConfig
+
+- [x] 1.1 Create `services/shared/llm_protocol.py` with `LLMClient` Protocol defining `call_llm(prompts, json_schema, document_text) -> ExtractionAttempt` and `close()` methods
+- [x] 1.2 Add `VLLMConfig` dataclass to `services/shared/config.py` with fields: `base_url`, `model`, `timeout`, `max_retries`, `retry_base_delay`, `retry_max_delay`, `retry_backoff_multiplier`, `max_tokens`, `temperature`, `api_key`
+- [x] 1.3 Add `vllm: VLLMConfig` field to `AppConfig` dataclass
+- [x] 1.4 Add `VLLM_*` environment variable loading to `load_config()` function
+- [x] 1.5 Add public `call_llm()` method to `OllamaClient` in `services/extractor/client.py` that delegates to `_call_ollama()`
+
+## Task 2: VLLMClient Implementation
+
+- [x] 2.1 Create `services/extractor/vllm_client.py` with `VLLMClient` class that satisfies the `LLMClient` protocol
+- [x] 2.2 Implement `call_llm()` method that sends POST to `/v1/chat/completions` with OpenAI-compatible payload (`model`, `messages`, `max_tokens`, `temperature`, `response_format`)
+- [x] 2.3 Implement response parsing: extract content from `choices[0].message.content`, apply `_strip_markdown_fences()` and `_repair_json()`
+- [x] 2.4 Implement error handling: map timeout → `timeout`, HTTP errors → `http_{code}`, connection errors → `connection_error: {details}`, empty response → `empty_model_response`
+- [x] 2.5 Implement `close()` method to release the underlying `httpx.AsyncClient`
+- [x] 2.6 Implement `check_vllm_health(base_url, timeout=10.0)` async function that GETs `/v1/models` and returns bool
+
+## Task 3: LLM Client Factory
+
+- [x] 3.1 Create `services/extractor/llm_factory.py` with `build_llm_client()` function that returns `OllamaClient` or `VLLMClient` based on resolved `model_provider`
+- [x] 3.2 Implement `build_config_from_resolved()` function that creates provider-specific config from `ResolvedAgentConfig` and base configs
+- [x] 3.3 Handle unknown provider values: log warning and fall back to `OllamaClient`
+
+## Task 4: Update Extractor Worker for Provider Abstraction
+
+- [x] 4.1 Update `services/extractor/main.py` to import and use `build_llm_client()` from the factory instead of directly constructing `OllamaClient`
+- [x] 4.2 Replace `_build_ollama_config_from_resolved()` usage with the factory's `build_config_from_resolved()` for both extractor and classifier clients
+- [x] 4.3 Add vLLM health check call at startup when resolved config specifies `model_provider = "vllm"`, with fallback to Ollama on failure
+- [x] 4.4 Update config refresh logic (every 100 jobs) to detect provider changes, close old client, and construct new client via factory
+- [x] 4.5 Add INFO-level logging for provider switches including old/new provider, model name, and variant ID
+
+## Task 5: Update Event Classifier for Provider Abstraction
+
+- [x] 5.1 Update `classify_global_event()` in `services/extractor/event_classifier.py` to accept `LLMClient` protocol type instead of `Any` for the client parameter
+- [x] 5.2 Replace `ollama_client._call_ollama()` calls with `client.call_llm()` calls
+- [x] 5.3 Update `ModelMetadata.provider` assignment to use the actual provider string from the client (detect from config type or pass explicitly)
+- [x] 5.4 Update retry logic to use client config attributes instead of accessing `ollama_client._base_delay` and `ollama_client._backoff_multiplier` directly
+
+## Task 6: Helm Configuration
+
+- [x] 6.1 Add `VLLM_BASE_URL`, `VLLM_MODEL`, `VLLM_TIMEOUT`, `VLLM_MAX_RETRIES`, `VLLM_TEMPERATURE`, and `VLLM_API_KEY` entries to the `config:` section in `infra/helm/stonks-oracle/values.yaml`
+
+## Task 7: Unit Tests for VLLMClient
+
+- [x] 7.1 Create `tests/test_vllm_client.py` with test for VLLMClient sending correct payload to `/v1/chat/completions` using mock httpx transport
+- [x] 7.2 Add test for VLLMClient extracting content from `choices[0].message.content`
+- [x] 7.3 Add test for VLLMClient handling empty choices array returning `empty_model_response` error
+- [x] 7.4 Add test for VLLMClient handling HTTP timeout returning `timeout` error
+- [x] 7.5 Add test for VLLMClient handling HTTP 500 returning `http_500` retryable error
+- [x] 7.6 Add test for VLLMClient handling HTTP 400 returning `http_400` non-retryable error
+- [x] 7.7 Add test for VLLMClient handling connection error returning `connection_error: ...`
+- [x] 7.8 Add test for VLLMClient applying markdown fence stripping and JSON repair to response
+- [x] 7.9 Add test for VLLMClient including temperature and response_format in payload
+- [x] 7.10 Add test for health check success returning True and logging INFO
+- [x] 7.11 Add test for health check failure returning False and logging WARNING
+- [x] 7.12 Add test for OllamaClient.call_llm() delegating to _call_ollama()
+- [x] 7.13 Add test for VLLMConfig loading from environment variables
+- [x] 7.14 Add test for AppConfig including vllm field with correct defaults
+
+## Task 8: Unit Tests for LLM Factory
+
+- [x] 8.1 Add tests to `tests/test_vllm_client.py` for factory returning OllamaClient when provider is "ollama"
+- [x] 8.2 Add test for factory returning VLLMClient when provider is "vllm"
+- [x] 8.3 Add test for factory returning OllamaClient when provider is empty string (default)
+- [x] 8.4 Add test for factory returning OllamaClient with warning when provider is unknown value
+
+## Task 9: Property-Based Tests
+
+- [x] 9.1 Create `tests/test_pbt_llm_provider.py` with property test for factory routing: for all model_provider in {"ollama", "vllm", "", None}, factory returns correct client type [PBT]
+- [x] 9.2 Add property test for error string format consistency: for all HTTP status codes (100-599), `_is_retryable()` classifies them consistently [PBT]
+- [x] 9.3 Add property test for VLLMClient request payload structure: for all generated prompt dicts, payload contains required OpenAI fields and excludes Ollama-specific fields [PBT]
+- [x] 9.4 Add property test for JSON repair idempotence: for all valid JSON strings, `_repair_json()` is idempotent [PBT]
+- [x] 9.5 Add property test for markdown fence stripping: for all strings, wrapping in fences then stripping recovers the original [PBT]
+- [x] 9.6 Add property test for VLLMConfig defaults: for all default-constructed instances, invariants hold (timeout > 0, max_retries >= 0, 0 <= temperature <= 2, max_tokens > 0) [PBT]
+
+## Task 10: Verification and Backward Compatibility
+
+- [x] 10.1 Run existing `tests/test_ollama_client.py` to verify no regressions
+- [x] 10.2 Run `ruff check services/` to verify no lint errors in modified files
+- [x] 10.3 Run full test suite `python -m pytest tests/ -x --tb=short -q` to verify all tests pass