feat: implement dual-pipeline signal engine service

New service at services/signal_engine/ implementing concurrent heuristic (deterministic scoring) and probabilistic (Bayesian inference) pipelines that evaluate technical signals across 6 timeframes (M30-M) and produce independent BUY/WATCH/SKIP verdicts per ticker per evaluation tick. Components: - Input Normalizer: multi-source data assembly with sentinel fallbacks - Signal Library: Fibonacci, MA Stack, RSI, Cup & Handle, Elliott Wave - Multi-Timeframe Confluence Engine: weighted scoring with D/W/M anchors - Hard Filter Engine: macro_bias, valuation, earnings proximity gating - Heuristic Pipeline: S_total scoring with confidence-gated verdicts - Probabilistic Pipeline: Bayesian log-odds with regime priors, entropy gating, EV_R calculation, and signal correlation penalty - Exit Engine: stop-loss, targets, trailing ATR-based stops - Delta Analyzer: pipeline agreement tracking with rolling Redis metrics - Output Formatter: SignalOutput contract + Recommendation schema mapping - Worker orchestrator: concurrent pipelines with failure isolation - Main entry point: queue polling with fail-safe config loading Infrastructure: - Migration 039: signal_engine_outputs table with 3 indexes - Helm chart: signalEngine service entry (processing tier) - Redis key: QUEUE_SIGNAL_ENGINE constant Tests: 390 tests (unit + property-based) covering all components Config: dual_pipeline_enabled=false by default (safe rollout)
2026-05-02 07:32:26 +00:00
parent 7e2343ec2c
commit f468e30af0
61 changed files with 14107 additions and 184 deletions
@@ -1,6 +1,6 @@
 # AI Agent Building Guide

-Stonks Oracle uses three AI agents powered by a local Ollama instance. Each agent has a dedicated purpose in the pipeline, a database-backed configuration, and support for A/B testing through variants. This guide covers how each agent works, how to configure them, how to create and test variants, and how to monitor performance.
+Stonks Oracle uses three AI agents powered by local LLM inference (Ollama or vLLM). Each agent has a dedicated purpose in the pipeline, a database-backed configuration, and support for A/B testing through variants. This guide covers how each agent works, how to configure them, how to create and test variants, and how to monitor performance.

 ## Table of Contents

@@ -8,6 +8,7 @@ Stonks Oracle uses three AI agents powered by a local Ollama instance. Each agen
  - [Document Intelligence Extractor](#1-document-intelligence-extractor)
  - [Global Event Classifier](#2-global-event-classifier)
  - [Thesis Rewriter](#3-thesis-rewriter)
+- [LLM Provider Abstraction](#llm-provider-abstraction)
 - [Database Schema](#database-schema)
  - [ai_agents Table](#ai_agents-table)
  - [agent_variants Table](#agent_variants-table)
@@ -30,9 +31,10 @@ Three agents are seeded into the `ai_agents` table on first migration (migration
 | **Slug** | `document-extractor` |
 | **Purpose** | Extracts structured intelligence (sentiment, catalysts, impact scores, key facts, risks) from company news, SEC filings, earnings transcripts, and press releases |
 | **Default Model** | `qwen3.5:9b-fast` (Ollama) |
+| **Supported Providers** | `ollama`, `vllm` |
 | **Prompt Version** | `document-intel-v2` |
 | **Schema Version** | `2.0.0` |
-| **Entry Point** | `services/extractor/main.py` → `services/extractor/client.py` |
+| **Entry Point** | `services/extractor/main.py` → `services/extractor/llm_factory.py` → `services/extractor/client.py` (Ollama) or `services/extractor/vllm_client.py` (vLLM) |

 **Input Data:**
 - Normalized document text (fetched from MinIO or passed in the Redis job payload)
@@ -40,7 +42,7 @@ Three agents are seeded into the `ai_agents` table on first migration (migration
 - List of tracked tickers for company identification
 - Document ID for traceability

-**Output Schema** (`ExtractionResult`):
+**Output Schema** (`ExtractionResult` — defined in `services/extractor/schemas.py`):

 ```json
 {
@@ -81,6 +83,7 @@ Use "other" for catalyst_type if unsure. Keep evidence_spans short
 - Includes tracked ticker list with rules for company identification
 - Includes the full JSON schema field descriptions
 - Truncates documents to 8,000 characters to limit inference time
+- When an active variant has `input_token_limit > 0`, truncation uses `input_token_limit * 4` characters instead

 ---

@@ -91,6 +94,7 @@ Use "other" for catalyst_type if unsure. Keep evidence_spans short
 | **Slug** | `event-classifier` |
 | **Purpose** | Classifies global/geopolitical news into structured macro events with impact type, severity, affected regions/sectors/commodities, and estimated duration |
 | **Default Model** | `qwen3.5:9b-fast` (Ollama) |
+| **Supported Providers** | `ollama`, `vllm` |
 | **Prompt Version** | `event-classification-v1` |
 | **Schema Version** | `1.0.0` |
 | **Entry Point** | `services/extractor/main.py` → `services/extractor/event_classifier.py` |
@@ -99,7 +103,7 @@ Use "other" for catalyst_type if unsure. Keep evidence_spans short
 - Normalized text of a macro news article (from the `stonks:queue:macro_classification` Redis queue)
 - Document ID for traceability

-**Output Schema** (`GlobalEvent`):
+**Output Schema** (`GlobalEvent` — defined in `services/extractor/event_classifier.py`):

 ```json
 {
@@ -141,9 +145,11 @@ as empty arrays.
 ```

 **User Prompt Template** (built by `build_event_classification_prompt()` in `services/extractor/event_classifier.py`):
- Includes anti-hallucination rules
+- Includes anti-hallucination rules (no fabrication, severity "critical" reserved for multi-country events)
 - Lists all valid enum values for each field
 - Truncates articles to 6,000 characters
+- When an active variant has `input_token_limit > 0`, truncation uses `input_token_limit * 4` characters instead
+- If a variant overrides the system prompt, the classifier ensures JSON output instructions are always appended if not already present

 ---

@@ -154,6 +160,7 @@ as empty arrays.
 | **Slug** | `thesis-rewriter` |
 | **Purpose** | Rewrites deterministic trade thesis summaries into clear, professional analyst prose. Optional layer — the system falls back to the deterministic thesis if this fails |
 | **Default Model** | `qwen3.5:9b-fast` (Ollama) |
+| **Supported Providers** | `ollama`, `vllm` |
 | **Prompt Version** | `thesis-rewrite-v1` |
 | **Schema Version** | `1.0.0` |
 | **Entry Point** | `services/recommendation/main.py` → `services/recommendation/thesis_llm.py` |
@@ -165,6 +172,7 @@ as empty arrays.
 **Output Schema:**
 - Plain text (not JSON). The model returns only the rewritten thesis as a string, under 150 words.
 - On failure or empty response, the original deterministic thesis is returned unchanged.
+- A `_strip_thinking_block()` post-processor removes `<think>` XML tags and "Thinking Process:" blocks that some models (e.g. Qwen3) emit before the actual response.

 **System Prompt:**

@@ -182,11 +190,37 @@ STRICT RULES:
 5. Use a neutral, professional tone. Avoid hype or marketing language.
 6. Return ONLY the rewritten thesis text. No JSON, no markdown, no
   commentary.
+7. Do NOT show your thinking process. Do NOT include any reasoning
+   steps. Output ONLY the final rewritten text.
 ```

 **User Prompt Template** (built by `build_thesis_rewrite_prompt()` in `services/recommendation/thesis_llm.py`):
 - Includes the deterministic thesis between delimiters
 - Includes trend context: ticker, window, direction, strength, confidence, contradiction score, top catalysts, top risks
+- Appends `/no_think` suffix to suppress reasoning mode on models that support it (e.g. Qwen3)
+- Ollama calls also set `"think": false` in the request payload
+
+---
+
+## LLM Provider Abstraction
+
+All three agents support both **Ollama** and **vLLM** as inference providers. The provider is determined by the `model_provider` field in the agent config (or active variant).
+
+**Module:** `services/extractor/llm_factory.py`
+
+The `build_llm_client()` factory function routes to the correct client:
+
+| `model_provider` value | Client class | API endpoint |
+|------------------------|-------------|--------------|
+| `ollama` (default), `""`, `None` | `OllamaClient` (`services/extractor/client.py`) | `{OLLAMA_BASE_URL}/api/chat` |
+| `vllm` | `VLLMClient` (`services/extractor/vllm_client.py`) | `{VLLM_BASE_URL}/v1/chat/completions` (OpenAI-compatible) |
+| Unknown value | `OllamaClient` (with warning log) | Falls back to Ollama |
+
+Both clients implement the `LLMClient` protocol (`services/shared/llm_protocol.py`), providing `call_llm()` and `close()` methods.
+
+**Provider switching at runtime:** When a variant changes the `model_provider`, the extractor worker detects this during its periodic config refresh (every 100 jobs) and creates a new client instance. The old client is closed gracefully. A safety guard prevents switching to Ollama if `OLLAMA_BASE_URL` is empty.
+
+**vLLM health check:** At startup, if the resolved provider is `vllm`, the extractor runs a health check against the vLLM endpoint. If it fails, the worker falls back to Ollama automatically.

 ---

@@ -202,8 +236,8 @@ Defined in migration `026_ai_agents.sql`. Stores the base configuration for each
 | `name` | `VARCHAR(100)` | — | Human-readable name (unique) |
 | `slug` | `VARCHAR(100)` | — | URL-safe identifier (unique), used by `AgentConfigResolver` |
 | `purpose` | `TEXT` | `''` | Description of what the agent does |
-| `model_provider` | `VARCHAR(50)` | `'ollama'` | LLM provider |
-| `model_name` | `VARCHAR(200)` | `'qwen3.5:9b'` | Model identifier |
+| `model_provider` | `VARCHAR(50)` | `'ollama'` | LLM provider (`ollama` or `vllm`) |
+| `model_name` | `VARCHAR(200)` | `'qwen3.5:9b-fast'` | Model identifier |
 | `system_prompt` | `TEXT` | `''` | System prompt sent to the model |
 | `user_prompt_template` | `TEXT` | `''` | User prompt template (optional — code-defined templates take precedence) |
 | `prompt_version` | `VARCHAR(100)` | `''` | Version tag for prompt tracking |
@@ -297,13 +331,20 @@ The `AgentConfigResolver` is the central mechanism for resolving runtime agent c
 2. **COALESCE-based override**: The SQL query uses `COALESCE(variant_column, agent_column)` for every configuration field. If an active variant exists and has a non-NULL value for a field, that value is used. Otherwise, the base agent's value is used.

   ```sql
-   SELECT a.id AS agent_id,
-          v.id AS variant_id,
+   SELECT a.id        AS agent_id,
+          v.id        AS variant_id,
          COALESCE(v.model_provider,       a.model_provider)       AS model_provider,
          COALESCE(v.model_name,           a.model_name)           AS model_name,
          COALESCE(v.system_prompt,        a.system_prompt)        AS system_prompt,
          COALESCE(v.user_prompt_template, a.user_prompt_template) AS user_prompt_template,
-          -- ... all other fields ...
+          COALESCE(v.prompt_version,       a.prompt_version)       AS prompt_version,
+          COALESCE(v.temperature,          a.temperature)          AS temperature,
+          COALESCE(v.max_tokens,           a.max_tokens)           AS max_tokens,
+          COALESCE(v.context_window,       0)                      AS context_window,
+          COALESCE(v.input_token_limit,    0)                      AS input_token_limit,
+          COALESCE(v.token_budget,         0)                      AS token_budget,
+          COALESCE(v.timeout_seconds,      a.timeout_seconds)      AS timeout_seconds,
+          COALESCE(v.max_retries,          a.max_retries)          AS max_retries
     FROM ai_agents a
     LEFT JOIN agent_variants v
            ON v.agent_id = a.id AND v.is_active = TRUE
@@ -361,7 +402,10 @@ resolver.invalidate()                       # Clear all entries

 ### Config Refresh in Workers

-The extractor and recommendation workers periodically re-resolve their agent config (every 100 jobs for the extractor, every 50 jobs for the recommendation worker). If the resolved model changes, the worker creates a new `OllamaClient` instance with the updated configuration.
+The extractor and recommendation workers periodically re-resolve their agent config to pick up variant swaps and model changes:
+
+- **Extractor worker** (`services/extractor/main.py`): Re-resolves both `document-extractor` and `event-classifier` configs every **100 jobs**. If the resolved model or provider changes, the worker creates a new LLM client instance via `build_llm_client()` and closes the old one. A safety guard prevents switching to Ollama if `OLLAMA_BASE_URL` is empty.
+- **Recommendation worker** (`services/recommendation/main.py`): Re-resolves the `thesis-rewriter` config every **50 jobs**. If the model changes, a new `OllamaConfig` is built.

 ---

@@ -373,7 +417,7 @@ Every agent invocation is logged to `agent_performance_log` with the `agent_id`

 - **Document extractor**: Logged in `services/extractor/main.py` after each extraction. Records success/failure, duration, confidence, retry count, token estimates.
 - **Event classifier**: Logged in `services/extractor/event_classifier.py` after each classification. Same fields.
- **Thesis rewriter**: Logged in `services/recommendation/thesis_llm.py` after each rewrite attempt. Confidence is always 0.0 (not applicable for rewrites).
+- **Thesis rewriter**: Logged in `services/recommendation/thesis_llm.py` after each rewrite attempt. Confidence is always 0.0 (not applicable for rewrites). `document_id` is always NULL.

 ### Querying for Variant Comparison

@@ -464,6 +508,8 @@ All agent endpoints are served by the Query API (`services/api/app.py`) under th
 }
 ```

+All fields except `name` have defaults. The `slug` is auto-generated from `name` if not provided. The `model_name` defaults to `llama3.1:8b` for user-created agents.
+
 **Update Agent Request Body** (all fields optional):

 ```json
@@ -509,6 +555,30 @@ All agent endpoints are served by the Query API (`services/api/app.py`) under th
 | `PUT` | `/api/agents/{agent_id}/variants/{variant_id}` | Partial update a variant |
 | `DELETE` | `/api/agents/{agent_id}/variants/{variant_id}` | Delete a variant (returns 400 if active) |

+**Create Variant Request Body:**
+
+```json
+{
+  "variant_name": "Llama 3.1 8B Test",
+  "variant_slug": "llama-3-1-8b-test",
+  "description": "Testing llama3.1:8b as an alternative",
+  "model_provider": "ollama",
+  "model_name": "llama3.1:8b",
+  "system_prompt": "",
+  "user_prompt_template": "",
+  "prompt_version": "",
+  "temperature": 0.0,
+  "max_tokens": 32768,
+  "context_window": 0,
+  "input_token_limit": 0,
+  "token_budget": 0,
+  "timeout_seconds": 120,
+  "max_retries": 2
+}
+```
+
+Required fields: `variant_name`, `model_name`. The `variant_slug` is auto-generated from `variant_name` if not provided.
+
 ### Clone Endpoints

 | Method | Path | Description |
@@ -516,7 +586,7 @@ All agent endpoints are served by the Query API (`services/api/app.py`) under th
 | `POST` | `/api/agents/{agent_id}/clone` | Clone an agent's base config as a new variant |
 | `POST` | `/api/agents/{agent_id}/variants/{variant_id}/clone` | Clone an existing variant as a new variant |

-Clone requests copy all configuration fields from the source, with optional overrides in the request body.
+Clone requests copy all configuration fields from the source, with optional overrides in the request body. The `variant_name` field is required. All other fields default to the source's values if not provided.

 ### Activate / Deactivate

@@ -525,6 +595,8 @@ Clone requests copy all configuration fields from the source, with optional over
 | `POST` | `/api/agents/{agent_id}/variants/{variant_id}/activate` | Set a variant as active (deactivates any other active variant in a single transaction) |
 | `POST` | `/api/agents/{agent_id}/variants/deactivate` | Deactivate the currently active variant (agent falls back to base config) |

+The activate endpoint uses a database transaction to atomically deactivate the current variant and activate the new one, ensuring exactly one active variant at all times.
+
 ### Per-Variant Performance

 | Method | Path | Description |
@@ -532,6 +604,8 @@ Clone requests copy all configuration fields from the source, with optional over
 | `GET` | `/api/agents/{agent_id}/variants/{variant_id}/performance` | Aggregated metrics for a specific variant |
 | `GET` | `/api/agents/{agent_id}/variants/{variant_id}/performance/history` | Hourly time-series for a specific variant |

+Both endpoints accept the same `hours` query parameter (default 24, max 720) and return the same response shape as the agent-level performance endpoints.
+
 ---

 ## Step-by-Step: Creating and Activating a Variant
@@ -616,3 +690,20 @@ curl -s -X PUT \
 ```

 Then re-activate and compare again.
+
+### 7. Switch to vLLM Provider
+
+To test a variant using vLLM instead of Ollama:
+
+```bash
+curl -s -X POST https://stonks-api.celestium.life/api/agents/$AGENT_ID/clone \
+  -H "Content-Type: application/json" \
+  -d '{
+    "variant_name": "vLLM Qwen3 Test",
+    "description": "Testing extraction with vLLM backend",
+    "model_provider": "vllm",
+    "model_name": "Qwen/Qwen3-8B"
+  }' | jq .
+```
+
+The extractor worker will detect the provider change during its next config refresh and build a `VLLMClient` instead of an `OllamaClient`. Ensure the `VLLM_BASE_URL` environment variable is set in the extractor deployment.