Files
stonks-oracle/docs/llm-to-trade-pipeline.md
T
Celes Renata c85c0068a2 fix: clean up utcnow deprecation warnings, fix 12 failing tests, add CI/CD pipeline manifests
- Replace all datetime.utcnow() with datetime.now(tz=timezone.utc) across 8 files
- Fix 12 failing tests to match current implementation behavior
- Fix pytest_plugins in non-top-level conftest (moved to root conftest.py)
- Auto-fix 189 lint issues (import sorting, unused imports)
- Add CI/CD pipeline infrastructure (ARC, ArgoCD, Kargo manifests)
- Add values-beta.yaml and values-paper.yaml for staged deployments
- Update GitHub Actions workflow to use self-hosted-gremlin runners
- Add integration-test job to CI pipeline

Result: 1596 passed, 0 failed, 0 warnings
2026-04-18 03:59:28 +00:00

536 lines
25 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# From Model Output to Trade: The Full Pipeline
This document traces the complete journey of data through Stonks Oracle — from the moment an Ollama model produces structured JSON, through signal scoring and aggregation, to the final trading decision.
---
## 1. Document Ingestion
Before the model ever sees a document, the ingestion layer fetches raw content from configured sources (news APIs, SEC filings, earnings transcripts, press releases). Each document lands in the `documents` table with a status, type, and `published_at` timestamp. A Redis queue (`stonks:queue:extraction`) feeds documents to the extractor service.
---
## 2. Prompting the Model
The extractor service (`services/extractor/client.py`) sends each document to a local Ollama instance via `POST /api/chat`.
### System prompt
A short, strict instruction set:
> You are a financial document analyst. Extract structured data as JSON. Return ONLY a single JSON object. No markdown fences, no explanation, no text before or after the JSON. Every field in the schema is required. Use "other" for catalyst_type if unsure. Keep evidence_spans short (under 20 words each). Keep key_facts to 3-5 items max.
### User prompt
Built dynamically per document (`services/extractor/prompts.py`). It includes:
- **Document type guidance** — tailored instructions for articles, filings, transcripts, and press releases. For example, filings get: *"Extract concrete financial figures, risk factors, and material events as stated."* Transcripts get: *"Distinguish between management forward-looking statements and reported results."*
- **Tracked ticker hints** — the list of 50 tracked tickers, with rules: if a ticker appears verbatim in the text, the model must include it; if a sector theme clearly affects a tracked company, include it; never invent tickers outside the list.
- **Field-by-field instructions** — what each output field means and its valid range.
- **Document text** — truncated to 8,000 characters to keep inference fast.
### Ollama call parameters
- `think=false` (speed over chain-of-thought)
- `num_predict=4096` (max output tokens)
- Optional `num_ctx` override for longer documents
- The JSON schema (generated from Pydantic models) is passed as the `format` parameter for structured output
---
## 3. Model Output: The JSON Contract
The model returns a single JSON object matching the `ExtractionResult` schema (`services/extractor/schemas.py`):
```json
{
"summary": "Apple reported record Q4 earnings driven by iPhone 16 demand.",
"companies": [
{
"ticker": "AAPL",
"company_name": "Apple Inc.",
"relevance": 0.95,
"sentiment": "positive",
"impact_score": 0.8,
"impact_horizon": "1d_7d",
"catalyst_type": "earnings",
"key_facts": [
"Revenue up 12% YoY to $94.9B",
"iPhone revenue grew 18%",
"Services hit all-time high"
],
"risks": [
"China market softness noted by management"
],
"evidence_spans": [
"record quarterly revenue of $94.9 billion",
"iPhone revenue grew 18 percent year over year"
]
}
],
"macro_themes": ["consumer_spending", "ai_capex"],
"novelty_score": 0.6,
"confidence": 0.85,
"extraction_warnings": []
}
```
### Field definitions
| Field | Type | Range | Purpose |
|---|---|---|---|
| `summary` | string | — | 1-3 sentence document summary |
| `companies[]` | array | — | Per-company intelligence (one entry per affected company) |
| `.ticker` | string | — | Stock ticker symbol |
| `.relevance` | float | 0-1 | How central this company is to the document |
| `.sentiment` | enum | positive / negative / neutral / mixed | Overall sentiment toward the company |
| `.impact_score` | float | 0-1 | Estimated magnitude of impact (0 = negligible, 1 = highly material) |
| `.impact_horizon` | string | intraday / 1d / 1d_7d / 1d_30d / 30d_90d / 90d_plus | When the impact is expected to manifest |
| `.catalyst_type` | enum | earnings / product / legal / macro / supply_chain / m_and_a / rating_change / other | Primary catalyst category |
| `.key_facts` | string[] | — | Facts explicitly stated in the document (no fabrication) |
| `.risks` | string[] | — | Risks explicitly mentioned |
| `.evidence_spans` | string[] | — | Short verbatim quotes supporting the analysis |
| `macro_themes` | string[] | — | Broad economic themes (rates, inflation, ai_capex, etc.) |
| `novelty_score` | float | 0-1 | How surprising the information is |
| `confidence` | float | 0-1 | Model's self-assessed extraction quality |
| `extraction_warnings` | string[] | — | Issues encountered (ambiguous_ticker, incomplete_text, etc.) |
---
## 4. JSON Repair and Validation
The raw model output goes through two stages before it's trusted.
### 4a. JSON repair (`services/extractor/client.py`)
Ollama's `format` constraint is unreliable with `think=false` on certain models (Ollama bug #14645). The extractor handles this:
1. Try `json.loads()` directly — if it parses, use it as-is.
2. Strip markdown fences (` ```json ... ``` `) if present.
3. Fall back to the `json-repair` library, which fixes trailing commas, unterminated strings, and control characters.
### 4b. Structural + semantic validation (`services/extractor/schemas.py`)
1. **Structural validation** — parse the JSON against the `ExtractionResult` Pydantic model. Missing required fields, wrong types, or out-of-range values fail here.
2. **Semantic validation** — cross-field consistency checks:
- Ticker format validation
- Evidence span length checks
- Catalyst type alias normalization (maps variants to canonical enum values)
- Impact horizon normalization
3. Returns a `ValidationReport` with the parsed result or a list of errors.
### 4c. Retry logic
If validation fails and the error is retryable (not an HTTP 4xx client error), the extractor retries up to `max_retries` times (default 2) with exponential backoff. Every attempt — raw output, validation result, error, duration — is preserved in the `ExtractionResponse.attempts` list for audit.
---
## 5. Persistence: Document Intelligence and Impact Records
A successful extraction produces two sets of database records.
### Document intelligence (`document_intelligence` table)
One row per document:
- `document_id`, `document_type`, `summary`, `companies` (JSONB), `macro_themes`
- `novelty_score`, `source_credibility`, `confidence`, `extraction_warnings`
- `validation_status` (valid/failed)
- `model` metadata: provider, model_name, prompt_version, schema_version
### Per-company impact records (`document_impact_records` table)
One row per company mentioned in the extraction:
- `ticker`, `company_name`, `relevance`, `sentiment`, `impact_score`, `impact_horizon`, `catalyst_type`
- `key_facts`, `risks`, `evidence_spans` (all JSONB)
- Links back to `document_intelligence` via `intelligence_id`
### Raw artifact storage (MinIO)
Full prompts and raw model responses are stored in MinIO buckets (`stonks-llm-prompts`, `stonks-llm-results`) keyed by `document_id`, so any extraction can be replayed or audited.
---
## 6. Signal Scoring: Turning Records into Weighted Signals
The aggregation engine (`services/aggregation/worker.py`) converts raw impact records into `WeightedSignal` objects. Each signal carries a composite weight that determines how much it influences the final trend.
### Weight components (`services/aggregation/scoring.py`)
The combined weight is:
```
combined = gate × recency × credibility × (1 + novelty_bonus) × market_context_multiplier
```
| Component | Formula | Purpose |
|---|---|---|
| **Confidence gate** | 0 if extraction confidence < 0.2, else 1 | Reject unreliable extractions entirely |
| **Recency decay** | `2^(-age_hours / half_life)`, min 0.01 | Exponential decay — newer documents matter more. Half-lives: intraday=2h, 1d=12h, 7d=72h, 30d=240h, 90d=720h |
| **Credibility** | `source_credibility ^ exponent`, clamped [0.1, 1.0] | Source quality weighting |
| **Novelty bonus** | `novelty_score × 0.25` | Novel information gets up to 25% boost |
| **Market context** | Volatility boost (up to +30%) + volume surge boost (+15%) | Fast-moving, high-volume markets amplify fresh signals |
Each `WeightedSignal` also carries:
- `sentiment_value`: +1.0 (positive), -1.0 (negative), 0.0 (neutral/mixed)
- `impact_score`: the extraction's impact magnitude
- `document_id`: for evidence tracing
---
## 7. Three Signal Layers
The aggregation engine merges signals from three independent layers. Each layer can be toggled on/off at runtime via the `risk_configs` table — no restart needed.
### Layer 1: Company-specific signals (always active)
Direct document intelligence about a company. This is the core layer — `document_impact_records` for the ticker, scored as described in §6.
### Layer 2: Macro signals (toggle: `macro_enabled`)
Global events that affect companies through exposure profiles.
**Flow:**
1. The macro service classifies global events (from news) using Ollama — extracting event type, severity, affected regions/sectors/commodities.
2. Each company has an **exposure profile** (`exposure_profiles` table): geographic revenue mix, supply chain regions, commodity dependencies, market position tier.
3. **Overlap scoring** computes how much a global event overlaps with a company's exposure (geographic, supply chain, commodity dimensions).
4. A **resilience modifier** based on market position tier (global leaders are more resilient than domestic companies) adjusts the score.
5. The final `macro_impact_score = base_score × overlap_factor × resilience_modifier`.
6. Events older than 48 hours get accelerated staleness decay.
Macro signals are converted to `WeightedSignal` objects with:
- `sentiment_value` mapped from `impact_direction` (positive → +1, negative → -1)
- `impact_score = macro_impact_score × macro_signal_weight` (default weight: 0.3)
- Recency decay from the global event's publication time
### Layer 3: Competitive signals (toggle: `competitive_enabled`)
Historical patterns and cross-company signal propagation.
**Flow:**
1. **Self-company pattern mining** (`services/aggregation/pattern_matcher.py`): For each catalyst type in the current impact records, query historical outcomes for this ticker. Lookback: 180 days for routine signals, 365 days for major decisions (1.3× weight multiplier). Produces `HistoricalPattern` objects with `bullish_pct`, `bearish_pct`, `avg_strength`, `pattern_confidence`.
2. **Cross-company propagation** (`services/aggregation/signal_propagation.py`): When company A has a catalyst, look up its competitors via the `competitor_relationships` table (46 relationships across 50 companies). For each competitor, query cross-company historical patterns. Signal strength = `avg_strength × relationship_strength × pattern_confidence × impact_score`. Direction = majority historical outcome (bullish or bearish).
3. Competitive signals are converted to `WeightedSignal` objects with:
- `impact_score = signal_strength × competitive_signal_weight` (default weight: 0.2)
- Recency decay from the pattern's most recent data point or the signal's `computed_at` time
### Merging
All three layers produce `WeightedSignal` objects with the same structure. The aggregation engine simply concatenates them into a single list before computing the trend summary. The relative influence of each layer is controlled by the `macro_signal_weight` (0.3) and `competitive_signal_weight` (0.2) multipliers applied to their impact scores.
---
## 8. Trend Summary Assembly
From the merged signal list, the aggregation engine computes a `TrendSummary` for each ticker × window combination (intraday, 1d, 7d, 30d, 90d).
### Weighted sentiment average
```
avg_sentiment = Σ(sentiment_value × combined_weight × impact_score) / Σ(combined_weight × impact_score)
```
### Trend direction
| Condition | Direction |
|---|---|
| `avg_sentiment ≥ 0.15` | **Bullish** |
| `avg_sentiment ≤ -0.15` | **Bearish** |
| Contradiction > 0.10 and \|avg_sentiment\| < 0.30 | **Mixed** |
| Otherwise | **Neutral** |
### Trend strength
`strength = min(|avg_sentiment|, 1.0)` — the absolute magnitude of the weighted sentiment, clamped to [0, 1].
### Contradiction score
Measures disagreement among signals:
```
contradiction = minority_side_weight / total_weight
```
Where minority side is whichever of positive or negative has less total weight. A score of 0 means full agreement; approaching 0.5 means equal-weight disagreement.
The system also runs multi-dimensional contradiction detection (`services/aggregation/contradiction.py`):
- **Sentiment disagreement** — the core positive-vs-negative split
- **Catalyst disagreement** — same catalyst type with opposing sentiment from different documents
### Confidence
Derived from four factors:
- **Unique source count** — more distinct documents = higher confidence (caps at 15 unique sources for 0.8 contribution)
- **Average extraction confidence** — from the model's self-assessed quality
- **Signal agreement** — fraction of signals pointing the same direction, dampened by sample size (log₂ scaling, saturates around 7 unique sources)
- **Contradiction penalty** — `contradiction_score × 0.4` subtracted
### Evidence ranking
Supporting and opposing documents are ranked by a composite score considering weight, impact, recency, and confidence — not just raw weight. The top 10 of each are stored for citation.
### Catalysts and risks
Dominant catalyst types are ranked by cumulative signal weight. Material risks are deduplicated and ordered by the weight of the signal that surfaced them.
### Persistence
The assembled `TrendSummary` is upserted into the `trend_windows` table (one row per entity × window, updated each cycle). A snapshot is also appended to `trend_history` for time-series charting. Evidence mappings go into `trend_evidence` with per-document rank scores and component breakdowns.
---
## 9. Trend Projections
After assembling the current trend, the engine computes a forward-looking projection (`services/aggregation/projection.py`):
- **Macro decay** — projects macro event impact forward with exponential decay based on estimated duration and severity
- **Momentum** — trend momentum from recent price action
- **Driving factors** — lists key macro events, competitive patterns, and market conditions
- **Divergence detection** — flags when the projection diverges from the current trend direction
Output: `TrendProjection` with `projected_direction`, `projected_strength`, `projected_confidence`, `projection_horizon`, `driving_factors`, and `diverges_from_current`. Projections with confidence below 0.3 are flagged as `low_confidence` and excluded from thesis generation.
---
## 10. Recommendation Generation
The recommendation service (`services/recommendation/worker.py`) turns trend summaries into actionable recommendations.
### Step 1: Data quality suppression (`services/recommendation/suppression.py`)
Before any eligibility check, the system evaluates the quality of the underlying data:
| Check | Threshold | Effect |
|---|---|---|
| Average extraction confidence | < 0.40 | Suppress |
| Evidence staleness | > 168 hours (7 days) | Suppress |
| Source type diversity | < 1 distinct type | Suppress |
| Extraction failure rate | > 50% | Suppress |
| Valid document count | < 2 | Suppress |
| Overall data quality score | < 0.30 | Suppress |
The data quality score is a weighted composite: 40% extraction confidence + 30% evidence freshness + 30% document coverage.
**Safety suppression** — two additional rules prevent trading on thin evidence from a single signal layer:
- **Macro-only suppression**: If the trend direction is driven solely by macro signals with zero company-specific evidence, the recommendation is forced to informational mode.
- **Pattern-only suppression**: Same rule for pattern/competitive signals with no company or macro support.
### Step 2: Eligibility evaluation (`services/recommendation/eligibility.py`)
Deterministic rules — no model involvement:
**Gate checks** (any failure → no recommendation):
- Confidence ≥ 0.35
- Trend strength ≥ 0.10
- Contradiction score ≤ 0.60
- Evidence count ≥ 2
- Direction ≠ neutral
**Action mapping:**
- Strong bullish (strength ≥ 0.25) → **BUY**
- Strong bearish (strength ≥ 0.25) → **SELL**
- Weak but directional + decent confidence (≥ 0.50) → **HOLD**
- Everything else → **WATCH**
**Mode escalation:**
- WATCH and HOLD → always **informational** (no trades)
- BUY/SELL with confidence ≥ 0.70, contradiction ≤ 0.25, evidence ≥ 5 → **live_eligible**
- BUY/SELL with confidence ≥ 0.50 → **paper_eligible**
- Below that → **informational**
### Step 3: Position sizing
Computed from signal quality:
```
raw_portfolio_pct = base (1%) + confidence × strength × range (up to 10%)
```
Adjusted by:
- Contradiction penalty (higher contradiction → smaller position)
- Evidence count penalty (< 3 docs → 50% reduction, < 5 docs → 75%)
- Max loss percentage scales similarly (base 0.3% up to 2%)
### Step 4: Thesis generation
Two layers:
1. **Deterministic thesis** — assembled from trend direction, strength, catalysts, risks, contradiction notes, projection info, and the recommended action. Always generated.
2. **Optional LLM rewrite** (`services/recommendation/thesis_llm.py`) — for trading-eligible recommendations only, the deterministic thesis is rewritten into analyst-quality prose via Ollama. This is cosmetic; the underlying decision is unchanged.
### Step 5: Risk classification
Based on contradiction score, confidence, evidence count, and mode:
- `low` — high confidence, low contradiction, strong evidence
- `moderate` — decent signals with some uncertainty
- `high` — notable contradiction or low evidence
- `very_high` — multiple risk factors present
The thesis is prefixed with the risk label: `[risk:moderate] AAPL shows a bullish trend...`
### Step 6: Persistence
- `recommendations` table — the full recommendation record
- `recommendation_evidence` table — per-document citations with weights and evidence types
- `risk_evaluations` table — the eligibility decision, risk checks, and full decision trace
---
## 11. Trading Engine Decision Loop
The trading engine (`services/trading/engine.py`) polls the `recommendations` table every 60 seconds for actionable recommendations (`action IN ('buy', 'sell')` and `mode IN ('paper_eligible', 'live_eligible')`).
### Pre-trade checks (in order, first failure short-circuits)
1. **Circuit breaker** — is the daily loss cap or single-position loss cap breached? If so, all trading halts.
2. **Trading window** — is the market open? Outside market hours, skip.
3. **Confidence gate** — does the recommendation meet the active risk tier's minimum confidence?
4. **Deduplication** — has this recommendation already been processed?
5. **Declining positions** — are there multiple open positions currently declining?
6. **Max open positions** — is the portfolio at capacity?
### Position sizing (`services/trading/position_sizer.py`)
Computes the dollar amount and share quantity:
- Confidence-based scaling (sample-size-dampened agreement scoring)
- Risk tier adjustment (conservative / moderate / aggressive)
- Portfolio heat check (sector concentration, correlation)
- Active pool available capital
- Absolute position cap
### Stop-loss and take-profit (`services/trading/stop_loss_manager.py`)
- Stop-loss = entry price (ATR × atr_multiplier)
- Take-profit = entry price + (ATR × atr_multiplier × reward_risk_ratio)
- Trailing stops activate for open positions
### Additional checks
- **Correlation-aware diversification** — reject positions that would push portfolio correlation above threshold
- **Earnings calendar awareness** — reduce size or skip if earnings are within 2 days
- **Gradual entry** — large positions (> $30) split into 3 tranches over time
- **Reserve pool** — profits from closed positions siphon into an emergency liquidity reserve
### Risk tier auto-adjustment (`services/trading/risk_tier_controller.py`)
Daily evaluation of Sharpe ratio, drawdown, and win rate. The engine auto-adjusts between conservative, moderate, and aggressive tiers. The new tier is persisted to `risk_configs` and takes effect on the next cycle.
### Output: Trading Decision
Every evaluation produces a `TradingDecision` record persisted for audit:
- `decision`: act or skip
- `skip_reason`: which check failed (if any)
- `computed_position_size`, `computed_share_quantity`
- `risk_tier_at_decision`, `portfolio_heat_at_decision`, `active_pool_at_decision`
- `circuit_breaker_status`, `correlation_check_result`, `sector_exposure_check_result`
- `earnings_proximity_flag`, `is_micro_trade`, `decision_trace`
If the decision is **act**, an order job is pushed to the Redis broker queue (`stonks:queue:broker`) with ticker, action, quantity, and order type.
---
## 12. The Complete Data Flow (Summary)
```
Document (article/filing/transcript)
Ollama extraction (JSON)
│ ├─ JSON repair (json-repair library)
│ └─ Pydantic validation + semantic checks
document_intelligence + document_impact_records (PostgreSQL)
│ └─ Raw prompts/responses → MinIO (audit)
├──────────────────────────────────────────────────┐
│ │
▼ ▼
Layer 1: Company signals Layer 2: Macro signals
(impact records → WeightedSignal) (global_events → exposure matching
→ macro_impact_records → WeightedSignal)
│ │
│ Layer 3: Competitive signals │
│ (pattern mining + propagation │
│ → competitive_signal_records │
│ → WeightedSignal) │
│ │ │
└───────────┬───────────────┘───────────────────────┘
Signal merging (concatenate all WeightedSignals)
Trend summary assembly
(weighted sentiment → direction, strength, confidence,
contradiction, evidence ranking, catalysts, risks)
├─→ trend_windows (PostgreSQL)
├─→ trend_history (time-series)
└─→ trend_evidence (per-document rankings)
Trend projection (forward-looking)
Data quality suppression
(extraction confidence, staleness, diversity,
macro-only / pattern-only safety)
Eligibility evaluation
(gate checks → action mapping → mode escalation → position sizing)
Thesis generation + risk classification
├─→ recommendations (PostgreSQL)
├─→ recommendation_evidence
└─→ risk_evaluations
Trading engine decision loop
(pre-trade checks → position sizing → stop-loss →
correlation → earnings → gradual entry)
├─→ trading_decisions (PostgreSQL, audit)
└─→ stonks:queue:broker (Redis, order execution)
```
---
## 13. Key Database Tables
| Table | Stage | Purpose |
|---|---|---|
| `documents` | Ingestion | Raw ingested content |
| `document_intelligence` | Extraction | Ollama extraction output |
| `document_impact_records` | Extraction | Per-company impact from a document |
| `global_events` | Macro | Classified macro/geopolitical events |
| `exposure_profiles` | Macro | Company exposure data (geography, supply chain, commodities) |
| `macro_impact_records` | Macro | Per-company macro impact scores |
| `competitor_relationships` | Competitive | Company relationship graph |
| `competitive_signal_records` | Competitive | Cross-company propagated signals |
| `trend_windows` | Aggregation | Current trend summaries (upserted each cycle) |
| `trend_history` | Aggregation | Time-series snapshots for charting |
| `trend_evidence` | Aggregation | Per-document evidence rankings |
| `trend_projections` | Projection | Forward-looking trend projections |
| `recommendations` | Recommendation | Trade recommendations |
| `recommendation_evidence` | Recommendation | Per-document citations |
| `risk_evaluations` | Recommendation | Eligibility decisions and risk checks |
| `risk_configs` | Runtime | Toggle switches and risk tier configuration |
| `trading_decisions` | Trading | Pre-trade evaluation audit trail |
| `positions` | Trading | Open positions |
| `orders` | Trading | Broker orders |
| `fills` | Trading | Order fills |
---
## 14. Audit Trail
Every stage preserves full context for reproducibility:
- **Extraction**: raw Ollama response, repair steps, validation errors, all retry attempts
- **Aggregation**: per-signal weight breakdowns (recency, credibility, novelty, market context), contradiction details by dimension
- **Recommendation**: deterministic thesis, evidence citations with weights, eligibility decision trace, risk evaluation
- **Trading**: every pre-trade check result, position sizing breakdown, risk tier at decision time, full decision trace
- **Execution**: order details, fills, P&L, performance metrics