Files
stonks-oracle/docs/llm-to-trade-pipeline.md
T
Celes Renata c85c0068a2 fix: clean up utcnow deprecation warnings, fix 12 failing tests, add CI/CD pipeline manifests
- Replace all datetime.utcnow() with datetime.now(tz=timezone.utc) across 8 files
- Fix 12 failing tests to match current implementation behavior
- Fix pytest_plugins in non-top-level conftest (moved to root conftest.py)
- Auto-fix 189 lint issues (import sorting, unused imports)
- Add CI/CD pipeline infrastructure (ARC, ArgoCD, Kargo manifests)
- Add values-beta.yaml and values-paper.yaml for staged deployments
- Update GitHub Actions workflow to use self-hosted-gremlin runners
- Add integration-test job to CI pipeline

Result: 1596 passed, 0 failed, 0 warnings
2026-04-18 03:59:28 +00:00

25 KiB
Raw Blame History

From Model Output to Trade: The Full Pipeline

This document traces the complete journey of data through Stonks Oracle — from the moment an Ollama model produces structured JSON, through signal scoring and aggregation, to the final trading decision.


1. Document Ingestion

Before the model ever sees a document, the ingestion layer fetches raw content from configured sources (news APIs, SEC filings, earnings transcripts, press releases). Each document lands in the documents table with a status, type, and published_at timestamp. A Redis queue (stonks:queue:extraction) feeds documents to the extractor service.


2. Prompting the Model

The extractor service (services/extractor/client.py) sends each document to a local Ollama instance via POST /api/chat.

System prompt

A short, strict instruction set:

You are a financial document analyst. Extract structured data as JSON. Return ONLY a single JSON object. No markdown fences, no explanation, no text before or after the JSON. Every field in the schema is required. Use "other" for catalyst_type if unsure. Keep evidence_spans short (under 20 words each). Keep key_facts to 3-5 items max.

User prompt

Built dynamically per document (services/extractor/prompts.py). It includes:

  • Document type guidance — tailored instructions for articles, filings, transcripts, and press releases. For example, filings get: "Extract concrete financial figures, risk factors, and material events as stated." Transcripts get: "Distinguish between management forward-looking statements and reported results."
  • Tracked ticker hints — the list of 50 tracked tickers, with rules: if a ticker appears verbatim in the text, the model must include it; if a sector theme clearly affects a tracked company, include it; never invent tickers outside the list.
  • Field-by-field instructions — what each output field means and its valid range.
  • Document text — truncated to 8,000 characters to keep inference fast.

Ollama call parameters

  • think=false (speed over chain-of-thought)
  • num_predict=4096 (max output tokens)
  • Optional num_ctx override for longer documents
  • The JSON schema (generated from Pydantic models) is passed as the format parameter for structured output

3. Model Output: The JSON Contract

The model returns a single JSON object matching the ExtractionResult schema (services/extractor/schemas.py):

{
  "summary": "Apple reported record Q4 earnings driven by iPhone 16 demand.",
  "companies": [
    {
      "ticker": "AAPL",
      "company_name": "Apple Inc.",
      "relevance": 0.95,
      "sentiment": "positive",
      "impact_score": 0.8,
      "impact_horizon": "1d_7d",
      "catalyst_type": "earnings",
      "key_facts": [
        "Revenue up 12% YoY to $94.9B",
        "iPhone revenue grew 18%",
        "Services hit all-time high"
      ],
      "risks": [
        "China market softness noted by management"
      ],
      "evidence_spans": [
        "record quarterly revenue of $94.9 billion",
        "iPhone revenue grew 18 percent year over year"
      ]
    }
  ],
  "macro_themes": ["consumer_spending", "ai_capex"],
  "novelty_score": 0.6,
  "confidence": 0.85,
  "extraction_warnings": []
}

Field definitions

Field Type Range Purpose
summary string 1-3 sentence document summary
companies[] array Per-company intelligence (one entry per affected company)
.ticker string Stock ticker symbol
.relevance float 0-1 How central this company is to the document
.sentiment enum positive / negative / neutral / mixed Overall sentiment toward the company
.impact_score float 0-1 Estimated magnitude of impact (0 = negligible, 1 = highly material)
.impact_horizon string intraday / 1d / 1d_7d / 1d_30d / 30d_90d / 90d_plus When the impact is expected to manifest
.catalyst_type enum earnings / product / legal / macro / supply_chain / m_and_a / rating_change / other Primary catalyst category
.key_facts string[] Facts explicitly stated in the document (no fabrication)
.risks string[] Risks explicitly mentioned
.evidence_spans string[] Short verbatim quotes supporting the analysis
macro_themes string[] Broad economic themes (rates, inflation, ai_capex, etc.)
novelty_score float 0-1 How surprising the information is
confidence float 0-1 Model's self-assessed extraction quality
extraction_warnings string[] Issues encountered (ambiguous_ticker, incomplete_text, etc.)

4. JSON Repair and Validation

The raw model output goes through two stages before it's trusted.

4a. JSON repair (services/extractor/client.py)

Ollama's format constraint is unreliable with think=false on certain models (Ollama bug #14645). The extractor handles this:

  1. Try json.loads() directly — if it parses, use it as-is.
  2. Strip markdown fences (```json ... ```) if present.
  3. Fall back to the json-repair library, which fixes trailing commas, unterminated strings, and control characters.

4b. Structural + semantic validation (services/extractor/schemas.py)

  1. Structural validation — parse the JSON against the ExtractionResult Pydantic model. Missing required fields, wrong types, or out-of-range values fail here.
  2. Semantic validation — cross-field consistency checks:
    • Ticker format validation
    • Evidence span length checks
    • Catalyst type alias normalization (maps variants to canonical enum values)
    • Impact horizon normalization
  3. Returns a ValidationReport with the parsed result or a list of errors.

4c. Retry logic

If validation fails and the error is retryable (not an HTTP 4xx client error), the extractor retries up to max_retries times (default 2) with exponential backoff. Every attempt — raw output, validation result, error, duration — is preserved in the ExtractionResponse.attempts list for audit.


5. Persistence: Document Intelligence and Impact Records

A successful extraction produces two sets of database records.

Document intelligence (document_intelligence table)

One row per document:

  • document_id, document_type, summary, companies (JSONB), macro_themes
  • novelty_score, source_credibility, confidence, extraction_warnings
  • validation_status (valid/failed)
  • model metadata: provider, model_name, prompt_version, schema_version

Per-company impact records (document_impact_records table)

One row per company mentioned in the extraction:

  • ticker, company_name, relevance, sentiment, impact_score, impact_horizon, catalyst_type
  • key_facts, risks, evidence_spans (all JSONB)
  • Links back to document_intelligence via intelligence_id

Raw artifact storage (MinIO)

Full prompts and raw model responses are stored in MinIO buckets (stonks-llm-prompts, stonks-llm-results) keyed by document_id, so any extraction can be replayed or audited.


6. Signal Scoring: Turning Records into Weighted Signals

The aggregation engine (services/aggregation/worker.py) converts raw impact records into WeightedSignal objects. Each signal carries a composite weight that determines how much it influences the final trend.

Weight components (services/aggregation/scoring.py)

The combined weight is:

combined = gate × recency × credibility × (1 + novelty_bonus) × market_context_multiplier
Component Formula Purpose
Confidence gate 0 if extraction confidence < 0.2, else 1 Reject unreliable extractions entirely
Recency decay 2^(-age_hours / half_life), min 0.01 Exponential decay — newer documents matter more. Half-lives: intraday=2h, 1d=12h, 7d=72h, 30d=240h, 90d=720h
Credibility source_credibility ^ exponent, clamped [0.1, 1.0] Source quality weighting
Novelty bonus novelty_score × 0.25 Novel information gets up to 25% boost
Market context Volatility boost (up to +30%) + volume surge boost (+15%) Fast-moving, high-volume markets amplify fresh signals

Each WeightedSignal also carries:

  • sentiment_value: +1.0 (positive), -1.0 (negative), 0.0 (neutral/mixed)
  • impact_score: the extraction's impact magnitude
  • document_id: for evidence tracing

7. Three Signal Layers

The aggregation engine merges signals from three independent layers. Each layer can be toggled on/off at runtime via the risk_configs table — no restart needed.

Layer 1: Company-specific signals (always active)

Direct document intelligence about a company. This is the core layer — document_impact_records for the ticker, scored as described in §6.

Layer 2: Macro signals (toggle: macro_enabled)

Global events that affect companies through exposure profiles.

Flow:

  1. The macro service classifies global events (from news) using Ollama — extracting event type, severity, affected regions/sectors/commodities.
  2. Each company has an exposure profile (exposure_profiles table): geographic revenue mix, supply chain regions, commodity dependencies, market position tier.
  3. Overlap scoring computes how much a global event overlaps with a company's exposure (geographic, supply chain, commodity dimensions).
  4. A resilience modifier based on market position tier (global leaders are more resilient than domestic companies) adjusts the score.
  5. The final macro_impact_score = base_score × overlap_factor × resilience_modifier.
  6. Events older than 48 hours get accelerated staleness decay.

Macro signals are converted to WeightedSignal objects with:

  • sentiment_value mapped from impact_direction (positive → +1, negative → -1)
  • impact_score = macro_impact_score × macro_signal_weight (default weight: 0.3)
  • Recency decay from the global event's publication time

Layer 3: Competitive signals (toggle: competitive_enabled)

Historical patterns and cross-company signal propagation.

Flow:

  1. Self-company pattern mining (services/aggregation/pattern_matcher.py): For each catalyst type in the current impact records, query historical outcomes for this ticker. Lookback: 180 days for routine signals, 365 days for major decisions (1.3× weight multiplier). Produces HistoricalPattern objects with bullish_pct, bearish_pct, avg_strength, pattern_confidence.
  2. Cross-company propagation (services/aggregation/signal_propagation.py): When company A has a catalyst, look up its competitors via the competitor_relationships table (46 relationships across 50 companies). For each competitor, query cross-company historical patterns. Signal strength = avg_strength × relationship_strength × pattern_confidence × impact_score. Direction = majority historical outcome (bullish or bearish).
  3. Competitive signals are converted to WeightedSignal objects with:
    • impact_score = signal_strength × competitive_signal_weight (default weight: 0.2)
    • Recency decay from the pattern's most recent data point or the signal's computed_at time

Merging

All three layers produce WeightedSignal objects with the same structure. The aggregation engine simply concatenates them into a single list before computing the trend summary. The relative influence of each layer is controlled by the macro_signal_weight (0.3) and competitive_signal_weight (0.2) multipliers applied to their impact scores.


8. Trend Summary Assembly

From the merged signal list, the aggregation engine computes a TrendSummary for each ticker × window combination (intraday, 1d, 7d, 30d, 90d).

Weighted sentiment average

avg_sentiment = Σ(sentiment_value × combined_weight × impact_score) / Σ(combined_weight × impact_score)

Trend direction

Condition Direction
avg_sentiment ≥ 0.15 Bullish
avg_sentiment ≤ -0.15 Bearish
Contradiction > 0.10 and |avg_sentiment| < 0.30 Mixed
Otherwise Neutral

Trend strength

strength = min(|avg_sentiment|, 1.0) — the absolute magnitude of the weighted sentiment, clamped to [0, 1].

Contradiction score

Measures disagreement among signals:

contradiction = minority_side_weight / total_weight

Where minority side is whichever of positive or negative has less total weight. A score of 0 means full agreement; approaching 0.5 means equal-weight disagreement.

The system also runs multi-dimensional contradiction detection (services/aggregation/contradiction.py):

  • Sentiment disagreement — the core positive-vs-negative split
  • Catalyst disagreement — same catalyst type with opposing sentiment from different documents

Confidence

Derived from four factors:

  • Unique source count — more distinct documents = higher confidence (caps at 15 unique sources for 0.8 contribution)
  • Average extraction confidence — from the model's self-assessed quality
  • Signal agreement — fraction of signals pointing the same direction, dampened by sample size (log₂ scaling, saturates around 7 unique sources)
  • Contradiction penaltycontradiction_score × 0.4 subtracted

Evidence ranking

Supporting and opposing documents are ranked by a composite score considering weight, impact, recency, and confidence — not just raw weight. The top 10 of each are stored for citation.

Catalysts and risks

Dominant catalyst types are ranked by cumulative signal weight. Material risks are deduplicated and ordered by the weight of the signal that surfaced them.

Persistence

The assembled TrendSummary is upserted into the trend_windows table (one row per entity × window, updated each cycle). A snapshot is also appended to trend_history for time-series charting. Evidence mappings go into trend_evidence with per-document rank scores and component breakdowns.


9. Trend Projections

After assembling the current trend, the engine computes a forward-looking projection (services/aggregation/projection.py):

  • Macro decay — projects macro event impact forward with exponential decay based on estimated duration and severity
  • Momentum — trend momentum from recent price action
  • Driving factors — lists key macro events, competitive patterns, and market conditions
  • Divergence detection — flags when the projection diverges from the current trend direction

Output: TrendProjection with projected_direction, projected_strength, projected_confidence, projection_horizon, driving_factors, and diverges_from_current. Projections with confidence below 0.3 are flagged as low_confidence and excluded from thesis generation.


10. Recommendation Generation

The recommendation service (services/recommendation/worker.py) turns trend summaries into actionable recommendations.

Step 1: Data quality suppression (services/recommendation/suppression.py)

Before any eligibility check, the system evaluates the quality of the underlying data:

Check Threshold Effect
Average extraction confidence < 0.40 Suppress
Evidence staleness > 168 hours (7 days) Suppress
Source type diversity < 1 distinct type Suppress
Extraction failure rate > 50% Suppress
Valid document count < 2 Suppress
Overall data quality score < 0.30 Suppress

The data quality score is a weighted composite: 40% extraction confidence + 30% evidence freshness + 30% document coverage.

Safety suppression — two additional rules prevent trading on thin evidence from a single signal layer:

  • Macro-only suppression: If the trend direction is driven solely by macro signals with zero company-specific evidence, the recommendation is forced to informational mode.
  • Pattern-only suppression: Same rule for pattern/competitive signals with no company or macro support.

Step 2: Eligibility evaluation (services/recommendation/eligibility.py)

Deterministic rules — no model involvement:

Gate checks (any failure → no recommendation):

  • Confidence ≥ 0.35
  • Trend strength ≥ 0.10
  • Contradiction score ≤ 0.60
  • Evidence count ≥ 2
  • Direction ≠ neutral

Action mapping:

  • Strong bullish (strength ≥ 0.25) → BUY
  • Strong bearish (strength ≥ 0.25) → SELL
  • Weak but directional + decent confidence (≥ 0.50) → HOLD
  • Everything else → WATCH

Mode escalation:

  • WATCH and HOLD → always informational (no trades)
  • BUY/SELL with confidence ≥ 0.70, contradiction ≤ 0.25, evidence ≥ 5 → live_eligible
  • BUY/SELL with confidence ≥ 0.50 → paper_eligible
  • Below that → informational

Step 3: Position sizing

Computed from signal quality:

raw_portfolio_pct = base (1%) + confidence × strength × range (up to 10%)

Adjusted by:

  • Contradiction penalty (higher contradiction → smaller position)
  • Evidence count penalty (< 3 docs → 50% reduction, < 5 docs → 75%)
  • Max loss percentage scales similarly (base 0.3% up to 2%)

Step 4: Thesis generation

Two layers:

  1. Deterministic thesis — assembled from trend direction, strength, catalysts, risks, contradiction notes, projection info, and the recommended action. Always generated.
  2. Optional LLM rewrite (services/recommendation/thesis_llm.py) — for trading-eligible recommendations only, the deterministic thesis is rewritten into analyst-quality prose via Ollama. This is cosmetic; the underlying decision is unchanged.

Step 5: Risk classification

Based on contradiction score, confidence, evidence count, and mode:

  • low — high confidence, low contradiction, strong evidence
  • moderate — decent signals with some uncertainty
  • high — notable contradiction or low evidence
  • very_high — multiple risk factors present

The thesis is prefixed with the risk label: [risk:moderate] AAPL shows a bullish trend...

Step 6: Persistence

  • recommendations table — the full recommendation record
  • recommendation_evidence table — per-document citations with weights and evidence types
  • risk_evaluations table — the eligibility decision, risk checks, and full decision trace

11. Trading Engine Decision Loop

The trading engine (services/trading/engine.py) polls the recommendations table every 60 seconds for actionable recommendations (action IN ('buy', 'sell') and mode IN ('paper_eligible', 'live_eligible')).

Pre-trade checks (in order, first failure short-circuits)

  1. Circuit breaker — is the daily loss cap or single-position loss cap breached? If so, all trading halts.
  2. Trading window — is the market open? Outside market hours, skip.
  3. Confidence gate — does the recommendation meet the active risk tier's minimum confidence?
  4. Deduplication — has this recommendation already been processed?
  5. Declining positions — are there multiple open positions currently declining?
  6. Max open positions — is the portfolio at capacity?

Position sizing (services/trading/position_sizer.py)

Computes the dollar amount and share quantity:

  • Confidence-based scaling (sample-size-dampened agreement scoring)
  • Risk tier adjustment (conservative / moderate / aggressive)
  • Portfolio heat check (sector concentration, correlation)
  • Active pool available capital
  • Absolute position cap

Stop-loss and take-profit (services/trading/stop_loss_manager.py)

  • Stop-loss = entry price (ATR × atr_multiplier)
  • Take-profit = entry price + (ATR × atr_multiplier × reward_risk_ratio)
  • Trailing stops activate for open positions

Additional checks

  • Correlation-aware diversification — reject positions that would push portfolio correlation above threshold
  • Earnings calendar awareness — reduce size or skip if earnings are within 2 days
  • Gradual entry — large positions (> $30) split into 3 tranches over time
  • Reserve pool — profits from closed positions siphon into an emergency liquidity reserve

Risk tier auto-adjustment (services/trading/risk_tier_controller.py)

Daily evaluation of Sharpe ratio, drawdown, and win rate. The engine auto-adjusts between conservative, moderate, and aggressive tiers. The new tier is persisted to risk_configs and takes effect on the next cycle.

Output: Trading Decision

Every evaluation produces a TradingDecision record persisted for audit:

  • decision: act or skip
  • skip_reason: which check failed (if any)
  • computed_position_size, computed_share_quantity
  • risk_tier_at_decision, portfolio_heat_at_decision, active_pool_at_decision
  • circuit_breaker_status, correlation_check_result, sector_exposure_check_result
  • earnings_proximity_flag, is_micro_trade, decision_trace

If the decision is act, an order job is pushed to the Redis broker queue (stonks:queue:broker) with ticker, action, quantity, and order type.


12. The Complete Data Flow (Summary)

Document (article/filing/transcript)
  │
  ▼
Ollama extraction (JSON)
  │  ├─ JSON repair (json-repair library)
  │  └─ Pydantic validation + semantic checks
  │
  ▼
document_intelligence + document_impact_records (PostgreSQL)
  │  └─ Raw prompts/responses → MinIO (audit)
  │
  ├──────────────────────────────────────────────────┐
  │                                                  │
  ▼                                                  ▼
Layer 1: Company signals              Layer 2: Macro signals
(impact records → WeightedSignal)     (global_events → exposure matching
                                       → macro_impact_records → WeightedSignal)
  │                                                  │
  │                    Layer 3: Competitive signals   │
  │                    (pattern mining + propagation  │
  │                     → competitive_signal_records  │
  │                     → WeightedSignal)             │
  │                           │                       │
  └───────────┬───────────────┘───────────────────────┘
              │
              ▼
       Signal merging (concatenate all WeightedSignals)
              │
              ▼
       Trend summary assembly
       (weighted sentiment → direction, strength, confidence,
        contradiction, evidence ranking, catalysts, risks)
              │
              ├─→ trend_windows (PostgreSQL)
              ├─→ trend_history (time-series)
              └─→ trend_evidence (per-document rankings)
              │
              ▼
       Trend projection (forward-looking)
              │
              ▼
       Data quality suppression
       (extraction confidence, staleness, diversity,
        macro-only / pattern-only safety)
              │
              ▼
       Eligibility evaluation
       (gate checks → action mapping → mode escalation → position sizing)
              │
              ▼
       Thesis generation + risk classification
              │
              ├─→ recommendations (PostgreSQL)
              ├─→ recommendation_evidence
              └─→ risk_evaluations
              │
              ▼
       Trading engine decision loop
       (pre-trade checks → position sizing → stop-loss →
        correlation → earnings → gradual entry)
              │
              ├─→ trading_decisions (PostgreSQL, audit)
              └─→ stonks:queue:broker (Redis, order execution)

13. Key Database Tables

Table Stage Purpose
documents Ingestion Raw ingested content
document_intelligence Extraction Ollama extraction output
document_impact_records Extraction Per-company impact from a document
global_events Macro Classified macro/geopolitical events
exposure_profiles Macro Company exposure data (geography, supply chain, commodities)
macro_impact_records Macro Per-company macro impact scores
competitor_relationships Competitive Company relationship graph
competitive_signal_records Competitive Cross-company propagated signals
trend_windows Aggregation Current trend summaries (upserted each cycle)
trend_history Aggregation Time-series snapshots for charting
trend_evidence Aggregation Per-document evidence rankings
trend_projections Projection Forward-looking trend projections
recommendations Recommendation Trade recommendations
recommendation_evidence Recommendation Per-document citations
risk_evaluations Recommendation Eligibility decisions and risk checks
risk_configs Runtime Toggle switches and risk tier configuration
trading_decisions Trading Pre-trade evaluation audit trail
positions Trading Open positions
orders Trading Broker orders
fills Trading Order fills

14. Audit Trail

Every stage preserves full context for reproducibility:

  • Extraction: raw Ollama response, repair steps, validation errors, all retry attempts
  • Aggregation: per-signal weight breakdowns (recency, credibility, novelty, market context), contradiction details by dimension
  • Recommendation: deterministic thesis, evidence citations with weights, eligibility decision trace, risk evaluation
  • Trading: every pre-trade check result, position sizing breakdown, risk tier at decision time, full decision trace
  • Execution: order details, fills, P&L, performance metrics