Files

T

Celes Renata c85c0068a2 fix: clean up utcnow deprecation warnings, fix 12 failing tests, add CI/CD pipeline manifests

- Replace all datetime.utcnow() with datetime.now(tz=timezone.utc) across 8 files
- Fix 12 failing tests to match current implementation behavior
- Fix pytest_plugins in non-top-level conftest (moved to root conftest.py)
- Auto-fix 189 lint issues (import sorting, unused imports)
- Add CI/CD pipeline infrastructure (ARC, ArgoCD, Kargo manifests)
- Add values-beta.yaml and values-paper.yaml for staged deployments
- Update GitHub Actions workflow to use self-hosted-gremlin runners
- Add integration-test job to CI pipeline

Result: 1596 passed, 0 failed, 0 warnings

2026-04-18 03:59:28 +00:00

25 KiB

Raw Blame History

From Model Output to Trade: The Full Pipeline

This document traces the complete journey of data through Stonks Oracle — from the moment an Ollama model produces structured JSON, through signal scoring and aggregation, to the final trading decision.

1. Document Ingestion

Before the model ever sees a document, the ingestion layer fetches raw content from configured sources (news APIs, SEC filings, earnings transcripts, press releases). Each document lands in the documents table with a status, type, and published_at timestamp. A Redis queue (stonks:queue:extraction) feeds documents to the extractor service.

2. Prompting the Model

The extractor service (services/extractor/client.py) sends each document to a local Ollama instance via POST /api/chat.

System prompt

A short, strict instruction set:

You are a financial document analyst. Extract structured data as JSON. Return ONLY a single JSON object. No markdown fences, no explanation, no text before or after the JSON. Every field in the schema is required. Use "other" for catalyst_type if unsure. Keep evidence_spans short (under 20 words each). Keep key_facts to 3-5 items max.

User prompt

Built dynamically per document (services/extractor/prompts.py). It includes:

Document type guidance — tailored instructions for articles, filings, transcripts, and press releases. For example, filings get: "Extract concrete financial figures, risk factors, and material events as stated." Transcripts get: "Distinguish between management forward-looking statements and reported results."
Tracked ticker hints — the list of 50 tracked tickers, with rules: if a ticker appears verbatim in the text, the model must include it; if a sector theme clearly affects a tracked company, include it; never invent tickers outside the list.
Field-by-field instructions — what each output field means and its valid range.
Document text — truncated to 8,000 characters to keep inference fast.

Ollama call parameters

think=false (speed over chain-of-thought)
num_predict=4096 (max output tokens)
Optional num_ctx override for longer documents
The JSON schema (generated from Pydantic models) is passed as the format parameter for structured output

3. Model Output: The JSON Contract

The model returns a single JSON object matching the ExtractionResult schema (services/extractor/schemas.py):

{
  "summary": "Apple reported record Q4 earnings driven by iPhone 16 demand.",
  "companies": [
    {
      "ticker": "AAPL",
      "company_name": "Apple Inc.",
      "relevance": 0.95,
      "sentiment": "positive",
      "impact_score": 0.8,
      "impact_horizon": "1d_7d",
      "catalyst_type": "earnings",
      "key_facts": [
        "Revenue up 12% YoY to $94.9B",
        "iPhone revenue grew 18%",
        "Services hit all-time high"
      ],
      "risks": [
        "China market softness noted by management"
      ],
      "evidence_spans": [
        "record quarterly revenue of $94.9 billion",
        "iPhone revenue grew 18 percent year over year"
      ]
    }
  ],
  "macro_themes": ["consumer_spending", "ai_capex"],
  "novelty_score": 0.6,
  "confidence": 0.85,
  "extraction_warnings": []
}

Field definitions

Field	Type	Range	Purpose
`summary`	string	—	1-3 sentence document summary
`companies[]`	array	—	Per-company intelligence (one entry per affected company)
`.ticker`	string	—	Stock ticker symbol
`.relevance`	float	0-1	How central this company is to the document
`.sentiment`	enum	positive / negative / neutral / mixed	Overall sentiment toward the company
`.impact_score`	float	0-1	Estimated magnitude of impact (0 = negligible, 1 = highly material)
`.impact_horizon`	string	intraday / 1d / 1d_7d / 1d_30d / 30d_90d / 90d_plus	When the impact is expected to manifest
`.catalyst_type`	enum	earnings / product / legal / macro / supply_chain / m_and_a / rating_change / other	Primary catalyst category
`.key_facts`	string[]	—	Facts explicitly stated in the document (no fabrication)
`.risks`	string[]	—	Risks explicitly mentioned
`.evidence_spans`	string[]	—	Short verbatim quotes supporting the analysis
`macro_themes`	string[]	—	Broad economic themes (rates, inflation, ai_capex, etc.)
`novelty_score`	float	0-1	How surprising the information is
`confidence`	float	0-1	Model's self-assessed extraction quality
`extraction_warnings`	string[]	—	Issues encountered (ambiguous_ticker, incomplete_text, etc.)

4. JSON Repair and Validation

The raw model output goes through two stages before it's trusted.

4a. JSON repair (`services/extractor/client.py`)

Ollama's format constraint is unreliable with think=false on certain models (Ollama bug #14645). The extractor handles this:

Try json.loads() directly — if it parses, use it as-is.
Strip markdown fences (```json ... ```) if present.
Fall back to the json-repair library, which fixes trailing commas, unterminated strings, and control characters.

4b. Structural + semantic validation (`services/extractor/schemas.py`)

Structural validation — parse the JSON against the ExtractionResult Pydantic model. Missing required fields, wrong types, or out-of-range values fail here.
Semantic validation — cross-field consistency checks:
- Ticker format validation
- Evidence span length checks
- Catalyst type alias normalization (maps variants to canonical enum values)
- Impact horizon normalization
Returns a ValidationReport with the parsed result or a list of errors.

4c. Retry logic

If validation fails and the error is retryable (not an HTTP 4xx client error), the extractor retries up to max_retries times (default 2) with exponential backoff. Every attempt — raw output, validation result, error, duration — is preserved in the ExtractionResponse.attempts list for audit.

5. Persistence: Document Intelligence and Impact Records

A successful extraction produces two sets of database records.

Document intelligence (`document_intelligence` table)

One row per document:

document_id, document_type, summary, companies (JSONB), macro_themes
novelty_score, source_credibility, confidence, extraction_warnings
validation_status (valid/failed)
model metadata: provider, model_name, prompt_version, schema_version

Per-company impact records (`document_impact_records` table)

One row per company mentioned in the extraction:

ticker, company_name, relevance, sentiment, impact_score, impact_horizon, catalyst_type
key_facts, risks, evidence_spans (all JSONB)
Links back to document_intelligence via intelligence_id

Raw artifact storage (MinIO)

Full prompts and raw model responses are stored in MinIO buckets (stonks-llm-prompts, stonks-llm-results) keyed by document_id, so any extraction can be replayed or audited.

6. Signal Scoring: Turning Records into Weighted Signals

The aggregation engine (services/aggregation/worker.py) converts raw impact records into WeightedSignal objects. Each signal carries a composite weight that determines how much it influences the final trend.

Weight components (`services/aggregation/scoring.py`)

The combined weight is:

combined = gate × recency × credibility × (1 + novelty_bonus) × market_context_multiplier

Component	Formula	Purpose
Confidence gate	0 if extraction confidence < 0.2, else 1	Reject unreliable extractions entirely
Recency decay	`2^(-age_hours / half_life)`, min 0.01	Exponential decay — newer documents matter more. Half-lives: intraday=2h, 1d=12h, 7d=72h, 30d=240h, 90d=720h
Credibility	`source_credibility ^ exponent`, clamped [0.1, 1.0]	Source quality weighting
Novelty bonus	`novelty_score × 0.25`	Novel information gets up to 25% boost
Market context	Volatility boost (up to +30%) + volume surge boost (+15%)	Fast-moving, high-volume markets amplify fresh signals

Each WeightedSignal also carries:

sentiment_value: +1.0 (positive), -1.0 (negative), 0.0 (neutral/mixed)
impact_score: the extraction's impact magnitude
document_id: for evidence tracing

7. Three Signal Layers

The aggregation engine merges signals from three independent layers. Each layer can be toggled on/off at runtime via the risk_configs table — no restart needed.

Layer 1: Company-specific signals (always active)

Direct document intelligence about a company. This is the core layer — document_impact_records for the ticker, scored as described in §6.

Layer 2: Macro signals (toggle: `macro_enabled`)

Global events that affect companies through exposure profiles.

Flow:

The macro service classifies global events (from news) using Ollama — extracting event type, severity, affected regions/sectors/commodities.
Each company has an exposure profile (exposure_profiles table): geographic revenue mix, supply chain regions, commodity dependencies, market position tier.
Overlap scoring computes how much a global event overlaps with a company's exposure (geographic, supply chain, commodity dimensions).
A resilience modifier based on market position tier (global leaders are more resilient than domestic companies) adjusts the score.
The final macro_impact_score = base_score × overlap_factor × resilience_modifier.
Events older than 48 hours get accelerated staleness decay.

Macro signals are converted to WeightedSignal objects with:

sentiment_value mapped from impact_direction (positive → +1, negative → -1)
impact_score = macro_impact_score × macro_signal_weight (default weight: 0.3)
Recency decay from the global event's publication time

Layer 3: Competitive signals (toggle: `competitive_enabled`)

Historical patterns and cross-company signal propagation.

Flow:

Self-company pattern mining (services/aggregation/pattern_matcher.py): For each catalyst type in the current impact records, query historical outcomes for this ticker. Lookback: 180 days for routine signals, 365 days for major decisions (1.3× weight multiplier). Produces HistoricalPattern objects with bullish_pct, bearish_pct, avg_strength, pattern_confidence.
Cross-company propagation (services/aggregation/signal_propagation.py): When company A has a catalyst, look up its competitors via the competitor_relationships table (46 relationships across 50 companies). For each competitor, query cross-company historical patterns. Signal strength = avg_strength × relationship_strength × pattern_confidence × impact_score. Direction = majority historical outcome (bullish or bearish).
Competitive signals are converted to WeightedSignal objects with:
- impact_score = signal_strength × competitive_signal_weight (default weight: 0.2)
- Recency decay from the pattern's most recent data point or the signal's computed_at time

Merging

All three layers produce WeightedSignal objects with the same structure. The aggregation engine simply concatenates them into a single list before computing the trend summary. The relative influence of each layer is controlled by the macro_signal_weight (0.3) and competitive_signal_weight (0.2) multipliers applied to their impact scores.

8. Trend Summary Assembly

From the merged signal list, the aggregation engine computes a TrendSummary for each ticker × window combination (intraday, 1d, 7d, 30d, 90d).

Weighted sentiment average

avg_sentiment = Σ(sentiment_value × combined_weight × impact_score) / Σ(combined_weight × impact_score)

Trend direction

Condition	Direction
`avg_sentiment ≥ 0.15`	Bullish
`avg_sentiment ≤ -0.15`	Bearish
Contradiction > 0.10 and \|avg_sentiment\| < 0.30	Mixed
Otherwise	Neutral

Trend strength

strength = min(|avg_sentiment|, 1.0) — the absolute magnitude of the weighted sentiment, clamped to [0, 1].

Contradiction score

Measures disagreement among signals:

contradiction = minority_side_weight / total_weight

Where minority side is whichever of positive or negative has less total weight. A score of 0 means full agreement; approaching 0.5 means equal-weight disagreement.

The system also runs multi-dimensional contradiction detection (services/aggregation/contradiction.py):

Sentiment disagreement — the core positive-vs-negative split
Catalyst disagreement — same catalyst type with opposing sentiment from different documents

Confidence

Derived from four factors:

Unique source count — more distinct documents = higher confidence (caps at 15 unique sources for 0.8 contribution)
Average extraction confidence — from the model's self-assessed quality
Signal agreement — fraction of signals pointing the same direction, dampened by sample size (log₂ scaling, saturates around 7 unique sources)
Contradiction penalty — contradiction_score × 0.4 subtracted

Evidence ranking

Supporting and opposing documents are ranked by a composite score considering weight, impact, recency, and confidence — not just raw weight. The top 10 of each are stored for citation.

Catalysts and risks

Dominant catalyst types are ranked by cumulative signal weight. Material risks are deduplicated and ordered by the weight of the signal that surfaced them.

Persistence

The assembled TrendSummary is upserted into the trend_windows table (one row per entity × window, updated each cycle). A snapshot is also appended to trend_history for time-series charting. Evidence mappings go into trend_evidence with per-document rank scores and component breakdowns.

9. Trend Projections

After assembling the current trend, the engine computes a forward-looking projection (services/aggregation/projection.py):

Macro decay — projects macro event impact forward with exponential decay based on estimated duration and severity
Momentum — trend momentum from recent price action
Driving factors — lists key macro events, competitive patterns, and market conditions
Divergence detection — flags when the projection diverges from the current trend direction

Output: TrendProjection with projected_direction, projected_strength, projected_confidence, projection_horizon, driving_factors, and diverges_from_current. Projections with confidence below 0.3 are flagged as low_confidence and excluded from thesis generation.

10. Recommendation Generation

The recommendation service (services/recommendation/worker.py) turns trend summaries into actionable recommendations.

Step 1: Data quality suppression (`services/recommendation/suppression.py`)

Before any eligibility check, the system evaluates the quality of the underlying data:

Check	Threshold	Effect
Average extraction confidence	< 0.40	Suppress
Evidence staleness	> 168 hours (7 days)	Suppress
Source type diversity	< 1 distinct type	Suppress
Extraction failure rate	> 50%	Suppress
Valid document count	< 2	Suppress
Overall data quality score	< 0.30	Suppress

The data quality score is a weighted composite: 40% extraction confidence + 30% evidence freshness + 30% document coverage.

Safety suppression — two additional rules prevent trading on thin evidence from a single signal layer:

Macro-only suppression: If the trend direction is driven solely by macro signals with zero company-specific evidence, the recommendation is forced to informational mode.
Pattern-only suppression: Same rule for pattern/competitive signals with no company or macro support.

Step 2: Eligibility evaluation (`services/recommendation/eligibility.py`)

Deterministic rules — no model involvement:

Gate checks (any failure → no recommendation):

Confidence ≥ 0.35
Trend strength ≥ 0.10
Contradiction score ≤ 0.60
Evidence count ≥ 2
Direction ≠ neutral

Action mapping:

Strong bullish (strength ≥ 0.25) → BUY
Strong bearish (strength ≥ 0.25) → SELL
Weak but directional + decent confidence (≥ 0.50) → HOLD
Everything else → WATCH

Mode escalation:

WATCH and HOLD → always informational (no trades)
BUY/SELL with confidence ≥ 0.70, contradiction ≤ 0.25, evidence ≥ 5 → live_eligible
BUY/SELL with confidence ≥ 0.50 → paper_eligible
Below that → informational

Step 3: Position sizing

Computed from signal quality:

raw_portfolio_pct = base (1%) + confidence × strength × range (up to 10%)

Adjusted by:

Contradiction penalty (higher contradiction → smaller position)
Evidence count penalty (< 3 docs → 50% reduction, < 5 docs → 75%)
Max loss percentage scales similarly (base 0.3% up to 2%)

Step 4: Thesis generation

Two layers:

Deterministic thesis — assembled from trend direction, strength, catalysts, risks, contradiction notes, projection info, and the recommended action. Always generated.
Optional LLM rewrite (services/recommendation/thesis_llm.py) — for trading-eligible recommendations only, the deterministic thesis is rewritten into analyst-quality prose via Ollama. This is cosmetic; the underlying decision is unchanged.

Step 5: Risk classification

Based on contradiction score, confidence, evidence count, and mode:

low — high confidence, low contradiction, strong evidence
moderate — decent signals with some uncertainty
high — notable contradiction or low evidence
very_high — multiple risk factors present

The thesis is prefixed with the risk label: [risk:moderate] AAPL shows a bullish trend...

Step 6: Persistence

recommendations table — the full recommendation record
recommendation_evidence table — per-document citations with weights and evidence types
risk_evaluations table — the eligibility decision, risk checks, and full decision trace

11. Trading Engine Decision Loop

The trading engine (services/trading/engine.py) polls the recommendations table every 60 seconds for actionable recommendations (action IN ('buy', 'sell') and mode IN ('paper_eligible', 'live_eligible')).

Pre-trade checks (in order, first failure short-circuits)

Circuit breaker — is the daily loss cap or single-position loss cap breached? If so, all trading halts.
Trading window — is the market open? Outside market hours, skip.
Confidence gate — does the recommendation meet the active risk tier's minimum confidence?
Deduplication — has this recommendation already been processed?
Declining positions — are there multiple open positions currently declining?
Max open positions — is the portfolio at capacity?

Position sizing (`services/trading/position_sizer.py`)

Computes the dollar amount and share quantity:

Confidence-based scaling (sample-size-dampened agreement scoring)
Risk tier adjustment (conservative / moderate / aggressive)
Portfolio heat check (sector concentration, correlation)
Active pool available capital
Absolute position cap

Stop-loss and take-profit (`services/trading/stop_loss_manager.py`)

Stop-loss = entry price − (ATR × atr_multiplier)
Take-profit = entry price + (ATR × atr_multiplier × reward_risk_ratio)
Trailing stops activate for open positions

Additional checks

Correlation-aware diversification — reject positions that would push portfolio correlation above threshold
Earnings calendar awareness — reduce size or skip if earnings are within 2 days
Gradual entry — large positions (> $30) split into 3 tranches over time
Reserve pool — profits from closed positions siphon into an emergency liquidity reserve

Risk tier auto-adjustment (`services/trading/risk_tier_controller.py`)

Daily evaluation of Sharpe ratio, drawdown, and win rate. The engine auto-adjusts between conservative, moderate, and aggressive tiers. The new tier is persisted to risk_configs and takes effect on the next cycle.

Output: Trading Decision

Every evaluation produces a TradingDecision record persisted for audit:

decision: act or skip
skip_reason: which check failed (if any)
computed_position_size, computed_share_quantity
risk_tier_at_decision, portfolio_heat_at_decision, active_pool_at_decision
circuit_breaker_status, correlation_check_result, sector_exposure_check_result
earnings_proximity_flag, is_micro_trade, decision_trace

If the decision is act, an order job is pushed to the Redis broker queue (stonks:queue:broker) with ticker, action, quantity, and order type.

12. The Complete Data Flow (Summary)

Document (article/filing/transcript)
  │
  ▼
Ollama extraction (JSON)
  │  ├─ JSON repair (json-repair library)
  │  └─ Pydantic validation + semantic checks
  │
  ▼
document_intelligence + document_impact_records (PostgreSQL)
  │  └─ Raw prompts/responses → MinIO (audit)
  │
  ├──────────────────────────────────────────────────┐
  │                                                  │
  ▼                                                  ▼
Layer 1: Company signals              Layer 2: Macro signals
(impact records → WeightedSignal)     (global_events → exposure matching
                                       → macro_impact_records → WeightedSignal)
  │                                                  │
  │                    Layer 3: Competitive signals   │
  │                    (pattern mining + propagation  │
  │                     → competitive_signal_records  │
  │                     → WeightedSignal)             │
  │                           │                       │
  └───────────┬───────────────┘───────────────────────┘
              │
              ▼
       Signal merging (concatenate all WeightedSignals)
              │
              ▼
       Trend summary assembly
       (weighted sentiment → direction, strength, confidence,
        contradiction, evidence ranking, catalysts, risks)
              │
              ├─→ trend_windows (PostgreSQL)
              ├─→ trend_history (time-series)
              └─→ trend_evidence (per-document rankings)
              │
              ▼
       Trend projection (forward-looking)
              │
              ▼
       Data quality suppression
       (extraction confidence, staleness, diversity,
        macro-only / pattern-only safety)
              │
              ▼
       Eligibility evaluation
       (gate checks → action mapping → mode escalation → position sizing)
              │
              ▼
       Thesis generation + risk classification
              │
              ├─→ recommendations (PostgreSQL)
              ├─→ recommendation_evidence
              └─→ risk_evaluations
              │
              ▼
       Trading engine decision loop
       (pre-trade checks → position sizing → stop-loss →
        correlation → earnings → gradual entry)
              │
              ├─→ trading_decisions (PostgreSQL, audit)
              └─→ stonks:queue:broker (Redis, order execution)

13. Key Database Tables

Table	Stage	Purpose
`documents`	Ingestion	Raw ingested content
`document_intelligence`	Extraction	Ollama extraction output
`document_impact_records`	Extraction	Per-company impact from a document
`global_events`	Macro	Classified macro/geopolitical events
`exposure_profiles`	Macro	Company exposure data (geography, supply chain, commodities)
`macro_impact_records`	Macro	Per-company macro impact scores
`competitor_relationships`	Competitive	Company relationship graph
`competitive_signal_records`	Competitive	Cross-company propagated signals
`trend_windows`	Aggregation	Current trend summaries (upserted each cycle)
`trend_history`	Aggregation	Time-series snapshots for charting
`trend_evidence`	Aggregation	Per-document evidence rankings
`trend_projections`	Projection	Forward-looking trend projections
`recommendations`	Recommendation	Trade recommendations
`recommendation_evidence`	Recommendation	Per-document citations
`risk_evaluations`	Recommendation	Eligibility decisions and risk checks
`risk_configs`	Runtime	Toggle switches and risk tier configuration
`trading_decisions`	Trading	Pre-trade evaluation audit trail
`positions`	Trading	Open positions
`orders`	Trading	Broker orders
`fills`	Trading	Order fills

14. Audit Trail

Every stage preserves full context for reproducibility:

Extraction: raw Ollama response, repair steps, validation errors, all retry attempts
Aggregation: per-signal weight breakdowns (recency, credibility, novelty, market context), contradiction details by dimension
Recommendation: deterministic thesis, evidence citations with weights, eligibility decision trace, risk evaluation
Trading: every pre-trade check result, position sizing breakdown, risk tier at decision time, full decision trace
Execution: order details, fills, P&L, performance metrics

25 KiB Raw Blame History Unescape Escape