- Replace all datetime.utcnow() with datetime.now(tz=timezone.utc) across 8 files - Fix 12 failing tests to match current implementation behavior - Fix pytest_plugins in non-top-level conftest (moved to root conftest.py) - Auto-fix 189 lint issues (import sorting, unused imports) - Add CI/CD pipeline infrastructure (ARC, ArgoCD, Kargo manifests) - Add values-beta.yaml and values-paper.yaml for staged deployments - Update GitHub Actions workflow to use self-hosted-gremlin runners - Add integration-test job to CI pipeline Result: 1596 passed, 0 failed, 0 warnings
25 KiB
From Model Output to Trade: The Full Pipeline
This document traces the complete journey of data through Stonks Oracle — from the moment an Ollama model produces structured JSON, through signal scoring and aggregation, to the final trading decision.
1. Document Ingestion
Before the model ever sees a document, the ingestion layer fetches raw content from configured sources (news APIs, SEC filings, earnings transcripts, press releases). Each document lands in the documents table with a status, type, and published_at timestamp. A Redis queue (stonks:queue:extraction) feeds documents to the extractor service.
2. Prompting the Model
The extractor service (services/extractor/client.py) sends each document to a local Ollama instance via POST /api/chat.
System prompt
A short, strict instruction set:
You are a financial document analyst. Extract structured data as JSON. Return ONLY a single JSON object. No markdown fences, no explanation, no text before or after the JSON. Every field in the schema is required. Use "other" for catalyst_type if unsure. Keep evidence_spans short (under 20 words each). Keep key_facts to 3-5 items max.
User prompt
Built dynamically per document (services/extractor/prompts.py). It includes:
- Document type guidance — tailored instructions for articles, filings, transcripts, and press releases. For example, filings get: "Extract concrete financial figures, risk factors, and material events as stated." Transcripts get: "Distinguish between management forward-looking statements and reported results."
- Tracked ticker hints — the list of 50 tracked tickers, with rules: if a ticker appears verbatim in the text, the model must include it; if a sector theme clearly affects a tracked company, include it; never invent tickers outside the list.
- Field-by-field instructions — what each output field means and its valid range.
- Document text — truncated to 8,000 characters to keep inference fast.
Ollama call parameters
think=false(speed over chain-of-thought)num_predict=4096(max output tokens)- Optional
num_ctxoverride for longer documents - The JSON schema (generated from Pydantic models) is passed as the
formatparameter for structured output
3. Model Output: The JSON Contract
The model returns a single JSON object matching the ExtractionResult schema (services/extractor/schemas.py):
{
"summary": "Apple reported record Q4 earnings driven by iPhone 16 demand.",
"companies": [
{
"ticker": "AAPL",
"company_name": "Apple Inc.",
"relevance": 0.95,
"sentiment": "positive",
"impact_score": 0.8,
"impact_horizon": "1d_7d",
"catalyst_type": "earnings",
"key_facts": [
"Revenue up 12% YoY to $94.9B",
"iPhone revenue grew 18%",
"Services hit all-time high"
],
"risks": [
"China market softness noted by management"
],
"evidence_spans": [
"record quarterly revenue of $94.9 billion",
"iPhone revenue grew 18 percent year over year"
]
}
],
"macro_themes": ["consumer_spending", "ai_capex"],
"novelty_score": 0.6,
"confidence": 0.85,
"extraction_warnings": []
}
Field definitions
| Field | Type | Range | Purpose |
|---|---|---|---|
summary |
string | — | 1-3 sentence document summary |
companies[] |
array | — | Per-company intelligence (one entry per affected company) |
.ticker |
string | — | Stock ticker symbol |
.relevance |
float | 0-1 | How central this company is to the document |
.sentiment |
enum | positive / negative / neutral / mixed | Overall sentiment toward the company |
.impact_score |
float | 0-1 | Estimated magnitude of impact (0 = negligible, 1 = highly material) |
.impact_horizon |
string | intraday / 1d / 1d_7d / 1d_30d / 30d_90d / 90d_plus | When the impact is expected to manifest |
.catalyst_type |
enum | earnings / product / legal / macro / supply_chain / m_and_a / rating_change / other | Primary catalyst category |
.key_facts |
string[] | — | Facts explicitly stated in the document (no fabrication) |
.risks |
string[] | — | Risks explicitly mentioned |
.evidence_spans |
string[] | — | Short verbatim quotes supporting the analysis |
macro_themes |
string[] | — | Broad economic themes (rates, inflation, ai_capex, etc.) |
novelty_score |
float | 0-1 | How surprising the information is |
confidence |
float | 0-1 | Model's self-assessed extraction quality |
extraction_warnings |
string[] | — | Issues encountered (ambiguous_ticker, incomplete_text, etc.) |
4. JSON Repair and Validation
The raw model output goes through two stages before it's trusted.
4a. JSON repair (services/extractor/client.py)
Ollama's format constraint is unreliable with think=false on certain models (Ollama bug #14645). The extractor handles this:
- Try
json.loads()directly — if it parses, use it as-is. - Strip markdown fences (
```json ... ```) if present. - Fall back to the
json-repairlibrary, which fixes trailing commas, unterminated strings, and control characters.
4b. Structural + semantic validation (services/extractor/schemas.py)
- Structural validation — parse the JSON against the
ExtractionResultPydantic model. Missing required fields, wrong types, or out-of-range values fail here. - Semantic validation — cross-field consistency checks:
- Ticker format validation
- Evidence span length checks
- Catalyst type alias normalization (maps variants to canonical enum values)
- Impact horizon normalization
- Returns a
ValidationReportwith the parsed result or a list of errors.
4c. Retry logic
If validation fails and the error is retryable (not an HTTP 4xx client error), the extractor retries up to max_retries times (default 2) with exponential backoff. Every attempt — raw output, validation result, error, duration — is preserved in the ExtractionResponse.attempts list for audit.
5. Persistence: Document Intelligence and Impact Records
A successful extraction produces two sets of database records.
Document intelligence (document_intelligence table)
One row per document:
document_id,document_type,summary,companies(JSONB),macro_themesnovelty_score,source_credibility,confidence,extraction_warningsvalidation_status(valid/failed)modelmetadata: provider, model_name, prompt_version, schema_version
Per-company impact records (document_impact_records table)
One row per company mentioned in the extraction:
ticker,company_name,relevance,sentiment,impact_score,impact_horizon,catalyst_typekey_facts,risks,evidence_spans(all JSONB)- Links back to
document_intelligenceviaintelligence_id
Raw artifact storage (MinIO)
Full prompts and raw model responses are stored in MinIO buckets (stonks-llm-prompts, stonks-llm-results) keyed by document_id, so any extraction can be replayed or audited.
6. Signal Scoring: Turning Records into Weighted Signals
The aggregation engine (services/aggregation/worker.py) converts raw impact records into WeightedSignal objects. Each signal carries a composite weight that determines how much it influences the final trend.
Weight components (services/aggregation/scoring.py)
The combined weight is:
combined = gate × recency × credibility × (1 + novelty_bonus) × market_context_multiplier
| Component | Formula | Purpose |
|---|---|---|
| Confidence gate | 0 if extraction confidence < 0.2, else 1 | Reject unreliable extractions entirely |
| Recency decay | 2^(-age_hours / half_life), min 0.01 |
Exponential decay — newer documents matter more. Half-lives: intraday=2h, 1d=12h, 7d=72h, 30d=240h, 90d=720h |
| Credibility | source_credibility ^ exponent, clamped [0.1, 1.0] |
Source quality weighting |
| Novelty bonus | novelty_score × 0.25 |
Novel information gets up to 25% boost |
| Market context | Volatility boost (up to +30%) + volume surge boost (+15%) | Fast-moving, high-volume markets amplify fresh signals |
Each WeightedSignal also carries:
sentiment_value: +1.0 (positive), -1.0 (negative), 0.0 (neutral/mixed)impact_score: the extraction's impact magnitudedocument_id: for evidence tracing
7. Three Signal Layers
The aggregation engine merges signals from three independent layers. Each layer can be toggled on/off at runtime via the risk_configs table — no restart needed.
Layer 1: Company-specific signals (always active)
Direct document intelligence about a company. This is the core layer — document_impact_records for the ticker, scored as described in §6.
Layer 2: Macro signals (toggle: macro_enabled)
Global events that affect companies through exposure profiles.
Flow:
- The macro service classifies global events (from news) using Ollama — extracting event type, severity, affected regions/sectors/commodities.
- Each company has an exposure profile (
exposure_profilestable): geographic revenue mix, supply chain regions, commodity dependencies, market position tier. - Overlap scoring computes how much a global event overlaps with a company's exposure (geographic, supply chain, commodity dimensions).
- A resilience modifier based on market position tier (global leaders are more resilient than domestic companies) adjusts the score.
- The final
macro_impact_score = base_score × overlap_factor × resilience_modifier. - Events older than 48 hours get accelerated staleness decay.
Macro signals are converted to WeightedSignal objects with:
sentiment_valuemapped fromimpact_direction(positive → +1, negative → -1)impact_score = macro_impact_score × macro_signal_weight(default weight: 0.3)- Recency decay from the global event's publication time
Layer 3: Competitive signals (toggle: competitive_enabled)
Historical patterns and cross-company signal propagation.
Flow:
- Self-company pattern mining (
services/aggregation/pattern_matcher.py): For each catalyst type in the current impact records, query historical outcomes for this ticker. Lookback: 180 days for routine signals, 365 days for major decisions (1.3× weight multiplier). ProducesHistoricalPatternobjects withbullish_pct,bearish_pct,avg_strength,pattern_confidence. - Cross-company propagation (
services/aggregation/signal_propagation.py): When company A has a catalyst, look up its competitors via thecompetitor_relationshipstable (46 relationships across 50 companies). For each competitor, query cross-company historical patterns. Signal strength =avg_strength × relationship_strength × pattern_confidence × impact_score. Direction = majority historical outcome (bullish or bearish). - Competitive signals are converted to
WeightedSignalobjects with:impact_score = signal_strength × competitive_signal_weight(default weight: 0.2)- Recency decay from the pattern's most recent data point or the signal's
computed_attime
Merging
All three layers produce WeightedSignal objects with the same structure. The aggregation engine simply concatenates them into a single list before computing the trend summary. The relative influence of each layer is controlled by the macro_signal_weight (0.3) and competitive_signal_weight (0.2) multipliers applied to their impact scores.
8. Trend Summary Assembly
From the merged signal list, the aggregation engine computes a TrendSummary for each ticker × window combination (intraday, 1d, 7d, 30d, 90d).
Weighted sentiment average
avg_sentiment = Σ(sentiment_value × combined_weight × impact_score) / Σ(combined_weight × impact_score)
Trend direction
| Condition | Direction |
|---|---|
avg_sentiment ≥ 0.15 |
Bullish |
avg_sentiment ≤ -0.15 |
Bearish |
| Contradiction > 0.10 and |avg_sentiment| < 0.30 | Mixed |
| Otherwise | Neutral |
Trend strength
strength = min(|avg_sentiment|, 1.0) — the absolute magnitude of the weighted sentiment, clamped to [0, 1].
Contradiction score
Measures disagreement among signals:
contradiction = minority_side_weight / total_weight
Where minority side is whichever of positive or negative has less total weight. A score of 0 means full agreement; approaching 0.5 means equal-weight disagreement.
The system also runs multi-dimensional contradiction detection (services/aggregation/contradiction.py):
- Sentiment disagreement — the core positive-vs-negative split
- Catalyst disagreement — same catalyst type with opposing sentiment from different documents
Confidence
Derived from four factors:
- Unique source count — more distinct documents = higher confidence (caps at 15 unique sources for 0.8 contribution)
- Average extraction confidence — from the model's self-assessed quality
- Signal agreement — fraction of signals pointing the same direction, dampened by sample size (log₂ scaling, saturates around 7 unique sources)
- Contradiction penalty —
contradiction_score × 0.4subtracted
Evidence ranking
Supporting and opposing documents are ranked by a composite score considering weight, impact, recency, and confidence — not just raw weight. The top 10 of each are stored for citation.
Catalysts and risks
Dominant catalyst types are ranked by cumulative signal weight. Material risks are deduplicated and ordered by the weight of the signal that surfaced them.
Persistence
The assembled TrendSummary is upserted into the trend_windows table (one row per entity × window, updated each cycle). A snapshot is also appended to trend_history for time-series charting. Evidence mappings go into trend_evidence with per-document rank scores and component breakdowns.
9. Trend Projections
After assembling the current trend, the engine computes a forward-looking projection (services/aggregation/projection.py):
- Macro decay — projects macro event impact forward with exponential decay based on estimated duration and severity
- Momentum — trend momentum from recent price action
- Driving factors — lists key macro events, competitive patterns, and market conditions
- Divergence detection — flags when the projection diverges from the current trend direction
Output: TrendProjection with projected_direction, projected_strength, projected_confidence, projection_horizon, driving_factors, and diverges_from_current. Projections with confidence below 0.3 are flagged as low_confidence and excluded from thesis generation.
10. Recommendation Generation
The recommendation service (services/recommendation/worker.py) turns trend summaries into actionable recommendations.
Step 1: Data quality suppression (services/recommendation/suppression.py)
Before any eligibility check, the system evaluates the quality of the underlying data:
| Check | Threshold | Effect |
|---|---|---|
| Average extraction confidence | < 0.40 | Suppress |
| Evidence staleness | > 168 hours (7 days) | Suppress |
| Source type diversity | < 1 distinct type | Suppress |
| Extraction failure rate | > 50% | Suppress |
| Valid document count | < 2 | Suppress |
| Overall data quality score | < 0.30 | Suppress |
The data quality score is a weighted composite: 40% extraction confidence + 30% evidence freshness + 30% document coverage.
Safety suppression — two additional rules prevent trading on thin evidence from a single signal layer:
- Macro-only suppression: If the trend direction is driven solely by macro signals with zero company-specific evidence, the recommendation is forced to informational mode.
- Pattern-only suppression: Same rule for pattern/competitive signals with no company or macro support.
Step 2: Eligibility evaluation (services/recommendation/eligibility.py)
Deterministic rules — no model involvement:
Gate checks (any failure → no recommendation):
- Confidence ≥ 0.35
- Trend strength ≥ 0.10
- Contradiction score ≤ 0.60
- Evidence count ≥ 2
- Direction ≠ neutral
Action mapping:
- Strong bullish (strength ≥ 0.25) → BUY
- Strong bearish (strength ≥ 0.25) → SELL
- Weak but directional + decent confidence (≥ 0.50) → HOLD
- Everything else → WATCH
Mode escalation:
- WATCH and HOLD → always informational (no trades)
- BUY/SELL with confidence ≥ 0.70, contradiction ≤ 0.25, evidence ≥ 5 → live_eligible
- BUY/SELL with confidence ≥ 0.50 → paper_eligible
- Below that → informational
Step 3: Position sizing
Computed from signal quality:
raw_portfolio_pct = base (1%) + confidence × strength × range (up to 10%)
Adjusted by:
- Contradiction penalty (higher contradiction → smaller position)
- Evidence count penalty (< 3 docs → 50% reduction, < 5 docs → 75%)
- Max loss percentage scales similarly (base 0.3% up to 2%)
Step 4: Thesis generation
Two layers:
- Deterministic thesis — assembled from trend direction, strength, catalysts, risks, contradiction notes, projection info, and the recommended action. Always generated.
- Optional LLM rewrite (
services/recommendation/thesis_llm.py) — for trading-eligible recommendations only, the deterministic thesis is rewritten into analyst-quality prose via Ollama. This is cosmetic; the underlying decision is unchanged.
Step 5: Risk classification
Based on contradiction score, confidence, evidence count, and mode:
low— high confidence, low contradiction, strong evidencemoderate— decent signals with some uncertaintyhigh— notable contradiction or low evidencevery_high— multiple risk factors present
The thesis is prefixed with the risk label: [risk:moderate] AAPL shows a bullish trend...
Step 6: Persistence
recommendationstable — the full recommendation recordrecommendation_evidencetable — per-document citations with weights and evidence typesrisk_evaluationstable — the eligibility decision, risk checks, and full decision trace
11. Trading Engine Decision Loop
The trading engine (services/trading/engine.py) polls the recommendations table every 60 seconds for actionable recommendations (action IN ('buy', 'sell') and mode IN ('paper_eligible', 'live_eligible')).
Pre-trade checks (in order, first failure short-circuits)
- Circuit breaker — is the daily loss cap or single-position loss cap breached? If so, all trading halts.
- Trading window — is the market open? Outside market hours, skip.
- Confidence gate — does the recommendation meet the active risk tier's minimum confidence?
- Deduplication — has this recommendation already been processed?
- Declining positions — are there multiple open positions currently declining?
- Max open positions — is the portfolio at capacity?
Position sizing (services/trading/position_sizer.py)
Computes the dollar amount and share quantity:
- Confidence-based scaling (sample-size-dampened agreement scoring)
- Risk tier adjustment (conservative / moderate / aggressive)
- Portfolio heat check (sector concentration, correlation)
- Active pool available capital
- Absolute position cap
Stop-loss and take-profit (services/trading/stop_loss_manager.py)
- Stop-loss = entry price − (ATR × atr_multiplier)
- Take-profit = entry price + (ATR × atr_multiplier × reward_risk_ratio)
- Trailing stops activate for open positions
Additional checks
- Correlation-aware diversification — reject positions that would push portfolio correlation above threshold
- Earnings calendar awareness — reduce size or skip if earnings are within 2 days
- Gradual entry — large positions (> $30) split into 3 tranches over time
- Reserve pool — profits from closed positions siphon into an emergency liquidity reserve
Risk tier auto-adjustment (services/trading/risk_tier_controller.py)
Daily evaluation of Sharpe ratio, drawdown, and win rate. The engine auto-adjusts between conservative, moderate, and aggressive tiers. The new tier is persisted to risk_configs and takes effect on the next cycle.
Output: Trading Decision
Every evaluation produces a TradingDecision record persisted for audit:
decision: act or skipskip_reason: which check failed (if any)computed_position_size,computed_share_quantityrisk_tier_at_decision,portfolio_heat_at_decision,active_pool_at_decisioncircuit_breaker_status,correlation_check_result,sector_exposure_check_resultearnings_proximity_flag,is_micro_trade,decision_trace
If the decision is act, an order job is pushed to the Redis broker queue (stonks:queue:broker) with ticker, action, quantity, and order type.
12. The Complete Data Flow (Summary)
Document (article/filing/transcript)
│
▼
Ollama extraction (JSON)
│ ├─ JSON repair (json-repair library)
│ └─ Pydantic validation + semantic checks
│
▼
document_intelligence + document_impact_records (PostgreSQL)
│ └─ Raw prompts/responses → MinIO (audit)
│
├──────────────────────────────────────────────────┐
│ │
▼ ▼
Layer 1: Company signals Layer 2: Macro signals
(impact records → WeightedSignal) (global_events → exposure matching
→ macro_impact_records → WeightedSignal)
│ │
│ Layer 3: Competitive signals │
│ (pattern mining + propagation │
│ → competitive_signal_records │
│ → WeightedSignal) │
│ │ │
└───────────┬───────────────┘───────────────────────┘
│
▼
Signal merging (concatenate all WeightedSignals)
│
▼
Trend summary assembly
(weighted sentiment → direction, strength, confidence,
contradiction, evidence ranking, catalysts, risks)
│
├─→ trend_windows (PostgreSQL)
├─→ trend_history (time-series)
└─→ trend_evidence (per-document rankings)
│
▼
Trend projection (forward-looking)
│
▼
Data quality suppression
(extraction confidence, staleness, diversity,
macro-only / pattern-only safety)
│
▼
Eligibility evaluation
(gate checks → action mapping → mode escalation → position sizing)
│
▼
Thesis generation + risk classification
│
├─→ recommendations (PostgreSQL)
├─→ recommendation_evidence
└─→ risk_evaluations
│
▼
Trading engine decision loop
(pre-trade checks → position sizing → stop-loss →
correlation → earnings → gradual entry)
│
├─→ trading_decisions (PostgreSQL, audit)
└─→ stonks:queue:broker (Redis, order execution)
13. Key Database Tables
| Table | Stage | Purpose |
|---|---|---|
documents |
Ingestion | Raw ingested content |
document_intelligence |
Extraction | Ollama extraction output |
document_impact_records |
Extraction | Per-company impact from a document |
global_events |
Macro | Classified macro/geopolitical events |
exposure_profiles |
Macro | Company exposure data (geography, supply chain, commodities) |
macro_impact_records |
Macro | Per-company macro impact scores |
competitor_relationships |
Competitive | Company relationship graph |
competitive_signal_records |
Competitive | Cross-company propagated signals |
trend_windows |
Aggregation | Current trend summaries (upserted each cycle) |
trend_history |
Aggregation | Time-series snapshots for charting |
trend_evidence |
Aggregation | Per-document evidence rankings |
trend_projections |
Projection | Forward-looking trend projections |
recommendations |
Recommendation | Trade recommendations |
recommendation_evidence |
Recommendation | Per-document citations |
risk_evaluations |
Recommendation | Eligibility decisions and risk checks |
risk_configs |
Runtime | Toggle switches and risk tier configuration |
trading_decisions |
Trading | Pre-trade evaluation audit trail |
positions |
Trading | Open positions |
orders |
Trading | Broker orders |
fills |
Trading | Order fills |
14. Audit Trail
Every stage preserves full context for reproducibility:
- Extraction: raw Ollama response, repair steps, validation errors, all retry attempts
- Aggregation: per-signal weight breakdowns (recency, credibility, novelty, market context), contradiction details by dimension
- Recommendation: deterministic thesis, evidence citations with weights, eligibility decision trace, risk evaluation
- Trading: every pre-trade check result, position sizing breakdown, risk tier at decision time, full decision trace
- Execution: order details, fills, P&L, performance metrics