Files
stonks-oracle/.kiro/specs/model-validation-calibration/tasks.md
T
Celes Renata 7fcc8a6c07
ci/woodpecker/push/test Pipeline failed
ci/woodpecker/push/build-1 unknown status
ci/woodpecker/push/build-3 unknown status
ci/woodpecker/push/build-2 unknown status
ci/woodpecker/push/finalize unknown status
Build and Push / lint-and-test (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.adapters.broker_adapter name:broker-adapter]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.aggregation.worker name:aggregation]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.extractor.worker name:extractor]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.ingestion.worker name:ingestion]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.lake_publisher.worker name:lake-publisher]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.parser.worker name:parser]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.recommendation.worker name:recommendation]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.scheduler.app name:scheduler]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.api.app:app --host 0.0.0.0 --port 8000 name:query-api]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.risk.app:app --host 0.0.0.0 --port 8000 name:risk]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.symbol_registry.app:app --host 0.0.0.0 --port 8000 name:symbol-registry]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.trading.app:app --host 0.0.0.0 --port 8000 name:trading-engine]) (push) Has been cancelled
Build and Push / build-dashboard (push) Has been cancelled
Build and Push / build-superset (push) Has been cancelled
Build and Push / integration-test (push) Has been cancelled
Build and Push / beta-gate (push) Has been cancelled
feat: model validation, calibration, and signal quality layer
- Migration 035: prediction_snapshots, prediction_outcomes, signal_evidence_links, model_metric_snapshots tables + SQL views
- Prediction snapshot writer with canonical evidence keys, duplicate detection, contribution scores
- Outcome evaluator across 5 horizons (1h, 6h, 1d, 7d, 30d)
- Metrics engine: ECE, Brier score, IC, Rank IC, benchmark comparison
- Attribution engine: per-source, per-catalyst, per-layer performance
- Calibration engine: Bayesian shrinkage source reliability
- Quality gate for live trading eligibility with configurable thresholds
- 7 new /api/validation/* endpoints
- Upgraded OpsModel dashboard with validation tab
- Enhanced recommendation display with calibration context
- Backtest replay validation mode
- 86 Python tests (unit + property-based), 179 frontend tests passing
2026-05-01 03:04:58 +00:00

261 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Implementation Plan: Model Validation, Calibration, and Signal Quality
## Overview
Add a closed-loop model validation layer to Stonks Oracle: prediction snapshot capture, outcome evaluation, calibration/IC metrics, source/catalyst/layer attribution, Bayesian source reliability, a quality gate for live trading, 7 new API endpoints, an upgraded OpsModel dashboard, and backtest replay integration. Implementation follows the four-phase priority order from the spec, with each phase building on the previous one.
## Tasks
- [x] 1. Database migration 035 — schema foundation
- [x] 1.1 Create `infra/migrations/035_model_validation.sql` with all tables, indexes, and views
- Create `prediction_snapshots` table with all columns from design (id UUID PK, generated_at, ticker, window, horizon, direction, action, mode, strength, confidence, contradiction, p_bull, p_bear, score_company, score_macro, score_competitive, evidence_count, unique_source_count, duplicate_evidence_count, price_at_prediction, spy_price_at_prediction, sector_etf_price_at_prediction, metadata JSONB, created_at)
- Create `prediction_outcomes` table with FK to prediction_snapshots (id UUID PK, prediction_id, evaluated_at, horizon, future_price, future_return, spy_future_price, spy_return, sector_etf_future_price, sector_etf_return, excess_return_vs_spy, excess_return_vs_sector, direction_correct, profitable, metadata JSONB, created_at)
- Create `signal_evidence_links` table with FK to prediction_snapshots (id UUID PK, prediction_id, document_id, signal_id, ticker, source, source_type, catalyst_type, sentiment, impact, extraction_confidence, weight, is_duplicate, canonical_evidence_key, contribution_score, metadata JSONB, created_at)
- Create `model_metric_snapshots` table (id UUID PK, generated_at, lookback_window, horizon, prediction_count, win_rate, directional_accuracy, information_coefficient, rank_information_coefficient, avg_return, avg_excess_return_vs_spy, avg_excess_return_vs_sector, calibration_error, brier_score, buy_win_rate, sell_win_rate, hold_win_rate, metadata JSONB, created_at)
- Create indexes on prediction_snapshots (ticker, generated_at, horizon), prediction_outcomes (prediction_id, horizon, evaluated_at), signal_evidence_links (prediction_id, document_id, ticker), model_metric_snapshots (generated_at, lookback_window, horizon)
- Create `v_prediction_performance` view joining prediction_snapshots with prediction_outcomes
- Create `v_source_performance` view joining signal_evidence_links with prediction_snapshots and prediction_outcomes
- _Requirements: 16.1, 16.2, 16.3, 16.4, 16.5, 16.6, 14.1, 14.2, 14.3, 14.4_
- [x] 2. Phase 1 — Prediction capture, outcome evaluation, core metrics, and dashboard API
- [x] 2.1 Implement Prediction Snapshot Writer (`services/validation/prediction_snapshot.py`)
- Create `services/validation/__init__.py`
- Define `SECTOR_ETF_MAP`, `EVALUATION_HORIZONS`, `MAX_SINGLE_DOCUMENT_WEIGHT` constants
- Implement `PredictionSnapshot` and `SignalEvidenceLink` dataclasses
- Implement `compute_canonical_evidence_key(title, url)` — SHA256 of normalized title + normalized URL (lowercase, strip whitespace for title; lowercase, strip query params for URL)
- Implement `fetch_latest_close_price(pool, ticker)` — query most recent close from market_snapshots
- Implement `create_prediction_snapshot(pool, recommendation, trend_summary, evidence_signals, evidence_docs)` — fetch prices (ticker, SPY, sector ETF), compute canonical keys, detect duplicates, clamp weights to MAX_SINGLE_DOCUMENT_WEIGHT, compute contribution scores (one-vote-per-canonical-key), persist snapshot + evidence links in a transaction
- Implement `compute_contribution_scores(weights)` — each score = weight_i / sum(weights), sums to 1.0
- Handle NULL prices gracefully (log warning, store NULL, don't fail)
- _Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 3.1, 3.2, 3.3, 3.4_
- [x] 2.2 Write property test for canonical evidence key determinism and idempotence
- **Property 4: Canonical Evidence Key Determinism and Normalization Idempotence**
- Test that same (title, url) always produces same key
- Test that normalizing already-normalized input produces same key
- **Validates: Requirements 2.3, 17.4**
- [x] 2.3 Write property test for contribution score sum-to-one and range
- **Property 7: Contribution Score Sum-to-One and Range**
- Test that all scores in [0.0, 1.0] and sum to 1.0 (within 1e-9 tolerance)
- Test that empty input returns empty list
- **Validates: Requirements 2.5, 17.7**
- [x] 2.4 Implement Outcome Evaluator (`services/validation/outcome_evaluator.py`)
- Define `PredictionOutcome` dataclass and `HORIZON_DURATIONS` mapping
- Implement `evaluate_matured_predictions(pool)` — find snapshots where horizon elapsed and outcome not recorded, evaluate each
- Implement `evaluate_single_prediction(pool, snapshot, horizon)` — fetch future price at horizon endpoint, compute future_return, SPY return, sector ETF return, excess returns, direction_correct, profitable; return None if future price unavailable
- Evaluate across all 5 horizons: 1h, 6h, 1d, 7d, 30d
- Skip horizons where future price is unavailable (retry next run)
- Store results in prediction_outcomes table
- _Requirements: 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 4.10_
- [x] 2.5 Implement Metrics Engine (`services/validation/metrics.py`)
- Define `CONFIDENCE_BUCKETS`, `LOOKBACK_WINDOWS` constants
- Define `CalibrationBucket` and `ModelMetricSnapshot` dataclasses
- Implement `compute_calibration_error(confidences, outcomes)` — group into 5 confidence buckets, compute ECE as weighted average of |avg_conf - win_rate|, flag miscalibrated buckets (|diff| > 0.15)
- Implement `compute_brier_score(p_bulls, outcomes)` — mean((p_bull - outcome)^2)
- Implement `compute_information_coefficient(scores, returns)` — Pearson correlation, return None when < 30 data points
- Implement `compute_rank_information_coefficient(scores, returns)` — Spearman rank correlation, return None when < 30 data points
- Implement `compute_contribution_scores(weights)` — weight_i / sum(weights), sums to 1.0
- Implement benchmark metrics: average excess return vs SPY, vs sector ETF, hit rate improvement
- Implement `compute_and_store_metric_snapshots(pool)` — compute for all lookback/horizon combinations (4 lookbacks × 5 horizons), persist to model_metric_snapshots
- _Requirements: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 6.1, 6.2, 6.3, 6.4, 6.5, 9.1, 9.2, 9.3, 9.4, 10.1, 10.2, 10.3, 10.4, 10.5_
- [x] 2.6 Write property test for ECE range and round-trip
- **Property 1: Calibration Error Range and Round-Trip**
- Test ECE in [0.0, 1.0] for all valid distributions
- Test ECE = 0.0 when every bucket's win rate matches avg confidence
- **Validates: Requirements 5.1, 5.3, 17.1**
- [x] 2.7 Write property test for Brier score range and perfect prediction
- **Property 2: Brier Score Range and Perfect Prediction**
- Test Brier in [0.0, 1.0] for all valid (p_bull, outcome) pairs
- Test Brier = 0.0 when all predictions perfectly correct
- **Validates: Requirements 5.4, 17.2**
- [x] 2.8 Write property test for IC range and perfect correlation
- **Property 3: Information Coefficient Range and Perfect Correlation**
- Test IC in [-1.0, 1.0] for all valid (score, return) pairs with ≥30 elements
- Test IC = 1.0 for perfectly positively correlated data
- **Validates: Requirements 6.1, 6.2, 17.3**
- [x] 2.9 Implement Dashboard API endpoints in `services/api/app.py`
- Add `/api/validation/summary` GET — return latest model_metric_snapshot + gate status
- Add `/api/validation/calibration` GET — return calibration table with buckets
- Add `/api/validation/ic-by-horizon` GET — return IC and Rank IC per horizon
- Add `/api/validation/gate-status` GET — return quality gate evaluation detail
- All endpoints accept optional `lookback` (default "30d") and `horizon` (default "7d") query params
- _Requirements: 12.1, 12.2, 12.3, 12.7_
- [x] 2.10 Add frontend validation API hooks in `frontend/src/api/hooks.ts`
- Add `useValidationSummary(lookback?, horizon?)` hook for `/api/validation/summary`
- Add `useValidationCalibration(lookback?, horizon?)` hook for `/api/validation/calibration`
- Add `useValidationICByHorizon(lookback?)` hook for `/api/validation/ic-by-horizon`
- Add `useValidationGateStatus()` hook for `/api/validation/gate-status`
- _Requirements: 12.1, 12.2, 12.3, 12.7_
- [x] 2.11 Upgrade OpsModel page (`frontend/src/pages/OpsModel.tsx`) — Phase 1 dashboard
- Add tabbed layout: existing "Extraction Performance" tab + new "Model Validation" tab
- Add summary cards: prediction count, win rate, directional accuracy, IC, Rank IC, Brier score, ECE, avg excess return vs SPY, gate status
- Add calibration table with confidence buckets, avg confidence, observed win rate, count, miscalibration flag
- Highlight miscalibrated buckets (|avg_confidence - observed_win_rate| > 0.15) with warning indicator
- Add IC-by-horizon table showing IC and Rank IC for each horizon
- Add gate status indicator (pass/fail with threshold details)
- _Requirements: 12.1, 12.2, 12.3, 12.7, 12.8, 12.9_
- [x] 3. Checkpoint — Phase 1 verification
- Ensure all tests pass, ask the user if questions arise.
- [x] 4. Phase 2 — Attribution engine and source/catalyst truth tables
- [x] 4.1 Implement Attribution Engine (`services/validation/attribution.py`)
- Define `SourceAttribution`, `CatalystAttribution`, `LayerAttribution` dataclasses
- Implement `compute_source_attribution(pool, lookback_days, horizon)` — join signal_evidence_links with prediction_outcomes, group by source; compute prediction count, avg weight, avg contribution score, win rate, avg future return, avg excess return vs SPY, IC, duplicate rate
- Implement `compute_catalyst_attribution(pool, lookback_days, horizon)` — same metrics grouped by catalyst_type
- Implement `compute_layer_attribution(pool, lookback_days, horizon)` — compute per-layer (company, macro, competitive) avg contribution %, dominant win rate (layer > 30% contribution), dominant IC
- _Requirements: 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7_
- [x] 4.2 Implement Calibration Engine (`services/validation/calibration.py`)
- Implement `compute_source_reliability(observed_win_rate, sample_count, prior_strength=30)` — Bayesian shrinkage: `0.5 + (n / (n + 30)) * (observed_win_rate - 0.5)`; return 0.5 when n=0
- Implement `compute_adjusted_evidence_weight(base_weight, reliability)``base_weight * (0.5 + reliability)`, clamped to [0.1, 2.0]
- Implement `update_source_reliabilities(pool)` — recompute from latest outcomes, update source_accuracy table
- _Requirements: 8.1, 8.2, 8.3, 8.4, 8.5_
- [x] 4.3 Write property test for source reliability Bayesian shrinkage bounds and convergence
- **Property 5: Source Reliability Bayesian Shrinkage Bounds and Convergence**
- Test reliability in [0.0, 1.0] for all valid inputs
- Test reliability = 0.5 when sample_count = 0
- Test reliability approaches observed_win_rate as sample_count → ∞
- **Validates: Requirements 8.1, 8.2, 17.5**
- [x] 4.4 Add attribution API endpoints in `services/api/app.py`
- Add `/api/validation/attribution/sources` GET — return per-source performance metrics
- Add `/api/validation/attribution/catalysts` GET — return per-catalyst performance metrics
- Add `/api/validation/attribution/layers` GET — return per-layer performance metrics
- All endpoints accept optional `lookback` (default "30d") and `horizon` (default "7d") query params
- _Requirements: 12.4, 12.5, 12.6_
- [x] 4.5 Add frontend attribution hooks in `frontend/src/api/hooks.ts`
- Add `useValidationAttributionSources(lookback?, horizon?)` hook
- Add `useValidationAttributionCatalysts(lookback?, horizon?)` hook
- Add `useValidationAttributionLayers(lookback?, horizon?)` hook
- _Requirements: 12.4, 12.5, 12.6_
- [x] 4.6 Extend OpsModel page with attribution tables
- Add source performance table (source, win rate, IC, avg return, duplicate rate)
- Add catalyst truth table (catalyst type, win rate, avg return, IC)
- Add layer attribution table (company/macro/competitive contribution %, dominant win rate, IC)
- _Requirements: 12.4, 12.5, 12.6, 12.8_
- [x] 5. Checkpoint — Phase 2 verification
- Ensure all tests pass, ask the user if questions arise.
- [x] 6. Phase 3 — Quality gate, recommendation enhancements, and pipeline wiring
- [x] 6.1 Implement Quality Gate (`services/trading/model_quality_gate.py`)
- Define `QualityGateConfig` dataclass with default thresholds (min_prediction_count=100, min_ic=0.03, min_win_rate=0.53, max_ece=0.15, min_excess_return_vs_spy=0.0, max_snapshot_age_hours=24)
- Define `GateThresholdResult` and `QualityGateResult` dataclasses
- Implement `evaluate_quality_gate(pool, config)` — read most recent model_metric_snapshot (30d lookback, 7d horizon), evaluate each threshold, store result in risk_configs under 'model_quality_gate' key
- Implement `load_gate_config_from_db(pool)` — load thresholds from risk_configs with defaults
- Default to paper-only mode when no snapshots exist or snapshot is stale (>24h)
- Log gate evaluation result with threshold pass/fail details
- _Requirements: 11.1, 11.2, 11.3, 11.4, 11.5, 11.6, 11.7_
- [x] 6.2 Write property test for quality gate determinism and threshold monotonicity
- **Property 6: Quality Gate Determinism and Threshold Monotonicity**
- Test same inputs always produce same pass/fail result
- Test relaxing any threshold never causes a previously passing gate to fail
- **Validates: Requirements 11.1, 17.6**
- [x] 6.3 Wire Quality Gate into aggregation cycle (`services/aggregation/worker.py`)
- Call `evaluate_quality_gate` at the start of each aggregation cycle
- When gate fails, force all recommendations to paper mode
- Log gate status at cycle start
- _Requirements: 11.2, 11.3_
- [x] 6.4 Wire Prediction Snapshot Writer into recommendation engine
- After recommendation is generated in `services/recommendation/eligibility.py` or the calling code, call `create_prediction_snapshot` to capture the prediction state
- Pass recommendation, trend_summary, evidence signals, and evidence docs
- Handle snapshot creation failure gracefully (log error, don't block recommendation)
- _Requirements: 1.1, 1.6_
- [x] 6.5 Enhance recommendation display on frontend
- Update `frontend/src/pages/RecommendationDetail` (or relevant recommendation display component) to show:
- Original confidence alongside calibrated confidence (historical win rate for that bucket)
- Historical win rate for similar confidence levels
- Evidence count, unique evidence count, duplicate evidence count
- Source reliability indicator for primary contributing sources
- Live eligibility status with reason (gate passed or which threshold failed)
- Add warning badge when duplicate evidence count > 20% of total evidence count
- Add warning badge when primary source reliability < 0.4
- _Requirements: 13.1, 13.2, 13.3, 13.4, 13.5, 13.6, 13.7_
- [x] 7. Checkpoint — Phase 3 verification
- Ensure all tests pass, ask the user if questions arise.
- [x] 8. Phase 4 — Backtest replay integration and unit tests
- [x] 8.1 Add validation mode to BacktestReplay (`services/trading/backtest_replay.py`)
- Add `validation_mode: bool = False` parameter to `BacktestReplay.run()`
- When validation_mode=True, create prediction snapshots for each historical recommendation using only data available at that point in time
- Evaluate prediction outcomes using market prices from the appropriate future horizon
- Prevent future data leakage: no market data after prediction generation time used during snapshot creation
- After backtest completes, trigger model metrics computation over the backtest period, tag snapshots with backtest_id
- _Requirements: 15.1, 15.2, 15.3, 15.4, 15.5_
- [x] 8.2 Write unit tests for prediction snapshot writer (`tests/test_model_validation_unit.py`)
- Test canonical evidence key: known title/URL → expected SHA256, empty inputs, unicode
- Test duplicate detection: 3 docs with 2 sharing a key → 1 marked duplicate
- Test contribution scores: [0.5, 0.3, 0.2] → [0.5, 0.3, 0.2], single doc → [1.0]
- Test weight clamping: weight 1.5 → clamped to 1.0
- _Requirements: 1.1, 2.3, 2.4, 2.5, 3.3_
- [x] 8.3 Write unit tests for outcome evaluator (`tests/test_model_validation_unit.py`)
- Test future return computation: price 100→110 → 0.10, price 100→90 → -0.10
- Test direction_correct logic: bullish+positive → true, bullish+negative → false
- Test profitable logic: buy+positive → true, sell+negative → true
- Test excess return: ticker 10%, SPY 5% → excess 5%
- _Requirements: 4.2, 4.5, 4.6, 4.7_
- [x] 8.4 Write unit tests for metrics engine (`tests/test_model_validation_unit.py`)
- Test ECE specific values: perfect calibration → 0.0, all overconfident → positive ECE
- Test Brier score: all correct at p=1.0 → 0.0, all wrong at p=1.0 → 1.0
- Test IC: perfect correlation → 1.0, anti-correlation → -1.0, < 30 → None
- _Requirements: 5.3, 5.4, 6.1, 6.2, 6.5_
- [x] 8.5 Write unit tests for calibration engine (`tests/test_model_validation_unit.py`)
- Test source reliability: n=0 → 0.5, n=1000 with wr=0.8 → ≈0.8, n=30 with wr=0.7 → 0.6
- Test adjusted evidence weight: reliability=0.5 → base*1.0, clamping to [0.1, 2.0]
- _Requirements: 8.1, 8.2, 8.3_
- [x] 8.6 Write unit tests for quality gate (`tests/test_model_validation_unit.py`)
- Test all thresholds met → pass
- Test one threshold failed → fail with reason
- Test fail-safe: no snapshots → paper-only, stale snapshot → paper-only
- _Requirements: 11.1, 11.6_
- [x] 8.7 Write frontend tests for validation dashboard (`frontend/src/test/pages.test.tsx`)
- Add MSW mock handlers for `/api/validation/summary`, `/api/validation/calibration`, `/api/validation/gate-status`
- Test OpsModel page renders validation tab with summary cards
- Test calibration table renders buckets with miscalibration warning
- Test gate status indicator renders pass/fail
- _Requirements: 12.8, 12.9_
- [x] 9. Final checkpoint — Ensure all tests pass
- Ensure all tests pass, ask the user if questions arise.
## Notes
- Tasks marked with `*` are optional and can be skipped for faster MVP
- Each task references specific requirements for traceability
- Checkpoints ensure incremental validation after each phase
- Property tests validate the 7 universal correctness properties from the design document
- Unit tests validate specific examples, edge cases, and integration points
- The design uses Python for backend and TypeScript for frontend — no language selection needed
- Migration number is 035 (existing migrations go up to 034)
- All new service modules go under `services/validation/` except the quality gate which goes in `services/trading/`
- The 7 new API endpoints are added to the existing `services/api/app.py`
- Frontend hooks follow existing patterns in `frontend/src/api/hooks.ts`
- Phase 1 delivers the core feedback loop (capture → evaluate → measure → display)
- Phase 2 adds attribution depth (which sources/catalysts/layers work best)
- Phase 3 adds safety (quality gate) and UX (recommendation warnings)
- Phase 4 adds historical analysis (backtest validation mode) and comprehensive tests