admin/stonks-oracle

Fork 0

Files

T

Celes Renata 7fcc8a6c07

ci/woodpecker/push/test Pipeline failed

Details

ci/woodpecker/push/build-1 unknown status

Details

ci/woodpecker/push/build-3 unknown status

Details

ci/woodpecker/push/build-2 unknown status

Details

ci/woodpecker/push/finalize unknown status

Details

Build and Push / lint-and-test (push) Has been cancelled

Details

Build and Push / build-services (map[cmd:python -m services.adapters.broker_adapter name:broker-adapter]) (push) Has been cancelled

Details

Build and Push / build-services (map[cmd:python -m services.aggregation.worker name:aggregation]) (push) Has been cancelled

Details

Build and Push / build-services (map[cmd:python -m services.extractor.worker name:extractor]) (push) Has been cancelled

Details

Build and Push / build-services (map[cmd:python -m services.ingestion.worker name:ingestion]) (push) Has been cancelled

Details

Build and Push / build-services (map[cmd:python -m services.lake_publisher.worker name:lake-publisher]) (push) Has been cancelled

Details

Build and Push / build-services (map[cmd:python -m services.parser.worker name:parser]) (push) Has been cancelled

Details

Build and Push / build-services (map[cmd:python -m services.recommendation.worker name:recommendation]) (push) Has been cancelled

Details

Build and Push / build-services (map[cmd:python -m services.scheduler.app name:scheduler]) (push) Has been cancelled

Details

Build and Push / build-services (map[cmd:uvicorn services.api.app:app --host 0.0.0.0 --port 8000 name:query-api]) (push) Has been cancelled

Details

Build and Push / build-services (map[cmd:uvicorn services.risk.app:app --host 0.0.0.0 --port 8000 name:risk]) (push) Has been cancelled

Details

Build and Push / build-services (map[cmd:uvicorn services.symbol_registry.app:app --host 0.0.0.0 --port 8000 name:symbol-registry]) (push) Has been cancelled

Details

Build and Push / build-services (map[cmd:uvicorn services.trading.app:app --host 0.0.0.0 --port 8000 name:trading-engine]) (push) Has been cancelled

Details

Build and Push / build-dashboard (push) Has been cancelled

Details

Build and Push / build-superset (push) Has been cancelled

Details

Build and Push / integration-test (push) Has been cancelled

Details

Build and Push / beta-gate (push) Has been cancelled

Details

feat: model validation, calibration, and signal quality layer

- Migration 035: prediction_snapshots, prediction_outcomes, signal_evidence_links, model_metric_snapshots tables + SQL views
- Prediction snapshot writer with canonical evidence keys, duplicate detection, contribution scores
- Outcome evaluator across 5 horizons (1h, 6h, 1d, 7d, 30d)
- Metrics engine: ECE, Brier score, IC, Rank IC, benchmark comparison
- Attribution engine: per-source, per-catalyst, per-layer performance
- Calibration engine: Bayesian shrinkage source reliability
- Quality gate for live trading eligibility with configurable thresholds
- 7 new /api/validation/* endpoints
- Upgraded OpsModel dashboard with validation tab
- Enhanced recommendation display with calibration context
- Backtest replay validation mode
- 86 Python tests (unit + property-based), 179 frontend tests passing

2026-05-01 03:04:58 +00:00

32 KiB

Raw Blame History

Requirements Document — Model Validation, Calibration, and Signal Quality

Introduction

The Stonks Oracle platform generates trend summaries and trading recommendations from a three-layer signal aggregation engine. While the pipeline produces directional predictions with confidence scores, there is no systematic mechanism to evaluate whether those predictions are accurate, whether confidence scores are well-calibrated, which sources and signal types contribute to correct predictions, or whether the system outperforms simple benchmarks. The platform also lacks safety gates that prevent live trading when model quality is insufficient.

This feature adds a complete model validation layer: prediction outcome tracking, calibration analysis, information coefficient metrics, signal and source attribution, evidence deduplication quality tracking, confidence recalibration, benchmark comparison, an upgraded Model Performance dashboard, and safety gates for live trading eligibility. The goal is to transform Stonks Oracle from a signal dashboard with paper trading into a statistically validated prediction engine with closed-loop feedback.

Glossary

Prediction_Snapshot_Writer: A new service component in services/validation/prediction_snapshot.py that captures the full state of every recommendation and trend prediction at generation time, including prices, evidence links, and duplicate counts.
Outcome_Evaluator: A new service component in services/validation/outcome_evaluator.py that runs periodically to compute realized future returns and directional accuracy for matured prediction snapshots across multiple horizons.
Metrics_Engine: A new service component in services/validation/metrics.py that computes aggregate model quality metrics including calibration error, information coefficient, Brier score, and win rates over configurable lookback windows.
Attribution_Engine: A new service component in services/validation/attribution.py that computes per-source, per-catalyst-type, and per-signal-layer performance metrics by joining evidence links with prediction outcomes.
Calibration_Engine: A new service component in services/validation/calibration.py that computes source reliability scores using Bayesian shrinkage and adjusts evidence weights based on historical source performance.
Quality_Gate: A new service component in services/trading/model_quality_gate.py that evaluates aggregate model metrics against configurable thresholds and determines whether the system meets minimum quality standards for live trading.
Information_Coefficient: The Pearson correlation between predicted scores and realized future returns, measuring the linear predictive power of the model. Abbreviated as IC.
Rank_Information_Coefficient: The Spearman rank correlation between predicted scores and realized future returns, measuring ordinal predictive power. Abbreviated as Rank IC.
Calibration_Error: The Expected Calibration Error (ECE), computed as the weighted average of the absolute difference between predicted confidence and observed win rate across confidence buckets.
Brier_Score: The mean squared error between the predicted bullish probability and the binary actual outcome (1 if price went up, 0 otherwise), measuring probabilistic forecast accuracy.
Canonical_Evidence_Key: A normalized identifier for a piece of evidence, computed as SHA256 of the normalized title concatenated with the normalized URL, used to detect duplicate evidence across different ingestion paths.
Excess_Return: The return of a prediction minus the return of a benchmark (SPY for broad market, sector ETF for sector-relative) over the same horizon, measuring alpha generation.
Prediction_Snapshot: A frozen record of a prediction at generation time, capturing all inputs (prices, scores, evidence) needed to evaluate the prediction against future outcomes without hindsight bias.
Model_Metric_Snapshot: A periodic aggregate of model quality metrics over a lookback window and horizon, stored for time-series analysis of model performance trends.
Source_Reliability: A Bayesian-shrunk estimate of a source's historical win rate, computed as 0.5 + (n/(n+30)) * (observed_win_rate - 0.5), which regresses toward 0.5 for sources with few observations.
Dashboard_API: The set of API endpoints under /api/validation/ that serve model quality metrics, calibration tables, attribution data, and gate status to the frontend.

Requirements

Requirement 1: Prediction Snapshot Capture

User Story: As a quantitative analyst, I want every recommendation and trend prediction captured as an immutable snapshot at generation time, so that I can evaluate predictions against future outcomes without hindsight bias.

Acceptance Criteria

WHEN a recommendation is generated by the Recommendation_Engine, THE Prediction_Snapshot_Writer SHALL create a prediction_snapshots record containing the ticker, generation timestamp, trend window, prediction horizon, direction, action, mode, strength, confidence, contradiction score, bullish probability, bearish probability, company score, macro score, competitive score, evidence count, unique source count, duplicate evidence count, price at prediction time, SPY price at prediction time, and sector ETF price at prediction time.
WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL record the current market price for the predicted ticker by querying the most recent close price from the market_snapshots table.
WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL record the current SPY price by querying the most recent close price for ticker SPY from the market_snapshots table.
WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL record the current sector ETF price by looking up the sector for the predicted ticker and querying the most recent close price for the corresponding sector ETF from the market_snapshots table.
IF the market price, SPY price, or sector ETF price is unavailable at snapshot time, THEN THE Prediction_Snapshot_Writer SHALL store NULL for the unavailable price fields and log a warning, rather than failing the snapshot creation.
THE Prediction_Snapshot_Writer SHALL store prediction snapshots in a new prediction_snapshots database table with a UUID primary key and indexed columns for ticker, generated_at, and horizon.
WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL store a JSONB metadata field containing any additional context from the trend summary market_context and recommendation risk_checks fields.

Requirement 2: Signal Evidence Link Tracking

User Story: As a quantitative analyst, I want to know which specific evidence documents contributed to each prediction, so that I can attribute prediction success or failure to individual sources and signal types.

Acceptance Criteria

WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL create signal_evidence_links records for each document that contributed to the prediction, linking the prediction_id to the document_id and signal_id.
THE signal_evidence_links record SHALL capture the source identifier, source type, catalyst type, sentiment, impact score, extraction confidence, weight assigned during aggregation, duplicate status, canonical evidence key, and contribution score for each contributing document.
WHEN recording evidence links, THE Prediction_Snapshot_Writer SHALL compute the canonical_evidence_key as the SHA256 hash of the concatenation of the normalized (lowercased, whitespace-trimmed) document title and the normalized (lowercased, query-parameters-stripped) document URL.
WHEN recording evidence links, THE Prediction_Snapshot_Writer SHALL mark a link as is_duplicate = true when another link for the same prediction and ticker shares the same canonical_evidence_key.
THE Prediction_Snapshot_Writer SHALL compute the contribution_score for each evidence link as the ratio of that document's effective weight to the total effective weight across all documents for the prediction.
THE signal_evidence_links table SHALL have a foreign key constraint from prediction_id to prediction_snapshots(id) and indexes on prediction_id, document_id, and ticker.

Requirement 3: Evidence Deduplication Quality Tracking

User Story: As a quantitative analyst, I want the system to track evidence deduplication quality per prediction, so that I can identify when predictions are inflated by counting the same information multiple times from different sources.

Acceptance Criteria

WHEN creating a prediction snapshot, THE Prediction_Snapshot_Writer SHALL compute the unique_source_count as the number of distinct source identifiers across all non-duplicate evidence links for that prediction.
WHEN creating a prediction snapshot, THE Prediction_Snapshot_Writer SHALL compute the duplicate_evidence_count as the number of evidence links marked as is_duplicate = true for that prediction.
THE Prediction_Snapshot_Writer SHALL enforce a maximum single-document weight cap of 1.0, clamping any individual document's effective weight to prevent a single piece of evidence from dominating the prediction.
WHEN computing contribution scores, THE Prediction_Snapshot_Writer SHALL count each canonical evidence key at most once per ticker per window, applying the one-vote-per-canonical-document deduplication rule.
THE Metrics_Engine SHALL compute a duplicate_rate metric as the ratio of duplicate_evidence_count to total evidence_count across predictions in the lookback window.

Requirement 4: Prediction Outcome Evaluation

User Story: As a quantitative analyst, I want realized market outcomes automatically matched to historical predictions, so that I can measure whether the system's directional calls and confidence scores correspond to actual price movements.

Acceptance Criteria

THE Outcome_Evaluator SHALL run on a periodic schedule, evaluating prediction snapshots whose horizon has elapsed and whose outcome has not yet been recorded.
WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL compute the future_return as (future_price - price_at_prediction) / price_at_prediction using the closing price at the horizon endpoint.
WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL compute the SPY return over the same horizon as (spy_future_price - spy_price_at_prediction) / spy_price_at_prediction.
WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL compute the sector ETF return over the same horizon as (sector_etf_future_price - sector_etf_price_at_prediction) / sector_etf_price_at_prediction.
WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL compute excess_return_vs_spy as future_return - spy_return and excess_return_vs_sector as future_return - sector_etf_return.
WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL determine direction_correct as true when the prediction direction is bullish and future_return is positive, or when the prediction direction is bearish and future_return is negative.
WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL determine profitable as true when the prediction action is buy and future_return is positive, or when the prediction action is sell and future_return is negative.
THE Outcome_Evaluator SHALL evaluate each prediction across all applicable horizons: 1 hour, 6 hours, 1 day, 7 days, and 30 days.
THE Outcome_Evaluator SHALL store evaluation results in a new prediction_outcomes table with a foreign key to prediction_snapshots and indexed columns for prediction_id, horizon, and evaluated_at.
IF the future price is unavailable at the horizon endpoint (market data gap), THEN THE Outcome_Evaluator SHALL skip that horizon evaluation and retry on the next run.

Requirement 5: Calibration Analysis

User Story: As a quantitative analyst, I want to measure how well the system's confidence scores predict actual win rates, so that I can identify overconfident or underconfident predictions and recalibrate the model.

Acceptance Criteria

THE Metrics_Engine SHALL compute calibration metrics by grouping evaluated predictions into confidence buckets: [0.50, 0.60), [0.60, 0.70), [0.70, 0.80), [0.80, 0.90), [0.90, 1.00].
FOR EACH confidence bucket, THE Metrics_Engine SHALL compute the average confidence, the observed win rate (fraction of direction_correct outcomes), and the prediction count.
THE Metrics_Engine SHALL compute the Expected Calibration Error (ECE) as the weighted average of |avg_confidence - observed_win_rate| across all buckets, weighted by the fraction of predictions in each bucket.
THE Metrics_Engine SHALL compute the Brier Score as mean((p_bull - actual_outcome)^2) across all evaluated predictions, where actual_outcome is 1.0 when the price moved in the predicted direction and 0.0 otherwise.
THE Metrics_Engine SHALL flag calibration buckets where |avg_confidence - observed_win_rate| > 0.15 as miscalibrated for dashboard highlighting.
THE Metrics_Engine SHALL compute calibration metrics separately for each prediction horizon (1h, 6h, 1d, 7d, 30d).

Requirement 6: Information Coefficient Metrics

User Story: As a quantitative analyst, I want to measure the correlation between the system's prediction scores and realized returns, so that I can assess whether higher-scored predictions actually produce higher returns.

Acceptance Criteria

THE Metrics_Engine SHALL compute the Information Coefficient (IC) as the Pearson correlation between prediction scores and future returns across all evaluated predictions in the lookback window.
THE Metrics_Engine SHALL compute the Rank Information Coefficient (Rank IC) as the Spearman rank correlation between prediction scores and future returns across all evaluated predictions in the lookback window.
THE Metrics_Engine SHALL compute IC and Rank IC separately for each prediction horizon (1h, 6h, 1d, 7d, 30d).
THE Metrics_Engine SHALL compute return statistics by confidence decile, grouping predictions into 10 equal-sized bins by confidence and computing the average future return and average excess return for each decile.
WHEN fewer than 30 evaluated predictions exist for a given horizon, THE Metrics_Engine SHALL report IC and Rank IC as NULL rather than computing unreliable correlations from small samples.

Requirement 7: Source and Signal Attribution

User Story: As a quantitative analyst, I want to know which sources, source types, and catalyst types contribute to accurate predictions, so that I can identify the most valuable information channels and deprioritize unreliable ones.

Acceptance Criteria

THE Attribution_Engine SHALL compute per-source performance metrics by joining signal_evidence_links with prediction_outcomes, grouping by source identifier.
FOR EACH source, THE Attribution_Engine SHALL compute: prediction count, average weight, average contribution score, win rate, average future return, average excess return vs SPY, and information coefficient.
THE Attribution_Engine SHALL compute the same performance metrics grouped by source_type (e.g., news_api, filings_api, web_scrape, market_api).
THE Attribution_Engine SHALL compute the same performance metrics grouped by catalyst_type (e.g., earnings, product, legal, macro, m_and_a).
THE Attribution_Engine SHALL compute layer attribution metrics for the three signal layers (company, macro, competitive) by using the score_company, score_macro, and score_competitive fields from prediction snapshots.
FOR EACH layer, THE Attribution_Engine SHALL compute the average contribution percentage, the win rate when that layer is the dominant contributor, and the IC of predictions where that layer contributes more than 30% of the total score.
THE Attribution_Engine SHALL compute a per-source duplicate_rate as the fraction of evidence links from that source marked as is_duplicate.

Requirement 8: Confidence Recalibration via Source Reliability

User Story: As a quantitative analyst, I want source credibility weights adjusted based on historical prediction accuracy using Bayesian shrinkage, so that the system learns from its own track record and improves over time.

Acceptance Criteria

THE Calibration_Engine SHALL compute source reliability using Bayesian shrinkage: reliability = 0.5 + (n / (n + 30)) * (observed_win_rate - 0.5), where n is the number of evaluated predictions involving that source and observed_win_rate is the fraction of correct directional calls.
WHEN a source has zero evaluated predictions, THE Calibration_Engine SHALL assign a reliability of 0.5 (the prior mean).
THE Calibration_Engine SHALL compute an adjusted evidence weight for each source as adjusted_weight = base_weight * (0.5 + reliability), clamped to the range [0.1, 2.0].
THE Calibration_Engine SHALL update source reliability scores after each outcome evaluation cycle, using the latest prediction outcomes.
THE Calibration_Engine SHALL store source reliability scores in the existing source_accuracy table, extending it with a reliability column or using the existing accuracy_ratio field with the Bayesian shrinkage formula.

Requirement 9: Benchmark Comparison

User Story: As a quantitative analyst, I want the system's prediction performance compared against simple benchmarks, so that I can determine whether the model adds value beyond naive strategies.

Acceptance Criteria

THE Metrics_Engine SHALL compute the average excess return of all buy predictions versus a buy-and-hold SPY strategy over the same horizons.
THE Metrics_Engine SHALL compute the average excess return of all buy predictions versus a buy-and-hold sector ETF strategy over the same horizons.
THE Metrics_Engine SHALL compute the win rate of the system's directional predictions compared to a random 50/50 baseline, reporting the statistical significance using a binomial test when the prediction count exceeds 100.
THE Metrics_Engine SHALL compute the hit rate improvement, defined as (system_win_rate - 0.5) / 0.5, representing the percentage improvement over random guessing.

Requirement 10: Model Metric Snapshots

User Story: As a quantitative analyst, I want aggregate model metrics stored as time-series snapshots, so that I can track whether model quality is improving or degrading over time.

Acceptance Criteria

THE Metrics_Engine SHALL periodically compute and store model_metric_snapshots containing all aggregate metrics for each combination of lookback window and prediction horizon.
EACH model_metric_snapshot SHALL contain: prediction count, win rate, directional accuracy, IC, Rank IC, average return, average excess return vs SPY, average excess return vs sector, calibration error (ECE), Brier score, and per-action win rates (buy, sell, hold).
THE Metrics_Engine SHALL store model_metric_snapshots in a new model_metric_snapshots database table with a UUID primary key and indexed columns for generated_at, lookback_window, and horizon.
THE Metrics_Engine SHALL compute snapshots for lookback windows of 7 days, 30 days, 90 days, and all-time.
THE Metrics_Engine SHALL store a JSONB metadata field in each snapshot for extensibility, containing any additional computed metrics not captured in dedicated columns.

Requirement 11: Safety Gate for Live Trading

User Story: As a platform operator, I want live trading automatically disabled when model quality metrics fall below minimum thresholds, so that the system does not risk real capital on a poorly performing model.

Acceptance Criteria

THE Quality_Gate SHALL evaluate the following minimum thresholds for live trading eligibility: minimum prediction count of 100, minimum IC of 0.03, minimum win rate of 0.53, maximum ECE of 0.15, and minimum excess return vs SPY of 0.0.
WHEN any threshold is not met, THE Quality_Gate SHALL force all recommendations to paper mode, overriding any live_eligible mode assignments.
THE Quality_Gate SHALL evaluate gate status at the start of each aggregation cycle by reading the most recent model_metric_snapshot.
THE Quality_Gate SHALL log the gate evaluation result including which thresholds passed and which failed, with their actual values.
THE Quality_Gate SHALL store the gate evaluation result in the risk_configs table under a model_quality_gate key, making it available to the recommendation engine and dashboard.
IF the model_metric_snapshots table is empty or the most recent snapshot is older than 24 hours, THEN THE Quality_Gate SHALL default to paper-only mode (fail-safe behavior).
THE Quality_Gate SHALL support configurable thresholds via the risk_configs table, with the default values specified in acceptance criterion 1 used when no override is configured.

Requirement 12: Model Performance Dashboard Upgrade

User Story: As a platform operator, I want a comprehensive model performance dashboard showing prediction accuracy, calibration, attribution, and gate status, so that I can monitor model quality and make informed decisions about live trading.

Acceptance Criteria

THE Dashboard_API SHALL expose a /api/validation/summary endpoint returning the latest model metric snapshot with summary cards for: prediction count, win rate, directional accuracy, IC, Rank IC, Brier score, calibration error, average excess return vs SPY, average excess return vs sector, and live trading gate status.
THE Dashboard_API SHALL expose a /api/validation/calibration endpoint returning the calibration table with confidence buckets, average confidence, observed win rate, prediction count, and miscalibration flag for each bucket.
THE Dashboard_API SHALL expose a /api/validation/ic-by-horizon endpoint returning IC and Rank IC values for each prediction horizon.
THE Dashboard_API SHALL expose a /api/validation/attribution/sources endpoint returning per-source performance metrics including win rate, IC, average return, and duplicate rate.
THE Dashboard_API SHALL expose a /api/validation/attribution/catalysts endpoint returning per-catalyst-type performance metrics.
THE Dashboard_API SHALL expose a /api/validation/attribution/layers endpoint returning per-signal-layer (company, macro, competitive) performance metrics.
THE Dashboard_API SHALL expose a /api/validation/gate-status endpoint returning the current quality gate evaluation with pass/fail status for each threshold.
THE frontend OpsModel page SHALL be upgraded to display the model validation summary cards, calibration table, IC-by-horizon table, source performance table, catalyst truth table, layer attribution table, and gate status indicator.
THE frontend SHALL highlight miscalibrated confidence buckets where |avg_confidence - observed_win_rate| > 0.15 with a visual warning indicator.

Requirement 13: Recommendation Display Enhancements

User Story: As a platform operator, I want each recommendation to display its validation context including calibrated confidence, historical win rate, and evidence quality indicators, so that I can assess the reliability of individual predictions.

Acceptance Criteria

WHEN displaying a recommendation, THE frontend SHALL show the original confidence alongside the calibrated confidence (based on the historical win rate for that confidence bucket).
WHEN displaying a recommendation, THE frontend SHALL show the historical win rate for predictions with similar confidence levels.
WHEN displaying a recommendation, THE frontend SHALL show the evidence count, unique evidence count, and duplicate evidence count.
WHEN displaying a recommendation, THE frontend SHALL show a source reliability indicator based on the Bayesian-shrunk reliability score of the primary contributing sources.
WHEN displaying a recommendation, THE frontend SHALL show the live eligibility status with the reason (gate passed, or which threshold failed).
WHEN the duplicate evidence count exceeds 20% of the total evidence count, THE frontend SHALL display a warning badge indicating potential evidence inflation.
WHEN the primary contributing source has a reliability score below 0.4, THE frontend SHALL display a warning badge indicating unknown or low source reliability.

Requirement 14: SQL Explorer Views

User Story: As a quantitative analyst, I want pre-built SQL views joining predictions with outcomes and evidence with performance, so that I can run ad-hoc analysis in the SQL Explorer without writing complex joins.

Acceptance Criteria

THE database migration SHALL create a view v_prediction_performance that joins prediction_snapshots with prediction_outcomes on prediction_id, providing a single flat table with prediction inputs and realized outcomes.
THE database migration SHALL create a view v_source_performance that joins signal_evidence_links with prediction_outcomes (via prediction_id), providing per-evidence-link outcome data for source attribution analysis.
THE v_prediction_performance view SHALL include columns for ticker, direction, action, confidence, strength, price_at_prediction, future_return, excess_return_vs_spy, direction_correct, profitable, horizon, generated_at, and evaluated_at.
THE v_source_performance view SHALL include columns for source, source_type, catalyst_type, sentiment, weight, contribution_score, is_duplicate, direction_correct, future_return, and excess_return_vs_spy.

Requirement 15: Backtest Replay Integration

User Story: As a quantitative analyst, I want to replay historical data through the prediction snapshot and outcome evaluation pipeline, so that I can assess model quality on historical data without future data leakage.

Acceptance Criteria

THE Backtest_Replay service SHALL support a validation mode that generates prediction snapshots and evaluates outcomes using only data available at each historical point in time.
WHEN running in validation mode, THE Backtest_Replay service SHALL process historical recommendations chronologically, creating prediction snapshots with the market prices that were available at each recommendation's generation time.
WHEN running in validation mode, THE Backtest_Replay service SHALL evaluate prediction outcomes using market prices from the appropriate future horizon relative to each prediction's generation time.
THE Backtest_Replay service SHALL prevent future data leakage by ensuring that no market data with a timestamp after the prediction generation time is used during snapshot creation.
WHEN a backtest validation run completes, THE Backtest_Replay service SHALL trigger a model metrics computation over the backtest period, storing the results as model_metric_snapshots tagged with the backtest_id.

Requirement 16: Database Schema

User Story: As a developer, I want the new database tables created via a migration script following the existing migration conventions, so that the schema changes are applied consistently across all environments.

Acceptance Criteria

THE database migration SHALL create the prediction_snapshots table with columns: id (UUID PK), generated_at (TIMESTAMPTZ), ticker (VARCHAR), window (VARCHAR), horizon (VARCHAR), direction (VARCHAR), action (VARCHAR), mode (VARCHAR), strength (FLOAT), confidence (FLOAT), contradiction (FLOAT), p_bull (FLOAT), p_bear (FLOAT), score_company (FLOAT), score_macro (FLOAT), score_competitive (FLOAT), evidence_count (INTEGER), unique_source_count (INTEGER), duplicate_evidence_count (INTEGER), price_at_prediction (FLOAT), spy_price_at_prediction (FLOAT), sector_etf_price_at_prediction (FLOAT), metadata (JSONB), created_at (TIMESTAMPTZ).
THE database migration SHALL create the prediction_outcomes table with columns: id (UUID PK), prediction_id (UUID FK to prediction_snapshots), evaluated_at (TIMESTAMPTZ), horizon (VARCHAR), future_price (FLOAT), future_return (FLOAT), spy_future_price (FLOAT), spy_return (FLOAT), sector_etf_future_price (FLOAT), sector_etf_return (FLOAT), excess_return_vs_spy (FLOAT), excess_return_vs_sector (FLOAT), direction_correct (BOOLEAN), profitable (BOOLEAN), metadata (JSONB), created_at (TIMESTAMPTZ).
THE database migration SHALL create the signal_evidence_links table with columns: id (UUID PK), prediction_id (UUID FK to prediction_snapshots), document_id (VARCHAR), signal_id (VARCHAR), ticker (VARCHAR), source (VARCHAR), source_type (VARCHAR), catalyst_type (VARCHAR), sentiment (VARCHAR), impact (FLOAT), extraction_confidence (FLOAT), weight (FLOAT), is_duplicate (BOOLEAN), canonical_evidence_key (VARCHAR), contribution_score (FLOAT), metadata (JSONB), created_at (TIMESTAMPTZ).
THE database migration SHALL create the model_metric_snapshots table with columns: id (UUID PK), generated_at (TIMESTAMPTZ), lookback_window (VARCHAR), horizon (VARCHAR), prediction_count (INTEGER), win_rate (FLOAT), directional_accuracy (FLOAT), information_coefficient (FLOAT), rank_information_coefficient (FLOAT), avg_return (FLOAT), avg_excess_return_vs_spy (FLOAT), avg_excess_return_vs_sector (FLOAT), calibration_error (FLOAT), brier_score (FLOAT), buy_win_rate (FLOAT), sell_win_rate (FLOAT), hold_win_rate (FLOAT), metadata (JSONB), created_at (TIMESTAMPTZ).
THE database migration SHALL create appropriate indexes on prediction_snapshots (ticker, generated_at, horizon), prediction_outcomes (prediction_id, horizon), signal_evidence_links (prediction_id, document_id, ticker), and model_metric_snapshots (generated_at, lookback_window, horizon).
THE database migration SHALL be numbered as 035_model_validation.sql, following the existing migration numbering convention.

Requirement 17: Property-Based Testing for Validation Metrics

User Story: As a developer, I want property-based tests validating the mathematical correctness of all validation metric computations, so that edge cases and numerical stability issues are caught before deployment.

Acceptance Criteria

THE test suite SHALL include a property-based test for calibration error verifying that ECE is in [0.0, 1.0] for all valid distributions of predictions across confidence buckets, and that ECE is 0.0 when every bucket's observed win rate exactly matches its average confidence (round-trip calibration property).
THE test suite SHALL include a property-based test for Brier score verifying that the score is in [0.0, 1.0] for all valid probability-outcome pairs, and that the score is 0.0 when all predictions are perfectly correct with probability 1.0.
THE test suite SHALL include a property-based test for information coefficient verifying that IC is in [-1.0, 1.0] for all valid score-return pairs, and that IC is 1.0 when scores and returns are perfectly positively correlated.
THE test suite SHALL include a property-based test for the canonical evidence key verifying that the key is deterministic (same inputs always produce the same key) and that normalization is idempotent (normalizing an already-normalized input produces the same key).
THE test suite SHALL include a property-based test for source reliability Bayesian shrinkage verifying that reliability is always in [0.0, 1.0], that reliability approaches 0.5 as sample count approaches 0, and that reliability approaches the observed win rate as sample count approaches infinity.
THE test suite SHALL include a property-based test for the quality gate verifying that the gate result is deterministic for the same metric inputs, and that relaxing any single threshold (making it easier to pass) never causes a previously passing gate to fail (monotonicity property).
THE test suite SHALL include a property-based test for contribution score computation verifying that all contribution scores for a single prediction sum to 1.0 (within floating-point tolerance) and that each individual score is in [0.0, 1.0].

32 KiB Raw Blame History

Requirements Document — Model Validation, Calibration, and Signal Quality

Introduction

Glossary

Requirements

Requirement 1: Prediction Snapshot Capture

Acceptance Criteria

Requirement 2: Signal Evidence Link Tracking

Acceptance Criteria

Requirement 3: Evidence Deduplication Quality Tracking

Acceptance Criteria

Requirement 4: Prediction Outcome Evaluation

Acceptance Criteria

Requirement 5: Calibration Analysis

Acceptance Criteria

Requirement 6: Information Coefficient Metrics

Acceptance Criteria

Requirement 7: Source and Signal Attribution

Acceptance Criteria

Requirement 8: Confidence Recalibration via Source Reliability

Acceptance Criteria

Requirement 9: Benchmark Comparison

Acceptance Criteria

Requirement 10: Model Metric Snapshots

Acceptance Criteria

Requirement 11: Safety Gate for Live Trading

Acceptance Criteria

Requirement 12: Model Performance Dashboard Upgrade

Acceptance Criteria

Requirement 13: Recommendation Display Enhancements

Acceptance Criteria

Requirement 14: SQL Explorer Views

Acceptance Criteria

Requirement 15: Backtest Replay Integration

Acceptance Criteria

Requirement 16: Database Schema

Acceptance Criteria

Requirement 17: Property-Based Testing for Validation Metrics

Acceptance Criteria

32 KiB

Raw Blame History