feat: model validation, calibration, and signal quality layer
ci/woodpecker/push/test Pipeline failed
ci/woodpecker/push/build-1 unknown status
ci/woodpecker/push/build-3 unknown status
ci/woodpecker/push/build-2 unknown status
ci/woodpecker/push/finalize unknown status
Build and Push / lint-and-test (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.adapters.broker_adapter name:broker-adapter]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.aggregation.worker name:aggregation]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.extractor.worker name:extractor]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.ingestion.worker name:ingestion]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.lake_publisher.worker name:lake-publisher]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.parser.worker name:parser]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.recommendation.worker name:recommendation]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.scheduler.app name:scheduler]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.api.app:app --host 0.0.0.0 --port 8000 name:query-api]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.risk.app:app --host 0.0.0.0 --port 8000 name:risk]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.symbol_registry.app:app --host 0.0.0.0 --port 8000 name:symbol-registry]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.trading.app:app --host 0.0.0.0 --port 8000 name:trading-engine]) (push) Has been cancelled
Build and Push / build-dashboard (push) Has been cancelled
Build and Push / build-superset (push) Has been cancelled
Build and Push / integration-test (push) Has been cancelled
Build and Push / beta-gate (push) Has been cancelled

- Migration 035: prediction_snapshots, prediction_outcomes, signal_evidence_links, model_metric_snapshots tables + SQL views
- Prediction snapshot writer with canonical evidence keys, duplicate detection, contribution scores
- Outcome evaluator across 5 horizons (1h, 6h, 1d, 7d, 30d)
- Metrics engine: ECE, Brier score, IC, Rank IC, benchmark comparison
- Attribution engine: per-source, per-catalyst, per-layer performance
- Calibration engine: Bayesian shrinkage source reliability
- Quality gate for live trading eligibility with configurable thresholds
- 7 new /api/validation/* endpoints
- Upgraded OpsModel dashboard with validation tab
- Enhanced recommendation display with calibration context
- Backtest replay validation mode
- 86 Python tests (unit + property-based), 179 frontend tests passing
This commit is contained in:
Celes Renata
2026-05-01 03:04:58 +00:00
parent 5d2ffd9163
commit 7fcc8a6c07
23 changed files with 7554 additions and 9 deletions
@@ -0,0 +1 @@
{"specId": "b595d834-7e72-4fab-87a9-65c92115a069", "workflowType": "requirements-first", "specType": "feature"}
@@ -0,0 +1,975 @@
# Design Document — Model Validation, Calibration, and Signal Quality
## Overview
This design adds a closed-loop model validation layer to Stonks Oracle. The system currently generates trend summaries and trading recommendations with confidence scores, but has no mechanism to evaluate whether those predictions are accurate, whether confidence scores are well-calibrated, which sources contribute to correct predictions, or whether the system outperforms simple benchmarks.
The validation layer introduces six new service modules under `services/validation/`, a quality gate in `services/trading/`, seven new API endpoints under `/api/validation/`, a database migration (035) with four new tables and two SQL views, and an upgraded OpsModel dashboard page. The architecture follows the existing patterns: pure computation modules with asyncpg for persistence, FastAPI endpoints in `services/api/app.py`, and React/TanStack Query hooks on the frontend.
### Design Rationale
A prediction engine without outcome tracking is flying blind. The validation layer closes the feedback loop by:
1. **Capturing immutable snapshots** at prediction time — preventing hindsight bias in evaluation
2. **Evaluating outcomes** across multiple horizons (1h, 6h, 1d, 7d, 30d) — matching the system's multi-window trend architecture
3. **Computing calibration metrics** (ECE, Brier score) — measuring whether confidence scores mean what they claim
4. **Tracking information coefficients** (IC, Rank IC) — measuring linear and ordinal predictive power
5. **Attributing performance** to sources, catalysts, and signal layers — identifying the most valuable information channels
6. **Recalibrating confidence** via Bayesian shrinkage — learning from the system's own track record
7. **Gating live trading** on minimum quality thresholds — preventing real capital risk on a poorly performing model
The design reuses existing infrastructure (asyncpg, FastAPI, TanStack Query, Recharts) and integrates with the existing `source_accuracy` table from the signal-math-upgrade spec.
---
## Architecture
### High-Level Data Flow
```mermaid
flowchart TD
subgraph "Prediction Capture (Real-time)"
A[Recommendation Engine] -->|generates| B[Prediction_Snapshot_Writer]
B --> C[prediction_snapshots table]
B --> D[signal_evidence_links table]
B -->|computes| E[canonical_evidence_key<br/>duplicate detection<br/>contribution scores]
end
subgraph "Outcome Evaluation (Periodic)"
F[Outcome_Evaluator<br/>scheduled job] -->|reads matured snapshots| C
F -->|fetches future prices| G[market_snapshots table]
F -->|computes returns| H[prediction_outcomes table]
F -->|evaluates 5 horizons| H
end
subgraph "Metrics Computation (Periodic)"
I[Metrics_Engine] -->|reads| H
I -->|reads| C
I -->|reads| D
I -->|computes| J[model_metric_snapshots table]
I -->|computes| K[Calibration: ECE, Brier]
I -->|computes| L[IC, Rank IC by horizon]
I -->|computes| M[Benchmark: excess returns]
end
subgraph "Attribution (Periodic)"
N[Attribution_Engine] -->|joins| D
N -->|joins| H
N -->|computes| O[Per-source metrics]
N -->|computes| P[Per-catalyst metrics]
N -->|computes| Q[Per-layer metrics]
end
subgraph "Calibration (Periodic)"
R[Calibration_Engine] -->|reads| H
R -->|reads| D
R -->|computes Bayesian shrinkage| S[source_accuracy table<br/>reliability scores]
end
subgraph "Safety Gate (Per-cycle)"
T[Quality_Gate] -->|reads latest| J
T -->|evaluates thresholds| U{Pass?}
U -->|yes| V[Live trading allowed]
U -->|no| W[Force paper mode]
T -->|stores result| X[risk_configs table<br/>model_quality_gate key]
end
subgraph "Dashboard (Frontend)"
Y[Dashboard_API<br/>7 endpoints] -->|reads| J
Y -->|reads| C
Y -->|reads| H
Y -->|reads| D
Z[OpsModel.tsx<br/>upgraded page] -->|fetches| Y
end
subgraph "Backtest Integration"
AA[BacktestReplay] -->|validation mode| B
AA -->|validation mode| F
AA -->|triggers| I
end
```
### Scheduling Strategy
The validation components run on different cadences:
| Component | Trigger | Cadence |
|-----------|---------|---------|
| Prediction_Snapshot_Writer | Synchronous — called by recommendation engine | Every recommendation |
| Outcome_Evaluator | Scheduled job | Every 1 hour |
| Metrics_Engine | After Outcome_Evaluator completes | Every 1 hour |
| Attribution_Engine | Called by Metrics_Engine | Every 1 hour |
| Calibration_Engine | After Metrics_Engine completes | Every 6 hours |
| Quality_Gate | Start of each aggregation cycle | Every aggregation cycle |
### Sector ETF Mapping
The system needs a mapping from company sectors to sector ETFs for benchmark comparison. This is stored as a configuration constant:
```python
SECTOR_ETF_MAP: dict[str, str] = {
"Technology": "XLK",
"Consumer Cyclical": "XLY",
"Financial Services": "XLF",
"Healthcare": "XLV",
"Energy": "XLE",
"Communication Services": "XLC",
"Industrials": "XLI",
"Consumer Defensive": "XLP",
"Real Estate": "XLRE",
"Utilities": "XLU",
}
```
---
## Components and Interfaces
### New Modules
| Module | File | Responsibility |
|--------|------|----------------|
| Prediction Snapshot Writer | `services/validation/prediction_snapshot.py` | Captures immutable prediction state at generation time |
| Outcome Evaluator | `services/validation/outcome_evaluator.py` | Matches predictions with realized market outcomes |
| Metrics Engine | `services/validation/metrics.py` | Computes calibration, IC, Brier, benchmark metrics |
| Attribution Engine | `services/validation/attribution.py` | Per-source, per-catalyst, per-layer performance |
| Calibration Engine | `services/validation/calibration.py` | Bayesian shrinkage source reliability, weight adjustment |
| Quality Gate | `services/trading/model_quality_gate.py` | Safety gate for live trading eligibility |
### Modified Modules
| Module | File | Changes |
|--------|------|---------|
| Query API | `services/api/app.py` | 7 new `/api/validation/*` endpoints |
| Aggregation Worker | `services/aggregation/worker.py` | Call Quality_Gate at cycle start |
| Recommendation Engine | `services/recommendation/eligibility.py` | Call Prediction_Snapshot_Writer after recommendation |
| Backtest Replay | `services/trading/backtest_replay.py` | Validation mode support |
| Frontend Hooks | `frontend/src/api/hooks.ts` | 7 new validation hooks |
| OpsModel Page | `frontend/src/pages/OpsModel.tsx` | Full dashboard upgrade |
| AppLayout | `frontend/src/components/AppLayout.tsx` | Nav item update (if needed) |
### Component Interface Details
#### 1. Prediction Snapshot Writer (`services/validation/prediction_snapshot.py`)
```python
SECTOR_ETF_MAP: dict[str, str] = {
"Technology": "XLK",
"Consumer Cyclical": "XLY",
"Financial Services": "XLF",
"Healthcare": "XLV",
"Energy": "XLE",
"Communication Services": "XLC",
"Industrials": "XLI",
"Consumer Defensive": "XLP",
"Real Estate": "XLRE",
"Utilities": "XLU",
}
EVALUATION_HORIZONS: list[str] = ["1h", "6h", "1d", "7d", "30d"]
MAX_SINGLE_DOCUMENT_WEIGHT: float = 1.0
@dataclass
class PredictionSnapshot:
"""Immutable snapshot of a prediction at generation time."""
id: str # UUID
generated_at: datetime
ticker: str
window: str
horizon: str
direction: str # bullish/bearish/mixed/neutral
action: str # buy/sell/hold/watch
mode: str # informational/paper_eligible/live_eligible
strength: float
confidence: float
contradiction: float
p_bull: float | None
p_bear: float | None
score_company: float
score_macro: float
score_competitive: float
evidence_count: int
unique_source_count: int
duplicate_evidence_count: int
price_at_prediction: float | None
spy_price_at_prediction: float | None
sector_etf_price_at_prediction: float | None
metadata: dict
@dataclass
class SignalEvidenceLink:
"""Link between a prediction and a contributing evidence document."""
id: str # UUID
prediction_id: str
document_id: str
signal_id: str
ticker: str
source: str
source_type: str
catalyst_type: str
sentiment: str
impact: float
extraction_confidence: float
weight: float # clamped to MAX_SINGLE_DOCUMENT_WEIGHT
is_duplicate: bool
canonical_evidence_key: str
contribution_score: float # weight / total_weight, sums to 1.0
metadata: dict
def compute_canonical_evidence_key(title: str, url: str) -> str:
"""SHA256 of normalized(title) + normalized(url).
Normalization: lowercase, strip whitespace for title;
lowercase, strip query params for URL.
"""
...
async def create_prediction_snapshot(
pool: asyncpg.Pool,
recommendation: Recommendation,
trend_summary: TrendSummary,
evidence_signals: list[WeightedSignal],
evidence_docs: list[dict], # document metadata from recommendation_evidence
) -> PredictionSnapshot:
"""Create and persist a prediction snapshot with evidence links.
1. Fetches current prices (ticker, SPY, sector ETF) from market_snapshots
2. Computes canonical evidence keys and duplicate detection
3. Clamps individual document weights to MAX_SINGLE_DOCUMENT_WEIGHT
4. Computes contribution scores (one-vote-per-canonical-key dedup)
5. Persists snapshot and evidence links in a transaction
"""
...
async def fetch_latest_close_price(
pool: asyncpg.Pool,
ticker: str,
) -> float | None:
"""Fetch most recent close price from market_snapshots for a ticker."""
...
```
#### 2. Outcome Evaluator (`services/validation/outcome_evaluator.py`)
```python
@dataclass
class PredictionOutcome:
"""Realized outcome for a prediction at a specific horizon."""
id: str # UUID
prediction_id: str
evaluated_at: datetime
horizon: str # 1h, 6h, 1d, 7d, 30d
future_price: float
future_return: float
spy_future_price: float | None
spy_return: float | None
sector_etf_future_price: float | None
sector_etf_return: float | None
excess_return_vs_spy: float | None
excess_return_vs_sector: float | None
direction_correct: bool
profitable: bool
metadata: dict
HORIZON_DURATIONS: dict[str, timedelta] = {
"1h": timedelta(hours=1),
"6h": timedelta(hours=6),
"1d": timedelta(days=1),
"7d": timedelta(days=7),
"30d": timedelta(days=30),
}
async def evaluate_matured_predictions(
pool: asyncpg.Pool,
) -> int:
"""Evaluate all matured prediction snapshots.
Finds snapshots where horizon has elapsed and outcome not yet recorded.
For each, fetches future prices and computes returns.
Skips horizons where future price is unavailable (retries next run).
Returns count of outcomes recorded.
"""
...
async def evaluate_single_prediction(
pool: asyncpg.Pool,
snapshot: PredictionSnapshot,
horizon: str,
) -> PredictionOutcome | None:
"""Evaluate a single prediction at a specific horizon.
Returns None if future price is unavailable.
"""
...
```
#### 3. Metrics Engine (`services/validation/metrics.py`)
```python
CONFIDENCE_BUCKETS: list[tuple[float, float]] = [
(0.50, 0.60),
(0.60, 0.70),
(0.70, 0.80),
(0.80, 0.90),
(0.90, 1.00),
]
LOOKBACK_WINDOWS: list[str] = ["7d", "30d", "90d", "all"]
@dataclass
class CalibrationBucket:
"""Calibration metrics for a single confidence bucket."""
bucket_low: float
bucket_high: float
avg_confidence: float
observed_win_rate: float
prediction_count: int
miscalibrated: bool # |avg_confidence - win_rate| > 0.15
@dataclass
class ModelMetricSnapshot:
"""Aggregate model quality metrics for a lookback/horizon combination."""
id: str
generated_at: datetime
lookback_window: str
horizon: str
prediction_count: int
win_rate: float
directional_accuracy: float
information_coefficient: float | None
rank_information_coefficient: float | None
avg_return: float
avg_excess_return_vs_spy: float
avg_excess_return_vs_sector: float
calibration_error: float # ECE
brier_score: float
buy_win_rate: float
sell_win_rate: float
hold_win_rate: float
metadata: dict
def compute_calibration_error(
confidences: list[float],
outcomes: list[bool],
) -> tuple[float, list[CalibrationBucket]]:
"""Compute ECE and calibration buckets.
ECE = Σ (n_b / N) * |avg_conf_b - win_rate_b|
Returns (ece, buckets).
"""
...
def compute_brier_score(
p_bulls: list[float],
outcomes: list[bool],
) -> float:
"""Brier score = mean((p_bull - outcome)^2).
outcome is 1.0 when price moved in predicted direction, 0.0 otherwise.
Returns value in [0.0, 1.0].
"""
...
def compute_information_coefficient(
scores: list[float],
returns: list[float],
) -> float | None:
"""Pearson correlation between prediction scores and future returns.
Returns None when fewer than 30 data points.
Returns value in [-1.0, 1.0].
"""
...
def compute_rank_information_coefficient(
scores: list[float],
returns: list[float],
) -> float | None:
"""Spearman rank correlation between prediction scores and future returns.
Returns None when fewer than 30 data points.
Returns value in [-1.0, 1.0].
"""
...
def compute_contribution_scores(
weights: list[float],
) -> list[float]:
"""Compute contribution scores from document weights.
Each score = weight_i / sum(weights). Sums to 1.0.
Each score in [0.0, 1.0].
Returns empty list for empty input.
"""
...
async def compute_and_store_metric_snapshots(
pool: asyncpg.Pool,
) -> list[ModelMetricSnapshot]:
"""Compute metric snapshots for all lookback/horizon combinations.
Lookback windows: 7d, 30d, 90d, all-time.
Horizons: 1h, 6h, 1d, 7d, 30d.
"""
...
```
#### 4. Attribution Engine (`services/validation/attribution.py`)
```python
@dataclass
class SourceAttribution:
"""Performance metrics for a single source."""
source: str
source_type: str
prediction_count: int
avg_weight: float
avg_contribution_score: float
win_rate: float
avg_future_return: float
avg_excess_return_vs_spy: float
information_coefficient: float | None
duplicate_rate: float
@dataclass
class CatalystAttribution:
"""Performance metrics for a single catalyst type."""
catalyst_type: str
prediction_count: int
win_rate: float
avg_future_return: float
avg_excess_return_vs_spy: float
information_coefficient: float | None
@dataclass
class LayerAttribution:
"""Performance metrics for a signal layer."""
layer: str # company, macro, competitive
avg_contribution_pct: float
dominant_win_rate: float # win rate when this layer > 30% contribution
dominant_ic: float | None # IC when this layer > 30% contribution
async def compute_source_attribution(
pool: asyncpg.Pool,
lookback_days: int = 30,
horizon: str = "7d",
) -> list[SourceAttribution]:
...
async def compute_catalyst_attribution(
pool: asyncpg.Pool,
lookback_days: int = 30,
horizon: str = "7d",
) -> list[CatalystAttribution]:
...
async def compute_layer_attribution(
pool: asyncpg.Pool,
lookback_days: int = 30,
horizon: str = "7d",
) -> list[LayerAttribution]:
...
```
#### 5. Calibration Engine (`services/validation/calibration.py`)
```python
def compute_source_reliability(
observed_win_rate: float,
sample_count: int,
prior_strength: int = 30,
) -> float:
"""Bayesian shrinkage source reliability.
reliability = 0.5 + (n / (n + prior_strength)) * (observed_win_rate - 0.5)
Returns value in [0.0, 1.0].
When n=0, returns 0.5 (prior mean).
As n→∞, approaches observed_win_rate.
"""
...
def compute_adjusted_evidence_weight(
base_weight: float,
reliability: float,
) -> float:
"""Adjusted weight = base_weight * (0.5 + reliability), clamped to [0.1, 2.0]."""
...
async def update_source_reliabilities(
pool: asyncpg.Pool,
) -> int:
"""Recompute and store source reliability scores from latest outcomes.
Uses the existing source_accuracy table, updating accuracy_ratio
with the Bayesian shrinkage formula.
Returns count of sources updated.
"""
...
```
#### 6. Quality Gate (`services/trading/model_quality_gate.py`)
```python
@dataclass
class QualityGateConfig:
"""Configurable thresholds for live trading eligibility."""
min_prediction_count: int = 100
min_ic: float = 0.03
min_win_rate: float = 0.53
max_ece: float = 0.15
min_excess_return_vs_spy: float = 0.0
max_snapshot_age_hours: int = 24
@dataclass
class GateThresholdResult:
"""Result for a single threshold check."""
name: str
threshold: float
actual: float
passed: bool
@dataclass
class QualityGateResult:
"""Full gate evaluation result."""
passed: bool
evaluated_at: datetime
threshold_results: list[GateThresholdResult]
reason: str # "all thresholds met" or "failed: ..."
snapshot_id: str | None
config: QualityGateConfig
async def evaluate_quality_gate(
pool: asyncpg.Pool,
config: QualityGateConfig | None = None,
) -> QualityGateResult:
"""Evaluate model quality gate from latest metric snapshot.
Reads the most recent model_metric_snapshot for the 30d lookback
and 7d horizon (the primary evaluation window).
If no snapshot exists or snapshot is stale (>24h), defaults to
paper-only mode (fail-safe).
Stores result in risk_configs under 'model_quality_gate' key.
"""
...
async def load_gate_config_from_db(
pool: asyncpg.Pool,
) -> QualityGateConfig:
"""Load gate thresholds from risk_configs, with defaults."""
...
```
#### 7. Dashboard API Endpoints
Seven new endpoints added to `services/api/app.py`:
| Endpoint | Method | Returns |
|----------|--------|---------|
| `/api/validation/summary` | GET | Latest model metric snapshot + gate status |
| `/api/validation/calibration` | GET | Calibration table with buckets |
| `/api/validation/ic-by-horizon` | GET | IC and Rank IC per horizon |
| `/api/validation/attribution/sources` | GET | Per-source performance |
| `/api/validation/attribution/catalysts` | GET | Per-catalyst performance |
| `/api/validation/attribution/layers` | GET | Per-layer performance |
| `/api/validation/gate-status` | GET | Quality gate evaluation detail |
All endpoints accept optional `lookback` (default "30d") and `horizon` (default "7d") query parameters.
---
## Data Models
### Database Schema (Migration 035)
#### prediction_snapshots
```sql
CREATE TABLE IF NOT EXISTS prediction_snapshots (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
generated_at TIMESTAMPTZ NOT NULL,
ticker VARCHAR(20) NOT NULL,
window VARCHAR(20) NOT NULL,
horizon VARCHAR(20) NOT NULL,
direction VARCHAR(20) NOT NULL,
action VARCHAR(20) NOT NULL,
mode VARCHAR(30) NOT NULL,
strength FLOAT NOT NULL,
confidence FLOAT NOT NULL,
contradiction FLOAT NOT NULL DEFAULT 0.0,
p_bull FLOAT,
p_bear FLOAT,
score_company FLOAT NOT NULL DEFAULT 0.0,
score_macro FLOAT NOT NULL DEFAULT 0.0,
score_competitive FLOAT NOT NULL DEFAULT 0.0,
evidence_count INTEGER NOT NULL DEFAULT 0,
unique_source_count INTEGER NOT NULL DEFAULT 0,
duplicate_evidence_count INTEGER NOT NULL DEFAULT 0,
price_at_prediction FLOAT,
spy_price_at_prediction FLOAT,
sector_etf_price_at_prediction FLOAT,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_pred_snap_ticker ON prediction_snapshots(ticker);
CREATE INDEX IF NOT EXISTS idx_pred_snap_generated ON prediction_snapshots(generated_at);
CREATE INDEX IF NOT EXISTS idx_pred_snap_horizon ON prediction_snapshots(horizon);
```
#### prediction_outcomes
```sql
CREATE TABLE IF NOT EXISTS prediction_outcomes (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
prediction_id UUID NOT NULL REFERENCES prediction_snapshots(id),
evaluated_at TIMESTAMPTZ NOT NULL,
horizon VARCHAR(20) NOT NULL,
future_price FLOAT,
future_return FLOAT,
spy_future_price FLOAT,
spy_return FLOAT,
sector_etf_future_price FLOAT,
sector_etf_return FLOAT,
excess_return_vs_spy FLOAT,
excess_return_vs_sector FLOAT,
direction_correct BOOLEAN,
profitable BOOLEAN,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_pred_out_prediction ON prediction_outcomes(prediction_id);
CREATE INDEX IF NOT EXISTS idx_pred_out_horizon ON prediction_outcomes(horizon);
CREATE INDEX IF NOT EXISTS idx_pred_out_evaluated ON prediction_outcomes(evaluated_at);
```
#### signal_evidence_links
```sql
CREATE TABLE IF NOT EXISTS signal_evidence_links (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
prediction_id UUID NOT NULL REFERENCES prediction_snapshots(id),
document_id VARCHAR(200),
signal_id VARCHAR(200),
ticker VARCHAR(20),
source VARCHAR(200),
source_type VARCHAR(50),
catalyst_type VARCHAR(50),
sentiment VARCHAR(20),
impact FLOAT,
extraction_confidence FLOAT,
weight FLOAT,
is_duplicate BOOLEAN NOT NULL DEFAULT FALSE,
canonical_evidence_key VARCHAR(64),
contribution_score FLOAT,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_sig_ev_prediction ON signal_evidence_links(prediction_id);
CREATE INDEX IF NOT EXISTS idx_sig_ev_document ON signal_evidence_links(document_id);
CREATE INDEX IF NOT EXISTS idx_sig_ev_ticker ON signal_evidence_links(ticker);
```
#### model_metric_snapshots
```sql
CREATE TABLE IF NOT EXISTS model_metric_snapshots (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
generated_at TIMESTAMPTZ NOT NULL,
lookback_window VARCHAR(20) NOT NULL,
horizon VARCHAR(20) NOT NULL,
prediction_count INTEGER NOT NULL DEFAULT 0,
win_rate FLOAT,
directional_accuracy FLOAT,
information_coefficient FLOAT,
rank_information_coefficient FLOAT,
avg_return FLOAT,
avg_excess_return_vs_spy FLOAT,
avg_excess_return_vs_sector FLOAT,
calibration_error FLOAT,
brier_score FLOAT,
buy_win_rate FLOAT,
sell_win_rate FLOAT,
hold_win_rate FLOAT,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_model_snap_generated ON model_metric_snapshots(generated_at);
CREATE INDEX IF NOT EXISTS idx_model_snap_lookback ON model_metric_snapshots(lookback_window);
CREATE INDEX IF NOT EXISTS idx_model_snap_horizon ON model_metric_snapshots(horizon);
```
#### SQL Explorer Views
```sql
CREATE OR REPLACE VIEW v_prediction_performance AS
SELECT
ps.ticker,
ps.direction,
ps.action,
ps.confidence,
ps.strength,
ps.contradiction,
ps.p_bull,
ps.score_company,
ps.score_macro,
ps.score_competitive,
ps.evidence_count,
ps.unique_source_count,
ps.duplicate_evidence_count,
ps.price_at_prediction,
po.future_return,
po.excess_return_vs_spy,
po.excess_return_vs_sector,
po.direction_correct,
po.profitable,
po.horizon,
ps.generated_at,
po.evaluated_at
FROM prediction_snapshots ps
JOIN prediction_outcomes po ON po.prediction_id = ps.id;
CREATE OR REPLACE VIEW v_source_performance AS
SELECT
sel.source,
sel.source_type,
sel.catalyst_type,
sel.sentiment,
sel.weight,
sel.contribution_score,
sel.is_duplicate,
po.direction_correct,
po.future_return,
po.excess_return_vs_spy,
po.horizon,
ps.generated_at
FROM signal_evidence_links sel
JOIN prediction_snapshots ps ON ps.id = sel.prediction_id
JOIN prediction_outcomes po ON po.prediction_id = sel.prediction_id;
```
---
## Correctness Properties
*A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
The following properties were derived from the acceptance criteria through systematic prework analysis. Each property is universally quantified and maps to specific requirements. After reflection, 7 unique properties remain — one for each PBT requirement in Requirement 17. Redundant properties from Requirements 2, 5, 6, 8, and 11 were consolidated with their corresponding Requirement 17 counterparts.
### Property 1: Calibration Error Range and Round-Trip
*For any* valid distribution of predictions across confidence buckets (where each prediction has a confidence in [0.5, 1.0] and a boolean outcome), the Expected Calibration Error (ECE) SHALL be in [0.0, 1.0]. Furthermore, when every bucket's observed win rate exactly matches its average confidence, ECE SHALL be 0.0.
**Validates: Requirements 5.1, 5.3, 17.1**
### Property 2: Brier Score Range and Perfect Prediction
*For any* list of (p_bull, outcome) pairs where p_bull ∈ [0.0, 1.0] and outcome ∈ {0.0, 1.0}, the Brier score SHALL be in [0.0, 1.0]. Furthermore, when all predictions have p_bull = 1.0 and outcome = 1.0 (or p_bull = 0.0 and outcome = 0.0), the Brier score SHALL be 0.0.
**Validates: Requirements 5.4, 17.2**
### Property 3: Information Coefficient Range and Perfect Correlation
*For any* list of (score, return) pairs with at least 30 elements where scores and returns are finite floats, the Information Coefficient (Pearson correlation) SHALL be in [-1.0, 1.0]. Furthermore, when scores and returns are perfectly positively linearly correlated (returns = a * scores + b, a > 0), IC SHALL be 1.0 (within floating-point tolerance).
**Validates: Requirements 6.1, 6.2, 17.3**
### Property 4: Canonical Evidence Key Determinism and Normalization Idempotence
*For any* (title, url) string pair, computing the canonical evidence key SHALL be deterministic — the same inputs always produce the same key. Furthermore, normalizing an already-normalized input (lowercased, trimmed title; lowercased, query-stripped URL) and computing the key SHALL produce the same key as the original computation (idempotence).
**Validates: Requirements 2.3, 17.4**
### Property 5: Source Reliability Bayesian Shrinkage Bounds and Convergence
*For any* observed_win_rate ∈ [0.0, 1.0] and sample_count ≥ 0, the source reliability computed via Bayesian shrinkage SHALL be in [0.0, 1.0]. When sample_count = 0, reliability SHALL be exactly 0.5. As sample_count increases toward infinity, reliability SHALL approach the observed_win_rate monotonically.
**Validates: Requirements 8.1, 8.2, 17.5**
### Property 6: Quality Gate Determinism and Threshold Monotonicity
*For any* set of model metric values and quality gate configuration, the gate evaluation result SHALL be deterministic — the same inputs always produce the same pass/fail result. Furthermore, for any configuration where the gate passes, relaxing any single threshold (increasing min values or decreasing max values to make them easier to satisfy) SHALL NOT cause the gate to fail (monotonicity).
**Validates: Requirements 11.1, 17.6**
### Property 7: Contribution Score Sum-to-One and Range
*For any* non-empty list of positive document weights, the computed contribution scores SHALL each be in [0.0, 1.0] and SHALL sum to 1.0 (within floating-point tolerance of 1e-9). For an empty weight list, the result SHALL be an empty list.
**Validates: Requirements 2.5, 17.7**
---
## Error Handling
### Price Data Unavailability
| Scenario | Handling |
|----------|----------|
| Ticker price unavailable at snapshot time | Store NULL for `price_at_prediction`, log warning, continue |
| SPY price unavailable at snapshot time | Store NULL for `spy_price_at_prediction`, log warning, continue |
| Sector ETF price unavailable at snapshot time | Store NULL for `sector_etf_price_at_prediction`, log warning, continue |
| Sector not found in SECTOR_ETF_MAP | Store NULL for sector ETF price, log warning |
| Future price unavailable at evaluation time | Skip that horizon, retry on next Outcome_Evaluator run |
| SPY/sector ETF future price unavailable | Store NULL for excess returns, still compute ticker return |
### Metrics Computation Edge Cases
| Scenario | Handling |
|----------|----------|
| Zero predictions in a confidence bucket | Exclude bucket from ECE computation |
| Fewer than 30 predictions for IC/Rank IC | Return NULL instead of unreliable correlation |
| All predictions in same confidence bucket | ECE = |avg_confidence - win_rate| for that single bucket |
| Division by zero in contribution scores (total weight = 0) | Return equal contribution scores (1/n) |
| Single prediction | Contribution score = 1.0 |
| NaN/infinity in metric computation | Guard with `math.isnan`/`math.isinf` checks, return 0.0 or NULL |
### Quality Gate Failures
| Scenario | Handling |
|----------|----------|
| No model_metric_snapshots exist | Default to paper-only mode (fail-safe) |
| Most recent snapshot older than 24 hours | Default to paper-only mode (fail-safe) |
| risk_configs table unreachable | Default to paper-only mode, log warning |
| Invalid threshold values in risk_configs | Use default thresholds, log warning |
| Gate evaluation fails mid-computation | Default to paper-only mode, log error |
### Database Failures
| Scenario | Handling |
|----------|----------|
| prediction_snapshots insert fails | Log error, do not block recommendation generation |
| signal_evidence_links insert fails | Log error, snapshot still created (partial data) |
| prediction_outcomes insert fails | Log error, retry on next Outcome_Evaluator run |
| model_metric_snapshots insert fails | Log error, stale metrics used until next successful computation |
| source_accuracy update fails | Log error, continue with stale reliability data |
### Canonical Evidence Key Edge Cases
| Scenario | Handling |
|----------|----------|
| Empty title | Use empty string in hash computation |
| Empty URL | Use empty string in hash computation |
| URL with no query parameters | Use URL as-is after lowercasing |
| Non-ASCII characters in title/URL | Encode as UTF-8 before hashing |
---
## Testing Strategy
### Dual Testing Approach
The model validation feature requires both property-based tests (for mathematical correctness of metric computations) and example-based unit tests (for specific behaviors, integration points, and edge cases). Property-based testing is appropriate here because the feature contains several pure mathematical functions (ECE, Brier score, IC, Bayesian shrinkage, contribution scores) with clear input/output behavior and universal properties.
### Property-Based Testing
**Library:** Hypothesis (already in use — `.hypothesis/` directory exists, project convention established)
**Configuration:**
- Minimum 100 iterations per property: `@settings(max_examples=100)`
- File naming: `tests/test_pbt_model_validation.py`
- Tag format: `# Feature: model-validation-calibration, Property N: <title>`
**Property tests to implement (one test per correctness property):**
| Property | Test Function | Key Generators |
|----------|---------------|----------------|
| 1: ECE range and round-trip | `test_calibration_error_range_and_roundtrip` | `st.lists(st.tuples(st.floats(0.5, 1.0), st.booleans()))` |
| 2: Brier score range and perfect | `test_brier_score_range_and_perfect` | `st.lists(st.tuples(st.floats(0.0, 1.0), st.sampled_from([0.0, 1.0])))` |
| 3: IC range and perfect correlation | `test_information_coefficient_range_and_perfect` | `st.lists(st.floats(-10, 10), min_size=30)` with linear transform |
| 4: Canonical key determinism and idempotence | `test_canonical_key_determinism_and_idempotence` | `st.text()` pairs for title and URL |
| 5: Source reliability bounds and convergence | `test_source_reliability_bounds_and_convergence` | `st.floats(0.0, 1.0)` for win_rate, `st.integers(0, 10000)` for n |
| 6: Quality gate determinism and monotonicity | `test_quality_gate_determinism_and_monotonicity` | Custom strategy for `QualityGateConfig` and metric values |
| 7: Contribution score sum-to-one | `test_contribution_score_sum_to_one` | `st.lists(st.floats(0.01, 100.0), min_size=1)` |
### Example-Based Unit Tests
**File:** `tests/test_model_validation_unit.py`
| Test Area | Examples |
|-----------|----------|
| Canonical evidence key | Known title/URL → expected SHA256, empty inputs, unicode |
| Duplicate detection | 3 docs with 2 sharing a key → 1 marked duplicate |
| Contribution scores | [0.5, 0.3, 0.2] → [0.5, 0.3, 0.2], single doc → [1.0] |
| ECE specific values | Perfect calibration → 0.0, all overconfident → positive ECE |
| Brier score specific values | All correct at p=1.0 → 0.0, all wrong at p=1.0 → 1.0 |
| IC specific values | Perfect correlation → 1.0, anti-correlation → -1.0, < 30 → None |
| Source reliability | n=0 → 0.5, n=1000 with wr=0.8 → ≈0.8, n=30 with wr=0.7 → 0.6 |
| Adjusted evidence weight | reliability=0.5 → base*1.0, clamping to [0.1, 2.0] |
| Quality gate | All thresholds met → pass, one failed → fail with reason |
| Quality gate fail-safe | No snapshots → paper-only, stale snapshot → paper-only |
| Direction correct logic | bullish+positive → true, bullish+negative → false |
| Profitable logic | buy+positive → true, sell+negative → true |
| Future return computation | price 100→110 → 0.10, price 100→90 → -0.10 |
| Excess return | ticker 10%, SPY 5% → excess 5% |
| Weight clamping | weight 1.5 → clamped to 1.0 |
### Frontend Tests
**File:** `frontend/src/test/pages.test.tsx` (extend existing)
| Test Area | Strategy |
|-----------|----------|
| OpsModel page renders validation tabs | MSW mock for `/api/validation/summary` |
| Calibration table renders buckets | MSW mock for `/api/validation/calibration` |
| Gate status indicator | MSW mock for `/api/validation/gate-status` |
| Miscalibration warning badge | Mock data with miscalibrated bucket |
### Integration Tests
**File:** `tests/test_model_validation_integration.py`
| Test Area | Strategy |
|-----------|----------|
| Snapshot creation with mock DB | asyncpg mock, verify INSERT queries |
| Outcome evaluation with mock prices | asyncpg mock, verify return computation |
| Metrics computation end-to-end | In-memory data, verify all metrics computed |
| API endpoint responses | FastAPI TestClient with mock pool |
### Test File Structure
```
tests/
├── test_pbt_model_validation.py # 7 property-based tests
├── test_model_validation_unit.py # Example-based unit tests
└── test_model_validation_integration.py # Integration tests (optional)
frontend/src/test/
└── pages.test.tsx # Extended with validation page tests
```
@@ -0,0 +1,286 @@
# Requirements Document — Model Validation, Calibration, and Signal Quality
## Introduction
The Stonks Oracle platform generates trend summaries and trading recommendations from a three-layer signal aggregation engine. While the pipeline produces directional predictions with confidence scores, there is no systematic mechanism to evaluate whether those predictions are accurate, whether confidence scores are well-calibrated, which sources and signal types contribute to correct predictions, or whether the system outperforms simple benchmarks. The platform also lacks safety gates that prevent live trading when model quality is insufficient.
This feature adds a complete model validation layer: prediction outcome tracking, calibration analysis, information coefficient metrics, signal and source attribution, evidence deduplication quality tracking, confidence recalibration, benchmark comparison, an upgraded Model Performance dashboard, and safety gates for live trading eligibility. The goal is to transform Stonks Oracle from a signal dashboard with paper trading into a statistically validated prediction engine with closed-loop feedback.
## Glossary
- **Prediction_Snapshot_Writer**: A new service component in `services/validation/prediction_snapshot.py` that captures the full state of every recommendation and trend prediction at generation time, including prices, evidence links, and duplicate counts.
- **Outcome_Evaluator**: A new service component in `services/validation/outcome_evaluator.py` that runs periodically to compute realized future returns and directional accuracy for matured prediction snapshots across multiple horizons.
- **Metrics_Engine**: A new service component in `services/validation/metrics.py` that computes aggregate model quality metrics including calibration error, information coefficient, Brier score, and win rates over configurable lookback windows.
- **Attribution_Engine**: A new service component in `services/validation/attribution.py` that computes per-source, per-catalyst-type, and per-signal-layer performance metrics by joining evidence links with prediction outcomes.
- **Calibration_Engine**: A new service component in `services/validation/calibration.py` that computes source reliability scores using Bayesian shrinkage and adjusts evidence weights based on historical source performance.
- **Quality_Gate**: A new service component in `services/trading/model_quality_gate.py` that evaluates aggregate model metrics against configurable thresholds and determines whether the system meets minimum quality standards for live trading.
- **Information_Coefficient**: The Pearson correlation between predicted scores and realized future returns, measuring the linear predictive power of the model. Abbreviated as IC.
- **Rank_Information_Coefficient**: The Spearman rank correlation between predicted scores and realized future returns, measuring ordinal predictive power. Abbreviated as Rank IC.
- **Calibration_Error**: The Expected Calibration Error (ECE), computed as the weighted average of the absolute difference between predicted confidence and observed win rate across confidence buckets.
- **Brier_Score**: The mean squared error between the predicted bullish probability and the binary actual outcome (1 if price went up, 0 otherwise), measuring probabilistic forecast accuracy.
- **Canonical_Evidence_Key**: A normalized identifier for a piece of evidence, computed as SHA256 of the normalized title concatenated with the normalized URL, used to detect duplicate evidence across different ingestion paths.
- **Excess_Return**: The return of a prediction minus the return of a benchmark (SPY for broad market, sector ETF for sector-relative) over the same horizon, measuring alpha generation.
- **Prediction_Snapshot**: A frozen record of a prediction at generation time, capturing all inputs (prices, scores, evidence) needed to evaluate the prediction against future outcomes without hindsight bias.
- **Model_Metric_Snapshot**: A periodic aggregate of model quality metrics over a lookback window and horizon, stored for time-series analysis of model performance trends.
- **Source_Reliability**: A Bayesian-shrunk estimate of a source's historical win rate, computed as `0.5 + (n/(n+30)) * (observed_win_rate - 0.5)`, which regresses toward 0.5 for sources with few observations.
- **Dashboard_API**: The set of API endpoints under `/api/validation/` that serve model quality metrics, calibration tables, attribution data, and gate status to the frontend.
---
## Requirements
### Requirement 1: Prediction Snapshot Capture
**User Story:** As a quantitative analyst, I want every recommendation and trend prediction captured as an immutable snapshot at generation time, so that I can evaluate predictions against future outcomes without hindsight bias.
#### Acceptance Criteria
1. WHEN a recommendation is generated by the Recommendation_Engine, THE Prediction_Snapshot_Writer SHALL create a prediction_snapshots record containing the ticker, generation timestamp, trend window, prediction horizon, direction, action, mode, strength, confidence, contradiction score, bullish probability, bearish probability, company score, macro score, competitive score, evidence count, unique source count, duplicate evidence count, price at prediction time, SPY price at prediction time, and sector ETF price at prediction time.
2. WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL record the current market price for the predicted ticker by querying the most recent close price from the market_snapshots table.
3. WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL record the current SPY price by querying the most recent close price for ticker SPY from the market_snapshots table.
4. WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL record the current sector ETF price by looking up the sector for the predicted ticker and querying the most recent close price for the corresponding sector ETF from the market_snapshots table.
5. IF the market price, SPY price, or sector ETF price is unavailable at snapshot time, THEN THE Prediction_Snapshot_Writer SHALL store NULL for the unavailable price fields and log a warning, rather than failing the snapshot creation.
6. THE Prediction_Snapshot_Writer SHALL store prediction snapshots in a new `prediction_snapshots` database table with a UUID primary key and indexed columns for ticker, generated_at, and horizon.
7. WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL store a JSONB metadata field containing any additional context from the trend summary market_context and recommendation risk_checks fields.
---
### Requirement 2: Signal Evidence Link Tracking
**User Story:** As a quantitative analyst, I want to know which specific evidence documents contributed to each prediction, so that I can attribute prediction success or failure to individual sources and signal types.
#### Acceptance Criteria
1. WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL create signal_evidence_links records for each document that contributed to the prediction, linking the prediction_id to the document_id and signal_id.
2. THE signal_evidence_links record SHALL capture the source identifier, source type, catalyst type, sentiment, impact score, extraction confidence, weight assigned during aggregation, duplicate status, canonical evidence key, and contribution score for each contributing document.
3. WHEN recording evidence links, THE Prediction_Snapshot_Writer SHALL compute the canonical_evidence_key as the SHA256 hash of the concatenation of the normalized (lowercased, whitespace-trimmed) document title and the normalized (lowercased, query-parameters-stripped) document URL.
4. WHEN recording evidence links, THE Prediction_Snapshot_Writer SHALL mark a link as `is_duplicate = true` when another link for the same prediction and ticker shares the same canonical_evidence_key.
5. THE Prediction_Snapshot_Writer SHALL compute the contribution_score for each evidence link as the ratio of that document's effective weight to the total effective weight across all documents for the prediction.
6. THE signal_evidence_links table SHALL have a foreign key constraint from prediction_id to prediction_snapshots(id) and indexes on prediction_id, document_id, and ticker.
---
### Requirement 3: Evidence Deduplication Quality Tracking
**User Story:** As a quantitative analyst, I want the system to track evidence deduplication quality per prediction, so that I can identify when predictions are inflated by counting the same information multiple times from different sources.
#### Acceptance Criteria
1. WHEN creating a prediction snapshot, THE Prediction_Snapshot_Writer SHALL compute the unique_source_count as the number of distinct source identifiers across all non-duplicate evidence links for that prediction.
2. WHEN creating a prediction snapshot, THE Prediction_Snapshot_Writer SHALL compute the duplicate_evidence_count as the number of evidence links marked as `is_duplicate = true` for that prediction.
3. THE Prediction_Snapshot_Writer SHALL enforce a maximum single-document weight cap of 1.0, clamping any individual document's effective weight to prevent a single piece of evidence from dominating the prediction.
4. WHEN computing contribution scores, THE Prediction_Snapshot_Writer SHALL count each canonical evidence key at most once per ticker per window, applying the one-vote-per-canonical-document deduplication rule.
5. THE Metrics_Engine SHALL compute a duplicate_rate metric as the ratio of duplicate_evidence_count to total evidence_count across predictions in the lookback window.
---
### Requirement 4: Prediction Outcome Evaluation
**User Story:** As a quantitative analyst, I want realized market outcomes automatically matched to historical predictions, so that I can measure whether the system's directional calls and confidence scores correspond to actual price movements.
#### Acceptance Criteria
1. THE Outcome_Evaluator SHALL run on a periodic schedule, evaluating prediction snapshots whose horizon has elapsed and whose outcome has not yet been recorded.
2. WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL compute the future_return as `(future_price - price_at_prediction) / price_at_prediction` using the closing price at the horizon endpoint.
3. WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL compute the SPY return over the same horizon as `(spy_future_price - spy_price_at_prediction) / spy_price_at_prediction`.
4. WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL compute the sector ETF return over the same horizon as `(sector_etf_future_price - sector_etf_price_at_prediction) / sector_etf_price_at_prediction`.
5. WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL compute excess_return_vs_spy as `future_return - spy_return` and excess_return_vs_sector as `future_return - sector_etf_return`.
6. WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL determine direction_correct as true when the prediction direction is bullish and future_return is positive, or when the prediction direction is bearish and future_return is negative.
7. WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL determine profitable as true when the prediction action is buy and future_return is positive, or when the prediction action is sell and future_return is negative.
8. THE Outcome_Evaluator SHALL evaluate each prediction across all applicable horizons: 1 hour, 6 hours, 1 day, 7 days, and 30 days.
9. THE Outcome_Evaluator SHALL store evaluation results in a new `prediction_outcomes` table with a foreign key to prediction_snapshots and indexed columns for prediction_id, horizon, and evaluated_at.
10. IF the future price is unavailable at the horizon endpoint (market data gap), THEN THE Outcome_Evaluator SHALL skip that horizon evaluation and retry on the next run.
---
### Requirement 5: Calibration Analysis
**User Story:** As a quantitative analyst, I want to measure how well the system's confidence scores predict actual win rates, so that I can identify overconfident or underconfident predictions and recalibrate the model.
#### Acceptance Criteria
1. THE Metrics_Engine SHALL compute calibration metrics by grouping evaluated predictions into confidence buckets: [0.50, 0.60), [0.60, 0.70), [0.70, 0.80), [0.80, 0.90), [0.90, 1.00].
2. FOR EACH confidence bucket, THE Metrics_Engine SHALL compute the average confidence, the observed win rate (fraction of direction_correct outcomes), and the prediction count.
3. THE Metrics_Engine SHALL compute the Expected Calibration Error (ECE) as the weighted average of `|avg_confidence - observed_win_rate|` across all buckets, weighted by the fraction of predictions in each bucket.
4. THE Metrics_Engine SHALL compute the Brier Score as `mean((p_bull - actual_outcome)^2)` across all evaluated predictions, where actual_outcome is 1.0 when the price moved in the predicted direction and 0.0 otherwise.
5. THE Metrics_Engine SHALL flag calibration buckets where `|avg_confidence - observed_win_rate| > 0.15` as miscalibrated for dashboard highlighting.
6. THE Metrics_Engine SHALL compute calibration metrics separately for each prediction horizon (1h, 6h, 1d, 7d, 30d).
---
### Requirement 6: Information Coefficient Metrics
**User Story:** As a quantitative analyst, I want to measure the correlation between the system's prediction scores and realized returns, so that I can assess whether higher-scored predictions actually produce higher returns.
#### Acceptance Criteria
1. THE Metrics_Engine SHALL compute the Information Coefficient (IC) as the Pearson correlation between prediction scores and future returns across all evaluated predictions in the lookback window.
2. THE Metrics_Engine SHALL compute the Rank Information Coefficient (Rank IC) as the Spearman rank correlation between prediction scores and future returns across all evaluated predictions in the lookback window.
3. THE Metrics_Engine SHALL compute IC and Rank IC separately for each prediction horizon (1h, 6h, 1d, 7d, 30d).
4. THE Metrics_Engine SHALL compute return statistics by confidence decile, grouping predictions into 10 equal-sized bins by confidence and computing the average future return and average excess return for each decile.
5. WHEN fewer than 30 evaluated predictions exist for a given horizon, THE Metrics_Engine SHALL report IC and Rank IC as NULL rather than computing unreliable correlations from small samples.
---
### Requirement 7: Source and Signal Attribution
**User Story:** As a quantitative analyst, I want to know which sources, source types, and catalyst types contribute to accurate predictions, so that I can identify the most valuable information channels and deprioritize unreliable ones.
#### Acceptance Criteria
1. THE Attribution_Engine SHALL compute per-source performance metrics by joining signal_evidence_links with prediction_outcomes, grouping by source identifier.
2. FOR EACH source, THE Attribution_Engine SHALL compute: prediction count, average weight, average contribution score, win rate, average future return, average excess return vs SPY, and information coefficient.
3. THE Attribution_Engine SHALL compute the same performance metrics grouped by source_type (e.g., news_api, filings_api, web_scrape, market_api).
4. THE Attribution_Engine SHALL compute the same performance metrics grouped by catalyst_type (e.g., earnings, product, legal, macro, m_and_a).
5. THE Attribution_Engine SHALL compute layer attribution metrics for the three signal layers (company, macro, competitive) by using the score_company, score_macro, and score_competitive fields from prediction snapshots.
6. FOR EACH layer, THE Attribution_Engine SHALL compute the average contribution percentage, the win rate when that layer is the dominant contributor, and the IC of predictions where that layer contributes more than 30% of the total score.
7. THE Attribution_Engine SHALL compute a per-source duplicate_rate as the fraction of evidence links from that source marked as is_duplicate.
---
### Requirement 8: Confidence Recalibration via Source Reliability
**User Story:** As a quantitative analyst, I want source credibility weights adjusted based on historical prediction accuracy using Bayesian shrinkage, so that the system learns from its own track record and improves over time.
#### Acceptance Criteria
1. THE Calibration_Engine SHALL compute source reliability using Bayesian shrinkage: `reliability = 0.5 + (n / (n + 30)) * (observed_win_rate - 0.5)`, where n is the number of evaluated predictions involving that source and observed_win_rate is the fraction of correct directional calls.
2. WHEN a source has zero evaluated predictions, THE Calibration_Engine SHALL assign a reliability of 0.5 (the prior mean).
3. THE Calibration_Engine SHALL compute an adjusted evidence weight for each source as `adjusted_weight = base_weight * (0.5 + reliability)`, clamped to the range [0.1, 2.0].
4. THE Calibration_Engine SHALL update source reliability scores after each outcome evaluation cycle, using the latest prediction outcomes.
5. THE Calibration_Engine SHALL store source reliability scores in the existing `source_accuracy` table, extending it with a reliability column or using the existing accuracy_ratio field with the Bayesian shrinkage formula.
---
### Requirement 9: Benchmark Comparison
**User Story:** As a quantitative analyst, I want the system's prediction performance compared against simple benchmarks, so that I can determine whether the model adds value beyond naive strategies.
#### Acceptance Criteria
1. THE Metrics_Engine SHALL compute the average excess return of all buy predictions versus a buy-and-hold SPY strategy over the same horizons.
2. THE Metrics_Engine SHALL compute the average excess return of all buy predictions versus a buy-and-hold sector ETF strategy over the same horizons.
3. THE Metrics_Engine SHALL compute the win rate of the system's directional predictions compared to a random 50/50 baseline, reporting the statistical significance using a binomial test when the prediction count exceeds 100.
4. THE Metrics_Engine SHALL compute the hit rate improvement, defined as `(system_win_rate - 0.5) / 0.5`, representing the percentage improvement over random guessing.
---
### Requirement 10: Model Metric Snapshots
**User Story:** As a quantitative analyst, I want aggregate model metrics stored as time-series snapshots, so that I can track whether model quality is improving or degrading over time.
#### Acceptance Criteria
1. THE Metrics_Engine SHALL periodically compute and store model_metric_snapshots containing all aggregate metrics for each combination of lookback window and prediction horizon.
2. EACH model_metric_snapshot SHALL contain: prediction count, win rate, directional accuracy, IC, Rank IC, average return, average excess return vs SPY, average excess return vs sector, calibration error (ECE), Brier score, and per-action win rates (buy, sell, hold).
3. THE Metrics_Engine SHALL store model_metric_snapshots in a new `model_metric_snapshots` database table with a UUID primary key and indexed columns for generated_at, lookback_window, and horizon.
4. THE Metrics_Engine SHALL compute snapshots for lookback windows of 7 days, 30 days, 90 days, and all-time.
5. THE Metrics_Engine SHALL store a JSONB metadata field in each snapshot for extensibility, containing any additional computed metrics not captured in dedicated columns.
---
### Requirement 11: Safety Gate for Live Trading
**User Story:** As a platform operator, I want live trading automatically disabled when model quality metrics fall below minimum thresholds, so that the system does not risk real capital on a poorly performing model.
#### Acceptance Criteria
1. THE Quality_Gate SHALL evaluate the following minimum thresholds for live trading eligibility: minimum prediction count of 100, minimum IC of 0.03, minimum win rate of 0.53, maximum ECE of 0.15, and minimum excess return vs SPY of 0.0.
2. WHEN any threshold is not met, THE Quality_Gate SHALL force all recommendations to paper mode, overriding any live_eligible mode assignments.
3. THE Quality_Gate SHALL evaluate gate status at the start of each aggregation cycle by reading the most recent model_metric_snapshot.
4. THE Quality_Gate SHALL log the gate evaluation result including which thresholds passed and which failed, with their actual values.
5. THE Quality_Gate SHALL store the gate evaluation result in the `risk_configs` table under a `model_quality_gate` key, making it available to the recommendation engine and dashboard.
6. IF the model_metric_snapshots table is empty or the most recent snapshot is older than 24 hours, THEN THE Quality_Gate SHALL default to paper-only mode (fail-safe behavior).
7. THE Quality_Gate SHALL support configurable thresholds via the `risk_configs` table, with the default values specified in acceptance criterion 1 used when no override is configured.
---
### Requirement 12: Model Performance Dashboard Upgrade
**User Story:** As a platform operator, I want a comprehensive model performance dashboard showing prediction accuracy, calibration, attribution, and gate status, so that I can monitor model quality and make informed decisions about live trading.
#### Acceptance Criteria
1. THE Dashboard_API SHALL expose a `/api/validation/summary` endpoint returning the latest model metric snapshot with summary cards for: prediction count, win rate, directional accuracy, IC, Rank IC, Brier score, calibration error, average excess return vs SPY, average excess return vs sector, and live trading gate status.
2. THE Dashboard_API SHALL expose a `/api/validation/calibration` endpoint returning the calibration table with confidence buckets, average confidence, observed win rate, prediction count, and miscalibration flag for each bucket.
3. THE Dashboard_API SHALL expose a `/api/validation/ic-by-horizon` endpoint returning IC and Rank IC values for each prediction horizon.
4. THE Dashboard_API SHALL expose a `/api/validation/attribution/sources` endpoint returning per-source performance metrics including win rate, IC, average return, and duplicate rate.
5. THE Dashboard_API SHALL expose a `/api/validation/attribution/catalysts` endpoint returning per-catalyst-type performance metrics.
6. THE Dashboard_API SHALL expose a `/api/validation/attribution/layers` endpoint returning per-signal-layer (company, macro, competitive) performance metrics.
7. THE Dashboard_API SHALL expose a `/api/validation/gate-status` endpoint returning the current quality gate evaluation with pass/fail status for each threshold.
8. THE frontend OpsModel page SHALL be upgraded to display the model validation summary cards, calibration table, IC-by-horizon table, source performance table, catalyst truth table, layer attribution table, and gate status indicator.
9. THE frontend SHALL highlight miscalibrated confidence buckets where `|avg_confidence - observed_win_rate| > 0.15` with a visual warning indicator.
---
### Requirement 13: Recommendation Display Enhancements
**User Story:** As a platform operator, I want each recommendation to display its validation context including calibrated confidence, historical win rate, and evidence quality indicators, so that I can assess the reliability of individual predictions.
#### Acceptance Criteria
1. WHEN displaying a recommendation, THE frontend SHALL show the original confidence alongside the calibrated confidence (based on the historical win rate for that confidence bucket).
2. WHEN displaying a recommendation, THE frontend SHALL show the historical win rate for predictions with similar confidence levels.
3. WHEN displaying a recommendation, THE frontend SHALL show the evidence count, unique evidence count, and duplicate evidence count.
4. WHEN displaying a recommendation, THE frontend SHALL show a source reliability indicator based on the Bayesian-shrunk reliability score of the primary contributing sources.
5. WHEN displaying a recommendation, THE frontend SHALL show the live eligibility status with the reason (gate passed, or which threshold failed).
6. WHEN the duplicate evidence count exceeds 20% of the total evidence count, THE frontend SHALL display a warning badge indicating potential evidence inflation.
7. WHEN the primary contributing source has a reliability score below 0.4, THE frontend SHALL display a warning badge indicating unknown or low source reliability.
---
### Requirement 14: SQL Explorer Views
**User Story:** As a quantitative analyst, I want pre-built SQL views joining predictions with outcomes and evidence with performance, so that I can run ad-hoc analysis in the SQL Explorer without writing complex joins.
#### Acceptance Criteria
1. THE database migration SHALL create a view `v_prediction_performance` that joins prediction_snapshots with prediction_outcomes on prediction_id, providing a single flat table with prediction inputs and realized outcomes.
2. THE database migration SHALL create a view `v_source_performance` that joins signal_evidence_links with prediction_outcomes (via prediction_id), providing per-evidence-link outcome data for source attribution analysis.
3. THE v_prediction_performance view SHALL include columns for ticker, direction, action, confidence, strength, price_at_prediction, future_return, excess_return_vs_spy, direction_correct, profitable, horizon, generated_at, and evaluated_at.
4. THE v_source_performance view SHALL include columns for source, source_type, catalyst_type, sentiment, weight, contribution_score, is_duplicate, direction_correct, future_return, and excess_return_vs_spy.
---
### Requirement 15: Backtest Replay Integration
**User Story:** As a quantitative analyst, I want to replay historical data through the prediction snapshot and outcome evaluation pipeline, so that I can assess model quality on historical data without future data leakage.
#### Acceptance Criteria
1. THE Backtest_Replay service SHALL support a validation mode that generates prediction snapshots and evaluates outcomes using only data available at each historical point in time.
2. WHEN running in validation mode, THE Backtest_Replay service SHALL process historical recommendations chronologically, creating prediction snapshots with the market prices that were available at each recommendation's generation time.
3. WHEN running in validation mode, THE Backtest_Replay service SHALL evaluate prediction outcomes using market prices from the appropriate future horizon relative to each prediction's generation time.
4. THE Backtest_Replay service SHALL prevent future data leakage by ensuring that no market data with a timestamp after the prediction generation time is used during snapshot creation.
5. WHEN a backtest validation run completes, THE Backtest_Replay service SHALL trigger a model metrics computation over the backtest period, storing the results as model_metric_snapshots tagged with the backtest_id.
---
### Requirement 16: Database Schema
**User Story:** As a developer, I want the new database tables created via a migration script following the existing migration conventions, so that the schema changes are applied consistently across all environments.
#### Acceptance Criteria
1. THE database migration SHALL create the `prediction_snapshots` table with columns: id (UUID PK), generated_at (TIMESTAMPTZ), ticker (VARCHAR), window (VARCHAR), horizon (VARCHAR), direction (VARCHAR), action (VARCHAR), mode (VARCHAR), strength (FLOAT), confidence (FLOAT), contradiction (FLOAT), p_bull (FLOAT), p_bear (FLOAT), score_company (FLOAT), score_macro (FLOAT), score_competitive (FLOAT), evidence_count (INTEGER), unique_source_count (INTEGER), duplicate_evidence_count (INTEGER), price_at_prediction (FLOAT), spy_price_at_prediction (FLOAT), sector_etf_price_at_prediction (FLOAT), metadata (JSONB), created_at (TIMESTAMPTZ).
2. THE database migration SHALL create the `prediction_outcomes` table with columns: id (UUID PK), prediction_id (UUID FK to prediction_snapshots), evaluated_at (TIMESTAMPTZ), horizon (VARCHAR), future_price (FLOAT), future_return (FLOAT), spy_future_price (FLOAT), spy_return (FLOAT), sector_etf_future_price (FLOAT), sector_etf_return (FLOAT), excess_return_vs_spy (FLOAT), excess_return_vs_sector (FLOAT), direction_correct (BOOLEAN), profitable (BOOLEAN), metadata (JSONB), created_at (TIMESTAMPTZ).
3. THE database migration SHALL create the `signal_evidence_links` table with columns: id (UUID PK), prediction_id (UUID FK to prediction_snapshots), document_id (VARCHAR), signal_id (VARCHAR), ticker (VARCHAR), source (VARCHAR), source_type (VARCHAR), catalyst_type (VARCHAR), sentiment (VARCHAR), impact (FLOAT), extraction_confidence (FLOAT), weight (FLOAT), is_duplicate (BOOLEAN), canonical_evidence_key (VARCHAR), contribution_score (FLOAT), metadata (JSONB), created_at (TIMESTAMPTZ).
4. THE database migration SHALL create the `model_metric_snapshots` table with columns: id (UUID PK), generated_at (TIMESTAMPTZ), lookback_window (VARCHAR), horizon (VARCHAR), prediction_count (INTEGER), win_rate (FLOAT), directional_accuracy (FLOAT), information_coefficient (FLOAT), rank_information_coefficient (FLOAT), avg_return (FLOAT), avg_excess_return_vs_spy (FLOAT), avg_excess_return_vs_sector (FLOAT), calibration_error (FLOAT), brier_score (FLOAT), buy_win_rate (FLOAT), sell_win_rate (FLOAT), hold_win_rate (FLOAT), metadata (JSONB), created_at (TIMESTAMPTZ).
5. THE database migration SHALL create appropriate indexes on prediction_snapshots (ticker, generated_at, horizon), prediction_outcomes (prediction_id, horizon), signal_evidence_links (prediction_id, document_id, ticker), and model_metric_snapshots (generated_at, lookback_window, horizon).
6. THE database migration SHALL be numbered as `035_model_validation.sql`, following the existing migration numbering convention.
---
### Requirement 17: Property-Based Testing for Validation Metrics
**User Story:** As a developer, I want property-based tests validating the mathematical correctness of all validation metric computations, so that edge cases and numerical stability issues are caught before deployment.
#### Acceptance Criteria
1. THE test suite SHALL include a property-based test for calibration error verifying that ECE is in [0.0, 1.0] for all valid distributions of predictions across confidence buckets, and that ECE is 0.0 when every bucket's observed win rate exactly matches its average confidence (round-trip calibration property).
2. THE test suite SHALL include a property-based test for Brier score verifying that the score is in [0.0, 1.0] for all valid probability-outcome pairs, and that the score is 0.0 when all predictions are perfectly correct with probability 1.0.
3. THE test suite SHALL include a property-based test for information coefficient verifying that IC is in [-1.0, 1.0] for all valid score-return pairs, and that IC is 1.0 when scores and returns are perfectly positively correlated.
4. THE test suite SHALL include a property-based test for the canonical evidence key verifying that the key is deterministic (same inputs always produce the same key) and that normalization is idempotent (normalizing an already-normalized input produces the same key).
5. THE test suite SHALL include a property-based test for source reliability Bayesian shrinkage verifying that reliability is always in [0.0, 1.0], that reliability approaches 0.5 as sample count approaches 0, and that reliability approaches the observed win rate as sample count approaches infinity.
6. THE test suite SHALL include a property-based test for the quality gate verifying that the gate result is deterministic for the same metric inputs, and that relaxing any single threshold (making it easier to pass) never causes a previously passing gate to fail (monotonicity property).
7. THE test suite SHALL include a property-based test for contribution score computation verifying that all contribution scores for a single prediction sum to 1.0 (within floating-point tolerance) and that each individual score is in [0.0, 1.0].
@@ -0,0 +1,260 @@
# Implementation Plan: Model Validation, Calibration, and Signal Quality
## Overview
Add a closed-loop model validation layer to Stonks Oracle: prediction snapshot capture, outcome evaluation, calibration/IC metrics, source/catalyst/layer attribution, Bayesian source reliability, a quality gate for live trading, 7 new API endpoints, an upgraded OpsModel dashboard, and backtest replay integration. Implementation follows the four-phase priority order from the spec, with each phase building on the previous one.
## Tasks
- [x] 1. Database migration 035 — schema foundation
- [x] 1.1 Create `infra/migrations/035_model_validation.sql` with all tables, indexes, and views
- Create `prediction_snapshots` table with all columns from design (id UUID PK, generated_at, ticker, window, horizon, direction, action, mode, strength, confidence, contradiction, p_bull, p_bear, score_company, score_macro, score_competitive, evidence_count, unique_source_count, duplicate_evidence_count, price_at_prediction, spy_price_at_prediction, sector_etf_price_at_prediction, metadata JSONB, created_at)
- Create `prediction_outcomes` table with FK to prediction_snapshots (id UUID PK, prediction_id, evaluated_at, horizon, future_price, future_return, spy_future_price, spy_return, sector_etf_future_price, sector_etf_return, excess_return_vs_spy, excess_return_vs_sector, direction_correct, profitable, metadata JSONB, created_at)
- Create `signal_evidence_links` table with FK to prediction_snapshots (id UUID PK, prediction_id, document_id, signal_id, ticker, source, source_type, catalyst_type, sentiment, impact, extraction_confidence, weight, is_duplicate, canonical_evidence_key, contribution_score, metadata JSONB, created_at)
- Create `model_metric_snapshots` table (id UUID PK, generated_at, lookback_window, horizon, prediction_count, win_rate, directional_accuracy, information_coefficient, rank_information_coefficient, avg_return, avg_excess_return_vs_spy, avg_excess_return_vs_sector, calibration_error, brier_score, buy_win_rate, sell_win_rate, hold_win_rate, metadata JSONB, created_at)
- Create indexes on prediction_snapshots (ticker, generated_at, horizon), prediction_outcomes (prediction_id, horizon, evaluated_at), signal_evidence_links (prediction_id, document_id, ticker), model_metric_snapshots (generated_at, lookback_window, horizon)
- Create `v_prediction_performance` view joining prediction_snapshots with prediction_outcomes
- Create `v_source_performance` view joining signal_evidence_links with prediction_snapshots and prediction_outcomes
- _Requirements: 16.1, 16.2, 16.3, 16.4, 16.5, 16.6, 14.1, 14.2, 14.3, 14.4_
- [x] 2. Phase 1 — Prediction capture, outcome evaluation, core metrics, and dashboard API
- [x] 2.1 Implement Prediction Snapshot Writer (`services/validation/prediction_snapshot.py`)
- Create `services/validation/__init__.py`
- Define `SECTOR_ETF_MAP`, `EVALUATION_HORIZONS`, `MAX_SINGLE_DOCUMENT_WEIGHT` constants
- Implement `PredictionSnapshot` and `SignalEvidenceLink` dataclasses
- Implement `compute_canonical_evidence_key(title, url)` — SHA256 of normalized title + normalized URL (lowercase, strip whitespace for title; lowercase, strip query params for URL)
- Implement `fetch_latest_close_price(pool, ticker)` — query most recent close from market_snapshots
- Implement `create_prediction_snapshot(pool, recommendation, trend_summary, evidence_signals, evidence_docs)` — fetch prices (ticker, SPY, sector ETF), compute canonical keys, detect duplicates, clamp weights to MAX_SINGLE_DOCUMENT_WEIGHT, compute contribution scores (one-vote-per-canonical-key), persist snapshot + evidence links in a transaction
- Implement `compute_contribution_scores(weights)` — each score = weight_i / sum(weights), sums to 1.0
- Handle NULL prices gracefully (log warning, store NULL, don't fail)
- _Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 3.1, 3.2, 3.3, 3.4_
- [x] 2.2 Write property test for canonical evidence key determinism and idempotence
- **Property 4: Canonical Evidence Key Determinism and Normalization Idempotence**
- Test that same (title, url) always produces same key
- Test that normalizing already-normalized input produces same key
- **Validates: Requirements 2.3, 17.4**
- [x] 2.3 Write property test for contribution score sum-to-one and range
- **Property 7: Contribution Score Sum-to-One and Range**
- Test that all scores in [0.0, 1.0] and sum to 1.0 (within 1e-9 tolerance)
- Test that empty input returns empty list
- **Validates: Requirements 2.5, 17.7**
- [x] 2.4 Implement Outcome Evaluator (`services/validation/outcome_evaluator.py`)
- Define `PredictionOutcome` dataclass and `HORIZON_DURATIONS` mapping
- Implement `evaluate_matured_predictions(pool)` — find snapshots where horizon elapsed and outcome not recorded, evaluate each
- Implement `evaluate_single_prediction(pool, snapshot, horizon)` — fetch future price at horizon endpoint, compute future_return, SPY return, sector ETF return, excess returns, direction_correct, profitable; return None if future price unavailable
- Evaluate across all 5 horizons: 1h, 6h, 1d, 7d, 30d
- Skip horizons where future price is unavailable (retry next run)
- Store results in prediction_outcomes table
- _Requirements: 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 4.10_
- [x] 2.5 Implement Metrics Engine (`services/validation/metrics.py`)
- Define `CONFIDENCE_BUCKETS`, `LOOKBACK_WINDOWS` constants
- Define `CalibrationBucket` and `ModelMetricSnapshot` dataclasses
- Implement `compute_calibration_error(confidences, outcomes)` — group into 5 confidence buckets, compute ECE as weighted average of |avg_conf - win_rate|, flag miscalibrated buckets (|diff| > 0.15)
- Implement `compute_brier_score(p_bulls, outcomes)` — mean((p_bull - outcome)^2)
- Implement `compute_information_coefficient(scores, returns)` — Pearson correlation, return None when < 30 data points
- Implement `compute_rank_information_coefficient(scores, returns)` — Spearman rank correlation, return None when < 30 data points
- Implement `compute_contribution_scores(weights)` — weight_i / sum(weights), sums to 1.0
- Implement benchmark metrics: average excess return vs SPY, vs sector ETF, hit rate improvement
- Implement `compute_and_store_metric_snapshots(pool)` — compute for all lookback/horizon combinations (4 lookbacks × 5 horizons), persist to model_metric_snapshots
- _Requirements: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 6.1, 6.2, 6.3, 6.4, 6.5, 9.1, 9.2, 9.3, 9.4, 10.1, 10.2, 10.3, 10.4, 10.5_
- [x] 2.6 Write property test for ECE range and round-trip
- **Property 1: Calibration Error Range and Round-Trip**
- Test ECE in [0.0, 1.0] for all valid distributions
- Test ECE = 0.0 when every bucket's win rate matches avg confidence
- **Validates: Requirements 5.1, 5.3, 17.1**
- [x] 2.7 Write property test for Brier score range and perfect prediction
- **Property 2: Brier Score Range and Perfect Prediction**
- Test Brier in [0.0, 1.0] for all valid (p_bull, outcome) pairs
- Test Brier = 0.0 when all predictions perfectly correct
- **Validates: Requirements 5.4, 17.2**
- [x] 2.8 Write property test for IC range and perfect correlation
- **Property 3: Information Coefficient Range and Perfect Correlation**
- Test IC in [-1.0, 1.0] for all valid (score, return) pairs with ≥30 elements
- Test IC = 1.0 for perfectly positively correlated data
- **Validates: Requirements 6.1, 6.2, 17.3**
- [x] 2.9 Implement Dashboard API endpoints in `services/api/app.py`
- Add `/api/validation/summary` GET — return latest model_metric_snapshot + gate status
- Add `/api/validation/calibration` GET — return calibration table with buckets
- Add `/api/validation/ic-by-horizon` GET — return IC and Rank IC per horizon
- Add `/api/validation/gate-status` GET — return quality gate evaluation detail
- All endpoints accept optional `lookback` (default "30d") and `horizon` (default "7d") query params
- _Requirements: 12.1, 12.2, 12.3, 12.7_
- [x] 2.10 Add frontend validation API hooks in `frontend/src/api/hooks.ts`
- Add `useValidationSummary(lookback?, horizon?)` hook for `/api/validation/summary`
- Add `useValidationCalibration(lookback?, horizon?)` hook for `/api/validation/calibration`
- Add `useValidationICByHorizon(lookback?)` hook for `/api/validation/ic-by-horizon`
- Add `useValidationGateStatus()` hook for `/api/validation/gate-status`
- _Requirements: 12.1, 12.2, 12.3, 12.7_
- [x] 2.11 Upgrade OpsModel page (`frontend/src/pages/OpsModel.tsx`) — Phase 1 dashboard
- Add tabbed layout: existing "Extraction Performance" tab + new "Model Validation" tab
- Add summary cards: prediction count, win rate, directional accuracy, IC, Rank IC, Brier score, ECE, avg excess return vs SPY, gate status
- Add calibration table with confidence buckets, avg confidence, observed win rate, count, miscalibration flag
- Highlight miscalibrated buckets (|avg_confidence - observed_win_rate| > 0.15) with warning indicator
- Add IC-by-horizon table showing IC and Rank IC for each horizon
- Add gate status indicator (pass/fail with threshold details)
- _Requirements: 12.1, 12.2, 12.3, 12.7, 12.8, 12.9_
- [x] 3. Checkpoint — Phase 1 verification
- Ensure all tests pass, ask the user if questions arise.
- [x] 4. Phase 2 — Attribution engine and source/catalyst truth tables
- [x] 4.1 Implement Attribution Engine (`services/validation/attribution.py`)
- Define `SourceAttribution`, `CatalystAttribution`, `LayerAttribution` dataclasses
- Implement `compute_source_attribution(pool, lookback_days, horizon)` — join signal_evidence_links with prediction_outcomes, group by source; compute prediction count, avg weight, avg contribution score, win rate, avg future return, avg excess return vs SPY, IC, duplicate rate
- Implement `compute_catalyst_attribution(pool, lookback_days, horizon)` — same metrics grouped by catalyst_type
- Implement `compute_layer_attribution(pool, lookback_days, horizon)` — compute per-layer (company, macro, competitive) avg contribution %, dominant win rate (layer > 30% contribution), dominant IC
- _Requirements: 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7_
- [x] 4.2 Implement Calibration Engine (`services/validation/calibration.py`)
- Implement `compute_source_reliability(observed_win_rate, sample_count, prior_strength=30)` — Bayesian shrinkage: `0.5 + (n / (n + 30)) * (observed_win_rate - 0.5)`; return 0.5 when n=0
- Implement `compute_adjusted_evidence_weight(base_weight, reliability)``base_weight * (0.5 + reliability)`, clamped to [0.1, 2.0]
- Implement `update_source_reliabilities(pool)` — recompute from latest outcomes, update source_accuracy table
- _Requirements: 8.1, 8.2, 8.3, 8.4, 8.5_
- [x] 4.3 Write property test for source reliability Bayesian shrinkage bounds and convergence
- **Property 5: Source Reliability Bayesian Shrinkage Bounds and Convergence**
- Test reliability in [0.0, 1.0] for all valid inputs
- Test reliability = 0.5 when sample_count = 0
- Test reliability approaches observed_win_rate as sample_count → ∞
- **Validates: Requirements 8.1, 8.2, 17.5**
- [x] 4.4 Add attribution API endpoints in `services/api/app.py`
- Add `/api/validation/attribution/sources` GET — return per-source performance metrics
- Add `/api/validation/attribution/catalysts` GET — return per-catalyst performance metrics
- Add `/api/validation/attribution/layers` GET — return per-layer performance metrics
- All endpoints accept optional `lookback` (default "30d") and `horizon` (default "7d") query params
- _Requirements: 12.4, 12.5, 12.6_
- [x] 4.5 Add frontend attribution hooks in `frontend/src/api/hooks.ts`
- Add `useValidationAttributionSources(lookback?, horizon?)` hook
- Add `useValidationAttributionCatalysts(lookback?, horizon?)` hook
- Add `useValidationAttributionLayers(lookback?, horizon?)` hook
- _Requirements: 12.4, 12.5, 12.6_
- [x] 4.6 Extend OpsModel page with attribution tables
- Add source performance table (source, win rate, IC, avg return, duplicate rate)
- Add catalyst truth table (catalyst type, win rate, avg return, IC)
- Add layer attribution table (company/macro/competitive contribution %, dominant win rate, IC)
- _Requirements: 12.4, 12.5, 12.6, 12.8_
- [x] 5. Checkpoint — Phase 2 verification
- Ensure all tests pass, ask the user if questions arise.
- [x] 6. Phase 3 — Quality gate, recommendation enhancements, and pipeline wiring
- [x] 6.1 Implement Quality Gate (`services/trading/model_quality_gate.py`)
- Define `QualityGateConfig` dataclass with default thresholds (min_prediction_count=100, min_ic=0.03, min_win_rate=0.53, max_ece=0.15, min_excess_return_vs_spy=0.0, max_snapshot_age_hours=24)
- Define `GateThresholdResult` and `QualityGateResult` dataclasses
- Implement `evaluate_quality_gate(pool, config)` — read most recent model_metric_snapshot (30d lookback, 7d horizon), evaluate each threshold, store result in risk_configs under 'model_quality_gate' key
- Implement `load_gate_config_from_db(pool)` — load thresholds from risk_configs with defaults
- Default to paper-only mode when no snapshots exist or snapshot is stale (>24h)
- Log gate evaluation result with threshold pass/fail details
- _Requirements: 11.1, 11.2, 11.3, 11.4, 11.5, 11.6, 11.7_
- [x] 6.2 Write property test for quality gate determinism and threshold monotonicity
- **Property 6: Quality Gate Determinism and Threshold Monotonicity**
- Test same inputs always produce same pass/fail result
- Test relaxing any threshold never causes a previously passing gate to fail
- **Validates: Requirements 11.1, 17.6**
- [x] 6.3 Wire Quality Gate into aggregation cycle (`services/aggregation/worker.py`)
- Call `evaluate_quality_gate` at the start of each aggregation cycle
- When gate fails, force all recommendations to paper mode
- Log gate status at cycle start
- _Requirements: 11.2, 11.3_
- [x] 6.4 Wire Prediction Snapshot Writer into recommendation engine
- After recommendation is generated in `services/recommendation/eligibility.py` or the calling code, call `create_prediction_snapshot` to capture the prediction state
- Pass recommendation, trend_summary, evidence signals, and evidence docs
- Handle snapshot creation failure gracefully (log error, don't block recommendation)
- _Requirements: 1.1, 1.6_
- [x] 6.5 Enhance recommendation display on frontend
- Update `frontend/src/pages/RecommendationDetail` (or relevant recommendation display component) to show:
- Original confidence alongside calibrated confidence (historical win rate for that bucket)
- Historical win rate for similar confidence levels
- Evidence count, unique evidence count, duplicate evidence count
- Source reliability indicator for primary contributing sources
- Live eligibility status with reason (gate passed or which threshold failed)
- Add warning badge when duplicate evidence count > 20% of total evidence count
- Add warning badge when primary source reliability < 0.4
- _Requirements: 13.1, 13.2, 13.3, 13.4, 13.5, 13.6, 13.7_
- [x] 7. Checkpoint — Phase 3 verification
- Ensure all tests pass, ask the user if questions arise.
- [x] 8. Phase 4 — Backtest replay integration and unit tests
- [x] 8.1 Add validation mode to BacktestReplay (`services/trading/backtest_replay.py`)
- Add `validation_mode: bool = False` parameter to `BacktestReplay.run()`
- When validation_mode=True, create prediction snapshots for each historical recommendation using only data available at that point in time
- Evaluate prediction outcomes using market prices from the appropriate future horizon
- Prevent future data leakage: no market data after prediction generation time used during snapshot creation
- After backtest completes, trigger model metrics computation over the backtest period, tag snapshots with backtest_id
- _Requirements: 15.1, 15.2, 15.3, 15.4, 15.5_
- [x] 8.2 Write unit tests for prediction snapshot writer (`tests/test_model_validation_unit.py`)
- Test canonical evidence key: known title/URL → expected SHA256, empty inputs, unicode
- Test duplicate detection: 3 docs with 2 sharing a key → 1 marked duplicate
- Test contribution scores: [0.5, 0.3, 0.2] → [0.5, 0.3, 0.2], single doc → [1.0]
- Test weight clamping: weight 1.5 → clamped to 1.0
- _Requirements: 1.1, 2.3, 2.4, 2.5, 3.3_
- [x] 8.3 Write unit tests for outcome evaluator (`tests/test_model_validation_unit.py`)
- Test future return computation: price 100→110 → 0.10, price 100→90 → -0.10
- Test direction_correct logic: bullish+positive → true, bullish+negative → false
- Test profitable logic: buy+positive → true, sell+negative → true
- Test excess return: ticker 10%, SPY 5% → excess 5%
- _Requirements: 4.2, 4.5, 4.6, 4.7_
- [x] 8.4 Write unit tests for metrics engine (`tests/test_model_validation_unit.py`)
- Test ECE specific values: perfect calibration → 0.0, all overconfident → positive ECE
- Test Brier score: all correct at p=1.0 → 0.0, all wrong at p=1.0 → 1.0
- Test IC: perfect correlation → 1.0, anti-correlation → -1.0, < 30 → None
- _Requirements: 5.3, 5.4, 6.1, 6.2, 6.5_
- [x] 8.5 Write unit tests for calibration engine (`tests/test_model_validation_unit.py`)
- Test source reliability: n=0 → 0.5, n=1000 with wr=0.8 → ≈0.8, n=30 with wr=0.7 → 0.6
- Test adjusted evidence weight: reliability=0.5 → base*1.0, clamping to [0.1, 2.0]
- _Requirements: 8.1, 8.2, 8.3_
- [x] 8.6 Write unit tests for quality gate (`tests/test_model_validation_unit.py`)
- Test all thresholds met → pass
- Test one threshold failed → fail with reason
- Test fail-safe: no snapshots → paper-only, stale snapshot → paper-only
- _Requirements: 11.1, 11.6_
- [x] 8.7 Write frontend tests for validation dashboard (`frontend/src/test/pages.test.tsx`)
- Add MSW mock handlers for `/api/validation/summary`, `/api/validation/calibration`, `/api/validation/gate-status`
- Test OpsModel page renders validation tab with summary cards
- Test calibration table renders buckets with miscalibration warning
- Test gate status indicator renders pass/fail
- _Requirements: 12.8, 12.9_
- [x] 9. Final checkpoint — Ensure all tests pass
- Ensure all tests pass, ask the user if questions arise.
## Notes
- Tasks marked with `*` are optional and can be skipped for faster MVP
- Each task references specific requirements for traceability
- Checkpoints ensure incremental validation after each phase
- Property tests validate the 7 universal correctness properties from the design document
- Unit tests validate specific examples, edge cases, and integration points
- The design uses Python for backend and TypeScript for frontend — no language selection needed
- Migration number is 035 (existing migrations go up to 034)
- All new service modules go under `services/validation/` except the quality gate which goes in `services/trading/`
- The 7 new API endpoints are added to the existing `services/api/app.py`
- Frontend hooks follow existing patterns in `frontend/src/api/hooks.ts`
- Phase 1 delivers the core feedback loop (capture → evaluate → measure → display)
- Phase 2 adds attribution depth (which sources/catalysts/layers work best)
- Phase 3 adds safety (quality gate) and UX (recommendation warnings)
- Phase 4 adds historical analysis (backtest validation mode) and comprehensive tests