Files
stonks-oracle/.kiro/specs/model-validation-calibration/design.md
T
Celes Renata 7fcc8a6c07
ci/woodpecker/push/test Pipeline failed
ci/woodpecker/push/build-1 unknown status
ci/woodpecker/push/build-3 unknown status
ci/woodpecker/push/build-2 unknown status
ci/woodpecker/push/finalize unknown status
Build and Push / lint-and-test (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.adapters.broker_adapter name:broker-adapter]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.aggregation.worker name:aggregation]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.extractor.worker name:extractor]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.ingestion.worker name:ingestion]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.lake_publisher.worker name:lake-publisher]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.parser.worker name:parser]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.recommendation.worker name:recommendation]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.scheduler.app name:scheduler]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.api.app:app --host 0.0.0.0 --port 8000 name:query-api]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.risk.app:app --host 0.0.0.0 --port 8000 name:risk]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.symbol_registry.app:app --host 0.0.0.0 --port 8000 name:symbol-registry]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.trading.app:app --host 0.0.0.0 --port 8000 name:trading-engine]) (push) Has been cancelled
Build and Push / build-dashboard (push) Has been cancelled
Build and Push / build-superset (push) Has been cancelled
Build and Push / integration-test (push) Has been cancelled
Build and Push / beta-gate (push) Has been cancelled
feat: model validation, calibration, and signal quality layer
- Migration 035: prediction_snapshots, prediction_outcomes, signal_evidence_links, model_metric_snapshots tables + SQL views
- Prediction snapshot writer with canonical evidence keys, duplicate detection, contribution scores
- Outcome evaluator across 5 horizons (1h, 6h, 1d, 7d, 30d)
- Metrics engine: ECE, Brier score, IC, Rank IC, benchmark comparison
- Attribution engine: per-source, per-catalyst, per-layer performance
- Calibration engine: Bayesian shrinkage source reliability
- Quality gate for live trading eligibility with configurable thresholds
- 7 new /api/validation/* endpoints
- Upgraded OpsModel dashboard with validation tab
- Enhanced recommendation display with calibration context
- Backtest replay validation mode
- 86 Python tests (unit + property-based), 179 frontend tests passing
2026-05-01 03:04:58 +00:00

976 lines
35 KiB
Markdown

# Design Document — Model Validation, Calibration, and Signal Quality
## Overview
This design adds a closed-loop model validation layer to Stonks Oracle. The system currently generates trend summaries and trading recommendations with confidence scores, but has no mechanism to evaluate whether those predictions are accurate, whether confidence scores are well-calibrated, which sources contribute to correct predictions, or whether the system outperforms simple benchmarks.
The validation layer introduces six new service modules under `services/validation/`, a quality gate in `services/trading/`, seven new API endpoints under `/api/validation/`, a database migration (035) with four new tables and two SQL views, and an upgraded OpsModel dashboard page. The architecture follows the existing patterns: pure computation modules with asyncpg for persistence, FastAPI endpoints in `services/api/app.py`, and React/TanStack Query hooks on the frontend.
### Design Rationale
A prediction engine without outcome tracking is flying blind. The validation layer closes the feedback loop by:
1. **Capturing immutable snapshots** at prediction time — preventing hindsight bias in evaluation
2. **Evaluating outcomes** across multiple horizons (1h, 6h, 1d, 7d, 30d) — matching the system's multi-window trend architecture
3. **Computing calibration metrics** (ECE, Brier score) — measuring whether confidence scores mean what they claim
4. **Tracking information coefficients** (IC, Rank IC) — measuring linear and ordinal predictive power
5. **Attributing performance** to sources, catalysts, and signal layers — identifying the most valuable information channels
6. **Recalibrating confidence** via Bayesian shrinkage — learning from the system's own track record
7. **Gating live trading** on minimum quality thresholds — preventing real capital risk on a poorly performing model
The design reuses existing infrastructure (asyncpg, FastAPI, TanStack Query, Recharts) and integrates with the existing `source_accuracy` table from the signal-math-upgrade spec.
---
## Architecture
### High-Level Data Flow
```mermaid
flowchart TD
subgraph "Prediction Capture (Real-time)"
A[Recommendation Engine] -->|generates| B[Prediction_Snapshot_Writer]
B --> C[prediction_snapshots table]
B --> D[signal_evidence_links table]
B -->|computes| E[canonical_evidence_key<br/>duplicate detection<br/>contribution scores]
end
subgraph "Outcome Evaluation (Periodic)"
F[Outcome_Evaluator<br/>scheduled job] -->|reads matured snapshots| C
F -->|fetches future prices| G[market_snapshots table]
F -->|computes returns| H[prediction_outcomes table]
F -->|evaluates 5 horizons| H
end
subgraph "Metrics Computation (Periodic)"
I[Metrics_Engine] -->|reads| H
I -->|reads| C
I -->|reads| D
I -->|computes| J[model_metric_snapshots table]
I -->|computes| K[Calibration: ECE, Brier]
I -->|computes| L[IC, Rank IC by horizon]
I -->|computes| M[Benchmark: excess returns]
end
subgraph "Attribution (Periodic)"
N[Attribution_Engine] -->|joins| D
N -->|joins| H
N -->|computes| O[Per-source metrics]
N -->|computes| P[Per-catalyst metrics]
N -->|computes| Q[Per-layer metrics]
end
subgraph "Calibration (Periodic)"
R[Calibration_Engine] -->|reads| H
R -->|reads| D
R -->|computes Bayesian shrinkage| S[source_accuracy table<br/>reliability scores]
end
subgraph "Safety Gate (Per-cycle)"
T[Quality_Gate] -->|reads latest| J
T -->|evaluates thresholds| U{Pass?}
U -->|yes| V[Live trading allowed]
U -->|no| W[Force paper mode]
T -->|stores result| X[risk_configs table<br/>model_quality_gate key]
end
subgraph "Dashboard (Frontend)"
Y[Dashboard_API<br/>7 endpoints] -->|reads| J
Y -->|reads| C
Y -->|reads| H
Y -->|reads| D
Z[OpsModel.tsx<br/>upgraded page] -->|fetches| Y
end
subgraph "Backtest Integration"
AA[BacktestReplay] -->|validation mode| B
AA -->|validation mode| F
AA -->|triggers| I
end
```
### Scheduling Strategy
The validation components run on different cadences:
| Component | Trigger | Cadence |
|-----------|---------|---------|
| Prediction_Snapshot_Writer | Synchronous — called by recommendation engine | Every recommendation |
| Outcome_Evaluator | Scheduled job | Every 1 hour |
| Metrics_Engine | After Outcome_Evaluator completes | Every 1 hour |
| Attribution_Engine | Called by Metrics_Engine | Every 1 hour |
| Calibration_Engine | After Metrics_Engine completes | Every 6 hours |
| Quality_Gate | Start of each aggregation cycle | Every aggregation cycle |
### Sector ETF Mapping
The system needs a mapping from company sectors to sector ETFs for benchmark comparison. This is stored as a configuration constant:
```python
SECTOR_ETF_MAP: dict[str, str] = {
"Technology": "XLK",
"Consumer Cyclical": "XLY",
"Financial Services": "XLF",
"Healthcare": "XLV",
"Energy": "XLE",
"Communication Services": "XLC",
"Industrials": "XLI",
"Consumer Defensive": "XLP",
"Real Estate": "XLRE",
"Utilities": "XLU",
}
```
---
## Components and Interfaces
### New Modules
| Module | File | Responsibility |
|--------|------|----------------|
| Prediction Snapshot Writer | `services/validation/prediction_snapshot.py` | Captures immutable prediction state at generation time |
| Outcome Evaluator | `services/validation/outcome_evaluator.py` | Matches predictions with realized market outcomes |
| Metrics Engine | `services/validation/metrics.py` | Computes calibration, IC, Brier, benchmark metrics |
| Attribution Engine | `services/validation/attribution.py` | Per-source, per-catalyst, per-layer performance |
| Calibration Engine | `services/validation/calibration.py` | Bayesian shrinkage source reliability, weight adjustment |
| Quality Gate | `services/trading/model_quality_gate.py` | Safety gate for live trading eligibility |
### Modified Modules
| Module | File | Changes |
|--------|------|---------|
| Query API | `services/api/app.py` | 7 new `/api/validation/*` endpoints |
| Aggregation Worker | `services/aggregation/worker.py` | Call Quality_Gate at cycle start |
| Recommendation Engine | `services/recommendation/eligibility.py` | Call Prediction_Snapshot_Writer after recommendation |
| Backtest Replay | `services/trading/backtest_replay.py` | Validation mode support |
| Frontend Hooks | `frontend/src/api/hooks.ts` | 7 new validation hooks |
| OpsModel Page | `frontend/src/pages/OpsModel.tsx` | Full dashboard upgrade |
| AppLayout | `frontend/src/components/AppLayout.tsx` | Nav item update (if needed) |
### Component Interface Details
#### 1. Prediction Snapshot Writer (`services/validation/prediction_snapshot.py`)
```python
SECTOR_ETF_MAP: dict[str, str] = {
"Technology": "XLK",
"Consumer Cyclical": "XLY",
"Financial Services": "XLF",
"Healthcare": "XLV",
"Energy": "XLE",
"Communication Services": "XLC",
"Industrials": "XLI",
"Consumer Defensive": "XLP",
"Real Estate": "XLRE",
"Utilities": "XLU",
}
EVALUATION_HORIZONS: list[str] = ["1h", "6h", "1d", "7d", "30d"]
MAX_SINGLE_DOCUMENT_WEIGHT: float = 1.0
@dataclass
class PredictionSnapshot:
"""Immutable snapshot of a prediction at generation time."""
id: str # UUID
generated_at: datetime
ticker: str
window: str
horizon: str
direction: str # bullish/bearish/mixed/neutral
action: str # buy/sell/hold/watch
mode: str # informational/paper_eligible/live_eligible
strength: float
confidence: float
contradiction: float
p_bull: float | None
p_bear: float | None
score_company: float
score_macro: float
score_competitive: float
evidence_count: int
unique_source_count: int
duplicate_evidence_count: int
price_at_prediction: float | None
spy_price_at_prediction: float | None
sector_etf_price_at_prediction: float | None
metadata: dict
@dataclass
class SignalEvidenceLink:
"""Link between a prediction and a contributing evidence document."""
id: str # UUID
prediction_id: str
document_id: str
signal_id: str
ticker: str
source: str
source_type: str
catalyst_type: str
sentiment: str
impact: float
extraction_confidence: float
weight: float # clamped to MAX_SINGLE_DOCUMENT_WEIGHT
is_duplicate: bool
canonical_evidence_key: str
contribution_score: float # weight / total_weight, sums to 1.0
metadata: dict
def compute_canonical_evidence_key(title: str, url: str) -> str:
"""SHA256 of normalized(title) + normalized(url).
Normalization: lowercase, strip whitespace for title;
lowercase, strip query params for URL.
"""
...
async def create_prediction_snapshot(
pool: asyncpg.Pool,
recommendation: Recommendation,
trend_summary: TrendSummary,
evidence_signals: list[WeightedSignal],
evidence_docs: list[dict], # document metadata from recommendation_evidence
) -> PredictionSnapshot:
"""Create and persist a prediction snapshot with evidence links.
1. Fetches current prices (ticker, SPY, sector ETF) from market_snapshots
2. Computes canonical evidence keys and duplicate detection
3. Clamps individual document weights to MAX_SINGLE_DOCUMENT_WEIGHT
4. Computes contribution scores (one-vote-per-canonical-key dedup)
5. Persists snapshot and evidence links in a transaction
"""
...
async def fetch_latest_close_price(
pool: asyncpg.Pool,
ticker: str,
) -> float | None:
"""Fetch most recent close price from market_snapshots for a ticker."""
...
```
#### 2. Outcome Evaluator (`services/validation/outcome_evaluator.py`)
```python
@dataclass
class PredictionOutcome:
"""Realized outcome for a prediction at a specific horizon."""
id: str # UUID
prediction_id: str
evaluated_at: datetime
horizon: str # 1h, 6h, 1d, 7d, 30d
future_price: float
future_return: float
spy_future_price: float | None
spy_return: float | None
sector_etf_future_price: float | None
sector_etf_return: float | None
excess_return_vs_spy: float | None
excess_return_vs_sector: float | None
direction_correct: bool
profitable: bool
metadata: dict
HORIZON_DURATIONS: dict[str, timedelta] = {
"1h": timedelta(hours=1),
"6h": timedelta(hours=6),
"1d": timedelta(days=1),
"7d": timedelta(days=7),
"30d": timedelta(days=30),
}
async def evaluate_matured_predictions(
pool: asyncpg.Pool,
) -> int:
"""Evaluate all matured prediction snapshots.
Finds snapshots where horizon has elapsed and outcome not yet recorded.
For each, fetches future prices and computes returns.
Skips horizons where future price is unavailable (retries next run).
Returns count of outcomes recorded.
"""
...
async def evaluate_single_prediction(
pool: asyncpg.Pool,
snapshot: PredictionSnapshot,
horizon: str,
) -> PredictionOutcome | None:
"""Evaluate a single prediction at a specific horizon.
Returns None if future price is unavailable.
"""
...
```
#### 3. Metrics Engine (`services/validation/metrics.py`)
```python
CONFIDENCE_BUCKETS: list[tuple[float, float]] = [
(0.50, 0.60),
(0.60, 0.70),
(0.70, 0.80),
(0.80, 0.90),
(0.90, 1.00),
]
LOOKBACK_WINDOWS: list[str] = ["7d", "30d", "90d", "all"]
@dataclass
class CalibrationBucket:
"""Calibration metrics for a single confidence bucket."""
bucket_low: float
bucket_high: float
avg_confidence: float
observed_win_rate: float
prediction_count: int
miscalibrated: bool # |avg_confidence - win_rate| > 0.15
@dataclass
class ModelMetricSnapshot:
"""Aggregate model quality metrics for a lookback/horizon combination."""
id: str
generated_at: datetime
lookback_window: str
horizon: str
prediction_count: int
win_rate: float
directional_accuracy: float
information_coefficient: float | None
rank_information_coefficient: float | None
avg_return: float
avg_excess_return_vs_spy: float
avg_excess_return_vs_sector: float
calibration_error: float # ECE
brier_score: float
buy_win_rate: float
sell_win_rate: float
hold_win_rate: float
metadata: dict
def compute_calibration_error(
confidences: list[float],
outcomes: list[bool],
) -> tuple[float, list[CalibrationBucket]]:
"""Compute ECE and calibration buckets.
ECE = Σ (n_b / N) * |avg_conf_b - win_rate_b|
Returns (ece, buckets).
"""
...
def compute_brier_score(
p_bulls: list[float],
outcomes: list[bool],
) -> float:
"""Brier score = mean((p_bull - outcome)^2).
outcome is 1.0 when price moved in predicted direction, 0.0 otherwise.
Returns value in [0.0, 1.0].
"""
...
def compute_information_coefficient(
scores: list[float],
returns: list[float],
) -> float | None:
"""Pearson correlation between prediction scores and future returns.
Returns None when fewer than 30 data points.
Returns value in [-1.0, 1.0].
"""
...
def compute_rank_information_coefficient(
scores: list[float],
returns: list[float],
) -> float | None:
"""Spearman rank correlation between prediction scores and future returns.
Returns None when fewer than 30 data points.
Returns value in [-1.0, 1.0].
"""
...
def compute_contribution_scores(
weights: list[float],
) -> list[float]:
"""Compute contribution scores from document weights.
Each score = weight_i / sum(weights). Sums to 1.0.
Each score in [0.0, 1.0].
Returns empty list for empty input.
"""
...
async def compute_and_store_metric_snapshots(
pool: asyncpg.Pool,
) -> list[ModelMetricSnapshot]:
"""Compute metric snapshots for all lookback/horizon combinations.
Lookback windows: 7d, 30d, 90d, all-time.
Horizons: 1h, 6h, 1d, 7d, 30d.
"""
...
```
#### 4. Attribution Engine (`services/validation/attribution.py`)
```python
@dataclass
class SourceAttribution:
"""Performance metrics for a single source."""
source: str
source_type: str
prediction_count: int
avg_weight: float
avg_contribution_score: float
win_rate: float
avg_future_return: float
avg_excess_return_vs_spy: float
information_coefficient: float | None
duplicate_rate: float
@dataclass
class CatalystAttribution:
"""Performance metrics for a single catalyst type."""
catalyst_type: str
prediction_count: int
win_rate: float
avg_future_return: float
avg_excess_return_vs_spy: float
information_coefficient: float | None
@dataclass
class LayerAttribution:
"""Performance metrics for a signal layer."""
layer: str # company, macro, competitive
avg_contribution_pct: float
dominant_win_rate: float # win rate when this layer > 30% contribution
dominant_ic: float | None # IC when this layer > 30% contribution
async def compute_source_attribution(
pool: asyncpg.Pool,
lookback_days: int = 30,
horizon: str = "7d",
) -> list[SourceAttribution]:
...
async def compute_catalyst_attribution(
pool: asyncpg.Pool,
lookback_days: int = 30,
horizon: str = "7d",
) -> list[CatalystAttribution]:
...
async def compute_layer_attribution(
pool: asyncpg.Pool,
lookback_days: int = 30,
horizon: str = "7d",
) -> list[LayerAttribution]:
...
```
#### 5. Calibration Engine (`services/validation/calibration.py`)
```python
def compute_source_reliability(
observed_win_rate: float,
sample_count: int,
prior_strength: int = 30,
) -> float:
"""Bayesian shrinkage source reliability.
reliability = 0.5 + (n / (n + prior_strength)) * (observed_win_rate - 0.5)
Returns value in [0.0, 1.0].
When n=0, returns 0.5 (prior mean).
As n→∞, approaches observed_win_rate.
"""
...
def compute_adjusted_evidence_weight(
base_weight: float,
reliability: float,
) -> float:
"""Adjusted weight = base_weight * (0.5 + reliability), clamped to [0.1, 2.0]."""
...
async def update_source_reliabilities(
pool: asyncpg.Pool,
) -> int:
"""Recompute and store source reliability scores from latest outcomes.
Uses the existing source_accuracy table, updating accuracy_ratio
with the Bayesian shrinkage formula.
Returns count of sources updated.
"""
...
```
#### 6. Quality Gate (`services/trading/model_quality_gate.py`)
```python
@dataclass
class QualityGateConfig:
"""Configurable thresholds for live trading eligibility."""
min_prediction_count: int = 100
min_ic: float = 0.03
min_win_rate: float = 0.53
max_ece: float = 0.15
min_excess_return_vs_spy: float = 0.0
max_snapshot_age_hours: int = 24
@dataclass
class GateThresholdResult:
"""Result for a single threshold check."""
name: str
threshold: float
actual: float
passed: bool
@dataclass
class QualityGateResult:
"""Full gate evaluation result."""
passed: bool
evaluated_at: datetime
threshold_results: list[GateThresholdResult]
reason: str # "all thresholds met" or "failed: ..."
snapshot_id: str | None
config: QualityGateConfig
async def evaluate_quality_gate(
pool: asyncpg.Pool,
config: QualityGateConfig | None = None,
) -> QualityGateResult:
"""Evaluate model quality gate from latest metric snapshot.
Reads the most recent model_metric_snapshot for the 30d lookback
and 7d horizon (the primary evaluation window).
If no snapshot exists or snapshot is stale (>24h), defaults to
paper-only mode (fail-safe).
Stores result in risk_configs under 'model_quality_gate' key.
"""
...
async def load_gate_config_from_db(
pool: asyncpg.Pool,
) -> QualityGateConfig:
"""Load gate thresholds from risk_configs, with defaults."""
...
```
#### 7. Dashboard API Endpoints
Seven new endpoints added to `services/api/app.py`:
| Endpoint | Method | Returns |
|----------|--------|---------|
| `/api/validation/summary` | GET | Latest model metric snapshot + gate status |
| `/api/validation/calibration` | GET | Calibration table with buckets |
| `/api/validation/ic-by-horizon` | GET | IC and Rank IC per horizon |
| `/api/validation/attribution/sources` | GET | Per-source performance |
| `/api/validation/attribution/catalysts` | GET | Per-catalyst performance |
| `/api/validation/attribution/layers` | GET | Per-layer performance |
| `/api/validation/gate-status` | GET | Quality gate evaluation detail |
All endpoints accept optional `lookback` (default "30d") and `horizon` (default "7d") query parameters.
---
## Data Models
### Database Schema (Migration 035)
#### prediction_snapshots
```sql
CREATE TABLE IF NOT EXISTS prediction_snapshots (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
generated_at TIMESTAMPTZ NOT NULL,
ticker VARCHAR(20) NOT NULL,
window VARCHAR(20) NOT NULL,
horizon VARCHAR(20) NOT NULL,
direction VARCHAR(20) NOT NULL,
action VARCHAR(20) NOT NULL,
mode VARCHAR(30) NOT NULL,
strength FLOAT NOT NULL,
confidence FLOAT NOT NULL,
contradiction FLOAT NOT NULL DEFAULT 0.0,
p_bull FLOAT,
p_bear FLOAT,
score_company FLOAT NOT NULL DEFAULT 0.0,
score_macro FLOAT NOT NULL DEFAULT 0.0,
score_competitive FLOAT NOT NULL DEFAULT 0.0,
evidence_count INTEGER NOT NULL DEFAULT 0,
unique_source_count INTEGER NOT NULL DEFAULT 0,
duplicate_evidence_count INTEGER NOT NULL DEFAULT 0,
price_at_prediction FLOAT,
spy_price_at_prediction FLOAT,
sector_etf_price_at_prediction FLOAT,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_pred_snap_ticker ON prediction_snapshots(ticker);
CREATE INDEX IF NOT EXISTS idx_pred_snap_generated ON prediction_snapshots(generated_at);
CREATE INDEX IF NOT EXISTS idx_pred_snap_horizon ON prediction_snapshots(horizon);
```
#### prediction_outcomes
```sql
CREATE TABLE IF NOT EXISTS prediction_outcomes (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
prediction_id UUID NOT NULL REFERENCES prediction_snapshots(id),
evaluated_at TIMESTAMPTZ NOT NULL,
horizon VARCHAR(20) NOT NULL,
future_price FLOAT,
future_return FLOAT,
spy_future_price FLOAT,
spy_return FLOAT,
sector_etf_future_price FLOAT,
sector_etf_return FLOAT,
excess_return_vs_spy FLOAT,
excess_return_vs_sector FLOAT,
direction_correct BOOLEAN,
profitable BOOLEAN,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_pred_out_prediction ON prediction_outcomes(prediction_id);
CREATE INDEX IF NOT EXISTS idx_pred_out_horizon ON prediction_outcomes(horizon);
CREATE INDEX IF NOT EXISTS idx_pred_out_evaluated ON prediction_outcomes(evaluated_at);
```
#### signal_evidence_links
```sql
CREATE TABLE IF NOT EXISTS signal_evidence_links (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
prediction_id UUID NOT NULL REFERENCES prediction_snapshots(id),
document_id VARCHAR(200),
signal_id VARCHAR(200),
ticker VARCHAR(20),
source VARCHAR(200),
source_type VARCHAR(50),
catalyst_type VARCHAR(50),
sentiment VARCHAR(20),
impact FLOAT,
extraction_confidence FLOAT,
weight FLOAT,
is_duplicate BOOLEAN NOT NULL DEFAULT FALSE,
canonical_evidence_key VARCHAR(64),
contribution_score FLOAT,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_sig_ev_prediction ON signal_evidence_links(prediction_id);
CREATE INDEX IF NOT EXISTS idx_sig_ev_document ON signal_evidence_links(document_id);
CREATE INDEX IF NOT EXISTS idx_sig_ev_ticker ON signal_evidence_links(ticker);
```
#### model_metric_snapshots
```sql
CREATE TABLE IF NOT EXISTS model_metric_snapshots (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
generated_at TIMESTAMPTZ NOT NULL,
lookback_window VARCHAR(20) NOT NULL,
horizon VARCHAR(20) NOT NULL,
prediction_count INTEGER NOT NULL DEFAULT 0,
win_rate FLOAT,
directional_accuracy FLOAT,
information_coefficient FLOAT,
rank_information_coefficient FLOAT,
avg_return FLOAT,
avg_excess_return_vs_spy FLOAT,
avg_excess_return_vs_sector FLOAT,
calibration_error FLOAT,
brier_score FLOAT,
buy_win_rate FLOAT,
sell_win_rate FLOAT,
hold_win_rate FLOAT,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_model_snap_generated ON model_metric_snapshots(generated_at);
CREATE INDEX IF NOT EXISTS idx_model_snap_lookback ON model_metric_snapshots(lookback_window);
CREATE INDEX IF NOT EXISTS idx_model_snap_horizon ON model_metric_snapshots(horizon);
```
#### SQL Explorer Views
```sql
CREATE OR REPLACE VIEW v_prediction_performance AS
SELECT
ps.ticker,
ps.direction,
ps.action,
ps.confidence,
ps.strength,
ps.contradiction,
ps.p_bull,
ps.score_company,
ps.score_macro,
ps.score_competitive,
ps.evidence_count,
ps.unique_source_count,
ps.duplicate_evidence_count,
ps.price_at_prediction,
po.future_return,
po.excess_return_vs_spy,
po.excess_return_vs_sector,
po.direction_correct,
po.profitable,
po.horizon,
ps.generated_at,
po.evaluated_at
FROM prediction_snapshots ps
JOIN prediction_outcomes po ON po.prediction_id = ps.id;
CREATE OR REPLACE VIEW v_source_performance AS
SELECT
sel.source,
sel.source_type,
sel.catalyst_type,
sel.sentiment,
sel.weight,
sel.contribution_score,
sel.is_duplicate,
po.direction_correct,
po.future_return,
po.excess_return_vs_spy,
po.horizon,
ps.generated_at
FROM signal_evidence_links sel
JOIN prediction_snapshots ps ON ps.id = sel.prediction_id
JOIN prediction_outcomes po ON po.prediction_id = sel.prediction_id;
```
---
## Correctness Properties
*A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
The following properties were derived from the acceptance criteria through systematic prework analysis. Each property is universally quantified and maps to specific requirements. After reflection, 7 unique properties remain — one for each PBT requirement in Requirement 17. Redundant properties from Requirements 2, 5, 6, 8, and 11 were consolidated with their corresponding Requirement 17 counterparts.
### Property 1: Calibration Error Range and Round-Trip
*For any* valid distribution of predictions across confidence buckets (where each prediction has a confidence in [0.5, 1.0] and a boolean outcome), the Expected Calibration Error (ECE) SHALL be in [0.0, 1.0]. Furthermore, when every bucket's observed win rate exactly matches its average confidence, ECE SHALL be 0.0.
**Validates: Requirements 5.1, 5.3, 17.1**
### Property 2: Brier Score Range and Perfect Prediction
*For any* list of (p_bull, outcome) pairs where p_bull ∈ [0.0, 1.0] and outcome ∈ {0.0, 1.0}, the Brier score SHALL be in [0.0, 1.0]. Furthermore, when all predictions have p_bull = 1.0 and outcome = 1.0 (or p_bull = 0.0 and outcome = 0.0), the Brier score SHALL be 0.0.
**Validates: Requirements 5.4, 17.2**
### Property 3: Information Coefficient Range and Perfect Correlation
*For any* list of (score, return) pairs with at least 30 elements where scores and returns are finite floats, the Information Coefficient (Pearson correlation) SHALL be in [-1.0, 1.0]. Furthermore, when scores and returns are perfectly positively linearly correlated (returns = a * scores + b, a > 0), IC SHALL be 1.0 (within floating-point tolerance).
**Validates: Requirements 6.1, 6.2, 17.3**
### Property 4: Canonical Evidence Key Determinism and Normalization Idempotence
*For any* (title, url) string pair, computing the canonical evidence key SHALL be deterministic — the same inputs always produce the same key. Furthermore, normalizing an already-normalized input (lowercased, trimmed title; lowercased, query-stripped URL) and computing the key SHALL produce the same key as the original computation (idempotence).
**Validates: Requirements 2.3, 17.4**
### Property 5: Source Reliability Bayesian Shrinkage Bounds and Convergence
*For any* observed_win_rate ∈ [0.0, 1.0] and sample_count ≥ 0, the source reliability computed via Bayesian shrinkage SHALL be in [0.0, 1.0]. When sample_count = 0, reliability SHALL be exactly 0.5. As sample_count increases toward infinity, reliability SHALL approach the observed_win_rate monotonically.
**Validates: Requirements 8.1, 8.2, 17.5**
### Property 6: Quality Gate Determinism and Threshold Monotonicity
*For any* set of model metric values and quality gate configuration, the gate evaluation result SHALL be deterministic — the same inputs always produce the same pass/fail result. Furthermore, for any configuration where the gate passes, relaxing any single threshold (increasing min values or decreasing max values to make them easier to satisfy) SHALL NOT cause the gate to fail (monotonicity).
**Validates: Requirements 11.1, 17.6**
### Property 7: Contribution Score Sum-to-One and Range
*For any* non-empty list of positive document weights, the computed contribution scores SHALL each be in [0.0, 1.0] and SHALL sum to 1.0 (within floating-point tolerance of 1e-9). For an empty weight list, the result SHALL be an empty list.
**Validates: Requirements 2.5, 17.7**
---
## Error Handling
### Price Data Unavailability
| Scenario | Handling |
|----------|----------|
| Ticker price unavailable at snapshot time | Store NULL for `price_at_prediction`, log warning, continue |
| SPY price unavailable at snapshot time | Store NULL for `spy_price_at_prediction`, log warning, continue |
| Sector ETF price unavailable at snapshot time | Store NULL for `sector_etf_price_at_prediction`, log warning, continue |
| Sector not found in SECTOR_ETF_MAP | Store NULL for sector ETF price, log warning |
| Future price unavailable at evaluation time | Skip that horizon, retry on next Outcome_Evaluator run |
| SPY/sector ETF future price unavailable | Store NULL for excess returns, still compute ticker return |
### Metrics Computation Edge Cases
| Scenario | Handling |
|----------|----------|
| Zero predictions in a confidence bucket | Exclude bucket from ECE computation |
| Fewer than 30 predictions for IC/Rank IC | Return NULL instead of unreliable correlation |
| All predictions in same confidence bucket | ECE = |avg_confidence - win_rate| for that single bucket |
| Division by zero in contribution scores (total weight = 0) | Return equal contribution scores (1/n) |
| Single prediction | Contribution score = 1.0 |
| NaN/infinity in metric computation | Guard with `math.isnan`/`math.isinf` checks, return 0.0 or NULL |
### Quality Gate Failures
| Scenario | Handling |
|----------|----------|
| No model_metric_snapshots exist | Default to paper-only mode (fail-safe) |
| Most recent snapshot older than 24 hours | Default to paper-only mode (fail-safe) |
| risk_configs table unreachable | Default to paper-only mode, log warning |
| Invalid threshold values in risk_configs | Use default thresholds, log warning |
| Gate evaluation fails mid-computation | Default to paper-only mode, log error |
### Database Failures
| Scenario | Handling |
|----------|----------|
| prediction_snapshots insert fails | Log error, do not block recommendation generation |
| signal_evidence_links insert fails | Log error, snapshot still created (partial data) |
| prediction_outcomes insert fails | Log error, retry on next Outcome_Evaluator run |
| model_metric_snapshots insert fails | Log error, stale metrics used until next successful computation |
| source_accuracy update fails | Log error, continue with stale reliability data |
### Canonical Evidence Key Edge Cases
| Scenario | Handling |
|----------|----------|
| Empty title | Use empty string in hash computation |
| Empty URL | Use empty string in hash computation |
| URL with no query parameters | Use URL as-is after lowercasing |
| Non-ASCII characters in title/URL | Encode as UTF-8 before hashing |
---
## Testing Strategy
### Dual Testing Approach
The model validation feature requires both property-based tests (for mathematical correctness of metric computations) and example-based unit tests (for specific behaviors, integration points, and edge cases). Property-based testing is appropriate here because the feature contains several pure mathematical functions (ECE, Brier score, IC, Bayesian shrinkage, contribution scores) with clear input/output behavior and universal properties.
### Property-Based Testing
**Library:** Hypothesis (already in use — `.hypothesis/` directory exists, project convention established)
**Configuration:**
- Minimum 100 iterations per property: `@settings(max_examples=100)`
- File naming: `tests/test_pbt_model_validation.py`
- Tag format: `# Feature: model-validation-calibration, Property N: <title>`
**Property tests to implement (one test per correctness property):**
| Property | Test Function | Key Generators |
|----------|---------------|----------------|
| 1: ECE range and round-trip | `test_calibration_error_range_and_roundtrip` | `st.lists(st.tuples(st.floats(0.5, 1.0), st.booleans()))` |
| 2: Brier score range and perfect | `test_brier_score_range_and_perfect` | `st.lists(st.tuples(st.floats(0.0, 1.0), st.sampled_from([0.0, 1.0])))` |
| 3: IC range and perfect correlation | `test_information_coefficient_range_and_perfect` | `st.lists(st.floats(-10, 10), min_size=30)` with linear transform |
| 4: Canonical key determinism and idempotence | `test_canonical_key_determinism_and_idempotence` | `st.text()` pairs for title and URL |
| 5: Source reliability bounds and convergence | `test_source_reliability_bounds_and_convergence` | `st.floats(0.0, 1.0)` for win_rate, `st.integers(0, 10000)` for n |
| 6: Quality gate determinism and monotonicity | `test_quality_gate_determinism_and_monotonicity` | Custom strategy for `QualityGateConfig` and metric values |
| 7: Contribution score sum-to-one | `test_contribution_score_sum_to_one` | `st.lists(st.floats(0.01, 100.0), min_size=1)` |
### Example-Based Unit Tests
**File:** `tests/test_model_validation_unit.py`
| Test Area | Examples |
|-----------|----------|
| Canonical evidence key | Known title/URL → expected SHA256, empty inputs, unicode |
| Duplicate detection | 3 docs with 2 sharing a key → 1 marked duplicate |
| Contribution scores | [0.5, 0.3, 0.2] → [0.5, 0.3, 0.2], single doc → [1.0] |
| ECE specific values | Perfect calibration → 0.0, all overconfident → positive ECE |
| Brier score specific values | All correct at p=1.0 → 0.0, all wrong at p=1.0 → 1.0 |
| IC specific values | Perfect correlation → 1.0, anti-correlation → -1.0, < 30 → None |
| Source reliability | n=0 → 0.5, n=1000 with wr=0.8 → ≈0.8, n=30 with wr=0.7 → 0.6 |
| Adjusted evidence weight | reliability=0.5 → base*1.0, clamping to [0.1, 2.0] |
| Quality gate | All thresholds met → pass, one failed → fail with reason |
| Quality gate fail-safe | No snapshots → paper-only, stale snapshot → paper-only |
| Direction correct logic | bullish+positive → true, bullish+negative → false |
| Profitable logic | buy+positive → true, sell+negative → true |
| Future return computation | price 100→110 → 0.10, price 100→90 → -0.10 |
| Excess return | ticker 10%, SPY 5% → excess 5% |
| Weight clamping | weight 1.5 → clamped to 1.0 |
### Frontend Tests
**File:** `frontend/src/test/pages.test.tsx` (extend existing)
| Test Area | Strategy |
|-----------|----------|
| OpsModel page renders validation tabs | MSW mock for `/api/validation/summary` |
| Calibration table renders buckets | MSW mock for `/api/validation/calibration` |
| Gate status indicator | MSW mock for `/api/validation/gate-status` |
| Miscalibration warning badge | Mock data with miscalibrated bucket |
### Integration Tests
**File:** `tests/test_model_validation_integration.py`
| Test Area | Strategy |
|-----------|----------|
| Snapshot creation with mock DB | asyncpg mock, verify INSERT queries |
| Outcome evaluation with mock prices | asyncpg mock, verify return computation |
| Metrics computation end-to-end | In-memory data, verify all metrics computed |
| API endpoint responses | FastAPI TestClient with mock pool |
### Test File Structure
```
tests/
├── test_pbt_model_validation.py # 7 property-based tests
├── test_model_validation_unit.py # Example-based unit tests
└── test_model_validation_integration.py # Integration tests (optional)
frontend/src/test/
└── pages.test.tsx # Extended with validation page tests
```