7fcc8a6c07
ci/woodpecker/push/test Pipeline failed
ci/woodpecker/push/build-1 unknown status
ci/woodpecker/push/build-3 unknown status
ci/woodpecker/push/build-2 unknown status
ci/woodpecker/push/finalize unknown status
Build and Push / lint-and-test (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.adapters.broker_adapter name:broker-adapter]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.aggregation.worker name:aggregation]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.extractor.worker name:extractor]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.ingestion.worker name:ingestion]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.lake_publisher.worker name:lake-publisher]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.parser.worker name:parser]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.recommendation.worker name:recommendation]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.scheduler.app name:scheduler]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.api.app:app --host 0.0.0.0 --port 8000 name:query-api]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.risk.app:app --host 0.0.0.0 --port 8000 name:risk]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.symbol_registry.app:app --host 0.0.0.0 --port 8000 name:symbol-registry]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.trading.app:app --host 0.0.0.0 --port 8000 name:trading-engine]) (push) Has been cancelled
Build and Push / build-dashboard (push) Has been cancelled
Build and Push / build-superset (push) Has been cancelled
Build and Push / integration-test (push) Has been cancelled
Build and Push / beta-gate (push) Has been cancelled
- Migration 035: prediction_snapshots, prediction_outcomes, signal_evidence_links, model_metric_snapshots tables + SQL views - Prediction snapshot writer with canonical evidence keys, duplicate detection, contribution scores - Outcome evaluator across 5 horizons (1h, 6h, 1d, 7d, 30d) - Metrics engine: ECE, Brier score, IC, Rank IC, benchmark comparison - Attribution engine: per-source, per-catalyst, per-layer performance - Calibration engine: Bayesian shrinkage source reliability - Quality gate for live trading eligibility with configurable thresholds - 7 new /api/validation/* endpoints - Upgraded OpsModel dashboard with validation tab - Enhanced recommendation display with calibration context - Backtest replay validation mode - 86 Python tests (unit + property-based), 179 frontend tests passing
976 lines
35 KiB
Markdown
976 lines
35 KiB
Markdown
# Design Document — Model Validation, Calibration, and Signal Quality
|
|
|
|
## Overview
|
|
|
|
This design adds a closed-loop model validation layer to Stonks Oracle. The system currently generates trend summaries and trading recommendations with confidence scores, but has no mechanism to evaluate whether those predictions are accurate, whether confidence scores are well-calibrated, which sources contribute to correct predictions, or whether the system outperforms simple benchmarks.
|
|
|
|
The validation layer introduces six new service modules under `services/validation/`, a quality gate in `services/trading/`, seven new API endpoints under `/api/validation/`, a database migration (035) with four new tables and two SQL views, and an upgraded OpsModel dashboard page. The architecture follows the existing patterns: pure computation modules with asyncpg for persistence, FastAPI endpoints in `services/api/app.py`, and React/TanStack Query hooks on the frontend.
|
|
|
|
### Design Rationale
|
|
|
|
A prediction engine without outcome tracking is flying blind. The validation layer closes the feedback loop by:
|
|
|
|
1. **Capturing immutable snapshots** at prediction time — preventing hindsight bias in evaluation
|
|
2. **Evaluating outcomes** across multiple horizons (1h, 6h, 1d, 7d, 30d) — matching the system's multi-window trend architecture
|
|
3. **Computing calibration metrics** (ECE, Brier score) — measuring whether confidence scores mean what they claim
|
|
4. **Tracking information coefficients** (IC, Rank IC) — measuring linear and ordinal predictive power
|
|
5. **Attributing performance** to sources, catalysts, and signal layers — identifying the most valuable information channels
|
|
6. **Recalibrating confidence** via Bayesian shrinkage — learning from the system's own track record
|
|
7. **Gating live trading** on minimum quality thresholds — preventing real capital risk on a poorly performing model
|
|
|
|
The design reuses existing infrastructure (asyncpg, FastAPI, TanStack Query, Recharts) and integrates with the existing `source_accuracy` table from the signal-math-upgrade spec.
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
### High-Level Data Flow
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
subgraph "Prediction Capture (Real-time)"
|
|
A[Recommendation Engine] -->|generates| B[Prediction_Snapshot_Writer]
|
|
B --> C[prediction_snapshots table]
|
|
B --> D[signal_evidence_links table]
|
|
B -->|computes| E[canonical_evidence_key<br/>duplicate detection<br/>contribution scores]
|
|
end
|
|
|
|
subgraph "Outcome Evaluation (Periodic)"
|
|
F[Outcome_Evaluator<br/>scheduled job] -->|reads matured snapshots| C
|
|
F -->|fetches future prices| G[market_snapshots table]
|
|
F -->|computes returns| H[prediction_outcomes table]
|
|
F -->|evaluates 5 horizons| H
|
|
end
|
|
|
|
subgraph "Metrics Computation (Periodic)"
|
|
I[Metrics_Engine] -->|reads| H
|
|
I -->|reads| C
|
|
I -->|reads| D
|
|
I -->|computes| J[model_metric_snapshots table]
|
|
I -->|computes| K[Calibration: ECE, Brier]
|
|
I -->|computes| L[IC, Rank IC by horizon]
|
|
I -->|computes| M[Benchmark: excess returns]
|
|
end
|
|
|
|
subgraph "Attribution (Periodic)"
|
|
N[Attribution_Engine] -->|joins| D
|
|
N -->|joins| H
|
|
N -->|computes| O[Per-source metrics]
|
|
N -->|computes| P[Per-catalyst metrics]
|
|
N -->|computes| Q[Per-layer metrics]
|
|
end
|
|
|
|
subgraph "Calibration (Periodic)"
|
|
R[Calibration_Engine] -->|reads| H
|
|
R -->|reads| D
|
|
R -->|computes Bayesian shrinkage| S[source_accuracy table<br/>reliability scores]
|
|
end
|
|
|
|
subgraph "Safety Gate (Per-cycle)"
|
|
T[Quality_Gate] -->|reads latest| J
|
|
T -->|evaluates thresholds| U{Pass?}
|
|
U -->|yes| V[Live trading allowed]
|
|
U -->|no| W[Force paper mode]
|
|
T -->|stores result| X[risk_configs table<br/>model_quality_gate key]
|
|
end
|
|
|
|
subgraph "Dashboard (Frontend)"
|
|
Y[Dashboard_API<br/>7 endpoints] -->|reads| J
|
|
Y -->|reads| C
|
|
Y -->|reads| H
|
|
Y -->|reads| D
|
|
Z[OpsModel.tsx<br/>upgraded page] -->|fetches| Y
|
|
end
|
|
|
|
subgraph "Backtest Integration"
|
|
AA[BacktestReplay] -->|validation mode| B
|
|
AA -->|validation mode| F
|
|
AA -->|triggers| I
|
|
end
|
|
```
|
|
|
|
### Scheduling Strategy
|
|
|
|
The validation components run on different cadences:
|
|
|
|
| Component | Trigger | Cadence |
|
|
|-----------|---------|---------|
|
|
| Prediction_Snapshot_Writer | Synchronous — called by recommendation engine | Every recommendation |
|
|
| Outcome_Evaluator | Scheduled job | Every 1 hour |
|
|
| Metrics_Engine | After Outcome_Evaluator completes | Every 1 hour |
|
|
| Attribution_Engine | Called by Metrics_Engine | Every 1 hour |
|
|
| Calibration_Engine | After Metrics_Engine completes | Every 6 hours |
|
|
| Quality_Gate | Start of each aggregation cycle | Every aggregation cycle |
|
|
|
|
### Sector ETF Mapping
|
|
|
|
The system needs a mapping from company sectors to sector ETFs for benchmark comparison. This is stored as a configuration constant:
|
|
|
|
```python
|
|
SECTOR_ETF_MAP: dict[str, str] = {
|
|
"Technology": "XLK",
|
|
"Consumer Cyclical": "XLY",
|
|
"Financial Services": "XLF",
|
|
"Healthcare": "XLV",
|
|
"Energy": "XLE",
|
|
"Communication Services": "XLC",
|
|
"Industrials": "XLI",
|
|
"Consumer Defensive": "XLP",
|
|
"Real Estate": "XLRE",
|
|
"Utilities": "XLU",
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Components and Interfaces
|
|
|
|
### New Modules
|
|
|
|
| Module | File | Responsibility |
|
|
|--------|------|----------------|
|
|
| Prediction Snapshot Writer | `services/validation/prediction_snapshot.py` | Captures immutable prediction state at generation time |
|
|
| Outcome Evaluator | `services/validation/outcome_evaluator.py` | Matches predictions with realized market outcomes |
|
|
| Metrics Engine | `services/validation/metrics.py` | Computes calibration, IC, Brier, benchmark metrics |
|
|
| Attribution Engine | `services/validation/attribution.py` | Per-source, per-catalyst, per-layer performance |
|
|
| Calibration Engine | `services/validation/calibration.py` | Bayesian shrinkage source reliability, weight adjustment |
|
|
| Quality Gate | `services/trading/model_quality_gate.py` | Safety gate for live trading eligibility |
|
|
|
|
### Modified Modules
|
|
|
|
| Module | File | Changes |
|
|
|--------|------|---------|
|
|
| Query API | `services/api/app.py` | 7 new `/api/validation/*` endpoints |
|
|
| Aggregation Worker | `services/aggregation/worker.py` | Call Quality_Gate at cycle start |
|
|
| Recommendation Engine | `services/recommendation/eligibility.py` | Call Prediction_Snapshot_Writer after recommendation |
|
|
| Backtest Replay | `services/trading/backtest_replay.py` | Validation mode support |
|
|
| Frontend Hooks | `frontend/src/api/hooks.ts` | 7 new validation hooks |
|
|
| OpsModel Page | `frontend/src/pages/OpsModel.tsx` | Full dashboard upgrade |
|
|
| AppLayout | `frontend/src/components/AppLayout.tsx` | Nav item update (if needed) |
|
|
|
|
### Component Interface Details
|
|
|
|
#### 1. Prediction Snapshot Writer (`services/validation/prediction_snapshot.py`)
|
|
|
|
```python
|
|
SECTOR_ETF_MAP: dict[str, str] = {
|
|
"Technology": "XLK",
|
|
"Consumer Cyclical": "XLY",
|
|
"Financial Services": "XLF",
|
|
"Healthcare": "XLV",
|
|
"Energy": "XLE",
|
|
"Communication Services": "XLC",
|
|
"Industrials": "XLI",
|
|
"Consumer Defensive": "XLP",
|
|
"Real Estate": "XLRE",
|
|
"Utilities": "XLU",
|
|
}
|
|
|
|
EVALUATION_HORIZONS: list[str] = ["1h", "6h", "1d", "7d", "30d"]
|
|
|
|
MAX_SINGLE_DOCUMENT_WEIGHT: float = 1.0
|
|
|
|
|
|
@dataclass
|
|
class PredictionSnapshot:
|
|
"""Immutable snapshot of a prediction at generation time."""
|
|
id: str # UUID
|
|
generated_at: datetime
|
|
ticker: str
|
|
window: str
|
|
horizon: str
|
|
direction: str # bullish/bearish/mixed/neutral
|
|
action: str # buy/sell/hold/watch
|
|
mode: str # informational/paper_eligible/live_eligible
|
|
strength: float
|
|
confidence: float
|
|
contradiction: float
|
|
p_bull: float | None
|
|
p_bear: float | None
|
|
score_company: float
|
|
score_macro: float
|
|
score_competitive: float
|
|
evidence_count: int
|
|
unique_source_count: int
|
|
duplicate_evidence_count: int
|
|
price_at_prediction: float | None
|
|
spy_price_at_prediction: float | None
|
|
sector_etf_price_at_prediction: float | None
|
|
metadata: dict
|
|
|
|
|
|
@dataclass
|
|
class SignalEvidenceLink:
|
|
"""Link between a prediction and a contributing evidence document."""
|
|
id: str # UUID
|
|
prediction_id: str
|
|
document_id: str
|
|
signal_id: str
|
|
ticker: str
|
|
source: str
|
|
source_type: str
|
|
catalyst_type: str
|
|
sentiment: str
|
|
impact: float
|
|
extraction_confidence: float
|
|
weight: float # clamped to MAX_SINGLE_DOCUMENT_WEIGHT
|
|
is_duplicate: bool
|
|
canonical_evidence_key: str
|
|
contribution_score: float # weight / total_weight, sums to 1.0
|
|
metadata: dict
|
|
|
|
|
|
def compute_canonical_evidence_key(title: str, url: str) -> str:
|
|
"""SHA256 of normalized(title) + normalized(url).
|
|
|
|
Normalization: lowercase, strip whitespace for title;
|
|
lowercase, strip query params for URL.
|
|
"""
|
|
...
|
|
|
|
|
|
async def create_prediction_snapshot(
|
|
pool: asyncpg.Pool,
|
|
recommendation: Recommendation,
|
|
trend_summary: TrendSummary,
|
|
evidence_signals: list[WeightedSignal],
|
|
evidence_docs: list[dict], # document metadata from recommendation_evidence
|
|
) -> PredictionSnapshot:
|
|
"""Create and persist a prediction snapshot with evidence links.
|
|
|
|
1. Fetches current prices (ticker, SPY, sector ETF) from market_snapshots
|
|
2. Computes canonical evidence keys and duplicate detection
|
|
3. Clamps individual document weights to MAX_SINGLE_DOCUMENT_WEIGHT
|
|
4. Computes contribution scores (one-vote-per-canonical-key dedup)
|
|
5. Persists snapshot and evidence links in a transaction
|
|
"""
|
|
...
|
|
|
|
|
|
async def fetch_latest_close_price(
|
|
pool: asyncpg.Pool,
|
|
ticker: str,
|
|
) -> float | None:
|
|
"""Fetch most recent close price from market_snapshots for a ticker."""
|
|
...
|
|
```
|
|
|
|
#### 2. Outcome Evaluator (`services/validation/outcome_evaluator.py`)
|
|
|
|
```python
|
|
@dataclass
|
|
class PredictionOutcome:
|
|
"""Realized outcome for a prediction at a specific horizon."""
|
|
id: str # UUID
|
|
prediction_id: str
|
|
evaluated_at: datetime
|
|
horizon: str # 1h, 6h, 1d, 7d, 30d
|
|
future_price: float
|
|
future_return: float
|
|
spy_future_price: float | None
|
|
spy_return: float | None
|
|
sector_etf_future_price: float | None
|
|
sector_etf_return: float | None
|
|
excess_return_vs_spy: float | None
|
|
excess_return_vs_sector: float | None
|
|
direction_correct: bool
|
|
profitable: bool
|
|
metadata: dict
|
|
|
|
|
|
HORIZON_DURATIONS: dict[str, timedelta] = {
|
|
"1h": timedelta(hours=1),
|
|
"6h": timedelta(hours=6),
|
|
"1d": timedelta(days=1),
|
|
"7d": timedelta(days=7),
|
|
"30d": timedelta(days=30),
|
|
}
|
|
|
|
|
|
async def evaluate_matured_predictions(
|
|
pool: asyncpg.Pool,
|
|
) -> int:
|
|
"""Evaluate all matured prediction snapshots.
|
|
|
|
Finds snapshots where horizon has elapsed and outcome not yet recorded.
|
|
For each, fetches future prices and computes returns.
|
|
Skips horizons where future price is unavailable (retries next run).
|
|
|
|
Returns count of outcomes recorded.
|
|
"""
|
|
...
|
|
|
|
|
|
async def evaluate_single_prediction(
|
|
pool: asyncpg.Pool,
|
|
snapshot: PredictionSnapshot,
|
|
horizon: str,
|
|
) -> PredictionOutcome | None:
|
|
"""Evaluate a single prediction at a specific horizon.
|
|
|
|
Returns None if future price is unavailable.
|
|
"""
|
|
...
|
|
```
|
|
|
|
#### 3. Metrics Engine (`services/validation/metrics.py`)
|
|
|
|
```python
|
|
CONFIDENCE_BUCKETS: list[tuple[float, float]] = [
|
|
(0.50, 0.60),
|
|
(0.60, 0.70),
|
|
(0.70, 0.80),
|
|
(0.80, 0.90),
|
|
(0.90, 1.00),
|
|
]
|
|
|
|
LOOKBACK_WINDOWS: list[str] = ["7d", "30d", "90d", "all"]
|
|
|
|
|
|
@dataclass
|
|
class CalibrationBucket:
|
|
"""Calibration metrics for a single confidence bucket."""
|
|
bucket_low: float
|
|
bucket_high: float
|
|
avg_confidence: float
|
|
observed_win_rate: float
|
|
prediction_count: int
|
|
miscalibrated: bool # |avg_confidence - win_rate| > 0.15
|
|
|
|
|
|
@dataclass
|
|
class ModelMetricSnapshot:
|
|
"""Aggregate model quality metrics for a lookback/horizon combination."""
|
|
id: str
|
|
generated_at: datetime
|
|
lookback_window: str
|
|
horizon: str
|
|
prediction_count: int
|
|
win_rate: float
|
|
directional_accuracy: float
|
|
information_coefficient: float | None
|
|
rank_information_coefficient: float | None
|
|
avg_return: float
|
|
avg_excess_return_vs_spy: float
|
|
avg_excess_return_vs_sector: float
|
|
calibration_error: float # ECE
|
|
brier_score: float
|
|
buy_win_rate: float
|
|
sell_win_rate: float
|
|
hold_win_rate: float
|
|
metadata: dict
|
|
|
|
|
|
def compute_calibration_error(
|
|
confidences: list[float],
|
|
outcomes: list[bool],
|
|
) -> tuple[float, list[CalibrationBucket]]:
|
|
"""Compute ECE and calibration buckets.
|
|
|
|
ECE = Σ (n_b / N) * |avg_conf_b - win_rate_b|
|
|
|
|
Returns (ece, buckets).
|
|
"""
|
|
...
|
|
|
|
|
|
def compute_brier_score(
|
|
p_bulls: list[float],
|
|
outcomes: list[bool],
|
|
) -> float:
|
|
"""Brier score = mean((p_bull - outcome)^2).
|
|
|
|
outcome is 1.0 when price moved in predicted direction, 0.0 otherwise.
|
|
Returns value in [0.0, 1.0].
|
|
"""
|
|
...
|
|
|
|
|
|
def compute_information_coefficient(
|
|
scores: list[float],
|
|
returns: list[float],
|
|
) -> float | None:
|
|
"""Pearson correlation between prediction scores and future returns.
|
|
|
|
Returns None when fewer than 30 data points.
|
|
Returns value in [-1.0, 1.0].
|
|
"""
|
|
...
|
|
|
|
|
|
def compute_rank_information_coefficient(
|
|
scores: list[float],
|
|
returns: list[float],
|
|
) -> float | None:
|
|
"""Spearman rank correlation between prediction scores and future returns.
|
|
|
|
Returns None when fewer than 30 data points.
|
|
Returns value in [-1.0, 1.0].
|
|
"""
|
|
...
|
|
|
|
|
|
def compute_contribution_scores(
|
|
weights: list[float],
|
|
) -> list[float]:
|
|
"""Compute contribution scores from document weights.
|
|
|
|
Each score = weight_i / sum(weights). Sums to 1.0.
|
|
Each score in [0.0, 1.0].
|
|
Returns empty list for empty input.
|
|
"""
|
|
...
|
|
|
|
|
|
async def compute_and_store_metric_snapshots(
|
|
pool: asyncpg.Pool,
|
|
) -> list[ModelMetricSnapshot]:
|
|
"""Compute metric snapshots for all lookback/horizon combinations.
|
|
|
|
Lookback windows: 7d, 30d, 90d, all-time.
|
|
Horizons: 1h, 6h, 1d, 7d, 30d.
|
|
"""
|
|
...
|
|
```
|
|
|
|
#### 4. Attribution Engine (`services/validation/attribution.py`)
|
|
|
|
```python
|
|
@dataclass
|
|
class SourceAttribution:
|
|
"""Performance metrics for a single source."""
|
|
source: str
|
|
source_type: str
|
|
prediction_count: int
|
|
avg_weight: float
|
|
avg_contribution_score: float
|
|
win_rate: float
|
|
avg_future_return: float
|
|
avg_excess_return_vs_spy: float
|
|
information_coefficient: float | None
|
|
duplicate_rate: float
|
|
|
|
|
|
@dataclass
|
|
class CatalystAttribution:
|
|
"""Performance metrics for a single catalyst type."""
|
|
catalyst_type: str
|
|
prediction_count: int
|
|
win_rate: float
|
|
avg_future_return: float
|
|
avg_excess_return_vs_spy: float
|
|
information_coefficient: float | None
|
|
|
|
|
|
@dataclass
|
|
class LayerAttribution:
|
|
"""Performance metrics for a signal layer."""
|
|
layer: str # company, macro, competitive
|
|
avg_contribution_pct: float
|
|
dominant_win_rate: float # win rate when this layer > 30% contribution
|
|
dominant_ic: float | None # IC when this layer > 30% contribution
|
|
|
|
|
|
async def compute_source_attribution(
|
|
pool: asyncpg.Pool,
|
|
lookback_days: int = 30,
|
|
horizon: str = "7d",
|
|
) -> list[SourceAttribution]:
|
|
...
|
|
|
|
|
|
async def compute_catalyst_attribution(
|
|
pool: asyncpg.Pool,
|
|
lookback_days: int = 30,
|
|
horizon: str = "7d",
|
|
) -> list[CatalystAttribution]:
|
|
...
|
|
|
|
|
|
async def compute_layer_attribution(
|
|
pool: asyncpg.Pool,
|
|
lookback_days: int = 30,
|
|
horizon: str = "7d",
|
|
) -> list[LayerAttribution]:
|
|
...
|
|
```
|
|
|
|
#### 5. Calibration Engine (`services/validation/calibration.py`)
|
|
|
|
```python
|
|
def compute_source_reliability(
|
|
observed_win_rate: float,
|
|
sample_count: int,
|
|
prior_strength: int = 30,
|
|
) -> float:
|
|
"""Bayesian shrinkage source reliability.
|
|
|
|
reliability = 0.5 + (n / (n + prior_strength)) * (observed_win_rate - 0.5)
|
|
|
|
Returns value in [0.0, 1.0].
|
|
When n=0, returns 0.5 (prior mean).
|
|
As n→∞, approaches observed_win_rate.
|
|
"""
|
|
...
|
|
|
|
|
|
def compute_adjusted_evidence_weight(
|
|
base_weight: float,
|
|
reliability: float,
|
|
) -> float:
|
|
"""Adjusted weight = base_weight * (0.5 + reliability), clamped to [0.1, 2.0]."""
|
|
...
|
|
|
|
|
|
async def update_source_reliabilities(
|
|
pool: asyncpg.Pool,
|
|
) -> int:
|
|
"""Recompute and store source reliability scores from latest outcomes.
|
|
|
|
Uses the existing source_accuracy table, updating accuracy_ratio
|
|
with the Bayesian shrinkage formula.
|
|
|
|
Returns count of sources updated.
|
|
"""
|
|
...
|
|
```
|
|
|
|
#### 6. Quality Gate (`services/trading/model_quality_gate.py`)
|
|
|
|
```python
|
|
@dataclass
|
|
class QualityGateConfig:
|
|
"""Configurable thresholds for live trading eligibility."""
|
|
min_prediction_count: int = 100
|
|
min_ic: float = 0.03
|
|
min_win_rate: float = 0.53
|
|
max_ece: float = 0.15
|
|
min_excess_return_vs_spy: float = 0.0
|
|
max_snapshot_age_hours: int = 24
|
|
|
|
|
|
@dataclass
|
|
class GateThresholdResult:
|
|
"""Result for a single threshold check."""
|
|
name: str
|
|
threshold: float
|
|
actual: float
|
|
passed: bool
|
|
|
|
|
|
@dataclass
|
|
class QualityGateResult:
|
|
"""Full gate evaluation result."""
|
|
passed: bool
|
|
evaluated_at: datetime
|
|
threshold_results: list[GateThresholdResult]
|
|
reason: str # "all thresholds met" or "failed: ..."
|
|
snapshot_id: str | None
|
|
config: QualityGateConfig
|
|
|
|
|
|
async def evaluate_quality_gate(
|
|
pool: asyncpg.Pool,
|
|
config: QualityGateConfig | None = None,
|
|
) -> QualityGateResult:
|
|
"""Evaluate model quality gate from latest metric snapshot.
|
|
|
|
Reads the most recent model_metric_snapshot for the 30d lookback
|
|
and 7d horizon (the primary evaluation window).
|
|
|
|
If no snapshot exists or snapshot is stale (>24h), defaults to
|
|
paper-only mode (fail-safe).
|
|
|
|
Stores result in risk_configs under 'model_quality_gate' key.
|
|
"""
|
|
...
|
|
|
|
|
|
async def load_gate_config_from_db(
|
|
pool: asyncpg.Pool,
|
|
) -> QualityGateConfig:
|
|
"""Load gate thresholds from risk_configs, with defaults."""
|
|
...
|
|
```
|
|
|
|
#### 7. Dashboard API Endpoints
|
|
|
|
Seven new endpoints added to `services/api/app.py`:
|
|
|
|
| Endpoint | Method | Returns |
|
|
|----------|--------|---------|
|
|
| `/api/validation/summary` | GET | Latest model metric snapshot + gate status |
|
|
| `/api/validation/calibration` | GET | Calibration table with buckets |
|
|
| `/api/validation/ic-by-horizon` | GET | IC and Rank IC per horizon |
|
|
| `/api/validation/attribution/sources` | GET | Per-source performance |
|
|
| `/api/validation/attribution/catalysts` | GET | Per-catalyst performance |
|
|
| `/api/validation/attribution/layers` | GET | Per-layer performance |
|
|
| `/api/validation/gate-status` | GET | Quality gate evaluation detail |
|
|
|
|
All endpoints accept optional `lookback` (default "30d") and `horizon` (default "7d") query parameters.
|
|
|
|
---
|
|
|
|
## Data Models
|
|
|
|
### Database Schema (Migration 035)
|
|
|
|
#### prediction_snapshots
|
|
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS prediction_snapshots (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
generated_at TIMESTAMPTZ NOT NULL,
|
|
ticker VARCHAR(20) NOT NULL,
|
|
window VARCHAR(20) NOT NULL,
|
|
horizon VARCHAR(20) NOT NULL,
|
|
direction VARCHAR(20) NOT NULL,
|
|
action VARCHAR(20) NOT NULL,
|
|
mode VARCHAR(30) NOT NULL,
|
|
strength FLOAT NOT NULL,
|
|
confidence FLOAT NOT NULL,
|
|
contradiction FLOAT NOT NULL DEFAULT 0.0,
|
|
p_bull FLOAT,
|
|
p_bear FLOAT,
|
|
score_company FLOAT NOT NULL DEFAULT 0.0,
|
|
score_macro FLOAT NOT NULL DEFAULT 0.0,
|
|
score_competitive FLOAT NOT NULL DEFAULT 0.0,
|
|
evidence_count INTEGER NOT NULL DEFAULT 0,
|
|
unique_source_count INTEGER NOT NULL DEFAULT 0,
|
|
duplicate_evidence_count INTEGER NOT NULL DEFAULT 0,
|
|
price_at_prediction FLOAT,
|
|
spy_price_at_prediction FLOAT,
|
|
sector_etf_price_at_prediction FLOAT,
|
|
metadata JSONB DEFAULT '{}',
|
|
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
|
);
|
|
|
|
CREATE INDEX IF NOT EXISTS idx_pred_snap_ticker ON prediction_snapshots(ticker);
|
|
CREATE INDEX IF NOT EXISTS idx_pred_snap_generated ON prediction_snapshots(generated_at);
|
|
CREATE INDEX IF NOT EXISTS idx_pred_snap_horizon ON prediction_snapshots(horizon);
|
|
```
|
|
|
|
#### prediction_outcomes
|
|
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS prediction_outcomes (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
prediction_id UUID NOT NULL REFERENCES prediction_snapshots(id),
|
|
evaluated_at TIMESTAMPTZ NOT NULL,
|
|
horizon VARCHAR(20) NOT NULL,
|
|
future_price FLOAT,
|
|
future_return FLOAT,
|
|
spy_future_price FLOAT,
|
|
spy_return FLOAT,
|
|
sector_etf_future_price FLOAT,
|
|
sector_etf_return FLOAT,
|
|
excess_return_vs_spy FLOAT,
|
|
excess_return_vs_sector FLOAT,
|
|
direction_correct BOOLEAN,
|
|
profitable BOOLEAN,
|
|
metadata JSONB DEFAULT '{}',
|
|
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
|
);
|
|
|
|
CREATE INDEX IF NOT EXISTS idx_pred_out_prediction ON prediction_outcomes(prediction_id);
|
|
CREATE INDEX IF NOT EXISTS idx_pred_out_horizon ON prediction_outcomes(horizon);
|
|
CREATE INDEX IF NOT EXISTS idx_pred_out_evaluated ON prediction_outcomes(evaluated_at);
|
|
```
|
|
|
|
#### signal_evidence_links
|
|
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS signal_evidence_links (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
prediction_id UUID NOT NULL REFERENCES prediction_snapshots(id),
|
|
document_id VARCHAR(200),
|
|
signal_id VARCHAR(200),
|
|
ticker VARCHAR(20),
|
|
source VARCHAR(200),
|
|
source_type VARCHAR(50),
|
|
catalyst_type VARCHAR(50),
|
|
sentiment VARCHAR(20),
|
|
impact FLOAT,
|
|
extraction_confidence FLOAT,
|
|
weight FLOAT,
|
|
is_duplicate BOOLEAN NOT NULL DEFAULT FALSE,
|
|
canonical_evidence_key VARCHAR(64),
|
|
contribution_score FLOAT,
|
|
metadata JSONB DEFAULT '{}',
|
|
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
|
);
|
|
|
|
CREATE INDEX IF NOT EXISTS idx_sig_ev_prediction ON signal_evidence_links(prediction_id);
|
|
CREATE INDEX IF NOT EXISTS idx_sig_ev_document ON signal_evidence_links(document_id);
|
|
CREATE INDEX IF NOT EXISTS idx_sig_ev_ticker ON signal_evidence_links(ticker);
|
|
```
|
|
|
|
#### model_metric_snapshots
|
|
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS model_metric_snapshots (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
generated_at TIMESTAMPTZ NOT NULL,
|
|
lookback_window VARCHAR(20) NOT NULL,
|
|
horizon VARCHAR(20) NOT NULL,
|
|
prediction_count INTEGER NOT NULL DEFAULT 0,
|
|
win_rate FLOAT,
|
|
directional_accuracy FLOAT,
|
|
information_coefficient FLOAT,
|
|
rank_information_coefficient FLOAT,
|
|
avg_return FLOAT,
|
|
avg_excess_return_vs_spy FLOAT,
|
|
avg_excess_return_vs_sector FLOAT,
|
|
calibration_error FLOAT,
|
|
brier_score FLOAT,
|
|
buy_win_rate FLOAT,
|
|
sell_win_rate FLOAT,
|
|
hold_win_rate FLOAT,
|
|
metadata JSONB DEFAULT '{}',
|
|
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
|
);
|
|
|
|
CREATE INDEX IF NOT EXISTS idx_model_snap_generated ON model_metric_snapshots(generated_at);
|
|
CREATE INDEX IF NOT EXISTS idx_model_snap_lookback ON model_metric_snapshots(lookback_window);
|
|
CREATE INDEX IF NOT EXISTS idx_model_snap_horizon ON model_metric_snapshots(horizon);
|
|
```
|
|
|
|
#### SQL Explorer Views
|
|
|
|
```sql
|
|
CREATE OR REPLACE VIEW v_prediction_performance AS
|
|
SELECT
|
|
ps.ticker,
|
|
ps.direction,
|
|
ps.action,
|
|
ps.confidence,
|
|
ps.strength,
|
|
ps.contradiction,
|
|
ps.p_bull,
|
|
ps.score_company,
|
|
ps.score_macro,
|
|
ps.score_competitive,
|
|
ps.evidence_count,
|
|
ps.unique_source_count,
|
|
ps.duplicate_evidence_count,
|
|
ps.price_at_prediction,
|
|
po.future_return,
|
|
po.excess_return_vs_spy,
|
|
po.excess_return_vs_sector,
|
|
po.direction_correct,
|
|
po.profitable,
|
|
po.horizon,
|
|
ps.generated_at,
|
|
po.evaluated_at
|
|
FROM prediction_snapshots ps
|
|
JOIN prediction_outcomes po ON po.prediction_id = ps.id;
|
|
|
|
CREATE OR REPLACE VIEW v_source_performance AS
|
|
SELECT
|
|
sel.source,
|
|
sel.source_type,
|
|
sel.catalyst_type,
|
|
sel.sentiment,
|
|
sel.weight,
|
|
sel.contribution_score,
|
|
sel.is_duplicate,
|
|
po.direction_correct,
|
|
po.future_return,
|
|
po.excess_return_vs_spy,
|
|
po.horizon,
|
|
ps.generated_at
|
|
FROM signal_evidence_links sel
|
|
JOIN prediction_snapshots ps ON ps.id = sel.prediction_id
|
|
JOIN prediction_outcomes po ON po.prediction_id = sel.prediction_id;
|
|
```
|
|
|
|
---
|
|
|
|
## Correctness Properties
|
|
|
|
*A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
|
|
|
|
The following properties were derived from the acceptance criteria through systematic prework analysis. Each property is universally quantified and maps to specific requirements. After reflection, 7 unique properties remain — one for each PBT requirement in Requirement 17. Redundant properties from Requirements 2, 5, 6, 8, and 11 were consolidated with their corresponding Requirement 17 counterparts.
|
|
|
|
### Property 1: Calibration Error Range and Round-Trip
|
|
|
|
*For any* valid distribution of predictions across confidence buckets (where each prediction has a confidence in [0.5, 1.0] and a boolean outcome), the Expected Calibration Error (ECE) SHALL be in [0.0, 1.0]. Furthermore, when every bucket's observed win rate exactly matches its average confidence, ECE SHALL be 0.0.
|
|
|
|
**Validates: Requirements 5.1, 5.3, 17.1**
|
|
|
|
### Property 2: Brier Score Range and Perfect Prediction
|
|
|
|
*For any* list of (p_bull, outcome) pairs where p_bull ∈ [0.0, 1.0] and outcome ∈ {0.0, 1.0}, the Brier score SHALL be in [0.0, 1.0]. Furthermore, when all predictions have p_bull = 1.0 and outcome = 1.0 (or p_bull = 0.0 and outcome = 0.0), the Brier score SHALL be 0.0.
|
|
|
|
**Validates: Requirements 5.4, 17.2**
|
|
|
|
### Property 3: Information Coefficient Range and Perfect Correlation
|
|
|
|
*For any* list of (score, return) pairs with at least 30 elements where scores and returns are finite floats, the Information Coefficient (Pearson correlation) SHALL be in [-1.0, 1.0]. Furthermore, when scores and returns are perfectly positively linearly correlated (returns = a * scores + b, a > 0), IC SHALL be 1.0 (within floating-point tolerance).
|
|
|
|
**Validates: Requirements 6.1, 6.2, 17.3**
|
|
|
|
### Property 4: Canonical Evidence Key Determinism and Normalization Idempotence
|
|
|
|
*For any* (title, url) string pair, computing the canonical evidence key SHALL be deterministic — the same inputs always produce the same key. Furthermore, normalizing an already-normalized input (lowercased, trimmed title; lowercased, query-stripped URL) and computing the key SHALL produce the same key as the original computation (idempotence).
|
|
|
|
**Validates: Requirements 2.3, 17.4**
|
|
|
|
### Property 5: Source Reliability Bayesian Shrinkage Bounds and Convergence
|
|
|
|
*For any* observed_win_rate ∈ [0.0, 1.0] and sample_count ≥ 0, the source reliability computed via Bayesian shrinkage SHALL be in [0.0, 1.0]. When sample_count = 0, reliability SHALL be exactly 0.5. As sample_count increases toward infinity, reliability SHALL approach the observed_win_rate monotonically.
|
|
|
|
**Validates: Requirements 8.1, 8.2, 17.5**
|
|
|
|
### Property 6: Quality Gate Determinism and Threshold Monotonicity
|
|
|
|
*For any* set of model metric values and quality gate configuration, the gate evaluation result SHALL be deterministic — the same inputs always produce the same pass/fail result. Furthermore, for any configuration where the gate passes, relaxing any single threshold (increasing min values or decreasing max values to make them easier to satisfy) SHALL NOT cause the gate to fail (monotonicity).
|
|
|
|
**Validates: Requirements 11.1, 17.6**
|
|
|
|
### Property 7: Contribution Score Sum-to-One and Range
|
|
|
|
*For any* non-empty list of positive document weights, the computed contribution scores SHALL each be in [0.0, 1.0] and SHALL sum to 1.0 (within floating-point tolerance of 1e-9). For an empty weight list, the result SHALL be an empty list.
|
|
|
|
**Validates: Requirements 2.5, 17.7**
|
|
|
|
---
|
|
|
|
## Error Handling
|
|
|
|
### Price Data Unavailability
|
|
|
|
| Scenario | Handling |
|
|
|----------|----------|
|
|
| Ticker price unavailable at snapshot time | Store NULL for `price_at_prediction`, log warning, continue |
|
|
| SPY price unavailable at snapshot time | Store NULL for `spy_price_at_prediction`, log warning, continue |
|
|
| Sector ETF price unavailable at snapshot time | Store NULL for `sector_etf_price_at_prediction`, log warning, continue |
|
|
| Sector not found in SECTOR_ETF_MAP | Store NULL for sector ETF price, log warning |
|
|
| Future price unavailable at evaluation time | Skip that horizon, retry on next Outcome_Evaluator run |
|
|
| SPY/sector ETF future price unavailable | Store NULL for excess returns, still compute ticker return |
|
|
|
|
### Metrics Computation Edge Cases
|
|
|
|
| Scenario | Handling |
|
|
|----------|----------|
|
|
| Zero predictions in a confidence bucket | Exclude bucket from ECE computation |
|
|
| Fewer than 30 predictions for IC/Rank IC | Return NULL instead of unreliable correlation |
|
|
| All predictions in same confidence bucket | ECE = |avg_confidence - win_rate| for that single bucket |
|
|
| Division by zero in contribution scores (total weight = 0) | Return equal contribution scores (1/n) |
|
|
| Single prediction | Contribution score = 1.0 |
|
|
| NaN/infinity in metric computation | Guard with `math.isnan`/`math.isinf` checks, return 0.0 or NULL |
|
|
|
|
### Quality Gate Failures
|
|
|
|
| Scenario | Handling |
|
|
|----------|----------|
|
|
| No model_metric_snapshots exist | Default to paper-only mode (fail-safe) |
|
|
| Most recent snapshot older than 24 hours | Default to paper-only mode (fail-safe) |
|
|
| risk_configs table unreachable | Default to paper-only mode, log warning |
|
|
| Invalid threshold values in risk_configs | Use default thresholds, log warning |
|
|
| Gate evaluation fails mid-computation | Default to paper-only mode, log error |
|
|
|
|
### Database Failures
|
|
|
|
| Scenario | Handling |
|
|
|----------|----------|
|
|
| prediction_snapshots insert fails | Log error, do not block recommendation generation |
|
|
| signal_evidence_links insert fails | Log error, snapshot still created (partial data) |
|
|
| prediction_outcomes insert fails | Log error, retry on next Outcome_Evaluator run |
|
|
| model_metric_snapshots insert fails | Log error, stale metrics used until next successful computation |
|
|
| source_accuracy update fails | Log error, continue with stale reliability data |
|
|
|
|
### Canonical Evidence Key Edge Cases
|
|
|
|
| Scenario | Handling |
|
|
|----------|----------|
|
|
| Empty title | Use empty string in hash computation |
|
|
| Empty URL | Use empty string in hash computation |
|
|
| URL with no query parameters | Use URL as-is after lowercasing |
|
|
| Non-ASCII characters in title/URL | Encode as UTF-8 before hashing |
|
|
|
|
---
|
|
|
|
## Testing Strategy
|
|
|
|
### Dual Testing Approach
|
|
|
|
The model validation feature requires both property-based tests (for mathematical correctness of metric computations) and example-based unit tests (for specific behaviors, integration points, and edge cases). Property-based testing is appropriate here because the feature contains several pure mathematical functions (ECE, Brier score, IC, Bayesian shrinkage, contribution scores) with clear input/output behavior and universal properties.
|
|
|
|
### Property-Based Testing
|
|
|
|
**Library:** Hypothesis (already in use — `.hypothesis/` directory exists, project convention established)
|
|
|
|
**Configuration:**
|
|
- Minimum 100 iterations per property: `@settings(max_examples=100)`
|
|
- File naming: `tests/test_pbt_model_validation.py`
|
|
- Tag format: `# Feature: model-validation-calibration, Property N: <title>`
|
|
|
|
**Property tests to implement (one test per correctness property):**
|
|
|
|
| Property | Test Function | Key Generators |
|
|
|----------|---------------|----------------|
|
|
| 1: ECE range and round-trip | `test_calibration_error_range_and_roundtrip` | `st.lists(st.tuples(st.floats(0.5, 1.0), st.booleans()))` |
|
|
| 2: Brier score range and perfect | `test_brier_score_range_and_perfect` | `st.lists(st.tuples(st.floats(0.0, 1.0), st.sampled_from([0.0, 1.0])))` |
|
|
| 3: IC range and perfect correlation | `test_information_coefficient_range_and_perfect` | `st.lists(st.floats(-10, 10), min_size=30)` with linear transform |
|
|
| 4: Canonical key determinism and idempotence | `test_canonical_key_determinism_and_idempotence` | `st.text()` pairs for title and URL |
|
|
| 5: Source reliability bounds and convergence | `test_source_reliability_bounds_and_convergence` | `st.floats(0.0, 1.0)` for win_rate, `st.integers(0, 10000)` for n |
|
|
| 6: Quality gate determinism and monotonicity | `test_quality_gate_determinism_and_monotonicity` | Custom strategy for `QualityGateConfig` and metric values |
|
|
| 7: Contribution score sum-to-one | `test_contribution_score_sum_to_one` | `st.lists(st.floats(0.01, 100.0), min_size=1)` |
|
|
|
|
### Example-Based Unit Tests
|
|
|
|
**File:** `tests/test_model_validation_unit.py`
|
|
|
|
| Test Area | Examples |
|
|
|-----------|----------|
|
|
| Canonical evidence key | Known title/URL → expected SHA256, empty inputs, unicode |
|
|
| Duplicate detection | 3 docs with 2 sharing a key → 1 marked duplicate |
|
|
| Contribution scores | [0.5, 0.3, 0.2] → [0.5, 0.3, 0.2], single doc → [1.0] |
|
|
| ECE specific values | Perfect calibration → 0.0, all overconfident → positive ECE |
|
|
| Brier score specific values | All correct at p=1.0 → 0.0, all wrong at p=1.0 → 1.0 |
|
|
| IC specific values | Perfect correlation → 1.0, anti-correlation → -1.0, < 30 → None |
|
|
| Source reliability | n=0 → 0.5, n=1000 with wr=0.8 → ≈0.8, n=30 with wr=0.7 → 0.6 |
|
|
| Adjusted evidence weight | reliability=0.5 → base*1.0, clamping to [0.1, 2.0] |
|
|
| Quality gate | All thresholds met → pass, one failed → fail with reason |
|
|
| Quality gate fail-safe | No snapshots → paper-only, stale snapshot → paper-only |
|
|
| Direction correct logic | bullish+positive → true, bullish+negative → false |
|
|
| Profitable logic | buy+positive → true, sell+negative → true |
|
|
| Future return computation | price 100→110 → 0.10, price 100→90 → -0.10 |
|
|
| Excess return | ticker 10%, SPY 5% → excess 5% |
|
|
| Weight clamping | weight 1.5 → clamped to 1.0 |
|
|
|
|
### Frontend Tests
|
|
|
|
**File:** `frontend/src/test/pages.test.tsx` (extend existing)
|
|
|
|
| Test Area | Strategy |
|
|
|-----------|----------|
|
|
| OpsModel page renders validation tabs | MSW mock for `/api/validation/summary` |
|
|
| Calibration table renders buckets | MSW mock for `/api/validation/calibration` |
|
|
| Gate status indicator | MSW mock for `/api/validation/gate-status` |
|
|
| Miscalibration warning badge | Mock data with miscalibrated bucket |
|
|
|
|
### Integration Tests
|
|
|
|
**File:** `tests/test_model_validation_integration.py`
|
|
|
|
| Test Area | Strategy |
|
|
|-----------|----------|
|
|
| Snapshot creation with mock DB | asyncpg mock, verify INSERT queries |
|
|
| Outcome evaluation with mock prices | asyncpg mock, verify return computation |
|
|
| Metrics computation end-to-end | In-memory data, verify all metrics computed |
|
|
| API endpoint responses | FastAPI TestClient with mock pool |
|
|
|
|
### Test File Structure
|
|
|
|
```
|
|
tests/
|
|
├── test_pbt_model_validation.py # 7 property-based tests
|
|
├── test_model_validation_unit.py # Example-based unit tests
|
|
└── test_model_validation_integration.py # Integration tests (optional)
|
|
|
|
frontend/src/test/
|
|
└── pages.test.tsx # Extended with validation page tests
|
|
```
|