feat: model validation, calibration, and signal quality layer

- Migration 035: prediction_snapshots, prediction_outcomes, signal_evidence_links, model_metric_snapshots tables + SQL views - Prediction snapshot writer with canonical evidence keys, duplicate detection, contribution scores - Outcome evaluator across 5 horizons (1h, 6h, 1d, 7d, 30d) - Metrics engine: ECE, Brier score, IC, Rank IC, benchmark comparison - Attribution engine: per-source, per-catalyst, per-layer performance - Calibration engine: Bayesian shrinkage source reliability - Quality gate for live trading eligibility with configurable thresholds - 7 new /api/validation/* endpoints - Upgraded OpsModel dashboard with validation tab - Enhanced recommendation display with calibration context - Backtest replay validation mode - 86 Python tests (unit + property-based), 179 frontend tests passing
2026-05-01 03:04:58 +00:00
parent 5d2ffd9163
commit 7fcc8a6c07
23 changed files with 7554 additions and 9 deletions
@@ -0,0 +1 @@
 {"specId": "b595d834-7e72-4fab-87a9-65c92115a069", "workflowType": "requirements-first", "specType": "feature"}
@@ -0,0 +1,975 @@
 # Design Document — Model Validation, Calibration, and Signal Quality
 ## Overview
 This design adds a closed-loop model validation layer to Stonks Oracle. The system currently generates trend summaries and trading recommendations with confidence scores, but has no mechanism to evaluate whether those predictions are accurate, whether confidence scores are well-calibrated, which sources contribute to correct predictions, or whether the system outperforms simple benchmarks.
 The validation layer introduces six new service modules under `services/validation/`, a quality gate in `services/trading/`, seven new API endpoints under `/api/validation/`, a database migration (035) with four new tables and two SQL views, and an upgraded OpsModel dashboard page. The architecture follows the existing patterns: pure computation modules with asyncpg for persistence, FastAPI endpoints in `services/api/app.py`, and React/TanStack Query hooks on the frontend.
 ### Design Rationale
 A prediction engine without outcome tracking is flying blind. The validation layer closes the feedback loop by:
 1. **Capturing immutable snapshots** at prediction time — preventing hindsight bias in evaluation
 2. **Evaluating outcomes** across multiple horizons (1h, 6h, 1d, 7d, 30d) — matching the system's multi-window trend architecture
 3. **Computing calibration metrics** (ECE, Brier score) — measuring whether confidence scores mean what they claim
 4. **Tracking information coefficients** (IC, Rank IC) — measuring linear and ordinal predictive power
 5. **Attributing performance** to sources, catalysts, and signal layers — identifying the most valuable information channels
 6. **Recalibrating confidence** via Bayesian shrinkage — learning from the system's own track record
 7. **Gating live trading** on minimum quality thresholds — preventing real capital risk on a poorly performing model
 The design reuses existing infrastructure (asyncpg, FastAPI, TanStack Query, Recharts) and integrates with the existing `source_accuracy` table from the signal-math-upgrade spec.
 ---
 ## Architecture
 ### High-Level Data Flow
 ```mermaid
 flowchart TD
    subgraph "Prediction Capture (Real-time)"
        A[Recommendation Engine] -->|generates| B[Prediction_Snapshot_Writer]
        B --> C[prediction_snapshots table]
        B --> D[signal_evidence_links table]
        B -->|computes| E[canonical_evidence_key<br/>duplicate detection<br/>contribution scores]
    end
    subgraph "Outcome Evaluation (Periodic)"
        F[Outcome_Evaluator<br/>scheduled job] -->|reads matured snapshots| C
        F -->|fetches future prices| G[market_snapshots table]
        F -->|computes returns| H[prediction_outcomes table]
        F -->|evaluates 5 horizons| H
    end
    subgraph "Metrics Computation (Periodic)"
        I[Metrics_Engine] -->|reads| H
        I -->|reads| C
        I -->|reads| D
        I -->|computes| J[model_metric_snapshots table]
        I -->|computes| K[Calibration: ECE, Brier]
        I -->|computes| L[IC, Rank IC by horizon]
        I -->|computes| M[Benchmark: excess returns]
    end
    subgraph "Attribution (Periodic)"
        N[Attribution_Engine] -->|joins| D
        N -->|joins| H
        N -->|computes| O[Per-source metrics]
        N -->|computes| P[Per-catalyst metrics]
        N -->|computes| Q[Per-layer metrics]
    end
    subgraph "Calibration (Periodic)"
        R[Calibration_Engine] -->|reads| H
        R -->|reads| D
        R -->|computes Bayesian shrinkage| S[source_accuracy table<br/>reliability scores]
    end
    subgraph "Safety Gate (Per-cycle)"
        T[Quality_Gate] -->|reads latest| J
        T -->|evaluates thresholds| U{Pass?}
        U -->|yes| V[Live trading allowed]
        U -->|no| W[Force paper mode]
        T -->|stores result| X[risk_configs table<br/>model_quality_gate key]
    end
    subgraph "Dashboard (Frontend)"
        Y[Dashboard_API<br/>7 endpoints] -->|reads| J
        Y -->|reads| C
        Y -->|reads| H
        Y -->|reads| D
        Z[OpsModel.tsx<br/>upgraded page] -->|fetches| Y
    end
    subgraph "Backtest Integration"
        AA[BacktestReplay] -->|validation mode| B
        AA -->|validation mode| F
        AA -->|triggers| I
    end
 ```
 ### Scheduling Strategy
 The validation components run on different cadences:
 | Component | Trigger | Cadence |
 |-----------|---------|---------|
 | Prediction_Snapshot_Writer | Synchronous — called by recommendation engine | Every recommendation |
 | Outcome_Evaluator | Scheduled job | Every 1 hour |
 | Metrics_Engine | After Outcome_Evaluator completes | Every 1 hour |
 | Attribution_Engine | Called by Metrics_Engine | Every 1 hour |
 | Calibration_Engine | After Metrics_Engine completes | Every 6 hours |
 | Quality_Gate | Start of each aggregation cycle | Every aggregation cycle |
 ### Sector ETF Mapping
 The system needs a mapping from company sectors to sector ETFs for benchmark comparison. This is stored as a configuration constant:
 ```python
 SECTOR_ETF_MAP: dict[str, str] = {
    "Technology": "XLK",
    "Consumer Cyclical": "XLY",
    "Financial Services": "XLF",
    "Healthcare": "XLV",
    "Energy": "XLE",
    "Communication Services": "XLC",
    "Industrials": "XLI",
    "Consumer Defensive": "XLP",
    "Real Estate": "XLRE",
    "Utilities": "XLU",
 }
 ```
 ---
 ## Components and Interfaces
 ### New Modules
 | Module | File | Responsibility |
 |--------|------|----------------|
 | Prediction Snapshot Writer | `services/validation/prediction_snapshot.py` | Captures immutable prediction state at generation time |
 | Outcome Evaluator | `services/validation/outcome_evaluator.py` | Matches predictions with realized market outcomes |
 | Metrics Engine | `services/validation/metrics.py` | Computes calibration, IC, Brier, benchmark metrics |
 | Attribution Engine | `services/validation/attribution.py` | Per-source, per-catalyst, per-layer performance |
 | Calibration Engine | `services/validation/calibration.py` | Bayesian shrinkage source reliability, weight adjustment |
 | Quality Gate | `services/trading/model_quality_gate.py` | Safety gate for live trading eligibility |
 ### Modified Modules
 | Module | File | Changes |
 |--------|------|---------|
 | Query API | `services/api/app.py` | 7 new `/api/validation/*` endpoints |
 | Aggregation Worker | `services/aggregation/worker.py` | Call Quality_Gate at cycle start |
 | Recommendation Engine | `services/recommendation/eligibility.py` | Call Prediction_Snapshot_Writer after recommendation |
 | Backtest Replay | `services/trading/backtest_replay.py` | Validation mode support |
 | Frontend Hooks | `frontend/src/api/hooks.ts` | 7 new validation hooks |
 | OpsModel Page | `frontend/src/pages/OpsModel.tsx` | Full dashboard upgrade |
 | AppLayout | `frontend/src/components/AppLayout.tsx` | Nav item update (if needed) |
 ### Component Interface Details
 #### 1. Prediction Snapshot Writer (`services/validation/prediction_snapshot.py`)
 ```python
 SECTOR_ETF_MAP: dict[str, str] = {
    "Technology": "XLK",
    "Consumer Cyclical": "XLY",
    "Financial Services": "XLF",
    "Healthcare": "XLV",
    "Energy": "XLE",
    "Communication Services": "XLC",
    "Industrials": "XLI",
    "Consumer Defensive": "XLP",
    "Real Estate": "XLRE",
    "Utilities": "XLU",
 }
 EVALUATION_HORIZONS: list[str] = ["1h", "6h", "1d", "7d", "30d"]
 MAX_SINGLE_DOCUMENT_WEIGHT: float = 1.0
@dataclass
 class PredictionSnapshot:
    """Immutable snapshot of a prediction at generation time."""
    id: str                          # UUID
    generated_at: datetime
    ticker: str
    window: str
    horizon: str
    direction: str                   # bullish/bearish/mixed/neutral
    action: str                      # buy/sell/hold/watch
    mode: str                        # informational/paper_eligible/live_eligible
    strength: float
    confidence: float
    contradiction: float
    p_bull: float | None
    p_bear: float | None
    score_company: float
    score_macro: float
    score_competitive: float
    evidence_count: int
    unique_source_count: int
    duplicate_evidence_count: int
    price_at_prediction: float | None
    spy_price_at_prediction: float | None
    sector_etf_price_at_prediction: float | None
    metadata: dict
@dataclass
 class SignalEvidenceLink:
    """Link between a prediction and a contributing evidence document."""
    id: str                          # UUID
    prediction_id: str
    document_id: str
    signal_id: str
    ticker: str
    source: str
    source_type: str
    catalyst_type: str
    sentiment: str
    impact: float
    extraction_confidence: float
    weight: float                    # clamped to MAX_SINGLE_DOCUMENT_WEIGHT
    is_duplicate: bool
    canonical_evidence_key: str
    contribution_score: float        # weight / total_weight, sums to 1.0
    metadata: dict
 def compute_canonical_evidence_key(title: str, url: str) -> str:
    """SHA256 of normalized(title) + normalized(url).
    Normalization: lowercase, strip whitespace for title;
    lowercase, strip query params for URL.
    """
    ...
 async def create_prediction_snapshot(
    pool: asyncpg.Pool,
    recommendation: Recommendation,
    trend_summary: TrendSummary,
    evidence_signals: list[WeightedSignal],
    evidence_docs: list[dict],       # document metadata from recommendation_evidence
 ) -> PredictionSnapshot:
    """Create and persist a prediction snapshot with evidence links.
    1. Fetches current prices (ticker, SPY, sector ETF) from market_snapshots
    2. Computes canonical evidence keys and duplicate detection
    3. Clamps individual document weights to MAX_SINGLE_DOCUMENT_WEIGHT
    4. Computes contribution scores (one-vote-per-canonical-key dedup)
    5. Persists snapshot and evidence links in a transaction
    """
    ...
 async def fetch_latest_close_price(
    pool: asyncpg.Pool,
    ticker: str,
 ) -> float | None:
    """Fetch most recent close price from market_snapshots for a ticker."""
    ...
 ```
 #### 2. Outcome Evaluator (`services/validation/outcome_evaluator.py`)
 ```python
@dataclass
 class PredictionOutcome:
    """Realized outcome for a prediction at a specific horizon."""
    id: str                          # UUID
    prediction_id: str
    evaluated_at: datetime
    horizon: str                     # 1h, 6h, 1d, 7d, 30d
    future_price: float
    future_return: float
    spy_future_price: float | None
    spy_return: float | None
    sector_etf_future_price: float | None
    sector_etf_return: float | None
    excess_return_vs_spy: float | None
    excess_return_vs_sector: float | None
    direction_correct: bool
    profitable: bool
    metadata: dict
 HORIZON_DURATIONS: dict[str, timedelta] = {
    "1h": timedelta(hours=1),
    "6h": timedelta(hours=6),
    "1d": timedelta(days=1),
    "7d": timedelta(days=7),
    "30d": timedelta(days=30),
 }
 async def evaluate_matured_predictions(
    pool: asyncpg.Pool,
 ) -> int:
    """Evaluate all matured prediction snapshots.
    Finds snapshots where horizon has elapsed and outcome not yet recorded.
    For each, fetches future prices and computes returns.
    Skips horizons where future price is unavailable (retries next run).
    Returns count of outcomes recorded.
    """
    ...
 async def evaluate_single_prediction(
    pool: asyncpg.Pool,
    snapshot: PredictionSnapshot,
    horizon: str,
 ) -> PredictionOutcome | None:
    """Evaluate a single prediction at a specific horizon.
    Returns None if future price is unavailable.
    """
    ...
 ```
 #### 3. Metrics Engine (`services/validation/metrics.py`)
 ```python
 CONFIDENCE_BUCKETS: list[tuple[float, float]] = [
    (0.50, 0.60),
    (0.60, 0.70),
    (0.70, 0.80),
    (0.80, 0.90),
    (0.90, 1.00),
 ]
 LOOKBACK_WINDOWS: list[str] = ["7d", "30d", "90d", "all"]
@dataclass
 class CalibrationBucket:
    """Calibration metrics for a single confidence bucket."""
    bucket_low: float
    bucket_high: float
    avg_confidence: float
    observed_win_rate: float
    prediction_count: int
    miscalibrated: bool              # |avg_confidence - win_rate| > 0.15
@dataclass
 class ModelMetricSnapshot:
    """Aggregate model quality metrics for a lookback/horizon combination."""
    id: str
    generated_at: datetime
    lookback_window: str
    horizon: str
    prediction_count: int
    win_rate: float
    directional_accuracy: float
    information_coefficient: float | None
    rank_information_coefficient: float | None
    avg_return: float
    avg_excess_return_vs_spy: float
    avg_excess_return_vs_sector: float
    calibration_error: float         # ECE
    brier_score: float
    buy_win_rate: float
    sell_win_rate: float
    hold_win_rate: float
    metadata: dict
 def compute_calibration_error(
    confidences: list[float],
    outcomes: list[bool],
 ) -> tuple[float, list[CalibrationBucket]]:
    """Compute ECE and calibration buckets.
    ECE = Σ (n_b / N) * |avg_conf_b - win_rate_b|
    Returns (ece, buckets).
    """
    ...
 def compute_brier_score(
    p_bulls: list[float],
    outcomes: list[bool],
 ) -> float:
    """Brier score = mean((p_bull - outcome)^2).
    outcome is 1.0 when price moved in predicted direction, 0.0 otherwise.
    Returns value in [0.0, 1.0].
    """
    ...
 def compute_information_coefficient(
    scores: list[float],
    returns: list[float],
 ) -> float | None:
    """Pearson correlation between prediction scores and future returns.
    Returns None when fewer than 30 data points.
    Returns value in [-1.0, 1.0].
    """
    ...
 def compute_rank_information_coefficient(
    scores: list[float],
    returns: list[float],
 ) -> float | None:
    """Spearman rank correlation between prediction scores and future returns.
    Returns None when fewer than 30 data points.
    Returns value in [-1.0, 1.0].
    """
    ...
 def compute_contribution_scores(
    weights: list[float],
 ) -> list[float]:
    """Compute contribution scores from document weights.
    Each score = weight_i / sum(weights). Sums to 1.0.
    Each score in [0.0, 1.0].
    Returns empty list for empty input.
    """
    ...
 async def compute_and_store_metric_snapshots(
    pool: asyncpg.Pool,
 ) -> list[ModelMetricSnapshot]:
    """Compute metric snapshots for all lookback/horizon combinations.
    Lookback windows: 7d, 30d, 90d, all-time.
    Horizons: 1h, 6h, 1d, 7d, 30d.
    """
    ...
 ```
 #### 4. Attribution Engine (`services/validation/attribution.py`)
 ```python
@dataclass
 class SourceAttribution:
    """Performance metrics for a single source."""
    source: str
    source_type: str
    prediction_count: int
    avg_weight: float
    avg_contribution_score: float
    win_rate: float
    avg_future_return: float
    avg_excess_return_vs_spy: float
    information_coefficient: float | None
    duplicate_rate: float
@dataclass
 class CatalystAttribution:
    """Performance metrics for a single catalyst type."""
    catalyst_type: str
    prediction_count: int
    win_rate: float
    avg_future_return: float
    avg_excess_return_vs_spy: float
    information_coefficient: float | None
@dataclass
 class LayerAttribution:
    """Performance metrics for a signal layer."""
    layer: str                       # company, macro, competitive
    avg_contribution_pct: float
    dominant_win_rate: float         # win rate when this layer > 30% contribution
    dominant_ic: float | None        # IC when this layer > 30% contribution
 async def compute_source_attribution(
    pool: asyncpg.Pool,
    lookback_days: int = 30,
    horizon: str = "7d",
 ) -> list[SourceAttribution]:
    ...
 async def compute_catalyst_attribution(
    pool: asyncpg.Pool,
    lookback_days: int = 30,
    horizon: str = "7d",
 ) -> list[CatalystAttribution]:
    ...
 async def compute_layer_attribution(
    pool: asyncpg.Pool,
    lookback_days: int = 30,
    horizon: str = "7d",
 ) -> list[LayerAttribution]:
    ...
 ```
 #### 5. Calibration Engine (`services/validation/calibration.py`)
 ```python
 def compute_source_reliability(
    observed_win_rate: float,
    sample_count: int,
    prior_strength: int = 30,
 ) -> float:
    """Bayesian shrinkage source reliability.
    reliability = 0.5 + (n / (n + prior_strength)) * (observed_win_rate - 0.5)
    Returns value in [0.0, 1.0].
    When n=0, returns 0.5 (prior mean).
    As n→∞, approaches observed_win_rate.
    """
    ...
 def compute_adjusted_evidence_weight(
    base_weight: float,
    reliability: float,
 ) -> float:
    """Adjusted weight = base_weight * (0.5 + reliability), clamped to [0.1, 2.0]."""
    ...
 async def update_source_reliabilities(
    pool: asyncpg.Pool,
 ) -> int:
    """Recompute and store source reliability scores from latest outcomes.
    Uses the existing source_accuracy table, updating accuracy_ratio
    with the Bayesian shrinkage formula.
    Returns count of sources updated.
    """
    ...
 ```
 #### 6. Quality Gate (`services/trading/model_quality_gate.py`)
 ```python
@dataclass
 class QualityGateConfig:
    """Configurable thresholds for live trading eligibility."""
    min_prediction_count: int = 100
    min_ic: float = 0.03
    min_win_rate: float = 0.53
    max_ece: float = 0.15
    min_excess_return_vs_spy: float = 0.0
    max_snapshot_age_hours: int = 24
@dataclass
 class GateThresholdResult:
    """Result for a single threshold check."""
    name: str
    threshold: float
    actual: float
    passed: bool
@dataclass
 class QualityGateResult:
    """Full gate evaluation result."""
    passed: bool
    evaluated_at: datetime
    threshold_results: list[GateThresholdResult]
    reason: str                      # "all thresholds met" or "failed: ..."
    snapshot_id: str | None
    config: QualityGateConfig
 async def evaluate_quality_gate(
    pool: asyncpg.Pool,
    config: QualityGateConfig | None = None,
 ) -> QualityGateResult:
    """Evaluate model quality gate from latest metric snapshot.
    Reads the most recent model_metric_snapshot for the 30d lookback
    and 7d horizon (the primary evaluation window).
    If no snapshot exists or snapshot is stale (>24h), defaults to
    paper-only mode (fail-safe).
    Stores result in risk_configs under 'model_quality_gate' key.
    """
    ...
 async def load_gate_config_from_db(
    pool: asyncpg.Pool,
 ) -> QualityGateConfig:
    """Load gate thresholds from risk_configs, with defaults."""
    ...
 ```
 #### 7. Dashboard API Endpoints
 Seven new endpoints added to `services/api/app.py`:
 | Endpoint | Method | Returns |
 |----------|--------|---------|
 | `/api/validation/summary` | GET | Latest model metric snapshot + gate status |
 | `/api/validation/calibration` | GET | Calibration table with buckets |
 | `/api/validation/ic-by-horizon` | GET | IC and Rank IC per horizon |
 | `/api/validation/attribution/sources` | GET | Per-source performance |
 | `/api/validation/attribution/catalysts` | GET | Per-catalyst performance |
 | `/api/validation/attribution/layers` | GET | Per-layer performance |
 | `/api/validation/gate-status` | GET | Quality gate evaluation detail |
 All endpoints accept optional `lookback` (default "30d") and `horizon` (default "7d") query parameters.
 ---
 ## Data Models
 ### Database Schema (Migration 035)
 #### prediction_snapshots
 ```sql
 CREATE TABLE IF NOT EXISTS prediction_snapshots (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    generated_at TIMESTAMPTZ NOT NULL,
    ticker VARCHAR(20) NOT NULL,
    window VARCHAR(20) NOT NULL,
    horizon VARCHAR(20) NOT NULL,
    direction VARCHAR(20) NOT NULL,
    action VARCHAR(20) NOT NULL,
    mode VARCHAR(30) NOT NULL,
    strength FLOAT NOT NULL,
    confidence FLOAT NOT NULL,
    contradiction FLOAT NOT NULL DEFAULT 0.0,
    p_bull FLOAT,
    p_bear FLOAT,
    score_company FLOAT NOT NULL DEFAULT 0.0,
    score_macro FLOAT NOT NULL DEFAULT 0.0,
    score_competitive FLOAT NOT NULL DEFAULT 0.0,
    evidence_count INTEGER NOT NULL DEFAULT 0,
    unique_source_count INTEGER NOT NULL DEFAULT 0,
    duplicate_evidence_count INTEGER NOT NULL DEFAULT 0,
    price_at_prediction FLOAT,
    spy_price_at_prediction FLOAT,
    sector_etf_price_at_prediction FLOAT,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
 );
 CREATE INDEX IF NOT EXISTS idx_pred_snap_ticker ON prediction_snapshots(ticker);
 CREATE INDEX IF NOT EXISTS idx_pred_snap_generated ON prediction_snapshots(generated_at);
 CREATE INDEX IF NOT EXISTS idx_pred_snap_horizon ON prediction_snapshots(horizon);
 ```
 #### prediction_outcomes
 ```sql
 CREATE TABLE IF NOT EXISTS prediction_outcomes (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    prediction_id UUID NOT NULL REFERENCES prediction_snapshots(id),
    evaluated_at TIMESTAMPTZ NOT NULL,
    horizon VARCHAR(20) NOT NULL,
    future_price FLOAT,
    future_return FLOAT,
    spy_future_price FLOAT,
    spy_return FLOAT,
    sector_etf_future_price FLOAT,
    sector_etf_return FLOAT,
    excess_return_vs_spy FLOAT,
    excess_return_vs_sector FLOAT,
    direction_correct BOOLEAN,
    profitable BOOLEAN,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
 );
 CREATE INDEX IF NOT EXISTS idx_pred_out_prediction ON prediction_outcomes(prediction_id);
 CREATE INDEX IF NOT EXISTS idx_pred_out_horizon ON prediction_outcomes(horizon);
 CREATE INDEX IF NOT EXISTS idx_pred_out_evaluated ON prediction_outcomes(evaluated_at);
 ```
 #### signal_evidence_links
 ```sql
 CREATE TABLE IF NOT EXISTS signal_evidence_links (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    prediction_id UUID NOT NULL REFERENCES prediction_snapshots(id),
    document_id VARCHAR(200),
    signal_id VARCHAR(200),
    ticker VARCHAR(20),
    source VARCHAR(200),
    source_type VARCHAR(50),
    catalyst_type VARCHAR(50),
    sentiment VARCHAR(20),
    impact FLOAT,
    extraction_confidence FLOAT,
    weight FLOAT,
    is_duplicate BOOLEAN NOT NULL DEFAULT FALSE,
    canonical_evidence_key VARCHAR(64),
    contribution_score FLOAT,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
 );
 CREATE INDEX IF NOT EXISTS idx_sig_ev_prediction ON signal_evidence_links(prediction_id);
 CREATE INDEX IF NOT EXISTS idx_sig_ev_document ON signal_evidence_links(document_id);
 CREATE INDEX IF NOT EXISTS idx_sig_ev_ticker ON signal_evidence_links(ticker);
 ```
 #### model_metric_snapshots
 ```sql
 CREATE TABLE IF NOT EXISTS model_metric_snapshots (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    generated_at TIMESTAMPTZ NOT NULL,
    lookback_window VARCHAR(20) NOT NULL,
    horizon VARCHAR(20) NOT NULL,
    prediction_count INTEGER NOT NULL DEFAULT 0,
    win_rate FLOAT,
    directional_accuracy FLOAT,
    information_coefficient FLOAT,
    rank_information_coefficient FLOAT,
    avg_return FLOAT,
    avg_excess_return_vs_spy FLOAT,
    avg_excess_return_vs_sector FLOAT,
    calibration_error FLOAT,
    brier_score FLOAT,
    buy_win_rate FLOAT,
    sell_win_rate FLOAT,
    hold_win_rate FLOAT,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
 );
 CREATE INDEX IF NOT EXISTS idx_model_snap_generated ON model_metric_snapshots(generated_at);
 CREATE INDEX IF NOT EXISTS idx_model_snap_lookback ON model_metric_snapshots(lookback_window);
 CREATE INDEX IF NOT EXISTS idx_model_snap_horizon ON model_metric_snapshots(horizon);
 ```
 #### SQL Explorer Views
 ```sql
 CREATE OR REPLACE VIEW v_prediction_performance AS
 SELECT
    ps.ticker,
    ps.direction,
    ps.action,
    ps.confidence,
    ps.strength,
    ps.contradiction,
    ps.p_bull,
    ps.score_company,
    ps.score_macro,
    ps.score_competitive,
    ps.evidence_count,
    ps.unique_source_count,
    ps.duplicate_evidence_count,
    ps.price_at_prediction,
    po.future_return,
    po.excess_return_vs_spy,
    po.excess_return_vs_sector,
    po.direction_correct,
    po.profitable,
    po.horizon,
    ps.generated_at,
    po.evaluated_at
 FROM prediction_snapshots ps
 JOIN prediction_outcomes po ON po.prediction_id = ps.id;
 CREATE OR REPLACE VIEW v_source_performance AS
 SELECT
    sel.source,
    sel.source_type,
    sel.catalyst_type,
    sel.sentiment,
    sel.weight,
    sel.contribution_score,
    sel.is_duplicate,
    po.direction_correct,
    po.future_return,
    po.excess_return_vs_spy,
    po.horizon,
    ps.generated_at
 FROM signal_evidence_links sel
 JOIN prediction_snapshots ps ON ps.id = sel.prediction_id
 JOIN prediction_outcomes po ON po.prediction_id = sel.prediction_id;
 ```
 ---
 ## Correctness Properties
 *A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
 The following properties were derived from the acceptance criteria through systematic prework analysis. Each property is universally quantified and maps to specific requirements. After reflection, 7 unique properties remain — one for each PBT requirement in Requirement 17. Redundant properties from Requirements 2, 5, 6, 8, and 11 were consolidated with their corresponding Requirement 17 counterparts.
 ### Property 1: Calibration Error Range and Round-Trip
 *For any* valid distribution of predictions across confidence buckets (where each prediction has a confidence in [0.5, 1.0] and a boolean outcome), the Expected Calibration Error (ECE) SHALL be in [0.0, 1.0]. Furthermore, when every bucket's observed win rate exactly matches its average confidence, ECE SHALL be 0.0.
 **Validates: Requirements 5.1, 5.3, 17.1**
 ### Property 2: Brier Score Range and Perfect Prediction
 *For any* list of (p_bull, outcome) pairs where p_bull ∈ [0.0, 1.0] and outcome ∈ {0.0, 1.0}, the Brier score SHALL be in [0.0, 1.0]. Furthermore, when all predictions have p_bull = 1.0 and outcome = 1.0 (or p_bull = 0.0 and outcome = 0.0), the Brier score SHALL be 0.0.
 **Validates: Requirements 5.4, 17.2**
 ### Property 3: Information Coefficient Range and Perfect Correlation
 *For any* list of (score, return) pairs with at least 30 elements where scores and returns are finite floats, the Information Coefficient (Pearson correlation) SHALL be in [-1.0, 1.0]. Furthermore, when scores and returns are perfectly positively linearly correlated (returns = a * scores + b, a > 0), IC SHALL be 1.0 (within floating-point tolerance).
 **Validates: Requirements 6.1, 6.2, 17.3**
 ### Property 4: Canonical Evidence Key Determinism and Normalization Idempotence
 *For any* (title, url) string pair, computing the canonical evidence key SHALL be deterministic — the same inputs always produce the same key. Furthermore, normalizing an already-normalized input (lowercased, trimmed title; lowercased, query-stripped URL) and computing the key SHALL produce the same key as the original computation (idempotence).
 **Validates: Requirements 2.3, 17.4**
 ### Property 5: Source Reliability Bayesian Shrinkage Bounds and Convergence
 *For any* observed_win_rate ∈ [0.0, 1.0] and sample_count ≥ 0, the source reliability computed via Bayesian shrinkage SHALL be in [0.0, 1.0]. When sample_count = 0, reliability SHALL be exactly 0.5. As sample_count increases toward infinity, reliability SHALL approach the observed_win_rate monotonically.
 **Validates: Requirements 8.1, 8.2, 17.5**
 ### Property 6: Quality Gate Determinism and Threshold Monotonicity
 *For any* set of model metric values and quality gate configuration, the gate evaluation result SHALL be deterministic — the same inputs always produce the same pass/fail result. Furthermore, for any configuration where the gate passes, relaxing any single threshold (increasing min values or decreasing max values to make them easier to satisfy) SHALL NOT cause the gate to fail (monotonicity).
 **Validates: Requirements 11.1, 17.6**
 ### Property 7: Contribution Score Sum-to-One and Range
 *For any* non-empty list of positive document weights, the computed contribution scores SHALL each be in [0.0, 1.0] and SHALL sum to 1.0 (within floating-point tolerance of 1e-9). For an empty weight list, the result SHALL be an empty list.
 **Validates: Requirements 2.5, 17.7**
 ---
 ## Error Handling
 ### Price Data Unavailability
 | Scenario | Handling |
 |----------|----------|
 | Ticker price unavailable at snapshot time | Store NULL for `price_at_prediction`, log warning, continue |
 | SPY price unavailable at snapshot time | Store NULL for `spy_price_at_prediction`, log warning, continue |
 | Sector ETF price unavailable at snapshot time | Store NULL for `sector_etf_price_at_prediction`, log warning, continue |
 | Sector not found in SECTOR_ETF_MAP | Store NULL for sector ETF price, log warning |
 | Future price unavailable at evaluation time | Skip that horizon, retry on next Outcome_Evaluator run |
 | SPY/sector ETF future price unavailable | Store NULL for excess returns, still compute ticker return |
 ### Metrics Computation Edge Cases
 | Scenario | Handling |
 |----------|----------|
 | Zero predictions in a confidence bucket | Exclude bucket from ECE computation |
 | Fewer than 30 predictions for IC/Rank IC | Return NULL instead of unreliable correlation |
 | All predictions in same confidence bucket | ECE = |avg_confidence - win_rate| for that single bucket |
 | Division by zero in contribution scores (total weight = 0) | Return equal contribution scores (1/n) |
 | Single prediction | Contribution score = 1.0 |
 | NaN/infinity in metric computation | Guard with `math.isnan`/`math.isinf` checks, return 0.0 or NULL |
 ### Quality Gate Failures
 | Scenario | Handling |
 |----------|----------|
 | No model_metric_snapshots exist | Default to paper-only mode (fail-safe) |
 | Most recent snapshot older than 24 hours | Default to paper-only mode (fail-safe) |
 | risk_configs table unreachable | Default to paper-only mode, log warning |
 | Invalid threshold values in risk_configs | Use default thresholds, log warning |
 | Gate evaluation fails mid-computation | Default to paper-only mode, log error |
 ### Database Failures
 | Scenario | Handling |
 |----------|----------|
 | prediction_snapshots insert fails | Log error, do not block recommendation generation |
 | signal_evidence_links insert fails | Log error, snapshot still created (partial data) |
 | prediction_outcomes insert fails | Log error, retry on next Outcome_Evaluator run |
 | model_metric_snapshots insert fails | Log error, stale metrics used until next successful computation |
 | source_accuracy update fails | Log error, continue with stale reliability data |
 ### Canonical Evidence Key Edge Cases
 | Scenario | Handling |
 |----------|----------|
 | Empty title | Use empty string in hash computation |
 | Empty URL | Use empty string in hash computation |
 | URL with no query parameters | Use URL as-is after lowercasing |
 | Non-ASCII characters in title/URL | Encode as UTF-8 before hashing |
 ---
 ## Testing Strategy
 ### Dual Testing Approach
 The model validation feature requires both property-based tests (for mathematical correctness of metric computations) and example-based unit tests (for specific behaviors, integration points, and edge cases). Property-based testing is appropriate here because the feature contains several pure mathematical functions (ECE, Brier score, IC, Bayesian shrinkage, contribution scores) with clear input/output behavior and universal properties.
 ### Property-Based Testing
 **Library:** Hypothesis (already in use — `.hypothesis/` directory exists, project convention established)
 **Configuration:**
 - Minimum 100 iterations per property: `@settings(max_examples=100)`
 - File naming: `tests/test_pbt_model_validation.py`
 - Tag format: `# Feature: model-validation-calibration, Property N: <title>`
 **Property tests to implement (one test per correctness property):**
 | Property | Test Function | Key Generators |
 |----------|---------------|----------------|
 | 1: ECE range and round-trip | `test_calibration_error_range_and_roundtrip` | `st.lists(st.tuples(st.floats(0.5, 1.0), st.booleans()))` |
 | 2: Brier score range and perfect | `test_brier_score_range_and_perfect` | `st.lists(st.tuples(st.floats(0.0, 1.0), st.sampled_from([0.0, 1.0])))` |
 | 3: IC range and perfect correlation | `test_information_coefficient_range_and_perfect` | `st.lists(st.floats(-10, 10), min_size=30)` with linear transform |
 | 4: Canonical key determinism and idempotence | `test_canonical_key_determinism_and_idempotence` | `st.text()` pairs for title and URL |
 | 5: Source reliability bounds and convergence | `test_source_reliability_bounds_and_convergence` | `st.floats(0.0, 1.0)` for win_rate, `st.integers(0, 10000)` for n |
 | 6: Quality gate determinism and monotonicity | `test_quality_gate_determinism_and_monotonicity` | Custom strategy for `QualityGateConfig` and metric values |
 | 7: Contribution score sum-to-one | `test_contribution_score_sum_to_one` | `st.lists(st.floats(0.01, 100.0), min_size=1)` |
 ### Example-Based Unit Tests
 **File:** `tests/test_model_validation_unit.py`
 | Test Area | Examples |
 |-----------|----------|
 | Canonical evidence key | Known title/URL → expected SHA256, empty inputs, unicode |
 | Duplicate detection | 3 docs with 2 sharing a key → 1 marked duplicate |
 | Contribution scores | [0.5, 0.3, 0.2] → [0.5, 0.3, 0.2], single doc → [1.0] |
 | ECE specific values | Perfect calibration → 0.0, all overconfident → positive ECE |
 | Brier score specific values | All correct at p=1.0 → 0.0, all wrong at p=1.0 → 1.0 |
 | IC specific values | Perfect correlation → 1.0, anti-correlation → -1.0, < 30 → None |
 | Source reliability | n=0 → 0.5, n=1000 with wr=0.8 → ≈0.8, n=30 with wr=0.7 → 0.6 |
 | Adjusted evidence weight | reliability=0.5 → base*1.0, clamping to [0.1, 2.0] |
 | Quality gate | All thresholds met → pass, one failed → fail with reason |
 | Quality gate fail-safe | No snapshots → paper-only, stale snapshot → paper-only |
 | Direction correct logic | bullish+positive → true, bullish+negative → false |
 | Profitable logic | buy+positive → true, sell+negative → true |
 | Future return computation | price 100→110 → 0.10, price 100→90 → -0.10 |
 | Excess return | ticker 10%, SPY 5% → excess 5% |
 | Weight clamping | weight 1.5 → clamped to 1.0 |
 ### Frontend Tests
 **File:** `frontend/src/test/pages.test.tsx` (extend existing)
 | Test Area | Strategy |
 |-----------|----------|
 | OpsModel page renders validation tabs | MSW mock for `/api/validation/summary` |
 | Calibration table renders buckets | MSW mock for `/api/validation/calibration` |
 | Gate status indicator | MSW mock for `/api/validation/gate-status` |
 | Miscalibration warning badge | Mock data with miscalibrated bucket |
 ### Integration Tests
 **File:** `tests/test_model_validation_integration.py`
 | Test Area | Strategy |
 |-----------|----------|
 | Snapshot creation with mock DB | asyncpg mock, verify INSERT queries |
 | Outcome evaluation with mock prices | asyncpg mock, verify return computation |
 | Metrics computation end-to-end | In-memory data, verify all metrics computed |
 | API endpoint responses | FastAPI TestClient with mock pool |
 ### Test File Structure
 ```
 tests/
 ├── test_pbt_model_validation.py         # 7 property-based tests
 ├── test_model_validation_unit.py        # Example-based unit tests
 └── test_model_validation_integration.py # Integration tests (optional)
 frontend/src/test/
 └── pages.test.tsx                       # Extended with validation page tests
 ```
@@ -0,0 +1,286 @@
 # Requirements Document — Model Validation, Calibration, and Signal Quality
 ## Introduction
 The Stonks Oracle platform generates trend summaries and trading recommendations from a three-layer signal aggregation engine. While the pipeline produces directional predictions with confidence scores, there is no systematic mechanism to evaluate whether those predictions are accurate, whether confidence scores are well-calibrated, which sources and signal types contribute to correct predictions, or whether the system outperforms simple benchmarks. The platform also lacks safety gates that prevent live trading when model quality is insufficient.
 This feature adds a complete model validation layer: prediction outcome tracking, calibration analysis, information coefficient metrics, signal and source attribution, evidence deduplication quality tracking, confidence recalibration, benchmark comparison, an upgraded Model Performance dashboard, and safety gates for live trading eligibility. The goal is to transform Stonks Oracle from a signal dashboard with paper trading into a statistically validated prediction engine with closed-loop feedback.
 ## Glossary
 - **Prediction_Snapshot_Writer**: A new service component in `services/validation/prediction_snapshot.py` that captures the full state of every recommendation and trend prediction at generation time, including prices, evidence links, and duplicate counts.
 - **Outcome_Evaluator**: A new service component in `services/validation/outcome_evaluator.py` that runs periodically to compute realized future returns and directional accuracy for matured prediction snapshots across multiple horizons.
 - **Metrics_Engine**: A new service component in `services/validation/metrics.py` that computes aggregate model quality metrics including calibration error, information coefficient, Brier score, and win rates over configurable lookback windows.
 - **Attribution_Engine**: A new service component in `services/validation/attribution.py` that computes per-source, per-catalyst-type, and per-signal-layer performance metrics by joining evidence links with prediction outcomes.
 - **Calibration_Engine**: A new service component in `services/validation/calibration.py` that computes source reliability scores using Bayesian shrinkage and adjusts evidence weights based on historical source performance.
 - **Quality_Gate**: A new service component in `services/trading/model_quality_gate.py` that evaluates aggregate model metrics against configurable thresholds and determines whether the system meets minimum quality standards for live trading.
 - **Information_Coefficient**: The Pearson correlation between predicted scores and realized future returns, measuring the linear predictive power of the model. Abbreviated as IC.
 - **Rank_Information_Coefficient**: The Spearman rank correlation between predicted scores and realized future returns, measuring ordinal predictive power. Abbreviated as Rank IC.
 - **Calibration_Error**: The Expected Calibration Error (ECE), computed as the weighted average of the absolute difference between predicted confidence and observed win rate across confidence buckets.
 - **Brier_Score**: The mean squared error between the predicted bullish probability and the binary actual outcome (1 if price went up, 0 otherwise), measuring probabilistic forecast accuracy.
 - **Canonical_Evidence_Key**: A normalized identifier for a piece of evidence, computed as SHA256 of the normalized title concatenated with the normalized URL, used to detect duplicate evidence across different ingestion paths.
 - **Excess_Return**: The return of a prediction minus the return of a benchmark (SPY for broad market, sector ETF for sector-relative) over the same horizon, measuring alpha generation.
 - **Prediction_Snapshot**: A frozen record of a prediction at generation time, capturing all inputs (prices, scores, evidence) needed to evaluate the prediction against future outcomes without hindsight bias.
 - **Model_Metric_Snapshot**: A periodic aggregate of model quality metrics over a lookback window and horizon, stored for time-series analysis of model performance trends.
 - **Source_Reliability**: A Bayesian-shrunk estimate of a source's historical win rate, computed as `0.5 + (n/(n+30)) * (observed_win_rate - 0.5)`, which regresses toward 0.5 for sources with few observations.
 - **Dashboard_API**: The set of API endpoints under `/api/validation/` that serve model quality metrics, calibration tables, attribution data, and gate status to the frontend.
 ---
 ## Requirements
 ### Requirement 1: Prediction Snapshot Capture
 **User Story:** As a quantitative analyst, I want every recommendation and trend prediction captured as an immutable snapshot at generation time, so that I can evaluate predictions against future outcomes without hindsight bias.
 #### Acceptance Criteria
 1. WHEN a recommendation is generated by the Recommendation_Engine, THE Prediction_Snapshot_Writer SHALL create a prediction_snapshots record containing the ticker, generation timestamp, trend window, prediction horizon, direction, action, mode, strength, confidence, contradiction score, bullish probability, bearish probability, company score, macro score, competitive score, evidence count, unique source count, duplicate evidence count, price at prediction time, SPY price at prediction time, and sector ETF price at prediction time.
 2. WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL record the current market price for the predicted ticker by querying the most recent close price from the market_snapshots table.
 3. WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL record the current SPY price by querying the most recent close price for ticker SPY from the market_snapshots table.
 4. WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL record the current sector ETF price by looking up the sector for the predicted ticker and querying the most recent close price for the corresponding sector ETF from the market_snapshots table.
 5. IF the market price, SPY price, or sector ETF price is unavailable at snapshot time, THEN THE Prediction_Snapshot_Writer SHALL store NULL for the unavailable price fields and log a warning, rather than failing the snapshot creation.
 6. THE Prediction_Snapshot_Writer SHALL store prediction snapshots in a new `prediction_snapshots` database table with a UUID primary key and indexed columns for ticker, generated_at, and horizon.
 7. WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL store a JSONB metadata field containing any additional context from the trend summary market_context and recommendation risk_checks fields.
 ---
 ### Requirement 2: Signal Evidence Link Tracking
 **User Story:** As a quantitative analyst, I want to know which specific evidence documents contributed to each prediction, so that I can attribute prediction success or failure to individual sources and signal types.
 #### Acceptance Criteria
 1. WHEN a prediction snapshot is created, THE Prediction_Snapshot_Writer SHALL create signal_evidence_links records for each document that contributed to the prediction, linking the prediction_id to the document_id and signal_id.
 2. THE signal_evidence_links record SHALL capture the source identifier, source type, catalyst type, sentiment, impact score, extraction confidence, weight assigned during aggregation, duplicate status, canonical evidence key, and contribution score for each contributing document.
 3. WHEN recording evidence links, THE Prediction_Snapshot_Writer SHALL compute the canonical_evidence_key as the SHA256 hash of the concatenation of the normalized (lowercased, whitespace-trimmed) document title and the normalized (lowercased, query-parameters-stripped) document URL.
 4. WHEN recording evidence links, THE Prediction_Snapshot_Writer SHALL mark a link as `is_duplicate = true` when another link for the same prediction and ticker shares the same canonical_evidence_key.
 5. THE Prediction_Snapshot_Writer SHALL compute the contribution_score for each evidence link as the ratio of that document's effective weight to the total effective weight across all documents for the prediction.
 6. THE signal_evidence_links table SHALL have a foreign key constraint from prediction_id to prediction_snapshots(id) and indexes on prediction_id, document_id, and ticker.
 ---
 ### Requirement 3: Evidence Deduplication Quality Tracking
 **User Story:** As a quantitative analyst, I want the system to track evidence deduplication quality per prediction, so that I can identify when predictions are inflated by counting the same information multiple times from different sources.
 #### Acceptance Criteria
 1. WHEN creating a prediction snapshot, THE Prediction_Snapshot_Writer SHALL compute the unique_source_count as the number of distinct source identifiers across all non-duplicate evidence links for that prediction.
 2. WHEN creating a prediction snapshot, THE Prediction_Snapshot_Writer SHALL compute the duplicate_evidence_count as the number of evidence links marked as `is_duplicate = true` for that prediction.
 3. THE Prediction_Snapshot_Writer SHALL enforce a maximum single-document weight cap of 1.0, clamping any individual document's effective weight to prevent a single piece of evidence from dominating the prediction.
 4. WHEN computing contribution scores, THE Prediction_Snapshot_Writer SHALL count each canonical evidence key at most once per ticker per window, applying the one-vote-per-canonical-document deduplication rule.
 5. THE Metrics_Engine SHALL compute a duplicate_rate metric as the ratio of duplicate_evidence_count to total evidence_count across predictions in the lookback window.
 ---
 ### Requirement 4: Prediction Outcome Evaluation
 **User Story:** As a quantitative analyst, I want realized market outcomes automatically matched to historical predictions, so that I can measure whether the system's directional calls and confidence scores correspond to actual price movements.
 #### Acceptance Criteria
 1. THE Outcome_Evaluator SHALL run on a periodic schedule, evaluating prediction snapshots whose horizon has elapsed and whose outcome has not yet been recorded.
 2. WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL compute the future_return as `(future_price - price_at_prediction) / price_at_prediction` using the closing price at the horizon endpoint.
 3. WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL compute the SPY return over the same horizon as `(spy_future_price - spy_price_at_prediction) / spy_price_at_prediction`.
 4. WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL compute the sector ETF return over the same horizon as `(sector_etf_future_price - sector_etf_price_at_prediction) / sector_etf_price_at_prediction`.
 5. WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL compute excess_return_vs_spy as `future_return - spy_return` and excess_return_vs_sector as `future_return - sector_etf_return`.
 6. WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL determine direction_correct as true when the prediction direction is bullish and future_return is positive, or when the prediction direction is bearish and future_return is negative.
 7. WHEN evaluating a prediction snapshot, THE Outcome_Evaluator SHALL determine profitable as true when the prediction action is buy and future_return is positive, or when the prediction action is sell and future_return is negative.
 8. THE Outcome_Evaluator SHALL evaluate each prediction across all applicable horizons: 1 hour, 6 hours, 1 day, 7 days, and 30 days.
 9. THE Outcome_Evaluator SHALL store evaluation results in a new `prediction_outcomes` table with a foreign key to prediction_snapshots and indexed columns for prediction_id, horizon, and evaluated_at.
 10. IF the future price is unavailable at the horizon endpoint (market data gap), THEN THE Outcome_Evaluator SHALL skip that horizon evaluation and retry on the next run.
 ---
 ### Requirement 5: Calibration Analysis
 **User Story:** As a quantitative analyst, I want to measure how well the system's confidence scores predict actual win rates, so that I can identify overconfident or underconfident predictions and recalibrate the model.
 #### Acceptance Criteria
 1. THE Metrics_Engine SHALL compute calibration metrics by grouping evaluated predictions into confidence buckets: [0.50, 0.60), [0.60, 0.70), [0.70, 0.80), [0.80, 0.90), [0.90, 1.00].
 2. FOR EACH confidence bucket, THE Metrics_Engine SHALL compute the average confidence, the observed win rate (fraction of direction_correct outcomes), and the prediction count.
 3. THE Metrics_Engine SHALL compute the Expected Calibration Error (ECE) as the weighted average of `|avg_confidence - observed_win_rate|` across all buckets, weighted by the fraction of predictions in each bucket.
 4. THE Metrics_Engine SHALL compute the Brier Score as `mean((p_bull - actual_outcome)^2)` across all evaluated predictions, where actual_outcome is 1.0 when the price moved in the predicted direction and 0.0 otherwise.
 5. THE Metrics_Engine SHALL flag calibration buckets where `|avg_confidence - observed_win_rate| > 0.15` as miscalibrated for dashboard highlighting.
 6. THE Metrics_Engine SHALL compute calibration metrics separately for each prediction horizon (1h, 6h, 1d, 7d, 30d).
 ---
 ### Requirement 6: Information Coefficient Metrics
 **User Story:** As a quantitative analyst, I want to measure the correlation between the system's prediction scores and realized returns, so that I can assess whether higher-scored predictions actually produce higher returns.
 #### Acceptance Criteria
 1. THE Metrics_Engine SHALL compute the Information Coefficient (IC) as the Pearson correlation between prediction scores and future returns across all evaluated predictions in the lookback window.
 2. THE Metrics_Engine SHALL compute the Rank Information Coefficient (Rank IC) as the Spearman rank correlation between prediction scores and future returns across all evaluated predictions in the lookback window.
 3. THE Metrics_Engine SHALL compute IC and Rank IC separately for each prediction horizon (1h, 6h, 1d, 7d, 30d).
 4. THE Metrics_Engine SHALL compute return statistics by confidence decile, grouping predictions into 10 equal-sized bins by confidence and computing the average future return and average excess return for each decile.
 5. WHEN fewer than 30 evaluated predictions exist for a given horizon, THE Metrics_Engine SHALL report IC and Rank IC as NULL rather than computing unreliable correlations from small samples.
 ---
 ### Requirement 7: Source and Signal Attribution
 **User Story:** As a quantitative analyst, I want to know which sources, source types, and catalyst types contribute to accurate predictions, so that I can identify the most valuable information channels and deprioritize unreliable ones.
 #### Acceptance Criteria
 1. THE Attribution_Engine SHALL compute per-source performance metrics by joining signal_evidence_links with prediction_outcomes, grouping by source identifier.
 2. FOR EACH source, THE Attribution_Engine SHALL compute: prediction count, average weight, average contribution score, win rate, average future return, average excess return vs SPY, and information coefficient.
 3. THE Attribution_Engine SHALL compute the same performance metrics grouped by source_type (e.g., news_api, filings_api, web_scrape, market_api).
 4. THE Attribution_Engine SHALL compute the same performance metrics grouped by catalyst_type (e.g., earnings, product, legal, macro, m_and_a).
 5. THE Attribution_Engine SHALL compute layer attribution metrics for the three signal layers (company, macro, competitive) by using the score_company, score_macro, and score_competitive fields from prediction snapshots.
 6. FOR EACH layer, THE Attribution_Engine SHALL compute the average contribution percentage, the win rate when that layer is the dominant contributor, and the IC of predictions where that layer contributes more than 30% of the total score.
 7. THE Attribution_Engine SHALL compute a per-source duplicate_rate as the fraction of evidence links from that source marked as is_duplicate.
 ---
 ### Requirement 8: Confidence Recalibration via Source Reliability
 **User Story:** As a quantitative analyst, I want source credibility weights adjusted based on historical prediction accuracy using Bayesian shrinkage, so that the system learns from its own track record and improves over time.
 #### Acceptance Criteria
 1. THE Calibration_Engine SHALL compute source reliability using Bayesian shrinkage: `reliability = 0.5 + (n / (n + 30)) * (observed_win_rate - 0.5)`, where n is the number of evaluated predictions involving that source and observed_win_rate is the fraction of correct directional calls.
 2. WHEN a source has zero evaluated predictions, THE Calibration_Engine SHALL assign a reliability of 0.5 (the prior mean).
 3. THE Calibration_Engine SHALL compute an adjusted evidence weight for each source as `adjusted_weight = base_weight * (0.5 + reliability)`, clamped to the range [0.1, 2.0].
 4. THE Calibration_Engine SHALL update source reliability scores after each outcome evaluation cycle, using the latest prediction outcomes.
 5. THE Calibration_Engine SHALL store source reliability scores in the existing `source_accuracy` table, extending it with a reliability column or using the existing accuracy_ratio field with the Bayesian shrinkage formula.
 ---
 ### Requirement 9: Benchmark Comparison
 **User Story:** As a quantitative analyst, I want the system's prediction performance compared against simple benchmarks, so that I can determine whether the model adds value beyond naive strategies.
 #### Acceptance Criteria
 1. THE Metrics_Engine SHALL compute the average excess return of all buy predictions versus a buy-and-hold SPY strategy over the same horizons.
 2. THE Metrics_Engine SHALL compute the average excess return of all buy predictions versus a buy-and-hold sector ETF strategy over the same horizons.
 3. THE Metrics_Engine SHALL compute the win rate of the system's directional predictions compared to a random 50/50 baseline, reporting the statistical significance using a binomial test when the prediction count exceeds 100.
 4. THE Metrics_Engine SHALL compute the hit rate improvement, defined as `(system_win_rate - 0.5) / 0.5`, representing the percentage improvement over random guessing.
 ---
 ### Requirement 10: Model Metric Snapshots
 **User Story:** As a quantitative analyst, I want aggregate model metrics stored as time-series snapshots, so that I can track whether model quality is improving or degrading over time.
 #### Acceptance Criteria
 1. THE Metrics_Engine SHALL periodically compute and store model_metric_snapshots containing all aggregate metrics for each combination of lookback window and prediction horizon.
 2. EACH model_metric_snapshot SHALL contain: prediction count, win rate, directional accuracy, IC, Rank IC, average return, average excess return vs SPY, average excess return vs sector, calibration error (ECE), Brier score, and per-action win rates (buy, sell, hold).
 3. THE Metrics_Engine SHALL store model_metric_snapshots in a new `model_metric_snapshots` database table with a UUID primary key and indexed columns for generated_at, lookback_window, and horizon.
 4. THE Metrics_Engine SHALL compute snapshots for lookback windows of 7 days, 30 days, 90 days, and all-time.
 5. THE Metrics_Engine SHALL store a JSONB metadata field in each snapshot for extensibility, containing any additional computed metrics not captured in dedicated columns.
 ---
 ### Requirement 11: Safety Gate for Live Trading
 **User Story:** As a platform operator, I want live trading automatically disabled when model quality metrics fall below minimum thresholds, so that the system does not risk real capital on a poorly performing model.
 #### Acceptance Criteria
 1. THE Quality_Gate SHALL evaluate the following minimum thresholds for live trading eligibility: minimum prediction count of 100, minimum IC of 0.03, minimum win rate of 0.53, maximum ECE of 0.15, and minimum excess return vs SPY of 0.0.
 2. WHEN any threshold is not met, THE Quality_Gate SHALL force all recommendations to paper mode, overriding any live_eligible mode assignments.
 3. THE Quality_Gate SHALL evaluate gate status at the start of each aggregation cycle by reading the most recent model_metric_snapshot.
 4. THE Quality_Gate SHALL log the gate evaluation result including which thresholds passed and which failed, with their actual values.
 5. THE Quality_Gate SHALL store the gate evaluation result in the `risk_configs` table under a `model_quality_gate` key, making it available to the recommendation engine and dashboard.
 6. IF the model_metric_snapshots table is empty or the most recent snapshot is older than 24 hours, THEN THE Quality_Gate SHALL default to paper-only mode (fail-safe behavior).
 7. THE Quality_Gate SHALL support configurable thresholds via the `risk_configs` table, with the default values specified in acceptance criterion 1 used when no override is configured.
 ---
 ### Requirement 12: Model Performance Dashboard Upgrade
 **User Story:** As a platform operator, I want a comprehensive model performance dashboard showing prediction accuracy, calibration, attribution, and gate status, so that I can monitor model quality and make informed decisions about live trading.
 #### Acceptance Criteria
 1. THE Dashboard_API SHALL expose a `/api/validation/summary` endpoint returning the latest model metric snapshot with summary cards for: prediction count, win rate, directional accuracy, IC, Rank IC, Brier score, calibration error, average excess return vs SPY, average excess return vs sector, and live trading gate status.
 2. THE Dashboard_API SHALL expose a `/api/validation/calibration` endpoint returning the calibration table with confidence buckets, average confidence, observed win rate, prediction count, and miscalibration flag for each bucket.
 3. THE Dashboard_API SHALL expose a `/api/validation/ic-by-horizon` endpoint returning IC and Rank IC values for each prediction horizon.
 4. THE Dashboard_API SHALL expose a `/api/validation/attribution/sources` endpoint returning per-source performance metrics including win rate, IC, average return, and duplicate rate.
 5. THE Dashboard_API SHALL expose a `/api/validation/attribution/catalysts` endpoint returning per-catalyst-type performance metrics.
 6. THE Dashboard_API SHALL expose a `/api/validation/attribution/layers` endpoint returning per-signal-layer (company, macro, competitive) performance metrics.
 7. THE Dashboard_API SHALL expose a `/api/validation/gate-status` endpoint returning the current quality gate evaluation with pass/fail status for each threshold.
 8. THE frontend OpsModel page SHALL be upgraded to display the model validation summary cards, calibration table, IC-by-horizon table, source performance table, catalyst truth table, layer attribution table, and gate status indicator.
 9. THE frontend SHALL highlight miscalibrated confidence buckets where `|avg_confidence - observed_win_rate| > 0.15` with a visual warning indicator.
 ---
 ### Requirement 13: Recommendation Display Enhancements
 **User Story:** As a platform operator, I want each recommendation to display its validation context including calibrated confidence, historical win rate, and evidence quality indicators, so that I can assess the reliability of individual predictions.
 #### Acceptance Criteria
 1. WHEN displaying a recommendation, THE frontend SHALL show the original confidence alongside the calibrated confidence (based on the historical win rate for that confidence bucket).
 2. WHEN displaying a recommendation, THE frontend SHALL show the historical win rate for predictions with similar confidence levels.
 3. WHEN displaying a recommendation, THE frontend SHALL show the evidence count, unique evidence count, and duplicate evidence count.
 4. WHEN displaying a recommendation, THE frontend SHALL show a source reliability indicator based on the Bayesian-shrunk reliability score of the primary contributing sources.
 5. WHEN displaying a recommendation, THE frontend SHALL show the live eligibility status with the reason (gate passed, or which threshold failed).
 6. WHEN the duplicate evidence count exceeds 20% of the total evidence count, THE frontend SHALL display a warning badge indicating potential evidence inflation.
 7. WHEN the primary contributing source has a reliability score below 0.4, THE frontend SHALL display a warning badge indicating unknown or low source reliability.
 ---
 ### Requirement 14: SQL Explorer Views
 **User Story:** As a quantitative analyst, I want pre-built SQL views joining predictions with outcomes and evidence with performance, so that I can run ad-hoc analysis in the SQL Explorer without writing complex joins.
 #### Acceptance Criteria
 1. THE database migration SHALL create a view `v_prediction_performance` that joins prediction_snapshots with prediction_outcomes on prediction_id, providing a single flat table with prediction inputs and realized outcomes.
 2. THE database migration SHALL create a view `v_source_performance` that joins signal_evidence_links with prediction_outcomes (via prediction_id), providing per-evidence-link outcome data for source attribution analysis.
 3. THE v_prediction_performance view SHALL include columns for ticker, direction, action, confidence, strength, price_at_prediction, future_return, excess_return_vs_spy, direction_correct, profitable, horizon, generated_at, and evaluated_at.
 4. THE v_source_performance view SHALL include columns for source, source_type, catalyst_type, sentiment, weight, contribution_score, is_duplicate, direction_correct, future_return, and excess_return_vs_spy.
 ---
 ### Requirement 15: Backtest Replay Integration
 **User Story:** As a quantitative analyst, I want to replay historical data through the prediction snapshot and outcome evaluation pipeline, so that I can assess model quality on historical data without future data leakage.
 #### Acceptance Criteria
 1. THE Backtest_Replay service SHALL support a validation mode that generates prediction snapshots and evaluates outcomes using only data available at each historical point in time.
 2. WHEN running in validation mode, THE Backtest_Replay service SHALL process historical recommendations chronologically, creating prediction snapshots with the market prices that were available at each recommendation's generation time.
 3. WHEN running in validation mode, THE Backtest_Replay service SHALL evaluate prediction outcomes using market prices from the appropriate future horizon relative to each prediction's generation time.
 4. THE Backtest_Replay service SHALL prevent future data leakage by ensuring that no market data with a timestamp after the prediction generation time is used during snapshot creation.
 5. WHEN a backtest validation run completes, THE Backtest_Replay service SHALL trigger a model metrics computation over the backtest period, storing the results as model_metric_snapshots tagged with the backtest_id.
 ---
 ### Requirement 16: Database Schema
 **User Story:** As a developer, I want the new database tables created via a migration script following the existing migration conventions, so that the schema changes are applied consistently across all environments.
 #### Acceptance Criteria
 1. THE database migration SHALL create the `prediction_snapshots` table with columns: id (UUID PK), generated_at (TIMESTAMPTZ), ticker (VARCHAR), window (VARCHAR), horizon (VARCHAR), direction (VARCHAR), action (VARCHAR), mode (VARCHAR), strength (FLOAT), confidence (FLOAT), contradiction (FLOAT), p_bull (FLOAT), p_bear (FLOAT), score_company (FLOAT), score_macro (FLOAT), score_competitive (FLOAT), evidence_count (INTEGER), unique_source_count (INTEGER), duplicate_evidence_count (INTEGER), price_at_prediction (FLOAT), spy_price_at_prediction (FLOAT), sector_etf_price_at_prediction (FLOAT), metadata (JSONB), created_at (TIMESTAMPTZ).
 2. THE database migration SHALL create the `prediction_outcomes` table with columns: id (UUID PK), prediction_id (UUID FK to prediction_snapshots), evaluated_at (TIMESTAMPTZ), horizon (VARCHAR), future_price (FLOAT), future_return (FLOAT), spy_future_price (FLOAT), spy_return (FLOAT), sector_etf_future_price (FLOAT), sector_etf_return (FLOAT), excess_return_vs_spy (FLOAT), excess_return_vs_sector (FLOAT), direction_correct (BOOLEAN), profitable (BOOLEAN), metadata (JSONB), created_at (TIMESTAMPTZ).
 3. THE database migration SHALL create the `signal_evidence_links` table with columns: id (UUID PK), prediction_id (UUID FK to prediction_snapshots), document_id (VARCHAR), signal_id (VARCHAR), ticker (VARCHAR), source (VARCHAR), source_type (VARCHAR), catalyst_type (VARCHAR), sentiment (VARCHAR), impact (FLOAT), extraction_confidence (FLOAT), weight (FLOAT), is_duplicate (BOOLEAN), canonical_evidence_key (VARCHAR), contribution_score (FLOAT), metadata (JSONB), created_at (TIMESTAMPTZ).
 4. THE database migration SHALL create the `model_metric_snapshots` table with columns: id (UUID PK), generated_at (TIMESTAMPTZ), lookback_window (VARCHAR), horizon (VARCHAR), prediction_count (INTEGER), win_rate (FLOAT), directional_accuracy (FLOAT), information_coefficient (FLOAT), rank_information_coefficient (FLOAT), avg_return (FLOAT), avg_excess_return_vs_spy (FLOAT), avg_excess_return_vs_sector (FLOAT), calibration_error (FLOAT), brier_score (FLOAT), buy_win_rate (FLOAT), sell_win_rate (FLOAT), hold_win_rate (FLOAT), metadata (JSONB), created_at (TIMESTAMPTZ).
 5. THE database migration SHALL create appropriate indexes on prediction_snapshots (ticker, generated_at, horizon), prediction_outcomes (prediction_id, horizon), signal_evidence_links (prediction_id, document_id, ticker), and model_metric_snapshots (generated_at, lookback_window, horizon).
 6. THE database migration SHALL be numbered as `035_model_validation.sql`, following the existing migration numbering convention.
 ---
 ### Requirement 17: Property-Based Testing for Validation Metrics
 **User Story:** As a developer, I want property-based tests validating the mathematical correctness of all validation metric computations, so that edge cases and numerical stability issues are caught before deployment.
 #### Acceptance Criteria
 1. THE test suite SHALL include a property-based test for calibration error verifying that ECE is in [0.0, 1.0] for all valid distributions of predictions across confidence buckets, and that ECE is 0.0 when every bucket's observed win rate exactly matches its average confidence (round-trip calibration property).
 2. THE test suite SHALL include a property-based test for Brier score verifying that the score is in [0.0, 1.0] for all valid probability-outcome pairs, and that the score is 0.0 when all predictions are perfectly correct with probability 1.0.
 3. THE test suite SHALL include a property-based test for information coefficient verifying that IC is in [-1.0, 1.0] for all valid score-return pairs, and that IC is 1.0 when scores and returns are perfectly positively correlated.
 4. THE test suite SHALL include a property-based test for the canonical evidence key verifying that the key is deterministic (same inputs always produce the same key) and that normalization is idempotent (normalizing an already-normalized input produces the same key).
 5. THE test suite SHALL include a property-based test for source reliability Bayesian shrinkage verifying that reliability is always in [0.0, 1.0], that reliability approaches 0.5 as sample count approaches 0, and that reliability approaches the observed win rate as sample count approaches infinity.
 6. THE test suite SHALL include a property-based test for the quality gate verifying that the gate result is deterministic for the same metric inputs, and that relaxing any single threshold (making it easier to pass) never causes a previously passing gate to fail (monotonicity property).
 7. THE test suite SHALL include a property-based test for contribution score computation verifying that all contribution scores for a single prediction sum to 1.0 (within floating-point tolerance) and that each individual score is in [0.0, 1.0].
@@ -0,0 +1,260 @@
 # Implementation Plan: Model Validation, Calibration, and Signal Quality
 ## Overview
 Add a closed-loop model validation layer to Stonks Oracle: prediction snapshot capture, outcome evaluation, calibration/IC metrics, source/catalyst/layer attribution, Bayesian source reliability, a quality gate for live trading, 7 new API endpoints, an upgraded OpsModel dashboard, and backtest replay integration. Implementation follows the four-phase priority order from the spec, with each phase building on the previous one.
 ## Tasks
 - [x] 1. Database migration 035 — schema foundation
  - [x] 1.1 Create `infra/migrations/035_model_validation.sql` with all tables, indexes, and views
    - Create `prediction_snapshots` table with all columns from design (id UUID PK, generated_at, ticker, window, horizon, direction, action, mode, strength, confidence, contradiction, p_bull, p_bear, score_company, score_macro, score_competitive, evidence_count, unique_source_count, duplicate_evidence_count, price_at_prediction, spy_price_at_prediction, sector_etf_price_at_prediction, metadata JSONB, created_at)
    - Create `prediction_outcomes` table with FK to prediction_snapshots (id UUID PK, prediction_id, evaluated_at, horizon, future_price, future_return, spy_future_price, spy_return, sector_etf_future_price, sector_etf_return, excess_return_vs_spy, excess_return_vs_sector, direction_correct, profitable, metadata JSONB, created_at)
    - Create `signal_evidence_links` table with FK to prediction_snapshots (id UUID PK, prediction_id, document_id, signal_id, ticker, source, source_type, catalyst_type, sentiment, impact, extraction_confidence, weight, is_duplicate, canonical_evidence_key, contribution_score, metadata JSONB, created_at)
    - Create `model_metric_snapshots` table (id UUID PK, generated_at, lookback_window, horizon, prediction_count, win_rate, directional_accuracy, information_coefficient, rank_information_coefficient, avg_return, avg_excess_return_vs_spy, avg_excess_return_vs_sector, calibration_error, brier_score, buy_win_rate, sell_win_rate, hold_win_rate, metadata JSONB, created_at)
    - Create indexes on prediction_snapshots (ticker, generated_at, horizon), prediction_outcomes (prediction_id, horizon, evaluated_at), signal_evidence_links (prediction_id, document_id, ticker), model_metric_snapshots (generated_at, lookback_window, horizon)
    - Create `v_prediction_performance` view joining prediction_snapshots with prediction_outcomes
    - Create `v_source_performance` view joining signal_evidence_links with prediction_snapshots and prediction_outcomes
    - _Requirements: 16.1, 16.2, 16.3, 16.4, 16.5, 16.6, 14.1, 14.2, 14.3, 14.4_
 - [x] 2. Phase 1 — Prediction capture, outcome evaluation, core metrics, and dashboard API
  - [x] 2.1 Implement Prediction Snapshot Writer (`services/validation/prediction_snapshot.py`)
    - Create `services/validation/__init__.py`
    - Define `SECTOR_ETF_MAP`, `EVALUATION_HORIZONS`, `MAX_SINGLE_DOCUMENT_WEIGHT` constants
    - Implement `PredictionSnapshot` and `SignalEvidenceLink` dataclasses
    - Implement `compute_canonical_evidence_key(title, url)` — SHA256 of normalized title + normalized URL (lowercase, strip whitespace for title; lowercase, strip query params for URL)
    - Implement `fetch_latest_close_price(pool, ticker)` — query most recent close from market_snapshots
    - Implement `create_prediction_snapshot(pool, recommendation, trend_summary, evidence_signals, evidence_docs)` — fetch prices (ticker, SPY, sector ETF), compute canonical keys, detect duplicates, clamp weights to MAX_SINGLE_DOCUMENT_WEIGHT, compute contribution scores (one-vote-per-canonical-key), persist snapshot + evidence links in a transaction
    - Implement `compute_contribution_scores(weights)` — each score = weight_i / sum(weights), sums to 1.0
    - Handle NULL prices gracefully (log warning, store NULL, don't fail)
    - _Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 3.1, 3.2, 3.3, 3.4_
  - [x] 2.2 Write property test for canonical evidence key determinism and idempotence
    - **Property 4: Canonical Evidence Key Determinism and Normalization Idempotence**
    - Test that same (title, url) always produces same key
    - Test that normalizing already-normalized input produces same key
    - **Validates: Requirements 2.3, 17.4**
  - [x] 2.3 Write property test for contribution score sum-to-one and range
    - **Property 7: Contribution Score Sum-to-One and Range**
    - Test that all scores in [0.0, 1.0] and sum to 1.0 (within 1e-9 tolerance)
    - Test that empty input returns empty list
    - **Validates: Requirements 2.5, 17.7**
  - [x] 2.4 Implement Outcome Evaluator (`services/validation/outcome_evaluator.py`)
    - Define `PredictionOutcome` dataclass and `HORIZON_DURATIONS` mapping
    - Implement `evaluate_matured_predictions(pool)` — find snapshots where horizon elapsed and outcome not recorded, evaluate each
    - Implement `evaluate_single_prediction(pool, snapshot, horizon)` — fetch future price at horizon endpoint, compute future_return, SPY return, sector ETF return, excess returns, direction_correct, profitable; return None if future price unavailable
    - Evaluate across all 5 horizons: 1h, 6h, 1d, 7d, 30d
    - Skip horizons where future price is unavailable (retry next run)
    - Store results in prediction_outcomes table
    - _Requirements: 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 4.10_
  - [x] 2.5 Implement Metrics Engine (`services/validation/metrics.py`)
    - Define `CONFIDENCE_BUCKETS`, `LOOKBACK_WINDOWS` constants
    - Define `CalibrationBucket` and `ModelMetricSnapshot` dataclasses
    - Implement `compute_calibration_error(confidences, outcomes)` — group into 5 confidence buckets, compute ECE as weighted average of |avg_conf - win_rate|, flag miscalibrated buckets (|diff| > 0.15)
    - Implement `compute_brier_score(p_bulls, outcomes)` — mean((p_bull - outcome)^2)
    - Implement `compute_information_coefficient(scores, returns)` — Pearson correlation, return None when < 30 data points
    - Implement `compute_rank_information_coefficient(scores, returns)` — Spearman rank correlation, return None when < 30 data points
    - Implement `compute_contribution_scores(weights)` — weight_i / sum(weights), sums to 1.0
    - Implement benchmark metrics: average excess return vs SPY, vs sector ETF, hit rate improvement
    - Implement `compute_and_store_metric_snapshots(pool)` — compute for all lookback/horizon combinations (4 lookbacks × 5 horizons), persist to model_metric_snapshots
    - _Requirements: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 6.1, 6.2, 6.3, 6.4, 6.5, 9.1, 9.2, 9.3, 9.4, 10.1, 10.2, 10.3, 10.4, 10.5_
  - [x] 2.6 Write property test for ECE range and round-trip
    - **Property 1: Calibration Error Range and Round-Trip**
    - Test ECE in [0.0, 1.0] for all valid distributions
    - Test ECE = 0.0 when every bucket's win rate matches avg confidence
    - **Validates: Requirements 5.1, 5.3, 17.1**
  - [x] 2.7 Write property test for Brier score range and perfect prediction
    - **Property 2: Brier Score Range and Perfect Prediction**
    - Test Brier in [0.0, 1.0] for all valid (p_bull, outcome) pairs
    - Test Brier = 0.0 when all predictions perfectly correct
    - **Validates: Requirements 5.4, 17.2**
  - [x] 2.8 Write property test for IC range and perfect correlation
    - **Property 3: Information Coefficient Range and Perfect Correlation**
    - Test IC in [-1.0, 1.0] for all valid (score, return) pairs with ≥30 elements
    - Test IC = 1.0 for perfectly positively correlated data
    - **Validates: Requirements 6.1, 6.2, 17.3**
  - [x] 2.9 Implement Dashboard API endpoints in `services/api/app.py`
    - Add `/api/validation/summary` GET — return latest model_metric_snapshot + gate status
    - Add `/api/validation/calibration` GET — return calibration table with buckets
    - Add `/api/validation/ic-by-horizon` GET — return IC and Rank IC per horizon
    - Add `/api/validation/gate-status` GET — return quality gate evaluation detail
    - All endpoints accept optional `lookback` (default "30d") and `horizon` (default "7d") query params
    - _Requirements: 12.1, 12.2, 12.3, 12.7_
  - [x] 2.10 Add frontend validation API hooks in `frontend/src/api/hooks.ts`
    - Add `useValidationSummary(lookback?, horizon?)` hook for `/api/validation/summary`
    - Add `useValidationCalibration(lookback?, horizon?)` hook for `/api/validation/calibration`
    - Add `useValidationICByHorizon(lookback?)` hook for `/api/validation/ic-by-horizon`
    - Add `useValidationGateStatus()` hook for `/api/validation/gate-status`
    - _Requirements: 12.1, 12.2, 12.3, 12.7_
  - [x] 2.11 Upgrade OpsModel page (`frontend/src/pages/OpsModel.tsx`) — Phase 1 dashboard
    - Add tabbed layout: existing "Extraction Performance" tab + new "Model Validation" tab
    - Add summary cards: prediction count, win rate, directional accuracy, IC, Rank IC, Brier score, ECE, avg excess return vs SPY, gate status
    - Add calibration table with confidence buckets, avg confidence, observed win rate, count, miscalibration flag
    - Highlight miscalibrated buckets (|avg_confidence - observed_win_rate| > 0.15) with warning indicator
    - Add IC-by-horizon table showing IC and Rank IC for each horizon
    - Add gate status indicator (pass/fail with threshold details)
    - _Requirements: 12.1, 12.2, 12.3, 12.7, 12.8, 12.9_
 - [x] 3. Checkpoint — Phase 1 verification
  - Ensure all tests pass, ask the user if questions arise.
 - [x] 4. Phase 2 — Attribution engine and source/catalyst truth tables
  - [x] 4.1 Implement Attribution Engine (`services/validation/attribution.py`)
    - Define `SourceAttribution`, `CatalystAttribution`, `LayerAttribution` dataclasses
    - Implement `compute_source_attribution(pool, lookback_days, horizon)` — join signal_evidence_links with prediction_outcomes, group by source; compute prediction count, avg weight, avg contribution score, win rate, avg future return, avg excess return vs SPY, IC, duplicate rate
    - Implement `compute_catalyst_attribution(pool, lookback_days, horizon)` — same metrics grouped by catalyst_type
    - Implement `compute_layer_attribution(pool, lookback_days, horizon)` — compute per-layer (company, macro, competitive) avg contribution %, dominant win rate (layer > 30% contribution), dominant IC
    - _Requirements: 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7_
  - [x] 4.2 Implement Calibration Engine (`services/validation/calibration.py`)
    - Implement `compute_source_reliability(observed_win_rate, sample_count, prior_strength=30)` — Bayesian shrinkage: `0.5 + (n / (n + 30)) * (observed_win_rate - 0.5)`; return 0.5 when n=0
    - Implement `compute_adjusted_evidence_weight(base_weight, reliability)` — `base_weight * (0.5 + reliability)`, clamped to [0.1, 2.0]
    - Implement `update_source_reliabilities(pool)` — recompute from latest outcomes, update source_accuracy table
    - _Requirements: 8.1, 8.2, 8.3, 8.4, 8.5_
  - [x] 4.3 Write property test for source reliability Bayesian shrinkage bounds and convergence
    - **Property 5: Source Reliability Bayesian Shrinkage Bounds and Convergence**
    - Test reliability in [0.0, 1.0] for all valid inputs
    - Test reliability = 0.5 when sample_count = 0
    - Test reliability approaches observed_win_rate as sample_count → ∞
    - **Validates: Requirements 8.1, 8.2, 17.5**
  - [x] 4.4 Add attribution API endpoints in `services/api/app.py`
    - Add `/api/validation/attribution/sources` GET — return per-source performance metrics
    - Add `/api/validation/attribution/catalysts` GET — return per-catalyst performance metrics
    - Add `/api/validation/attribution/layers` GET — return per-layer performance metrics
    - All endpoints accept optional `lookback` (default "30d") and `horizon` (default "7d") query params
    - _Requirements: 12.4, 12.5, 12.6_
  - [x] 4.5 Add frontend attribution hooks in `frontend/src/api/hooks.ts`
    - Add `useValidationAttributionSources(lookback?, horizon?)` hook
    - Add `useValidationAttributionCatalysts(lookback?, horizon?)` hook
    - Add `useValidationAttributionLayers(lookback?, horizon?)` hook
    - _Requirements: 12.4, 12.5, 12.6_
  - [x] 4.6 Extend OpsModel page with attribution tables
    - Add source performance table (source, win rate, IC, avg return, duplicate rate)
    - Add catalyst truth table (catalyst type, win rate, avg return, IC)
    - Add layer attribution table (company/macro/competitive contribution %, dominant win rate, IC)
    - _Requirements: 12.4, 12.5, 12.6, 12.8_
 - [x] 5. Checkpoint — Phase 2 verification
  - Ensure all tests pass, ask the user if questions arise.
 - [x] 6. Phase 3 — Quality gate, recommendation enhancements, and pipeline wiring
  - [x] 6.1 Implement Quality Gate (`services/trading/model_quality_gate.py`)
    - Define `QualityGateConfig` dataclass with default thresholds (min_prediction_count=100, min_ic=0.03, min_win_rate=0.53, max_ece=0.15, min_excess_return_vs_spy=0.0, max_snapshot_age_hours=24)
    - Define `GateThresholdResult` and `QualityGateResult` dataclasses
    - Implement `evaluate_quality_gate(pool, config)` — read most recent model_metric_snapshot (30d lookback, 7d horizon), evaluate each threshold, store result in risk_configs under 'model_quality_gate' key
    - Implement `load_gate_config_from_db(pool)` — load thresholds from risk_configs with defaults
    - Default to paper-only mode when no snapshots exist or snapshot is stale (>24h)
    - Log gate evaluation result with threshold pass/fail details
    - _Requirements: 11.1, 11.2, 11.3, 11.4, 11.5, 11.6, 11.7_
  - [x] 6.2 Write property test for quality gate determinism and threshold monotonicity
    - **Property 6: Quality Gate Determinism and Threshold Monotonicity**
    - Test same inputs always produce same pass/fail result
    - Test relaxing any threshold never causes a previously passing gate to fail
    - **Validates: Requirements 11.1, 17.6**
  - [x] 6.3 Wire Quality Gate into aggregation cycle (`services/aggregation/worker.py`)
    - Call `evaluate_quality_gate` at the start of each aggregation cycle
    - When gate fails, force all recommendations to paper mode
    - Log gate status at cycle start
    - _Requirements: 11.2, 11.3_
  - [x] 6.4 Wire Prediction Snapshot Writer into recommendation engine
    - After recommendation is generated in `services/recommendation/eligibility.py` or the calling code, call `create_prediction_snapshot` to capture the prediction state
    - Pass recommendation, trend_summary, evidence signals, and evidence docs
    - Handle snapshot creation failure gracefully (log error, don't block recommendation)
    - _Requirements: 1.1, 1.6_
  - [x] 6.5 Enhance recommendation display on frontend
    - Update `frontend/src/pages/RecommendationDetail` (or relevant recommendation display component) to show:
      - Original confidence alongside calibrated confidence (historical win rate for that bucket)
      - Historical win rate for similar confidence levels
      - Evidence count, unique evidence count, duplicate evidence count
      - Source reliability indicator for primary contributing sources
      - Live eligibility status with reason (gate passed or which threshold failed)
    - Add warning badge when duplicate evidence count > 20% of total evidence count
    - Add warning badge when primary source reliability < 0.4
    - _Requirements: 13.1, 13.2, 13.3, 13.4, 13.5, 13.6, 13.7_
 - [x] 7. Checkpoint — Phase 3 verification
  - Ensure all tests pass, ask the user if questions arise.
 - [x] 8. Phase 4 — Backtest replay integration and unit tests
  - [x] 8.1 Add validation mode to BacktestReplay (`services/trading/backtest_replay.py`)
    - Add `validation_mode: bool = False` parameter to `BacktestReplay.run()`
    - When validation_mode=True, create prediction snapshots for each historical recommendation using only data available at that point in time
    - Evaluate prediction outcomes using market prices from the appropriate future horizon
    - Prevent future data leakage: no market data after prediction generation time used during snapshot creation
    - After backtest completes, trigger model metrics computation over the backtest period, tag snapshots with backtest_id
    - _Requirements: 15.1, 15.2, 15.3, 15.4, 15.5_
  - [x] 8.2 Write unit tests for prediction snapshot writer (`tests/test_model_validation_unit.py`)
    - Test canonical evidence key: known title/URL → expected SHA256, empty inputs, unicode
    - Test duplicate detection: 3 docs with 2 sharing a key → 1 marked duplicate
    - Test contribution scores: [0.5, 0.3, 0.2] → [0.5, 0.3, 0.2], single doc → [1.0]
    - Test weight clamping: weight 1.5 → clamped to 1.0
    - _Requirements: 1.1, 2.3, 2.4, 2.5, 3.3_
  - [x] 8.3 Write unit tests for outcome evaluator (`tests/test_model_validation_unit.py`)
    - Test future return computation: price 100→110 → 0.10, price 100→90 → -0.10
    - Test direction_correct logic: bullish+positive → true, bullish+negative → false
    - Test profitable logic: buy+positive → true, sell+negative → true
    - Test excess return: ticker 10%, SPY 5% → excess 5%
    - _Requirements: 4.2, 4.5, 4.6, 4.7_
  - [x] 8.4 Write unit tests for metrics engine (`tests/test_model_validation_unit.py`)
    - Test ECE specific values: perfect calibration → 0.0, all overconfident → positive ECE
    - Test Brier score: all correct at p=1.0 → 0.0, all wrong at p=1.0 → 1.0
    - Test IC: perfect correlation → 1.0, anti-correlation → -1.0, < 30 → None
    - _Requirements: 5.3, 5.4, 6.1, 6.2, 6.5_
  - [x] 8.5 Write unit tests for calibration engine (`tests/test_model_validation_unit.py`)
    - Test source reliability: n=0 → 0.5, n=1000 with wr=0.8 → ≈0.8, n=30 with wr=0.7 → 0.6
    - Test adjusted evidence weight: reliability=0.5 → base*1.0, clamping to [0.1, 2.0]
    - _Requirements: 8.1, 8.2, 8.3_
  - [x] 8.6 Write unit tests for quality gate (`tests/test_model_validation_unit.py`)
    - Test all thresholds met → pass
    - Test one threshold failed → fail with reason
    - Test fail-safe: no snapshots → paper-only, stale snapshot → paper-only
    - _Requirements: 11.1, 11.6_
  - [x] 8.7 Write frontend tests for validation dashboard (`frontend/src/test/pages.test.tsx`)
    - Add MSW mock handlers for `/api/validation/summary`, `/api/validation/calibration`, `/api/validation/gate-status`
    - Test OpsModel page renders validation tab with summary cards
    - Test calibration table renders buckets with miscalibration warning
    - Test gate status indicator renders pass/fail
    - _Requirements: 12.8, 12.9_
 - [x] 9. Final checkpoint — Ensure all tests pass
  - Ensure all tests pass, ask the user if questions arise.
 ## Notes
 - Tasks marked with `*` are optional and can be skipped for faster MVP
 - Each task references specific requirements for traceability
 - Checkpoints ensure incremental validation after each phase
 - Property tests validate the 7 universal correctness properties from the design document
 - Unit tests validate specific examples, edge cases, and integration points
 - The design uses Python for backend and TypeScript for frontend — no language selection needed
 - Migration number is 035 (existing migrations go up to 034)
 - All new service modules go under `services/validation/` except the quality gate which goes in `services/trading/`
 - The 7 new API endpoints are added to the existing `services/api/app.py`
 - Frontend hooks follow existing patterns in `frontend/src/api/hooks.ts`
 - Phase 1 delivers the core feedback loop (capture → evaluate → measure → display)
 - Phase 2 adds attribution depth (which sources/catalysts/layers work best)
 - Phase 3 adds safety (quality gate) and UX (recommendation warnings)
 - Phase 4 adds historical analysis (backtest validation mode) and comprehensive tests
@@ -885,3 +885,169 @@ export function useToggleMacro() {
    onSuccess: () => qc.invalidateQueries({ queryKey: ['macro-status'] }),
  });
 }
 // ---------------------------------------------------------------------------
 // Validation: Model Quality & Calibration (Requirements 12.1, 12.2, 12.3, 12.7)
 // ---------------------------------------------------------------------------
 export interface ModelMetricSnapshot {
  id: string;
  generated_at: string;
  lookback_window: string;
  horizon: string;
  prediction_count: number;
  win_rate: number | null;
  directional_accuracy: number | null;
  information_coefficient: number | null;
  rank_information_coefficient: number | null;
  avg_return: number | null;
  avg_excess_return_vs_spy: number | null;
  avg_excess_return_vs_sector: number | null;
  calibration_error: number | null;
  brier_score: number | null;
  buy_win_rate: number | null;
  sell_win_rate: number | null;
  hold_win_rate: number | null;
  metadata: Record<string, unknown> | null;
 }
 export interface ValidationSummary {
  snapshot: ModelMetricSnapshot | null;
  gate_status: Record<string, unknown> | null;
 }
 export interface CalibrationBucket {
  bucket_low: number;
  bucket_high: number;
  avg_confidence: number;
  observed_win_rate: number;
  prediction_count: number;
  miscalibrated: boolean;
 }
 export interface ValidationCalibration {
  buckets: CalibrationBucket[];
  lookback: string;
  horizon: string;
 }
 export interface ICByHorizonEntry {
  horizon: string;
  information_coefficient: number | null;
  rank_information_coefficient: number | null;
  prediction_count: number;
  generated_at: string | null;
 }
 export interface ValidationICByHorizon {
  horizons: ICByHorizonEntry[];
  lookback: string;
 }
 export interface ValidationGateStatus {
  gate_status: Record<string, unknown> | null;
  updated_at?: string | null;
  message?: string;
 }
 export function useValidationSummary(lookback = '30d', horizon = '7d') {
  const qs = new URLSearchParams();
  if (lookback) qs.set('lookback', lookback);
  if (horizon) qs.set('horizon', horizon);
  const path = `/api/validation/summary${qs.toString() ? '?' + qs : ''}`;
  return useGet<ValidationSummary>(['validation-summary', lookback, horizon], 'query', path);
 }
 export function useValidationCalibration(lookback = '30d', horizon = '7d') {
  const qs = new URLSearchParams();
  if (lookback) qs.set('lookback', lookback);
  if (horizon) qs.set('horizon', horizon);
  const path = `/api/validation/calibration${qs.toString() ? '?' + qs : ''}`;
  return useGet<ValidationCalibration>(['validation-calibration', lookback, horizon], 'query', path);
 }
 export function useValidationICByHorizon(lookback = '30d') {
  const qs = new URLSearchParams();
  if (lookback) qs.set('lookback', lookback);
  const path = `/api/validation/ic-by-horizon${qs.toString() ? '?' + qs : ''}`;
  return useGet<ValidationICByHorizon>(['validation-ic-by-horizon', lookback], 'query', path);
 }
 export function useValidationGateStatus() {
  return useGet<ValidationGateStatus>(['validation-gate-status'], 'query', '/api/validation/gate-status');
 }
 // ---------------------------------------------------------------------------
 // Validation: Attribution — Sources, Catalysts, Layers (Requirements 12.4, 12.5, 12.6)
 // ---------------------------------------------------------------------------
 export interface SourceAttribution {
  source: string;
  source_type: string;
  prediction_count: number;
  avg_weight: number;
  avg_contribution_score: number;
  win_rate: number;
  avg_future_return: number;
  avg_excess_return_vs_spy: number;
  information_coefficient: number | null;
  duplicate_rate: number;
 }
 export interface SourceAttributionResponse {
  sources: SourceAttribution[];
  lookback: string;
  horizon: string;
 }
 export interface CatalystAttribution {
  catalyst_type: string;
  prediction_count: number;
  win_rate: number;
  avg_future_return: number;
  avg_excess_return_vs_spy: number;
  information_coefficient: number | null;
 }
 export interface CatalystAttributionResponse {
  catalysts: CatalystAttribution[];
  lookback: string;
  horizon: string;
 }
 export interface LayerAttribution {
  layer: string;
  avg_contribution_pct: number;
  dominant_win_rate: number;
  dominant_ic: number | null;
 }
 export interface LayerAttributionResponse {
  layers: LayerAttribution[];
  lookback: string;
  horizon: string;
 }
 export function useValidationAttributionSources(lookback = '30d', horizon = '7d') {
  const qs = new URLSearchParams();
  if (lookback) qs.set('lookback', lookback);
  if (horizon) qs.set('horizon', horizon);
  const path = `/api/validation/attribution/sources${qs.toString() ? '?' + qs : ''}`;
  return useGet<SourceAttributionResponse>(['validation-attribution-sources', lookback, horizon], 'query', path);
 }
 export function useValidationAttributionCatalysts(lookback = '30d', horizon = '7d') {
  const qs = new URLSearchParams();
  if (lookback) qs.set('lookback', lookback);
  if (horizon) qs.set('horizon', horizon);
  const path = `/api/validation/attribution/catalysts${qs.toString() ? '?' + qs : ''}`;
  return useGet<CatalystAttributionResponse>(['validation-attribution-catalysts', lookback, horizon], 'query', path);
 }
 export function useValidationAttributionLayers(lookback = '30d', horizon = '7d') {
  const qs = new URLSearchParams();
  if (lookback) qs.set('lookback', lookback);
  if (horizon) qs.set('horizon', horizon);
  const path = `/api/validation/attribution/layers${qs.toString() ? '?' + qs : ''}`;
  return useGet<LayerAttributionResponse>(['validation-attribution-layers', lookback, horizon], 'query', path);
 }
@@ -1,9 +1,89 @@
 import { useState } from 'react';
-import { useModelPerformance, useModelFailures } from '../api/hooks';
+import {
  useModelPerformance,
  useModelFailures,
  useValidationSummary,
  useValidationCalibration,
  useValidationICByHorizon,
  useValidationGateStatus,
  useValidationAttributionSources,
  useValidationAttributionCatalysts,
  useValidationAttributionLayers,
 } from '../api/hooks';
 import type {
  ValidationSummary,
  ValidationCalibration,
  CalibrationBucket,
  ValidationICByHorizon,
  ICByHorizonEntry,
  ValidationGateStatus,
  SourceAttributionResponse,
  CatalystAttributionResponse,
  LayerAttributionResponse,
  SourceAttribution,
  CatalystAttribution,
  LayerAttribution,
 } from '../api/hooks';
 import { LoadingSpinner, DateRangeSelector, StatusBadge, Card } from '../components/ui';
 import { AlertTriangle, ShieldCheck, ShieldX } from 'lucide-react';
 type Tab = 'extraction' | 'validation';
 export function OpsModelPage() {
  const [hours, setHours] = useState(24);
  const [activeTab, setActiveTab] = useState<Tab>('extraction');
  return (
    <div className="space-y-6">
      <div className="flex items-center justify-between">
        <h1 className="text-xl font-semibold text-gray-100">Model Performance</h1>
        {activeTab === 'extraction' && (
          <DateRangeSelector value={hours} onChange={setHours} />
        )}
      </div>
      {/* Tab bar */}
      <div className="flex border-b border-surface-700" role="tablist" aria-label="Model performance tabs">
        <button
          role="tab"
          aria-selected={activeTab === 'extraction'}
          onClick={() => setActiveTab('extraction')}
          className={`px-4 py-2 text-sm font-medium transition-colors ${
            activeTab === 'extraction'
              ? 'border-b-2 border-brand-500 text-brand-400'
              : 'text-gray-400 hover:text-gray-200'
          }`}
        >
          Extraction Performance
        </button>
        <button
          role="tab"
          aria-selected={activeTab === 'validation'}
          onClick={() => setActiveTab('validation')}
          className={`px-4 py-2 text-sm font-medium transition-colors ${
            activeTab === 'validation'
              ? 'border-b-2 border-brand-500 text-brand-400'
              : 'text-gray-400 hover:text-gray-200'
          }`}
        >
          Model Validation
        </button>
      </div>
      {activeTab === 'extraction' ? (
        <ExtractionTab hours={hours} />
      ) : (
        <ValidationTab />
      )}
    </div>
  );
 }
 /* ------------------------------------------------------------------ */
 /* Extraction Performance Tab (existing content)                       */
 /* ------------------------------------------------------------------ */
 function ExtractionTab({ hours }: { hours: number }) {
  const { data: perf, isLoading } = useModelPerformance(hours);
  const { data: failures } = useModelFailures(hours);
@@ -13,11 +93,6 @@ export function OpsModelPage() {
  return (
    <div className="space-y-6">
      <div className="flex items-center justify-between">
        <h1 className="text-xl font-semibold text-gray-100">Model Performance</h1>
        <DateRangeSelector value={hours} onChange={setHours} />
      </div>
      {/* Key metrics */}
      <div className="grid grid-cols-2 gap-3 sm:grid-cols-5">
        <StatCard label="Total Extractions" value={String(p.total_extractions ?? '—')} />
@@ -63,6 +138,482 @@ export function OpsModelPage() {
  );
 }
 /* ------------------------------------------------------------------ */
 /* Model Validation Tab (new)                                          */
 /* ------------------------------------------------------------------ */
 function ValidationTab() {
  const { data: summary, isLoading: summaryLoading, error: summaryError } = useValidationSummary();
  const { data: calibration, isLoading: calLoading, error: calError } = useValidationCalibration();
  const { data: icData, isLoading: icLoading, error: icError } = useValidationICByHorizon();
  const { data: gateData, isLoading: gateLoading, error: gateError } = useValidationGateStatus();
  const { data: sourcesData, isLoading: srcLoading, error: srcError } = useValidationAttributionSources();
  const { data: catalystsData, isLoading: catLoading, error: catError } = useValidationAttributionCatalysts();
  const { data: layersData, isLoading: layLoading, error: layError } = useValidationAttributionLayers();
  return (
    <div className="space-y-6">
      {/* Gate Status */}
      <GateStatusSection data={gateData} isLoading={gateLoading} error={gateError} />
      {/* Summary Cards */}
      <SummaryCardsSection data={summary} isLoading={summaryLoading} error={summaryError} />
      {/* Calibration Table */}
      <CalibrationTableSection data={calibration} isLoading={calLoading} error={calError} />
      {/* IC by Horizon Table */}
      <ICByHorizonSection data={icData} isLoading={icLoading} error={icError} />
      {/* Source Attribution Table */}
      <SourceAttributionSection data={sourcesData} isLoading={srcLoading} error={srcError} />
      {/* Catalyst Attribution Table */}
      <CatalystAttributionSection data={catalystsData} isLoading={catLoading} error={catError} />
      {/* Layer Attribution Table */}
      <LayerAttributionSection data={layersData} isLoading={layLoading} error={layError} />
    </div>
  );
 }
 /* ------------------------------------------------------------------ */
 /* Gate Status Section                                                 */
 /* ------------------------------------------------------------------ */
 function GateStatusSection({ data, isLoading, error }: {
  data: ValidationGateStatus | undefined;
  isLoading: boolean;
  error: Error | null;
 }) {
  if (isLoading) return <LoadingSpinner />;
  if (error) return <ErrorCard message="Failed to load gate status" />;
  const gate = data?.gate_status as Record<string, unknown> | null;
  if (!gate) {
    return (
      <Card className="flex items-center gap-3">
        <ShieldX size={20} className="text-yellow-400" />
        <div>
          <div className="text-sm font-medium text-yellow-400">Gate Status Unknown</div>
          <div className="text-xs text-gray-500">No gate evaluation data available</div>
        </div>
      </Card>
    );
  }
  const passed = gate.passed as boolean | undefined;
  const reason = gate.reason as string | undefined;
  const thresholds = gate.threshold_results as Array<Record<string, unknown>> | undefined;
  return (
    <Card>
      <div className="mb-3 flex items-center gap-3">
        {passed ? (
          <ShieldCheck size={20} className="text-green-400" />
        ) : (
          <ShieldX size={20} className="text-red-400" />
        )}
        <div>
          <div className={`text-sm font-medium ${passed ? 'text-green-400' : 'text-red-400'}`}>
            Live Trading Gate: {passed ? 'PASS' : 'FAIL'}
          </div>
          {reason && <div className="text-xs text-gray-500">{reason}</div>}
        </div>
      </div>
      {thresholds && thresholds.length > 0 && (
        <div className="overflow-x-auto">
          <table className="w-full text-left text-xs">
            <thead>
              <tr className="border-b border-surface-700 text-gray-500">
                <th className="pb-2 pr-4 font-medium">Threshold</th>
                <th className="pb-2 pr-4 font-medium">Required</th>
                <th className="pb-2 pr-4 font-medium">Actual</th>
                <th className="pb-2 font-medium">Status</th>
              </tr>
            </thead>
            <tbody>
              {thresholds.map((t, i) => (
                <tr key={i} className="border-b border-surface-800">
                  <td className="py-1.5 pr-4 text-gray-300">{String(t.name ?? '')}</td>
                  <td className="py-1.5 pr-4 font-mono text-gray-400">{fmtThreshold(t.threshold)}</td>
                  <td className="py-1.5 pr-4 font-mono text-gray-300">{fmtThreshold(t.actual)}</td>
                  <td className="py-1.5">
                    <StatusBadge status={t.passed ? 'success' : 'failed'} />
                  </td>
                </tr>
              ))}
            </tbody>
          </table>
        </div>
      )}
    </Card>
  );
 }
 /* ------------------------------------------------------------------ */
 /* Summary Cards Section                                               */
 /* ------------------------------------------------------------------ */
 function SummaryCardsSection({ data, isLoading, error }: {
  data: ValidationSummary | undefined;
  isLoading: boolean;
  error: Error | null;
 }) {
  if (isLoading) return <LoadingSpinner />;
  if (error) return <ErrorCard message="Failed to load validation summary" />;
  const snap = data?.snapshot;
  if (!snap) {
    return (
      <Card>
        <p className="text-sm text-gray-500">No validation data available yet. Metrics will appear once predictions have been evaluated.</p>
      </Card>
    );
  }
  return (
    <div className="grid grid-cols-2 gap-3 sm:grid-cols-3 lg:grid-cols-5">
      <StatCard label="Predictions" value={String(snap.prediction_count ?? '—')} />
      <StatCard
        label="Win Rate"
        value={fmtPct(snap.win_rate)}
        color={colorForRate(snap.win_rate, 0.53)}
      />
      <StatCard
        label="Directional Accuracy"
        value={fmtPct(snap.directional_accuracy)}
        color={colorForRate(snap.directional_accuracy, 0.53)}
      />
      <StatCard
        label="IC"
        value={fmtIC(snap.information_coefficient)}
        color={colorForIC(snap.information_coefficient)}
      />
      <StatCard
        label="Rank IC"
        value={fmtIC(snap.rank_information_coefficient)}
        color={colorForIC(snap.rank_information_coefficient)}
      />
      <StatCard
        label="Brier Score"
        value={snap.brier_score != null ? snap.brier_score.toFixed(4) : '—'}
        color={snap.brier_score != null && snap.brier_score < 0.25 ? 'text-green-400' : 'text-gray-100'}
      />
      <StatCard
        label="ECE"
        value={snap.calibration_error != null ? snap.calibration_error.toFixed(4) : '—'}
        color={snap.calibration_error != null && snap.calibration_error < 0.15 ? 'text-green-400' : 'text-yellow-400'}
      />
      <StatCard
        label="Excess vs SPY"
        value={fmtPct(snap.avg_excess_return_vs_spy)}
        color={snap.avg_excess_return_vs_spy != null && snap.avg_excess_return_vs_spy > 0 ? 'text-green-400' : 'text-red-400'}
      />
    </div>
  );
 }
 /* ------------------------------------------------------------------ */
 /* Calibration Table Section                                           */
 /* ------------------------------------------------------------------ */
 function CalibrationTableSection({ data, isLoading, error }: {
  data: ValidationCalibration | undefined;
  isLoading: boolean;
  error: Error | null;
 }) {
  if (isLoading) return <LoadingSpinner />;
  if (error) return <ErrorCard message="Failed to load calibration data" />;
  const buckets = data?.buckets;
  if (!buckets || buckets.length === 0) {
    return (
      <Card>
        <h2 className="mb-2 text-sm font-medium text-gray-400">Calibration</h2>
        <p className="text-sm text-gray-500">No calibration data available</p>
      </Card>
    );
  }
  return (
    <Card>
      <h2 className="mb-3 text-sm font-medium text-gray-400">Calibration by Confidence Bucket</h2>
      <div className="overflow-x-auto">
        <table className="w-full text-left text-xs">
          <thead>
            <tr className="border-b border-surface-700 text-gray-500">
              <th className="pb-2 pr-4 font-medium">Bucket</th>
              <th className="pb-2 pr-4 font-medium">Avg Confidence</th>
              <th className="pb-2 pr-4 font-medium">Observed Win Rate</th>
              <th className="pb-2 pr-4 font-medium">Count</th>
              <th className="pb-2 font-medium">Status</th>
            </tr>
          </thead>
          <tbody>
            {buckets.map((b: CalibrationBucket, i: number) => (
              <CalibrationRow key={i} bucket={b} />
            ))}
          </tbody>
        </table>
      </div>
    </Card>
  );
 }
 function CalibrationRow({ bucket }: { bucket: CalibrationBucket }) {
  const isMiscalibrated = bucket.miscalibrated ||
    Math.abs(bucket.avg_confidence - bucket.observed_win_rate) > 0.15;
  return (
    <tr className={`border-b border-surface-800 ${isMiscalibrated ? 'bg-amber-900/20' : ''}`}>
      <td className="py-1.5 pr-4 font-mono text-gray-300">
        [{fmtPctShort(bucket.bucket_low)}, {fmtPctShort(bucket.bucket_high)})
      </td>
      <td className="py-1.5 pr-4 font-mono text-gray-300">{fmtPctShort(bucket.avg_confidence)}</td>
      <td className="py-1.5 pr-4 font-mono text-gray-300">{fmtPctShort(bucket.observed_win_rate)}</td>
      <td className="py-1.5 pr-4 font-mono text-gray-400">{bucket.prediction_count}</td>
      <td className="py-1.5">
        {isMiscalibrated ? (
          <span className="inline-flex items-center gap-1 text-amber-400">
            <AlertTriangle size={14} />
            <span>Miscalibrated</span>
          </span>
        ) : (
          <span className="text-green-400">OK</span>
        )}
      </td>
    </tr>
  );
 }
 /* ------------------------------------------------------------------ */
 /* IC by Horizon Section                                               */
 /* ------------------------------------------------------------------ */
 function ICByHorizonSection({ data, isLoading, error }: {
  data: ValidationICByHorizon | undefined;
  isLoading: boolean;
  error: Error | null;
 }) {
  if (isLoading) return <LoadingSpinner />;
  if (error) return <ErrorCard message="Failed to load IC by horizon data" />;
  const horizons = data?.horizons;
  if (!horizons || horizons.length === 0) {
    return (
      <Card>
        <h2 className="mb-2 text-sm font-medium text-gray-400">IC by Horizon</h2>
        <p className="text-sm text-gray-500">No IC data available</p>
      </Card>
    );
  }
  return (
    <Card>
      <h2 className="mb-3 text-sm font-medium text-gray-400">Information Coefficient by Horizon</h2>
      <div className="overflow-x-auto">
        <table className="w-full text-left text-xs">
          <thead>
            <tr className="border-b border-surface-700 text-gray-500">
              <th className="pb-2 pr-4 font-medium">Horizon</th>
              <th className="pb-2 pr-4 font-medium">IC</th>
              <th className="pb-2 pr-4 font-medium">Rank IC</th>
              <th className="pb-2 font-medium">Predictions</th>
            </tr>
          </thead>
          <tbody>
            {horizons.map((h: ICByHorizonEntry, i: number) => (
              <tr key={i} className="border-b border-surface-800">
                <td className="py-1.5 pr-4 font-mono text-gray-300">{h.horizon}</td>
                <td className={`py-1.5 pr-4 font-mono ${colorForIC(h.information_coefficient)}`}>
                  {fmtIC(h.information_coefficient)}
                </td>
                <td className={`py-1.5 pr-4 font-mono ${colorForIC(h.rank_information_coefficient)}`}>
                  {fmtIC(h.rank_information_coefficient)}
                </td>
                <td className="py-1.5 font-mono text-gray-400">{h.prediction_count}</td>
              </tr>
            ))}
          </tbody>
        </table>
      </div>
    </Card>
  );
 }
 /* ------------------------------------------------------------------ */
 /* Source Attribution Section                                           */
 /* ------------------------------------------------------------------ */
 function SourceAttributionSection({ data, isLoading, error }: {
  data: SourceAttributionResponse | undefined;
  isLoading: boolean;
  error: Error | null;
 }) {
  if (isLoading) return <LoadingSpinner />;
  if (error) return <ErrorCard message="Failed to load source attribution data" />;
  const sources = data?.sources;
  if (!sources || sources.length === 0) {
    return (
      <Card>
        <h2 className="mb-2 text-sm font-medium text-gray-400">Source Performance</h2>
        <p className="text-sm text-gray-500">No source attribution data available</p>
      </Card>
    );
  }
  return (
    <Card>
      <h2 className="mb-3 text-sm font-medium text-gray-400">Source Performance</h2>
      <div className="overflow-x-auto">
        <table className="w-full text-left text-xs">
          <thead>
            <tr className="border-b border-surface-700 text-gray-500">
              <th className="pb-2 pr-4 font-medium">Source</th>
              <th className="pb-2 pr-4 font-medium">Win Rate</th>
              <th className="pb-2 pr-4 font-medium">IC</th>
              <th className="pb-2 pr-4 font-medium">Avg Return</th>
              <th className="pb-2 font-medium">Duplicate Rate</th>
            </tr>
          </thead>
          <tbody>
            {sources.map((s: SourceAttribution, i: number) => (
              <tr key={i} className="border-b border-surface-800">
                <td className="py-1.5 pr-4 text-gray-300">{s.source}</td>
                <td className={`py-1.5 pr-4 font-mono ${colorForRate(s.win_rate, 0.53)}`}>
                  {fmtPct(s.win_rate)}
                </td>
                <td className={`py-1.5 pr-4 font-mono ${colorForIC(s.information_coefficient)}`}>
                  {fmtIC(s.information_coefficient)}
                </td>
                <td className="py-1.5 pr-4 font-mono text-gray-300">{fmtPct(s.avg_future_return)}</td>
                <td className="py-1.5 font-mono text-gray-300">{fmtPct(s.duplicate_rate)}</td>
              </tr>
            ))}
          </tbody>
        </table>
      </div>
    </Card>
  );
 }
 /* ------------------------------------------------------------------ */
 /* Catalyst Attribution Section                                        */
 /* ------------------------------------------------------------------ */
 function CatalystAttributionSection({ data, isLoading, error }: {
  data: CatalystAttributionResponse | undefined;
  isLoading: boolean;
  error: Error | null;
 }) {
  if (isLoading) return <LoadingSpinner />;
  if (error) return <ErrorCard message="Failed to load catalyst attribution data" />;
  const catalysts = data?.catalysts;
  if (!catalysts || catalysts.length === 0) {
    return (
      <Card>
        <h2 className="mb-2 text-sm font-medium text-gray-400">Catalyst Truth Table</h2>
        <p className="text-sm text-gray-500">No catalyst attribution data available</p>
      </Card>
    );
  }
  return (
    <Card>
      <h2 className="mb-3 text-sm font-medium text-gray-400">Catalyst Truth Table</h2>
      <div className="overflow-x-auto">
        <table className="w-full text-left text-xs">
          <thead>
            <tr className="border-b border-surface-700 text-gray-500">
              <th className="pb-2 pr-4 font-medium">Catalyst Type</th>
              <th className="pb-2 pr-4 font-medium">Win Rate</th>
              <th className="pb-2 pr-4 font-medium">Avg Return</th>
              <th className="pb-2 font-medium">IC</th>
            </tr>
          </thead>
          <tbody>
            {catalysts.map((c: CatalystAttribution, i: number) => (
              <tr key={i} className="border-b border-surface-800">
                <td className="py-1.5 pr-4 text-gray-300">{c.catalyst_type}</td>
                <td className={`py-1.5 pr-4 font-mono ${colorForRate(c.win_rate, 0.53)}`}>
                  {fmtPct(c.win_rate)}
                </td>
                <td className="py-1.5 pr-4 font-mono text-gray-300">{fmtPct(c.avg_future_return)}</td>
                <td className={`py-1.5 font-mono ${colorForIC(c.information_coefficient)}`}>
                  {fmtIC(c.information_coefficient)}
                </td>
              </tr>
            ))}
          </tbody>
        </table>
      </div>
    </Card>
  );
 }
 /* ------------------------------------------------------------------ */
 /* Layer Attribution Section                                           */
 /* ------------------------------------------------------------------ */
 function LayerAttributionSection({ data, isLoading, error }: {
  data: LayerAttributionResponse | undefined;
  isLoading: boolean;
  error: Error | null;
 }) {
  if (isLoading) return <LoadingSpinner />;
  if (error) return <ErrorCard message="Failed to load layer attribution data" />;
  const layers = data?.layers;
  if (!layers || layers.length === 0) {
    return (
      <Card>
        <h2 className="mb-2 text-sm font-medium text-gray-400">Layer Attribution</h2>
        <p className="text-sm text-gray-500">No layer attribution data available</p>
      </Card>
    );
  }
  return (
    <Card>
      <h2 className="mb-3 text-sm font-medium text-gray-400">Layer Attribution</h2>
      <div className="overflow-x-auto">
        <table className="w-full text-left text-xs">
          <thead>
            <tr className="border-b border-surface-700 text-gray-500">
              <th className="pb-2 pr-4 font-medium">Layer</th>
              <th className="pb-2 pr-4 font-medium">Contribution %</th>
              <th className="pb-2 pr-4 font-medium">Dominant Win Rate</th>
              <th className="pb-2 font-medium">IC</th>
            </tr>
          </thead>
          <tbody>
            {layers.map((l: LayerAttribution, i: number) => (
              <tr key={i} className="border-b border-surface-800">
                <td className="py-1.5 pr-4 text-gray-300 capitalize">{l.layer}</td>
                <td className="py-1.5 pr-4 font-mono text-gray-300">{fmtPct(l.avg_contribution_pct)}</td>
                <td className={`py-1.5 pr-4 font-mono ${colorForRate(l.dominant_win_rate, 0.53)}`}>
                  {fmtPct(l.dominant_win_rate)}
                </td>
                <td className={`py-1.5 font-mono ${colorForIC(l.dominant_ic)}`}>
                  {fmtIC(l.dominant_ic)}
                </td>
              </tr>
            ))}
          </tbody>
        </table>
      </div>
    </Card>
  );
 }
 /* ------------------------------------------------------------------ */
 /* Shared helpers                                                      */
 /* ------------------------------------------------------------------ */
 function StatCard({ label, value, color = 'text-gray-100' }: { label: string; value: string; color?: string }) {
  return (
    <Card className="text-center">
@@ -71,3 +622,53 @@ function StatCard({ label, value, color = 'text-gray-100' }: { label: string; va
    </Card>
  );
 }
 function ErrorCard({ message }: { message: string }) {
  return (
    <Card className="border-red-700/50 bg-red-900/20">
      <p className="text-sm text-red-400">{message}</p>
    </Card>
  );
 }
 /** Format a float as percentage with 1 decimal place, or '—' if null */
 function fmtPct(v: number | null | undefined): string {
  if (v == null) return '—';
  return `${(v * 100).toFixed(1)}%`;
 }
 /** Format a float as short percentage (no decimal) for bucket display */
 function fmtPctShort(v: number | null | undefined): string {
  if (v == null) return '—';
  return `${(v * 100).toFixed(0)}%`;
 }
 /** Format IC value with 4 decimal places, or '—' if null */
 function fmtIC(v: number | null | undefined): string {
  if (v == null) return '—';
  return v.toFixed(4);
 }
 /** Format a threshold value for display */
 function fmtThreshold(v: unknown): string {
  if (v == null) return '—';
  if (typeof v === 'number') {
    if (Number.isInteger(v)) return String(v);
    return v.toFixed(4);
  }
  return String(v);
 }
 /** Color for win rate / accuracy — green if above threshold, red otherwise */
 function colorForRate(v: number | null | undefined, threshold: number): string {
  if (v == null) return 'text-gray-100';
  return v >= threshold ? 'text-green-400' : 'text-red-400';
 }
 /** Color for IC — green if positive, red if negative, gray if null */
 function colorForIC(v: number | null | undefined): string {
  if (v == null) return 'text-gray-400';
  if (v >= 0.03) return 'text-green-400';
  if (v > 0) return 'text-yellow-400';
  return 'text-red-400';
 }
@@ -1,13 +1,92 @@
 /**
 * Recommendation detail page with validation context.
 *
 * Shows original confidence alongside calibrated confidence (historical win rate),
 * evidence quality indicators, source reliability, and live eligibility status.
 *
 * Requirements: 13.1, 13.2, 13.3, 13.4, 13.5, 13.6, 13.7
 */
 import { useParams, Link } from '@tanstack/react-router';
-import { useRecommendation } from '../api/hooks';
+import { AlertTriangle, ShieldCheck, ShieldX, Info } from 'lucide-react';
 import {
  useRecommendation,
  useValidationCalibration,
  useValidationGateStatus,
  useValidationAttributionSources,
 } from '../api/hooks';
 import { StatusBadge, ConfidenceBar, LoadingSpinner, Card } from '../components/ui';
 export function RecommendationDetailPage() {
  const { id } = useParams({ from: '/recommendations/$id' });
  const { data: rec, isLoading } = useRecommendation(id);
  const { data: calibration } = useValidationCalibration();
  const { data: gateData } = useValidationGateStatus();
  const { data: sourcesData } = useValidationAttributionSources();
  if (isLoading || !rec) return <LoadingSpinner />;
  // --- Calibration: find the bucket matching this recommendation's confidence ---
  const matchingBucket = calibration?.buckets?.find(
    (b) => rec.confidence >= b.bucket_low && rec.confidence < b.bucket_high,
  );
  // Handle edge case: confidence of exactly 1.0 falls in the last bucket [0.90, 1.00]
  const calibratedBucket =
    matchingBucket ??
    (rec.confidence >= 1.0
      ? calibration?.buckets?.find((b) => b.bucket_high >= 1.0)
      : undefined);
  const historicalWinRate = calibratedBucket?.observed_win_rate;
  // --- Evidence counts ---
  const totalEvidenceCount = rec.evidence.length;
  // Compute duplicate evidence: group by normalized title, count extras
  const titleCounts = new Map<string, number>();
  for (const ev of rec.evidence) {
    const key = (ev.title ?? '').toLowerCase().trim();
    titleCounts.set(key, (titleCounts.get(key) ?? 0) + 1);
  }
  let duplicateEvidenceCount = 0;
  for (const count of titleCounts.values()) {
    if (count > 1) duplicateEvidenceCount += count - 1;
  }
  const uniqueEvidenceCount = totalEvidenceCount - duplicateEvidenceCount;
  const duplicateRatio = totalEvidenceCount > 0 ? duplicateEvidenceCount / totalEvidenceCount : 0;
  const hasDuplicateWarning = duplicateRatio > 0.2;
  // --- Source reliability: find primary contributing sources ---
  const evidenceSources = new Map<string, number>();
  for (const ev of rec.evidence) {
    const src = ev.source_type ?? ev.publisher ?? 'unknown';
    evidenceSources.set(src, (evidenceSources.get(src) ?? 0) + ev.weight);
  }
  // Sort by total weight descending to find primary source
  const sortedSources = [...evidenceSources.entries()].sort((a, b) => b[1] - a[1]);
  const primarySourceType = sortedSources[0]?.[0];
  // Look up source reliability from attribution data
  const primarySourceAttribution = sourcesData?.sources?.find(
    (s) => s.source_type === primarySourceType || s.source === primarySourceType,
  );
  // Source reliability is approximated from win_rate via Bayesian shrinkage
  // The attribution data has win_rate which is the observed metric
  const primarySourceWinRate = primarySourceAttribution?.win_rate;
  // Bayesian shrinkage: reliability = 0.5 + (n/(n+30)) * (win_rate - 0.5)
  const primarySourceCount = primarySourceAttribution?.prediction_count ?? 0;
  const primarySourceReliability =
    primarySourceWinRate != null
      ? 0.5 + (primarySourceCount / (primarySourceCount + 30)) * (primarySourceWinRate - 0.5)
      : undefined;
  const hasLowReliabilityWarning =
    primarySourceReliability != null && primarySourceReliability < 0.4;
  // --- Gate status ---
  const gateStatus = gateData?.gate_status as {
    passed?: boolean;
    reason?: string;
    threshold_results?: Array<{ name: string; threshold: number; actual: number; passed: boolean }>;
  } | null;
  return (
    <div className="space-y-6">
      <div className="flex items-center gap-3">
@@ -28,6 +107,137 @@ export function RecommendationDetailPage() {
        </dl>
      </Card>
      {/* Validation Context Card — Requirements 13.1–13.7 */}
      <Card>
        <h2 className="mb-3 text-sm font-medium text-gray-400">Validation Context</h2>
        <dl className="grid grid-cols-2 gap-x-8 gap-y-3 text-sm sm:grid-cols-3">
          {/* 13.1: Original confidence alongside calibrated confidence */}
          <div>
            <dt className="text-gray-500">Original Confidence</dt>
            <dd className="text-gray-200">{(rec.confidence * 100).toFixed(1)}%</dd>
          </div>
          <div>
            <dt className="text-gray-500">Calibrated Confidence</dt>
            <dd className="text-gray-200">
              {historicalWinRate != null
                ? `${(historicalWinRate * 100).toFixed(1)}%`
                : 'N/A'}
            </dd>
          </div>
          {/* 13.2: Historical win rate for similar confidence levels */}
          <div>
            <dt className="text-gray-500">Historical Win Rate</dt>
            <dd className="text-gray-200">
              {historicalWinRate != null ? (
                <span>
                  {(historicalWinRate * 100).toFixed(1)}%
                  {calibratedBucket && (
                    <span className="ml-1 text-xs text-gray-500">
                      ({calibratedBucket.prediction_count} predictions)
                    </span>
                  )}
                </span>
              ) : (
                'N/A'
              )}
            </dd>
          </div>
          {/* 13.3: Evidence count, unique evidence count, duplicate evidence count */}
          <div>
            <dt className="text-gray-500">Evidence Count</dt>
            <dd className="text-gray-200">{totalEvidenceCount}</dd>
          </div>
          <div>
            <dt className="text-gray-500">Unique Evidence</dt>
            <dd className="text-gray-200">{uniqueEvidenceCount}</dd>
          </div>
          <div>
            <dt className="flex items-center gap-1 text-gray-500">
              Duplicate Evidence
              {/* 13.6: Warning badge when duplicate evidence count > 20% of total */}
              {hasDuplicateWarning && (
                <span
                  className="inline-flex items-center gap-0.5 rounded-full border border-yellow-700/50 bg-yellow-900/40 px-1.5 py-0.5 text-[10px] font-medium text-yellow-400"
                  title="Duplicate evidence exceeds 20% of total — potential evidence inflation"
                >
                  <AlertTriangle size={10} />
                  &gt;20%
                </span>
              )}
            </dt>
            <dd className="text-gray-200">
              {duplicateEvidenceCount}
              {totalEvidenceCount > 0 && (
                <span className="ml-1 text-xs text-gray-500">
                  ({(duplicateRatio * 100).toFixed(0)}%)
                </span>
              )}
            </dd>
          </div>
          {/* 13.4: Source reliability indicator */}
          <div>
            <dt className="flex items-center gap-1 text-gray-500">
              Primary Source Reliability
              {/* 13.7: Warning badge when primary source reliability < 0.4 */}
              {hasLowReliabilityWarning && (
                <span
                  className="inline-flex items-center gap-0.5 rounded-full border border-red-700/50 bg-red-900/40 px-1.5 py-0.5 text-[10px] font-medium text-red-400"
                  title="Primary source reliability is below 0.4 — low or unknown reliability"
                >
                  <AlertTriangle size={10} />
                  Low
                </span>
              )}
            </dt>
            <dd className="text-gray-200">
              {primarySourceReliability != null ? (
                <span>
                  {primarySourceReliability.toFixed(3)}
                  {primarySourceType && (
                    <span className="ml-1 text-xs text-gray-500">({primarySourceType})</span>
                  )}
                </span>
              ) : (
                'N/A'
              )}
            </dd>
          </div>
          {/* 13.5: Live eligibility status with reason */}
          <div className="col-span-2">
            <dt className="text-gray-500">Live Eligibility</dt>
            <dd>
              {gateStatus != null ? (
                <div className="flex items-center gap-2">
                  {gateStatus.passed ? (
                    <span className="inline-flex items-center gap-1 text-green-400">
                      <ShieldCheck size={14} />
                      Gate Passed
                    </span>
                  ) : (
                    <span className="inline-flex items-center gap-1 text-red-400">
                      <ShieldX size={14} />
                      Gate Failed
                    </span>
                  )}
                  {gateStatus.reason && (
                    <span className="text-xs text-gray-500">{gateStatus.reason}</span>
                  )}
                </div>
              ) : (
                <span className="inline-flex items-center gap-1 text-gray-500">
                  <Info size={14} />
                  N/A — no gate evaluation available
                </span>
              )}
            </dd>
          </div>
        </dl>
      </Card>
      {rec.thesis && (
        <Card>
          <h2 className="mb-2 text-sm font-medium text-gray-400">Thesis</h2>
@@ -73,6 +73,97 @@ export const mockVariantPerfHistory = [
  { hour: '2026-04-10T11:00:00Z', invocations: 12, successes: 11, avg_duration_ms: 1300, avg_confidence: 0.82 },
 ];
 // Validation: Model Quality & Calibration mock data
 export const mockValidationSummary = {
  snapshot: {
    id: 'ms-1',
    generated_at: '2026-04-11T12:00:00Z',
    lookback_window: '30d',
    horizon: '7d',
    prediction_count: 150,
    win_rate: 0.58,
    directional_accuracy: 0.56,
    information_coefficient: 0.045,
    rank_information_coefficient: 0.038,
    avg_return: 0.012,
    avg_excess_return_vs_spy: 0.003,
    avg_excess_return_vs_sector: 0.002,
    calibration_error: 0.08,
    brier_score: 0.21,
    buy_win_rate: 0.61,
    sell_win_rate: 0.54,
    hold_win_rate: 0.50,
    metadata: {},
  },
  gate_status: {
    passed: true,
    reason: 'all thresholds met',
    threshold_results: [
      { name: 'min_prediction_count', threshold: 100, actual: 150, passed: true },
      { name: 'min_ic', threshold: 0.03, actual: 0.045, passed: true },
      { name: 'min_win_rate', threshold: 0.53, actual: 0.58, passed: true },
    ],
  },
 };
 export const mockValidationCalibration = {
  buckets: [
    { bucket_low: 0.50, bucket_high: 0.60, avg_confidence: 0.55, observed_win_rate: 0.52, prediction_count: 30, miscalibrated: false },
    { bucket_low: 0.60, bucket_high: 0.70, avg_confidence: 0.65, observed_win_rate: 0.58, prediction_count: 40, miscalibrated: false },
    { bucket_low: 0.70, bucket_high: 0.80, avg_confidence: 0.75, observed_win_rate: 0.55, prediction_count: 35, miscalibrated: true },
    { bucket_low: 0.80, bucket_high: 0.90, avg_confidence: 0.85, observed_win_rate: 0.70, prediction_count: 25, miscalibrated: false },
    { bucket_low: 0.90, bucket_high: 1.00, avg_confidence: 0.95, observed_win_rate: 0.72, prediction_count: 20, miscalibrated: true },
  ],
  lookback: '30d',
  horizon: '7d',
 };
 export const mockValidationGateStatus = {
  gate_status: {
    passed: false,
    reason: 'failed: min_ic below threshold',
    threshold_results: [
      { name: 'min_prediction_count', threshold: 100, actual: 150, passed: true },
      { name: 'min_ic', threshold: 0.03, actual: 0.02, passed: false },
      { name: 'min_win_rate', threshold: 0.53, actual: 0.58, passed: true },
    ],
  },
 };
 export const mockValidationICByHorizon = {
  horizons: [
    { horizon: '1h', information_coefficient: 0.02, rank_information_coefficient: 0.015, prediction_count: 120, generated_at: '2026-04-11T12:00:00Z' },
    { horizon: '7d', information_coefficient: 0.045, rank_information_coefficient: 0.038, prediction_count: 100, generated_at: '2026-04-11T12:00:00Z' },
  ],
  lookback: '30d',
 };
 export const mockValidationAttributionSources = {
  sources: [
    { source: 'Reuters', source_type: 'news_api', prediction_count: 50, avg_weight: 0.6, avg_contribution_score: 0.3, win_rate: 0.62, avg_future_return: 0.015, avg_excess_return_vs_spy: 0.005, information_coefficient: 0.05, duplicate_rate: 0.1 },
  ],
  lookback: '30d',
  horizon: '7d',
 };
 export const mockValidationAttributionCatalysts = {
  catalysts: [
    { catalyst_type: 'earnings', prediction_count: 40, win_rate: 0.65, avg_future_return: 0.02, avg_excess_return_vs_spy: 0.008, information_coefficient: 0.06 },
  ],
  lookback: '30d',
  horizon: '7d',
 };
 export const mockValidationAttributionLayers = {
  layers: [
    { layer: 'company', avg_contribution_pct: 0.55, dominant_win_rate: 0.60, dominant_ic: 0.04 },
    { layer: 'macro', avg_contribution_pct: 0.30, dominant_win_rate: 0.52, dominant_ic: 0.02 },
    { layer: 'competitive', avg_contribution_pct: 0.15, dominant_win_rate: 0.48, dominant_ic: null },
  ],
  lookback: '30d',
  horizon: '7d',
 };
 export const handlers = [
  // Query API (proxied at /api/)
  http.get('/api/companies', () => HttpResponse.json(mockCompanies)),
@@ -242,4 +333,13 @@ export const handlers = [
    const body = await request.json() as Record<string, unknown>;
    return HttpResponse.json({ enabled: body.enabled, previous_enabled: true, toggled_by: 'operator' });
  }),
  // Validation: Model Quality & Calibration endpoints
  http.get('/api/validation/summary', () => HttpResponse.json(mockValidationSummary)),
  http.get('/api/validation/calibration', () => HttpResponse.json(mockValidationCalibration)),
  http.get('/api/validation/gate-status', () => HttpResponse.json(mockValidationGateStatus)),
  http.get('/api/validation/ic-by-horizon', () => HttpResponse.json(mockValidationICByHorizon)),
  http.get('/api/validation/attribution/sources', () => HttpResponse.json(mockValidationAttributionSources)),
  http.get('/api/validation/attribution/catalysts', () => HttpResponse.json(mockValidationAttributionCatalysts)),
  http.get('/api/validation/attribution/layers', () => HttpResponse.json(mockValidationAttributionLayers)),
 ];
@@ -169,6 +169,55 @@ describe('Global Events page', () => {
  });
 });
 describe('OpsModel validation tab', () => {
  it('renders Model Validation tab with summary cards', async () => {
    renderRoute('/ops/model');
    await waitFor(() => expect(screen.getByText('Model Performance')).toBeInTheDocument());
    // The tab buttons should be present
    expect(screen.getByText('Extraction Performance')).toBeInTheDocument();
    expect(screen.getByText('Model Validation')).toBeInTheDocument();
    // Click the Model Validation tab button
    await userEvent.click(screen.getByText('Model Validation'));
    // Summary cards should render key metric labels unique to the validation summary
    await waitFor(() => {
      expect(screen.getByText('Brier Score')).toBeInTheDocument();
      expect(screen.getByText('ECE')).toBeInTheDocument();
      expect(screen.getByText('Directional Accuracy')).toBeInTheDocument();
      expect(screen.getByText('Excess vs SPY')).toBeInTheDocument();
    });
  }, 10000);
  it('renders calibration table with miscalibration warning', async () => {
    renderRoute('/ops/model');
    await waitFor(() => expect(screen.getByText('Model Performance')).toBeInTheDocument());
    await userEvent.click(screen.getByText('Model Validation'));
    await waitFor(() => {
      expect(screen.getByText('Calibration by Confidence Bucket')).toBeInTheDocument();
    });
    // Miscalibrated buckets should show warning text
    const miscalWarnings = screen.getAllByText('Miscalibrated');
    expect(miscalWarnings.length).toBeGreaterThanOrEqual(1);
  }, 10000);
  it('renders gate status pass/fail indicator', async () => {
    renderRoute('/ops/model');
    await waitFor(() => expect(screen.getByText('Model Performance')).toBeInTheDocument());
    await userEvent.click(screen.getByText('Model Validation'));
    // The gate-status endpoint returns passed: false
    await waitFor(() => {
      expect(screen.getByText(/Live Trading Gate: FAIL/)).toBeInTheDocument();
    });
  }, 10000);
 });
 describe('Agents page', () => {
  it('renders agent list in sidebar', async () => {
    renderRoute('/agents');
@@ -0,0 +1,176 @@
 -- Migration 035: Model Validation, Calibration, and Signal Quality
 -- Creates tables for prediction snapshots, outcomes, evidence links, and metric snapshots
 -- Plus views for prediction performance and source performance analysis
 -- ============================================================================
 -- Table: prediction_snapshots
 -- Immutable snapshot of a prediction at generation time
 -- ============================================================================
 CREATE TABLE IF NOT EXISTS prediction_snapshots (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    generated_at TIMESTAMPTZ NOT NULL,
    ticker VARCHAR(20) NOT NULL,
    window VARCHAR(20) NOT NULL,
    horizon VARCHAR(20) NOT NULL,
    direction VARCHAR(20) NOT NULL,
    action VARCHAR(20) NOT NULL,
    mode VARCHAR(30) NOT NULL,
    strength FLOAT NOT NULL,
    confidence FLOAT NOT NULL,
    contradiction FLOAT NOT NULL DEFAULT 0.0,
    p_bull FLOAT,
    p_bear FLOAT,
    score_company FLOAT NOT NULL DEFAULT 0.0,
    score_macro FLOAT NOT NULL DEFAULT 0.0,
    score_competitive FLOAT NOT NULL DEFAULT 0.0,
    evidence_count INTEGER NOT NULL DEFAULT 0,
    unique_source_count INTEGER NOT NULL DEFAULT 0,
    duplicate_evidence_count INTEGER NOT NULL DEFAULT 0,
    price_at_prediction FLOAT,
    spy_price_at_prediction FLOAT,
    sector_etf_price_at_prediction FLOAT,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
 );
 CREATE INDEX IF NOT EXISTS idx_pred_snap_ticker ON prediction_snapshots(ticker);
 CREATE INDEX IF NOT EXISTS idx_pred_snap_generated ON prediction_snapshots(generated_at);
 CREATE INDEX IF NOT EXISTS idx_pred_snap_horizon ON prediction_snapshots(horizon);
 -- ============================================================================
 -- Table: prediction_outcomes
 -- Realized outcome for a prediction at a specific horizon
 -- ============================================================================
 CREATE TABLE IF NOT EXISTS prediction_outcomes (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    prediction_id UUID NOT NULL REFERENCES prediction_snapshots(id),
    evaluated_at TIMESTAMPTZ NOT NULL,
    horizon VARCHAR(20) NOT NULL,
    future_price FLOAT,
    future_return FLOAT,
    spy_future_price FLOAT,
    spy_return FLOAT,
    sector_etf_future_price FLOAT,
    sector_etf_return FLOAT,
    excess_return_vs_spy FLOAT,
    excess_return_vs_sector FLOAT,
    direction_correct BOOLEAN,
    profitable BOOLEAN,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
 );
 CREATE INDEX IF NOT EXISTS idx_pred_out_prediction ON prediction_outcomes(prediction_id);
 CREATE INDEX IF NOT EXISTS idx_pred_out_horizon ON prediction_outcomes(horizon);
 CREATE INDEX IF NOT EXISTS idx_pred_out_evaluated ON prediction_outcomes(evaluated_at);
 -- ============================================================================
 -- Table: signal_evidence_links
 -- Link between a prediction and a contributing evidence document
 -- ============================================================================
 CREATE TABLE IF NOT EXISTS signal_evidence_links (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    prediction_id UUID NOT NULL REFERENCES prediction_snapshots(id),
    document_id VARCHAR(200),
    signal_id VARCHAR(200),
    ticker VARCHAR(20),
    source VARCHAR(200),
    source_type VARCHAR(50),
    catalyst_type VARCHAR(50),
    sentiment VARCHAR(20),
    impact FLOAT,
    extraction_confidence FLOAT,
    weight FLOAT,
    is_duplicate BOOLEAN NOT NULL DEFAULT FALSE,
    canonical_evidence_key VARCHAR(64),
    contribution_score FLOAT,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
 );
 CREATE INDEX IF NOT EXISTS idx_sig_ev_prediction ON signal_evidence_links(prediction_id);
 CREATE INDEX IF NOT EXISTS idx_sig_ev_document ON signal_evidence_links(document_id);
 CREATE INDEX IF NOT EXISTS idx_sig_ev_ticker ON signal_evidence_links(ticker);
 -- ============================================================================
 -- Table: model_metric_snapshots
 -- Aggregate model quality metrics for a lookback/horizon combination
 -- ============================================================================
 CREATE TABLE IF NOT EXISTS model_metric_snapshots (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    generated_at TIMESTAMPTZ NOT NULL,
    lookback_window VARCHAR(20) NOT NULL,
    horizon VARCHAR(20) NOT NULL,
    prediction_count INTEGER NOT NULL DEFAULT 0,
    win_rate FLOAT,
    directional_accuracy FLOAT,
    information_coefficient FLOAT,
    rank_information_coefficient FLOAT,
    avg_return FLOAT,
    avg_excess_return_vs_spy FLOAT,
    avg_excess_return_vs_sector FLOAT,
    calibration_error FLOAT,
    brier_score FLOAT,
    buy_win_rate FLOAT,
    sell_win_rate FLOAT,
    hold_win_rate FLOAT,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
 );
 CREATE INDEX IF NOT EXISTS idx_model_snap_generated ON model_metric_snapshots(generated_at);
 CREATE INDEX IF NOT EXISTS idx_model_snap_lookback ON model_metric_snapshots(lookback_window);
 CREATE INDEX IF NOT EXISTS idx_model_snap_horizon ON model_metric_snapshots(horizon);
 -- ============================================================================
 -- View: v_prediction_performance
 -- Joins prediction snapshots with outcomes for flat analysis
 -- ============================================================================
 CREATE OR REPLACE VIEW v_prediction_performance AS
 SELECT
    ps.ticker,
    ps.direction,
    ps.action,
    ps.confidence,
    ps.strength,
    ps.contradiction,
    ps.p_bull,
    ps.score_company,
    ps.score_macro,
    ps.score_competitive,
    ps.evidence_count,
    ps.unique_source_count,
    ps.duplicate_evidence_count,
    ps.price_at_prediction,
    po.future_return,
    po.excess_return_vs_spy,
    po.excess_return_vs_sector,
    po.direction_correct,
    po.profitable,
    po.horizon,
    ps.generated_at,
    po.evaluated_at
 FROM prediction_snapshots ps
 JOIN prediction_outcomes po ON po.prediction_id = ps.id;
 -- ============================================================================
 -- View: v_source_performance
 -- Joins evidence links with snapshots and outcomes for source attribution
 -- ============================================================================
 CREATE OR REPLACE VIEW v_source_performance AS
 SELECT
    sel.source,
    sel.source_type,
    sel.catalyst_type,
    sel.sentiment,
    sel.weight,
    sel.contribution_score,
    sel.is_duplicate,
    po.direction_correct,
    po.future_return,
    po.excess_return_vs_spy,
    po.horizon,
    ps.generated_at
 FROM signal_evidence_links sel
 JOIN prediction_snapshots ps ON ps.id = sel.prediction_id
 JOIN prediction_outcomes po ON po.prediction_id = sel.prediction_id;
@@ -64,6 +64,7 @@ from services.shared.metrics import (
    AGGREGATION_WINDOWS_COMPUTED,
 )
 from services.shared.schemas import TrendDirection, TrendSummary, TrendWindow
 from services.trading.model_quality_gate import QualityGateResult, evaluate_quality_gate
 logger = logging.getLogger(__name__)
@@ -1576,10 +1577,34 @@ async def aggregate_company(
    # Mid-cycle changes take effect on the next cycle.
    probabilistic = await fetch_probabilistic_scoring_enabled(pool)
    pipeline_mode = "probabilistic" if probabilistic else "heuristic"
    # --- Quality gate evaluation (Req 11.2, 11.3) ---
    # Evaluate model quality gate at the start of each aggregation cycle.
    # When the gate fails, all recommendations are forced to paper mode.
    # Gate evaluation failure defaults to paper-only (fail-safe).
    quality_gate_passed = False
    try:
        gate_result: QualityGateResult = await evaluate_quality_gate(pool)
        quality_gate_passed = gate_result.passed
        logger.info(
-        "Aggregation cycle for %s: pipeline_mode=%s",
+            "Quality gate for %s cycle: %s — %s",
            ticker,
            "PASS" if gate_result.passed else "FAIL",
            gate_result.reason,
        )
    except Exception:
        logger.exception(
            "Quality gate evaluation failed for %s cycle — "
            "defaulting to paper-only mode (fail-safe)",
            ticker,
        )
        quality_gate_passed = False
    logger.info(
        "Aggregation cycle for %s: pipeline_mode=%s quality_gate=%s",
        ticker,
        pipeline_mode,
        "passed" if quality_gate_passed else "failed",
    )
    # --- Regime detection (Req 7.1, 7.2, 7.3, 7.8, 7.9) ---
@@ -1647,6 +1672,20 @@ async def aggregate_company(
            ticker_returns=ticker_returns,
            ticker_volumes=ticker_volumes,
        )
        # When quality gate fails, annotate the trend summary so the
        # recommendation engine forces paper mode (Req 11.2, 11.3).
        if not quality_gate_passed:
            ctx = summary.market_context
            if isinstance(ctx, dict):
                ctx["quality_gate_passed"] = False
            elif ctx is not None and hasattr(ctx, "model_dump"):
                ctx_dict = ctx.model_dump()
                ctx_dict["quality_gate_passed"] = False
                summary.market_context = ctx_dict
            else:
                summary.market_context = {"quality_gate_passed": False}
        summaries.append(summary)
    return summaries
@@ -43,6 +43,11 @@ from services.shared.db import get_pg_pool, get_redis
 from services.shared.logging import new_trace_id, set_trace_context, setup_logging
 from services.shared.redis_keys import PIPELINE_ENABLED_KEY, QUEUE_BROKER, QUEUE_PREFIX, queue_key
 from services.shared.schemas import MAJOR_DECISION_CATALYSTS
 from services.validation.attribution import (
    compute_catalyst_attribution,
    compute_layer_attribution,
    compute_source_attribution,
 )
 logger = logging.getLogger("query_api")
@@ -3769,3 +3774,336 @@ async def get_variant_performance_history(
        agent_id, variant_id, hours,
    )
    return [_row_to_dict(r) for r in rows]
 # ---------------------------------------------------------------------------
 # Model Validation Dashboard  (Requirements 12.1, 12.2, 12.3, 12.7)
 # ---------------------------------------------------------------------------
 _VALID_LOOKBACKS = {"7d", "30d", "90d", "all"}
 _VALID_HORIZONS = {"1h", "6h", "1d", "7d", "30d"}
@app.get("/api/validation/summary")
 async def get_validation_summary(
    lookback: str = Query(default="30d"),
    horizon: str = Query(default="7d"),
 ):
    """Latest model metric snapshot plus quality gate status.
    Returns the most recent model_metric_snapshot for the given
    lookback/horizon combination, along with the current gate status
    from risk_configs.
    Requirement 12.1
    """
    if lookback not in _VALID_LOOKBACKS:
        raise HTTPException(400, f"Invalid lookback: {lookback}. Must be one of {sorted(_VALID_LOOKBACKS)}")
    if horizon not in _VALID_HORIZONS:
        raise HTTPException(400, f"Invalid horizon: {horizon}. Must be one of {sorted(_VALID_HORIZONS)}")
    # Latest metric snapshot for the requested lookback/horizon
    snapshot_row = await pool.fetchrow(
        """SELECT id, generated_at, lookback_window, horizon,
                  prediction_count, win_rate, directional_accuracy,
                  information_coefficient, rank_information_coefficient,
                  avg_return, avg_excess_return_vs_spy, avg_excess_return_vs_sector,
                  calibration_error, brier_score,
                  buy_win_rate, sell_win_rate, hold_win_rate,
                  metadata
           FROM model_metric_snapshots
           WHERE lookback_window = $1 AND horizon = $2
           ORDER BY generated_at DESC
           LIMIT 1""",
        lookback, horizon,
    )
    snapshot = None
    if snapshot_row:
        snapshot = _row_to_dict(snapshot_row)
        snapshot["metadata"] = _parse_jsonb(snapshot.get("metadata"))
    # Gate status from risk_configs
    gate_row = await pool.fetchrow(
        "SELECT config, updated_at FROM risk_configs WHERE name = 'model_quality_gate'",
    )
    gate_status = None
    if gate_row:
        gate_status = _parse_jsonb(gate_row["config"])
    return {
        "snapshot": snapshot,
        "gate_status": gate_status,
    }
@app.get("/api/validation/calibration")
 async def get_validation_calibration(
    lookback: str = Query(default="30d"),
    horizon: str = Query(default="7d"),
 ):
    """Calibration table with confidence buckets.
    Queries v_prediction_performance for the given lookback/horizon,
    groups by confidence buckets, and computes avg_confidence,
    observed_win_rate, count, and miscalibrated flag per bucket.
    Requirement 12.2
    """
    if lookback not in _VALID_LOOKBACKS:
        raise HTTPException(400, f"Invalid lookback: {lookback}. Must be one of {sorted(_VALID_LOOKBACKS)}")
    if horizon not in _VALID_HORIZONS:
        raise HTTPException(400, f"Invalid horizon: {horizon}. Must be one of {sorted(_VALID_HORIZONS)}")
    # Build lookback filter
    lookback_condition = ""
    params: list[Any] = [horizon]
    idx = 2
    if lookback != "all":
        lookback_days = {"7d": 7, "30d": 30, "90d": 90}[lookback]
        lookback_condition = f"AND generated_at >= NOW() - make_interval(days => ${idx})"
        params.append(lookback_days)
        idx += 1
    rows = await pool.fetch(
        f"""SELECT confidence, direction_correct
            FROM v_prediction_performance
            WHERE horizon = $1
              {lookback_condition}
              AND confidence IS NOT NULL""",
        *params,
    )
    # Group into calibration buckets
    buckets_def = [
        (0.50, 0.60),
        (0.60, 0.70),
        (0.70, 0.80),
        (0.80, 0.90),
        (0.90, 1.00),
    ]
    buckets = []
    for low, high in buckets_def:
        bucket_rows = []
        for r in rows:
            conf = float(r["confidence"])
            if high == 1.00:
                in_bucket = low <= conf <= high
            else:
                in_bucket = low <= conf < high
            if in_bucket:
                bucket_rows.append(r)
        count = len(bucket_rows)
        if count == 0:
            buckets.append({
                "bucket_low": low,
                "bucket_high": high,
                "avg_confidence": 0.0,
                "observed_win_rate": 0.0,
                "prediction_count": 0,
                "miscalibrated": False,
            })
            continue
        avg_conf = sum(float(r["confidence"]) for r in bucket_rows) / count
        win_count = sum(1 for r in bucket_rows if r["direction_correct"] is True)
        win_rate = win_count / count
        diff = abs(avg_conf - win_rate)
        buckets.append({
            "bucket_low": low,
            "bucket_high": high,
            "avg_confidence": round(avg_conf, 4),
            "observed_win_rate": round(win_rate, 4),
            "prediction_count": count,
            "miscalibrated": diff > 0.15,
        })
    return {"buckets": buckets, "lookback": lookback, "horizon": horizon}
@app.get("/api/validation/ic-by-horizon")
 async def get_validation_ic_by_horizon(
    lookback: str = Query(default="30d"),
 ):
    """IC and Rank IC per prediction horizon.
    Queries the most recent model_metric_snapshot for the given lookback
    across all 5 horizons, returning IC and Rank IC for each.
    Requirement 12.3
    """
    if lookback not in _VALID_LOOKBACKS:
        raise HTTPException(400, f"Invalid lookback: {lookback}. Must be one of {sorted(_VALID_LOOKBACKS)}")
    rows = await pool.fetch(
        """SELECT DISTINCT ON (horizon)
                  horizon,
                  information_coefficient,
                  rank_information_coefficient,
                  prediction_count,
                  generated_at
           FROM model_metric_snapshots
           WHERE lookback_window = $1
           ORDER BY horizon, generated_at DESC""",
        lookback,
    )
    horizons = []
    for r in rows:
        horizons.append({
            "horizon": r["horizon"],
            "information_coefficient": float(r["information_coefficient"]) if r["information_coefficient"] is not None else None,
            "rank_information_coefficient": float(r["rank_information_coefficient"]) if r["rank_information_coefficient"] is not None else None,
            "prediction_count": r["prediction_count"],
            "generated_at": r["generated_at"].isoformat() if r["generated_at"] else None,
        })
    # Sort by canonical horizon order
    horizon_order = {"1h": 0, "6h": 1, "1d": 2, "7d": 3, "30d": 4}
    horizons.sort(key=lambda h: horizon_order.get(h["horizon"], 99))
    return {"horizons": horizons, "lookback": lookback}
@app.get("/api/validation/gate-status")
 async def get_validation_gate_status():
    """Quality gate evaluation detail.
    Returns the stored gate evaluation result from risk_configs
    where key = 'model_quality_gate'.
    Requirement 12.7
    """
    gate_row = await pool.fetchrow(
        "SELECT config, updated_at FROM risk_configs WHERE name = 'model_quality_gate'",
    )
    if not gate_row:
        return {
            "gate_status": None,
            "message": "No gate evaluation found. Model metrics may not have been computed yet.",
        }
    gate_data = _parse_jsonb(gate_row["config"])
    updated_at = gate_row["updated_at"].isoformat() if gate_row.get("updated_at") else None
    return {
        "gate_status": gate_data,
        "updated_at": updated_at,
    }
 # ---------------------------------------------------------------------------
 # Attribution Endpoints  (Requirements 12.4, 12.5, 12.6)
 # ---------------------------------------------------------------------------
 _LOOKBACK_TO_DAYS: dict[str, int] = {
    "7d": 7,
    "30d": 30,
    "90d": 90,
    "all": 3650,
 }
@app.get("/api/validation/attribution/sources")
 async def get_validation_attribution_sources(
    lookback: str = Query(default="30d"),
    horizon: str = Query(default="7d"),
 ):
    """Per-source performance metrics.
    Returns win rate, IC, average return, duplicate rate, and other
    attribution metrics for each source, computed over the given
    lookback window and prediction horizon.
    Requirement 12.4
    """
    if lookback not in _VALID_LOOKBACKS:
        raise HTTPException(400, f"Invalid lookback: {lookback}. Must be one of {sorted(_VALID_LOOKBACKS)}")
    if horizon not in _VALID_HORIZONS:
        raise HTTPException(400, f"Invalid horizon: {horizon}. Must be one of {sorted(_VALID_HORIZONS)}")
    lookback_days = _LOOKBACK_TO_DAYS[lookback]
    try:
        results = await compute_source_attribution(pool, lookback_days=lookback_days, horizon=horizon)
    except Exception:
        logger.exception("Failed to compute source attribution")
        raise HTTPException(500, "Failed to compute source attribution")
    return {
        "sources": [asdict(r) for r in results],
        "lookback": lookback,
        "horizon": horizon,
    }
@app.get("/api/validation/attribution/catalysts")
 async def get_validation_attribution_catalysts(
    lookback: str = Query(default="30d"),
    horizon: str = Query(default="7d"),
 ):
    """Per-catalyst-type performance metrics.
    Returns win rate, IC, average return, and other attribution metrics
    for each catalyst type, computed over the given lookback window
    and prediction horizon.
    Requirement 12.5
    """
    if lookback not in _VALID_LOOKBACKS:
        raise HTTPException(400, f"Invalid lookback: {lookback}. Must be one of {sorted(_VALID_LOOKBACKS)}")
    if horizon not in _VALID_HORIZONS:
        raise HTTPException(400, f"Invalid horizon: {horizon}. Must be one of {sorted(_VALID_HORIZONS)}")
    lookback_days = _LOOKBACK_TO_DAYS[lookback]
    try:
        results = await compute_catalyst_attribution(pool, lookback_days=lookback_days, horizon=horizon)
    except Exception:
        logger.exception("Failed to compute catalyst attribution")
        raise HTTPException(500, "Failed to compute catalyst attribution")
    return {
        "catalysts": [asdict(r) for r in results],
        "lookback": lookback,
        "horizon": horizon,
    }
@app.get("/api/validation/attribution/layers")
 async def get_validation_attribution_layers(
    lookback: str = Query(default="30d"),
    horizon: str = Query(default="7d"),
 ):
    """Per-signal-layer (company, macro, competitive) performance metrics.
    Returns average contribution percentage, dominant win rate, and
    dominant IC for each of the three signal layers, computed over
    the given lookback window and prediction horizon.
    Requirement 12.6
    """
    if lookback not in _VALID_LOOKBACKS:
        raise HTTPException(400, f"Invalid lookback: {lookback}. Must be one of {sorted(_VALID_LOOKBACKS)}")
    if horizon not in _VALID_HORIZONS:
        raise HTTPException(400, f"Invalid horizon: {horizon}. Must be one of {sorted(_VALID_HORIZONS)}")
    lookback_days = _LOOKBACK_TO_DAYS[lookback]
    try:
        results = await compute_layer_attribution(pool, lookback_days=lookback_days, horizon=horizon)
    except Exception:
        logger.exception("Failed to compute layer attribution")
        raise HTTPException(500, "Failed to compute layer attribution")
    return {
        "layers": [asdict(r) for r in results],
        "lookback": lookback,
        "horizon": horizon,
    }
@@ -48,6 +48,7 @@ from services.shared.schemas import (
    TrendSummary,
    TrendWindow,
 )
 from services.validation.prediction_snapshot import create_prediction_snapshot
 logger = logging.getLogger(__name__)
@@ -741,6 +742,92 @@ def _map_time_horizon_prefix(window: str) -> str:
    return mapping.get(window, "window_")
 # ---------------------------------------------------------------------------
 # Fetch evidence signals and docs for prediction snapshot (Requirement 1.1)
 # ---------------------------------------------------------------------------
 _EVIDENCE_SIGNALS_QUERY = """
 SELECT
    dir.document_id::text AS document_id,
    di.id::text AS signal_id,
    dir.ticker,
    d.source_type AS source,
    d.source_type,
    dir.catalyst_type,
    dir.sentiment,
    dir.impact_score AS impact,
    di.confidence AS extraction_confidence,
    di.source_credibility AS weight
 FROM document_impact_records dir
 JOIN document_intelligence di ON di.id = dir.intelligence_id
 JOIN documents d ON d.id = di.document_id
 WHERE dir.document_id = ANY($1::uuid[])
  AND di.validation_status = 'valid'
 """
 _EVIDENCE_DOCS_QUERY = """
 SELECT
    d.id::text AS document_id,
    COALESCE(d.title, '') AS title,
    COALESCE(d.url, '') AS url
 FROM documents d
 WHERE d.id = ANY($1::uuid[])
 """
 async def _fetch_evidence_for_snapshot(
    pool: asyncpg.Pool,
    document_ids: list[str],
 ) -> tuple[list[dict], list[dict]]:
    """Fetch evidence signals and document metadata for prediction snapshot.
    Filters out non-UUID document IDs (e.g. synthetic pattern IDs) since
    they cannot be looked up in the documents table.
    Returns (evidence_signals, evidence_docs).
    """
    # Filter to valid UUIDs only
    valid_ids: list[str] = []
    for doc_id in document_ids:
        try:
            _uuid.UUID(doc_id)
            valid_ids.append(doc_id)
        except (ValueError, AttributeError):
            continue
    if not valid_ids:
        return [], []
    signal_rows = await pool.fetch(_EVIDENCE_SIGNALS_QUERY, valid_ids)
    evidence_signals = [
        {
            "document_id": row["document_id"],
            "signal_id": row["signal_id"],
            "ticker": row["ticker"] or "",
            "source": row["source"] or "",
            "source_type": row["source_type"] or "",
            "catalyst_type": row["catalyst_type"] or "",
            "sentiment": row["sentiment"] or "",
            "impact": float(row["impact"] or 0.0),
            "extraction_confidence": float(row["extraction_confidence"] or 0.0),
            "weight": float(row["weight"] or 0.0),
        }
        for row in signal_rows
    ]
    doc_rows = await pool.fetch(_EVIDENCE_DOCS_QUERY, valid_ids)
    evidence_docs = [
        {
            "document_id": row["document_id"],
            "title": row["title"],
            "url": row["url"],
        }
        for row in doc_rows
    ]
    return evidence_signals, evidence_docs
 async def generate_recommendation(
    pool: asyncpg.Pool,
    ticker: str,
@@ -847,6 +934,22 @@ async def generate_recommendation(
        eligibility_result=result,
    )
    # 7b. Capture prediction snapshot for model validation (Requirements 1.1, 1.6)
    try:
        all_doc_ids = list(summary.top_supporting_evidence) + list(summary.top_opposing_evidence)
        evidence_signals, evidence_docs = await _fetch_evidence_for_snapshot(
            pool, all_doc_ids,
        )
        await create_prediction_snapshot(
            pool, rec, summary, evidence_signals, evidence_docs,
        )
    except Exception:
        logger.warning(
            "Failed to create prediction snapshot for %s/%s — recommendation "
            "persisted but snapshot creation failed",
            ticker, rec_id, exc_info=True,
        )
    # 8. Publish prediction facts to analytical tables (Requirement 9.4)
    if minio_client is not None:
        try:
@@ -4,6 +4,10 @@ Task 32: Fetches historical recommendations from the database, simulates
 the decision logic chronologically using evaluate_recommendation(), tracks
 simulated positions and equity curve, and persists results to backtest_runs
 and backtest_trades tables.
 Supports a validation mode (Requirements 15.1–15.5) that generates prediction
 snapshots and evaluates outcomes using only data available at each historical
 point in time, preventing future data leakage.
 """
 from __future__ import annotations
@@ -39,12 +43,22 @@ class BacktestReplay:
        self.pool = pool
        self._perf = PerformanceComputer()
-    async def run(self, config: BacktestConfig, backtest_id: str | None = None) -> BacktestResult:
+    async def run(
        self,
        config: BacktestConfig,
        backtest_id: str | None = None,
        validation_mode: bool = False,
    ) -> BacktestResult:
        """Execute a full backtest replay.
        Args:
            config: Backtest configuration (date range, capital, risk tier).
            backtest_id: Optional pre-generated ID. If not provided, one is generated.
            validation_mode: When True, creates prediction snapshots for each
                historical recommendation using only data available at that point
                in time, evaluates outcomes, and computes model metrics over the
                backtest period. Snapshots are tagged with the backtest_id.
                (Requirements 15.1–15.5)
        Returns:
            BacktestResult with metrics, trade log, and equity curve.
@@ -87,6 +101,7 @@ class BacktestReplay:
            daily_returns: list[float] = []
            prev_value = config.initial_capital
            trade_log: list[dict] = []
            validation_snapshot_ids: list[str] = []  # track snapshot IDs for validation mode
            # Pre-load company sectors and latest prices for enrichment
            company_sectors: dict[str, str] = {}
@@ -172,6 +187,25 @@ class BacktestReplay:
                        now=sim_time,
                    )
                    # --- Validation mode: create prediction snapshot (Req 15.1, 15.2, 15.4) ---
                    if validation_mode and self.pool is not None:
                        try:
                            snapshot_id = await self._create_validation_snapshot(
                                rec=rec,
                                sim_time=sim_time,
                                backtest_id=backtest_id,
                                company_sectors=company_sectors,
                            )
                            if snapshot_id is not None:
                                validation_snapshot_ids.append(snapshot_id)
                        except Exception:
                            logger.warning(
                                "Validation snapshot failed for %s at %s, continuing backtest",
                                rec.get("ticker", "?"),
                                sim_time,
                                exc_info=True,
                            )
                    if decision.decision == "act":
                        act_count += 1
                        ticker = decision.ticker
@@ -348,6 +382,10 @@ class BacktestReplay:
            # Persist results
            await self._persist_results(result, closed_trades)
            # --- Validation mode: evaluate outcomes and compute metrics (Req 15.3, 15.5) ---
            if validation_mode and self.pool is not None and validation_snapshot_ids:
                await self._run_validation_evaluation(backtest_id)
            return result
        except Exception as exc:
@@ -356,6 +394,210 @@ class BacktestReplay:
            await self._persist_failed_run(backtest_id, config, str(exc))
            raise
    # ------------------------------------------------------------------
    # Validation mode helpers (Requirements 15.1–15.5)
    # ------------------------------------------------------------------
    # SQL to fetch the close price at or before a specific time — prevents
    # future data leakage by only returning data available at that point.
    _CLOSE_AT_TIME_SQL = """
    SELECT (data->>'c')::float AS close
    FROM market_snapshots
    WHERE ticker = $1
      AND snapshot_type = 'bar'
      AND data->>'c' IS NOT NULL
      AND captured_at <= $2
    ORDER BY captured_at DESC
    LIMIT 1
    """
    _COMPANY_SECTOR_SQL = """
    SELECT sector FROM companies WHERE ticker = $1 AND active = TRUE LIMIT 1
    """
    _SECTOR_ETF_MAP: dict[str, str] = {
        "Technology": "XLK",
        "Consumer Cyclical": "XLY",
        "Financial Services": "XLF",
        "Healthcare": "XLV",
        "Energy": "XLE",
        "Communication Services": "XLC",
        "Industrials": "XLI",
        "Consumer Defensive": "XLP",
        "Real Estate": "XLRE",
        "Utilities": "XLU",
    }
    async def _fetch_close_at_time(
        self,
        ticker: str,
        target_time: datetime,
    ) -> float | None:
        """Fetch the close price for *ticker* at or before *target_time*.
        Ensures no future data leakage — only market data with
        ``captured_at <= target_time`` is considered (Requirement 15.4).
        """
        if self.pool is None:
            return None
        row = await self.pool.fetchrow(self._CLOSE_AT_TIME_SQL, ticker, target_time)
        if row is None:
            return None
        return row["close"]
    async def _create_validation_snapshot(
        self,
        rec: dict,
        sim_time: datetime,
        backtest_id: str,
        company_sectors: dict[str, str],
    ) -> str | None:
        """Create a prediction snapshot using only data available at *sim_time*.
        Fetches ticker, SPY, and sector ETF prices as of *sim_time* to prevent
        future data leakage (Requirements 15.1, 15.2, 15.4).  The snapshot is
        tagged with *backtest_id* in its metadata field (Requirement 15.5).
        Returns the snapshot UUID on success, or ``None`` on failure.
        """
        from services.validation.prediction_snapshot import (
            SECTOR_ETF_MAP,
        )
        ticker = rec.get("ticker", "")
        if not ticker:
            return None
        # Fetch prices using only data available at sim_time (Req 15.4)
        ticker_price = await self._fetch_close_at_time(ticker, sim_time)
        spy_price = await self._fetch_close_at_time("SPY", sim_time)
        # Sector ETF price
        sector = company_sectors.get(ticker)
        sector_etf_ticker = SECTOR_ETF_MAP.get(sector) if sector else None
        sector_etf_price: float | None = None
        if sector_etf_ticker is not None:
            sector_etf_price = await self._fetch_close_at_time(
                sector_etf_ticker, sim_time
            )
        snapshot_id = str(uuid.uuid4())
        # Build metadata tagged with backtest_id (Req 15.5)
        metadata: dict = {
            "backtest_id": backtest_id,
            "source": "backtest_validation",
        }
        # Map recommendation fields to snapshot columns
        direction = rec.get("direction", rec.get("trend_direction", "neutral"))
        action = rec.get("action", "watch")
        mode = rec.get("mode", "informational")
        confidence = float(rec.get("confidence", 0.5))
        strength = float(rec.get("strength", rec.get("trend_strength", 0.5)))
        contradiction = float(rec.get("contradiction", rec.get("contradiction_score", 0.0)))
        p_bull = rec.get("p_bull")
        if p_bull is not None:
            p_bull = float(p_bull)
        p_bear = (1.0 - p_bull) if p_bull is not None else None
        window = rec.get("window", rec.get("trend_window", "7d"))
        horizon = rec.get("time_horizon", rec.get("horizon", "7d"))
        # Insert the snapshot directly — we bypass create_prediction_snapshot()
        # because that function fetches *latest* prices (not point-in-time).
        insert_sql = """
        INSERT INTO prediction_snapshots (
            id, generated_at, ticker, window, horizon, direction, action, mode,
            strength, confidence, contradiction, p_bull, p_bear,
            score_company, score_macro, score_competitive,
            evidence_count, unique_source_count, duplicate_evidence_count,
            price_at_prediction, spy_price_at_prediction,
            sector_etf_price_at_prediction, metadata
        ) VALUES (
            $1::uuid, $2, $3, $4, $5, $6, $7, $8,
            $9, $10, $11, $12, $13,
            $14, $15, $16,
            $17, $18, $19,
            $20, $21, $22,
            $23::jsonb
        )
        """
        await self.pool.execute(
            insert_sql,
            snapshot_id,
            sim_time,
            ticker,
            str(window),
            str(horizon),
            str(direction),
            str(action),
            str(mode),
            strength,
            confidence,
            contradiction,
            p_bull,
            p_bear,
            float(rec.get("score_company", 0.0)),
            float(rec.get("score_macro", 0.0)),
            float(rec.get("score_competitive", 0.0)),
            int(rec.get("evidence_count", 0)),
            int(rec.get("unique_source_count", 0)),
            int(rec.get("duplicate_evidence_count", 0)),
            ticker_price,
            spy_price,
            sector_etf_price,
            json.dumps(metadata),
        )
        logger.debug(
            "Validation snapshot %s created for %s at %s (backtest %s)",
            snapshot_id,
            ticker,
            sim_time,
            backtest_id,
        )
        return snapshot_id
    async def _run_validation_evaluation(self, backtest_id: str) -> None:
        """Evaluate prediction outcomes and compute metrics for the backtest.
        Calls the outcome evaluator and metrics engine after the backtest
        completes (Requirements 15.3, 15.5).  Failures are logged but do
        not block the backtest result.
        """
        from services.validation.metrics import compute_and_store_metric_snapshots
        from services.validation.outcome_evaluator import evaluate_matured_predictions
        # Step 1: Evaluate matured predictions (Req 15.3)
        try:
            outcomes_count = await evaluate_matured_predictions(self.pool)
            logger.info(
                "Backtest %s validation: %d prediction outcomes evaluated",
                backtest_id,
                outcomes_count,
            )
        except Exception:
            logger.warning(
                "Backtest %s: outcome evaluation failed, continuing",
                backtest_id,
                exc_info=True,
            )
        # Step 2: Compute and store metric snapshots (Req 15.5)
        try:
            snapshots = await compute_and_store_metric_snapshots(self.pool)
            logger.info(
                "Backtest %s validation: %d metric snapshots computed",
                backtest_id,
                len(snapshots),
            )
        except Exception:
            logger.warning(
                "Backtest %s: metric snapshot computation failed, continuing",
                backtest_id,
                exc_info=True,
            )
    # ------------------------------------------------------------------
    # Database helpers
    # ------------------------------------------------------------------
@@ -0,0 +1,329 @@
 """Quality gate for live trading eligibility.
 Evaluates aggregate model metrics against configurable thresholds and
 determines whether the system meets minimum quality standards for live
 trading.  When any threshold is not met, the gate forces all
 recommendations to paper mode (fail-safe).
 Requirements: 11.1, 11.2, 11.3, 11.4, 11.5, 11.6, 11.7
 """
 from __future__ import annotations
 import json
 import logging
 from dataclasses import asdict, dataclass, field
 from datetime import datetime, timezone
 import asyncpg
 logger = logging.getLogger("trading_engine.quality_gate")
 # ---------------------------------------------------------------------------
 # Data classes
 # ---------------------------------------------------------------------------
@dataclass
 class QualityGateConfig:
    """Configurable thresholds for live trading eligibility."""
    min_prediction_count: int = 100
    min_ic: float = 0.03
    min_win_rate: float = 0.53
    max_ece: float = 0.15
    min_excess_return_vs_spy: float = 0.0
    max_snapshot_age_hours: int = 24
@dataclass
 class GateThresholdResult:
    """Result for a single threshold check."""
    name: str
    threshold: float
    actual: float
    passed: bool
@dataclass
 class QualityGateResult:
    """Full gate evaluation result."""
    passed: bool
    evaluated_at: datetime
    threshold_results: list[GateThresholdResult] = field(default_factory=list)
    reason: str = ""
    snapshot_id: str | None = None
    config: QualityGateConfig = field(default_factory=QualityGateConfig)
 # ---------------------------------------------------------------------------
 # Threshold evaluation helpers
 # ---------------------------------------------------------------------------
 def _evaluate_thresholds(
    snapshot: dict,
    config: QualityGateConfig,
 ) -> list[GateThresholdResult]:
    """Evaluate each threshold against snapshot metric values."""
    results: list[GateThresholdResult] = []
    # min_prediction_count
    actual_count = snapshot.get("prediction_count") or 0
    results.append(
        GateThresholdResult(
            name="min_prediction_count",
            threshold=float(config.min_prediction_count),
            actual=float(actual_count),
            passed=actual_count >= config.min_prediction_count,
        )
    )
    # min_ic
    actual_ic = snapshot.get("information_coefficient")
    if actual_ic is None:
        actual_ic = 0.0
    results.append(
        GateThresholdResult(
            name="min_ic",
            threshold=config.min_ic,
            actual=float(actual_ic),
            passed=float(actual_ic) >= config.min_ic,
        )
    )
    # min_win_rate
    actual_wr = snapshot.get("win_rate")
    if actual_wr is None:
        actual_wr = 0.0
    results.append(
        GateThresholdResult(
            name="min_win_rate",
            threshold=config.min_win_rate,
            actual=float(actual_wr),
            passed=float(actual_wr) >= config.min_win_rate,
        )
    )
    # max_ece (calibration_error)
    actual_ece = snapshot.get("calibration_error")
    if actual_ece is None:
        actual_ece = 1.0  # worst-case when missing
    results.append(
        GateThresholdResult(
            name="max_ece",
            threshold=config.max_ece,
            actual=float(actual_ece),
            passed=float(actual_ece) <= config.max_ece,
        )
    )
    # min_excess_return_vs_spy
    actual_excess = snapshot.get("avg_excess_return_vs_spy")
    if actual_excess is None:
        actual_excess = 0.0
    results.append(
        GateThresholdResult(
            name="min_excess_return_vs_spy",
            threshold=config.min_excess_return_vs_spy,
            actual=float(actual_excess),
            passed=float(actual_excess) >= config.min_excess_return_vs_spy,
        )
    )
    return results
 # ---------------------------------------------------------------------------
 # Public API
 # ---------------------------------------------------------------------------
 async def evaluate_quality_gate(
    pool: asyncpg.Pool,
    config: QualityGateConfig | None = None,
 ) -> QualityGateResult:
    """Evaluate model quality gate from latest metric snapshot.
    Reads the most recent ``model_metric_snapshot`` for the 30d lookback
    and 7d horizon (the primary evaluation window).
    If no snapshot exists or snapshot is stale (>max_snapshot_age_hours),
    defaults to paper-only mode (fail-safe).
    Stores result in ``risk_configs`` under ``'model_quality_gate'`` key.
    """
    if config is None:
        config = await load_gate_config_from_db(pool)
    now = datetime.now(tz=timezone.utc)
    # Fetch the most recent metric snapshot for 30d lookback / 7d horizon
    try:
        row = await pool.fetchrow(
            """SELECT id, generated_at, prediction_count, win_rate,
                      directional_accuracy, information_coefficient,
                      rank_information_coefficient, avg_return,
                      avg_excess_return_vs_spy, avg_excess_return_vs_sector,
                      calibration_error, brier_score,
                      buy_win_rate, sell_win_rate, hold_win_rate
               FROM model_metric_snapshots
               WHERE lookback_window = '30d' AND horizon = '7d'
               ORDER BY generated_at DESC
               LIMIT 1""",
        )
    except Exception:
        logger.exception("Failed to query model_metric_snapshots")
        row = None
    # Fail-safe: no snapshot exists
    if row is None:
        result = QualityGateResult(
            passed=False,
            evaluated_at=now,
            threshold_results=[],
            reason="no model metric snapshot available — defaulting to paper-only",
            snapshot_id=None,
            config=config,
        )
        logger.warning("Quality gate: %s", result.reason)
        await _store_gate_result(pool, result)
        return result
    snapshot = dict(row)
    snapshot_id = str(snapshot["id"])
    generated_at = snapshot["generated_at"]
    # Fail-safe: stale snapshot
    age_hours = (now - generated_at).total_seconds() / 3600.0
    if age_hours > config.max_snapshot_age_hours:
        result = QualityGateResult(
            passed=False,
            evaluated_at=now,
            threshold_results=[],
            reason=(
                f"most recent snapshot is {age_hours:.1f}h old "
                f"(max {config.max_snapshot_age_hours}h) — defaulting to paper-only"
            ),
            snapshot_id=snapshot_id,
            config=config,
        )
        logger.warning("Quality gate: %s", result.reason)
        await _store_gate_result(pool, result)
        return result
    # Evaluate thresholds
    threshold_results = _evaluate_thresholds(snapshot, config)
    failed = [r for r in threshold_results if not r.passed]
    if failed:
        failed_names = ", ".join(
            f"{r.name}(actual={r.actual:.4f}, threshold={r.threshold:.4f})"
            for r in failed
        )
        reason = f"failed: {failed_names}"
        passed = False
    else:
        reason = "all thresholds met"
        passed = True
    result = QualityGateResult(
        passed=passed,
        evaluated_at=now,
        threshold_results=threshold_results,
        reason=reason,
        snapshot_id=snapshot_id,
        config=config,
    )
    # Log details
    for tr in threshold_results:
        logger.info(
            "Quality gate threshold %s: actual=%.4f threshold=%.4f %s",
            tr.name,
            tr.actual,
            tr.threshold,
            "PASS" if tr.passed else "FAIL",
        )
    logger.info("Quality gate result: %s — %s", "PASS" if passed else "FAIL", reason)
    await _store_gate_result(pool, result)
    return result
 async def load_gate_config_from_db(
    pool: asyncpg.Pool,
 ) -> QualityGateConfig:
    """Load gate thresholds from risk_configs, with defaults.
    Looks for a ``risk_configs`` row with ``name = 'model_quality_gate_config'``.
    If found, merges stored thresholds over the defaults.  If not found or
    the stored JSON is invalid, returns the default config.
    """
    defaults = QualityGateConfig()
    try:
        row = await pool.fetchrow(
            "SELECT config FROM risk_configs WHERE name = 'model_quality_gate_config'",
        )
    except Exception:
        logger.warning("Failed to load gate config from risk_configs — using defaults")
        return defaults
    if row is None:
        return defaults
    try:
        raw = row["config"]
        cfg = raw if isinstance(raw, dict) else json.loads(raw)
    except (json.JSONDecodeError, TypeError):
        logger.warning("Invalid gate config JSON in risk_configs — using defaults")
        return defaults
    return QualityGateConfig(
        min_prediction_count=int(cfg.get("min_prediction_count", defaults.min_prediction_count)),
        min_ic=float(cfg.get("min_ic", defaults.min_ic)),
        min_win_rate=float(cfg.get("min_win_rate", defaults.min_win_rate)),
        max_ece=float(cfg.get("max_ece", defaults.max_ece)),
        min_excess_return_vs_spy=float(
            cfg.get("min_excess_return_vs_spy", defaults.min_excess_return_vs_spy)
        ),
        max_snapshot_age_hours=int(
            cfg.get("max_snapshot_age_hours", defaults.max_snapshot_age_hours)
        ),
    )
 # ---------------------------------------------------------------------------
 # Internal helpers
 # ---------------------------------------------------------------------------
 def _gate_result_to_json(result: QualityGateResult) -> str:
    """Serialize a QualityGateResult to JSON for storage in risk_configs."""
    payload = {
        "passed": result.passed,
        "evaluated_at": result.evaluated_at.isoformat(),
        "reason": result.reason,
        "snapshot_id": result.snapshot_id,
        "config": asdict(result.config),
        "threshold_results": [asdict(tr) for tr in result.threshold_results],
    }
    return json.dumps(payload, default=str)
 async def _store_gate_result(pool: asyncpg.Pool, result: QualityGateResult) -> None:
    """Upsert gate evaluation result into risk_configs."""
    payload = _gate_result_to_json(result)
    try:
        await pool.execute(
            """INSERT INTO risk_configs (name, config, updated_at)
               VALUES ('model_quality_gate', $1::jsonb, NOW())
               ON CONFLICT (name) WHERE active = TRUE
               DO UPDATE SET config = $1::jsonb, updated_at = NOW()""",
            payload,
        )
    except Exception:
        logger.exception("Failed to store quality gate result in risk_configs")
@@ -0,0 +1 @@
@@ -0,0 +1,591 @@
 """Attribution Engine — per-source, per-catalyst, and per-layer performance.
 Joins signal evidence links with prediction outcomes to compute attribution
 metrics that identify which sources, catalyst types, and signal layers
 contribute most to accurate predictions.
 Requirements: 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7
 """
 from __future__ import annotations
 import logging
 import math
 from dataclasses import dataclass
 from datetime import datetime, timedelta
 import asyncpg
 logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 # Dataclasses
 # ---------------------------------------------------------------------------
@dataclass
 class SourceAttribution:
    """Performance metrics for a single source."""
    source: str
    source_type: str
    prediction_count: int
    avg_weight: float
    avg_contribution_score: float
    win_rate: float
    avg_future_return: float
    avg_excess_return_vs_spy: float
    information_coefficient: float | None
    duplicate_rate: float
@dataclass
 class CatalystAttribution:
    """Performance metrics for a single catalyst type."""
    catalyst_type: str
    prediction_count: int
    win_rate: float
    avg_future_return: float
    avg_excess_return_vs_spy: float
    information_coefficient: float | None
@dataclass
 class LayerAttribution:
    """Performance metrics for a signal layer."""
    layer: str  # company, macro, competitive
    avg_contribution_pct: float
    dominant_win_rate: float  # win rate when this layer > 30% contribution
    dominant_ic: float | None  # IC when this layer > 30% contribution
 # ---------------------------------------------------------------------------
 # Pure computation helpers
 # ---------------------------------------------------------------------------
 def _pearson_correlation(xs: list[float], ys: list[float]) -> float | None:
    """Compute Pearson correlation coefficient between two lists.
    Returns None if the lists have fewer than 2 elements or if either
    has zero variance. Guards against NaN/infinity.
    """
    n = len(xs)
    if n < 2:
        return None
    mean_x = sum(xs) / n
    mean_y = sum(ys) / n
    cov = 0.0
    var_x = 0.0
    var_y = 0.0
    for x, y in zip(xs, ys):
        dx = x - mean_x
        dy = y - mean_y
        cov += dx * dy
        var_x += dx * dx
        var_y += dy * dy
    if var_x == 0.0 or var_y == 0.0:
        return None
    r = cov / math.sqrt(var_x * var_y)
    if math.isnan(r) or math.isinf(r):
        return None
    return max(-1.0, min(1.0, r))
 def _compute_ic(
    contribution_scores: list[float],
    future_returns: list[float],
 ) -> float | None:
    """Compute IC (Pearson correlation) between contribution scores and returns.
    Returns None when fewer than 30 data points.
    """
    if len(contribution_scores) < 30 or len(future_returns) < 30:
        return None
    n = min(len(contribution_scores), len(future_returns))
    return _pearson_correlation(contribution_scores[:n], future_returns[:n])
 # ---------------------------------------------------------------------------
 # SQL queries — source attribution via v_source_performance
 # ---------------------------------------------------------------------------
 _SOURCE_ATTRIBUTION_SQL = """
 SELECT
    source,
    source_type,
    weight,
    contribution_score,
    is_duplicate,
    direction_correct,
    future_return,
    excess_return_vs_spy
 FROM v_source_performance
 WHERE horizon = $1
  AND generated_at >= $2
 """
 _SOURCE_ATTRIBUTION_ALL_SQL = """
 SELECT
    source,
    source_type,
    weight,
    contribution_score,
    is_duplicate,
    direction_correct,
    future_return,
    excess_return_vs_spy
 FROM v_source_performance
 WHERE horizon = $1
 """
 # ---------------------------------------------------------------------------
 # SQL queries — catalyst attribution via v_source_performance
 # ---------------------------------------------------------------------------
 _CATALYST_ATTRIBUTION_SQL = """
 SELECT
    catalyst_type,
    weight,
    contribution_score,
    direction_correct,
    future_return,
    excess_return_vs_spy
 FROM v_source_performance
 WHERE horizon = $1
  AND generated_at >= $2
 """
 _CATALYST_ATTRIBUTION_ALL_SQL = """
 SELECT
    catalyst_type,
    weight,
    contribution_score,
    direction_correct,
    future_return,
    excess_return_vs_spy
 FROM v_source_performance
 WHERE horizon = $1
 """
 # ---------------------------------------------------------------------------
 # SQL queries — layer attribution via prediction_snapshots + outcomes
 # ---------------------------------------------------------------------------
 _LAYER_ATTRIBUTION_SQL = """
 SELECT
    ps.score_company,
    ps.score_macro,
    ps.score_competitive,
    po.direction_correct,
    po.future_return
 FROM prediction_snapshots ps
 JOIN prediction_outcomes po ON po.prediction_id = ps.id
 WHERE po.horizon = $1
  AND ps.generated_at >= $2
 """
 _LAYER_ATTRIBUTION_ALL_SQL = """
 SELECT
    ps.score_company,
    ps.score_macro,
    ps.score_competitive,
    po.direction_correct,
    po.future_return
 FROM prediction_snapshots ps
 JOIN prediction_outcomes po ON po.prediction_id = ps.id
 WHERE po.horizon = $1
 """
 # ---------------------------------------------------------------------------
 # Source attribution (Requirements 7.1, 7.2, 7.7)
 # ---------------------------------------------------------------------------
 async def compute_source_attribution(
    pool: asyncpg.Pool,
    lookback_days: int = 30,
    horizon: str = "7d",
 ) -> list[SourceAttribution]:
    """Compute per-source performance metrics.
    Queries v_source_performance, groups by source, and computes:
    prediction count, avg weight, avg contribution score, win rate,
    avg future return, avg excess return vs SPY, IC, and duplicate rate.
    Returns a list of SourceAttribution sorted by prediction count descending.
    """
    now = datetime.now().astimezone()
    cutoff = now - timedelta(days=lookback_days)
    try:
        rows = await pool.fetch(_SOURCE_ATTRIBUTION_SQL, horizon, cutoff)
    except Exception:
        logger.exception(
            "Failed to query source attribution for horizon=%s lookback=%dd",
            horizon,
            lookback_days,
        )
        return []
    if not rows:
        return []
    # Group rows by source
    source_groups: dict[str, list[dict]] = {}
    for row in rows:
        r = dict(row)
        key = r.get("source") or "unknown"
        source_groups.setdefault(key, []).append(r)
    results: list[SourceAttribution] = []
    for source, group in source_groups.items():
        count = len(group)
        # Source type — take the most common one
        source_type = group[0].get("source_type") or "unknown"
        # Avg weight
        weights = [r["weight"] for r in group if r.get("weight") is not None]
        avg_weight = sum(weights) / len(weights) if weights else 0.0
        # Avg contribution score
        contrib_scores = [
            r["contribution_score"]
            for r in group
            if r.get("contribution_score") is not None
        ]
        avg_contribution_score = (
            sum(contrib_scores) / len(contrib_scores) if contrib_scores else 0.0
        )
        # Win rate
        direction_rows = [r for r in group if r.get("direction_correct") is not None]
        win_count = sum(1 for r in direction_rows if r["direction_correct"] is True)
        win_rate = win_count / len(direction_rows) if direction_rows else 0.0
        # Avg future return
        returns = [
            r["future_return"] for r in group if r.get("future_return") is not None
        ]
        avg_future_return = sum(returns) / len(returns) if returns else 0.0
        # Avg excess return vs SPY
        excess_returns = [
            r["excess_return_vs_spy"]
            for r in group
            if r.get("excess_return_vs_spy") is not None
        ]
        avg_excess_return_vs_spy = (
            sum(excess_returns) / len(excess_returns) if excess_returns else 0.0
        )
        # IC: correlation between contribution scores and future returns
        ic_scores = [
            r["contribution_score"]
            for r in group
            if r.get("contribution_score") is not None
            and r.get("future_return") is not None
        ]
        ic_returns = [
            r["future_return"]
            for r in group
            if r.get("contribution_score") is not None
            and r.get("future_return") is not None
        ]
        ic = _compute_ic(ic_scores, ic_returns)
        # Duplicate rate: is_duplicate=true / total
        dup_count = sum(1 for r in group if r.get("is_duplicate") is True)
        duplicate_rate = dup_count / count
        results.append(
            SourceAttribution(
                source=source,
                source_type=source_type,
                prediction_count=count,
                avg_weight=avg_weight,
                avg_contribution_score=avg_contribution_score,
                win_rate=win_rate,
                avg_future_return=avg_future_return,
                avg_excess_return_vs_spy=avg_excess_return_vs_spy,
                information_coefficient=ic,
                duplicate_rate=duplicate_rate,
            )
        )
    # Sort by prediction count descending
    results.sort(key=lambda a: a.prediction_count, reverse=True)
    logger.info(
        "Computed source attribution for %d sources (horizon=%s, lookback=%dd)",
        len(results),
        horizon,
        lookback_days,
    )
    return results
 # ---------------------------------------------------------------------------
 # Catalyst attribution (Requirements 7.3, 7.4)
 # ---------------------------------------------------------------------------
 async def compute_catalyst_attribution(
    pool: asyncpg.Pool,
    lookback_days: int = 30,
    horizon: str = "7d",
 ) -> list[CatalystAttribution]:
    """Compute per-catalyst-type performance metrics.
    Queries v_source_performance, groups by catalyst_type, and computes:
    prediction count, win rate, avg future return, avg excess return vs SPY,
    and IC.
    Returns a list of CatalystAttribution sorted by prediction count descending.
    """
    now = datetime.now().astimezone()
    cutoff = now - timedelta(days=lookback_days)
    try:
        rows = await pool.fetch(_CATALYST_ATTRIBUTION_SQL, horizon, cutoff)
    except Exception:
        logger.exception(
            "Failed to query catalyst attribution for horizon=%s lookback=%dd",
            horizon,
            lookback_days,
        )
        return []
    if not rows:
        return []
    # Group rows by catalyst_type
    catalyst_groups: dict[str, list[dict]] = {}
    for row in rows:
        r = dict(row)
        key = r.get("catalyst_type") or "unknown"
        catalyst_groups.setdefault(key, []).append(r)
    results: list[CatalystAttribution] = []
    for catalyst_type, group in catalyst_groups.items():
        count = len(group)
        # Win rate
        direction_rows = [r for r in group if r.get("direction_correct") is not None]
        win_count = sum(1 for r in direction_rows if r["direction_correct"] is True)
        win_rate = win_count / len(direction_rows) if direction_rows else 0.0
        # Avg future return
        returns = [
            r["future_return"] for r in group if r.get("future_return") is not None
        ]
        avg_future_return = sum(returns) / len(returns) if returns else 0.0
        # Avg excess return vs SPY
        excess_returns = [
            r["excess_return_vs_spy"]
            for r in group
            if r.get("excess_return_vs_spy") is not None
        ]
        avg_excess_return_vs_spy = (
            sum(excess_returns) / len(excess_returns) if excess_returns else 0.0
        )
        # IC: correlation between contribution scores and future returns
        ic_scores = [
            r["contribution_score"]
            for r in group
            if r.get("contribution_score") is not None
            and r.get("future_return") is not None
        ]
        ic_returns = [
            r["future_return"]
            for r in group
            if r.get("contribution_score") is not None
            and r.get("future_return") is not None
        ]
        ic = _compute_ic(ic_scores, ic_returns)
        results.append(
            CatalystAttribution(
                catalyst_type=catalyst_type,
                prediction_count=count,
                win_rate=win_rate,
                avg_future_return=avg_future_return,
                avg_excess_return_vs_spy=avg_excess_return_vs_spy,
                information_coefficient=ic,
            )
        )
    # Sort by prediction count descending
    results.sort(key=lambda a: a.prediction_count, reverse=True)
    logger.info(
        "Computed catalyst attribution for %d catalyst types "
        "(horizon=%s, lookback=%dd)",
        len(results),
        horizon,
        lookback_days,
    )
    return results
 # ---------------------------------------------------------------------------
 # Layer attribution (Requirements 7.5, 7.6)
 # ---------------------------------------------------------------------------
 async def compute_layer_attribution(
    pool: asyncpg.Pool,
    lookback_days: int = 30,
    horizon: str = "7d",
 ) -> list[LayerAttribution]:
    """Compute per-layer (company, macro, competitive) performance metrics.
    Queries prediction_snapshots joined with prediction_outcomes to get
    score_company, score_macro, score_competitive alongside outcomes.
    For each layer computes:
    - avg_contribution_pct: average of layer_score / total_score across all
      predictions (where total_score > 0)
    - dominant_win_rate: win rate for predictions where the layer contributes
      more than 30% of the total score
    - dominant_ic: IC (Pearson correlation between layer score and future
      return) for predictions where the layer contributes > 30%
    Returns a list of 3 LayerAttribution objects (company, macro, competitive).
    """
    now = datetime.now().astimezone()
    cutoff = now - timedelta(days=lookback_days)
    try:
        rows = await pool.fetch(_LAYER_ATTRIBUTION_SQL, horizon, cutoff)
    except Exception:
        logger.exception(
            "Failed to query layer attribution for horizon=%s lookback=%dd",
            horizon,
            lookback_days,
        )
        return []
    if not rows:
        return [
            LayerAttribution(
                layer="company",
                avg_contribution_pct=0.0,
                dominant_win_rate=0.0,
                dominant_ic=None,
            ),
            LayerAttribution(
                layer="macro",
                avg_contribution_pct=0.0,
                dominant_win_rate=0.0,
                dominant_ic=None,
            ),
            LayerAttribution(
                layer="competitive",
                avg_contribution_pct=0.0,
                dominant_win_rate=0.0,
                dominant_ic=None,
            ),
        ]
    row_dicts = [dict(r) for r in rows]
    layers = [
        ("company", "score_company"),
        ("macro", "score_macro"),
        ("competitive", "score_competitive"),
    ]
    results: list[LayerAttribution] = []
    for layer_name, score_field in layers:
        # --- Average contribution percentage ---
        contribution_pcts: list[float] = []
        for r in row_dicts:
            total = (
                (r.get("score_company") or 0.0)
                + (r.get("score_macro") or 0.0)
                + (r.get("score_competitive") or 0.0)
            )
            if total > 0.0:
                layer_score = r.get(score_field) or 0.0
                contribution_pcts.append(layer_score / total)
        avg_contribution_pct = (
            sum(contribution_pcts) / len(contribution_pcts)
            if contribution_pcts
            else 0.0
        )
        # --- Dominant predictions: layer > 30% of total score ---
        dominant_rows: list[dict] = []
        for r in row_dicts:
            total = (
                (r.get("score_company") or 0.0)
                + (r.get("score_macro") or 0.0)
                + (r.get("score_competitive") or 0.0)
            )
            if total > 0.0:
                layer_score = r.get(score_field) or 0.0
                if layer_score / total > 0.30:
                    dominant_rows.append(r)
        # Dominant win rate
        dominant_direction_rows = [
            r for r in dominant_rows if r.get("direction_correct") is not None
        ]
        dominant_win_count = sum(
            1 for r in dominant_direction_rows if r["direction_correct"] is True
        )
        dominant_win_rate = (
            dominant_win_count / len(dominant_direction_rows)
            if dominant_direction_rows
            else 0.0
        )
        # Dominant IC: correlation between layer score and future return
        dom_scores = [
            r.get(score_field) or 0.0
            for r in dominant_rows
            if r.get("future_return") is not None
        ]
        dom_returns = [
            r["future_return"]
            for r in dominant_rows
            if r.get("future_return") is not None
        ]
        dominant_ic = _compute_ic(dom_scores, dom_returns)
        results.append(
            LayerAttribution(
                layer=layer_name,
                avg_contribution_pct=avg_contribution_pct,
                dominant_win_rate=dominant_win_rate,
                dominant_ic=dominant_ic,
            )
        )
    logger.info(
        "Computed layer attribution for 3 layers (horizon=%s, lookback=%dd)",
        horizon,
        lookback_days,
    )
    return results
@@ -0,0 +1,135 @@
 """Calibration Engine — Bayesian shrinkage source reliability and weight adjustment.
 Computes source reliability scores using Bayesian shrinkage from historical
 prediction outcomes, and adjusts evidence weights based on source performance.
 Updates the existing source_accuracy table with reliability scores.
 Requirements: 8.1, 8.2, 8.3, 8.4, 8.5
 """
 from __future__ import annotations
 import logging
 import asyncpg
 logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 # Pure functions — testable without a database
 # ---------------------------------------------------------------------------
 def compute_source_reliability(
    observed_win_rate: float,
    sample_count: int,
    prior_strength: int = 30,
 ) -> float:
    """Bayesian shrinkage source reliability.
    reliability = 0.5 + (n / (n + prior_strength)) * (observed_win_rate - 0.5)
    Returns value in [0.0, 1.0].
    When n=0, returns 0.5 (prior mean).
    As n→∞, approaches observed_win_rate.
    """
    if sample_count <= 0:
        return 0.5
    shrinkage = sample_count / (sample_count + prior_strength)
    reliability = 0.5 + shrinkage * (observed_win_rate - 0.5)
    # Clamp to [0.0, 1.0] for safety (should already be in range when
    # observed_win_rate is in [0.0, 1.0], but guard against edge cases).
    return max(0.0, min(1.0, reliability))
 def compute_adjusted_evidence_weight(
    base_weight: float,
    reliability: float,
 ) -> float:
    """Adjusted weight = base_weight * (0.5 + reliability), clamped to [0.1, 2.0]."""
    adjusted = base_weight * (0.5 + reliability)
    return max(0.1, min(2.0, adjusted))
 # ---------------------------------------------------------------------------
 # SQL queries
 # ---------------------------------------------------------------------------
 # Query v_source_performance to get per-source win rates and sample counts.
 # Groups by source, counting total predictions and directional wins.
 _SOURCE_PERFORMANCE_SQL = """
 SELECT
    source,
    COUNT(*) AS sample_count,
    COUNT(*) FILTER (WHERE direction_correct = TRUE) AS win_count
 FROM v_source_performance
 WHERE direction_correct IS NOT NULL
 GROUP BY source
 """
 # Upsert into source_accuracy: update accuracy_ratio and sample_count
 # for existing sources, insert new ones.
 _UPSERT_SOURCE_ACCURACY_SQL = """
 INSERT INTO source_accuracy (source_id, accuracy_ratio, sample_count, last_updated)
 VALUES ($1, $2, $3, NOW())
 ON CONFLICT (source_id)
 DO UPDATE SET
    accuracy_ratio = EXCLUDED.accuracy_ratio,
    sample_count = EXCLUDED.sample_count,
    last_updated = NOW()
 """
 # ---------------------------------------------------------------------------
 # Database-backed function
 # ---------------------------------------------------------------------------
 async def update_source_reliabilities(
    pool: asyncpg.Pool,
 ) -> int:
    """Recompute and store source reliability scores from latest outcomes.
    1. Queries v_source_performance to get per-source win rates and counts
    2. Computes Bayesian shrinkage reliability for each source
    3. Upserts into source_accuracy table (accuracy_ratio = reliability)
    Returns count of sources updated.
    """
    try:
        rows = await pool.fetch(_SOURCE_PERFORMANCE_SQL)
    except Exception:
        logger.exception("Failed to query source performance for reliability update")
        return 0
    if not rows:
        logger.info("No source performance data available for reliability update")
        return 0
    updated = 0
    for row in rows:
        source = row["source"]
        sample_count = row["sample_count"]
        win_count = row["win_count"]
        observed_win_rate = win_count / sample_count if sample_count > 0 else 0.5
        reliability = compute_source_reliability(observed_win_rate, sample_count)
        try:
            await pool.execute(
                _UPSERT_SOURCE_ACCURACY_SQL,
                source,
                reliability,
                sample_count,
            )
            updated += 1
        except Exception:
            logger.exception(
                "Failed to upsert source reliability for source=%s", source
            )
    logger.info("Updated source reliabilities for %d sources", updated)
    return updated
@@ -0,0 +1,637 @@
 """Metrics Engine — computes calibration, IC, Brier, and benchmark metrics.
 Aggregates model quality metrics across configurable lookback windows and
 prediction horizons. Stores periodic snapshots for time-series analysis
 of model performance trends.
 Requirements: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 6.1, 6.2, 6.3, 6.4, 6.5,
              9.1, 9.2, 9.3, 9.4, 10.1, 10.2, 10.3, 10.4, 10.5
 """
 from __future__ import annotations
 import json
 import logging
 import math
 import uuid
 from dataclasses import dataclass, field
 from datetime import datetime, timedelta
 import asyncpg
 logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 # Constants
 # ---------------------------------------------------------------------------
 CONFIDENCE_BUCKETS: list[tuple[float, float]] = [
    (0.50, 0.60),
    (0.60, 0.70),
    (0.70, 0.80),
    (0.80, 0.90),
    (0.90, 1.00),
 ]
 LOOKBACK_WINDOWS: list[str] = ["7d", "30d", "90d", "all"]
 LOOKBACK_DURATIONS: dict[str, timedelta | None] = {
    "7d": timedelta(days=7),
    "30d": timedelta(days=30),
    "90d": timedelta(days=90),
    "all": None,
 }
 EVALUATION_HORIZONS: list[str] = ["1h", "6h", "1d", "7d", "30d"]
 # ---------------------------------------------------------------------------
 # Dataclasses
 # ---------------------------------------------------------------------------
@dataclass
 class CalibrationBucket:
    """Calibration metrics for a single confidence bucket."""
    bucket_low: float
    bucket_high: float
    avg_confidence: float
    observed_win_rate: float
    prediction_count: int
    miscalibrated: bool  # |avg_confidence - win_rate| > 0.15
@dataclass
 class ModelMetricSnapshot:
    """Aggregate model quality metrics for a lookback/horizon combination."""
    id: str
    generated_at: datetime
    lookback_window: str
    horizon: str
    prediction_count: int
    win_rate: float
    directional_accuracy: float
    information_coefficient: float | None
    rank_information_coefficient: float | None
    avg_return: float
    avg_excess_return_vs_spy: float
    avg_excess_return_vs_sector: float
    calibration_error: float  # ECE
    brier_score: float
    buy_win_rate: float
    sell_win_rate: float
    hold_win_rate: float
    metadata: dict = field(default_factory=dict)
 # ---------------------------------------------------------------------------
 # Pure computation functions
 # ---------------------------------------------------------------------------
 def compute_calibration_error(
    confidences: list[float],
    outcomes: list[bool],
 ) -> tuple[float, list[CalibrationBucket]]:
    """Compute ECE and calibration buckets.
    ECE = Σ (n_b / N) * |avg_conf_b - win_rate_b|
    Groups predictions into 5 confidence buckets and computes the weighted
    average of |avg_confidence - observed_win_rate| across all buckets.
    Flags buckets where |diff| > 0.15 as miscalibrated.
    Returns (ece, buckets). Returns (0.0, []) when no data is provided.
    """
    if not confidences or not outcomes:
        return 0.0, []
    n = len(confidences)
    buckets: list[CalibrationBucket] = []
    ece = 0.0
    for low, high in CONFIDENCE_BUCKETS:
        bucket_confs: list[float] = []
        bucket_outcomes: list[bool] = []
        for conf, outcome in zip(confidences, outcomes):
            # Last bucket is inclusive on the right: [0.90, 1.00]
            if high == 1.00:
                in_bucket = low <= conf <= high
            else:
                in_bucket = low <= conf < high
            if in_bucket:
                bucket_confs.append(conf)
                bucket_outcomes.append(outcome)
        count = len(bucket_confs)
        if count == 0:
            # Empty bucket — exclude from ECE, still record it
            buckets.append(
                CalibrationBucket(
                    bucket_low=low,
                    bucket_high=high,
                    avg_confidence=0.0,
                    observed_win_rate=0.0,
                    prediction_count=0,
                    miscalibrated=False,
                )
            )
            continue
        avg_conf = sum(bucket_confs) / count
        win_rate = sum(1.0 for o in bucket_outcomes if o) / count
        diff = abs(avg_conf - win_rate)
        miscalibrated = diff > 0.15
        buckets.append(
            CalibrationBucket(
                bucket_low=low,
                bucket_high=high,
                avg_confidence=avg_conf,
                observed_win_rate=win_rate,
                prediction_count=count,
                miscalibrated=miscalibrated,
            )
        )
        ece += (count / n) * diff
    return ece, buckets
 def compute_brier_score(
    p_bulls: list[float],
    outcomes: list[bool],
 ) -> float:
    """Brier score = mean((p_bull - outcome)^2).
    outcome is 1.0 when price moved in predicted direction, 0.0 otherwise.
    Returns value in [0.0, 1.0]. Returns 0.0 for empty input.
    """
    if not p_bulls or not outcomes:
        return 0.0
    n = len(p_bulls)
    total = 0.0
    for p, o in zip(p_bulls, outcomes):
        actual = 1.0 if o else 0.0
        total += (p - actual) ** 2
    return total / n
 def _pearson_correlation(xs: list[float], ys: list[float]) -> float | None:
    """Compute Pearson correlation coefficient between two lists.
    Returns None if the lists have fewer than 2 elements or if either
    has zero variance. Guards against NaN/infinity.
    """
    n = len(xs)
    if n < 2:
        return None
    mean_x = sum(xs) / n
    mean_y = sum(ys) / n
    cov = 0.0
    var_x = 0.0
    var_y = 0.0
    for x, y in zip(xs, ys):
        dx = x - mean_x
        dy = y - mean_y
        cov += dx * dy
        var_x += dx * dx
        var_y += dy * dy
    if var_x == 0.0 or var_y == 0.0:
        return None
    r = cov / math.sqrt(var_x * var_y)
    # Guard against floating-point drift
    if math.isnan(r) or math.isinf(r):
        return None
    # Clamp to [-1.0, 1.0]
    return max(-1.0, min(1.0, r))
 def _rank_data(values: list[float]) -> list[float]:
    """Compute fractional ranks for a list of values (average tie-breaking)."""
    n = len(values)
    indexed = sorted(range(n), key=lambda i: values[i])
    ranks = [0.0] * n
    i = 0
    while i < n:
        # Find the end of the tie group
        j = i + 1
        while j < n and values[indexed[j]] == values[indexed[i]]:
            j += 1
        # Average rank for the tie group (1-based)
        avg_rank = (i + j + 1) / 2.0
        for k in range(i, j):
            ranks[indexed[k]] = avg_rank
        i = j
    return ranks
 def compute_information_coefficient(
    scores: list[float],
    returns: list[float],
 ) -> float | None:
    """Pearson correlation between prediction scores and future returns.
    Returns None when fewer than 30 data points.
    Returns value in [-1.0, 1.0].
    """
    if len(scores) < 30 or len(returns) < 30:
        return None
    n = min(len(scores), len(returns))
    return _pearson_correlation(scores[:n], returns[:n])
 def compute_rank_information_coefficient(
    scores: list[float],
    returns: list[float],
 ) -> float | None:
    """Spearman rank correlation between prediction scores and future returns.
    Ranks the data and computes Pearson correlation on the ranks.
    Returns None when fewer than 30 data points.
    Returns value in [-1.0, 1.0].
    """
    if len(scores) < 30 or len(returns) < 30:
        return None
    n = min(len(scores), len(returns))
    ranked_scores = _rank_data(scores[:n])
    ranked_returns = _rank_data(returns[:n])
    return _pearson_correlation(ranked_scores, ranked_returns)
 def compute_contribution_scores(
    weights: list[float],
 ) -> list[float]:
    """Compute contribution scores from document weights.
    Each score = weight_i / sum(weights). Sums to 1.0.
    Each score in [0.0, 1.0].
    Returns empty list for empty input.
    """
    if not weights:
        return []
    total = sum(weights)
    if total == 0.0:
        n = len(weights)
        return [1.0 / n] * n
    return [w / total for w in weights]
 def compute_hit_rate_improvement(win_rate: float) -> float:
    """Hit rate improvement over random 50/50 baseline.
    Defined as (system_win_rate - 0.5) / 0.5.
    """
    return (win_rate - 0.5) / 0.5
 # ---------------------------------------------------------------------------
 # SQL queries for v_prediction_performance view
 # ---------------------------------------------------------------------------
 _PERFORMANCE_DATA_SQL = """
 SELECT
    ticker,
    direction,
    action,
    confidence,
    strength,
    p_bull,
    score_company,
    score_macro,
    score_competitive,
    future_return,
    excess_return_vs_spy,
    excess_return_vs_sector,
    direction_correct,
    profitable,
    horizon,
    generated_at
 FROM v_prediction_performance
 WHERE horizon = $1
 """
 _PERFORMANCE_DATA_WITH_LOOKBACK_SQL = """
 SELECT
    ticker,
    direction,
    action,
    confidence,
    strength,
    p_bull,
    score_company,
    score_macro,
    score_competitive,
    future_return,
    excess_return_vs_spy,
    excess_return_vs_sector,
    direction_correct,
    profitable,
    horizon,
    generated_at
 FROM v_prediction_performance
 WHERE horizon = $1
  AND generated_at >= $2
 """
 _INSERT_METRIC_SNAPSHOT_SQL = """
 INSERT INTO model_metric_snapshots (
    id, generated_at, lookback_window, horizon,
    prediction_count, win_rate, directional_accuracy,
    information_coefficient, rank_information_coefficient,
    avg_return, avg_excess_return_vs_spy, avg_excess_return_vs_sector,
    calibration_error, brier_score,
    buy_win_rate, sell_win_rate, hold_win_rate,
    metadata
 ) VALUES (
    $1::uuid, $2, $3, $4,
    $5, $6, $7,
    $8, $9,
    $10, $11, $12,
    $13, $14,
    $15, $16, $17,
    $18::jsonb
 )
 """
 # ---------------------------------------------------------------------------
 # Metric computation from raw rows
 # ---------------------------------------------------------------------------
 def _compute_metrics_from_rows(
    rows: list[dict],
    lookback_window: str,
    horizon: str,
 ) -> ModelMetricSnapshot:
    """Compute all metrics from a list of prediction performance rows.
    Returns a ModelMetricSnapshot with all computed metrics.
    """
    now = datetime.now().astimezone()
    snapshot_id = str(uuid.uuid4())
    prediction_count = len(rows)
    if prediction_count == 0:
        return ModelMetricSnapshot(
            id=snapshot_id,
            generated_at=now,
            lookback_window=lookback_window,
            horizon=horizon,
            prediction_count=0,
            win_rate=0.0,
            directional_accuracy=0.0,
            information_coefficient=None,
            rank_information_coefficient=None,
            avg_return=0.0,
            avg_excess_return_vs_spy=0.0,
            avg_excess_return_vs_sector=0.0,
            calibration_error=0.0,
            brier_score=0.0,
            buy_win_rate=0.0,
            sell_win_rate=0.0,
            hold_win_rate=0.0,
            metadata={},
        )
    # --- Win rate and directional accuracy ---
    direction_correct_count = sum(
        1 for r in rows if r.get("direction_correct") is True
    )
    win_rate = direction_correct_count / prediction_count
    directional_accuracy = win_rate  # Same metric, different name
    # --- Per-action win rates ---
    buy_rows = [r for r in rows if (r.get("action") or "").lower() == "buy"]
    sell_rows = [r for r in rows if (r.get("action") or "").lower() == "sell"]
    hold_rows = [r for r in rows if (r.get("action") or "").lower() == "hold"]
    buy_win_rate = (
        sum(1 for r in buy_rows if r.get("direction_correct") is True) / len(buy_rows)
        if buy_rows
        else 0.0
    )
    sell_win_rate = (
        sum(1 for r in sell_rows if r.get("direction_correct") is True)
        / len(sell_rows)
        if sell_rows
        else 0.0
    )
    hold_win_rate = (
        sum(1 for r in hold_rows if r.get("direction_correct") is True)
        / len(hold_rows)
        if hold_rows
        else 0.0
    )
    # --- Average return ---
    returns_list = [
        r["future_return"] for r in rows if r.get("future_return") is not None
    ]
    avg_return = sum(returns_list) / len(returns_list) if returns_list else 0.0
    # --- Average excess return vs SPY (Requirement 9.1) ---
    excess_spy_list = [
        r["excess_return_vs_spy"]
        for r in rows
        if r.get("excess_return_vs_spy") is not None
    ]
    avg_excess_return_vs_spy = (
        sum(excess_spy_list) / len(excess_spy_list) if excess_spy_list else 0.0
    )
    # --- Average excess return vs sector ETF (Requirement 9.2) ---
    excess_sector_list = [
        r["excess_return_vs_sector"]
        for r in rows
        if r.get("excess_return_vs_sector") is not None
    ]
    avg_excess_return_vs_sector = (
        sum(excess_sector_list) / len(excess_sector_list)
        if excess_sector_list
        else 0.0
    )
    # --- Calibration error (ECE) (Requirements 5.1, 5.2, 5.3, 5.5) ---
    confidences = [
        r["confidence"] for r in rows if r.get("confidence") is not None
    ]
    outcomes = [
        r.get("direction_correct") is True
        for r in rows
        if r.get("confidence") is not None
    ]
    ece, _buckets = compute_calibration_error(confidences, outcomes)
    # --- Brier score (Requirement 5.4) ---
    p_bulls = [r["p_bull"] for r in rows if r.get("p_bull") is not None]
    brier_outcomes = [
        r.get("direction_correct") is True
        for r in rows
        if r.get("p_bull") is not None
    ]
    brier = compute_brier_score(p_bulls, brier_outcomes)
    # --- Information Coefficient (Requirements 6.1, 6.5) ---
    ic_scores = [
        r["strength"] for r in rows if r.get("strength") is not None
        and r.get("future_return") is not None
    ]
    ic_returns = [
        r["future_return"] for r in rows if r.get("strength") is not None
        and r.get("future_return") is not None
    ]
    ic = compute_information_coefficient(ic_scores, ic_returns)
    # --- Rank Information Coefficient (Requirements 6.2, 6.5) ---
    rank_ic = compute_rank_information_coefficient(ic_scores, ic_returns)
    # --- Hit rate improvement (Requirement 9.4) ---
    hit_rate_improvement = compute_hit_rate_improvement(win_rate)
    # --- Metadata (Requirement 10.5) ---
    metadata: dict = {
        "hit_rate_improvement": hit_rate_improvement,
        "buy_count": len(buy_rows),
        "sell_count": len(sell_rows),
        "hold_count": len(hold_rows),
        "returns_count": len(returns_list),
        "excess_spy_count": len(excess_spy_list),
        "excess_sector_count": len(excess_sector_list),
    }
    return ModelMetricSnapshot(
        id=snapshot_id,
        generated_at=now,
        lookback_window=lookback_window,
        horizon=horizon,
        prediction_count=prediction_count,
        win_rate=win_rate,
        directional_accuracy=directional_accuracy,
        information_coefficient=ic,
        rank_information_coefficient=rank_ic,
        avg_return=avg_return,
        avg_excess_return_vs_spy=avg_excess_return_vs_spy,
        avg_excess_return_vs_sector=avg_excess_return_vs_sector,
        calibration_error=ece,
        brier_score=brier,
        buy_win_rate=buy_win_rate,
        sell_win_rate=sell_win_rate,
        hold_win_rate=hold_win_rate,
        metadata=metadata,
    )
 # ---------------------------------------------------------------------------
 # Main entry point (Requirements 10.1, 10.2, 10.3, 10.4, 10.5)
 # ---------------------------------------------------------------------------
 async def compute_and_store_metric_snapshots(
    pool: asyncpg.Pool,
 ) -> list[ModelMetricSnapshot]:
    """Compute metric snapshots for all lookback/horizon combinations.
    Lookback windows: 7d, 30d, 90d, all-time.
    Horizons: 1h, 6h, 1d, 7d, 30d.
    For each of the 4 lookbacks × 5 horizons = 20 combinations, queries the
    v_prediction_performance view, computes all metrics, and persists the
    result to model_metric_snapshots.
    Returns the list of computed snapshots.
    """
    snapshots: list[ModelMetricSnapshot] = []
    now = datetime.now().astimezone()
    for lookback in LOOKBACK_WINDOWS:
        duration = LOOKBACK_DURATIONS[lookback]
        for horizon in EVALUATION_HORIZONS:
            try:
                # Query performance data
                if duration is not None:
                    cutoff = now - duration
                    rows = await pool.fetch(
                        _PERFORMANCE_DATA_WITH_LOOKBACK_SQL,
                        horizon,
                        cutoff,
                    )
                else:
                    rows = await pool.fetch(
                        _PERFORMANCE_DATA_SQL,
                        horizon,
                    )
                # Convert asyncpg Records to dicts
                row_dicts = [dict(r) for r in rows]
                # Compute metrics
                snapshot = _compute_metrics_from_rows(
                    row_dicts, lookback, horizon
                )
                # Persist
                await pool.execute(
                    _INSERT_METRIC_SNAPSHOT_SQL,
                    snapshot.id,
                    snapshot.generated_at,
                    snapshot.lookback_window,
                    snapshot.horizon,
                    snapshot.prediction_count,
                    snapshot.win_rate,
                    snapshot.directional_accuracy,
                    snapshot.information_coefficient,
                    snapshot.rank_information_coefficient,
                    snapshot.avg_return,
                    snapshot.avg_excess_return_vs_spy,
                    snapshot.avg_excess_return_vs_sector,
                    snapshot.calibration_error,
                    snapshot.brier_score,
                    snapshot.buy_win_rate,
                    snapshot.sell_win_rate,
                    snapshot.hold_win_rate,
                    json.dumps(snapshot.metadata),
                )
                snapshots.append(snapshot)
            except Exception:
                logger.exception(
                    "Failed to compute metrics for lookback=%s horizon=%s",
                    lookback,
                    horizon,
                )
                continue
    logger.info(
        "Computed %d metric snapshots across %d lookback/horizon combinations",
        len(snapshots),
        len(LOOKBACK_WINDOWS) * len(EVALUATION_HORIZONS),
    )
    return snapshots
@@ -0,0 +1,414 @@
 """Outcome Evaluator — matches predictions with realized market outcomes.
 Runs periodically to evaluate prediction snapshots whose horizon has elapsed.
 For each snapshot, fetches future prices at the horizon endpoint and computes
 returns, excess returns, directional accuracy, and profitability across all
 five evaluation horizons (1h, 6h, 1d, 7d, 30d).
 Requirements: 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 4.10
 """
 from __future__ import annotations
 import json
 import logging
 import uuid
 from dataclasses import dataclass, field
 from datetime import datetime, timedelta
 import asyncpg
 logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 # Constants
 # ---------------------------------------------------------------------------
 HORIZON_DURATIONS: dict[str, timedelta] = {
    "1h": timedelta(hours=1),
    "6h": timedelta(hours=6),
    "1d": timedelta(days=1),
    "7d": timedelta(days=7),
    "30d": timedelta(days=30),
 }
 # ---------------------------------------------------------------------------
 # Dataclasses
 # ---------------------------------------------------------------------------
@dataclass
 class PredictionOutcome:
    """Realized outcome for a prediction at a specific horizon."""
    id: str  # UUID
    prediction_id: str
    evaluated_at: datetime
    horizon: str  # 1h, 6h, 1d, 7d, 30d
    future_price: float
    future_return: float
    spy_future_price: float | None
    spy_return: float | None
    sector_etf_future_price: float | None
    sector_etf_return: float | None
    excess_return_vs_spy: float | None
    excess_return_vs_sector: float | None
    direction_correct: bool
    profitable: bool
    metadata: dict = field(default_factory=dict)
 # ---------------------------------------------------------------------------
 # SQL statements
 # ---------------------------------------------------------------------------
 # Find matured predictions: snapshots where generated_at + horizon_duration <= NOW()
 # and no outcome has been recorded yet for that (prediction_id, horizon) pair.
 # We evaluate ALL 5 horizons for each snapshot, not just the snapshot's own horizon.
 _MATURED_PREDICTIONS_SQL = """
 SELECT
    ps.id,
    ps.generated_at,
    ps.ticker,
    ps.horizon AS snapshot_horizon,
    ps.direction,
    ps.action,
    ps.price_at_prediction,
    ps.spy_price_at_prediction,
    ps.sector_etf_price_at_prediction
 FROM prediction_snapshots ps
 WHERE ps.generated_at + $1::interval <= NOW()
  AND NOT EXISTS (
      SELECT 1 FROM prediction_outcomes po
      WHERE po.prediction_id = ps.id AND po.horizon = $2
  )
 """
 # Fetch the close price for a ticker at or before a specific time.
 # Uses the closest bar before or at the target time.
 _CLOSE_AT_TIME_SQL = """
 SELECT (data->>'c')::float AS close
 FROM market_snapshots
 WHERE ticker = $1
  AND snapshot_type = 'bar'
  AND data->>'c' IS NOT NULL
  AND captured_at <= $2
 ORDER BY captured_at DESC
 LIMIT 1
 """
 _INSERT_OUTCOME_SQL = """
 INSERT INTO prediction_outcomes (
    id, prediction_id, evaluated_at, horizon,
    future_price, future_return,
    spy_future_price, spy_return,
    sector_etf_future_price, sector_etf_return,
    excess_return_vs_spy, excess_return_vs_sector,
    direction_correct, profitable,
    metadata
 ) VALUES (
    $1::uuid, $2::uuid, $3, $4,
    $5, $6,
    $7, $8,
    $9, $10,
    $11, $12,
    $13, $14,
    $15::jsonb
 )
 """
 # ---------------------------------------------------------------------------
 # Price fetching at a specific time
 # ---------------------------------------------------------------------------
 async def _fetch_close_at_time(
    pool: asyncpg.Pool,
    ticker: str,
    target_time: datetime,
 ) -> float | None:
    """Fetch the close price for a ticker at or before a specific time.
    Returns None if no market data is available before the target time.
    """
    row = await pool.fetchrow(_CLOSE_AT_TIME_SQL, ticker, target_time)
    if row is None:
        return None
    return row["close"]
 # ---------------------------------------------------------------------------
 # Sector ETF lookup (reuse pattern from prediction_snapshot)
 # ---------------------------------------------------------------------------
 _SECTOR_ETF_MAP: dict[str, str] = {
    "Technology": "XLK",
    "Consumer Cyclical": "XLY",
    "Financial Services": "XLF",
    "Healthcare": "XLV",
    "Energy": "XLE",
    "Communication Services": "XLC",
    "Industrials": "XLI",
    "Consumer Defensive": "XLP",
    "Real Estate": "XLRE",
    "Utilities": "XLU",
 }
 _COMPANY_SECTOR_SQL = """
 SELECT sector FROM companies WHERE ticker = $1 AND active = TRUE LIMIT 1
 """
 async def _fetch_sector_etf_ticker(pool: asyncpg.Pool, ticker: str) -> str | None:
    """Look up the sector ETF ticker for a company ticker."""
    row = await pool.fetchrow(_COMPANY_SECTOR_SQL, ticker)
    if row is None or row["sector"] is None:
        return None
    return _SECTOR_ETF_MAP.get(row["sector"])
 # ---------------------------------------------------------------------------
 # Return computation helpers
 # ---------------------------------------------------------------------------
 def _compute_return(current_price: float, future_price: float) -> float:
    """Compute simple return: (future - current) / current."""
    if current_price == 0.0:
        return 0.0
    return (future_price - current_price) / current_price
 def _is_direction_correct(direction: str, future_return: float) -> bool:
    """Determine if the predicted direction matches the realized return.
    bullish + positive return = True
    bearish + negative return = True
    All other combinations = False
    """
    direction_lower = direction.lower()
    if direction_lower == "bullish" and future_return > 0.0:
        return True
    if direction_lower == "bearish" and future_return < 0.0:
        return True
    return False
 def _is_profitable(action: str, future_return: float) -> bool:
    """Determine if the predicted action would have been profitable.
    buy + positive return = True
    sell + negative return = True
    All other combinations = False
    """
    action_lower = action.lower()
    if action_lower == "buy" and future_return > 0.0:
        return True
    if action_lower == "sell" and future_return < 0.0:
        return True
    return False
 # ---------------------------------------------------------------------------
 # Single prediction evaluation (Requirements 4.2–4.7)
 # ---------------------------------------------------------------------------
 async def evaluate_single_prediction(
    pool: asyncpg.Pool,
    snapshot: dict,
    horizon: str,
 ) -> PredictionOutcome | None:
    """Evaluate a single prediction at a specific horizon.
    Fetches the future price at generated_at + horizon_duration for the ticker,
    SPY, and sector ETF. Computes returns, excess returns, direction correctness,
    and profitability.
    Returns None if the ticker's future price is unavailable (Requirement 4.10).
    """
    duration = HORIZON_DURATIONS[horizon]
    target_time = snapshot["generated_at"] + duration
    ticker = snapshot["ticker"]
    # Fetch future price for the ticker — required (skip if unavailable)
    future_price = await _fetch_close_at_time(pool, ticker, target_time)
    if future_price is None:
        logger.debug(
            "Future price unavailable for %s at horizon %s (target %s), skipping",
            ticker,
            horizon,
            target_time,
        )
        return None
    price_at_prediction = snapshot["price_at_prediction"]
    if price_at_prediction is None or price_at_prediction == 0.0:
        logger.warning(
            "Price at prediction is NULL or zero for snapshot %s, skipping horizon %s",
            snapshot["id"],
            horizon,
        )
        return None
    # Compute ticker future return (Requirement 4.2)
    future_return = _compute_return(price_at_prediction, future_price)
    # Fetch SPY future price and compute SPY return (Requirement 4.3)
    spy_future_price: float | None = None
    spy_return: float | None = None
    spy_price_at_prediction = snapshot["spy_price_at_prediction"]
    if spy_price_at_prediction is not None and spy_price_at_prediction != 0.0:
        spy_future_price = await _fetch_close_at_time(pool, "SPY", target_time)
        if spy_future_price is not None:
            spy_return = _compute_return(spy_price_at_prediction, spy_future_price)
    # Fetch sector ETF future price and compute sector return (Requirement 4.4)
    sector_etf_future_price: float | None = None
    sector_etf_return: float | None = None
    sector_etf_price_at_prediction = snapshot["sector_etf_price_at_prediction"]
    if (
        sector_etf_price_at_prediction is not None
        and sector_etf_price_at_prediction != 0.0
    ):
        sector_etf_ticker = await _fetch_sector_etf_ticker(pool, ticker)
        if sector_etf_ticker is not None:
            sector_etf_future_price = await _fetch_close_at_time(
                pool, sector_etf_ticker, target_time
            )
            if sector_etf_future_price is not None:
                sector_etf_return = _compute_return(
                    sector_etf_price_at_prediction, sector_etf_future_price
                )
    # Compute excess returns (Requirement 4.5)
    excess_return_vs_spy: float | None = None
    if future_return is not None and spy_return is not None:
        excess_return_vs_spy = future_return - spy_return
    excess_return_vs_sector: float | None = None
    if future_return is not None and sector_etf_return is not None:
        excess_return_vs_sector = future_return - sector_etf_return
    # Determine direction correctness (Requirement 4.6)
    direction_correct = _is_direction_correct(snapshot["direction"], future_return)
    # Determine profitability (Requirement 4.7)
    profitable = _is_profitable(snapshot["action"], future_return)
    now = datetime.now().astimezone()
    return PredictionOutcome(
        id=str(uuid.uuid4()),
        prediction_id=str(snapshot["id"]),
        evaluated_at=now,
        horizon=horizon,
        future_price=future_price,
        future_return=future_return,
        spy_future_price=spy_future_price,
        spy_return=spy_return,
        sector_etf_future_price=sector_etf_future_price,
        sector_etf_return=sector_etf_return,
        excess_return_vs_spy=excess_return_vs_spy,
        excess_return_vs_sector=excess_return_vs_sector,
        direction_correct=direction_correct,
        profitable=profitable,
        metadata={
            "ticker": ticker,
            "horizon": horizon,
            "price_at_prediction": price_at_prediction,
            "future_price": future_price,
        },
    )
 # ---------------------------------------------------------------------------
 # Store outcome (Requirement 4.9)
 # ---------------------------------------------------------------------------
 async def _store_outcome(
    conn: asyncpg.Connection,
    outcome: PredictionOutcome,
 ) -> None:
    """Persist a single prediction outcome to the database."""
    await conn.execute(
        _INSERT_OUTCOME_SQL,
        outcome.id,
        outcome.prediction_id,
        outcome.evaluated_at,
        outcome.horizon,
        outcome.future_price,
        outcome.future_return,
        outcome.spy_future_price,
        outcome.spy_return,
        outcome.sector_etf_future_price,
        outcome.sector_etf_return,
        outcome.excess_return_vs_spy,
        outcome.excess_return_vs_sector,
        outcome.direction_correct,
        outcome.profitable,
        json.dumps(outcome.metadata),
    )
 # ---------------------------------------------------------------------------
 # Main entry point (Requirements 4.1, 4.8, 4.9, 4.10)
 # ---------------------------------------------------------------------------
 async def evaluate_matured_predictions(
    pool: asyncpg.Pool,
 ) -> int:
    """Evaluate all matured prediction snapshots across all horizons.
    For each of the 5 horizons (1h, 6h, 1d, 7d, 30d), finds prediction
    snapshots where generated_at + horizon_duration <= NOW() and no outcome
    has been recorded for that (prediction_id, horizon) pair.
    For each matured snapshot-horizon pair, fetches future prices and computes
    returns. Skips horizons where the future price is unavailable — those will
    be retried on the next run (Requirement 4.10).
    Returns the total count of outcomes recorded.
    """
    total_recorded = 0
    for horizon, duration in HORIZON_DURATIONS.items():
        # Find snapshots matured for this horizon
        rows = await pool.fetch(_MATURED_PREDICTIONS_SQL, duration, horizon)
        if not rows:
            continue
        logger.info(
            "Found %d matured predictions for horizon %s", len(rows), horizon
        )
        for row in rows:
            snapshot = dict(row)
            try:
                outcome = await evaluate_single_prediction(pool, snapshot, horizon)
                if outcome is None:
                    # Future price unavailable — skip, retry next run
                    continue
                async with pool.acquire() as conn:
                    async with conn.transaction():
                        await _store_outcome(conn, outcome)
                total_recorded += 1
            except Exception:
                logger.exception(
                    "Failed to evaluate snapshot %s at horizon %s",
                    snapshot["id"],
                    horizon,
                )
                continue
    logger.info("Outcome evaluation complete: %d outcomes recorded", total_recorded)
    return total_recorded
@@ -0,0 +1,540 @@
 """Prediction Snapshot Writer — captures immutable prediction state at generation time.
 Creates frozen records of every recommendation with prices, evidence links,
 duplicate detection, and contribution scores so that predictions can be
 evaluated against future outcomes without hindsight bias.
 Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 3.1, 3.2, 3.3, 3.4
 """
 from __future__ import annotations
 import hashlib
 import json
 import logging
 import urllib.parse
 import uuid
 from dataclasses import dataclass, field
 from datetime import datetime
 import asyncpg
 from services.shared.schemas import Recommendation, TrendSummary
 logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 # Constants
 # ---------------------------------------------------------------------------
 SECTOR_ETF_MAP: dict[str, str] = {
    "Technology": "XLK",
    "Consumer Cyclical": "XLY",
    "Financial Services": "XLF",
    "Healthcare": "XLV",
    "Energy": "XLE",
    "Communication Services": "XLC",
    "Industrials": "XLI",
    "Consumer Defensive": "XLP",
    "Real Estate": "XLRE",
    "Utilities": "XLU",
 }
 EVALUATION_HORIZONS: list[str] = ["1h", "6h", "1d", "7d", "30d"]
 MAX_SINGLE_DOCUMENT_WEIGHT: float = 1.0
 # ---------------------------------------------------------------------------
 # Dataclasses
 # ---------------------------------------------------------------------------
@dataclass
 class PredictionSnapshot:
    """Immutable snapshot of a prediction at generation time."""
    id: str  # UUID
    generated_at: datetime
    ticker: str
    window: str
    horizon: str
    direction: str  # bullish/bearish/mixed/neutral
    action: str  # buy/sell/hold/watch
    mode: str  # informational/paper_eligible/live_eligible
    strength: float
    confidence: float
    contradiction: float
    p_bull: float | None
    p_bear: float | None
    score_company: float
    score_macro: float
    score_competitive: float
    evidence_count: int
    unique_source_count: int
    duplicate_evidence_count: int
    price_at_prediction: float | None
    spy_price_at_prediction: float | None
    sector_etf_price_at_prediction: float | None
    metadata: dict = field(default_factory=dict)
@dataclass
 class SignalEvidenceLink:
    """Link between a prediction and a contributing evidence document."""
    id: str  # UUID
    prediction_id: str
    document_id: str
    signal_id: str
    ticker: str
    source: str
    source_type: str
    catalyst_type: str
    sentiment: str
    impact: float
    extraction_confidence: float
    weight: float  # clamped to MAX_SINGLE_DOCUMENT_WEIGHT
    is_duplicate: bool
    canonical_evidence_key: str
    contribution_score: float  # weight / total_weight, sums to 1.0
    metadata: dict = field(default_factory=dict)
 # ---------------------------------------------------------------------------
 # Canonical evidence key computation (Requirements 2.3, 17.4)
 # ---------------------------------------------------------------------------
 def compute_canonical_evidence_key(title: str, url: str) -> str:
    """SHA256 of normalized(title) + normalized(url).
    Normalization:
    - Title: lowercase, strip leading/trailing whitespace
    - URL: lowercase, strip query parameters (keep scheme, netloc, path)
    """
    normalized_title = title.strip().lower()
    parsed = urllib.parse.urlparse(url.lower())
    normalized_url = urllib.parse.urlunparse(
        (parsed.scheme, parsed.netloc, parsed.path, "", "", "")
    )
    combined = normalized_title + normalized_url
    return hashlib.sha256(combined.encode("utf-8")).hexdigest()
 # ---------------------------------------------------------------------------
 # Contribution score computation (Requirements 2.5, 17.7)
 # ---------------------------------------------------------------------------
 def compute_contribution_scores(weights: list[float]) -> list[float]:
    """Compute contribution scores: each score = weight_i / sum(weights).
    All scores are in [0.0, 1.0] and sum to 1.0 (within floating-point tolerance).
    Returns an empty list for empty input.
    """
    if not weights:
        return []
    total = sum(weights)
    if total == 0.0:
        # All weights are zero — distribute equally
        n = len(weights)
        return [1.0 / n] * n
    return [w / total for w in weights]
 # ---------------------------------------------------------------------------
 # Price fetching (Requirements 1.2, 1.3, 1.4, 1.5)
 # ---------------------------------------------------------------------------
 _LATEST_CLOSE_SQL = """
 SELECT (data->>'c')::float AS close
 FROM market_snapshots
 WHERE ticker = $1 AND snapshot_type = 'bar' AND data->>'c' IS NOT NULL
 ORDER BY captured_at DESC
 LIMIT 1
 """
 async def fetch_latest_close_price(
    pool: asyncpg.Pool,
    ticker: str,
 ) -> float | None:
    """Fetch most recent close price from market_snapshots for a ticker.
    Returns None if no market data is available for the ticker.
    """
    row = await pool.fetchrow(_LATEST_CLOSE_SQL, ticker)
    if row is None:
        return None
    return row["close"]
 # ---------------------------------------------------------------------------
 # Sector ETF lookup
 # ---------------------------------------------------------------------------
 _COMPANY_SECTOR_SQL = """
 SELECT sector FROM companies WHERE ticker = $1 AND active = TRUE LIMIT 1
 """
 async def _fetch_sector_etf_ticker(pool: asyncpg.Pool, ticker: str) -> str | None:
    """Look up the sector ETF ticker for a company ticker."""
    row = await pool.fetchrow(_COMPANY_SECTOR_SQL, ticker)
    if row is None or row["sector"] is None:
        return None
    return SECTOR_ETF_MAP.get(row["sector"])
 # ---------------------------------------------------------------------------
 # Layer score computation
 # ---------------------------------------------------------------------------
 def _compute_layer_scores(
    evidence_signals: list[dict],
 ) -> tuple[float, float, float]:
    """Compute company, macro, and competitive layer scores from evidence signals.
    Each signal's source_type determines its layer:
    - company: news_api, filings_api, web_scrape
    - macro: macro events (source_type containing 'macro')
    - competitive: competitive signals (source_type containing 'competitive' or 'pattern')
    Returns (score_company, score_macro, score_competitive) as fractions summing to 1.0.
    """
    company_weight = 0.0
    macro_weight = 0.0
    competitive_weight = 0.0
    for sig in evidence_signals:
        w = sig.get("weight", 0.0)
        source_type = sig.get("source_type", "").lower()
        catalyst_type = sig.get("catalyst_type", "").lower()
        if "macro" in source_type or catalyst_type == "macro":
            macro_weight += w
        elif "competitive" in source_type or "pattern" in source_type:
            competitive_weight += w
        else:
            company_weight += w
    total = company_weight + macro_weight + competitive_weight
    if total == 0.0:
        return (0.0, 0.0, 0.0)
    return (
        round(company_weight / total, 6),
        round(macro_weight / total, 6),
        round(competitive_weight / total, 6),
    )
 # ---------------------------------------------------------------------------
 # SQL statements
 # ---------------------------------------------------------------------------
 _INSERT_SNAPSHOT_SQL = """
 INSERT INTO prediction_snapshots (
    id, generated_at, ticker, window, horizon, direction, action, mode,
    strength, confidence, contradiction, p_bull, p_bear,
    score_company, score_macro, score_competitive,
    evidence_count, unique_source_count, duplicate_evidence_count,
    price_at_prediction, spy_price_at_prediction, sector_etf_price_at_prediction,
    metadata
 ) VALUES (
    $1::uuid, $2, $3, $4, $5, $6, $7, $8,
    $9, $10, $11, $12, $13,
    $14, $15, $16,
    $17, $18, $19,
    $20, $21, $22,
    $23::jsonb
 )
 """
 _INSERT_EVIDENCE_LINK_SQL = """
 INSERT INTO signal_evidence_links (
    id, prediction_id, document_id, signal_id, ticker,
    source, source_type, catalyst_type, sentiment,
    impact, extraction_confidence, weight,
    is_duplicate, canonical_evidence_key, contribution_score,
    metadata
 ) VALUES (
    $1::uuid, $2::uuid, $3, $4, $5,
    $6, $7, $8, $9,
    $10, $11, $12,
    $13, $14, $15,
    $16::jsonb
 )
 """
 # ---------------------------------------------------------------------------
 # Main entry point (Requirements 1.1–1.7, 2.1–2.6, 3.1–3.4)
 # ---------------------------------------------------------------------------
 async def create_prediction_snapshot(
    pool: asyncpg.Pool,
    recommendation: Recommendation,
    trend_summary: TrendSummary,
    evidence_signals: list[dict],
    evidence_docs: list[dict],
 ) -> PredictionSnapshot:
    """Create and persist a prediction snapshot with evidence links.
    Steps:
    1. Fetch current prices (ticker, SPY, sector ETF) from market_snapshots
    2. Compute canonical evidence keys and detect duplicates
    3. Clamp individual document weights to MAX_SINGLE_DOCUMENT_WEIGHT
    4. Compute contribution scores (one-vote-per-canonical-key dedup)
    5. Persist snapshot and evidence links in a transaction
    Args:
        pool: asyncpg connection pool.
        recommendation: The generated Recommendation object.
        trend_summary: The TrendSummary used to generate the recommendation.
        evidence_signals: List of dicts with signal fields (source, source_type,
            catalyst_type, sentiment, impact, extraction_confidence, weight,
            document_id, signal_id, ticker).
        evidence_docs: List of dicts with document metadata (title, url, document_id).
    Returns:
        The persisted PredictionSnapshot.
    """
    ticker = recommendation.ticker
    # 1. Fetch prices — handle NULL gracefully (Requirement 1.5)
    ticker_price = await fetch_latest_close_price(pool, ticker)
    if ticker_price is None:
        logger.warning("No market price available for %s at snapshot time", ticker)
    spy_price = await fetch_latest_close_price(pool, "SPY")
    if spy_price is None:
        logger.warning("No SPY price available at snapshot time")
    sector_etf_ticker = await _fetch_sector_etf_ticker(pool, ticker)
    sector_etf_price: float | None = None
    if sector_etf_ticker is not None:
        sector_etf_price = await fetch_latest_close_price(pool, sector_etf_ticker)
        if sector_etf_price is None:
            logger.warning(
                "No sector ETF price available for %s (%s) at snapshot time",
                sector_etf_ticker,
                ticker,
            )
    else:
        logger.warning("No sector ETF mapping found for ticker %s", ticker)
    # 2. Build a doc lookup for canonical key computation
    doc_lookup: dict[str, dict] = {}
    for doc in evidence_docs:
        doc_id = doc.get("document_id", "")
        doc_lookup[doc_id] = doc
    # 3. Process evidence signals: compute canonical keys, detect duplicates,
    #    clamp weights
    processed_links: list[dict] = []
    seen_canonical_keys: dict[str, int] = {}  # canonical_key -> first index
    for sig in evidence_signals:
        doc_id = sig.get("document_id", "")
        doc_meta = doc_lookup.get(doc_id, {})
        title = doc_meta.get("title", "")
        url = doc_meta.get("url", "")
        canonical_key = compute_canonical_evidence_key(title, url)
        # Detect duplicates: same canonical key for same ticker
        is_duplicate = canonical_key in seen_canonical_keys
        if not is_duplicate:
            seen_canonical_keys[canonical_key] = len(processed_links)
        # Clamp weight to MAX_SINGLE_DOCUMENT_WEIGHT (Requirement 3.3)
        raw_weight = sig.get("weight", 0.0)
        clamped_weight = min(raw_weight, MAX_SINGLE_DOCUMENT_WEIGHT)
        processed_links.append({
            "id": str(uuid.uuid4()),
            "document_id": doc_id,
            "signal_id": sig.get("signal_id", ""),
            "ticker": sig.get("ticker", ticker),
            "source": sig.get("source", ""),
            "source_type": sig.get("source_type", ""),
            "catalyst_type": sig.get("catalyst_type", ""),
            "sentiment": sig.get("sentiment", ""),
            "impact": sig.get("impact", 0.0),
            "extraction_confidence": sig.get("extraction_confidence", 0.0),
            "weight": clamped_weight,
            "is_duplicate": is_duplicate,
            "canonical_evidence_key": canonical_key,
        })
    # 4. Compute contribution scores — one vote per canonical key (Requirement 3.4)
    #    Only non-duplicate links contribute to the weight pool
    non_dup_weights = [
        link["weight"] for link in processed_links if not link["is_duplicate"]
    ]
    non_dup_scores = compute_contribution_scores(non_dup_weights)
    # Assign contribution scores: non-duplicates get their computed score,
    # duplicates get 0.0
    score_idx = 0
    for link in processed_links:
        if not link["is_duplicate"]:
            link["contribution_score"] = non_dup_scores[score_idx]
            score_idx += 1
        else:
            link["contribution_score"] = 0.0
    # 5. Compute deduplication quality metrics (Requirements 3.1, 3.2)
    unique_sources = {
        link["source"]
        for link in processed_links
        if not link["is_duplicate"]
    }
    unique_source_count = len(unique_sources)
    duplicate_evidence_count = sum(
        1 for link in processed_links if link["is_duplicate"]
    )
    # 6. Compute layer scores from evidence signals
    score_company, score_macro, score_competitive = _compute_layer_scores(
        evidence_signals
    )
    # 7. Build metadata from trend summary context (Requirement 1.7)
    metadata: dict = {}
    if trend_summary.market_context is not None:
        metadata["market_context"] = {
            "ticker": trend_summary.market_context.ticker,
            "price_change_pct": trend_summary.market_context.price_change_pct,
            "avg_volume": trend_summary.market_context.avg_volume,
            "volume_change_pct": trend_summary.market_context.volume_change_pct,
            "volatility": trend_summary.market_context.volatility,
            "latest_close": trend_summary.market_context.latest_close,
            "bars_available": trend_summary.market_context.bars_available,
        }
    if sector_etf_ticker is not None:
        metadata["sector_etf_ticker"] = sector_etf_ticker
    # 8. Build the snapshot
    snapshot_id = str(uuid.uuid4())
    snapshot = PredictionSnapshot(
        id=snapshot_id,
        generated_at=recommendation.generated_at,
        ticker=ticker,
        window=trend_summary.window.value,
        horizon=recommendation.time_horizon,
        direction=trend_summary.trend_direction.value,
        action=recommendation.action.value,
        mode=recommendation.mode.value,
        strength=trend_summary.trend_strength,
        confidence=recommendation.confidence,
        contradiction=trend_summary.contradiction_score,
        p_bull=trend_summary.p_bull,
        p_bear=1.0 - trend_summary.p_bull if trend_summary.p_bull is not None else None,
        score_company=score_company,
        score_macro=score_macro,
        score_competitive=score_competitive,
        evidence_count=len(processed_links),
        unique_source_count=unique_source_count,
        duplicate_evidence_count=duplicate_evidence_count,
        price_at_prediction=ticker_price,
        spy_price_at_prediction=spy_price,
        sector_etf_price_at_prediction=sector_etf_price,
        metadata=metadata,
    )
    # 9. Build evidence link objects
    evidence_link_objects: list[SignalEvidenceLink] = []
    for link in processed_links:
        evidence_link_objects.append(
            SignalEvidenceLink(
                id=link["id"],
                prediction_id=snapshot_id,
                document_id=link["document_id"],
                signal_id=link["signal_id"],
                ticker=link["ticker"],
                source=link["source"],
                source_type=link["source_type"],
                catalyst_type=link["catalyst_type"],
                sentiment=link["sentiment"],
                impact=link["impact"],
                extraction_confidence=link["extraction_confidence"],
                weight=link["weight"],
                is_duplicate=link["is_duplicate"],
                canonical_evidence_key=link["canonical_evidence_key"],
                contribution_score=link["contribution_score"],
            )
        )
    # 10. Persist in a transaction (Requirements 1.6, 2.6)
    async with pool.acquire() as conn:
        async with conn.transaction():
            await conn.execute(
                _INSERT_SNAPSHOT_SQL,
                snapshot.id,
                snapshot.generated_at,
                snapshot.ticker,
                snapshot.window,
                snapshot.horizon,
                snapshot.direction,
                snapshot.action,
                snapshot.mode,
                snapshot.strength,
                snapshot.confidence,
                snapshot.contradiction,
                snapshot.p_bull,
                snapshot.p_bear,
                snapshot.score_company,
                snapshot.score_macro,
                snapshot.score_competitive,
                snapshot.evidence_count,
                snapshot.unique_source_count,
                snapshot.duplicate_evidence_count,
                snapshot.price_at_prediction,
                snapshot.spy_price_at_prediction,
                snapshot.sector_etf_price_at_prediction,
                json.dumps(snapshot.metadata),
            )
            for link in evidence_link_objects:
                await conn.execute(
                    _INSERT_EVIDENCE_LINK_SQL,
                    link.id,
                    link.prediction_id,
                    link.document_id,
                    link.signal_id,
                    link.ticker,
                    link.source,
                    link.source_type,
                    link.catalyst_type,
                    link.sentiment,
                    link.impact,
                    link.extraction_confidence,
                    link.weight,
                    link.is_duplicate,
                    link.canonical_evidence_key,
                    link.contribution_score,
                    json.dumps(link.metadata),
                )
    logger.info(
        "Created prediction snapshot %s for %s: %d evidence links "
        "(%d unique sources, %d duplicates), prices: ticker=%s spy=%s sector_etf=%s",
        snapshot_id,
        ticker,
        len(evidence_link_objects),
        unique_source_count,
        duplicate_evidence_count,
        ticker_price,
        spy_price,
        sector_etf_price,
    )
    return snapshot
@@ -0,0 +1,690 @@
 """Unit tests for model validation, calibration, and signal quality modules.
 Covers prediction snapshot writer, outcome evaluator, metrics engine,
 calibration engine, and quality gate — all pure-function / deterministic tests.
 Requirements: 1.1, 2.3, 2.4, 2.5, 3.3, 4.2, 4.5, 4.6, 4.7,
              5.3, 5.4, 6.1, 6.2, 6.5, 8.1, 8.2, 8.3, 11.1, 11.6
 """
 from __future__ import annotations
 import hashlib
 import pytest
 # -- Prediction Snapshot Writer --
 from services.validation.prediction_snapshot import (
    MAX_SINGLE_DOCUMENT_WEIGHT,
    compute_canonical_evidence_key,
    compute_contribution_scores,
 )
 # -- Outcome Evaluator --
 from services.validation.outcome_evaluator import (
    _compute_return,
    _is_direction_correct,
    _is_profitable,
 )
 # -- Metrics Engine --
 from services.validation.metrics import (
    compute_brier_score,
    compute_calibration_error,
    compute_information_coefficient,
    compute_rank_information_coefficient,
 )
 # -- Calibration Engine --
 from services.validation.calibration import (
    compute_adjusted_evidence_weight,
    compute_source_reliability,
 )
 # -- Quality Gate --
 from services.trading.model_quality_gate import (
    QualityGateConfig,
    _evaluate_thresholds,
 )
 # ===================================================================
 # 8.2 — Prediction Snapshot Writer unit tests
 # Requirements: 1.1, 2.3, 2.4, 2.5, 3.3
 # ===================================================================
 class TestCanonicalEvidenceKey:
    """Tests for compute_canonical_evidence_key."""
    def test_known_title_url_produces_expected_sha256(self):
        """Known title/URL pair produces a deterministic SHA256 hash."""
        key = compute_canonical_evidence_key(
            "Test Article", "https://example.com/article?ref=123"
        )
        assert key == "abd5818d51579a7af51cd06861289c7f1fdc97c0f522e8ba13ce9b4aad01cb6f"
    def test_empty_inputs(self):
        """Empty title and URL produce SHA256 of empty string."""
        key = compute_canonical_evidence_key("", "")
        expected = hashlib.sha256(b"").hexdigest()
        assert key == expected
    def test_unicode_inputs(self):
        """Unicode title and URL are handled correctly."""
        key = compute_canonical_evidence_key(
            "日本語テスト", "https://example.com/日本語"
        )
        assert key == "553553928bb4e36abdf283ff3c52df0695fca09809159650a9bdcb4fb2c5f62b"
    def test_normalization_case_insensitive(self):
        """Title and URL are lowercased before hashing."""
        key_lower = compute_canonical_evidence_key(
            "test article", "https://example.com/path"
        )
        key_upper = compute_canonical_evidence_key(
            "TEST ARTICLE", "HTTPS://EXAMPLE.COM/PATH"
        )
        assert key_lower == key_upper
    def test_normalization_strips_query_params(self):
        """URL query parameters are stripped before hashing."""
        key_with_params = compute_canonical_evidence_key(
            "title", "https://example.com/article?utm_source=twitter&ref=123"
        )
        key_without_params = compute_canonical_evidence_key(
            "title", "https://example.com/article"
        )
        assert key_with_params == key_without_params
    def test_normalization_strips_whitespace(self):
        """Leading/trailing whitespace in title is stripped."""
        key_trimmed = compute_canonical_evidence_key(
            "test", "https://example.com"
        )
        key_padded = compute_canonical_evidence_key(
            "  test  ", "https://example.com"
        )
        assert key_trimmed == key_padded
 class TestDuplicateDetection:
    """Tests for duplicate detection via canonical evidence keys."""
    def test_three_docs_two_sharing_key_one_duplicate(self):
        """3 docs where 2 share a canonical key → 1 marked duplicate."""
        # Simulate the duplicate detection logic from create_prediction_snapshot
        docs = [
            {"title": "Breaking News", "url": "https://news.com/article"},
            {"title": "breaking news", "url": "https://news.com/article?ref=1"},
            {"title": "Other Story", "url": "https://other.com/story"},
        ]
        seen_keys: dict[str, int] = {}
        duplicates: list[bool] = []
        for doc in docs:
            key = compute_canonical_evidence_key(doc["title"], doc["url"])
            is_dup = key in seen_keys
            if not is_dup:
                seen_keys[key] = len(duplicates)
            duplicates.append(is_dup)
        assert duplicates == [False, True, False]
        assert sum(duplicates) == 1
 class TestContributionScores:
    """Tests for compute_contribution_scores."""
    def test_known_weights(self):
        """[0.5, 0.3, 0.2] → [0.5, 0.3, 0.2] (already sums to 1.0)."""
        scores = compute_contribution_scores([0.5, 0.3, 0.2])
        assert scores == pytest.approx([0.5, 0.3, 0.2])
        assert sum(scores) == pytest.approx(1.0)
    def test_single_doc(self):
        """Single document → contribution score of 1.0."""
        scores = compute_contribution_scores([0.7])
        assert scores == pytest.approx([1.0])
    def test_empty_input(self):
        """Empty input → empty list."""
        scores = compute_contribution_scores([])
        assert scores == []
    def test_all_zero_weights(self):
        """All-zero weights → equal distribution."""
        scores = compute_contribution_scores([0.0, 0.0, 0.0])
        assert len(scores) == 3
        assert all(s == pytest.approx(1.0 / 3.0) for s in scores)
    def test_scores_sum_to_one(self):
        """Arbitrary weights sum to 1.0."""
        scores = compute_contribution_scores([1.0, 2.0, 3.0, 4.0])
        assert sum(scores) == pytest.approx(1.0)
        assert scores == pytest.approx([0.1, 0.2, 0.3, 0.4])
 class TestWeightClamping:
    """Tests for MAX_SINGLE_DOCUMENT_WEIGHT clamping."""
    def test_weight_above_max_clamped(self):
        """Weight 1.5 → clamped to MAX_SINGLE_DOCUMENT_WEIGHT (1.0)."""
        raw_weight = 1.5
        clamped = min(raw_weight, MAX_SINGLE_DOCUMENT_WEIGHT)
        assert clamped == 1.0
    def test_weight_at_max_unchanged(self):
        """Weight exactly at MAX stays unchanged."""
        raw_weight = 1.0
        clamped = min(raw_weight, MAX_SINGLE_DOCUMENT_WEIGHT)
        assert clamped == 1.0
    def test_weight_below_max_unchanged(self):
        """Weight below MAX stays unchanged."""
        raw_weight = 0.5
        clamped = min(raw_weight, MAX_SINGLE_DOCUMENT_WEIGHT)
        assert clamped == 0.5
 # ===================================================================
 # 8.3 — Outcome Evaluator unit tests
 # Requirements: 4.2, 4.5, 4.6, 4.7
 # ===================================================================
 class TestComputeReturn:
    """Tests for _compute_return."""
    def test_positive_return(self):
        """Price 100 → 110 → return 0.10."""
        assert _compute_return(100.0, 110.0) == pytest.approx(0.10)
    def test_negative_return(self):
        """Price 100 → 90 → return -0.10."""
        assert _compute_return(100.0, 90.0) == pytest.approx(-0.10)
    def test_zero_return(self):
        """Price unchanged → return 0.0."""
        assert _compute_return(100.0, 100.0) == pytest.approx(0.0)
    def test_zero_current_price(self):
        """Current price 0 → return 0.0 (guard against division by zero)."""
        assert _compute_return(0.0, 110.0) == 0.0
 class TestDirectionCorrect:
    """Tests for _is_direction_correct."""
    def test_bullish_positive_return(self):
        """Bullish + positive return → True."""
        assert _is_direction_correct("bullish", 0.05) is True
    def test_bullish_negative_return(self):
        """Bullish + negative return → False."""
        assert _is_direction_correct("bullish", -0.05) is False
    def test_bearish_negative_return(self):
        """Bearish + negative return → True."""
        assert _is_direction_correct("bearish", -0.05) is True
    def test_bearish_positive_return(self):
        """Bearish + positive return → False."""
        assert _is_direction_correct("bearish", 0.05) is False
    def test_bullish_zero_return(self):
        """Bullish + zero return → False (not strictly positive)."""
        assert _is_direction_correct("bullish", 0.0) is False
    def test_bearish_zero_return(self):
        """Bearish + zero return → False (not strictly negative)."""
        assert _is_direction_correct("bearish", 0.0) is False
    def test_mixed_direction(self):
        """Mixed direction → always False."""
        assert _is_direction_correct("mixed", 0.05) is False
        assert _is_direction_correct("mixed", -0.05) is False
    def test_case_insensitive(self):
        """Direction matching is case-insensitive."""
        assert _is_direction_correct("Bullish", 0.05) is True
        assert _is_direction_correct("BEARISH", -0.05) is True
 class TestIsProfitable:
    """Tests for _is_profitable."""
    def test_buy_positive_return(self):
        """Buy + positive return → True."""
        assert _is_profitable("buy", 0.05) is True
    def test_buy_negative_return(self):
        """Buy + negative return → False."""
        assert _is_profitable("buy", -0.05) is False
    def test_sell_negative_return(self):
        """Sell + negative return → True."""
        assert _is_profitable("sell", -0.05) is True
    def test_sell_positive_return(self):
        """Sell + positive return → False."""
        assert _is_profitable("sell", 0.05) is False
    def test_hold_any_return(self):
        """Hold → always False."""
        assert _is_profitable("hold", 0.05) is False
        assert _is_profitable("hold", -0.05) is False
    def test_case_insensitive(self):
        """Action matching is case-insensitive."""
        assert _is_profitable("Buy", 0.05) is True
        assert _is_profitable("SELL", -0.05) is True
 class TestExcessReturn:
    """Tests for excess return computation (ticker return - benchmark return)."""
    def test_excess_return_vs_spy(self):
        """Ticker 10%, SPY 5% → excess 5%."""
        ticker_return = _compute_return(100.0, 110.0)  # 0.10
        spy_return = _compute_return(100.0, 105.0)  # 0.05
        excess = ticker_return - spy_return
        assert excess == pytest.approx(0.05)
    def test_negative_excess_return(self):
        """Ticker 3%, SPY 5% → excess -2%."""
        ticker_return = _compute_return(100.0, 103.0)  # 0.03
        spy_return = _compute_return(100.0, 105.0)  # 0.05
        excess = ticker_return - spy_return
        assert excess == pytest.approx(-0.02)
    def test_zero_excess_return(self):
        """Same return → excess 0%."""
        ticker_return = _compute_return(100.0, 110.0)
        spy_return = _compute_return(100.0, 110.0)
        excess = ticker_return - spy_return
        assert excess == pytest.approx(0.0)
 # ===================================================================
 # 8.4 — Metrics Engine unit tests
 # Requirements: 5.3, 5.4, 6.1, 6.2, 6.5
 # ===================================================================
 class TestCalibrationError:
    """Tests for compute_calibration_error (ECE)."""
    def test_perfect_calibration_ece_zero(self):
        """Perfect calibration → ECE = 0.0.
        All predictions in [0.70, 0.80) bucket with 75% win rate
        matching ~0.75 avg confidence.
        """
        confidences = [0.75] * 100
        outcomes = [True] * 75 + [False] * 25
        ece, buckets = compute_calibration_error(confidences, outcomes)
        assert ece == pytest.approx(0.0, abs=1e-9)
    def test_all_overconfident_positive_ece(self):
        """All overconfident (high confidence, low win rate) → positive ECE."""
        # All predictions at 0.95 confidence but only 50% win rate
        confidences = [0.95] * 100
        outcomes = [True] * 50 + [False] * 50
        ece, buckets = compute_calibration_error(confidences, outcomes)
        assert ece > 0.0
        # ECE should be |0.95 - 0.50| = 0.45
        assert ece == pytest.approx(0.45, abs=0.01)
    def test_empty_input_returns_zero(self):
        """Empty input → ECE = 0.0, empty buckets."""
        ece, buckets = compute_calibration_error([], [])
        assert ece == 0.0
        assert buckets == []
    def test_miscalibrated_flag(self):
        """Buckets with |avg_conf - win_rate| > 0.15 are flagged."""
        # All in [0.90, 1.00] bucket with 0% win rate → diff = 0.95
        confidences = [0.95] * 20
        outcomes = [False] * 20
        _ece, buckets = compute_calibration_error(confidences, outcomes)
        # Find the [0.90, 1.00] bucket
        high_bucket = [b for b in buckets if b.bucket_low == 0.90]
        assert len(high_bucket) == 1
        assert high_bucket[0].miscalibrated is True
    def test_ece_in_valid_range(self):
        """ECE is always in [0.0, 1.0]."""
        confidences = [0.55, 0.65, 0.75, 0.85, 0.95]
        outcomes = [False, True, False, True, False]
        ece, _ = compute_calibration_error(confidences, outcomes)
        assert 0.0 <= ece <= 1.0
 class TestBrierScore:
    """Tests for compute_brier_score."""
    def test_all_correct_at_p1(self):
        """All correct at p=1.0 → Brier = 0.0."""
        p_bulls = [1.0] * 10
        outcomes = [True] * 10
        assert compute_brier_score(p_bulls, outcomes) == pytest.approx(0.0)
    def test_all_wrong_at_p1(self):
        """All wrong at p=1.0 → Brier = 1.0."""
        p_bulls = [1.0] * 10
        outcomes = [False] * 10
        assert compute_brier_score(p_bulls, outcomes) == pytest.approx(1.0)
    def test_all_correct_at_p0(self):
        """All correct at p=0.0 (bearish correct) → Brier = 0.0."""
        p_bulls = [0.0] * 10
        outcomes = [False] * 10
        assert compute_brier_score(p_bulls, outcomes) == pytest.approx(0.0)
    def test_empty_input(self):
        """Empty input → Brier = 0.0."""
        assert compute_brier_score([], []) == 0.0
    def test_mixed_predictions(self):
        """Mixed predictions produce a value in (0, 1)."""
        p_bulls = [0.8, 0.6, 0.3]
        outcomes = [True, False, True]
        brier = compute_brier_score(p_bulls, outcomes)
        assert 0.0 < brier < 1.0
 class TestInformationCoefficient:
    """Tests for compute_information_coefficient (Pearson IC)."""
    def test_perfect_positive_correlation(self):
        """Perfectly correlated scores and returns → IC = 1.0."""
        scores = list(range(30))
        returns = [s * 2.0 + 1.0 for s in scores]  # linear: y = 2x + 1
        ic = compute_information_coefficient(scores, returns)
        assert ic is not None
        assert ic == pytest.approx(1.0, abs=1e-9)
    def test_perfect_negative_correlation(self):
        """Anti-correlated scores and returns → IC = -1.0."""
        scores = list(range(30))
        returns = [-s * 2.0 for s in scores]
        ic = compute_information_coefficient(scores, returns)
        assert ic is not None
        assert ic == pytest.approx(-1.0, abs=1e-9)
    def test_fewer_than_30_returns_none(self):
        """Fewer than 30 data points → None."""
        scores = list(range(29))
        returns = list(range(29))
        ic = compute_information_coefficient(scores, returns)
        assert ic is None
    def test_ic_in_valid_range(self):
        """IC is always in [-1.0, 1.0] for valid data."""
        scores = [float(i % 7) for i in range(50)]
        returns = [float(i % 5) for i in range(50)]
        ic = compute_information_coefficient(scores, returns)
        assert ic is not None
        assert -1.0 <= ic <= 1.0
 class TestRankInformationCoefficient:
    """Tests for compute_rank_information_coefficient (Spearman Rank IC)."""
    def test_perfect_rank_correlation(self):
        """Perfectly rank-correlated → Rank IC = 1.0."""
        scores = list(range(30))
        returns = list(range(30))  # same ordering
        rank_ic = compute_rank_information_coefficient(scores, returns)
        assert rank_ic is not None
        assert rank_ic == pytest.approx(1.0, abs=1e-9)
    def test_perfect_anti_rank_correlation(self):
        """Perfectly anti-rank-correlated → Rank IC = -1.0."""
        scores = list(range(30))
        returns = list(range(29, -1, -1))  # reversed ordering
        rank_ic = compute_rank_information_coefficient(scores, returns)
        assert rank_ic is not None
        assert rank_ic == pytest.approx(-1.0, abs=1e-9)
    def test_fewer_than_30_returns_none(self):
        """Fewer than 30 data points → None."""
        scores = list(range(29))
        returns = list(range(29))
        rank_ic = compute_rank_information_coefficient(scores, returns)
        assert rank_ic is None
 # ===================================================================
 # 8.5 — Calibration Engine unit tests
 # Requirements: 8.1, 8.2, 8.3
 # ===================================================================
 class TestSourceReliability:
    """Tests for compute_source_reliability (Bayesian shrinkage)."""
    def test_zero_samples_returns_prior(self):
        """n=0 → reliability = 0.5 (prior mean)."""
        assert compute_source_reliability(0.8, 0) == 0.5
    def test_large_sample_approaches_observed(self):
        """n=1000 with wr=0.8 → ≈0.8 (close to observed win rate)."""
        reliability = compute_source_reliability(0.8, 1000)
        assert reliability == pytest.approx(0.7912621359223302)
        # Should be close to 0.8 but not exactly
        assert abs(reliability - 0.8) < 0.02
    def test_moderate_sample(self):
        """n=30 with wr=0.7 → 0.6 exactly.
        0.5 + (30/60) * (0.7 - 0.5) = 0.5 + 0.5 * 0.2 = 0.6
        """
        assert compute_source_reliability(0.7, 30) == pytest.approx(0.6)
    def test_reliability_in_range(self):
        """Reliability is always in [0.0, 1.0]."""
        # Extreme win rates
        assert 0.0 <= compute_source_reliability(0.0, 100) <= 1.0
        assert 0.0 <= compute_source_reliability(1.0, 100) <= 1.0
        assert 0.0 <= compute_source_reliability(0.5, 1) <= 1.0
    def test_negative_sample_count_returns_prior(self):
        """Negative sample count → treated as 0, returns 0.5."""
        assert compute_source_reliability(0.8, -5) == 0.5
 class TestAdjustedEvidenceWeight:
    """Tests for compute_adjusted_evidence_weight."""
    def test_reliability_half_gives_base_weight(self):
        """reliability=0.5 → adjusted = base * (0.5 + 0.5) = base * 1.0."""
        assert compute_adjusted_evidence_weight(1.0, 0.5) == pytest.approx(1.0)
    def test_high_reliability_increases_weight(self):
        """reliability=1.0 → adjusted = base * 1.5."""
        assert compute_adjusted_evidence_weight(1.0, 1.0) == pytest.approx(1.5)
    def test_low_reliability_decreases_weight(self):
        """reliability=0.0 → adjusted = base * 0.5."""
        assert compute_adjusted_evidence_weight(1.0, 0.0) == pytest.approx(0.5)
    def test_clamped_to_upper_bound(self):
        """Large base_weight * high reliability → clamped to 2.0."""
        result = compute_adjusted_evidence_weight(3.0, 1.0)
        assert result == 2.0
    def test_clamped_to_lower_bound(self):
        """Small base_weight * low reliability → clamped to 0.1."""
        result = compute_adjusted_evidence_weight(0.1, 0.0)
        assert result == 0.1
    def test_mid_range_not_clamped(self):
        """Normal values stay within bounds without clamping."""
        result = compute_adjusted_evidence_weight(0.8, 0.6)
        # 0.8 * (0.5 + 0.6) = 0.8 * 1.1 = 0.88
        assert result == pytest.approx(0.88)
        assert 0.1 <= result <= 2.0
 # ===================================================================
 # 8.6 — Quality Gate unit tests
 # Requirements: 11.1, 11.6
 # ===================================================================
 class TestQualityGate:
    """Tests for _evaluate_thresholds and QualityGateConfig."""
    def _make_passing_snapshot(self) -> dict:
        """Return a metric snapshot dict that meets all default thresholds."""
        return {
            "prediction_count": 200,
            "information_coefficient": 0.10,
            "win_rate": 0.60,
            "calibration_error": 0.08,
            "avg_excess_return_vs_spy": 0.02,
        }
    def test_all_thresholds_met_pass(self):
        """All thresholds met → every result is passed=True."""
        config = QualityGateConfig()
        snapshot = self._make_passing_snapshot()
        results = _evaluate_thresholds(snapshot, config)
        assert len(results) == 5
        assert all(r.passed for r in results), (
            f"Expected all thresholds to pass, but got: "
            f"{[(r.name, r.passed) for r in results]}"
        )
    def test_one_threshold_failed_ic_below_min(self):
        """IC below min_ic → that threshold fails, others pass."""
        config = QualityGateConfig()
        snapshot = self._make_passing_snapshot()
        snapshot["information_coefficient"] = 0.01  # below min_ic=0.03
        results = _evaluate_thresholds(snapshot, config)
        results_by_name = {r.name: r for r in results}
        assert results_by_name["min_ic"].passed is False
        assert results_by_name["min_ic"].actual == pytest.approx(0.01)
        assert results_by_name["min_ic"].threshold == pytest.approx(0.03)
        # All other thresholds should still pass
        for name, result in results_by_name.items():
            if name != "min_ic":
                assert result.passed is True, f"{name} should pass but didn't"
    def test_all_thresholds_below_all_fail(self):
        """All metric values below thresholds → all results are passed=False."""
        config = QualityGateConfig()
        snapshot = {
            "prediction_count": 10,           # below 100
            "information_coefficient": 0.0,   # below 0.03
            "win_rate": 0.40,                 # below 0.53
            "calibration_error": 0.50,        # above 0.15
            "avg_excess_return_vs_spy": -0.05, # below 0.0
        }
        results = _evaluate_thresholds(snapshot, config)
        assert len(results) == 5
        assert all(not r.passed for r in results), (
            f"Expected all thresholds to fail, but got: "
            f"{[(r.name, r.passed) for r in results]}"
        )
    def test_failsafe_none_values_treated_as_worst_case(self):
        """Missing (None) metric values are treated as worst-case defaults.
        This tests the fail-safe behavior: when no snapshots exist,
        the snapshot dict would have None values. _evaluate_thresholds
        treats None as 0 for min-thresholds and 1.0 for max_ece,
        causing all thresholds to fail → paper-only.
        """
        config = QualityGateConfig()
        snapshot = {
            "prediction_count": None,
            "information_coefficient": None,
            "win_rate": None,
            "calibration_error": None,
            "avg_excess_return_vs_spy": None,
        }
        results = _evaluate_thresholds(snapshot, config)
        results_by_name = {r.name: r for r in results}
        # prediction_count: None → 0, below 100 → fail
        assert results_by_name["min_prediction_count"].passed is False
        assert results_by_name["min_prediction_count"].actual == 0.0
        # IC: None → 0.0, below 0.03 → fail
        assert results_by_name["min_ic"].passed is False
        assert results_by_name["min_ic"].actual == 0.0
        # win_rate: None → 0.0, below 0.53 → fail
        assert results_by_name["min_win_rate"].passed is False
        assert results_by_name["min_win_rate"].actual == 0.0
        # calibration_error: None → 1.0 (worst-case), above 0.15 → fail
        assert results_by_name["max_ece"].passed is False
        assert results_by_name["max_ece"].actual == 1.0
        # excess_return: None → 0.0, equal to min 0.0 → pass (>= 0.0)
        assert results_by_name["min_excess_return_vs_spy"].passed is True
        assert results_by_name["min_excess_return_vs_spy"].actual == 0.0
    def test_stale_snapshot_age_exceeds_max(self):
        """Snapshot age exceeding max_snapshot_age_hours causes gate failure.
        The evaluate_quality_gate async function checks snapshot age
        before calling _evaluate_thresholds. Here we verify the config
        field is respected by testing the age comparison logic directly.
        """
        config = QualityGateConfig(max_snapshot_age_hours=24)
        age_hours = 30.0  # 30 hours old, exceeds 24h max
        assert age_hours > config.max_snapshot_age_hours
    def test_threshold_boundary_exact_values(self):
        """Metric values exactly at threshold boundaries → pass.
        min thresholds use >=, max thresholds use <=.
        """
        config = QualityGateConfig()
        snapshot = {
            "prediction_count": 100,          # exactly min_prediction_count
            "information_coefficient": 0.03,  # exactly min_ic
            "win_rate": 0.53,                 # exactly min_win_rate
            "calibration_error": 0.15,        # exactly max_ece
            "avg_excess_return_vs_spy": 0.0,  # exactly min_excess_return
        }
        results = _evaluate_thresholds(snapshot, config)
        assert all(r.passed for r in results), (
            f"Boundary values should pass, but got: "
            f"{[(r.name, r.passed, r.actual, r.threshold) for r in results]}"
        )
    def test_custom_config_thresholds(self):
        """Custom QualityGateConfig thresholds are respected."""
        config = QualityGateConfig(
            min_prediction_count=50,
            min_ic=0.01,
            min_win_rate=0.51,
            max_ece=0.20,
            min_excess_return_vs_spy=-0.01,
        )
        snapshot = {
            "prediction_count": 60,
            "information_coefficient": 0.02,
            "win_rate": 0.52,
            "calibration_error": 0.18,
            "avg_excess_return_vs_spy": -0.005,
        }
        results = _evaluate_thresholds(snapshot, config)
        assert all(r.passed for r in results), (
            f"Custom thresholds should pass, but got: "
            f"{[(r.name, r.passed) for r in results]}"
        )
@@ -0,0 +1,662 @@
 """Property-based tests for model validation, calibration, and signal quality.
 Feature: model-validation-calibration
 Tests correctness properties from the design specification covering
 canonical evidence key determinism/idempotence, contribution score
 invariants, calibration error bounds, Brier score bounds, information
 coefficient bounds, source reliability shrinkage, and quality gate
 determinism.
 """
 from __future__ import annotations
 import urllib.parse
 from hypothesis import given, settings
 from hypothesis import strategies as st
 from services.validation.prediction_snapshot import (
    compute_canonical_evidence_key,
    compute_contribution_scores,
 )
 # ---------------------------------------------------------------------------
 # Strategies
 # ---------------------------------------------------------------------------
 # Titles: arbitrary text (including whitespace, unicode)
 title_strategy = st.text(min_size=0, max_size=200)
 # URLs: build realistic URLs with optional query params
 url_strategy = st.builds(
    lambda scheme, host, path, query: urllib.parse.urlunparse(
        (scheme, host, path, "", query, "")
    ),
    scheme=st.sampled_from(["http", "https"]),
    host=st.from_regex(r"[a-z0-9]{1,20}\.[a-z]{2,6}", fullmatch=True),
    path=st.from_regex(r"(/[a-z0-9\-]{0,15}){0,4}", fullmatch=True),
    query=st.from_regex(r"([a-z]{1,8}=[a-z0-9]{1,8}(&[a-z]{1,8}=[a-z0-9]{1,8}){0,3})?", fullmatch=True),
 )
 # ---------------------------------------------------------------------------
 # Property 4: Canonical Evidence Key Determinism and Normalization Idempotence
 # Validates: Requirements 2.3, 17.4
 # ---------------------------------------------------------------------------
@given(title=title_strategy, url=url_strategy)
@settings(max_examples=100)
 def test_canonical_evidence_key_determinism(title: str, url: str) -> None:
    """**Validates: Requirements 2.3, 17.4**
    For any (title, url) pair, computing the canonical evidence key twice
    with the same inputs SHALL produce the same result (determinism).
    """
    key1 = compute_canonical_evidence_key(title, url)
    key2 = compute_canonical_evidence_key(title, url)
    assert key1 == key2, (
        f"Determinism violated: same inputs produced different keys: "
        f"{key1!r} != {key2!r}"
    )
    # Key should be a valid SHA256 hex digest (64 hex chars)
    assert len(key1) == 64, f"Expected 64-char hex digest, got {len(key1)}"
    assert all(c in "0123456789abcdef" for c in key1), (
        f"Key contains non-hex characters: {key1!r}"
    )
@given(title=title_strategy, url=url_strategy)
@settings(max_examples=100)
 def test_canonical_evidence_key_normalization_idempotence(title: str, url: str) -> None:
    """**Validates: Requirements 2.3, 17.4**
    Normalizing an already-normalized input and computing the key SHALL
    produce the same key as the original computation (idempotence).
    Normalization rules:
    - Title: lowercase, strip leading/trailing whitespace
    - URL: lowercase, strip query parameters (keep scheme, netloc, path)
    """
    # Compute key from original (unnormalized) inputs
    key_original = compute_canonical_evidence_key(title, url)
    # Pre-normalize the inputs the same way the function does internally
    normalized_title = title.strip().lower()
    parsed = urllib.parse.urlparse(url.lower())
    normalized_url = urllib.parse.urlunparse(
        (parsed.scheme, parsed.netloc, parsed.path, "", "", "")
    )
    # Compute key from already-normalized inputs
    key_from_normalized = compute_canonical_evidence_key(normalized_title, normalized_url)
    assert key_original == key_from_normalized, (
        f"Idempotence violated: key from original inputs ({key_original!r}) "
        f"differs from key from pre-normalized inputs ({key_from_normalized!r}). "
        f"title={title!r}, url={url!r}"
    )
 # ---------------------------------------------------------------------------
 # Strategies for contribution score tests
 # ---------------------------------------------------------------------------
 positive_weights_strategy = st.lists(
    st.floats(min_value=0.01, max_value=1000.0, allow_nan=False, allow_infinity=False),
    min_size=1,
    max_size=50,
 )
 # ---------------------------------------------------------------------------
 # Property 7: Contribution Score Sum-to-One and Range
 # Validates: Requirements 2.5, 17.7
 # ---------------------------------------------------------------------------
@given(weights=positive_weights_strategy)
@settings(max_examples=100)
 def test_contribution_scores_sum_to_one_and_range(weights: list[float]) -> None:
    """**Validates: Requirements 2.5, 17.7**
    For any non-empty list of positive document weights, the computed
    contribution scores SHALL each be in [0.0, 1.0] and SHALL sum to 1.0
    (within floating-point tolerance of 1e-9).
    """
    scores = compute_contribution_scores(weights)
    # Same length as input
    assert len(scores) == len(weights), (
        f"Expected {len(weights)} scores, got {len(scores)}"
    )
    # Each score in [0.0, 1.0]
    for i, score in enumerate(scores):
        assert 0.0 <= score <= 1.0, (
            f"Score at index {i} is {score}, expected in [0.0, 1.0]. "
            f"weights={weights}"
        )
    # Scores sum to 1.0 within tolerance
    total = sum(scores)
    assert abs(total - 1.0) < 1e-9, (
        f"Scores sum to {total}, expected 1.0 within 1e-9 tolerance. "
        f"weights={weights}"
    )
 def test_contribution_scores_empty_input() -> None:
    """**Validates: Requirements 2.5, 17.7**
    For an empty weight list, the result SHALL be an empty list.
    """
    scores = compute_contribution_scores([])
    assert scores == [], f"Expected empty list for empty input, got {scores}"
 # ---------------------------------------------------------------------------
 # Strategies for calibration error tests
 # ---------------------------------------------------------------------------
 confidence_strategy = st.floats(
    min_value=0.50, max_value=1.00, allow_nan=False, allow_infinity=False
 )
 outcome_strategy = st.booleans()
 prediction_pairs_strategy = st.lists(
    st.tuples(confidence_strategy, outcome_strategy),
    min_size=1,
    max_size=100,
 )
 # Import metric functions
 from services.validation.metrics import (
    compute_brier_score,
    compute_calibration_error,
    compute_information_coefficient,
 )
 # ---------------------------------------------------------------------------
 # Property 1: Calibration Error Range and Round-Trip
 # Validates: Requirements 5.1, 5.3, 17.1
 # ---------------------------------------------------------------------------
@given(pairs=prediction_pairs_strategy)
@settings(max_examples=100)
 def test_calibration_error_range(pairs: list[tuple[float, bool]]) -> None:
    """**Validates: Requirements 5.1, 5.3, 17.1**
    For any valid distribution of predictions with confidences in [0.50, 1.00]
    and boolean outcomes, the Expected Calibration Error (ECE) SHALL be in
    [0.0, 1.0].
    """
    confidences = [c for c, _ in pairs]
    outcomes = [o for _, o in pairs]
    ece, buckets = compute_calibration_error(confidences, outcomes)
    assert 0.0 <= ece <= 1.0, (
        f"ECE {ece} is outside [0.0, 1.0]. "
        f"confidences={confidences}, outcomes={outcomes}"
    )
    # Each bucket's metrics should also be well-formed
    for bucket in buckets:
        if bucket.prediction_count > 0:
            assert 0.0 <= bucket.avg_confidence <= 1.0, (
                f"Bucket [{bucket.bucket_low}, {bucket.bucket_high}) has "
                f"avg_confidence={bucket.avg_confidence} outside [0.0, 1.0]"
            )
            assert 0.0 <= bucket.observed_win_rate <= 1.0, (
                f"Bucket [{bucket.bucket_low}, {bucket.bucket_high}) has "
                f"observed_win_rate={bucket.observed_win_rate} outside [0.0, 1.0]"
            )
 def test_calibration_error_zero_when_perfectly_calibrated() -> None:
    """**Validates: Requirements 5.1, 5.3, 17.1**
    When every bucket's observed win rate exactly matches its average
    confidence, ECE SHALL be 0.0.
    Constructs a scenario with predictions in multiple buckets where the
    fraction of True outcomes in each bucket equals the bucket's average
    confidence.
    """
    # For each bucket midpoint, place predictions so win_rate == avg_confidence.
    # Use 100 predictions per bucket at the midpoint confidence.
    # Set exactly round(100 * midpoint) outcomes to True.
    bucket_midpoints = [0.55, 0.65, 0.75, 0.85, 0.95]
    n_per_bucket = 100
    confidences: list[float] = []
    outcomes: list[bool] = []
    for midpoint in bucket_midpoints:
        n_true = round(n_per_bucket * midpoint)
        n_false = n_per_bucket - n_true
        confidences.extend([midpoint] * n_per_bucket)
        outcomes.extend([True] * n_true + [False] * n_false)
    ece, buckets = compute_calibration_error(confidences, outcomes)
    assert ece == 0.0, (
        f"ECE should be 0.0 for perfectly calibrated predictions, got {ece}. "
        f"Buckets: {[(b.avg_confidence, b.observed_win_rate, b.prediction_count) for b in buckets]}"
    )
    # Verify each non-empty bucket has matching avg_confidence and win_rate
    for bucket in buckets:
        if bucket.prediction_count > 0:
            assert bucket.avg_confidence == bucket.observed_win_rate, (
                f"Bucket [{bucket.bucket_low}, {bucket.bucket_high}) has "
                f"avg_confidence={bucket.avg_confidence} != "
                f"observed_win_rate={bucket.observed_win_rate}"
            )
            assert not bucket.miscalibrated, (
                f"Bucket [{bucket.bucket_low}, {bucket.bucket_high}) should not "
                f"be flagged as miscalibrated when perfectly calibrated"
            )
 # ---------------------------------------------------------------------------
 # Strategies for Brier score tests
 # ---------------------------------------------------------------------------
 p_bull_strategy = st.floats(
    min_value=0.0, max_value=1.0, allow_nan=False, allow_infinity=False
 )
 brier_outcome_strategy = st.booleans()
 brier_pairs_strategy = st.lists(
    st.tuples(p_bull_strategy, brier_outcome_strategy),
    min_size=1,
    max_size=100,
 )
 # ---------------------------------------------------------------------------
 # Property 2: Brier Score Range and Perfect Prediction
 # Validates: Requirements 5.4, 17.2
 # ---------------------------------------------------------------------------
@given(pairs=brier_pairs_strategy)
@settings(max_examples=100)
 def test_brier_score_range(pairs: list[tuple[float, bool]]) -> None:
    """**Validates: Requirements 5.4, 17.2**
    For any list of (p_bull, outcome) pairs where p_bull ∈ [0.0, 1.0] and
    outcome is boolean, the Brier score SHALL be in [0.0, 1.0].
    """
    p_bulls = [p for p, _ in pairs]
    outcomes = [o for _, o in pairs]
    brier = compute_brier_score(p_bulls, outcomes)
    assert 0.0 <= brier <= 1.0, (
        f"Brier score {brier} is outside [0.0, 1.0]. "
        f"p_bulls={p_bulls}, outcomes={outcomes}"
    )
@given(n=st.integers(min_value=1, max_value=100))
@settings(max_examples=100)
 def test_brier_score_perfect_prediction(n: int) -> None:
    """**Validates: Requirements 5.4, 17.2**
    When all predictions are perfectly correct — p_bull = 1.0 with
    outcome = True, or p_bull = 0.0 with outcome = False — the Brier
    score SHALL be 0.0.
    """
    # Case 1: all p_bull = 1.0 and outcome = True
    p_bulls_all_bull = [1.0] * n
    outcomes_all_true = [True] * n
    brier_bull = compute_brier_score(p_bulls_all_bull, outcomes_all_true)
    assert brier_bull == 0.0, (
        f"Brier score should be 0.0 for perfect bullish predictions, "
        f"got {brier_bull} with n={n}"
    )
    # Case 2: all p_bull = 0.0 and outcome = False
    p_bulls_all_bear = [0.0] * n
    outcomes_all_false = [False] * n
    brier_bear = compute_brier_score(p_bulls_all_bear, outcomes_all_false)
    assert brier_bear == 0.0, (
        f"Brier score should be 0.0 for perfect bearish predictions, "
        f"got {brier_bear} with n={n}"
    )
 # ---------------------------------------------------------------------------
 # Strategies for Information Coefficient tests
 # ---------------------------------------------------------------------------
 ic_score_strategy = st.floats(
    min_value=-100.0, max_value=100.0, allow_nan=False, allow_infinity=False
 )
 # Generate lists of at least 30 (score, return) pairs
 ic_pairs_strategy = st.lists(
    st.tuples(ic_score_strategy, ic_score_strategy),
    min_size=30,
    max_size=100,
 )
 # ---------------------------------------------------------------------------
 # Property 3: Information Coefficient Range and Perfect Correlation
 # Validates: Requirements 6.1, 6.2, 17.3
 # ---------------------------------------------------------------------------
@given(pairs=ic_pairs_strategy)
@settings(max_examples=100)
 def test_information_coefficient_range(pairs: list[tuple[float, float]]) -> None:
    """**Validates: Requirements 6.1, 6.2, 17.3**
    For any list of (score, return) pairs with at least 30 elements where
    scores and returns are finite floats, the Information Coefficient
    (Pearson correlation) SHALL be in [-1.0, 1.0] or None (when variance
    is zero).
    """
    scores = [s for s, _ in pairs]
    returns = [r for _, r in pairs]
    ic = compute_information_coefficient(scores, returns)
    # IC may be None if variance is zero in either list
    if ic is not None:
        assert -1.0 <= ic <= 1.0, (
            f"IC {ic} is outside [-1.0, 1.0]. "
            f"scores={scores[:5]}..., returns={returns[:5]}..."
        )
@given(
    scores=st.lists(
        st.floats(min_value=-100.0, max_value=100.0, allow_nan=False, allow_infinity=False),
        min_size=30,
        max_size=100,
    ).filter(lambda xs: max(xs) - min(xs) > 1e-6),
    a=st.floats(min_value=0.01, max_value=100.0, allow_nan=False, allow_infinity=False),
    b=st.floats(min_value=-100.0, max_value=100.0, allow_nan=False, allow_infinity=False),
 )
@settings(max_examples=100)
 def test_information_coefficient_perfect_positive_correlation(
    scores: list[float], a: float, b: float
 ) -> None:
    """**Validates: Requirements 6.1, 6.2, 17.3**
    When scores and returns are perfectly positively linearly correlated
    (returns = a * scores + b, a > 0), IC SHALL be 1.0 within
    floating-point tolerance.
    """
    returns = [a * s + b for s in scores]
    ic = compute_information_coefficient(scores, returns)
    assert ic is not None, (
        f"IC should not be None for perfectly correlated data with variance. "
        f"a={a}, b={b}, scores={scores[:5]}..."
    )
    assert abs(ic - 1.0) < 1e-6, (
        f"IC should be 1.0 for perfectly positively correlated data, "
        f"got {ic}. a={a}, b={b}"
    )
 # ---------------------------------------------------------------------------
 # Strategies for source reliability tests
 # ---------------------------------------------------------------------------
 from services.validation.calibration import compute_source_reliability
 observed_win_rate_strategy = st.floats(
    min_value=0.0, max_value=1.0, allow_nan=False, allow_infinity=False
 )
 sample_count_strategy = st.integers(min_value=0, max_value=100_000)
 # ---------------------------------------------------------------------------
 # Property 5: Source Reliability Bayesian Shrinkage Bounds and Convergence
 # Validates: Requirements 8.1, 8.2, 17.5
 # ---------------------------------------------------------------------------
@given(
    observed_win_rate=observed_win_rate_strategy,
    sample_count=sample_count_strategy,
 )
@settings(max_examples=100)
 def test_source_reliability_range(observed_win_rate: float, sample_count: int) -> None:
    """**Validates: Requirements 8.1, 8.2, 17.5**
    For any observed_win_rate in [0.0, 1.0] and sample_count >= 0,
    the source reliability computed via Bayesian shrinkage SHALL be
    in [0.0, 1.0].
    """
    reliability = compute_source_reliability(observed_win_rate, sample_count)
    assert 0.0 <= reliability <= 1.0, (
        f"Reliability {reliability} is outside [0.0, 1.0]. "
        f"observed_win_rate={observed_win_rate}, sample_count={sample_count}"
    )
 def test_source_reliability_zero_samples() -> None:
    """**Validates: Requirements 8.1, 8.2, 17.5**
    When sample_count = 0, reliability SHALL be exactly 0.5 (the prior mean).
    """
    reliability = compute_source_reliability(observed_win_rate=0.8, sample_count=0)
    assert reliability == 0.5, (
        f"Reliability should be 0.5 when sample_count=0, got {reliability}"
    )
    # Also verify with different win rates
    for wr in [0.0, 0.25, 0.5, 0.75, 1.0]:
        r = compute_source_reliability(observed_win_rate=wr, sample_count=0)
        assert r == 0.5, (
            f"Reliability should be 0.5 when sample_count=0 regardless of "
            f"observed_win_rate={wr}, got {r}"
        )
@given(
    observed_win_rate=st.floats(
        min_value=0.0, max_value=1.0, allow_nan=False, allow_infinity=False
    ),
 )
@settings(max_examples=100)
 def test_source_reliability_convergence(observed_win_rate: float) -> None:
    """**Validates: Requirements 8.1, 8.2, 17.5**
    As sample_count increases toward infinity, reliability SHALL approach
    the observed_win_rate. For a large sample_count (e.g., 10000),
    reliability should be within 0.01 of observed_win_rate.
    """
    reliability = compute_source_reliability(observed_win_rate, sample_count=10_000)
    assert abs(reliability - observed_win_rate) < 0.01, (
        f"Reliability {reliability} should be within 0.01 of "
        f"observed_win_rate {observed_win_rate} when sample_count=10000. "
        f"Difference: {abs(reliability - observed_win_rate)}"
    )
 # ---------------------------------------------------------------------------
 # Strategies for quality gate tests
 # ---------------------------------------------------------------------------
 from services.trading.model_quality_gate import (
    GateThresholdResult,
    QualityGateConfig,
    _evaluate_thresholds,
 )
 # Snapshot dict strategy: generate each metric value in a reasonable range
 snapshot_strategy = st.fixed_dictionaries({
    "prediction_count": st.integers(min_value=0, max_value=10_000),
    "information_coefficient": st.floats(
        min_value=-1.0, max_value=1.0, allow_nan=False, allow_infinity=False
    ),
    "win_rate": st.floats(
        min_value=0.0, max_value=1.0, allow_nan=False, allow_infinity=False
    ),
    "calibration_error": st.floats(
        min_value=0.0, max_value=1.0, allow_nan=False, allow_infinity=False
    ),
    "avg_excess_return_vs_spy": st.floats(
        min_value=-1.0, max_value=1.0, allow_nan=False, allow_infinity=False
    ),
 })
 # Config strategy: generate each threshold in a reasonable range
 gate_config_strategy = st.builds(
    QualityGateConfig,
    min_prediction_count=st.integers(min_value=0, max_value=10_000),
    min_ic=st.floats(
        min_value=-1.0, max_value=1.0, allow_nan=False, allow_infinity=False
    ),
    min_win_rate=st.floats(
        min_value=0.0, max_value=1.0, allow_nan=False, allow_infinity=False
    ),
    max_ece=st.floats(
        min_value=0.0, max_value=1.0, allow_nan=False, allow_infinity=False
    ),
    min_excess_return_vs_spy=st.floats(
        min_value=-1.0, max_value=1.0, allow_nan=False, allow_infinity=False
    ),
 )
 # ---------------------------------------------------------------------------
 # Property 6: Quality Gate Determinism and Threshold Monotonicity
 # Validates: Requirements 11.1, 17.6
 # ---------------------------------------------------------------------------
@given(snapshot=snapshot_strategy, config=gate_config_strategy)
@settings(max_examples=100)
 def test_quality_gate_determinism(
    snapshot: dict, config: QualityGateConfig
 ) -> None:
    """**Validates: Requirements 11.1, 17.6**
    For any set of model metric values and quality gate configuration,
    calling _evaluate_thresholds twice with the same inputs SHALL produce
    the same pass/fail result for every threshold (determinism).
    """
    results1 = _evaluate_thresholds(snapshot, config)
    results2 = _evaluate_thresholds(snapshot, config)
    assert len(results1) == len(results2), (
        f"Different number of threshold results: {len(results1)} vs {len(results2)}"
    )
    for r1, r2 in zip(results1, results2):
        assert r1.name == r2.name, (
            f"Threshold name mismatch: {r1.name!r} vs {r2.name!r}"
        )
        assert r1.threshold == r2.threshold, (
            f"Threshold value mismatch for {r1.name}: "
            f"{r1.threshold} vs {r2.threshold}"
        )
        assert r1.actual == r2.actual, (
            f"Actual value mismatch for {r1.name}: "
            f"{r1.actual} vs {r2.actual}"
        )
        assert r1.passed == r2.passed, (
            f"Determinism violated for threshold {r1.name}: "
            f"first call passed={r1.passed}, second call passed={r2.passed}. "
            f"actual={r1.actual}, threshold={r1.threshold}"
        )
    # Overall gate pass/fail should also be deterministic
    all_passed_1 = all(r.passed for r in results1)
    all_passed_2 = all(r.passed for r in results2)
    assert all_passed_1 == all_passed_2, (
        f"Overall gate determinism violated: "
        f"first call passed={all_passed_1}, second call passed={all_passed_2}"
    )
@given(
    snapshot=snapshot_strategy,
    config=gate_config_strategy,
    relax_amount=st.floats(
        min_value=0.0, max_value=1.0, allow_nan=False, allow_infinity=False
    ),
    threshold_to_relax=st.sampled_from([
        "min_prediction_count",
        "min_ic",
        "min_win_rate",
        "max_ece",
        "min_excess_return_vs_spy",
    ]),
 )
@settings(max_examples=100)
 def test_quality_gate_threshold_monotonicity(
    snapshot: dict,
    config: QualityGateConfig,
    relax_amount: float,
    threshold_to_relax: str,
 ) -> None:
    """**Validates: Requirements 11.1, 17.6**
    For any configuration where the gate passes, relaxing any single
    threshold (decreasing min values or increasing max values to make
    them easier to satisfy) SHALL NOT cause the gate to fail
    (monotonicity).
    """
    # Evaluate with original config
    original_results = _evaluate_thresholds(snapshot, config)
    original_passed = all(r.passed for r in original_results)
    # Only test monotonicity when the gate originally passes
    if not original_passed:
        return
    # Create a relaxed config by making one threshold easier to satisfy
    from dataclasses import replace
    if threshold_to_relax == "min_prediction_count":
        # Decrease min → easier to satisfy
        relaxed_value = max(0, config.min_prediction_count - int(relax_amount * 1000))
        relaxed_config = replace(config, min_prediction_count=relaxed_value)
    elif threshold_to_relax == "min_ic":
        # Decrease min → easier to satisfy
        relaxed_config = replace(config, min_ic=config.min_ic - relax_amount)
    elif threshold_to_relax == "min_win_rate":
        # Decrease min → easier to satisfy
        relaxed_config = replace(config, min_win_rate=config.min_win_rate - relax_amount)
    elif threshold_to_relax == "max_ece":
        # Increase max → easier to satisfy
        relaxed_config = replace(config, max_ece=config.max_ece + relax_amount)
    elif threshold_to_relax == "min_excess_return_vs_spy":
        # Decrease min → easier to satisfy
        relaxed_config = replace(
            config,
            min_excess_return_vs_spy=config.min_excess_return_vs_spy - relax_amount,
        )
    else:
        return  # pragma: no cover
    # Evaluate with relaxed config
    relaxed_results = _evaluate_thresholds(snapshot, config=relaxed_config)
    relaxed_passed = all(r.passed for r in relaxed_results)
    assert relaxed_passed, (
        f"Monotonicity violated: gate passed with original config but failed "
        f"after relaxing {threshold_to_relax}. "
        f"Original config: min_prediction_count={config.min_prediction_count}, "
        f"min_ic={config.min_ic}, min_win_rate={config.min_win_rate}, "
        f"max_ece={config.max_ece}, "
        f"min_excess_return_vs_spy={config.min_excess_return_vs_spy}. "
        f"Relaxed threshold: {threshold_to_relax} by {relax_amount}. "
        f"Failed thresholds: "
        f"{[(r.name, r.actual, r.threshold) for r in relaxed_results if not r.passed]}"
    )
		`@@ -0,0 +1 @@`
							`{"specId": "b595d834-7e72-4fab-87a9-65c92115a069", "workflowType": "requirements-first", "specType": "feature"}`