Files
stonks-oracle/.kiro/specs/model-validation-calibration/design.md
T
Celes Renata 7fcc8a6c07
ci/woodpecker/push/test Pipeline failed
ci/woodpecker/push/build-1 unknown status
ci/woodpecker/push/build-3 unknown status
ci/woodpecker/push/build-2 unknown status
ci/woodpecker/push/finalize unknown status
Build and Push / lint-and-test (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.adapters.broker_adapter name:broker-adapter]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.aggregation.worker name:aggregation]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.extractor.worker name:extractor]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.ingestion.worker name:ingestion]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.lake_publisher.worker name:lake-publisher]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.parser.worker name:parser]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.recommendation.worker name:recommendation]) (push) Has been cancelled
Build and Push / build-services (map[cmd:python -m services.scheduler.app name:scheduler]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.api.app:app --host 0.0.0.0 --port 8000 name:query-api]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.risk.app:app --host 0.0.0.0 --port 8000 name:risk]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.symbol_registry.app:app --host 0.0.0.0 --port 8000 name:symbol-registry]) (push) Has been cancelled
Build and Push / build-services (map[cmd:uvicorn services.trading.app:app --host 0.0.0.0 --port 8000 name:trading-engine]) (push) Has been cancelled
Build and Push / build-dashboard (push) Has been cancelled
Build and Push / build-superset (push) Has been cancelled
Build and Push / integration-test (push) Has been cancelled
Build and Push / beta-gate (push) Has been cancelled
feat: model validation, calibration, and signal quality layer
- Migration 035: prediction_snapshots, prediction_outcomes, signal_evidence_links, model_metric_snapshots tables + SQL views
- Prediction snapshot writer with canonical evidence keys, duplicate detection, contribution scores
- Outcome evaluator across 5 horizons (1h, 6h, 1d, 7d, 30d)
- Metrics engine: ECE, Brier score, IC, Rank IC, benchmark comparison
- Attribution engine: per-source, per-catalyst, per-layer performance
- Calibration engine: Bayesian shrinkage source reliability
- Quality gate for live trading eligibility with configurable thresholds
- 7 new /api/validation/* endpoints
- Upgraded OpsModel dashboard with validation tab
- Enhanced recommendation display with calibration context
- Backtest replay validation mode
- 86 Python tests (unit + property-based), 179 frontend tests passing
2026-05-01 03:04:58 +00:00

35 KiB

Design Document — Model Validation, Calibration, and Signal Quality

Overview

This design adds a closed-loop model validation layer to Stonks Oracle. The system currently generates trend summaries and trading recommendations with confidence scores, but has no mechanism to evaluate whether those predictions are accurate, whether confidence scores are well-calibrated, which sources contribute to correct predictions, or whether the system outperforms simple benchmarks.

The validation layer introduces six new service modules under services/validation/, a quality gate in services/trading/, seven new API endpoints under /api/validation/, a database migration (035) with four new tables and two SQL views, and an upgraded OpsModel dashboard page. The architecture follows the existing patterns: pure computation modules with asyncpg for persistence, FastAPI endpoints in services/api/app.py, and React/TanStack Query hooks on the frontend.

Design Rationale

A prediction engine without outcome tracking is flying blind. The validation layer closes the feedback loop by:

  1. Capturing immutable snapshots at prediction time — preventing hindsight bias in evaluation
  2. Evaluating outcomes across multiple horizons (1h, 6h, 1d, 7d, 30d) — matching the system's multi-window trend architecture
  3. Computing calibration metrics (ECE, Brier score) — measuring whether confidence scores mean what they claim
  4. Tracking information coefficients (IC, Rank IC) — measuring linear and ordinal predictive power
  5. Attributing performance to sources, catalysts, and signal layers — identifying the most valuable information channels
  6. Recalibrating confidence via Bayesian shrinkage — learning from the system's own track record
  7. Gating live trading on minimum quality thresholds — preventing real capital risk on a poorly performing model

The design reuses existing infrastructure (asyncpg, FastAPI, TanStack Query, Recharts) and integrates with the existing source_accuracy table from the signal-math-upgrade spec.


Architecture

High-Level Data Flow

flowchart TD
    subgraph "Prediction Capture (Real-time)"
        A[Recommendation Engine] -->|generates| B[Prediction_Snapshot_Writer]
        B --> C[prediction_snapshots table]
        B --> D[signal_evidence_links table]
        B -->|computes| E[canonical_evidence_key<br/>duplicate detection<br/>contribution scores]
    end

    subgraph "Outcome Evaluation (Periodic)"
        F[Outcome_Evaluator<br/>scheduled job] -->|reads matured snapshots| C
        F -->|fetches future prices| G[market_snapshots table]
        F -->|computes returns| H[prediction_outcomes table]
        F -->|evaluates 5 horizons| H
    end

    subgraph "Metrics Computation (Periodic)"
        I[Metrics_Engine] -->|reads| H
        I -->|reads| C
        I -->|reads| D
        I -->|computes| J[model_metric_snapshots table]
        I -->|computes| K[Calibration: ECE, Brier]
        I -->|computes| L[IC, Rank IC by horizon]
        I -->|computes| M[Benchmark: excess returns]
    end

    subgraph "Attribution (Periodic)"
        N[Attribution_Engine] -->|joins| D
        N -->|joins| H
        N -->|computes| O[Per-source metrics]
        N -->|computes| P[Per-catalyst metrics]
        N -->|computes| Q[Per-layer metrics]
    end

    subgraph "Calibration (Periodic)"
        R[Calibration_Engine] -->|reads| H
        R -->|reads| D
        R -->|computes Bayesian shrinkage| S[source_accuracy table<br/>reliability scores]
    end

    subgraph "Safety Gate (Per-cycle)"
        T[Quality_Gate] -->|reads latest| J
        T -->|evaluates thresholds| U{Pass?}
        U -->|yes| V[Live trading allowed]
        U -->|no| W[Force paper mode]
        T -->|stores result| X[risk_configs table<br/>model_quality_gate key]
    end

    subgraph "Dashboard (Frontend)"
        Y[Dashboard_API<br/>7 endpoints] -->|reads| J
        Y -->|reads| C
        Y -->|reads| H
        Y -->|reads| D
        Z[OpsModel.tsx<br/>upgraded page] -->|fetches| Y
    end

    subgraph "Backtest Integration"
        AA[BacktestReplay] -->|validation mode| B
        AA -->|validation mode| F
        AA -->|triggers| I
    end

Scheduling Strategy

The validation components run on different cadences:

Component Trigger Cadence
Prediction_Snapshot_Writer Synchronous — called by recommendation engine Every recommendation
Outcome_Evaluator Scheduled job Every 1 hour
Metrics_Engine After Outcome_Evaluator completes Every 1 hour
Attribution_Engine Called by Metrics_Engine Every 1 hour
Calibration_Engine After Metrics_Engine completes Every 6 hours
Quality_Gate Start of each aggregation cycle Every aggregation cycle

Sector ETF Mapping

The system needs a mapping from company sectors to sector ETFs for benchmark comparison. This is stored as a configuration constant:

SECTOR_ETF_MAP: dict[str, str] = {
    "Technology": "XLK",
    "Consumer Cyclical": "XLY",
    "Financial Services": "XLF",
    "Healthcare": "XLV",
    "Energy": "XLE",
    "Communication Services": "XLC",
    "Industrials": "XLI",
    "Consumer Defensive": "XLP",
    "Real Estate": "XLRE",
    "Utilities": "XLU",
}

Components and Interfaces

New Modules

Module File Responsibility
Prediction Snapshot Writer services/validation/prediction_snapshot.py Captures immutable prediction state at generation time
Outcome Evaluator services/validation/outcome_evaluator.py Matches predictions with realized market outcomes
Metrics Engine services/validation/metrics.py Computes calibration, IC, Brier, benchmark metrics
Attribution Engine services/validation/attribution.py Per-source, per-catalyst, per-layer performance
Calibration Engine services/validation/calibration.py Bayesian shrinkage source reliability, weight adjustment
Quality Gate services/trading/model_quality_gate.py Safety gate for live trading eligibility

Modified Modules

Module File Changes
Query API services/api/app.py 7 new /api/validation/* endpoints
Aggregation Worker services/aggregation/worker.py Call Quality_Gate at cycle start
Recommendation Engine services/recommendation/eligibility.py Call Prediction_Snapshot_Writer after recommendation
Backtest Replay services/trading/backtest_replay.py Validation mode support
Frontend Hooks frontend/src/api/hooks.ts 7 new validation hooks
OpsModel Page frontend/src/pages/OpsModel.tsx Full dashboard upgrade
AppLayout frontend/src/components/AppLayout.tsx Nav item update (if needed)

Component Interface Details

1. Prediction Snapshot Writer (services/validation/prediction_snapshot.py)

SECTOR_ETF_MAP: dict[str, str] = {
    "Technology": "XLK",
    "Consumer Cyclical": "XLY",
    "Financial Services": "XLF",
    "Healthcare": "XLV",
    "Energy": "XLE",
    "Communication Services": "XLC",
    "Industrials": "XLI",
    "Consumer Defensive": "XLP",
    "Real Estate": "XLRE",
    "Utilities": "XLU",
}

EVALUATION_HORIZONS: list[str] = ["1h", "6h", "1d", "7d", "30d"]

MAX_SINGLE_DOCUMENT_WEIGHT: float = 1.0


@dataclass
class PredictionSnapshot:
    """Immutable snapshot of a prediction at generation time."""
    id: str                          # UUID
    generated_at: datetime
    ticker: str
    window: str
    horizon: str
    direction: str                   # bullish/bearish/mixed/neutral
    action: str                      # buy/sell/hold/watch
    mode: str                        # informational/paper_eligible/live_eligible
    strength: float
    confidence: float
    contradiction: float
    p_bull: float | None
    p_bear: float | None
    score_company: float
    score_macro: float
    score_competitive: float
    evidence_count: int
    unique_source_count: int
    duplicate_evidence_count: int
    price_at_prediction: float | None
    spy_price_at_prediction: float | None
    sector_etf_price_at_prediction: float | None
    metadata: dict


@dataclass
class SignalEvidenceLink:
    """Link between a prediction and a contributing evidence document."""
    id: str                          # UUID
    prediction_id: str
    document_id: str
    signal_id: str
    ticker: str
    source: str
    source_type: str
    catalyst_type: str
    sentiment: str
    impact: float
    extraction_confidence: float
    weight: float                    # clamped to MAX_SINGLE_DOCUMENT_WEIGHT
    is_duplicate: bool
    canonical_evidence_key: str
    contribution_score: float        # weight / total_weight, sums to 1.0
    metadata: dict


def compute_canonical_evidence_key(title: str, url: str) -> str:
    """SHA256 of normalized(title) + normalized(url).

    Normalization: lowercase, strip whitespace for title;
    lowercase, strip query params for URL.
    """
    ...


async def create_prediction_snapshot(
    pool: asyncpg.Pool,
    recommendation: Recommendation,
    trend_summary: TrendSummary,
    evidence_signals: list[WeightedSignal],
    evidence_docs: list[dict],       # document metadata from recommendation_evidence
) -> PredictionSnapshot:
    """Create and persist a prediction snapshot with evidence links.

    1. Fetches current prices (ticker, SPY, sector ETF) from market_snapshots
    2. Computes canonical evidence keys and duplicate detection
    3. Clamps individual document weights to MAX_SINGLE_DOCUMENT_WEIGHT
    4. Computes contribution scores (one-vote-per-canonical-key dedup)
    5. Persists snapshot and evidence links in a transaction
    """
    ...


async def fetch_latest_close_price(
    pool: asyncpg.Pool,
    ticker: str,
) -> float | None:
    """Fetch most recent close price from market_snapshots for a ticker."""
    ...

2. Outcome Evaluator (services/validation/outcome_evaluator.py)

@dataclass
class PredictionOutcome:
    """Realized outcome for a prediction at a specific horizon."""
    id: str                          # UUID
    prediction_id: str
    evaluated_at: datetime
    horizon: str                     # 1h, 6h, 1d, 7d, 30d
    future_price: float
    future_return: float
    spy_future_price: float | None
    spy_return: float | None
    sector_etf_future_price: float | None
    sector_etf_return: float | None
    excess_return_vs_spy: float | None
    excess_return_vs_sector: float | None
    direction_correct: bool
    profitable: bool
    metadata: dict


HORIZON_DURATIONS: dict[str, timedelta] = {
    "1h": timedelta(hours=1),
    "6h": timedelta(hours=6),
    "1d": timedelta(days=1),
    "7d": timedelta(days=7),
    "30d": timedelta(days=30),
}


async def evaluate_matured_predictions(
    pool: asyncpg.Pool,
) -> int:
    """Evaluate all matured prediction snapshots.

    Finds snapshots where horizon has elapsed and outcome not yet recorded.
    For each, fetches future prices and computes returns.
    Skips horizons where future price is unavailable (retries next run).

    Returns count of outcomes recorded.
    """
    ...


async def evaluate_single_prediction(
    pool: asyncpg.Pool,
    snapshot: PredictionSnapshot,
    horizon: str,
) -> PredictionOutcome | None:
    """Evaluate a single prediction at a specific horizon.

    Returns None if future price is unavailable.
    """
    ...

3. Metrics Engine (services/validation/metrics.py)

CONFIDENCE_BUCKETS: list[tuple[float, float]] = [
    (0.50, 0.60),
    (0.60, 0.70),
    (0.70, 0.80),
    (0.80, 0.90),
    (0.90, 1.00),
]

LOOKBACK_WINDOWS: list[str] = ["7d", "30d", "90d", "all"]


@dataclass
class CalibrationBucket:
    """Calibration metrics for a single confidence bucket."""
    bucket_low: float
    bucket_high: float
    avg_confidence: float
    observed_win_rate: float
    prediction_count: int
    miscalibrated: bool              # |avg_confidence - win_rate| > 0.15


@dataclass
class ModelMetricSnapshot:
    """Aggregate model quality metrics for a lookback/horizon combination."""
    id: str
    generated_at: datetime
    lookback_window: str
    horizon: str
    prediction_count: int
    win_rate: float
    directional_accuracy: float
    information_coefficient: float | None
    rank_information_coefficient: float | None
    avg_return: float
    avg_excess_return_vs_spy: float
    avg_excess_return_vs_sector: float
    calibration_error: float         # ECE
    brier_score: float
    buy_win_rate: float
    sell_win_rate: float
    hold_win_rate: float
    metadata: dict


def compute_calibration_error(
    confidences: list[float],
    outcomes: list[bool],
) -> tuple[float, list[CalibrationBucket]]:
    """Compute ECE and calibration buckets.

    ECE = Σ (n_b / N) * |avg_conf_b - win_rate_b|

    Returns (ece, buckets).
    """
    ...


def compute_brier_score(
    p_bulls: list[float],
    outcomes: list[bool],
) -> float:
    """Brier score = mean((p_bull - outcome)^2).

    outcome is 1.0 when price moved in predicted direction, 0.0 otherwise.
    Returns value in [0.0, 1.0].
    """
    ...


def compute_information_coefficient(
    scores: list[float],
    returns: list[float],
) -> float | None:
    """Pearson correlation between prediction scores and future returns.

    Returns None when fewer than 30 data points.
    Returns value in [-1.0, 1.0].
    """
    ...


def compute_rank_information_coefficient(
    scores: list[float],
    returns: list[float],
) -> float | None:
    """Spearman rank correlation between prediction scores and future returns.

    Returns None when fewer than 30 data points.
    Returns value in [-1.0, 1.0].
    """
    ...


def compute_contribution_scores(
    weights: list[float],
) -> list[float]:
    """Compute contribution scores from document weights.

    Each score = weight_i / sum(weights). Sums to 1.0.
    Each score in [0.0, 1.0].
    Returns empty list for empty input.
    """
    ...


async def compute_and_store_metric_snapshots(
    pool: asyncpg.Pool,
) -> list[ModelMetricSnapshot]:
    """Compute metric snapshots for all lookback/horizon combinations.

    Lookback windows: 7d, 30d, 90d, all-time.
    Horizons: 1h, 6h, 1d, 7d, 30d.
    """
    ...

4. Attribution Engine (services/validation/attribution.py)

@dataclass
class SourceAttribution:
    """Performance metrics for a single source."""
    source: str
    source_type: str
    prediction_count: int
    avg_weight: float
    avg_contribution_score: float
    win_rate: float
    avg_future_return: float
    avg_excess_return_vs_spy: float
    information_coefficient: float | None
    duplicate_rate: float


@dataclass
class CatalystAttribution:
    """Performance metrics for a single catalyst type."""
    catalyst_type: str
    prediction_count: int
    win_rate: float
    avg_future_return: float
    avg_excess_return_vs_spy: float
    information_coefficient: float | None


@dataclass
class LayerAttribution:
    """Performance metrics for a signal layer."""
    layer: str                       # company, macro, competitive
    avg_contribution_pct: float
    dominant_win_rate: float         # win rate when this layer > 30% contribution
    dominant_ic: float | None        # IC when this layer > 30% contribution


async def compute_source_attribution(
    pool: asyncpg.Pool,
    lookback_days: int = 30,
    horizon: str = "7d",
) -> list[SourceAttribution]:
    ...


async def compute_catalyst_attribution(
    pool: asyncpg.Pool,
    lookback_days: int = 30,
    horizon: str = "7d",
) -> list[CatalystAttribution]:
    ...


async def compute_layer_attribution(
    pool: asyncpg.Pool,
    lookback_days: int = 30,
    horizon: str = "7d",
) -> list[LayerAttribution]:
    ...

5. Calibration Engine (services/validation/calibration.py)

def compute_source_reliability(
    observed_win_rate: float,
    sample_count: int,
    prior_strength: int = 30,
) -> float:
    """Bayesian shrinkage source reliability.

    reliability = 0.5 + (n / (n + prior_strength)) * (observed_win_rate - 0.5)

    Returns value in [0.0, 1.0].
    When n=0, returns 0.5 (prior mean).
    As n→∞, approaches observed_win_rate.
    """
    ...


def compute_adjusted_evidence_weight(
    base_weight: float,
    reliability: float,
) -> float:
    """Adjusted weight = base_weight * (0.5 + reliability), clamped to [0.1, 2.0]."""
    ...


async def update_source_reliabilities(
    pool: asyncpg.Pool,
) -> int:
    """Recompute and store source reliability scores from latest outcomes.

    Uses the existing source_accuracy table, updating accuracy_ratio
    with the Bayesian shrinkage formula.

    Returns count of sources updated.
    """
    ...

6. Quality Gate (services/trading/model_quality_gate.py)

@dataclass
class QualityGateConfig:
    """Configurable thresholds for live trading eligibility."""
    min_prediction_count: int = 100
    min_ic: float = 0.03
    min_win_rate: float = 0.53
    max_ece: float = 0.15
    min_excess_return_vs_spy: float = 0.0
    max_snapshot_age_hours: int = 24


@dataclass
class GateThresholdResult:
    """Result for a single threshold check."""
    name: str
    threshold: float
    actual: float
    passed: bool


@dataclass
class QualityGateResult:
    """Full gate evaluation result."""
    passed: bool
    evaluated_at: datetime
    threshold_results: list[GateThresholdResult]
    reason: str                      # "all thresholds met" or "failed: ..."
    snapshot_id: str | None
    config: QualityGateConfig


async def evaluate_quality_gate(
    pool: asyncpg.Pool,
    config: QualityGateConfig | None = None,
) -> QualityGateResult:
    """Evaluate model quality gate from latest metric snapshot.

    Reads the most recent model_metric_snapshot for the 30d lookback
    and 7d horizon (the primary evaluation window).

    If no snapshot exists or snapshot is stale (>24h), defaults to
    paper-only mode (fail-safe).

    Stores result in risk_configs under 'model_quality_gate' key.
    """
    ...


async def load_gate_config_from_db(
    pool: asyncpg.Pool,
) -> QualityGateConfig:
    """Load gate thresholds from risk_configs, with defaults."""
    ...

7. Dashboard API Endpoints

Seven new endpoints added to services/api/app.py:

Endpoint Method Returns
/api/validation/summary GET Latest model metric snapshot + gate status
/api/validation/calibration GET Calibration table with buckets
/api/validation/ic-by-horizon GET IC and Rank IC per horizon
/api/validation/attribution/sources GET Per-source performance
/api/validation/attribution/catalysts GET Per-catalyst performance
/api/validation/attribution/layers GET Per-layer performance
/api/validation/gate-status GET Quality gate evaluation detail

All endpoints accept optional lookback (default "30d") and horizon (default "7d") query parameters.


Data Models

Database Schema (Migration 035)

prediction_snapshots

CREATE TABLE IF NOT EXISTS prediction_snapshots (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    generated_at TIMESTAMPTZ NOT NULL,
    ticker VARCHAR(20) NOT NULL,
    window VARCHAR(20) NOT NULL,
    horizon VARCHAR(20) NOT NULL,
    direction VARCHAR(20) NOT NULL,
    action VARCHAR(20) NOT NULL,
    mode VARCHAR(30) NOT NULL,
    strength FLOAT NOT NULL,
    confidence FLOAT NOT NULL,
    contradiction FLOAT NOT NULL DEFAULT 0.0,
    p_bull FLOAT,
    p_bear FLOAT,
    score_company FLOAT NOT NULL DEFAULT 0.0,
    score_macro FLOAT NOT NULL DEFAULT 0.0,
    score_competitive FLOAT NOT NULL DEFAULT 0.0,
    evidence_count INTEGER NOT NULL DEFAULT 0,
    unique_source_count INTEGER NOT NULL DEFAULT 0,
    duplicate_evidence_count INTEGER NOT NULL DEFAULT 0,
    price_at_prediction FLOAT,
    spy_price_at_prediction FLOAT,
    sector_etf_price_at_prediction FLOAT,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX IF NOT EXISTS idx_pred_snap_ticker ON prediction_snapshots(ticker);
CREATE INDEX IF NOT EXISTS idx_pred_snap_generated ON prediction_snapshots(generated_at);
CREATE INDEX IF NOT EXISTS idx_pred_snap_horizon ON prediction_snapshots(horizon);

prediction_outcomes

CREATE TABLE IF NOT EXISTS prediction_outcomes (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    prediction_id UUID NOT NULL REFERENCES prediction_snapshots(id),
    evaluated_at TIMESTAMPTZ NOT NULL,
    horizon VARCHAR(20) NOT NULL,
    future_price FLOAT,
    future_return FLOAT,
    spy_future_price FLOAT,
    spy_return FLOAT,
    sector_etf_future_price FLOAT,
    sector_etf_return FLOAT,
    excess_return_vs_spy FLOAT,
    excess_return_vs_sector FLOAT,
    direction_correct BOOLEAN,
    profitable BOOLEAN,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX IF NOT EXISTS idx_pred_out_prediction ON prediction_outcomes(prediction_id);
CREATE INDEX IF NOT EXISTS idx_pred_out_horizon ON prediction_outcomes(horizon);
CREATE INDEX IF NOT EXISTS idx_pred_out_evaluated ON prediction_outcomes(evaluated_at);
CREATE TABLE IF NOT EXISTS signal_evidence_links (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    prediction_id UUID NOT NULL REFERENCES prediction_snapshots(id),
    document_id VARCHAR(200),
    signal_id VARCHAR(200),
    ticker VARCHAR(20),
    source VARCHAR(200),
    source_type VARCHAR(50),
    catalyst_type VARCHAR(50),
    sentiment VARCHAR(20),
    impact FLOAT,
    extraction_confidence FLOAT,
    weight FLOAT,
    is_duplicate BOOLEAN NOT NULL DEFAULT FALSE,
    canonical_evidence_key VARCHAR(64),
    contribution_score FLOAT,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX IF NOT EXISTS idx_sig_ev_prediction ON signal_evidence_links(prediction_id);
CREATE INDEX IF NOT EXISTS idx_sig_ev_document ON signal_evidence_links(document_id);
CREATE INDEX IF NOT EXISTS idx_sig_ev_ticker ON signal_evidence_links(ticker);

model_metric_snapshots

CREATE TABLE IF NOT EXISTS model_metric_snapshots (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    generated_at TIMESTAMPTZ NOT NULL,
    lookback_window VARCHAR(20) NOT NULL,
    horizon VARCHAR(20) NOT NULL,
    prediction_count INTEGER NOT NULL DEFAULT 0,
    win_rate FLOAT,
    directional_accuracy FLOAT,
    information_coefficient FLOAT,
    rank_information_coefficient FLOAT,
    avg_return FLOAT,
    avg_excess_return_vs_spy FLOAT,
    avg_excess_return_vs_sector FLOAT,
    calibration_error FLOAT,
    brier_score FLOAT,
    buy_win_rate FLOAT,
    sell_win_rate FLOAT,
    hold_win_rate FLOAT,
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX IF NOT EXISTS idx_model_snap_generated ON model_metric_snapshots(generated_at);
CREATE INDEX IF NOT EXISTS idx_model_snap_lookback ON model_metric_snapshots(lookback_window);
CREATE INDEX IF NOT EXISTS idx_model_snap_horizon ON model_metric_snapshots(horizon);

SQL Explorer Views

CREATE OR REPLACE VIEW v_prediction_performance AS
SELECT
    ps.ticker,
    ps.direction,
    ps.action,
    ps.confidence,
    ps.strength,
    ps.contradiction,
    ps.p_bull,
    ps.score_company,
    ps.score_macro,
    ps.score_competitive,
    ps.evidence_count,
    ps.unique_source_count,
    ps.duplicate_evidence_count,
    ps.price_at_prediction,
    po.future_return,
    po.excess_return_vs_spy,
    po.excess_return_vs_sector,
    po.direction_correct,
    po.profitable,
    po.horizon,
    ps.generated_at,
    po.evaluated_at
FROM prediction_snapshots ps
JOIN prediction_outcomes po ON po.prediction_id = ps.id;

CREATE OR REPLACE VIEW v_source_performance AS
SELECT
    sel.source,
    sel.source_type,
    sel.catalyst_type,
    sel.sentiment,
    sel.weight,
    sel.contribution_score,
    sel.is_duplicate,
    po.direction_correct,
    po.future_return,
    po.excess_return_vs_spy,
    po.horizon,
    ps.generated_at
FROM signal_evidence_links sel
JOIN prediction_snapshots ps ON ps.id = sel.prediction_id
JOIN prediction_outcomes po ON po.prediction_id = sel.prediction_id;

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

The following properties were derived from the acceptance criteria through systematic prework analysis. Each property is universally quantified and maps to specific requirements. After reflection, 7 unique properties remain — one for each PBT requirement in Requirement 17. Redundant properties from Requirements 2, 5, 6, 8, and 11 were consolidated with their corresponding Requirement 17 counterparts.

Property 1: Calibration Error Range and Round-Trip

For any valid distribution of predictions across confidence buckets (where each prediction has a confidence in [0.5, 1.0] and a boolean outcome), the Expected Calibration Error (ECE) SHALL be in [0.0, 1.0]. Furthermore, when every bucket's observed win rate exactly matches its average confidence, ECE SHALL be 0.0.

Validates: Requirements 5.1, 5.3, 17.1

Property 2: Brier Score Range and Perfect Prediction

For any list of (p_bull, outcome) pairs where p_bull ∈ [0.0, 1.0] and outcome ∈ {0.0, 1.0}, the Brier score SHALL be in [0.0, 1.0]. Furthermore, when all predictions have p_bull = 1.0 and outcome = 1.0 (or p_bull = 0.0 and outcome = 0.0), the Brier score SHALL be 0.0.

Validates: Requirements 5.4, 17.2

Property 3: Information Coefficient Range and Perfect Correlation

For any list of (score, return) pairs with at least 30 elements where scores and returns are finite floats, the Information Coefficient (Pearson correlation) SHALL be in [-1.0, 1.0]. Furthermore, when scores and returns are perfectly positively linearly correlated (returns = a * scores + b, a > 0), IC SHALL be 1.0 (within floating-point tolerance).

Validates: Requirements 6.1, 6.2, 17.3

Property 4: Canonical Evidence Key Determinism and Normalization Idempotence

For any (title, url) string pair, computing the canonical evidence key SHALL be deterministic — the same inputs always produce the same key. Furthermore, normalizing an already-normalized input (lowercased, trimmed title; lowercased, query-stripped URL) and computing the key SHALL produce the same key as the original computation (idempotence).

Validates: Requirements 2.3, 17.4

Property 5: Source Reliability Bayesian Shrinkage Bounds and Convergence

For any observed_win_rate ∈ [0.0, 1.0] and sample_count ≥ 0, the source reliability computed via Bayesian shrinkage SHALL be in [0.0, 1.0]. When sample_count = 0, reliability SHALL be exactly 0.5. As sample_count increases toward infinity, reliability SHALL approach the observed_win_rate monotonically.

Validates: Requirements 8.1, 8.2, 17.5

Property 6: Quality Gate Determinism and Threshold Monotonicity

For any set of model metric values and quality gate configuration, the gate evaluation result SHALL be deterministic — the same inputs always produce the same pass/fail result. Furthermore, for any configuration where the gate passes, relaxing any single threshold (increasing min values or decreasing max values to make them easier to satisfy) SHALL NOT cause the gate to fail (monotonicity).

Validates: Requirements 11.1, 17.6

Property 7: Contribution Score Sum-to-One and Range

For any non-empty list of positive document weights, the computed contribution scores SHALL each be in [0.0, 1.0] and SHALL sum to 1.0 (within floating-point tolerance of 1e-9). For an empty weight list, the result SHALL be an empty list.

Validates: Requirements 2.5, 17.7


Error Handling

Price Data Unavailability

Scenario Handling
Ticker price unavailable at snapshot time Store NULL for price_at_prediction, log warning, continue
SPY price unavailable at snapshot time Store NULL for spy_price_at_prediction, log warning, continue
Sector ETF price unavailable at snapshot time Store NULL for sector_etf_price_at_prediction, log warning, continue
Sector not found in SECTOR_ETF_MAP Store NULL for sector ETF price, log warning
Future price unavailable at evaluation time Skip that horizon, retry on next Outcome_Evaluator run
SPY/sector ETF future price unavailable Store NULL for excess returns, still compute ticker return

Metrics Computation Edge Cases

Scenario Handling
Zero predictions in a confidence bucket Exclude bucket from ECE computation
Fewer than 30 predictions for IC/Rank IC Return NULL instead of unreliable correlation
All predictions in same confidence bucket ECE =
Division by zero in contribution scores (total weight = 0) Return equal contribution scores (1/n)
Single prediction Contribution score = 1.0
NaN/infinity in metric computation Guard with math.isnan/math.isinf checks, return 0.0 or NULL

Quality Gate Failures

Scenario Handling
No model_metric_snapshots exist Default to paper-only mode (fail-safe)
Most recent snapshot older than 24 hours Default to paper-only mode (fail-safe)
risk_configs table unreachable Default to paper-only mode, log warning
Invalid threshold values in risk_configs Use default thresholds, log warning
Gate evaluation fails mid-computation Default to paper-only mode, log error

Database Failures

Scenario Handling
prediction_snapshots insert fails Log error, do not block recommendation generation
signal_evidence_links insert fails Log error, snapshot still created (partial data)
prediction_outcomes insert fails Log error, retry on next Outcome_Evaluator run
model_metric_snapshots insert fails Log error, stale metrics used until next successful computation
source_accuracy update fails Log error, continue with stale reliability data

Canonical Evidence Key Edge Cases

Scenario Handling
Empty title Use empty string in hash computation
Empty URL Use empty string in hash computation
URL with no query parameters Use URL as-is after lowercasing
Non-ASCII characters in title/URL Encode as UTF-8 before hashing

Testing Strategy

Dual Testing Approach

The model validation feature requires both property-based tests (for mathematical correctness of metric computations) and example-based unit tests (for specific behaviors, integration points, and edge cases). Property-based testing is appropriate here because the feature contains several pure mathematical functions (ECE, Brier score, IC, Bayesian shrinkage, contribution scores) with clear input/output behavior and universal properties.

Property-Based Testing

Library: Hypothesis (already in use — .hypothesis/ directory exists, project convention established)

Configuration:

  • Minimum 100 iterations per property: @settings(max_examples=100)
  • File naming: tests/test_pbt_model_validation.py
  • Tag format: # Feature: model-validation-calibration, Property N: <title>

Property tests to implement (one test per correctness property):

Property Test Function Key Generators
1: ECE range and round-trip test_calibration_error_range_and_roundtrip st.lists(st.tuples(st.floats(0.5, 1.0), st.booleans()))
2: Brier score range and perfect test_brier_score_range_and_perfect st.lists(st.tuples(st.floats(0.0, 1.0), st.sampled_from([0.0, 1.0])))
3: IC range and perfect correlation test_information_coefficient_range_and_perfect st.lists(st.floats(-10, 10), min_size=30) with linear transform
4: Canonical key determinism and idempotence test_canonical_key_determinism_and_idempotence st.text() pairs for title and URL
5: Source reliability bounds and convergence test_source_reliability_bounds_and_convergence st.floats(0.0, 1.0) for win_rate, st.integers(0, 10000) for n
6: Quality gate determinism and monotonicity test_quality_gate_determinism_and_monotonicity Custom strategy for QualityGateConfig and metric values
7: Contribution score sum-to-one test_contribution_score_sum_to_one st.lists(st.floats(0.01, 100.0), min_size=1)

Example-Based Unit Tests

File: tests/test_model_validation_unit.py

Test Area Examples
Canonical evidence key Known title/URL → expected SHA256, empty inputs, unicode
Duplicate detection 3 docs with 2 sharing a key → 1 marked duplicate
Contribution scores [0.5, 0.3, 0.2] → [0.5, 0.3, 0.2], single doc → [1.0]
ECE specific values Perfect calibration → 0.0, all overconfident → positive ECE
Brier score specific values All correct at p=1.0 → 0.0, all wrong at p=1.0 → 1.0
IC specific values Perfect correlation → 1.0, anti-correlation → -1.0, < 30 → None
Source reliability n=0 → 0.5, n=1000 with wr=0.8 → ≈0.8, n=30 with wr=0.7 → 0.6
Adjusted evidence weight reliability=0.5 → base*1.0, clamping to [0.1, 2.0]
Quality gate All thresholds met → pass, one failed → fail with reason
Quality gate fail-safe No snapshots → paper-only, stale snapshot → paper-only
Direction correct logic bullish+positive → true, bullish+negative → false
Profitable logic buy+positive → true, sell+negative → true
Future return computation price 100→110 → 0.10, price 100→90 → -0.10
Excess return ticker 10%, SPY 5% → excess 5%
Weight clamping weight 1.5 → clamped to 1.0

Frontend Tests

File: frontend/src/test/pages.test.tsx (extend existing)

Test Area Strategy
OpsModel page renders validation tabs MSW mock for /api/validation/summary
Calibration table renders buckets MSW mock for /api/validation/calibration
Gate status indicator MSW mock for /api/validation/gate-status
Miscalibration warning badge Mock data with miscalibrated bucket

Integration Tests

File: tests/test_model_validation_integration.py

Test Area Strategy
Snapshot creation with mock DB asyncpg mock, verify INSERT queries
Outcome evaluation with mock prices asyncpg mock, verify return computation
Metrics computation end-to-end In-memory data, verify all metrics computed
API endpoint responses FastAPI TestClient with mock pool

Test File Structure

tests/
├── test_pbt_model_validation.py         # 7 property-based tests
├── test_model_validation_unit.py        # Example-based unit tests
└── test_model_validation_integration.py # Integration tests (optional)

frontend/src/test/
└── pages.test.tsx                       # Extended with validation page tests