Replay Dataset for Deterministic Extraction Testing
Archived document fixtures used to verify that the extraction pipeline produces consistent, schema-valid results across code changes.
Each fixture is a JSON file containing:
document_id: stable identifier for the fixturedocument_type: article, filing, transcript, or press_releasedocument_text: normalized text as it would arrive from the parserknown_tickers: ticker hints passed to the extraction promptexpected_extraction: the expected extraction result (schema-valid)metadata: fixture provenance info (created_at, description, schema_version)
The replay runner (tests/test_replay_extraction.py) loads these fixtures,
validates the expected outputs against the current extraction schema, and
optionally runs them through a live Ollama instance for end-to-end checks.