Files
stonks-oracle/.kiro/specs/sanitized-pipeline-docs/design.md
T
Celes Renata 88ad1e8d99 feat: comprehensive docs, unit tests, docker-compose app services
- Add scheduler and ingestion unit tests (test_scheduler_unit.py, test_ingestion_unit.py)
- Add all 13 app services + dashboard to docker-compose.yml
- Add full documentation suite: API reference, Helm reference, Docker deployment guide,
  3 architecture diagrams (K8s, Docker Compose, data pipeline), AI agent guide,
  backup/restore guide, observability/metrics reference, per-service docs
- Add intelligence pipeline deep-dive docs with Mermaid diagrams
- Update README with documentation index and links
- Add specs for comprehensive-quality-docs, intelligence-pipeline-deep-dive,
  sanitized-pipeline-docs
2026-04-22 02:56:41 +00:00

20 KiB
Raw Blame History

Design Document: Sanitized Pipeline Documentation

Overview

This design specifies the process and structure for producing a sanitized version of the 6-page intelligence pipeline deep dive documentation. The sanitized docs transform the existing docs/intelligence-pipeline-deep-dive/ content into domain-neutral equivalents stored at docs/sanitized-pipeline-deep-dive/, stripping all financial, market, and trading language while preserving every engineering detail — algorithms, formulas, architectural patterns, queue topologies, database schemas, code module references, and Mermaid diagrams.

The deliverable is a documentation-only transformation. No application code, database schemas, or infrastructure changes are involved. The output is Markdown files and Mermaid diagram files that mirror the original structure with domain-neutral framing.

Key design decision: The sanitization is a manual content transformation guided by a defined terminology map. Each source file is read, transformed according to the mapping rules, and written to the output directory. The original files remain untouched.

Source Material

The source documentation at docs/intelligence-pipeline-deep-dive/ consists of:

File Content
index.md Table of contents, introduction, diagram links, related docs
01-data-ingestion-and-preparation.md Scheduler, ingestion worker, deduplication, parser
02-ai-agent-processing-and-extraction.md Document extractor, event classifier, JSON repair, validation
03-signal-scoring-and-weighted-signals.md Composite weight formula, three signal layers, sentiment mapping
04-trend-aggregation-and-accumulating-signals.md Time windows, trend direction, contradiction, evidence ranking, confidence
05-recommendation-generation.md Suppression, eligibility, position sizing, thesis, risk classification
06-trading-decisions-and-execution.md Trading engine, pre-trade checks, circuit breakers, broker adapter
diagrams/ingestion-to-extraction-flow.md Mermaid flowchart: scheduler → ingestion → parser → extractor
diagrams/three-layer-signal-merging.md Mermaid flowchart: three signal layers → aggregation
diagrams/weighted-signal-computation.md Mermaid flowchart: composite weight formula breakdown
diagrams/trend-accumulation-escalation.md Mermaid flowchart: time windows → escalation path
diagrams/recommendation-generation-flow.md Mermaid flowchart: suppression → eligibility → thesis → risk
diagrams/trading-engine-decision-loop.md Mermaid flowchart: pre-trade checks → position sizing → order submission

Architecture

Output File Organization

The sanitized docs mirror the source structure with sanitized filenames:

docs/sanitized-pipeline-deep-dive/
├── index.md
├── 01-data-ingestion-and-preparation.md
├── 02-ai-agent-processing-and-extraction.md
├── 03-signal-scoring-and-weighted-signals.md
├── 04-trend-aggregation-and-accumulating-signals.md
├── 05-recommendation-generation.md
├── 06-decision-execution.md
└── diagrams/
    ├── ingestion-to-extraction-flow.md
    ├── three-layer-signal-merging.md
    ├── weighted-signal-computation.md
    ├── trend-accumulation-escalation.md
    ├── recommendation-generation-flow.md
    └── decision-engine-loop.md

Filename changes from source:

  • 06-trading-decisions-and-execution.md06-decision-execution.md (removes "trading")
  • diagrams/trading-engine-decision-loop.mddiagrams/decision-engine-loop.md (removes "trading")
  • All other filenames are already domain-neutral and remain unchanged

Transformation Process

The sanitization follows a three-pass approach for each file:

  1. Terminology pass: Apply the terminology map to replace all financial/trading terms with domain-neutral equivalents. This covers inline text, headings, table cells, code blocks, and Mermaid diagram labels.
  2. Reference pass: Update all internal cross-references to point to sanitized filenames (e.g., 06-trading-decisions-and-execution.md06-decision-execution.md, trading-engine-decision-loop.mddecision-engine-loop.md). Remove or neutralize references to external financial docs (e.g., links to ../llm-to-trade-pipeline.md become neutral descriptions).
  3. Narrative pass: Reframe example scenarios, inline illustrations, and narrative framing to use domain-neutral language. This pass handles context-dependent replacements that a simple find-and-replace cannot catch — e.g., "a bearish article about AAPL" becomes "a negative-sentiment article about Entity-A".

Content Flow

The sanitized docs preserve the same page-to-page narrative flow as the originals:

flowchart LR
    P1["Page 1\nData Ingestion"] --> P2["Page 2\nAI Extraction"]
    P2 --> P3["Page 3\nSignal Scoring"]
    P3 --> P4["Page 4\nTrend Aggregation"]
    P4 --> P5["Page 5\nRecommendations"]
    P5 --> P6["Page 6\nDecision Execution"]

Components and Interfaces

Terminology Map

The core of the sanitization is a defined mapping from financial/trading terms to domain-neutral equivalents. The map is applied consistently across all files.

System and Provider Names

Source Term Sanitized Replacement
Stonks Oracle / stonks the platform / the system
Polygon.io / Polygon external data provider / data source API
SEC EDGAR / SEC / EFTS public records API / regulatory filings source
Alpaca / AlpacaBrokerAdapter execution adapter / external execution API
Wall Street (removed or reframed)

Trading and Financial Actions

Source Term Sanitized Replacement
buy act
sell defer
hold monitor
watch observe
trading engine decision execution engine
paper trading / paper_eligible simulation mode / simulation_eligible
live trading / live_eligible live execution mode / production_eligible
trade / trading (as action) decision / execution
order (broker order) execution request
pre-trade checks pre-execution checks

Financial Concepts

Source Term Sanitized Replacement
portfolio resource pool / allocation pool
portfolio allocation resource allocation
portfolio heat pool exposure
portfolio snapshots pool snapshots
position sizing commitment sizing / resource allocation
position (open position) commitment / active commitment
stop-loss risk threshold / loss limit
take-profit gain target
bullish positive / favorable
bearish negative / unfavorable
stock ticker / ticker symbol entity identifier
stock market (removed or reframed)
earnings / earnings call / earnings report performance report / periodic disclosure
10-K / 10-Q / 8-K regulatory filing types
SEC filings regulatory filings
broker / broker API execution adapter / execution API
P&L gain/loss
Sharpe ratio risk-adjusted return ratio
drawdown peak-to-trough decline
win rate success rate

Ticker Symbols and Company Names

Source Term Sanitized Replacement
AAPL / Apple Entity-A
TSLA / Tesla Entity-B
NVDA / NVIDIA Entity-C
XOM Entity-D
META Entity-E
Any other ticker Entity-{letter} or "tracked entity"

Redis Keys

Source Pattern Sanitized Pattern
stonks:queue:* app:queue:*
stonks:dedupe:* app:dedupe:*
stonks:ratelimit:* app:ratelimit:*
stonks:trading:circuit_breaker:* app:execution:circuit_breaker:*
stonks:dedupe:trading:* app:dedupe:execution:*

MinIO Buckets

Source Bucket Sanitized Bucket
stonks-raw-market app-raw-data
stonks-raw-news app-raw-content
stonks-raw-filings app-raw-filings
stonks-normalized app-normalized
stonks-llm-prompts app-llm-prompts
stonks-llm-results app-llm-results

Database Tables

Source Table Sanitized Table
trading_decisions execution_decisions
portfolio_snapshots pool_snapshots
portfolio_pct (column) allocation_pct

All other table names (documents, document_intelligence, trend_windows, recommendations, etc.) are already domain-neutral and remain unchanged.

Adapter and Source Type Names

Source Term Sanitized Replacement
PolygonNewsAdapter ExternalNewsAdapter
PolygonMarketAdapter ExternalDataAdapter
SECEdgarAdapter RegulatoryFilingsAdapter
AlpacaBrokerAdapter ExecutionAdapter
broker (source_type) execution_api
market_api (source_type) data_api
filings_api (source_type) filings_api (unchanged — already neutral)

Preserved Engineering Terms

The following terms are explicitly preserved because they describe engineering patterns, not financial concepts:

  • circuit breaker — engineering safety pattern for rate limiting and cascading failure prevention
  • exponential backoff — retry pattern
  • adapter pattern — software design pattern (only the domain-specific adapter names are sanitized)
  • signal — used in signal processing and scoring context
  • trend, sentiment, confidence, contradiction, evidence — data analysis terms
  • recency decay, credibility weight, novelty bonus — scoring algorithm terms
  • weighted sentiment average — mathematical computation term

Preserved Technical Content

All of the following are preserved verbatim (with only the terminology map applied to embedded financial terms):

  • Composite signal scoring formula: combined = gate × recency × credibility × (1 + novelty_bonus) × market_context_multiplier
  • Confidence computation formula with log₂ scaling and four components
  • Weighted sentiment average formula
  • All threshold values, configuration parameters, and numeric constants
  • All Markdown table structures containing technical parameters
  • All code module path references (e.g., services/aggregation/scoring.py)
  • Three-layer signal architecture with weight ratios (1.0, 0.3, 0.2)
  • Contradiction detection algorithm and evidence ranking methodology
  • All PostgreSQL table structures and column descriptions (with sanitized names where needed)
  • All Redis queue patterns and operations (rpush/lpop/blpop)
  • All MinIO storage patterns (with sanitized bucket names)
  • Ollama as the LLM inference provider

Index Page Reframing

The sanitized index.md describes the system as an "AI-driven intelligence-to-decision pipeline" that:

  1. Ingests data from multiple external data sources
  2. Extracts structured intelligence via NLP/LLM
  3. Scores and weights signals
  4. Aggregates trends across time windows
  5. Generates recommendations with quality gates
  6. Executes decisions autonomously with safety mechanisms

References to "Stonks Oracle" are replaced with "the platform" or "the system". References to financial-specific APIs (Polygon.io, SEC EDGAR) are replaced with neutral descriptions. The "Related Documentation" section links are updated to use neutral descriptions or removed if they reference financial-specific content.

Page 06 Reframing

Page 06 undergoes the most extensive reframing since it covers the trading engine. Key changes:

  • Title: "Decision Execution" instead of "Trading Decisions and Execution"
  • "Trading engine" → "decision execution engine"
  • "Pre-trade checks" → "pre-execution checks"
  • "Broker adapter" / "Alpaca" → "execution adapter" / "external execution API"
  • "Paper trading" → "simulation mode"
  • "Live trading" → "live execution mode"
  • "Portfolio" → "resource pool" / "allocation pool"
  • "Position" → "commitment" / "active commitment"
  • "Stop-loss" → "risk threshold"
  • "Take-profit" → "gain target"
  • All order submission language reframed as "execution request submission"

Diagram Sanitization

Each Mermaid diagram file receives the same terminology map treatment:

  • Node labels containing financial terms are replaced
  • Queue name labels (stonks:queue:*app:queue:*)
  • Bucket name labels (stonks-raw-marketapp-raw-data)
  • Table name labels (trading_decisionsexecution_decisions)
  • Adapter names in node labels
  • Subgraph titles containing financial terms
  • The trading-engine-decision-loop.md diagram is renamed to decision-engine-loop.md

Mermaid syntax, node relationships, subgraph structures, and flow directions are preserved exactly.

Data Models

This feature produces only documentation files. There are no new data models, database tables, or schema changes.

The sanitized narrative pages reference the same data models as the originals, with terminology-mapped names where applicable:

  • WeightedSignal — document reference + composite weight + sentiment + impact (unchanged)
  • SignalWeight — breakdown of recency, credibility, novelty, confidence gate, market context multiplier (unchanged)
  • TrendSummary — rolling trend for an entity across a time window (unchanged)
  • Recommendation — actionable decision recommendation (reframed from "trade recommendation")
  • execution_decisions table — audit record of every decision evaluation (sanitized from trading_decisions)
  • pool_snapshots table — resource pool state snapshots (sanitized from portfolio_snapshots)

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

The sanitized documentation set has one key universal property: the complete absence of financial/trading terminology across all output files. This is well-suited to property-based testing because the property must hold for every file in the output set, and the banned term list is large enough that systematic checking across all files provides high-value coverage.

Property 1: Banned Financial Terminology Exclusion

For any file in the sanitized documentation set (docs/sanitized-pipeline-deep-dive/), the file content shall not contain any term from the comprehensive banned financial terminology list. The banned list includes: stock ticker symbols (AAPL, TSLA, NVDA, XOM, META, and all 50 tracked tickers), company names used as financial examples (Apple, Tesla, NVIDIA), trading action labels (buy, sell, hold, watch as action labels — BUY, SELL, HOLD, WATCH in uppercase), financial system terms (trading engine, paper trading, live trading, paper_eligible, live_eligible, portfolio, portfolio allocation, portfolio heat, portfolio snapshots, broker, Alpaca, broker adapter, broker API, stock market, Wall Street, bullish, bearish, position sizing, stop-loss), financial event terms (SEC EDGAR, SEC filings, 10-K, 10-Q, 8-K, earnings, earnings call, earnings report), provider names (Polygon.io, Polygon), system names (Stonks Oracle, stonks), and infrastructure patterns containing financial terms (stonks: prefix in Redis keys, stonks- prefix in MinIO buckets, trading_decisions table name, portfolio_snapshots table name).

Validates: Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 6.2, 7.1, 7.2, 7.3, 8.1, 8.2

Error Handling

Since this is a documentation-only deliverable, there is no runtime error handling to design. The primary quality concerns are:

Accuracy of Terminology Replacement

Every financial/trading term must be replaced with its domain-neutral equivalent. Missing a single instance of "stonks" in a Redis key pattern or "AAPL" in an example scenario would violate the sanitization requirements. The terminology map defined in the Components section serves as the authoritative reference.

Preservation of Technical Content

The sanitization must not accidentally remove or alter engineering content. Key risks:

  • Formula corruption: The composite weight formula contains market_context_multiplier — the word "market" must not be blindly replaced since it's part of a technical variable name
  • Code path corruption: Module paths like services/trading/engine.py contain "trading" — these paths reference actual files and must be preserved as-is (the code files are not being renamed)
  • Table name corruption: Database table names like trading_decisions need sanitization in narrative text but the actual SQL/code references to the original table names should be handled carefully

Design decision: Code module paths (e.g., services/trading/engine.py) are preserved exactly as they appear in the source, since they reference actual files in the repository. Only narrative references to concepts (e.g., "the trading engine") are sanitized. Variable names within formulas and code blocks are preserved. Database table names are sanitized in narrative descriptions and table listings, but inline code references note the sanitized name.

Cross-Reference Integrity

All internal links must resolve to files that exist in the sanitized output:

  • Page-to-page links must use sanitized filenames
  • Diagram links must use sanitized diagram filenames
  • No links should point back to the source docs/intelligence-pipeline-deep-dive/ directory

Testing Strategy

Why Limited PBT Applies

This is a documentation-only deliverable — the output is static Markdown files, not executable code with functions and data transformations. However, one universal property (banned term exclusion) is well-suited to property-based testing because it must hold across all files and involves checking a large set of terms against file content.

Most other requirements (structural checks, content preservation, narrative reframing) are better verified through example-based tests and manual review.

Property-Based Tests

  • Library: Hypothesis (Python, already in the project)
  • Configuration: @settings(max_examples=100)
  • Property 1 implementation: Generate random selections from the banned term list and random file selections from the sanitized docs, verify the term does not appear in the file content. Alternatively, exhaustively check all banned terms against all files (since the file set is small and fixed, this is more practical as an exhaustive example-based test).

Practical note: Given the small, fixed file set (14 files), the banned term exclusion property is most practically implemented as an exhaustive check — iterate all files × all banned terms — rather than a randomized property test. This provides complete coverage rather than probabilistic coverage.

Example-Based Tests

  1. File structure verification: Verify all expected files exist at the correct paths
  2. Cross-reference integrity: Parse all sanitized files, extract markdown links, verify they resolve to existing sanitized files
  3. Mermaid syntax validation: Verify each diagram file contains valid Mermaid flowchart declarations
  4. Technical content preservation: Spot-check that key formulas, threshold values, and code module paths are present in the sanitized docs
  5. Terminology replacement verification: Spot-check that key replacements appear (e.g., "decision execution engine" replaces "trading engine")
  6. Index page framing: Verify the index describes the system as an "AI-driven intelligence-to-decision pipeline"
  7. Database table sanitization: Verify execution_decisions appears where trading_decisions was, and pool_snapshots where portfolio_snapshots was

Manual Review

  • Narrative coherence and readability of the sanitized content
  • Consistency of domain-neutral framing across all pages
  • Quality of example scenario replacements (e.g., "bearish article about AAPL" → "negative-sentiment article about Entity-A")
  • Preservation of page-to-page transition flow