Files

T

Celes Renata 88ad1e8d99 feat: comprehensive docs, unit tests, docker-compose app services

- Add scheduler and ingestion unit tests (test_scheduler_unit.py, test_ingestion_unit.py)
- Add all 13 app services + dashboard to docker-compose.yml
- Add full documentation suite: API reference, Helm reference, Docker deployment guide,
  3 architecture diagrams (K8s, Docker Compose, data pipeline), AI agent guide,
  backup/restore guide, observability/metrics reference, per-service docs
- Add intelligence pipeline deep-dive docs with Mermaid diagrams
- Update README with documentation index and links
- Add specs for comprehensive-quality-docs, intelligence-pipeline-deep-dive,
  sanitized-pipeline-docs

2026-04-22 02:56:41 +00:00

20 KiB

Raw Blame History

Design Document: Sanitized Pipeline Documentation

Overview

This design specifies the process and structure for producing a sanitized version of the 6-page intelligence pipeline deep dive documentation. The sanitized docs transform the existing docs/intelligence-pipeline-deep-dive/ content into domain-neutral equivalents stored at docs/sanitized-pipeline-deep-dive/, stripping all financial, market, and trading language while preserving every engineering detail — algorithms, formulas, architectural patterns, queue topologies, database schemas, code module references, and Mermaid diagrams.

The deliverable is a documentation-only transformation. No application code, database schemas, or infrastructure changes are involved. The output is Markdown files and Mermaid diagram files that mirror the original structure with domain-neutral framing.

Key design decision: The sanitization is a manual content transformation guided by a defined terminology map. Each source file is read, transformed according to the mapping rules, and written to the output directory. The original files remain untouched.

Source Material

The source documentation at docs/intelligence-pipeline-deep-dive/ consists of:

File	Content
`index.md`	Table of contents, introduction, diagram links, related docs
`01-data-ingestion-and-preparation.md`	Scheduler, ingestion worker, deduplication, parser
`02-ai-agent-processing-and-extraction.md`	Document extractor, event classifier, JSON repair, validation
`03-signal-scoring-and-weighted-signals.md`	Composite weight formula, three signal layers, sentiment mapping
`04-trend-aggregation-and-accumulating-signals.md`	Time windows, trend direction, contradiction, evidence ranking, confidence
`05-recommendation-generation.md`	Suppression, eligibility, position sizing, thesis, risk classification
`06-trading-decisions-and-execution.md`	Trading engine, pre-trade checks, circuit breakers, broker adapter
`diagrams/ingestion-to-extraction-flow.md`	Mermaid flowchart: scheduler → ingestion → parser → extractor
`diagrams/three-layer-signal-merging.md`	Mermaid flowchart: three signal layers → aggregation
`diagrams/weighted-signal-computation.md`	Mermaid flowchart: composite weight formula breakdown
`diagrams/trend-accumulation-escalation.md`	Mermaid flowchart: time windows → escalation path
`diagrams/recommendation-generation-flow.md`	Mermaid flowchart: suppression → eligibility → thesis → risk
`diagrams/trading-engine-decision-loop.md`	Mermaid flowchart: pre-trade checks → position sizing → order submission

Architecture

Output File Organization

The sanitized docs mirror the source structure with sanitized filenames:

docs/sanitized-pipeline-deep-dive/
├── index.md
├── 01-data-ingestion-and-preparation.md
├── 02-ai-agent-processing-and-extraction.md
├── 03-signal-scoring-and-weighted-signals.md
├── 04-trend-aggregation-and-accumulating-signals.md
├── 05-recommendation-generation.md
├── 06-decision-execution.md
└── diagrams/
    ├── ingestion-to-extraction-flow.md
    ├── three-layer-signal-merging.md
    ├── weighted-signal-computation.md
    ├── trend-accumulation-escalation.md
    ├── recommendation-generation-flow.md
    └── decision-engine-loop.md

Filename changes from source:

06-trading-decisions-and-execution.md → 06-decision-execution.md (removes "trading")
diagrams/trading-engine-decision-loop.md → diagrams/decision-engine-loop.md (removes "trading")
All other filenames are already domain-neutral and remain unchanged

Transformation Process

The sanitization follows a three-pass approach for each file:

Terminology pass: Apply the terminology map to replace all financial/trading terms with domain-neutral equivalents. This covers inline text, headings, table cells, code blocks, and Mermaid diagram labels.
Reference pass: Update all internal cross-references to point to sanitized filenames (e.g., 06-trading-decisions-and-execution.md → 06-decision-execution.md, trading-engine-decision-loop.md → decision-engine-loop.md). Remove or neutralize references to external financial docs (e.g., links to ../llm-to-trade-pipeline.md become neutral descriptions).
Narrative pass: Reframe example scenarios, inline illustrations, and narrative framing to use domain-neutral language. This pass handles context-dependent replacements that a simple find-and-replace cannot catch — e.g., "a bearish article about AAPL" becomes "a negative-sentiment article about Entity-A".

Content Flow

The sanitized docs preserve the same page-to-page narrative flow as the originals:

flowchart LR
    P1["Page 1\nData Ingestion"] --> P2["Page 2\nAI Extraction"]
    P2 --> P3["Page 3\nSignal Scoring"]
    P3 --> P4["Page 4\nTrend Aggregation"]
    P4 --> P5["Page 5\nRecommendations"]
    P5 --> P6["Page 6\nDecision Execution"]

Components and Interfaces

Terminology Map

The core of the sanitization is a defined mapping from financial/trading terms to domain-neutral equivalents. The map is applied consistently across all files.

System and Provider Names

Source Term	Sanitized Replacement
Stonks Oracle / stonks	the platform / the system
Polygon.io / Polygon	external data provider / data source API
SEC EDGAR / SEC / EFTS	public records API / regulatory filings source
Alpaca / AlpacaBrokerAdapter	execution adapter / external execution API
Wall Street	(removed or reframed)

Trading and Financial Actions

Source Term	Sanitized Replacement
buy	act
sell	defer
hold	monitor
watch	observe
trading engine	decision execution engine
paper trading / paper_eligible	simulation mode / simulation_eligible
live trading / live_eligible	live execution mode / production_eligible
trade / trading (as action)	decision / execution
order (broker order)	execution request
pre-trade checks	pre-execution checks

Financial Concepts

Source Term	Sanitized Replacement
portfolio	resource pool / allocation pool
portfolio allocation	resource allocation
portfolio heat	pool exposure
portfolio snapshots	pool snapshots
position sizing	commitment sizing / resource allocation
position (open position)	commitment / active commitment
stop-loss	risk threshold / loss limit
take-profit	gain target
bullish	positive / favorable
bearish	negative / unfavorable
stock ticker / ticker symbol	entity identifier
stock market	(removed or reframed)
earnings / earnings call / earnings report	performance report / periodic disclosure
10-K / 10-Q / 8-K	regulatory filing types
SEC filings	regulatory filings
broker / broker API	execution adapter / execution API
P&L	gain/loss
Sharpe ratio	risk-adjusted return ratio
drawdown	peak-to-trough decline
win rate	success rate

Ticker Symbols and Company Names

Source Term	Sanitized Replacement
AAPL / Apple	Entity-A
TSLA / Tesla	Entity-B
NVDA / NVIDIA	Entity-C
XOM	Entity-D
META	Entity-E
Any other ticker	Entity-{letter} or "tracked entity"

Redis Keys

Source Pattern	Sanitized Pattern
`stonks:queue:*`	`app:queue:*`
`stonks:dedupe:*`	`app:dedupe:*`
`stonks:ratelimit:*`	`app:ratelimit:*`
`stonks:trading:circuit_breaker:*`	`app:execution:circuit_breaker:*`
`stonks:dedupe:trading:*`	`app:dedupe:execution:*`

MinIO Buckets

Source Bucket	Sanitized Bucket
`stonks-raw-market`	`app-raw-data`
`stonks-raw-news`	`app-raw-content`
`stonks-raw-filings`	`app-raw-filings`
`stonks-normalized`	`app-normalized`
`stonks-llm-prompts`	`app-llm-prompts`
`stonks-llm-results`	`app-llm-results`

Database Tables

Source Table	Sanitized Table
`trading_decisions`	`execution_decisions`
`portfolio_snapshots`	`pool_snapshots`
`portfolio_pct` (column)	`allocation_pct`

All other table names (documents, document_intelligence, trend_windows, recommendations, etc.) are already domain-neutral and remain unchanged.

Adapter and Source Type Names

Source Term	Sanitized Replacement
`PolygonNewsAdapter`	`ExternalNewsAdapter`
`PolygonMarketAdapter`	`ExternalDataAdapter`
`SECEdgarAdapter`	`RegulatoryFilingsAdapter`
`AlpacaBrokerAdapter`	`ExecutionAdapter`
`broker` (source_type)	`execution_api`
`market_api` (source_type)	`data_api`
`filings_api` (source_type)	`filings_api` (unchanged — already neutral)

Preserved Engineering Terms

The following terms are explicitly preserved because they describe engineering patterns, not financial concepts:

circuit breaker — engineering safety pattern for rate limiting and cascading failure prevention
exponential backoff — retry pattern
adapter pattern — software design pattern (only the domain-specific adapter names are sanitized)
signal — used in signal processing and scoring context
trend, sentiment, confidence, contradiction, evidence — data analysis terms
recency decay, credibility weight, novelty bonus — scoring algorithm terms
weighted sentiment average — mathematical computation term

Preserved Technical Content

All of the following are preserved verbatim (with only the terminology map applied to embedded financial terms):

Composite signal scoring formula: combined = gate × recency × credibility × (1 + novelty_bonus) × market_context_multiplier
Confidence computation formula with log₂ scaling and four components
Weighted sentiment average formula
All threshold values, configuration parameters, and numeric constants
All Markdown table structures containing technical parameters
All code module path references (e.g., services/aggregation/scoring.py)
Three-layer signal architecture with weight ratios (1.0, 0.3, 0.2)
Contradiction detection algorithm and evidence ranking methodology
All PostgreSQL table structures and column descriptions (with sanitized names where needed)
All Redis queue patterns and operations (rpush/lpop/blpop)
All MinIO storage patterns (with sanitized bucket names)
Ollama as the LLM inference provider

Index Page Reframing

The sanitized index.md describes the system as an "AI-driven intelligence-to-decision pipeline" that:

Ingests data from multiple external data sources
Extracts structured intelligence via NLP/LLM
Scores and weights signals
Aggregates trends across time windows
Generates recommendations with quality gates
Executes decisions autonomously with safety mechanisms

References to "Stonks Oracle" are replaced with "the platform" or "the system". References to financial-specific APIs (Polygon.io, SEC EDGAR) are replaced with neutral descriptions. The "Related Documentation" section links are updated to use neutral descriptions or removed if they reference financial-specific content.

Page 06 Reframing

Page 06 undergoes the most extensive reframing since it covers the trading engine. Key changes:

Title: "Decision Execution" instead of "Trading Decisions and Execution"
"Trading engine" → "decision execution engine"
"Pre-trade checks" → "pre-execution checks"
"Broker adapter" / "Alpaca" → "execution adapter" / "external execution API"
"Paper trading" → "simulation mode"
"Live trading" → "live execution mode"
"Portfolio" → "resource pool" / "allocation pool"
"Position" → "commitment" / "active commitment"
"Stop-loss" → "risk threshold"
"Take-profit" → "gain target"
All order submission language reframed as "execution request submission"

Diagram Sanitization

Each Mermaid diagram file receives the same terminology map treatment:

Node labels containing financial terms are replaced
Queue name labels (stonks:queue:* → app:queue:*)
Bucket name labels (stonks-raw-market → app-raw-data)
Table name labels (trading_decisions → execution_decisions)
Adapter names in node labels
Subgraph titles containing financial terms
The trading-engine-decision-loop.md diagram is renamed to decision-engine-loop.md

Mermaid syntax, node relationships, subgraph structures, and flow directions are preserved exactly.

Data Models

This feature produces only documentation files. There are no new data models, database tables, or schema changes.

The sanitized narrative pages reference the same data models as the originals, with terminology-mapped names where applicable:

WeightedSignal — document reference + composite weight + sentiment + impact (unchanged)
SignalWeight — breakdown of recency, credibility, novelty, confidence gate, market context multiplier (unchanged)
TrendSummary — rolling trend for an entity across a time window (unchanged)
Recommendation — actionable decision recommendation (reframed from "trade recommendation")
execution_decisions table — audit record of every decision evaluation (sanitized from trading_decisions)
pool_snapshots table — resource pool state snapshots (sanitized from portfolio_snapshots)

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

The sanitized documentation set has one key universal property: the complete absence of financial/trading terminology across all output files. This is well-suited to property-based testing because the property must hold for every file in the output set, and the banned term list is large enough that systematic checking across all files provides high-value coverage.

Property 1: Banned Financial Terminology Exclusion

For any file in the sanitized documentation set (docs/sanitized-pipeline-deep-dive/), the file content shall not contain any term from the comprehensive banned financial terminology list. The banned list includes: stock ticker symbols (AAPL, TSLA, NVDA, XOM, META, and all 50 tracked tickers), company names used as financial examples (Apple, Tesla, NVIDIA), trading action labels (buy, sell, hold, watch as action labels — BUY, SELL, HOLD, WATCH in uppercase), financial system terms (trading engine, paper trading, live trading, paper_eligible, live_eligible, portfolio, portfolio allocation, portfolio heat, portfolio snapshots, broker, Alpaca, broker adapter, broker API, stock market, Wall Street, bullish, bearish, position sizing, stop-loss), financial event terms (SEC EDGAR, SEC filings, 10-K, 10-Q, 8-K, earnings, earnings call, earnings report), provider names (Polygon.io, Polygon), system names (Stonks Oracle, stonks), and infrastructure patterns containing financial terms (stonks: prefix in Redis keys, stonks- prefix in MinIO buckets, trading_decisions table name, portfolio_snapshots table name).

Validates: Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 6.2, 7.1, 7.2, 7.3, 8.1, 8.2

Error Handling

Since this is a documentation-only deliverable, there is no runtime error handling to design. The primary quality concerns are:

Accuracy of Terminology Replacement

Every financial/trading term must be replaced with its domain-neutral equivalent. Missing a single instance of "stonks" in a Redis key pattern or "AAPL" in an example scenario would violate the sanitization requirements. The terminology map defined in the Components section serves as the authoritative reference.

Preservation of Technical Content

The sanitization must not accidentally remove or alter engineering content. Key risks:

Formula corruption: The composite weight formula contains market_context_multiplier — the word "market" must not be blindly replaced since it's part of a technical variable name
Code path corruption: Module paths like services/trading/engine.py contain "trading" — these paths reference actual files and must be preserved as-is (the code files are not being renamed)
Table name corruption: Database table names like trading_decisions need sanitization in narrative text but the actual SQL/code references to the original table names should be handled carefully

Design decision: Code module paths (e.g., services/trading/engine.py) are preserved exactly as they appear in the source, since they reference actual files in the repository. Only narrative references to concepts (e.g., "the trading engine") are sanitized. Variable names within formulas and code blocks are preserved. Database table names are sanitized in narrative descriptions and table listings, but inline code references note the sanitized name.

Cross-Reference Integrity

All internal links must resolve to files that exist in the sanitized output:

Page-to-page links must use sanitized filenames
Diagram links must use sanitized diagram filenames
No links should point back to the source docs/intelligence-pipeline-deep-dive/ directory

Testing Strategy

Why Limited PBT Applies

This is a documentation-only deliverable — the output is static Markdown files, not executable code with functions and data transformations. However, one universal property (banned term exclusion) is well-suited to property-based testing because it must hold across all files and involves checking a large set of terms against file content.

Most other requirements (structural checks, content preservation, narrative reframing) are better verified through example-based tests and manual review.

Property-Based Tests

Library: Hypothesis (Python, already in the project)
Configuration: @settings(max_examples=100)
Property 1 implementation: Generate random selections from the banned term list and random file selections from the sanitized docs, verify the term does not appear in the file content. Alternatively, exhaustively check all banned terms against all files (since the file set is small and fixed, this is more practical as an exhaustive example-based test).

Practical note: Given the small, fixed file set (14 files), the banned term exclusion property is most practically implemented as an exhaustive check — iterate all files × all banned terms — rather than a randomized property test. This provides complete coverage rather than probabilistic coverage.

Example-Based Tests

File structure verification: Verify all expected files exist at the correct paths
Cross-reference integrity: Parse all sanitized files, extract markdown links, verify they resolve to existing sanitized files
Mermaid syntax validation: Verify each diagram file contains valid Mermaid flowchart declarations
Technical content preservation: Spot-check that key formulas, threshold values, and code module paths are present in the sanitized docs
Terminology replacement verification: Spot-check that key replacements appear (e.g., "decision execution engine" replaces "trading engine")
Index page framing: Verify the index describes the system as an "AI-driven intelligence-to-decision pipeline"
Database table sanitization: Verify execution_decisions appears where trading_decisions was, and pool_snapshots where portfolio_snapshots was

Manual Review

Narrative coherence and readability of the sanitized content
Consistency of domain-neutral framing across all pages
Quality of example scenario replacements (e.g., "bearish article about AAPL" → "negative-sentiment article about Entity-A")
Preservation of page-to-page transition flow

20 KiB Raw Blame History Unescape Escape