feat: comprehensive docs, unit tests, docker-compose app services

- Add scheduler and ingestion unit tests (test_scheduler_unit.py, test_ingestion_unit.py)
- Add all 13 app services + dashboard to docker-compose.yml
- Add full documentation suite: API reference, Helm reference, Docker deployment guide,
  3 architecture diagrams (K8s, Docker Compose, data pipeline), AI agent guide,
  backup/restore guide, observability/metrics reference, per-service docs
- Add intelligence pipeline deep-dive docs with Mermaid diagrams
- Update README with documentation index and links
- Add specs for comprehensive-quality-docs, intelligence-pipeline-deep-dive,
  sanitized-pipeline-docs
This commit is contained in:
Celes Renata
2026-04-22 02:56:41 +00:00
parent f251c53f92
commit 88ad1e8d99
57 changed files with 13318 additions and 51 deletions
@@ -0,0 +1,341 @@
# Design Document: Sanitized Pipeline Documentation
## Overview
This design specifies the process and structure for producing a sanitized version of the 6-page intelligence pipeline deep dive documentation. The sanitized docs transform the existing `docs/intelligence-pipeline-deep-dive/` content into domain-neutral equivalents stored at `docs/sanitized-pipeline-deep-dive/`, stripping all financial, market, and trading language while preserving every engineering detail — algorithms, formulas, architectural patterns, queue topologies, database schemas, code module references, and Mermaid diagrams.
The deliverable is a documentation-only transformation. No application code, database schemas, or infrastructure changes are involved. The output is Markdown files and Mermaid diagram files that mirror the original structure with domain-neutral framing.
**Key design decision**: The sanitization is a manual content transformation guided by a defined terminology map. Each source file is read, transformed according to the mapping rules, and written to the output directory. The original files remain untouched.
### Source Material
The source documentation at `docs/intelligence-pipeline-deep-dive/` consists of:
| File | Content |
|------|---------|
| `index.md` | Table of contents, introduction, diagram links, related docs |
| `01-data-ingestion-and-preparation.md` | Scheduler, ingestion worker, deduplication, parser |
| `02-ai-agent-processing-and-extraction.md` | Document extractor, event classifier, JSON repair, validation |
| `03-signal-scoring-and-weighted-signals.md` | Composite weight formula, three signal layers, sentiment mapping |
| `04-trend-aggregation-and-accumulating-signals.md` | Time windows, trend direction, contradiction, evidence ranking, confidence |
| `05-recommendation-generation.md` | Suppression, eligibility, position sizing, thesis, risk classification |
| `06-trading-decisions-and-execution.md` | Trading engine, pre-trade checks, circuit breakers, broker adapter |
| `diagrams/ingestion-to-extraction-flow.md` | Mermaid flowchart: scheduler → ingestion → parser → extractor |
| `diagrams/three-layer-signal-merging.md` | Mermaid flowchart: three signal layers → aggregation |
| `diagrams/weighted-signal-computation.md` | Mermaid flowchart: composite weight formula breakdown |
| `diagrams/trend-accumulation-escalation.md` | Mermaid flowchart: time windows → escalation path |
| `diagrams/recommendation-generation-flow.md` | Mermaid flowchart: suppression → eligibility → thesis → risk |
| `diagrams/trading-engine-decision-loop.md` | Mermaid flowchart: pre-trade checks → position sizing → order submission |
## Architecture
### Output File Organization
The sanitized docs mirror the source structure with sanitized filenames:
```
docs/sanitized-pipeline-deep-dive/
├── index.md
├── 01-data-ingestion-and-preparation.md
├── 02-ai-agent-processing-and-extraction.md
├── 03-signal-scoring-and-weighted-signals.md
├── 04-trend-aggregation-and-accumulating-signals.md
├── 05-recommendation-generation.md
├── 06-decision-execution.md
└── diagrams/
├── ingestion-to-extraction-flow.md
├── three-layer-signal-merging.md
├── weighted-signal-computation.md
├── trend-accumulation-escalation.md
├── recommendation-generation-flow.md
└── decision-engine-loop.md
```
**Filename changes from source:**
- `06-trading-decisions-and-execution.md``06-decision-execution.md` (removes "trading")
- `diagrams/trading-engine-decision-loop.md``diagrams/decision-engine-loop.md` (removes "trading")
- All other filenames are already domain-neutral and remain unchanged
### Transformation Process
The sanitization follows a three-pass approach for each file:
1. **Terminology pass**: Apply the terminology map to replace all financial/trading terms with domain-neutral equivalents. This covers inline text, headings, table cells, code blocks, and Mermaid diagram labels.
2. **Reference pass**: Update all internal cross-references to point to sanitized filenames (e.g., `06-trading-decisions-and-execution.md``06-decision-execution.md`, `trading-engine-decision-loop.md``decision-engine-loop.md`). Remove or neutralize references to external financial docs (e.g., links to `../llm-to-trade-pipeline.md` become neutral descriptions).
3. **Narrative pass**: Reframe example scenarios, inline illustrations, and narrative framing to use domain-neutral language. This pass handles context-dependent replacements that a simple find-and-replace cannot catch — e.g., "a bearish article about AAPL" becomes "a negative-sentiment article about Entity-A".
### Content Flow
The sanitized docs preserve the same page-to-page narrative flow as the originals:
```mermaid
flowchart LR
P1["Page 1\nData Ingestion"] --> P2["Page 2\nAI Extraction"]
P2 --> P3["Page 3\nSignal Scoring"]
P3 --> P4["Page 4\nTrend Aggregation"]
P4 --> P5["Page 5\nRecommendations"]
P5 --> P6["Page 6\nDecision Execution"]
```
## Components and Interfaces
### Terminology Map
The core of the sanitization is a defined mapping from financial/trading terms to domain-neutral equivalents. The map is applied consistently across all files.
#### System and Provider Names
| Source Term | Sanitized Replacement |
|-------------|----------------------|
| Stonks Oracle / stonks | the platform / the system |
| Polygon.io / Polygon | external data provider / data source API |
| SEC EDGAR / SEC / EFTS | public records API / regulatory filings source |
| Alpaca / AlpacaBrokerAdapter | execution adapter / external execution API |
| Wall Street | (removed or reframed) |
#### Trading and Financial Actions
| Source Term | Sanitized Replacement |
|-------------|----------------------|
| buy | act |
| sell | defer |
| hold | monitor |
| watch | observe |
| trading engine | decision execution engine |
| paper trading / paper_eligible | simulation mode / simulation_eligible |
| live trading / live_eligible | live execution mode / production_eligible |
| trade / trading (as action) | decision / execution |
| order (broker order) | execution request |
| pre-trade checks | pre-execution checks |
#### Financial Concepts
| Source Term | Sanitized Replacement |
|-------------|----------------------|
| portfolio | resource pool / allocation pool |
| portfolio allocation | resource allocation |
| portfolio heat | pool exposure |
| portfolio snapshots | pool snapshots |
| position sizing | commitment sizing / resource allocation |
| position (open position) | commitment / active commitment |
| stop-loss | risk threshold / loss limit |
| take-profit | gain target |
| bullish | positive / favorable |
| bearish | negative / unfavorable |
| stock ticker / ticker symbol | entity identifier |
| stock market | (removed or reframed) |
| earnings / earnings call / earnings report | performance report / periodic disclosure |
| 10-K / 10-Q / 8-K | regulatory filing types |
| SEC filings | regulatory filings |
| broker / broker API | execution adapter / execution API |
| P&L | gain/loss |
| Sharpe ratio | risk-adjusted return ratio |
| drawdown | peak-to-trough decline |
| win rate | success rate |
#### Ticker Symbols and Company Names
| Source Term | Sanitized Replacement |
|-------------|----------------------|
| AAPL / Apple | Entity-A |
| TSLA / Tesla | Entity-B |
| NVDA / NVIDIA | Entity-C |
| XOM | Entity-D |
| META | Entity-E |
| Any other ticker | Entity-{letter} or "tracked entity" |
#### Redis Keys
| Source Pattern | Sanitized Pattern |
|----------------|-------------------|
| `stonks:queue:*` | `app:queue:*` |
| `stonks:dedupe:*` | `app:dedupe:*` |
| `stonks:ratelimit:*` | `app:ratelimit:*` |
| `stonks:trading:circuit_breaker:*` | `app:execution:circuit_breaker:*` |
| `stonks:dedupe:trading:*` | `app:dedupe:execution:*` |
#### MinIO Buckets
| Source Bucket | Sanitized Bucket |
|---------------|-----------------|
| `stonks-raw-market` | `app-raw-data` |
| `stonks-raw-news` | `app-raw-content` |
| `stonks-raw-filings` | `app-raw-filings` |
| `stonks-normalized` | `app-normalized` |
| `stonks-llm-prompts` | `app-llm-prompts` |
| `stonks-llm-results` | `app-llm-results` |
#### Database Tables
| Source Table | Sanitized Table |
|-------------|----------------|
| `trading_decisions` | `execution_decisions` |
| `portfolio_snapshots` | `pool_snapshots` |
| `portfolio_pct` (column) | `allocation_pct` |
All other table names (`documents`, `document_intelligence`, `trend_windows`, `recommendations`, etc.) are already domain-neutral and remain unchanged.
#### Adapter and Source Type Names
| Source Term | Sanitized Replacement |
|-------------|----------------------|
| `PolygonNewsAdapter` | `ExternalNewsAdapter` |
| `PolygonMarketAdapter` | `ExternalDataAdapter` |
| `SECEdgarAdapter` | `RegulatoryFilingsAdapter` |
| `AlpacaBrokerAdapter` | `ExecutionAdapter` |
| `broker` (source_type) | `execution_api` |
| `market_api` (source_type) | `data_api` |
| `filings_api` (source_type) | `filings_api` (unchanged — already neutral) |
### Preserved Engineering Terms
The following terms are explicitly preserved because they describe engineering patterns, not financial concepts:
- **circuit breaker** — engineering safety pattern for rate limiting and cascading failure prevention
- **exponential backoff** — retry pattern
- **adapter pattern** — software design pattern (only the domain-specific adapter *names* are sanitized)
- **signal** — used in signal processing and scoring context
- **trend**, **sentiment**, **confidence**, **contradiction**, **evidence** — data analysis terms
- **recency decay**, **credibility weight**, **novelty bonus** — scoring algorithm terms
- **weighted sentiment average** — mathematical computation term
### Preserved Technical Content
All of the following are preserved verbatim (with only the terminology map applied to embedded financial terms):
- Composite signal scoring formula: `combined = gate × recency × credibility × (1 + novelty_bonus) × market_context_multiplier`
- Confidence computation formula with log₂ scaling and four components
- Weighted sentiment average formula
- All threshold values, configuration parameters, and numeric constants
- All Markdown table structures containing technical parameters
- All code module path references (e.g., `services/aggregation/scoring.py`)
- Three-layer signal architecture with weight ratios (1.0, 0.3, 0.2)
- Contradiction detection algorithm and evidence ranking methodology
- All PostgreSQL table structures and column descriptions (with sanitized names where needed)
- All Redis queue patterns and operations (`rpush`/`lpop`/`blpop`)
- All MinIO storage patterns (with sanitized bucket names)
- Ollama as the LLM inference provider
### Index Page Reframing
The sanitized `index.md` describes the system as an "AI-driven intelligence-to-decision pipeline" that:
1. Ingests data from multiple external data sources
2. Extracts structured intelligence via NLP/LLM
3. Scores and weights signals
4. Aggregates trends across time windows
5. Generates recommendations with quality gates
6. Executes decisions autonomously with safety mechanisms
References to "Stonks Oracle" are replaced with "the platform" or "the system". References to financial-specific APIs (Polygon.io, SEC EDGAR) are replaced with neutral descriptions. The "Related Documentation" section links are updated to use neutral descriptions or removed if they reference financial-specific content.
### Page 06 Reframing
Page 06 undergoes the most extensive reframing since it covers the trading engine. Key changes:
- Title: "Decision Execution" instead of "Trading Decisions and Execution"
- "Trading engine" → "decision execution engine"
- "Pre-trade checks" → "pre-execution checks"
- "Broker adapter" / "Alpaca" → "execution adapter" / "external execution API"
- "Paper trading" → "simulation mode"
- "Live trading" → "live execution mode"
- "Portfolio" → "resource pool" / "allocation pool"
- "Position" → "commitment" / "active commitment"
- "Stop-loss" → "risk threshold"
- "Take-profit" → "gain target"
- All order submission language reframed as "execution request submission"
### Diagram Sanitization
Each Mermaid diagram file receives the same terminology map treatment:
- Node labels containing financial terms are replaced
- Queue name labels (`stonks:queue:*``app:queue:*`)
- Bucket name labels (`stonks-raw-market``app-raw-data`)
- Table name labels (`trading_decisions``execution_decisions`)
- Adapter names in node labels
- Subgraph titles containing financial terms
- The `trading-engine-decision-loop.md` diagram is renamed to `decision-engine-loop.md`
Mermaid syntax, node relationships, subgraph structures, and flow directions are preserved exactly.
## Data Models
This feature produces only documentation files. There are no new data models, database tables, or schema changes.
The sanitized narrative pages reference the same data models as the originals, with terminology-mapped names where applicable:
- **`WeightedSignal`** — document reference + composite weight + sentiment + impact (unchanged)
- **`SignalWeight`** — breakdown of recency, credibility, novelty, confidence gate, market context multiplier (unchanged)
- **`TrendSummary`** — rolling trend for an entity across a time window (unchanged)
- **`Recommendation`** — actionable decision recommendation (reframed from "trade recommendation")
- **`execution_decisions`** table — audit record of every decision evaluation (sanitized from `trading_decisions`)
- **`pool_snapshots`** table — resource pool state snapshots (sanitized from `portfolio_snapshots`)
## Correctness Properties
*A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
The sanitized documentation set has one key universal property: the complete absence of financial/trading terminology across all output files. This is well-suited to property-based testing because the property must hold for *every* file in the output set, and the banned term list is large enough that systematic checking across all files provides high-value coverage.
### Property 1: Banned Financial Terminology Exclusion
*For any* file in the sanitized documentation set (`docs/sanitized-pipeline-deep-dive/`), the file content shall not contain any term from the comprehensive banned financial terminology list. The banned list includes: stock ticker symbols (AAPL, TSLA, NVDA, XOM, META, and all 50 tracked tickers), company names used as financial examples (Apple, Tesla, NVIDIA), trading action labels (buy, sell, hold, watch as action labels — BUY, SELL, HOLD, WATCH in uppercase), financial system terms (trading engine, paper trading, live trading, paper_eligible, live_eligible, portfolio, portfolio allocation, portfolio heat, portfolio snapshots, broker, Alpaca, broker adapter, broker API, stock market, Wall Street, bullish, bearish, position sizing, stop-loss), financial event terms (SEC EDGAR, SEC filings, 10-K, 10-Q, 8-K, earnings, earnings call, earnings report), provider names (Polygon.io, Polygon), system names (Stonks Oracle, stonks), and infrastructure patterns containing financial terms (stonks: prefix in Redis keys, stonks- prefix in MinIO buckets, trading_decisions table name, portfolio_snapshots table name).
**Validates: Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 6.2, 7.1, 7.2, 7.3, 8.1, 8.2**
## Error Handling
Since this is a documentation-only deliverable, there is no runtime error handling to design. The primary quality concerns are:
### Accuracy of Terminology Replacement
Every financial/trading term must be replaced with its domain-neutral equivalent. Missing a single instance of "stonks" in a Redis key pattern or "AAPL" in an example scenario would violate the sanitization requirements. The terminology map defined in the Components section serves as the authoritative reference.
### Preservation of Technical Content
The sanitization must not accidentally remove or alter engineering content. Key risks:
- **Formula corruption**: The composite weight formula contains `market_context_multiplier` — the word "market" must not be blindly replaced since it's part of a technical variable name
- **Code path corruption**: Module paths like `services/trading/engine.py` contain "trading" — these paths reference actual files and must be preserved as-is (the code files are not being renamed)
- **Table name corruption**: Database table names like `trading_decisions` need sanitization in narrative text but the actual SQL/code references to the original table names should be handled carefully
**Design decision**: Code module paths (e.g., `services/trading/engine.py`) are preserved exactly as they appear in the source, since they reference actual files in the repository. Only narrative references to concepts (e.g., "the trading engine") are sanitized. Variable names within formulas and code blocks are preserved. Database table names are sanitized in narrative descriptions and table listings, but inline code references note the sanitized name.
### Cross-Reference Integrity
All internal links must resolve to files that exist in the sanitized output:
- Page-to-page links must use sanitized filenames
- Diagram links must use sanitized diagram filenames
- No links should point back to the source `docs/intelligence-pipeline-deep-dive/` directory
## Testing Strategy
### Why Limited PBT Applies
This is a documentation-only deliverable — the output is static Markdown files, not executable code with functions and data transformations. However, one universal property (banned term exclusion) is well-suited to property-based testing because it must hold across all files and involves checking a large set of terms against file content.
Most other requirements (structural checks, content preservation, narrative reframing) are better verified through example-based tests and manual review.
### Property-Based Tests
- **Library**: Hypothesis (Python, already in the project)
- **Configuration**: `@settings(max_examples=100)`
- **Property 1 implementation**: Generate random selections from the banned term list and random file selections from the sanitized docs, verify the term does not appear in the file content. Alternatively, exhaustively check all banned terms against all files (since the file set is small and fixed, this is more practical as an exhaustive example-based test).
**Practical note**: Given the small, fixed file set (14 files), the banned term exclusion property is most practically implemented as an exhaustive check — iterate all files × all banned terms — rather than a randomized property test. This provides complete coverage rather than probabilistic coverage.
### Example-Based Tests
1. **File structure verification**: Verify all expected files exist at the correct paths
2. **Cross-reference integrity**: Parse all sanitized files, extract markdown links, verify they resolve to existing sanitized files
3. **Mermaid syntax validation**: Verify each diagram file contains valid Mermaid `flowchart` declarations
4. **Technical content preservation**: Spot-check that key formulas, threshold values, and code module paths are present in the sanitized docs
5. **Terminology replacement verification**: Spot-check that key replacements appear (e.g., "decision execution engine" replaces "trading engine")
6. **Index page framing**: Verify the index describes the system as an "AI-driven intelligence-to-decision pipeline"
7. **Database table sanitization**: Verify `execution_decisions` appears where `trading_decisions` was, and `pool_snapshots` where `portfolio_snapshots` was
### Manual Review
- Narrative coherence and readability of the sanitized content
- Consistency of domain-neutral framing across all pages
- Quality of example scenario replacements (e.g., "bearish article about AAPL" → "negative-sentiment article about Entity-A")
- Preservation of page-to-page transition flow