- Add scheduler and ingestion unit tests (test_scheduler_unit.py, test_ingestion_unit.py) - Add all 13 app services + dashboard to docker-compose.yml - Add full documentation suite: API reference, Helm reference, Docker deployment guide, 3 architecture diagrams (K8s, Docker Compose, data pipeline), AI agent guide, backup/restore guide, observability/metrics reference, per-service docs - Add intelligence pipeline deep-dive docs with Mermaid diagrams - Update README with documentation index and links - Add specs for comprehensive-quality-docs, intelligence-pipeline-deep-dive, sanitized-pipeline-docs
23 KiB
Design Document: Comprehensive Quality & Documentation
Overview
This design covers three pillars for the Stonks Oracle platform:
- Test Coverage — Close unit test gaps in the scheduler and ingestion services, fix pre-existing test failures in the extractor module, and achieve a fully green test suite (Requirements 1–4).
- Docker Deployment — Extend
docker-compose.ymlto include all 13 application services plus the frontend, enabling full-platform local development without Kubernetes (Requirement 5). - Documentation — Produce comprehensive documentation covering per-service features, API references, Helm chart configuration, Docker deployment, three Mermaid architecture diagrams, AI agent building, backup/restore, observability, and README resource links (Requirements 6–16).
Design Rationale
The platform has mature production code across 13 services but uneven test coverage and documentation. The scheduler and ingestion services lack dedicated unit tests — their logic is only exercised through integration tests. Four extractor-related test files have pre-existing failures that block CI. Documentation exists only as a local dev setup guide, a pipeline overview, and a runbook. This initiative fills those gaps systematically.
The approach prioritizes:
- Test isolation: Mock all external dependencies (PostgreSQL, Redis, MinIO, Ollama) so unit tests run fast and deterministically.
- Documentation from source: Generate API references by inspecting actual FastAPI route definitions, Helm values from
values.yaml, and metrics fromservices/shared/metrics.py. - Docker parity with Kubernetes: Mirror the Helm chart's service definitions in Docker Compose so both deployment modes stay in sync.
Architecture
The work does not change the platform's runtime architecture. It adds:
- New test files in
tests/for scheduler and ingestion unit tests. - Fixes to existing test files and/or production code to resolve failures.
- New service definitions in
docker-compose.ymlusing the existingdocker/DockerfilewithSERVICE_CMDbuild args. - New documentation files in
docs/organized by topic. - Updated
README.mdwith a documentation index and Mermaid diagram.
graph TD
subgraph "Test Coverage (Reqs 1-4)"
T1[tests/test_scheduler_unit.py]
T2[tests/test_ingestion_unit.py]
T3[Fix test_extractor_prompts.py]
T4[Fix test_extractor_schemas.py]
T5[Fix test_ollama_client.py]
T6[Fix test_filings_adapter.py]
end
subgraph "Docker (Req 5)"
D1[docker-compose.yml<br/>+ 13 app services + frontend]
end
subgraph "Documentation (Reqs 6-16)"
DOC1[docs/services.md]
DOC2[docs/api-reference.md]
DOC3[docs/helm-reference.md]
DOC4[docs/docker-deployment.md]
DOC5[docs/architecture-kubernetes.md]
DOC6[docs/architecture-docker-compose.md]
DOC7[docs/architecture-data-pipeline.md]
DOC8[docs/ai-agents.md]
DOC9[docs/backup-restore.md]
DOC10[docs/observability.md]
DOC11[README.md update]
end
Components and Interfaces
1. Scheduler Unit Tests (Requirement 1)
Target module: services/scheduler/app.py
Functions to test in isolation:
get_cadence_for_source(source_type, config)— Returns polling interval from config or defaults.compute_backoff(retry_count)— Exponential backoff with cap.is_source_due(...)— Core scheduling logic: determines if a source needs polling based on last run status, timing, retry state.build_job_payload(source, aliases, now)— Constructs the ingestion job dict.schedule_cycle(pool, rds)— Full scheduling pass (mocked DB/Redis).check_rate_limit(rds, source_type, now)— Rate limiting with per-type and global Polygon limits.recover_stale_documents(pool, rds)— Re-enqueue orphaned parsed documents.retry_failed_extractions(pool, rds)— Re-enqueue failed extractions.
Mocking strategy:
asyncpg.Pool→AsyncMockwith.fetch(),.fetchrow(),.fetchval(),.execute()returning canned records.redis.asyncio.Redis→AsyncMockwith.rpush(),.set(),.get(),.incr(),.expire(),.decr(),.delete()tracking calls.- Use
unittest.mock.patchfor module-level imports where needed.
Test file: tests/test_scheduler_unit.py
2. Ingestion Unit Tests (Requirement 2)
Target module: services/ingestion/worker.py
Functions to test:
process_job(job, pool, rds, minio_client, adapters)— Main job processing with various adapter outcomes.- Error handling paths: adapter returns
AdapterResult(error=...), retry exhaustion, dead-letter routing. - Deduplication: content hash already seen in Redis, cross-source document dedup via
dedupe_items.
Mocking strategy:
- Adapters →
AsyncMockreturningAdapterResultwith controllederror,items,content_hash,raw_payload. asyncpg.Pool→AsyncMockforingestion_runsINSERT/UPDATE,persist_ingestion_items,record_retrieval_failure.redis.asyncio.Redis→AsyncMockfor dedupe checks, queue pushes, DLQ routing.minio.Minio→MagicMockforupload_raw_artifact.
Test file: tests/test_ingestion_unit.py
3. Extractor Test Fixes (Requirement 3)
Target files:
tests/test_extractor_prompts.pytests/test_extractor_schemas.pytests/test_ollama_client.pytests/test_filings_adapter.py
Approach: Run each file individually, diagnose failures, and fix either the test setup (mock configuration, fixture data) or the production code. Preserve original test intent and assertions. If production code changes are needed, add regression tests.
4. Full Test Suite Green (Requirement 4)
Verification: Run pytest tests/ -x --tb=short -q and ruff check services/ after all fixes. All existing test_pbt_* files must remain passing. Any production code fix must include a regression test.
5. Docker Compose Application Services (Requirement 5)
Current state: docker-compose.yml defines 7 infrastructure services (postgres, redis, minio, minio-init, ollama, trino, hive-metastore, superset).
Addition: 14 new service definitions (13 app services + frontend dashboard):
| Service | Image Build | Command | Port | Depends On |
|---|---|---|---|---|
| scheduler | docker/Dockerfile.scheduler |
python -m services.scheduler.app |
— | postgres, redis |
| symbol-registry | docker/Dockerfile |
uvicorn services.symbol_registry.app:app --host 0.0.0.0 --port 8000 |
8001:8000 | postgres |
| ingestion | docker/Dockerfile |
python -m services.ingestion.worker |
— | postgres, redis, minio |
| parser | docker/Dockerfile |
python -m services.parser.worker |
— | postgres, redis |
| extractor | docker/Dockerfile |
python -m services.extractor.main |
— | postgres, redis, ollama |
| aggregation | docker/Dockerfile |
python -m services.aggregation.main |
— | postgres, redis |
| recommendation | docker/Dockerfile |
python -m services.recommendation.main |
— | postgres, redis |
| trading-engine | docker/Dockerfile |
uvicorn services.trading.app:app --host 0.0.0.0 --port 8000 |
8002:8000 | postgres, redis |
| risk-engine | docker/Dockerfile |
uvicorn services.risk.app:app --host 0.0.0.0 --port 8000 |
8003:8000 | postgres |
| broker-adapter | docker/Dockerfile |
python -m services.adapters.broker_service |
— | postgres, redis |
| lake-publisher | docker/Dockerfile |
python -m services.lake_publisher.jobs |
— | postgres, minio |
| query-api | docker/Dockerfile |
uvicorn services.api.app:app --host 0.0.0.0 --port 8000 |
8004:8000 | postgres, redis, minio |
| dashboard | frontend/Dockerfile |
nginx (built-in) | 3000:8080 | query-api |
Common environment block (shared via x-app-env YAML anchor):
POSTGRES_HOST: postgres
POSTGRES_PORT: "5432"
POSTGRES_DB: stonks
POSTGRES_USER: stonks
POSTGRES_PASSWORD: stonks_dev
REDIS_HOST: redis
REDIS_PORT: "6379"
MINIO_ENDPOINT: minio:9000
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
OLLAMA_BASE_URL: http://ollama:11434
.env file support: MARKET_DATA_API_KEY, BROKER_API_KEY, BROKER_API_SECRET, BROKER_BASE_URL loaded via env_file: .env on services that need them (ingestion, broker-adapter, trading-engine).
Health checks: FastAPI services use curl -f http://localhost:8000/health; workers use process liveness checks. Infrastructure depends_on uses condition: service_healthy.
6. Documentation Structure (Requirements 6–16)
All documentation files are Markdown in docs/. The structure:
docs/
├── services.md # Req 6: Per-service feature docs
├── api-reference.md # Req 7: All 4 FastAPI API references
├── helm-reference.md # Req 8: Helm chart values reference
├── docker-deployment.md # Req 9: Docker deployment guide
├── architecture-kubernetes.md # Req 10: K8s Mermaid diagram
├── architecture-docker-compose.md # Req 11: Docker Compose Mermaid diagram
├── architecture-data-pipeline.md # Req 12: Data pipeline Mermaid diagram
├── ai-agents.md # Req 13: AI agent building guide
├── backup-restore.md # Req 14: Backup and restore guide
├── observability.md # Req 15: Observability & metrics reference
├── LOCAL_DEV_SETUP.md # (existing)
├── llm-to-trade-pipeline.md # (existing)
└── notes/
└── runbook.md # (existing)
6a. Service Feature Documentation (docs/services.md) — Req 6
For each of the 13 services, document:
- Purpose: What the service does in the pipeline.
- Entry point: Module path (e.g.,
services.scheduler.app). - Configuration: Environment variables from
services/shared/config.pyrelevant to this service. - Database tables: Tables read/written by this service.
- Redis queues: Queue names consumed from and published to (from
services/shared/redis_keys.py). - Queue message schema: JSON structure of messages.
- Signal layers: For aggregation/recommendation, document the three signal layers (company, macro, competitive), their toggles (
macro_enabled,competitive_enabledinrisk_configs), and weight configurations. - Trading engine features: For the trading service, document position sizing, circuit breakers, reserve pool, risk tier auto-adjustment, backtesting, and notification configuration.
Queue topology reference (from redis_keys.py):
| Queue | Producer | Consumer |
|---|---|---|
stonks:queue:ingestion |
scheduler | ingestion |
stonks:queue:parsing |
ingestion | parser |
stonks:queue:extraction |
parser | extractor |
stonks:queue:macro_classification |
parser, scheduler | extractor |
stonks:queue:aggregation |
extractor | aggregation |
stonks:queue:recommendation |
aggregation | recommendation |
stonks:queue:lake_publish |
various | lake-publisher |
stonks:queue:broker_orders |
trading-engine, trading API | broker-adapter |
stonks:queue:trading_decisions |
recommendation | trading-engine |
6b. API Reference (docs/api-reference.md) — Req 7
Document all endpoints from the four FastAPI services by inspecting their route definitions:
Query API (services/api/app.py): ~40+ endpoints covering companies, documents, trends, recommendations, evidence drill-down, orders, positions, portfolio, global events, macro impacts, competitive signals, trend projections, agents, dead-letter queues, pipeline control, SQL explorer, saved queries, audit trail, DevOps metrics, and Prometheus metrics.
Symbol Registry API (services/symbol_registry/app.py): Companies CRUD, aliases, watchlists, sources, exposure profiles, competitor relationships, competitor inference.
Trading API (services/trading/app.py): Health/readiness, engine status, config update, pause/resume, reset, decisions audit, performance metrics/history, backtesting, notifications config/history, override orders, debug state.
Risk API (services/risk/app.py): Order evaluation (POST /evaluate), health, pending approvals, approval review, approval expiration.
For each endpoint: method, path, query parameters (type, default, constraints), request body schema, response schema, error codes (4xx/5xx).
6c. Helm Chart Reference (docs/helm-reference.md) — Req 8
Document from infra/helm/stonks-oracle/values.yaml:
imageblock: registry, pullPolicy, tagpipelineEnabled: toggle and effect on worker replicasservicesblock: per-service structure (replicas, image, command, tier, port, secrets, resources, probes)configblock: all ConfigMap environment variables with defaults and descriptionssecretsblock: core, broker, market, gmail, dashboard — injection via--setflagsingressblock: className, clusterIssuer, host mappings- Analytics stack: trino, hiveMetastore, superset toggles and resources
networkPolicies.enabled: default-deny-ingress behavior- Value override files:
values-beta.yaml,values-paper.yamland their deployment stages
6d. Docker Deployment Guide (docs/docker-deployment.md) — Req 9
- Complete service inventory with images, ports, volumes, environment variables
.envfile format with all required/optional variables- Volume mounts and data persistence (pgdata, miniodata, ollama_models, hive_data, superset_data)
- Health check configurations
- Dockerfile build arguments (
SERVICE_CMD) - Operational commands: start, stop, restart, logs, scale, reset (
docker compose down -v)
6e. Architecture Diagrams (Reqs 10–12)
Kubernetes diagram (docs/architecture-kubernetes.md):
stonks-oraclenamespace with all 13 services grouped by tier (api, processing, trading, orchestration, analytics, frontend)- External cluster services in their namespaces (postgresql-service, redis-service, minio-service, ollama-service)
- Traefik ingress routes to external domains
- Network policy boundaries
- Analytics plane (Trino, Hive Metastore, Superset)
- Helm-managed secrets (core, broker, market, gmail) with consumer mapping
- Service tier distinction (API with ingress, pipeline workers, trading)
Docker Compose diagram (docs/architecture-docker-compose.md):
- All infrastructure + application containers
- Host port mappings
depends_onrelationships and health check dependencies- Named volumes and mount points
.envfile providing API keys- Internal Docker network connectivity
Data Pipeline diagram (docs/architecture-data-pipeline.md):
- External sources → ingestion → parsing → extraction → aggregation → recommendation → risk → trading → broker
- Redis queue topology with queue names
- Three signal layers as distinct paths merging at aggregation
- Data stores at each stage (MinIO, PostgreSQL, Redis)
- Trading engine decision loop
- Analytical branch (lake publisher → MinIO/Parquet → Trino → Superset/Dashboard)
- External integrations (Ollama, Alpaca, AWS SNS, Gmail)
6f. AI Agent Guide (docs/ai-agents.md) — Req 13
- Three built-in agents: document-extractor, event-classifier, thesis-rewriter
- Per-agent: purpose, input data, output schema, default model, system prompt structure, user prompt template
ai_agentstable schema and registration (system-seeded vs API-created)agent_variantstable: create, activate, deactivate variants for A/B testingAgentConfigResolvermodule: TTL cache (60s default), COALESCE-based variant override, fallback behavior- Performance logging:
agent_performance_logtable, querying for variant comparison - API endpoints: CRUD on
/api/agents, test endpoint/api/agents/{id}/test - Step-by-step guide: creating a new variant with different model/prompt and activating it
6g. Backup & Restore Guide (docs/backup-restore.md) — Req 14
Scripts in scripts/:
backup-db.sh: PostgreSQL dump, CLI args, storage location, retention (keeps last 7)restore-db.sh: PostgreSQL restore, service scale-down/up, data loss implicationsbackup-redis.sh: Redis RDB snapshot backupbackup.sh: Combined backup (DB + Redis),--upload-miniooptionrestore.sh: Combined restore- Full nuke-and-rebuild procedure (connection termination, DB drop, Redis flush, redeploy, re-seed)
- Recommended backup schedules and automation (cron, Kubernetes CronJobs)
6h. Observability Reference (docs/observability.md) — Req 15
/metricsendpoint on query-api, Prometheus scrape configuration- All metrics from
services/shared/metrics.py:- Ingestion:
stonks_ingestion_jobs_total,stonks_ingestion_items_fetched_total,stonks_ingestion_items_new_total,stonks_ingestion_items_deduped_total,stonks_ingestion_errors_total,stonks_ingestion_adapter_duration_seconds - Parsing:
stonks_parse_jobs_total,stonks_parse_quality_score,stonks_parse_low_quality_total,stonks_parse_duration_seconds - Extraction:
stonks_extraction_jobs_total,stonks_extraction_attempts_total,stonks_extraction_retries_total,stonks_extraction_duration_seconds,stonks_extraction_confidence,stonks_extraction_validation_errors_total,stonks_extraction_tokens_total - Aggregation:
stonks_aggregation_windows_total,stonks_aggregation_signals_total,stonks_aggregation_contradiction_score,stonks_aggregation_duration_seconds - Recommendation:
stonks_recommendations_total,stonks_recommendations_suppressed_total,stonks_recommendation_confidence - Lake:
stonks_lake_facts_published_total,stonks_lake_publish_duration_seconds,stonks_lake_publish_errors_total,stonks_lake_publish_bytes_total - Trading:
stonks_orders_submitted_total,stonks_orders_rejected_total,stonks_orders_filled_total,stonks_orders_duplicates_prevented_total,stonks_risk_evaluations_total,stonks_risk_check_failures_total,stonks_positions_synced_total - Alerting:
stonks_alerts_fired_total,stonks_alerts_resolved_total,stonks_alert_check_duration_seconds,stonks_alert_active - DLQ:
stonks_dlq_items_total,stonks_dlq_replayed_total,stonks_dlq_depth - Active:
stonks_active_jobs
- Ingestion:
- Alerting module (
services/shared/alerting.py): 4 alert rules (source_failures, schema_failure_spike, analytical_lag, broker_issues), thresholds, evaluation windows, ConfigMap variables - Structured JSON logging format, trace context (trace_id, span_id)
- Dead-letter queue system: queue names (
stonks:dlq:<queue>), routing, replay tooling - Recommended Prometheus/Grafana queries
6i. README Update — Req 16
- Add "Documentation" section with links to all docs
- Replace ASCII architecture diagram with Mermaid or link to diagram docs
- Preserve all existing content (license, features, tech stack, project structure, deployment)
Data Models
No new database tables or schema changes are introduced. This initiative works with existing tables:
Tables referenced in test coverage work:
sources,companies,company_aliases— scheduler source pollingingestion_runs— scheduler run tracking, ingestion job recordingdocuments,document_company_mentions— ingestion persistence, stale document recoverydocument_intelligence,document_impact_records— extractor test fixturesmodel_performance_metrics— extractor schema validation metrics
Tables documented (not modified):
- All tables listed above plus
trend_windows,trend_history,trend_projections,recommendations,recommendation_evidence,risk_evaluations,orders,order_events,positions,portfolio_snapshots,trading_decisions,circuit_breaker_events,reserve_pool_ledger,risk_tier_history,backtest_runs,backtest_trades,notifications,global_events,macro_impact_records,exposure_profiles,competitor_relationships,competitive_signal_records,ai_agents,agent_variants,agent_performance_log,audit_events,watchlists,watchlist_members,retention_policies,market_snapshots
Error Handling
Test Coverage
- Mock failures: Unit tests must verify that scheduler and ingestion services handle database/Redis connection failures gracefully (no crashes, proper logging).
- Adapter errors: Ingestion unit tests must verify retry logic with exponential backoff and dead-letter queue routing after retry exhaustion.
- Test fix approach: When fixing pre-existing failures, prefer fixing test setup over changing production code. If production code changes are needed, add regression tests to prevent re-introduction.
Docker Compose
- Health check failures: Application services use
depends_onwithcondition: service_healthyto wait for infrastructure. Health checks haveinterval,timeout,retries, andstart_periodconfigured. - Missing
.envfile: Services that need API keys (ingestion, broker-adapter, trading-engine) will start but log warnings about missing keys. The platform runs in a degraded mode without external API access. - Build failures: Each service uses the same base Dockerfile with
SERVICE_CMDbuild arg. Build errors are isolated per service.
Documentation
- Stale documentation: Documentation is generated from source code inspection. If the codebase changes after documentation is written, the docs may drift. The README links section serves as a single index to find and update docs.
- Diagram accuracy: Mermaid diagrams are hand-authored based on current architecture. They should be updated when services are added or removed.
Testing Strategy
PBT Applicability Assessment
Property-based testing is NOT applicable to this feature. The work consists of:
- Unit tests for existing services — These are example-based tests with mocked dependencies, not pure functions with universal properties.
- Fixing pre-existing test failures — Bug fixes to existing tests/code.
- Docker Compose configuration — Declarative infrastructure configuration.
- Documentation — Markdown files with no executable logic.
None of these involve new pure functions, parsers, serializers, or business logic where PBT would add value. The existing test_pbt_* files (22 files covering trading, aggregation, competitive intelligence, etc.) already provide PBT coverage for the platform's core logic and must remain passing.
Unit Testing Strategy
New test files:
tests/test_scheduler_unit.py— 8+ test cases covering all scheduler pure functions and theschedule_cycleorchestration with mocked dependencies.tests/test_ingestion_unit.py— 6+ test cases covering adapter error handling, retry logic, deduplication, and dead-letter queue routing.
Test fix files (existing, to be repaired):
tests/test_extractor_prompts.pytests/test_extractor_schemas.pytests/test_ollama_client.pytests/test_filings_adapter.py
Test framework: pytest + pytest-asyncio (already configured in the project).
Mocking approach: unittest.mock.AsyncMock for async dependencies, unittest.mock.MagicMock for sync dependencies, unittest.mock.patch for module-level state.
Verification Criteria
pytest tests/ -x --tb=short -q→ zero failuresruff check services/→ zero violations- All 22 existing
test_pbt_*files pass unchanged docker compose configvalidates the updated docker-compose.yml- All documentation files render valid Markdown with working internal links