Files

T

Celes Renata 88ad1e8d99 feat: comprehensive docs, unit tests, docker-compose app services

- Add scheduler and ingestion unit tests (test_scheduler_unit.py, test_ingestion_unit.py)
- Add all 13 app services + dashboard to docker-compose.yml
- Add full documentation suite: API reference, Helm reference, Docker deployment guide,
  3 architecture diagrams (K8s, Docker Compose, data pipeline), AI agent guide,
  backup/restore guide, observability/metrics reference, per-service docs
- Add intelligence pipeline deep-dive docs with Mermaid diagrams
- Update README with documentation index and links
- Add specs for comprehensive-quality-docs, intelligence-pipeline-deep-dive,
  sanitized-pipeline-docs

2026-04-22 02:56:41 +00:00

23 KiB

Raw Blame History

Design Document: Comprehensive Quality & Documentation

Overview

This design covers three pillars for the Stonks Oracle platform:

Test Coverage — Close unit test gaps in the scheduler and ingestion services, fix pre-existing test failures in the extractor module, and achieve a fully green test suite (Requirements 1–4).
Docker Deployment — Extend docker-compose.yml to include all 13 application services plus the frontend, enabling full-platform local development without Kubernetes (Requirement 5).
Documentation — Produce comprehensive documentation covering per-service features, API references, Helm chart configuration, Docker deployment, three Mermaid architecture diagrams, AI agent building, backup/restore, observability, and README resource links (Requirements 6–16).

Design Rationale

The platform has mature production code across 13 services but uneven test coverage and documentation. The scheduler and ingestion services lack dedicated unit tests — their logic is only exercised through integration tests. Four extractor-related test files have pre-existing failures that block CI. Documentation exists only as a local dev setup guide, a pipeline overview, and a runbook. This initiative fills those gaps systematically.

The approach prioritizes:

Test isolation: Mock all external dependencies (PostgreSQL, Redis, MinIO, Ollama) so unit tests run fast and deterministically.
Documentation from source: Generate API references by inspecting actual FastAPI route definitions, Helm values from values.yaml, and metrics from services/shared/metrics.py.
Docker parity with Kubernetes: Mirror the Helm chart's service definitions in Docker Compose so both deployment modes stay in sync.

Architecture

The work does not change the platform's runtime architecture. It adds:

New test files in tests/ for scheduler and ingestion unit tests.
Fixes to existing test files and/or production code to resolve failures.
New service definitions in docker-compose.yml using the existing docker/Dockerfile with SERVICE_CMD build args.
New documentation files in docs/ organized by topic.
Updated README.md with a documentation index and Mermaid diagram.

graph TD
    subgraph "Test Coverage (Reqs 1-4)"
        T1[tests/test_scheduler_unit.py]
        T2[tests/test_ingestion_unit.py]
        T3[Fix test_extractor_prompts.py]
        T4[Fix test_extractor_schemas.py]
        T5[Fix test_ollama_client.py]
        T6[Fix test_filings_adapter.py]
    end

    subgraph "Docker (Req 5)"
        D1[docker-compose.yml<br/>+ 13 app services + frontend]
    end

    subgraph "Documentation (Reqs 6-16)"
        DOC1[docs/services.md]
        DOC2[docs/api-reference.md]
        DOC3[docs/helm-reference.md]
        DOC4[docs/docker-deployment.md]
        DOC5[docs/architecture-kubernetes.md]
        DOC6[docs/architecture-docker-compose.md]
        DOC7[docs/architecture-data-pipeline.md]
        DOC8[docs/ai-agents.md]
        DOC9[docs/backup-restore.md]
        DOC10[docs/observability.md]
        DOC11[README.md update]
    end

Components and Interfaces

1. Scheduler Unit Tests (Requirement 1)

Target module: services/scheduler/app.py

Functions to test in isolation:

get_cadence_for_source(source_type, config) — Returns polling interval from config or defaults.
compute_backoff(retry_count) — Exponential backoff with cap.
is_source_due(...) — Core scheduling logic: determines if a source needs polling based on last run status, timing, retry state.
build_job_payload(source, aliases, now) — Constructs the ingestion job dict.
schedule_cycle(pool, rds) — Full scheduling pass (mocked DB/Redis).
check_rate_limit(rds, source_type, now) — Rate limiting with per-type and global Polygon limits.
recover_stale_documents(pool, rds) — Re-enqueue orphaned parsed documents.
retry_failed_extractions(pool, rds) — Re-enqueue failed extractions.

Mocking strategy:

asyncpg.Pool → AsyncMock with .fetch(), .fetchrow(), .fetchval(), .execute() returning canned records.
redis.asyncio.Redis → AsyncMock with .rpush(), .set(), .get(), .incr(), .expire(), .decr(), .delete() tracking calls.
Use unittest.mock.patch for module-level imports where needed.

Test file: tests/test_scheduler_unit.py

2. Ingestion Unit Tests (Requirement 2)

Target module: services/ingestion/worker.py

Functions to test:

process_job(job, pool, rds, minio_client, adapters) — Main job processing with various adapter outcomes.
Error handling paths: adapter returns AdapterResult(error=...), retry exhaustion, dead-letter routing.
Deduplication: content hash already seen in Redis, cross-source document dedup via dedupe_items.

Mocking strategy:

Adapters → AsyncMock returning AdapterResult with controlled error, items, content_hash, raw_payload.
asyncpg.Pool → AsyncMock for ingestion_runs INSERT/UPDATE, persist_ingestion_items, record_retrieval_failure.
redis.asyncio.Redis → AsyncMock for dedupe checks, queue pushes, DLQ routing.
minio.Minio → MagicMock for upload_raw_artifact.

Test file: tests/test_ingestion_unit.py

3. Extractor Test Fixes (Requirement 3)

Target files:

tests/test_extractor_prompts.py
tests/test_extractor_schemas.py
tests/test_ollama_client.py
tests/test_filings_adapter.py

Approach: Run each file individually, diagnose failures, and fix either the test setup (mock configuration, fixture data) or the production code. Preserve original test intent and assertions. If production code changes are needed, add regression tests.

4. Full Test Suite Green (Requirement 4)

Verification: Run pytest tests/ -x --tb=short -q and ruff check services/ after all fixes. All existing test_pbt_* files must remain passing. Any production code fix must include a regression test.

5. Docker Compose Application Services (Requirement 5)

Current state: docker-compose.yml defines 7 infrastructure services (postgres, redis, minio, minio-init, ollama, trino, hive-metastore, superset).

Addition: 14 new service definitions (13 app services + frontend dashboard):

Service	Image Build	Command	Port	Depends On
scheduler	`docker/Dockerfile.scheduler`	`python -m services.scheduler.app`	—	postgres, redis
symbol-registry	`docker/Dockerfile`	`uvicorn services.symbol_registry.app:app --host 0.0.0.0 --port 8000`	8001:8000	postgres
ingestion	`docker/Dockerfile`	`python -m services.ingestion.worker`	—	postgres, redis, minio
parser	`docker/Dockerfile`	`python -m services.parser.worker`	—	postgres, redis
extractor	`docker/Dockerfile`	`python -m services.extractor.main`	—	postgres, redis, ollama
aggregation	`docker/Dockerfile`	`python -m services.aggregation.main`	—	postgres, redis
recommendation	`docker/Dockerfile`	`python -m services.recommendation.main`	—	postgres, redis
trading-engine	`docker/Dockerfile`	`uvicorn services.trading.app:app --host 0.0.0.0 --port 8000`	8002:8000	postgres, redis
risk-engine	`docker/Dockerfile`	`uvicorn services.risk.app:app --host 0.0.0.0 --port 8000`	8003:8000	postgres
broker-adapter	`docker/Dockerfile`	`python -m services.adapters.broker_service`	—	postgres, redis
lake-publisher	`docker/Dockerfile`	`python -m services.lake_publisher.jobs`	—	postgres, minio
query-api	`docker/Dockerfile`	`uvicorn services.api.app:app --host 0.0.0.0 --port 8000`	8004:8000	postgres, redis, minio
dashboard	`frontend/Dockerfile`	nginx (built-in)	3000:8080	query-api

Common environment block (shared via x-app-env YAML anchor):

POSTGRES_HOST: postgres
POSTGRES_PORT: "5432"
POSTGRES_DB: stonks
POSTGRES_USER: stonks
POSTGRES_PASSWORD: stonks_dev
REDIS_HOST: redis
REDIS_PORT: "6379"
MINIO_ENDPOINT: minio:9000
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
OLLAMA_BASE_URL: http://ollama:11434

.env file support: MARKET_DATA_API_KEY, BROKER_API_KEY, BROKER_API_SECRET, BROKER_BASE_URL loaded via env_file: .env on services that need them (ingestion, broker-adapter, trading-engine).

Health checks: FastAPI services use curl -f http://localhost:8000/health; workers use process liveness checks. Infrastructure depends_on uses condition: service_healthy.

6. Documentation Structure (Requirements 6–16)

All documentation files are Markdown in docs/. The structure:

docs/
├── services.md                      # Req 6: Per-service feature docs
├── api-reference.md                 # Req 7: All 4 FastAPI API references
├── helm-reference.md                # Req 8: Helm chart values reference
├── docker-deployment.md             # Req 9: Docker deployment guide
├── architecture-kubernetes.md       # Req 10: K8s Mermaid diagram
├── architecture-docker-compose.md   # Req 11: Docker Compose Mermaid diagram
├── architecture-data-pipeline.md    # Req 12: Data pipeline Mermaid diagram
├── ai-agents.md                     # Req 13: AI agent building guide
├── backup-restore.md                # Req 14: Backup and restore guide
├── observability.md                 # Req 15: Observability & metrics reference
├── LOCAL_DEV_SETUP.md               # (existing)
├── llm-to-trade-pipeline.md         # (existing)
└── notes/
    └── runbook.md                   # (existing)

6a. Service Feature Documentation (`docs/services.md`) — Req 6

For each of the 13 services, document:

Purpose: What the service does in the pipeline.
Entry point: Module path (e.g., services.scheduler.app).
Configuration: Environment variables from services/shared/config.py relevant to this service.
Database tables: Tables read/written by this service.
Redis queues: Queue names consumed from and published to (from services/shared/redis_keys.py).
Queue message schema: JSON structure of messages.
Signal layers: For aggregation/recommendation, document the three signal layers (company, macro, competitive), their toggles (macro_enabled, competitive_enabled in risk_configs), and weight configurations.
Trading engine features: For the trading service, document position sizing, circuit breakers, reserve pool, risk tier auto-adjustment, backtesting, and notification configuration.

Queue topology reference (from redis_keys.py):

Queue	Producer	Consumer
`stonks:queue:ingestion`	scheduler	ingestion
`stonks:queue:parsing`	ingestion	parser
`stonks:queue:extraction`	parser	extractor
`stonks:queue:macro_classification`	parser, scheduler	extractor
`stonks:queue:aggregation`	extractor	aggregation
`stonks:queue:recommendation`	aggregation	recommendation
`stonks:queue:lake_publish`	various	lake-publisher
`stonks:queue:broker_orders`	trading-engine, trading API	broker-adapter
`stonks:queue:trading_decisions`	recommendation	trading-engine

6b. API Reference (`docs/api-reference.md`) — Req 7

Document all endpoints from the four FastAPI services by inspecting their route definitions:

Query API (services/api/app.py): ~40+ endpoints covering companies, documents, trends, recommendations, evidence drill-down, orders, positions, portfolio, global events, macro impacts, competitive signals, trend projections, agents, dead-letter queues, pipeline control, SQL explorer, saved queries, audit trail, DevOps metrics, and Prometheus metrics.

Symbol Registry API (services/symbol_registry/app.py): Companies CRUD, aliases, watchlists, sources, exposure profiles, competitor relationships, competitor inference.

Trading API (services/trading/app.py): Health/readiness, engine status, config update, pause/resume, reset, decisions audit, performance metrics/history, backtesting, notifications config/history, override orders, debug state.

Risk API (services/risk/app.py): Order evaluation (POST /evaluate), health, pending approvals, approval review, approval expiration.

For each endpoint: method, path, query parameters (type, default, constraints), request body schema, response schema, error codes (4xx/5xx).

6c. Helm Chart Reference (`docs/helm-reference.md`) — Req 8

Document from infra/helm/stonks-oracle/values.yaml:

image block: registry, pullPolicy, tag
pipelineEnabled: toggle and effect on worker replicas
services block: per-service structure (replicas, image, command, tier, port, secrets, resources, probes)
config block: all ConfigMap environment variables with defaults and descriptions
secrets block: core, broker, market, gmail, dashboard — injection via --set flags
ingress block: className, clusterIssuer, host mappings
Analytics stack: trino, hiveMetastore, superset toggles and resources
networkPolicies.enabled: default-deny-ingress behavior
Value override files: values-beta.yaml, values-paper.yaml and their deployment stages

6d. Docker Deployment Guide (`docs/docker-deployment.md`) — Req 9

Complete service inventory with images, ports, volumes, environment variables
.env file format with all required/optional variables
Volume mounts and data persistence (pgdata, miniodata, ollama_models, hive_data, superset_data)
Health check configurations
Dockerfile build arguments (SERVICE_CMD)
Operational commands: start, stop, restart, logs, scale, reset (docker compose down -v)

6e. Architecture Diagrams (Reqs 10–12)

Kubernetes diagram (docs/architecture-kubernetes.md):

stonks-oracle namespace with all 13 services grouped by tier (api, processing, trading, orchestration, analytics, frontend)
External cluster services in their namespaces (postgresql-service, redis-service, minio-service, ollama-service)
Traefik ingress routes to external domains
Network policy boundaries
Analytics plane (Trino, Hive Metastore, Superset)
Helm-managed secrets (core, broker, market, gmail) with consumer mapping
Service tier distinction (API with ingress, pipeline workers, trading)

Docker Compose diagram (docs/architecture-docker-compose.md):

All infrastructure + application containers
Host port mappings
depends_on relationships and health check dependencies
Named volumes and mount points
.env file providing API keys
Internal Docker network connectivity

Data Pipeline diagram (docs/architecture-data-pipeline.md):

External sources → ingestion → parsing → extraction → aggregation → recommendation → risk → trading → broker
Redis queue topology with queue names
Three signal layers as distinct paths merging at aggregation
Data stores at each stage (MinIO, PostgreSQL, Redis)
Trading engine decision loop
Analytical branch (lake publisher → MinIO/Parquet → Trino → Superset/Dashboard)
External integrations (Ollama, Alpaca, AWS SNS, Gmail)

6f. AI Agent Guide (`docs/ai-agents.md`) — Req 13

Three built-in agents: document-extractor, event-classifier, thesis-rewriter
Per-agent: purpose, input data, output schema, default model, system prompt structure, user prompt template
ai_agents table schema and registration (system-seeded vs API-created)
agent_variants table: create, activate, deactivate variants for A/B testing
AgentConfigResolver module: TTL cache (60s default), COALESCE-based variant override, fallback behavior
Performance logging: agent_performance_log table, querying for variant comparison
API endpoints: CRUD on /api/agents, test endpoint /api/agents/{id}/test
Step-by-step guide: creating a new variant with different model/prompt and activating it

6g. Backup & Restore Guide (`docs/backup-restore.md`) — Req 14

Scripts in scripts/:

backup-db.sh: PostgreSQL dump, CLI args, storage location, retention (keeps last 7)
restore-db.sh: PostgreSQL restore, service scale-down/up, data loss implications
backup-redis.sh: Redis RDB snapshot backup
backup.sh: Combined backup (DB + Redis), --upload-minio option
restore.sh: Combined restore
Full nuke-and-rebuild procedure (connection termination, DB drop, Redis flush, redeploy, re-seed)
Recommended backup schedules and automation (cron, Kubernetes CronJobs)

6h. Observability Reference (`docs/observability.md`) — Req 15

/metrics endpoint on query-api, Prometheus scrape configuration
All metrics from services/shared/metrics.py:
- Ingestion: stonks_ingestion_jobs_total, stonks_ingestion_items_fetched_total, stonks_ingestion_items_new_total, stonks_ingestion_items_deduped_total, stonks_ingestion_errors_total, stonks_ingestion_adapter_duration_seconds
- Parsing: stonks_parse_jobs_total, stonks_parse_quality_score, stonks_parse_low_quality_total, stonks_parse_duration_seconds
- Extraction: stonks_extraction_jobs_total, stonks_extraction_attempts_total, stonks_extraction_retries_total, stonks_extraction_duration_seconds, stonks_extraction_confidence, stonks_extraction_validation_errors_total, stonks_extraction_tokens_total
- Aggregation: stonks_aggregation_windows_total, stonks_aggregation_signals_total, stonks_aggregation_contradiction_score, stonks_aggregation_duration_seconds
- Recommendation: stonks_recommendations_total, stonks_recommendations_suppressed_total, stonks_recommendation_confidence
- Lake: stonks_lake_facts_published_total, stonks_lake_publish_duration_seconds, stonks_lake_publish_errors_total, stonks_lake_publish_bytes_total
- Trading: stonks_orders_submitted_total, stonks_orders_rejected_total, stonks_orders_filled_total, stonks_orders_duplicates_prevented_total, stonks_risk_evaluations_total, stonks_risk_check_failures_total, stonks_positions_synced_total
- Alerting: stonks_alerts_fired_total, stonks_alerts_resolved_total, stonks_alert_check_duration_seconds, stonks_alert_active
- DLQ: stonks_dlq_items_total, stonks_dlq_replayed_total, stonks_dlq_depth
- Active: stonks_active_jobs
Alerting module (services/shared/alerting.py): 4 alert rules (source_failures, schema_failure_spike, analytical_lag, broker_issues), thresholds, evaluation windows, ConfigMap variables
Structured JSON logging format, trace context (trace_id, span_id)
Dead-letter queue system: queue names (stonks:dlq:<queue>), routing, replay tooling
Recommended Prometheus/Grafana queries

6i. README Update — Req 16

Add "Documentation" section with links to all docs
Replace ASCII architecture diagram with Mermaid or link to diagram docs
Preserve all existing content (license, features, tech stack, project structure, deployment)

Data Models

No new database tables or schema changes are introduced. This initiative works with existing tables:

Tables referenced in test coverage work:

sources, companies, company_aliases — scheduler source polling
ingestion_runs — scheduler run tracking, ingestion job recording
documents, document_company_mentions — ingestion persistence, stale document recovery
document_intelligence, document_impact_records — extractor test fixtures
model_performance_metrics — extractor schema validation metrics

Tables documented (not modified):

All tables listed above plus trend_windows, trend_history, trend_projections, recommendations, recommendation_evidence, risk_evaluations, orders, order_events, positions, portfolio_snapshots, trading_decisions, circuit_breaker_events, reserve_pool_ledger, risk_tier_history, backtest_runs, backtest_trades, notifications, global_events, macro_impact_records, exposure_profiles, competitor_relationships, competitive_signal_records, ai_agents, agent_variants, agent_performance_log, audit_events, watchlists, watchlist_members, retention_policies, market_snapshots

Error Handling

Test Coverage

Mock failures: Unit tests must verify that scheduler and ingestion services handle database/Redis connection failures gracefully (no crashes, proper logging).
Adapter errors: Ingestion unit tests must verify retry logic with exponential backoff and dead-letter queue routing after retry exhaustion.
Test fix approach: When fixing pre-existing failures, prefer fixing test setup over changing production code. If production code changes are needed, add regression tests to prevent re-introduction.

Docker Compose

Health check failures: Application services use depends_on with condition: service_healthy to wait for infrastructure. Health checks have interval, timeout, retries, and start_period configured.
Missing .env file: Services that need API keys (ingestion, broker-adapter, trading-engine) will start but log warnings about missing keys. The platform runs in a degraded mode without external API access.
Build failures: Each service uses the same base Dockerfile with SERVICE_CMD build arg. Build errors are isolated per service.

Documentation

Stale documentation: Documentation is generated from source code inspection. If the codebase changes after documentation is written, the docs may drift. The README links section serves as a single index to find and update docs.
Diagram accuracy: Mermaid diagrams are hand-authored based on current architecture. They should be updated when services are added or removed.

Testing Strategy

PBT Applicability Assessment

Property-based testing is NOT applicable to this feature. The work consists of:

Unit tests for existing services — These are example-based tests with mocked dependencies, not pure functions with universal properties.
Fixing pre-existing test failures — Bug fixes to existing tests/code.
Docker Compose configuration — Declarative infrastructure configuration.
Documentation — Markdown files with no executable logic.

None of these involve new pure functions, parsers, serializers, or business logic where PBT would add value. The existing test_pbt_* files (22 files covering trading, aggregation, competitive intelligence, etc.) already provide PBT coverage for the platform's core logic and must remain passing.

Unit Testing Strategy

New test files:

tests/test_scheduler_unit.py — 8+ test cases covering all scheduler pure functions and the schedule_cycle orchestration with mocked dependencies.
tests/test_ingestion_unit.py — 6+ test cases covering adapter error handling, retry logic, deduplication, and dead-letter queue routing.

Test fix files (existing, to be repaired):

tests/test_extractor_prompts.py
tests/test_extractor_schemas.py
tests/test_ollama_client.py
tests/test_filings_adapter.py

Test framework: pytest + pytest-asyncio (already configured in the project).

Mocking approach: unittest.mock.AsyncMock for async dependencies, unittest.mock.MagicMock for sync dependencies, unittest.mock.patch for module-level state.

Verification Criteria

pytest tests/ -x --tb=short -q → zero failures
ruff check services/ → zero violations
All 22 existing test_pbt_* files pass unchanged
docker compose config validates the updated docker-compose.yml
All documentation files render valid Markdown with working internal links

23 KiB Raw Blame History Unescape Escape