Files
Celes Renata 88ad1e8d99 feat: comprehensive docs, unit tests, docker-compose app services
- Add scheduler and ingestion unit tests (test_scheduler_unit.py, test_ingestion_unit.py)
- Add all 13 app services + dashboard to docker-compose.yml
- Add full documentation suite: API reference, Helm reference, Docker deployment guide,
  3 architecture diagrams (K8s, Docker Compose, data pipeline), AI agent guide,
  backup/restore guide, observability/metrics reference, per-service docs
- Add intelligence pipeline deep-dive docs with Mermaid diagrams
- Update README with documentation index and links
- Add specs for comprehensive-quality-docs, intelligence-pipeline-deep-dive,
  sanitized-pipeline-docs
2026-04-22 02:56:41 +00:00

23 KiB
Raw Permalink Blame History

Design Document: Comprehensive Quality & Documentation

Overview

This design covers three pillars for the Stonks Oracle platform:

  1. Test Coverage — Close unit test gaps in the scheduler and ingestion services, fix pre-existing test failures in the extractor module, and achieve a fully green test suite (Requirements 14).
  2. Docker Deployment — Extend docker-compose.yml to include all 13 application services plus the frontend, enabling full-platform local development without Kubernetes (Requirement 5).
  3. Documentation — Produce comprehensive documentation covering per-service features, API references, Helm chart configuration, Docker deployment, three Mermaid architecture diagrams, AI agent building, backup/restore, observability, and README resource links (Requirements 616).

Design Rationale

The platform has mature production code across 13 services but uneven test coverage and documentation. The scheduler and ingestion services lack dedicated unit tests — their logic is only exercised through integration tests. Four extractor-related test files have pre-existing failures that block CI. Documentation exists only as a local dev setup guide, a pipeline overview, and a runbook. This initiative fills those gaps systematically.

The approach prioritizes:

  • Test isolation: Mock all external dependencies (PostgreSQL, Redis, MinIO, Ollama) so unit tests run fast and deterministically.
  • Documentation from source: Generate API references by inspecting actual FastAPI route definitions, Helm values from values.yaml, and metrics from services/shared/metrics.py.
  • Docker parity with Kubernetes: Mirror the Helm chart's service definitions in Docker Compose so both deployment modes stay in sync.

Architecture

The work does not change the platform's runtime architecture. It adds:

  1. New test files in tests/ for scheduler and ingestion unit tests.
  2. Fixes to existing test files and/or production code to resolve failures.
  3. New service definitions in docker-compose.yml using the existing docker/Dockerfile with SERVICE_CMD build args.
  4. New documentation files in docs/ organized by topic.
  5. Updated README.md with a documentation index and Mermaid diagram.
graph TD
    subgraph "Test Coverage (Reqs 1-4)"
        T1[tests/test_scheduler_unit.py]
        T2[tests/test_ingestion_unit.py]
        T3[Fix test_extractor_prompts.py]
        T4[Fix test_extractor_schemas.py]
        T5[Fix test_ollama_client.py]
        T6[Fix test_filings_adapter.py]
    end

    subgraph "Docker (Req 5)"
        D1[docker-compose.yml<br/>+ 13 app services + frontend]
    end

    subgraph "Documentation (Reqs 6-16)"
        DOC1[docs/services.md]
        DOC2[docs/api-reference.md]
        DOC3[docs/helm-reference.md]
        DOC4[docs/docker-deployment.md]
        DOC5[docs/architecture-kubernetes.md]
        DOC6[docs/architecture-docker-compose.md]
        DOC7[docs/architecture-data-pipeline.md]
        DOC8[docs/ai-agents.md]
        DOC9[docs/backup-restore.md]
        DOC10[docs/observability.md]
        DOC11[README.md update]
    end

Components and Interfaces

1. Scheduler Unit Tests (Requirement 1)

Target module: services/scheduler/app.py

Functions to test in isolation:

  • get_cadence_for_source(source_type, config) — Returns polling interval from config or defaults.
  • compute_backoff(retry_count) — Exponential backoff with cap.
  • is_source_due(...) — Core scheduling logic: determines if a source needs polling based on last run status, timing, retry state.
  • build_job_payload(source, aliases, now) — Constructs the ingestion job dict.
  • schedule_cycle(pool, rds) — Full scheduling pass (mocked DB/Redis).
  • check_rate_limit(rds, source_type, now) — Rate limiting with per-type and global Polygon limits.
  • recover_stale_documents(pool, rds) — Re-enqueue orphaned parsed documents.
  • retry_failed_extractions(pool, rds) — Re-enqueue failed extractions.

Mocking strategy:

  • asyncpg.PoolAsyncMock with .fetch(), .fetchrow(), .fetchval(), .execute() returning canned records.
  • redis.asyncio.RedisAsyncMock with .rpush(), .set(), .get(), .incr(), .expire(), .decr(), .delete() tracking calls.
  • Use unittest.mock.patch for module-level imports where needed.

Test file: tests/test_scheduler_unit.py

2. Ingestion Unit Tests (Requirement 2)

Target module: services/ingestion/worker.py

Functions to test:

  • process_job(job, pool, rds, minio_client, adapters) — Main job processing with various adapter outcomes.
  • Error handling paths: adapter returns AdapterResult(error=...), retry exhaustion, dead-letter routing.
  • Deduplication: content hash already seen in Redis, cross-source document dedup via dedupe_items.

Mocking strategy:

  • Adapters → AsyncMock returning AdapterResult with controlled error, items, content_hash, raw_payload.
  • asyncpg.PoolAsyncMock for ingestion_runs INSERT/UPDATE, persist_ingestion_items, record_retrieval_failure.
  • redis.asyncio.RedisAsyncMock for dedupe checks, queue pushes, DLQ routing.
  • minio.MinioMagicMock for upload_raw_artifact.

Test file: tests/test_ingestion_unit.py

3. Extractor Test Fixes (Requirement 3)

Target files:

  • tests/test_extractor_prompts.py
  • tests/test_extractor_schemas.py
  • tests/test_ollama_client.py
  • tests/test_filings_adapter.py

Approach: Run each file individually, diagnose failures, and fix either the test setup (mock configuration, fixture data) or the production code. Preserve original test intent and assertions. If production code changes are needed, add regression tests.

4. Full Test Suite Green (Requirement 4)

Verification: Run pytest tests/ -x --tb=short -q and ruff check services/ after all fixes. All existing test_pbt_* files must remain passing. Any production code fix must include a regression test.

5. Docker Compose Application Services (Requirement 5)

Current state: docker-compose.yml defines 7 infrastructure services (postgres, redis, minio, minio-init, ollama, trino, hive-metastore, superset).

Addition: 14 new service definitions (13 app services + frontend dashboard):

Service Image Build Command Port Depends On
scheduler docker/Dockerfile.scheduler python -m services.scheduler.app postgres, redis
symbol-registry docker/Dockerfile uvicorn services.symbol_registry.app:app --host 0.0.0.0 --port 8000 8001:8000 postgres
ingestion docker/Dockerfile python -m services.ingestion.worker postgres, redis, minio
parser docker/Dockerfile python -m services.parser.worker postgres, redis
extractor docker/Dockerfile python -m services.extractor.main postgres, redis, ollama
aggregation docker/Dockerfile python -m services.aggregation.main postgres, redis
recommendation docker/Dockerfile python -m services.recommendation.main postgres, redis
trading-engine docker/Dockerfile uvicorn services.trading.app:app --host 0.0.0.0 --port 8000 8002:8000 postgres, redis
risk-engine docker/Dockerfile uvicorn services.risk.app:app --host 0.0.0.0 --port 8000 8003:8000 postgres
broker-adapter docker/Dockerfile python -m services.adapters.broker_service postgres, redis
lake-publisher docker/Dockerfile python -m services.lake_publisher.jobs postgres, minio
query-api docker/Dockerfile uvicorn services.api.app:app --host 0.0.0.0 --port 8000 8004:8000 postgres, redis, minio
dashboard frontend/Dockerfile nginx (built-in) 3000:8080 query-api

Common environment block (shared via x-app-env YAML anchor):

POSTGRES_HOST: postgres
POSTGRES_PORT: "5432"
POSTGRES_DB: stonks
POSTGRES_USER: stonks
POSTGRES_PASSWORD: stonks_dev
REDIS_HOST: redis
REDIS_PORT: "6379"
MINIO_ENDPOINT: minio:9000
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
OLLAMA_BASE_URL: http://ollama:11434

.env file support: MARKET_DATA_API_KEY, BROKER_API_KEY, BROKER_API_SECRET, BROKER_BASE_URL loaded via env_file: .env on services that need them (ingestion, broker-adapter, trading-engine).

Health checks: FastAPI services use curl -f http://localhost:8000/health; workers use process liveness checks. Infrastructure depends_on uses condition: service_healthy.

6. Documentation Structure (Requirements 616)

All documentation files are Markdown in docs/. The structure:

docs/
├── services.md                      # Req 6: Per-service feature docs
├── api-reference.md                 # Req 7: All 4 FastAPI API references
├── helm-reference.md                # Req 8: Helm chart values reference
├── docker-deployment.md             # Req 9: Docker deployment guide
├── architecture-kubernetes.md       # Req 10: K8s Mermaid diagram
├── architecture-docker-compose.md   # Req 11: Docker Compose Mermaid diagram
├── architecture-data-pipeline.md    # Req 12: Data pipeline Mermaid diagram
├── ai-agents.md                     # Req 13: AI agent building guide
├── backup-restore.md                # Req 14: Backup and restore guide
├── observability.md                 # Req 15: Observability & metrics reference
├── LOCAL_DEV_SETUP.md               # (existing)
├── llm-to-trade-pipeline.md         # (existing)
└── notes/
    └── runbook.md                   # (existing)

6a. Service Feature Documentation (docs/services.md) — Req 6

For each of the 13 services, document:

  • Purpose: What the service does in the pipeline.
  • Entry point: Module path (e.g., services.scheduler.app).
  • Configuration: Environment variables from services/shared/config.py relevant to this service.
  • Database tables: Tables read/written by this service.
  • Redis queues: Queue names consumed from and published to (from services/shared/redis_keys.py).
  • Queue message schema: JSON structure of messages.
  • Signal layers: For aggregation/recommendation, document the three signal layers (company, macro, competitive), their toggles (macro_enabled, competitive_enabled in risk_configs), and weight configurations.
  • Trading engine features: For the trading service, document position sizing, circuit breakers, reserve pool, risk tier auto-adjustment, backtesting, and notification configuration.

Queue topology reference (from redis_keys.py):

Queue Producer Consumer
stonks:queue:ingestion scheduler ingestion
stonks:queue:parsing ingestion parser
stonks:queue:extraction parser extractor
stonks:queue:macro_classification parser, scheduler extractor
stonks:queue:aggregation extractor aggregation
stonks:queue:recommendation aggregation recommendation
stonks:queue:lake_publish various lake-publisher
stonks:queue:broker_orders trading-engine, trading API broker-adapter
stonks:queue:trading_decisions recommendation trading-engine

6b. API Reference (docs/api-reference.md) — Req 7

Document all endpoints from the four FastAPI services by inspecting their route definitions:

Query API (services/api/app.py): ~40+ endpoints covering companies, documents, trends, recommendations, evidence drill-down, orders, positions, portfolio, global events, macro impacts, competitive signals, trend projections, agents, dead-letter queues, pipeline control, SQL explorer, saved queries, audit trail, DevOps metrics, and Prometheus metrics.

Symbol Registry API (services/symbol_registry/app.py): Companies CRUD, aliases, watchlists, sources, exposure profiles, competitor relationships, competitor inference.

Trading API (services/trading/app.py): Health/readiness, engine status, config update, pause/resume, reset, decisions audit, performance metrics/history, backtesting, notifications config/history, override orders, debug state.

Risk API (services/risk/app.py): Order evaluation (POST /evaluate), health, pending approvals, approval review, approval expiration.

For each endpoint: method, path, query parameters (type, default, constraints), request body schema, response schema, error codes (4xx/5xx).

6c. Helm Chart Reference (docs/helm-reference.md) — Req 8

Document from infra/helm/stonks-oracle/values.yaml:

  • image block: registry, pullPolicy, tag
  • pipelineEnabled: toggle and effect on worker replicas
  • services block: per-service structure (replicas, image, command, tier, port, secrets, resources, probes)
  • config block: all ConfigMap environment variables with defaults and descriptions
  • secrets block: core, broker, market, gmail, dashboard — injection via --set flags
  • ingress block: className, clusterIssuer, host mappings
  • Analytics stack: trino, hiveMetastore, superset toggles and resources
  • networkPolicies.enabled: default-deny-ingress behavior
  • Value override files: values-beta.yaml, values-paper.yaml and their deployment stages

6d. Docker Deployment Guide (docs/docker-deployment.md) — Req 9

  • Complete service inventory with images, ports, volumes, environment variables
  • .env file format with all required/optional variables
  • Volume mounts and data persistence (pgdata, miniodata, ollama_models, hive_data, superset_data)
  • Health check configurations
  • Dockerfile build arguments (SERVICE_CMD)
  • Operational commands: start, stop, restart, logs, scale, reset (docker compose down -v)

6e. Architecture Diagrams (Reqs 1012)

Kubernetes diagram (docs/architecture-kubernetes.md):

  • stonks-oracle namespace with all 13 services grouped by tier (api, processing, trading, orchestration, analytics, frontend)
  • External cluster services in their namespaces (postgresql-service, redis-service, minio-service, ollama-service)
  • Traefik ingress routes to external domains
  • Network policy boundaries
  • Analytics plane (Trino, Hive Metastore, Superset)
  • Helm-managed secrets (core, broker, market, gmail) with consumer mapping
  • Service tier distinction (API with ingress, pipeline workers, trading)

Docker Compose diagram (docs/architecture-docker-compose.md):

  • All infrastructure + application containers
  • Host port mappings
  • depends_on relationships and health check dependencies
  • Named volumes and mount points
  • .env file providing API keys
  • Internal Docker network connectivity

Data Pipeline diagram (docs/architecture-data-pipeline.md):

  • External sources → ingestion → parsing → extraction → aggregation → recommendation → risk → trading → broker
  • Redis queue topology with queue names
  • Three signal layers as distinct paths merging at aggregation
  • Data stores at each stage (MinIO, PostgreSQL, Redis)
  • Trading engine decision loop
  • Analytical branch (lake publisher → MinIO/Parquet → Trino → Superset/Dashboard)
  • External integrations (Ollama, Alpaca, AWS SNS, Gmail)

6f. AI Agent Guide (docs/ai-agents.md) — Req 13

  • Three built-in agents: document-extractor, event-classifier, thesis-rewriter
  • Per-agent: purpose, input data, output schema, default model, system prompt structure, user prompt template
  • ai_agents table schema and registration (system-seeded vs API-created)
  • agent_variants table: create, activate, deactivate variants for A/B testing
  • AgentConfigResolver module: TTL cache (60s default), COALESCE-based variant override, fallback behavior
  • Performance logging: agent_performance_log table, querying for variant comparison
  • API endpoints: CRUD on /api/agents, test endpoint /api/agents/{id}/test
  • Step-by-step guide: creating a new variant with different model/prompt and activating it

6g. Backup & Restore Guide (docs/backup-restore.md) — Req 14

Scripts in scripts/:

  • backup-db.sh: PostgreSQL dump, CLI args, storage location, retention (keeps last 7)
  • restore-db.sh: PostgreSQL restore, service scale-down/up, data loss implications
  • backup-redis.sh: Redis RDB snapshot backup
  • backup.sh: Combined backup (DB + Redis), --upload-minio option
  • restore.sh: Combined restore
  • Full nuke-and-rebuild procedure (connection termination, DB drop, Redis flush, redeploy, re-seed)
  • Recommended backup schedules and automation (cron, Kubernetes CronJobs)

6h. Observability Reference (docs/observability.md) — Req 15

  • /metrics endpoint on query-api, Prometheus scrape configuration
  • All metrics from services/shared/metrics.py:
    • Ingestion: stonks_ingestion_jobs_total, stonks_ingestion_items_fetched_total, stonks_ingestion_items_new_total, stonks_ingestion_items_deduped_total, stonks_ingestion_errors_total, stonks_ingestion_adapter_duration_seconds
    • Parsing: stonks_parse_jobs_total, stonks_parse_quality_score, stonks_parse_low_quality_total, stonks_parse_duration_seconds
    • Extraction: stonks_extraction_jobs_total, stonks_extraction_attempts_total, stonks_extraction_retries_total, stonks_extraction_duration_seconds, stonks_extraction_confidence, stonks_extraction_validation_errors_total, stonks_extraction_tokens_total
    • Aggregation: stonks_aggregation_windows_total, stonks_aggregation_signals_total, stonks_aggregation_contradiction_score, stonks_aggregation_duration_seconds
    • Recommendation: stonks_recommendations_total, stonks_recommendations_suppressed_total, stonks_recommendation_confidence
    • Lake: stonks_lake_facts_published_total, stonks_lake_publish_duration_seconds, stonks_lake_publish_errors_total, stonks_lake_publish_bytes_total
    • Trading: stonks_orders_submitted_total, stonks_orders_rejected_total, stonks_orders_filled_total, stonks_orders_duplicates_prevented_total, stonks_risk_evaluations_total, stonks_risk_check_failures_total, stonks_positions_synced_total
    • Alerting: stonks_alerts_fired_total, stonks_alerts_resolved_total, stonks_alert_check_duration_seconds, stonks_alert_active
    • DLQ: stonks_dlq_items_total, stonks_dlq_replayed_total, stonks_dlq_depth
    • Active: stonks_active_jobs
  • Alerting module (services/shared/alerting.py): 4 alert rules (source_failures, schema_failure_spike, analytical_lag, broker_issues), thresholds, evaluation windows, ConfigMap variables
  • Structured JSON logging format, trace context (trace_id, span_id)
  • Dead-letter queue system: queue names (stonks:dlq:<queue>), routing, replay tooling
  • Recommended Prometheus/Grafana queries

6i. README Update — Req 16

  • Add "Documentation" section with links to all docs
  • Replace ASCII architecture diagram with Mermaid or link to diagram docs
  • Preserve all existing content (license, features, tech stack, project structure, deployment)

Data Models

No new database tables or schema changes are introduced. This initiative works with existing tables:

Tables referenced in test coverage work:

  • sources, companies, company_aliases — scheduler source polling
  • ingestion_runs — scheduler run tracking, ingestion job recording
  • documents, document_company_mentions — ingestion persistence, stale document recovery
  • document_intelligence, document_impact_records — extractor test fixtures
  • model_performance_metrics — extractor schema validation metrics

Tables documented (not modified):

  • All tables listed above plus trend_windows, trend_history, trend_projections, recommendations, recommendation_evidence, risk_evaluations, orders, order_events, positions, portfolio_snapshots, trading_decisions, circuit_breaker_events, reserve_pool_ledger, risk_tier_history, backtest_runs, backtest_trades, notifications, global_events, macro_impact_records, exposure_profiles, competitor_relationships, competitive_signal_records, ai_agents, agent_variants, agent_performance_log, audit_events, watchlists, watchlist_members, retention_policies, market_snapshots

Error Handling

Test Coverage

  • Mock failures: Unit tests must verify that scheduler and ingestion services handle database/Redis connection failures gracefully (no crashes, proper logging).
  • Adapter errors: Ingestion unit tests must verify retry logic with exponential backoff and dead-letter queue routing after retry exhaustion.
  • Test fix approach: When fixing pre-existing failures, prefer fixing test setup over changing production code. If production code changes are needed, add regression tests to prevent re-introduction.

Docker Compose

  • Health check failures: Application services use depends_on with condition: service_healthy to wait for infrastructure. Health checks have interval, timeout, retries, and start_period configured.
  • Missing .env file: Services that need API keys (ingestion, broker-adapter, trading-engine) will start but log warnings about missing keys. The platform runs in a degraded mode without external API access.
  • Build failures: Each service uses the same base Dockerfile with SERVICE_CMD build arg. Build errors are isolated per service.

Documentation

  • Stale documentation: Documentation is generated from source code inspection. If the codebase changes after documentation is written, the docs may drift. The README links section serves as a single index to find and update docs.
  • Diagram accuracy: Mermaid diagrams are hand-authored based on current architecture. They should be updated when services are added or removed.

Testing Strategy

PBT Applicability Assessment

Property-based testing is NOT applicable to this feature. The work consists of:

  1. Unit tests for existing services — These are example-based tests with mocked dependencies, not pure functions with universal properties.
  2. Fixing pre-existing test failures — Bug fixes to existing tests/code.
  3. Docker Compose configuration — Declarative infrastructure configuration.
  4. Documentation — Markdown files with no executable logic.

None of these involve new pure functions, parsers, serializers, or business logic where PBT would add value. The existing test_pbt_* files (22 files covering trading, aggregation, competitive intelligence, etc.) already provide PBT coverage for the platform's core logic and must remain passing.

Unit Testing Strategy

New test files:

  • tests/test_scheduler_unit.py — 8+ test cases covering all scheduler pure functions and the schedule_cycle orchestration with mocked dependencies.
  • tests/test_ingestion_unit.py — 6+ test cases covering adapter error handling, retry logic, deduplication, and dead-letter queue routing.

Test fix files (existing, to be repaired):

  • tests/test_extractor_prompts.py
  • tests/test_extractor_schemas.py
  • tests/test_ollama_client.py
  • tests/test_filings_adapter.py

Test framework: pytest + pytest-asyncio (already configured in the project).

Mocking approach: unittest.mock.AsyncMock for async dependencies, unittest.mock.MagicMock for sync dependencies, unittest.mock.patch for module-level state.

Verification Criteria

  1. pytest tests/ -x --tb=short -q → zero failures
  2. ruff check services/ → zero violations
  3. All 22 existing test_pbt_* files pass unchanged
  4. docker compose config validates the updated docker-compose.yml
  5. All documentation files render valid Markdown with working internal links