# Design Document: Comprehensive Quality & Documentation

## Overview

This design covers three pillars for the Stonks Oracle platform:

1. **Test Coverage** — Close unit test gaps in the scheduler and ingestion services, fix pre-existing test failures in the extractor module, and achieve a fully green test suite (Requirements 1–4).
2. **Docker Deployment** — Extend `docker-compose.yml` to include all 13 application services plus the frontend, enabling full-platform local development without Kubernetes (Requirement 5).
3. **Documentation** — Produce comprehensive documentation covering per-service features, API references, Helm chart configuration, Docker deployment, three Mermaid architecture diagrams, AI agent building, backup/restore, observability, and README resource links (Requirements 6–16).

### Design Rationale

The platform has mature production code across 13 services but uneven test coverage and documentation. The scheduler and ingestion services lack dedicated unit tests — their logic is only exercised through integration tests. Four extractor-related test files have pre-existing failures that block CI. Documentation exists only as a local dev setup guide, a pipeline overview, and a runbook. This initiative fills those gaps systematically.

The approach prioritizes:
- **Test isolation**: Mock all external dependencies (PostgreSQL, Redis, MinIO, Ollama) so unit tests run fast and deterministically.
- **Documentation from source**: Generate API references by inspecting actual FastAPI route definitions, Helm values from `values.yaml`, and metrics from `services/shared/metrics.py`.
- **Docker parity with Kubernetes**: Mirror the Helm chart's service definitions in Docker Compose so both deployment modes stay in sync.

## Architecture

The work does not change the platform's runtime architecture. It adds:

1. **New test files** in `tests/` for scheduler and ingestion unit tests.
2. **Fixes** to existing test files and/or production code to resolve failures.
3. **New service definitions** in `docker-compose.yml` using the existing `docker/Dockerfile` with `SERVICE_CMD` build args.
4. **New documentation files** in `docs/` organized by topic.
5. **Updated `README.md`** with a documentation index and Mermaid diagram.

```mermaid
graph TD
    subgraph "Test Coverage (Reqs 1-4)"
        T1[tests/test_scheduler_unit.py]
        T2[tests/test_ingestion_unit.py]
        T3[Fix test_extractor_prompts.py]
        T4[Fix test_extractor_schemas.py]
        T5[Fix test_ollama_client.py]
        T6[Fix test_filings_adapter.py]
    end

    subgraph "Docker (Req 5)"
        D1[docker-compose.yml<br/>+ 13 app services + frontend]
    end

    subgraph "Documentation (Reqs 6-16)"
        DOC1[docs/services.md]
        DOC2[docs/api-reference.md]
        DOC3[docs/helm-reference.md]
        DOC4[docs/docker-deployment.md]
        DOC5[docs/architecture-kubernetes.md]
        DOC6[docs/architecture-docker-compose.md]
        DOC7[docs/architecture-data-pipeline.md]
        DOC8[docs/ai-agents.md]
        DOC9[docs/backup-restore.md]
        DOC10[docs/observability.md]
        DOC11[README.md update]
    end
```

## Components and Interfaces

### 1. Scheduler Unit Tests (Requirement 1)

**Target module**: `services/scheduler/app.py`

**Functions to test in isolation**:
- `get_cadence_for_source(source_type, config)` — Returns polling interval from config or defaults.
- `compute_backoff(retry_count)` — Exponential backoff with cap.
- `is_source_due(...)` — Core scheduling logic: determines if a source needs polling based on last run status, timing, retry state.
- `build_job_payload(source, aliases, now)` — Constructs the ingestion job dict.
- `schedule_cycle(pool, rds)` — Full scheduling pass (mocked DB/Redis).
- `check_rate_limit(rds, source_type, now)` — Rate limiting with per-type and global Polygon limits.
- `recover_stale_documents(pool, rds)` — Re-enqueue orphaned parsed documents.
- `retry_failed_extractions(pool, rds)` — Re-enqueue failed extractions.

**Mocking strategy**:
- `asyncpg.Pool` → `AsyncMock` with `.fetch()`, `.fetchrow()`, `.fetchval()`, `.execute()` returning canned records.
- `redis.asyncio.Redis` → `AsyncMock` with `.rpush()`, `.set()`, `.get()`, `.incr()`, `.expire()`, `.decr()`, `.delete()` tracking calls.
- Use `unittest.mock.patch` for module-level imports where needed.

**Test file**: `tests/test_scheduler_unit.py`

### 2. Ingestion Unit Tests (Requirement 2)

**Target module**: `services/ingestion/worker.py`

**Functions to test**:
- `process_job(job, pool, rds, minio_client, adapters)` — Main job processing with various adapter outcomes.
- Error handling paths: adapter returns `AdapterResult(error=...)`, retry exhaustion, dead-letter routing.
- Deduplication: content hash already seen in Redis, cross-source document dedup via `dedupe_items`.

**Mocking strategy**:
- Adapters → `AsyncMock` returning `AdapterResult` with controlled `error`, `items`, `content_hash`, `raw_payload`.
- `asyncpg.Pool` → `AsyncMock` for `ingestion_runs` INSERT/UPDATE, `persist_ingestion_items`, `record_retrieval_failure`.
- `redis.asyncio.Redis` → `AsyncMock` for dedupe checks, queue pushes, DLQ routing.
- `minio.Minio` → `MagicMock` for `upload_raw_artifact`.

**Test file**: `tests/test_ingestion_unit.py`

### 3. Extractor Test Fixes (Requirement 3)

**Target files**:
- `tests/test_extractor_prompts.py`
- `tests/test_extractor_schemas.py`
- `tests/test_ollama_client.py`
- `tests/test_filings_adapter.py`

**Approach**: Run each file individually, diagnose failures, and fix either the test setup (mock configuration, fixture data) or the production code. Preserve original test intent and assertions. If production code changes are needed, add regression tests.

### 4. Full Test Suite Green (Requirement 4)

**Verification**: Run `pytest tests/ -x --tb=short -q` and `ruff check services/` after all fixes. All existing `test_pbt_*` files must remain passing. Any production code fix must include a regression test.

### 5. Docker Compose Application Services (Requirement 5)

**Current state**: `docker-compose.yml` defines 7 infrastructure services (postgres, redis, minio, minio-init, ollama, trino, hive-metastore, superset).

**Addition**: 14 new service definitions (13 app services + frontend dashboard):

| Service | Image Build | Command | Port | Depends On |
|---------|------------|---------|------|------------|
| scheduler | `docker/Dockerfile.scheduler` | `python -m services.scheduler.app` | — | postgres, redis |
| symbol-registry | `docker/Dockerfile` | `uvicorn services.symbol_registry.app:app --host 0.0.0.0 --port 8000` | 8001:8000 | postgres |
| ingestion | `docker/Dockerfile` | `python -m services.ingestion.worker` | — | postgres, redis, minio |
| parser | `docker/Dockerfile` | `python -m services.parser.worker` | — | postgres, redis |
| extractor | `docker/Dockerfile` | `python -m services.extractor.main` | — | postgres, redis, ollama |
| aggregation | `docker/Dockerfile` | `python -m services.aggregation.main` | — | postgres, redis |
| recommendation | `docker/Dockerfile` | `python -m services.recommendation.main` | — | postgres, redis |
| trading-engine | `docker/Dockerfile` | `uvicorn services.trading.app:app --host 0.0.0.0 --port 8000` | 8002:8000 | postgres, redis |
| risk-engine | `docker/Dockerfile` | `uvicorn services.risk.app:app --host 0.0.0.0 --port 8000` | 8003:8000 | postgres |
| broker-adapter | `docker/Dockerfile` | `python -m services.adapters.broker_service` | — | postgres, redis |
| lake-publisher | `docker/Dockerfile` | `python -m services.lake_publisher.jobs` | — | postgres, minio |
| query-api | `docker/Dockerfile` | `uvicorn services.api.app:app --host 0.0.0.0 --port 8000` | 8004:8000 | postgres, redis, minio |
| dashboard | `frontend/Dockerfile` | nginx (built-in) | 3000:8080 | query-api |

**Common environment block** (shared via `x-app-env` YAML anchor):
```yaml
POSTGRES_HOST: postgres
POSTGRES_PORT: "5432"
POSTGRES_DB: stonks
POSTGRES_USER: stonks
POSTGRES_PASSWORD: stonks_dev
REDIS_HOST: redis
REDIS_PORT: "6379"
MINIO_ENDPOINT: minio:9000
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
OLLAMA_BASE_URL: http://ollama:11434
```

**`.env` file support**: `MARKET_DATA_API_KEY`, `BROKER_API_KEY`, `BROKER_API_SECRET`, `BROKER_BASE_URL` loaded via `env_file: .env` on services that need them (ingestion, broker-adapter, trading-engine).

**Health checks**: FastAPI services use `curl -f http://localhost:8000/health`; workers use process liveness checks. Infrastructure `depends_on` uses `condition: service_healthy`.

### 6. Documentation Structure (Requirements 6–16)

All documentation files are Markdown in `docs/`. The structure:

```
docs/
├── services.md                      # Req 6: Per-service feature docs
├── api-reference.md                 # Req 7: All 4 FastAPI API references
├── helm-reference.md                # Req 8: Helm chart values reference
├── docker-deployment.md             # Req 9: Docker deployment guide
├── architecture-kubernetes.md       # Req 10: K8s Mermaid diagram
├── architecture-docker-compose.md   # Req 11: Docker Compose Mermaid diagram
├── architecture-data-pipeline.md    # Req 12: Data pipeline Mermaid diagram
├── ai-agents.md                     # Req 13: AI agent building guide
├── backup-restore.md                # Req 14: Backup and restore guide
├── observability.md                 # Req 15: Observability & metrics reference
├── LOCAL_DEV_SETUP.md               # (existing)
├── llm-to-trade-pipeline.md         # (existing)
└── notes/
    └── runbook.md                   # (existing)
```

#### 6a. Service Feature Documentation (`docs/services.md`) — Req 6

For each of the 13 services, document:
- **Purpose**: What the service does in the pipeline.
- **Entry point**: Module path (e.g., `services.scheduler.app`).
- **Configuration**: Environment variables from `services/shared/config.py` relevant to this service.
- **Database tables**: Tables read/written by this service.
- **Redis queues**: Queue names consumed from and published to (from `services/shared/redis_keys.py`).
- **Queue message schema**: JSON structure of messages.
- **Signal layers**: For aggregation/recommendation, document the three signal layers (company, macro, competitive), their toggles (`macro_enabled`, `competitive_enabled` in `risk_configs`), and weight configurations.
- **Trading engine features**: For the trading service, document position sizing, circuit breakers, reserve pool, risk tier auto-adjustment, backtesting, and notification configuration.

Queue topology reference (from `redis_keys.py`):
| Queue | Producer | Consumer |
|-------|----------|----------|
| `stonks:queue:ingestion` | scheduler | ingestion |
| `stonks:queue:parsing` | ingestion | parser |
| `stonks:queue:extraction` | parser | extractor |
| `stonks:queue:macro_classification` | parser, scheduler | extractor |
| `stonks:queue:aggregation` | extractor | aggregation |
| `stonks:queue:recommendation` | aggregation | recommendation |
| `stonks:queue:lake_publish` | various | lake-publisher |
| `stonks:queue:broker_orders` | trading-engine, trading API | broker-adapter |
| `stonks:queue:trading_decisions` | recommendation | trading-engine |

#### 6b. API Reference (`docs/api-reference.md`) — Req 7

Document all endpoints from the four FastAPI services by inspecting their route definitions:

**Query API** (`services/api/app.py`): ~40+ endpoints covering companies, documents, trends, recommendations, evidence drill-down, orders, positions, portfolio, global events, macro impacts, competitive signals, trend projections, agents, dead-letter queues, pipeline control, SQL explorer, saved queries, audit trail, DevOps metrics, and Prometheus metrics.

**Symbol Registry API** (`services/symbol_registry/app.py`): Companies CRUD, aliases, watchlists, sources, exposure profiles, competitor relationships, competitor inference.

**Trading API** (`services/trading/app.py`): Health/readiness, engine status, config update, pause/resume, reset, decisions audit, performance metrics/history, backtesting, notifications config/history, override orders, debug state.

**Risk API** (`services/risk/app.py`): Order evaluation (`POST /evaluate`), health, pending approvals, approval review, approval expiration.

For each endpoint: method, path, query parameters (type, default, constraints), request body schema, response schema, error codes (4xx/5xx).

#### 6c. Helm Chart Reference (`docs/helm-reference.md`) — Req 8

Document from `infra/helm/stonks-oracle/values.yaml`:
- `image` block: registry, pullPolicy, tag
- `pipelineEnabled`: toggle and effect on worker replicas
- `services` block: per-service structure (replicas, image, command, tier, port, secrets, resources, probes)
- `config` block: all ConfigMap environment variables with defaults and descriptions
- `secrets` block: core, broker, market, gmail, dashboard — injection via `--set` flags
- `ingress` block: className, clusterIssuer, host mappings
- Analytics stack: trino, hiveMetastore, superset toggles and resources
- `networkPolicies.enabled`: default-deny-ingress behavior
- Value override files: `values-beta.yaml`, `values-paper.yaml` and their deployment stages

#### 6d. Docker Deployment Guide (`docs/docker-deployment.md`) — Req 9

- Complete service inventory with images, ports, volumes, environment variables
- `.env` file format with all required/optional variables
- Volume mounts and data persistence (pgdata, miniodata, ollama_models, hive_data, superset_data)
- Health check configurations
- Dockerfile build arguments (`SERVICE_CMD`)
- Operational commands: start, stop, restart, logs, scale, reset (`docker compose down -v`)

#### 6e. Architecture Diagrams (Reqs 10–12)

**Kubernetes diagram** (`docs/architecture-kubernetes.md`):
- `stonks-oracle` namespace with all 13 services grouped by tier (api, processing, trading, orchestration, analytics, frontend)
- External cluster services in their namespaces (postgresql-service, redis-service, minio-service, ollama-service)
- Traefik ingress routes to external domains
- Network policy boundaries
- Analytics plane (Trino, Hive Metastore, Superset)
- Helm-managed secrets (core, broker, market, gmail) with consumer mapping
- Service tier distinction (API with ingress, pipeline workers, trading)

**Docker Compose diagram** (`docs/architecture-docker-compose.md`):
- All infrastructure + application containers
- Host port mappings
- `depends_on` relationships and health check dependencies
- Named volumes and mount points
- `.env` file providing API keys
- Internal Docker network connectivity

**Data Pipeline diagram** (`docs/architecture-data-pipeline.md`):
- External sources → ingestion → parsing → extraction → aggregation → recommendation → risk → trading → broker
- Redis queue topology with queue names
- Three signal layers as distinct paths merging at aggregation
- Data stores at each stage (MinIO, PostgreSQL, Redis)
- Trading engine decision loop
- Analytical branch (lake publisher → MinIO/Parquet → Trino → Superset/Dashboard)
- External integrations (Ollama, Alpaca, AWS SNS, Gmail)

#### 6f. AI Agent Guide (`docs/ai-agents.md`) — Req 13

- Three built-in agents: document-extractor, event-classifier, thesis-rewriter
- Per-agent: purpose, input data, output schema, default model, system prompt structure, user prompt template
- `ai_agents` table schema and registration (system-seeded vs API-created)
- `agent_variants` table: create, activate, deactivate variants for A/B testing
- `AgentConfigResolver` module: TTL cache (60s default), COALESCE-based variant override, fallback behavior
- Performance logging: `agent_performance_log` table, querying for variant comparison
- API endpoints: CRUD on `/api/agents`, test endpoint `/api/agents/{id}/test`
- Step-by-step guide: creating a new variant with different model/prompt and activating it

#### 6g. Backup & Restore Guide (`docs/backup-restore.md`) — Req 14

Scripts in `scripts/`:
- `backup-db.sh`: PostgreSQL dump, CLI args, storage location, retention (keeps last 7)
- `restore-db.sh`: PostgreSQL restore, service scale-down/up, data loss implications
- `backup-redis.sh`: Redis RDB snapshot backup
- `backup.sh`: Combined backup (DB + Redis), `--upload-minio` option
- `restore.sh`: Combined restore
- Full nuke-and-rebuild procedure (connection termination, DB drop, Redis flush, redeploy, re-seed)
- Recommended backup schedules and automation (cron, Kubernetes CronJobs)

#### 6h. Observability Reference (`docs/observability.md`) — Req 15

- `/metrics` endpoint on query-api, Prometheus scrape configuration
- All metrics from `services/shared/metrics.py`:
  - **Ingestion**: `stonks_ingestion_jobs_total`, `stonks_ingestion_items_fetched_total`, `stonks_ingestion_items_new_total`, `stonks_ingestion_items_deduped_total`, `stonks_ingestion_errors_total`, `stonks_ingestion_adapter_duration_seconds`
  - **Parsing**: `stonks_parse_jobs_total`, `stonks_parse_quality_score`, `stonks_parse_low_quality_total`, `stonks_parse_duration_seconds`
  - **Extraction**: `stonks_extraction_jobs_total`, `stonks_extraction_attempts_total`, `stonks_extraction_retries_total`, `stonks_extraction_duration_seconds`, `stonks_extraction_confidence`, `stonks_extraction_validation_errors_total`, `stonks_extraction_tokens_total`
  - **Aggregation**: `stonks_aggregation_windows_total`, `stonks_aggregation_signals_total`, `stonks_aggregation_contradiction_score`, `stonks_aggregation_duration_seconds`
  - **Recommendation**: `stonks_recommendations_total`, `stonks_recommendations_suppressed_total`, `stonks_recommendation_confidence`
  - **Lake**: `stonks_lake_facts_published_total`, `stonks_lake_publish_duration_seconds`, `stonks_lake_publish_errors_total`, `stonks_lake_publish_bytes_total`
  - **Trading**: `stonks_orders_submitted_total`, `stonks_orders_rejected_total`, `stonks_orders_filled_total`, `stonks_orders_duplicates_prevented_total`, `stonks_risk_evaluations_total`, `stonks_risk_check_failures_total`, `stonks_positions_synced_total`
  - **Alerting**: `stonks_alerts_fired_total`, `stonks_alerts_resolved_total`, `stonks_alert_check_duration_seconds`, `stonks_alert_active`
  - **DLQ**: `stonks_dlq_items_total`, `stonks_dlq_replayed_total`, `stonks_dlq_depth`
  - **Active**: `stonks_active_jobs`
- Alerting module (`services/shared/alerting.py`): 4 alert rules (source_failures, schema_failure_spike, analytical_lag, broker_issues), thresholds, evaluation windows, ConfigMap variables
- Structured JSON logging format, trace context (trace_id, span_id)
- Dead-letter queue system: queue names (`stonks:dlq:<queue>`), routing, replay tooling
- Recommended Prometheus/Grafana queries

#### 6i. README Update — Req 16

- Add "Documentation" section with links to all docs
- Replace ASCII architecture diagram with Mermaid or link to diagram docs
- Preserve all existing content (license, features, tech stack, project structure, deployment)

## Data Models

No new database tables or schema changes are introduced. This initiative works with existing tables:

**Tables referenced in test coverage work**:
- `sources`, `companies`, `company_aliases` — scheduler source polling
- `ingestion_runs` — scheduler run tracking, ingestion job recording
- `documents`, `document_company_mentions` — ingestion persistence, stale document recovery
- `document_intelligence`, `document_impact_records` — extractor test fixtures
- `model_performance_metrics` — extractor schema validation metrics

**Tables documented** (not modified):
- All tables listed above plus `trend_windows`, `trend_history`, `trend_projections`, `recommendations`, `recommendation_evidence`, `risk_evaluations`, `orders`, `order_events`, `positions`, `portfolio_snapshots`, `trading_decisions`, `circuit_breaker_events`, `reserve_pool_ledger`, `risk_tier_history`, `backtest_runs`, `backtest_trades`, `notifications`, `global_events`, `macro_impact_records`, `exposure_profiles`, `competitor_relationships`, `competitive_signal_records`, `ai_agents`, `agent_variants`, `agent_performance_log`, `audit_events`, `watchlists`, `watchlist_members`, `retention_policies`, `market_snapshots`

## Error Handling

### Test Coverage
- **Mock failures**: Unit tests must verify that scheduler and ingestion services handle database/Redis connection failures gracefully (no crashes, proper logging).
- **Adapter errors**: Ingestion unit tests must verify retry logic with exponential backoff and dead-letter queue routing after retry exhaustion.
- **Test fix approach**: When fixing pre-existing failures, prefer fixing test setup over changing production code. If production code changes are needed, add regression tests to prevent re-introduction.

### Docker Compose
- **Health check failures**: Application services use `depends_on` with `condition: service_healthy` to wait for infrastructure. Health checks have `interval`, `timeout`, `retries`, and `start_period` configured.
- **Missing `.env` file**: Services that need API keys (ingestion, broker-adapter, trading-engine) will start but log warnings about missing keys. The platform runs in a degraded mode without external API access.
- **Build failures**: Each service uses the same base Dockerfile with `SERVICE_CMD` build arg. Build errors are isolated per service.

### Documentation
- **Stale documentation**: Documentation is generated from source code inspection. If the codebase changes after documentation is written, the docs may drift. The README links section serves as a single index to find and update docs.
- **Diagram accuracy**: Mermaid diagrams are hand-authored based on current architecture. They should be updated when services are added or removed.

## Testing Strategy

### PBT Applicability Assessment

Property-based testing is **NOT applicable** to this feature. The work consists of:
1. **Unit tests for existing services** — These are example-based tests with mocked dependencies, not pure functions with universal properties.
2. **Fixing pre-existing test failures** — Bug fixes to existing tests/code.
3. **Docker Compose configuration** — Declarative infrastructure configuration.
4. **Documentation** — Markdown files with no executable logic.

None of these involve new pure functions, parsers, serializers, or business logic where PBT would add value. The existing `test_pbt_*` files (22 files covering trading, aggregation, competitive intelligence, etc.) already provide PBT coverage for the platform's core logic and must remain passing.

### Unit Testing Strategy

**New test files**:
- `tests/test_scheduler_unit.py` — 8+ test cases covering all scheduler pure functions and the `schedule_cycle` orchestration with mocked dependencies.
- `tests/test_ingestion_unit.py` — 6+ test cases covering adapter error handling, retry logic, deduplication, and dead-letter queue routing.

**Test fix files** (existing, to be repaired):
- `tests/test_extractor_prompts.py`
- `tests/test_extractor_schemas.py`
- `tests/test_ollama_client.py`
- `tests/test_filings_adapter.py`

**Test framework**: pytest + pytest-asyncio (already configured in the project).

**Mocking approach**: `unittest.mock.AsyncMock` for async dependencies, `unittest.mock.MagicMock` for sync dependencies, `unittest.mock.patch` for module-level state.

### Verification Criteria

1. `pytest tests/ -x --tb=short -q` → zero failures
2. `ruff check services/` → zero violations
3. All 22 existing `test_pbt_*` files pass unchanged
4. `docker compose config` validates the updated docker-compose.yml
5. All documentation files render valid Markdown with working internal links