diff --git a/.kiro/specs/stonks-oracle/tasks.md b/.kiro/specs/stonks-oracle/tasks.md index 7ee4ad9..1b25c95 100644 --- a/.kiro/specs/stonks-oracle/tasks.md +++ b/.kiro/specs/stonks-oracle/tasks.md @@ -113,7 +113,7 @@ - [x] Create replay dataset from archived documents for deterministic extraction testing - [x] Create integration tests for the full ingest-to-recommendation flow - [x] Create paper trading simulation scenarios -- [x] Validate fail-closed behavior for broker outages and ambiguous order states +- [x] Vnmalidate fail-closed behavior for broker outages and ambiguous order states - [x] Validate lake publication and Trino query correctness over partitioned MinIO datasets - [x] ~~Run shadow mode~~ moved to Phase 15.5 (post-deployment) - [x] ~~Prepare operator runbook~~ moved to Phase 15.5 (post-deployment) @@ -136,7 +136,7 @@ - _Requirements: N1_ ## Phase 15 - CI Validation, Helm Deployment, and Cluster Rollout -- [-] 15. Commit, push, validate CI, create Helm chart, and deploy to cluster +- [x] 15. Commit, push, validate CI, create Helm chart, and deploy to cluster - [x] 15.1 Commit and push code to GitHub - Configure git with SSH key for the private repo - Commit all current changes with message `phase 14-15: docker build validation and helm deployment` @@ -146,7 +146,7 @@ - Monitor the GitHub Actions run to confirm lint-and-test and build-services jobs succeed - Fix any CI failures and re-push if needed - _Requirements: N1_ -- [-] 15.3 Create Helm chart for stonks-oracle deployment +- [x] 15.3 Create Helm chart for stonks-oracle deployment - Create `infra/helm/stonks-oracle/Chart.yaml` with chart metadata - Create `infra/helm/stonks-oracle/values.yaml` with configurable image tags, replica counts, resource limits, and environment references - Create Helm templates for all deployments, services, configmap, secrets, ingress, and network policies from existing K8s manifests @@ -157,17 +157,17 @@ - Create a `docker-registry` secret in the `stonks-oracle` namespace with GHCR credentials (using a GitHub PAT or deploy key) - Reference the imagePullSecret in all deployment specs via the Helm values - _Requirements: 8.2, N1_ -- [-] 15.5 Deploy stonks-oracle to the cluster via Helm +- [x] 15.5 Deploy stonks-oracle to the cluster via Helm - Run `helm install` or `helm upgrade --install` targeting the `stonks-oracle` namespace - Verify all pods reach Running/Ready state - Verify services and ingress endpoints are reachable - Debug and fix any deployment issues (CrashLoopBackOff, image pull errors, config mismatches) - _Requirements: N1, 12.1_ -- [ ] 15.6 Run shadow mode before enabling any live execution +- [x] 15.6 Run shadow mode before enabling any live execution - Confirm all services are running and processing in paper-only mode - Validate end-to-end data flow from ingestion through recommendation without live trades - _Requirements: N5, 8.1_ -- [ ] 15.7 Prepare operator runbook and incident response procedures +- [x] 15.7 Prepare operator runbook and incident response procedures - Document service restart procedures, log access, and common failure modes - Document how to toggle trading modes and approve live execution - _Requirements: 8.2, 12.1_ diff --git a/docs/notes/runbook.md b/docs/notes/runbook.md new file mode 100644 index 0000000..3d38478 --- /dev/null +++ b/docs/notes/runbook.md @@ -0,0 +1,147 @@ +# Stonks Oracle — Operator Runbook + +## Cluster Access + +```bash +kubectl config use-context +# All stonks-oracle resources live in the stonks-oracle namespace +alias kso='kubectl -n stonks-oracle' +``` + +## Service Overview + +| Service | Type | Replicas | Notes | +|---------|------|----------|-------| +| scheduler | CronJob-like worker | 1 | Polls sources on schedule | +| symbol-registry | FastAPI | 1 | Company/watchlist CRUD | +| ingestion | Queue worker | 2 | Fetches from adapters | +| parser | Queue worker | 2 | HTML→text extraction | +| extractor | Queue worker | 1 | LLM-based intelligence extraction | +| aggregation | Queue worker | 1 | Trend/signal aggregation | +| recommendation | Queue worker | 1 | Trade signal generation | +| risk | FastAPI | 1 | Risk evaluation + approval | +| broker-adapter | Queue worker | 1 | Paper/live order execution | +| lake-publisher | Queue worker | 1 | Iceberg table publication | +| query-api | FastAPI | 1 | Dashboard/analytics queries | +| trino | Analytics engine | 1 | SQL over lakehouse | +| superset | Dashboard | 1 | Visualization | +| hive-metastore | Metastore | 1 | Iceberg catalog backend | + +## Common Operations + +### Restart a service +```bash +kso rollout restart deployment/ +``` + +### Check logs +```bash +kso logs deployment/ --tail=50 -f +# For previous crash: +kso logs --previous --tail=50 +``` + +### Scale a service +```bash +kso scale deployment/ --replicas=N +``` + +### Redeploy with updated secrets +```bash +GHCR_TOKEN=$(cat /run/secrets/github_token) +helm upgrade --install stonks-oracle infra/helm/stonks-oracle \ + --namespace stonks-oracle \ + --set "ghcrAuth.password=$GHCR_TOKEN" \ + --set 'secrets.core.POSTGRES_PASSWORD=St0nks0racl3!' \ + --set "secrets.core.MINIO_ACCESS_KEY=AKIA6V7J3N9B5P0D2YQH" \ + --set 'secrets.core.MINIO_SECRET_KEY=8fG3!v2rJ7$wN@9mLpQ6zXbC4tKdPqW1' \ + --set 'secrets.core.REDIS_PASSWORD=PSCh4ng3me!' +# Then restart deployments to pick up secret changes: +for dep in $(kso get deployments -o name); do kso rollout restart "$dep"; done +``` + +### Run database migrations +```bash +for f in $(ls infra/migrations/*.sql | sort); do + kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks < "$f" +done +``` + +## Trading Mode Toggle + +Current mode is set via ConfigMap `stonks-config` key `BROKER_MODE`. + +```bash +# Check current mode +kso get configmap stonks-config -o jsonpath='{.data.BROKER_MODE}' + +# To switch modes, update values.yaml config.BROKER_MODE and helm upgrade, +# then restart broker-adapter and risk deployments. +``` + +**Modes:** +- `paper` — all orders go through paper trading simulation (default, safe) +- `live` — orders are submitted to the real broker API (requires operator approval workflow) + +**Never switch to live without:** +1. Confirming paper trading PnL is acceptable +2. Verifying risk limits are configured in `risk_configuration` table +3. Enabling operator approval in `operator_approvals` table + +## Operator Approval for Live Trades + +The risk engine requires explicit operator approval before executing live trades. +Approvals are managed via the risk API: + +```bash +# Check pending approvals +curl -s https://stonks-api.celestium.life/risk/approvals/pending + +# Approve a recommendation +curl -X POST https://stonks-api.celestium.life/risk/approvals//approve +``` + +## Common Failure Modes + +### CrashLoopBackOff on workers +Queue workers (aggregation, extractor, recommendation, broker-adapter, lake-publisher) exit with code 0 when the queue is empty. Kubernetes restarts them, which is normal. They'll process work when messages arrive. + +### PostgreSQL auth failure +Password mismatch between `stonks-core-secrets.POSTGRES_PASSWORD` and the actual DB user password. Fix: +```bash +kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks <<'EOF' +ALTER USER stonks WITH PASSWORD ''; +EOF +``` +Then update the Helm secret and restart. + +### Redis connection refused +Check Redis is running: `kubectl get pods -n redis-service` +If Redis master is down, restart it: `kubectl rollout restart -n redis-service statefulset/redis-master` + +### ImagePullBackOff +GHCR credentials expired or missing. Re-run `helm upgrade` with fresh `ghcrAuth.password`. + +### Superset won't start +Needs custom image with `sqlalchemy-trino` package. Stock `apache/superset:latest` doesn't include it. + +## Log Access + +All services output JSON logs when `JSON_LOGS=true` (default). + +```bash +# Stream all logs from a service +kso logs -f deployment/ --tail=100 + +# Search for errors across all pods +kso logs --all-containers --prefix --tail=100 | grep -i error +``` + +## Ingress Endpoints + +| URL | Service | +|-----|---------| +| https://stonks-api.celestium.life | Query API | +| https://stonks-registry.celestium.life | Symbol Registry | +| https://stonks-dash.celestium.life | Superset | +| https://stonks-trino.celestium.life | Trino |