# Stonks Oracle — Operator Runbook ## Cluster Access ```bash kubectl config use-context # All stonks-oracle resources live in the stonks-oracle namespace alias kso='kubectl -n stonks-oracle' ``` 4-node k3s cluster (gremlin-1 through gremlin-4). Deploy host is gremlin-1 (192.168.42.254) where secrets and the deploy script live. ## Service Overview | Service | Type | Replicas | Notes | |---------|------|----------|-------| | scheduler | CronJob-like worker | 1 | Polls sources on schedule | | symbol-registry | FastAPI | 1 | Company/watchlist/exposure/competitor CRUD | | ingestion | Queue worker | 2 | Fetches from adapters (market data, news, filings, macro) | | parser | Queue worker | 2 | HTML→text extraction | | extractor | Queue worker | 1 | LLM-based intelligence extraction + event classification | | aggregation | Queue worker | 1 | Trend/signal aggregation across all 3 layers | | recommendation | Queue worker | 1 | Trade signal generation | | trading-engine | FastAPI | 1 | Autonomous decision loop, position sizing, backtesting | | risk | FastAPI | 1 | Risk evaluation + approval | | broker-adapter | Queue worker | 1 | Paper/live order execution via Alpaca | | lake-publisher | Queue worker | 1 | Iceberg table publication | | query-api | FastAPI | 1 | Dashboard/analytics queries | | dashboard | nginx | 1 | React SPA on port 8080 | | trino | Analytics engine | 1 | SQL over lakehouse | | superset | Dashboard | 1 | Visualization | | hive-metastore | Metastore | 1 | Iceberg catalog backend | ## Deployment ### Full Deploy Run from gremlin-1 where secrets are available: ```bash bash ~/sources/kube/stonks-oracle/runmefirst.sh ``` This script: 1. Pulls latest code 2. Creates namespace with Helm labels 3. Sets up PostgreSQL user and database 4. Runs all migrations in order 5. Deploys via Helm with secrets injected 6. Rolling restarts all deployments ### Quick Helm Upgrade After CI builds new images: ```bash helm upgrade --install stonks-oracle infra/helm/stonks-oracle -n stonks-oracle ``` ### Full Teardown Preserves PostgreSQL, Redis, and MinIO data: ```bash bash ~/sources/kube/stonks-oracle/runmelast.sh ``` ## Secrets Management Secrets are stored on the deploy host at `~/sources/kube/stonks-oracle/`. This directory is NOT a git repo — secrets stay local. Required secret files: - `~/sources/kube/stonks-oracle/polygon.io.key` — Polygon.io API key - `~/sources/kube/stonks-oracle/alpaca.key` — Alpaca API key - `~/sources/kube/stonks-oracle/alpaca.secret` — Alpaca API secret - `~/sources/kube/stonks-oracle/alpaca.url` — Alpaca base URL (defaults to paper API) - `/run/secrets/github_token` — GHCR authentication token The deploy script (`runmefirst.sh`) reads these files and injects them into Kubernetes secrets via Helm `--set` flags. Never hardcode secrets in manifests, values files, or this runbook. To rotate a secret: 1. Update the file on gremlin-1 2. Re-run `runmefirst.sh` (or `helm upgrade` with the new `--set` values) 3. Restart affected deployments ## Common Operations ### Restart a service ```bash kso rollout restart deployment/ ``` ### Check logs ```bash kso logs deployment/ --tail=50 -f # For previous crash: kso logs --previous --tail=50 ``` ### Scale a service ```bash kso scale deployment/ --replicas=N ``` ### Run database migrations ```bash for f in $(ls infra/migrations/*.sql | sort); do kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks < "$f" done ``` ## Trading Engine Operations ### Check trading engine status ```bash curl -s https://stonks-trading.celestium.life/health curl -s https://stonks-trading.celestium.life/ready ``` ### Pause trading ```bash # Via API — sets enabled=false in trading_engine_config curl -X PUT https://stonks-trading.celestium.life/api/trading/config \ -H 'Content-Type: application/json' \ -d '{"enabled": false}' ``` ### Resume trading ```bash curl -X PUT https://stonks-trading.celestium.life/api/trading/config \ -H 'Content-Type: application/json' \ -d '{"enabled": true}' ``` ### Check recent trading decisions ```bash curl -s https://stonks-api.celestium.life/api/trading/decisions?limit=10 ``` ### Run a backtest ```bash curl -X POST https://stonks-trading.celestium.life/api/trading/backtest \ -H 'Content-Type: application/json' \ -d '{"start_date": "2025-01-01", "end_date": "2025-06-01", "initial_capital": 100000, "risk_tier": "moderate"}' ``` ### Check circuit breaker status ```bash curl -s https://stonks-api.celestium.life/api/trading/circuit-breaker ``` ### Check portfolio state ```bash curl -s https://stonks-api.celestium.life/api/trading/portfolio ``` ## Broker Mode Toggle Current mode is set via ConfigMap `stonks-config` key `BROKER_MODE`. ```bash # Check current mode kso get configmap stonks-config -o jsonpath='{.data.BROKER_MODE}' ``` **Modes:** - `paper` — all orders go through paper trading simulation (default) - `live` — orders submitted to real broker API (requires operator approval workflow) **Never switch to live without:** 1. Confirming paper trading PnL is acceptable 2. Verifying risk limits are configured 3. Enabling operator approval in the risk engine ## Signal Layer Toggles ### Macro signal layer ```bash # Check status curl -s https://stonks-api.celestium.life/api/admin/macro/status # Toggle curl -X PUT https://stonks-api.celestium.life/api/admin/macro/toggle ``` ### Competitive signal layer ```bash # Check status curl -s https://stonks-api.celestium.life/api/admin/competitive/status # Toggle curl -X PUT https://stonks-api.celestium.life/api/admin/competitive/toggle ``` ## Backup and Restore ### Database backup ```bash # Local backup (keeps last 7) ./scripts/backup-db.sh # Backup + upload to MinIO ./scripts/backup-db.sh --upload-minio ``` Backups go to `~/backups/stonks-oracle/`. Old backups are auto-pruned (keeps last 7). ### Database restore ```bash # Lists available backups if no argument given ./scripts/restore-db.sh # Restore a specific backup (WARNING: replaces all data) ./scripts/restore-db.sh ~/backups/stonks-oracle/stonks-20250615-180000.sql.gz ``` The restore script scales down all services, restores the dump, re-grants permissions, and scales services back up. ### Redis backup ```bash ./scripts/backup-redis.sh ``` Triggers a BGSAVE and copies the RDB dump locally. ## Database Nuke & Rebuild When a full reset is needed: ```bash # 1. Tear down Helm release bash ~/sources/kube/stonks-oracle/runmelast.sh # 2. Terminate connections and drop database kubectl exec -n postgresql-service postgresql-1 -c postgres -- \ psql -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'stonks' AND pid <> pg_backend_pid();" kubectl exec -n postgresql-service postgresql-1 -c postgres -- \ psql -U postgres -c "DROP DATABASE IF EXISTS stonks;" # 3. Flush Redis dedup markers # (clear all stonks:* keys from Redis) # 4. Full redeploy (creates DB, runs migrations, deploys) bash ~/sources/kube/stonks-oracle/runmefirst.sh # 5. Re-seed companies and relationships # (run from a pod or with port-forwarded DB access) python -m services.symbol_registry.seed ``` ## Monitoring ### Check pod status ```bash kso get pods kso get pods -o wide # includes node placement ``` ### Check ingestion health ```bash # Recent ingestion activity kso logs deployment/ingestion --tail=20 # Source failure alerts kso logs deployment/scheduler --tail=20 | grep -i "failure\|alert" ``` ### Check broker errors ```bash kso logs deployment/broker-adapter --tail=30 | grep -i "error\|fail" ``` ### Check global event processing ```bash kso logs deployment/extractor --tail=20 | grep -i "macro\|global" ``` ### Check trading decisions ```bash kso logs deployment/trading-engine --tail=30 ``` ### Stream all errors ```bash kso logs --all-containers --prefix --tail=100 | grep -i error ``` ## Ingress Endpoints | URL | Service | |-----|---------| | https://stonks.celestium.life | Dashboard | | https://stonks-api.celestium.life | Query API | | https://stonks-registry.celestium.life | Symbol Registry | | https://stonks-trading.celestium.life | Trading Engine | | https://stonks-dash.celestium.life | Superset | | https://stonks-trino.celestium.life | Trino | ## CI/CD Workflow: `.github/workflows/build.yml` Push to `main` triggers: lint → pytest → frontend vitest → build all service images → push to GHCR. ### Check recent builds ```bash gh run list -L 5 ``` ### Re-run a failed build ```bash gh run rerun --failed ``` ### View failure logs ```bash gh run view --log-failed ``` ## Common Failure Modes ### CrashLoopBackOff on workers Queue workers (aggregation, extractor, recommendation, broker-adapter, lake-publisher) exit with code 0 when the queue is empty. Kubernetes restarts them — this is normal. They process work when messages arrive. ### PostgreSQL auth failure Password mismatch between the Kubernetes secret and the actual DB user. Fix by re-running `runmefirst.sh` which resets the password and redeploys. ### Redis connection refused ```bash kubectl get pods -n redis-service kubectl rollout restart -n redis-service statefulset/redis-master ``` ### ImagePullBackOff GHCR credentials expired or missing. Re-run `runmefirst.sh` with a fresh GitHub token at `/run/secrets/github_token`. ### Trading engine not making decisions 1. Check if trading is enabled: `curl -s https://stonks-trading.celestium.life/health` 2. Check circuit breaker status — may be tripped 3. Check if within trading window (9:45 AM – 3:45 PM ET) 4. Check if there are actionable recommendations in the queue 5. Check logs: `kso logs deployment/trading-engine --tail=50`