11 KiB
Stonks Oracle — Operator Runbook
Cluster Access
kubectl config use-context <your-context>
# All stonks-oracle resources live in the stonks-oracle namespace
alias kso='kubectl -n stonks-oracle'
4-node k3s cluster (gremlin-1 through gremlin-4). Deploy host is gremlin-1 (192.168.42.254) where secrets and the deploy script live.
Service Overview
| Service | Type | Replicas | Notes |
|---|---|---|---|
| scheduler | CronJob-like worker | 1 | Polls sources on schedule |
| symbol-registry | FastAPI | 1 | Company/watchlist/exposure/competitor CRUD |
| ingestion | Queue worker | 2 | Fetches from adapters (market data, news, filings, macro) |
| parser | Queue worker | 2 | HTML→text extraction |
| extractor | Queue worker | 1 | LLM-based intelligence extraction + event classification |
| aggregation | Queue worker | 1 | Trend/signal aggregation across all 3 layers |
| recommendation | Queue worker | 1 | Trade signal generation |
| trading-engine | FastAPI | 1 | Autonomous decision loop, position sizing, backtesting |
| risk | FastAPI | 1 | Risk evaluation + approval |
| broker-adapter | Queue worker | 1 | Paper/live order execution via Alpaca |
| lake-publisher | Queue worker | 1 | Iceberg table publication |
| query-api | FastAPI | 1 | Dashboard/analytics queries |
| dashboard | nginx | 1 | React SPA on port 8080 |
| trino | Analytics engine | 1 | SQL over lakehouse |
| superset | Dashboard | 1 | Visualization |
| hive-metastore | Metastore | 1 | Iceberg catalog backend |
Deployment
Full Deploy
Run from gremlin-1 where secrets are available:
bash ~/sources/kube/stonks-oracle/runmefirst.sh
This script:
- Pulls latest code
- Creates namespace with Helm labels
- Sets up PostgreSQL user and database
- Runs all migrations in order
- Deploys via Helm with secrets injected
- Rolling restarts all deployments
Quick Helm Upgrade
After CI builds new images:
helm upgrade --install stonks-oracle infra/helm/stonks-oracle -n stonks-oracle
Full Teardown
Preserves PostgreSQL, Redis, and MinIO data:
bash ~/sources/kube/stonks-oracle/runmelast.sh
Secrets Management
Secrets are stored on the deploy host at ~/sources/kube/stonks-oracle/. This directory is NOT a git repo — secrets stay local.
Required secret files:
~/sources/kube/stonks-oracle/polygon.io.key— Polygon.io API key~/sources/kube/stonks-oracle/alpaca.key— Alpaca API key~/sources/kube/stonks-oracle/alpaca.secret— Alpaca API secret~/sources/kube/stonks-oracle/alpaca.url— Alpaca base URL (defaults to paper API)/run/secrets/github_token— GHCR authentication token
The deploy script (runmefirst.sh) reads these files and injects them into Kubernetes secrets via Helm --set flags. Never hardcode secrets in manifests, values files, or this runbook.
To rotate a secret:
- Update the file on gremlin-1
- Re-run
runmefirst.sh(orhelm upgradewith the new--setvalues) - Restart affected deployments
Common Operations
Restart a service
kso rollout restart deployment/<service-name>
Check logs
kso logs deployment/<service-name> --tail=50 -f
# For previous crash:
kso logs <pod-name> --previous --tail=50
Scale a service
kso scale deployment/<service-name> --replicas=N
Run database migrations
for f in $(ls infra/migrations/*.sql | sort); do
kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks < "$f"
done
Trading Engine Operations
Check trading engine status
curl -s https://stonks-trading.celestium.life/health
curl -s https://stonks-trading.celestium.life/ready
Pause trading
# Via API — sets enabled=false in trading_engine_config
curl -X PUT https://stonks-trading.celestium.life/api/trading/config \
-H 'Content-Type: application/json' \
-d '{"enabled": false}'
Resume trading
curl -X PUT https://stonks-trading.celestium.life/api/trading/config \
-H 'Content-Type: application/json' \
-d '{"enabled": true}'
Full paper trading reset
Liquidates all Alpaca positions, cancels open orders, wipes all local trading state (decisions, orders, positions, snapshots, backtests), and sets engine capital from the broker's actual account balance.
# Reset and sync capital from broker
curl -X POST https://stonks-trading.celestium.life/api/trading/reset \
-H 'Content-Type: application/json' \
-d '{}'
# Or override with a specific capital amount
curl -X POST https://stonks-trading.celestium.life/api/trading/reset \
-H 'Content-Type: application/json' \
-d '{"initial_capital": 100000}'
Note: if the market is closed, Alpaca liquidation orders will be queued and fill at next market open. The engine capital is set immediately.
Switching to a new Alpaca paper account
Alpaca allows up to 3 paper accounts. To start fresh:
- Go to https://app.alpaca.markets
- Click paper account number → "Open New Paper Account"
- Generate new API keys
- Update secrets on gremlin-1:
alpaca.key,alpaca.secret,alpaca.url - Re-run
runmefirst.shorhelm upgradewith new--setvalues - Restart broker-adapter:
kso rollout restart deployment/broker-adapter - Hit the reset endpoint to sync engine state with the new account
Check recent trading decisions
curl -s https://stonks-api.celestium.life/api/trading/decisions?limit=10
Run a backtest
curl -X POST https://stonks-trading.celestium.life/api/trading/backtest \
-H 'Content-Type: application/json' \
-d '{"start_date": "2025-01-01", "end_date": "2025-06-01", "initial_capital": 100000, "risk_tier": "moderate"}'
Check circuit breaker status
curl -s https://stonks-api.celestium.life/api/trading/circuit-breaker
Check portfolio state
curl -s https://stonks-api.celestium.life/api/trading/portfolio
Broker Mode Toggle
Current mode is set via ConfigMap stonks-config key BROKER_MODE.
# Check current mode
kso get configmap stonks-config -o jsonpath='{.data.BROKER_MODE}'
Modes:
paper— all orders go through paper trading simulation (default)live— orders submitted to real broker API (requires operator approval workflow)
Never switch to live without:
- Confirming paper trading PnL is acceptable
- Verifying risk limits are configured
- Enabling operator approval in the risk engine
Signal Layer Toggles
Macro signal layer
# Check status
curl -s https://stonks-api.celestium.life/api/admin/macro/status
# Toggle
curl -X PUT https://stonks-api.celestium.life/api/admin/macro/toggle
Competitive signal layer
# Check status
curl -s https://stonks-api.celestium.life/api/admin/competitive/status
# Toggle
curl -X PUT https://stonks-api.celestium.life/api/admin/competitive/toggle
Backup and Restore
Database backup
# Local backup (keeps last 7)
./scripts/backup-db.sh
# Backup + upload to MinIO
./scripts/backup-db.sh --upload-minio
Backups go to ~/backups/stonks-oracle/. Old backups are auto-pruned (keeps last 7).
Database restore
# Lists available backups if no argument given
./scripts/restore-db.sh
# Restore a specific backup (WARNING: replaces all data)
./scripts/restore-db.sh ~/backups/stonks-oracle/stonks-20250615-180000.sql.gz
The restore script scales down all services, restores the dump, re-grants permissions, and scales services back up.
Redis backup
./scripts/backup-redis.sh
Triggers a BGSAVE and copies the RDB dump locally.
Database Nuke & Rebuild
When a full reset is needed:
# 1. Tear down Helm release
bash ~/sources/kube/stonks-oracle/runmelast.sh
# 2. Terminate connections and drop database
kubectl exec -n postgresql-service postgresql-1 -c postgres -- \
psql -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'stonks' AND pid <> pg_backend_pid();"
kubectl exec -n postgresql-service postgresql-1 -c postgres -- \
psql -U postgres -c "DROP DATABASE IF EXISTS stonks;"
# 3. Flush Redis dedup markers
# (clear all stonks:* keys from Redis)
# 4. Full redeploy (creates DB, runs migrations, deploys)
bash ~/sources/kube/stonks-oracle/runmefirst.sh
# 5. Re-seed companies and relationships
# (run from a pod or with port-forwarded DB access)
python -m services.symbol_registry.seed
Monitoring
Check pod status
kso get pods
kso get pods -o wide # includes node placement
Check ingestion health
# Recent ingestion activity
kso logs deployment/ingestion --tail=20
# Source failure alerts
kso logs deployment/scheduler --tail=20 | grep -i "failure\|alert"
Check broker errors
kso logs deployment/broker-adapter --tail=30 | grep -i "error\|fail"
Check global event processing
kso logs deployment/extractor --tail=20 | grep -i "macro\|global"
Check trading decisions
kso logs deployment/trading-engine --tail=30
Stream all errors
kso logs --all-containers --prefix --tail=100 | grep -i error
Ingress Endpoints
| URL | Service |
|---|---|
| https://stonks.celestium.life | Dashboard |
| https://stonks-api.celestium.life | Query API |
| https://stonks-registry.celestium.life | Symbol Registry |
| https://stonks-trading.celestium.life | Trading Engine |
| https://stonks-dash.celestium.life | Superset |
| https://stonks-trino.celestium.life | Trino |
CI/CD
Workflow: .github/workflows/build.yml
Push to main triggers: lint → pytest → frontend vitest → build all service images → push to GHCR.
Check recent builds
gh run list -L 5
Re-run a failed build
gh run rerun <run-id> --failed
View failure logs
gh run view <run-id> --log-failed
Common Failure Modes
CrashLoopBackOff on workers
Queue workers (aggregation, extractor, recommendation, broker-adapter, lake-publisher) exit with code 0 when the queue is empty. Kubernetes restarts them — this is normal. They process work when messages arrive.
PostgreSQL auth failure
Password mismatch between the Kubernetes secret and the actual DB user. Fix by re-running runmefirst.sh which resets the password and redeploys.
Redis connection refused
kubectl get pods -n redis-service
kubectl rollout restart -n redis-service statefulset/redis-master
ImagePullBackOff
GHCR credentials expired or missing. Re-run runmefirst.sh with a fresh GitHub token at /run/secrets/github_token.
Trading engine not making decisions
- Check if trading is enabled:
curl -s https://stonks-trading.celestium.life/health - Check circuit breaker status — may be tripped
- Check if within trading window (9:45 AM – 3:45 PM ET)
- Check if there are actionable recommendations in the queue
- Check logs:
kso logs deployment/trading-engine --tail=50