README: updated architecture diagram, three signal layers, tracked universe, autonomous trading engine, global news interpolation, competitive intelligence, paper trading, notification service, updated services table, project structure, deployment, endpoints. Runbook: updated service overview, deployment via runmefirst.sh, secrets management (keys in kube dir not repo), backup/restore scripts, trading engine operations, signal layer toggles, database nuke & rebuild, monitoring, CI/CD, removed hardcoded secrets.
9.6 KiB
Stonks Oracle — Operator Runbook
Cluster Access
kubectl config use-context <your-context>
# All stonks-oracle resources live in the stonks-oracle namespace
alias kso='kubectl -n stonks-oracle'
4-node k3s cluster (gremlin-1 through gremlin-4). Deploy host is gremlin-1 (192.168.42.254) where secrets and the deploy script live.
Service Overview
| Service | Type | Replicas | Notes |
|---|---|---|---|
| scheduler | CronJob-like worker | 1 | Polls sources on schedule |
| symbol-registry | FastAPI | 1 | Company/watchlist/exposure/competitor CRUD |
| ingestion | Queue worker | 2 | Fetches from adapters (market data, news, filings, macro) |
| parser | Queue worker | 2 | HTML→text extraction |
| extractor | Queue worker | 1 | LLM-based intelligence extraction + event classification |
| aggregation | Queue worker | 1 | Trend/signal aggregation across all 3 layers |
| recommendation | Queue worker | 1 | Trade signal generation |
| trading-engine | FastAPI | 1 | Autonomous decision loop, position sizing, backtesting |
| risk | FastAPI | 1 | Risk evaluation + approval |
| broker-adapter | Queue worker | 1 | Paper/live order execution via Alpaca |
| lake-publisher | Queue worker | 1 | Iceberg table publication |
| query-api | FastAPI | 1 | Dashboard/analytics queries |
| dashboard | nginx | 1 | React SPA on port 8080 |
| trino | Analytics engine | 1 | SQL over lakehouse |
| superset | Dashboard | 1 | Visualization |
| hive-metastore | Metastore | 1 | Iceberg catalog backend |
Deployment
Full Deploy
Run from gremlin-1 where secrets are available:
bash ~/sources/kube/stonks-oracle/runmefirst.sh
This script:
- Pulls latest code
- Creates namespace with Helm labels
- Sets up PostgreSQL user and database
- Runs all migrations in order
- Deploys via Helm with secrets injected
- Rolling restarts all deployments
Quick Helm Upgrade
After CI builds new images:
helm upgrade --install stonks-oracle infra/helm/stonks-oracle -n stonks-oracle
Full Teardown
Preserves PostgreSQL, Redis, and MinIO data:
bash ~/sources/kube/stonks-oracle/runmelast.sh
Secrets Management
Secrets are stored on the deploy host at ~/sources/kube/stonks-oracle/. This directory is NOT a git repo — secrets stay local.
Required secret files:
~/sources/kube/stonks-oracle/polygon.io.key— Polygon.io API key~/sources/kube/stonks-oracle/alpaca.key— Alpaca API key~/sources/kube/stonks-oracle/alpaca.secret— Alpaca API secret~/sources/kube/stonks-oracle/alpaca.url— Alpaca base URL (defaults to paper API)/run/secrets/github_token— GHCR authentication token
The deploy script (runmefirst.sh) reads these files and injects them into Kubernetes secrets via Helm --set flags. Never hardcode secrets in manifests, values files, or this runbook.
To rotate a secret:
- Update the file on gremlin-1
- Re-run
runmefirst.sh(orhelm upgradewith the new--setvalues) - Restart affected deployments
Common Operations
Restart a service
kso rollout restart deployment/<service-name>
Check logs
kso logs deployment/<service-name> --tail=50 -f
# For previous crash:
kso logs <pod-name> --previous --tail=50
Scale a service
kso scale deployment/<service-name> --replicas=N
Run database migrations
for f in $(ls infra/migrations/*.sql | sort); do
kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks < "$f"
done
Trading Engine Operations
Check trading engine status
curl -s https://stonks-trading.celestium.life/health
curl -s https://stonks-trading.celestium.life/ready
Pause trading
# Via API — sets enabled=false in trading_engine_config
curl -X PUT https://stonks-trading.celestium.life/api/trading/config \
-H 'Content-Type: application/json' \
-d '{"enabled": false}'
Resume trading
curl -X PUT https://stonks-trading.celestium.life/api/trading/config \
-H 'Content-Type: application/json' \
-d '{"enabled": true}'
Check recent trading decisions
curl -s https://stonks-api.celestium.life/api/trading/decisions?limit=10
Run a backtest
curl -X POST https://stonks-trading.celestium.life/api/trading/backtest \
-H 'Content-Type: application/json' \
-d '{"start_date": "2025-01-01", "end_date": "2025-06-01", "initial_capital": 100000, "risk_tier": "moderate"}'
Check circuit breaker status
curl -s https://stonks-api.celestium.life/api/trading/circuit-breaker
Check portfolio state
curl -s https://stonks-api.celestium.life/api/trading/portfolio
Broker Mode Toggle
Current mode is set via ConfigMap stonks-config key BROKER_MODE.
# Check current mode
kso get configmap stonks-config -o jsonpath='{.data.BROKER_MODE}'
Modes:
paper— all orders go through paper trading simulation (default)live— orders submitted to real broker API (requires operator approval workflow)
Never switch to live without:
- Confirming paper trading PnL is acceptable
- Verifying risk limits are configured
- Enabling operator approval in the risk engine
Signal Layer Toggles
Macro signal layer
# Check status
curl -s https://stonks-api.celestium.life/api/admin/macro/status
# Toggle
curl -X PUT https://stonks-api.celestium.life/api/admin/macro/toggle
Competitive signal layer
# Check status
curl -s https://stonks-api.celestium.life/api/admin/competitive/status
# Toggle
curl -X PUT https://stonks-api.celestium.life/api/admin/competitive/toggle
Backup and Restore
Database backup
# Local backup (keeps last 7)
./scripts/backup-db.sh
# Backup + upload to MinIO
./scripts/backup-db.sh --upload-minio
Backups go to ~/backups/stonks-oracle/. Old backups are auto-pruned (keeps last 7).
Database restore
# Lists available backups if no argument given
./scripts/restore-db.sh
# Restore a specific backup (WARNING: replaces all data)
./scripts/restore-db.sh ~/backups/stonks-oracle/stonks-20250615-180000.sql.gz
The restore script scales down all services, restores the dump, re-grants permissions, and scales services back up.
Redis backup
./scripts/backup-redis.sh
Triggers a BGSAVE and copies the RDB dump locally.
Database Nuke & Rebuild
When a full reset is needed:
# 1. Tear down Helm release
bash ~/sources/kube/stonks-oracle/runmelast.sh
# 2. Terminate connections and drop database
kubectl exec -n postgresql-service postgresql-1 -c postgres -- \
psql -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'stonks' AND pid <> pg_backend_pid();"
kubectl exec -n postgresql-service postgresql-1 -c postgres -- \
psql -U postgres -c "DROP DATABASE IF EXISTS stonks;"
# 3. Flush Redis dedup markers
# (clear all stonks:* keys from Redis)
# 4. Full redeploy (creates DB, runs migrations, deploys)
bash ~/sources/kube/stonks-oracle/runmefirst.sh
# 5. Re-seed companies and relationships
# (run from a pod or with port-forwarded DB access)
python -m services.symbol_registry.seed
Monitoring
Check pod status
kso get pods
kso get pods -o wide # includes node placement
Check ingestion health
# Recent ingestion activity
kso logs deployment/ingestion --tail=20
# Source failure alerts
kso logs deployment/scheduler --tail=20 | grep -i "failure\|alert"
Check broker errors
kso logs deployment/broker-adapter --tail=30 | grep -i "error\|fail"
Check global event processing
kso logs deployment/extractor --tail=20 | grep -i "macro\|global"
Check trading decisions
kso logs deployment/trading-engine --tail=30
Stream all errors
kso logs --all-containers --prefix --tail=100 | grep -i error
Ingress Endpoints
| URL | Service |
|---|---|
| https://stonks.celestium.life | Dashboard |
| https://stonks-api.celestium.life | Query API |
| https://stonks-registry.celestium.life | Symbol Registry |
| https://stonks-trading.celestium.life | Trading Engine |
| https://stonks-dash.celestium.life | Superset |
| https://stonks-trino.celestium.life | Trino |
CI/CD
Workflow: .github/workflows/build.yml
Push to main triggers: lint → pytest → frontend vitest → build all service images → push to GHCR.
Check recent builds
gh run list -L 5
Re-run a failed build
gh run rerun <run-id> --failed
View failure logs
gh run view <run-id> --log-failed
Common Failure Modes
CrashLoopBackOff on workers
Queue workers (aggregation, extractor, recommendation, broker-adapter, lake-publisher) exit with code 0 when the queue is empty. Kubernetes restarts them — this is normal. They process work when messages arrive.
PostgreSQL auth failure
Password mismatch between the Kubernetes secret and the actual DB user. Fix by re-running runmefirst.sh which resets the password and redeploys.
Redis connection refused
kubectl get pods -n redis-service
kubectl rollout restart -n redis-service statefulset/redis-master
ImagePullBackOff
GHCR credentials expired or missing. Re-run runmefirst.sh with a fresh GitHub token at /run/secrets/github_token.
Trading engine not making decisions
- Check if trading is enabled:
curl -s https://stonks-trading.celestium.life/health - Check circuit breaker status — may be tripped
- Check if within trading window (9:45 AM – 3:45 PM ET)
- Check if there are actionable recommendations in the queue
- Check logs:
kso logs deployment/trading-engine --tail=50