9aae57f3e1
README: updated architecture diagram, three signal layers, tracked universe, autonomous trading engine, global news interpolation, competitive intelligence, paper trading, notification service, updated services table, project structure, deployment, endpoints. Runbook: updated service overview, deployment via runmefirst.sh, secrets management (keys in kube dir not repo), backup/restore scripts, trading engine operations, signal layer toggles, database nuke & rebuild, monitoring, CI/CD, removed hardcoded secrets.
343 lines
9.6 KiB
Markdown
343 lines
9.6 KiB
Markdown
# Stonks Oracle — Operator Runbook
|
||
|
||
## Cluster Access
|
||
|
||
```bash
|
||
kubectl config use-context <your-context>
|
||
# All stonks-oracle resources live in the stonks-oracle namespace
|
||
alias kso='kubectl -n stonks-oracle'
|
||
```
|
||
|
||
4-node k3s cluster (gremlin-1 through gremlin-4). Deploy host is gremlin-1 (192.168.42.254) where secrets and the deploy script live.
|
||
|
||
## Service Overview
|
||
|
||
| Service | Type | Replicas | Notes |
|
||
|---------|------|----------|-------|
|
||
| scheduler | CronJob-like worker | 1 | Polls sources on schedule |
|
||
| symbol-registry | FastAPI | 1 | Company/watchlist/exposure/competitor CRUD |
|
||
| ingestion | Queue worker | 2 | Fetches from adapters (market data, news, filings, macro) |
|
||
| parser | Queue worker | 2 | HTML→text extraction |
|
||
| extractor | Queue worker | 1 | LLM-based intelligence extraction + event classification |
|
||
| aggregation | Queue worker | 1 | Trend/signal aggregation across all 3 layers |
|
||
| recommendation | Queue worker | 1 | Trade signal generation |
|
||
| trading-engine | FastAPI | 1 | Autonomous decision loop, position sizing, backtesting |
|
||
| risk | FastAPI | 1 | Risk evaluation + approval |
|
||
| broker-adapter | Queue worker | 1 | Paper/live order execution via Alpaca |
|
||
| lake-publisher | Queue worker | 1 | Iceberg table publication |
|
||
| query-api | FastAPI | 1 | Dashboard/analytics queries |
|
||
| dashboard | nginx | 1 | React SPA on port 8080 |
|
||
| trino | Analytics engine | 1 | SQL over lakehouse |
|
||
| superset | Dashboard | 1 | Visualization |
|
||
| hive-metastore | Metastore | 1 | Iceberg catalog backend |
|
||
|
||
## Deployment
|
||
|
||
### Full Deploy
|
||
|
||
Run from gremlin-1 where secrets are available:
|
||
|
||
```bash
|
||
bash ~/sources/kube/stonks-oracle/runmefirst.sh
|
||
```
|
||
|
||
This script:
|
||
1. Pulls latest code
|
||
2. Creates namespace with Helm labels
|
||
3. Sets up PostgreSQL user and database
|
||
4. Runs all migrations in order
|
||
5. Deploys via Helm with secrets injected
|
||
6. Rolling restarts all deployments
|
||
|
||
### Quick Helm Upgrade
|
||
|
||
After CI builds new images:
|
||
|
||
```bash
|
||
helm upgrade --install stonks-oracle infra/helm/stonks-oracle -n stonks-oracle
|
||
```
|
||
|
||
### Full Teardown
|
||
|
||
Preserves PostgreSQL, Redis, and MinIO data:
|
||
|
||
```bash
|
||
bash ~/sources/kube/stonks-oracle/runmelast.sh
|
||
```
|
||
|
||
## Secrets Management
|
||
|
||
Secrets are stored on the deploy host at `~/sources/kube/stonks-oracle/`. This directory is NOT a git repo — secrets stay local.
|
||
|
||
Required secret files:
|
||
- `~/sources/kube/stonks-oracle/polygon.io.key` — Polygon.io API key
|
||
- `~/sources/kube/stonks-oracle/alpaca.key` — Alpaca API key
|
||
- `~/sources/kube/stonks-oracle/alpaca.secret` — Alpaca API secret
|
||
- `~/sources/kube/stonks-oracle/alpaca.url` — Alpaca base URL (defaults to paper API)
|
||
- `/run/secrets/github_token` — GHCR authentication token
|
||
|
||
The deploy script (`runmefirst.sh`) reads these files and injects them into Kubernetes secrets via Helm `--set` flags. Never hardcode secrets in manifests, values files, or this runbook.
|
||
|
||
To rotate a secret:
|
||
1. Update the file on gremlin-1
|
||
2. Re-run `runmefirst.sh` (or `helm upgrade` with the new `--set` values)
|
||
3. Restart affected deployments
|
||
|
||
## Common Operations
|
||
|
||
### Restart a service
|
||
```bash
|
||
kso rollout restart deployment/<service-name>
|
||
```
|
||
|
||
### Check logs
|
||
```bash
|
||
kso logs deployment/<service-name> --tail=50 -f
|
||
# For previous crash:
|
||
kso logs <pod-name> --previous --tail=50
|
||
```
|
||
|
||
### Scale a service
|
||
```bash
|
||
kso scale deployment/<service-name> --replicas=N
|
||
```
|
||
|
||
### Run database migrations
|
||
```bash
|
||
for f in $(ls infra/migrations/*.sql | sort); do
|
||
kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks < "$f"
|
||
done
|
||
```
|
||
|
||
## Trading Engine Operations
|
||
|
||
### Check trading engine status
|
||
```bash
|
||
curl -s https://stonks-trading.celestium.life/health
|
||
curl -s https://stonks-trading.celestium.life/ready
|
||
```
|
||
|
||
### Pause trading
|
||
```bash
|
||
# Via API — sets enabled=false in trading_engine_config
|
||
curl -X PUT https://stonks-trading.celestium.life/api/trading/config \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{"enabled": false}'
|
||
```
|
||
|
||
### Resume trading
|
||
```bash
|
||
curl -X PUT https://stonks-trading.celestium.life/api/trading/config \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{"enabled": true}'
|
||
```
|
||
|
||
### Check recent trading decisions
|
||
```bash
|
||
curl -s https://stonks-api.celestium.life/api/trading/decisions?limit=10
|
||
```
|
||
|
||
### Run a backtest
|
||
```bash
|
||
curl -X POST https://stonks-trading.celestium.life/api/trading/backtest \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{"start_date": "2025-01-01", "end_date": "2025-06-01", "initial_capital": 100000, "risk_tier": "moderate"}'
|
||
```
|
||
|
||
### Check circuit breaker status
|
||
```bash
|
||
curl -s https://stonks-api.celestium.life/api/trading/circuit-breaker
|
||
```
|
||
|
||
### Check portfolio state
|
||
```bash
|
||
curl -s https://stonks-api.celestium.life/api/trading/portfolio
|
||
```
|
||
|
||
## Broker Mode Toggle
|
||
|
||
Current mode is set via ConfigMap `stonks-config` key `BROKER_MODE`.
|
||
|
||
```bash
|
||
# Check current mode
|
||
kso get configmap stonks-config -o jsonpath='{.data.BROKER_MODE}'
|
||
```
|
||
|
||
**Modes:**
|
||
- `paper` — all orders go through paper trading simulation (default)
|
||
- `live` — orders submitted to real broker API (requires operator approval workflow)
|
||
|
||
**Never switch to live without:**
|
||
1. Confirming paper trading PnL is acceptable
|
||
2. Verifying risk limits are configured
|
||
3. Enabling operator approval in the risk engine
|
||
|
||
## Signal Layer Toggles
|
||
|
||
### Macro signal layer
|
||
```bash
|
||
# Check status
|
||
curl -s https://stonks-api.celestium.life/api/admin/macro/status
|
||
|
||
# Toggle
|
||
curl -X PUT https://stonks-api.celestium.life/api/admin/macro/toggle
|
||
```
|
||
|
||
### Competitive signal layer
|
||
```bash
|
||
# Check status
|
||
curl -s https://stonks-api.celestium.life/api/admin/competitive/status
|
||
|
||
# Toggle
|
||
curl -X PUT https://stonks-api.celestium.life/api/admin/competitive/toggle
|
||
```
|
||
|
||
## Backup and Restore
|
||
|
||
### Database backup
|
||
```bash
|
||
# Local backup (keeps last 7)
|
||
./scripts/backup-db.sh
|
||
|
||
# Backup + upload to MinIO
|
||
./scripts/backup-db.sh --upload-minio
|
||
```
|
||
|
||
Backups go to `~/backups/stonks-oracle/`. Old backups are auto-pruned (keeps last 7).
|
||
|
||
### Database restore
|
||
```bash
|
||
# Lists available backups if no argument given
|
||
./scripts/restore-db.sh
|
||
|
||
# Restore a specific backup (WARNING: replaces all data)
|
||
./scripts/restore-db.sh ~/backups/stonks-oracle/stonks-20250615-180000.sql.gz
|
||
```
|
||
|
||
The restore script scales down all services, restores the dump, re-grants permissions, and scales services back up.
|
||
|
||
### Redis backup
|
||
```bash
|
||
./scripts/backup-redis.sh
|
||
```
|
||
|
||
Triggers a BGSAVE and copies the RDB dump locally.
|
||
|
||
## Database Nuke & Rebuild
|
||
|
||
When a full reset is needed:
|
||
|
||
```bash
|
||
# 1. Tear down Helm release
|
||
bash ~/sources/kube/stonks-oracle/runmelast.sh
|
||
|
||
# 2. Terminate connections and drop database
|
||
kubectl exec -n postgresql-service postgresql-1 -c postgres -- \
|
||
psql -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'stonks' AND pid <> pg_backend_pid();"
|
||
kubectl exec -n postgresql-service postgresql-1 -c postgres -- \
|
||
psql -U postgres -c "DROP DATABASE IF EXISTS stonks;"
|
||
|
||
# 3. Flush Redis dedup markers
|
||
# (clear all stonks:* keys from Redis)
|
||
|
||
# 4. Full redeploy (creates DB, runs migrations, deploys)
|
||
bash ~/sources/kube/stonks-oracle/runmefirst.sh
|
||
|
||
# 5. Re-seed companies and relationships
|
||
# (run from a pod or with port-forwarded DB access)
|
||
python -m services.symbol_registry.seed
|
||
```
|
||
|
||
## Monitoring
|
||
|
||
### Check pod status
|
||
```bash
|
||
kso get pods
|
||
kso get pods -o wide # includes node placement
|
||
```
|
||
|
||
### Check ingestion health
|
||
```bash
|
||
# Recent ingestion activity
|
||
kso logs deployment/ingestion --tail=20
|
||
|
||
# Source failure alerts
|
||
kso logs deployment/scheduler --tail=20 | grep -i "failure\|alert"
|
||
```
|
||
|
||
### Check broker errors
|
||
```bash
|
||
kso logs deployment/broker-adapter --tail=30 | grep -i "error\|fail"
|
||
```
|
||
|
||
### Check global event processing
|
||
```bash
|
||
kso logs deployment/extractor --tail=20 | grep -i "macro\|global"
|
||
```
|
||
|
||
### Check trading decisions
|
||
```bash
|
||
kso logs deployment/trading-engine --tail=30
|
||
```
|
||
|
||
### Stream all errors
|
||
```bash
|
||
kso logs --all-containers --prefix --tail=100 | grep -i error
|
||
```
|
||
|
||
## Ingress Endpoints
|
||
|
||
| URL | Service |
|
||
|-----|---------|
|
||
| https://stonks.celestium.life | Dashboard |
|
||
| https://stonks-api.celestium.life | Query API |
|
||
| https://stonks-registry.celestium.life | Symbol Registry |
|
||
| https://stonks-trading.celestium.life | Trading Engine |
|
||
| https://stonks-dash.celestium.life | Superset |
|
||
| https://stonks-trino.celestium.life | Trino |
|
||
|
||
## CI/CD
|
||
|
||
Workflow: `.github/workflows/build.yml`
|
||
|
||
Push to `main` triggers: lint → pytest → frontend vitest → build all service images → push to GHCR.
|
||
|
||
### Check recent builds
|
||
```bash
|
||
gh run list -L 5
|
||
```
|
||
|
||
### Re-run a failed build
|
||
```bash
|
||
gh run rerun <run-id> --failed
|
||
```
|
||
|
||
### View failure logs
|
||
```bash
|
||
gh run view <run-id> --log-failed
|
||
```
|
||
|
||
## Common Failure Modes
|
||
|
||
### CrashLoopBackOff on workers
|
||
Queue workers (aggregation, extractor, recommendation, broker-adapter, lake-publisher) exit with code 0 when the queue is empty. Kubernetes restarts them — this is normal. They process work when messages arrive.
|
||
|
||
### PostgreSQL auth failure
|
||
Password mismatch between the Kubernetes secret and the actual DB user. Fix by re-running `runmefirst.sh` which resets the password and redeploys.
|
||
|
||
### Redis connection refused
|
||
```bash
|
||
kubectl get pods -n redis-service
|
||
kubectl rollout restart -n redis-service statefulset/redis-master
|
||
```
|
||
|
||
### ImagePullBackOff
|
||
GHCR credentials expired or missing. Re-run `runmefirst.sh` with a fresh GitHub token at `/run/secrets/github_token`.
|
||
|
||
### Trading engine not making decisions
|
||
1. Check if trading is enabled: `curl -s https://stonks-trading.celestium.life/health`
|
||
2. Check circuit breaker status — may be tripped
|
||
3. Check if within trading window (9:45 AM – 3:45 PM ET)
|
||
4. Check if there are actionable recommendations in the queue
|
||
5. Check logs: `kso logs deployment/trading-engine --tail=50`
|