phase 15: helm deployment complete, runbook, trino/superset fixes

This commit is contained in:
Celes Renata
2026-04-11 14:27:47 -07:00
parent fe3d6c0cb0
commit e7b2a5e67f
2 changed files with 153 additions and 6 deletions
+147
View File
@@ -0,0 +1,147 @@
# Stonks Oracle — Operator Runbook
## Cluster Access
```bash
kubectl config use-context <your-context>
# All stonks-oracle resources live in the stonks-oracle namespace
alias kso='kubectl -n stonks-oracle'
```
## Service Overview
| Service | Type | Replicas | Notes |
|---------|------|----------|-------|
| scheduler | CronJob-like worker | 1 | Polls sources on schedule |
| symbol-registry | FastAPI | 1 | Company/watchlist CRUD |
| ingestion | Queue worker | 2 | Fetches from adapters |
| parser | Queue worker | 2 | HTML→text extraction |
| extractor | Queue worker | 1 | LLM-based intelligence extraction |
| aggregation | Queue worker | 1 | Trend/signal aggregation |
| recommendation | Queue worker | 1 | Trade signal generation |
| risk | FastAPI | 1 | Risk evaluation + approval |
| broker-adapter | Queue worker | 1 | Paper/live order execution |
| lake-publisher | Queue worker | 1 | Iceberg table publication |
| query-api | FastAPI | 1 | Dashboard/analytics queries |
| trino | Analytics engine | 1 | SQL over lakehouse |
| superset | Dashboard | 1 | Visualization |
| hive-metastore | Metastore | 1 | Iceberg catalog backend |
## Common Operations
### Restart a service
```bash
kso rollout restart deployment/<service-name>
```
### Check logs
```bash
kso logs deployment/<service-name> --tail=50 -f
# For previous crash:
kso logs <pod-name> --previous --tail=50
```
### Scale a service
```bash
kso scale deployment/<service-name> --replicas=N
```
### Redeploy with updated secrets
```bash
GHCR_TOKEN=$(cat /run/secrets/github_token)
helm upgrade --install stonks-oracle infra/helm/stonks-oracle \
--namespace stonks-oracle \
--set "ghcrAuth.password=$GHCR_TOKEN" \
--set 'secrets.core.POSTGRES_PASSWORD=St0nks0racl3!' \
--set "secrets.core.MINIO_ACCESS_KEY=AKIA6V7J3N9B5P0D2YQH" \
--set 'secrets.core.MINIO_SECRET_KEY=8fG3!v2rJ7$wN@9mLpQ6zXbC4tKdPqW1' \
--set 'secrets.core.REDIS_PASSWORD=PSCh4ng3me!'
# Then restart deployments to pick up secret changes:
for dep in $(kso get deployments -o name); do kso rollout restart "$dep"; done
```
### Run database migrations
```bash
for f in $(ls infra/migrations/*.sql | sort); do
kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks < "$f"
done
```
## Trading Mode Toggle
Current mode is set via ConfigMap `stonks-config` key `BROKER_MODE`.
```bash
# Check current mode
kso get configmap stonks-config -o jsonpath='{.data.BROKER_MODE}'
# To switch modes, update values.yaml config.BROKER_MODE and helm upgrade,
# then restart broker-adapter and risk deployments.
```
**Modes:**
- `paper` — all orders go through paper trading simulation (default, safe)
- `live` — orders are submitted to the real broker API (requires operator approval workflow)
**Never switch to live without:**
1. Confirming paper trading PnL is acceptable
2. Verifying risk limits are configured in `risk_configuration` table
3. Enabling operator approval in `operator_approvals` table
## Operator Approval for Live Trades
The risk engine requires explicit operator approval before executing live trades.
Approvals are managed via the risk API:
```bash
# Check pending approvals
curl -s https://stonks-api.celestium.life/risk/approvals/pending
# Approve a recommendation
curl -X POST https://stonks-api.celestium.life/risk/approvals/<id>/approve
```
## Common Failure Modes
### CrashLoopBackOff on workers
Queue workers (aggregation, extractor, recommendation, broker-adapter, lake-publisher) exit with code 0 when the queue is empty. Kubernetes restarts them, which is normal. They'll process work when messages arrive.
### PostgreSQL auth failure
Password mismatch between `stonks-core-secrets.POSTGRES_PASSWORD` and the actual DB user password. Fix:
```bash
kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks <<'EOF'
ALTER USER stonks WITH PASSWORD '<new-password>';
EOF
```
Then update the Helm secret and restart.
### Redis connection refused
Check Redis is running: `kubectl get pods -n redis-service`
If Redis master is down, restart it: `kubectl rollout restart -n redis-service statefulset/redis-master`
### ImagePullBackOff
GHCR credentials expired or missing. Re-run `helm upgrade` with fresh `ghcrAuth.password`.
### Superset won't start
Needs custom image with `sqlalchemy-trino` package. Stock `apache/superset:latest` doesn't include it.
## Log Access
All services output JSON logs when `JSON_LOGS=true` (default).
```bash
# Stream all logs from a service
kso logs -f deployment/<service> --tail=100
# Search for errors across all pods
kso logs --all-containers --prefix --tail=100 | grep -i error
```
## Ingress Endpoints
| URL | Service |
|-----|---------|
| https://stonks-api.celestium.life | Query API |
| https://stonks-registry.celestium.life | Symbol Registry |
| https://stonks-dash.celestium.life | Superset |
| https://stonks-trino.celestium.life | Trino |