Files
stonks-oracle/docs/notes/runbook.md
T

4.8 KiB

Stonks Oracle — Operator Runbook

Cluster Access

kubectl config use-context <your-context>
# All stonks-oracle resources live in the stonks-oracle namespace
alias kso='kubectl -n stonks-oracle'

Service Overview

Service Type Replicas Notes
scheduler CronJob-like worker 1 Polls sources on schedule
symbol-registry FastAPI 1 Company/watchlist CRUD
ingestion Queue worker 2 Fetches from adapters
parser Queue worker 2 HTML→text extraction
extractor Queue worker 1 LLM-based intelligence extraction
aggregation Queue worker 1 Trend/signal aggregation
recommendation Queue worker 1 Trade signal generation
risk FastAPI 1 Risk evaluation + approval
broker-adapter Queue worker 1 Paper/live order execution
lake-publisher Queue worker 1 Iceberg table publication
query-api FastAPI 1 Dashboard/analytics queries
trino Analytics engine 1 SQL over lakehouse
superset Dashboard 1 Visualization
hive-metastore Metastore 1 Iceberg catalog backend

Common Operations

Restart a service

kso rollout restart deployment/<service-name>

Check logs

kso logs deployment/<service-name> --tail=50 -f
# For previous crash:
kso logs <pod-name> --previous --tail=50

Scale a service

kso scale deployment/<service-name> --replicas=N

Redeploy with updated secrets

GHCR_TOKEN=$(cat /run/secrets/github_token)
helm upgrade --install stonks-oracle infra/helm/stonks-oracle \
  --namespace stonks-oracle \
  --set "ghcrAuth.password=$GHCR_TOKEN" \
  --set 'secrets.core.POSTGRES_PASSWORD=St0nks0racl3!' \
  --set "secrets.core.MINIO_ACCESS_KEY=AKIA6V7J3N9B5P0D2YQH" \
  --set 'secrets.core.MINIO_SECRET_KEY=8fG3!v2rJ7$wN@9mLpQ6zXbC4tKdPqW1' \
  --set 'secrets.core.REDIS_PASSWORD=PSCh4ng3me!'
# Then restart deployments to pick up secret changes:
for dep in $(kso get deployments -o name); do kso rollout restart "$dep"; done

Run database migrations

for f in $(ls infra/migrations/*.sql | sort); do
  kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks < "$f"
done

Trading Mode Toggle

Current mode is set via ConfigMap stonks-config key BROKER_MODE.

# Check current mode
kso get configmap stonks-config -o jsonpath='{.data.BROKER_MODE}'

# To switch modes, update values.yaml config.BROKER_MODE and helm upgrade,
# then restart broker-adapter and risk deployments.

Modes:

  • paper — all orders go through paper trading simulation (default, safe)
  • live — orders are submitted to the real broker API (requires operator approval workflow)

Never switch to live without:

  1. Confirming paper trading PnL is acceptable
  2. Verifying risk limits are configured in risk_configuration table
  3. Enabling operator approval in operator_approvals table

Operator Approval for Live Trades

The risk engine requires explicit operator approval before executing live trades. Approvals are managed via the risk API:

# Check pending approvals
curl -s https://stonks-api.celestium.life/risk/approvals/pending

# Approve a recommendation
curl -X POST https://stonks-api.celestium.life/risk/approvals/<id>/approve

Common Failure Modes

CrashLoopBackOff on workers

Queue workers (aggregation, extractor, recommendation, broker-adapter, lake-publisher) exit with code 0 when the queue is empty. Kubernetes restarts them, which is normal. They'll process work when messages arrive.

PostgreSQL auth failure

Password mismatch between stonks-core-secrets.POSTGRES_PASSWORD and the actual DB user password. Fix:

kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks <<'EOF'
ALTER USER stonks WITH PASSWORD '<new-password>';
EOF

Then update the Helm secret and restart.

Redis connection refused

Check Redis is running: kubectl get pods -n redis-service If Redis master is down, restart it: kubectl rollout restart -n redis-service statefulset/redis-master

ImagePullBackOff

GHCR credentials expired or missing. Re-run helm upgrade with fresh ghcrAuth.password.

Superset won't start

Needs custom image with sqlalchemy-trino package. Stock apache/superset:latest doesn't include it.

Log Access

All services output JSON logs when JSON_LOGS=true (default).

# Stream all logs from a service
kso logs -f deployment/<service> --tail=100

# Search for errors across all pods
kso logs --all-containers --prefix --tail=100 | grep -i error

Ingress Endpoints

URL Service
https://stonks-api.celestium.life Query API
https://stonks-registry.celestium.life Symbol Registry
https://stonks-dash.celestium.life Superset
https://stonks-trino.celestium.life Trino