Files

T

Celes Renata fde819ec09 docs: update README and runbook for broker-synced reset, confidence dampener, paper account workflow

2026-04-17 04:32:49 +00:00

11 KiB

Raw Blame History

Stonks Oracle — Operator Runbook

Cluster Access

kubectl config use-context <your-context>
# All stonks-oracle resources live in the stonks-oracle namespace
alias kso='kubectl -n stonks-oracle'

4-node k3s cluster (gremlin-1 through gremlin-4). Deploy host is gremlin-1 (192.168.42.254) where secrets and the deploy script live.

Service Overview

Service	Type	Replicas	Notes
scheduler	CronJob-like worker	1	Polls sources on schedule
symbol-registry	FastAPI	1	Company/watchlist/exposure/competitor CRUD
ingestion	Queue worker	2	Fetches from adapters (market data, news, filings, macro)
parser	Queue worker	2	HTML→text extraction
extractor	Queue worker	1	LLM-based intelligence extraction + event classification
aggregation	Queue worker	1	Trend/signal aggregation across all 3 layers
recommendation	Queue worker	1	Trade signal generation
trading-engine	FastAPI	1	Autonomous decision loop, position sizing, backtesting
risk	FastAPI	1	Risk evaluation + approval
broker-adapter	Queue worker	1	Paper/live order execution via Alpaca
lake-publisher	Queue worker	1	Iceberg table publication
query-api	FastAPI	1	Dashboard/analytics queries
dashboard	nginx	1	React SPA on port 8080
trino	Analytics engine	1	SQL over lakehouse
superset	Dashboard	1	Visualization
hive-metastore	Metastore	1	Iceberg catalog backend

Deployment

Full Deploy

Run from gremlin-1 where secrets are available:

bash ~/sources/kube/stonks-oracle/runmefirst.sh

This script:

Pulls latest code
Creates namespace with Helm labels
Sets up PostgreSQL user and database
Runs all migrations in order
Deploys via Helm with secrets injected
Rolling restarts all deployments

Quick Helm Upgrade

After CI builds new images:

helm upgrade --install stonks-oracle infra/helm/stonks-oracle -n stonks-oracle

Full Teardown

Preserves PostgreSQL, Redis, and MinIO data:

bash ~/sources/kube/stonks-oracle/runmelast.sh

Secrets Management

Secrets are stored on the deploy host at ~/sources/kube/stonks-oracle/. This directory is NOT a git repo — secrets stay local.

Required secret files:

~/sources/kube/stonks-oracle/polygon.io.key — Polygon.io API key
~/sources/kube/stonks-oracle/alpaca.key — Alpaca API key
~/sources/kube/stonks-oracle/alpaca.secret — Alpaca API secret
~/sources/kube/stonks-oracle/alpaca.url — Alpaca base URL (defaults to paper API)
/run/secrets/github_token — GHCR authentication token

The deploy script (runmefirst.sh) reads these files and injects them into Kubernetes secrets via Helm --set flags. Never hardcode secrets in manifests, values files, or this runbook.

To rotate a secret:

Update the file on gremlin-1
Re-run runmefirst.sh (or helm upgrade with the new --set values)
Restart affected deployments

Common Operations

Restart a service

kso rollout restart deployment/<service-name>

Check logs

kso logs deployment/<service-name> --tail=50 -f
# For previous crash:
kso logs <pod-name> --previous --tail=50

Scale a service

kso scale deployment/<service-name> --replicas=N

Run database migrations

for f in $(ls infra/migrations/*.sql | sort); do
  kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks < "$f"
done

Trading Engine Operations

Check trading engine status

curl -s https://stonks-trading.celestium.life/health
curl -s https://stonks-trading.celestium.life/ready

Pause trading

# Via API — sets enabled=false in trading_engine_config
curl -X PUT https://stonks-trading.celestium.life/api/trading/config \
  -H 'Content-Type: application/json' \
  -d '{"enabled": false}'

Resume trading

curl -X PUT https://stonks-trading.celestium.life/api/trading/config \
  -H 'Content-Type: application/json' \
  -d '{"enabled": true}'

Full paper trading reset

Liquidates all Alpaca positions, cancels open orders, wipes all local trading state (decisions, orders, positions, snapshots, backtests), and sets engine capital from the broker's actual account balance.

# Reset and sync capital from broker
curl -X POST https://stonks-trading.celestium.life/api/trading/reset \
  -H 'Content-Type: application/json' \
  -d '{}'

# Or override with a specific capital amount
curl -X POST https://stonks-trading.celestium.life/api/trading/reset \
  -H 'Content-Type: application/json' \
  -d '{"initial_capital": 100000}'

Note: if the market is closed, Alpaca liquidation orders will be queued and fill at next market open. The engine capital is set immediately.

Switching to a new Alpaca paper account

Alpaca allows up to 3 paper accounts. To start fresh:

Go to https://app.alpaca.markets
Click paper account number → "Open New Paper Account"
Generate new API keys
Update secrets on gremlin-1: alpaca.key, alpaca.secret, alpaca.url
Re-run runmefirst.sh or helm upgrade with new --set values
Restart broker-adapter: kso rollout restart deployment/broker-adapter
Hit the reset endpoint to sync engine state with the new account

Check recent trading decisions

curl -s https://stonks-api.celestium.life/api/trading/decisions?limit=10

Run a backtest

curl -X POST https://stonks-trading.celestium.life/api/trading/backtest \
  -H 'Content-Type: application/json' \
  -d '{"start_date": "2025-01-01", "end_date": "2025-06-01", "initial_capital": 100000, "risk_tier": "moderate"}'

Check circuit breaker status

curl -s https://stonks-api.celestium.life/api/trading/circuit-breaker

Check portfolio state

curl -s https://stonks-api.celestium.life/api/trading/portfolio

Broker Mode Toggle

Current mode is set via ConfigMap stonks-config key BROKER_MODE.

# Check current mode
kso get configmap stonks-config -o jsonpath='{.data.BROKER_MODE}'

Modes:

paper — all orders go through paper trading simulation (default)
live — orders submitted to real broker API (requires operator approval workflow)

Never switch to live without:

Confirming paper trading PnL is acceptable
Verifying risk limits are configured
Enabling operator approval in the risk engine

Signal Layer Toggles

Macro signal layer

# Check status
curl -s https://stonks-api.celestium.life/api/admin/macro/status

# Toggle
curl -X PUT https://stonks-api.celestium.life/api/admin/macro/toggle

Competitive signal layer

# Check status
curl -s https://stonks-api.celestium.life/api/admin/competitive/status

# Toggle
curl -X PUT https://stonks-api.celestium.life/api/admin/competitive/toggle

Backup and Restore

Database backup

# Local backup (keeps last 7)
./scripts/backup-db.sh

# Backup + upload to MinIO
./scripts/backup-db.sh --upload-minio

Backups go to ~/backups/stonks-oracle/. Old backups are auto-pruned (keeps last 7).

Database restore

# Lists available backups if no argument given
./scripts/restore-db.sh

# Restore a specific backup (WARNING: replaces all data)
./scripts/restore-db.sh ~/backups/stonks-oracle/stonks-20250615-180000.sql.gz

The restore script scales down all services, restores the dump, re-grants permissions, and scales services back up.

Redis backup

./scripts/backup-redis.sh

Triggers a BGSAVE and copies the RDB dump locally.

Database Nuke & Rebuild

When a full reset is needed:

# 1. Tear down Helm release
bash ~/sources/kube/stonks-oracle/runmelast.sh

# 2. Terminate connections and drop database
kubectl exec -n postgresql-service postgresql-1 -c postgres -- \
  psql -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'stonks' AND pid <> pg_backend_pid();"
kubectl exec -n postgresql-service postgresql-1 -c postgres -- \
  psql -U postgres -c "DROP DATABASE IF EXISTS stonks;"

# 3. Flush Redis dedup markers
# (clear all stonks:* keys from Redis)

# 4. Full redeploy (creates DB, runs migrations, deploys)
bash ~/sources/kube/stonks-oracle/runmefirst.sh

# 5. Re-seed companies and relationships
# (run from a pod or with port-forwarded DB access)
python -m services.symbol_registry.seed

Monitoring

Check pod status

kso get pods
kso get pods -o wide  # includes node placement

Check ingestion health

# Recent ingestion activity
kso logs deployment/ingestion --tail=20

# Source failure alerts
kso logs deployment/scheduler --tail=20 | grep -i "failure\|alert"

Check broker errors

kso logs deployment/broker-adapter --tail=30 | grep -i "error\|fail"

Check global event processing

kso logs deployment/extractor --tail=20 | grep -i "macro\|global"

Check trading decisions

kso logs deployment/trading-engine --tail=30

Stream all errors

kso logs --all-containers --prefix --tail=100 | grep -i error

Ingress Endpoints

URL	Service
https://stonks.celestium.life	Dashboard
https://stonks-api.celestium.life	Query API
https://stonks-registry.celestium.life	Symbol Registry
https://stonks-trading.celestium.life	Trading Engine
https://stonks-dash.celestium.life	Superset
https://stonks-trino.celestium.life	Trino

CI/CD

Workflow: .github/workflows/build.yml

Push to main triggers: lint → pytest → frontend vitest → build all service images → push to GHCR.

Check recent builds

gh run list -L 5

Re-run a failed build

gh run rerun <run-id> --failed

View failure logs

gh run view <run-id> --log-failed

Common Failure Modes

CrashLoopBackOff on workers

Queue workers (aggregation, extractor, recommendation, broker-adapter, lake-publisher) exit with code 0 when the queue is empty. Kubernetes restarts them — this is normal. They process work when messages arrive.

PostgreSQL auth failure

Password mismatch between the Kubernetes secret and the actual DB user. Fix by re-running runmefirst.sh which resets the password and redeploys.

Redis connection refused

kubectl get pods -n redis-service
kubectl rollout restart -n redis-service statefulset/redis-master

ImagePullBackOff

GHCR credentials expired or missing. Re-run runmefirst.sh with a fresh GitHub token at /run/secrets/github_token.

Trading engine not making decisions

Check if trading is enabled: curl -s https://stonks-trading.celestium.life/health
Check circuit breaker status — may be tripped
Check if within trading window (9:45 AM – 3:45 PM ET)
Check if there are actionable recommendations in the queue
Check logs: kso logs deployment/trading-engine --tail=50

11 KiB Raw Blame History Unescape Escape