docs: rewrite README and runbook for current platform state

README: updated architecture diagram, three signal layers, tracked
universe, autonomous trading engine, global news interpolation,
competitive intelligence, paper trading, notification service,
updated services table, project structure, deployment, endpoints.

Runbook: updated service overview, deployment via runmefirst.sh,
secrets management (keys in kube dir not repo), backup/restore
scripts, trading engine operations, signal layer toggles, database
nuke & rebuild, monitoring, CI/CD, removed hardcoded secrets.
This commit is contained in:
Celes Renata
2026-04-16 02:06:18 +00:00
parent e652a62dbc
commit 9aae57f3e1
2 changed files with 366 additions and 134 deletions
+253 -58
View File
@@ -8,25 +8,81 @@ kubectl config use-context <your-context>
alias kso='kubectl -n stonks-oracle'
```
4-node k3s cluster (gremlin-1 through gremlin-4). Deploy host is gremlin-1 (192.168.42.254) where secrets and the deploy script live.
## Service Overview
| Service | Type | Replicas | Notes |
|---------|------|----------|-------|
| scheduler | CronJob-like worker | 1 | Polls sources on schedule |
| symbol-registry | FastAPI | 1 | Company/watchlist CRUD |
| ingestion | Queue worker | 2 | Fetches from adapters |
| symbol-registry | FastAPI | 1 | Company/watchlist/exposure/competitor CRUD |
| ingestion | Queue worker | 2 | Fetches from adapters (market data, news, filings, macro) |
| parser | Queue worker | 2 | HTML→text extraction |
| extractor | Queue worker | 1 | LLM-based intelligence extraction |
| aggregation | Queue worker | 1 | Trend/signal aggregation |
| extractor | Queue worker | 1 | LLM-based intelligence extraction + event classification |
| aggregation | Queue worker | 1 | Trend/signal aggregation across all 3 layers |
| recommendation | Queue worker | 1 | Trade signal generation |
| trading-engine | FastAPI | 1 | Autonomous decision loop, position sizing, backtesting |
| risk | FastAPI | 1 | Risk evaluation + approval |
| broker-adapter | Queue worker | 1 | Paper/live order execution |
| broker-adapter | Queue worker | 1 | Paper/live order execution via Alpaca |
| lake-publisher | Queue worker | 1 | Iceberg table publication |
| query-api | FastAPI | 1 | Dashboard/analytics queries |
| dashboard | nginx | 1 | React SPA on port 8080 |
| trino | Analytics engine | 1 | SQL over lakehouse |
| superset | Dashboard | 1 | Visualization |
| hive-metastore | Metastore | 1 | Iceberg catalog backend |
## Deployment
### Full Deploy
Run from gremlin-1 where secrets are available:
```bash
bash ~/sources/kube/stonks-oracle/runmefirst.sh
```
This script:
1. Pulls latest code
2. Creates namespace with Helm labels
3. Sets up PostgreSQL user and database
4. Runs all migrations in order
5. Deploys via Helm with secrets injected
6. Rolling restarts all deployments
### Quick Helm Upgrade
After CI builds new images:
```bash
helm upgrade --install stonks-oracle infra/helm/stonks-oracle -n stonks-oracle
```
### Full Teardown
Preserves PostgreSQL, Redis, and MinIO data:
```bash
bash ~/sources/kube/stonks-oracle/runmelast.sh
```
## Secrets Management
Secrets are stored on the deploy host at `~/sources/kube/stonks-oracle/`. This directory is NOT a git repo — secrets stay local.
Required secret files:
- `~/sources/kube/stonks-oracle/polygon.io.key` — Polygon.io API key
- `~/sources/kube/stonks-oracle/alpaca.key` — Alpaca API key
- `~/sources/kube/stonks-oracle/alpaca.secret` — Alpaca API secret
- `~/sources/kube/stonks-oracle/alpaca.url` — Alpaca base URL (defaults to paper API)
- `/run/secrets/github_token` — GHCR authentication token
The deploy script (`runmefirst.sh`) reads these files and injects them into Kubernetes secrets via Helm `--set` flags. Never hardcode secrets in manifests, values files, or this runbook.
To rotate a secret:
1. Update the file on gremlin-1
2. Re-run `runmefirst.sh` (or `helm upgrade` with the new `--set` values)
3. Restart affected deployments
## Common Operations
### Restart a service
@@ -46,20 +102,6 @@ kso logs <pod-name> --previous --tail=50
kso scale deployment/<service-name> --replicas=N
```
### Redeploy with updated secrets
```bash
GHCR_TOKEN=$(cat /run/secrets/github_token)
helm upgrade --install stonks-oracle infra/helm/stonks-oracle \
--namespace stonks-oracle \
--set "ghcrAuth.password=$GHCR_TOKEN" \
--set 'secrets.core.POSTGRES_PASSWORD=St0nks0racl3!' \
--set "secrets.core.MINIO_ACCESS_KEY=AKIA6V7J3N9B5P0D2YQH" \
--set 'secrets.core.MINIO_SECRET_KEY=8fG3!v2rJ7$wN@9mLpQ6zXbC4tKdPqW1' \
--set 'secrets.core.REDIS_PASSWORD=PSCh4ng3me!'
# Then restart deployments to pick up secret changes:
for dep in $(kso get deployments -o name); do kso rollout restart "$dep"; done
```
### Run database migrations
```bash
for f in $(ls infra/migrations/*.sql | sort); do
@@ -67,73 +109,179 @@ for f in $(ls infra/migrations/*.sql | sort); do
done
```
## Trading Mode Toggle
## Trading Engine Operations
### Check trading engine status
```bash
curl -s https://stonks-trading.celestium.life/health
curl -s https://stonks-trading.celestium.life/ready
```
### Pause trading
```bash
# Via API — sets enabled=false in trading_engine_config
curl -X PUT https://stonks-trading.celestium.life/api/trading/config \
-H 'Content-Type: application/json' \
-d '{"enabled": false}'
```
### Resume trading
```bash
curl -X PUT https://stonks-trading.celestium.life/api/trading/config \
-H 'Content-Type: application/json' \
-d '{"enabled": true}'
```
### Check recent trading decisions
```bash
curl -s https://stonks-api.celestium.life/api/trading/decisions?limit=10
```
### Run a backtest
```bash
curl -X POST https://stonks-trading.celestium.life/api/trading/backtest \
-H 'Content-Type: application/json' \
-d '{"start_date": "2025-01-01", "end_date": "2025-06-01", "initial_capital": 100000, "risk_tier": "moderate"}'
```
### Check circuit breaker status
```bash
curl -s https://stonks-api.celestium.life/api/trading/circuit-breaker
```
### Check portfolio state
```bash
curl -s https://stonks-api.celestium.life/api/trading/portfolio
```
## Broker Mode Toggle
Current mode is set via ConfigMap `stonks-config` key `BROKER_MODE`.
```bash
# Check current mode
kso get configmap stonks-config -o jsonpath='{.data.BROKER_MODE}'
# To switch modes, update values.yaml config.BROKER_MODE and helm upgrade,
# then restart broker-adapter and risk deployments.
```
**Modes:**
- `paper` — all orders go through paper trading simulation (default, safe)
- `live` — orders are submitted to the real broker API (requires operator approval workflow)
- `paper` — all orders go through paper trading simulation (default)
- `live` — orders submitted to real broker API (requires operator approval workflow)
**Never switch to live without:**
1. Confirming paper trading PnL is acceptable
2. Verifying risk limits are configured in `risk_configuration` table
3. Enabling operator approval in `operator_approvals` table
2. Verifying risk limits are configured
3. Enabling operator approval in the risk engine
## Operator Approval for Live Trades
The risk engine requires explicit operator approval before executing live trades.
Approvals are managed via the risk API:
## Signal Layer Toggles
### Macro signal layer
```bash
# Check pending approvals
curl -s https://stonks-api.celestium.life/risk/approvals/pending
# Check status
curl -s https://stonks-api.celestium.life/api/admin/macro/status
# Approve a recommendation
curl -X POST https://stonks-api.celestium.life/risk/approvals/<id>/approve
# Toggle
curl -X PUT https://stonks-api.celestium.life/api/admin/macro/toggle
```
## Common Failure Modes
### CrashLoopBackOff on workers
Queue workers (aggregation, extractor, recommendation, broker-adapter, lake-publisher) exit with code 0 when the queue is empty. Kubernetes restarts them, which is normal. They'll process work when messages arrive.
### PostgreSQL auth failure
Password mismatch between `stonks-core-secrets.POSTGRES_PASSWORD` and the actual DB user password. Fix:
### Competitive signal layer
```bash
kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks <<'EOF'
ALTER USER stonks WITH PASSWORD '<new-password>';
EOF
# Check status
curl -s https://stonks-api.celestium.life/api/admin/competitive/status
# Toggle
curl -X PUT https://stonks-api.celestium.life/api/admin/competitive/toggle
```
Then update the Helm secret and restart.
### Redis connection refused
Check Redis is running: `kubectl get pods -n redis-service`
If Redis master is down, restart it: `kubectl rollout restart -n redis-service statefulset/redis-master`
## Backup and Restore
### ImagePullBackOff
GHCR credentials expired or missing. Re-run `helm upgrade` with fresh `ghcrAuth.password`.
### Database backup
```bash
# Local backup (keeps last 7)
./scripts/backup-db.sh
### Superset won't start
Needs custom image with `sqlalchemy-trino` package. Stock `apache/superset:latest` doesn't include it.
# Backup + upload to MinIO
./scripts/backup-db.sh --upload-minio
```
## Log Access
Backups go to `~/backups/stonks-oracle/`. Old backups are auto-pruned (keeps last 7).
All services output JSON logs when `JSON_LOGS=true` (default).
### Database restore
```bash
# Lists available backups if no argument given
./scripts/restore-db.sh
# Restore a specific backup (WARNING: replaces all data)
./scripts/restore-db.sh ~/backups/stonks-oracle/stonks-20250615-180000.sql.gz
```
The restore script scales down all services, restores the dump, re-grants permissions, and scales services back up.
### Redis backup
```bash
./scripts/backup-redis.sh
```
Triggers a BGSAVE and copies the RDB dump locally.
## Database Nuke & Rebuild
When a full reset is needed:
```bash
# Stream all logs from a service
kso logs -f deployment/<service> --tail=100
# 1. Tear down Helm release
bash ~/sources/kube/stonks-oracle/runmelast.sh
# Search for errors across all pods
# 2. Terminate connections and drop database
kubectl exec -n postgresql-service postgresql-1 -c postgres -- \
psql -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'stonks' AND pid <> pg_backend_pid();"
kubectl exec -n postgresql-service postgresql-1 -c postgres -- \
psql -U postgres -c "DROP DATABASE IF EXISTS stonks;"
# 3. Flush Redis dedup markers
# (clear all stonks:* keys from Redis)
# 4. Full redeploy (creates DB, runs migrations, deploys)
bash ~/sources/kube/stonks-oracle/runmefirst.sh
# 5. Re-seed companies and relationships
# (run from a pod or with port-forwarded DB access)
python -m services.symbol_registry.seed
```
## Monitoring
### Check pod status
```bash
kso get pods
kso get pods -o wide # includes node placement
```
### Check ingestion health
```bash
# Recent ingestion activity
kso logs deployment/ingestion --tail=20
# Source failure alerts
kso logs deployment/scheduler --tail=20 | grep -i "failure\|alert"
```
### Check broker errors
```bash
kso logs deployment/broker-adapter --tail=30 | grep -i "error\|fail"
```
### Check global event processing
```bash
kso logs deployment/extractor --tail=20 | grep -i "macro\|global"
```
### Check trading decisions
```bash
kso logs deployment/trading-engine --tail=30
```
### Stream all errors
```bash
kso logs --all-containers --prefix --tail=100 | grep -i error
```
@@ -141,7 +289,54 @@ kso logs --all-containers --prefix --tail=100 | grep -i error
| URL | Service |
|-----|---------|
| https://stonks.celestium.life | Dashboard |
| https://stonks-api.celestium.life | Query API |
| https://stonks-registry.celestium.life | Symbol Registry |
| https://stonks-trading.celestium.life | Trading Engine |
| https://stonks-dash.celestium.life | Superset |
| https://stonks-trino.celestium.life | Trino |
## CI/CD
Workflow: `.github/workflows/build.yml`
Push to `main` triggers: lint → pytest → frontend vitest → build all service images → push to GHCR.
### Check recent builds
```bash
gh run list -L 5
```
### Re-run a failed build
```bash
gh run rerun <run-id> --failed
```
### View failure logs
```bash
gh run view <run-id> --log-failed
```
## Common Failure Modes
### CrashLoopBackOff on workers
Queue workers (aggregation, extractor, recommendation, broker-adapter, lake-publisher) exit with code 0 when the queue is empty. Kubernetes restarts them — this is normal. They process work when messages arrive.
### PostgreSQL auth failure
Password mismatch between the Kubernetes secret and the actual DB user. Fix by re-running `runmefirst.sh` which resets the password and redeploys.
### Redis connection refused
```bash
kubectl get pods -n redis-service
kubectl rollout restart -n redis-service statefulset/redis-master
```
### ImagePullBackOff
GHCR credentials expired or missing. Re-run `runmefirst.sh` with a fresh GitHub token at `/run/secrets/github_token`.
### Trading engine not making decisions
1. Check if trading is enabled: `curl -s https://stonks-trading.celestium.life/health`
2. Check circuit breaker status — may be tripped
3. Check if within trading window (9:45 AM 3:45 PM ET)
4. Check if there are actionable recommendations in the queue
5. Check logs: `kso logs deployment/trading-engine --tail=50`