docs: rewrite README and runbook for current platform state
README: updated architecture diagram, three signal layers, tracked universe, autonomous trading engine, global news interpolation, competitive intelligence, paper trading, notification service, updated services table, project structure, deployment, endpoints. Runbook: updated service overview, deployment via runmefirst.sh, secrets management (keys in kube dir not repo), backup/restore scripts, trading engine operations, signal layer toggles, database nuke & rebuild, monitoring, CI/CD, removed hardcoded secrets.
This commit is contained in:
+253
-58
@@ -8,25 +8,81 @@ kubectl config use-context <your-context>
|
||||
alias kso='kubectl -n stonks-oracle'
|
||||
```
|
||||
|
||||
4-node k3s cluster (gremlin-1 through gremlin-4). Deploy host is gremlin-1 (192.168.42.254) where secrets and the deploy script live.
|
||||
|
||||
## Service Overview
|
||||
|
||||
| Service | Type | Replicas | Notes |
|
||||
|---------|------|----------|-------|
|
||||
| scheduler | CronJob-like worker | 1 | Polls sources on schedule |
|
||||
| symbol-registry | FastAPI | 1 | Company/watchlist CRUD |
|
||||
| ingestion | Queue worker | 2 | Fetches from adapters |
|
||||
| symbol-registry | FastAPI | 1 | Company/watchlist/exposure/competitor CRUD |
|
||||
| ingestion | Queue worker | 2 | Fetches from adapters (market data, news, filings, macro) |
|
||||
| parser | Queue worker | 2 | HTML→text extraction |
|
||||
| extractor | Queue worker | 1 | LLM-based intelligence extraction |
|
||||
| aggregation | Queue worker | 1 | Trend/signal aggregation |
|
||||
| extractor | Queue worker | 1 | LLM-based intelligence extraction + event classification |
|
||||
| aggregation | Queue worker | 1 | Trend/signal aggregation across all 3 layers |
|
||||
| recommendation | Queue worker | 1 | Trade signal generation |
|
||||
| trading-engine | FastAPI | 1 | Autonomous decision loop, position sizing, backtesting |
|
||||
| risk | FastAPI | 1 | Risk evaluation + approval |
|
||||
| broker-adapter | Queue worker | 1 | Paper/live order execution |
|
||||
| broker-adapter | Queue worker | 1 | Paper/live order execution via Alpaca |
|
||||
| lake-publisher | Queue worker | 1 | Iceberg table publication |
|
||||
| query-api | FastAPI | 1 | Dashboard/analytics queries |
|
||||
| dashboard | nginx | 1 | React SPA on port 8080 |
|
||||
| trino | Analytics engine | 1 | SQL over lakehouse |
|
||||
| superset | Dashboard | 1 | Visualization |
|
||||
| hive-metastore | Metastore | 1 | Iceberg catalog backend |
|
||||
|
||||
## Deployment
|
||||
|
||||
### Full Deploy
|
||||
|
||||
Run from gremlin-1 where secrets are available:
|
||||
|
||||
```bash
|
||||
bash ~/sources/kube/stonks-oracle/runmefirst.sh
|
||||
```
|
||||
|
||||
This script:
|
||||
1. Pulls latest code
|
||||
2. Creates namespace with Helm labels
|
||||
3. Sets up PostgreSQL user and database
|
||||
4. Runs all migrations in order
|
||||
5. Deploys via Helm with secrets injected
|
||||
6. Rolling restarts all deployments
|
||||
|
||||
### Quick Helm Upgrade
|
||||
|
||||
After CI builds new images:
|
||||
|
||||
```bash
|
||||
helm upgrade --install stonks-oracle infra/helm/stonks-oracle -n stonks-oracle
|
||||
```
|
||||
|
||||
### Full Teardown
|
||||
|
||||
Preserves PostgreSQL, Redis, and MinIO data:
|
||||
|
||||
```bash
|
||||
bash ~/sources/kube/stonks-oracle/runmelast.sh
|
||||
```
|
||||
|
||||
## Secrets Management
|
||||
|
||||
Secrets are stored on the deploy host at `~/sources/kube/stonks-oracle/`. This directory is NOT a git repo — secrets stay local.
|
||||
|
||||
Required secret files:
|
||||
- `~/sources/kube/stonks-oracle/polygon.io.key` — Polygon.io API key
|
||||
- `~/sources/kube/stonks-oracle/alpaca.key` — Alpaca API key
|
||||
- `~/sources/kube/stonks-oracle/alpaca.secret` — Alpaca API secret
|
||||
- `~/sources/kube/stonks-oracle/alpaca.url` — Alpaca base URL (defaults to paper API)
|
||||
- `/run/secrets/github_token` — GHCR authentication token
|
||||
|
||||
The deploy script (`runmefirst.sh`) reads these files and injects them into Kubernetes secrets via Helm `--set` flags. Never hardcode secrets in manifests, values files, or this runbook.
|
||||
|
||||
To rotate a secret:
|
||||
1. Update the file on gremlin-1
|
||||
2. Re-run `runmefirst.sh` (or `helm upgrade` with the new `--set` values)
|
||||
3. Restart affected deployments
|
||||
|
||||
## Common Operations
|
||||
|
||||
### Restart a service
|
||||
@@ -46,20 +102,6 @@ kso logs <pod-name> --previous --tail=50
|
||||
kso scale deployment/<service-name> --replicas=N
|
||||
```
|
||||
|
||||
### Redeploy with updated secrets
|
||||
```bash
|
||||
GHCR_TOKEN=$(cat /run/secrets/github_token)
|
||||
helm upgrade --install stonks-oracle infra/helm/stonks-oracle \
|
||||
--namespace stonks-oracle \
|
||||
--set "ghcrAuth.password=$GHCR_TOKEN" \
|
||||
--set 'secrets.core.POSTGRES_PASSWORD=St0nks0racl3!' \
|
||||
--set "secrets.core.MINIO_ACCESS_KEY=AKIA6V7J3N9B5P0D2YQH" \
|
||||
--set 'secrets.core.MINIO_SECRET_KEY=8fG3!v2rJ7$wN@9mLpQ6zXbC4tKdPqW1' \
|
||||
--set 'secrets.core.REDIS_PASSWORD=PSCh4ng3me!'
|
||||
# Then restart deployments to pick up secret changes:
|
||||
for dep in $(kso get deployments -o name); do kso rollout restart "$dep"; done
|
||||
```
|
||||
|
||||
### Run database migrations
|
||||
```bash
|
||||
for f in $(ls infra/migrations/*.sql | sort); do
|
||||
@@ -67,73 +109,179 @@ for f in $(ls infra/migrations/*.sql | sort); do
|
||||
done
|
||||
```
|
||||
|
||||
## Trading Mode Toggle
|
||||
## Trading Engine Operations
|
||||
|
||||
### Check trading engine status
|
||||
```bash
|
||||
curl -s https://stonks-trading.celestium.life/health
|
||||
curl -s https://stonks-trading.celestium.life/ready
|
||||
```
|
||||
|
||||
### Pause trading
|
||||
```bash
|
||||
# Via API — sets enabled=false in trading_engine_config
|
||||
curl -X PUT https://stonks-trading.celestium.life/api/trading/config \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"enabled": false}'
|
||||
```
|
||||
|
||||
### Resume trading
|
||||
```bash
|
||||
curl -X PUT https://stonks-trading.celestium.life/api/trading/config \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"enabled": true}'
|
||||
```
|
||||
|
||||
### Check recent trading decisions
|
||||
```bash
|
||||
curl -s https://stonks-api.celestium.life/api/trading/decisions?limit=10
|
||||
```
|
||||
|
||||
### Run a backtest
|
||||
```bash
|
||||
curl -X POST https://stonks-trading.celestium.life/api/trading/backtest \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"start_date": "2025-01-01", "end_date": "2025-06-01", "initial_capital": 100000, "risk_tier": "moderate"}'
|
||||
```
|
||||
|
||||
### Check circuit breaker status
|
||||
```bash
|
||||
curl -s https://stonks-api.celestium.life/api/trading/circuit-breaker
|
||||
```
|
||||
|
||||
### Check portfolio state
|
||||
```bash
|
||||
curl -s https://stonks-api.celestium.life/api/trading/portfolio
|
||||
```
|
||||
|
||||
## Broker Mode Toggle
|
||||
|
||||
Current mode is set via ConfigMap `stonks-config` key `BROKER_MODE`.
|
||||
|
||||
```bash
|
||||
# Check current mode
|
||||
kso get configmap stonks-config -o jsonpath='{.data.BROKER_MODE}'
|
||||
|
||||
# To switch modes, update values.yaml config.BROKER_MODE and helm upgrade,
|
||||
# then restart broker-adapter and risk deployments.
|
||||
```
|
||||
|
||||
**Modes:**
|
||||
- `paper` — all orders go through paper trading simulation (default, safe)
|
||||
- `live` — orders are submitted to the real broker API (requires operator approval workflow)
|
||||
- `paper` — all orders go through paper trading simulation (default)
|
||||
- `live` — orders submitted to real broker API (requires operator approval workflow)
|
||||
|
||||
**Never switch to live without:**
|
||||
1. Confirming paper trading PnL is acceptable
|
||||
2. Verifying risk limits are configured in `risk_configuration` table
|
||||
3. Enabling operator approval in `operator_approvals` table
|
||||
2. Verifying risk limits are configured
|
||||
3. Enabling operator approval in the risk engine
|
||||
|
||||
## Operator Approval for Live Trades
|
||||
|
||||
The risk engine requires explicit operator approval before executing live trades.
|
||||
Approvals are managed via the risk API:
|
||||
## Signal Layer Toggles
|
||||
|
||||
### Macro signal layer
|
||||
```bash
|
||||
# Check pending approvals
|
||||
curl -s https://stonks-api.celestium.life/risk/approvals/pending
|
||||
# Check status
|
||||
curl -s https://stonks-api.celestium.life/api/admin/macro/status
|
||||
|
||||
# Approve a recommendation
|
||||
curl -X POST https://stonks-api.celestium.life/risk/approvals/<id>/approve
|
||||
# Toggle
|
||||
curl -X PUT https://stonks-api.celestium.life/api/admin/macro/toggle
|
||||
```
|
||||
|
||||
## Common Failure Modes
|
||||
|
||||
### CrashLoopBackOff on workers
|
||||
Queue workers (aggregation, extractor, recommendation, broker-adapter, lake-publisher) exit with code 0 when the queue is empty. Kubernetes restarts them, which is normal. They'll process work when messages arrive.
|
||||
|
||||
### PostgreSQL auth failure
|
||||
Password mismatch between `stonks-core-secrets.POSTGRES_PASSWORD` and the actual DB user password. Fix:
|
||||
### Competitive signal layer
|
||||
```bash
|
||||
kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks <<'EOF'
|
||||
ALTER USER stonks WITH PASSWORD '<new-password>';
|
||||
EOF
|
||||
# Check status
|
||||
curl -s https://stonks-api.celestium.life/api/admin/competitive/status
|
||||
|
||||
# Toggle
|
||||
curl -X PUT https://stonks-api.celestium.life/api/admin/competitive/toggle
|
||||
```
|
||||
Then update the Helm secret and restart.
|
||||
|
||||
### Redis connection refused
|
||||
Check Redis is running: `kubectl get pods -n redis-service`
|
||||
If Redis master is down, restart it: `kubectl rollout restart -n redis-service statefulset/redis-master`
|
||||
## Backup and Restore
|
||||
|
||||
### ImagePullBackOff
|
||||
GHCR credentials expired or missing. Re-run `helm upgrade` with fresh `ghcrAuth.password`.
|
||||
### Database backup
|
||||
```bash
|
||||
# Local backup (keeps last 7)
|
||||
./scripts/backup-db.sh
|
||||
|
||||
### Superset won't start
|
||||
Needs custom image with `sqlalchemy-trino` package. Stock `apache/superset:latest` doesn't include it.
|
||||
# Backup + upload to MinIO
|
||||
./scripts/backup-db.sh --upload-minio
|
||||
```
|
||||
|
||||
## Log Access
|
||||
Backups go to `~/backups/stonks-oracle/`. Old backups are auto-pruned (keeps last 7).
|
||||
|
||||
All services output JSON logs when `JSON_LOGS=true` (default).
|
||||
### Database restore
|
||||
```bash
|
||||
# Lists available backups if no argument given
|
||||
./scripts/restore-db.sh
|
||||
|
||||
# Restore a specific backup (WARNING: replaces all data)
|
||||
./scripts/restore-db.sh ~/backups/stonks-oracle/stonks-20250615-180000.sql.gz
|
||||
```
|
||||
|
||||
The restore script scales down all services, restores the dump, re-grants permissions, and scales services back up.
|
||||
|
||||
### Redis backup
|
||||
```bash
|
||||
./scripts/backup-redis.sh
|
||||
```
|
||||
|
||||
Triggers a BGSAVE and copies the RDB dump locally.
|
||||
|
||||
## Database Nuke & Rebuild
|
||||
|
||||
When a full reset is needed:
|
||||
|
||||
```bash
|
||||
# Stream all logs from a service
|
||||
kso logs -f deployment/<service> --tail=100
|
||||
# 1. Tear down Helm release
|
||||
bash ~/sources/kube/stonks-oracle/runmelast.sh
|
||||
|
||||
# Search for errors across all pods
|
||||
# 2. Terminate connections and drop database
|
||||
kubectl exec -n postgresql-service postgresql-1 -c postgres -- \
|
||||
psql -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'stonks' AND pid <> pg_backend_pid();"
|
||||
kubectl exec -n postgresql-service postgresql-1 -c postgres -- \
|
||||
psql -U postgres -c "DROP DATABASE IF EXISTS stonks;"
|
||||
|
||||
# 3. Flush Redis dedup markers
|
||||
# (clear all stonks:* keys from Redis)
|
||||
|
||||
# 4. Full redeploy (creates DB, runs migrations, deploys)
|
||||
bash ~/sources/kube/stonks-oracle/runmefirst.sh
|
||||
|
||||
# 5. Re-seed companies and relationships
|
||||
# (run from a pod or with port-forwarded DB access)
|
||||
python -m services.symbol_registry.seed
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Check pod status
|
||||
```bash
|
||||
kso get pods
|
||||
kso get pods -o wide # includes node placement
|
||||
```
|
||||
|
||||
### Check ingestion health
|
||||
```bash
|
||||
# Recent ingestion activity
|
||||
kso logs deployment/ingestion --tail=20
|
||||
|
||||
# Source failure alerts
|
||||
kso logs deployment/scheduler --tail=20 | grep -i "failure\|alert"
|
||||
```
|
||||
|
||||
### Check broker errors
|
||||
```bash
|
||||
kso logs deployment/broker-adapter --tail=30 | grep -i "error\|fail"
|
||||
```
|
||||
|
||||
### Check global event processing
|
||||
```bash
|
||||
kso logs deployment/extractor --tail=20 | grep -i "macro\|global"
|
||||
```
|
||||
|
||||
### Check trading decisions
|
||||
```bash
|
||||
kso logs deployment/trading-engine --tail=30
|
||||
```
|
||||
|
||||
### Stream all errors
|
||||
```bash
|
||||
kso logs --all-containers --prefix --tail=100 | grep -i error
|
||||
```
|
||||
|
||||
@@ -141,7 +289,54 @@ kso logs --all-containers --prefix --tail=100 | grep -i error
|
||||
|
||||
| URL | Service |
|
||||
|-----|---------|
|
||||
| https://stonks.celestium.life | Dashboard |
|
||||
| https://stonks-api.celestium.life | Query API |
|
||||
| https://stonks-registry.celestium.life | Symbol Registry |
|
||||
| https://stonks-trading.celestium.life | Trading Engine |
|
||||
| https://stonks-dash.celestium.life | Superset |
|
||||
| https://stonks-trino.celestium.life | Trino |
|
||||
|
||||
## CI/CD
|
||||
|
||||
Workflow: `.github/workflows/build.yml`
|
||||
|
||||
Push to `main` triggers: lint → pytest → frontend vitest → build all service images → push to GHCR.
|
||||
|
||||
### Check recent builds
|
||||
```bash
|
||||
gh run list -L 5
|
||||
```
|
||||
|
||||
### Re-run a failed build
|
||||
```bash
|
||||
gh run rerun <run-id> --failed
|
||||
```
|
||||
|
||||
### View failure logs
|
||||
```bash
|
||||
gh run view <run-id> --log-failed
|
||||
```
|
||||
|
||||
## Common Failure Modes
|
||||
|
||||
### CrashLoopBackOff on workers
|
||||
Queue workers (aggregation, extractor, recommendation, broker-adapter, lake-publisher) exit with code 0 when the queue is empty. Kubernetes restarts them — this is normal. They process work when messages arrive.
|
||||
|
||||
### PostgreSQL auth failure
|
||||
Password mismatch between the Kubernetes secret and the actual DB user. Fix by re-running `runmefirst.sh` which resets the password and redeploys.
|
||||
|
||||
### Redis connection refused
|
||||
```bash
|
||||
kubectl get pods -n redis-service
|
||||
kubectl rollout restart -n redis-service statefulset/redis-master
|
||||
```
|
||||
|
||||
### ImagePullBackOff
|
||||
GHCR credentials expired or missing. Re-run `runmefirst.sh` with a fresh GitHub token at `/run/secrets/github_token`.
|
||||
|
||||
### Trading engine not making decisions
|
||||
1. Check if trading is enabled: `curl -s https://stonks-trading.celestium.life/health`
|
||||
2. Check circuit breaker status — may be tripped
|
||||
3. Check if within trading window (9:45 AM – 3:45 PM ET)
|
||||
4. Check if there are actionable recommendations in the queue
|
||||
5. Check logs: `kso logs deployment/trading-engine --tail=50`
|
||||
|
||||
Reference in New Issue
Block a user