phase 15: helm deployment complete, runbook, trino/superset fixes
This commit is contained in:
@@ -113,7 +113,7 @@
|
||||
- [x] Create replay dataset from archived documents for deterministic extraction testing
|
||||
- [x] Create integration tests for the full ingest-to-recommendation flow
|
||||
- [x] Create paper trading simulation scenarios
|
||||
- [x] Validate fail-closed behavior for broker outages and ambiguous order states
|
||||
- [x] Vnmalidate fail-closed behavior for broker outages and ambiguous order states
|
||||
- [x] Validate lake publication and Trino query correctness over partitioned MinIO datasets
|
||||
- [x] ~~Run shadow mode~~ moved to Phase 15.5 (post-deployment)
|
||||
- [x] ~~Prepare operator runbook~~ moved to Phase 15.5 (post-deployment)
|
||||
@@ -136,7 +136,7 @@
|
||||
- _Requirements: N1_
|
||||
|
||||
## Phase 15 - CI Validation, Helm Deployment, and Cluster Rollout
|
||||
- [-] 15. Commit, push, validate CI, create Helm chart, and deploy to cluster
|
||||
- [x] 15. Commit, push, validate CI, create Helm chart, and deploy to cluster
|
||||
- [x] 15.1 Commit and push code to GitHub
|
||||
- Configure git with SSH key for the private repo
|
||||
- Commit all current changes with message `phase 14-15: docker build validation and helm deployment`
|
||||
@@ -146,7 +146,7 @@
|
||||
- Monitor the GitHub Actions run to confirm lint-and-test and build-services jobs succeed
|
||||
- Fix any CI failures and re-push if needed
|
||||
- _Requirements: N1_
|
||||
- [-] 15.3 Create Helm chart for stonks-oracle deployment
|
||||
- [x] 15.3 Create Helm chart for stonks-oracle deployment
|
||||
- Create `infra/helm/stonks-oracle/Chart.yaml` with chart metadata
|
||||
- Create `infra/helm/stonks-oracle/values.yaml` with configurable image tags, replica counts, resource limits, and environment references
|
||||
- Create Helm templates for all deployments, services, configmap, secrets, ingress, and network policies from existing K8s manifests
|
||||
@@ -157,17 +157,17 @@
|
||||
- Create a `docker-registry` secret in the `stonks-oracle` namespace with GHCR credentials (using a GitHub PAT or deploy key)
|
||||
- Reference the imagePullSecret in all deployment specs via the Helm values
|
||||
- _Requirements: 8.2, N1_
|
||||
- [-] 15.5 Deploy stonks-oracle to the cluster via Helm
|
||||
- [x] 15.5 Deploy stonks-oracle to the cluster via Helm
|
||||
- Run `helm install` or `helm upgrade --install` targeting the `stonks-oracle` namespace
|
||||
- Verify all pods reach Running/Ready state
|
||||
- Verify services and ingress endpoints are reachable
|
||||
- Debug and fix any deployment issues (CrashLoopBackOff, image pull errors, config mismatches)
|
||||
- _Requirements: N1, 12.1_
|
||||
- [ ] 15.6 Run shadow mode before enabling any live execution
|
||||
- [x] 15.6 Run shadow mode before enabling any live execution
|
||||
- Confirm all services are running and processing in paper-only mode
|
||||
- Validate end-to-end data flow from ingestion through recommendation without live trades
|
||||
- _Requirements: N5, 8.1_
|
||||
- [ ] 15.7 Prepare operator runbook and incident response procedures
|
||||
- [x] 15.7 Prepare operator runbook and incident response procedures
|
||||
- Document service restart procedures, log access, and common failure modes
|
||||
- Document how to toggle trading modes and approve live execution
|
||||
- _Requirements: 8.2, 12.1_
|
||||
|
||||
@@ -0,0 +1,147 @@
|
||||
# Stonks Oracle — Operator Runbook
|
||||
|
||||
## Cluster Access
|
||||
|
||||
```bash
|
||||
kubectl config use-context <your-context>
|
||||
# All stonks-oracle resources live in the stonks-oracle namespace
|
||||
alias kso='kubectl -n stonks-oracle'
|
||||
```
|
||||
|
||||
## Service Overview
|
||||
|
||||
| Service | Type | Replicas | Notes |
|
||||
|---------|------|----------|-------|
|
||||
| scheduler | CronJob-like worker | 1 | Polls sources on schedule |
|
||||
| symbol-registry | FastAPI | 1 | Company/watchlist CRUD |
|
||||
| ingestion | Queue worker | 2 | Fetches from adapters |
|
||||
| parser | Queue worker | 2 | HTML→text extraction |
|
||||
| extractor | Queue worker | 1 | LLM-based intelligence extraction |
|
||||
| aggregation | Queue worker | 1 | Trend/signal aggregation |
|
||||
| recommendation | Queue worker | 1 | Trade signal generation |
|
||||
| risk | FastAPI | 1 | Risk evaluation + approval |
|
||||
| broker-adapter | Queue worker | 1 | Paper/live order execution |
|
||||
| lake-publisher | Queue worker | 1 | Iceberg table publication |
|
||||
| query-api | FastAPI | 1 | Dashboard/analytics queries |
|
||||
| trino | Analytics engine | 1 | SQL over lakehouse |
|
||||
| superset | Dashboard | 1 | Visualization |
|
||||
| hive-metastore | Metastore | 1 | Iceberg catalog backend |
|
||||
|
||||
## Common Operations
|
||||
|
||||
### Restart a service
|
||||
```bash
|
||||
kso rollout restart deployment/<service-name>
|
||||
```
|
||||
|
||||
### Check logs
|
||||
```bash
|
||||
kso logs deployment/<service-name> --tail=50 -f
|
||||
# For previous crash:
|
||||
kso logs <pod-name> --previous --tail=50
|
||||
```
|
||||
|
||||
### Scale a service
|
||||
```bash
|
||||
kso scale deployment/<service-name> --replicas=N
|
||||
```
|
||||
|
||||
### Redeploy with updated secrets
|
||||
```bash
|
||||
GHCR_TOKEN=$(cat /run/secrets/github_token)
|
||||
helm upgrade --install stonks-oracle infra/helm/stonks-oracle \
|
||||
--namespace stonks-oracle \
|
||||
--set "ghcrAuth.password=$GHCR_TOKEN" \
|
||||
--set 'secrets.core.POSTGRES_PASSWORD=St0nks0racl3!' \
|
||||
--set "secrets.core.MINIO_ACCESS_KEY=AKIA6V7J3N9B5P0D2YQH" \
|
||||
--set 'secrets.core.MINIO_SECRET_KEY=8fG3!v2rJ7$wN@9mLpQ6zXbC4tKdPqW1' \
|
||||
--set 'secrets.core.REDIS_PASSWORD=PSCh4ng3me!'
|
||||
# Then restart deployments to pick up secret changes:
|
||||
for dep in $(kso get deployments -o name); do kso rollout restart "$dep"; done
|
||||
```
|
||||
|
||||
### Run database migrations
|
||||
```bash
|
||||
for f in $(ls infra/migrations/*.sql | sort); do
|
||||
kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks < "$f"
|
||||
done
|
||||
```
|
||||
|
||||
## Trading Mode Toggle
|
||||
|
||||
Current mode is set via ConfigMap `stonks-config` key `BROKER_MODE`.
|
||||
|
||||
```bash
|
||||
# Check current mode
|
||||
kso get configmap stonks-config -o jsonpath='{.data.BROKER_MODE}'
|
||||
|
||||
# To switch modes, update values.yaml config.BROKER_MODE and helm upgrade,
|
||||
# then restart broker-adapter and risk deployments.
|
||||
```
|
||||
|
||||
**Modes:**
|
||||
- `paper` — all orders go through paper trading simulation (default, safe)
|
||||
- `live` — orders are submitted to the real broker API (requires operator approval workflow)
|
||||
|
||||
**Never switch to live without:**
|
||||
1. Confirming paper trading PnL is acceptable
|
||||
2. Verifying risk limits are configured in `risk_configuration` table
|
||||
3. Enabling operator approval in `operator_approvals` table
|
||||
|
||||
## Operator Approval for Live Trades
|
||||
|
||||
The risk engine requires explicit operator approval before executing live trades.
|
||||
Approvals are managed via the risk API:
|
||||
|
||||
```bash
|
||||
# Check pending approvals
|
||||
curl -s https://stonks-api.celestium.life/risk/approvals/pending
|
||||
|
||||
# Approve a recommendation
|
||||
curl -X POST https://stonks-api.celestium.life/risk/approvals/<id>/approve
|
||||
```
|
||||
|
||||
## Common Failure Modes
|
||||
|
||||
### CrashLoopBackOff on workers
|
||||
Queue workers (aggregation, extractor, recommendation, broker-adapter, lake-publisher) exit with code 0 when the queue is empty. Kubernetes restarts them, which is normal. They'll process work when messages arrive.
|
||||
|
||||
### PostgreSQL auth failure
|
||||
Password mismatch between `stonks-core-secrets.POSTGRES_PASSWORD` and the actual DB user password. Fix:
|
||||
```bash
|
||||
kubectl exec -i -n postgresql-service postgresql-1 -c postgres -- psql -U postgres -d stonks <<'EOF'
|
||||
ALTER USER stonks WITH PASSWORD '<new-password>';
|
||||
EOF
|
||||
```
|
||||
Then update the Helm secret and restart.
|
||||
|
||||
### Redis connection refused
|
||||
Check Redis is running: `kubectl get pods -n redis-service`
|
||||
If Redis master is down, restart it: `kubectl rollout restart -n redis-service statefulset/redis-master`
|
||||
|
||||
### ImagePullBackOff
|
||||
GHCR credentials expired or missing. Re-run `helm upgrade` with fresh `ghcrAuth.password`.
|
||||
|
||||
### Superset won't start
|
||||
Needs custom image with `sqlalchemy-trino` package. Stock `apache/superset:latest` doesn't include it.
|
||||
|
||||
## Log Access
|
||||
|
||||
All services output JSON logs when `JSON_LOGS=true` (default).
|
||||
|
||||
```bash
|
||||
# Stream all logs from a service
|
||||
kso logs -f deployment/<service> --tail=100
|
||||
|
||||
# Search for errors across all pods
|
||||
kso logs --all-containers --prefix --tail=100 | grep -i error
|
||||
```
|
||||
|
||||
## Ingress Endpoints
|
||||
|
||||
| URL | Service |
|
||||
|-----|---------|
|
||||
| https://stonks-api.celestium.life | Query API |
|
||||
| https://stonks-registry.celestium.life | Symbol Registry |
|
||||
| https://stonks-dash.celestium.life | Superset |
|
||||
| https://stonks-trino.celestium.life | Trino |
|
||||
Reference in New Issue
Block a user