fix: clean up utcnow deprecation warnings, fix 12 failing tests, add CI/CD pipeline manifests

- Replace all datetime.utcnow() with datetime.now(tz=timezone.utc) across 8 files
- Fix 12 failing tests to match current implementation behavior
- Fix pytest_plugins in non-top-level conftest (moved to root conftest.py)
- Auto-fix 189 lint issues (import sorting, unused imports)
- Add CI/CD pipeline infrastructure (ARC, ArgoCD, Kargo manifests)
- Add values-beta.yaml and values-paper.yaml for staged deployments
- Update GitHub Actions workflow to use self-hosted-gremlin runners
- Add integration-test job to CI pipeline

Result: 1596 passed, 0 failed, 0 warnings
This commit is contained in:
Celes Renata
2026-04-18 03:59:28 +00:00
parent 40227a4eb2
commit c85c0068a2
123 changed files with 7221 additions and 405 deletions
+1
View File
@@ -0,0 +1 @@
{"specId": "6864b7d1-ab86-473f-b6ad-7091eaabac76", "workflowType": "requirements-first", "specType": "feature"}
+628
View File
@@ -0,0 +1,628 @@
# CI/CD Pipeline — Design
## Overview
This design describes a full CI/CD pipeline for the Stonks Oracle platform built on three Kubernetes-native tools: GitHub Actions Runner Controller (ARC) for self-hosted CI runners, ArgoCD for GitOps-based deployment, and Kargo for staged promotion orchestration. The pipeline replaces GitHub-hosted runners with ephemeral pods on the existing 4-node NixOS Gremlin cluster, routes built images through five stages (CI → Integration Test → Beta → Paper → Live), and enforces market-hours promotion blockers with a break-glass emergency override.
All pipeline infrastructure scripts and manifests live in `~/sources/kube/pipelines/` on gremlin-1 — fully separate from the application's `~/sources/kube/stonks-oracle/` deployment scripts. Pipeline state persists on NFS volumes at `nfs://192.168.42.8:/volume1/Kubernetes/pipelines` so that ArgoCD configs, Kargo promotion history, and ARC data survive cluster teardowns and rebuilds.
### Key Design Decisions
1. **ARC with Kubernetes mode (not Docker-in-Docker)** — Runner pods use `containerMode.type: kubernetes` so each workflow step runs as a separate pod. This avoids the security and complexity overhead of DinD while leveraging the cluster's existing container runtime. Docker builds use `docker/build-push-action` with Buildx, which works with the Kubernetes executor.
2. **One ArgoCD Application per stage** — Beta, Paper, and Live each get their own ArgoCD Application resource pointing at the same Helm chart (`infra/helm/stonks-oracle/`) but with different values files (`values-beta.yaml`, `values-paper.yaml`, `values.yaml`). This keeps stage configs independent and auditable.
3. **Kargo Image Updater pattern** — A single Kargo Warehouse watches the GHCR image repository for new tags. Kargo Stages (beta → paper → live) form a linear promotion DAG. Each Stage's promotion template updates the image tag in the corresponding ArgoCD Application and triggers a sync.
4. **Market-hours blocker via Kargo AnalysisTemplate** — Kargo verification steps check Eastern Time before allowing promotions to Paper and Live stages. Break-glass is implemented via Kargo's manual approval with required notes, bypassing the verification gate.
5. **NFS static provisioning with Retain policy** — PVs are created manually by `runmefirst.sh` with `persistentVolumeReclaimPolicy: Retain`. The teardown script (`runmelast.sh`) deletes Helm releases and namespaces but leaves PVs and NFS data intact.
6. **Install order: PVs → ARC → ArgoCD → Kargo**`runmefirst.sh` creates PVs first (they're cluster-scoped), then installs each tool via Helm in dependency order. Kargo depends on ArgoCD being present.
## Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Gremlin Cluster (4x NixOS) │
│ │
│ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────────────────┐ │
│ │ arc-system ns │ │ argocd ns │ │ kargo ns │ │
│ │ │ │ │ │ │ │
│ │ ARC Controller │ │ ArgoCD Server │ │ Kargo Controller │ │
│ │ Runner ScaleSet │ │ Repo Server │ │ Kargo Dashboard │ │
│ │ (ephemeral pods)│ │ App Controller │ │ (stonks-kargo. │ │
│ │ │ │ (stonks-argocd. │ │ celestium.life) │ │
│ │ │ │ celestium.life)│ │ │ │
│ └─────────────────┘ └──────────────────┘ └───────────────────────────┘ │
│ │
│ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────────────────┐ │
│ │ stonks-beta ns │ │ stonks-paper ns │ │ stonks-oracle ns │ │
│ │ │ │ │ │ (live/production) │ │
│ │ ArgoCD App: │ │ ArgoCD App: │ │ ArgoCD App: │ │
│ │ stonks-beta │ │ stonks-paper │ │ stonks-live │ │
│ │ values-beta.yaml│ │ values-paper. │ │ values.yaml │ │
│ │ (mock broker) │ │ yaml │ │ (production broker) │ │
│ │ │ │ (paper broker) │ │ │ │
│ └─────────────────┘ └──────────────────┘ └───────────────────────────┘ │
│ │
│ NFS PVs: nfs://192.168.42.8:/volume1/Kubernetes/pipelines/{argocd,kargo,arc}│
└─────────────────────────────────────────────────────────────────────────────┘
```
### Promotion Flow
```mermaid
graph LR
A[Git Push to main] --> B[CI: Lint + Test<br/>ARC self-hosted runner]
B --> C[CI: Build + Push<br/>all images to GHCR]
C --> D[Integration Tests<br/>run_pipeline.sh]
D -->|pass| E[Kargo Warehouse<br/>detects new image tag]
D -->|fail| X[❌ Blocked]
E --> F[Beta Stage<br/>auto-promote]
F --> G{Market Hours?}
G -->|outside hours| H[Paper Stage<br/>manual promote]
G -->|during hours| I[🚫 Blocked<br/>break-glass available]
I -->|break-glass| H
H --> J{Market Hours?}
J -->|outside hours| K[Live Stage<br/>manual approve + notes]
J -->|during hours| L[🚫 Blocked<br/>break-glass available]
L -->|break-glass| K
```
## Components and Interfaces
### 1. Pipeline Scripts (`~/sources/kube/pipelines/`)
```
~/sources/kube/pipelines/
├── runmefirst.sh # Full install: PVs → ARC → ArgoCD → Kargo
├── runmelast.sh # Teardown: Kargo → ArgoCD → ARC (preserves PVs + NFS data)
├── pvs/
│ ├── argocd-pv.yaml # NFS PV for ArgoCD server data
│ ├── kargo-pv.yaml # NFS PV for Kargo data
│ └── arc-pv.yaml # NFS PV for ARC runner data
├── arc/
│ ├── values.yaml # ARC controller Helm values
│ └── runner-scaleset.yaml # RunnerScaleSet CR for stonks-oracle repo
├── argocd/
│ ├── values.yaml # ArgoCD Helm values (ingress, TLS, persistence)
│ ├── apps/
│ │ ├── stonks-beta.yaml # ArgoCD Application for beta
│ │ ├── stonks-paper.yaml # ArgoCD Application for paper
│ │ └── stonks-live.yaml # ArgoCD Application for live
│ └── repo-secret.yaml # Git repo credentials for ArgoCD
├── kargo/
│ ├── values.yaml # Kargo Helm values (ingress, TLS, persistence)
│ ├── project.yaml # Kargo Project: stonks-oracle
│ ├── warehouse.yaml # Kargo Warehouse watching GHCR
│ ├── stages/
│ │ ├── beta.yaml # Kargo Stage: beta (auto-promote)
│ │ ├── paper.yaml # Kargo Stage: paper (market-hours gate)
│ │ └── live.yaml # Kargo Stage: live (manual approval + market-hours gate)
│ └── project-config.yaml # ProjectConfig: auto-promotion settings
└── helm-values/
├── values-beta.yaml # Helm overrides for beta stage
└── values-paper.yaml # Helm overrides for paper stage
```
### 2. ARC — GitHub Actions Runner Controller
**Namespace:** `arc-system`
**Components:**
- **ARC Controller** — Installed via the `oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller` Helm chart. Watches for GitHub webhook events and provisions runner pods.
- **Runner ScaleSet** — Installed via the `oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set` Helm chart. Configured for the `celesrenata/stonks-oracle` repository with the label `self-hosted-gremlin`.
**Runner Pod Configuration:**
- Ephemeral: each job gets a fresh pod, destroyed on completion
- Kubernetes mode (`containerMode.type: kubernetes`): workflow steps run as separate containers
- Resource limits: 2 CPU, 4Gi memory per runner pod
- Docker Buildx support via `docker/setup-buildx-action` (uses Kubernetes builder)
- GitHub App or PAT authentication stored in a Kubernetes Secret
**Interface with CI workflow:**
- The existing `.github/workflows/build.yml` is updated to use `runs-on: self-hosted-gremlin` instead of `runs-on: ubuntu-latest`
- All existing build steps remain unchanged — only the runner label changes
### 3. ArgoCD — GitOps Deployment Controller
**Namespace:** `argocd`
**Components:**
- **ArgoCD Server** — Web UI and API, exposed via Traefik ingress at `stonks-argocd.celestium.life` with TLS via `ca-issuer`
- **Repo Server** — Clones Git repos and renders Helm templates
- **Application Controller** — Watches ArgoCD Application resources and syncs cluster state
**ArgoCD Applications (one per stage):**
| Application | Namespace | Values File | Sync Policy |
|---|---|---|---|
| `stonks-beta` | `stonks-beta` | `values-beta.yaml` | Auto-sync (Kargo triggers) |
| `stonks-paper` | `stonks-paper` | `values-paper.yaml` | Auto-sync (Kargo triggers) |
| `stonks-live` | `stonks-oracle` | `values.yaml` | Auto-sync (Kargo triggers) |
Each Application points at the same Helm chart (`infra/helm/stonks-oracle/`) in the `celesrenata/stonks-oracle` Git repository but uses a different values file. The `image.tag` parameter is overridden by Kargo during promotion.
**Application Resource Structure:**
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: stonks-beta
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/celesrenata/stonks-oracle.git
targetRevision: main
path: infra/helm/stonks-oracle
helm:
valueFiles:
- values-beta.yaml
parameters:
- name: image.tag
value: latest # Overridden by Kargo during promotion
destination:
server: https://kubernetes.default.svc
namespace: stonks-beta
syncPolicy:
automated:
prune: true
selfHeal: true
```
### 4. Kargo — Promotion Orchestration
**Namespace:** `kargo`
**Components:**
- **Kargo Controller** — Watches Warehouse, Stage, and Promotion resources
- **Kargo Dashboard** — Web UI at `stonks-kargo.celestium.life` with TLS via `ca-issuer`. Provides visual promotion management, stage status, and audit history.
**Kargo Resources:**
#### Warehouse
Watches the GHCR image repository for new image tags. Produces Freight resources for each new tag discovered.
```yaml
apiVersion: kargo.akuity.io/v1alpha1
kind: Warehouse
metadata:
name: stonks-images
namespace: stonks-oracle # Kargo project namespace
spec:
subscriptions:
- image:
repoURL: ghcr.io/celesrenata/stonks-oracle/query-api
semverConstraint: ""
discoveryPolicy: SemVer # or Digest — tracks by SHA tag
```
#### Stages (Linear DAG)
```
Warehouse: stonks-images
Stage: beta (auto-promote, no market-hours gate)
Stage: paper (manual promote, market-hours verification)
Stage: live (manual approval + notes, market-hours verification)
```
Each Stage's promotion template:
1. Clones the Git repo
2. Updates `image.tag` in the stage-specific values file (or uses `argocd-update` step)
3. Triggers the ArgoCD Application to sync
#### Market-Hours Verification
Paper and Live stages include a verification step that checks whether the current time falls within US market hours (09:3016:00 ET, MonFri). If it does, the promotion is blocked unless the operator uses Kargo's manual approval (break-glass) with a required justification note.
This is implemented as a Kargo verification step using an `AnalysisTemplate` that runs a lightweight container to check the current Eastern Time:
```yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: market-hours-check
namespace: stonks-oracle
spec:
metrics:
- name: outside-market-hours
provider:
job:
spec:
template:
spec:
containers:
- name: check
image: alpine:3.19
command: [sh, -c]
args:
- |
apk add --no-cache tzdata
export TZ=America/New_York
DOW=$(date +%u) # 1=Mon, 7=Sun
HOUR=$(date +%H)
MIN=$(date +%M)
TIME_MIN=$((HOUR * 60 + MIN))
MARKET_OPEN=570 # 09:30
MARKET_CLOSE=960 # 16:00
if [ "$DOW" -ge 6 ]; then
echo "Weekend — promotions allowed"
exit 0
fi
if [ "$TIME_MIN" -lt "$MARKET_OPEN" ] || [ "$TIME_MIN" -ge "$MARKET_CLOSE" ]; then
echo "Outside market hours — promotions allowed"
exit 0
fi
echo "Market hours active ($(date)) — promotion blocked"
exit 1
restartPolicy: Never
```
#### Break-Glass Mechanism
Kargo's built-in manual approval flow serves as the break-glass mechanism. When a promotion is blocked by the market-hours verification:
1. The operator clicks "Approve" in the Kargo Dashboard
2. A confirmation dialog appears requiring a justification note
3. The approval bypasses the verification gate for that single Freight/Stage combination
4. The approval, operator identity, timestamp, and justification are recorded in Kargo's audit trail
5. Subsequent promotions still require passing the market-hours check (the override is not sticky)
### 5. Updated GitHub Actions Workflow
The existing `.github/workflows/build.yml` is updated with:
1. **Runner label change**: `runs-on: ubuntu-latest``runs-on: self-hosted-gremlin`
2. **New integration test job**: After image builds, a new `integration-test` job invokes `bash infra/inttest/run_pipeline.sh --image-tag ${{ github.sha }} --results-file inttest-results.json`
3. **Artifact upload**: The `inttest-results.json` is uploaded as a build artifact
4. **Gate logic**: If integration tests fail, the workflow fails and Kargo will not see the new image tag as verified
```yaml
integration-test:
needs: [build-services, build-dashboard]
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
runs-on: self-hosted-gremlin
steps:
- uses: actions/checkout@v5
- name: Run integration tests
run: |
bash infra/inttest/run_pipeline.sh \
--image-tag ${{ github.sha }} \
--results-file inttest-results.json
- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with:
name: inttest-results
path: inttest-results.json
```
### 6. Helm Values Strategy
**values-beta.yaml** (lighter resources, mock broker, no external API keys):
```yaml
image:
tag: latest # Overridden by Kargo
config:
BROKER_MODE: "mock"
BROKER_PROVIDER: "mock"
LOG_LEVEL: "DEBUG"
TRADING_ENABLED: "false"
services:
ingestion:
replicas: 1
parser:
replicas: 1
aggregation:
replicas: 1
```
**values-paper.yaml** (paper broker credentials, Alpaca paper API):
```yaml
image:
tag: latest # Overridden by Kargo
config:
BROKER_MODE: "paper"
BROKER_PROVIDER: "alpaca"
LOG_LEVEL: "INFO"
TRADING_ENABLED: "true"
secrets:
broker:
BROKER_BASE_URL: "https://paper-api.alpaca.markets"
```
**values.yaml** (production — existing, unchanged):
- Uses live broker credentials
- Full replica counts
- Production resource limits
### 7. NFS Persistent Volumes
Three PVs with static provisioning, all using `persistentVolumeReclaimPolicy: Retain`:
| PV Name | NFS Path | Capacity | Bound To |
|---|---|---|---|
| `pipeline-argocd-pv` | `/volume1/Kubernetes/pipelines/argocd` | 5Gi | PVC in `argocd` ns |
| `pipeline-kargo-pv` | `/volume1/Kubernetes/pipelines/kargo` | 2Gi | PVC in `kargo` ns |
| `pipeline-arc-pv` | `/volume1/Kubernetes/pipelines/arc` | 2Gi | PVC in `arc-system` ns |
```yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: pipeline-argocd-pv
labels:
app: pipeline-argocd
spec:
capacity:
storage: 5Gi
accessModes: [ReadWriteOnce]
persistentVolumeReclaimPolicy: Retain
nfs:
server: 192.168.42.8
path: /volume1/Kubernetes/pipelines/argocd
```
### 8. runmefirst.sh — Install Orchestration
```
#!/bin/bash
set -euo pipefail
# 1. Create namespaces
kubectl create namespace arc-system --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace argocd --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace kargo --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace stonks-beta --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace stonks-paper --dry-run=client -o yaml | kubectl apply -f -
# 2. Create NFS PVs (cluster-scoped, idempotent)
kubectl apply -f pvs/
# 3. Install ARC controller
helm install arc \
--namespace arc-system \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller
# 4. Install ARC runner scale set
kubectl apply -f arc/runner-scaleset.yaml
# 5. Install ArgoCD
helm install argocd argo/argo-cd \
--namespace argocd \
-f argocd/values.yaml
# 6. Apply ArgoCD repo secret + Applications
kubectl apply -f argocd/repo-secret.yaml
kubectl apply -f argocd/apps/
# 7. Install Kargo
helm install kargo oci://ghcr.io/akuity/kargo-charts/kargo \
--namespace kargo \
-f kargo/values.yaml
# 8. Apply Kargo project, warehouse, stages
kubectl apply -f kargo/project.yaml
kubectl apply -f kargo/project-config.yaml
kubectl apply -f kargo/warehouse.yaml
kubectl apply -f kargo/stages/
```
### 9. runmelast.sh — Teardown
```
#!/bin/bash
set -euo pipefail
# Reverse order: Kargo → ArgoCD → ARC
# Preserves: PVs, NFS data, stonks-oracle namespace
# 1. Remove Kargo resources
kubectl delete -f kargo/stages/ --ignore-not-found
kubectl delete -f kargo/warehouse.yaml --ignore-not-found
kubectl delete -f kargo/project-config.yaml --ignore-not-found
kubectl delete -f kargo/project.yaml --ignore-not-found
helm uninstall kargo --namespace kargo || true
# 2. Remove ArgoCD resources
kubectl delete -f argocd/apps/ --ignore-not-found
kubectl delete -f argocd/repo-secret.yaml --ignore-not-found
helm uninstall argocd --namespace argocd || true
# 3. Remove ARC
kubectl delete -f arc/runner-scaleset.yaml --ignore-not-found
helm uninstall arc --namespace arc-system || true
# 4. Delete namespaces (but NOT stonks-oracle, stonks-beta, stonks-paper)
kubectl delete namespace arc-system --ignore-not-found
kubectl delete namespace argocd --ignore-not-found
kubectl delete namespace kargo --ignore-not-found
# 5. PVs are intentionally NOT deleted — data persists on NFS
echo "Pipeline infrastructure removed. NFS PVs and data preserved."
```
## Data Models
### Kargo Resource Relationships
```mermaid
graph TD
W[Warehouse: stonks-images<br/>watches GHCR for new tags] -->|produces| F[Freight<br/>image tag = git SHA]
F -->|auto-promote| SB[Stage: beta<br/>ArgoCD App: stonks-beta]
SB -->|verified → available| SP[Stage: paper<br/>market-hours verification<br/>ArgoCD App: stonks-paper]
SP -->|verified → available| SL[Stage: live<br/>manual approval + market-hours<br/>ArgoCD App: stonks-live]
```
### ArgoCD Application ↔ Kargo Stage Mapping
| Kargo Stage | ArgoCD Application | Target Namespace | Values File | Promotion Gate |
|---|---|---|---|---|
| `beta` | `stonks-beta` | `stonks-beta` | `values-beta.yaml` | Auto-promote (no gate) |
| `paper` | `stonks-paper` | `stonks-paper` | `values-paper.yaml` | Market-hours verification |
| `live` | `stonks-live` | `stonks-oracle` | `values.yaml` | Manual approval + market-hours |
### NFS Storage Layout
```
nfs://192.168.42.8:/volume1/Kubernetes/pipelines/
├── argocd/ # ArgoCD server data, repo cache
├── kargo/ # Kargo controller data, promotion history
└── arc/ # ARC runner data, job logs
```
### Image Tag Flow
```
Git SHA (e.g., abc123)
→ CI builds: ghcr.io/celesrenata/stonks-oracle/<service>:abc123
→ Integration test: run_pipeline.sh --image-tag abc123
→ Kargo Warehouse detects: abc123
→ Kargo Freight created: abc123
→ Beta: helm upgrade with image.tag=abc123
→ Paper: helm upgrade with image.tag=abc123 (after market-hours check)
→ Live: helm upgrade with image.tag=abc123 (after approval + market-hours check)
```
### Stage Enable/Disable Configuration
Stage enable/disable is managed via the Kargo ProjectConfig resource. Disabling a stage removes it from the promotion DAG — Freight skips to the next enabled stage. Re-enabling restores the gate.
```yaml
apiVersion: kargo.akuity.io/v1alpha1
kind: ProjectConfig
metadata:
name: stonks-oracle
namespace: stonks-oracle
spec:
promotionPolicies:
- stage: beta
autoPromotionEnabled: true
- stage: paper
autoPromotionEnabled: false
- stage: live
autoPromotionEnabled: false
```
## Error Handling
### runmefirst.sh Failures
| Failure | Detection | Recovery |
|---|---|---|
| Namespace creation fails | `kubectl create` non-zero exit | Script exits with error message. Re-run is idempotent (uses `--dry-run=client -o yaml | kubectl apply`). |
| NFS PV creation fails | `kubectl apply` non-zero exit | Check NFS server reachability (`ping 192.168.42.8`). Verify NFS paths exist on Synology. |
| Helm install fails (ARC/ArgoCD/Kargo) | `helm install` non-zero exit | Script exits. Check Helm repo access, image pull credentials, and cluster resources. Re-run after fixing. |
| ArgoCD Application creation fails | `kubectl apply` non-zero exit | Verify ArgoCD CRDs are installed (ArgoCD Helm chart must be running first). |
| Kargo resource creation fails | `kubectl apply` non-zero exit | Verify Kargo CRDs are installed (Kargo Helm chart must be running first). |
### runmelast.sh Failures
| Failure | Detection | Recovery |
|---|---|---|
| Helm uninstall fails | Non-zero exit (caught by `|| true`) | Script continues. Manually clean up with `kubectl delete namespace`. |
| Namespace deletion hangs | Namespace stuck in Terminating | Check for finalizers: `kubectl get namespace <ns> -o json` and remove stuck finalizers. |
| PV accidentally deleted | PV missing after teardown | PVs are NOT deleted by runmelast.sh. If manually deleted, NFS data is still on disk — recreate PV pointing at same NFS path. |
### CI Workflow Failures
| Failure | Detection | Recovery |
|---|---|---|
| Self-hosted runner unavailable | GitHub Actions job queued indefinitely | Check ARC controller logs in `arc-system`. Verify runner scale set is registered. Fallback: temporarily switch to `ubuntu-latest`. |
| Image build fails | `docker/build-push-action` non-zero exit | Check build logs. Fix code/Dockerfile and re-push. |
| Integration test fails | `run_pipeline.sh` exits non-zero | Check `inttest-results.json` artifact for failure details. Fix and re-push. Promotion to beta is blocked. |
| GHCR push fails | Authentication error | Verify `GITHUB_TOKEN` secret has `packages:write` permission. Check GHCR rate limits. |
### Promotion Failures
| Failure | Detection | Recovery |
|---|---|---|
| ArgoCD sync fails | ArgoCD Application shows "Degraded" or "OutOfSync" | Check ArgoCD UI at `stonks-argocd.celestium.life`. Inspect sync error. Fix manifests and re-sync. |
| Kargo promotion fails | Kargo Stage shows "Failed" | Check Kargo Dashboard at `stonks-kargo.celestium.life`. Inspect promotion step logs. |
| Market-hours check fails unexpectedly | Verification step errors (not blocks) | Check AnalysisTemplate pod logs. Verify `tzdata` package is available in the container. |
| NFS volume unavailable | Pods stuck in Pending (PVC not bound) | Check NFS server status. Verify PV exists and is not bound to a different PVC. |
### Rollback Strategy
- **Beta/Paper**: ArgoCD auto-sync means reverting the image tag in the values file (or promoting a previous Freight) triggers a rollback. Kargo's promotion history shows which Freight was previously deployed.
- **Live**: Same mechanism — promote a previous Freight to the live stage. ArgoCD syncs the previous image tag. Manual approval is still required.
- **Emergency**: If ArgoCD is down, direct `helm upgrade` with the previous image tag: `helm upgrade stonks-oracle infra/helm/stonks-oracle -n stonks-oracle --set image.tag=<previous-sha>`
## Testing Strategy
### Why Property-Based Testing Does Not Apply
This feature is entirely Infrastructure as Code: shell scripts (`runmefirst.sh`, `runmelast.sh`), Kubernetes YAML manifests (PVs, ArgoCD Applications, Kargo Stages/Warehouses), Helm values files, and GitHub Actions workflow configuration. There are no pure functions, parsers, serializers, or business logic with meaningful input variation. Every acceptance criterion classified as either SMOKE (one-time configuration check) or INTEGRATION (external service verification).
PBT requires universal properties that hold across a wide input space — "for all X, P(X) holds." This feature has no such properties. The "inputs" are fixed configuration values (namespace names, NFS paths, Helm chart paths, domain names) and the "outputs" are Kubernetes resource states. Running 100 iterations of "does the ArgoCD ingress have TLS enabled" adds no value over running it once.
### Testing Approach
The testing strategy uses three tiers:
#### Tier 1: Smoke Tests (Configuration Validation)
Validate that all generated manifests and scripts are structurally correct before deployment. These run locally or in CI without requiring a live cluster.
| Test | What It Validates | How |
|---|---|---|
| Manifest syntax | All YAML files parse correctly | `kubectl apply --dry-run=client -f <file>` |
| Helm template rendering | Values files produce valid K8s resources | `helm template` with each values file |
| Namespace isolation | Pipeline namespaces are distinct from `stonks-oracle` | Grep manifests for namespace fields |
| NFS path separation | PVs use distinct subdirectories | Inspect PV YAML for unique paths |
| Workflow syntax | GitHub Actions YAML is valid | `actionlint` or GitHub's workflow validator |
| Runner label | Workflow uses `self-hosted-gremlin` label | Grep workflow YAML |
| Service matrix completeness | All 12 services + dashboard + superset in build matrix | Count matrix entries |
| ArgoCD Application structure | Each app points at correct chart, values, namespace | Inspect Application YAML |
| Kargo Stage DAG | Stages form correct linear pipeline | Inspect Stage YAML requestedFreight |
#### Tier 2: Integration Tests (Live Cluster Verification)
Run after `runmefirst.sh` on the Gremlin cluster. Verify that all components are running and wired correctly.
| Test | What It Validates | How |
|---|---|---|
| ARC controller running | ARC pods healthy in `arc-system` | `kubectl get pods -n arc-system` |
| Runner registration | Scale set registered with GitHub | Check GitHub repo settings or ARC logs |
| ArgoCD accessible | Web UI responds at `stonks-argocd.celestium.life` | `curl -k https://stonks-argocd.celestium.life` |
| Kargo accessible | Dashboard responds at `stonks-kargo.celestium.life` | `curl -k https://stonks-kargo.celestium.life` |
| TLS certificates | Ingress has valid certs from `ca-issuer` | `openssl s_client` or cert-manager status |
| PV binding | PVCs are bound to NFS PVs | `kubectl get pvc -n argocd` |
| ArgoCD sync | Applications sync successfully | `argocd app get stonks-beta` |
| Kargo Warehouse | Warehouse discovers images from GHCR | `kubectl get freight -n stonks-oracle` |
| End-to-end promotion | Image flows from beta → paper → live | Trigger promotion, verify deployments update |
| Teardown preservation | After `runmelast.sh`, PVs and NFS data intact | Run teardown, check PVs and NFS mount |
| Rebuild reattach | After teardown + `runmefirst.sh`, state restored | Rebuild, verify promotion history preserved |
#### Tier 3: Market-Hours and Break-Glass Tests
These require either mocked time or execution at specific times.
| Test | What It Validates | How |
|---|---|---|
| Market-hours block (during hours) | Promotion blocked 09:3016:00 ET MonFri | Run AnalysisTemplate with `TZ=America/New_York` during market hours |
| Market-hours allow (outside hours) | Promotion allowed outside market hours | Run AnalysisTemplate outside market hours or on weekend |
| Market-hours boundary | Correct behavior at 09:29, 09:30, 15:59, 16:00 | Run check script with mocked times |
| DST handling | Correct ET evaluation across DST transitions | Verify script uses `America/New_York` (not fixed UTC offset) |
| Break-glass override | Manual approval bypasses market-hours block | During market hours, use Kargo manual approval |
| Break-glass audit | Approval records operator, timestamp, justification | After break-glass, query Kargo audit trail |
| Break-glass non-sticky | Next promotion is still blocked | After break-glass, verify subsequent promotion is blocked |
### Test Execution
- **Smoke tests**: Run as part of a validation script before deployment. Can be added as a CI job.
- **Integration tests**: Run manually after `runmefirst.sh` on the Gremlin cluster. Document as a checklist in the pipeline README.
- **Market-hours tests**: Run manually at appropriate times, or use the market-hours check script in isolation with mocked `TZ` and `date` values.
+229
View File
@@ -0,0 +1,229 @@
# CI/CD Pipeline — Requirements
## Introduction
Full CI/CD pipeline for the Stonks Oracle platform replacing GitHub-hosted runners with self-hosted runners on the existing Kubernetes cluster (GitHub Actions Runner Controller), GitOps-based deployment via ArgoCD, and staged promotion orchestration via Kargo. The pipeline provides five stages — CI, integration test, beta, paper, and live — with market-hours promotion blockers, break-glass emergency overrides, and a visual web dashboard for promotion management. All pipeline infrastructure scripts reside in `~/sources/kube/pipelines/` on gremlin-1 and persist state on NFS volumes that survive cluster rebuilds.
## Glossary
- **ARC**: GitHub Actions Runner Controller — a Kubernetes operator that provisions self-hosted GitHub Actions runners as pods in the cluster
- **ArgoCD**: A GitOps continuous delivery controller for Kubernetes that syncs cluster state from Git repositories
- **Kargo**: A promotion orchestration layer built on top of ArgoCD providing staged promotion gates, a visual web dashboard, and audit trails
- **Pipeline_Infrastructure**: The set of Kubernetes resources (ARC, ArgoCD, Kargo) and their supporting manifests, PVs, and scripts that comprise the CI/CD system, deployed from `~/sources/kube/pipelines/`
- **Promotion**: The act of advancing a specific image tag (SHA) from one pipeline stage to the next (e.g., beta to paper)
- **Promotion_Blocker**: A time-based gate that prevents promotions during US equity market hours (09:3016:00 ET, MondayFriday)
- **Break_Glass**: An emergency override mechanism that bypasses the Promotion_Blocker, requiring explicit confirmation and an audit note
- **Stage**: One of the five deployment environments in the pipeline: CI, Integration_Test, Beta, Paper, Live
- **NFS_PV**: A Kubernetes PersistentVolume backed by the NFS share at `nfs://192.168.42.8:/volume1/Kubernetes/pipelines`, used to persist pipeline state across cluster rebuilds
- **GHCR**: GitHub Container Registry at `ghcr.io/celesrenata/stonks-oracle`, the target registry for all built images
- **Image_Tag**: A Docker image tag in the format `<sha>` (Git commit SHA) used to identify a specific build across all stages
- **Gremlin_Cluster**: The 4-node NixOS Kubernetes cluster (gremlin-1 through gremlin-4) at primary address 192.168.42.254
- **Market_Hours**: US equity market trading hours, 09:3016:00 Eastern Time, Monday through Friday
- **Kargo_Dashboard**: The Kargo web UI providing visual promotion management, stage status, and audit history
- **Integration_Test_Runner**: The existing standalone script at `infra/inttest/run_pipeline.sh` that deploys an ephemeral sandbox, seeds data, runs API tests, and produces `inttest-results.json`
## Requirements
### Requirement 1: Pipeline Infrastructure Deployment
**User Story:** As a platform operator, I want a single deployment script that installs all CI/CD pipeline components (ARC, ArgoCD, Kargo) onto the Gremlin_Cluster, so that the pipeline infrastructure can be stood up or rebuilt with one command.
#### Acceptance Criteria
1. WHEN the operator executes `runmefirst.sh` from `~/sources/kube/pipelines/`, THE Pipeline_Infrastructure SHALL install ARC, ArgoCD, and Kargo into the Gremlin_Cluster in dedicated namespaces
2. WHEN the operator executes `runmefirst.sh`, THE Pipeline_Infrastructure SHALL create NFS-backed PersistentVolumes at `nfs://192.168.42.8:/volume1/Kubernetes/pipelines` for ArgoCD, Kargo, and ARC persistent data
3. WHEN ArgoCD is deployed, THE Pipeline_Infrastructure SHALL expose the ArgoCD web UI via Traefik ingress with TLS using the `ca-issuer` ClusterIssuer
4. WHEN Kargo is deployed, THE Pipeline_Infrastructure SHALL expose the Kargo_Dashboard via Traefik ingress with TLS using the `ca-issuer` ClusterIssuer
5. THE Pipeline_Infrastructure SHALL store all deployment manifests and scripts in `~/sources/kube/pipelines/` on gremlin-1
### Requirement 2: Pipeline Infrastructure Teardown
**User Story:** As a platform operator, I want a teardown script that removes pipeline components without destroying persistent pipeline data, so that pipeline state survives cluster rebuilds.
#### Acceptance Criteria
1. WHEN the operator executes `runmelast.sh` from `~/sources/kube/pipelines/`, THE Pipeline_Infrastructure SHALL remove ARC, ArgoCD, and Kargo deployments from the Gremlin_Cluster
2. WHEN `runmelast.sh` executes, THE Pipeline_Infrastructure SHALL preserve all NFS_PV resources and the data stored on `nfs://192.168.42.8:/volume1/Kubernetes/pipelines`
3. WHEN `runmelast.sh` executes, THE Pipeline_Infrastructure SHALL leave the application namespace `stonks-oracle` and all application workloads untouched
4. WHEN the application teardown script `~/sources/kube/stonks-oracle/runmelast.sh` executes, THE Pipeline_Infrastructure SHALL remain operational and unaffected
### Requirement 3: Pipeline Infrastructure Isolation
**User Story:** As a platform operator, I want the pipeline infrastructure to be fully isolated from the application infrastructure, so that deploying or tearing down one does not affect the other.
#### Acceptance Criteria
1. THE Pipeline_Infrastructure SHALL deploy ARC, ArgoCD, and Kargo in namespaces separate from the `stonks-oracle` application namespace
2. THE Pipeline_Infrastructure SHALL use independent Helm releases or manifests that share no lifecycle with the `stonks-oracle` Helm chart
3. THE Pipeline_Infrastructure SHALL use NFS_PV paths under `pipelines/` that are distinct from any application storage paths
### Requirement 4: Self-Hosted CI Runners
**User Story:** As a developer, I want CI builds to run on self-hosted runners in the Gremlin_Cluster via ARC, so that GitHub Actions compute costs are eliminated.
#### Acceptance Criteria
1. WHEN ARC is deployed, THE Pipeline_Infrastructure SHALL register a runner scale set with GitHub that accepts jobs from the `celesrenata/stonks-oracle` repository
2. WHEN a GitHub Actions workflow targets the self-hosted runner label, THE ARC SHALL provision runner pods in the Gremlin_Cluster to execute the job
3. WHEN a CI job completes, THE ARC SHALL terminate the runner pod and release cluster resources
4. THE ARC SHALL use ephemeral runner pods that start clean for each job execution
### Requirement 5: CI Stage — Lint and Test
**User Story:** As a developer, I want every push to main or pull request to trigger automated linting and testing on self-hosted runners, so that code quality is validated before images are built.
#### Acceptance Criteria
1. WHEN a push to the `main` branch or a pull request is opened, THE CI_Stage SHALL trigger a workflow on self-hosted ARC runners
2. WHEN the CI workflow runs, THE CI_Stage SHALL execute Python linting using `ruff check services/`
3. WHEN the CI workflow runs, THE CI_Stage SHALL execute Python unit tests using `pytest tests/`
4. WHEN the CI workflow runs, THE CI_Stage SHALL install frontend dependencies and execute frontend tests using `vitest`
5. IF any lint or test step fails, THEN THE CI_Stage SHALL mark the workflow as failed and skip image builds
### Requirement 6: CI Stage — Image Build and Push
**User Story:** As a developer, I want Docker images for all services and the dashboard to be built and pushed to GHCR on every successful main branch push, so that new images are available for deployment.
#### Acceptance Criteria
1. WHEN lint and tests pass on a push to `main`, THE CI_Stage SHALL build Docker images for all 12 Python services (scheduler, symbol-registry, ingestion, parser, extractor, aggregation, recommendation, risk, broker-adapter, lake-publisher, query-api, trading-engine)
2. WHEN lint and tests pass on a push to `main`, THE CI_Stage SHALL build the dashboard Docker image from `frontend/Dockerfile`
3. WHEN lint and tests pass on a push to `main`, THE CI_Stage SHALL build the superset Docker image from `docker/Dockerfile.superset`
4. WHEN images are built, THE CI_Stage SHALL push each image to GHCR with tags `ghcr.io/celesrenata/stonks-oracle/<service>:<git-sha>` and `ghcr.io/celesrenata/stonks-oracle/<service>:latest`
5. WHEN all images are pushed, THE CI_Stage SHALL record the Git SHA as the Image_Tag for downstream stages
### Requirement 7: Integration Test Stage
**User Story:** As a developer, I want the CI pipeline to automatically run integration tests against newly built images, so that functional correctness is validated before promotion to beta.
#### Acceptance Criteria
1. WHEN all images are pushed to GHCR for a given Image_Tag, THE Integration_Test_Stage SHALL invoke the Integration_Test_Runner with `bash infra/inttest/run_pipeline.sh --image-tag <sha>`
2. WHEN the Integration_Test_Runner completes, THE Integration_Test_Stage SHALL parse the `inttest-results.json` file for test counts and exit code
3. IF the Integration_Test_Runner exits with code 0, THEN THE Integration_Test_Stage SHALL mark the Image_Tag as eligible for promotion to Beta
4. IF the Integration_Test_Runner exits with a non-zero code, THEN THE Integration_Test_Stage SHALL block promotion to Beta and report the failure details
5. THE Integration_Test_Stage SHALL archive the `inttest-results.json` as a build artifact
### Requirement 8: Beta Stage Deployment
**User Story:** As a developer, I want a beta environment where newly built images are deployed for smoke testing and manual verification before promotion to paper trading, so that regressions are caught early.
#### Acceptance Criteria
1. WHEN an Image_Tag passes the Integration_Test_Stage, THE Beta_Stage SHALL deploy the application with that Image_Tag to a beta namespace or Helm release managed by ArgoCD
2. WHILE the Beta_Stage is active, THE Kargo_Dashboard SHALL display the currently deployed Image_Tag and its promotion status
3. WHEN a developer requests promotion from Beta to Paper via the Kargo_Dashboard, THE Beta_Stage SHALL verify that the Image_Tag passed integration tests before allowing promotion
4. THE Beta_Stage SHALL use the same Helm chart (`infra/helm/stonks-oracle/`) as production, with beta-specific value overrides
### Requirement 9: Paper Trading Stage Deployment
**User Story:** As a trader, I want a paper trading environment that uses the Alpaca paper broker, so that new builds can be validated against simulated market conditions before going live.
#### Acceptance Criteria
1. WHEN an Image_Tag is promoted from Beta, THE Paper_Stage SHALL deploy the application with that Image_Tag to a paper trading namespace managed by ArgoCD
2. THE Paper_Stage SHALL configure the broker adapter with `BROKER_MODE=paper` and `BROKER_PROVIDER=alpaca` using Alpaca paper trading credentials
3. WHILE Market_Hours are active (09:3016:00 ET, MondayFriday), THE Paper_Stage SHALL block automatic and manual promotions to the Paper_Stage unless Break_Glass is activated
4. WHEN a promotion to Paper is attempted outside Market_Hours, THE Paper_Stage SHALL allow the promotion to proceed
5. THE Paper_Stage SHALL use the same Helm chart (`infra/helm/stonks-oracle/`) as production, with paper-specific value overrides
### Requirement 10: Live Stage Deployment
**User Story:** As a platform operator, I want production deployments to require explicit manual approval with notes, so that live trading is protected from accidental or untested deployments.
#### Acceptance Criteria
1. WHEN an Image_Tag is promoted from Paper, THE Live_Stage SHALL require explicit manual approval with a notes field before deploying to the `stonks-oracle` production namespace
2. THE Live_Stage SHALL deploy the application with the approved Image_Tag via ArgoCD syncing the production Helm release
3. WHILE Market_Hours are active (09:3016:00 ET, MondayFriday), THE Live_Stage SHALL block promotions to the Live_Stage unless Break_Glass is activated
4. WHEN a promotion to Live is attempted outside Market_Hours with valid approval, THE Live_Stage SHALL allow the promotion to proceed
5. THE Live_Stage SHALL use the existing `stonks-oracle` namespace and Helm chart with production values
### Requirement 11: Market-Hours Promotion Blocker
**User Story:** As a risk manager, I want promotions to paper and live environments to be blocked during US market hours, so that deployments do not disrupt active trading sessions.
#### Acceptance Criteria
1. WHILE the current time is between 09:30 and 16:00 Eastern Time on a weekday, THE Promotion_Blocker SHALL prevent promotions to the Paper_Stage and Live_Stage
2. WHEN the current time is outside 09:3016:00 ET or on a weekend, THE Promotion_Blocker SHALL allow promotions to proceed (subject to other gates)
3. WHEN a promotion is blocked by the Promotion_Blocker, THE Kargo_Dashboard SHALL display a visual indicator showing the block reason and the time until the market closes
4. THE Promotion_Blocker SHALL evaluate Eastern Time correctly, accounting for US daylight saving time transitions
### Requirement 12: Break-Glass Emergency Override
**User Story:** As a platform operator, I want a break-glass mechanism to bypass market-hours blockers during emergencies, so that critical fixes can be deployed at any time.
#### Acceptance Criteria
1. WHEN an operator activates Break_Glass via the Kargo_Dashboard, THE Pipeline_Infrastructure SHALL bypass the Promotion_Blocker for the target Stage
2. WHEN Break_Glass is activated, THE Kargo_Dashboard SHALL require a confirmation dialog before proceeding
3. WHEN Break_Glass is activated, THE Pipeline_Infrastructure SHALL require the operator to provide a written justification note
4. WHEN Break_Glass is used, THE Pipeline_Infrastructure SHALL record the operator identity, timestamp, target Stage, Image_Tag, and justification note in the audit trail
5. THE Break_Glass mechanism SHALL apply only to the single promotion for which it was activated and SHALL NOT disable the Promotion_Blocker for subsequent promotions
### Requirement 13: Per-Stage Enable/Disable Controls
**User Story:** As a platform operator, I want to independently enable or disable each pipeline stage, so that the pipeline can be configured for different operational modes.
#### Acceptance Criteria
1. THE Pipeline_Infrastructure SHALL provide a configuration mechanism to independently enable or disable each of the five stages (CI, Integration_Test, Beta, Paper, Live)
2. WHEN a Stage is disabled, THE Pipeline_Infrastructure SHALL skip that Stage during promotion and advance the Image_Tag to the next enabled Stage
3. WHEN a Stage is re-enabled, THE Pipeline_Infrastructure SHALL resume gating promotions through that Stage for new Image_Tags
### Requirement 14: Revision Tracking
**User Story:** As a developer, I want to see which Image_Tag (Git SHA) is deployed at each pipeline stage, so that I can track exactly what code is running in each environment.
#### Acceptance Criteria
1. THE Kargo_Dashboard SHALL display the currently deployed Image_Tag for each active Stage
2. WHEN a promotion occurs, THE Kargo_Dashboard SHALL update the displayed Image_Tag for the target Stage within 60 seconds
3. THE Pipeline_Infrastructure SHALL maintain a mapping of Stage to current Image_Tag that is queryable via the Kargo API or ArgoCD
### Requirement 15: Audit Trail
**User Story:** As a compliance officer, I want a complete audit trail of all promotions including who promoted, when, with what notes, and whether break-glass was used, so that deployment decisions are traceable.
#### Acceptance Criteria
1. WHEN a promotion occurs, THE Pipeline_Infrastructure SHALL record the operator identity, timestamp, source Stage, target Stage, Image_Tag, and any notes provided
2. WHEN Break_Glass is used for a promotion, THE Pipeline_Infrastructure SHALL record the break-glass justification alongside the standard promotion record
3. THE Kargo_Dashboard SHALL display the promotion history for each Stage, showing all recorded audit fields
4. THE Pipeline_Infrastructure SHALL persist audit trail data on NFS_PV so that promotion history survives cluster rebuilds
### Requirement 16: Kargo Visual Dashboard
**User Story:** As a platform operator, I want a web dashboard showing all pipeline stages, their current revisions, and promotion controls, so that I can manage deployments visually.
#### Acceptance Criteria
1. THE Kargo_Dashboard SHALL display all five Stages with their current deployed Image_Tag and promotion status
2. THE Kargo_Dashboard SHALL provide a click-to-promote action for advancing an Image_Tag from one Stage to the next
3. WHEN Market_Hours are active, THE Kargo_Dashboard SHALL display block/allow indicators on the Paper_Stage and Live_Stage
4. THE Kargo_Dashboard SHALL provide a notes field when promoting or when a promotion is blocked
5. THE Kargo_Dashboard SHALL provide a Break_Glass button with a confirmation dialog for emergency overrides
6. THE Kargo_Dashboard SHALL be accessible via Traefik ingress at a `*.celestium.life` domain with TLS via `ca-issuer`
### Requirement 17: NFS Persistent Storage
**User Story:** As a platform operator, I want all pipeline state (ArgoCD app configs, Kargo promotion history, ARC data) to persist on NFS volumes, so that pipeline data survives cluster teardowns and rebuilds.
#### Acceptance Criteria
1. THE Pipeline_Infrastructure SHALL create PersistentVolumes backed by the NFS share at `nfs://192.168.42.8:/volume1/Kubernetes/pipelines` for ArgoCD server data, Kargo data, and ARC data
2. WHEN `runmelast.sh` is executed, THE NFS_PV resources and their underlying NFS data SHALL remain intact
3. WHEN `runmefirst.sh` is executed after a previous teardown, THE Pipeline_Infrastructure SHALL reattach to the existing NFS data and restore previous pipeline state
4. THE Pipeline_Infrastructure SHALL use separate NFS subdirectories for ArgoCD, Kargo, and ARC to prevent data conflicts
### Requirement 18: ArgoCD GitOps Configuration
**User Story:** As a platform operator, I want ArgoCD to sync Kubernetes manifests from the Git repository, so that the cluster state is always consistent with the declared configuration.
#### Acceptance Criteria
1. THE ArgoCD SHALL be configured with an Application resource pointing to the `infra/helm/stonks-oracle/` Helm chart in the `celesrenata/stonks-oracle` Git repository
2. WHEN a change is committed to the Helm chart or values files in Git, THE ArgoCD SHALL detect the change and sync the updated manifests to the target namespace
3. THE ArgoCD SHALL support multiple Application resources for beta, paper, and live environments, each with stage-specific value overrides
4. IF an ArgoCD sync fails, THEN THE ArgoCD SHALL report the failure status in the ArgoCD UI and the Kargo_Dashboard
+96
View File
@@ -0,0 +1,96 @@
# Implementation Plan: CI/CD Pipeline
## Overview
Build a full CI/CD pipeline for Stonks Oracle using ARC (self-hosted GitHub Actions runners), ArgoCD (GitOps deployment), and Kargo (staged promotion orchestration) on the Gremlin cluster. Pipeline infrastructure scripts go in `~/sources/kube/pipelines/` on gremlin-1. Helm values files and the updated GitHub Actions workflow go in the stonks-oracle repo.
## Tasks
- [x] 1. Create NFS PersistentVolume manifests
- [x] 1.1 Create `~/sources/kube/pipelines/pvs/argocd-pv.yaml` — NFS PV for ArgoCD (5Gi, `nfs://192.168.42.8:/volume1/Kubernetes/pipelines/argocd`, `persistentVolumeReclaimPolicy: Retain`, label `app: pipeline-argocd`)
- _Requirements: 1.2, 17.1, 17.4_
- [x] 1.2 Create `~/sources/kube/pipelines/pvs/kargo-pv.yaml` — NFS PV for Kargo (2Gi, `nfs://192.168.42.8:/volume1/Kubernetes/pipelines/kargo`, `persistentVolumeReclaimPolicy: Retain`, label `app: pipeline-kargo`)
- _Requirements: 1.2, 17.1, 17.4_
- [x] 1.3 Create `~/sources/kube/pipelines/pvs/arc-pv.yaml` — NFS PV for ARC (2Gi, `nfs://192.168.42.8:/volume1/Kubernetes/pipelines/arc`, `persistentVolumeReclaimPolicy: Retain`, label `app: pipeline-arc`)
- _Requirements: 1.2, 17.1, 17.4_
- [x] 2. Create ARC (Actions Runner Controller) manifests
- [x] 2.1 Create `~/sources/kube/pipelines/arc/values.yaml` — Helm values for the ARC controller chart (`oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller`), namespace `arc-system`
- _Requirements: 1.1, 4.1_
- [x] 2.2 Create `~/sources/kube/pipelines/arc/runner-scaleset.yaml` — RunnerScaleSet CR for `celesrenata/stonks-oracle` repo with label `self-hosted-gremlin`, `containerMode.type: kubernetes`, ephemeral pods, 2 CPU / 4Gi memory limits
- _Requirements: 4.1, 4.2, 4.3, 4.4_
- [x] 3. Create ArgoCD manifests
- [x] 3.1 Create `~/sources/kube/pipelines/argocd/values.yaml` — Helm values for `argo/argo-cd` chart in `argocd` namespace, with Traefik ingress at `stonks-argocd.celestium.life`, TLS via `ca-issuer`, NFS PVC for persistence
- _Requirements: 1.1, 1.3, 18.1_
- [x] 3.2 Create `~/sources/kube/pipelines/argocd/repo-secret.yaml` — Kubernetes Secret with Git credentials for the `celesrenata/stonks-oracle` repository, namespace `argocd`
- _Requirements: 18.1_
- [x] 3.3 Create `~/sources/kube/pipelines/argocd/apps/stonks-beta.yaml` — ArgoCD Application for beta stage, pointing at `infra/helm/stonks-oracle/` with `values-beta.yaml`, target namespace `stonks-beta`, auto-sync with prune and selfHeal
- _Requirements: 8.1, 8.4, 18.2, 18.3_
- [x] 3.4 Create `~/sources/kube/pipelines/argocd/apps/stonks-paper.yaml` — ArgoCD Application for paper stage, pointing at `infra/helm/stonks-oracle/` with `values-paper.yaml`, target namespace `stonks-paper`, auto-sync with prune and selfHeal
- _Requirements: 9.1, 9.5, 18.2, 18.3_
- [x] 3.5 Create `~/sources/kube/pipelines/argocd/apps/stonks-live.yaml` — ArgoCD Application for live stage, pointing at `infra/helm/stonks-oracle/` with `values.yaml`, target namespace `stonks-oracle`, auto-sync with prune and selfHeal
- _Requirements: 10.2, 10.5, 18.2, 18.3_
- [x] 4. Checkpoint — Verify ArgoCD and ARC manifests
- Ensure all YAML manifests are syntactically valid. Review that each ArgoCD Application points at the correct chart path, values file, and target namespace. Ask the user if questions arise.
- [x] 5. Create Kargo manifests
- [x] 5.1 Create `~/sources/kube/pipelines/kargo/values.yaml` — Helm values for `oci://ghcr.io/akuity/kargo-charts/kargo` in `kargo` namespace, with Traefik ingress at `stonks-kargo.celestium.life`, TLS via `ca-issuer`, NFS PVC for persistence
- _Requirements: 1.1, 1.4, 16.6_
- [x] 5.2 Create `~/sources/kube/pipelines/kargo/project.yaml` — Kargo Project resource `stonks-oracle` in `stonks-oracle` namespace
- _Requirements: 8.2, 14.1_
- [x] 5.3 Create `~/sources/kube/pipelines/kargo/warehouse.yaml` — Kargo Warehouse `stonks-images` watching `ghcr.io/celesrenata/stonks-oracle/query-api` for new image tags
- _Requirements: 6.5, 14.1_
- [x] 5.4 Create `~/sources/kube/pipelines/kargo/stages/beta.yaml` — Kargo Stage for beta with auto-promotion enabled, promotion template that updates `image.tag` in the `stonks-beta` ArgoCD Application
- _Requirements: 8.1, 8.3, 13.1_
- [x] 5.5 Create `~/sources/kube/pipelines/kargo/stages/paper.yaml` — Kargo Stage for paper with manual promotion, market-hours verification step (AnalysisTemplate), promotion template that updates `image.tag` in the `stonks-paper` ArgoCD Application
- _Requirements: 9.1, 9.3, 9.4, 11.1, 11.2, 13.1_
- [x] 5.6 Create `~/sources/kube/pipelines/kargo/stages/live.yaml` — Kargo Stage for live with manual approval + required notes, market-hours verification step, promotion template that updates `image.tag` in the `stonks-live` ArgoCD Application
- _Requirements: 10.1, 10.3, 10.4, 11.1, 11.2, 12.1, 12.3, 13.1_
- [x] 5.7 Create `~/sources/kube/pipelines/kargo/project-config.yaml` — Kargo ProjectConfig with per-stage `autoPromotionEnabled` settings (beta: true, paper: false, live: false)
- _Requirements: 13.1, 13.2, 13.3_
- [x] 6. Create market-hours AnalysisTemplate
- [x] 6.1 Create the AnalysisTemplate manifest for market-hours verification — runs an Alpine container that checks Eastern Time (09:3016:00 ET, MonFri), exits 0 outside market hours, exits 1 during market hours. Uses `America/New_York` timezone for DST correctness. Place in `~/sources/kube/pipelines/kargo/` directory.
- _Requirements: 11.1, 11.2, 11.4_
- [x] 7. Checkpoint — Verify Kargo manifests and promotion DAG
- Ensure Kargo stages form the correct linear DAG: beta → paper → live. Verify market-hours AnalysisTemplate is referenced by paper and live stages. Ensure all YAML is syntactically valid. Ask the user if questions arise.
- [x] 8. Create Helm values files for beta and paper stages (in stonks-oracle repo)
- [x] 8.1 Create `infra/helm/stonks-oracle/values-beta.yaml` — lighter resources, `BROKER_MODE: mock`, `BROKER_PROVIDER: mock`, `LOG_LEVEL: DEBUG`, `TRADING_ENABLED: false`, single replicas per service
- _Requirements: 8.4, 9.2_
- [x] 8.2 Create `infra/helm/stonks-oracle/values-paper.yaml` — paper broker config, `BROKER_MODE: paper`, `BROKER_PROVIDER: alpaca`, `BROKER_BASE_URL: https://paper-api.alpaca.markets`, `LOG_LEVEL: INFO`, `TRADING_ENABLED: true`
- _Requirements: 9.2, 9.5_
- [x] 9. Update GitHub Actions workflow (in stonks-oracle repo)
- [x] 9.1 Update `.github/workflows/build.yml` — change `runs-on: ubuntu-latest` to `runs-on: self-hosted-gremlin` on all jobs (`lint-and-test`, `build-services`, `build-dashboard`, `build-superset`)
- _Requirements: 5.1, 4.2_
- [x] 9.2 Add `integration-test` job to `.github/workflows/build.yml` — depends on `build-services` and `build-dashboard`, runs only on push to main, invokes `bash infra/inttest/run_pipeline.sh --image-tag ${{ github.sha }} --results-file inttest-results.json`, uploads `inttest-results.json` as a build artifact via `actions/upload-artifact@v4`
- _Requirements: 7.1, 7.2, 7.3, 7.4, 7.5_
- [x] 10. Checkpoint — Verify workflow and values files
- Ensure the updated workflow YAML is syntactically valid. Verify the integration-test job has correct `needs`, `if` condition, and artifact upload. Confirm values-beta.yaml and values-paper.yaml are valid Helm values. Ask the user if questions arise.
- [x] 11. Create install and teardown scripts
- [x] 11.1 Create `~/sources/kube/pipelines/runmefirst.sh` — full install script: create namespaces (`arc-system`, `argocd`, `kargo`, `stonks-beta`, `stonks-paper`), apply PVs, install ARC controller via Helm, apply runner scaleset, install ArgoCD via Helm with values, apply repo secret + ArgoCD Applications, install Kargo via Helm with values, apply Kargo project + warehouse + stages. Use `set -euo pipefail`, idempotent namespace creation via `--dry-run=client -o yaml | kubectl apply -f -`
- _Requirements: 1.1, 1.2, 1.5, 3.1_
- [x] 11.2 Create `~/sources/kube/pipelines/runmelast.sh` — teardown script: delete Kargo resources (stages, warehouse, project-config, project), uninstall Kargo Helm release, delete ArgoCD resources (apps, repo-secret), uninstall ArgoCD Helm release, delete ARC resources (runner-scaleset), uninstall ARC Helm release, delete namespaces (`arc-system`, `argocd`, `kargo`). Preserve PVs, NFS data, `stonks-oracle` namespace, `stonks-beta`, and `stonks-paper` namespaces. Use `--ignore-not-found` and `|| true` for idempotency.
- _Requirements: 2.1, 2.2, 2.3, 2.4, 3.2, 3.3, 17.2_
- [x] 12. Final checkpoint — Review all artifacts
- Ensure all files are created in the correct locations: pipeline scripts in `~/sources/kube/pipelines/`, Helm values and workflow changes in the stonks-oracle repo. Verify install order in `runmefirst.sh` matches design (PVs → ARC → ArgoCD → Kargo). Verify teardown order in `runmelast.sh` is reverse (Kargo → ArgoCD → ARC). Ensure all tests pass, ask the user if questions arise.
## Notes
- Pipeline infrastructure scripts (`~/sources/kube/pipelines/`) are created on gremlin-1, separate from the stonks-oracle repo
- Helm values files (`values-beta.yaml`, `values-paper.yaml`) and the GitHub Actions workflow update are in the stonks-oracle repo
- No property-based tests — this feature is entirely IaC (shell scripts, YAML manifests, Helm values)
- The existing `values.yaml` (production) is not modified — live stage uses it as-is
- PVs use `persistentVolumeReclaimPolicy: Retain` so NFS data survives teardowns
- Break-glass is Kargo's built-in manual approval — no custom code needed (Requirements 12.112.5)
- Audit trail is provided by Kargo's native promotion history (Requirements 15.115.4)
- Kargo Dashboard features (stage display, promotion controls, block indicators) are provided by the Kargo chart out of the box (Requirements 14.114.3, 16.116.5)
- Each task references specific requirements for traceability
- Checkpoints ensure incremental validation between major phases
+130 -24
View File
@@ -120,48 +120,119 @@ Wraps each test with timing:
- Outputs a summary table at the end
- Flags any endpoint > 500ms as "slow"
### 6. Runner Script (`tests/integration/run_pipeline.sh`)
### 6. Runner Script (`infra/inttest/run_pipeline.sh`)
Orchestrates the full pipeline:
Standalone orchestration script with a well-defined CLI contract so any CI/CD system (or a human) can invoke it. The future CI/CD pipeline spec will call this script as a stage.
**CLI interface:**
```
Usage: bash infra/inttest/run_pipeline.sh [OPTIONS]
Options:
--image-tag TAG Docker image tag to deploy (default: latest)
--namespace NAME Override namespace name (default: stonks-inttest-<timestamp>)
--skip-teardown Leave namespace running after tests (for debugging)
--results-file PATH Path for JSON results output (default: inttest-results.json)
Exit codes:
0 All tests passed
1 One or more test failures
2 Infrastructure setup failure (postgres/redis/minio/services didn't start)
```
**JSON result contract** (`inttest-results.json`):
```json
{
"run_id": "stonks-inttest-1705312800",
"image_tag": "abc123",
"started_at": "2025-01-15T12:00:00Z",
"completed_at": "2025-01-15T12:07:30Z",
"exit_code": 0,
"stages": {
"infra_deploy": {"duration_s": 45.2, "status": "ok"},
"seed_data": {"duration_s": 8.1, "status": "ok"},
"service_deploy": {"duration_s": 32.5, "status": "ok"},
"integration_tests": {"duration_s": 28.3, "status": "ok"},
"teardown": {"duration_s": 5.0, "status": "ok"}
},
"tests": {
"total": 41,
"passed": 41,
"failed": 0,
"errors": 0
},
"profiling": {
"endpoints": {
"/api/companies": {"p50_ms": 12, "p95_ms": 25, "p99_ms": 45},
...
},
"slow_endpoints": []
}
}
```
This contract is designed so the future CI/CD pipeline can:
1. Parse `exit_code` to decide whether to promote to the next stage
2. Parse `profiling.slow_endpoints` to flag performance regressions
3. Archive the full JSON as a build artifact
4. Display `tests.passed`/`tests.failed` in a dashboard
```bash
#!/bin/bash
set -euo pipefail
# Parse CLI args
IMAGE_TAG="latest"
NAMESPACE="stonks-inttest-$(date +%s)"
PROFILING_OUTPUT="inttest-results-${NAMESPACE}.json"
SKIP_TEARDOWN=false
RESULTS_FILE="inttest-results.json"
while [[ $# -gt 0 ]]; do
case $1 in
--image-tag) IMAGE_TAG="$2"; shift 2 ;;
--namespace) NAMESPACE="$2"; shift 2 ;;
--skip-teardown) SKIP_TEARDOWN=true; shift ;;
--results-file) RESULTS_FILE="$2"; shift 2 ;;
*) echo "Unknown option: $1"; exit 2 ;;
esac
done
# Cleanup function (always runs, even on failure)
cleanup() {
if [ "$SKIP_TEARDOWN" = false ]; then
kubectl delete namespace "$NAMESPACE" --wait=false 2>/dev/null || true
fi
}
trap cleanup EXIT
# Stage 1: Create namespace
kubectl create namespace $NAMESPACE
kubectl create namespace "$NAMESPACE"
# Stage 2: Deploy infra
envsubst < infra/inttest/postgres.yaml | kubectl apply -n $NAMESPACE -f -
envsubst < infra/inttest/redis.yaml | kubectl apply -n $NAMESPACE -f -
envsubst < infra/inttest/minio.yaml | kubectl apply -n $NAMESPACE -f -
kubectl wait --for=condition=ready pod -l app=postgres -n $NAMESPACE --timeout=120s
kubectl wait --for=condition=ready pod -l app=redis -n $NAMESPACE --timeout=60s
kubectl wait --for=condition=ready pod -l app=minio -n $NAMESPACE --timeout=60s
kubectl create configmap postgres-migrations --from-file=infra/migrations/ -n "$NAMESPACE"
export NAMESPACE
envsubst < infra/inttest/postgres.yaml | kubectl apply -n "$NAMESPACE" -f -
envsubst < infra/inttest/redis.yaml | kubectl apply -n "$NAMESPACE" -f -
envsubst < infra/inttest/minio.yaml | kubectl apply -n "$NAMESPACE" -f -
kubectl wait --for=condition=ready pod -l app=postgres -n "$NAMESPACE" --timeout=120s
kubectl wait --for=condition=ready pod -l app=redis -n "$NAMESPACE" --timeout=60s
kubectl wait --for=condition=ready pod -l app=minio -n "$NAMESPACE" --timeout=60s
# Stage 3: Run migrations + seed
kubectl run seed-runner --image=ghcr.io/celesrenata/stonks-oracle/query-api:latest \
-n $NAMESPACE --restart=Never --env="POSTGRES_HOST=postgres" ... \
-- python -c "import asyncio; from tests.integration.seed_sandbox import seed; asyncio.run(seed())"
kubectl wait --for=condition=complete job/seed-runner -n $NAMESPACE --timeout=120s
# Stage 3: Seed data (run from a pod with DB access)
# ... seed runner pod ...
# Stage 4: Deploy services
envsubst < infra/inttest/services.yaml | kubectl apply -n $NAMESPACE -f -
kubectl wait --for=condition=ready pod -l tier=api -n $NAMESPACE --timeout=120s
# Stage 4: Deploy services (using specified image tag)
envsubst < infra/inttest/services.yaml | sed "s/:latest/:${IMAGE_TAG}/g" | kubectl apply -n "$NAMESPACE" -f -
kubectl wait --for=condition=ready pod -l tier=api -n "$NAMESPACE" --timeout=120s
# Stage 5: Run integration tests
kubectl run test-runner --image=ghcr.io/celesrenata/stonks-oracle/query-api:latest \
-n $NAMESPACE --restart=Never \
-- python -m pytest tests/integration/ -v --tb=short
envsubst < infra/inttest/runner.yaml | sed "s/:latest/:${IMAGE_TAG}/g" | kubectl apply -n "$NAMESPACE" -f -
kubectl wait --for=condition=complete job/inttest-runner -n "$NAMESPACE" --timeout=600s
# Stage 6: Collect results
kubectl logs job/test-runner -n $NAMESPACE > $PROFILING_OUTPUT
kubectl logs job/inttest-runner -n "$NAMESPACE" > "$RESULTS_FILE"
# Stage 7: Teardown
kubectl delete namespace $NAMESPACE --wait=false
# Stage 7: Teardown (handled by trap)
```
## Profiling Strategy
@@ -217,3 +288,38 @@ CREATE namespace
→ Collect results
→ DELETE namespace (always, even on failure)
```
## Integration Contract for Future CI/CD Pipeline
This spec produces a standalone runner (`infra/inttest/run_pipeline.sh`) with a well-defined contract. A future spec ("CI/CD Deployment Pipeline") will consume it as one stage in a larger pipeline:
```
┌─────────────────────────────────────────────────────────────────────────┐
│ Future CI/CD Pipeline (separate spec) │
│ │
│ 1. Git push → webhook to self-hosted runner on gremlin nodes │
│ 2. Lint + Unit Tests (ruff, pytest, vitest) │
│ 3. Docker Build → push to GHCR (self-hosted, no GH Actions compute) │
│ 4. ┌──────────────────────────────────────────────────────────┐ │
│ │ Integration Tests (THIS SPEC) │ │
│ │ bash infra/inttest/run_pipeline.sh --image-tag $SHA │ │
│ │ → reads inttest-results.json │ │
│ │ → exit code 0 = promote, 1 = block │ │
│ └──────────────────────────────────────────────────────────┘ │
│ 5. Promote to beta namespace (if tests pass) │
│ 6. Promote to paper namespace (manual gate or auto) │
│ 7. Promote to live namespace (market-hours blocker + break-glass) │
│ │
│ Each stage has enable/disable toggle. │
│ Promotions blocked during market hours (9:3016:00 ET) unless │
│ break-glass is activated. │
└─────────────────────────────────────────────────────────────────────────┘
```
**What this spec provides to the future pipeline:**
- `infra/inttest/run_pipeline.sh` — callable with `--image-tag` to test any build
- `inttest-results.json` — machine-readable results for promotion decisions
- Exit codes for pass/fail gating
- `--skip-teardown` for debugging failed runs
- All K8s manifests in `infra/inttest/` for sandbox lifecycle
- Deterministic seed data and comprehensive API test coverage
@@ -5,16 +5,23 @@ End-to-end integration test pipeline that runs in Kubernetes, spinning up isolat
## Functional Requirements
### FR-1: Pipeline Stages
1. **Lint** — ruff check on Python, eslint on frontend
2. **Unit Tests** — pytest + vitest against local mocks
3. **Build** — Docker images for all services + dashboard
4. **Deploy Sandbox** — ephemeral namespace with own PostgreSQL, Redis, MinIO (no Ollama — too heavy for CI)
5. **Seed Data** — populate DB and S3 with enough data to exercise every frontend component
6. **Integration Tests** — HTTP-level validation of every API endpoint the frontend depends on
7. **Frontend E2E**render every page against the live sandbox APIs, assert no errors and expected data
8. **Profiling** — measure and report timing for each pipeline stage and each API endpoint
9. **Teardown** — delete the ephemeral namespace and all resources
### FR-1: Integration Test Stages
This spec covers the **integration test foundation** — sandbox infra, seed data, test suites, profiling, and a standalone runner script. A separate CI/CD pipeline spec will consume this foundation to provide build, staged promotion (beta → paper → live), market-hours gating, and break-glass deployment.
Stages owned by this spec:
1. **Deploy Sandbox** — ephemeral namespace with own PostgreSQL, Redis, MinIO (no Ollama — too heavy for CI)
2. **Seed Data** — populate DB and S3 with enough data to exercise every frontend component
3. **Integration Tests** — HTTP-level validation of every API endpoint the frontend depends on
4. **Frontend Data Deps**verify every page's API dependencies return valid data
5. **Profiling** — measure and report timing for each stage and each API endpoint
6. **Teardown** — delete the ephemeral namespace and all resources
Stages deferred to the CI/CD pipeline spec:
- Lint, unit tests, Docker image builds (self-hosted on gremlin nodes)
- Staged promotion: beta → paper → live namespaces
- Market-hours promotion blockers (no deploys during 9:3016:00 ET unless break-glass)
- Break-glass emergency production deploy
- Per-stage enable/disable toggles
### FR-2: Sandbox Infrastructure
- PostgreSQL 16 (ephemeral, no persistent volume)
@@ -72,5 +79,15 @@ Target: full pipeline completes in under 10 minutes. Seed data insertion under 3
### NFR-3: Reproducibility
Seed data is deterministic (fixed UUIDs, timestamps). No external API calls (Polygon, Alpaca). All data is synthetic.
### NFR-4: CI Integration
Pipeline can be triggered from GitHub Actions as a separate workflow, or manually via `kubectl apply`.
### NFR-4: Pipeline Integration Contract
The runner script is a standalone tool that can be invoked by any CI/CD system. It exposes:
- **CLI interface**: `bash infra/inttest/run_pipeline.sh [--image-tag TAG] [--namespace NAME] [--skip-teardown]`
- **Exit codes**: 0 = all tests passed, 1 = test failures, 2 = infra setup failure
- **JSON result file**: `inttest-results.json` with test counts, pass/fail, per-endpoint latency, stage timings
- **stdout/stderr**: human-readable progress and summary
A future CI/CD pipeline spec will invoke this script as a stage, passing in the image tag from a self-hosted build step. That spec will handle:
- Self-hosted build runners on gremlin nodes (no GitHub Actions compute)
- Staged promotion (beta → paper → live) with per-stage enable/disable
- Market-hours promotion blockers (9:3016:00 ET)
- Break-glass emergency deploy to production
+17 -18
View File
@@ -1,31 +1,30 @@
# Integration Test Pipeline — Tasks
## Phase 1: Sandbox Infrastructure Manifests
- [ ] 1. Create `infra/inttest/postgres.yaml` — PostgreSQL 16 Deployment with migrations as init container, no PV
- [ ] 2. Create `infra/inttest/redis.yaml` — Redis 7 Deployment, no persistence
- [ ] 3. Create `infra/inttest/minio.yaml` — MinIO Deployment + bucket init Job
- [ ] 4. Create `infra/inttest/services.yaml` — query-api, symbol-registry, risk, trading-engine Deployments pointing at sandbox infra
- [ ] 5. Create `infra/inttest/runner.yaml` — test runner Job template
- [x] 1. Create `infra/inttest/postgres.yaml` — PostgreSQL 16 Deployment with migrations as init container, no PV
- [x] 2. Create `infra/inttest/redis.yaml` — Redis 7 Deployment, no persistence
- [x] 3. Create `infra/inttest/minio.yaml` — MinIO Deployment + bucket init Job
- [x] 4. Create `infra/inttest/services.yaml` — query-api, symbol-registry, risk, trading-engine Deployments pointing at sandbox infra
- [x] 5. Create `infra/inttest/runner.yaml` — test runner Job template
## Phase 2: Seed Data
- [ ] 6. Create `tests/integration/seed_sandbox.py` — deterministic seed script with fixed UUIDs for 5 companies, 10 documents, 5 trends, 5 recommendations, 3 orders, 2 positions, 2 global events, 2 competitive signals, 3 agents, trading config, portfolio snapshot
- [ ] 7. Create `tests/integration/seed_minio.py` — seed MinIO buckets with sample normalized text files
- [x] 6. Create `tests/integration/seed_sandbox.py` — deterministic seed script with fixed UUIDs for 5 companies, 10 documents, 5 trends, 5 recommendations, 3 orders, 2 positions, 2 global events, 2 competitive signals, 3 agents, trading config, portfolio snapshot
- [x] 7. Create `tests/integration/seed_minio.py` — seed MinIO buckets with sample normalized text files
## Phase 3: API Integration Tests
- [ ] 8. Create `tests/integration/conftest.py` — pytest fixtures for HTTP client, base URLs, seed IDs
- [ ] 9. Create `tests/integration/test_query_api.py` — tests for all 17 query API endpoints
- [ ] 10. Create `tests/integration/test_registry_api.py` — tests for all 8 symbol registry endpoints
- [ ] 11. Create `tests/integration/test_risk_api.py` — tests for all 4 risk engine endpoints
- [ ] 12. Create `tests/integration/test_trading_api.py` — tests for all 12 trading engine endpoints
- [ ] 13. Create `tests/integration/test_frontend_data_deps.py` — tests verifying every frontend page's API dependencies return valid data
- [x] 8. Create `tests/integration/conftest.py` — pytest fixtures for HTTP client, base URLs, seed IDs
- [x] 9. Create `tests/integration/test_query_api.py` — tests for all 17 query API endpoints
- [x] 10. Create `tests/integration/test_registry_api.py` — tests for all 8 symbol registry endpoints
- [x] 11. Create `tests/integration/test_risk_api.py` — tests for all 4 risk engine endpoints
- [x] 12. Create `tests/integration/test_trading_api.py` — tests for all 12 trading engine endpoints
- [x] 13. Create `tests/integration/test_frontend_data_deps.py` — tests verifying every frontend page's API dependencies return valid data
## Phase 4: Profiling
- [ ] 14. Create `tests/integration/profiler.py` — timing wrapper that records per-endpoint latency and produces a summary report
- [ ] 15. Add profiling output to test runner (JSON report with P50/P95/P99 per endpoint, stage timings)
- [x] 14. Create `tests/integration/profiler.py` — timing wrapper that records per-endpoint latency and produces a summary report
- [x] 15. Add profiling output to test runner (JSON report with P50/P95/P99 per endpoint, stage timings)
## Phase 5: Pipeline Runner
- [ ] 16. Create `infra/inttest/run_pipeline.sh` — orchestration script that creates namespace, deploys infra, seeds, deploys services, runs tests, collects results, tears down
- [ ] 17. Create `.github/workflows/integration.yml` — GitHub Actions workflow that triggers the pipeline on demand or on PR
- [x] 16. Create `infra/inttest/run_pipeline.sh` standalone orchestration script with CLI args (`--image-tag`, `--namespace`, `--skip-teardown`, `--results-file`), exit codes (0=pass, 1=fail, 2=infra error), JSON result output; creates namespace, deploys infra, seeds, deploys services, runs tests, collects results, tears down
## Phase 6: Documentation
- [ ] 18. Add integration test section to `docs/LOCAL_DEV_SETUP.md` with instructions for running locally
- [x] 17. Add integration test section to `docs/LOCAL_DEV_SETUP.md` with instructions for running locally, CLI usage, JSON result contract, and a note that a future CI/CD pipeline spec will consume this runner