c85c0068a2
- Replace all datetime.utcnow() with datetime.now(tz=timezone.utc) across 8 files - Fix 12 failing tests to match current implementation behavior - Fix pytest_plugins in non-top-level conftest (moved to root conftest.py) - Auto-fix 189 lint issues (import sorting, unused imports) - Add CI/CD pipeline infrastructure (ARC, ArgoCD, Kargo manifests) - Add values-beta.yaml and values-paper.yaml for staged deployments - Update GitHub Actions workflow to use self-hosted-gremlin runners - Add integration-test job to CI pipeline Result: 1596 passed, 0 failed, 0 warnings
629 lines
30 KiB
Markdown
629 lines
30 KiB
Markdown
# CI/CD Pipeline — Design
|
||
|
||
## Overview
|
||
|
||
This design describes a full CI/CD pipeline for the Stonks Oracle platform built on three Kubernetes-native tools: GitHub Actions Runner Controller (ARC) for self-hosted CI runners, ArgoCD for GitOps-based deployment, and Kargo for staged promotion orchestration. The pipeline replaces GitHub-hosted runners with ephemeral pods on the existing 4-node NixOS Gremlin cluster, routes built images through five stages (CI → Integration Test → Beta → Paper → Live), and enforces market-hours promotion blockers with a break-glass emergency override.
|
||
|
||
All pipeline infrastructure scripts and manifests live in `~/sources/kube/pipelines/` on gremlin-1 — fully separate from the application's `~/sources/kube/stonks-oracle/` deployment scripts. Pipeline state persists on NFS volumes at `nfs://192.168.42.8:/volume1/Kubernetes/pipelines` so that ArgoCD configs, Kargo promotion history, and ARC data survive cluster teardowns and rebuilds.
|
||
|
||
### Key Design Decisions
|
||
|
||
1. **ARC with Kubernetes mode (not Docker-in-Docker)** — Runner pods use `containerMode.type: kubernetes` so each workflow step runs as a separate pod. This avoids the security and complexity overhead of DinD while leveraging the cluster's existing container runtime. Docker builds use `docker/build-push-action` with Buildx, which works with the Kubernetes executor.
|
||
|
||
2. **One ArgoCD Application per stage** — Beta, Paper, and Live each get their own ArgoCD Application resource pointing at the same Helm chart (`infra/helm/stonks-oracle/`) but with different values files (`values-beta.yaml`, `values-paper.yaml`, `values.yaml`). This keeps stage configs independent and auditable.
|
||
|
||
3. **Kargo Image Updater pattern** — A single Kargo Warehouse watches the GHCR image repository for new tags. Kargo Stages (beta → paper → live) form a linear promotion DAG. Each Stage's promotion template updates the image tag in the corresponding ArgoCD Application and triggers a sync.
|
||
|
||
4. **Market-hours blocker via Kargo AnalysisTemplate** — Kargo verification steps check Eastern Time before allowing promotions to Paper and Live stages. Break-glass is implemented via Kargo's manual approval with required notes, bypassing the verification gate.
|
||
|
||
5. **NFS static provisioning with Retain policy** — PVs are created manually by `runmefirst.sh` with `persistentVolumeReclaimPolicy: Retain`. The teardown script (`runmelast.sh`) deletes Helm releases and namespaces but leaves PVs and NFS data intact.
|
||
|
||
6. **Install order: PVs → ARC → ArgoCD → Kargo** — `runmefirst.sh` creates PVs first (they're cluster-scoped), then installs each tool via Helm in dependency order. Kargo depends on ArgoCD being present.
|
||
|
||
## Architecture
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||
│ Gremlin Cluster (4x NixOS) │
|
||
│ │
|
||
│ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────────────────┐ │
|
||
│ │ arc-system ns │ │ argocd ns │ │ kargo ns │ │
|
||
│ │ │ │ │ │ │ │
|
||
│ │ ARC Controller │ │ ArgoCD Server │ │ Kargo Controller │ │
|
||
│ │ Runner ScaleSet │ │ Repo Server │ │ Kargo Dashboard │ │
|
||
│ │ (ephemeral pods)│ │ App Controller │ │ (stonks-kargo. │ │
|
||
│ │ │ │ (stonks-argocd. │ │ celestium.life) │ │
|
||
│ │ │ │ celestium.life)│ │ │ │
|
||
│ └─────────────────┘ └──────────────────┘ └───────────────────────────┘ │
|
||
│ │
|
||
│ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────────────────┐ │
|
||
│ │ stonks-beta ns │ │ stonks-paper ns │ │ stonks-oracle ns │ │
|
||
│ │ │ │ │ │ (live/production) │ │
|
||
│ │ ArgoCD App: │ │ ArgoCD App: │ │ ArgoCD App: │ │
|
||
│ │ stonks-beta │ │ stonks-paper │ │ stonks-live │ │
|
||
│ │ values-beta.yaml│ │ values-paper. │ │ values.yaml │ │
|
||
│ │ (mock broker) │ │ yaml │ │ (production broker) │ │
|
||
│ │ │ │ (paper broker) │ │ │ │
|
||
│ └─────────────────┘ └──────────────────┘ └───────────────────────────┘ │
|
||
│ │
|
||
│ NFS PVs: nfs://192.168.42.8:/volume1/Kubernetes/pipelines/{argocd,kargo,arc}│
|
||
└─────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Promotion Flow
|
||
|
||
```mermaid
|
||
graph LR
|
||
A[Git Push to main] --> B[CI: Lint + Test<br/>ARC self-hosted runner]
|
||
B --> C[CI: Build + Push<br/>all images to GHCR]
|
||
C --> D[Integration Tests<br/>run_pipeline.sh]
|
||
D -->|pass| E[Kargo Warehouse<br/>detects new image tag]
|
||
D -->|fail| X[❌ Blocked]
|
||
E --> F[Beta Stage<br/>auto-promote]
|
||
F --> G{Market Hours?}
|
||
G -->|outside hours| H[Paper Stage<br/>manual promote]
|
||
G -->|during hours| I[🚫 Blocked<br/>break-glass available]
|
||
I -->|break-glass| H
|
||
H --> J{Market Hours?}
|
||
J -->|outside hours| K[Live Stage<br/>manual approve + notes]
|
||
J -->|during hours| L[🚫 Blocked<br/>break-glass available]
|
||
L -->|break-glass| K
|
||
```
|
||
|
||
## Components and Interfaces
|
||
|
||
### 1. Pipeline Scripts (`~/sources/kube/pipelines/`)
|
||
|
||
```
|
||
~/sources/kube/pipelines/
|
||
├── runmefirst.sh # Full install: PVs → ARC → ArgoCD → Kargo
|
||
├── runmelast.sh # Teardown: Kargo → ArgoCD → ARC (preserves PVs + NFS data)
|
||
├── pvs/
|
||
│ ├── argocd-pv.yaml # NFS PV for ArgoCD server data
|
||
│ ├── kargo-pv.yaml # NFS PV for Kargo data
|
||
│ └── arc-pv.yaml # NFS PV for ARC runner data
|
||
├── arc/
|
||
│ ├── values.yaml # ARC controller Helm values
|
||
│ └── runner-scaleset.yaml # RunnerScaleSet CR for stonks-oracle repo
|
||
├── argocd/
|
||
│ ├── values.yaml # ArgoCD Helm values (ingress, TLS, persistence)
|
||
│ ├── apps/
|
||
│ │ ├── stonks-beta.yaml # ArgoCD Application for beta
|
||
│ │ ├── stonks-paper.yaml # ArgoCD Application for paper
|
||
│ │ └── stonks-live.yaml # ArgoCD Application for live
|
||
│ └── repo-secret.yaml # Git repo credentials for ArgoCD
|
||
├── kargo/
|
||
│ ├── values.yaml # Kargo Helm values (ingress, TLS, persistence)
|
||
│ ├── project.yaml # Kargo Project: stonks-oracle
|
||
│ ├── warehouse.yaml # Kargo Warehouse watching GHCR
|
||
│ ├── stages/
|
||
│ │ ├── beta.yaml # Kargo Stage: beta (auto-promote)
|
||
│ │ ├── paper.yaml # Kargo Stage: paper (market-hours gate)
|
||
│ │ └── live.yaml # Kargo Stage: live (manual approval + market-hours gate)
|
||
│ └── project-config.yaml # ProjectConfig: auto-promotion settings
|
||
└── helm-values/
|
||
├── values-beta.yaml # Helm overrides for beta stage
|
||
└── values-paper.yaml # Helm overrides for paper stage
|
||
```
|
||
|
||
### 2. ARC — GitHub Actions Runner Controller
|
||
|
||
**Namespace:** `arc-system`
|
||
|
||
**Components:**
|
||
- **ARC Controller** — Installed via the `oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller` Helm chart. Watches for GitHub webhook events and provisions runner pods.
|
||
- **Runner ScaleSet** — Installed via the `oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set` Helm chart. Configured for the `celesrenata/stonks-oracle` repository with the label `self-hosted-gremlin`.
|
||
|
||
**Runner Pod Configuration:**
|
||
- Ephemeral: each job gets a fresh pod, destroyed on completion
|
||
- Kubernetes mode (`containerMode.type: kubernetes`): workflow steps run as separate containers
|
||
- Resource limits: 2 CPU, 4Gi memory per runner pod
|
||
- Docker Buildx support via `docker/setup-buildx-action` (uses Kubernetes builder)
|
||
- GitHub App or PAT authentication stored in a Kubernetes Secret
|
||
|
||
**Interface with CI workflow:**
|
||
- The existing `.github/workflows/build.yml` is updated to use `runs-on: self-hosted-gremlin` instead of `runs-on: ubuntu-latest`
|
||
- All existing build steps remain unchanged — only the runner label changes
|
||
|
||
### 3. ArgoCD — GitOps Deployment Controller
|
||
|
||
**Namespace:** `argocd`
|
||
|
||
**Components:**
|
||
- **ArgoCD Server** — Web UI and API, exposed via Traefik ingress at `stonks-argocd.celestium.life` with TLS via `ca-issuer`
|
||
- **Repo Server** — Clones Git repos and renders Helm templates
|
||
- **Application Controller** — Watches ArgoCD Application resources and syncs cluster state
|
||
|
||
**ArgoCD Applications (one per stage):**
|
||
|
||
| Application | Namespace | Values File | Sync Policy |
|
||
|---|---|---|---|
|
||
| `stonks-beta` | `stonks-beta` | `values-beta.yaml` | Auto-sync (Kargo triggers) |
|
||
| `stonks-paper` | `stonks-paper` | `values-paper.yaml` | Auto-sync (Kargo triggers) |
|
||
| `stonks-live` | `stonks-oracle` | `values.yaml` | Auto-sync (Kargo triggers) |
|
||
|
||
Each Application points at the same Helm chart (`infra/helm/stonks-oracle/`) in the `celesrenata/stonks-oracle` Git repository but uses a different values file. The `image.tag` parameter is overridden by Kargo during promotion.
|
||
|
||
**Application Resource Structure:**
|
||
```yaml
|
||
apiVersion: argoproj.io/v1alpha1
|
||
kind: Application
|
||
metadata:
|
||
name: stonks-beta
|
||
namespace: argocd
|
||
spec:
|
||
project: default
|
||
source:
|
||
repoURL: https://github.com/celesrenata/stonks-oracle.git
|
||
targetRevision: main
|
||
path: infra/helm/stonks-oracle
|
||
helm:
|
||
valueFiles:
|
||
- values-beta.yaml
|
||
parameters:
|
||
- name: image.tag
|
||
value: latest # Overridden by Kargo during promotion
|
||
destination:
|
||
server: https://kubernetes.default.svc
|
||
namespace: stonks-beta
|
||
syncPolicy:
|
||
automated:
|
||
prune: true
|
||
selfHeal: true
|
||
```
|
||
|
||
### 4. Kargo — Promotion Orchestration
|
||
|
||
**Namespace:** `kargo`
|
||
|
||
**Components:**
|
||
- **Kargo Controller** — Watches Warehouse, Stage, and Promotion resources
|
||
- **Kargo Dashboard** — Web UI at `stonks-kargo.celestium.life` with TLS via `ca-issuer`. Provides visual promotion management, stage status, and audit history.
|
||
|
||
**Kargo Resources:**
|
||
|
||
#### Warehouse
|
||
Watches the GHCR image repository for new image tags. Produces Freight resources for each new tag discovered.
|
||
|
||
```yaml
|
||
apiVersion: kargo.akuity.io/v1alpha1
|
||
kind: Warehouse
|
||
metadata:
|
||
name: stonks-images
|
||
namespace: stonks-oracle # Kargo project namespace
|
||
spec:
|
||
subscriptions:
|
||
- image:
|
||
repoURL: ghcr.io/celesrenata/stonks-oracle/query-api
|
||
semverConstraint: ""
|
||
discoveryPolicy: SemVer # or Digest — tracks by SHA tag
|
||
```
|
||
|
||
#### Stages (Linear DAG)
|
||
|
||
```
|
||
Warehouse: stonks-images
|
||
│
|
||
▼
|
||
Stage: beta (auto-promote, no market-hours gate)
|
||
│
|
||
▼
|
||
Stage: paper (manual promote, market-hours verification)
|
||
│
|
||
▼
|
||
Stage: live (manual approval + notes, market-hours verification)
|
||
```
|
||
|
||
Each Stage's promotion template:
|
||
1. Clones the Git repo
|
||
2. Updates `image.tag` in the stage-specific values file (or uses `argocd-update` step)
|
||
3. Triggers the ArgoCD Application to sync
|
||
|
||
#### Market-Hours Verification
|
||
|
||
Paper and Live stages include a verification step that checks whether the current time falls within US market hours (09:30–16:00 ET, Mon–Fri). If it does, the promotion is blocked unless the operator uses Kargo's manual approval (break-glass) with a required justification note.
|
||
|
||
This is implemented as a Kargo verification step using an `AnalysisTemplate` that runs a lightweight container to check the current Eastern Time:
|
||
|
||
```yaml
|
||
apiVersion: argoproj.io/v1alpha1
|
||
kind: AnalysisTemplate
|
||
metadata:
|
||
name: market-hours-check
|
||
namespace: stonks-oracle
|
||
spec:
|
||
metrics:
|
||
- name: outside-market-hours
|
||
provider:
|
||
job:
|
||
spec:
|
||
template:
|
||
spec:
|
||
containers:
|
||
- name: check
|
||
image: alpine:3.19
|
||
command: [sh, -c]
|
||
args:
|
||
- |
|
||
apk add --no-cache tzdata
|
||
export TZ=America/New_York
|
||
DOW=$(date +%u) # 1=Mon, 7=Sun
|
||
HOUR=$(date +%H)
|
||
MIN=$(date +%M)
|
||
TIME_MIN=$((HOUR * 60 + MIN))
|
||
MARKET_OPEN=570 # 09:30
|
||
MARKET_CLOSE=960 # 16:00
|
||
if [ "$DOW" -ge 6 ]; then
|
||
echo "Weekend — promotions allowed"
|
||
exit 0
|
||
fi
|
||
if [ "$TIME_MIN" -lt "$MARKET_OPEN" ] || [ "$TIME_MIN" -ge "$MARKET_CLOSE" ]; then
|
||
echo "Outside market hours — promotions allowed"
|
||
exit 0
|
||
fi
|
||
echo "Market hours active ($(date)) — promotion blocked"
|
||
exit 1
|
||
restartPolicy: Never
|
||
```
|
||
|
||
#### Break-Glass Mechanism
|
||
|
||
Kargo's built-in manual approval flow serves as the break-glass mechanism. When a promotion is blocked by the market-hours verification:
|
||
|
||
1. The operator clicks "Approve" in the Kargo Dashboard
|
||
2. A confirmation dialog appears requiring a justification note
|
||
3. The approval bypasses the verification gate for that single Freight/Stage combination
|
||
4. The approval, operator identity, timestamp, and justification are recorded in Kargo's audit trail
|
||
5. Subsequent promotions still require passing the market-hours check (the override is not sticky)
|
||
|
||
### 5. Updated GitHub Actions Workflow
|
||
|
||
The existing `.github/workflows/build.yml` is updated with:
|
||
|
||
1. **Runner label change**: `runs-on: ubuntu-latest` → `runs-on: self-hosted-gremlin`
|
||
2. **New integration test job**: After image builds, a new `integration-test` job invokes `bash infra/inttest/run_pipeline.sh --image-tag ${{ github.sha }} --results-file inttest-results.json`
|
||
3. **Artifact upload**: The `inttest-results.json` is uploaded as a build artifact
|
||
4. **Gate logic**: If integration tests fail, the workflow fails and Kargo will not see the new image tag as verified
|
||
|
||
```yaml
|
||
integration-test:
|
||
needs: [build-services, build-dashboard]
|
||
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
|
||
runs-on: self-hosted-gremlin
|
||
steps:
|
||
- uses: actions/checkout@v5
|
||
- name: Run integration tests
|
||
run: |
|
||
bash infra/inttest/run_pipeline.sh \
|
||
--image-tag ${{ github.sha }} \
|
||
--results-file inttest-results.json
|
||
- name: Upload results
|
||
if: always()
|
||
uses: actions/upload-artifact@v4
|
||
with:
|
||
name: inttest-results
|
||
path: inttest-results.json
|
||
```
|
||
|
||
### 6. Helm Values Strategy
|
||
|
||
**values-beta.yaml** (lighter resources, mock broker, no external API keys):
|
||
```yaml
|
||
image:
|
||
tag: latest # Overridden by Kargo
|
||
|
||
config:
|
||
BROKER_MODE: "mock"
|
||
BROKER_PROVIDER: "mock"
|
||
LOG_LEVEL: "DEBUG"
|
||
TRADING_ENABLED: "false"
|
||
|
||
services:
|
||
ingestion:
|
||
replicas: 1
|
||
parser:
|
||
replicas: 1
|
||
aggregation:
|
||
replicas: 1
|
||
```
|
||
|
||
**values-paper.yaml** (paper broker credentials, Alpaca paper API):
|
||
```yaml
|
||
image:
|
||
tag: latest # Overridden by Kargo
|
||
|
||
config:
|
||
BROKER_MODE: "paper"
|
||
BROKER_PROVIDER: "alpaca"
|
||
LOG_LEVEL: "INFO"
|
||
TRADING_ENABLED: "true"
|
||
|
||
secrets:
|
||
broker:
|
||
BROKER_BASE_URL: "https://paper-api.alpaca.markets"
|
||
```
|
||
|
||
**values.yaml** (production — existing, unchanged):
|
||
- Uses live broker credentials
|
||
- Full replica counts
|
||
- Production resource limits
|
||
|
||
### 7. NFS Persistent Volumes
|
||
|
||
Three PVs with static provisioning, all using `persistentVolumeReclaimPolicy: Retain`:
|
||
|
||
| PV Name | NFS Path | Capacity | Bound To |
|
||
|---|---|---|---|
|
||
| `pipeline-argocd-pv` | `/volume1/Kubernetes/pipelines/argocd` | 5Gi | PVC in `argocd` ns |
|
||
| `pipeline-kargo-pv` | `/volume1/Kubernetes/pipelines/kargo` | 2Gi | PVC in `kargo` ns |
|
||
| `pipeline-arc-pv` | `/volume1/Kubernetes/pipelines/arc` | 2Gi | PVC in `arc-system` ns |
|
||
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: PersistentVolume
|
||
metadata:
|
||
name: pipeline-argocd-pv
|
||
labels:
|
||
app: pipeline-argocd
|
||
spec:
|
||
capacity:
|
||
storage: 5Gi
|
||
accessModes: [ReadWriteOnce]
|
||
persistentVolumeReclaimPolicy: Retain
|
||
nfs:
|
||
server: 192.168.42.8
|
||
path: /volume1/Kubernetes/pipelines/argocd
|
||
```
|
||
|
||
### 8. runmefirst.sh — Install Orchestration
|
||
|
||
```
|
||
#!/bin/bash
|
||
set -euo pipefail
|
||
|
||
# 1. Create namespaces
|
||
kubectl create namespace arc-system --dry-run=client -o yaml | kubectl apply -f -
|
||
kubectl create namespace argocd --dry-run=client -o yaml | kubectl apply -f -
|
||
kubectl create namespace kargo --dry-run=client -o yaml | kubectl apply -f -
|
||
kubectl create namespace stonks-beta --dry-run=client -o yaml | kubectl apply -f -
|
||
kubectl create namespace stonks-paper --dry-run=client -o yaml | kubectl apply -f -
|
||
|
||
# 2. Create NFS PVs (cluster-scoped, idempotent)
|
||
kubectl apply -f pvs/
|
||
|
||
# 3. Install ARC controller
|
||
helm install arc \
|
||
--namespace arc-system \
|
||
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller
|
||
|
||
# 4. Install ARC runner scale set
|
||
kubectl apply -f arc/runner-scaleset.yaml
|
||
|
||
# 5. Install ArgoCD
|
||
helm install argocd argo/argo-cd \
|
||
--namespace argocd \
|
||
-f argocd/values.yaml
|
||
|
||
# 6. Apply ArgoCD repo secret + Applications
|
||
kubectl apply -f argocd/repo-secret.yaml
|
||
kubectl apply -f argocd/apps/
|
||
|
||
# 7. Install Kargo
|
||
helm install kargo oci://ghcr.io/akuity/kargo-charts/kargo \
|
||
--namespace kargo \
|
||
-f kargo/values.yaml
|
||
|
||
# 8. Apply Kargo project, warehouse, stages
|
||
kubectl apply -f kargo/project.yaml
|
||
kubectl apply -f kargo/project-config.yaml
|
||
kubectl apply -f kargo/warehouse.yaml
|
||
kubectl apply -f kargo/stages/
|
||
```
|
||
|
||
### 9. runmelast.sh — Teardown
|
||
|
||
```
|
||
#!/bin/bash
|
||
set -euo pipefail
|
||
|
||
# Reverse order: Kargo → ArgoCD → ARC
|
||
# Preserves: PVs, NFS data, stonks-oracle namespace
|
||
|
||
# 1. Remove Kargo resources
|
||
kubectl delete -f kargo/stages/ --ignore-not-found
|
||
kubectl delete -f kargo/warehouse.yaml --ignore-not-found
|
||
kubectl delete -f kargo/project-config.yaml --ignore-not-found
|
||
kubectl delete -f kargo/project.yaml --ignore-not-found
|
||
helm uninstall kargo --namespace kargo || true
|
||
|
||
# 2. Remove ArgoCD resources
|
||
kubectl delete -f argocd/apps/ --ignore-not-found
|
||
kubectl delete -f argocd/repo-secret.yaml --ignore-not-found
|
||
helm uninstall argocd --namespace argocd || true
|
||
|
||
# 3. Remove ARC
|
||
kubectl delete -f arc/runner-scaleset.yaml --ignore-not-found
|
||
helm uninstall arc --namespace arc-system || true
|
||
|
||
# 4. Delete namespaces (but NOT stonks-oracle, stonks-beta, stonks-paper)
|
||
kubectl delete namespace arc-system --ignore-not-found
|
||
kubectl delete namespace argocd --ignore-not-found
|
||
kubectl delete namespace kargo --ignore-not-found
|
||
|
||
# 5. PVs are intentionally NOT deleted — data persists on NFS
|
||
echo "Pipeline infrastructure removed. NFS PVs and data preserved."
|
||
```
|
||
|
||
## Data Models
|
||
|
||
### Kargo Resource Relationships
|
||
|
||
```mermaid
|
||
graph TD
|
||
W[Warehouse: stonks-images<br/>watches GHCR for new tags] -->|produces| F[Freight<br/>image tag = git SHA]
|
||
F -->|auto-promote| SB[Stage: beta<br/>ArgoCD App: stonks-beta]
|
||
SB -->|verified → available| SP[Stage: paper<br/>market-hours verification<br/>ArgoCD App: stonks-paper]
|
||
SP -->|verified → available| SL[Stage: live<br/>manual approval + market-hours<br/>ArgoCD App: stonks-live]
|
||
```
|
||
|
||
### ArgoCD Application ↔ Kargo Stage Mapping
|
||
|
||
| Kargo Stage | ArgoCD Application | Target Namespace | Values File | Promotion Gate |
|
||
|---|---|---|---|---|
|
||
| `beta` | `stonks-beta` | `stonks-beta` | `values-beta.yaml` | Auto-promote (no gate) |
|
||
| `paper` | `stonks-paper` | `stonks-paper` | `values-paper.yaml` | Market-hours verification |
|
||
| `live` | `stonks-live` | `stonks-oracle` | `values.yaml` | Manual approval + market-hours |
|
||
|
||
### NFS Storage Layout
|
||
|
||
```
|
||
nfs://192.168.42.8:/volume1/Kubernetes/pipelines/
|
||
├── argocd/ # ArgoCD server data, repo cache
|
||
├── kargo/ # Kargo controller data, promotion history
|
||
└── arc/ # ARC runner data, job logs
|
||
```
|
||
|
||
### Image Tag Flow
|
||
|
||
```
|
||
Git SHA (e.g., abc123)
|
||
→ CI builds: ghcr.io/celesrenata/stonks-oracle/<service>:abc123
|
||
→ Integration test: run_pipeline.sh --image-tag abc123
|
||
→ Kargo Warehouse detects: abc123
|
||
→ Kargo Freight created: abc123
|
||
→ Beta: helm upgrade with image.tag=abc123
|
||
→ Paper: helm upgrade with image.tag=abc123 (after market-hours check)
|
||
→ Live: helm upgrade with image.tag=abc123 (after approval + market-hours check)
|
||
```
|
||
|
||
### Stage Enable/Disable Configuration
|
||
|
||
Stage enable/disable is managed via the Kargo ProjectConfig resource. Disabling a stage removes it from the promotion DAG — Freight skips to the next enabled stage. Re-enabling restores the gate.
|
||
|
||
```yaml
|
||
apiVersion: kargo.akuity.io/v1alpha1
|
||
kind: ProjectConfig
|
||
metadata:
|
||
name: stonks-oracle
|
||
namespace: stonks-oracle
|
||
spec:
|
||
promotionPolicies:
|
||
- stage: beta
|
||
autoPromotionEnabled: true
|
||
- stage: paper
|
||
autoPromotionEnabled: false
|
||
- stage: live
|
||
autoPromotionEnabled: false
|
||
```
|
||
|
||
|
||
## Error Handling
|
||
|
||
### runmefirst.sh Failures
|
||
|
||
| Failure | Detection | Recovery |
|
||
|---|---|---|
|
||
| Namespace creation fails | `kubectl create` non-zero exit | Script exits with error message. Re-run is idempotent (uses `--dry-run=client -o yaml | kubectl apply`). |
|
||
| NFS PV creation fails | `kubectl apply` non-zero exit | Check NFS server reachability (`ping 192.168.42.8`). Verify NFS paths exist on Synology. |
|
||
| Helm install fails (ARC/ArgoCD/Kargo) | `helm install` non-zero exit | Script exits. Check Helm repo access, image pull credentials, and cluster resources. Re-run after fixing. |
|
||
| ArgoCD Application creation fails | `kubectl apply` non-zero exit | Verify ArgoCD CRDs are installed (ArgoCD Helm chart must be running first). |
|
||
| Kargo resource creation fails | `kubectl apply` non-zero exit | Verify Kargo CRDs are installed (Kargo Helm chart must be running first). |
|
||
|
||
### runmelast.sh Failures
|
||
|
||
| Failure | Detection | Recovery |
|
||
|---|---|---|
|
||
| Helm uninstall fails | Non-zero exit (caught by `|| true`) | Script continues. Manually clean up with `kubectl delete namespace`. |
|
||
| Namespace deletion hangs | Namespace stuck in Terminating | Check for finalizers: `kubectl get namespace <ns> -o json` and remove stuck finalizers. |
|
||
| PV accidentally deleted | PV missing after teardown | PVs are NOT deleted by runmelast.sh. If manually deleted, NFS data is still on disk — recreate PV pointing at same NFS path. |
|
||
|
||
### CI Workflow Failures
|
||
|
||
| Failure | Detection | Recovery |
|
||
|---|---|---|
|
||
| Self-hosted runner unavailable | GitHub Actions job queued indefinitely | Check ARC controller logs in `arc-system`. Verify runner scale set is registered. Fallback: temporarily switch to `ubuntu-latest`. |
|
||
| Image build fails | `docker/build-push-action` non-zero exit | Check build logs. Fix code/Dockerfile and re-push. |
|
||
| Integration test fails | `run_pipeline.sh` exits non-zero | Check `inttest-results.json` artifact for failure details. Fix and re-push. Promotion to beta is blocked. |
|
||
| GHCR push fails | Authentication error | Verify `GITHUB_TOKEN` secret has `packages:write` permission. Check GHCR rate limits. |
|
||
|
||
### Promotion Failures
|
||
|
||
| Failure | Detection | Recovery |
|
||
|---|---|---|
|
||
| ArgoCD sync fails | ArgoCD Application shows "Degraded" or "OutOfSync" | Check ArgoCD UI at `stonks-argocd.celestium.life`. Inspect sync error. Fix manifests and re-sync. |
|
||
| Kargo promotion fails | Kargo Stage shows "Failed" | Check Kargo Dashboard at `stonks-kargo.celestium.life`. Inspect promotion step logs. |
|
||
| Market-hours check fails unexpectedly | Verification step errors (not blocks) | Check AnalysisTemplate pod logs. Verify `tzdata` package is available in the container. |
|
||
| NFS volume unavailable | Pods stuck in Pending (PVC not bound) | Check NFS server status. Verify PV exists and is not bound to a different PVC. |
|
||
|
||
### Rollback Strategy
|
||
|
||
- **Beta/Paper**: ArgoCD auto-sync means reverting the image tag in the values file (or promoting a previous Freight) triggers a rollback. Kargo's promotion history shows which Freight was previously deployed.
|
||
- **Live**: Same mechanism — promote a previous Freight to the live stage. ArgoCD syncs the previous image tag. Manual approval is still required.
|
||
- **Emergency**: If ArgoCD is down, direct `helm upgrade` with the previous image tag: `helm upgrade stonks-oracle infra/helm/stonks-oracle -n stonks-oracle --set image.tag=<previous-sha>`
|
||
|
||
## Testing Strategy
|
||
|
||
### Why Property-Based Testing Does Not Apply
|
||
|
||
This feature is entirely Infrastructure as Code: shell scripts (`runmefirst.sh`, `runmelast.sh`), Kubernetes YAML manifests (PVs, ArgoCD Applications, Kargo Stages/Warehouses), Helm values files, and GitHub Actions workflow configuration. There are no pure functions, parsers, serializers, or business logic with meaningful input variation. Every acceptance criterion classified as either SMOKE (one-time configuration check) or INTEGRATION (external service verification).
|
||
|
||
PBT requires universal properties that hold across a wide input space — "for all X, P(X) holds." This feature has no such properties. The "inputs" are fixed configuration values (namespace names, NFS paths, Helm chart paths, domain names) and the "outputs" are Kubernetes resource states. Running 100 iterations of "does the ArgoCD ingress have TLS enabled" adds no value over running it once.
|
||
|
||
### Testing Approach
|
||
|
||
The testing strategy uses three tiers:
|
||
|
||
#### Tier 1: Smoke Tests (Configuration Validation)
|
||
|
||
Validate that all generated manifests and scripts are structurally correct before deployment. These run locally or in CI without requiring a live cluster.
|
||
|
||
| Test | What It Validates | How |
|
||
|---|---|---|
|
||
| Manifest syntax | All YAML files parse correctly | `kubectl apply --dry-run=client -f <file>` |
|
||
| Helm template rendering | Values files produce valid K8s resources | `helm template` with each values file |
|
||
| Namespace isolation | Pipeline namespaces are distinct from `stonks-oracle` | Grep manifests for namespace fields |
|
||
| NFS path separation | PVs use distinct subdirectories | Inspect PV YAML for unique paths |
|
||
| Workflow syntax | GitHub Actions YAML is valid | `actionlint` or GitHub's workflow validator |
|
||
| Runner label | Workflow uses `self-hosted-gremlin` label | Grep workflow YAML |
|
||
| Service matrix completeness | All 12 services + dashboard + superset in build matrix | Count matrix entries |
|
||
| ArgoCD Application structure | Each app points at correct chart, values, namespace | Inspect Application YAML |
|
||
| Kargo Stage DAG | Stages form correct linear pipeline | Inspect Stage YAML requestedFreight |
|
||
|
||
#### Tier 2: Integration Tests (Live Cluster Verification)
|
||
|
||
Run after `runmefirst.sh` on the Gremlin cluster. Verify that all components are running and wired correctly.
|
||
|
||
| Test | What It Validates | How |
|
||
|---|---|---|
|
||
| ARC controller running | ARC pods healthy in `arc-system` | `kubectl get pods -n arc-system` |
|
||
| Runner registration | Scale set registered with GitHub | Check GitHub repo settings or ARC logs |
|
||
| ArgoCD accessible | Web UI responds at `stonks-argocd.celestium.life` | `curl -k https://stonks-argocd.celestium.life` |
|
||
| Kargo accessible | Dashboard responds at `stonks-kargo.celestium.life` | `curl -k https://stonks-kargo.celestium.life` |
|
||
| TLS certificates | Ingress has valid certs from `ca-issuer` | `openssl s_client` or cert-manager status |
|
||
| PV binding | PVCs are bound to NFS PVs | `kubectl get pvc -n argocd` |
|
||
| ArgoCD sync | Applications sync successfully | `argocd app get stonks-beta` |
|
||
| Kargo Warehouse | Warehouse discovers images from GHCR | `kubectl get freight -n stonks-oracle` |
|
||
| End-to-end promotion | Image flows from beta → paper → live | Trigger promotion, verify deployments update |
|
||
| Teardown preservation | After `runmelast.sh`, PVs and NFS data intact | Run teardown, check PVs and NFS mount |
|
||
| Rebuild reattach | After teardown + `runmefirst.sh`, state restored | Rebuild, verify promotion history preserved |
|
||
|
||
#### Tier 3: Market-Hours and Break-Glass Tests
|
||
|
||
These require either mocked time or execution at specific times.
|
||
|
||
| Test | What It Validates | How |
|
||
|---|---|---|
|
||
| Market-hours block (during hours) | Promotion blocked 09:30–16:00 ET Mon–Fri | Run AnalysisTemplate with `TZ=America/New_York` during market hours |
|
||
| Market-hours allow (outside hours) | Promotion allowed outside market hours | Run AnalysisTemplate outside market hours or on weekend |
|
||
| Market-hours boundary | Correct behavior at 09:29, 09:30, 15:59, 16:00 | Run check script with mocked times |
|
||
| DST handling | Correct ET evaluation across DST transitions | Verify script uses `America/New_York` (not fixed UTC offset) |
|
||
| Break-glass override | Manual approval bypasses market-hours block | During market hours, use Kargo manual approval |
|
||
| Break-glass audit | Approval records operator, timestamp, justification | After break-glass, query Kargo audit trail |
|
||
| Break-glass non-sticky | Next promotion is still blocked | After break-glass, verify subsequent promotion is blocked |
|
||
|
||
### Test Execution
|
||
|
||
- **Smoke tests**: Run as part of a validation script before deployment. Can be added as a CI job.
|
||
- **Integration tests**: Run manually after `runmefirst.sh` on the Gremlin cluster. Document as a checklist in the pipeline README.
|
||
- **Market-hours tests**: Run manually at appropriate times, or use the market-hours check script in isolation with mocked `TZ` and `date` values.
|