c85c0068a2
- Replace all datetime.utcnow() with datetime.now(tz=timezone.utc) across 8 files - Fix 12 failing tests to match current implementation behavior - Fix pytest_plugins in non-top-level conftest (moved to root conftest.py) - Auto-fix 189 lint issues (import sorting, unused imports) - Add CI/CD pipeline infrastructure (ARC, ArgoCD, Kargo manifests) - Add values-beta.yaml and values-paper.yaml for staged deployments - Update GitHub Actions workflow to use self-hosted-gremlin runners - Add integration-test job to CI pipeline Result: 1596 passed, 0 failed, 0 warnings
230 lines
18 KiB
Markdown
230 lines
18 KiB
Markdown
# CI/CD Pipeline — Requirements
|
||
|
||
## Introduction
|
||
|
||
Full CI/CD pipeline for the Stonks Oracle platform replacing GitHub-hosted runners with self-hosted runners on the existing Kubernetes cluster (GitHub Actions Runner Controller), GitOps-based deployment via ArgoCD, and staged promotion orchestration via Kargo. The pipeline provides five stages — CI, integration test, beta, paper, and live — with market-hours promotion blockers, break-glass emergency overrides, and a visual web dashboard for promotion management. All pipeline infrastructure scripts reside in `~/sources/kube/pipelines/` on gremlin-1 and persist state on NFS volumes that survive cluster rebuilds.
|
||
|
||
## Glossary
|
||
|
||
- **ARC**: GitHub Actions Runner Controller — a Kubernetes operator that provisions self-hosted GitHub Actions runners as pods in the cluster
|
||
- **ArgoCD**: A GitOps continuous delivery controller for Kubernetes that syncs cluster state from Git repositories
|
||
- **Kargo**: A promotion orchestration layer built on top of ArgoCD providing staged promotion gates, a visual web dashboard, and audit trails
|
||
- **Pipeline_Infrastructure**: The set of Kubernetes resources (ARC, ArgoCD, Kargo) and their supporting manifests, PVs, and scripts that comprise the CI/CD system, deployed from `~/sources/kube/pipelines/`
|
||
- **Promotion**: The act of advancing a specific image tag (SHA) from one pipeline stage to the next (e.g., beta to paper)
|
||
- **Promotion_Blocker**: A time-based gate that prevents promotions during US equity market hours (09:30–16:00 ET, Monday–Friday)
|
||
- **Break_Glass**: An emergency override mechanism that bypasses the Promotion_Blocker, requiring explicit confirmation and an audit note
|
||
- **Stage**: One of the five deployment environments in the pipeline: CI, Integration_Test, Beta, Paper, Live
|
||
- **NFS_PV**: A Kubernetes PersistentVolume backed by the NFS share at `nfs://192.168.42.8:/volume1/Kubernetes/pipelines`, used to persist pipeline state across cluster rebuilds
|
||
- **GHCR**: GitHub Container Registry at `ghcr.io/celesrenata/stonks-oracle`, the target registry for all built images
|
||
- **Image_Tag**: A Docker image tag in the format `<sha>` (Git commit SHA) used to identify a specific build across all stages
|
||
- **Gremlin_Cluster**: The 4-node NixOS Kubernetes cluster (gremlin-1 through gremlin-4) at primary address 192.168.42.254
|
||
- **Market_Hours**: US equity market trading hours, 09:30–16:00 Eastern Time, Monday through Friday
|
||
- **Kargo_Dashboard**: The Kargo web UI providing visual promotion management, stage status, and audit history
|
||
- **Integration_Test_Runner**: The existing standalone script at `infra/inttest/run_pipeline.sh` that deploys an ephemeral sandbox, seeds data, runs API tests, and produces `inttest-results.json`
|
||
|
||
## Requirements
|
||
|
||
### Requirement 1: Pipeline Infrastructure Deployment
|
||
|
||
**User Story:** As a platform operator, I want a single deployment script that installs all CI/CD pipeline components (ARC, ArgoCD, Kargo) onto the Gremlin_Cluster, so that the pipeline infrastructure can be stood up or rebuilt with one command.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. WHEN the operator executes `runmefirst.sh` from `~/sources/kube/pipelines/`, THE Pipeline_Infrastructure SHALL install ARC, ArgoCD, and Kargo into the Gremlin_Cluster in dedicated namespaces
|
||
2. WHEN the operator executes `runmefirst.sh`, THE Pipeline_Infrastructure SHALL create NFS-backed PersistentVolumes at `nfs://192.168.42.8:/volume1/Kubernetes/pipelines` for ArgoCD, Kargo, and ARC persistent data
|
||
3. WHEN ArgoCD is deployed, THE Pipeline_Infrastructure SHALL expose the ArgoCD web UI via Traefik ingress with TLS using the `ca-issuer` ClusterIssuer
|
||
4. WHEN Kargo is deployed, THE Pipeline_Infrastructure SHALL expose the Kargo_Dashboard via Traefik ingress with TLS using the `ca-issuer` ClusterIssuer
|
||
5. THE Pipeline_Infrastructure SHALL store all deployment manifests and scripts in `~/sources/kube/pipelines/` on gremlin-1
|
||
|
||
### Requirement 2: Pipeline Infrastructure Teardown
|
||
|
||
**User Story:** As a platform operator, I want a teardown script that removes pipeline components without destroying persistent pipeline data, so that pipeline state survives cluster rebuilds.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. WHEN the operator executes `runmelast.sh` from `~/sources/kube/pipelines/`, THE Pipeline_Infrastructure SHALL remove ARC, ArgoCD, and Kargo deployments from the Gremlin_Cluster
|
||
2. WHEN `runmelast.sh` executes, THE Pipeline_Infrastructure SHALL preserve all NFS_PV resources and the data stored on `nfs://192.168.42.8:/volume1/Kubernetes/pipelines`
|
||
3. WHEN `runmelast.sh` executes, THE Pipeline_Infrastructure SHALL leave the application namespace `stonks-oracle` and all application workloads untouched
|
||
4. WHEN the application teardown script `~/sources/kube/stonks-oracle/runmelast.sh` executes, THE Pipeline_Infrastructure SHALL remain operational and unaffected
|
||
|
||
### Requirement 3: Pipeline Infrastructure Isolation
|
||
|
||
**User Story:** As a platform operator, I want the pipeline infrastructure to be fully isolated from the application infrastructure, so that deploying or tearing down one does not affect the other.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. THE Pipeline_Infrastructure SHALL deploy ARC, ArgoCD, and Kargo in namespaces separate from the `stonks-oracle` application namespace
|
||
2. THE Pipeline_Infrastructure SHALL use independent Helm releases or manifests that share no lifecycle with the `stonks-oracle` Helm chart
|
||
3. THE Pipeline_Infrastructure SHALL use NFS_PV paths under `pipelines/` that are distinct from any application storage paths
|
||
|
||
### Requirement 4: Self-Hosted CI Runners
|
||
|
||
**User Story:** As a developer, I want CI builds to run on self-hosted runners in the Gremlin_Cluster via ARC, so that GitHub Actions compute costs are eliminated.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. WHEN ARC is deployed, THE Pipeline_Infrastructure SHALL register a runner scale set with GitHub that accepts jobs from the `celesrenata/stonks-oracle` repository
|
||
2. WHEN a GitHub Actions workflow targets the self-hosted runner label, THE ARC SHALL provision runner pods in the Gremlin_Cluster to execute the job
|
||
3. WHEN a CI job completes, THE ARC SHALL terminate the runner pod and release cluster resources
|
||
4. THE ARC SHALL use ephemeral runner pods that start clean for each job execution
|
||
|
||
### Requirement 5: CI Stage — Lint and Test
|
||
|
||
**User Story:** As a developer, I want every push to main or pull request to trigger automated linting and testing on self-hosted runners, so that code quality is validated before images are built.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. WHEN a push to the `main` branch or a pull request is opened, THE CI_Stage SHALL trigger a workflow on self-hosted ARC runners
|
||
2. WHEN the CI workflow runs, THE CI_Stage SHALL execute Python linting using `ruff check services/`
|
||
3. WHEN the CI workflow runs, THE CI_Stage SHALL execute Python unit tests using `pytest tests/`
|
||
4. WHEN the CI workflow runs, THE CI_Stage SHALL install frontend dependencies and execute frontend tests using `vitest`
|
||
5. IF any lint or test step fails, THEN THE CI_Stage SHALL mark the workflow as failed and skip image builds
|
||
|
||
### Requirement 6: CI Stage — Image Build and Push
|
||
|
||
**User Story:** As a developer, I want Docker images for all services and the dashboard to be built and pushed to GHCR on every successful main branch push, so that new images are available for deployment.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. WHEN lint and tests pass on a push to `main`, THE CI_Stage SHALL build Docker images for all 12 Python services (scheduler, symbol-registry, ingestion, parser, extractor, aggregation, recommendation, risk, broker-adapter, lake-publisher, query-api, trading-engine)
|
||
2. WHEN lint and tests pass on a push to `main`, THE CI_Stage SHALL build the dashboard Docker image from `frontend/Dockerfile`
|
||
3. WHEN lint and tests pass on a push to `main`, THE CI_Stage SHALL build the superset Docker image from `docker/Dockerfile.superset`
|
||
4. WHEN images are built, THE CI_Stage SHALL push each image to GHCR with tags `ghcr.io/celesrenata/stonks-oracle/<service>:<git-sha>` and `ghcr.io/celesrenata/stonks-oracle/<service>:latest`
|
||
5. WHEN all images are pushed, THE CI_Stage SHALL record the Git SHA as the Image_Tag for downstream stages
|
||
|
||
### Requirement 7: Integration Test Stage
|
||
|
||
**User Story:** As a developer, I want the CI pipeline to automatically run integration tests against newly built images, so that functional correctness is validated before promotion to beta.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. WHEN all images are pushed to GHCR for a given Image_Tag, THE Integration_Test_Stage SHALL invoke the Integration_Test_Runner with `bash infra/inttest/run_pipeline.sh --image-tag <sha>`
|
||
2. WHEN the Integration_Test_Runner completes, THE Integration_Test_Stage SHALL parse the `inttest-results.json` file for test counts and exit code
|
||
3. IF the Integration_Test_Runner exits with code 0, THEN THE Integration_Test_Stage SHALL mark the Image_Tag as eligible for promotion to Beta
|
||
4. IF the Integration_Test_Runner exits with a non-zero code, THEN THE Integration_Test_Stage SHALL block promotion to Beta and report the failure details
|
||
5. THE Integration_Test_Stage SHALL archive the `inttest-results.json` as a build artifact
|
||
|
||
### Requirement 8: Beta Stage Deployment
|
||
|
||
**User Story:** As a developer, I want a beta environment where newly built images are deployed for smoke testing and manual verification before promotion to paper trading, so that regressions are caught early.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. WHEN an Image_Tag passes the Integration_Test_Stage, THE Beta_Stage SHALL deploy the application with that Image_Tag to a beta namespace or Helm release managed by ArgoCD
|
||
2. WHILE the Beta_Stage is active, THE Kargo_Dashboard SHALL display the currently deployed Image_Tag and its promotion status
|
||
3. WHEN a developer requests promotion from Beta to Paper via the Kargo_Dashboard, THE Beta_Stage SHALL verify that the Image_Tag passed integration tests before allowing promotion
|
||
4. THE Beta_Stage SHALL use the same Helm chart (`infra/helm/stonks-oracle/`) as production, with beta-specific value overrides
|
||
|
||
### Requirement 9: Paper Trading Stage Deployment
|
||
|
||
**User Story:** As a trader, I want a paper trading environment that uses the Alpaca paper broker, so that new builds can be validated against simulated market conditions before going live.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. WHEN an Image_Tag is promoted from Beta, THE Paper_Stage SHALL deploy the application with that Image_Tag to a paper trading namespace managed by ArgoCD
|
||
2. THE Paper_Stage SHALL configure the broker adapter with `BROKER_MODE=paper` and `BROKER_PROVIDER=alpaca` using Alpaca paper trading credentials
|
||
3. WHILE Market_Hours are active (09:30–16:00 ET, Monday–Friday), THE Paper_Stage SHALL block automatic and manual promotions to the Paper_Stage unless Break_Glass is activated
|
||
4. WHEN a promotion to Paper is attempted outside Market_Hours, THE Paper_Stage SHALL allow the promotion to proceed
|
||
5. THE Paper_Stage SHALL use the same Helm chart (`infra/helm/stonks-oracle/`) as production, with paper-specific value overrides
|
||
|
||
### Requirement 10: Live Stage Deployment
|
||
|
||
**User Story:** As a platform operator, I want production deployments to require explicit manual approval with notes, so that live trading is protected from accidental or untested deployments.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. WHEN an Image_Tag is promoted from Paper, THE Live_Stage SHALL require explicit manual approval with a notes field before deploying to the `stonks-oracle` production namespace
|
||
2. THE Live_Stage SHALL deploy the application with the approved Image_Tag via ArgoCD syncing the production Helm release
|
||
3. WHILE Market_Hours are active (09:30–16:00 ET, Monday–Friday), THE Live_Stage SHALL block promotions to the Live_Stage unless Break_Glass is activated
|
||
4. WHEN a promotion to Live is attempted outside Market_Hours with valid approval, THE Live_Stage SHALL allow the promotion to proceed
|
||
5. THE Live_Stage SHALL use the existing `stonks-oracle` namespace and Helm chart with production values
|
||
|
||
### Requirement 11: Market-Hours Promotion Blocker
|
||
|
||
**User Story:** As a risk manager, I want promotions to paper and live environments to be blocked during US market hours, so that deployments do not disrupt active trading sessions.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. WHILE the current time is between 09:30 and 16:00 Eastern Time on a weekday, THE Promotion_Blocker SHALL prevent promotions to the Paper_Stage and Live_Stage
|
||
2. WHEN the current time is outside 09:30–16:00 ET or on a weekend, THE Promotion_Blocker SHALL allow promotions to proceed (subject to other gates)
|
||
3. WHEN a promotion is blocked by the Promotion_Blocker, THE Kargo_Dashboard SHALL display a visual indicator showing the block reason and the time until the market closes
|
||
4. THE Promotion_Blocker SHALL evaluate Eastern Time correctly, accounting for US daylight saving time transitions
|
||
|
||
### Requirement 12: Break-Glass Emergency Override
|
||
|
||
**User Story:** As a platform operator, I want a break-glass mechanism to bypass market-hours blockers during emergencies, so that critical fixes can be deployed at any time.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. WHEN an operator activates Break_Glass via the Kargo_Dashboard, THE Pipeline_Infrastructure SHALL bypass the Promotion_Blocker for the target Stage
|
||
2. WHEN Break_Glass is activated, THE Kargo_Dashboard SHALL require a confirmation dialog before proceeding
|
||
3. WHEN Break_Glass is activated, THE Pipeline_Infrastructure SHALL require the operator to provide a written justification note
|
||
4. WHEN Break_Glass is used, THE Pipeline_Infrastructure SHALL record the operator identity, timestamp, target Stage, Image_Tag, and justification note in the audit trail
|
||
5. THE Break_Glass mechanism SHALL apply only to the single promotion for which it was activated and SHALL NOT disable the Promotion_Blocker for subsequent promotions
|
||
|
||
### Requirement 13: Per-Stage Enable/Disable Controls
|
||
|
||
**User Story:** As a platform operator, I want to independently enable or disable each pipeline stage, so that the pipeline can be configured for different operational modes.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. THE Pipeline_Infrastructure SHALL provide a configuration mechanism to independently enable or disable each of the five stages (CI, Integration_Test, Beta, Paper, Live)
|
||
2. WHEN a Stage is disabled, THE Pipeline_Infrastructure SHALL skip that Stage during promotion and advance the Image_Tag to the next enabled Stage
|
||
3. WHEN a Stage is re-enabled, THE Pipeline_Infrastructure SHALL resume gating promotions through that Stage for new Image_Tags
|
||
|
||
### Requirement 14: Revision Tracking
|
||
|
||
**User Story:** As a developer, I want to see which Image_Tag (Git SHA) is deployed at each pipeline stage, so that I can track exactly what code is running in each environment.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. THE Kargo_Dashboard SHALL display the currently deployed Image_Tag for each active Stage
|
||
2. WHEN a promotion occurs, THE Kargo_Dashboard SHALL update the displayed Image_Tag for the target Stage within 60 seconds
|
||
3. THE Pipeline_Infrastructure SHALL maintain a mapping of Stage to current Image_Tag that is queryable via the Kargo API or ArgoCD
|
||
|
||
### Requirement 15: Audit Trail
|
||
|
||
**User Story:** As a compliance officer, I want a complete audit trail of all promotions including who promoted, when, with what notes, and whether break-glass was used, so that deployment decisions are traceable.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. WHEN a promotion occurs, THE Pipeline_Infrastructure SHALL record the operator identity, timestamp, source Stage, target Stage, Image_Tag, and any notes provided
|
||
2. WHEN Break_Glass is used for a promotion, THE Pipeline_Infrastructure SHALL record the break-glass justification alongside the standard promotion record
|
||
3. THE Kargo_Dashboard SHALL display the promotion history for each Stage, showing all recorded audit fields
|
||
4. THE Pipeline_Infrastructure SHALL persist audit trail data on NFS_PV so that promotion history survives cluster rebuilds
|
||
|
||
### Requirement 16: Kargo Visual Dashboard
|
||
|
||
**User Story:** As a platform operator, I want a web dashboard showing all pipeline stages, their current revisions, and promotion controls, so that I can manage deployments visually.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. THE Kargo_Dashboard SHALL display all five Stages with their current deployed Image_Tag and promotion status
|
||
2. THE Kargo_Dashboard SHALL provide a click-to-promote action for advancing an Image_Tag from one Stage to the next
|
||
3. WHEN Market_Hours are active, THE Kargo_Dashboard SHALL display block/allow indicators on the Paper_Stage and Live_Stage
|
||
4. THE Kargo_Dashboard SHALL provide a notes field when promoting or when a promotion is blocked
|
||
5. THE Kargo_Dashboard SHALL provide a Break_Glass button with a confirmation dialog for emergency overrides
|
||
6. THE Kargo_Dashboard SHALL be accessible via Traefik ingress at a `*.celestium.life` domain with TLS via `ca-issuer`
|
||
|
||
### Requirement 17: NFS Persistent Storage
|
||
|
||
**User Story:** As a platform operator, I want all pipeline state (ArgoCD app configs, Kargo promotion history, ARC data) to persist on NFS volumes, so that pipeline data survives cluster teardowns and rebuilds.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. THE Pipeline_Infrastructure SHALL create PersistentVolumes backed by the NFS share at `nfs://192.168.42.8:/volume1/Kubernetes/pipelines` for ArgoCD server data, Kargo data, and ARC data
|
||
2. WHEN `runmelast.sh` is executed, THE NFS_PV resources and their underlying NFS data SHALL remain intact
|
||
3. WHEN `runmefirst.sh` is executed after a previous teardown, THE Pipeline_Infrastructure SHALL reattach to the existing NFS data and restore previous pipeline state
|
||
4. THE Pipeline_Infrastructure SHALL use separate NFS subdirectories for ArgoCD, Kargo, and ARC to prevent data conflicts
|
||
|
||
### Requirement 18: ArgoCD GitOps Configuration
|
||
|
||
**User Story:** As a platform operator, I want ArgoCD to sync Kubernetes manifests from the Git repository, so that the cluster state is always consistent with the declared configuration.
|
||
|
||
#### Acceptance Criteria
|
||
|
||
1. THE ArgoCD SHALL be configured with an Application resource pointing to the `infra/helm/stonks-oracle/` Helm chart in the `celesrenata/stonks-oracle` Git repository
|
||
2. WHEN a change is committed to the Helm chart or values files in Git, THE ArgoCD SHALL detect the change and sync the updated manifests to the target namespace
|
||
3. THE ArgoCD SHALL support multiple Application resources for beta, paper, and live environments, each with stage-specific value overrides
|
||
4. IF an ArgoCD sync fails, THEN THE ArgoCD SHALL report the failure status in the ArgoCD UI and the Kargo_Dashboard
|