Files
Celes Renata c85c0068a2 fix: clean up utcnow deprecation warnings, fix 12 failing tests, add CI/CD pipeline manifests
- Replace all datetime.utcnow() with datetime.now(tz=timezone.utc) across 8 files
- Fix 12 failing tests to match current implementation behavior
- Fix pytest_plugins in non-top-level conftest (moved to root conftest.py)
- Auto-fix 189 lint issues (import sorting, unused imports)
- Add CI/CD pipeline infrastructure (ARC, ArgoCD, Kargo manifests)
- Add values-beta.yaml and values-paper.yaml for staged deployments
- Update GitHub Actions workflow to use self-hosted-gremlin runners
- Add integration-test job to CI pipeline

Result: 1596 passed, 0 failed, 0 warnings
2026-04-18 03:59:28 +00:00

18 KiB
Raw Permalink Blame History

CI/CD Pipeline — Requirements

Introduction

Full CI/CD pipeline for the Stonks Oracle platform replacing GitHub-hosted runners with self-hosted runners on the existing Kubernetes cluster (GitHub Actions Runner Controller), GitOps-based deployment via ArgoCD, and staged promotion orchestration via Kargo. The pipeline provides five stages — CI, integration test, beta, paper, and live — with market-hours promotion blockers, break-glass emergency overrides, and a visual web dashboard for promotion management. All pipeline infrastructure scripts reside in ~/sources/kube/pipelines/ on gremlin-1 and persist state on NFS volumes that survive cluster rebuilds.

Glossary

  • ARC: GitHub Actions Runner Controller — a Kubernetes operator that provisions self-hosted GitHub Actions runners as pods in the cluster
  • ArgoCD: A GitOps continuous delivery controller for Kubernetes that syncs cluster state from Git repositories
  • Kargo: A promotion orchestration layer built on top of ArgoCD providing staged promotion gates, a visual web dashboard, and audit trails
  • Pipeline_Infrastructure: The set of Kubernetes resources (ARC, ArgoCD, Kargo) and their supporting manifests, PVs, and scripts that comprise the CI/CD system, deployed from ~/sources/kube/pipelines/
  • Promotion: The act of advancing a specific image tag (SHA) from one pipeline stage to the next (e.g., beta to paper)
  • Promotion_Blocker: A time-based gate that prevents promotions during US equity market hours (09:3016:00 ET, MondayFriday)
  • Break_Glass: An emergency override mechanism that bypasses the Promotion_Blocker, requiring explicit confirmation and an audit note
  • Stage: One of the five deployment environments in the pipeline: CI, Integration_Test, Beta, Paper, Live
  • NFS_PV: A Kubernetes PersistentVolume backed by the NFS share at nfs://192.168.42.8:/volume1/Kubernetes/pipelines, used to persist pipeline state across cluster rebuilds
  • GHCR: GitHub Container Registry at ghcr.io/celesrenata/stonks-oracle, the target registry for all built images
  • Image_Tag: A Docker image tag in the format <sha> (Git commit SHA) used to identify a specific build across all stages
  • Gremlin_Cluster: The 4-node NixOS Kubernetes cluster (gremlin-1 through gremlin-4) at primary address 192.168.42.254
  • Market_Hours: US equity market trading hours, 09:3016:00 Eastern Time, Monday through Friday
  • Kargo_Dashboard: The Kargo web UI providing visual promotion management, stage status, and audit history
  • Integration_Test_Runner: The existing standalone script at infra/inttest/run_pipeline.sh that deploys an ephemeral sandbox, seeds data, runs API tests, and produces inttest-results.json

Requirements

Requirement 1: Pipeline Infrastructure Deployment

User Story: As a platform operator, I want a single deployment script that installs all CI/CD pipeline components (ARC, ArgoCD, Kargo) onto the Gremlin_Cluster, so that the pipeline infrastructure can be stood up or rebuilt with one command.

Acceptance Criteria

  1. WHEN the operator executes runmefirst.sh from ~/sources/kube/pipelines/, THE Pipeline_Infrastructure SHALL install ARC, ArgoCD, and Kargo into the Gremlin_Cluster in dedicated namespaces
  2. WHEN the operator executes runmefirst.sh, THE Pipeline_Infrastructure SHALL create NFS-backed PersistentVolumes at nfs://192.168.42.8:/volume1/Kubernetes/pipelines for ArgoCD, Kargo, and ARC persistent data
  3. WHEN ArgoCD is deployed, THE Pipeline_Infrastructure SHALL expose the ArgoCD web UI via Traefik ingress with TLS using the ca-issuer ClusterIssuer
  4. WHEN Kargo is deployed, THE Pipeline_Infrastructure SHALL expose the Kargo_Dashboard via Traefik ingress with TLS using the ca-issuer ClusterIssuer
  5. THE Pipeline_Infrastructure SHALL store all deployment manifests and scripts in ~/sources/kube/pipelines/ on gremlin-1

Requirement 2: Pipeline Infrastructure Teardown

User Story: As a platform operator, I want a teardown script that removes pipeline components without destroying persistent pipeline data, so that pipeline state survives cluster rebuilds.

Acceptance Criteria

  1. WHEN the operator executes runmelast.sh from ~/sources/kube/pipelines/, THE Pipeline_Infrastructure SHALL remove ARC, ArgoCD, and Kargo deployments from the Gremlin_Cluster
  2. WHEN runmelast.sh executes, THE Pipeline_Infrastructure SHALL preserve all NFS_PV resources and the data stored on nfs://192.168.42.8:/volume1/Kubernetes/pipelines
  3. WHEN runmelast.sh executes, THE Pipeline_Infrastructure SHALL leave the application namespace stonks-oracle and all application workloads untouched
  4. WHEN the application teardown script ~/sources/kube/stonks-oracle/runmelast.sh executes, THE Pipeline_Infrastructure SHALL remain operational and unaffected

Requirement 3: Pipeline Infrastructure Isolation

User Story: As a platform operator, I want the pipeline infrastructure to be fully isolated from the application infrastructure, so that deploying or tearing down one does not affect the other.

Acceptance Criteria

  1. THE Pipeline_Infrastructure SHALL deploy ARC, ArgoCD, and Kargo in namespaces separate from the stonks-oracle application namespace
  2. THE Pipeline_Infrastructure SHALL use independent Helm releases or manifests that share no lifecycle with the stonks-oracle Helm chart
  3. THE Pipeline_Infrastructure SHALL use NFS_PV paths under pipelines/ that are distinct from any application storage paths

Requirement 4: Self-Hosted CI Runners

User Story: As a developer, I want CI builds to run on self-hosted runners in the Gremlin_Cluster via ARC, so that GitHub Actions compute costs are eliminated.

Acceptance Criteria

  1. WHEN ARC is deployed, THE Pipeline_Infrastructure SHALL register a runner scale set with GitHub that accepts jobs from the celesrenata/stonks-oracle repository
  2. WHEN a GitHub Actions workflow targets the self-hosted runner label, THE ARC SHALL provision runner pods in the Gremlin_Cluster to execute the job
  3. WHEN a CI job completes, THE ARC SHALL terminate the runner pod and release cluster resources
  4. THE ARC SHALL use ephemeral runner pods that start clean for each job execution

Requirement 5: CI Stage — Lint and Test

User Story: As a developer, I want every push to main or pull request to trigger automated linting and testing on self-hosted runners, so that code quality is validated before images are built.

Acceptance Criteria

  1. WHEN a push to the main branch or a pull request is opened, THE CI_Stage SHALL trigger a workflow on self-hosted ARC runners
  2. WHEN the CI workflow runs, THE CI_Stage SHALL execute Python linting using ruff check services/
  3. WHEN the CI workflow runs, THE CI_Stage SHALL execute Python unit tests using pytest tests/
  4. WHEN the CI workflow runs, THE CI_Stage SHALL install frontend dependencies and execute frontend tests using vitest
  5. IF any lint or test step fails, THEN THE CI_Stage SHALL mark the workflow as failed and skip image builds

Requirement 6: CI Stage — Image Build and Push

User Story: As a developer, I want Docker images for all services and the dashboard to be built and pushed to GHCR on every successful main branch push, so that new images are available for deployment.

Acceptance Criteria

  1. WHEN lint and tests pass on a push to main, THE CI_Stage SHALL build Docker images for all 12 Python services (scheduler, symbol-registry, ingestion, parser, extractor, aggregation, recommendation, risk, broker-adapter, lake-publisher, query-api, trading-engine)
  2. WHEN lint and tests pass on a push to main, THE CI_Stage SHALL build the dashboard Docker image from frontend/Dockerfile
  3. WHEN lint and tests pass on a push to main, THE CI_Stage SHALL build the superset Docker image from docker/Dockerfile.superset
  4. WHEN images are built, THE CI_Stage SHALL push each image to GHCR with tags ghcr.io/celesrenata/stonks-oracle/<service>:<git-sha> and ghcr.io/celesrenata/stonks-oracle/<service>:latest
  5. WHEN all images are pushed, THE CI_Stage SHALL record the Git SHA as the Image_Tag for downstream stages

Requirement 7: Integration Test Stage

User Story: As a developer, I want the CI pipeline to automatically run integration tests against newly built images, so that functional correctness is validated before promotion to beta.

Acceptance Criteria

  1. WHEN all images are pushed to GHCR for a given Image_Tag, THE Integration_Test_Stage SHALL invoke the Integration_Test_Runner with bash infra/inttest/run_pipeline.sh --image-tag <sha>
  2. WHEN the Integration_Test_Runner completes, THE Integration_Test_Stage SHALL parse the inttest-results.json file for test counts and exit code
  3. IF the Integration_Test_Runner exits with code 0, THEN THE Integration_Test_Stage SHALL mark the Image_Tag as eligible for promotion to Beta
  4. IF the Integration_Test_Runner exits with a non-zero code, THEN THE Integration_Test_Stage SHALL block promotion to Beta and report the failure details
  5. THE Integration_Test_Stage SHALL archive the inttest-results.json as a build artifact

Requirement 8: Beta Stage Deployment

User Story: As a developer, I want a beta environment where newly built images are deployed for smoke testing and manual verification before promotion to paper trading, so that regressions are caught early.

Acceptance Criteria

  1. WHEN an Image_Tag passes the Integration_Test_Stage, THE Beta_Stage SHALL deploy the application with that Image_Tag to a beta namespace or Helm release managed by ArgoCD
  2. WHILE the Beta_Stage is active, THE Kargo_Dashboard SHALL display the currently deployed Image_Tag and its promotion status
  3. WHEN a developer requests promotion from Beta to Paper via the Kargo_Dashboard, THE Beta_Stage SHALL verify that the Image_Tag passed integration tests before allowing promotion
  4. THE Beta_Stage SHALL use the same Helm chart (infra/helm/stonks-oracle/) as production, with beta-specific value overrides

Requirement 9: Paper Trading Stage Deployment

User Story: As a trader, I want a paper trading environment that uses the Alpaca paper broker, so that new builds can be validated against simulated market conditions before going live.

Acceptance Criteria

  1. WHEN an Image_Tag is promoted from Beta, THE Paper_Stage SHALL deploy the application with that Image_Tag to a paper trading namespace managed by ArgoCD
  2. THE Paper_Stage SHALL configure the broker adapter with BROKER_MODE=paper and BROKER_PROVIDER=alpaca using Alpaca paper trading credentials
  3. WHILE Market_Hours are active (09:3016:00 ET, MondayFriday), THE Paper_Stage SHALL block automatic and manual promotions to the Paper_Stage unless Break_Glass is activated
  4. WHEN a promotion to Paper is attempted outside Market_Hours, THE Paper_Stage SHALL allow the promotion to proceed
  5. THE Paper_Stage SHALL use the same Helm chart (infra/helm/stonks-oracle/) as production, with paper-specific value overrides

Requirement 10: Live Stage Deployment

User Story: As a platform operator, I want production deployments to require explicit manual approval with notes, so that live trading is protected from accidental or untested deployments.

Acceptance Criteria

  1. WHEN an Image_Tag is promoted from Paper, THE Live_Stage SHALL require explicit manual approval with a notes field before deploying to the stonks-oracle production namespace
  2. THE Live_Stage SHALL deploy the application with the approved Image_Tag via ArgoCD syncing the production Helm release
  3. WHILE Market_Hours are active (09:3016:00 ET, MondayFriday), THE Live_Stage SHALL block promotions to the Live_Stage unless Break_Glass is activated
  4. WHEN a promotion to Live is attempted outside Market_Hours with valid approval, THE Live_Stage SHALL allow the promotion to proceed
  5. THE Live_Stage SHALL use the existing stonks-oracle namespace and Helm chart with production values

Requirement 11: Market-Hours Promotion Blocker

User Story: As a risk manager, I want promotions to paper and live environments to be blocked during US market hours, so that deployments do not disrupt active trading sessions.

Acceptance Criteria

  1. WHILE the current time is between 09:30 and 16:00 Eastern Time on a weekday, THE Promotion_Blocker SHALL prevent promotions to the Paper_Stage and Live_Stage
  2. WHEN the current time is outside 09:3016:00 ET or on a weekend, THE Promotion_Blocker SHALL allow promotions to proceed (subject to other gates)
  3. WHEN a promotion is blocked by the Promotion_Blocker, THE Kargo_Dashboard SHALL display a visual indicator showing the block reason and the time until the market closes
  4. THE Promotion_Blocker SHALL evaluate Eastern Time correctly, accounting for US daylight saving time transitions

Requirement 12: Break-Glass Emergency Override

User Story: As a platform operator, I want a break-glass mechanism to bypass market-hours blockers during emergencies, so that critical fixes can be deployed at any time.

Acceptance Criteria

  1. WHEN an operator activates Break_Glass via the Kargo_Dashboard, THE Pipeline_Infrastructure SHALL bypass the Promotion_Blocker for the target Stage
  2. WHEN Break_Glass is activated, THE Kargo_Dashboard SHALL require a confirmation dialog before proceeding
  3. WHEN Break_Glass is activated, THE Pipeline_Infrastructure SHALL require the operator to provide a written justification note
  4. WHEN Break_Glass is used, THE Pipeline_Infrastructure SHALL record the operator identity, timestamp, target Stage, Image_Tag, and justification note in the audit trail
  5. THE Break_Glass mechanism SHALL apply only to the single promotion for which it was activated and SHALL NOT disable the Promotion_Blocker for subsequent promotions

Requirement 13: Per-Stage Enable/Disable Controls

User Story: As a platform operator, I want to independently enable or disable each pipeline stage, so that the pipeline can be configured for different operational modes.

Acceptance Criteria

  1. THE Pipeline_Infrastructure SHALL provide a configuration mechanism to independently enable or disable each of the five stages (CI, Integration_Test, Beta, Paper, Live)
  2. WHEN a Stage is disabled, THE Pipeline_Infrastructure SHALL skip that Stage during promotion and advance the Image_Tag to the next enabled Stage
  3. WHEN a Stage is re-enabled, THE Pipeline_Infrastructure SHALL resume gating promotions through that Stage for new Image_Tags

Requirement 14: Revision Tracking

User Story: As a developer, I want to see which Image_Tag (Git SHA) is deployed at each pipeline stage, so that I can track exactly what code is running in each environment.

Acceptance Criteria

  1. THE Kargo_Dashboard SHALL display the currently deployed Image_Tag for each active Stage
  2. WHEN a promotion occurs, THE Kargo_Dashboard SHALL update the displayed Image_Tag for the target Stage within 60 seconds
  3. THE Pipeline_Infrastructure SHALL maintain a mapping of Stage to current Image_Tag that is queryable via the Kargo API or ArgoCD

Requirement 15: Audit Trail

User Story: As a compliance officer, I want a complete audit trail of all promotions including who promoted, when, with what notes, and whether break-glass was used, so that deployment decisions are traceable.

Acceptance Criteria

  1. WHEN a promotion occurs, THE Pipeline_Infrastructure SHALL record the operator identity, timestamp, source Stage, target Stage, Image_Tag, and any notes provided
  2. WHEN Break_Glass is used for a promotion, THE Pipeline_Infrastructure SHALL record the break-glass justification alongside the standard promotion record
  3. THE Kargo_Dashboard SHALL display the promotion history for each Stage, showing all recorded audit fields
  4. THE Pipeline_Infrastructure SHALL persist audit trail data on NFS_PV so that promotion history survives cluster rebuilds

Requirement 16: Kargo Visual Dashboard

User Story: As a platform operator, I want a web dashboard showing all pipeline stages, their current revisions, and promotion controls, so that I can manage deployments visually.

Acceptance Criteria

  1. THE Kargo_Dashboard SHALL display all five Stages with their current deployed Image_Tag and promotion status
  2. THE Kargo_Dashboard SHALL provide a click-to-promote action for advancing an Image_Tag from one Stage to the next
  3. WHEN Market_Hours are active, THE Kargo_Dashboard SHALL display block/allow indicators on the Paper_Stage and Live_Stage
  4. THE Kargo_Dashboard SHALL provide a notes field when promoting or when a promotion is blocked
  5. THE Kargo_Dashboard SHALL provide a Break_Glass button with a confirmation dialog for emergency overrides
  6. THE Kargo_Dashboard SHALL be accessible via Traefik ingress at a *.celestium.life domain with TLS via ca-issuer

Requirement 17: NFS Persistent Storage

User Story: As a platform operator, I want all pipeline state (ArgoCD app configs, Kargo promotion history, ARC data) to persist on NFS volumes, so that pipeline data survives cluster teardowns and rebuilds.

Acceptance Criteria

  1. THE Pipeline_Infrastructure SHALL create PersistentVolumes backed by the NFS share at nfs://192.168.42.8:/volume1/Kubernetes/pipelines for ArgoCD server data, Kargo data, and ARC data
  2. WHEN runmelast.sh is executed, THE NFS_PV resources and their underlying NFS data SHALL remain intact
  3. WHEN runmefirst.sh is executed after a previous teardown, THE Pipeline_Infrastructure SHALL reattach to the existing NFS data and restore previous pipeline state
  4. THE Pipeline_Infrastructure SHALL use separate NFS subdirectories for ArgoCD, Kargo, and ARC to prevent data conflicts

Requirement 18: ArgoCD GitOps Configuration

User Story: As a platform operator, I want ArgoCD to sync Kubernetes manifests from the Git repository, so that the cluster state is always consistent with the declared configuration.

Acceptance Criteria

  1. THE ArgoCD SHALL be configured with an Application resource pointing to the infra/helm/stonks-oracle/ Helm chart in the celesrenata/stonks-oracle Git repository
  2. WHEN a change is committed to the Helm chart or values files in Git, THE ArgoCD SHALL detect the change and sync the updated manifests to the target namespace
  3. THE ArgoCD SHALL support multiple Application resources for beta, paper, and live environments, each with stage-specific value overrides
  4. IF an ArgoCD sync fails, THEN THE ArgoCD SHALL report the failure status in the ArgoCD UI and the Kargo_Dashboard