Permanent fix for cluster rebuilds:
- OAuth2 client_id/secret baked into woodpecker/values.yaml
- WOODPECKER_AGENT_SECRET shared between server and agents
- runmefirst.sh uses baked creds if present, creates fresh ones only
if values.yaml still has placeholders
- Agents survive DB wipes since they auth via shared secret
Tests complete in ~7s. The 10-minute timeout was causing unnecessary
wait time on failures. Reduced Job activeDeadlineSeconds and kubectl
wait timeout to 300s.
- rec['mode'] can be 'autonomous' (not just informational/paper/live)
- risk check uses 'check_name'/'result' not 'name'/'passed'
- decision type can be 'execute' not just 'act'/'skip'
- Added pipelineEnabled flag to Helm values (default: true)
- Worker services (scheduler, ingestion, parser, extractor, aggregation,
recommendation, broker-adapter, lake-publisher) scale to 0 when disabled
- API services always run regardless of toggle
- Redis-based runtime toggle: POST /api/ops/pipeline/toggle
- Scheduler checks the flag before each cycle
- Frontend: green/red Pipeline ON/OFF button on the pipeline page
- Beta defaults to pipelineEnabled: false
- Base values.yaml: blanked external URLs (Ollama, Polygon, Alpaca)
so stages only connect to what they explicitly configure
Base values.yaml now has empty OLLAMA_BASE_URL, MARKET_DATA_BASE_URL,
and BROKER_PROVIDER. Only paper (and eventually live) set the real
URLs. Beta inherits empty defaults so it can't reach external services.
Beta is for API testing only. Scale scheduler, ingestion, parser,
extractor, aggregation, recommendation, broker-adapter, and
lake-publisher to 0 replicas. Blank out Polygon and Alpaca keys.
Infra secrets (postgres, redis, minio) kept so API services work.
Beta is for API testing only. Blanked out Polygon/Alpaca/Ollama
credentials, set OLLAMA_BASE_URL to localhost:99999, and scaled
scheduler/ingestion/parser/extractor/aggregation/recommendation/
broker-adapter/lake-publisher to 0 replicas.
The 30-minute threshold was shorter than the queue drain time, causing
the recovery sweep to re-enqueue docs that were already queued but not
yet processed. Bumped to 4 hours with matching marker TTL.
- All paper stage credentials now in values-paper.yaml so ArgoCD
renders them correctly on every sync (no more empty secrets)
- Added seed-if-empty init container to scheduler: runs the seed
script if the companies table is empty after migrations
Recovery sweeps and the retry endpoint now check a per-document Redis
key (SET NX, 1h TTL) before pushing to the queue. If the marker exists,
the doc is already enqueued and gets skipped. This prevents the
scheduler from re-enqueuing the same parsed docs every 5 minutes.
The pipeline health, SSE stream, and retry endpoints were hardcoding
'stonks:queue:{name}' but services use DEPLOY_STAGE prefix
('stonks:paper:queue:{name}'). Now uses queue_key() from redis_keys.py.
The extraction queue had 3000+ SEC filings backed up with a single
extractor pod processing them at 10-115s each. Ollama handles
concurrent requests so multiple extractor pods can share the GPU.
- POST /api/ops/pipeline/retry-failed endpoint resets extraction_failed
docs to parsed, deletes failed intelligence rows, and re-enqueues
them (batch of 200)
- Scheduler now auto-retries extraction_failed docs every ~10 minutes
(100 per cycle, 60-min cooldown per doc)
- Pipeline page shows 'Retry Failed (N)' button when extraction_failed
count > 0, with pending/success/error states
The polling loop checked conditions[0].type which missed the Complete
condition when it wasn't at index 0. Switch to kubectl wait
--for=condition=complete which handles condition matching reliably.
The inline catalyst_type query in GET /api/patterns/{ticker} referenced
dir.document_id which does not exist on document_impact_records. The
table links to documents via intelligence_id -> document_intelligence ->
document_id. Added the missing JOIN to match the pattern used in
_SELF_PATTERN_QUERY.
1. patterns endpoint: fix query referencing non-existent column
di.catalyst_type → dir.catalyst_type (column is on document_impact_records)
2. lockouts seed: use relative timestamps (now + 7d) so active lockout
is always in the future regardless of when tests run
3. create_agent: make slug optional with auto-generation from name
4. create_source: json.dumps(config) + ::jsonb cast for asyncpg JSONB compat
5. approval_expiry: return count as int (len(expired)) not the list itself
6. metrics_consistency: fix test assertion to match API contract
(total >= active + reserve, not total == active + reserve + unrealized)
- Poll job status instead of kubectl wait (catches Failed condition
immediately instead of waiting 600s for Complete that never comes)
- Replace grep -oP (Perl regex) with POSIX grep -o (BusyBox compat)
- StorageClass longhorn-rwx with hard NFS mount (no softerr)
- Prevents Remote I/O errors from share-manager startup race
- RWX allows steps to spread across all 4 cluster nodes
- podAntiAffinity ensures different workflows use different nodes