phase 0+1: project scaffold, k8s manifests, CI pipeline, steering, hooks, tests

- Repository structure for all services, infra, lakehouse, dashboards
- K8s manifests targeting stonks-oracle namespace with GHCR images
- Ingress via Traefik with ca-issuer TLS for internal services
- ConfigMap wired to existing cluster services (pg, redis, minio, ollama)
- GitHub Actions workflow for lint, test, multi-service container builds
- Dockerfile with build-arg CMD per service
- Makefile for local build/push/deploy
- Steering rules for TDD workflow, K8s conventions, project context
- Agent hooks for lint-on-save, test-on-save, k8s-validate, phase-commit
- Ruff linter config, all lint issues fixed
- 14 passing tests for schemas, config, redis keys
- PostgreSQL migrations, Trino catalogs, Superset config, MinIO lifecycle
This commit is contained in:
Celes Renata
2026-04-11 03:25:08 -07:00
parent 8cfc4f423b
commit ebea70573b
90 changed files with 3590 additions and 19 deletions
+269
View File
@@ -0,0 +1,269 @@
# Stonks Oracle - Requirements
## Overview
This feature builds an AI-assisted market intelligence, execution, and analytics platform for a Kubernetes-hosted environment. The platform ingests market symbols, licensed market data, company-specific news, regulatory filings, scraped web sources, and broker execution events; stores raw and normalized artifacts; extracts structured JSON with local Ollama models; computes trend and sentiment summaries; and optionally places trades through a broker integration.
The platform SHALL also maintain a local analytics lake on MinIO using Hive-compatible partitioned data, support Athena-like SQL querying over captured market and trade data, and expose QuickSight-like dashboards for research, review, and audit.
The initial release is focused on reliable ingestion, deterministic structured extraction, explainable trend scoring, paper trading safety, and internal analytics visibility.
## User Stories
- As an operator, I want to register companies, tickers, sectors, watchlists, and source rules so the system knows what to monitor.
- As an analyst, I want every raw article, filing, market snapshot, and scrape artifact preserved so I can audit downstream AI conclusions.
- As a data engineer, I want structured JSON extraction from each article and filing so downstream analytics are queryable.
- As a strategist, I want aggregated trend assessments per symbol, sector, and market regime so I can evaluate opportunities.
- As a trader, I want the system to generate explainable trade recommendations with explicit confidence, catalysts, and risk notes.
- As a risk owner, I want strict controls on automated trading so the system cannot place unsafe orders.
- As a quantitative reviewer, I want to query historical market data, AI predictions, and executed trades in one SQL-accessible analytics plane.
- As a dashboard user, I want QuickSight-like visualizations for performance, signal quality, prediction accuracy, and model behavior.
- As a platform owner, I want the system to run fully inside Kubernetes against local Ollama and self-hosted analytics components.
## Functional Requirements
### 1. Watchlist and source management
#### Requirement 1.1
WHEN an operator creates or updates a tracked company
THE SYSTEM SHALL persist the company profile including ticker, legal name, aliases, exchange, sector, industry, market cap bucket, and source configuration.
#### Requirement 1.2
WHEN an operator defines a source configuration for a company
THE SYSTEM SHALL support source types including market data APIs, news API feeds, SEC or investor relations URLs, company press release pages, earnings transcript sources, curated web pages, and broker-linked execution sources.
#### Requirement 1.3
WHEN a company has aliases, brands, or product names
THE SYSTEM SHALL use those aliases during source retrieval, de-duplication, entity matching, and extraction.
### 2. External API integrations
#### Requirement 2.1
WHEN the scheduler triggers a market ingestion cycle
THE SYSTEM SHALL fetch configured market data API results for tracked companies and persist raw response payloads.
#### Requirement 2.2
WHEN the scheduler triggers a news ingestion cycle
THE SYSTEM SHALL fetch configured news API results for tracked companies and persist raw response payloads.
#### Requirement 2.3
WHEN the scheduler triggers a regulatory ingestion cycle
THE SYSTEM SHALL fetch configured filing or issuer event data from authoritative sources such as SEC-style APIs and persist raw response payloads.
#### Requirement 2.4
WHEN trade automation is enabled
THE SYSTEM SHALL integrate with at least one broker API that supports paper trading, order placement, order status retrieval, positions, account balances, and execution events.
#### Requirement 2.5
WHEN external APIs enforce rate limits or quotas
THE SYSTEM SHALL coordinate request pacing, retries, and backoff across workers.
### 3. Ingestion and raw artifact retention
#### Requirement 3.1
WHEN a scraper retrieves an article, filing, or web page
THE SYSTEM SHALL store the raw HTML, rendered text, metadata, retrieval timestamp, and retrieval source in object storage.
#### Requirement 3.2
WHEN an article, filing, or market payload is ingested
THE SYSTEM SHALL generate a stable content hash and use it to prevent duplicate processing.
#### Requirement 3.3
WHEN the system stores a raw artifact
THE SYSTEM SHALL persist an associated metadata record containing symbol, source, URL when applicable, title, publication time, retrieval time, language when applicable, and content hash.
#### Requirement 3.4
WHEN content retrieval fails
THE SYSTEM SHALL record the failure reason, retry policy state, and next eligible retry time.
### 4. Parsing and normalization
#### Requirement 4.1
WHEN a raw article or filing enters the parsing stage
THE SYSTEM SHALL extract normalized text, author data when available, publisher, tags, mentioned entities, outbound links, and document type.
#### Requirement 4.2
WHEN the system detects boilerplate or repeated template text
THE SYSTEM SHALL reduce or remove boilerplate before AI extraction while retaining the original raw artifact for audit.
#### Requirement 4.3
WHEN the parser cannot confidently extract article body text
THE SYSTEM SHALL flag the document for low-quality extraction and prevent it from influencing downstream trading until reviewed or reprocessed.
### 5. AI article and document extraction
#### Requirement 5.1
WHEN a normalized article or filing is ready for AI extraction
THE SYSTEM SHALL send the document to a local Ollama model using structured output with an explicit JSON schema.
#### Requirement 5.2
WHEN the model returns extraction output
THE SYSTEM SHALL validate the response against the expected schema before saving it.
#### Requirement 5.3
WHEN extraction succeeds
THE SYSTEM SHALL produce a canonical document intelligence object with at minimum:
- document_id
- document_type
- source metadata
- tickers referenced
- companies referenced
- document summary
- sentiment by company
- catalyst type
- impact horizon
- key facts
- risks mentioned
- macro themes
- confidence score
- extraction warnings
- model metadata
#### Requirement 5.4
WHEN the model response is invalid, incomplete, or hallucinatory
THE SYSTEM SHALL retry extraction according to policy and preserve both the failed output and validation errors.
#### Requirement 5.5
WHEN a document is materially relevant to multiple companies
THE SYSTEM SHALL emit one shared document record and one or more per-company impact records.
### 6. Aggregation and trend analysis
#### Requirement 6.1
WHEN multiple document intelligence objects and market observations exist for a company
THE SYSTEM SHALL generate a rolling company trend summary over configurable windows including intraday, 1 day, 7 day, 30 day, and 90 day intervals.
#### Requirement 6.2
WHEN generating a company trend summary
THE SYSTEM SHALL consider sentiment, catalyst frequency, source credibility, recency decay, contradiction detection, document novelty, and current market context.
#### Requirement 6.3
WHEN generating a market-wide trend summary
THE SYSTEM SHALL aggregate company-level signals into sector and market-level summaries.
#### Requirement 6.4
WHEN contradictory signals exist across sources
THE SYSTEM SHALL represent disagreement explicitly rather than collapsing it into a single unsupported conclusion.
#### Requirement 6.5
WHEN a trend summary is produced
THE SYSTEM SHALL include explainability fields listing the top supporting and opposing evidence.
### 7. Trade recommendation generation
#### Requirement 7.1
WHEN a company trend summary is available
THE SYSTEM SHALL be able to generate a recommendation object containing action type, thesis, confidence, expected horizon, invalidation conditions, and cited evidence.
#### Requirement 7.2
WHEN a recommendation is generated
THE SYSTEM SHALL separate descriptive analysis from prescriptive trade action and include a risk classification.
#### Requirement 7.3
WHEN the system proposes a trade
THE SYSTEM SHALL attach position sizing guidance based on configured portfolio rules rather than unconstrained model output.
#### Requirement 7.4
WHEN the confidence or data quality falls below configured thresholds
THE SYSTEM SHALL suppress automated trade eligibility and mark the recommendation as informational only.
### 8. Trade execution and safety controls
#### Requirement 8.1
WHEN trade automation is enabled
THE SYSTEM SHALL support paper trading mode and live trading mode as separate execution environments.
#### Requirement 8.2
WHEN live trading mode is enabled
THE SYSTEM SHALL require operator approval controls, risk limits, and broker credential isolation.
#### Requirement 8.3
WHEN the system places an order
THE SYSTEM SHALL persist the full decision trace including signals used, prompt versions, model versions, thresholds, and broker response.
#### Requirement 8.4
WHEN a proposed order violates configured risk controls
THE SYSTEM SHALL reject the order before broker submission.
#### Requirement 8.5
WHEN a broker API is unavailable or partially fails
THE SYSTEM SHALL fail closed and SHALL NOT place duplicate or ambiguous orders.
### 9. Storage and queryability
#### Requirement 9.1
WHEN storing raw artifacts
THE SYSTEM SHALL use MinIO object storage as the system of record for HTML, text, API payloads, prompts, model outputs, and exported analytical datasets.
#### Requirement 9.2
WHEN storing normalized relational data
THE SYSTEM SHALL use PostgreSQL for companies, watchlists, article metadata, document intelligence objects, trends, recommendations, operational execution records, and control-plane state.
#### Requirement 9.3
WHEN low-latency coordination or caching is required
THE SYSTEM SHALL use Redis for job state, distributed locks, short-lived caches, and rate-limit coordination.
#### Requirement 9.4
WHEN historical analytical queries are needed
THE SYSTEM SHALL persist analytical fact datasets in Hive-compatible partitioned form on MinIO so that market data, predictions, and trade outcomes can be queried together.
#### Requirement 9.5
WHEN analytical table management is required
THE SYSTEM SHALL support a lakehouse table abstraction that permits append-only fact ingestion, partitioned queries, and schema evolution.
### 10. SQL analytics and dashboards
#### Requirement 10.1
WHEN a user or service executes an analytical query
THE SYSTEM SHALL provide an Athena-like SQL query service over MinIO-hosted analytical datasets.
#### Requirement 10.2
WHEN a dashboard user explores market, prediction, and trade data
THE SYSTEM SHALL expose QuickSight-like dashboards for performance, confidence, prediction accuracy, evidence coverage, and model behavior.
#### Requirement 10.3
WHEN analytical results combine AI outputs with executed trades and market outcomes
THE SYSTEM SHALL support joins across predicted signals, broker executions, and realized performance data.
#### Requirement 10.4
WHEN dashboards or research queries need drill-down capability
THE SYSTEM SHALL provide traceability from analytical aggregates back to underlying documents, prompts, model outputs, and raw artifacts.
### 11. APIs and UI
#### Requirement 11.1
WHEN a client requests company analytics
THE SYSTEM SHALL expose APIs for document timelines, trend summaries, recommendation history, execution history, and evidence drill-down.
#### Requirement 11.2
WHEN an operator inspects a recommendation
THE SYSTEM SHALL display the contributing document intelligence objects, the raw sources used, and any market context features that influenced the decision.
#### Requirement 11.3
WHEN a user reviews an order decision
THE SYSTEM SHALL expose a full audit trail from ingestion through broker execution and eventual market outcome.
### 12. Observability and operations
#### Requirement 12.1
WHEN a pipeline stage runs
THE SYSTEM SHALL emit structured logs, metrics, and traces for ingestion, parsing, extraction, aggregation, analytics publication, and trading.
#### Requirement 12.2
WHEN model performance degrades
THE SYSTEM SHALL surface schema failure rates, latency percentiles, token usage estimates, and extraction retry counts.
#### Requirement 12.3
WHEN source coverage changes materially
THE SYSTEM SHALL alert operators about sustained source failures, symbol coverage gaps, or analytical publication lag.
## Non-Functional Requirements
#### Requirement N1
WHEN the system processes documents and market events concurrently
THE SYSTEM SHALL support horizontal scaling across Kubernetes workers.
#### Requirement N2
WHEN the system stores model-derived conclusions
THE SYSTEM SHALL preserve enough provenance to reproduce or challenge those conclusions later.
#### Requirement N3
WHEN the system handles licensed or restricted content
THE SYSTEM SHALL preserve source metadata, access policy, and retention policy for each artifact.
#### Requirement N4
WHEN the system publishes analytical datasets
THE SYSTEM SHALL ensure queryable partitions are written atomically or with an equivalent consistency guarantee.
#### Requirement N5
WHEN trade execution is enabled
THE SYSTEM SHALL prioritize fail-closed behavior over availability in ambiguous conditions.
#### Requirement N6
WHEN dashboards query large historical datasets
THE SYSTEM SHALL support partition pruning and index or metadata strategies that keep typical analyst queries responsive.