phase 0+1: project scaffold, k8s manifests, CI pipeline, steering, hooks, tests

- Repository structure for all services, infra, lakehouse, dashboards - K8s manifests targeting stonks-oracle namespace with GHCR images - Ingress via Traefik with ca-issuer TLS for internal services - ConfigMap wired to existing cluster services (pg, redis, minio, ollama) - GitHub Actions workflow for lint, test, multi-service container builds - Dockerfile with build-arg CMD per service - Makefile for local build/push/deploy - Steering rules for TDD workflow, K8s conventions, project context - Agent hooks for lint-on-save, test-on-save, k8s-validate, phase-commit - Ruff linter config, all lint issues fixed - 14 passing tests for schemas, config, redis keys - PostgreSQL migrations, Trino catalogs, Superset config, MinIO lifecycle
2026-04-11 03:25:08 -07:00
parent 8cfc4f423b
commit ebea70573b
90 changed files with 3590 additions and 19 deletions
@@ -0,0 +1,269 @@
+# Stonks Oracle - Requirements
+
+## Overview
+This feature builds an AI-assisted market intelligence, execution, and analytics platform for a Kubernetes-hosted environment. The platform ingests market symbols, licensed market data, company-specific news, regulatory filings, scraped web sources, and broker execution events; stores raw and normalized artifacts; extracts structured JSON with local Ollama models; computes trend and sentiment summaries; and optionally places trades through a broker integration.
+
+The platform SHALL also maintain a local analytics lake on MinIO using Hive-compatible partitioned data, support Athena-like SQL querying over captured market and trade data, and expose QuickSight-like dashboards for research, review, and audit.
+
+The initial release is focused on reliable ingestion, deterministic structured extraction, explainable trend scoring, paper trading safety, and internal analytics visibility.
+
+## User Stories
+- As an operator, I want to register companies, tickers, sectors, watchlists, and source rules so the system knows what to monitor.
+- As an analyst, I want every raw article, filing, market snapshot, and scrape artifact preserved so I can audit downstream AI conclusions.
+- As a data engineer, I want structured JSON extraction from each article and filing so downstream analytics are queryable.
+- As a strategist, I want aggregated trend assessments per symbol, sector, and market regime so I can evaluate opportunities.
+- As a trader, I want the system to generate explainable trade recommendations with explicit confidence, catalysts, and risk notes.
+- As a risk owner, I want strict controls on automated trading so the system cannot place unsafe orders.
+- As a quantitative reviewer, I want to query historical market data, AI predictions, and executed trades in one SQL-accessible analytics plane.
+- As a dashboard user, I want QuickSight-like visualizations for performance, signal quality, prediction accuracy, and model behavior.
+- As a platform owner, I want the system to run fully inside Kubernetes against local Ollama and self-hosted analytics components.
+
+## Functional Requirements
+
+### 1. Watchlist and source management
+#### Requirement 1.1
+WHEN an operator creates or updates a tracked company
+THE SYSTEM SHALL persist the company profile including ticker, legal name, aliases, exchange, sector, industry, market cap bucket, and source configuration.
+
+#### Requirement 1.2
+WHEN an operator defines a source configuration for a company
+THE SYSTEM SHALL support source types including market data APIs, news API feeds, SEC or investor relations URLs, company press release pages, earnings transcript sources, curated web pages, and broker-linked execution sources.
+
+#### Requirement 1.3
+WHEN a company has aliases, brands, or product names
+THE SYSTEM SHALL use those aliases during source retrieval, de-duplication, entity matching, and extraction.
+
+### 2. External API integrations
+#### Requirement 2.1
+WHEN the scheduler triggers a market ingestion cycle
+THE SYSTEM SHALL fetch configured market data API results for tracked companies and persist raw response payloads.
+
+#### Requirement 2.2
+WHEN the scheduler triggers a news ingestion cycle
+THE SYSTEM SHALL fetch configured news API results for tracked companies and persist raw response payloads.
+
+#### Requirement 2.3
+WHEN the scheduler triggers a regulatory ingestion cycle
+THE SYSTEM SHALL fetch configured filing or issuer event data from authoritative sources such as SEC-style APIs and persist raw response payloads.
+
+#### Requirement 2.4
+WHEN trade automation is enabled
+THE SYSTEM SHALL integrate with at least one broker API that supports paper trading, order placement, order status retrieval, positions, account balances, and execution events.
+
+#### Requirement 2.5
+WHEN external APIs enforce rate limits or quotas
+THE SYSTEM SHALL coordinate request pacing, retries, and backoff across workers.
+
+### 3. Ingestion and raw artifact retention
+#### Requirement 3.1
+WHEN a scraper retrieves an article, filing, or web page
+THE SYSTEM SHALL store the raw HTML, rendered text, metadata, retrieval timestamp, and retrieval source in object storage.
+
+#### Requirement 3.2
+WHEN an article, filing, or market payload is ingested
+THE SYSTEM SHALL generate a stable content hash and use it to prevent duplicate processing.
+
+#### Requirement 3.3
+WHEN the system stores a raw artifact
+THE SYSTEM SHALL persist an associated metadata record containing symbol, source, URL when applicable, title, publication time, retrieval time, language when applicable, and content hash.
+
+#### Requirement 3.4
+WHEN content retrieval fails
+THE SYSTEM SHALL record the failure reason, retry policy state, and next eligible retry time.
+
+### 4. Parsing and normalization
+#### Requirement 4.1
+WHEN a raw article or filing enters the parsing stage
+THE SYSTEM SHALL extract normalized text, author data when available, publisher, tags, mentioned entities, outbound links, and document type.
+
+#### Requirement 4.2
+WHEN the system detects boilerplate or repeated template text
+THE SYSTEM SHALL reduce or remove boilerplate before AI extraction while retaining the original raw artifact for audit.
+
+#### Requirement 4.3
+WHEN the parser cannot confidently extract article body text
+THE SYSTEM SHALL flag the document for low-quality extraction and prevent it from influencing downstream trading until reviewed or reprocessed.
+
+### 5. AI article and document extraction
+#### Requirement 5.1
+WHEN a normalized article or filing is ready for AI extraction
+THE SYSTEM SHALL send the document to a local Ollama model using structured output with an explicit JSON schema.
+
+#### Requirement 5.2
+WHEN the model returns extraction output
+THE SYSTEM SHALL validate the response against the expected schema before saving it.
+
+#### Requirement 5.3
+WHEN extraction succeeds
+THE SYSTEM SHALL produce a canonical document intelligence object with at minimum:
+- document_id
+- document_type
+- source metadata
+- tickers referenced
+- companies referenced
+- document summary
+- sentiment by company
+- catalyst type
+- impact horizon
+- key facts
+- risks mentioned
+- macro themes
+- confidence score
+- extraction warnings
+- model metadata
+
+#### Requirement 5.4
+WHEN the model response is invalid, incomplete, or hallucinatory
+THE SYSTEM SHALL retry extraction according to policy and preserve both the failed output and validation errors.
+
+#### Requirement 5.5
+WHEN a document is materially relevant to multiple companies
+THE SYSTEM SHALL emit one shared document record and one or more per-company impact records.
+
+### 6. Aggregation and trend analysis
+#### Requirement 6.1
+WHEN multiple document intelligence objects and market observations exist for a company
+THE SYSTEM SHALL generate a rolling company trend summary over configurable windows including intraday, 1 day, 7 day, 30 day, and 90 day intervals.
+
+#### Requirement 6.2
+WHEN generating a company trend summary
+THE SYSTEM SHALL consider sentiment, catalyst frequency, source credibility, recency decay, contradiction detection, document novelty, and current market context.
+
+#### Requirement 6.3
+WHEN generating a market-wide trend summary
+THE SYSTEM SHALL aggregate company-level signals into sector and market-level summaries.
+
+#### Requirement 6.4
+WHEN contradictory signals exist across sources
+THE SYSTEM SHALL represent disagreement explicitly rather than collapsing it into a single unsupported conclusion.
+
+#### Requirement 6.5
+WHEN a trend summary is produced
+THE SYSTEM SHALL include explainability fields listing the top supporting and opposing evidence.
+
+### 7. Trade recommendation generation
+#### Requirement 7.1
+WHEN a company trend summary is available
+THE SYSTEM SHALL be able to generate a recommendation object containing action type, thesis, confidence, expected horizon, invalidation conditions, and cited evidence.
+
+#### Requirement 7.2
+WHEN a recommendation is generated
+THE SYSTEM SHALL separate descriptive analysis from prescriptive trade action and include a risk classification.
+
+#### Requirement 7.3
+WHEN the system proposes a trade
+THE SYSTEM SHALL attach position sizing guidance based on configured portfolio rules rather than unconstrained model output.
+
+#### Requirement 7.4
+WHEN the confidence or data quality falls below configured thresholds
+THE SYSTEM SHALL suppress automated trade eligibility and mark the recommendation as informational only.
+
+### 8. Trade execution and safety controls
+#### Requirement 8.1
+WHEN trade automation is enabled
+THE SYSTEM SHALL support paper trading mode and live trading mode as separate execution environments.
+
+#### Requirement 8.2
+WHEN live trading mode is enabled
+THE SYSTEM SHALL require operator approval controls, risk limits, and broker credential isolation.
+
+#### Requirement 8.3
+WHEN the system places an order
+THE SYSTEM SHALL persist the full decision trace including signals used, prompt versions, model versions, thresholds, and broker response.
+
+#### Requirement 8.4
+WHEN a proposed order violates configured risk controls
+THE SYSTEM SHALL reject the order before broker submission.
+
+#### Requirement 8.5
+WHEN a broker API is unavailable or partially fails
+THE SYSTEM SHALL fail closed and SHALL NOT place duplicate or ambiguous orders.
+
+### 9. Storage and queryability
+#### Requirement 9.1
+WHEN storing raw artifacts
+THE SYSTEM SHALL use MinIO object storage as the system of record for HTML, text, API payloads, prompts, model outputs, and exported analytical datasets.
+
+#### Requirement 9.2
+WHEN storing normalized relational data
+THE SYSTEM SHALL use PostgreSQL for companies, watchlists, article metadata, document intelligence objects, trends, recommendations, operational execution records, and control-plane state.
+
+#### Requirement 9.3
+WHEN low-latency coordination or caching is required
+THE SYSTEM SHALL use Redis for job state, distributed locks, short-lived caches, and rate-limit coordination.
+
+#### Requirement 9.4
+WHEN historical analytical queries are needed
+THE SYSTEM SHALL persist analytical fact datasets in Hive-compatible partitioned form on MinIO so that market data, predictions, and trade outcomes can be queried together.
+
+#### Requirement 9.5
+WHEN analytical table management is required
+THE SYSTEM SHALL support a lakehouse table abstraction that permits append-only fact ingestion, partitioned queries, and schema evolution.
+
+### 10. SQL analytics and dashboards
+#### Requirement 10.1
+WHEN a user or service executes an analytical query
+THE SYSTEM SHALL provide an Athena-like SQL query service over MinIO-hosted analytical datasets.
+
+#### Requirement 10.2
+WHEN a dashboard user explores market, prediction, and trade data
+THE SYSTEM SHALL expose QuickSight-like dashboards for performance, confidence, prediction accuracy, evidence coverage, and model behavior.
+
+#### Requirement 10.3
+WHEN analytical results combine AI outputs with executed trades and market outcomes
+THE SYSTEM SHALL support joins across predicted signals, broker executions, and realized performance data.
+
+#### Requirement 10.4
+WHEN dashboards or research queries need drill-down capability
+THE SYSTEM SHALL provide traceability from analytical aggregates back to underlying documents, prompts, model outputs, and raw artifacts.
+
+### 11. APIs and UI
+#### Requirement 11.1
+WHEN a client requests company analytics
+THE SYSTEM SHALL expose APIs for document timelines, trend summaries, recommendation history, execution history, and evidence drill-down.
+
+#### Requirement 11.2
+WHEN an operator inspects a recommendation
+THE SYSTEM SHALL display the contributing document intelligence objects, the raw sources used, and any market context features that influenced the decision.
+
+#### Requirement 11.3
+WHEN a user reviews an order decision
+THE SYSTEM SHALL expose a full audit trail from ingestion through broker execution and eventual market outcome.
+
+### 12. Observability and operations
+#### Requirement 12.1
+WHEN a pipeline stage runs
+THE SYSTEM SHALL emit structured logs, metrics, and traces for ingestion, parsing, extraction, aggregation, analytics publication, and trading.
+
+#### Requirement 12.2
+WHEN model performance degrades
+THE SYSTEM SHALL surface schema failure rates, latency percentiles, token usage estimates, and extraction retry counts.
+
+#### Requirement 12.3
+WHEN source coverage changes materially
+THE SYSTEM SHALL alert operators about sustained source failures, symbol coverage gaps, or analytical publication lag.
+
+## Non-Functional Requirements
+#### Requirement N1
+WHEN the system processes documents and market events concurrently
+THE SYSTEM SHALL support horizontal scaling across Kubernetes workers.
+
+#### Requirement N2
+WHEN the system stores model-derived conclusions
+THE SYSTEM SHALL preserve enough provenance to reproduce or challenge those conclusions later.
+
+#### Requirement N3
+WHEN the system handles licensed or restricted content
+THE SYSTEM SHALL preserve source metadata, access policy, and retention policy for each artifact.
+
+#### Requirement N4
+WHEN the system publishes analytical datasets
+THE SYSTEM SHALL ensure queryable partitions are written atomically or with an equivalent consistency guarantee.
+
+#### Requirement N5
+WHEN trade execution is enabled
+THE SYSTEM SHALL prioritize fail-closed behavior over availability in ambiguous conditions.
+
+#### Requirement N6
+WHEN dashboards query large historical datasets
+THE SYSTEM SHALL support partition pruning and index or metadata strategies that keep typical analyst queries responsive.