From 8cfc4f423b8132f32c0371e6e0ea4a17b37fa243 Mon Sep 17 00:00:00 2001 From: Celes Renata Date: Sat, 11 Apr 2026 02:15:06 -0700 Subject: [PATCH] initial commit --- design.md | 480 ++++++++++++++++++++++++++++++++++++++++++++++++ requirements.md | 269 +++++++++++++++++++++++++++ tasks.md | 129 +++++++++++++ 3 files changed, 878 insertions(+) create mode 100644 design.md create mode 100644 requirements.md create mode 100644 tasks.md diff --git a/design.md b/design.md new file mode 100644 index 0000000..ef57af8 --- /dev/null +++ b/design.md @@ -0,0 +1,480 @@ +# Stonks Oracle - Design + +## 1. Purpose +Stonks Oracle is a Kubernetes-native AI market intelligence and trading platform. It ingests structured market data, company news, filings, and curated web content; preserves raw artifacts in MinIO; extracts structured intelligence objects with local Ollama models; aggregates signals into trend and recommendation outputs; optionally executes trades through a broker integration; and publishes historical datasets into a local lakehouse for Athena-like querying and QuickSight-like dashboards. + +This design prioritizes: +- deterministic data contracts +- auditability of every AI-derived conclusion +- safe paper-trading-first automation +- self-hosted analytics on MinIO-backed datasets +- clear separation between operational state and analytical state + +## 2. Architecture Summary +The platform is split into two planes: + +### 2.1 Operational plane +Handles ingestion, parsing, structured extraction, signal generation, risk evaluation, trade execution, and control APIs. + +Primary stores: +- PostgreSQL for operational state and transactional records +- Redis for queues, locks, and hot cache state +- MinIO for raw artifacts, prompts, model outputs, and exported datasets + +### 2.2 Analytical plane +Handles historical fact storage, SQL query access, research, scorecards, and dashboards. + +Primary components: +- MinIO as S3-compatible object store +- Hive-compatible partition layout for query compatibility +- Iceberg tables as the preferred lakehouse abstraction for managed analytical datasets +- Trino as the Athena-like SQL query engine +- Apache Superset as the QuickSight-like dashboard and exploration layer + +## 3. External Integrations + +### 3.1 Market Data API +Used for: +- quotes +- OHLCV bars +- reference data +- corporate actions +- earnings calendars +- optional market news or fundamentals + +### 3.2 News API +Used for: +- company-linked headlines +- publisher metadata +- article URLs +- article summaries when licensed + +### 3.3 Filings / Regulatory API +Used for: +- SEC-style company submissions +- 8-K, 10-Q, 10-K, and related filings +- structured issuer event discovery + +### 3.4 Web Scraper +Used for: +- full article body retrieval when API content is partial +- investor relations pages +- curated press release sources +- transcript or presentation retrieval when permitted + +### 3.5 Broker API +Used for: +- paper-trading simulation or sandbox trading +- live order submission when enabled +- order acknowledgements and rejections +- fills and cancellations +- positions and account balances + +## 4. Logical Components + +### 4.1 Symbol Registry Service +Responsibilities: +- manage companies, aliases, watchlists, sectors, and source configurations +- manage source trust or credibility policies +- manage symbol-to-document matching rules + +### 4.2 Scheduler / Orchestrator +Responsibilities: +- trigger market, news, filings, and scrape jobs +- manage polling cadences by source class +- coordinate backoff, retries, and dedupe windows +- publish downstream jobs to workers + +### 4.3 Ingestion Adapters +Subcomponents: +- Market data adapter +- News API adapter +- Filings adapter +- Broker event adapter + +Responsibilities: +- fetch external payloads +- preserve raw responses in MinIO +- normalize metadata into PostgreSQL +- emit processing jobs for parsing or publication + +### 4.4 Scraper / Parser Service +Responsibilities: +- fetch and render source pages +- extract normalized text and metadata +- reduce boilerplate and duplicated template text +- score parser quality and extraction confidence +- persist normalized artifacts + +### 4.5 Ollama Extraction Service +Responsibilities: +- call local Ollama models using schema-constrained JSON output +- produce canonical document intelligence objects +- preserve prompts, schemas, model metadata, and raw outputs +- validate schema and semantic consistency +- retry invalid generations under policy + +### 4.6 Aggregation Engine +Responsibilities: +- combine document intelligence with market context +- compute rolling trend summaries by company, sector, and market +- track contradiction and agreement signals +- score evidence with recency decay and source weighting + +### 4.7 Recommendation Engine +Responsibilities: +- generate explainable recommendation objects from aggregated evidence +- separate deterministic eligibility scoring from final action mapping +- produce suggested action, thesis, horizon, and invalidation conditions +- publish analytical prediction facts to the lake + +### 4.8 Risk Engine +Responsibilities: +- enforce guardrails such as max position size, daily loss cap, exposure by sector, symbol cooldowns, news shock lockouts, and operator approval rules +- determine whether a recommendation is eligible for paper or live execution +- block ambiguous or unsafe orders before broker submission + +### 4.9 Broker Adapter +Responsibilities: +- abstract one or more trading APIs +- support paper mode and live mode +- record submission, acknowledgement, rejection, fill, and cancellation events +- guarantee idempotent order submission keys +- publish order and fill facts to both PostgreSQL and the analytical lake + +### 4.10 Lake Publisher +Responsibilities: +- transform operational records into analytics-friendly fact datasets +- publish append-only partitioned tables to MinIO +- maintain Iceberg metadata or equivalent lakehouse metadata +- expose datasets such as predictions, outcomes, fills, bars, and PnL + +### 4.11 Query API / Dashboard +Responsibilities: +- expose companies, documents, trends, recommendations, and orders +- provide evidence drill-down and audit views +- provide operator controls for live-trading enablement and review queues +- expose links into analytical dashboards and query tools + +### 4.12 SQL Query Engine and BI Layer +Components: +- Trino coordinator and workers +- Hive Metastore or Iceberg catalog service +- Apache Superset + +Responsibilities: +- provide Athena-like SQL access to MinIO-hosted tables +- support dashboard datasets and ad hoc exploration +- support joins between market facts, AI predictions, and executed trades + +## 5. Storage Model + +### 5.1 Operational stores +#### PostgreSQL +Used for: +- companies and aliases +- watchlists and source configs +- article and filing metadata +- document intelligence objects +- trend summaries +- recommendations +- risk evaluations +- orders and execution events +- control-plane state and audit records + +#### Redis +Used for: +- distributed locks for symbol-source retrieval +- ingestion rate-limit counters +- job queue state +- retry backoff state +- dedupe markers +- cache for hot API and dashboard views + +#### MinIO object storage +Used for: +- raw API payloads +- raw article HTML and normalized text +- prompts, schemas, and raw model results +- exported analytical datasets +- audit traces and reproducibility bundles + +### 5.2 MinIO bucket layout +Recommended buckets: +- `stonks-raw-market` — raw market API payloads +- `stonks-raw-news` — raw news API payloads and article HTML +- `stonks-raw-filings` — raw filings and issuer event payloads +- `stonks-normalized` — cleaned text and parser outputs +- `stonks-llm-prompts` — prompts and schemas used +- `stonks-llm-results` — raw model outputs and validation reports +- `stonks-lakehouse` — partitioned analytical datasets and table metadata +- `stonks-audit` — execution traces and exported reports + +Suggested raw object path pattern: +```text +/{stage}/{symbol}/{yyyy}/{mm}/{dd}/{document_id}/{artifact_type}.json +/{stage}/{symbol}/{yyyy}/{mm}/{dd}/{document_id}/{artifact_type}.html +``` + +Suggested analytical path pattern: +```text +/warehouse/{table_name}/dt={yyyy-mm-dd}/symbol={ticker}/part-*.parquet +``` + +### 5.3 Lakehouse model +Preferred design: +- Parquet files stored in MinIO +- Hive-compatible partitioning for interoperability +- Iceberg table metadata for managed analytical tables +- Trino catalogs for SQL access + +Rationale: +- Hive-compatible layouts preserve broad engine compatibility +- Iceberg improves schema evolution, partition handling, and table maintenance +- Trino can query MinIO-backed object storage and supports both Hive and Iceberg catalogs + +## 6. Data Model + +### 6.1 PostgreSQL schema outline +Core tables: +- `companies` +- `company_aliases` +- `watchlists` +- `watchlist_members` +- `sources` +- `api_credentials_refs` +- `ingestion_runs` +- `market_snapshots` +- `documents` +- `document_versions` +- `document_company_mentions` +- `document_intelligence` +- `document_impact_records` +- `trend_windows` +- `recommendations` +- `recommendation_evidence` +- `risk_evaluations` +- `broker_accounts` +- `orders` +- `order_events` +- `positions` +- `audit_events` + +### 6.2 Article or document metadata record +```json +{ + "document_id": "uuid", + "document_type": "article|filing|transcript|press_release", + "symbol_candidates": ["AAPL", "MSFT"], + "source_type": "news_api", + "publisher": "string", + "url": "string", + "canonical_url": "string", + "title": "string", + "published_at": "2026-04-09T00:00:00Z", + "retrieved_at": "2026-04-09T00:00:00Z", + "language": "en", + "content_hash": "sha256", + "storage_refs": { + "raw_html": "s3://...", + "raw_payload": "s3://..." + } +} +``` + +### 6.3 Document intelligence schema +```json +{ + "document_id": "uuid", + "summary": "string", + "companies": [ + { + "ticker": "AAPL", + "company_name": "Apple Inc.", + "relevance": 0.95, + "sentiment": "positive", + "impact_score": 0.71, + "impact_horizon": "1d_30d", + "catalyst_type": "earnings|product|legal|macro|supply_chain|m_and_a|rating_change|other", + "key_facts": ["string"], + "risks": ["string"], + "evidence_spans": ["string"] + } + ], + "macro_themes": ["rates", "ai_capex"], + "novelty_score": 0.64, + "source_credibility": 0.8, + "extraction_warnings": ["ambiguous_ticker_reference"], + "confidence": 0.86, + "model": { + "provider": "ollama", + "model_name": "gpt-oss:20b", + "prompt_version": "document-intel-v2", + "schema_version": "2.0.0" + } +} +``` + +### 6.4 Trend summary schema +```json +{ + "entity_type": "company", + "entity_id": "AAPL", + "window": "7d", + "trend_direction": "bullish|bearish|mixed|neutral", + "trend_strength": 0.68, + "confidence": 0.74, + "top_supporting_evidence": ["document_id_1", "document_id_2"], + "top_opposing_evidence": ["document_id_3"], + "dominant_catalysts": ["product", "analyst_rating"], + "material_risks": ["regulatory scrutiny"], + "contradiction_score": 0.22 +} +``` + +### 6.5 Recommendation schema +```json +{ + "recommendation_id": "uuid", + "ticker": "AAPL", + "action": "buy|sell|hold|watch", + "mode": "informational|paper_eligible|live_eligible", + "confidence": 0.72, + "time_horizon": "swing_1d_10d", + "thesis": "string", + "invalidation_conditions": ["string"], + "position_sizing": { + "portfolio_pct": 0.02, + "max_loss_pct": 0.005 + }, + "evidence_refs": ["document_id_1", "document_id_2"], + "model_metadata": { + "version": "recommendation-v1" + } +} +``` + +## 7. Analytical Lake Datasets +The analytical plane should expose the following logical fact tables: +- `lake.market_bars` +- `lake.market_quotes` +- `lake.company_events` +- `lake.documents` +- `lake.document_extractions` +- `lake.trade_signals` +- `lake.trade_orders` +- `lake.trade_fills` +- `lake.positions_daily` +- `lake.pnl_daily` +- `lake.prediction_vs_outcome` + +Recommended partitioning examples: +- market data: partition by `dt`, optional symbol transform later +- documents: partition by `dt` and maybe `source_type` +- predictions: partition by `dt` and `model_version` +- fills and PnL: partition by `dt` and broker account + +## 8. Data Flows + +### 8.1 Market and document ingestion flow +1. Scheduler selects due symbols and sources. +2. Adapters fetch market, news, and filings payloads. +3. Raw payloads are written to MinIO. +4. Metadata records are written to PostgreSQL. +5. New documents are emitted to parser jobs. + +### 8.2 Extraction flow +1. Parser produces normalized text and confidence score. +2. Extraction worker sends document to Ollama with schema-bound output. +3. Validator checks schema and semantic consistency. +4. Canonical intelligence object is stored in PostgreSQL and MinIO. +5. Aggregation jobs are triggered for impacted symbols. + +### 8.3 Recommendation and trade flow +1. Aggregation engine updates trend windows. +2. Recommendation engine emits a recommendation object. +3. Risk engine determines eligibility and allowed execution mode. +4. Broker adapter places paper or live orders when authorized. +5. Broker events update PostgreSQL and publish analytical facts to the lake. + +### 8.4 Lake publication flow +1. Operational records are transformed into analytical facts. +2. Facts are written as partitioned Parquet files to MinIO. +3. Table metadata is updated through Iceberg or equivalent catalog operations. +4. Trino exposes the datasets for SQL. +5. Superset uses Trino datasets for dashboards and ad hoc exploration. + +## 9. Query and Dashboard Surface + +### 9.1 Operational API +Should expose: +- company and watchlist configuration +- source health and job state +- document timelines and evidence +- recommendation history +- order history and audit trail +- risk configuration and trading mode + +### 9.2 Analytical surface +Should expose: +- SQL access through Trino +- dashboard datasets in Superset +- scorecards for prediction accuracy and PnL +- evidence-to-outcome drill-down views +- model performance and extraction failure dashboards + +Suggested starter dashboards: +- symbol overview +- market sentiment heatmap +- prediction confidence vs realized move +- paper trading PnL +- model extraction quality +- source coverage and ingestion lag + +## 10. Reliability and Safety +- Broker submission must be idempotent. +- Live trading must be disabled by default. +- Paper trading must be the first enabled execution mode. +- Invalid model output must not advance to trade execution. +- Low-quality document extraction must not influence live trading. +- All analytical publication jobs should be replayable. +- Every recommendation and order should be reproducible from saved prompts, source refs, and model metadata. + +## 11. Deployment Notes +Recommended Kubernetes workloads: +- `symbol-registry-api` +- `scheduler` +- `market-adapter-worker` +- `news-adapter-worker` +- `filings-adapter-worker` +- `scraper-worker` +- `parser-worker` +- `ollama-extractor-worker` +- `aggregation-worker` +- `recommendation-worker` +- `risk-engine-api` +- `broker-adapter` +- `lake-publisher` +- `trino-coordinator` +- `trino-worker` +- `superset-web` +- `postgres` +- `redis` +- `minio` + +## 12. Deliberate Scope Boundaries for v1 +Included in v1: +- tracked watchlists +- market, news, filings, and broker integrations +- Ollama structured extraction +- trend aggregation and recommendation objects +- paper trading with strict controls +- MinIO-backed analytics lake +- Trino and Superset self-hosted analytics + +Deferred from v1: +- options trading +- full order book or tick-level market microstructure +- online model retraining +- fully autonomous live trading with no approval workflow +- advanced portfolio optimization beyond basic sizing and risk caps diff --git a/requirements.md b/requirements.md new file mode 100644 index 0000000..6cca995 --- /dev/null +++ b/requirements.md @@ -0,0 +1,269 @@ +# Stonks Oracle - Requirements + +## Overview +This feature builds an AI-assisted market intelligence, execution, and analytics platform for a Kubernetes-hosted environment. The platform ingests market symbols, licensed market data, company-specific news, regulatory filings, scraped web sources, and broker execution events; stores raw and normalized artifacts; extracts structured JSON with local Ollama models; computes trend and sentiment summaries; and optionally places trades through a broker integration. + +The platform SHALL also maintain a local analytics lake on MinIO using Hive-compatible partitioned data, support Athena-like SQL querying over captured market and trade data, and expose QuickSight-like dashboards for research, review, and audit. + +The initial release is focused on reliable ingestion, deterministic structured extraction, explainable trend scoring, paper trading safety, and internal analytics visibility. + +## User Stories +- As an operator, I want to register companies, tickers, sectors, watchlists, and source rules so the system knows what to monitor. +- As an analyst, I want every raw article, filing, market snapshot, and scrape artifact preserved so I can audit downstream AI conclusions. +- As a data engineer, I want structured JSON extraction from each article and filing so downstream analytics are queryable. +- As a strategist, I want aggregated trend assessments per symbol, sector, and market regime so I can evaluate opportunities. +- As a trader, I want the system to generate explainable trade recommendations with explicit confidence, catalysts, and risk notes. +- As a risk owner, I want strict controls on automated trading so the system cannot place unsafe orders. +- As a quantitative reviewer, I want to query historical market data, AI predictions, and executed trades in one SQL-accessible analytics plane. +- As a dashboard user, I want QuickSight-like visualizations for performance, signal quality, prediction accuracy, and model behavior. +- As a platform owner, I want the system to run fully inside Kubernetes against local Ollama and self-hosted analytics components. + +## Functional Requirements + +### 1. Watchlist and source management +#### Requirement 1.1 +WHEN an operator creates or updates a tracked company +THE SYSTEM SHALL persist the company profile including ticker, legal name, aliases, exchange, sector, industry, market cap bucket, and source configuration. + +#### Requirement 1.2 +WHEN an operator defines a source configuration for a company +THE SYSTEM SHALL support source types including market data APIs, news API feeds, SEC or investor relations URLs, company press release pages, earnings transcript sources, curated web pages, and broker-linked execution sources. + +#### Requirement 1.3 +WHEN a company has aliases, brands, or product names +THE SYSTEM SHALL use those aliases during source retrieval, de-duplication, entity matching, and extraction. + +### 2. External API integrations +#### Requirement 2.1 +WHEN the scheduler triggers a market ingestion cycle +THE SYSTEM SHALL fetch configured market data API results for tracked companies and persist raw response payloads. + +#### Requirement 2.2 +WHEN the scheduler triggers a news ingestion cycle +THE SYSTEM SHALL fetch configured news API results for tracked companies and persist raw response payloads. + +#### Requirement 2.3 +WHEN the scheduler triggers a regulatory ingestion cycle +THE SYSTEM SHALL fetch configured filing or issuer event data from authoritative sources such as SEC-style APIs and persist raw response payloads. + +#### Requirement 2.4 +WHEN trade automation is enabled +THE SYSTEM SHALL integrate with at least one broker API that supports paper trading, order placement, order status retrieval, positions, account balances, and execution events. + +#### Requirement 2.5 +WHEN external APIs enforce rate limits or quotas +THE SYSTEM SHALL coordinate request pacing, retries, and backoff across workers. + +### 3. Ingestion and raw artifact retention +#### Requirement 3.1 +WHEN a scraper retrieves an article, filing, or web page +THE SYSTEM SHALL store the raw HTML, rendered text, metadata, retrieval timestamp, and retrieval source in object storage. + +#### Requirement 3.2 +WHEN an article, filing, or market payload is ingested +THE SYSTEM SHALL generate a stable content hash and use it to prevent duplicate processing. + +#### Requirement 3.3 +WHEN the system stores a raw artifact +THE SYSTEM SHALL persist an associated metadata record containing symbol, source, URL when applicable, title, publication time, retrieval time, language when applicable, and content hash. + +#### Requirement 3.4 +WHEN content retrieval fails +THE SYSTEM SHALL record the failure reason, retry policy state, and next eligible retry time. + +### 4. Parsing and normalization +#### Requirement 4.1 +WHEN a raw article or filing enters the parsing stage +THE SYSTEM SHALL extract normalized text, author data when available, publisher, tags, mentioned entities, outbound links, and document type. + +#### Requirement 4.2 +WHEN the system detects boilerplate or repeated template text +THE SYSTEM SHALL reduce or remove boilerplate before AI extraction while retaining the original raw artifact for audit. + +#### Requirement 4.3 +WHEN the parser cannot confidently extract article body text +THE SYSTEM SHALL flag the document for low-quality extraction and prevent it from influencing downstream trading until reviewed or reprocessed. + +### 5. AI article and document extraction +#### Requirement 5.1 +WHEN a normalized article or filing is ready for AI extraction +THE SYSTEM SHALL send the document to a local Ollama model using structured output with an explicit JSON schema. + +#### Requirement 5.2 +WHEN the model returns extraction output +THE SYSTEM SHALL validate the response against the expected schema before saving it. + +#### Requirement 5.3 +WHEN extraction succeeds +THE SYSTEM SHALL produce a canonical document intelligence object with at minimum: +- document_id +- document_type +- source metadata +- tickers referenced +- companies referenced +- document summary +- sentiment by company +- catalyst type +- impact horizon +- key facts +- risks mentioned +- macro themes +- confidence score +- extraction warnings +- model metadata + +#### Requirement 5.4 +WHEN the model response is invalid, incomplete, or hallucinatory +THE SYSTEM SHALL retry extraction according to policy and preserve both the failed output and validation errors. + +#### Requirement 5.5 +WHEN a document is materially relevant to multiple companies +THE SYSTEM SHALL emit one shared document record and one or more per-company impact records. + +### 6. Aggregation and trend analysis +#### Requirement 6.1 +WHEN multiple document intelligence objects and market observations exist for a company +THE SYSTEM SHALL generate a rolling company trend summary over configurable windows including intraday, 1 day, 7 day, 30 day, and 90 day intervals. + +#### Requirement 6.2 +WHEN generating a company trend summary +THE SYSTEM SHALL consider sentiment, catalyst frequency, source credibility, recency decay, contradiction detection, document novelty, and current market context. + +#### Requirement 6.3 +WHEN generating a market-wide trend summary +THE SYSTEM SHALL aggregate company-level signals into sector and market-level summaries. + +#### Requirement 6.4 +WHEN contradictory signals exist across sources +THE SYSTEM SHALL represent disagreement explicitly rather than collapsing it into a single unsupported conclusion. + +#### Requirement 6.5 +WHEN a trend summary is produced +THE SYSTEM SHALL include explainability fields listing the top supporting and opposing evidence. + +### 7. Trade recommendation generation +#### Requirement 7.1 +WHEN a company trend summary is available +THE SYSTEM SHALL be able to generate a recommendation object containing action type, thesis, confidence, expected horizon, invalidation conditions, and cited evidence. + +#### Requirement 7.2 +WHEN a recommendation is generated +THE SYSTEM SHALL separate descriptive analysis from prescriptive trade action and include a risk classification. + +#### Requirement 7.3 +WHEN the system proposes a trade +THE SYSTEM SHALL attach position sizing guidance based on configured portfolio rules rather than unconstrained model output. + +#### Requirement 7.4 +WHEN the confidence or data quality falls below configured thresholds +THE SYSTEM SHALL suppress automated trade eligibility and mark the recommendation as informational only. + +### 8. Trade execution and safety controls +#### Requirement 8.1 +WHEN trade automation is enabled +THE SYSTEM SHALL support paper trading mode and live trading mode as separate execution environments. + +#### Requirement 8.2 +WHEN live trading mode is enabled +THE SYSTEM SHALL require operator approval controls, risk limits, and broker credential isolation. + +#### Requirement 8.3 +WHEN the system places an order +THE SYSTEM SHALL persist the full decision trace including signals used, prompt versions, model versions, thresholds, and broker response. + +#### Requirement 8.4 +WHEN a proposed order violates configured risk controls +THE SYSTEM SHALL reject the order before broker submission. + +#### Requirement 8.5 +WHEN a broker API is unavailable or partially fails +THE SYSTEM SHALL fail closed and SHALL NOT place duplicate or ambiguous orders. + +### 9. Storage and queryability +#### Requirement 9.1 +WHEN storing raw artifacts +THE SYSTEM SHALL use MinIO object storage as the system of record for HTML, text, API payloads, prompts, model outputs, and exported analytical datasets. + +#### Requirement 9.2 +WHEN storing normalized relational data +THE SYSTEM SHALL use PostgreSQL for companies, watchlists, article metadata, document intelligence objects, trends, recommendations, operational execution records, and control-plane state. + +#### Requirement 9.3 +WHEN low-latency coordination or caching is required +THE SYSTEM SHALL use Redis for job state, distributed locks, short-lived caches, and rate-limit coordination. + +#### Requirement 9.4 +WHEN historical analytical queries are needed +THE SYSTEM SHALL persist analytical fact datasets in Hive-compatible partitioned form on MinIO so that market data, predictions, and trade outcomes can be queried together. + +#### Requirement 9.5 +WHEN analytical table management is required +THE SYSTEM SHALL support a lakehouse table abstraction that permits append-only fact ingestion, partitioned queries, and schema evolution. + +### 10. SQL analytics and dashboards +#### Requirement 10.1 +WHEN a user or service executes an analytical query +THE SYSTEM SHALL provide an Athena-like SQL query service over MinIO-hosted analytical datasets. + +#### Requirement 10.2 +WHEN a dashboard user explores market, prediction, and trade data +THE SYSTEM SHALL expose QuickSight-like dashboards for performance, confidence, prediction accuracy, evidence coverage, and model behavior. + +#### Requirement 10.3 +WHEN analytical results combine AI outputs with executed trades and market outcomes +THE SYSTEM SHALL support joins across predicted signals, broker executions, and realized performance data. + +#### Requirement 10.4 +WHEN dashboards or research queries need drill-down capability +THE SYSTEM SHALL provide traceability from analytical aggregates back to underlying documents, prompts, model outputs, and raw artifacts. + +### 11. APIs and UI +#### Requirement 11.1 +WHEN a client requests company analytics +THE SYSTEM SHALL expose APIs for document timelines, trend summaries, recommendation history, execution history, and evidence drill-down. + +#### Requirement 11.2 +WHEN an operator inspects a recommendation +THE SYSTEM SHALL display the contributing document intelligence objects, the raw sources used, and any market context features that influenced the decision. + +#### Requirement 11.3 +WHEN a user reviews an order decision +THE SYSTEM SHALL expose a full audit trail from ingestion through broker execution and eventual market outcome. + +### 12. Observability and operations +#### Requirement 12.1 +WHEN a pipeline stage runs +THE SYSTEM SHALL emit structured logs, metrics, and traces for ingestion, parsing, extraction, aggregation, analytics publication, and trading. + +#### Requirement 12.2 +WHEN model performance degrades +THE SYSTEM SHALL surface schema failure rates, latency percentiles, token usage estimates, and extraction retry counts. + +#### Requirement 12.3 +WHEN source coverage changes materially +THE SYSTEM SHALL alert operators about sustained source failures, symbol coverage gaps, or analytical publication lag. + +## Non-Functional Requirements +#### Requirement N1 +WHEN the system processes documents and market events concurrently +THE SYSTEM SHALL support horizontal scaling across Kubernetes workers. + +#### Requirement N2 +WHEN the system stores model-derived conclusions +THE SYSTEM SHALL preserve enough provenance to reproduce or challenge those conclusions later. + +#### Requirement N3 +WHEN the system handles licensed or restricted content +THE SYSTEM SHALL preserve source metadata, access policy, and retention policy for each artifact. + +#### Requirement N4 +WHEN the system publishes analytical datasets +THE SYSTEM SHALL ensure queryable partitions are written atomically or with an equivalent consistency guarantee. + +#### Requirement N5 +WHEN trade execution is enabled +THE SYSTEM SHALL prioritize fail-closed behavior over availability in ambiguous conditions. + +#### Requirement N6 +WHEN dashboards query large historical datasets +THE SYSTEM SHALL support partition pruning and index or metadata strategies that keep typical analyst queries responsive. diff --git a/tasks.md b/tasks.md new file mode 100644 index 0000000..779a282 --- /dev/null +++ b/tasks.md @@ -0,0 +1,129 @@ +# Stonks Oracle - Tasks + +## Phase 0 - Project Setup +- [ ] Create repository structure for services, shared schemas, infrastructure, lakehouse, and dashboards +- [ ] Choose implementation language for services (Python preferred for scraping/LLM workflows) +- [ ] Add local development stack with MinIO, PostgreSQL, Redis, Ollama, Trino, and Superset +- [ ] Add Kubernetes manifests or Helm chart skeletons for all core components +- [ ] Add CI pipeline for linting, tests, container builds, schema checks, and lake dataset validation + +## Phase 1 - Core Data and Infrastructure +- [ ] Create PostgreSQL schema migrations for companies, watchlists, sources, documents, document intelligence, trends, recommendations, orders, positions, and audit records +- [ ] Create MinIO bucket provisioning and lifecycle policies +- [ ] Create Redis key conventions and queue abstractions +- [ ] Implement shared config loader for environment variables and secrets +- [ ] Implement shared typed JSON schemas for document intelligence, trend summaries, and recommendations +- [ ] Stand up initial Trino catalog configuration for MinIO-backed datasets +- [ ] Stand up Superset with environment-backed datasource configuration + +## Phase 2 - Symbol Registry and Source Management +- [ ] Build symbol registry API endpoints for companies, aliases, watchlists, and sources +- [ ] Add source credibility, retention policy, and access policy fields +- [ ] Add source classes for market data API, news API, filings API, web scrape, and broker adapter +- [ ] Add admin validation for duplicate tickers, invalid URLs, and unsupported source types +- [ ] Add seed data support for an initial tracked watchlist + +## Phase 3 - External API Adapters +- [ ] Implement scheduler for symbol and source polling windows +- [ ] Implement market data API adapter interface +- [ ] Implement first concrete market data provider adapter +- [ ] Implement news API adapter interface +- [ ] Implement first concrete news API provider adapter +- [ ] Implement filings or regulatory adapter interface +- [ ] Implement first concrete filings provider adapter +- [ ] Implement broker API adapter interface for paper trading and order events +- [ ] Implement rate-limit coordination, retries, and backoff across adapters + +## Phase 4 - Ingestion Pipeline +- [ ] Implement web scraper worker for curated URLs and article pages +- [ ] Implement canonical URL normalization and content hashing +- [ ] Implement raw artifact upload to MinIO +- [ ] Implement metadata persistence in PostgreSQL for market payloads, documents, and broker events +- [ ] Implement retry and failure tracking for source retrieval +- [ ] Implement dedupe logic across article and filing sources + +## Phase 5 - Parsing and Normalization +- [ ] Implement HTML-to-text parsing pipeline +- [ ] Implement boilerplate reduction and body extraction heuristics +- [ ] Implement parser quality scoring and confidence flags +- [ ] Implement company mention detection using ticker, alias, and name matching +- [ ] Persist normalized text and parser outputs to MinIO and PostgreSQL + +## Phase 6 - Ollama Structured Extraction +- [ ] Build extraction prompt templates with anti-hallucination instructions +- [ ] Build JSON schema definitions for document intelligence extraction +- [ ] Implement Ollama client wrapper using structured output format +- [ ] Implement schema validation and semantic validation layers +- [ ] Persist prompts, model metadata, raw outputs, validation reports, and final intelligence objects +- [ ] Add retry behavior for invalid or incomplete model responses +- [ ] Add model performance metrics and dashboards + +## Phase 7 - Aggregation and Trend Engine +- [ ] Implement recency decay and source credibility weighting +- [ ] Integrate market context features into aggregation windows +- [ ] Implement company-level rolling window aggregation +- [ ] Implement contradiction detection and disagreement representation +- [ ] Implement sector and market rollups +- [ ] Implement evidence ranking for supporting and opposing documents +- [ ] Persist trend windows and evidence mappings + +## Phase 8 - Recommendation Engine +- [ ] Design deterministic recommendation eligibility logic +- [ ] Implement recommendation generation from aggregated scores and evidence +- [ ] Add optional LLM wording layer for thesis generation only +- [ ] Persist recommendation objects and evidence citations +- [ ] Add suppression logic for low-quality data or low confidence +- [ ] Publish prediction facts to analytical tables + +## Phase 9 - Risk Engine and Trade Adapter +- [ ] Implement portfolio and account risk configuration model +- [ ] Implement hard blocks for max position size, sector exposure, daily loss limits, and news-shock lockouts +- [ ] Implement paper trading adapter behavior and state sync +- [ ] Integrate first broker API in sandbox mode +- [ ] Implement idempotent order submission keys and duplicate prevention +- [ ] Implement full execution audit trail +- [ ] Add operator approval workflow for live trading mode +- [ ] Publish order, fill, and position facts to analytical tables + +## Phase 10 - Lakehouse and SQL Analytics +- [ ] Define analytical fact tables for bars, documents, extractions, signals, orders, fills, positions, and PnL +- [ ] Implement Parquet writers for analytical datasets +- [ ] Implement Hive-compatible partition layout conventions on MinIO +- [ ] Implement Iceberg table creation and metadata management for analytical datasets +- [ ] Implement lake publisher jobs from operational data into analytical fact tables +- [ ] Configure Trino catalogs for Hive and or Iceberg access to MinIO +- [ ] Add example SQL views for prediction-vs-outcome and paper-trade scorecards + +## Phase 11 - Query API and Dashboard +- [ ] Build APIs for companies, document timelines, trend summaries, recommendations, and order history +- [ ] Build evidence drill-down view linking recommendations to source documents and raw artifacts +- [ ] Build admin controls for source health, symbol configs, and trading mode +- [ ] Build operational dashboard for ingestion throughput, model failures, and source coverage gaps +- [ ] Build Superset starter dashboards for symbol overview, sentiment heatmap, PnL, and prediction accuracy + +## Phase 12 - Observability and Hardening +- [ ] Add structured logs and distributed tracing across services +- [ ] Add Prometheus metrics for ingestion, parsing, extraction, aggregation, lake publication, and trading +- [ ] Add alerting for source failures, schema failure spikes, analytical lag, and broker issues +- [ ] Add dead-letter queues and replay tooling +- [ ] Add data retention and lifecycle controls for raw and derived artifacts +- [ ] Add security review for secrets, network policies, trading isolation, and dashboard access control + +## Phase 13 - Verification and Rollout +- [ ] Create replay dataset from archived documents for deterministic extraction testing +- [ ] Create integration tests for the full ingest-to-recommendation flow +- [ ] Create paper trading simulation scenarios +- [ ] Validate fail-closed behavior for broker outages and ambiguous order states +- [ ] Validate lake publication and Trino query correctness over partitioned MinIO datasets +- [ ] Run shadow mode before enabling any live execution +- [ ] Prepare operator runbook and incident response procedures + +## Recommended First Vertical Slice +- [ ] Track 5 to 10 symbols +- [ ] Ingest one market data API, one news API, and one filings source per symbol group +- [ ] Persist raw artifacts to MinIO and metadata to PostgreSQL +- [ ] Extract structured document intelligence through Ollama +- [ ] Generate 7-day company trend summaries with market context +- [ ] Produce paper-trade recommendations only +- [ ] Publish analytical facts for bars, signals, and paper trades into MinIO +- [ ] Expose a simple dashboard with evidence, trend cards, and prediction-vs-outcome views