phase 0+1: project scaffold, k8s manifests, CI pipeline, steering, hooks, tests

- Repository structure for all services, infra, lakehouse, dashboards
- K8s manifests targeting stonks-oracle namespace with GHCR images
- Ingress via Traefik with ca-issuer TLS for internal services
- ConfigMap wired to existing cluster services (pg, redis, minio, ollama)
- GitHub Actions workflow for lint, test, multi-service container builds
- Dockerfile with build-arg CMD per service
- Makefile for local build/push/deploy
- Steering rules for TDD workflow, K8s conventions, project context
- Agent hooks for lint-on-save, test-on-save, k8s-validate, phase-commit
- Ruff linter config, all lint issues fixed
- 14 passing tests for schemas, config, redis keys
- PostgreSQL migrations, Trino catalogs, Superset config, MinIO lifecycle
This commit is contained in:
Celes Renata
2026-04-11 03:25:08 -07:00
parent 8cfc4f423b
commit ebea70573b
90 changed files with 3590 additions and 19 deletions
+480
View File
@@ -0,0 +1,480 @@
# Stonks Oracle - Design
## 1. Purpose
Stonks Oracle is a Kubernetes-native AI market intelligence and trading platform. It ingests structured market data, company news, filings, and curated web content; preserves raw artifacts in MinIO; extracts structured intelligence objects with local Ollama models; aggregates signals into trend and recommendation outputs; optionally executes trades through a broker integration; and publishes historical datasets into a local lakehouse for Athena-like querying and QuickSight-like dashboards.
This design prioritizes:
- deterministic data contracts
- auditability of every AI-derived conclusion
- safe paper-trading-first automation
- self-hosted analytics on MinIO-backed datasets
- clear separation between operational state and analytical state
## 2. Architecture Summary
The platform is split into two planes:
### 2.1 Operational plane
Handles ingestion, parsing, structured extraction, signal generation, risk evaluation, trade execution, and control APIs.
Primary stores:
- PostgreSQL for operational state and transactional records
- Redis for queues, locks, and hot cache state
- MinIO for raw artifacts, prompts, model outputs, and exported datasets
### 2.2 Analytical plane
Handles historical fact storage, SQL query access, research, scorecards, and dashboards.
Primary components:
- MinIO as S3-compatible object store
- Hive-compatible partition layout for query compatibility
- Iceberg tables as the preferred lakehouse abstraction for managed analytical datasets
- Trino as the Athena-like SQL query engine
- Apache Superset as the QuickSight-like dashboard and exploration layer
## 3. External Integrations
### 3.1 Market Data API
Used for:
- quotes
- OHLCV bars
- reference data
- corporate actions
- earnings calendars
- optional market news or fundamentals
### 3.2 News API
Used for:
- company-linked headlines
- publisher metadata
- article URLs
- article summaries when licensed
### 3.3 Filings / Regulatory API
Used for:
- SEC-style company submissions
- 8-K, 10-Q, 10-K, and related filings
- structured issuer event discovery
### 3.4 Web Scraper
Used for:
- full article body retrieval when API content is partial
- investor relations pages
- curated press release sources
- transcript or presentation retrieval when permitted
### 3.5 Broker API
Used for:
- paper-trading simulation or sandbox trading
- live order submission when enabled
- order acknowledgements and rejections
- fills and cancellations
- positions and account balances
## 4. Logical Components
### 4.1 Symbol Registry Service
Responsibilities:
- manage companies, aliases, watchlists, sectors, and source configurations
- manage source trust or credibility policies
- manage symbol-to-document matching rules
### 4.2 Scheduler / Orchestrator
Responsibilities:
- trigger market, news, filings, and scrape jobs
- manage polling cadences by source class
- coordinate backoff, retries, and dedupe windows
- publish downstream jobs to workers
### 4.3 Ingestion Adapters
Subcomponents:
- Market data adapter
- News API adapter
- Filings adapter
- Broker event adapter
Responsibilities:
- fetch external payloads
- preserve raw responses in MinIO
- normalize metadata into PostgreSQL
- emit processing jobs for parsing or publication
### 4.4 Scraper / Parser Service
Responsibilities:
- fetch and render source pages
- extract normalized text and metadata
- reduce boilerplate and duplicated template text
- score parser quality and extraction confidence
- persist normalized artifacts
### 4.5 Ollama Extraction Service
Responsibilities:
- call local Ollama models using schema-constrained JSON output
- produce canonical document intelligence objects
- preserve prompts, schemas, model metadata, and raw outputs
- validate schema and semantic consistency
- retry invalid generations under policy
### 4.6 Aggregation Engine
Responsibilities:
- combine document intelligence with market context
- compute rolling trend summaries by company, sector, and market
- track contradiction and agreement signals
- score evidence with recency decay and source weighting
### 4.7 Recommendation Engine
Responsibilities:
- generate explainable recommendation objects from aggregated evidence
- separate deterministic eligibility scoring from final action mapping
- produce suggested action, thesis, horizon, and invalidation conditions
- publish analytical prediction facts to the lake
### 4.8 Risk Engine
Responsibilities:
- enforce guardrails such as max position size, daily loss cap, exposure by sector, symbol cooldowns, news shock lockouts, and operator approval rules
- determine whether a recommendation is eligible for paper or live execution
- block ambiguous or unsafe orders before broker submission
### 4.9 Broker Adapter
Responsibilities:
- abstract one or more trading APIs
- support paper mode and live mode
- record submission, acknowledgement, rejection, fill, and cancellation events
- guarantee idempotent order submission keys
- publish order and fill facts to both PostgreSQL and the analytical lake
### 4.10 Lake Publisher
Responsibilities:
- transform operational records into analytics-friendly fact datasets
- publish append-only partitioned tables to MinIO
- maintain Iceberg metadata or equivalent lakehouse metadata
- expose datasets such as predictions, outcomes, fills, bars, and PnL
### 4.11 Query API / Dashboard
Responsibilities:
- expose companies, documents, trends, recommendations, and orders
- provide evidence drill-down and audit views
- provide operator controls for live-trading enablement and review queues
- expose links into analytical dashboards and query tools
### 4.12 SQL Query Engine and BI Layer
Components:
- Trino coordinator and workers
- Hive Metastore or Iceberg catalog service
- Apache Superset
Responsibilities:
- provide Athena-like SQL access to MinIO-hosted tables
- support dashboard datasets and ad hoc exploration
- support joins between market facts, AI predictions, and executed trades
#
# 5. Storage Model
### 5.1 Operational stores
#### PostgreSQL
Used for:
- companies and aliases
- watchlists and source configs
- article and filing metadata
- document intelligence objects
- trend summaries
- recommendations
- risk evaluations
- orders and execution events
- control-plane state and audit records
#### Redis
Used for:
- distributed locks for symbol-source retrieval
- ingestion rate-limit counters
- job queue state
- retry backoff state
- dedupe markers
- cache for hot API and dashboard views
#### MinIO object storage
Used for:
- raw API payloads
- raw article HTML and normalized text
- prompts, schemas, and raw model results
- exported analytical datasets
- audit traces and reproducibility bundles
### 5.2 MinIO bucket layout
Recommended buckets:
- `stonks-raw-market` — raw market API payloads
- `stonks-raw-news` — raw news API payloads and article HTML
- `stonks-raw-filings` — raw filings and issuer event payloads
- `stonks-normalized` — cleaned text and parser outputs
- `stonks-llm-prompts` — prompts and schemas used
- `stonks-llm-results` — raw model outputs and validation reports
- `stonks-lakehouse` — partitioned analytical datasets and table metadata
- `stonks-audit` — execution traces and exported reports
Suggested raw object path pattern:
```text
/{stage}/{symbol}/{yyyy}/{mm}/{dd}/{document_id}/{artifact_type}.json
/{stage}/{symbol}/{yyyy}/{mm}/{dd}/{document_id}/{artifact_type}.html
```
Suggested analytical path pattern:
```text
/warehouse/{table_name}/dt={yyyy-mm-dd}/symbol={ticker}/part-*.parquet
```
### 5.3 Lakehouse model
Preferred design:
- Parquet files stored in MinIO
- Hive-compatible partitioning for interoperability
- Iceberg table metadata for managed analytical tables
- Trino catalogs for SQL access
Rationale:
- Hive-compatible layouts preserve broad engine compatibility
- Iceberg improves schema evolution, partition handling, and table maintenance
- Trino can query MinIO-backed object storage and supports both Hive and Iceberg catalogs
## 6. Data Model
### 6.1 PostgreSQL schema outline
Core tables:
- `companies`
- `company_aliases`
- `watchlists`
- `watchlist_members`
- `sources`
- `api_credentials_refs`
- `ingestion_runs`
- `market_snapshots`
- `documents`
- `document_versions`
- `document_company_mentions`
- `document_intelligence`
- `document_impact_records`
- `trend_windows`
- `recommendations`
- `recommendation_evidence`
- `risk_evaluations`
- `broker_accounts`
- `orders`
- `order_events`
- `positions`
- `audit_events`
### 6.2 Article or document metadata record
```json
{
"document_id": "uuid",
"document_type": "article|filing|transcript|press_release",
"symbol_candidates": ["AAPL", "MSFT"],
"source_type": "news_api",
"publisher": "string",
"url": "string",
"canonical_url": "string",
"title": "string",
"published_at": "2026-04-09T00:00:00Z",
"retrieved_at": "2026-04-09T00:00:00Z",
"language": "en",
"content_hash": "sha256",
"storage_refs": {
"raw_html": "s3://...",
"raw_payload": "s3://..."
}
}
```
### 6.3 Document intelligence schema
```json
{
"document_id": "uuid",
"summary": "string",
"companies": [
{
"ticker": "AAPL",
"company_name": "Apple Inc.",
"relevance": 0.95,
"sentiment": "positive",
"impact_score": 0.71,
"impact_horizon": "1d_30d",
"catalyst_type": "earnings|product|legal|macro|supply_chain|m_and_a|rating_change|other",
"key_facts": ["string"],
"risks": ["string"],
"evidence_spans": ["string"]
}
],
"macro_themes": ["rates", "ai_capex"],
"novelty_score": 0.64,
"source_credibility": 0.8,
"extraction_warnings": ["ambiguous_ticker_reference"],
"confidence": 0.86,
"model": {
"provider": "ollama",
"model_name": "gpt-oss:20b",
"prompt_version": "document-intel-v2",
"schema_version": "2.0.0"
}
}
```
### 6.4 Trend summary schema
```json
{
"entity_type": "company",
"entity_id": "AAPL",
"window": "7d",
"trend_direction": "bullish|bearish|mixed|neutral",
"trend_strength": 0.68,
"confidence": 0.74,
"top_supporting_evidence": ["document_id_1", "document_id_2"],
"top_opposing_evidence": ["document_id_3"],
"dominant_catalysts": ["product", "analyst_rating"],
"material_risks": ["regulatory scrutiny"],
"contradiction_score": 0.22
}
```
### 6.5 Recommendation schema
```json
{
"recommendation_id": "uuid",
"ticker": "AAPL",
"action": "buy|sell|hold|watch",
"mode": "informational|paper_eligible|live_eligible",
"confidence": 0.72,
"time_horizon": "swing_1d_10d",
"thesis": "string",
"invalidation_conditions": ["string"],
"position_sizing": {
"portfolio_pct": 0.02,
"max_loss_pct": 0.005
},
"evidence_refs": ["document_id_1", "document_id_2"],
"model_metadata": {
"version": "recommendation-v1"
}
}
```
## 7. Analytical Lake Datasets
The analytical plane should expose the following logical fact tables:
- `lake.market_bars`
- `lake.market_quotes`
- `lake.company_events`
- `lake.documents`
- `lake.document_extractions`
- `lake.trade_signals`
- `lake.trade_orders`
- `lake.trade_fills`
- `lake.positions_daily`
- `lake.pnl_daily`
- `lake.prediction_vs_outcome`
Recommended partitioning examples:
- market data: partition by `dt`, optional symbol transform later
- documents: partition by `dt` and maybe `source_type`
- predictions: partition by `dt` and `model_version`
- fills and PnL: partition by `dt` and broker account
## 8. Data Flows
### 8.1 Market and document ingestion flow
1. Scheduler selects due symbols and sources.
2. Adapters fetch market, news, and filings payloads.
3. Raw payloads are written to MinIO.
4. Metadata records are written to PostgreSQL.
5. New documents are emitted to parser jobs.
### 8.2 Extraction flow
1. Parser produces normalized text and confidence score.
2. Extraction worker sends document to Ollama with schema-bound output.
3. Validator checks schema and semantic consistency.
4. Canonical intelligence object is stored in PostgreSQL and MinIO.
5. Aggregation jobs are triggered for impacted symbols.
### 8.3 Recommendation and trade flow
1. Aggregation engine updates trend windows.
2. Recommendation engine emits a recommendation object.
3. Risk engine determines eligibility and allowed execution mode.
4. Broker adapter places paper or live orders when authorized.
5. Broker events update PostgreSQL and publish analytical facts to the lake.
### 8.4 Lake publication flow
1. Operational records are transformed into analytical facts.
2. Facts are written as partitioned Parquet files to MinIO.
3. Table metadata is updated through Iceberg or equivalent catalog operations.
4. Trino exposes the datasets for SQL.
5. Superset uses Trino datasets for dashboards and ad hoc exploration.
## 9. Query and Dashboard Surface
### 9.1 Operational API
Should expose:
- company and watchlist configuration
- source health and job state
- document timelines and evidence
- recommendation history
- order history and audit trail
- risk configuration and trading mode
### 9.2 Analytical surface
Should expose:
- SQL access through Trino
- dashboard datasets in Superset
- scorecards for prediction accuracy and PnL
- evidence-to-outcome drill-down views
- model performance and extraction failure dashboards
Suggested starter dashboards:
- symbol overview
- market sentiment heatmap
- prediction confidence vs realized move
- paper trading PnL
- model extraction quality
- source coverage and ingestion lag
## 10. Reliability and Safety
- Broker submission must be idempotent.
- Live trading must be disabled by default.
- Paper trading must be the first enabled execution mode.
- Invalid model output must not advance to trade execution.
- Low-quality document extraction must not influence live trading.
- All analytical publication jobs should be replayable.
- Every recommendation and order should be reproducible from saved prompts, source refs, and model metadata.
## 11. Deployment Notes
Recommended Kubernetes workloads:
- `symbol-registry-api`
- `scheduler`
- `market-adapter-worker`
- `news-adapter-worker`
- `filings-adapter-worker`
- `scraper-worker`
- `parser-worker`
- `ollama-extractor-worker`
- `aggregation-worker`
- `recommendation-worker`
- `risk-engine-api`
- `broker-adapter`
- `lake-publisher`
- `trino-coordinator`
- `trino-worker`
- `superset-web`
- `postgres`
- `redis`
- `minio`
## 12. Deliberate Scope Boundaries for v1
Included in v1:
- tracked watchlists
- market, news, filings, and broker integrations
- Ollama structured extraction
- trend aggregation and recommendation objects
- paper trading with strict controls
- MinIO-backed analytics lake
- Trino and Superset self-hosted analytics
Deferred from v1:
- options trading
- full order book or tick-level market microstructure
- online model retraining
- fully autonomous live trading with no approval workflow
- advanced portfolio optimization beyond basic sizing and risk caps
+269
View File
@@ -0,0 +1,269 @@
# Stonks Oracle - Requirements
## Overview
This feature builds an AI-assisted market intelligence, execution, and analytics platform for a Kubernetes-hosted environment. The platform ingests market symbols, licensed market data, company-specific news, regulatory filings, scraped web sources, and broker execution events; stores raw and normalized artifacts; extracts structured JSON with local Ollama models; computes trend and sentiment summaries; and optionally places trades through a broker integration.
The platform SHALL also maintain a local analytics lake on MinIO using Hive-compatible partitioned data, support Athena-like SQL querying over captured market and trade data, and expose QuickSight-like dashboards for research, review, and audit.
The initial release is focused on reliable ingestion, deterministic structured extraction, explainable trend scoring, paper trading safety, and internal analytics visibility.
## User Stories
- As an operator, I want to register companies, tickers, sectors, watchlists, and source rules so the system knows what to monitor.
- As an analyst, I want every raw article, filing, market snapshot, and scrape artifact preserved so I can audit downstream AI conclusions.
- As a data engineer, I want structured JSON extraction from each article and filing so downstream analytics are queryable.
- As a strategist, I want aggregated trend assessments per symbol, sector, and market regime so I can evaluate opportunities.
- As a trader, I want the system to generate explainable trade recommendations with explicit confidence, catalysts, and risk notes.
- As a risk owner, I want strict controls on automated trading so the system cannot place unsafe orders.
- As a quantitative reviewer, I want to query historical market data, AI predictions, and executed trades in one SQL-accessible analytics plane.
- As a dashboard user, I want QuickSight-like visualizations for performance, signal quality, prediction accuracy, and model behavior.
- As a platform owner, I want the system to run fully inside Kubernetes against local Ollama and self-hosted analytics components.
## Functional Requirements
### 1. Watchlist and source management
#### Requirement 1.1
WHEN an operator creates or updates a tracked company
THE SYSTEM SHALL persist the company profile including ticker, legal name, aliases, exchange, sector, industry, market cap bucket, and source configuration.
#### Requirement 1.2
WHEN an operator defines a source configuration for a company
THE SYSTEM SHALL support source types including market data APIs, news API feeds, SEC or investor relations URLs, company press release pages, earnings transcript sources, curated web pages, and broker-linked execution sources.
#### Requirement 1.3
WHEN a company has aliases, brands, or product names
THE SYSTEM SHALL use those aliases during source retrieval, de-duplication, entity matching, and extraction.
### 2. External API integrations
#### Requirement 2.1
WHEN the scheduler triggers a market ingestion cycle
THE SYSTEM SHALL fetch configured market data API results for tracked companies and persist raw response payloads.
#### Requirement 2.2
WHEN the scheduler triggers a news ingestion cycle
THE SYSTEM SHALL fetch configured news API results for tracked companies and persist raw response payloads.
#### Requirement 2.3
WHEN the scheduler triggers a regulatory ingestion cycle
THE SYSTEM SHALL fetch configured filing or issuer event data from authoritative sources such as SEC-style APIs and persist raw response payloads.
#### Requirement 2.4
WHEN trade automation is enabled
THE SYSTEM SHALL integrate with at least one broker API that supports paper trading, order placement, order status retrieval, positions, account balances, and execution events.
#### Requirement 2.5
WHEN external APIs enforce rate limits or quotas
THE SYSTEM SHALL coordinate request pacing, retries, and backoff across workers.
### 3. Ingestion and raw artifact retention
#### Requirement 3.1
WHEN a scraper retrieves an article, filing, or web page
THE SYSTEM SHALL store the raw HTML, rendered text, metadata, retrieval timestamp, and retrieval source in object storage.
#### Requirement 3.2
WHEN an article, filing, or market payload is ingested
THE SYSTEM SHALL generate a stable content hash and use it to prevent duplicate processing.
#### Requirement 3.3
WHEN the system stores a raw artifact
THE SYSTEM SHALL persist an associated metadata record containing symbol, source, URL when applicable, title, publication time, retrieval time, language when applicable, and content hash.
#### Requirement 3.4
WHEN content retrieval fails
THE SYSTEM SHALL record the failure reason, retry policy state, and next eligible retry time.
### 4. Parsing and normalization
#### Requirement 4.1
WHEN a raw article or filing enters the parsing stage
THE SYSTEM SHALL extract normalized text, author data when available, publisher, tags, mentioned entities, outbound links, and document type.
#### Requirement 4.2
WHEN the system detects boilerplate or repeated template text
THE SYSTEM SHALL reduce or remove boilerplate before AI extraction while retaining the original raw artifact for audit.
#### Requirement 4.3
WHEN the parser cannot confidently extract article body text
THE SYSTEM SHALL flag the document for low-quality extraction and prevent it from influencing downstream trading until reviewed or reprocessed.
### 5. AI article and document extraction
#### Requirement 5.1
WHEN a normalized article or filing is ready for AI extraction
THE SYSTEM SHALL send the document to a local Ollama model using structured output with an explicit JSON schema.
#### Requirement 5.2
WHEN the model returns extraction output
THE SYSTEM SHALL validate the response against the expected schema before saving it.
#### Requirement 5.3
WHEN extraction succeeds
THE SYSTEM SHALL produce a canonical document intelligence object with at minimum:
- document_id
- document_type
- source metadata
- tickers referenced
- companies referenced
- document summary
- sentiment by company
- catalyst type
- impact horizon
- key facts
- risks mentioned
- macro themes
- confidence score
- extraction warnings
- model metadata
#### Requirement 5.4
WHEN the model response is invalid, incomplete, or hallucinatory
THE SYSTEM SHALL retry extraction according to policy and preserve both the failed output and validation errors.
#### Requirement 5.5
WHEN a document is materially relevant to multiple companies
THE SYSTEM SHALL emit one shared document record and one or more per-company impact records.
### 6. Aggregation and trend analysis
#### Requirement 6.1
WHEN multiple document intelligence objects and market observations exist for a company
THE SYSTEM SHALL generate a rolling company trend summary over configurable windows including intraday, 1 day, 7 day, 30 day, and 90 day intervals.
#### Requirement 6.2
WHEN generating a company trend summary
THE SYSTEM SHALL consider sentiment, catalyst frequency, source credibility, recency decay, contradiction detection, document novelty, and current market context.
#### Requirement 6.3
WHEN generating a market-wide trend summary
THE SYSTEM SHALL aggregate company-level signals into sector and market-level summaries.
#### Requirement 6.4
WHEN contradictory signals exist across sources
THE SYSTEM SHALL represent disagreement explicitly rather than collapsing it into a single unsupported conclusion.
#### Requirement 6.5
WHEN a trend summary is produced
THE SYSTEM SHALL include explainability fields listing the top supporting and opposing evidence.
### 7. Trade recommendation generation
#### Requirement 7.1
WHEN a company trend summary is available
THE SYSTEM SHALL be able to generate a recommendation object containing action type, thesis, confidence, expected horizon, invalidation conditions, and cited evidence.
#### Requirement 7.2
WHEN a recommendation is generated
THE SYSTEM SHALL separate descriptive analysis from prescriptive trade action and include a risk classification.
#### Requirement 7.3
WHEN the system proposes a trade
THE SYSTEM SHALL attach position sizing guidance based on configured portfolio rules rather than unconstrained model output.
#### Requirement 7.4
WHEN the confidence or data quality falls below configured thresholds
THE SYSTEM SHALL suppress automated trade eligibility and mark the recommendation as informational only.
### 8. Trade execution and safety controls
#### Requirement 8.1
WHEN trade automation is enabled
THE SYSTEM SHALL support paper trading mode and live trading mode as separate execution environments.
#### Requirement 8.2
WHEN live trading mode is enabled
THE SYSTEM SHALL require operator approval controls, risk limits, and broker credential isolation.
#### Requirement 8.3
WHEN the system places an order
THE SYSTEM SHALL persist the full decision trace including signals used, prompt versions, model versions, thresholds, and broker response.
#### Requirement 8.4
WHEN a proposed order violates configured risk controls
THE SYSTEM SHALL reject the order before broker submission.
#### Requirement 8.5
WHEN a broker API is unavailable or partially fails
THE SYSTEM SHALL fail closed and SHALL NOT place duplicate or ambiguous orders.
### 9. Storage and queryability
#### Requirement 9.1
WHEN storing raw artifacts
THE SYSTEM SHALL use MinIO object storage as the system of record for HTML, text, API payloads, prompts, model outputs, and exported analytical datasets.
#### Requirement 9.2
WHEN storing normalized relational data
THE SYSTEM SHALL use PostgreSQL for companies, watchlists, article metadata, document intelligence objects, trends, recommendations, operational execution records, and control-plane state.
#### Requirement 9.3
WHEN low-latency coordination or caching is required
THE SYSTEM SHALL use Redis for job state, distributed locks, short-lived caches, and rate-limit coordination.
#### Requirement 9.4
WHEN historical analytical queries are needed
THE SYSTEM SHALL persist analytical fact datasets in Hive-compatible partitioned form on MinIO so that market data, predictions, and trade outcomes can be queried together.
#### Requirement 9.5
WHEN analytical table management is required
THE SYSTEM SHALL support a lakehouse table abstraction that permits append-only fact ingestion, partitioned queries, and schema evolution.
### 10. SQL analytics and dashboards
#### Requirement 10.1
WHEN a user or service executes an analytical query
THE SYSTEM SHALL provide an Athena-like SQL query service over MinIO-hosted analytical datasets.
#### Requirement 10.2
WHEN a dashboard user explores market, prediction, and trade data
THE SYSTEM SHALL expose QuickSight-like dashboards for performance, confidence, prediction accuracy, evidence coverage, and model behavior.
#### Requirement 10.3
WHEN analytical results combine AI outputs with executed trades and market outcomes
THE SYSTEM SHALL support joins across predicted signals, broker executions, and realized performance data.
#### Requirement 10.4
WHEN dashboards or research queries need drill-down capability
THE SYSTEM SHALL provide traceability from analytical aggregates back to underlying documents, prompts, model outputs, and raw artifacts.
### 11. APIs and UI
#### Requirement 11.1
WHEN a client requests company analytics
THE SYSTEM SHALL expose APIs for document timelines, trend summaries, recommendation history, execution history, and evidence drill-down.
#### Requirement 11.2
WHEN an operator inspects a recommendation
THE SYSTEM SHALL display the contributing document intelligence objects, the raw sources used, and any market context features that influenced the decision.
#### Requirement 11.3
WHEN a user reviews an order decision
THE SYSTEM SHALL expose a full audit trail from ingestion through broker execution and eventual market outcome.
### 12. Observability and operations
#### Requirement 12.1
WHEN a pipeline stage runs
THE SYSTEM SHALL emit structured logs, metrics, and traces for ingestion, parsing, extraction, aggregation, analytics publication, and trading.
#### Requirement 12.2
WHEN model performance degrades
THE SYSTEM SHALL surface schema failure rates, latency percentiles, token usage estimates, and extraction retry counts.
#### Requirement 12.3
WHEN source coverage changes materially
THE SYSTEM SHALL alert operators about sustained source failures, symbol coverage gaps, or analytical publication lag.
## Non-Functional Requirements
#### Requirement N1
WHEN the system processes documents and market events concurrently
THE SYSTEM SHALL support horizontal scaling across Kubernetes workers.
#### Requirement N2
WHEN the system stores model-derived conclusions
THE SYSTEM SHALL preserve enough provenance to reproduce or challenge those conclusions later.
#### Requirement N3
WHEN the system handles licensed or restricted content
THE SYSTEM SHALL preserve source metadata, access policy, and retention policy for each artifact.
#### Requirement N4
WHEN the system publishes analytical datasets
THE SYSTEM SHALL ensure queryable partitions are written atomically or with an equivalent consistency guarantee.
#### Requirement N5
WHEN trade execution is enabled
THE SYSTEM SHALL prioritize fail-closed behavior over availability in ambiguous conditions.
#### Requirement N6
WHEN dashboards query large historical datasets
THE SYSTEM SHALL support partition pruning and index or metadata strategies that keep typical analyst queries responsive.
+129
View File
@@ -0,0 +1,129 @@
# Stonks Oracle - Tasks
## Phase 0 - Project Setup
- [x] Create repository structure for services, shared schemas, infrastructure, lakehouse, and dashboards
- [x] Choose implementation language for services (Python preferred for scraping/LLM workflows)
- [x] Add local development stack with MinIO, PostgreSQL, Redis, Ollama, Trino, and Superset
- [x] Add Kubernetes manifests or Helm chart skeletons for all core components
- [x] Add CI pipeline for linting, tests, container builds, schema checks, and lake dataset validation
## Phase 1 - Core Data and Infrastructure
- [x] Create PostgreSQL schema migrations for companies, watchlists, sources, documents, document intelligence, trends, recommendations, orders, positions, and audit records
- [x] Create MinIO bucket provisioning and lifecycle policies
- [x] Create Redis key conventions and queue abstractions
- [x] Implement shared config loader for environment variables and secrets
- [x] Implement shared typed JSON schemas for document intelligence, trend summaries, and recommendations
- [x] Stand up initial Trino catalog configuration for MinIO-backed datasets
- [x] Stand up Superset with environment-backed datasource configuration
## Phase 2 - Symbol Registry and Source Management
- [ ] Build symbol registry API endpoints for companies, aliases, watchlists, and sources
- [ ] Add source credibility, retention policy, and access policy fields
- [ ] Add source classes for market data API, news API, filings API, web scrape, and broker adapter
- [ ] Add admin validation for duplicate tickers, invalid URLs, and unsupported source types
- [ ] Add seed data support for an initial tracked watchlist
## Phase 3
- External API Adapters
- [ ] Implement scheduler for symbol and source polling windows
- [ ] Implement market data API adapter interface
- [ ] Implement first concrete market data provider adapter
- [ ] Implement news API adapter interface
- [ ] Implement first concrete news API provider adapter
- [ ] Implement filings or regulatory adapter interface
- [ ] Implement first concrete filings provider adapter
- [ ] Implement broker API adapter interface for paper trading and order events
- [ ] Implement rate-limit coordination, retries, and backoff across adapters
## Phase 4 - Ingestion Pipeline
- [ ] Implement web scraper worker for curated URLs and article pages
- [ ] Implement canonical URL normalization and content hashing
- [ ] Implement raw artifact upload to MinIO
- [ ] Implement metadata persistence in PostgreSQL for market payloads, documents, and broker events
- [ ] Implement retry and failure tracking for source retrieval
- [ ] Implement dedupe logic across article and filing sources
## Phase 5 - Parsing and Normalization
- [ ] Implement HTML-to-text parsing pipeline
- [ ] Implement boilerplate reduction and body extraction heuristics
- [ ] Implement parser quality scoring and confidence flags
- [ ] Implement company mention detection using ticker, alias, and name matching
- [ ] Persist normalized text and parser outputs to MinIO and PostgreSQL
## Phase 6 - Ollama Structured Extraction
- [ ] Build extraction prompt templates with anti-hallucination instructions
- [ ] Build JSON schema definitions for document intelligence extraction
- [ ] Implement Ollama client wrapper using structured output format
- [ ] Implement schema validation and semantic validation layers
- [ ] Persist prompts, model metadata, raw outputs, validation reports, and final intelligence objects
- [ ] Add retry behavior for invalid or incomplete model responses
- [ ] Add model performance metrics and dashboards
## Phase 7 - Aggregation and Trend Engine
- [ ] Implement recency decay and source credibility weighting
- [ ] Integrate market context features into aggregation windows
- [ ] Implement company-level rolling window aggregation
- [ ] Implement contradiction detection and disagreement representation
- [ ] Implement sector and market rollups
- [ ] Implement evidence ranking for supporting and opposing documents
- [ ] Persist trend windows and evidence mappings
## Phase 8 - Recommendation Engine
- [ ] Design deterministic recommendation eligibility logic
- [ ] Implement recommendation generation from aggregated scores and evidence
- [ ] Add optional LLM wording layer for thesis generation only
- [ ] Persist recommendation objects and evidence citations
- [ ] Add suppression logic for low-quality data or low confidence
- [ ] Publish prediction facts to analytical tables
## Phase 9 - Risk Engine and Trade Adapter
- [ ] Implement portfolio and account risk configuration model
- [ ] Implement hard blocks for max position size, sector exposure, daily loss limits, and news-shock lockouts
- [ ] Implement paper trading adapter behavior and state sync
- [ ] Integrate first broker API in sandbox mode
- [ ] Implement idempotent order submission keys and duplicate prevention
- [ ] Implement full execution audit trail
- [ ] Add operator approval workflow for live trading mode
- [ ] Publish order, fill, and position facts to analytical tables
## Phase 10 - Lakehouse and SQL Analytics
- [ ] Define analytical fact tables for bars, documents, extractions, signals, orders, fills, positions, and PnL
- [ ] Implement Parquet writers for analytical datasets
- [ ] Implement Hive-compatible partition layout conventions on MinIO
- [ ] Implement Iceberg table creation and metadata management for analytical datasets
- [ ] Implement lake publisher jobs from operational data into analytical fact tables
- [ ] Configure Trino catalogs for Hive and or Iceberg access to MinIO
- [ ] Add example SQL views for prediction-vs-outcome and paper-trade scorecards
## Phase 11 - Query API and Dashboard
- [ ] Build APIs for companies, document timelines, trend summaries, recommendations, and order history
- [ ] Build evidence drill-down view linking recommendations to source documents and raw artifacts
- [ ] Build admin controls for source health, symbol configs, and trading mode
- [ ] Build operational dashboard for ingestion throughput, model failures, and source coverage gaps
- [ ] Build Superset starter dashboards for symbol overview, sentiment heatmap, PnL, and prediction accuracy
## Phase 12 - Observability and Hardening
- [ ] Add structured logs and distributed tracing across services
- [ ] Add Prometheus metrics for ingestion, parsing, extraction, aggregation, lake publication, and trading
- [ ] Add alerting for source failures, schema failure spikes, analytical lag, and broker issues
- [ ] Add dead-letter queues and replay tooling
- [ ] Add data retention and lifecycle controls for raw and derived artifacts
- [ ] Add security review for secrets, network policies, trading isolation, and dashboard access control
## Phase 13 - Verification and Rollout
- [ ] Create replay dataset from archived documents for deterministic extraction testing
- [ ] Create integration tests for the full ingest-to-recommendation flow
- [ ] Create paper trading simulation scenarios
- [ ] Validate fail-closed behavior for broker outages and ambiguous order states
- [ ] Validate lake publication and Trino query correctness over partitioned MinIO datasets
- [ ] Run shadow mode before enabling any live execution
- [ ] Prepare operator runbook and incident response procedures
## Recommended First Vertical Slice
- [ ] Track 5 to 10 symbols
- [ ] Ingest one market data API, one news API, and one filings source per symbol group
- [ ] Persist raw artifacts to MinIO and metadata to PostgreSQL
- [ ] Extract structured document intelligence through Ollama
- [ ] Generate 7-day company trend summaries with market context
- [ ] Produce paper-trade recommendations only
- [ ] Publish analytical facts for bars, signals, and paper trades into MinIO
- [ ] Expose a simple dashboard with evidence, trend cards, and prediction-vs-outcome views