Files
stonks-oracle/design.md
T
Celes Renata 8cfc4f423b initial commit
2026-04-11 02:15:06 -07:00

481 lines
15 KiB
Markdown

# Stonks Oracle - Design
## 1. Purpose
Stonks Oracle is a Kubernetes-native AI market intelligence and trading platform. It ingests structured market data, company news, filings, and curated web content; preserves raw artifacts in MinIO; extracts structured intelligence objects with local Ollama models; aggregates signals into trend and recommendation outputs; optionally executes trades through a broker integration; and publishes historical datasets into a local lakehouse for Athena-like querying and QuickSight-like dashboards.
This design prioritizes:
- deterministic data contracts
- auditability of every AI-derived conclusion
- safe paper-trading-first automation
- self-hosted analytics on MinIO-backed datasets
- clear separation between operational state and analytical state
## 2. Architecture Summary
The platform is split into two planes:
### 2.1 Operational plane
Handles ingestion, parsing, structured extraction, signal generation, risk evaluation, trade execution, and control APIs.
Primary stores:
- PostgreSQL for operational state and transactional records
- Redis for queues, locks, and hot cache state
- MinIO for raw artifacts, prompts, model outputs, and exported datasets
### 2.2 Analytical plane
Handles historical fact storage, SQL query access, research, scorecards, and dashboards.
Primary components:
- MinIO as S3-compatible object store
- Hive-compatible partition layout for query compatibility
- Iceberg tables as the preferred lakehouse abstraction for managed analytical datasets
- Trino as the Athena-like SQL query engine
- Apache Superset as the QuickSight-like dashboard and exploration layer
## 3. External Integrations
### 3.1 Market Data API
Used for:
- quotes
- OHLCV bars
- reference data
- corporate actions
- earnings calendars
- optional market news or fundamentals
### 3.2 News API
Used for:
- company-linked headlines
- publisher metadata
- article URLs
- article summaries when licensed
### 3.3 Filings / Regulatory API
Used for:
- SEC-style company submissions
- 8-K, 10-Q, 10-K, and related filings
- structured issuer event discovery
### 3.4 Web Scraper
Used for:
- full article body retrieval when API content is partial
- investor relations pages
- curated press release sources
- transcript or presentation retrieval when permitted
### 3.5 Broker API
Used for:
- paper-trading simulation or sandbox trading
- live order submission when enabled
- order acknowledgements and rejections
- fills and cancellations
- positions and account balances
## 4. Logical Components
### 4.1 Symbol Registry Service
Responsibilities:
- manage companies, aliases, watchlists, sectors, and source configurations
- manage source trust or credibility policies
- manage symbol-to-document matching rules
### 4.2 Scheduler / Orchestrator
Responsibilities:
- trigger market, news, filings, and scrape jobs
- manage polling cadences by source class
- coordinate backoff, retries, and dedupe windows
- publish downstream jobs to workers
### 4.3 Ingestion Adapters
Subcomponents:
- Market data adapter
- News API adapter
- Filings adapter
- Broker event adapter
Responsibilities:
- fetch external payloads
- preserve raw responses in MinIO
- normalize metadata into PostgreSQL
- emit processing jobs for parsing or publication
### 4.4 Scraper / Parser Service
Responsibilities:
- fetch and render source pages
- extract normalized text and metadata
- reduce boilerplate and duplicated template text
- score parser quality and extraction confidence
- persist normalized artifacts
### 4.5 Ollama Extraction Service
Responsibilities:
- call local Ollama models using schema-constrained JSON output
- produce canonical document intelligence objects
- preserve prompts, schemas, model metadata, and raw outputs
- validate schema and semantic consistency
- retry invalid generations under policy
### 4.6 Aggregation Engine
Responsibilities:
- combine document intelligence with market context
- compute rolling trend summaries by company, sector, and market
- track contradiction and agreement signals
- score evidence with recency decay and source weighting
### 4.7 Recommendation Engine
Responsibilities:
- generate explainable recommendation objects from aggregated evidence
- separate deterministic eligibility scoring from final action mapping
- produce suggested action, thesis, horizon, and invalidation conditions
- publish analytical prediction facts to the lake
### 4.8 Risk Engine
Responsibilities:
- enforce guardrails such as max position size, daily loss cap, exposure by sector, symbol cooldowns, news shock lockouts, and operator approval rules
- determine whether a recommendation is eligible for paper or live execution
- block ambiguous or unsafe orders before broker submission
### 4.9 Broker Adapter
Responsibilities:
- abstract one or more trading APIs
- support paper mode and live mode
- record submission, acknowledgement, rejection, fill, and cancellation events
- guarantee idempotent order submission keys
- publish order and fill facts to both PostgreSQL and the analytical lake
### 4.10 Lake Publisher
Responsibilities:
- transform operational records into analytics-friendly fact datasets
- publish append-only partitioned tables to MinIO
- maintain Iceberg metadata or equivalent lakehouse metadata
- expose datasets such as predictions, outcomes, fills, bars, and PnL
### 4.11 Query API / Dashboard
Responsibilities:
- expose companies, documents, trends, recommendations, and orders
- provide evidence drill-down and audit views
- provide operator controls for live-trading enablement and review queues
- expose links into analytical dashboards and query tools
### 4.12 SQL Query Engine and BI Layer
Components:
- Trino coordinator and workers
- Hive Metastore or Iceberg catalog service
- Apache Superset
Responsibilities:
- provide Athena-like SQL access to MinIO-hosted tables
- support dashboard datasets and ad hoc exploration
- support joins between market facts, AI predictions, and executed trades
## 5. Storage Model
### 5.1 Operational stores
#### PostgreSQL
Used for:
- companies and aliases
- watchlists and source configs
- article and filing metadata
- document intelligence objects
- trend summaries
- recommendations
- risk evaluations
- orders and execution events
- control-plane state and audit records
#### Redis
Used for:
- distributed locks for symbol-source retrieval
- ingestion rate-limit counters
- job queue state
- retry backoff state
- dedupe markers
- cache for hot API and dashboard views
#### MinIO object storage
Used for:
- raw API payloads
- raw article HTML and normalized text
- prompts, schemas, and raw model results
- exported analytical datasets
- audit traces and reproducibility bundles
### 5.2 MinIO bucket layout
Recommended buckets:
- `stonks-raw-market` — raw market API payloads
- `stonks-raw-news` — raw news API payloads and article HTML
- `stonks-raw-filings` — raw filings and issuer event payloads
- `stonks-normalized` — cleaned text and parser outputs
- `stonks-llm-prompts` — prompts and schemas used
- `stonks-llm-results` — raw model outputs and validation reports
- `stonks-lakehouse` — partitioned analytical datasets and table metadata
- `stonks-audit` — execution traces and exported reports
Suggested raw object path pattern:
```text
/{stage}/{symbol}/{yyyy}/{mm}/{dd}/{document_id}/{artifact_type}.json
/{stage}/{symbol}/{yyyy}/{mm}/{dd}/{document_id}/{artifact_type}.html
```
Suggested analytical path pattern:
```text
/warehouse/{table_name}/dt={yyyy-mm-dd}/symbol={ticker}/part-*.parquet
```
### 5.3 Lakehouse model
Preferred design:
- Parquet files stored in MinIO
- Hive-compatible partitioning for interoperability
- Iceberg table metadata for managed analytical tables
- Trino catalogs for SQL access
Rationale:
- Hive-compatible layouts preserve broad engine compatibility
- Iceberg improves schema evolution, partition handling, and table maintenance
- Trino can query MinIO-backed object storage and supports both Hive and Iceberg catalogs
## 6. Data Model
### 6.1 PostgreSQL schema outline
Core tables:
- `companies`
- `company_aliases`
- `watchlists`
- `watchlist_members`
- `sources`
- `api_credentials_refs`
- `ingestion_runs`
- `market_snapshots`
- `documents`
- `document_versions`
- `document_company_mentions`
- `document_intelligence`
- `document_impact_records`
- `trend_windows`
- `recommendations`
- `recommendation_evidence`
- `risk_evaluations`
- `broker_accounts`
- `orders`
- `order_events`
- `positions`
- `audit_events`
### 6.2 Article or document metadata record
```json
{
"document_id": "uuid",
"document_type": "article|filing|transcript|press_release",
"symbol_candidates": ["AAPL", "MSFT"],
"source_type": "news_api",
"publisher": "string",
"url": "string",
"canonical_url": "string",
"title": "string",
"published_at": "2026-04-09T00:00:00Z",
"retrieved_at": "2026-04-09T00:00:00Z",
"language": "en",
"content_hash": "sha256",
"storage_refs": {
"raw_html": "s3://...",
"raw_payload": "s3://..."
}
}
```
### 6.3 Document intelligence schema
```json
{
"document_id": "uuid",
"summary": "string",
"companies": [
{
"ticker": "AAPL",
"company_name": "Apple Inc.",
"relevance": 0.95,
"sentiment": "positive",
"impact_score": 0.71,
"impact_horizon": "1d_30d",
"catalyst_type": "earnings|product|legal|macro|supply_chain|m_and_a|rating_change|other",
"key_facts": ["string"],
"risks": ["string"],
"evidence_spans": ["string"]
}
],
"macro_themes": ["rates", "ai_capex"],
"novelty_score": 0.64,
"source_credibility": 0.8,
"extraction_warnings": ["ambiguous_ticker_reference"],
"confidence": 0.86,
"model": {
"provider": "ollama",
"model_name": "gpt-oss:20b",
"prompt_version": "document-intel-v2",
"schema_version": "2.0.0"
}
}
```
### 6.4 Trend summary schema
```json
{
"entity_type": "company",
"entity_id": "AAPL",
"window": "7d",
"trend_direction": "bullish|bearish|mixed|neutral",
"trend_strength": 0.68,
"confidence": 0.74,
"top_supporting_evidence": ["document_id_1", "document_id_2"],
"top_opposing_evidence": ["document_id_3"],
"dominant_catalysts": ["product", "analyst_rating"],
"material_risks": ["regulatory scrutiny"],
"contradiction_score": 0.22
}
```
### 6.5 Recommendation schema
```json
{
"recommendation_id": "uuid",
"ticker": "AAPL",
"action": "buy|sell|hold|watch",
"mode": "informational|paper_eligible|live_eligible",
"confidence": 0.72,
"time_horizon": "swing_1d_10d",
"thesis": "string",
"invalidation_conditions": ["string"],
"position_sizing": {
"portfolio_pct": 0.02,
"max_loss_pct": 0.005
},
"evidence_refs": ["document_id_1", "document_id_2"],
"model_metadata": {
"version": "recommendation-v1"
}
}
```
## 7. Analytical Lake Datasets
The analytical plane should expose the following logical fact tables:
- `lake.market_bars`
- `lake.market_quotes`
- `lake.company_events`
- `lake.documents`
- `lake.document_extractions`
- `lake.trade_signals`
- `lake.trade_orders`
- `lake.trade_fills`
- `lake.positions_daily`
- `lake.pnl_daily`
- `lake.prediction_vs_outcome`
Recommended partitioning examples:
- market data: partition by `dt`, optional symbol transform later
- documents: partition by `dt` and maybe `source_type`
- predictions: partition by `dt` and `model_version`
- fills and PnL: partition by `dt` and broker account
## 8. Data Flows
### 8.1 Market and document ingestion flow
1. Scheduler selects due symbols and sources.
2. Adapters fetch market, news, and filings payloads.
3. Raw payloads are written to MinIO.
4. Metadata records are written to PostgreSQL.
5. New documents are emitted to parser jobs.
### 8.2 Extraction flow
1. Parser produces normalized text and confidence score.
2. Extraction worker sends document to Ollama with schema-bound output.
3. Validator checks schema and semantic consistency.
4. Canonical intelligence object is stored in PostgreSQL and MinIO.
5. Aggregation jobs are triggered for impacted symbols.
### 8.3 Recommendation and trade flow
1. Aggregation engine updates trend windows.
2. Recommendation engine emits a recommendation object.
3. Risk engine determines eligibility and allowed execution mode.
4. Broker adapter places paper or live orders when authorized.
5. Broker events update PostgreSQL and publish analytical facts to the lake.
### 8.4 Lake publication flow
1. Operational records are transformed into analytical facts.
2. Facts are written as partitioned Parquet files to MinIO.
3. Table metadata is updated through Iceberg or equivalent catalog operations.
4. Trino exposes the datasets for SQL.
5. Superset uses Trino datasets for dashboards and ad hoc exploration.
## 9. Query and Dashboard Surface
### 9.1 Operational API
Should expose:
- company and watchlist configuration
- source health and job state
- document timelines and evidence
- recommendation history
- order history and audit trail
- risk configuration and trading mode
### 9.2 Analytical surface
Should expose:
- SQL access through Trino
- dashboard datasets in Superset
- scorecards for prediction accuracy and PnL
- evidence-to-outcome drill-down views
- model performance and extraction failure dashboards
Suggested starter dashboards:
- symbol overview
- market sentiment heatmap
- prediction confidence vs realized move
- paper trading PnL
- model extraction quality
- source coverage and ingestion lag
## 10. Reliability and Safety
- Broker submission must be idempotent.
- Live trading must be disabled by default.
- Paper trading must be the first enabled execution mode.
- Invalid model output must not advance to trade execution.
- Low-quality document extraction must not influence live trading.
- All analytical publication jobs should be replayable.
- Every recommendation and order should be reproducible from saved prompts, source refs, and model metadata.
## 11. Deployment Notes
Recommended Kubernetes workloads:
- `symbol-registry-api`
- `scheduler`
- `market-adapter-worker`
- `news-adapter-worker`
- `filings-adapter-worker`
- `scraper-worker`
- `parser-worker`
- `ollama-extractor-worker`
- `aggregation-worker`
- `recommendation-worker`
- `risk-engine-api`
- `broker-adapter`
- `lake-publisher`
- `trino-coordinator`
- `trino-worker`
- `superset-web`
- `postgres`
- `redis`
- `minio`
## 12. Deliberate Scope Boundaries for v1
Included in v1:
- tracked watchlists
- market, news, filings, and broker integrations
- Ollama structured extraction
- trend aggregation and recommendation objects
- paper trading with strict controls
- MinIO-backed analytics lake
- Trino and Superset self-hosted analytics
Deferred from v1:
- options trading
- full order book or tick-level market microstructure
- online model retraining
- fully autonomous live trading with no approval workflow
- advanced portfolio optimization beyond basic sizing and risk caps