562 lines
19 KiB
Markdown
562 lines
19 KiB
Markdown
# Stonks Oracle - Design
|
||
|
||
## 1. Purpose
|
||
Stonks Oracle is a Kubernetes-native AI market intelligence and trading platform. It ingests structured market data, company news, filings, and curated web content; preserves raw artifacts in MinIO; extracts structured intelligence objects with local Ollama models; aggregates signals into trend and recommendation outputs; optionally executes trades through a broker integration; and publishes historical datasets into a local lakehouse for Athena-like querying and QuickSight-like dashboards.
|
||
|
||
This design prioritizes:
|
||
- deterministic data contracts
|
||
- auditability of every AI-derived conclusion
|
||
- safe paper-trading-first automation
|
||
- self-hosted analytics on MinIO-backed datasets
|
||
- clear separation between operational state and analytical state
|
||
|
||
## 2. Architecture Summary
|
||
The platform is split into two planes:
|
||
|
||
### 2.1 Operational plane
|
||
Handles ingestion, parsing, structured extraction, signal generation, risk evaluation, trade execution, and control APIs.
|
||
|
||
Primary stores:
|
||
- PostgreSQL for operational state and transactional records
|
||
- Redis for queues, locks, and hot cache state
|
||
- MinIO for raw artifacts, prompts, model outputs, and exported datasets
|
||
|
||
### 2.2 Analytical plane
|
||
Handles historical fact storage, SQL query access, research, scorecards, and dashboards.
|
||
|
||
Primary components:
|
||
- MinIO as S3-compatible object store
|
||
- Hive-compatible partition layout for query compatibility
|
||
- Iceberg tables as the preferred lakehouse abstraction for managed analytical datasets
|
||
- Trino as the Athena-like SQL query engine
|
||
- Apache Superset as the QuickSight-like dashboard and exploration layer
|
||
|
||
## 3. External Integrations
|
||
|
||
### 3.1 Market Data API
|
||
Used for:
|
||
- quotes
|
||
- OHLCV bars
|
||
- reference data
|
||
- corporate actions
|
||
- earnings calendars
|
||
- optional market news or fundamentals
|
||
|
||
### 3.2 News API
|
||
Used for:
|
||
- company-linked headlines
|
||
- publisher metadata
|
||
- article URLs
|
||
- article summaries when licensed
|
||
|
||
### 3.3 Filings / Regulatory API
|
||
Used for:
|
||
- SEC-style company submissions
|
||
- 8-K, 10-Q, 10-K, and related filings
|
||
- structured issuer event discovery
|
||
|
||
### 3.4 Web Scraper
|
||
Used for:
|
||
- full article body retrieval when API content is partial
|
||
- investor relations pages
|
||
- curated press release sources
|
||
- transcript or presentation retrieval when permitted
|
||
|
||
### 3.5 Broker API
|
||
Used for:
|
||
- paper-trading simulation or sandbox trading
|
||
- live order submission when enabled
|
||
- order acknowledgements and rejections
|
||
- fills and cancellations
|
||
- positions and account balances
|
||
|
||
## 4. Logical Components
|
||
|
||
### 4.1 Symbol Registry Service
|
||
Responsibilities:
|
||
- manage companies, aliases, watchlists, sectors, and source configurations
|
||
- manage source trust or credibility policies
|
||
- manage symbol-to-document matching rules
|
||
|
||
### 4.2 Scheduler / Orchestrator
|
||
Responsibilities:
|
||
- trigger market, news, filings, and scrape jobs
|
||
- manage polling cadences by source class
|
||
- coordinate backoff, retries, and dedupe windows
|
||
- publish downstream jobs to workers
|
||
|
||
### 4.3 Ingestion Adapters
|
||
Subcomponents:
|
||
- Market data adapter
|
||
- News API adapter
|
||
- Filings adapter
|
||
- Broker event adapter
|
||
|
||
Responsibilities:
|
||
- fetch external payloads
|
||
- preserve raw responses in MinIO
|
||
- normalize metadata into PostgreSQL
|
||
- emit processing jobs for parsing or publication
|
||
|
||
### 4.4 Scraper / Parser Service
|
||
Responsibilities:
|
||
- fetch and render source pages
|
||
- extract normalized text and metadata
|
||
- reduce boilerplate and duplicated template text
|
||
- score parser quality and extraction confidence
|
||
- persist normalized artifacts
|
||
|
||
### 4.5 Ollama Extraction Service
|
||
Responsibilities:
|
||
- call local Ollama models using schema-constrained JSON output
|
||
- produce canonical document intelligence objects
|
||
- preserve prompts, schemas, model metadata, and raw outputs
|
||
- validate schema and semantic consistency
|
||
- retry invalid generations under policy
|
||
|
||
### 4.6 Aggregation Engine
|
||
Responsibilities:
|
||
- combine document intelligence with market context
|
||
- compute rolling trend summaries by company, sector, and market
|
||
- track contradiction and agreement signals
|
||
- score evidence with recency decay and source weighting
|
||
|
||
### 4.7 Recommendation Engine
|
||
Responsibilities:
|
||
- generate explainable recommendation objects from aggregated evidence
|
||
- separate deterministic eligibility scoring from final action mapping
|
||
- produce suggested action, thesis, horizon, and invalidation conditions
|
||
- publish analytical prediction facts to the lake
|
||
|
||
### 4.8 Risk Engine
|
||
Responsibilities:
|
||
- enforce guardrails such as max position size, daily loss cap, exposure by sector, symbol cooldowns, news shock lockouts, and operator approval rules
|
||
- determine whether a recommendation is eligible for paper or live execution
|
||
- block ambiguous or unsafe orders before broker submission
|
||
|
||
### 4.9 Broker Adapter
|
||
Responsibilities:
|
||
- abstract one or more trading APIs
|
||
- support paper mode and live mode
|
||
- record submission, acknowledgement, rejection, fill, and cancellation events
|
||
- guarantee idempotent order submission keys
|
||
- publish order and fill facts to both PostgreSQL and the analytical lake
|
||
|
||
### 4.10 Lake Publisher
|
||
Responsibilities:
|
||
- transform operational records into analytics-friendly fact datasets
|
||
- publish append-only partitioned tables to MinIO
|
||
- maintain Iceberg metadata or equivalent lakehouse metadata
|
||
- expose datasets such as predictions, outcomes, fills, bars, and PnL
|
||
|
||
### 4.11 Query API / Dashboard
|
||
Responsibilities:
|
||
- expose companies, documents, trends, recommendations, and orders
|
||
- provide evidence drill-down and audit views
|
||
- provide operator controls for live-trading enablement and review queues
|
||
- expose links into analytical dashboards and query tools
|
||
|
||
### 4.12 SQL Query Engine and BI Layer
|
||
Components:
|
||
- Trino coordinator and workers
|
||
- Hive Metastore or Iceberg catalog service
|
||
- Apache Superset
|
||
|
||
Responsibilities:
|
||
- provide Athena-like SQL access to MinIO-hosted tables
|
||
- support dashboard datasets and ad hoc exploration
|
||
- support joins between market facts, AI predictions, and executed trades
|
||
|
||
### 4.13 Web Dashboard (Frontend)
|
||
Technology:
|
||
- React 18+ with TypeScript
|
||
- Vite for build tooling
|
||
- TanStack Router for client-side routing
|
||
- TanStack Query for API data fetching and caching
|
||
- Recharts for time-series and analytical charts
|
||
- Monaco Editor for SQL query editing
|
||
- Tailwind CSS for styling
|
||
- Served as a static SPA from the query-api container or a dedicated nginx container
|
||
|
||
Responsibilities:
|
||
- provide a unified web UI for all operational and analytical functions
|
||
- company, watchlist, alias, and source CRUD management
|
||
- document timeline browsing with intelligence drill-down
|
||
- trend summary visualization with evidence chain navigation
|
||
- recommendation review with full provenance display
|
||
- order and position tracking with audit trail views
|
||
- trading mode controls, risk configuration, approval workflow, and lockout management
|
||
- DevOps dashboards for pipeline health, ingestion throughput, model performance, and source coverage
|
||
- interactive SQL query explorer with result visualization (Athena-like)
|
||
- pre-built analytical dashboards with chart interactions (QuickSight-like)
|
||
- all API interactions go through the existing Query API, Symbol Registry, and Risk Engine endpoints
|
||
#
|
||
# 5. Storage Model
|
||
|
||
### 5.1 Operational stores
|
||
#### PostgreSQL
|
||
Used for:
|
||
- companies and aliases
|
||
- watchlists and source configs
|
||
- article and filing metadata
|
||
- document intelligence objects
|
||
- trend summaries
|
||
- recommendations
|
||
- risk evaluations
|
||
- orders and execution events
|
||
- control-plane state and audit records
|
||
|
||
#### Redis
|
||
Used for:
|
||
- distributed locks for symbol-source retrieval
|
||
- ingestion rate-limit counters
|
||
- job queue state
|
||
- retry backoff state
|
||
- dedupe markers
|
||
- cache for hot API and dashboard views
|
||
|
||
#### MinIO object storage
|
||
Used for:
|
||
- raw API payloads
|
||
- raw article HTML and normalized text
|
||
- prompts, schemas, and raw model results
|
||
- exported analytical datasets
|
||
- audit traces and reproducibility bundles
|
||
|
||
### 5.2 MinIO bucket layout
|
||
Recommended buckets:
|
||
- `stonks-raw-market` — raw market API payloads
|
||
- `stonks-raw-news` — raw news API payloads and article HTML
|
||
- `stonks-raw-filings` — raw filings and issuer event payloads
|
||
- `stonks-normalized` — cleaned text and parser outputs
|
||
- `stonks-llm-prompts` — prompts and schemas used
|
||
- `stonks-llm-results` — raw model outputs and validation reports
|
||
- `stonks-lakehouse` — partitioned analytical datasets and table metadata
|
||
- `stonks-audit` — execution traces and exported reports
|
||
|
||
Suggested raw object path pattern:
|
||
```text
|
||
/{stage}/{symbol}/{yyyy}/{mm}/{dd}/{document_id}/{artifact_type}.json
|
||
/{stage}/{symbol}/{yyyy}/{mm}/{dd}/{document_id}/{artifact_type}.html
|
||
```
|
||
|
||
Suggested analytical path pattern:
|
||
```text
|
||
/warehouse/{table_name}/dt={yyyy-mm-dd}/symbol={ticker}/part-*.parquet
|
||
```
|
||
|
||
### 5.3 Lakehouse model
|
||
Preferred design:
|
||
- Parquet files stored in MinIO
|
||
- Hive-compatible partitioning for interoperability
|
||
- Iceberg table metadata for managed analytical tables
|
||
- Trino catalogs for SQL access
|
||
|
||
Rationale:
|
||
- Hive-compatible layouts preserve broad engine compatibility
|
||
- Iceberg improves schema evolution, partition handling, and table maintenance
|
||
- Trino can query MinIO-backed object storage and supports both Hive and Iceberg catalogs
|
||
|
||
## 6. Data Model
|
||
|
||
### 6.1 PostgreSQL schema outline
|
||
Core tables:
|
||
- `companies`
|
||
- `company_aliases`
|
||
- `watchlists`
|
||
- `watchlist_members`
|
||
- `sources`
|
||
- `api_credentials_refs`
|
||
- `ingestion_runs`
|
||
- `market_snapshots`
|
||
- `documents`
|
||
- `document_versions`
|
||
- `document_company_mentions`
|
||
- `document_intelligence`
|
||
- `document_impact_records`
|
||
- `trend_windows`
|
||
- `recommendations`
|
||
- `recommendation_evidence`
|
||
- `risk_evaluations`
|
||
- `broker_accounts`
|
||
- `orders`
|
||
- `order_events`
|
||
- `positions`
|
||
- `audit_events`
|
||
|
||
### 6.2 Article or document metadata record
|
||
```json
|
||
{
|
||
"document_id": "uuid",
|
||
"document_type": "article|filing|transcript|press_release",
|
||
"symbol_candidates": ["AAPL", "MSFT"],
|
||
"source_type": "news_api",
|
||
"publisher": "string",
|
||
"url": "string",
|
||
"canonical_url": "string",
|
||
"title": "string",
|
||
"published_at": "2026-04-09T00:00:00Z",
|
||
"retrieved_at": "2026-04-09T00:00:00Z",
|
||
"language": "en",
|
||
"content_hash": "sha256",
|
||
"storage_refs": {
|
||
"raw_html": "s3://...",
|
||
"raw_payload": "s3://..."
|
||
}
|
||
}
|
||
```
|
||
|
||
### 6.3 Document intelligence schema
|
||
```json
|
||
{
|
||
"document_id": "uuid",
|
||
"summary": "string",
|
||
"companies": [
|
||
{
|
||
"ticker": "AAPL",
|
||
"company_name": "Apple Inc.",
|
||
"relevance": 0.95,
|
||
"sentiment": "positive",
|
||
"impact_score": 0.71,
|
||
"impact_horizon": "1d_30d",
|
||
"catalyst_type": "earnings|product|legal|macro|supply_chain|m_and_a|rating_change|other",
|
||
"key_facts": ["string"],
|
||
"risks": ["string"],
|
||
"evidence_spans": ["string"]
|
||
}
|
||
],
|
||
"macro_themes": ["rates", "ai_capex"],
|
||
"novelty_score": 0.64,
|
||
"source_credibility": 0.8,
|
||
"extraction_warnings": ["ambiguous_ticker_reference"],
|
||
"confidence": 0.86,
|
||
"model": {
|
||
"provider": "ollama",
|
||
"model_name": "gpt-oss:20b",
|
||
"prompt_version": "document-intel-v2",
|
||
"schema_version": "2.0.0"
|
||
}
|
||
}
|
||
```
|
||
|
||
### 6.4 Trend summary schema
|
||
```json
|
||
{
|
||
"entity_type": "company",
|
||
"entity_id": "AAPL",
|
||
"window": "7d",
|
||
"trend_direction": "bullish|bearish|mixed|neutral",
|
||
"trend_strength": 0.68,
|
||
"confidence": 0.74,
|
||
"top_supporting_evidence": ["document_id_1", "document_id_2"],
|
||
"top_opposing_evidence": ["document_id_3"],
|
||
"dominant_catalysts": ["product", "analyst_rating"],
|
||
"material_risks": ["regulatory scrutiny"],
|
||
"contradiction_score": 0.22
|
||
}
|
||
```
|
||
|
||
### 6.5 Recommendation schema
|
||
```json
|
||
{
|
||
"recommendation_id": "uuid",
|
||
"ticker": "AAPL",
|
||
"action": "buy|sell|hold|watch",
|
||
"mode": "informational|paper_eligible|live_eligible",
|
||
"confidence": 0.72,
|
||
"time_horizon": "swing_1d_10d",
|
||
"thesis": "string",
|
||
"invalidation_conditions": ["string"],
|
||
"position_sizing": {
|
||
"portfolio_pct": 0.02,
|
||
"max_loss_pct": 0.005
|
||
},
|
||
"evidence_refs": ["document_id_1", "document_id_2"],
|
||
"model_metadata": {
|
||
"version": "recommendation-v1"
|
||
}
|
||
}
|
||
```
|
||
|
||
## 7. Analytical Lake Datasets
|
||
The analytical plane should expose the following logical fact tables:
|
||
- `lake.market_bars`
|
||
- `lake.market_quotes`
|
||
- `lake.company_events`
|
||
- `lake.documents`
|
||
- `lake.document_extractions`
|
||
- `lake.trade_signals`
|
||
- `lake.trade_orders`
|
||
- `lake.trade_fills`
|
||
- `lake.positions_daily`
|
||
- `lake.pnl_daily`
|
||
- `lake.prediction_vs_outcome`
|
||
|
||
Recommended partitioning examples:
|
||
- market data: partition by `dt`, optional symbol transform later
|
||
- documents: partition by `dt` and maybe `source_type`
|
||
- predictions: partition by `dt` and `model_version`
|
||
- fills and PnL: partition by `dt` and broker account
|
||
|
||
## 8. Data Flows
|
||
|
||
### 8.1 Market and document ingestion flow
|
||
1. Scheduler selects due symbols and sources.
|
||
2. Adapters fetch market, news, and filings payloads.
|
||
3. Raw payloads are written to MinIO.
|
||
4. Metadata records are written to PostgreSQL.
|
||
5. New documents are emitted to parser jobs.
|
||
|
||
### 8.2 Extraction flow
|
||
1. Parser produces normalized text and confidence score.
|
||
2. Extraction worker sends document to Ollama with schema-bound output.
|
||
3. Validator checks schema and semantic consistency.
|
||
4. Canonical intelligence object is stored in PostgreSQL and MinIO.
|
||
5. Aggregation jobs are triggered for impacted symbols.
|
||
|
||
### 8.3 Recommendation and trade flow
|
||
1. Aggregation engine updates trend windows.
|
||
2. Recommendation engine emits a recommendation object.
|
||
3. Risk engine determines eligibility and allowed execution mode.
|
||
4. Broker adapter places paper or live orders when authorized.
|
||
5. Broker events update PostgreSQL and publish analytical facts to the lake.
|
||
|
||
### 8.4 Lake publication flow
|
||
1. Operational records are transformed into analytical facts.
|
||
2. Facts are written as partitioned Parquet files to MinIO.
|
||
3. Table metadata is updated through Iceberg or equivalent catalog operations.
|
||
4. Trino exposes the datasets for SQL.
|
||
5. Superset uses Trino datasets for dashboards and ad hoc exploration.
|
||
|
||
## 9. Query and Dashboard Surface
|
||
|
||
### 9.1 Operational API
|
||
Should expose:
|
||
- company and watchlist configuration
|
||
- source health and job state
|
||
- document timelines and evidence
|
||
- recommendation history
|
||
- order history and audit trail
|
||
- risk configuration and trading mode
|
||
|
||
### 9.2 Analytical surface
|
||
Should expose:
|
||
- SQL access through Trino
|
||
- dashboard datasets in Superset
|
||
- scorecards for prediction accuracy and PnL
|
||
- evidence-to-outcome drill-down views
|
||
- model performance and extraction failure dashboards
|
||
|
||
Suggested starter dashboards:
|
||
- symbol overview
|
||
- market sentiment heatmap
|
||
- prediction confidence vs realized move
|
||
- paper trading PnL
|
||
- model extraction quality
|
||
- source coverage and ingestion lag
|
||
|
||
### 9.3 Web Dashboard Architecture
|
||
|
||
#### Page structure
|
||
The SPA is organized into the following top-level sections accessible from a persistent sidebar:
|
||
|
||
| Section | Route prefix | Primary API source |
|
||
|---------|-------------|-------------------|
|
||
| Home / Overview | `/` | Query API `/api/ops/pipeline/health`, `/api/ops/ingestion/summary` |
|
||
| Companies | `/companies` | Symbol Registry `/companies`, Query API `/api/companies` |
|
||
| Documents | `/documents` | Query API `/api/documents` |
|
||
| Trends | `/trends` | Query API `/api/trends` |
|
||
| Recommendations | `/recommendations` | Query API `/api/recommendations` |
|
||
| Orders | `/orders` | Query API `/api/orders` |
|
||
| Positions | `/positions` | Query API `/api/positions` |
|
||
| Trading Controls | `/trading` | Query API `/api/admin/trading/*`, Risk Engine `/approvals/*` |
|
||
| Source Management | `/sources` | Query API `/api/admin/sources/*` |
|
||
| Pipeline Health | `/ops/pipeline` | Query API `/api/ops/pipeline/health` |
|
||
| Ingestion Monitor | `/ops/ingestion` | Query API `/api/ops/ingestion/*` |
|
||
| Model Performance | `/ops/model` | Query API `/api/ops/model/*` |
|
||
| Source Coverage | `/ops/coverage` | Query API `/api/ops/sources/coverage-gaps` |
|
||
| SQL Explorer | `/analytics/query` | Trino via Query API proxy endpoint |
|
||
| Dashboards | `/analytics/dashboards` | Trino via Query API proxy endpoint |
|
||
|
||
#### SQL Explorer design
|
||
The SQL explorer provides an Athena-like experience:
|
||
- Monaco Editor instance with SQL syntax highlighting
|
||
- Schema browser sidebar listing Trino catalogs, schemas, and tables
|
||
- Query execution via a new `/api/analytics/query` proxy endpoint on the Query API that forwards SQL to Trino
|
||
- Results rendered in a virtual-scrolling table with sortable columns
|
||
- Chart builder that maps result columns to axes and renders via Recharts
|
||
- Saved queries stored in PostgreSQL with name, description, and SQL text
|
||
|
||
#### Pre-built dashboards
|
||
Each dashboard is a React component that fetches data from the SQL explorer proxy or dedicated API endpoints:
|
||
- Symbol Overview: company card grid with trend direction, latest recommendation, position status
|
||
- Sentiment Heatmap: sector × time matrix colored by aggregated sentiment
|
||
- Prediction Accuracy: scatter plot of predicted confidence vs realized price move
|
||
- Paper Trading PnL: equity curve, daily PnL bars, win rate metrics
|
||
- Model Quality: extraction success rate over time, latency distribution, retry rate
|
||
- Source Coverage: company × source type matrix with health indicators
|
||
|
||
#### API proxy for Trino
|
||
A new endpoint on the Query API:
|
||
```
|
||
POST /api/analytics/query
|
||
Body: { "sql": "SELECT ...", "limit": 1000 }
|
||
Response: { "columns": [...], "rows": [...], "row_count": N, "elapsed_ms": N }
|
||
```
|
||
This proxies SQL to Trino, enforces row limits, and returns structured results.
|
||
|
||
#### Deployment
|
||
The frontend is built as a static SPA and served either:
|
||
- From the query-api container via FastAPI's `StaticFiles` mount
|
||
- From a dedicated lightweight nginx container
|
||
|
||
The Helm chart adds a `dashboard` service entry when the frontend is enabled.
|
||
|
||
## 10. Reliability and Safety
|
||
- Broker submission must be idempotent.
|
||
- Live trading must be disabled by default.
|
||
- Paper trading must be the first enabled execution mode.
|
||
- Invalid model output must not advance to trade execution.
|
||
- Low-quality document extraction must not influence live trading.
|
||
- All analytical publication jobs should be replayable.
|
||
- Every recommendation and order should be reproducible from saved prompts, source refs, and model metadata.
|
||
|
||
## 11. Deployment Notes
|
||
Recommended Kubernetes workloads:
|
||
- `symbol-registry-api`
|
||
- `scheduler`
|
||
- `market-adapter-worker`
|
||
- `news-adapter-worker`
|
||
- `filings-adapter-worker`
|
||
- `scraper-worker`
|
||
- `parser-worker`
|
||
- `ollama-extractor-worker`
|
||
- `aggregation-worker`
|
||
- `recommendation-worker`
|
||
- `risk-engine-api`
|
||
- `broker-adapter`
|
||
- `lake-publisher`
|
||
- `trino-coordinator`
|
||
- `trino-worker`
|
||
- `superset-web`
|
||
- `dashboard` (React SPA served via nginx or query-api static mount)
|
||
- `postgres`
|
||
- `redis`
|
||
- `minio`
|
||
|
||
## 12. Deliberate Scope Boundaries for v1
|
||
Included in v1:
|
||
- tracked watchlists
|
||
- market, news, filings, and broker integrations
|
||
- Ollama structured extraction
|
||
- trend aggregation and recommendation objects
|
||
- paper trading with strict controls
|
||
- MinIO-backed analytics lake
|
||
- Trino and Superset self-hosted analytics
|
||
|
||
Deferred from v1:
|
||
- options trading
|
||
- full order book or tick-level market microstructure
|
||
- online model retraining
|
||
- fully autonomous live trading with no approval workflow
|
||
- advanced portfolio optimization beyond basic sizing and risk caps |