# Page 1 — Data Ingestion and Preparation

Every signal that the platform eventually acts on begins its life as raw data pulled from an external source. Before any AI agent can extract structured intelligence, before any trend can accumulate, and before any decision can be executed, the platform must first discover new content, fetch it reliably, eliminate duplicates, store the raw artifacts for audit, and normalize the text into a form suitable for downstream processing. This page traces that journey from external API to parser output, covering the Scheduler, Ingestion Worker, deduplication layer, raw storage, and Parser in detail.

For a visual overview of the full flow described here, see the [Ingestion to Extraction Flow diagram](diagrams/ingestion-to-extraction-flow.md).

---

## Four Categories of Input Data

The platform tracks 50 entities across 10 sectors, and it draws intelligence from four distinct categories of external data. Each category has its own adapter, its own API conventions, and its own scheduling cadence, but all of them feed into the same ingestion pipeline.

The first category is **entity news**, sourced from the external data provider's news endpoint (`/v2/reference/news`). The `ExternalNewsAdapter` in `services/adapters/news_adapter.py` fetches articles linked to a specific entity identifier, returning structured results that include title, publisher, article URL, description, keywords, and publication timestamp. Each request can return up to 1,000 articles, though the default limit is 20 per fetch. The adapter tracks the most recent `published_utc` value and uses it on subsequent fetches to avoid re-retrieving articles the system has already seen.

The second category is **regulatory filings**, sourced from the public records API full-text search system (regulatory filings source). The `RegulatoryFilingsAdapter` in `services/adapters/filings_adapter.py` queries the `/LATEST/search-index` endpoint for regulatory filing types and other form types associated with an entity's identifier or CIK number. Unlike the external data provider endpoints, the public records API requires no key — only a descriptive `User-Agent` header per the API's fair-access policy. The adapter deduplicates results by accession number (`adsh`), filters out non-primary documents like XML fragments and graphics, and constructs the public records API filing index URL for each hit so downstream services can fetch the full document.

The third category is **data feeds**, also sourced from the external data provider. The `ExternalDataAdapter` in `services/adapters/market_adapter.py` supports multiple endpoints: previous-day aggregate bars (`/v2/aggs/ticker/{ticker}/prev`), range bars for custom date windows, intraday hourly bars, grouped daily bars that return data for all entities in a single call (`/v2/aggs/grouped/locale/us/market/stocks/{date}`), and entity detail lookups. Data feeds follow a different path than textual content — they do not pass through the Parser or Extractor, since the structured numeric data is already in a usable form.

The fourth category is **macro and geopolitical news**, fetched by the `MacroNewsAdapter` in `services/adapters/macro_news_adapter.py`. Unlike the other three categories, macro news is not entity-specific. These sources have `source_type='macro_news'` in the `sources` database table and may have a `NULL` `company_id`. The adapter fetches from a configurable HTTP endpoint (typically the external data provider's news API filtered for broad topics) and returns articles that describe global events — policy shifts, central bank decisions, geopolitical conflicts — rather than entity-specific developments. Macro news articles are eventually classified by the Global Event Classifier agent and routed through a separate queue, as described in [Page 2](02-ai-agent-processing-and-extraction.md).

All four adapter classes inherit from `BaseAdapter` defined in `services/adapters/base.py` and return an `AdapterResult` dataclass containing the raw payload bytes, a SHA-256 content hash, a list of parsed item dicts, HTTP metadata (status code, response time), and an error field that is `None` on success. This uniform interface allows the Ingestion Worker to handle all source types through a single dispatch mechanism.

---

## The Scheduler: Orchestrating Ingestion Cycles

The Scheduler (`services/scheduler/app.py`) is the heartbeat of the ingestion pipeline. It runs a continuous loop that ticks every 15 seconds (`SCHEDULER_TICK = 15`), and on each tick it evaluates which sources are due for their next fetch. The Scheduler does not fetch data itself — it enqueues jobs onto the `app:queue:ingestion` Redis list for the Ingestion Worker to process.

Each source type has a default polling cadence defined in the `DEFAULT_CADENCES` dictionary:

| Source Type      | Default Cadence |
|------------------|-----------------|
| `market_api`     | 300 seconds     |
| `news_api`       | 300 seconds     |
| `filings_api`    | 3,600 seconds   |
| `macro_news`     | 600 seconds     |
| `web_scrape`     | 1,800 seconds   |
| `execution_api`  | 30 seconds      |

Individual sources can override their cadence via the `polling_interval_seconds` field in their `config` JSONB column in the `sources` table. The `get_cadence_for_source()` function checks for this override first, falling back to the default if none is set, and enforces a minimum interval of 10 seconds.

The Scheduler determines whether a source is due by calling `is_source_due()`, which considers several conditions. If a source has never run before (no entry in the `ingestion_runs` table), it is immediately due. If the last run failed, the Scheduler respects an exponential backoff computed by `compute_backoff()`: the delay starts at 60 seconds (`DEFAULT_BACKOFF_BASE`) and doubles with each retry up to a maximum of 3,600 seconds (`MAX_BACKOFF`). If a source has failed 10 consecutive times (`MAX_RETRY_COUNT`), the Scheduler stops scheduling it entirely until an operator manually resets the retry state. If the last run is still marked as `running`, the source is skipped to prevent double-scheduling. Otherwise, the Scheduler checks whether enough time has elapsed since the last completed run based on the source's cadence.

Rate limiting adds another layer of protection. The `check_rate_limit()` function enforces two constraints. First, each source type has a per-type limit defined in `DEFAULT_RATE_LIMITS` — for example, `market_api` and `news_api` are each capped at 20 requests per minute, while `filings_api` and `macro_news` are capped at 10. Second, because `market_api` and `news_api` both use the same external data provider API key, a global provider rate limit of 45 requests per minute (`PROVIDER_GLOBAL_RATE_LIMIT`) is enforced across both types combined. Rate limit state is tracked in Redis using keys of the form `app:ratelimit:{source_type}:{window}`, where the window is a minute-granularity timestamp. If a source type exceeds its limit, the Scheduler logs a warning and skips that source for the current tick.

The Scheduler handles three categories of sources in each cycle. First, it fetches all active entity-specific sources (excluding `macro_news`) by joining the `sources` and `companies` tables. Second, it fetches active macro news sources separately, since these may not have a `company_id`. Third, it fetches global data sources — those with `source_type='market_api'` and `company_id IS NULL` — which represent endpoints like the grouped daily bars that return data for all entities in a single API call. For intraday bar sources, the Scheduler expands a single global source into per-entity jobs for every active entity.

Each enqueued job payload includes the `source_id`, `company_id`, `ticker`, `legal_name`, `source_type`, `source_name`, `config`, `credibility_score`, a list of company `aliases` (fetched from the `company_aliases` table), and a `scheduled_at` timestamp. The job is pushed onto `app:queue:ingestion` via Redis `RPUSH`.

Beyond scheduling, the Scheduler also performs periodic maintenance. Every ~20 cycles (~5 minutes), it runs `recover_stale_documents()` to re-enqueue documents that have been stuck in `parsed` status for longer than 240 minutes — a safety net for cases where Redis loses queue entries due to pod restarts or OOM events. Every ~40 cycles (~10 minutes), it runs `retry_failed_extractions()` to give documents in `extraction_failed` status another chance, resetting them to `parsed` and deleting the failed `document_intelligence` row so the Extractor treats them as fresh. Every ~100 cycles (~25 minutes), it runs `cleanup_all_tables()` to enforce retention policies across tables like `competitive_signal_records` (30 days), `ingestion_runs` (14 days), and `execution_decisions` (90 days).

For more detail on the Scheduler's configuration and operational behavior, see the [Services Reference](../services.md).

---

## The Ingestion Worker: Adapter Dispatch and Persistence

The Ingestion Worker (`services/ingestion/worker.py`) is a long-running process that continuously pops jobs from the `app:queue:ingestion` Redis list and processes them. On startup, it initializes one instance of each adapter class and stores them in a dispatch dictionary keyed by `source_type`:

```
adapters = {
    "market_api":     ExternalDataAdapter(...),
    "news_api":       ExternalNewsAdapter(...),
    "filings_api":    RegulatoryFilingsAdapter(),
    "web_scrape":     WebScrapeAdapter(),
    "execution_api":  ExecutionAdapter(...),
    "macro_news":     MacroNewsAdapter(...),
}
```

When a job arrives, the `process_job()` function looks up the appropriate adapter by `source_type` and calls its `fetch()` method with the ticker and source config. Before fetching, it records a new row in the `ingestion_runs` table with status `running`. If the adapter returns an error, the worker calls `record_retrieval_failure()` to update the run status and increment the source's retry counter with exponential backoff timing.

On a successful fetch, the worker performs several steps in sequence. First, it uploads the raw payload to MinIO via `upload_raw_artifact()` in `services/shared/storage.py`. The target bucket is determined by the source type through the `SOURCE_BUCKET_MAP`: `market_api` payloads go to `app-raw-data`, `news_api` and `macro_news` payloads go to `app-raw-content`, and `filings_api` payloads go to `app-raw-filings`. Objects are stored under a path that encodes the source type, entity identifier, date hierarchy, and document ID — for example, `news_api/Entity-A/2025/01/15/{run_id}/raw.json`.

---

## Content Deduplication via Redis

After storing the raw artifact, the Ingestion Worker checks for duplicate content. Deduplication operates at two levels.

At the payload level, the worker checks the overall `content_hash` (a SHA-256 digest of the raw API response) against Redis. The key pattern is `app:dedupe:{content_hash}` with a 24-hour TTL (86,400 seconds). If the hash is already present, the entire payload is skipped — the `ingestion_runs` row is marked as completed with `items_new=0`, and no downstream jobs are enqueued. If the hash is new, the worker sets the marker in Redis so future fetches of identical content are caught.

At the individual item level, for source types other than `market_api` and `execution_api`, the worker calls `dedupe_items()` from `services/shared/dedupe.py`. This function checks each item against a layered deduplication strategy. The fast path checks Redis for both content-hash markers (`app:dedupe:{hash}`) and canonical-URL markers (`app:dedupe:url:{url_hash}`), both with 24-hour TTLs. If the Redis check misses, the function falls back to PostgreSQL, querying the `documents` table by `content_hash` or `canonical_url` for durable cross-source matching. When a duplicate is found through the PostgreSQL fallback, the function warms the Redis cache so subsequent checks are fast.

Items identified as duplicates are not discarded entirely. If the duplicate document was originally ingested for a different entity, the worker creates a cross-source mention link in the `document_company_mentions` table via `persist_document_company_mention()`. This ensures that a news article mentioning both Entity-A and Entity-F is linked to both entities even if it was first ingested through Entity-A's news source.

New (non-duplicate) items are persisted to PostgreSQL through `persist_ingestion_items()` in `services/shared/metadata.py`, which inserts rows into the `documents` table and records entity mentions in `document_company_mentions`. Each new document ID is then pushed onto `app:queue:parsing` for the Parser to process. After persistence, the worker calls `mark_as_seen()` to set Redis dedupe markers for both the content hash and canonical URL of each new item, ensuring that the next fetch cycle's deduplication checks are fast.

On successful completion, the worker updates the `ingestion_runs` row with the final counts (`items_fetched`, `items_new`) and calls `reset_source_retry_state()` to clear any accumulated backoff from previous failures. For news-type sources (`news_api` and `macro_news`), the worker also updates the source's `config` JSONB column with the latest `published_utc` value, so the next fetch only retrieves newer articles.

---

## The Parser: Normalization, Quality Scoring, and Routing

Documents that pass through ingestion arrive on the `app:queue:parsing` Redis list as JSON payloads containing a `document_id`, `ticker`, and `source_type`. The Parser Worker (`services/parser/worker.py`) pops these jobs and transforms raw HTML or text into normalized, quality-scored documents ready for AI extraction.

The parsing pipeline begins with HTML fetching. If the document has a URL (looked up from the `documents` table if not present in the job payload), the worker calls `fetch_html()` to retrieve the page content. Public records API URLs receive a specialized `User-Agent` header to comply with the API's fair-access policy. The raw HTML is then passed to `parse_html()` in `services/parser/html_parser.py`, which runs a multi-stage extraction pipeline.

The HTML parser first strips non-content tags — `script`, `style`, `nav`, `footer`, `header`, `aside`, `iframe`, and others — and removes boilerplate containers identified by CSS class or ID patterns (sidebars, ad slots, newsletter signups, social share bars, and similar UI elements). It then searches for the article body using a priority list of semantic selectors (`article`, `[role='main']`, `.article-body`, `.post-content`, and others). If no semantic match is found, it falls back to text-density scoring across candidate `div`, `section`, and `td` elements, selecting the block with the highest composite score based on text density, link density, paragraph count, and word count. The extracted text undergoes further cleaning: regex-based removal of residual boilerplate phrases (copyright notices, "subscribe to our newsletter" prompts, "share this article" fragments), removal of short orphan lines that are likely UI fragments, detection and collapse of repeated template blocks, and whitespace normalization.

Metadata extraction pulls the document title (from `og:title` or `<title>`), author, publisher (from `og:site_name` or hostname), publication date (from `article:published_time` or JSON-LD `datePublished`), canonical URL, language, description, and keywords from the HTML head elements.

If the parsed body text is shorter than 500 characters, the worker attempts to enrich it by reading the raw API payload from MinIO and extracting the data provider's article description, keywords, and author fields for the matching article. This enrichment step ensures that even articles with minimal scrapeable HTML still have enough textual content for meaningful AI extraction.

Quality scoring is performed by `score_parse_quality()` in `services/parser/html_parser.py`, which evaluates six weighted signals to produce a composite score between 0 and 0.95:

| Signal             | Weight | What It Measures                                                |
|--------------------|--------|-----------------------------------------------------------------|
| `word_count`       | 0.30   | Length of extracted text (thresholds at 20, 50, 150, 300 words) |
| `body_found`       | 0.20   | Whether a semantic article body element was located              |
| `diversity`        | 0.15   | Vocabulary richness (unique words / total words)                 |
| `sentence`         | 0.15   | Presence of proper sentence structure (terminal punctuation)     |
| `paragraph`        | 0.10   | Multi-paragraph structure (blocks separated by blank lines)      |
| `metadata`         | 0.10   | Presence of title, author, publisher, and publication date       |

The composite score maps to a confidence label: scores below 0.35 are labeled `low`, scores between 0.35 and 0.65 are `medium`, and scores 0.65 and above are `high`. Documents with `low` confidence are marked with status `low_quality` in the `documents` table and are not enqueued for extraction — they are effectively filtered out of the pipeline at this stage.

Entity mention detection runs next. The worker fetches all known aliases from the `company_aliases` table (plus entity identifiers and legal names from the `companies` table) and calls `detect_company_mentions()` in `services/parser/html_parser.py`. The matching strategy varies by alias length: one-to-two character aliases use case-sensitive word-boundary matching to avoid false positives (the letter "A" should not match every occurrence of the word "a"), three-to-four character aliases use case-insensitive word-boundary matching (standard identifier format), and aliases of five or more characters use case-insensitive substring matching (entity names and brands). Confidence scores vary by alias type: identifier matches receive 0.9, legal name matches 0.85, general aliases 0.7, and brand matches 0.6. Multiple alias hits for the same entity are deduplicated, keeping the highest-confidence match and summing match counts. Detected mentions are persisted to the `document_company_mentions` table.

The normalized text and a structured parser output JSON (containing all metadata, quality signals, warnings, outbound links, tags, and mentions) are uploaded to the `app-normalized` MinIO bucket. The `documents` row is updated with the normalized storage reference, parser output reference, quality score, and confidence level.

Finally, the Parser makes a routing decision. If the document's `document_type` is `macro_event`, it is pushed onto `app:queue:macro_classification` for the Global Event Classifier agent. All other documents are pushed onto `app:queue:extraction` for the Document Intelligence Extractor agent. Both queues feed into the Extractor service described in [Page 2](02-ai-agent-processing-and-extraction.md). The job payload includes the `document_id`, `ticker`, and the first 32,000 characters of the normalized text, giving the downstream agent immediate access to the content without needing to fetch it from MinIO.

For additional detail on queue topology and data store layout, see the [Data Pipeline Architecture](../architecture-data-pipeline.md) documentation.

---

## What Comes Next

At this point, raw data has been fetched from four external sources, deduplicated, stored in MinIO, parsed into normalized text, scored for quality, tagged with entity mentions, and routed to the appropriate extraction queue. The documents sitting on `app:queue:extraction` and `app:queue:macro_classification` are clean, quality-filtered, and ready for AI processing. [Page 2 — AI Agent Processing and Structured Extraction](02-ai-agent-processing-and-extraction.md) picks up the story from here, explaining how the Document Intelligence Extractor and Global Event Classifier agents use LLM inference to transform these normalized documents into the structured JSON intelligence that feeds the rest of the pipeline.