88ad1e8d99
- Add scheduler and ingestion unit tests (test_scheduler_unit.py, test_ingestion_unit.py) - Add all 13 app services + dashboard to docker-compose.yml - Add full documentation suite: API reference, Helm reference, Docker deployment guide, 3 architecture diagrams (K8s, Docker Compose, data pipeline), AI agent guide, backup/restore guide, observability/metrics reference, per-service docs - Add intelligence pipeline deep-dive docs with Mermaid diagrams - Update README with documentation index and links - Add specs for comprehensive-quality-docs, intelligence-pipeline-deep-dive, sanitized-pipeline-docs
226 lines
24 KiB
Markdown
226 lines
24 KiB
Markdown
# Page 5 — Recommendation Generation and Signal-to-Action Translation
|
||
|
||
The aggregation engine described in [Page 4](04-trend-aggregation-and-accumulating-signals.md) produces `TrendSummary` objects across five time windows for each entity identifier, encoding the direction, strength, confidence, contradiction level, and supporting evidence accumulated from all three signal layers. But a `TrendSummary` is an assessment — it describes what the evidence says, not what the system should do about it. The recommendation engine is where assessment becomes action. It takes each `TrendSummary`, subjects it to a series of deterministic evaluations, and produces a `Recommendation` object that specifies a concrete action (act, defer, monitor, or observe), an execution mode (informational, simulation-eligible, or production-eligible), a commitment sizing guideline, a human-readable thesis, and a risk classification. Every decision in this pipeline is rule-based and fully traceable — the LLM is only involved in an optional downstream step that rewrites the thesis wording.
|
||
|
||
The recommendation worker in `services/recommendation/main.py` polls the `app:queue:recommendation` Redis queue for jobs, each specifying an entity identifier and time window. For each job, it delegates to `generate_recommendation()` in `services/recommendation/worker.py`, which orchestrates the full pipeline: fetch the latest trend summary, check for duplicate recommendations, fetch any available trend projection, evaluate data quality suppression, evaluate eligibility, optionally rewrite the thesis via LLM, build the `Recommendation` object, and persist everything to PostgreSQL. For a visual overview of this flow, see the [Recommendation Generation Flow diagram](diagrams/recommendation-generation-flow.md).
|
||
|
||
---
|
||
|
||
## Data Quality Suppression
|
||
|
||
Before the eligibility engine evaluates whether a trend is strong enough to act on, the suppression layer in `services/recommendation/suppression.py` asks a more fundamental question: is the underlying data reliable enough to act on at all? A trend might show high confidence and strong directionality, but if the documents feeding it are stale, poorly extracted, or drawn from a single source type, the apparent signal quality is illusory. The suppression layer acts as a pre-filter on data quality, running before the eligibility engine and forcing any recommendation built on unreliable data to `informational` mode regardless of how strong the trend metrics look.
|
||
|
||
The `evaluate_suppression()` function accepts a `TrendSummary` and a `DataQualityContext` — a set of metrics about the documents underlying the trend, populated by querying `documents` and `document_intelligence` tables for the evidence document IDs stored in the trend summary. When full document-level metrics are not available (for example, in a development environment without the full document pipeline), the function falls back to `build_quality_context_from_summary()`, which estimates quality metrics from the trend summary's own evidence counts and confidence.
|
||
|
||
### The Six Data Quality Checks
|
||
|
||
The suppression evaluation runs six independent checks, each comparing a data quality metric against a configurable threshold defined in `SuppressionConfig`. If any single check fails, the recommendation is suppressed:
|
||
|
||
1. **Low extraction confidence** — If the average extraction confidence across the evidence documents falls below `0.40` (`min_avg_extraction_confidence`), the underlying LLM extractions are too unreliable. This catches cases where the extractor struggled with document formatting, ambiguous content, or low-quality source material, as described in [Page 2](02-ai-agent-processing-and-extraction.md).
|
||
|
||
2. **Evidence staleness** — If the most recent evidence document is older than `168` hours (7 days, `max_evidence_staleness_hours`), the trend is based on outdated information. Conditions change rapidly, and a week-old evidence base may no longer reflect the current state. When documents exist but no timestamp is available, the evidence is conservatively treated as stale.
|
||
|
||
3. **Low source diversity** — If fewer than `1` distinct source type (`min_source_types`) contributed to the evidence, the signal may be driven by a single unreliable source class. In practice, this check fires when the quality context has documents but all come from the same source type (for example, all news articles with no filings or supplementary data to corroborate).
|
||
|
||
4. **High extraction failure rate** — If more than `50%` (`max_extraction_failure_rate`) of the documents that should have contributed to the trend failed extraction entirely, the data pipeline is unreliable for this entity. A high failure rate means the trend summary is built from a biased subset of the available evidence — the failed documents might have told a different story.
|
||
|
||
5. **Insufficient valid documents** — If fewer than `2` valid (non-failed) documents (`min_valid_documents`) contributed to the trend, there simply is not enough data to act on. A single document, no matter how high-quality, does not provide the corroboration needed for automated execution decisions.
|
||
|
||
6. **Low data quality score** — The `_compute_data_quality_score()` function computes an overall quality score from three weighted components: extraction confidence (40% weight, normalized against a 0.8 baseline), evidence freshness (30% weight, linear decay over the staleness window), and document coverage (30% weight, combining the valid/total ratio with a count factor that saturates at 10 documents). If this composite score falls below `0.30` (`min_data_quality_score`) and the low-confidence check has not already fired, a general suppression reason is added.
|
||
|
||
When any check triggers, the `SuppressionResult` records the specific reasons (as `SuppressionReason` enum values) and the computed data quality score. The worker in `services/recommendation/worker.py` uses this result to force the recommendation's mode to `informational` and append a suppression note to the thesis text, ensuring the suppression decision is visible in the audit trail.
|
||
|
||
### Safety Suppressions: Macro-Only and Pattern-Only Signals
|
||
|
||
Beyond the six data quality checks, two additional safety suppressions protect against acting on signals that lack entity-specific corroboration:
|
||
|
||
**Macro-only suppression** (`evaluate_macro_only_suppression()`) fires when macro signals are the sole basis for a trend direction — no entity-specific signals contributed at all. As described in [Page 3](03-signal-scoring-and-weighted-signals.md), macro signals enter the aggregation engine at a reduced weight of `0.3` relative to entity-specific signals. But even at reduced weight, macro signals alone can shift a trend direction if no entity-specific evidence exists. When this happens, the recommendation is forced to `informational` mode with a caveat noting that the signal is macro-only and should not be used for automated execution.
|
||
|
||
**Pattern-only suppression** (`evaluate_pattern_only_suppression()`) applies the same logic to competitive/pattern signals. When pattern-based signals from `services/aggregation/pattern_matcher.py` and `services/aggregation/signal_propagation.py` are the sole contributors — no entity-specific or macro signals — the recommendation is suppressed. Historical patterns are valuable context, but acting on them without any current evidence is too speculative for automated execution.
|
||
|
||
Both safety suppressions are evaluated in the worker after the main suppression check, and both force the mode to `informational` when triggered.
|
||
|
||
---
|
||
|
||
## Eligibility Evaluation
|
||
|
||
Recommendations that survive the suppression layer enter the eligibility evaluation in `services/recommendation/eligibility.py`. This is the core decision logic — a set of deterministic rules that map trend metrics to actions, execution modes, and commitment sizing. The `evaluate_eligibility()` function is the single entry point, accepting a `TrendSummary` and an `EligibilityConfig` of tunable thresholds.
|
||
|
||
### Gate Checks
|
||
|
||
The `_check_gates()` function applies five hard gates. If any gate fails, the trend is ineligible for a recommendation (though the action and mode are still computed for the audit trace):
|
||
|
||
| Gate | Threshold | Rejection Reason |
|
||
|------|-----------|-----------------|
|
||
| Confidence | ≥ `0.35` | `low_confidence` |
|
||
| Trend strength | ≥ `0.10` | `low_trend_strength` |
|
||
| Contradiction score | ≤ `0.60` | `high_contradiction` |
|
||
| Evidence count | ≥ `2` (supporting + opposing) | `insufficient_evidence` |
|
||
| Direction | ≠ `neutral` | `neutral_direction` |
|
||
|
||
These gates are intentionally conservative. A confidence threshold of `0.35` means the system needs meaningful evidence breadth and agreement before generating any recommendation at all (see the confidence computation in [Page 4](04-trend-aggregation-and-accumulating-signals.md)). The contradiction ceiling of `0.60` allows moderately contested trends through — only when the evidence is deeply split does the gate reject. The evidence minimum of `2` ensures that no recommendation is ever based on a single document.
|
||
|
||
When a trend fails any gate, the resulting `EligibilityResult` has `eligible = False` and the mode is forced to `informational`, regardless of what the mode escalation logic would otherwise compute.
|
||
|
||
### Action Mapping
|
||
|
||
The `_determine_action()` function maps the trend's direction and strength to one of four action types. The logic evaluates in a specific order:
|
||
|
||
**Mixed or neutral direction → OBSERVE.** If the trend direction is `mixed` (high contradiction with weak directional signal) or `neutral`, the action is always `OBSERVE`. There is no directional conviction to act on.
|
||
|
||
**Strong directional signal → ACT or DEFER.** If the trend strength reaches `0.25` or above (`action_strength_threshold`), the action follows the direction: `ACT` for positive, `DEFER` for negative. This threshold ensures that only trends with meaningful magnitude trigger commitment-changing actions.
|
||
|
||
**Weak directional signal with decent confidence → MONITOR.** If the trend has a clear direction (positive or negative) but strength remains below `0.25`, the action depends on confidence. If confidence reaches `0.50` or above (`hold_confidence_threshold`), the action is `MONITOR` — the system recognizes the directional lean but does not have enough conviction to recommend a commitment change. Below `0.50` confidence, the action falls to `OBSERVE`.
|
||
|
||
This mapping creates the escalation ladder described in [Page 4](04-trend-aggregation-and-accumulating-signals.md): as consecutive signals accumulate and strengthen the trend metrics, the action naturally progresses from OBSERVE → MONITOR → ACT/DEFER.
|
||
|
||
### Mode Escalation
|
||
|
||
The `_determine_mode()` function determines the highest execution mode allowed for the recommendation. Mode controls whether the recommendation is purely informational, eligible for simulation mode, or eligible for live execution mode:
|
||
|
||
**OBSERVE and MONITOR → always informational.** These actions do not trigger executions, so they are always `informational` mode. They are logged for human review and dashboard display but never enter the decision execution engine.
|
||
|
||
**ACT and DEFER → escalation based on signal quality.** For actionable recommendations, mode escalates through three tiers:
|
||
|
||
- **`informational`** — The default when confidence is below `0.50`. The recommendation is recorded but not eligible for any execution.
|
||
- **`simulation_eligible`** — When confidence reaches `0.50` or above (`paper_confidence_threshold`). The recommendation can be picked up by the simulation engine described in [Page 6](06-decision-execution.md).
|
||
- **`production_eligible`** — The strictest tier, requiring confidence ≥ `0.70` (`live_confidence_threshold`), contradiction ≤ `0.25` (`live_max_contradiction`), and evidence count ≥ `5` (`live_min_evidence`). This triple gate ensures that only high-conviction, well-corroborated, low-contradiction recommendations can trigger live executions.
|
||
|
||
The evidence count for mode escalation is computed as the sum of supporting and opposing evidence documents, matching the same count used in the gate checks.
|
||
|
||
---
|
||
|
||
## Commitment Sizing
|
||
|
||
The `_compute_position_sizing()` function in `services/recommendation/eligibility.py` translates signal quality into an allocation pool guideline. Commitment sizing is not a fixed value — it scales dynamically with the confidence and strength of the underlying trend, penalized by contradiction and thin evidence.
|
||
|
||
### Base and Scaling
|
||
|
||
The computation starts with a base allocation of `1%` (`base_allocation_pct = 0.01`) and scales upward based on two factors:
|
||
|
||
- **Confidence factor** — `0.8 × confidence` (`confidence_sizing_weight`), reflecting how much the system trusts the trend assessment.
|
||
- **Strength factor** — `0.5 + 0.5 × trend_strength`, ranging from `0.5` (weakest trend) to `1.0` (strongest trend).
|
||
|
||
The raw allocation percentage is computed as:
|
||
|
||
```
|
||
raw_allocation = base + confidence_factor × strength_factor × (max - base)
|
||
```
|
||
|
||
where `max` is `10%` (`max_allocation_pct = 0.10`). At maximum confidence (1.0) and maximum strength (1.0), the raw allocation reaches the full 10%. At typical values (confidence 0.6, strength 0.3), the raw allocation is considerably lower.
|
||
|
||
### Contradiction Penalty
|
||
|
||
The contradiction score applies a multiplicative penalty:
|
||
|
||
```
|
||
allocation_pct = raw_allocation × (1.0 − 0.5 × contradiction_score)
|
||
```
|
||
|
||
A contradiction score of `0.40` reduces the allocation by 20%. A score of `0.0` (no contradiction) applies no penalty. This ensures that contested trends receive smaller commitment sizes even when they pass the eligibility gates.
|
||
|
||
### Evidence Count Penalty
|
||
|
||
Thin evidence further reduces the allocation:
|
||
|
||
- Fewer than `3` evidence documents → multiply by `0.5` (halved).
|
||
- Fewer than `5` evidence documents → multiply by `0.75`.
|
||
- `5` or more documents → no penalty.
|
||
|
||
This penalty stacks with the contradiction penalty, so a trend with high contradiction and thin evidence receives a substantially reduced commitment size.
|
||
|
||
### Max Loss Scaling
|
||
|
||
The same scaling logic applies to the maximum loss percentage, which starts at a base of `0.3%` (`base_max_loss_pct = 0.003`) and scales up to `2%` (`max_max_loss_pct = 0.02`). Higher-conviction commitments are allowed larger loss tolerances, while low-conviction or contested commitments are constrained to tighter risk thresholds.
|
||
|
||
The final `PositionSizing` object (defined in `services/shared/schemas.py`) contains `allocation_pct` and `max_loss_pct`, both clamped to their respective bounds. This object is embedded in the `Recommendation` and later consumed by the decision execution engine's own commitment sizer (described in [Page 6](06-decision-execution.md)), which applies additional resource pool-level constraints.
|
||
|
||
---
|
||
|
||
## Thesis Generation
|
||
|
||
Every recommendation includes a human-readable thesis that explains the reasoning behind the action. Thesis generation happens in two layers: a deterministic assembly that is always present, and an optional LLM rewrite that polishes the wording for execution-eligible recommendations.
|
||
|
||
### Deterministic Thesis Assembly
|
||
|
||
The `build_thesis()` function in `services/recommendation/worker.py` constructs a thesis string entirely from the trend data and eligibility result, with no model involvement. The thesis is assembled from several components in order:
|
||
|
||
1. **Opening** — States the entity identifier, trend direction, window, strength, and confidence. For example: "Entity-A shows a negative trend over the 7d window with strength 0.35 and confidence 0.62."
|
||
|
||
2. **Catalysts** — Lists the top three dominant catalysts from the `TrendSummary`, drawn from the evidence ranking described in [Page 4](04-trend-aggregation-and-accumulating-signals.md).
|
||
|
||
3. **Contradiction note** — If the contradiction score exceeds `0.15`, a note flags the signal disagreement and its magnitude.
|
||
|
||
4. **Trend projection** — When a `TrendProjection` is available and not flagged as low-confidence, the thesis incorporates the projected direction, strength, and top driving factors. If the projection diverges from the current trend, a divergence note is appended.
|
||
|
||
5. **Risks** — Lists the top two material risks from the `TrendSummary`.
|
||
|
||
6. **Evidence count** — States the number of supporting and opposing evidence documents.
|
||
|
||
7. **Prescriptive action** — States the recommended action and mode (e.g., "Recommendation: DEFER (simulation eligible).").
|
||
|
||
The deterministic thesis is always generated and serves as the audit reference. Even when the LLM rewrites the thesis, the deterministic version is preserved in the model metadata for traceability.
|
||
|
||
### Optional LLM Rewrite via the Thesis-Rewriter Agent
|
||
|
||
For recommendations that are both eligible and not suppressed, the worker optionally invokes the thesis-rewriter agent to polish the deterministic thesis into professional-quality prose. The LLM rewrite is implemented in `services/recommendation/thesis_llm.py` and uses the `thesis-rewriter` agent slug, resolved at runtime through the `AgentConfigResolver` in `services/shared/agent_config.py`.
|
||
|
||
The `AgentConfigResolver` queries the `ai_agents` and `agent_variants` database tables to resolve the active configuration for the `thesis-rewriter` slug, preferring an active variant's model, timeout, and retry settings when one exists. The resolver uses a 60-second TTL in-memory cache to avoid hitting the database on every recommendation. This is the same resolution mechanism used by the document extractor and event classifier agents described in [Page 2](02-ai-agent-processing-and-extraction.md).
|
||
|
||
The `rewrite_thesis_with_llm()` function builds a prompt from the deterministic thesis and trend context (entity identifier, window, direction, strength, confidence, contradiction score, catalysts, risks), sends it to the local Ollama instance via HTTP, and returns the rewritten text. The system prompt enforces strict rules: no fabricated information, no numbers or facts not present in the input, under 150 words, neutral professional tone, and only the rewritten thesis text in the response.
|
||
|
||
The LLM layer is purely additive — if the call fails for any reason (network error, timeout, empty response, token budget exceeded), the original deterministic thesis is returned unchanged. The worker in `services/recommendation/main.py` resolves the thesis-rewriter configuration at startup and refreshes it every 50 jobs to pick up configuration changes without requiring a restart. When no database configuration exists for the `thesis-rewriter` slug, thesis rewriting is silently disabled.
|
||
|
||
Performance logging for the thesis-rewriter is written to the `agent_performance_log` table, recording success/failure, duration, estimated token counts, and the variant ID. Token budget enforcement checks hourly usage against the variant's configured budget before making the LLM call, preventing runaway costs from high-volume recommendation cycles.
|
||
|
||
### Risk Classification Prefix
|
||
|
||
Before the thesis is stored, the `classify_risk()` function in `services/recommendation/worker.py` assigns a risk classification label that is prepended to the thesis text as a `[risk:<level>]` prefix. The classification is computed from a composite score:
|
||
|
||
| Factor | Contribution |
|
||
|--------|-------------|
|
||
| Contradiction score | `contradiction × 2.0` |
|
||
| Low confidence | `(1.0 − confidence) × 1.5` |
|
||
| Low evidence count | `+1.0` if < 3 docs, `+0.5` if < 5 docs |
|
||
| Rejection reasons | `+0.5` per rejection reason |
|
||
|
||
The composite score maps to four levels:
|
||
|
||
| Score Range | Classification |
|
||
|-------------|---------------|
|
||
| ≥ 3.0 | `very_high` |
|
||
| ≥ 2.0 | `high` |
|
||
| ≥ 1.0 | `moderate` |
|
||
| < 1.0 | `low` |
|
||
|
||
A recommendation with high contradiction (0.4 → contributes 0.8), moderate confidence (0.55 → contributes 0.675), and 4 evidence documents (contributes 0.5) would score 1.975, classifying as `moderate`. The same recommendation with only 2 evidence documents would score 2.475, pushing it to `high`. This classification gives downstream consumers — both the decision execution engine and human reviewers — a quick risk signal without needing to re-evaluate the underlying metrics.
|
||
|
||
---
|
||
|
||
## Persistence
|
||
|
||
The recommendation pipeline persists its output to three PostgreSQL tables, creating a complete audit trail from trend assessment through decision logic to the final recommendation.
|
||
|
||
### `recommendations` — The Core Record
|
||
|
||
The `persist_recommendation()` function in `services/recommendation/worker.py` inserts the `Recommendation` into the `recommendations` table. Each row captures the entity identifier, action, mode, confidence, time horizon, thesis (including the risk classification prefix and any suppression notes), invalidation conditions (as JSONB), commitment sizing (allocation percentage and max loss percentage), model metadata (provider, model name, prompt version, schema version), risk classification, and generation timestamp. The insert returns the recommendation's UUID, which serves as the foreign key for the evidence and risk evaluation tables.
|
||
|
||
### `recommendation_evidence` — Evidence Citations
|
||
|
||
For each evidence document referenced in the recommendation, a row is inserted into the `recommendation_evidence` table linking the recommendation UUID to the document UUID, with an evidence type (`supporting` or `opposing`) and a position-based weight that decays with rank: `weight = 1.0 / (1.0 + index × 0.1)`. The first supporting document gets weight `1.0`, the second gets `0.91`, the third `0.83`, and so on. Non-UUID document IDs (such as synthetic pattern signal IDs like `pattern:Entity-A:performance_report:7d` from the competitive signal layer) are filtered out before insertion, since the table enforces a foreign key to the `documents` table.
|
||
|
||
### `risk_evaluations` — Decision Audit Trail
|
||
|
||
The `risk_evaluations` table records the full eligibility decision for each recommendation: whether the trend was eligible, the allowed mode, the list of rejection reasons (as JSONB), and a `risk_checks` JSONB object containing the time horizon, commitment sizing details, invalidation conditions, and risk classification. This table enables post-hoc analysis of why the system made a particular decision — auditors can trace from the recommendation back through the eligibility evaluation to the underlying trend metrics.
|
||
|
||
---
|
||
|
||
## Deduplication
|
||
|
||
Before running the full evaluation pipeline, the worker checks whether the latest recommendation for the same entity identifier and time horizon is effectively identical to what would be generated. The `_is_duplicate_recommendation()` function in `services/recommendation/worker.py` compares the previous recommendation's action, mode, and confidence (within a `0.01` tolerance) against the current eligibility result. If all three match, the recommendation is skipped — the underlying trend data has not changed meaningfully since the last cycle. This prevents the system from flooding the `recommendations` table with identical entries on every aggregation cycle, while still generating a new recommendation whenever the trend metrics shift enough to change the action, mode, or confidence.
|
||
|
||
---
|
||
|
||
## What Comes Next
|
||
|
||
At this point, the recommendation engine has translated trend assessments into concrete `Recommendation` objects — each with an action, execution mode, commitment sizing guideline, thesis, and risk classification — and persisted them alongside their evidence citations and eligibility audit trails. Recommendations marked as `simulation_eligible` or `production_eligible` are now available for the decision execution engine to consume. [Page 6 — Decision Execution](06-decision-execution.md) explains how the decision execution engine polls these recommendations, applies its own pre-execution check sequence (circuit breakers, execution windows, confidence gates, deduplication, declining commitments, and max open commitments), computes final commitment sizes with resource pool-level constraints, and submits execution requests through the execution adapter to the external execution API. |