Files

T

Celes Renata 88ad1e8d99 feat: comprehensive docs, unit tests, docker-compose app services

- Add scheduler and ingestion unit tests (test_scheduler_unit.py, test_ingestion_unit.py)
- Add all 13 app services + dashboard to docker-compose.yml
- Add full documentation suite: API reference, Helm reference, Docker deployment guide,
  3 architecture diagrams (K8s, Docker Compose, data pipeline), AI agent guide,
  backup/restore guide, observability/metrics reference, per-service docs
- Add intelligence pipeline deep-dive docs with Mermaid diagrams
- Update README with documentation index and links
- Add specs for comprehensive-quality-docs, intelligence-pipeline-deep-dive,
  sanitized-pipeline-docs

2026-04-22 02:56:41 +00:00

24 KiB

Raw Permalink Blame History

Page 5 — Recommendation Generation and Signal-to-Action Translation

The aggregation engine described in Page 4 produces TrendSummary objects across five time windows for each entity identifier, encoding the direction, strength, confidence, contradiction level, and supporting evidence accumulated from all three signal layers. But a TrendSummary is an assessment — it describes what the evidence says, not what the system should do about it. The recommendation engine is where assessment becomes action. It takes each TrendSummary, subjects it to a series of deterministic evaluations, and produces a Recommendation object that specifies a concrete action (act, defer, monitor, or observe), an execution mode (informational, simulation-eligible, or production-eligible), a commitment sizing guideline, a human-readable thesis, and a risk classification. Every decision in this pipeline is rule-based and fully traceable — the LLM is only involved in an optional downstream step that rewrites the thesis wording.

The recommendation worker in services/recommendation/main.py polls the app:queue:recommendation Redis queue for jobs, each specifying an entity identifier and time window. For each job, it delegates to generate_recommendation() in services/recommendation/worker.py, which orchestrates the full pipeline: fetch the latest trend summary, check for duplicate recommendations, fetch any available trend projection, evaluate data quality suppression, evaluate eligibility, optionally rewrite the thesis via LLM, build the Recommendation object, and persist everything to PostgreSQL. For a visual overview of this flow, see the Recommendation Generation Flow diagram.

Data Quality Suppression

Before the eligibility engine evaluates whether a trend is strong enough to act on, the suppression layer in services/recommendation/suppression.py asks a more fundamental question: is the underlying data reliable enough to act on at all? A trend might show high confidence and strong directionality, but if the documents feeding it are stale, poorly extracted, or drawn from a single source type, the apparent signal quality is illusory. The suppression layer acts as a pre-filter on data quality, running before the eligibility engine and forcing any recommendation built on unreliable data to informational mode regardless of how strong the trend metrics look.

The evaluate_suppression() function accepts a TrendSummary and a DataQualityContext — a set of metrics about the documents underlying the trend, populated by querying documents and document_intelligence tables for the evidence document IDs stored in the trend summary. When full document-level metrics are not available (for example, in a development environment without the full document pipeline), the function falls back to build_quality_context_from_summary(), which estimates quality metrics from the trend summary's own evidence counts and confidence.

The Six Data Quality Checks

The suppression evaluation runs six independent checks, each comparing a data quality metric against a configurable threshold defined in SuppressionConfig. If any single check fails, the recommendation is suppressed:

Low extraction confidence — If the average extraction confidence across the evidence documents falls below 0.40 (min_avg_extraction_confidence), the underlying LLM extractions are too unreliable. This catches cases where the extractor struggled with document formatting, ambiguous content, or low-quality source material, as described in Page 2.
Evidence staleness — If the most recent evidence document is older than 168 hours (7 days, max_evidence_staleness_hours), the trend is based on outdated information. Conditions change rapidly, and a week-old evidence base may no longer reflect the current state. When documents exist but no timestamp is available, the evidence is conservatively treated as stale.
Low source diversity — If fewer than 1 distinct source type (min_source_types) contributed to the evidence, the signal may be driven by a single unreliable source class. In practice, this check fires when the quality context has documents but all come from the same source type (for example, all news articles with no filings or supplementary data to corroborate).
High extraction failure rate — If more than 50% (max_extraction_failure_rate) of the documents that should have contributed to the trend failed extraction entirely, the data pipeline is unreliable for this entity. A high failure rate means the trend summary is built from a biased subset of the available evidence — the failed documents might have told a different story.
Insufficient valid documents — If fewer than 2 valid (non-failed) documents (min_valid_documents) contributed to the trend, there simply is not enough data to act on. A single document, no matter how high-quality, does not provide the corroboration needed for automated execution decisions.
Low data quality score — The _compute_data_quality_score() function computes an overall quality score from three weighted components: extraction confidence (40% weight, normalized against a 0.8 baseline), evidence freshness (30% weight, linear decay over the staleness window), and document coverage (30% weight, combining the valid/total ratio with a count factor that saturates at 10 documents). If this composite score falls below 0.30 (min_data_quality_score) and the low-confidence check has not already fired, a general suppression reason is added.

When any check triggers, the SuppressionResult records the specific reasons (as SuppressionReason enum values) and the computed data quality score. The worker in services/recommendation/worker.py uses this result to force the recommendation's mode to informational and append a suppression note to the thesis text, ensuring the suppression decision is visible in the audit trail.

Safety Suppressions: Macro-Only and Pattern-Only Signals

Beyond the six data quality checks, two additional safety suppressions protect against acting on signals that lack entity-specific corroboration:

Macro-only suppression (evaluate_macro_only_suppression()) fires when macro signals are the sole basis for a trend direction — no entity-specific signals contributed at all. As described in Page 3, macro signals enter the aggregation engine at a reduced weight of 0.3 relative to entity-specific signals. But even at reduced weight, macro signals alone can shift a trend direction if no entity-specific evidence exists. When this happens, the recommendation is forced to informational mode with a caveat noting that the signal is macro-only and should not be used for automated execution.

Pattern-only suppression (evaluate_pattern_only_suppression()) applies the same logic to competitive/pattern signals. When pattern-based signals from services/aggregation/pattern_matcher.py and services/aggregation/signal_propagation.py are the sole contributors — no entity-specific or macro signals — the recommendation is suppressed. Historical patterns are valuable context, but acting on them without any current evidence is too speculative for automated execution.

Both safety suppressions are evaluated in the worker after the main suppression check, and both force the mode to informational when triggered.

Eligibility Evaluation

Recommendations that survive the suppression layer enter the eligibility evaluation in services/recommendation/eligibility.py. This is the core decision logic — a set of deterministic rules that map trend metrics to actions, execution modes, and commitment sizing. The evaluate_eligibility() function is the single entry point, accepting a TrendSummary and an EligibilityConfig of tunable thresholds.

Gate Checks

The _check_gates() function applies five hard gates. If any gate fails, the trend is ineligible for a recommendation (though the action and mode are still computed for the audit trace):

Gate	Threshold	Rejection Reason
Confidence	≥ `0.35`	`low_confidence`
Trend strength	≥ `0.10`	`low_trend_strength`
Contradiction score	≤ `0.60`	`high_contradiction`
Evidence count	≥ `2` (supporting + opposing)	`insufficient_evidence`
Direction	≠ `neutral`	`neutral_direction`

These gates are intentionally conservative. A confidence threshold of 0.35 means the system needs meaningful evidence breadth and agreement before generating any recommendation at all (see the confidence computation in Page 4). The contradiction ceiling of 0.60 allows moderately contested trends through — only when the evidence is deeply split does the gate reject. The evidence minimum of 2 ensures that no recommendation is ever based on a single document.

When a trend fails any gate, the resulting EligibilityResult has eligible = False and the mode is forced to informational, regardless of what the mode escalation logic would otherwise compute.

Action Mapping

The _determine_action() function maps the trend's direction and strength to one of four action types. The logic evaluates in a specific order:

Mixed or neutral direction → OBSERVE. If the trend direction is mixed (high contradiction with weak directional signal) or neutral, the action is always OBSERVE. There is no directional conviction to act on.

Strong directional signal → ACT or DEFER. If the trend strength reaches 0.25 or above (action_strength_threshold), the action follows the direction: ACT for positive, DEFER for negative. This threshold ensures that only trends with meaningful magnitude trigger commitment-changing actions.

Weak directional signal with decent confidence → MONITOR. If the trend has a clear direction (positive or negative) but strength remains below 0.25, the action depends on confidence. If confidence reaches 0.50 or above (hold_confidence_threshold), the action is MONITOR — the system recognizes the directional lean but does not have enough conviction to recommend a commitment change. Below 0.50 confidence, the action falls to OBSERVE.

This mapping creates the escalation ladder described in Page 4: as consecutive signals accumulate and strengthen the trend metrics, the action naturally progresses from OBSERVE → MONITOR → ACT/DEFER.

Mode Escalation

The _determine_mode() function determines the highest execution mode allowed for the recommendation. Mode controls whether the recommendation is purely informational, eligible for simulation mode, or eligible for live execution mode:

OBSERVE and MONITOR → always informational. These actions do not trigger executions, so they are always informational mode. They are logged for human review and dashboard display but never enter the decision execution engine.

ACT and DEFER → escalation based on signal quality. For actionable recommendations, mode escalates through three tiers:

informational — The default when confidence is below 0.50. The recommendation is recorded but not eligible for any execution.
simulation_eligible — When confidence reaches 0.50 or above (paper_confidence_threshold). The recommendation can be picked up by the simulation engine described in Page 6.
production_eligible — The strictest tier, requiring confidence ≥ 0.70 (live_confidence_threshold), contradiction ≤ 0.25 (live_max_contradiction), and evidence count ≥ 5 (live_min_evidence). This triple gate ensures that only high-conviction, well-corroborated, low-contradiction recommendations can trigger live executions.

The evidence count for mode escalation is computed as the sum of supporting and opposing evidence documents, matching the same count used in the gate checks.

Commitment Sizing

The _compute_position_sizing() function in services/recommendation/eligibility.py translates signal quality into an allocation pool guideline. Commitment sizing is not a fixed value — it scales dynamically with the confidence and strength of the underlying trend, penalized by contradiction and thin evidence.

Base and Scaling

The computation starts with a base allocation of 1% (base_allocation_pct = 0.01) and scales upward based on two factors:

Confidence factor — 0.8 × confidence (confidence_sizing_weight), reflecting how much the system trusts the trend assessment.
Strength factor — 0.5 + 0.5 × trend_strength, ranging from 0.5 (weakest trend) to 1.0 (strongest trend).

The raw allocation percentage is computed as:

raw_allocation = base + confidence_factor × strength_factor × (max - base)

where max is 10% (max_allocation_pct = 0.10). At maximum confidence (1.0) and maximum strength (1.0), the raw allocation reaches the full 10%. At typical values (confidence 0.6, strength 0.3), the raw allocation is considerably lower.

Contradiction Penalty

The contradiction score applies a multiplicative penalty:

allocation_pct = raw_allocation × (1.0 − 0.5 × contradiction_score)

A contradiction score of 0.40 reduces the allocation by 20%. A score of 0.0 (no contradiction) applies no penalty. This ensures that contested trends receive smaller commitment sizes even when they pass the eligibility gates.

Evidence Count Penalty

Thin evidence further reduces the allocation:

Fewer than 3 evidence documents → multiply by 0.5 (halved).
Fewer than 5 evidence documents → multiply by 0.75.
5 or more documents → no penalty.

This penalty stacks with the contradiction penalty, so a trend with high contradiction and thin evidence receives a substantially reduced commitment size.

Max Loss Scaling

The same scaling logic applies to the maximum loss percentage, which starts at a base of 0.3% (base_max_loss_pct = 0.003) and scales up to 2% (max_max_loss_pct = 0.02). Higher-conviction commitments are allowed larger loss tolerances, while low-conviction or contested commitments are constrained to tighter risk thresholds.

The final PositionSizing object (defined in services/shared/schemas.py) contains allocation_pct and max_loss_pct, both clamped to their respective bounds. This object is embedded in the Recommendation and later consumed by the decision execution engine's own commitment sizer (described in Page 6), which applies additional resource pool-level constraints.

Thesis Generation

Every recommendation includes a human-readable thesis that explains the reasoning behind the action. Thesis generation happens in two layers: a deterministic assembly that is always present, and an optional LLM rewrite that polishes the wording for execution-eligible recommendations.

Deterministic Thesis Assembly

The build_thesis() function in services/recommendation/worker.py constructs a thesis string entirely from the trend data and eligibility result, with no model involvement. The thesis is assembled from several components in order:

Opening — States the entity identifier, trend direction, window, strength, and confidence. For example: "Entity-A shows a negative trend over the 7d window with strength 0.35 and confidence 0.62."
Catalysts — Lists the top three dominant catalysts from the TrendSummary, drawn from the evidence ranking described in Page 4.
Contradiction note — If the contradiction score exceeds 0.15, a note flags the signal disagreement and its magnitude.
Trend projection — When a TrendProjection is available and not flagged as low-confidence, the thesis incorporates the projected direction, strength, and top driving factors. If the projection diverges from the current trend, a divergence note is appended.
Risks — Lists the top two material risks from the TrendSummary.
Evidence count — States the number of supporting and opposing evidence documents.
Prescriptive action — States the recommended action and mode (e.g., "Recommendation: DEFER (simulation eligible).").

The deterministic thesis is always generated and serves as the audit reference. Even when the LLM rewrites the thesis, the deterministic version is preserved in the model metadata for traceability.

Optional LLM Rewrite via the Thesis-Rewriter Agent

For recommendations that are both eligible and not suppressed, the worker optionally invokes the thesis-rewriter agent to polish the deterministic thesis into professional-quality prose. The LLM rewrite is implemented in services/recommendation/thesis_llm.py and uses the thesis-rewriter agent slug, resolved at runtime through the AgentConfigResolver in services/shared/agent_config.py.

The AgentConfigResolver queries the ai_agents and agent_variants database tables to resolve the active configuration for the thesis-rewriter slug, preferring an active variant's model, timeout, and retry settings when one exists. The resolver uses a 60-second TTL in-memory cache to avoid hitting the database on every recommendation. This is the same resolution mechanism used by the document extractor and event classifier agents described in Page 2.

The rewrite_thesis_with_llm() function builds a prompt from the deterministic thesis and trend context (entity identifier, window, direction, strength, confidence, contradiction score, catalysts, risks), sends it to the local Ollama instance via HTTP, and returns the rewritten text. The system prompt enforces strict rules: no fabricated information, no numbers or facts not present in the input, under 150 words, neutral professional tone, and only the rewritten thesis text in the response.

The LLM layer is purely additive — if the call fails for any reason (network error, timeout, empty response, token budget exceeded), the original deterministic thesis is returned unchanged. The worker in services/recommendation/main.py resolves the thesis-rewriter configuration at startup and refreshes it every 50 jobs to pick up configuration changes without requiring a restart. When no database configuration exists for the thesis-rewriter slug, thesis rewriting is silently disabled.

Performance logging for the thesis-rewriter is written to the agent_performance_log table, recording success/failure, duration, estimated token counts, and the variant ID. Token budget enforcement checks hourly usage against the variant's configured budget before making the LLM call, preventing runaway costs from high-volume recommendation cycles.

Risk Classification Prefix

Before the thesis is stored, the classify_risk() function in services/recommendation/worker.py assigns a risk classification label that is prepended to the thesis text as a [risk:<level>] prefix. The classification is computed from a composite score:

Factor	Contribution
Contradiction score	`contradiction × 2.0`
Low confidence	`(1.0 − confidence) × 1.5`
Low evidence count	`+1.0` if < 3 docs, `+0.5` if < 5 docs
Rejection reasons	`+0.5` per rejection reason

The composite score maps to four levels:

Score Range	Classification
≥ 3.0	`very_high`
≥ 2.0	`high`
≥ 1.0	`moderate`
< 1.0	`low`

A recommendation with high contradiction (0.4 → contributes 0.8), moderate confidence (0.55 → contributes 0.675), and 4 evidence documents (contributes 0.5) would score 1.975, classifying as moderate. The same recommendation with only 2 evidence documents would score 2.475, pushing it to high. This classification gives downstream consumers — both the decision execution engine and human reviewers — a quick risk signal without needing to re-evaluate the underlying metrics.

Persistence

The recommendation pipeline persists its output to three PostgreSQL tables, creating a complete audit trail from trend assessment through decision logic to the final recommendation.

`recommendations` — The Core Record

The persist_recommendation() function in services/recommendation/worker.py inserts the Recommendation into the recommendations table. Each row captures the entity identifier, action, mode, confidence, time horizon, thesis (including the risk classification prefix and any suppression notes), invalidation conditions (as JSONB), commitment sizing (allocation percentage and max loss percentage), model metadata (provider, model name, prompt version, schema version), risk classification, and generation timestamp. The insert returns the recommendation's UUID, which serves as the foreign key for the evidence and risk evaluation tables.

`recommendation_evidence` — Evidence Citations

For each evidence document referenced in the recommendation, a row is inserted into the recommendation_evidence table linking the recommendation UUID to the document UUID, with an evidence type (supporting or opposing) and a position-based weight that decays with rank: weight = 1.0 / (1.0 + index × 0.1). The first supporting document gets weight 1.0, the second gets 0.91, the third 0.83, and so on. Non-UUID document IDs (such as synthetic pattern signal IDs like pattern:Entity-A:performance_report:7d from the competitive signal layer) are filtered out before insertion, since the table enforces a foreign key to the documents table.

`risk_evaluations` — Decision Audit Trail

The risk_evaluations table records the full eligibility decision for each recommendation: whether the trend was eligible, the allowed mode, the list of rejection reasons (as JSONB), and a risk_checks JSONB object containing the time horizon, commitment sizing details, invalidation conditions, and risk classification. This table enables post-hoc analysis of why the system made a particular decision — auditors can trace from the recommendation back through the eligibility evaluation to the underlying trend metrics.

Deduplication

Before running the full evaluation pipeline, the worker checks whether the latest recommendation for the same entity identifier and time horizon is effectively identical to what would be generated. The _is_duplicate_recommendation() function in services/recommendation/worker.py compares the previous recommendation's action, mode, and confidence (within a 0.01 tolerance) against the current eligibility result. If all three match, the recommendation is skipped — the underlying trend data has not changed meaningfully since the last cycle. This prevents the system from flooding the recommendations table with identical entries on every aggregation cycle, while still generating a new recommendation whenever the trend metrics shift enough to change the action, mode, or confidence.

What Comes Next

At this point, the recommendation engine has translated trend assessments into concrete Recommendation objects — each with an action, execution mode, commitment sizing guideline, thesis, and risk classification — and persisted them alongside their evidence citations and eligibility audit trails. Recommendations marked as simulation_eligible or production_eligible are now available for the decision execution engine to consume. Page 6 — Decision Execution explains how the decision execution engine polls these recommendations, applies its own pre-execution check sequence (circuit breakers, execution windows, confidence gates, deduplication, declining commitments, and max open commitments), computes final commitment sizes with resource pool-level constraints, and submits execution requests through the execution adapter to the external execution API.

24 KiB Raw Permalink Blame History Unescape Escape