diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 1f3fb5c25b94eb3933e9bd37db2cbbf2d25a590e..5a5c82c326324bd64c00d30c6c2640a6b7ecbefa 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -63,6 +63,8 @@ DB vs tabular is **not** a routing concern — it's a per-source attribute (`sou ## 3. Routing model +> **Superseded 2026-06-18** — the 3-way `source_hint` below was reworked into a flat **6-intent** handler router (`chat`, `help`, `problem_statement`, `check`, `unstructured_flow`, `structured_flow`). Modality (structured vs unstructured *data*) is now the Planner's job, not the router's. See `ORCHESTRATOR_REWORK_PLAN.md`. + ``` source_hint ∈ { chat, unstructured, structured } ``` diff --git a/CHECKPOINT_PLAN_2026-06-17.md b/CHECKPOINT_PLAN_2026-06-17.md new file mode 100644 index 0000000000000000000000000000000000000000..db7d81082d386a1acdd5142397f8616df019f4d4 --- /dev/null +++ b/CHECKPOINT_PLAN_2026-06-17.md @@ -0,0 +1,147 @@ +# Checkpoint Plan — Wednesday, 17 June 2026 + +Working plan for Sofhia & Rifqi based on the checkpoint with mas Harry on **Thursday, 11 June 2026**. +Goal: everything below is **merged and demo-able before the next sync on Wednesday, 17 June (afternoon)**. + +**Updated at: Friday, 12 June 2026** (Sofhia + Rifqi) + +> Source of truth for decisions is the meeting itself. Note: the NotebookLM summary is **stale on two points** — Data Availability Check was *eliminated* as a tool, and Success Metrics was *folded into* the Problem Statement template. Do not build either as a standalone skill. + +--- + +## 0. Progress (per Fri 12 Jun — Sofhia) + +Dated snapshot of what landed this session. Live task status (incl. what's left) lives in §2 Ownership — this section only records the deltas + traceability. + +- ✅ **Tool matrix** built (xlsx, all ~10 tools + status colours) — presentation material ready. +- ✅ **Registry trimmed to 4 active analytics** (`KM-641`, commit `66e2e4d`): `ACTIVE_ANALYTICS_TOOLS` (descriptive, aggregate, correlation, trend) vs `DEFERRED_ANALYTICS_TOOLS` (comparison, contribution, profile, segment) — specs + compute fns kept, only registry exposure withheld. Tests 206 pass, ruff/mypy clean. +- ✅ **Planner few-shot synced**: Example A `analyze_contribution` → `analyze_aggregate` (so few-shots don't reference a deferred tool). +- ✅ **Data-access tools renamed** (`KM-642`, commit `c38c0c2`): `query_structured` → `data_retrieve`, `retrieve_documents` → `knowledge_retrieve` across the tool layer + planner stub/prompt/validator/few-shots. Mechanical, no behavior change. +- ✅ **`data_check` merge + `knowledge_check`** (`KM-643`, commit `4bd5f1e`): `list_sources` + `describe_source` → one parameterized `data_check` (no arg = list structured sources; `source_id` = schema) + new `knowledge_check` (unstructured). Tests 206 pass. +- ✅ **Redis Cloud live** (free tier, TTL = 1 h), env vars shared in the group (Rifqi). +- ✅ **Planner tool list verified** against the trimmed registry — no references to old tool names or deferred analytics anywhere in `src/` (Rifqi). +- 📌 **Decision:** `tests/` stays gitignored — team decided not to push tests to origin (closes PROGRESS.md R3 as won't-do). +- 📌 **Ownership:** Rifqi owns `generate_report` development + the `analysis_records` table / real `AnalysisStore` (contract still co-designed with Sofhia). +- ✅ **R5 cache fix** (Rifqi, `b701e95`): chat cache scoped by `user_id`, TTL 24h→1h. +- ✅ **AnalysisRecord persistence landed** (Rifqi): `stage` now flows to the record (CRISP-DM grouping for the report) + identity fields (`record_id`/`analysis_id`/`user_id`); `PostgresAnalysisStore` + `analysis_records` table replace `NullAnalysisStore`, wired into `ChatHandler`. Unblocks the `generate_report` renderer and the DoD "record persisted" step. Open: `analysis_id` handoff from Harry's Analysis State. +- ✅ **Verb-first tool naming** (Sofhia, commit `2d6406d`): the 4 data/knowledge tools renamed to lead with a verb — `data_check`→`check_data`, `knowledge_check`→`check_knowledge`, `data_retrieve`→`retrieve_data`, `knowledge_retrieve`→`retrieve_knowledge` (the `analyze_*` tools already lead with a verb). These verb-first names are now canonical; the tool-set table + §3 below use them. Dated log entries above keep the old names as historical record. + +--- + +## 1. Locked decisions (from the 2026-06-11 checkpoint) + +1. **Single chat page.** The separate interview/survey page is killed. Sidebar = Knowledge menu (connect/manage data) + Analysis menu (sessions). +2. **Data-first hard gate.** Creating a new analysis requires **≥ 1 bound data source** (server-side rejection, no empty sessions). User provides title + optional short description. +3. **Analysis State lives in the DB.** Per-analysis row: `user_id`, `data_source_ids[]`, `interview_status` (default `not_pass`), `report_status` (default `no_report` → `V1`, `V2`, …). Explicitly **NOT cached, NOT in Redis** — the Orchestrator reads it from Postgres every turn. +4. **Skills, not agents.** No separate interview agent. The Orchestrator routes per user turn using the Analysis State; an analytical request still executes through the existing Planner → TaskRunner → Assembler spine (static plan, no mid-run LLM). +5. **Interview = one skill: Problem Statement.** Success metrics become fields inside the PS template (what to increase/decrease + target). Data availability check is handled by the data-first creation gate + PS validation cross-checking fields against the bound catalog — not a separate tool. +6. **Analytics focus = 4 tools:** descriptive, aggregate, correlation, trend. The other four composites (comparison, contribution, profile, segment) are **deprioritized, not deleted** — keep the code, just don't register them. If "comparison" returns later it should be a proper statistical **test**, not a generic compare. +7. **`describe_source` merges into the listing tool** — one call returns sources *with* their schema/metadata, fewer tools for the planner. +8. **Report = on-demand, button-triggered (not a chat skill).** A dedicated "Generate Report" button in the Analysis menu calls a **report API** (not the chat route): trigger generation for a session, list its versions, fetch a version. Renders from accumulated **AnalysisRecords + the Problem Statement** — never from chat history. Each report is a **persisted, versioned artifact**: generation snapshots the record IDs it used and bumps `report_status` to `V`. (Owner: Rifqi, KM-644.) +9. **Help = deterministic guide.** No LLM: read Analysis State → tell the user the next required step. Callable in any state. +10. **Redis Cloud free tier, TTL = 1 hour**, env shared in the team group — for retrieval/query caching only, never for state. + +### Final tool set (~10) + +| Tool (canonical, verb-first) | Maps to (lineage) | Status | +|---|---|---| +| `check_knowledge` | new — list user's documents + metadata | done | +| `check_data` | `list_sources` + `describe_source` merged (catalog-backed) | done | +| `retrieve_knowledge` | `retrieve_documents` → `knowledge_retrieve` | done | +| `retrieve_data` | `query_structured` → `data_retrieve` (tabular: file + DB, both working) | done | +| `analyze_descriptive` | `src/tools/analytics/descriptive.py` | done | +| `analyze_aggregate` | `src/tools/analytics/aggregation.py` | done | +| `analyze_correlation` | `src/tools/analytics/relationship.py` | done | +| `analyze_trend` | `src/tools/analytics/temporal.py` | done | +| `problem_statement` | new — interview skill (**Harry**) | Harry | +| `generate_report` | new — on-demand, versioned | to design | +| `help` | new — deterministic state guide | to build | + +(`problem_statement` + `help` live at the orchestrator level; `generate_report` is **button-triggered via a dedicated report API**, not chat-routed (decision #8). The TaskRunner registry holds the 4 analytics + 4 data/knowledge tools. Unregister `analyze_comparison`, `analyze_contribution`, `analyze_profile`, `analyze_segment` from the planner-visible registry — keep the modules.) + +--- + +## 2. Ownership + +### Sofhia +- [x] 4 analytics tools: trim registry to 4 active, tests still pass after deprioritizing the other four. (`KM-641`, commit `66e2e4d`) +- [x] Data/knowledge tools: merge `describe_source` into `data_check`, rename `retrieve_documents` → `knowledge_retrieve`, `query_structured` → `data_retrieve`, build `knowledge_check`. (`KM-642` `c38c0c2`, `KM-643` `4bd5f1e`) +- [ ] Co-design `generate_report` contract with Rifqi (Rifqi owns development, see §3). +- [x] Tool matrix (see §4). + +### Rifqi +- [x] **Redis Cloud free tier** (~30–50 MB): create instance, set TTL = 1 h, share env vars in the group. (done 12 Jun) +- [x] **R5 cache fix**: chat cache key scoped by `user_id`, TTL 24h→1h (urgent on shared Redis). (12 Jun, commit `b701e95`) +- [x] **AnalysisRecord contract gaps closed**: `stage` (CRISP-DM) now flows Task→TaskResult→TaskSummary so the report can group the method appendix; `AnalysisRecord` gained `record_id`/`analysis_id`/`user_id` identity fields. (12 Jun) +- [x] **`analysis_records` table + real `AnalysisStore`**: `PostgresAnalysisStore` (save + `list_for_analysis`, never-throw) replaces `NullAnalysisStore`; wired into `ChatHandler`, `user_id` stamped at save. Satisfies the DoD "record persisted" step. (12 Jun) +- [ ] **Own `generate_report` development — KM-644 "Report Generator"** (contract co-designed with Sofhia, see §3). Button-triggered via a dedicated **report API** (trigger / list versions / fetch); reads `analysis_records` + Problem Statement; persists a versioned report artifact, bumps `report_status`. *(record persistence done above; report API + persistence + renderer + contract doc next)* +- [x] Verify planner tool list matches the trimmed registry (4 analytics + 4 data/knowledge) and few-shots don't reference removed tools. (verified 12 Jun — no stale tool names in `src/`) +- ⚠️ **Blocked-on-Harry**: `analysis_id` is `NULL` on persisted records until the Analysis State reaches the slow path — need the session-ID handoff so `generate_report` can group records per analysis. + +### Shared (Sofhia + Rifqi) +- [ ] `generate_report` design + skeleton: input = AnalysisRecords for the session + Problem Statement from Analysis State; output = versioned artifact; bumps `report_status`. Agree on the contract even if rendering is stubbed for Wednesday. (Development: Rifqi.) +- [ ] `help` skill: deterministic — read Analysis State, return the next required step. Small, do it together or whoever finishes first. +- [ ] Tool behavior smoke test end-to-end on an easy case (descriptive/aggregate path), per Harry's ask: "robust tools before agents." + +### Harry (dependencies — not ours, but we block on them) +- `problem_statement` skill + PS template (incl. increase/decrease target fields). +- Analysis State class + DB table, frontend analysis-builder step. +- Merging our PRs (he auto-merges; he clones from latest after). + +--- + +## 3. Per-tool behavior contract (how to build each one) + +Harry's framing: for every tool, define **goal / trigger / input / process / output**, and behave like a Claude-style skill — if a required argument is missing, respond with a polite feedback message asking for it (e.g. table/column name), never guess silently. + +- **`check_knowledge`** — "what documents do I have?" → list documents with name, type, uploaded-at. +- **`check_data`** — "what data do I have?" → sources (file + DB) with schema/metadata from the data catalog, created/uploaded timestamps. +- **`retrieve_knowledge`** — RAG over uploaded documents; returns passages with source attribution. +- **`retrieve_data`** — query tabular data (file + DB) via QueryIR; output consumable by the `analyze_*` tools. +- **`analyze_*` (4)** — require valid table/column references; if missing or wrong, return actionable feedback instead of guessing. +- **`generate_report`** — button-triggered via a dedicated report API (not chat-routed); on-demand only (never auto); post-pass gated; renders from AnalysisRecords + PS; persists a versioned artifact, snapshots record IDs, bumps version. (KM-644, Rifqi.) +- **`help`** — no LLM; state → next step. Repeating it is fine, that's its job. + +--- + +## 4. Tool matrix (deliverable for the sync) + +Harry explicitly asked for a matrix covering every tool. Produce one sheet/markdown table with columns: + +`tool | goal | trigger (when the orchestrator calls it) | input | process | output | gated by interview_status? | status (done / in progress / planned)` + +Use the tool set table in §1 as the row list. This doubles as the presentation material on Wednesday. + +--- + +## 5. Day-by-day + +| Day | Target | +|---|---| +| **Thu 11** | Checkpoint meeting + task split with Harry. | +| **Fri 12 (today)** | ✅ Registry trimmed to 4 analytics + few-shot synced (Sofhia, KM-641). ✅ Tool matrix built. ⏳ Redis Cloud + env share (Rifqi). | +| **Mon 15** | Data/knowledge tools done (`data_check` merge, renames, `knowledge_check`). `generate_report` contract agreed. | +| **Tue 16** | `help` skill done. `generate_report` skeleton wired to AnalysisRecord. Tool matrix drafted. End-to-end smoke test on the easy path. | +| **Wed 17 (AM)** | Buffer: fix fallout, finalize matrix, rehearse the demo flow. | +| **Wed 17 (PM)** | **Sync with Harry.** | + +--- + +## 6. Open questions to confirm with Harry on Wednesday + +1. **Gate scope.** Proposal: keep the fast path + exploration tools (`check_knowledge`, `check_data`, retrieves, `help`, arguably `descriptive`) available **pre-pass**; gate only the insight tools (correlation, trend, report). Hard-gating everything risks frustrating users who just want to look at their data. +2. **Who flips `interview_status` to `pass`?** Proposal: a deterministic validator (PS template slots complete + fields cross-checked against the bound catalog) makes the call — the LLM conducts the conversation but never decides the pass. ("Conversational skin, deterministic skeleton.") +3. **Skills vs spine — one sentence to lock in writing:** *"Skills are registry tools executed by the existing Planner → TaskRunner → Assembler spine; the Analysis State gate is a pre-check in the Orchestrator."* This keeps the new flow and the locked architecture fully compatible. +4. `generate_report` invocation goes through the same gate (post-pass only) — confirm. + +--- + +## 7. Definition of done for Wednesday + +- [ ] All team PRs merged; Harry unblocked on the Analysis State class. +- [ ] Registry exposes exactly 4 analytics + 4 data/knowledge tools, all passing local tests. +- [ ] Redis Cloud shared and working locally for all three of us (TTL 1 h). +- [ ] `help` works against a (possibly stubbed) Analysis State. +- [ ] `generate_report` contract written; skeleton callable. +- [ ] Tool matrix ready to present. +- [ ] One end-to-end happy path runs: create analysis (with data) → blocked pre-pass → interview stub passes → descriptive/aggregate answer → record persisted. diff --git a/PROGRESS.md b/PROGRESS.md index 82710e060647468fed6d48ca4268861cd707351e..52099d06d9979f4b1a543cd2c07bfedb7b106e2d 100644 --- a/PROGRESS.md +++ b/PROGRESS.md @@ -2,8 +2,32 @@ Persistent tracker mirroring the 42-item ownership table in `REPO_CONTEXT.md` "Team — division of work". Update as PRs land. Future Claude Code sessions read this to know what's already done. -**Last updated**: 2026-06-10 (tool layer complete + hardening/DRY + Langfuse tracing + gated slow-path wiring) -**Current open PR**: `pr/2` — active. +**Last updated**: 2026-06-12 (Redis Cloud live; R3 closed as won't-do; R5 cache fix; AnalysisRecord persistence landed — `PostgresAnalysisStore` + `analysis_records` table) +**Current open PR**: `pr/3` — active. + +--- + +## What just shipped (2026-06-12 — AnalysisRecord persistence, Rifqi) + +Groundwork for `generate_report`. The slow path now persists a real, citable +record; the report (next) renders from it. + +- **Contract gaps closed** (`agents/slow_path/schemas.py`): `stage: CrispStage` + added to `TaskResult` + `TaskSummary` and populated at all 3 `TaskResult` build + sites in `task_runner.py` + copied in `assembler._build_record` — so the report + can group its method appendix by CRISP-DM phase. `AnalysisRecord` gained identity: + `record_id` (auto uuid), `analysis_id`/`user_id` (optional; stamped at persist). +- **Real store** (`agents/slow_path/store.py`): `PostgresAnalysisStore` — + `save()` (never-throw, idempotent upsert) + `list_for_analysis()` (oldest-first, + the report's render order). `NullAnalysisStore` kept (tests / disabled persistence). + `AnalysisStore` Protocol gained `list_for_analysis`. +- **Table** (`db/postgres/models.py`): `analysis_records` jsonb table (one row per + run, indexed by `analysis_id` + `user_id`); registered in `init_db.py`, created by + `create_all` on startup (no migration — `data_catalog` precedent). +- **Wired** (`agents/chat_handler.py`): default store flipped to `PostgresAnalysisStore`; + `user_id` stamped onto the record at the save site (in scope there). +- **Open**: `analysis_id` is `NULL` until Harry's Analysis State reaches the slow + path (session-ID handoff needed to group records per analysis). --- @@ -39,12 +63,12 @@ Verified against code before logging. Severity: **critical** / important / nice- |---|---|---|---|---| | R1 | **AuthN/AuthZ** on data endpoints — reject body-supplied `user_id`/`room_id`, derive identity from a verified token. `/chat/stream` has none (`chat.py:40,128`); tenant isolation is client honesty. **CORRECTION to the review:** `security/auth.py` is a STUB (all `NotImplementedError`); the real JWT impl lives in `src/users/users.py` (`encode_jwt`/`decode_jwt`, HS, env-keyed) **but is unused** — `/login` (`api/v1/users.py`) returns the user profile as plain JSON and mints NO token. So R1 is cross-team: (1) `/login` must issue a JWT, (2) frontend must send it as `Bearer`, (3) data endpoints validate it. **Gates the engine-cache work (DB2).** | **critical** | DB/B + frontend | `[ ]` | | R2 | **Always compile a LIMIT** — `sql.py` now emits a bound for every query: explicit limit honored (clamped to `MAX_RESULT_ROWS=10000`), unbounded queries get `LIMIT cap+1` so an unbounded SELECT can't stream a whole table into memory. `CompiledSql.row_cap` carries the cap; `DbExecutor` caps + flags truncation from it (dropped its own `_ROW_HARD_CAP`). Tests updated (`test_sql.py`, +3 cases); `S608` restored to `tests/**` ruff ignore (was dropped). | **critical** | DB | `[x]` | -| R3 | **Commit `tests/` + minimal CI** — `tests/` is gitignored; the 200+ tests cited as done exist only on laptops (already caused rename rot). GitHub origin carries tests; HF Space gets the Docker build (already doesn't COPY tests). | **critical (process)** | shared | `[ ]` | +| R3 | **Commit `tests/` + minimal CI** — `tests/` is gitignored; the 200+ tests cited as done exist only on laptops (already caused rename rot). ~~GitHub origin carries tests; HF Space gets the Docker build.~~ **2026-06-12: team decided tests stay gitignored/local — closed as won't-do.** | **critical (process)** | shared | `[won't do]` | | DB1 | **In-memory `describe_source`** (request-scoped `MemoizingCatalogReader`, `reader.py`) + **LLM-client hoist** (shared module-level `ChatHandler` in `chat.py`). Measured live: `describe_source` 3.5s→~2.0s (structured read now served from the planner's cached snapshot; only the unstructured read remains a round-trip), catalog reads/request ~5→~2. External `query_structured` handshake unchanged (DB2's job) so total slow path is ~flat until DB2. Tests: `tests/catalog/test_reader.py`. | important | agent | `[x]` | | DB2 | **Keyed engine cache** — `src/database_client/engine.py::UserEngineCache` (process singleton): pooled engines keyed by `client_id + creds-hash` (rotation auto-invalidates), bounded LRU (50) + 600s idle TTL, `pool_pre_ping` + `pool_recycle=300`. `DbExecutor._run_sync` reuses the warm connection instead of `create_engine→connect→dispose` per query (postgres/supabase only; other db_types keep the legacy path — no regression). **Live-measured: warm `query_structured` 6.6–9.4s → ~2.5s** (the residual is the per-call catalog-DB client fetch + pre-ping, not the external handshake). **Finding:** Neon's transaction pooler REJECTS `default_transaction_read_only` as a libpq startup `option` — caught live; moved read-only + statement_timeout to a per-connection `connect` event (best-effort; authoritative read-only is the SELECT-only compiler + sqlglot guard, see R10). Per-request ownership/active check kept. Proceeded ahead of R1 per owner decision (marginal security delta over the existing no-auth state; auth tracked separately). Tests: `tests/database_client/test_engine.py`. First query/process still cold → DB3. | important | DB | `[x]` | | DB3 | **Speculative pre-connect** — `DbExecutor.prewarm(catalog, user_id)` warms the pooled engine for schema sources (fire-and-forget at slow-path entry) so the cold first-query handshake overlaps the ~4s Planner call. Best-effort, never raises; gated to the default path (skipped when a coordinator factory is injected). Verified live through `ChatHandler.handle`. | nice-to-have | DB | `[x]` | | R4 | **Per-stage progress events** — `SlowPathCoordinator.run` gained an optional `progress` callback; `ChatHandler` bridges it to SSE `status` events (`chat.py` forwards them). Live: stream now shows `Planning…`→`Running N steps…`→`Composing…` (max wire gap ~4.6s, was ~13s of silence) → fixes proxy idle-timeout + UX. **Deferred:** token-streaming the Assembler answer needs splitting it into a streamed prose call + a structured-record call — that doubles the Assembler LLM calls (cost/latency), so it's a separate decision; the answer is still emitted as one chunk after the (fast ~2.5s) Assembler. Test: `test_chat_handler_wiring.py`. | important | agent | `[~]` | -| R5 | **Response cache**: key on `user_id` + catalog version; invalidate on ingest. Today `chat:{room_id}:{message}`, 24h TTL, no user (`chat.py:138`) → cross-room replay + stale answers. | important | B | `[ ]` | +| R5 | **Response cache**: key on `user_id` + catalog version; invalidate on ingest. Was `chat:{room_id}:{message}`, 24h TTL, no user → cross-user replay + stale answers. **2026-06-12 (Rifqi):** key now `chat:{room_id}:{user_id}:{message}` via `_chat_cache_key()`, TTL 24h→1h (checkpoint decision) — urgent now that Redis is a shared Cloud instance. `DELETE /chat/cache` gained a required `user_id` param (frontend heads-up); room-wide clear pattern unchanged. **Still open:** catalog-version in key / invalidate-on-ingest. | important | B | `[~]` | | R6 | **Hard time budget** — wrap `coordinator.run()` in `asyncio.wait_for` (60–90s). `Constraints.time_budget_seconds` is rendered but not enforced. | important | agent | `[ ]` | | R7 | **Root-task-failure short-circuit** before the Assembler (templated/fast-path fallback, NOT replanning) — stops paying ~2k tok to narrate an empty RunState. | important | agent | `[ ]` | | R8 | **Catalog upsert race** — per-user advisory lock around read-merge-upsert (`store.py`); concurrent uploads can drop a source. | important | DB | `[ ]` | @@ -129,7 +153,8 @@ LLM tokens; verified live to US Cloud. wires Pattern A correctly; self-corrects via retry. **Open follow-ups:** real `BusinessContext` (lead); create `analysis_records` table + -real `AnalysisStore`; register data-access `ToolSpec`s upstream (`data_access_registry()`) +real `AnalysisStore` (**Rifqi owns, 2026-06-12** — folded into `generate_report` work, +see `CHECKPOINT_PLAN_2026-06-17.md`); register data-access `ToolSpec`s upstream (`data_access_registry()`) or keep the planner stub; 4o → GPT-mini deployment swap; flip `enable_slow_path` on once `BusinessContext` is real. NOTE: 3 test files pre-existing broken from rename rot (`test_chat_handler.py`, `test_intent_router.py`, `test_answer_agent.py` import the old @@ -396,8 +421,9 @@ New scope after the original 42-item table; added as the tool layer landed (KM-6 | — | Tool contracts (`tools/contracts.py`) | TAB | `[x]` | KM-627 — canonical `ToolSpec` / `ToolRegistry` / `ToolOutput`. `agents/planner/contracts.py` re-exports them (+ keeps the lead's `BusinessContext` stub). | | — | Analytics registry (`tools/registry.py`) | TAB | `[x]` | KM-628 — `analytics_registry()`. `analyze_descriptive.required` = `["data","column_ids"]` (aligned to compute signature, commit 4bb7623). | | — | Invoker layer (`tools/invoker.py`) | TAB | `[x]` | KM-629 — `AnalyticsToolInvoker` (Pattern A: `analyze_*` take a `data` `${t}` placeholder from upstream `query_structured`; `_materialize` → DataFrame, `_coerce_decimals` covers the whole family) + `CompositeToolInvoker` (routes data-access vs analytics by name). | -| — | Data-access tools (`tools/data_access.py`) | TAB | `[x]` | KM-630 — `DataAccessToolInvoker`: `list_sources` / `describe_source` / `query_structured` / `retrieve_documents`. Per-request DI (`user_id` + `CatalogReader`). `query_structured` calls `IRValidator` + `ExecutorDispatcher` (planner skipped — IR pre-built by the agent Planner). | +| — | Data-access tools (`tools/data_access.py`) | TAB | `[x]` | KM-630 — `DataAccessToolInvoker`: `list_sources` / `describe_source` / `query_structured` / `retrieve_documents`. Per-request DI (`user_id` + `CatalogReader`). `query_structured` calls `IRValidator` + `ExecutorDispatcher` (planner skipped — IR pre-built by the agent Planner). **Superseded by KM-642/643** — renamed `data_retrieve`/`knowledge_retrieve` and `list_sources`+`describe_source` merged into `data_check` + new `knowledge_check`; see row below. | | — | Tool tests (`tests/unit/tools/`) | TAB | `[x]` | analytics + data-access + invoker tests (gitignored). Incl. regression `test_decimal_columns_coerced_for_analyze_contribution`. | +| — | Data/knowledge tool taxonomy (`tools/data_access.py`) | TAB | `[x]` | KM-642/643 (commits c38c0c2, 4bd5f1e) — renamed `query_structured`→`data_retrieve`, `retrieve_documents`→`knowledge_retrieve`; merged `list_sources`+`describe_source` → parameterized `data_check` (no arg = list structured sources; `source_id` = that source's schema) + new `knowledge_check` (unstructured/documents). Split mirrors the catalog's structured/unstructured slices. Planner stub/prompt/validator/few-shots synced; `DATA_ACCESS_TOOLS` guard kept in lockstep. Note: dated log entries above (e.g. the 2026-06-09 E2E) keep the old names as historical record. | ### API surface diff --git a/eval/__init__.py b/eval/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/eval/intent/README.md b/eval/intent/README.md new file mode 100644 index 0000000000000000000000000000000000000000..35a5825d4c9b7ed1ccb59c7ab2254fdc06367e39 --- /dev/null +++ b/eval/intent/README.md @@ -0,0 +1,70 @@ +# Intent-Routing Eval (E3) + +Scores the live 6-intent router (`OrchestratorAgent.classify`) against a golden +dataset of labelled messages. Run it before any deploy that touches the router +prompt (`src/config/prompts/intent_router.md`) or its few-shots. + +## Files + +| File | What | +|---|---| +| `intent_dataset.json` | **Golden dataset** — `message` + known-correct `expected_intent` per case. The source of truth scoring compares against. | +| `run_eval.py` | Runner — calls the router per case, scores correctness, records latency + tokens. | +| `results/` | Timestamped run reports, one JSON per run (never overwritten). | + +## Run + +Run as a module (`-m`), not the file path — module mode puts the repo root on +`sys.path` so `src` imports resolve; `python eval/intent/run_eval.py` fails. + +```bash +uv run python -m eval.intent.run_eval # full dataset +uv run python -m eval.intent.run_eval --limit 6 # quick smoke test +uv run python -m eval.intent.run_eval --langfuse # also stream traces to Langfuse +``` + +Needs a populated `.env` (Azure OpenAI) — it calls the live model and spends +tokens. Output: a per-case detail table + an aggregate summary in the terminal, +and `results/eval_result_.json`. + +**Tracking is the committed result files, not Langfuse** — the JSON reports in +`results/` are the versionable audit trail (see below). `--langfuse` is an +*optional* extra: when set, each case is also sent as a Langfuse trace (grouped +under one `intent_eval_` session) with a `intent_correct` 1/0 score, so the +same run is browsable in the Langfuse dashboard. It is off by default and the +eval runs fully without Langfuse configured. + +## What's measured + +- **correctness** — overall + per-intent + per-language accuracy (`got == expected`) +- **runtime** — average ms per case +- **tokens** — input / output / total (read from the model response, no Langfuse) + +## Commit convention for `results/` + +The reports are **versionable**, not a scratch log: + +- **Do commit** a result after a meaningful change — e.g. a new + `intent_router.md` version, or new dataset cases. The new timestamped file + *adds* to the history; old files are never replaced. This is how we answer + "did accuracy improve after prompt v2?" — diff two committed result files. +- **Don't commit** throwaway runs while iterating. Just leave them unstaged or + delete them. + +So the audit trail = prompt versions (in `src/config/prompts/`) lined up against +the committed result files here. + +## Dataset notes + +- 6 intents: `chat`, `help`, `problem_statement`, `check`, `unstructured_flow`, + `structured_flow`. Each has 6-7 **distinct** scenarios (not EN/ID translation + pairs), balanced across English + Indonesian. +- `carried_over: true` rows mirror the pre-rework `intent_router.md` examples + (regression). `lang` enables per-language scoring. `id` is a stable handle for + diffing the same case across runs. +- Routing labels are decided from the question **phrasing**, not from which file + holds the answer (the router has no catalog access). See the `_grounding` note + in `intent_dataset.json`. +- Owner: Rifqi (structured/DB-grounded rows) + Sofhia (unstructured/document + + tabular-file rows). Merge both into this one file. +``` diff --git a/eval/intent/__init__.py b/eval/intent/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/eval/intent/intent_dataset.json b/eval/intent/intent_dataset.json new file mode 100644 index 0000000000000000000000000000000000000000..5d362e5549efc098d1a7280ef73d4a901eaac629 --- /dev/null +++ b/eval/intent/intent_dataset.json @@ -0,0 +1,56 @@ +{ + "_about": "Golden intent dataset for the reworked 6-intent router (E3 eval — runs against the live LLM, not the mocked unit tests). Each of the 6 intents has 6-7 DISTINCT scenarios (not EN/ID translation pairs of one scenario) balanced across English + Indonesian, since users code-switch (technical terms often stay English: revenue, churn, upload, base_price). `carried_over` rows mirror the old intent_router.md examples for regression; the rest are new. `id` is a stable per-case handle so timestamped run files (eval_result_.json) can be diffed case-by-case across runs — it is NOT the run timestamp (that lives in the filename). `lang` enables per-language correctness scoring (aggregate, not matched-pairs).", + "_grounding": "structured_flow + structured `check` rows are grounded in Sofhia's real test files: online_vs_offline_learning_dataset.csv (cols: Learning_Mode, Subject, Study_Hours, Retention_Score, Focus_Level, Exam_Score) and xl_knowledge_large.xlsx (telco product catalog: product_name, category, base_price, is_active, region_restriction...). unstructured_flow + document `check` rows are grounded in the IoT connectivity indicative-price PDF and the Internet of Things DOCX. IMPORTANT: routing labels are decided from the question PHRASING, not from which file holds the answer (the router has no catalog access) — so structured rows are clearly analytical (avg/correlation/count) and unstructured rows are clearly explanatory (jelaskan/summarize/features), with no price-lookup collisions. Rifqi's DB-grounded rows can be merged into this same file for variety; owner: Rifqi + Sofhia.", + "_next_layer": "Not in this seed: (1) deliberate near-boundary cases (chat-vs-help, check-vs-structured); (2) follow-up/contextual handling — decide whether follow-ups route to `chat` or are out of scope.", + "schema": { + "id": "stable per-case handle, _", + "message": "the user utterance fed to the router", + "expected_intent": "one of: chat | help | problem_statement | check | unstructured_flow | structured_flow", + "lang": "en | id", + "carried_over": "true if mirrored from the pre-rework intent_router.md examples" + }, + "cases": [ + { "id": "chat_01", "message": "Hi", "expected_intent": "chat", "lang": "en", "carried_over": true }, + { "id": "chat_02", "message": "Bye, thanks", "expected_intent": "chat", "lang": "en", "carried_over": true }, + { "id": "chat_03", "message": "What can you do?", "expected_intent": "chat", "lang": "en", "carried_over": true }, + { "id": "chat_04", "message": "Kamu bisa ngerti bahasa Indonesia gk?", "expected_intent": "chat", "lang": "id", "carried_over": false }, + { "id": "chat_05", "message": "Test, kebaca gak?", "expected_intent": "chat", "lang": "id", "carried_over": false }, + { "id": "chat_06", "message": "Oh paham2", "expected_intent": "chat", "lang": "id", "carried_over": false }, + + { "id": "help_01", "message": "Okay I uploaded my data, what do I do next?", "expected_intent": "help", "lang": "en", "carried_over": false }, + { "id": "help_02", "message": "How does this work, where should I start?", "expected_intent": "help", "lang": "en", "carried_over": false }, + { "id": "help_03", "message": "How do I connect my database to this?", "expected_intent": "help", "lang": "en", "carried_over": false }, + { "id": "help_04", "message": "Setelah analisis selesai, aku bisa ngapain lagi?", "expected_intent": "help", "lang": "id", "carried_over": false }, + { "id": "help_05", "message": "Aku harus upload file dulu atau connect database dulu atau bisa langsung tanpa keduanya?", "expected_intent": "help", "lang": "id", "carried_over": false }, + { "id": "help_06", "message": "Cara bikin report-nya gimana deh?", "expected_intent": "help", "lang": "id", "carried_over": false }, + + { "id": "ps_01", "message": "I want to reduce customer churn next quarter, target under 5%.", "expected_intent": "problem_statement", "lang": "en", "carried_over": false }, + { "id": "ps_02", "message": "My goal is to improve online students' exam scores this semester.", "expected_intent": "problem_statement", "lang": "en", "carried_over": false }, + { "id": "ps_03", "message": "We need to figure out which product categories to push next year.", "expected_intent": "problem_statement", "lang": "en", "carried_over": false }, + { "id": "ps_04", "message": "Aku mau tau faktor apa yg paling ngaruh ke retention score siswa.", "expected_intent": "problem_statement", "lang": "id", "carried_over": false }, + { "id": "ps_05", "message": "Tujuanku naikin penjualan produk prepaid kuartal depan.", "expected_intent": "problem_statement", "lang": "id", "carried_over": false }, + { "id": "ps_06", "message": "Aku pengen fokus benahin paket internet yang kurang laku di luar Jawa.", "expected_intent": "problem_statement", "lang": "id", "carried_over": false }, + + { "id": "check_01", "message": "What data do I have?", "expected_intent": "check", "lang": "en", "carried_over": false }, + { "id": "check_02", "message": "What columns are in the online vs offline learning dataset?", "expected_intent": "check", "lang": "en", "carried_over": false }, + { "id": "check_03", "message": "Is the IoT connectivity pricing PDF already uploaded?", "expected_intent": "check", "lang": "en", "carried_over": false }, + { "id": "check_04", "message": "Kolom di tabel product master list apa aja?", "expected_intent": "check", "lang": "id", "carried_over": false }, + { "id": "check_05", "message": "Dokumen apa aja yang udh aku upload?", "expected_intent": "check", "lang": "id", "carried_over": false }, + { "id": "check_06", "message": "Sumber dataku yang berupa database yg mana aja?", "expected_intent": "check", "lang": "id", "carried_over": false }, + + { "id": "unstructured_01", "message": "apa key feature dari iot connectivity?", "expected_intent": "unstructured_flow", "lang": "id", "carried_over": true }, + { "id": "unstructured_02", "message": "Jelaskan tentang Internet of Things.", "expected_intent": "unstructured_flow", "lang": "id", "carried_over": false }, + { "id": "unstructured_03", "message": "Menurut dokumen IoT connectivity, paket apa aja yang ditawarkan?", "expected_intent": "unstructured_flow", "lang": "id", "carried_over": false }, + { "id": "unstructured_04", "message": "What pricing tiers are in the IoT connectivity document?", "expected_intent": "unstructured_flow", "lang": "en", "carried_over": false }, + { "id": "unstructured_05", "message": "Summarize the key points from the IoT connectivity pricing document.", "expected_intent": "unstructured_flow", "lang": "en", "carried_over": false }, + { "id": "unstructured_06", "message": "What use cases of IoT are mentioned in the document?", "expected_intent": "unstructured_flow", "lang": "en", "carried_over": false }, + + { "id": "structured_01", "message": "How many orders did we get last month?", "expected_intent": "structured_flow", "lang": "en", "carried_over": true }, + { "id": "structured_02", "message": "Top 5 customers by revenue this year", "expected_intent": "structured_flow", "lang": "en", "carried_over": true }, + { "id": "structured_03", "message": "What's the average exam score per learning mode?", "expected_intent": "structured_flow", "lang": "en", "carried_over": false }, + { "id": "structured_04", "message": "Is there a correlation between study hours and exam score?", "expected_intent": "structured_flow", "lang": "en", "carried_over": false }, + { "id": "structured_05", "message": "Rata-rata base price per kategori produk berapa?", "expected_intent": "structured_flow", "lang": "id", "carried_over": false }, + { "id": "structured_06", "message": "Ada berapa produk yang masih aktif per kategori?", "expected_intent": "structured_flow", "lang": "id", "carried_over": false }, + { "id": "structured_07", "message": "Bandingin retention score antara siswa online sama offline.", "expected_intent": "structured_flow", "lang": "id", "carried_over": false } + ] +} diff --git a/eval/intent/results/.gitkeep b/eval/intent/results/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/eval/intent/run_eval.py b/eval/intent/run_eval.py new file mode 100644 index 0000000000000000000000000000000000000000..1a22d351ed998e3ede73bd9a40bf909d53bd87e9 --- /dev/null +++ b/eval/intent/run_eval.py @@ -0,0 +1,384 @@ +"""Intent-routing eval runner (E3). + +Feeds each golden case in `intent_dataset.json` to the live 6-intent router +(`OrchestratorAgent.classify`), then scores correctness + records latency and +token usage. Prints a per-case detail table and an aggregate summary, and +writes a timestamped JSON report under `results/` (never overwritten — one file +per run, so runs can be diffed over time). + +Run before every deploy that touches the router prompt or its few-shots. +Invoke as a module (`-m`) so the repo root is on `sys.path` and `src` imports +resolve — running the file path directly (`python eval/intent/run_eval.py`) +puts only `eval/intent/` on the path and fails: + + uv run python -m eval.intent.run_eval + uv run python -m eval.intent.run_eval --limit 6 # quick smoke test + +Tokens come straight from the model response (LangChain `usage_metadata` via a +callback) — no Langfuse needed. The router is called unmodified: it already +accepts a `callbacks=` list and forwards it into the chain config. +""" + +from __future__ import annotations + +import argparse +import asyncio +import json +import statistics +import time +from dataclasses import asdict, dataclass +from datetime import datetime +from pathlib import Path +from typing import Any + +from langchain_core.callbacks import BaseCallbackHandler +from langchain_core.outputs import LLMResult + +from src.agents.orchestration import OrchestratorAgent + +_HERE = Path(__file__).resolve().parent +DATASET = _HERE / "intent_dataset.json" +RESULTS_DIR = _HERE / "results" + +INTENTS = [ + "chat", + "help", + "problem_statement", + "check", + "unstructured_flow", + "structured_flow", +] + +# Short labels so the EXPECT->GOT column stays narrow in the detail table. +_ABBR = { + "chat": "chat", + "help": "help", + "problem_statement": "prob_stmt", + "check": "check", + "unstructured_flow": "unstruct", + "structured_flow": "structF", +} + + +class _UsageCollector(BaseCallbackHandler): + """Sums token usage across the LLM calls made during one classify(). + + Reads `usage_metadata` off each returned message (the canonical LangChain + field), falling back to `llm_output['token_usage']` for providers that only + populate the legacy field. + """ + + def __init__(self) -> None: + self.input_tokens = 0 + self.output_tokens = 0 + self.total_tokens = 0 + + def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None: + before = self.total_tokens + for generation_list in response.generations: + for generation in generation_list: + message = getattr(generation, "message", None) + usage = getattr(message, "usage_metadata", None) if message else None + if usage: + self.input_tokens += usage.get("input_tokens", 0) + self.output_tokens += usage.get("output_tokens", 0) + self.total_tokens += usage.get("total_tokens", 0) + if self.total_tokens == before and response.llm_output: + usage = response.llm_output.get("token_usage") or {} + self.input_tokens += usage.get("prompt_tokens", 0) + self.output_tokens += usage.get("completion_tokens", 0) + self.total_tokens += usage.get("total_tokens", 0) + + @property + def tokens(self) -> dict[str, int]: + return { + "input": self.input_tokens, + "output": self.output_tokens, + "total": self.total_tokens, + } + + +@dataclass +class CaseResult: + id: str + lang: str + message: str + expected: str + got: str + correct: bool + latency_ms: int + tokens: dict[str, int] + + +def load_cases(path: Path) -> list[dict[str, Any]]: + """Read the `cases` array, skipping the leading `_*` doc keys and `schema`.""" + data = json.loads(path.read_text(encoding="utf-8")) + return list(data["cases"]) + + +@dataclass +class _LangfuseCtx: + """Optional Langfuse sink — one session groups all cases of a run.""" + + session_id: str + client: Any + + +def _new_langfuse_handler(lf_ctx: _LangfuseCtx, case: dict[str, Any]) -> Any: + """Per-case LangChain callback so each trace carries the case's labels.""" + from langfuse.callback import CallbackHandler + + from src.config.settings import settings + + return CallbackHandler( + public_key=settings.LANGFUSE_PUBLIC_KEY, + secret_key=settings.LANGFUSE_SECRET_KEY, + host=settings.LANGFUSE_HOST, + session_id=lf_ctx.session_id, + trace_name=f"intent_eval/{case['id']}", + metadata={ + "case_id": case["id"], + "expected": case["expected_intent"], + "lang": case["lang"], + }, + tags=["intent-eval", case["expected_intent"], case["lang"]], + ) + + +def _score_langfuse(lf_ctx: _LangfuseCtx, handler: Any, result: CaseResult) -> None: + """Attach a 1/0 correctness score to the case's trace. Best-effort.""" + try: + lf_ctx.client.score( + trace_id=handler.get_trace_id(), + name="intent_correct", + value=1 if result.correct else 0, + comment=f"{result.expected} -> {result.got}", + ) + except Exception: # noqa: BLE001, S110 — scoring must never break the run + pass + + +async def run_case( + agent: OrchestratorAgent, + case: dict[str, Any], + lf_ctx: _LangfuseCtx | None = None, +) -> CaseResult: + """Classify one message; never throws — a failed call is recorded as ERROR.""" + collector = _UsageCollector() + callbacks: list[Any] = [collector] + lf_handler = _new_langfuse_handler(lf_ctx, case) if lf_ctx else None + if lf_handler is not None: + callbacks.append(lf_handler) + + start = time.perf_counter() + got: str + try: + decision = await agent.classify(case["message"], callbacks=callbacks) + got = decision.intent + except Exception as exc: # noqa: BLE001 — one bad case shouldn't kill the run + got = f"ERROR:{type(exc).__name__}" + latency_ms = round((time.perf_counter() - start) * 1000) + + result = CaseResult( + id=case["id"], + lang=case["lang"], + message=case["message"], + expected=case["expected_intent"], + got=got, + correct=got == case["expected_intent"], + latency_ms=latency_ms, + tokens=collector.tokens, + ) + if lf_ctx is not None and lf_handler is not None: + _score_langfuse(lf_ctx, lf_handler, result) + return result + + +def _group_accuracy(results: list[CaseResult], key: str) -> dict[str, dict[str, Any]]: + out: dict[str, dict[str, Any]] = {} + keys = INTENTS if key == "expected" else sorted({getattr(r, key) for r in results}) + for k in keys: + sub = [r for r in results if getattr(r, key) == k] + if not sub: + continue + passed = sum(r.correct for r in sub) + out[k] = { + "n": len(sub), + "passed": passed, + "accuracy": round(passed / len(sub), 3), + } + return out + + +def summarize(results: list[CaseResult]) -> dict[str, Any]: + n = len(results) + passed = sum(r.correct for r in results) + latencies = [r.latency_ms for r in results] + tok_in = sum(r.tokens["input"] for r in results) + tok_out = sum(r.tokens["output"] for r in results) + tok_total = sum(r.tokens["total"] for r in results) + return { + "total": n, + "passed": passed, + "accuracy": round(passed / n, 3) if n else 0.0, + "runtime_avg_ms": round(statistics.mean(latencies)) if latencies else 0, + "runtime_total_s": round(sum(latencies) / 1000, 1), + "tokens": { + "input": tok_in, + "output": tok_out, + "total": tok_total, + "avg_total_per_case": round(tok_total / n) if n else 0, + }, + "by_intent": _group_accuracy(results, "expected"), + "by_lang": _group_accuracy(results, "lang"), + } + + +def _truncate(text: str, width: int) -> str: + text = text.replace("\n", " ") + return text if len(text) <= width else text[: width - 3] + "..." + + +def format_table(results: list[CaseResult]) -> str: + header = ( + f"{'ID':<15} {'L':<3} {'QUESTION':<40} " + f"{'EXPECT->GOT':<20} {'OK':<3} {'MS':>5} {'TOK':>6}" + ) + rule = "-" * len(header) + lines = [rule, header, rule] + for r in results: + exp_got = f"{_ABBR.get(r.expected, r.expected)}->{_ABBR.get(r.got, r.got)}" + ok = "ok" if r.correct else "X" + lines.append( + f"{r.id:<15} {r.lang:<3} {_truncate(r.message, 40):<40} " + f"{_truncate(exp_got, 20):<20} {ok:<3} {r.latency_ms:>5} {r.tokens['total']:>6}" + ) + lines.append(rule) + return "\n".join(lines) + + +def format_summary(summary: dict[str, Any], results: list[CaseResult]) -> str: + lines = ["SUMMARY"] + lines.append( + f" Overall {summary['passed']}/{summary['total']} correct" + f" ({summary['accuracy'] * 100:.1f}%)" + ) + lines.append( + f" Runtime avg {summary['runtime_avg_ms']} ms" + f" | total {summary['runtime_total_s']} s" + ) + tok = summary["tokens"] + lines.append( + f" Tokens avg {tok['avg_total_per_case']}" + f" | total {tok['total']} (in {tok['input']} / out {tok['output']})" + ) + lines.append("") + lines.append(" By intent") + for intent, m in summary["by_intent"].items(): + lines.append( + f" {intent:<18} {m['passed']}/{m['n']} {m['accuracy'] * 100:.0f}%" + ) + lines.append(" By language") + for lang, m in summary["by_lang"].items(): + lines.append( + f" {lang:<18} {m['passed']}/{m['n']} {m['accuracy'] * 100:.0f}%" + ) + failures = [r for r in results if not r.correct] + lines.append("") + lines.append(f" FAILURES ({len(failures)})") + for r in failures: + lines.append(f" {r.id:<14} [{r.lang}] {r.expected:<12} -> {r.got}") + return "\n".join(lines) + + +def build_report( + results: list[CaseResult], summary: dict[str, Any], meta: dict[str, Any] +) -> dict[str, Any]: + run = {**meta, **{k: summary[k] for k in ( + "total", "passed", "accuracy", "runtime_avg_ms", "runtime_total_s", "tokens" + )}} + return { + "run": run, + "by_intent": summary["by_intent"], + "by_lang": summary["by_lang"], + "cases": [asdict(r) for r in results], + } + + +def _model_name() -> str: + try: + from src.config.settings import settings + + return str(settings.azureai_deployment_name_4o) + except Exception: # noqa: BLE001 — meta only; .env may be absent + return "gpt-4o" + + +async def main() -> None: + parser = argparse.ArgumentParser(description="Intent-routing eval (E3)") + parser.add_argument("--dataset", type=Path, default=DATASET) + parser.add_argument("--limit", type=int, default=0, help="run first N cases only") + parser.add_argument("--prompt-version", default="intent_router.md") + parser.add_argument("--no-table", action="store_true", help="skip the detail table") + parser.add_argument( + "--langfuse", action="store_true", + help="also send each case as a Langfuse trace + correctness score", + ) + args = parser.parse_args() + + cases = load_cases(args.dataset) + if args.limit: + cases = cases[: args.limit] + + started = datetime.now() + print(f"Intent Routing Eval -- {started:%Y-%m-%d %H:%M:%S}") + print(f"dataset: {args.dataset.name} ({len(cases)}) model: {_model_name()} " + f"prompt: {args.prompt_version}") + + lf_ctx: _LangfuseCtx | None = None + if args.langfuse: + try: + from src.observability.langfuse.langfuse import get_langfuse + + lf_ctx = _LangfuseCtx( + session_id=f"intent_eval_{started:%Y%m%d_%H%M%S}", + client=get_langfuse(), # type: ignore[no-untyped-call] + ) + print(f"langfuse: enabled (session {lf_ctx.session_id})") + except Exception as exc: # noqa: BLE001 — Langfuse is optional + print(f"langfuse: disabled ({type(exc).__name__}: {exc})") + + agent = OrchestratorAgent() + results: list[CaseResult] = [] + for case in cases: + results.append(await run_case(agent, case, lf_ctx)) + + if lf_ctx is not None: + try: + lf_ctx.client.flush() + except Exception: # noqa: BLE001, S110 — flush failure shouldn't fail the run + pass + + summary = summarize(results) + if not args.no_table: + print(format_table(results)) + print(format_summary(summary, results)) + + meta = { + "timestamp": started.isoformat(timespec="seconds"), + "dataset": args.dataset.name, + "model": _model_name(), + "prompt_version": args.prompt_version, + "langfuse_session": lf_ctx.session_id if lf_ctx else None, + } + report = build_report(results, summary, meta) + RESULTS_DIR.mkdir(parents=True, exist_ok=True) + out_path = RESULTS_DIR / f"eval_result_{started:%Y-%m-%d_%H%M%S}.json" + out_path.write_text( + json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8" + ) + print(f"\n-> saved: {out_path.relative_to(_HERE.parent.parent)}") + + +if __name__ == "__main__": + asyncio.run(main()) diff --git a/eval/readiness/README.md b/eval/readiness/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c1537630b94d016f783e32e39549723d6d7bc0d2 --- /dev/null +++ b/eval/readiness/README.md @@ -0,0 +1,34 @@ +# Report-readiness eval + +Scores the deterministic `is_report_ready` signal (`src/agents/report/readiness.py`) +that the Help skill consumes to decide whether to nudge the user toward generating a +report. No LLM, no DB — each golden case declares an analysis state + a set of +persisted records/reports, and the runner feeds them through `is_report_ready` via +injectable fake stores. + +## Run + +```bash +uv run python -m eval.readiness.run_eval +uv run python -m eval.readiness.run_eval --limit 5 # smoke test +uv run python -m eval.readiness.run_eval --no-table # summary only +``` + +Each run writes a timestamped `results/readiness_result_.json` (never +overwritten, diffable across runs). + +## What it measures + +- **Floor correctness** — exact `ready` + `missing` for the deterministic floor + (validated goal · ≥1 substantive record · delta-since-report). Should sit at ~100%; + this is the regression guard as criteria evolve. +- **Alignment gap** — `alignment` cases have substantive records (floor says + `ready=true`) but `aligned=false`: the analyses don't address the problem statement. + The floor can't see this. The gap count is the evidence for/against adding the + deferred LLM-judge — "ship the floor, earn the judge." + +## Dataset + +`readiness_dataset.json` — groups: `floor`, `delta`, `edge` (doc-only product +question), `alignment`. See the `_about` / `_alignment` doc keys in the file. The +`aligned` label is a semantic judgment; owner: Rifqi (report semantics) + Sofhia. diff --git a/eval/readiness/__init__.py b/eval/readiness/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/eval/readiness/readiness_dataset.json b/eval/readiness/readiness_dataset.json new file mode 100644 index 0000000000000000000000000000000000000000..52f6a678f3df46373f3337fe37623cf90d5bf00b --- /dev/null +++ b/eval/readiness/readiness_dataset.json @@ -0,0 +1,40 @@ +{ + "_about": "Golden dataset for the report-readiness signal (`src/agents/report/readiness.is_report_ready`). Deterministic (no LLM): each case declares an analysis state + a set of persisted AnalysisRecords/reports, and the runner feeds them through is_report_ready via injectable fake stores, scoring the boolean `ready` AND the `missing` gaps. Floor cases should score ~100% (regression value). The `alignment` group probes the deferred LLM-judge — see _alignment.", + "_floor": "is_report_ready's deterministic floor: (1) problem_validated, (2) >=1 SUBSTANTIVE record, (3) delta-since-report. SUBSTANTIVE (KM-652 fix T1) = a record whose ANALYSIS task succeeded: tasks_run contains a task with status=success AND an analyze_* tool. A failed analysis still persists a record WITH findings (narrating the failure) and its data-access tasks (check_/retrieve_) succeed — so neither 'has findings' nor 'any task succeeded' counts. Only a successful analyze_* does.", + "_records": "records[].analysis = 'success' (analyze_* succeeded → substantive) | 'failure' (analyze_* failed, data-access still succeeded — the real e2e case, NOT substantive) | 'none' (only check_/retrieve_ succeeded, no analyze task — NOT substantive; guards the 'any task succeeded' trap). records[].findings = count (a failure run still has findings; floor ignores them now). records[].age_min / reports[].age_min = minutes ago (smaller = newer).", + "_alignment": "ALIGNMENT cases: a successful analysis (floor says ready=true) but `aligned=false` means it doesn't address the problem statement — a human would say NOT ready. Scored floor-correct, counted separately as the 'alignment gap' = evidence for/against the LLM-judge. Alignment label owner: Rifqi (report semantics) + Sofhia.", + "schema": { + "id": "stable per-case handle, _", + "group": "floor | delta | edge | alignment", + "problem_validated": "bool", + "report_id": "null = never generated; a string = a report exists", + "records": "[{ analysis: success|failure|none, findings: int, age_min: int }]", + "reports": "[{ age_min: int }] (only meaningful when report_id set)", + "aligned": "bool — do the analyses address the problem statement? (floor ignores this)", + "expected_ready": "what the deterministic floor SHOULD return", + "expected_missing": "subset of [problem, analysis, delta]", + "note": "human-readable description" + }, + "cases": [ + { "id": "floor_01", "group": "floor", "problem_validated": false, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["problem", "analysis"], "note": "new analysis: no validated goal and no records" }, + { "id": "floor_02", "group": "floor", "problem_validated": false, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 30 }], "reports": [], "aligned": true, "expected_ready": false, "expected_missing": ["problem"], "note": "has a successful analysis but goal not validated (isolates the problem gap)" }, + { "id": "floor_03", "group": "floor", "problem_validated": true, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "validated goal but no analysis run yet" }, + { "id": "floor_04", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 20 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 REGRESSION: analyze_* FAILED but the record still has 3 findings (narrating failure) + check/retrieve succeeded. Must NOT be ready — this is the live e2e case (analyze_aggregate failed, report still got generated under the old 'has findings' rule)." }, + { "id": "floor_05", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "none", "findings": 0, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 nuance: only data-access tasks (check/retrieve) succeeded, no analyze task. 'any task succeeded' would wrongly pass — must NOT be ready." }, + { "id": "floor_06", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "validated + one successful analysis, no prior report → ready" }, + { "id": "floor_07", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 40 }, { "analysis": "success", "findings": 1, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "multiple successful analyses → ready" }, + { "id": "floor_08", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 30 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one failed + one successful analysis → the successful one is enough → ready" }, + + { "id": "delta_01", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }], "reports": [{ "age_min": 5 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "report exists, all analysis older than it → nothing new to report" }, + { "id": "delta_02", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 5 }], "reports": [{ "age_min": 120 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "newer successful analysis after the report → ready to regenerate" }, + { "id": "delta_03", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 1, "age_min": 90 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one old + one newer-than-report success → ready" }, + { "id": "delta_04", "group": "delta", "problem_validated": true, "report_id": "rep-2", "records": [{ "analysis": "success", "findings": 2, "age_min": 90 }], "reports": [{ "age_min": 200 }, { "age_min": 30 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "multiple reports — newest wins; analysis older than newest report → not ready" }, + { "id": "delta_05", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }, { "analysis": "failure", "findings": 3, "age_min": 5 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "T1+delta: the only NEW analysis (age 5) is a FAILURE → no NEW substantive since the report → not ready. A failed retry must not unlock a duplicate report." }, + + { "id": "edge_01", "group": "edge", "problem_validated": true, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "doc-only analysis (RAG, no structured run) produces no AnalysisRecord → never report-able under the floor. PRODUCT QUESTION: should doc-only be report-able?" }, + + { "id": "align_01", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: successful analysis but it doesn't address the problem statement. Floor says ready; a human would say not-ready." }, + { "id": "align_02", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 25 }, { "analysis": "success", "findings": 1, "age_min": 5 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: lots of successful analysis, none aligned to the goal" }, + { "id": "align_03", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "control: successful AND aligned → genuinely ready, no gap" } + ] +} diff --git a/eval/readiness/results/.gitkeep b/eval/readiness/results/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/eval/readiness/results/readiness_result_2026-06-22_101645.json b/eval/readiness/results/readiness_result_2026-06-22_101645.json new file mode 100644 index 0000000000000000000000000000000000000000..317b7a3b35f4adb7f3ea7e535f4d0a19128ca4a1 --- /dev/null +++ b/eval/readiness/results/readiness_result_2026-06-22_101645.json @@ -0,0 +1,268 @@ +{ + "run": { + "timestamp": "2026-06-22T10:16:45", + "dataset": "readiness_dataset.json", + "target": "src/agents/report/readiness.is_report_ready", + "total": 16, + "passed": 16, + "accuracy": 1.0, + "runtime_avg_ms": 0.0 + }, + "alignment_gap": { + "count": 2, + "ids": [ + "align_01", + "align_02" + ] + }, + "by_group": { + "floor": { + "n": 8, + "passed": 8, + "accuracy": 1.0 + }, + "delta": { + "n": 4, + "passed": 4, + "accuracy": 1.0 + }, + "edge": { + "n": 1, + "passed": 1, + "accuracy": 1.0 + }, + "alignment": { + "n": 3, + "passed": 3, + "accuracy": 1.0 + } + }, + "cases": [ + { + "id": "floor_01", + "group": "floor", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "a validated problem statement", + "at least one completed analysis" + ], + "got_missing": [ + "a validated problem statement", + "at least one completed analysis" + ], + "correct": true, + "aligned": false, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_02", + "group": "floor", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "a validated problem statement" + ], + "got_missing": [ + "a validated problem statement" + ], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_03", + "group": "floor", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "at least one completed analysis" + ], + "got_missing": [ + "at least one completed analysis" + ], + "correct": true, + "aligned": false, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_04", + "group": "floor", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "at least one completed analysis" + ], + "got_missing": [ + "at least one completed analysis" + ], + "correct": true, + "aligned": false, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_05", + "group": "floor", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "at least one completed analysis" + ], + "got_missing": [ + "at least one completed analysis" + ], + "correct": true, + "aligned": false, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_06", + "group": "floor", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_07", + "group": "floor", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_08", + "group": "floor", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "delta_01", + "group": "delta", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "a new analysis since the last report" + ], + "got_missing": [ + "a new analysis since the last report" + ], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "delta_02", + "group": "delta", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "delta_03", + "group": "delta", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "delta_04", + "group": "delta", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "a new analysis since the last report" + ], + "got_missing": [ + "a new analysis since the last report" + ], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "edge_01", + "group": "edge", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "at least one completed analysis" + ], + "got_missing": [ + "at least one completed analysis" + ], + "correct": true, + "aligned": false, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "align_01", + "group": "alignment", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": false, + "gap": true, + "latency_ms": 0.0 + }, + { + "id": "align_02", + "group": "alignment", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": false, + "gap": true, + "latency_ms": 0.0 + }, + { + "id": "align_03", + "group": "alignment", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + } + ] +} \ No newline at end of file diff --git a/eval/readiness/results/readiness_result_2026-06-22_143809.json b/eval/readiness/results/readiness_result_2026-06-22_143809.json new file mode 100644 index 0000000000000000000000000000000000000000..8f93a8876bfb3ddb988d6d13c9b2ea148d934859 --- /dev/null +++ b/eval/readiness/results/readiness_result_2026-06-22_143809.json @@ -0,0 +1,284 @@ +{ + "run": { + "timestamp": "2026-06-22T14:38:09", + "dataset": "readiness_dataset.json", + "target": "src/agents/report/readiness.is_report_ready", + "total": 17, + "passed": 17, + "accuracy": 1.0, + "runtime_avg_ms": 0.01 + }, + "alignment_gap": { + "count": 2, + "ids": [ + "align_01", + "align_02" + ] + }, + "by_group": { + "floor": { + "n": 8, + "passed": 8, + "accuracy": 1.0 + }, + "delta": { + "n": 5, + "passed": 5, + "accuracy": 1.0 + }, + "edge": { + "n": 1, + "passed": 1, + "accuracy": 1.0 + }, + "alignment": { + "n": 3, + "passed": 3, + "accuracy": 1.0 + } + }, + "cases": [ + { + "id": "floor_01", + "group": "floor", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "a validated problem statement", + "at least one completed analysis" + ], + "got_missing": [ + "a validated problem statement", + "at least one completed analysis" + ], + "correct": true, + "aligned": false, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_02", + "group": "floor", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "a validated problem statement" + ], + "got_missing": [ + "a validated problem statement" + ], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_03", + "group": "floor", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "at least one completed analysis" + ], + "got_missing": [ + "at least one completed analysis" + ], + "correct": true, + "aligned": false, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_04", + "group": "floor", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "at least one completed analysis" + ], + "got_missing": [ + "at least one completed analysis" + ], + "correct": true, + "aligned": false, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_05", + "group": "floor", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "at least one completed analysis" + ], + "got_missing": [ + "at least one completed analysis" + ], + "correct": true, + "aligned": false, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_06", + "group": "floor", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_07", + "group": "floor", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "floor_08", + "group": "floor", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.1 + }, + { + "id": "delta_01", + "group": "delta", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "a new analysis since the last report" + ], + "got_missing": [ + "a new analysis since the last report" + ], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "delta_02", + "group": "delta", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "delta_03", + "group": "delta", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "delta_04", + "group": "delta", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "a new analysis since the last report" + ], + "got_missing": [ + "a new analysis since the last report" + ], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "delta_05", + "group": "delta", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "a new analysis since the last report" + ], + "got_missing": [ + "a new analysis since the last report" + ], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "edge_01", + "group": "edge", + "expected_ready": false, + "got_ready": false, + "expected_missing": [ + "at least one completed analysis" + ], + "got_missing": [ + "at least one completed analysis" + ], + "correct": true, + "aligned": false, + "gap": false, + "latency_ms": 0.0 + }, + { + "id": "align_01", + "group": "alignment", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": false, + "gap": true, + "latency_ms": 0.0 + }, + { + "id": "align_02", + "group": "alignment", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": false, + "gap": true, + "latency_ms": 0.0 + }, + { + "id": "align_03", + "group": "alignment", + "expected_ready": true, + "got_ready": true, + "expected_missing": [], + "got_missing": [], + "correct": true, + "aligned": true, + "gap": false, + "latency_ms": 0.0 + } + ] +} \ No newline at end of file diff --git a/eval/readiness/run_eval.py b/eval/readiness/run_eval.py new file mode 100644 index 0000000000000000000000000000000000000000..83d44da60eadb2d44d62e5318d72528f27bfd449 --- /dev/null +++ b/eval/readiness/run_eval.py @@ -0,0 +1,309 @@ +"""Report-readiness eval runner. + +Feeds each golden case in `readiness_dataset.json` to the deterministic +`is_report_ready` signal (`src/agents/report/readiness.py`) via injectable FAKE +stores — no LLM, no DB — then scores both the boolean `ready` and the `missing` +gaps. Prints a per-case detail table + aggregate summary and writes a timestamped +JSON report under `results/` (never overwritten — one file per run, diffable). + +Two metrics matter: + - FLOOR correctness (ready + missing exact) — should be ~100%; this is the + regression guard as the criteria evolve. + - ALIGNMENT GAP — cases the floor calls ready=true but whose analyses are NOT + aligned to the problem statement (`aligned=false`). The floor can't see this; + the gap count is the evidence for/against adding the deferred LLM-judge. + +Invoke as a module so `src` imports resolve: + + uv run python -m eval.readiness.run_eval + uv run python -m eval.readiness.run_eval --limit 5 +""" + +from __future__ import annotations + +import argparse +import asyncio +import json +import statistics +import time +from dataclasses import asdict, dataclass, field +from datetime import UTC, datetime, timedelta +from pathlib import Path +from typing import Any + +from src.agents.gate import stub_analysis_state +from src.agents.report.readiness import ( + _MISSING_ANALYSIS, + _MISSING_DELTA, + _MISSING_PROBLEM, + is_report_ready, +) + +_HERE = Path(__file__).resolve().parent +DATASET = _HERE / "readiness_dataset.json" +RESULTS_DIR = _HERE / "results" +GROUPS = ["floor", "delta", "edge", "alignment"] + +# Dataset short codes -> the exact `missing` strings is_report_ready emits. Imported +# from the module so the dataset stays readable and survives wording changes. +_CODE_TO_MISSING = { + "problem": _MISSING_PROBLEM, + "analysis": _MISSING_ANALYSIS, + "delta": _MISSING_DELTA, +} + + +@dataclass +class _FakeTask: + """Mirrors slow_path.schemas.TaskSummary (the bits is_report_ready reads).""" + + status: str # success | partial | failure + tools_used: list[str] + + +@dataclass +class _FakeRecord: + findings: list[Any] + created_at: datetime + tasks_run: list[_FakeTask] + + +@dataclass +class _FakeReport: + generated_at: datetime + + +class _FakeStore: + """Stand-in for the Postgres record/report store — returns canned rows.""" + + def __init__(self, rows: list[Any]) -> None: + self._rows = rows + + async def list_for_analysis(self, _analysis_id: str) -> list[Any]: + return self._rows + + +@dataclass +class CaseResult: + id: str + group: str + expected_ready: bool + got_ready: bool + expected_missing: list[str] + got_missing: list[str] + correct: bool + aligned: bool + gap: bool # floor said ready but analyses not aligned to the problem statement + latency_ms: float + + +def load_cases(path: Path) -> list[dict[str, Any]]: + data = json.loads(path.read_text(encoding="utf-8")) + return list(data["cases"]) + + +def _build_tasks(analysis: str) -> list[_FakeTask]: + """Realistic tasks_run: data-access always succeeds; the analyze_* task varies. + + analysis = 'success' (analyze_* succeeded) | 'failure' (analyze_* failed) | + 'none' (no analyze task at all — only check/retrieve succeeded). + """ + tasks = [ + _FakeTask(status="success", tools_used=["check_data"]), + _FakeTask(status="success", tools_used=["retrieve_data"]), + ] + if analysis == "success": + tasks.append(_FakeTask(status="success", tools_used=["analyze_aggregate"])) + elif analysis == "failure": + tasks.append(_FakeTask(status="failure", tools_used=["analyze_aggregate"])) + return tasks + + +def _build_records(specs: list[dict[str, Any]], now: datetime) -> list[_FakeRecord]: + return [ + _FakeRecord( + findings=["f"] * int(spec.get("findings", 0)), + created_at=now - timedelta(minutes=int(spec["age_min"])), + tasks_run=_build_tasks(str(spec.get("analysis", "success"))), + ) + for spec in specs + ] + + +def _build_reports(specs: list[dict[str, Any]], now: datetime) -> list[_FakeReport]: + return [ + _FakeReport(generated_at=now - timedelta(minutes=int(spec["age_min"]))) + for spec in specs + ] + + +async def run_case(case: dict[str, Any]) -> CaseResult: + now = datetime.now(UTC) + state = stub_analysis_state(problem_validated=bool(case["problem_validated"])) + if case.get("report_id"): + state = state.model_copy(update={"report_id": case["report_id"]}) + + record_store = _FakeStore(_build_records(case.get("records", []), now)) + report_store = _FakeStore(_build_reports(case.get("reports", []), now)) + expected_missing = sorted(_CODE_TO_MISSING[c] for c in case["expected_missing"]) + + start = time.perf_counter() + rr = await is_report_ready( + case["id"], state, record_store=record_store, report_store=report_store + ) + latency_ms = round((time.perf_counter() - start) * 1000, 1) + + got_missing = sorted(rr.missing) + ready_ok = rr.ready == bool(case["expected_ready"]) + missing_ok = got_missing == expected_missing + return CaseResult( + id=case["id"], + group=case["group"], + expected_ready=bool(case["expected_ready"]), + got_ready=rr.ready, + expected_missing=expected_missing, + got_missing=got_missing, + correct=ready_ok and missing_ok, + aligned=bool(case["aligned"]), + gap=rr.ready and not bool(case["aligned"]), + latency_ms=latency_ms, + ) + + +def _group_accuracy(results: list[CaseResult]) -> dict[str, dict[str, Any]]: + out: dict[str, dict[str, Any]] = {} + for g in GROUPS: + sub = [r for r in results if r.group == g] + if not sub: + continue + passed = sum(r.correct for r in sub) + out[g] = {"n": len(sub), "passed": passed, "accuracy": round(passed / len(sub), 3)} + return out + + +def summarize(results: list[CaseResult]) -> dict[str, Any]: + n = len(results) + passed = sum(r.correct for r in results) + gaps = [r for r in results if r.gap] + latencies = [r.latency_ms for r in results] + return { + "total": n, + "passed": passed, + "accuracy": round(passed / n, 3) if n else 0.0, + "runtime_avg_ms": round(statistics.mean(latencies), 2) if latencies else 0, + "alignment_gap": {"count": len(gaps), "ids": [r.id for r in gaps]}, + "by_group": _group_accuracy(results), + } + + +def _fmt_bool(value: bool) -> str: + return "T" if value else "F" + + +def _truncate(text: str, width: int) -> str: + return text if len(text) <= width else text[: width - 3] + "..." + + +def format_table(results: list[CaseResult]) -> str: + header = ( + f"{'ID':<12} {'GROUP':<10} {'RDY e/g':<8} " + f"{'MISSING (got)':<40} {'OK':<3} {'GAP':<4}" + ) + rule = "-" * len(header) + lines = [rule, header, rule] + for r in results: + rdy = f"{_fmt_bool(r.expected_ready)}/{_fmt_bool(r.got_ready)}" + missing = ", ".join(r.got_missing) or "-" + ok = "ok" if r.correct else "X" + gap = "GAP" if r.gap else "" + lines.append( + f"{r.id:<12} {r.group:<10} {rdy:<8} " + f"{_truncate(missing, 40):<40} {ok:<3} {gap:<4}" + ) + lines.append(rule) + return "\n".join(lines) + + +def format_summary(summary: dict[str, Any], results: list[CaseResult]) -> str: + lines = ["SUMMARY"] + lines.append( + f" Floor {summary['passed']}/{summary['total']} correct" + f" ({summary['accuracy'] * 100:.1f}%) avg {summary['runtime_avg_ms']} ms" + ) + gap = summary["alignment_gap"] + lines.append( + f" Align gap {gap['count']} case(s) ready-but-misaligned" + + (f" -> {', '.join(gap['ids'])}" if gap["ids"] else "") + ) + lines.append(" (floor can't catch these; this count is the LLM-judge justification)") + lines.append("") + lines.append(" By group") + for g, m in summary["by_group"].items(): + lines.append(f" {g:<12} {m['passed']}/{m['n']} {m['accuracy'] * 100:.0f}%") + failures = [r for r in results if not r.correct] + lines.append("") + lines.append(f" FAILURES ({len(failures)})") + for r in failures: + lines.append( + f" {r.id:<12} ready {_fmt_bool(r.expected_ready)}->{_fmt_bool(r.got_ready)}" + f" missing {r.expected_missing} -> {r.got_missing}" + ) + return "\n".join(lines) + + +def build_report( + results: list[CaseResult], summary: dict[str, Any], meta: dict[str, Any] +) -> dict[str, Any]: + run = {**meta, **{k: summary[k] for k in ("total", "passed", "accuracy", "runtime_avg_ms")}} + return { + "run": run, + "alignment_gap": summary["alignment_gap"], + "by_group": summary["by_group"], + "cases": [asdict(r) for r in results], + } + + +@dataclass +class _Args: + dataset: Path = DATASET + limit: int = 0 + no_table: bool = False + extra: dict[str, Any] = field(default_factory=dict) + + +async def main() -> None: + parser = argparse.ArgumentParser(description="Report-readiness eval") + parser.add_argument("--dataset", type=Path, default=DATASET) + parser.add_argument("--limit", type=int, default=0, help="run first N cases only") + parser.add_argument("--no-table", action="store_true", help="skip the detail table") + args = parser.parse_args() + + cases = load_cases(args.dataset) + if args.limit: + cases = cases[: args.limit] + + started = datetime.now() + print(f"Report-Readiness Eval -- {started:%Y-%m-%d %H:%M:%S}") + print(f"dataset: {args.dataset.name} ({len(cases)} cases) target: is_report_ready") + + results = [await run_case(case) for case in cases] + + summary = summarize(results) + if not args.no_table: + print(format_table(results)) + print(format_summary(summary, results)) + + meta = { + "timestamp": started.isoformat(timespec="seconds"), + "dataset": args.dataset.name, + "target": "src/agents/report/readiness.is_report_ready", + } + report = build_report(results, summary, meta) + RESULTS_DIR.mkdir(parents=True, exist_ok=True) + out_path = RESULTS_DIR / f"readiness_result_{started:%Y-%m-%d_%H%M%S}.json" + out_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8") + print(f"\n-> saved: {out_path.relative_to(_HERE.parent.parent)}") + + +if __name__ == "__main__": + asyncio.run(main()) diff --git a/main.py b/main.py index f8102b34ec32859ecb28c9a74a3038ca71448cbe..945ffcd3aabeb28d4f779f8b922f46289be2c51d 100644 --- a/main.py +++ b/main.py @@ -13,6 +13,9 @@ from src.api.v1.room import router as room_router from src.api.v1.users import router as users_router from src.api.v1.db_client import router as db_client_router from src.api.v1.data_catalog import router as data_catalog_router +from src.api.v1.report import router as report_router +from src.api.v1.analysis import router as analysis_router +from src.api.v1.tools import router as tools_router from src.db.postgres.init_db import init_db import os import uvicorn @@ -53,6 +56,9 @@ app.include_router(room_router) app.include_router(chat_router) app.include_router(db_client_router) app.include_router(data_catalog_router) +app.include_router(report_router) +app.include_router(analysis_router) +app.include_router(tools_router) @app.get("/") diff --git a/pyproject.toml b/pyproject.toml index ea38cb5468913c984643d8f2c0105e2957bcb364..4e21c1c45ad088d0ec660395bf8032de3de23d27 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -123,6 +123,8 @@ ignore = [ # S608: golden compiler tests assert literal SQL strings (incl. concatenated # suffixes) — they never execute against a DB, so it's a false positive here. "tests/**" = ["S101", "S105", "S106", "S608"] +# T201: eval/ scripts are CLIs — print() is their intended output channel. +"eval/**" = ["T201"] [tool.mypy] python_version = "3.12" diff --git a/src/agents/binding_store.py b/src/agents/binding_store.py new file mode 100644 index 0000000000000000000000000000000000000000..07cc4e50cd82eeb65d3b61db6f5a705fab318d9e --- /dev/null +++ b/src/agents/binding_store.py @@ -0,0 +1,34 @@ +"""AnalysisDataSourceStore — read per-analysis data-source bindings (#10). + +The dedorch `data_sources` table records which catalog sources an analysis is scoped +to (`reference_id` = the catalog source id). It's written at `/analysis/create`; this +store is the read seam for the two consumers — `structured_flow` catalog scoping and +the report's data-source appendix. + +Fail-open by convention at the call sites: an empty binding (legacy room, or the FE +not yet sending ids) means "no restriction" — fall back to the whole catalog. Mirrors +`AnalysisStateStore`: each call opens its own `AsyncSession`. +""" + +from __future__ import annotations + +from sqlalchemy import select + +from src.db.postgres.connection import AsyncSessionLocal +from src.db.postgres.models import AnalysisDataSourceRow +from src.middlewares.logging import get_logger + +logger = get_logger("binding_store") + + +class AnalysisDataSourceStore: + """Read the bound catalog `source_id`s for an analysis.""" + + async def get(self, analysis_id: str) -> list[str]: + async with AsyncSessionLocal() as session: + result = await session.execute( + select(AnalysisDataSourceRow.reference_id).where( + AnalysisDataSourceRow.analysis_id == analysis_id + ) + ) + return list(result.scalars().all()) diff --git a/src/agents/chat_handler.py b/src/agents/chat_handler.py index b009e1a1e559f041ea4a8dd027f13f3e9d2d5b43..077e041956691b371d2c8b521c96f718abfa0fda 100644 --- a/src/agents/chat_handler.py +++ b/src/agents/chat_handler.py @@ -2,18 +2,22 @@ End-to-end flow per user message: - 1. `IntentRouter.classify` → `chat` / `unstructured` / `structured`. - 2. Route: - - `chat` → no context. Pass straight to ChatbotAgent. - - `structured` → CatalogReader → QueryService → QueryResult. - - `unstructured` → DocumentRetriever (placeholder, raises until TAB - ships) → list[DocumentChunk]. + 1. `OrchestratorAgent.classify` → RouterDecision (one of six intents). + 2. Route by intent: + - `chat` → no context. Pass straight to ChatbotAgent. + - `structured_flow` → CatalogReader → slow path / QueryService. + - `unstructured_flow` → DocumentRetriever (RAG over PGVector) → + list[DocumentChunk]. + - `check` → check_data / check_knowledge tool → rendered table. + - `problem_statement` → PS skill: draft + validate → write analysis state. + - `help` → Help skill: analysis state + history → streamed guidance. 3. `ChatbotAgent.astream` → yield text tokens. 4. Wrap each step into an SSE-style event dict so the API endpoint can stream them as Server-Sent Events. -Phase 1's chat endpoint (`src/api/v1/chat.py`) is intentionally NOT touched -in this PR. PR7 cleanup will rewire it to call `ChatHandler.handle(...)`. +The chat endpoint (`src/api/v1/chat.py`) calls `ChatHandler.handle(...)` per +request, behind two endpoint-level pre-filters: a greeting/farewell +short-circuit and a Redis response cache (both skip the LLM on a hit). All dependencies are injectable for tests. Default constructors lazy-build production deps (no `Settings()` triggered at import time as long as you @@ -33,12 +37,16 @@ from src.middlewares.logging import get_logger from src.retrieval.base import RetrievalResult from .chatbot import ChatbotAgent, DocumentChunk +from .handlers.check import run_check +from .handlers.help import HelpAgent +from .handlers.problem_statement import ProblemStatementAgent, run_problem_statement from .orchestration import OrchestratorAgent if TYPE_CHECKING: from ..catalog.reader import CatalogReader from ..query.service import QueryService from ..retrieval.router import RetrievalRouter + from .gate import AnalysisState from .slow_path.coordinator import SlowPathCoordinator from .slow_path.store import AnalysisStore @@ -71,6 +79,12 @@ class ChatHandler: Callable[[str], SlowPathCoordinator] | None ) = None, analysis_store: AnalysisStore | None = None, + check_invoker_factory: Callable[[str], Any] | None = None, + ps_agent: ProblemStatementAgent | None = None, + help_agent: HelpAgent | None = None, + state_store: Any | None = None, + binding_store: Any | None = None, + enable_gate: bool = False, enable_tracing: bool = False, ) -> None: self._intent_router = intent_router @@ -88,6 +102,21 @@ class ChatHandler: self._enable_slow_path = enable_slow_path self._slow_path_factory = slow_path_coordinator_factory self._analysis_store = analysis_store + # `check` skill: builds the data-access invoker (check_data/check_knowledge) + # per request with the authenticated user_id. Injectable for tests. + self._check_invoker_factory = check_invoker_factory + # `problem_statement` skill: LLM drafter + the Analysis State store it writes + # `problem_validated` to. Both injectable for tests. + self._ps_agent = ps_agent + # `help` skill: LLM guide that reads the Analysis State + chat history. + self._help_agent = help_agent + self._state_store = state_store + # `#10` data-source binding: scopes structured_flow's catalog to the sources + # the analysis is bound to. Injectable for tests; fail-open when absent. + self._binding_store = binding_store + # Deterministic gate: redirect structured_flow -> problem_statement until the + # analysis is validated. OFF by default (legacy rooms have no state row). + self._enable_gate = enable_gate # ------------------------------------------------------------------ # Lazy default-dep builders @@ -125,6 +154,71 @@ class ChatHandler: self._document_retriever = RetrievalRouter() return self._document_retriever + def _get_check_invoker(self, user_id: str) -> Any: + """Build the per-request data-access invoker for the `check` skill.""" + if self._check_invoker_factory is not None: + return self._check_invoker_factory(user_id) + from ..tools.data_access import DataAccessToolInvoker + + return DataAccessToolInvoker(user_id, self._get_catalog_reader()) + + def _get_ps_agent(self) -> ProblemStatementAgent: + if self._ps_agent is None: + self._ps_agent = ProblemStatementAgent() + return self._ps_agent + + def _get_help_agent(self) -> HelpAgent: + if self._help_agent is None: + self._help_agent = HelpAgent() + return self._help_agent + + def _get_state_store(self) -> Any: + if self._state_store is None: + from .state_store import AnalysisStateStore + + self._state_store = AnalysisStateStore() + return self._state_store + + def _get_binding_store(self) -> Any: + if self._binding_store is None: + from .binding_store import AnalysisDataSourceStore + + self._binding_store = AnalysisDataSourceStore() + return self._binding_store + + async def _bound_source_ids(self, analysis_id: str | None) -> set[str]: + """#10: the catalog source_ids this analysis is bound to (empty = unscoped). + + Fail-open: no analysis_id, no binding rows (legacy room / FE not sending + ids), or a read error → empty set, which the caller treats as "whole + catalog". Used to build a `_ScopedCatalogReader` so the Planner AND the + data-access tools (which re-read the catalog themselves) see the same scope. + """ + if not analysis_id: + return set() + try: + return set(await self._get_binding_store().get(analysis_id)) + except Exception as e: # noqa: BLE001 — never block the query on this + logger.warning("binding read failed — unscoped", analysis_id=analysis_id, error=str(e)) + return set() + + async def _load_analysis_state(self, analysis_id: str | None) -> AnalysisState: + """Load Analysis State for the Help skill; fail closed to a not-validated stub. + + Mirrors the gate's never-throw fallback so Help degrades gracefully on a + missing row, a read error, or a legacy room with no `analysis_id`. + """ + from .gate import stub_analysis_state + + if not analysis_id: + return stub_analysis_state(problem_validated=False) + try: + state = await self._get_state_store().get(analysis_id) + except Exception as e: + logger.warning("help state read failed — not-validated", error=str(e)) + state = None + return state if state is not None else stub_analysis_state(problem_validated=False) + # ------------------------------------------------------------------ # Public entry # ------------------------------------------------------------------ @@ -134,6 +228,7 @@ class ChatHandler: message: str, user_id: str, history: list[BaseMessage] | None = None, + analysis_id: str | None = None, ) -> AsyncIterator[dict[str, Any]]: tracer = self._make_tracer(user_id, message) @@ -147,7 +242,39 @@ class ChatHandler: yield {"event": "error", "data": f"Could not classify message: {e}"} return - yield {"event": "intent", "data": decision.model_dump_json()} + intent = decision.intent + # ---- 1a. Ensure session state row (T-A) ---------------------- + # Rooms created via /room/create have no `analysis_states` row. Without one + # the gate redirect-loops and problem_statement / report_id writes silently + # no-op. Lazily get-or-create it (idempotent) so any session is gate-ready. + analysis_state: AnalysisState | None = None + if analysis_id: + try: + analysis_state = await self._get_state_store().ensure(analysis_id, user_id) + except Exception as e: + logger.warning( + "analysis state ensure failed", analysis_id=analysis_id, error=str(e) + ) + + # ---- 1b. Gate (deterministic, post-router) ------------------- + # Redirect structured_flow -> problem_statement until the analysis is + # validated. Fails closed (not-validated) when the state row is unavailable. + if self._enable_gate and analysis_id: + from .gate import gate, stub_analysis_state + + intent = gate( + intent, + analysis_state + if analysis_state is not None + else stub_analysis_state(problem_validated=False), + ) + + # The `intent` event is consumed by the endpoint (it gates response caching + # on the effective intent) and is NOT forwarded to the frontend. We emit the + # post-gate intent so the cache keys on what actually ran. + event_data = decision.model_dump() + event_data["intent"] = intent + yield {"event": "intent", "data": json.dumps(event_data)} rewritten = decision.rewritten_query or message query_result = None @@ -155,7 +282,7 @@ class ChatHandler: raw_chunks: Any = None # ---- 2. Route ------------------------------------------------ - if decision.source_hint == "structured": + if intent == "structured_flow": try: # One memoizing reader per request: the same catalog is otherwise # re-fetched from the catalog DB 4-5x across the slow-path run. This @@ -164,10 +291,15 @@ class ChatHandler: from ..catalog.reader import MemoizingCatalogReader req_reader = MemoizingCatalogReader(self._get_catalog_reader()) - catalog = await req_reader.read(user_id, "structured") + # #10: scope every catalog read — the Planner's AND the data-access + # tools' own re-reads — to the analysis's bound sources, so binding + # is a boundary, not just a planner hint (T-B). Fail-open (T-C). + bound = await self._bound_source_ids(analysis_id) + reader = _ScopedCatalogReader(req_reader, bound) if bound else req_reader + catalog = await reader.read(user_id, "structured") if self._enable_slow_path: async for event in self._run_slow_path( - user_id, rewritten, catalog, tracer, req_reader + user_id, rewritten, catalog, tracer, reader, analysis_id ): yield event return @@ -182,32 +314,88 @@ class ChatHandler: ) yield {"event": "error", "data": f"Structured query failed: {e}"} return - elif decision.source_hint == "unstructured": + elif intent == "unstructured_flow": try: raw_chunks = await self._get_document_retriever().retrieve( rewritten, user_id ) chunks = _normalize_chunks(raw_chunks) - except NotImplementedError: - logger.warning("DocumentRetriever placeholder hit", user_id=user_id) - yield { - "event": "error", - "data": "Document retrieval is not yet available — pending implementation.", - } - return except Exception as e: logger.error( "unstructured route failed", user_id=user_id, error=str(e) ) yield {"event": "error", "data": f"Document retrieval failed: {e}"} return + elif intent == "check": + try: + invoker = self._get_check_invoker(user_id) + text = await run_check(rewritten, invoker) + except Exception as e: + logger.error("check route failed", user_id=user_id, error=str(e)) + yield {"event": "error", "data": f"Lookup failed: {e}"} + return + yield {"event": "chunk", "data": text} + yield {"event": "done", "data": ""} + return + elif intent == "problem_statement": + try: + text = await run_problem_statement( + message, + analysis_id, + agent=self._get_ps_agent(), + store=self._get_state_store(), + history=history, + ) + except Exception as e: + logger.error("problem_statement route failed", user_id=user_id, error=str(e)) + yield {"event": "error", "data": f"Problem statement failed: {e}"} + return + yield {"event": "chunk", "data": text} + yield {"event": "done", "data": ""} + return + elif intent == "help": + try: + state = analysis_state or await self._load_analysis_state(analysis_id) + except Exception as e: + logger.error("help route failed", user_id=user_id, error=str(e)) + yield {"event": "error", "data": f"Help failed: {e}"} + return + # report_ready (seam #5): deterministic — validated goal + ≥1 recorded + # analysis (mirrors the report API's own 409 gate). Never-throws (fails + # closed to not-ready), so Help degrades safely. The consistency guard in + # HelpAgent only offers `generate_report` when this says ready. + from .report.readiness import is_report_ready + + report_ready = await is_report_ready(analysis_id, state) + # The prompt sees chat history -> masked. + hc = tracer.callbacks(masked=True) + hkw = {"callbacks": hc} if hc else {} + try: + async for token in self._get_help_agent().astream( + state, + history=history, + message=message, + report_ready=report_ready, + **hkw, + ): + yield {"event": "chunk", "data": token} + except Exception as e: + logger.error("help streaming failed", user_id=user_id, error=str(e)) + yield {"event": "error", "data": f"Help generation failed: {e}"} + return + tracer.end() + yield {"event": "done", "data": ""} + return # else: chat path — no context # ---- 2b. Emit sources --------------------------------------- - sources = _build_sources( - decision.source_hint, user_id, query_result, raw_chunks + sources = _build_sources(intent, user_id, query_result, raw_chunks) + logger.info( + "built sources", + intent=intent, + sources_count=len(sources), + raw_chunks_count=len(raw_chunks) if raw_chunks else 0, ) - logger.info("built sources", source_hint=decision.source_hint, sources_count=len(sources), raw_chunks_count=len(raw_chunks) if raw_chunks else 0) yield {"event": "sources", "data": json.dumps(sources)} # ---- 3. Stream answer ---------------------------------------- @@ -282,9 +470,9 @@ class ChatHandler: def _get_analysis_store(self) -> AnalysisStore: if self._analysis_store is None: - from .slow_path.store import NullAnalysisStore + from .slow_path.store import PostgresAnalysisStore - self._analysis_store = NullAnalysisStore() + self._analysis_store = PostgresAnalysisStore() return self._analysis_store async def _run_slow_path( @@ -294,11 +482,13 @@ class ChatHandler: catalog: Any, tracer: Any = None, catalog_reader: CatalogReader | None = None, + analysis_id: str | None = None, ) -> AsyncIterator[dict[str, Any]]: """Run the slow path and stream its assembled answer as SSE events. Context comes from the `get_business_context` seam (a stub today); the - `analysis_record` is persisted via the `AnalysisStore` seam (a no-op today). + `analysis_record` is persisted via the `AnalysisStore` seam (PostgresAnalysisStore), + stamped with the request's user_id + analysis_id so the report can group it. `chat_answer` is emitted as a single `chunk` (the Assembler returns the whole object — true token streaming is a later step). """ @@ -368,26 +558,58 @@ class ChatHandler: yield {"event": "sources", "data": json.dumps([])} # TODO: derive from record yield {"event": "chunk", "data": result.chat_answer} try: - await self._get_analysis_store().save(result.analysis_record) + # Stamp identity from the request scope: owner + the shared session id + # (analysis_id == room_id). Without analysis_id the record is orphaned — + # list_for_analysis can't find it, so the report + is_report_ready go + # blind. The store is never-throw. + record = result.analysis_record.model_copy( + update={"user_id": user_id, "analysis_id": analysis_id} + ) + await self._get_analysis_store().save(record) except Exception as e: # persistence must never break the user's answer logger.error("analysis_record persist failed", user_id=user_id, error=str(e)) tracer.end() # output omitted (chat_answer may contain PII on Cloud) yield {"event": "done", "data": ""} +class _ScopedCatalogReader: + """Wraps a CatalogReader, restricting `structured` reads to an analysis's bound + sources (#10). + + Scoping lives here — not at a single call site — so the Planner AND the + data-access tools (which re-read the catalog themselves) see the same scoped + view; otherwise binding is only a hint to the Planner while the executor runs + against the full catalog. Fail-open: an empty or fully-disjoint binding yields + the whole catalog, so a stale / cross-source binding degrades instead of + emptying the catalog. Only `structured` reads are scoped (all #10 binds today); + `unstructured` / retrieval reads pass through. + """ + + def __init__(self, inner: Any, bound: set[str]) -> None: + self._inner = inner + self._bound = bound + + async def read(self, user_id: str, source_hint: str) -> Any: + catalog = await self._inner.read(user_id, source_hint) + if not self._bound or source_hint != "structured": + return catalog + scoped = [s for s in catalog.sources if s.source_id in self._bound] + return catalog.model_copy(update={"sources": scoped or catalog.sources}) + + def _build_sources( - source_hint: str, + intent: str, user_id: str, query_result: Any, raw_chunks: Any, ) -> list[dict[str, Any]]: """Build the sources payload for the SSE `sources` event. - - structured: one entry per executed table (table_name only). - - unstructured: deduped by (document_id, page_label), Phase 1 shape. + - structured_flow: one entry per executed table (table_name only). + - unstructured_flow: deduped by (document_id, page_label), Phase 1 shape. - chat or error: empty list. """ - if source_hint == "structured": + if intent == "structured_flow": if query_result is None or getattr(query_result, "error", None): return [] table_name = getattr(query_result, "table_name", "") or "" @@ -399,7 +621,7 @@ def _build_sources( "page_label": None, }] - if source_hint == "unstructured" and raw_chunks: + if intent == "unstructured_flow" and raw_chunks: seen: set[tuple[Any, Any]] = set() sources: list[dict[str, Any]] = [] for item in raw_chunks: diff --git a/src/agents/gate.py b/src/agents/gate.py new file mode 100644 index 0000000000000000000000000000000000000000..96e0ba74ffe52504903f6253228082fddd71be8c --- /dev/null +++ b/src/agents/gate.py @@ -0,0 +1,108 @@ +"""Deterministic routing gate — policy check over the router's intent. + +After the LLM router picks an intent, the gate checks it against the per-analysis +Analysis State and returns the **effective** intent: allow as-is, or redirect. No +LLM, no I/O in `gate()` itself. + +Only one rule has teeth in v1: an analytical request (`structured_flow`) requires a +validated problem statement (`problem_validated is True`); otherwise it is +redirected to `problem_statement` so the user defines the goal first. Everything +else passes through. `generate_report` is not a router intent (button / report +API), so it is not gated here. + +`AnalysisState` is the locked 8-field contract (mirrors the `analysis_states` DB +table). `get_analysis_state` reads the real per-analysis row via `AnalysisStateStore` +(#9, landed); it fails closed to a not-validated stub on a missing row or read error. +See `ORCHESTRATOR_REWORK_PLAN.md` §4. +""" + +from __future__ import annotations + +from datetime import UTC, datetime + +from pydantic import BaseModel + +from src.agents.orchestration import Intent +from src.middlewares.logging import get_logger + +logger = get_logger("gate") + + +class AnalysisState(BaseModel): + """Per-analysis state the gate + Help skill read every turn (locked contract). + + `problem_validated` is the gate driver; `report_id` is null until a report + exists. Field names mirror the `analysis_states` table so the DB read swaps in + without touching readers. + """ + + id: str + analysis_title: str + problem_statement: str + problem_validated: bool = False + owner_id: str + report_id: str | None = None + created_at: datetime + updated_at: datetime + + +def gate(intent: Intent, state: AnalysisState) -> Intent: + """Return the effective intent after applying the deterministic gate policy. + + `structured_flow` requires `problem_validated is True`; otherwise redirect to + `problem_statement`. All other intents pass through unchanged. + """ + if intent == "structured_flow" and not state.problem_validated: + logger.info( + "gate redirect", + requested=intent, + effective="problem_statement", + reason="problem_not_validated", + ) + return "problem_statement" + return intent + + +def stub_analysis_state(*, problem_validated: bool = False) -> AnalysisState: + """Hardcoded Analysis State for building/testing before the DB lands (#9). + + Shared fixture so the gate, the Help skill, and tests all exercise the same + shape. `problem_validated=True` simulates a passed interview. + """ + now = datetime.now(UTC) + return AnalysisState( + id="stub-analysis", + analysis_title="Stub analysis", + problem_statement="Stub problem statement" if problem_validated else "", + problem_validated=problem_validated, + owner_id="stub-user", + report_id=None, + created_at=now, + updated_at=now, + ) + + +async def get_analysis_state(analysis_id: str) -> AnalysisState: + """Load the Analysis State for an analysis (shared id with the chat room). + + Reads the `analysis_states` row via `AnalysisStateStore`. Never-throw seam: a + missing row (e.g. a legacy room created before this table) or a read failure + degrades to a **not-validated** stub, so the gate fails closed (→ steer to + `problem_statement`) rather than running ungated analysis. The store import is + lazy so this module stays import-safe without a DB. + """ + try: + from src.agents.state_store import AnalysisStateStore + + state = await AnalysisStateStore().get(analysis_id) + except Exception as exc: # noqa: BLE001 — never-throw; fail closed to not-validated + logger.warning( + "get_analysis_state read failed — default not-validated", + analysis_id=analysis_id, + error=str(exc), + ) + return stub_analysis_state(problem_validated=False) + if state is None: + logger.debug("analysis_state missing — default not-validated", analysis_id=analysis_id) + return stub_analysis_state(problem_validated=False) + return state diff --git a/src/agents/handlers/__init__.py b/src/agents/handlers/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..94c86057031f4d146d13f1c38aa0bc4ce2276787 --- /dev/null +++ b/src/agents/handlers/__init__.py @@ -0,0 +1 @@ +"""Deterministic skill handlers dispatched by the orchestrator (non-LLM).""" diff --git a/src/agents/handlers/check.py b/src/agents/handlers/check.py new file mode 100644 index 0000000000000000000000000000000000000000..fe8823b9d8a48d75c6b1060ec3f04c6ffb238981 --- /dev/null +++ b/src/agents/handlers/check.py @@ -0,0 +1,165 @@ +"""`check` skill handler — deterministic data/document inventory (no LLM). + +The router emits a single `check` intent; this handler picks the concrete tool +(`check_data` for structured sources, `check_knowledge` for documents) and renders +the tool's `ToolOutput` table into a markdown reply. Broad queries with no +specific cue call both tools concurrently and stitch a helicopter-view inventory. +See `ORCHESTRATOR_REWORK_PLAN.md` §2. + +The data-access invoker never throws (§8.4); `render_tool_output` handles the +`error` envelope defensively. +""" + +from __future__ import annotations + +import asyncio +import re +from typing import TYPE_CHECKING + +from src.tools.contracts import ToolOutput + +if TYPE_CHECKING: + from src.agents.slow_path.invoker import ToolInvoker + +# Cues that point at documents rather than structured data. +_KNOWLEDGE_CUES = ( + "document", + "docs", + "doc ", + "file", + "pdf", + "docx", + ".txt", + "uploaded", + "knowledge", + "dokumen", +) + +# Cues that point at structured/tabular data specifically. +_DATA_CUES = ( + "kolom", + "column", + "tabel", + "table", + "baris", + "row", + "schema", + "skema", + "database", + " db", +) + + +def _intent(message: str) -> str: + """Return 'knowledge', 'data', or 'both' (helicopter view) from keyword cues.""" + lowered = message.lower() + is_knowledge = any(cue in lowered for cue in _KNOWLEDGE_CUES) + is_data = any(cue in lowered for cue in _DATA_CUES) + if is_knowledge and not is_data: + return "knowledge" + if is_data and not is_knowledge: + return "data" + return "both" + + +def render_tool_output(out: ToolOutput) -> str: + """Render a `check_*` ToolOutput table into a markdown string, or '' if empty.""" + if out.kind == "error": + return f"Sorry, I couldn't look that up: {out.error}" + columns = out.columns or [] + rows = out.rows or [] + if not rows: + return "" + header = "| " + " | ".join(columns) + " |" + separator = "| " + " | ".join("---" for _ in columns) + " |" + body = "\n".join( + "| " + " | ".join(str(cell) for cell in row) + " |" for row in rows + ) + return f"{header}\n{separator}\n{body}" + + +def _matched_source_ids(message: str, inventory: ToolOutput) -> list[str]: + """All source_ids whose name appears as a whole word in the message. + + The user names sources in plain words ("sales", "kolom sales sama orders"); + the tool needs exact `source_id`s. We resolve them against the inventory + rows (kind="table", columns include "source_id" + "name") instead of an LLM + — a cheap match against catalog metadata already in hand. Whole-word match + (`\\b`) avoids nuisance hits ("orders" inside "reorders") and treats `_` as + part of the word, so "sales" won't pick up "sales_archive". Multiple named + sources all match, so the caller can show each schema. + """ + if inventory.kind != "table" or not inventory.rows: + return [] + cols = inventory.columns or [] + try: + id_idx = cols.index("source_id") + name_idx = cols.index("name") + except ValueError: + return [] + + matched: list[str] = [] + for row in inventory.rows: + name = str(row[name_idx]) + if name and re.search(rf"\b{re.escape(name)}\b", message, re.IGNORECASE): + matched.append(str(row[id_idx])) + return matched + + +def _render_helicopter(data_out: ToolOutput, knowledge_out: ToolOutput) -> str: + """Stitch structured + document inventory into one helicopter-view reply.""" + parts: list[str] = [] + + data_table = render_tool_output(data_out) + if data_table: + parts.append(f"**Data terstruktur**\n{data_table}") + + knowledge_table = render_tool_output(knowledge_out) + if knowledge_table: + parts.append(f"**Dokumen**\n{knowledge_table}") + + if not parts: + return "Nothing registered yet — I don't see any sources or documents." + + return "\n\n".join(parts) + + +async def run_check(message: str, invoker: ToolInvoker) -> str: + """Route to check_data, check_knowledge, or both (helicopter view) based on cues.""" + intent = _intent(message) + + _no_match = "Nothing registered yet — I don't see any matching sources." + + if intent == "knowledge": + out = await invoker.invoke("check_knowledge", {}) + return render_tool_output(out) or _no_match + + if intent == "data": + inventory = await invoker.invoke("check_data", {}) + if inventory.kind == "error": + return render_tool_output(inventory) + # Drill down to the schema of each source the user named; if they named + # none, return the source listing. + source_ids = _matched_source_ids(message, inventory) + if not source_ids: + return render_tool_output(inventory) or _no_match + schemas = await asyncio.gather( + *(invoker.invoke("check_data", {"source_id": sid}) for sid in source_ids) + ) + if len(schemas) == 1: + return render_tool_output(schemas[0]) or _no_match + # Multiple named sources → one labelled section per source. + sections: list[str] = [] + for out in schemas: + table = render_tool_output(out) + if table: + label = (out.meta or {}).get("source_name") or "source" + sections.append(f"**{label}**\n{table}") + return "\n\n".join(sections) or _no_match + + # broad / ambiguous → helicopter view: call both concurrently + data_out, knowledge_out = await asyncio.gather( + invoker.invoke("check_data", {}), + invoker.invoke("check_knowledge", {}), + ) + return _render_helicopter(data_out, knowledge_out) diff --git a/src/agents/handlers/help.py b/src/agents/handlers/help.py new file mode 100644 index 0000000000000000000000000000000000000000..016c5391dd86ceac1822ab18a9782ed125d5aef4 --- /dev/null +++ b/src/agents/handlers/help.py @@ -0,0 +1,192 @@ +"""`help` skill handler — state-aware next-step guidance (LLM call). + +Reads the per-analysis state + chat history (and a deterministic report-readiness +signal) and tells the user where they are and what to do next. Help only guides; +it never runs analysis or produces data answers. + +The prompt lives in `config/prompts/help.md` (the playbook); this module composes +the context and streams the LLM answer, mirroring `ChatbotAgent`. The **consistency +guard** has teeth here, not just in the prompt: `_derive_available_actions` computes +the actions actually allowed from the state (the same policy as `gate.py`), and that +list is fed into the prompt — the LLM is told to suggest *only* those, so it can't +tell the user to generate a report when the goal isn't validated or the analysis +isn't ready. + +SEAMS: + - `AnalysisState` is the locked 8-field contract from `gate.py` (KM-652). The gate, + this skill, and tests share `gate.stub_analysis_state(...)` so they exercise the + same shape. + - `ReportReadiness` is the return shape of `is_report_ready(chat_history)` (seam #5, + Rifqi — not built yet). Help *consumes* it; it does not compute it. Until it lands, + the caller passes a stub (default: not ready). +""" + +from __future__ import annotations + +from collections.abc import AsyncIterator +from dataclasses import dataclass, field +from pathlib import Path +from typing import Any + +from langchain_core.messages import BaseMessage +from langchain_core.output_parsers import StrOutputParser +from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder +from langchain_core.runnables import Runnable +from langchain_openai import AzureChatOpenAI + +from src.agents.gate import AnalysisState +from src.middlewares.logging import get_logger + +logger = get_logger("help") + +_PROMPT_DIR = Path(__file__).resolve().parent.parent.parent / "config" / "prompts" +_SYSTEM_PROMPT_PATH = _PROMPT_DIR / "help.md" +_GUARDRAILS_PATH = _PROMPT_DIR / "guardrails.md" + +# Neutral human turn when Help is triggered by a slash command with no real content. +_DEFAULT_TRIGGER = "What should I do next?" + + +@dataclass +class ReportReadiness: + """Deterministic report-readiness signal — the return of Rifqi's `is_report_ready`. + + `missing` lists the gaps to fill when `ready` is False. + """ + + ready: bool = False + missing: list[str] = field(default_factory=list) + + +def _derive_available_actions(state: AnalysisState, report_ready: ReportReadiness) -> list[str]: + """Actions Help is allowed to suggest, derived from state (mirrors `gate.py`). + + This is the consistency guard's teeth: analysis is gated behind a validated goal + (same rule the gate applies to `structured_flow`), and a report is only offered + when the readiness signal says so. Keep this policy in sync with `gate.gate`. + """ + if not state.problem_validated: + # Goal not set → the only useful move is defining the problem statement. + return ["define_problem_statement"] + + actions = ["ask_analysis_question", "refine_problem_statement"] + if report_ready.ready: + actions.append("generate_report") + return actions + + +def _format_state(state: AnalysisState) -> str: + """Render the analysis state as a compact context block for the LLM.""" + has_report = "yes" if state.report_id else "no" + return ( + "[Analysis state]\n" + f"analysis_title: {state.analysis_title or '(none)'}\n" + f"problem_statement: {state.problem_statement or '(empty)'}\n" + f"problem_validated: {str(state.problem_validated).lower()}\n" + f"has_report: {has_report}" + ) + + +def _format_report_ready(report_ready: ReportReadiness) -> str: + missing = ", ".join(report_ready.missing) if report_ready.missing else "(none)" + return ( + "[Report readiness]\n" + f"ready: {str(report_ready.ready).lower()}\n" + f"missing: {missing}" + ) + + +def _build_context_block( + state: AnalysisState, + report_ready: ReportReadiness, + available_actions: list[str], +) -> str: + """Compose the deterministic context the prompt's 'never misguide' rule trusts.""" + return "\n\n".join( + [ + _format_state(state), + _format_report_ready(report_ready), + "[Available actions]\n" + ", ".join(available_actions), + ] + ) + + +def _load_system_prompt() -> str: + """Compose system prompt = help.md + guardrails.md (guardrails last, as elsewhere).""" + help_md = _SYSTEM_PROMPT_PATH.read_text(encoding="utf-8") + guardrails = _GUARDRAILS_PATH.read_text(encoding="utf-8") + return f"{help_md}\n\n{guardrails}" + + +def _build_default_chain() -> Runnable: + from src.config.settings import settings + + llm = AzureChatOpenAI( + azure_deployment=settings.azureai_deployment_name_4o, + openai_api_version=settings.azureai_api_version_4o, + azure_endpoint=settings.azureai_endpoint_url_4o, + api_key=settings.azureai_api_key_4o, + temperature=0.3, + model_kwargs={"stream_options": {"include_usage": True}}, + ) + prompt = ChatPromptTemplate.from_messages( + [ + ("system", _load_system_prompt()), + MessagesPlaceholder(variable_name="history", optional=True), + ("human", "{message}"), + ("system", "Analysis state and signals for this turn:\n\n{context}"), + ] + ) + return prompt | llm | StrOutputParser() + + +class HelpAgent: + """Streams state-aware guidance to the user. + + `chain` is injectable: tests pass a fake that yields canned tokens. Default + constructs the production Azure OpenAI streaming chain on first use. + """ + + def __init__(self, chain: Runnable | None = None) -> None: + self._chain = chain + + def _ensure_chain(self) -> Runnable: + if self._chain is None: + self._chain = _build_default_chain() + return self._chain + + async def astream( + self, + state: AnalysisState, + history: list[BaseMessage] | None = None, + report_ready: ReportReadiness | None = None, + message: str | None = None, + available_actions: list[str] | None = None, + callbacks: list | None = None, + ) -> AsyncIterator[str]: + """Stream tokens of the guidance reply. + + `report_ready` defaults to "not ready" so a missing signal degrades safely. + `available_actions`, when omitted, is derived deterministically from state. + """ + readiness = report_ready or ReportReadiness() + actions = available_actions or _derive_available_actions(state, readiness) + logger.info( + "help guidance", + problem_validated=state.problem_validated, + report_ready=readiness.ready, + available_actions=actions, + ) + + chain = self._ensure_chain() + payload: dict[str, Any] = { + "message": message or _DEFAULT_TRIGGER, + "history": history or [], + "context": _build_context_block(state, readiness, actions), + } + if callbacks: + async for token in chain.astream(payload, config={"callbacks": callbacks}): + yield token + else: + async for token in chain.astream(payload): + yield token diff --git a/src/agents/handlers/problem_statement.py b/src/agents/handlers/problem_statement.py new file mode 100644 index 0000000000000000000000000000000000000000..71597180c7b42194b95b6491d217f320c3793ef6 --- /dev/null +++ b/src/agents/handlers/problem_statement.py @@ -0,0 +1,171 @@ +"""Problem Statement skill — guide the user to a usable problem statement. + +Routed by the orchestrator (intent `problem_statement`) and callable as a skill. +An LLM drafts/refines the statement from the analysis title + the user's message and +declares what's still `missing`; a check validates only when nothing is missing. The +model is instructed to fill `objective`/`metric` ONLY from what the user explicitly +stated — a bare data question ("which X has the most Y?") leaves them in `missing`, so +it does not auto-validate (the gate stays meaningful). On a valid draft it persists +`problem_statement` + `problem_validated=True`; otherwise it streams guidance and +leaves the analysis un-validated. + +NOTE: completeness is still a (hardened) LLM judgment — the truly deterministic gate +is an explicit user confirmation, planned with the frontend (see T3b / #11). + +See `ORCHESTRATOR_REWORK_PLAN.md` §4 and the 2026-06-18 checkpoint. +""" + +from __future__ import annotations + +from pathlib import Path +from typing import TYPE_CHECKING + +from langchain_core.messages import BaseMessage +from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder +from langchain_core.runnables import Runnable +from langchain_openai import AzureChatOpenAI +from pydantic import BaseModel, Field + +from src.middlewares.logging import get_logger + +if TYPE_CHECKING: + from src.agents.state_store import AnalysisStateStore + +logger = get_logger("problem_statement") + +_PROMPT_PATH = ( + Path(__file__).resolve().parent.parent.parent + / "config" + / "prompts" + / "problem_statement.md" +) + + +class ProblemStatementDraft(BaseModel): + """LLM output for the Problem Statement skill.""" + + problem_statement: str = Field( + ..., description="The refined, standalone problem statement (never empty)." + ) + objective: str = Field( + "", description="What success looks like — fill ONLY when the user explicitly " + "stated it; never inferred from a data question. Empty otherwise." + ) + metric: str = Field( + "", description="The KPI to move/investigate — fill ONLY when the user " + "explicitly stated it; never inferred from a data question. Empty otherwise." + ) + missing: list[str] = Field( + default_factory=list, + description="Which of 'objective' / 'metric' the user has NOT explicitly stated " + "yet. A bare data question leaves both here. Empty list = complete.", + ) + feedback: str = Field( + ..., + description="Message to the user — guidance if incomplete, confirmation if complete.", + ) + + +def is_valid(draft: ProblemStatementDraft) -> bool: + """Complete iff there's a statement and the model flagged nothing missing. + + Keying on the model's explicit `missing` list (rather than 'are objective/metric + non-empty') is what stops a bare data question from auto-validating: the hardened + prompt puts the un-stated parts in `missing`, so this returns False for it. + """ + return bool(draft.problem_statement.strip()) and not draft.missing + + +def _load_prompt_text() -> str: + return _PROMPT_PATH.read_text(encoding="utf-8") + + +def _build_default_chain() -> Runnable: + from src.config.settings import settings + + llm = AzureChatOpenAI( + azure_deployment=settings.azureai_deployment_name_4o, + openai_api_version=settings.azureai_api_version_4o, + azure_endpoint=settings.azureai_endpoint_url_4o, + api_key=settings.azureai_api_key_4o, + temperature=0, + ) + prompt = ChatPromptTemplate.from_messages( + [ + ("system", _load_prompt_text()), + MessagesPlaceholder(variable_name="history", optional=True), + ( + "human", + "Analysis title: {analysis_title}\n" + "Current problem statement: {current}\n\n" + "User message: {message}", + ), + ] + ) + return prompt | llm.with_structured_output(ProblemStatementDraft) + + +class ProblemStatementAgent: + """Single LLM call that drafts/refines a problem statement. + + Inject `chain` for tests; the default builds the Azure OpenAI chain on first use. + """ + + def __init__(self, chain: Runnable | None = None) -> None: + self._chain = chain + + def _ensure_chain(self) -> Runnable: + if self._chain is None: + self._chain = _build_default_chain() + return self._chain + + async def draft( + self, + message: str, + analysis_title: str, + current: str, + history: list[BaseMessage] | None = None, + ) -> ProblemStatementDraft: + chain = self._ensure_chain() + return await chain.ainvoke( + { + "message": message, + "analysis_title": analysis_title, + "current": current, + "history": history or [], + } + ) + + +async def run_problem_statement( + message: str, + analysis_id: str | None, + *, + agent: ProblemStatementAgent, + store: AnalysisStateStore, + history: list[BaseMessage] | None = None, +) -> str: + """Draft + validate the problem statement; persist on a valid draft. + + Loads the current title/statement (if the analysis exists), drafts a refinement, + runs the deterministic completeness check, and writes `problem_statement` + + `problem_validated` back. Returns the user-facing feedback. When `analysis_id` is + missing (e.g. a legacy room), it still drafts + returns guidance but cannot persist. + """ + analysis_title, current = "New analysis", "" + if analysis_id: + state = await store.get(analysis_id) + if state is not None: + analysis_title, current = state.analysis_title, state.problem_statement + + draft = await agent.draft(message, analysis_title, current, history) + validated = is_valid(draft) + + if analysis_id: + await store.update( + analysis_id, + problem_statement=draft.problem_statement, + problem_validated=validated, + ) + logger.info("problem_statement drafted", analysis_id=analysis_id, validated=validated) + return draft.feedback diff --git a/src/agents/orchestration.py b/src/agents/orchestration.py index 53f702d7c267660d038f5ce88b803f304e3b33e0..ebe8990d7e32e4104825c65f82c65f94c2a226bc 100644 --- a/src/agents/orchestration.py +++ b/src/agents/orchestration.py @@ -1,13 +1,17 @@ -"""OrchestratorAgent — classifies a user message and emits source_hint. +"""OrchestratorAgent — classifies a user message into one of six intents. -Output: needs_search (bool) + source_hint ∈ { chat, unstructured, structured } -+ rewritten_query (standalone form of the user's question, history-resolved). +Output: RouterDecision { intent, rewritten_query, confidence }. -Phase 2 replaces the previous intent-classification body. The class name -is preserved so existing import sites (`from src.agents.orchestration -import OrchestratorAgent`) keep working. The default LLM chain is -constructed lazily so the module is import-safe even without `.env` -populated. +The router is a **handler-level** intent classifier, not a data-modality +classifier: `structured_flow` routes to the slow Planner spine and +`unstructured_flow` to the fast RAG path; the structured/unstructured data mix on +the slow path is the Planner's job, not the router's. See +`ORCHESTRATOR_REWORK_PLAN.md`. + +The class name `OrchestratorAgent` is preserved so existing import sites +(`from src.agents.orchestration import OrchestratorAgent`) keep working. The +default LLM chain is built lazily so the module is import-safe even without +`.env` populated. """ from __future__ import annotations @@ -25,7 +29,14 @@ from src.middlewares.logging import get_logger logger = get_logger("orchestrator") -SourceHint = Literal["chat", "unstructured", "structured"] +Intent = Literal[ + "chat", + "help", + "problem_statement", + "check", + "unstructured_flow", + "structured_flow", +] _PROMPT_PATH = ( Path(__file__).resolve().parent.parent @@ -35,21 +46,29 @@ _PROMPT_PATH = ( ) -class IntentRouterDecision(BaseModel): +class RouterDecision(BaseModel): """LLM output. Pydantic so it can be used with `with_structured_output`.""" - needs_search: bool = Field( - ..., description="True if we must look at the user's data to answer." - ) - source_hint: SourceHint = Field( + intent: Intent = Field( ..., - description="Which downstream path: 'chat' (no lookup), " - "'unstructured' (PDF/DOCX/TXT prose), 'structured' (DB / tabular file).", + description=( + "Handler route for this message: 'chat' (conversational, no data), " + "'help' (what-to-do-next guidance), 'problem_statement' (define or " + "refine the analysis goal), 'check' (inventory: what data/documents " + "exist), 'unstructured_flow' (answer from documents, fast RAG), or " + "'structured_flow' (analytical question over data, slow Planner path)." + ), ) rewritten_query: str | None = Field( None, - description="Standalone version of the question, history-resolved. " - "Null when needs_search=false.", + description=( + "Standalone version of the question, history-resolved. Null for " + "'chat' and 'help' (no data lookup needed)." + ), + ) + confidence: float | None = Field( + None, + description="Classifier confidence in [0, 1]. Optional.", ) @@ -74,11 +93,11 @@ def _build_default_chain() -> Runnable: ("human", "{message}"), ] ) - return prompt | llm.with_structured_output(IntentRouterDecision) + return prompt | llm.with_structured_output(RouterDecision) class OrchestratorAgent: - """Classifies a user message into chat / unstructured / structured. + """Classifies a user message into one of the six router intents. Inject `structured_chain` for tests; default builds the production Azure OpenAI chain on first use. @@ -97,18 +116,14 @@ class OrchestratorAgent: message: str, history: list[BaseMessage] | None = None, callbacks: list | None = None, - ) -> IntentRouterDecision: + ) -> RouterDecision: chain = self._ensure_chain() payload = {"message": message, "history": history or []} if callbacks: - decision: IntentRouterDecision = await chain.ainvoke( + decision: RouterDecision = await chain.ainvoke( payload, config={"callbacks": callbacks} ) else: decision = await chain.ainvoke(payload) - logger.info( - "intent classified", - source_hint=decision.source_hint, - needs_search=decision.needs_search, - ) + logger.info("intent classified", intent=decision.intent) return decision diff --git a/src/agents/planner/examples.py b/src/agents/planner/examples.py index ca701e6ecd56f3c44ea801f8dd8833cdb9a4841d..2bb00919e2200b16be7dc121abc41640b18df8e1 100644 --- a/src/agents/planner/examples.py +++ b/src/agents/planner/examples.py @@ -2,7 +2,7 @@ Two illustrative (question -> TaskList) pairs that teach the OUTPUT SHAPE: stages, dependency edges, ordered tool-call chains, inline QueryIR, -"${t}" placeholders, and the assumed data-flow convention — `query_structured` +"${t}" placeholders, and the assumed data-flow convention — `retrieve_data` pulls rows, then a composite `analyze_*` tool consumes them via a `data` placeholder referencing the upstream result's column aliases (Pattern A; the tool team may instead pick self-fetch by `source_id`, in which case these examples are reshaped @@ -21,9 +21,8 @@ from .schemas import Task, TaskList, ToolCall # --------------------------------------------------------------------------- # # Example A — exploratory, no modeling. # "Which product categories drove last quarter's revenue?" -# Shows: query_structured pulls rows -> analyze_contribution computes each -# category's share of the total in one call (no manual per-category + total -# queries). +# Shows: retrieve_data pulls rows -> analyze_aggregate sums revenue per +# category in one call (no manual per-category queries). # --------------------------------------------------------------------------- # _EXAMPLE_A = TaskList( @@ -36,7 +35,7 @@ _EXAMPLE_A = TaskList( id="t1", stage="data_understanding", objective="Confirm the sales source exposes category, revenue, and order date.", - tool_calls=[ToolCall(tool="describe_source", args={"source_id": "src_sales"})], + tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})], expected_output="source_shape", success_criteria="Produced the orders table schema; the 3 needed columns are present.", depends_on=[], @@ -48,7 +47,7 @@ _EXAMPLE_A = TaskList( objective="Pull last quarter's order-level category and revenue rows.", tool_calls=[ ToolCall( - tool="query_structured", + tool="retrieve_data", args={ "ir": { "source_id": "src_sales", @@ -78,20 +77,19 @@ _EXAMPLE_A = TaskList( Task( id="t3", stage="evaluation", - objective="Rank each category's revenue share of the quarter total.", + objective="Sum revenue per category for the quarter.", tool_calls=[ ToolCall( - tool="analyze_contribution", + tool="analyze_aggregate", args={ "data": "${t2}", - "dimension": "category", - "value_column": "revenue", - "agg": "sum", + "aggregations": {"revenue": ["sum"]}, + "group_by": ["category"], }, ) ], - expected_output="category_contribution", - success_criteria="Produced each category's revenue share, ranked high to low.", + expected_output="category_revenue", + success_criteria="Produced total revenue per category, one row each.", depends_on=["t2"], estimated_cost="low", ), @@ -113,7 +111,7 @@ _EXAMPLE_B = TaskList( id="t1", stage="data_understanding", objective="Confirm the sales source exposes order date, revenue, and region.", - tool_calls=[ToolCall(tool="describe_source", args={"source_id": "src_sales"})], + tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})], expected_output="source_shape", success_criteria="Produced the orders table schema; the needed columns are present.", depends_on=[], @@ -125,7 +123,7 @@ _EXAMPLE_B = TaskList( objective="Pull this year's order dates, revenue, and region.", tool_calls=[ ToolCall( - tool="query_structured", + tool="retrieve_data", args={ "ir": { "source_id": "src_sales", @@ -189,8 +187,8 @@ _EXAMPLE_B = TaskList( # Example C — mixed structured + unstructured. # "Revenue dipped in Q1 — what happened?" # Shows: a structured branch (query -> analyze_trend) runs alongside an -# INDEPENDENT retrieve_documents branch that pulls qualitative context. Note -# retrieve_documents takes a natural-language `query` (NOT a `${t}` data +# INDEPENDENT retrieve_knowledge branch that pulls qualitative context. Note +# retrieve_knowledge takes a natural-language `query` (NOT a `${t}` data # placeholder — it is a source, not a consumer) and can run in parallel; the # Assembler folds the document context into the explanation. # --------------------------------------------------------------------------- # @@ -205,7 +203,7 @@ _EXAMPLE_C = TaskList( id="t1", stage="data_understanding", objective="Confirm the sales source exposes order date and revenue.", - tool_calls=[ToolCall(tool="describe_source", args={"source_id": "src_sales"})], + tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})], expected_output="source_shape", success_criteria="Produced the orders table schema; date and revenue columns present.", depends_on=[], @@ -217,7 +215,7 @@ _EXAMPLE_C = TaskList( objective="Pull Q1 order dates and revenue.", tool_calls=[ ToolCall( - tool="query_structured", + tool="retrieve_data", args={ "ir": { "source_id": "src_sales", @@ -275,7 +273,7 @@ _EXAMPLE_C = TaskList( objective="Retrieve qualitative context on Q1 operational events behind a dip.", tool_calls=[ ToolCall( - tool="retrieve_documents", + tool="retrieve_knowledge", args={ "query": "operational issues, outages, or notable events in Q1 2026", "top_k": 5, @@ -310,7 +308,7 @@ _EXAMPLE_D = TaskList( id="t1", stage="data_understanding", objective="Confirm the sales source exposes region and revenue.", - tool_calls=[ToolCall(tool="describe_source", args={"source_id": "src_sales"})], + tool_calls=[ToolCall(tool="check_data", args={"source_id": "src_sales"})], expected_output="source_shape", success_criteria="Produced the orders table schema; region and revenue present.", depends_on=[], @@ -322,7 +320,7 @@ _EXAMPLE_D = TaskList( objective="Pull order-level region and revenue.", tool_calls=[ ToolCall( - tool="query_structured", + tool="retrieve_data", args={ "ir": { "source_id": "src_sales", diff --git a/src/agents/planner/inputs.py b/src/agents/planner/inputs.py index 95496c628c353b8145f40b00864b66c6233b4769..7e7f56dc99539b43e5269ed7097d9e2a3847343b 100644 --- a/src/agents/planner/inputs.py +++ b/src/agents/planner/inputs.py @@ -4,9 +4,9 @@ for the planner prompt. It carries every table + column id/type/PII flag + row counts + low-cardinality top_values, with `sample_values` nulled on PII columns (INV: no PII sample values into the prompt, see doc §13). It also lists the -available unstructured sources so the planner can plan `retrieve_documents`. +available unstructured sources so the planner can plan `retrieve_knowledge`. -The planner *validator* still checks inline `query_structured` IRs against the +The planner *validator* still checks inline `retrieve_data` IRs against the full `Catalog` via the existing IRValidator — the summary is a prompt input, not the validation source of truth. @@ -124,7 +124,7 @@ class CatalogSummary(BaseModel): lines.append("") if self.unstructured_sources: - lines.append("Unstructured sources (for retrieve_documents):") + lines.append("Unstructured sources (for retrieve_knowledge):") for src in self.unstructured_sources: lines.append(f" - {src.name} — id={src.source_id}") diff --git a/src/agents/planner/registry.py b/src/agents/planner/registry.py index 4c7783ec8af1dca2b1c270ba0931b7ced52f3e65..0da2d3608c182f9cfabef60b0796ab6c0d8a4039 100644 --- a/src/agents/planner/registry.py +++ b/src/agents/planner/registry.py @@ -7,8 +7,8 @@ outside it). `src/tools/registry.py::analytics_registry()` (KM-628), built on the canonical `ToolSpec` (`src/tools/contracts.py`, KM-465/KM-627) and the prompt-style tool descriptions (KM-625). No longer a stub on our side — it tracks the real registry. -- **Data access (`query_structured` / `retrieve_documents` / `list_sources` / - `describe_source`) — spec BODIES still a local stub.** The tool team owns these too, +- **Data access (`retrieve_data` / `retrieve_knowledge` / `check_data` / + `check_knowledge`) — spec BODIES still a local stub.** The tool team owns these too, but their wrappers + `ToolSpec`s haven't landed yet (KM-465 #4). We keep best-guess spec bodies here so the Planner can plan end-to-end — but the NAMES derive from `src.tools.data_access.DATA_ACCESS_TOOLS` (R11), so a tool rename/addition upstream @@ -16,10 +16,10 @@ outside it). this slice and swap `default_registry()` for the tool team's full composition. **Confirmed conventions (KM-465):** Pattern A — `analyze_*` tools take a `data` -`"${t}"` placeholder pointing at an upstream `query_structured` output (no +`"${t}"` placeholder pointing at an upstream `retrieve_data` output (no self-fetch); resolved to a DataFrame at execution time. `input_schema` is the lightweight `{required, properties}` dict the planner validator (check #8) reads; -`query_structured.args["ir"]` carries an inline QueryIR validated against the +`retrieve_data.args["ir"]` carries an inline QueryIR validated against the catalog by the existing IRValidator. See AGENT_ARCHITECTURE_CONTEXT_new.md §9.2 / §9.3. @@ -38,25 +38,32 @@ from .contracts import ToolRegistry, ToolSpec # --------------------------------------------------------------------------- # _DATA_ACCESS_SPEC_BODIES: tuple[ToolSpec, ...] = ( ToolSpec( - name="query_structured", + name="retrieve_data", category="analytics.query", input_schema={"required": ["ir"], "properties": {"ir": {"type": "object"}}}, output_kind="table", description=( - "Run one validated, single-table query against a structured source (DB " - "schema or tabular file) and return rows. The `ir` argument is an inline " - "QueryIR (the JSON intent: source_id, table_id, select, filters, group_by, " - "order_by, limit) — never SQL. This is the data-access entry point: use it " - "to select, filter, and pull the rows the analytics (`analyze_*`) tools " - "then consume. It also does simple built-in aggregation the IR can express " - "(count/sum/avg/min/max/count_distinct). Do NOT use it for richer statistics " + "Run one validated query against a structured source and return rows. The " + "`ir` argument is an inline QueryIR (the JSON intent: source_id, table_id, " + "joins, select, filters, group_by, order_by, limit) — never SQL. This is the " + "data-access entry point: use it to select, filter, and pull the rows the " + "analytics (`analyze_*`) tools then consume. It also does simple built-in " + "aggregation the IR can express (count/sum/avg/min/max/count_distinct). " + "JOINS (database sources only): to group a measure in one table by a " + "dimension in a RELATED table, add a `joins` entry " + "({target_table_id, left_column_id, right_column_id}) along a declared " + "foreign key — e.g. sum order_items.line_total grouped by products.category " + "via order_items.product_id = products.id. Prefer an existing measure column " + "(e.g. line_total) over recomputing, and a single table when the measure and " + "dimension already live together. Joins are NOT supported on tabular/file " + "sources yet. Do NOT use this for richer statistics " "(median/percentile/mode/stddev/skew → analyze_descriptive), trends " "(analyze_trend), correlation, segmentation, or share-of-total; and do NOT " - "use it to read documents (use retrieve_documents)." + "use it to read documents (use retrieve_knowledge)." ), ), ToolSpec( - name="retrieve_documents", + name="retrieve_knowledge", category="retrieval.documents", input_schema={ "required": ["query"], @@ -71,32 +78,36 @@ _DATA_ACCESS_SPEC_BODIES: tuple[ToolSpec, ...] = ( "Dense-retrieve the most relevant chunks from the user's unstructured " "sources (PDF/DOCX/TXT) for a natural-language `query`. Use this to pull " "qualitative context into an analysis. Optionally scope to one `source_id`. " - "Do NOT use it for numbers in tables — that is query_structured's job." + "Do NOT use it for numbers in tables — that is retrieve_data's job." ), ), ToolSpec( - name="list_sources", + name="check_data", category="catalog.introspection", - input_schema={"required": [], "properties": {}}, + input_schema={ + "required": [], + "properties": {"source_id": {"type": "string"}}, + }, output_kind="table", description=( - "List the user's available data sources (id, name, type, table count). Use " - "early in data_understanding when the plan must discover what exists before " - "querying. Cheap. Do NOT use it to read column details (use describe_source)." + "Inspect the user's structured data sources (DB + tabular). With no " + "arguments, lists the sources (id, name, type, table count) — use early in " + "data_understanding to discover what exists. With a `source_id`, returns that " + "source's tables and columns (names, types, row counts) — use to confirm a " + "source's shape before querying it. Cheap. Do NOT use it to fetch data rows " + "(use retrieve_data) or to inspect documents (use check_knowledge)." ), ), ToolSpec( - name="describe_source", + name="check_knowledge", category="catalog.introspection", - input_schema={ - "required": ["source_id"], - "properties": {"source_id": {"type": "string"}}, - }, + input_schema={"required": [], "properties": {}}, output_kind="table", description=( - "Return the tables and columns (names, types, row counts) of one source by " - "`source_id`. Use in data_understanding to confirm the shape of a source " - "before querying it. Do NOT use it to fetch data rows (use query_structured)." + "List the user's unstructured sources / documents (id, name, type). Use in " + "data_understanding to discover what qualitative material exists before " + "retrieving from it. Do NOT use it to read document content (use " + "retrieve_knowledge) or to inspect structured data (use check_data)." ), ), ) diff --git a/src/agents/planner/service.py b/src/agents/planner/service.py index 37bfb55044a58f155240cb0391bae05527782ca6..1c47fd3cb7f53468e07230ab92ff433a5399381e 100644 --- a/src/agents/planner/service.py +++ b/src/agents/planner/service.py @@ -9,7 +9,7 @@ static plan. The service takes the full `Catalog` (not just a `CatalogSummary`): it derives the PII-safe `CatalogSummary` for the prompt, but validation needs the full -catalog so the existing `IRValidator` can check inline `query_structured` IRs. +catalog so the existing `IRValidator` can check inline `retrieve_data` IRs. See AGENT_ARCHITECTURE_CONTEXT_new.md §7.3. """ diff --git a/src/agents/planner/validator.py b/src/agents/planner/validator.py index c4333886a30c37225ef0dfef9dfc0d96563827d3..a277a2b90e392abea67fba65ef78b6f52b039759 100644 --- a/src/agents/planner/validator.py +++ b/src/agents/planner/validator.py @@ -95,8 +95,8 @@ class PlannerValidator: f"source_id {src!r} (known: {sorted(known_sources)})" ) - # Check 8b — inline query_structured IR validates against the catalog. - if call.tool == "query_structured": + # Check 8b — inline retrieve_data IR validates against the catalog. + if call.tool == "retrieve_data": self._validate_inline_ir(task.id, call.args, catalog) # Check 7 — success_criteria is checkable. @@ -114,20 +114,20 @@ class PlannerValidator: raw_ir = args.get("ir") if not isinstance(raw_ir, dict): raise PlannerValidationError( - f"task {task_id}: query_structured.args.ir must be an inline QueryIR " + f"task {task_id}: retrieve_data.args.ir must be an inline QueryIR " f"object, got {type(raw_ir).__name__}" ) try: ir = QueryIR.model_validate(raw_ir) except ValidationError as e: raise PlannerValidationError( - f"task {task_id}: query_structured.args.ir is not a valid QueryIR: {e}" + f"task {task_id}: retrieve_data.args.ir is not a valid QueryIR: {e}" ) from e try: self._ir_validator.validate(ir, catalog) except IRValidationError as e: raise PlannerValidationError( - f"task {task_id}: query_structured IR failed catalog validation: {e}" + f"task {task_id}: retrieve_data IR failed catalog validation: {e}" ) from e @staticmethod diff --git a/src/agents/report/__init__.py b/src/agents/report/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..7024c4b0aa439f0c4740109243aa780e3a384845 --- /dev/null +++ b/src/agents/report/__init__.py @@ -0,0 +1,9 @@ +"""Report generator (KM-644). + +A button-triggered *service* — not a chat skill, not a slow-path agent. It turns a +session's persisted `AnalysisRecord`s + Problem Statement into a versioned, +business-readable `AnalysisReport`. Architecturally it mirrors the Assembler: one +constrained LLM call (the executive summary) wrapped in deterministic assembly that +copies every other field verbatim from the records (INV-4). Reports are immutable +per version and persisted to the `analysis_reports` table. +""" diff --git a/src/agents/report/errors.py b/src/agents/report/errors.py new file mode 100644 index 0000000000000000000000000000000000000000..28052ff6496c9c2c4de540b4ab30d2ae264cff50 --- /dev/null +++ b/src/agents/report/errors.py @@ -0,0 +1,7 @@ +"""Typed errors for the report generator (KM-644).""" + +from __future__ import annotations + + +class ReportError(Exception): + """The report could not be generated (e.g. no records for the analysis).""" diff --git a/src/agents/report/generator.py b/src/agents/report/generator.py new file mode 100644 index 0000000000000000000000000000000000000000..6f2ca6465d0a86766298c8a06cceb921891743c8 --- /dev/null +++ b/src/agents/report/generator.py @@ -0,0 +1,363 @@ +"""ReportGenerator — turns a session's AnalysisRecords into an AnalysisReport (KM-644). + +A button-triggered service shaped like the Assembler: deterministic assembly of the +records (findings/caveats/open_questions/data_sources/method_steps, copied verbatim — +INV-4) wrapped around exactly ONE LLM call that authors only the executive summary. +If that call fails the report is still returned with a deterministic fallback +summary (decision D1) — the deterministic body is the real value. + +Versioning + persistence live in `ReportStore`; this service does generation only +(returns an `AnalysisReport` with `version=0`; the store assigns the real version). +Chain construction mirrors `agents/slow_path/assembler.py`. +""" + +from __future__ import annotations + +from datetime import UTC, datetime +from pathlib import Path + +from langchain_core.messages import SystemMessage +from langchain_core.prompts import ChatPromptTemplate +from langchain_core.runnables import Runnable +from langchain_openai import AzureChatOpenAI + +from src.middlewares.logging import get_logger + +from ..slow_path.schemas import AnalysisRecord, TaskSummary +from .errors import ReportError +from .schemas import ( + AnalysisReport, + AttributedNote, + DataSourceRef, + ProblemStatement, + ReportFinding, + ReportSummaryNarrative, +) + +logger = get_logger("report_generator") + +_FALLBACK_SUMMARY = "Automated summary unavailable — see the findings below." + +# CRISP-DM phases in narrative order, with human labels for the method appendix. +_STAGE_LABELS: list[tuple[str, str]] = [ + ("data_understanding", "Data understanding"), + ("data_preparation", "Data preparation"), + ("modeling", "Modeling"), + ("evaluation", "Evaluation"), +] + +_PROMPT_PATH = ( + Path(__file__).resolve().parent.parent.parent / "config" / "prompts" / "report_summary.md" +) + + +def _load_prompt_text() -> str: + return _PROMPT_PATH.read_text(encoding="utf-8") + + +def _build_default_chain() -> Runnable: + from src.config.settings import settings + + llm = AzureChatOpenAI( + azure_deployment=settings.azureai_deployment_name_4o, + openai_api_version=settings.azureai_api_version_4o, + azure_endpoint=settings.azureai_endpoint_url_4o, + api_key=settings.azureai_api_key_4o, + temperature=0, + ) + prompt = ChatPromptTemplate.from_messages( + [ + SystemMessage(content=_load_prompt_text()), + ("human", "{human_content}"), + ] + ) + return prompt | llm.with_structured_output(ReportSummaryNarrative) + + +_default_chain: Runnable | None = None + + +def _get_default_chain() -> Runnable: + global _default_chain + if _default_chain is None: + _default_chain = _build_default_chain() + return _default_chain + + +# --------------------------------------------------------------------------- # +# Deterministic assembly (pure; no LLM, no I/O) — easy to unit-test. +# --------------------------------------------------------------------------- # + + +def _collect_findings(records: list[AnalysisRecord]) -> list[ReportFinding]: + # Findings are distinct insights — not deduped; each traces to its record. + return [ + ReportFinding(text=text, record_ids=[rec.record_id]) + for rec in records + for text in rec.findings + ] + + +def _collect_notes(records: list[AnalysisRecord], field: str) -> list[AttributedNote]: + # Caveats / open_questions are deduped by text; a merged note cites every + # record it came from (plural record_ids). + merged: dict[str, list[str]] = {} + for rec in records: + for text in getattr(rec, field): + ids = merged.setdefault(text, []) + if rec.record_id not in ids: + ids.append(rec.record_id) + return [AttributedNote(text=text, record_ids=ids) for text, ids in merged.items()] + + +def _collect_method_steps(records: list[AnalysisRecord]) -> list[TaskSummary]: + steps: list[TaskSummary] = [] + for rec in records: + steps.extend(rec.tasks_run) + return steps + + +def _build_data_sources( + records: list[AnalysisRecord], catalog, bound_ids: list[str] | None = None +) -> list[DataSourceRef]: + """Freeze real catalog metadata for the sources this analysis used. + + When the analysis has a data-source binding (#10), the candidate set is scoped + to the bound sources first (fail-open if the binding doesn't intersect the + catalog). Within that set, matches catalog sources against the records' + (narrative) `data_used` by name/id; falls back to all (bound) sources, then to + bare `data_used` strings if no catalog is available — so the section is always + populated, best-effort. + """ + if catalog is None or not catalog.sources: + seen: list[str] = [] + for rec in records: + for du in rec.data_used: + if du not in seen: + seen.append(du) + return [DataSourceRef(source_id=d, name=d, source_type="", detail={}) for d in seen] + + candidates = catalog.sources + if bound_ids: + scoped = [s for s in candidates if s.source_id in set(bound_ids)] + candidates = scoped or candidates # fail-open if binding doesn't match catalog + + def _ref(s) -> DataSourceRef: + return DataSourceRef( + source_id=s.source_id, + name=s.name, + source_type=s.source_type, + detail={ + "tables": [t.name for t in s.tables], + "row_count": sum((t.row_count or 0) for t in s.tables) or None, + "columns": [c.name for t in s.tables for c in t.columns], + }, + ) + + used = " ".join(du for rec in records for du in rec.data_used).lower() + matched = [ + _ref(s) + for s in candidates + if s.name.lower() in used or s.source_id.lower() in used + ] + return matched or [_ref(s) for s in candidates] + + +def _build_human_content( + ps: ProblemStatement, findings: list[ReportFinding], caveats: list[AttributedNote] +) -> str: + sections = [] + ps_lines = [v for v in (ps.objective, ps.target_value, ps.scope) if v] + if ps_lines: + sections.append("# Problem Statement\n" + "\n".join(ps_lines)) + sections.append( + "# Findings (already finalized — synthesize, do not add numbers)\n" + + "\n".join(f"- {f.text}" for f in findings) + ) + if caveats: + sections.append("# Caveats\n" + "\n".join(f"- {c.text}" for c in caveats)) + return "\n\n".join(sections) + + +def _render_markdown(report: AnalysisReport) -> str: + # Version is deliberately NOT in the markdown — it is assigned by the store + # after rendering and lives in the structured `version` field / API metadata. + parts: list[str] = ["# Analysis Report"] + parts.append( + f"*Generated {report.generated_at:%Y-%m-%d} · " + f"{len(report.record_ids)} analyses · {len(report.data_sources)} source(s)*" + ) + + ps = report.problem_statement + ps_lines = [v for v in (ps.objective, ps.target_value, ps.scope) if v] + if ps_lines: + parts.append("## Problem Statement\n" + " ".join(ps_lines)) + + if report.executive_summary: + parts.append("## Executive Summary\n" + report.executive_summary) + + if report.findings: + lines = ["## Key Findings"] + for i, f in enumerate(report.findings, 1): + cite = f" *({', '.join(f.record_ids)})*" if f.record_ids else "" + lines.append(f"{i}. {f.text}{cite}") + parts.append("\n".join(lines)) + + if report.caveats or report.open_questions: + lines = ["## Caveats & Open Questions"] + for n in report.caveats: + cite = f" *({', '.join(n.record_ids)})*" if n.record_ids else "" + lines.append(f"- {n.text}{cite}") + for n in report.open_questions: + cite = f" *({', '.join(n.record_ids)})*" if n.record_ids else "" + lines.append(f"- Open: {n.text}{cite}") + parts.append("\n".join(lines)) + + if report.data_sources: + lines = ["## Appendix A — Data Used", "| source | type | detail |", "|---|---|---|"] + for ds in report.data_sources: + d = ds.detail + bits = [] + if d.get("tables"): + bits.append("tables: " + ", ".join(d["tables"])) + if d.get("row_count"): + bits.append(f"{d['row_count']} rows") + if d.get("columns"): + bits.append(f"{len(d['columns'])} cols") + lines.append(f"| {ds.name} | {ds.source_type or '—'} | {' · '.join(bits) or '—'} |") + parts.append("\n".join(lines)) + + if report.method_steps: + lines = ["## Appendix B — Method"] + for stage_key, label in _STAGE_LABELS: + steps = [s for s in report.method_steps if s.stage == stage_key] + if not steps: + continue + rendered = "; ".join( + f"{', '.join(s.tools_used) or '—'} ({s.status})" for s in steps + ) + lines.append(f"**{label}** — {rendered}") + parts.append("\n".join(lines)) + + return "\n\n".join(parts) + + +# --------------------------------------------------------------------------- # +# Service +# --------------------------------------------------------------------------- # + + +class ReportGenerator: + """Generates an `AnalysisReport` from persisted records. Inject deps for tests.""" + + def __init__( + self, + record_store=None, + structured_chain: Runnable | None = None, + catalog_store=None, + binding_store=None, + ) -> None: + self._record_store = record_store + self._chain = structured_chain + self._catalog_store = catalog_store + self._binding_store = binding_store + + def _ensure_record_store(self): + if self._record_store is None: + from ..slow_path.store import PostgresAnalysisStore + + self._record_store = PostgresAnalysisStore() + return self._record_store + + def _ensure_chain(self) -> Runnable: + if self._chain is None: + self._chain = _get_default_chain() + return self._chain + + def _ensure_catalog_store(self): + if self._catalog_store is None: + from src.catalog.store import CatalogStore + + self._catalog_store = CatalogStore() + return self._catalog_store + + async def generate( + self, + analysis_id: str, + user_id: str | None = None, + problem_statement: ProblemStatement | None = None, + ) -> AnalysisReport: + records = await self._ensure_record_store().list_for_analysis(analysis_id) + if not records: + raise ReportError(f"no analyses recorded for {analysis_id!r} yet") + + ps = problem_statement or ProblemStatement() + findings = _collect_findings(records) + caveats = _collect_notes(records, "caveats") + open_questions = _collect_notes(records, "open_questions") + method_steps = _collect_method_steps(records) + bound_ids = await self._read_binding(analysis_id) + data_sources = _build_data_sources( + records, await self._read_catalog(user_id), bound_ids + ) + executive_summary = await self._summarize(ps, findings, caveats) + + report = AnalysisReport( + analysis_id=analysis_id, + user_id=user_id, + version=0, # assigned by ReportStore.save under the advisory lock + generated_at=datetime.now(UTC), + problem_statement=ps, + record_ids=[r.record_id for r in records], + executive_summary=executive_summary, + findings=findings, + caveats=caveats, + open_questions=open_questions, + data_sources=data_sources, + method_steps=method_steps, + ) + report.rendered_markdown = _render_markdown(report) + logger.info( + "report generated", + analysis_id=analysis_id, + records=len(records), + findings=len(findings), + ) + return report + + async def _read_catalog(self, user_id: str | None): + if not user_id: + return None + try: + return await self._ensure_catalog_store().get(user_id) + except Exception as exc: # data_sources falls back; never break the report + logger.warning("catalog read failed; data_sources will fall back", error=str(exc)) + return None + + def _ensure_binding_store(self): + if self._binding_store is None: + from ..binding_store import AnalysisDataSourceStore + + self._binding_store = AnalysisDataSourceStore() + return self._binding_store + + async def _read_binding(self, analysis_id: str) -> list[str]: + """Bound source ids for the analysis (#10). Never-throw → [] (unscoped).""" + try: + return await self._ensure_binding_store().get(analysis_id) + except Exception as exc: # data_sources falls back to whole catalog + logger.warning("binding read failed; data_sources unscoped", error=str(exc)) + return [] + + async def _summarize( + self, ps: ProblemStatement, findings: list[ReportFinding], caveats: list[AttributedNote] + ) -> str: + human_content = _build_human_content(ps, findings, caveats) + try: + narrative: ReportSummaryNarrative = await self._ensure_chain().ainvoke( + {"human_content": human_content} + ) + return narrative.executive_summary + except Exception as exc: # D1: degrade, don't fail the whole report + logger.warning("report summary LLM failed; using fallback", error=str(exc)) + return _FALLBACK_SUMMARY diff --git a/src/agents/report/readiness.py b/src/agents/report/readiness.py new file mode 100644 index 0000000000000000000000000000000000000000..5234004d9170968b2bde00f3b3136ab475296465 --- /dev/null +++ b/src/agents/report/readiness.py @@ -0,0 +1,165 @@ +"""`is_report_ready` — deterministic report-readiness signal (seam #5, KM-652). + +The Help skill asks "can the user generate a report yet?" before it offers that as +a next step. This is the producer of that answer; Help only *consumes* it (see +`handlers/help.ReportReadiness`). No LLM — readiness is a fact about persisted state, +not a judgement. + +The rule mirrors what makes a real report non-empty and worth generating, so Help can +never suggest an action that would 409 or produce a duplicate: + 1. `problem_validated` — the gate's own precondition (no validated goal, no + analysis worth reporting). Same rule `gate.gate` applies to `structured_flow`. + 2. at least one **substantive** persisted `AnalysisRecord` — a record whose + *analysis* task succeeded. A failed run still persists a record WITH findings + (they narrate the failure), and data-access tasks (check_/retrieve_) succeed even + when the analysis fails — so neither "has findings" nor "any task succeeded" is + enough. We require a genuine analysis tool (analyze_*) to have completed. We count + *results*, not chat turns. + 3. delta-since-report — if a report already exists (`state.report_id`), only ready + when there's a substantive analysis newer than the latest report; otherwise the + new report would be identical. + +`missing` names whichever criterion is absent, so Help can tell the user the next gap +to fill (the team values `missing` over the bare boolean). Bias is anti-false-positive +(report is also button-triggered): a record-store read failure fails **closed** +(not ready); a report-store read failure during the delta check fails **open** (we +can't prove staleness, and the button is always there). + +NOT in scope (deferred, pending the readiness eval set): semantic *alignment* of the +analyses to the problem statement and *depth*/variety scoring — both need an LLM judge +and shouldn't sit in the per-turn Help hot path until eval justifies the cost. +""" + +from __future__ import annotations + +from datetime import UTC, datetime +from typing import TYPE_CHECKING + +from src.middlewares.logging import get_logger + +from ..handlers.help import ReportReadiness + +if TYPE_CHECKING: + from ..gate import AnalysisState + +logger = get_logger("report_readiness") + +# Human-readable gaps surfaced to the user via Help (kept stable for the prompt). +_MISSING_PROBLEM = "a validated problem statement" +_MISSING_ANALYSIS = "at least one completed analysis" +_MISSING_DELTA = "a new analysis since the last report" + + +def _default_record_store(): + from ..slow_path.store import PostgresAnalysisStore + + return PostgresAnalysisStore() + + +def _default_report_store(): + from .store import ReportStore + + return ReportStore() + + +def _is_newer(a: datetime, b: datetime) -> bool: + """True if `a` is later than `b`, tolerating naive/aware mismatch (assume UTC).""" + if a.tzinfo is None: + a = a.replace(tzinfo=UTC) + if b.tzinfo is None: + b = b.replace(tzinfo=UTC) + return a > b + + +def _has_successful_analysis(record) -> bool: + """True if the record has at least one *analysis* task that succeeded. + + A failed run still writes findings (narrating the failure) and its data-access + tasks (check_/retrieve_) succeed, so we can't key on findings or on "any task + succeeded". An analysis tool (analyze_*) completing is the real "we produced a + result" signal. + """ + return any( + t.status == "success" and any(tool.startswith("analyze") for tool in t.tools_used) + for t in record.tasks_run + ) + + +async def report_floor( + analysis_id: str | None, + state: AnalysisState, + *, + record_store=None, +) -> tuple[list[str], list]: + """The report **floor**: a validated goal + ≥1 substantive analysis. + + Returns `(missing, substantive_records)`. This is the shared gate both the Help + readiness signal AND the report API enforce, so the button and Help can't drift + (T-D / T11). It deliberately excludes the delta-since-report check — that is + advisory and lives only in `is_report_ready`; the report button is always allowed + to cut a new version (decision 4A). Fails closed (counts as missing analysis) on + a record-store read error. `record_store` is injectable for tests. + """ + missing: list[str] = [] + if not state.problem_validated: + missing.append(_MISSING_PROBLEM) + + substantive: list = [] + if analysis_id: + try: + store = record_store or _default_record_store() + records = await store.list_for_analysis(analysis_id) + substantive = [r for r in records if _has_successful_analysis(r)] + except Exception as exc: # noqa: BLE001 — never-throw; fail closed to not-ready + logger.warning( + "report_floor: record store read failed — not ready", + analysis_id=analysis_id, + error=str(exc), + ) + return [*missing, _MISSING_ANALYSIS], [] + + if not substantive: + missing.append(_MISSING_ANALYSIS) + return missing, substantive + + +async def is_report_ready( + analysis_id: str | None, + state: AnalysisState, + *, + record_store=None, + report_store=None, +) -> ReportReadiness: + """Return whether a report can be generated for this analysis, and the gaps if not. + + `record_store` / `report_store` are injectable for tests; they default to the + real Postgres stores. + """ + missing, substantive = await report_floor( + analysis_id, state, record_store=record_store + ) + + if not substantive: + # No analyses to report on → the delta check is moot. + return ReportReadiness(ready=not missing, missing=missing) + + # Delta-since-report: a report already exists, so only ready if a substantive + # analysis is newer than the latest report. Fail-open on a report-store error. + if state.report_id: + last_report_at: datetime | None = None + try: + rstore = report_store or _default_report_store() + reports = await rstore.list_for_analysis(analysis_id) + last_report_at = max((r.generated_at for r in reports), default=None) + except Exception as exc: # noqa: BLE001 — skip delta; can't prove staleness + logger.warning( + "is_report_ready: report store read failed — skipping delta check", + analysis_id=analysis_id, + error=str(exc), + ) + if last_report_at is not None and not any( + _is_newer(r.created_at, last_report_at) for r in substantive + ): + missing.append(_MISSING_DELTA) + + return ReportReadiness(ready=not missing, missing=missing) diff --git a/src/agents/report/schemas.py b/src/agents/report/schemas.py new file mode 100644 index 0000000000000000000000000000000000000000..ab282222a05db1a430e4a04b327b4f197f0bbd23 --- /dev/null +++ b/src/agents/report/schemas.py @@ -0,0 +1,91 @@ +"""Report contract — `AnalysisReport` and its parts (KM-644). + +The report generator turns a session's persisted `AnalysisRecord`s + Problem +Statement into a versioned report. Only `executive_summary` is LLM-authored +(`ReportSummaryNarrative`); every other field is copied verbatim from the records +by code (INV-4), so the report stays a faithful, auditable artifact. + +Two deliberate looseness choices for v1 (tighten later once usage shows): +`ProblemStatement` (stub of Harry's real PS) and `ReportFinding.supporting_data`. + +See CHECKPOINT_PLAN_2026-06-17.md decision #8. +""" + +from __future__ import annotations + +from datetime import datetime +from uuid import uuid4 + +from pydantic import BaseModel, Field + +from ..slow_path.schemas import TaskSummary + + +class ProblemStatement(BaseModel): + """Minimal stub of Harry's Problem Statement, frozen into each report. + + Loose on purpose until the real PS template lands (Analysis State, upstream). + A report snapshots the PS as it was at generation time. + """ + + objective: str = "" + metric_direction: str = "" # "increase" | "decrease" + target_metric: str = "" + target_value: str = "" + scope: str = "" + + +class DataSourceRef(BaseModel): + """Frozen catalog metadata for a source used in the analysis. + + Snapshotted at generation time (NOT re-fetched at render) so a re-ingested + source never retroactively changes an old report — same freeze rationale as + `ProblemStatement`. + """ + + source_id: str + name: str + source_type: str # postgres | file | ... + detail: dict = Field(default_factory=dict) # rows in scope, columns, window + + +class ReportFinding(BaseModel): + text: str + record_ids: list[str] = Field(default_factory=list) # records backing this finding + supporting_data: dict | None = None # loose for v1; the chart-able slice + + +class AttributedNote(BaseModel): + """A caveat or open question carrying the records it came from. + + Plural `record_ids` because a note can be deduped/merged across records. + """ + + text: str + record_ids: list[str] = Field(default_factory=list) + + +class ReportSummaryNarrative(BaseModel): + """The ONLY LLM-authored part of the report (with_structured_output target).""" + + executive_summary: str + + +class AnalysisReport(BaseModel): + report_id: str = Field(default_factory=lambda: uuid4().hex) + analysis_id: str + user_id: str | None = None + version: int + generated_at: datetime + # Frozen snapshots. + problem_statement: ProblemStatement = Field(default_factory=ProblemStatement) + record_ids: list[str] = Field(default_factory=list) # records used (snapshot) + # LLM-authored. + executive_summary: str = "" + # Deterministic pass-through from records. + findings: list[ReportFinding] = Field(default_factory=list) + caveats: list[AttributedNote] = Field(default_factory=list) + open_questions: list[AttributedNote] = Field(default_factory=list) + data_sources: list[DataSourceRef] = Field(default_factory=list) + method_steps: list[TaskSummary] = Field(default_factory=list) # carries `stage` + rendered_markdown: str = "" diff --git a/src/agents/report/store.py b/src/agents/report/store.py new file mode 100644 index 0000000000000000000000000000000000000000..db39c289fe478badcdeafaeb270b08f91b6d6f1d --- /dev/null +++ b/src/agents/report/store.py @@ -0,0 +1,119 @@ +"""ReportStore — persists/reads versioned AnalysisReports (KM-644). + +Mirrors `PostgresAnalysisStore`: each call opens its own `AsyncSessionLocal`. + +Version assignment is serialized per `analysis_id` with a Postgres +transaction-level advisory lock so concurrent button presses can't compute the +same version number; the `(analysis_id, version)` unique constraint is the +backstop. Per decision 4A every generation is a new version, so two +near-simultaneous presses legitimately produce V and V — the lock only +prevents a duplicate-number race, not double generation. +""" + +from __future__ import annotations + +import hashlib + +from sqlalchemy import func, select, text + +from src.db.postgres.connection import AsyncSessionLocal +from src.db.postgres.models import AnalysisReportRow +from src.middlewares.logging import get_logger + +from .schemas import AnalysisReport + +logger = get_logger("report_store") + + +def _lock_key(analysis_id: str) -> int: + """Stable signed 64-bit key for `pg_advisory_xact_lock`. + + Python's builtin `hash(str)` is randomized per process, so derive a + deterministic key from a digest instead. + """ + digest = hashlib.sha256(analysis_id.encode()).digest() + return int.from_bytes(digest[:8], "big", signed=True) + + +def _report_title(report: AnalysisReport) -> str: + """Title for the dedorch `reports.title` column — the goal, else a generic label.""" + objective = (report.problem_statement.objective or "").strip() + return objective[:200] if objective else "Analysis Report" + + +def _row_to_report(row) -> AnalysisReport: + """Rebuild a minimal AnalysisReport from the flat dedorch row. + + dedorch stores markdown only, so structured fields (findings/caveats/…) come back + empty; `rendered_markdown` carries the content the FE renders/downloads. + """ + return AnalysisReport( + report_id=row.id, + analysis_id=row.analysis_id, + version=row.version, + generated_at=row.generated_at, + rendered_markdown=row.content, + ) + + +class ReportStore: + """Read/write versioned reports keyed by `analysis_id`.""" + + async def save(self, report: AnalysisReport) -> AnalysisReport: + """Assign the next version under an advisory lock and persist. + + Mutates and returns `report` with its final `version`. + """ + async with AsyncSessionLocal() as session: + async with session.begin(): + await session.execute( + text("SELECT pg_advisory_xact_lock(:k)"), + {"k": _lock_key(report.analysis_id)}, + ) + result = await session.execute( + select(func.max(AnalysisReportRow.version)).where( + AnalysisReportRow.analysis_id == report.analysis_id + ) + ) + report.version = (result.scalar_one_or_none() or 0) + 1 + session.add( + AnalysisReportRow( + id=report.report_id, + analysis_id=report.analysis_id, + title=_report_title(report), + content=report.rendered_markdown or "", + generated_at=report.generated_at, + version=report.version, + ) + ) + # leaving session.begin() commits, which releases the advisory lock + logger.info( + "report persisted", + analysis_id=report.analysis_id, + version=report.version, + report_id=report.report_id, + ) + return report + + async def list_for_analysis(self, analysis_id: str) -> list[AnalysisReport]: + async with AsyncSessionLocal() as session: + result = await session.execute( + select(AnalysisReportRow) + .where(AnalysisReportRow.analysis_id == analysis_id) + .order_by(AnalysisReportRow.version.asc()) + ) + rows = result.scalars().all() + return [_row_to_report(row) for row in rows] + + async def get(self, analysis_id: str, version: int) -> AnalysisReport | None: + async with AsyncSessionLocal() as session: + result = await session.execute( + select(AnalysisReportRow).where( + AnalysisReportRow.analysis_id == analysis_id, + AnalysisReportRow.version == version, + ) + ) + row = result.scalar_one_or_none() + if row is None: + return None + return _row_to_report(row) diff --git a/src/agents/slow_path/assembler.py b/src/agents/slow_path/assembler.py index 88ab7aca02b49782502e79338063a703b3c2edc9..f5645aa1b5ea7205aaec714814bed9b79fab1b12 100644 --- a/src/agents/slow_path/assembler.py +++ b/src/agents/slow_path/assembler.py @@ -33,6 +33,7 @@ from .schemas import ( AssembledOutput, AssemblerNarrative, RunState, + TaskResult, TaskSummary, ) @@ -116,16 +117,46 @@ class Assembler: return AssembledOutput(chat_answer=narrative.chat_answer, analysis_record=record) +# Persisted records keep `analyze_*` outputs (scalar/stats/series — small, and the +# basis a future report/chart renders from) in full, but cap raw `table` rows from +# data-access tools (retrieve_data can return up to the 10k LIMIT): the report never +# renders raw rows, so storing them all would bloat every record's jsonb. +_SNAPSHOT_ROW_SAMPLE = 10 + + +def _trim_for_snapshot(result: TaskResult) -> TaskResult: + trimmed = [] + changed = False + for out in result.outputs: + if out.kind == "table" and out.rows is not None and len(out.rows) > _SNAPSHOT_ROW_SAMPLE: + changed = True + trimmed.append( + out.model_copy( + update={ + "rows": out.rows[:_SNAPSHOT_ROW_SAMPLE], + "meta": {**out.meta, "total_rows": len(out.rows), "rows_truncated": True}, + } + ) + ) + else: + trimmed.append(out) + return result.model_copy(update={"outputs": trimmed}) if changed else result + + def _build_record(narrative: AssemblerNarrative, run_state: RunState) -> AnalysisRecord: tasks_run = [ TaskSummary( task_id=task_id, + stage=result.stage, objective=result.objective, status=result.status, tools_used=[o.tool for o in result.outputs], ) for task_id, result in run_state.results.items() ] + results_snapshot = { + task_id: _trim_for_snapshot(result) for task_id, result in run_state.results.items() + } return AnalysisRecord( goal_restated=narrative.goal_restated, findings=narrative.findings, @@ -133,7 +164,7 @@ def _build_record(narrative: AssemblerNarrative, run_state: RunState) -> Analysi data_used=narrative.data_used, open_questions=narrative.open_questions, tasks_run=tasks_run, - results_snapshot=run_state.results, + results_snapshot=results_snapshot, plan_id=run_state.plan_id, business_context_id=run_state.business_context_id, created_at=datetime.now(UTC), diff --git a/src/agents/slow_path/coordinator.py b/src/agents/slow_path/coordinator.py index 4fe68a9b56f61d2d17745b2ae57825cc3ae004ba..a17468401398f93cf75a41c71bbb7e77be45fa82 100644 --- a/src/agents/slow_path/coordinator.py +++ b/src/agents/slow_path/coordinator.py @@ -1,9 +1,9 @@ """SlowPathCoordinator — wires the slow path: Planner -> TaskRunner -> Assembler. -A thin coordination object. This is the unit the (future) expanded Orchestrator / -ChatHandler will call on a `structured` analytical query. It is built and tested -here but **not yet wired into the live chat flow** — that step waits on the tool -team's real `ToolInvoker` and a real `BusinessContext` source. +A thin coordination object. `ChatHandler` calls it on a `structured_flow` query when +`ENABLE_SLOW_PATH` is on (the real `ToolInvoker` is composed in +`ChatHandler._get_slow_path_coordinator`). `BusinessContext` is still a stub until the +lead's real source lands. See AGENT_ARCHITECTURE_CONTEXT_new.md §5.2 / §6.1. """ diff --git a/src/agents/slow_path/schemas.py b/src/agents/slow_path/schemas.py index 42d4aa2d729815a4ff8a1fa01a4baac93e562176..70632cdc9bbf71a9e4a29e1b5a808b1a865df998 100644 --- a/src/agents/slow_path/schemas.py +++ b/src/agents/slow_path/schemas.py @@ -21,10 +21,12 @@ from __future__ import annotations from datetime import datetime from typing import Literal +from uuid import uuid4 from pydantic import BaseModel, Field from ..planner.contracts import ToolOutput +from ..planner.schemas import CrispStage TaskStatus = Literal["success", "partial", "failure"] @@ -36,6 +38,7 @@ TaskStatus = Literal["success", "partial", "failure"] class TaskResult(BaseModel): task_id: str + stage: CrispStage # copied from the plan Task; carries CRISP-DM grouping to the report status: TaskStatus objective: str outputs: list[ToolOutput] = Field(default_factory=list) # one per tool_call @@ -57,12 +60,21 @@ class RunState(BaseModel): class TaskSummary(BaseModel): task_id: str + stage: CrispStage # lets the report group the method appendix by CRISP-DM phase objective: str status: TaskStatus tools_used: list[str] = Field(default_factory=list) class AnalysisRecord(BaseModel): + # Identity. `record_id` is the unit the report cites and snapshots + # (`record_ids`); `analysis_id`/`user_id` scope the record to one analysis + # session + owner and are stamped by the composition root / AnalysisStore at + # persist time (they depend on the Analysis State that lives outside the slow + # path), so they default to None when the Assembler first builds the record. + record_id: str = Field(default_factory=lambda: uuid4().hex) + analysis_id: str | None = None + user_id: str | None = None # Narrative fields — authored by the Assembler LLM. goal_restated: str findings: list[str] = Field(default_factory=list) diff --git a/src/agents/slow_path/store.py b/src/agents/slow_path/store.py index 43181a8615ca0d73848a2cbdedea2c490f2419bc..871d03ce3e7d753db9c4698e3f6f50cbbef912a0 100644 --- a/src/agents/slow_path/store.py +++ b/src/agents/slow_path/store.py @@ -2,21 +2,28 @@ The Assembler produces an `AnalysisRecord` (the faithful, structured record of a run — §8.3, INV-4). Persisting it is a separate concern from streaming the answer, -so it sits behind this one-method seam. +so it sits behind this seam. `generate_report` later reads records back by +`analysis_id` (oldest-first) and renders from them — never from chat history. -`NullAnalysisStore` is the default: it logs that a record was produced but stores -nothing, because the backing table does not exist yet. The plan is to store records -in the **same catalog DB** (Neon `dataeyond`, `settings.postgres_connstring`). +- `NullAnalysisStore` logs and stores nothing (kept for tests / when persistence + is intentionally disabled). +- `PostgresAnalysisStore` writes one `analysis_records` row per run in the catalog + DB (Neon `dataeyond`, `settings.postgres_connstring`). -TODO(persistence): add a Postgres-backed `AnalysisStore` writing an -`analysis_records` table in the catalog DB, keyed on -(business_context_id, plan_id, created_at), then inject it into ChatHandler. +`save` must never raise on the caller's path — a persistence failure must not break +the user's answer (§8.3). `list_for_analysis` is a read for the report generator and +is allowed to surface errors to its caller. """ from __future__ import annotations from typing import Protocol, runtime_checkable +from sqlalchemy import select +from sqlalchemy.dialects.postgresql import insert + +from src.db.postgres.connection import AsyncSessionLocal +from src.db.postgres.models import AnalysisRecordRow from src.middlewares.logging import get_logger from .schemas import AnalysisRecord @@ -26,19 +33,78 @@ logger = get_logger("analysis_store") @runtime_checkable class AnalysisStore(Protocol): - """Persist a completed analysis. Implementations must never raise on the - caller's path — a persistence failure must not break the user's answer.""" + """Persist + read completed analyses. + + `save` must never raise on the caller's path. `list_for_analysis` returns the + records for one analysis session, oldest-first (the order the report renders in). + """ async def save(self, record: AnalysisRecord) -> None: ... + async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]: ... + class NullAnalysisStore: - """Default no-op store: logs the record, persists nothing (no table yet).""" + """No-op store: logs the record, persists nothing. Reads return empty.""" async def save(self, record: AnalysisRecord) -> None: logger.info( - "analysis_record produced (not persisted — no store configured)", + "analysis_record produced (not persisted — NullAnalysisStore)", + record_id=record.record_id, plan_id=record.plan_id, - business_context_id=record.business_context_id, n_tasks=len(record.tasks_run), ) + + async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]: + return [] + + +class PostgresAnalysisStore: + """Writes/reads `analysis_records` jsonb rows in the catalog DB. + + Mirrors `CatalogStore`: each call opens its own `AsyncSession`. One row per + record (vs. one-per-user for the catalog) since records accumulate per analysis. + """ + + async def save(self, record: AnalysisRecord) -> None: + try: + payload = record.model_dump(mode="json") + async with AsyncSessionLocal() as session: + stmt = insert(AnalysisRecordRow).values( + id=record.record_id, + analysis_id=record.analysis_id, + user_id=record.user_id, + plan_id=record.plan_id, + data=payload, + created_at=record.created_at, + ) + # Re-running the same plan id-collides only if record_id repeats; + # treat that as idempotent (overwrite) rather than erroring the user. + stmt = stmt.on_conflict_do_update( + index_elements=[AnalysisRecordRow.id], + set_={"data": stmt.excluded.data}, + ) + await session.execute(stmt) + await session.commit() + logger.info( + "analysis_record persisted", + record_id=record.record_id, + analysis_id=record.analysis_id, + user_id=record.user_id, + ) + except Exception as exc: # never break the user's answer (§8.3) + logger.error( + "analysis_record persist failed", + record_id=record.record_id, + error=str(exc), + ) + + async def list_for_analysis(self, analysis_id: str) -> list[AnalysisRecord]: + async with AsyncSessionLocal() as session: + result = await session.execute( + select(AnalysisRecordRow.data) + .where(AnalysisRecordRow.analysis_id == analysis_id) + .order_by(AnalysisRecordRow.created_at.asc()) + ) + rows = result.scalars().all() + return [AnalysisRecord.model_validate(row) for row in rows] diff --git a/src/agents/slow_path/task_runner.py b/src/agents/slow_path/task_runner.py index aeac6ea9a14210043c25327f2cdedfd38557337a..7b057606f319709f553593aa090ece260196890c 100644 --- a/src/agents/slow_path/task_runner.py +++ b/src/agents/slow_path/task_runner.py @@ -53,6 +53,7 @@ class TaskRunner: for tid in list(remaining): results[tid] = TaskResult( task_id=tid, + stage=tasks_by_id[tid].stage, status="failure", objective=tasks_by_id[tid].objective, error="unresolved dependency; task could not run", @@ -68,6 +69,7 @@ class TaskRunner: if failed: results[tid] = TaskResult( task_id=tid, + stage=task.stage, status="failure", objective=task.objective, error=f"skipped: upstream {failed} did not succeed", @@ -110,6 +112,7 @@ class TaskRunner: error = errs[0] if errs else "all tool calls failed" return TaskResult( task_id=task.id, + stage=task.stage, status=status, objective=task.objective, outputs=outputs, diff --git a/src/agents/state_store.py b/src/agents/state_store.py new file mode 100644 index 0000000000000000000000000000000000000000..f63787da071f91b191b346bd57b355d680cb4ff8 --- /dev/null +++ b/src/agents/state_store.py @@ -0,0 +1,128 @@ +"""AnalysisStateStore — read/write the per-analysis session state. + +The orchestrator gate + Help skill read `AnalysisState` (the locked contract in +`gate.py`) every turn; the Problem Statement skill writes `problem_validated`. The +row shares its id with the chat `rooms` row — one session = one analysis = one +conversation (`analysis_id == room_id`). + +Mirrors `PostgresAnalysisStore`: each call opens its own `AsyncSession`. +""" + +from __future__ import annotations + +from sqlalchemy.dialects.postgresql import insert + +from src.agents.gate import AnalysisState +from src.db.postgres.connection import AsyncSessionLocal +from src.db.postgres.models import AnalysisStateRow +from src.middlewares.logging import get_logger + +logger = get_logger("analysis_state_store") + + +def _row_to_state(row: AnalysisStateRow) -> AnalysisState: + """Map a DB row to the frozen `AnalysisState` contract.""" + return AnalysisState( + id=row.id, + analysis_title=row.analysis_title, + problem_statement=row.problem_statement, + problem_validated=row.problem_validated, + owner_id=row.owner_id, + report_id=row.report_id, + created_at=row.created_at, + updated_at=row.updated_at, + ) + + +class AnalysisStateStore: + """Read/write the dedorch `analysis` table, keyed by the shared session id.""" + + async def get(self, analysis_id: str) -> AnalysisState | None: + async with AsyncSessionLocal() as session: + row = await session.get(AnalysisStateRow, analysis_id) + return _row_to_state(row) if row is not None else None + + async def ensure( + self, + analysis_id: str, + owner_id: str, + analysis_title: str = "New analysis", + ) -> AnalysisState: + """Get-or-create the state row for a session (idempotent, race-safe). + + Sessions born from `/room/create` have no `analysis_states` row; without + one the gate redirect-loops and `problem_statement` / `report_id` writes + silently no-op. Called per turn (analysis_id == room_id) so any session is + gate-ready. `INSERT ... ON CONFLICT DO NOTHING` makes concurrent first + turns safe; the row is then read back. Legacy rows created this way carry + no source bindings — binding scoping fail-opens to the whole catalog. + """ + async with AsyncSessionLocal() as session: + stmt = ( + insert(AnalysisStateRow) + .values( + id=analysis_id, + owner_id=owner_id, + analysis_title=analysis_title, + problem_statement="", + problem_validated=False, + ) + .on_conflict_do_nothing(index_elements=[AnalysisStateRow.id]) + ) + await session.execute(stmt) + await session.commit() + row = await session.get(AnalysisStateRow, analysis_id) + return _row_to_state(row) + + async def create( + self, + *, + analysis_id: str, + owner_id: str, + analysis_title: str = "New analysis", + problem_statement: str = "", + ) -> AnalysisState: + """Create the state row for a new analysis (id shared with its chat room).""" + async with AsyncSessionLocal() as session: + row = AnalysisStateRow( + id=analysis_id, + owner_id=owner_id, + analysis_title=analysis_title, + problem_statement=problem_statement, + problem_validated=False, + ) + session.add(row) + await session.commit() + await session.refresh(row) + return _row_to_state(row) + + async def update( + self, + analysis_id: str, + *, + problem_statement: str | None = None, + problem_validated: bool | None = None, + report_id: str | None = None, + ) -> AnalysisState | None: + """Patch the given fields (only non-None args are written). Returns the row. + + Used by the Problem Statement skill (`problem_validated`) and the report + flow (`report_id`). Returns None if the analysis doesn't exist. + """ + async with AsyncSessionLocal() as session: + row = await session.get(AnalysisStateRow, analysis_id) + if row is None: + logger.warning( + "analysis row missing — update skipped", + analysis_id=analysis_id, + ) + return None + if problem_statement is not None: + row.problem_statement = problem_statement + if problem_validated is not None: + row.problem_validated = problem_validated + if report_id is not None: + row.report_id = report_id + await session.commit() + await session.refresh(row) + return _row_to_state(row) diff --git a/src/api/v1/analysis.py b/src/api/v1/analysis.py new file mode 100644 index 0000000000000000000000000000000000000000..ae63bc0c09b6317f2e1d95796e31bc219c65a3a0 --- /dev/null +++ b/src/api/v1/analysis.py @@ -0,0 +1,174 @@ +"""Analysis session API — create a new analysis (the per-session workspace). + +An analysis IS the chat session: the `analysis_states` row and the chat `rooms` +row share one id (`analysis_id == room_id`), so the existing `room_id` on the chat +request doubles as the `analysis_id`. Creating an analysis enforces the data-first +gate (>=1 bound source) and seeds the state with a title + an optional problem +statement (validated later by the Problem Statement skill). +""" + +import uuid + +from fastapi import APIRouter, Depends, HTTPException +from pydantic import BaseModel, Field +from sqlalchemy import select +from sqlalchemy.ext.asyncio import AsyncSession + +from src.db.postgres.connection import get_db +from src.db.postgres.models import AnalysisDataSourceRow, AnalysisStateRow, Room +from src.middlewares.logging import get_logger, log_execution + +logger = get_logger("analysis_api") + +router = APIRouter(prefix="/api/v1", tags=["Analysis"]) + + +def _serialize_state(row: AnalysisStateRow, data_source_ids: list[str]) -> dict: + """The full analysis payload: the 8 state fields + the bound source ids.""" + return { + "id": row.id, + "analysis_title": row.analysis_title, + "problem_statement": row.problem_statement, + "problem_validated": row.problem_validated, + "owner_id": row.owner_id, + "report_id": row.report_id, + "data_source_ids": data_source_ids, + "created_at": row.created_at.isoformat() if row.created_at else None, + "updated_at": row.updated_at.isoformat() if row.updated_at else None, + } + + +async def _bound_source_ids(db: AsyncSession, analysis_id: str) -> list[str]: + result = await db.execute( + select(AnalysisDataSourceRow.reference_id).where( + AnalysisDataSourceRow.analysis_id == analysis_id + ) + ) + return list(result.scalars().all()) + + +async def _sources_by_id(user_id: str) -> dict: + """Catalog sources keyed by source_id, to resolve `type`/`name` on binding. + + Never-throw: missing catalog / read error → empty map, and binding rows fall back + to type='unknown' / name=reference_id. + """ + try: + from src.catalog.store import CatalogStore + + catalog = await CatalogStore().get(user_id) + except Exception as e: # noqa: BLE001 — binding must not fail on catalog read + logger.warning("analysis: catalog read failed for binding", user_id=user_id, error=str(e)) + return {} + return {s.source_id: s for s in catalog.sources} if catalog else {} + + +class CreateAnalysisRequest(BaseModel): + user_id: str + analysis_title: str = "New analysis" + problem_statement: str = "" + data_source_ids: list[str] = Field(default_factory=list) + + +@router.post("/analysis/create") +@log_execution(logger) +async def create_analysis( + request: CreateAnalysisRequest, + db: AsyncSession = Depends(get_db), +): + """Create a new analysis session: one shared id for its state + chat room. + + Data-first gate (decision #2): an analysis requires >=1 bound data source. + The bound sources are persisted as dedorch `data_sources` rows (#10) in the same + transaction as the state + room, so the analysis is scoped to exactly the sources + the user picked. `structured_flow` and the report read this binding back. + """ + if not request.data_source_ids: + raise HTTPException( + status_code=400, + detail="An analysis requires at least one bound data source.", + ) + + analysis_id = str(uuid.uuid4()) + # The analysis IS the session: state row + chat room + source bindings share one + # id, created atomically in one transaction. + state_row = AnalysisStateRow( + id=analysis_id, + owner_id=request.user_id, + analysis_title=request.analysis_title, + problem_statement=request.problem_statement, + problem_validated=False, + ) + db.add(Room(id=analysis_id, user_id=request.user_id, title=request.analysis_title)) + db.add(state_row) + # dict.fromkeys dedupes while preserving order. Each binding row snapshots the + # source's type + name from the catalog (reference_id = catalog source id); + # bound_at/created_at default to now() in dedorch. + bound_ids = list(dict.fromkeys(request.data_source_ids)) + src_by_id = await _sources_by_id(request.user_id) + for source_id in bound_ids: + src = src_by_id.get(source_id) + db.add( + AnalysisDataSourceRow( + id=str(uuid.uuid4()), + analysis_id=analysis_id, + type=src.source_type if src else "unknown", + name=src.name if src else source_id, + reference_id=source_id, + bound_by=request.user_id, + ) + ) + await db.commit() + await db.refresh(state_row) + + logger.info( + "analysis created", + analysis_id=analysis_id, + user_id=request.user_id, + sources=len(bound_ids), + ) + return { + "status": "success", + "message": "Analysis created successfully", + "data": _serialize_state(state_row, bound_ids), + } + + +@router.get("/analysis") +@log_execution(logger) +async def list_analyses(user_id: str, db: AsyncSession = Depends(get_db)): + """List a user's analyses, most-recently-updated first (Analysis sidebar). + + Summary fields only (no per-row source bindings — fetch those via the detail + endpoint) to keep the list a single query. + """ + result = await db.execute( + select(AnalysisStateRow) + .where(AnalysisStateRow.owner_id == user_id) + .order_by(AnalysisStateRow.updated_at.desc()) + ) + rows = result.scalars().all() + return { + "status": "success", + "data": [ + { + "id": r.id, + "analysis_title": r.analysis_title, + "problem_validated": r.problem_validated, + "report_id": r.report_id, + "updated_at": r.updated_at.isoformat() if r.updated_at else None, + } + for r in rows + ], + } + + +@router.get("/analysis/{analysis_id}") +@log_execution(logger) +async def get_analysis(analysis_id: str, db: AsyncSession = Depends(get_db)): + """Read one analysis's state + bound data sources (the FE workspace render).""" + row = await db.get(AnalysisStateRow, analysis_id) + if row is None: + raise HTTPException(status_code=404, detail=f"Analysis {analysis_id!r} not found.") + data_source_ids = await _bound_source_ids(db, analysis_id) + return {"status": "success", "data": _serialize_state(row, data_source_ids)} diff --git a/src/api/v1/chat.py b/src/api/v1/chat.py index c1801f844345627efd78759a912ded55e9c79c6e..f9de517031ca80e9a0ae9906aa89e79fbfb7a117 100644 --- a/src/api/v1/chat.py +++ b/src/api/v1/chat.py @@ -31,6 +31,7 @@ router = APIRouter(prefix="/api/v1", tags=["Chat"]) _chat_handler = ChatHandler( enable_tracing=True, enable_slow_path=settings.enable_slow_path, + enable_gate=settings.enable_gate, ) _GREETINGS = frozenset(["hi", "hello", "hey", "halo", "hai", "hei"]) @@ -64,8 +65,39 @@ async def get_cached_response(redis, cache_key: str) -> Optional[dict]: return None +# 1h TTL per the 2026-06-11 checkpoint decision (Redis = retrieval/query caching +# only, short-lived). Was 24h, which served stale answers after re-ingestion. +_CHAT_CACHE_TTL_SECONDS = 3600 + +# Only stateless replies are safe to cache. The cache key is (room, user, message) +# with no analysis-state/data version, so caching a state- or data-dependent answer +# (help / problem_statement / check / structured_flow / unstructured_flow) would +# replay a stale answer after the state or data changes — and, since the read check +# runs before the gate, could even bypass the gate when the same message repeats. +# So we cache ONLY the `chat` intent. Caching analysis answers needs proper +# invalidation on data/state change — deferred. The write is gated by the intent the +# handler already emits; the read stays as-is (safe because only `chat` is ever +# stored). +_CACHEABLE_INTENTS = frozenset({"chat"}) + + +def _chat_cache_key(room_id: str, user_id: str, message: str) -> str: + # user_id is part of the key so one user's cached answer can never be + # replayed to another (R5); room_id stays first so the room-wide clear + # endpoint can keep matching on a `chat:{room_id}:*` prefix. + # LIMITATION (T-G): the key omits conversation history, so a repeated message + # replays its cached answer even if the conversation has since moved on. Only + # the stateless `chat` intent is cached, so the blast radius is small — but a + # history-aware key (hash of last-N turns) would close it. Flagged to Harry. + return f"{settings.redis_prefix}chat:{room_id}:{user_id}:{message}" + + async def cache_response(redis, cache_key: str, response: str, sources: list): - await redis.setex(cache_key, 86400, json.dumps({"response": response, "sources": sources})) + await redis.setex( + cache_key, + _CHAT_CACHE_TTL_SECONDS, + json.dumps({"response": response, "sources": sources}), + ) async def load_history(db: AsyncSession, room_id: str, limit: int = 10) -> list: @@ -107,10 +139,10 @@ async def save_messages( @router.delete("/chat/cache") -async def clear_chat_cache(room_id: str, message: str): - """Delete the Redis cache entry for a specific room + message pair.""" +async def clear_chat_cache(room_id: str, user_id: str, message: str): + """Delete the Redis cache entry for a specific room + user + message pair.""" redis = await get_redis() - cache_key = f"{settings.redis_prefix}chat:{room_id}:{message}" + cache_key = _chat_cache_key(room_id, user_id, message) deleted = await redis.delete(cache_key) return {"deleted": deleted > 0, "cache_key": cache_key} @@ -146,7 +178,7 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)): 3. done — signals end of stream """ redis = await get_redis() - cache_key = f"{settings.redis_prefix}chat:{request.room_id}:{request.message}" + cache_key = _chat_cache_key(request.room_id, request.user_id, request.message) # Redis cache hit cached = await get_cached_response(redis, cache_key) @@ -186,8 +218,17 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)): logger.info("stream_response started", room_id=request.room_id, user_id=request.user_id) full_response = "" sources: List[Dict[str, Any]] = [] - async for event in handler.handle(request.message, request.user_id, history): - if event["event"] == "sources": + effective_intent: Optional[str] = None + async for event in handler.handle( + request.message, request.user_id, history, analysis_id=request.room_id + ): + if event["event"] == "intent": + # consumed internally (not forwarded); gates caching below. + try: + effective_intent = json.loads(event["data"]).get("intent") + except (TypeError, ValueError, AttributeError): + effective_intent = None + elif event["event"] == "sources": try: sources = json.loads(event["data"]) or [] except (TypeError, ValueError): @@ -197,7 +238,10 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)): full_response += event["data"] yield event elif event["event"] == "done": - await cache_response(redis, cache_key, full_response, sources=sources) + # Only cache stateless `chat` replies — caching a state/data- + # dependent answer would replay it stale (see _CACHEABLE_INTENTS). + if effective_intent in _CACHEABLE_INTENTS: + await cache_response(redis, cache_key, full_response, sources=sources) logger.info("saving messages", sources_count=len(sources), sources=sources) try: await save_messages(db, request.room_id, request.message, full_response, sources=sources) @@ -211,7 +255,6 @@ async def chat_stream(request: ChatRequest, db: AsyncSession = Depends(get_db)): elif event["event"] == "error": yield event return - # "intent" event: consumed internally, not forwarded to frontend return EventSourceResponse(stream_response()) diff --git a/src/api/v1/report.py b/src/api/v1/report.py new file mode 100644 index 0000000000000000000000000000000000000000..3c2da90b7a584cf97cdea9a1a280b5c66da49af5 --- /dev/null +++ b/src/api/v1/report.py @@ -0,0 +1,189 @@ +"""Report API (KM-644) — the dedicated "Generate Report" surface. + +NOT a chat route. The frontend button calls these endpoints directly: + POST /report generate a new version for a session + GET /report/{analysis_id} list a session's report versions + GET /report/{analysis_id}/{ver} fetch one version + +Generation reads persisted AnalysisRecords + Problem Statement, makes one LLM call +(the executive summary), and persists an immutable versioned artifact. The +ReportGenerator + ReportStore are process singletons (the generator caches its LLM +chain warm across requests, like ChatHandler). + +Note (T-E): AnalysisRecords are only persisted by the slow path, so reports require +`ENABLE_SLOW_PATH=on`. With it off, no records exist and generation 409s — by design, +not a bug. POST gates on the same floor as Help's readiness signal (validated goal + +≥1 substantive analysis) so the button and Help never disagree. +""" + +from fastapi import APIRouter, HTTPException, Query, status + +from src.agents.report.errors import ReportError +from src.agents.report.generator import ReportGenerator +from src.agents.report.schemas import AnalysisReport, ProblemStatement +from src.agents.report.store import ReportStore +from src.middlewares.logging import get_logger, log_execution +from src.models.api.report import ReportVersionEntry + +logger = get_logger("report_api") + +router = APIRouter(prefix="/api/v1", tags=["Report"]) + +_generator = ReportGenerator() +_store = ReportStore() + + +async def _load_state(analysis_id: str): + """Load the AnalysisState (for the floor gate + problem statement). Never-throw.""" + try: + from src.agents.state_store import AnalysisStateStore + + return await AnalysisStateStore().get(analysis_id) + except Exception as e: # noqa: BLE001 — never block report generation on this + logger.warning("report: state load failed", analysis_id=analysis_id, error=str(e)) + return None + + +def _problem_statement_from(state) -> ProblemStatement: + """Map the analysis's free-text problem statement into the report's structured PS.""" + if state is None or not state.problem_statement: + return ProblemStatement() + return ProblemStatement(objective=state.problem_statement) + + +async def _record_report_on_state(analysis_id: str, report_id: str) -> None: + """Write the new `report_id` back onto the Analysis State (never-throw). + + Closes the loop so Help's `has_report` and the readiness delta-check can see + that a report exists. A missing state row / write error must not fail a report + that already generated and persisted. + """ + try: + from src.agents.state_store import AnalysisStateStore + + await AnalysisStateStore().update(analysis_id, report_id=report_id) + except Exception as e: # noqa: BLE001 + logger.warning( + "report: report_id write-back failed", analysis_id=analysis_id, error=str(e) + ) + + +@router.post( + "/report", + response_model=AnalysisReport, + status_code=status.HTTP_201_CREATED, + summary="Generate a new report version for an analysis session", + responses={ + 201: {"description": "A new versioned report was generated and persisted."}, + 409: {"description": "No analyses recorded for this session yet — nothing to report."}, + 500: {"description": "Report generation or persistence failed."}, + }, +) +@log_execution(logger) +async def generate_report( + analysis_id: str = Query(..., description="The analysis session to report on."), + user_id: str = Query(..., description="Owner of the analysis session."), +): + """Generate, persist, and return a new report version. + + Each call produces a new version (V1, V2, …) that snapshots the records and + Problem Statement it used. Server-side gate: the report **floor** — a validated + goal + ≥1 substantive analysis — the same floor Help's readiness signal uses, so + the button and Help can't disagree (T-D). The delta-since-report check is NOT + applied here: a new version is always allowed (decision 4A). + """ + from src.agents.gate import stub_analysis_state + from src.agents.report.readiness import report_floor + + state = await _load_state(analysis_id) + floor_missing, _ = await report_floor( + analysis_id, state or stub_analysis_state(problem_validated=False) + ) + if floor_missing: + raise HTTPException( + status_code=status.HTTP_409_CONFLICT, + detail="Not ready to generate a report — still needs " + + ", ".join(floor_missing) + + ".", + ) + + try: + problem_statement = _problem_statement_from(state) + report = await _generator.generate( + analysis_id, user_id, problem_statement=problem_statement + ) + except ReportError as e: + raise HTTPException(status_code=status.HTTP_409_CONFLICT, detail=str(e)) from e + except Exception as e: + logger.error("report generation failed", analysis_id=analysis_id, error=str(e)) + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail=f"Report generation failed: {e}", + ) from e + + try: + saved = await _store.save(report) + except Exception as e: + logger.error("report persist failed", analysis_id=analysis_id, error=str(e)) + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail=f"Report persistence failed: {e}", + ) from e + + await _record_report_on_state(analysis_id, saved.report_id) + return saved + + +@router.get( + "/report/{analysis_id}", + response_model=list[ReportVersionEntry], + summary="List a session's report versions", + response_description="Version metadata, oldest-first. Empty if none generated yet.", +) +@log_execution(logger) +async def list_report_versions(analysis_id: str): + """Return version metadata for a session (for the Analysis-menu sidebar).""" + try: + reports = await _store.list_for_analysis(analysis_id) + except Exception as e: + logger.error("report list failed", analysis_id=analysis_id, error=str(e)) + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail=f"Failed to list reports: {e}", + ) from e + + return [ + ReportVersionEntry( + report_id=r.report_id, + version=r.version, + generated_at=r.generated_at, + record_count=len(r.record_ids), + ) + for r in reports + ] + + +@router.get( + "/report/{analysis_id}/{version}", + response_model=AnalysisReport, + summary="Fetch one report version", + responses={404: {"description": "No report at that version for this session."}}, +) +@log_execution(logger) +async def get_report_version(analysis_id: str, version: int): + """Return the full content of a specific report version.""" + try: + report = await _store.get(analysis_id, version) + except Exception as e: + logger.error("report fetch failed", analysis_id=analysis_id, version=version, error=str(e)) + raise HTTPException( + status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, + detail=f"Failed to fetch report: {e}", + ) from e + + if report is None: + raise HTTPException( + status_code=status.HTTP_404_NOT_FOUND, + detail=f"No report v{version} for analysis {analysis_id!r}.", + ) + return report diff --git a/src/api/v1/tools.py b/src/api/v1/tools.py new file mode 100644 index 0000000000000000000000000000000000000000..d1bf8a3d1421196ecbf954107a9e962076a59eb9 --- /dev/null +++ b/src/api/v1/tools.py @@ -0,0 +1,124 @@ +"""Tool / command catalog API endpoints. + +Exposes the agent's user-invocable slash-command catalog so the Golang backend +can cache it and the frontend can render its "/" command menu WITHOUT calling the +AI agent for every list (Golang GETs + caches `list_tools`). + +Scope confirmed: the catalog is the UNIFIED set of +everything the user can invoke via `/` — +spanning what the team internally splits into skills + analytics tools + +data-access tools. Naming : verb-first, kebab-case, `/` prefix. + +Each command maps 1:1 to a real internal tool/intent `name` (the dispatch key); +the granular data-access tools (check_data, check_knowledge, retrieve_data, +retrieve_knowledge) are listed separately. +NOTE: the merged `check` intent still exists for natural-language routing — it is +NOT a slash command; slash invocation bypasses the router to the tool directly. +Deferred analytics tools (comparison/contribution/profile/segment) are NOT +exposed (not wired to the Planner). + +Stateless and deterministic — safe for the Golang backend to cache. +""" + +from typing import Literal + +from fastapi import APIRouter +from pydantic import BaseModel + +from src.middlewares.logging import get_logger, log_execution + +logger = get_logger("tools_api") + +router = APIRouter(prefix="/api/v1", tags=["Tools"]) + +CommandType = Literal["skill", "analytics", "data_access"] + + +class CommandResponse(BaseModel): + command: str # FE-facing slash command, e.g. "/analyze-descriptive" + name: str # internal handler/tool name, e.g. "analyze_descriptive" + type: CommandType + description: str + + +class ListToolsResponse(BaseModel): + count: int + tools: list[CommandResponse] + + +# Single source of truth for the FE slash-command catalog. Order = display order. +# Keep `command` in Harry's convention (verb-first, kebab-case, `/`); `name` is the +# internal route/tool name used by the orchestrator. +_COMMAND_CATALOG: list[CommandResponse] = [ + CommandResponse( + command="/help", + name="help", + type="skill", + description="Show what the assistant can do and guide your next step.", + ), + CommandResponse( + command="/problem-statement", + name="problem_statement", + type="skill", + description="Define and validate your analysis goal (objective + metric) " + "before exploring data.", + ), + CommandResponse( + command="/analyze-descriptive", + name="analyze_descriptive", + type="analytics", + description="Summary statistics for selected columns (count, mean, min, max, …).", + ), + CommandResponse( + command="/analyze-aggregate", + name="analyze_aggregate", + type="analytics", + description="Group and aggregate values (sum, count, average) by dimension.", + ), + CommandResponse( + command="/analyze-correlation", + name="analyze_correlation", + type="analytics", + description="Correlation strength between numeric columns.", + ), + CommandResponse( + command="/analyze-trend", + name="analyze_trend", + type="analytics", + description="Trend of a value over time at a chosen frequency.", + ), + CommandResponse( + command="/check-data", + name="check_data", + type="data_access", + description="Inventory of the available structured data sources.", + ), + CommandResponse( + command="/check-knowledge", + name="check_knowledge", + type="data_access", + description="Inventory of the available knowledge / uploaded documents.", + ), + CommandResponse( + command="/retrieve-data", + name="retrieve_data", + type="data_access", + description="Pull rows from a structured source for analysis.", + ), + CommandResponse( + command="/retrieve-knowledge", + name="retrieve_knowledge", + type="data_access", + description="Retrieve relevant passages from your uploaded documents.", + ), +] + + +@router.get("/tools", response_model=ListToolsResponse) +@log_execution(logger) +async def list_tools() -> ListToolsResponse: + """List the user-invocable slash-command catalog (skills + tools). + + Static per deployment — safe for the Golang backend to cache. + """ + return ListToolsResponse(count=len(_COMMAND_CATALOG), tools=_COMMAND_CATALOG) diff --git a/src/catalog/reader.py b/src/catalog/reader.py index 2c933a62332b752ddac63b065a9e0beb8e26c181..1da38b141b1391cb6659200dcf161425a749c1b9 100644 --- a/src/catalog/reader.py +++ b/src/catalog/reader.py @@ -45,8 +45,9 @@ class MemoizingCatalogReader(CatalogReader): One per request. The same per-user catalog is otherwise fetched from the catalog DB 4-5x during a single slow-path run (planner load, then - describe_source's structured+unstructured reads, then query_structured's - structured read). Wrapping the base reader collapses those to one round-trip + check_data's structured read + check_knowledge's unstructured read, then + retrieve_data's structured read). Wrapping the base reader collapses those + to one round-trip per distinct source_hint and pins a single consistent snapshot for the whole request (plan-time and execution-time catalogs can no longer diverge). """ diff --git a/src/config/prompts/help.md b/src/config/prompts/help.md new file mode 100644 index 0000000000000000000000000000000000000000..8092e77db8dffc1cf1792d372e0bf576819d9b85 --- /dev/null +++ b/src/config/prompts/help.md @@ -0,0 +1,107 @@ + + +You are the **Help guide** for an AI data-analysis assistant. Think of yourself as the +instruction sheet that comes with a board game: your only job is to tell the user +**where they are in their analysis and what to do next**, so they are never lost. You do +**not** do analysis, answer data questions, or invent facts about their data. + +## What you receive this turn + +You are given context, never raw user prose to analyze: + +- **`analysis_state`** — the current per-analysis state. Fields you use: + - `analysis_title` — what this analysis is called. + - `problem_statement` — the user's goal (may be empty/weak; it is optional at creation). + - `problem_validated` (bool) — **the gate.** `false` = the goal still needs work; `true` = the goal is set and analysis is unlocked. + - `report_id` — `0`/absent means no report has ever been generated. +- **`chat_history`** — the conversation so far. Use it to judge how far along the user is and to avoid repeating yourself. +- **`report_ready`** — a **deterministic** signal computed for you (NOT your judgment): + - `ready` (bool) — whether there is enough analysis to generate a report. + - `missing` (list) — if not ready, the gaps to fill. +- **`available_actions`** *(optional)* — which actions are actually wired right now. If present, **only suggest actions listed here.** + +> **Hard rule — never misguide.** Trust the signals above for *what is possible*, not your +> own guess. If `report_ready.ready` is `false`, do **not** tell the user to generate a +> report. If an action isn't in `available_actions`, do not suggest it. If Help is wrong, +> the user is wrong. + +## How to answer — two layers, always + +1. **Where you are + what's next** — one short sentence locating the user, then the single most useful next step. +2. **How** — concrete, do-able instructions for that step (not just "you can analyze now" — show *how* to start). + +Keep it short. Lead with the next step; don't recap everything. + +## State-tiered guidance + +Pick the branch that matches `analysis_state` + `report_ready`: + +### A. `problem_validated == false` → fix the goal first +The user can't get good analysis without a clear goal. Steer them to define or sharpen the +problem statement. +- If `problem_statement` is empty: encourage them to state what they want to find out, and mention the AI can help — they can run **`/problem_statement`** (or just describe their goal in chat). +- If `problem_statement` exists but is vague: gently push for something more **measurable and concrete** (a target, a metric, a timeframe), grounded in their `analysis_title` and the data they've bound. Give one short example of a sharper version. +- Do **not** push analysis or reports yet. + +### B. `problem_validated == true`, little/no analysis yet → orient to analysis +Tell them the goal is set and they can start asking questions about their data. Give the **how**: +- Suggest 2–3 concrete starter questions, **descriptive/basic first** (e.g. "Which products sell the most?", "How have sales trended this month?"). +- **Tie suggestions back to their `problem_statement`** so the analysis stays relevant — don't suggest random analyses. +- **Read `chat_history` first and never re-suggest a question already asked or answered.** Build on what's done with a follow-up that adds *new* evidence (a trend over time, a breakdown, a comparison, a deeper cut), not a repeat of a question that already has an answer. +- You may offer a basic end-to-end "starter analysis" path (a few descriptive questions → a first report), kept simple. + +### C. `problem_validated == true`, analysis under way, `report_ready.ready == false` → close the gaps +They've started but there isn't enough yet for a report. Point at `report_ready.missing` and +recommend the specific next questions that would fill those gaps (phrase them as questions +the user can ask), still anchored to the problem statement. + +### D. `problem_validated == true` and `report_ready.ready == true` → nudge toward the report +There's enough to report. Encourage them to generate it. Report can be triggered **two ways**: +the **`/generate report`** skill **or** the report button — mention both so it feels natural. +Do not over-promise the report's depth. + +## How-to phrasing (degrade gracefully) + +- **Via chat / skills** — write these **accurately and specifically**; they are stable (e.g. "type your question in the chat", "run `/problem_statement`", "run `/generate report`"). +- **Via the UI (buttons/menus)** — the frontend isn't final yet. Describe UI steps **generically** ("use the Generate Report option") rather than naming exact buttons/positions you're unsure of. Prefer the chat/skill path when unsure. *(A later version of this file will fill in the real UI steps.)* +- If a field in `analysis_state` is missing or the state looks unwired, **fall back to generic guidance** rather than guessing specifics. + +## Tone + +Plain, warm, and encouraging — like a helpful guide, **not** a hype trailer. No exclamation +spam, no overselling. Respond in the **user's language** (match `chat_history` — Indonesian or +English). A few sentences is usually enough. + +## Constraints + +- You **only** guide. Never run analysis, never produce report content, never quote data values. +- Never suggest an action that the signals say isn't available or isn't ready. +- One step at a time — give the next step, not the whole roadmap. +- When you suggest questions, **dedupe against `chat_history`** — only propose analyses not yet run that move the goal forward; a question that already has an answer adds no fresh evidence. +- No markdown headers or code fences in your reply; short prose (and an inline `/command` or a tiny bullet list) is fine. + +## Examples + +``` +State: problem_validated=false, problem_statement="" +→ "Looks like we haven't set a goal yet. Tell me what you want to find out — for example, + 'reduce churn next quarter' — or run /problem_statement and I'll help you shape it." + +State: problem_validated=false, problem_statement="make sales better" +→ "Your goal is a good start but a bit broad. Let's make it measurable — e.g. 'grow north-region + revenue by 10% this quarter.' Run /problem_statement and we'll refine it together." + +State: problem_validated=true, chat_history nearly empty +→ "Your goal is set — you can start exploring now. Try a basic question first, like + 'Which products sell the most?' or 'How have monthly sales trended?', then we can dig into + what's driving your goal." + +State: problem_validated=true, report_ready.ready=false, missing=["no comparison over time"] +→ "Good progress. Before a report, it's worth looking at change over time — try asking + 'How does this quarter compare to last?' Once we have that, we can put the report together." + +State: problem_validated=true, report_ready.ready=true +→ "You've covered enough to summarize. You can generate your report now — run /generate report + or use the report option to create it." +``` diff --git a/src/config/prompts/intent_router.md b/src/config/prompts/intent_router.md index 96cc5e53948764b397350c588435a744a2d97ab4..e327a9596c81aebfcfb64285df5d5586c313ae53 100644 --- a/src/config/prompts/intent_router.md +++ b/src/config/prompts/intent_router.md @@ -1,82 +1,119 @@ -You are the intent router for an AI data assistant. Given a user's latest message (and optionally recent conversation history), decide which downstream path should handle it. +You are the intent router for an AI data assistant. Given a user's latest message (and optionally recent conversation history), decide which downstream **handler** should process it. You classify the route only — you do not answer the question. ## Output Return three fields: -- **`needs_search`** — `true` if we must look at the user's data to answer; `false` for greetings, farewells, off-topic chitchat, or meta questions about the assistant itself. -- **`source_hint`** — one of: - - `chat` — no data lookup needed (greetings, farewells, generic small talk). - - `unstructured` — the user is asking about a topic, concept, feature, or factual knowledge that may exist in uploaded documents (PDF / DOCX / TXT). The user does not need to explicitly mention a document. - - `structured` — the user is asking a **data question** answerable from a database or a tabular file (CSV / XLSX / Parquet). This includes counts, sums, top-N, filters, comparisons, trends, joins across registered structured sources. -- **`rewritten_query`** — a **standalone** version of the user's question that incorporates necessary context from history. If the original message is already standalone, return it unchanged. If `needs_search` is `false`, leave this empty/null. +- **`intent`** — exactly one of: + - `chat` — conversational, no data needed: greetings, farewells, thanks, "how are you", "what can you do", small talk. + - `help` — the user wants to know **what to do next** or how the process works ("what's the next step?", "how do I start?", "what should I do now?"). + - `problem_statement` — the user wants to **define or refine the analysis goal**: the business problem, objectives, what to increase/decrease, targets/success metrics — or is answering questions about the goal. + - `check` — the user wants an **inventory** of what they have: "what data do I have?", "what columns are in this table?", "what documents did I upload?", "describe my dataset". This is metadata/listing, not analysis. + - `unstructured_flow` — the user asks about a **topic, concept, feature, explanation, or factual knowledge** that may live in uploaded documents (PDF/DOCX/TXT). Pure document Q&A. The user need not mention a document. + - `structured_flow` — the user asks an **analytical question over their data**: counts, sums, top-N, filters, comparisons, trends, correlations, segments, share-of-total, joins across structured sources. This routes to the slow analytical path. +- **`rewritten_query`** — a **standalone** version of the user's question, with context from history resolved. If the message is already standalone, copy it verbatim. Leave empty/null for `chat` and `help`. +- **`confidence`** — your confidence in the chosen intent, a number in [0, 1]. ## Routing rules -1. If the message is ONLY a pure greeting / farewell / thanks / "how are you" / "what can you do" / compliment with no factual question → `chat` + `needs_search=false`. -2. If the message asks a data question answerable from a database or tabular file (counts, sums, top-N, filters, comparisons, trends, sheet rows, table columns) → `structured` + `needs_search=true`. -3. If the message asks about a topic, concept, feature, explanation, summary, or factual knowledge — even without explicitly mentioning a document — route to `unstructured` + `needs_search=true`. The user may have uploaded relevant documents covering that topic. -4. If ambiguous between structured and unstructured → prefer `unstructured`. Only prefer `structured` if there are clear signals of tabular/numeric data questions. -5. Cross-source comparison ("compare DB sales to the customers.csv file") → `structured`. The planner sees both source types in one prompt and can correlate. +1. Pure greeting / farewell / thanks / "what can you do" / compliment with no task → `chat`. +2. "What do I do next / how do I proceed / where do I start" → `help`. +3. The user states or refines a goal, objective, target, or success metric, or answers a goal-defining question → `problem_statement`. +4. "What data / columns / tables / documents do I have", "describe my data", inventory or metadata requests → `check`. +5. A question answerable from document prose — a topic, concept, feature, explanation, summary, or factual knowledge, even without naming a document → `unstructured_flow`. +6. An analytical question answerable by computing over tabular/DB data (counts, sums, top-N, filters, comparisons, trends, correlations, segments) → `structured_flow`. + +## Disambiguation (the boundaries that matter) + +- **`check` vs `structured_flow`** — "what do I have / describe it" → `check`; "analyze / compute / trend / correlate / compare it" → `structured_flow`. +- **`unstructured_flow` vs `structured_flow`** — pure document/concept Q&A → `unstructured_flow`; anything needing computation over tabular/DB data → `structured_flow`. **When in doubt between "analytical AND also needs document context" → `structured_flow`** (the analytical path can pull document context itself). Only choose `unstructured_flow` for *pure* document questions with no computation. +- **`help` vs `problem_statement`** — "what's next?" → `help`; "here is my goal / let's define the objective" → `problem_statement`. +- **`chat` vs everything else** — only use `chat` when there is no task and no data question at all. ## Rewriting follow-ups -When history is present and the new message references prior context using pronouns or fragments ("tell me more", "what about last quarter?", "and by region?"), expand the rewritten_query into a fully standalone question. Example: +When history is present and the new message references prior context with pronouns or fragments ("tell me more", "what about last quarter?", "and by region?"), expand `rewritten_query` into a fully standalone question. Example: History: "What was our top product last month?" → "Pro Plan Annual at $487k" Message: "How does that compare to Q1?" rewritten_query: "How does Pro Plan Annual's revenue last month compare to Q1?" -If the original is already standalone, copy it verbatim into rewritten_query. +If the original is already standalone, copy it verbatim into `rewritten_query`. ## Few-shot examples ``` User: "Hi" -→ needs_search=false, source_hint="chat", rewritten_query=null +→ intent="chat", rewritten_query=null, confidence=0.99 User: "Bye, thanks" -→ needs_search=false, source_hint="chat", rewritten_query=null +→ intent="chat", rewritten_query=null, confidence=0.99 User: "What can you do?" -→ needs_search=false, source_hint="chat", rewritten_query=null +→ intent="chat", rewritten_query=null, confidence=0.95 -User: "How many orders did we get last month?" -→ needs_search=true, source_hint="structured", - rewritten_query="How many orders did we get last month?" +User: "Okay I uploaded my data, what do I do next?" +→ intent="help", rewritten_query=null, confidence=0.93 + +User: "How does this work? Where should I start?" +→ intent="help", rewritten_query=null, confidence=0.9 + +User: "I want to reduce customer churn next quarter, target under 5%." +→ intent="problem_statement", + rewritten_query="Define the analysis goal: reduce customer churn next quarter to under 5%.", + confidence=0.9 + +User: "My goal is to grow revenue in the north region." +→ intent="problem_statement", + rewritten_query="Define the analysis goal: grow revenue in the north region.", + confidence=0.88 + +User: "What data do I have?" +→ intent="check", rewritten_query="What data sources do I have?", confidence=0.95 + +User: "What columns are in the orders table?" +→ intent="check", rewritten_query="What columns are in the orders table?", confidence=0.93 + +User: "What documents have I uploaded?" +→ intent="check", rewritten_query="What documents have I uploaded?", confidence=0.93 User: "What does the Q1 board memo say about churn?" -→ needs_search=true, source_hint="unstructured", - rewritten_query="What does the Q1 board memo say about churn?" +→ intent="unstructured_flow", + rewritten_query="What does the Q1 board memo say about churn?", confidence=0.9 -User: "Top 5 customers by revenue this year" -→ needs_search=true, source_hint="structured", - rewritten_query="Top 5 customers by revenue this year" +User: "jelaskan tentang machine learning" +→ intent="unstructured_flow", rewritten_query="Explain machine learning", confidence=0.85 User: "apa key feature dari iot connectivity?" -→ needs_search=true, source_hint="unstructured", - rewritten_query="What are the key features of IoT connectivity?" +→ intent="unstructured_flow", + rewritten_query="What are the key features of IoT connectivity?", confidence=0.85 -User: "jelaskan tentang machine learning" -→ needs_search=true, source_hint="unstructured", - rewritten_query="Explain machine learning" +User: "How many orders did we get last month?" +→ intent="structured_flow", + rewritten_query="How many orders did we get last month?", confidence=0.92 + +User: "Top 5 customers by revenue this year" +→ intent="structured_flow", + rewritten_query="Top 5 customers by revenue this year", confidence=0.93 -User: "bagaimana cara kerja neural network?" -→ needs_search=true, source_hint="unstructured", - rewritten_query="How does a neural network work?" +User: "Is there a correlation between discount and units sold?" +→ intent="structured_flow", + rewritten_query="Is there a correlation between discount and units sold?", confidence=0.9 -User: "what is the main purpose of this system?" -→ needs_search=true, source_hint="unstructured", - rewritten_query="What is the main purpose of this system?" +User: "How has monthly revenue trended by region, and what stands out?" +→ intent="structured_flow", + rewritten_query="How has monthly revenue trended by region this year, and what is unusual?", + confidence=0.88 History: assistant: "Pro Plan Annual led at $487,200 in April." User: "And in March?" -→ needs_search=true, source_hint="structured", - rewritten_query="What was Pro Plan Annual's revenue in March?" +→ intent="structured_flow", + rewritten_query="What was Pro Plan Annual's revenue in March?", confidence=0.9 ``` ## Constraints -- Do not invent data. If the question is factual or knowledge-based (not clearly tabular), route to `unstructured` and let the retriever decide. Only route to `structured` if the question clearly involves counts, sums, filters, or trends from tabular sources. +- Pick exactly one `intent`. Do not invent values outside the six listed. +- Prefer `unstructured_flow` over `structured_flow` only for pure knowledge/document questions; prefer `structured_flow` whenever computation over data is involved. - Do not refuse — refusal happens later in guardrails. Just classify. - One JSON object as output; no prose, no markdown. diff --git a/src/config/prompts/planner.md b/src/config/prompts/planner.md index 8946de1bbf716ef42d78c444d824de1364e7d1a3..9e49dcb1048622bb2bc08855a82366b47d8d58eb 100644 --- a/src/config/prompts/planner.md +++ b/src/config/prompts/planner.md @@ -10,7 +10,7 @@ only a `TaskList` object that conforms to the provided schema. 1. **Emit intent, never code.** Never write SQL, pandas, or any code. The only query you express is an inline `QueryIR` (a JSON intent object) inside a - `query_structured` tool call's `args.ir`. + `retrieve_data` tool call's `args.ir`. 2. **The plan is static.** There is no replanning and no execution feedback. Plan the whole analysis up front; assume each task runs once, in dependency order. 3. **Use only tools from the "Available tools" list.** Never invent a tool name. @@ -31,16 +31,27 @@ only a `TaskList` object that conforms to the provided schema. - **Wire data between tasks with placeholders.** When a task needs an upstream task's output as an argument, use the string `"${t}"` (e.g. `"${t2}"`) as the argument value. Set `depends_on` accordingly. -- **Data access vs analytics tools.** `query_structured` is the data-access entry +- **Data access vs analytics tools.** `retrieve_data` is the data-access entry point: use it to select, filter, and pull rows (and simple built-in count/sum/avg/min/max/count_distinct the IR can express). For anything richer — descriptive statistics (median/percentile/mode/std/skew), time trends, group comparisons, share-of-total, correlation, segmentation, or data-quality - profiling — run `query_structured` to fetch the rows, then pass its output to + profiling — run `retrieve_data` to fetch the rows, then pass its output to the matching composite `analyze_*` tool via a `"${t}"` `data` argument (referencing the upstream result's column aliases). +- **Measure by a dimension in another table (joins).** When the number you are + aggregating and the grouping dimension live in DIFFERENT tables of the same + database source, add a `joins` entry to the `retrieve_data` IR along a foreign + key declared in the catalog — do NOT pick a table that lacks the measure, and do + NOT try to "combine" unrelated tables. Example — "revenue by category": the + measure `order_items.line_total` joined to `products` on + `order_items.product_id = products.id`, grouped by `products.category`. Prefer an + existing measure column over recomputing; use a single table (no join) when the + measure and dimension already live together (e.g. "revenue by region" from + `orders.region` + `orders.total_amount`). Joins are database-only — not available + for tabular/file sources. - **Mixing structured + unstructured.** If qualitative context helps, add a - `retrieve_documents` task against an unstructured source listed in the catalog. + `retrieve_knowledge` task against an unstructured source listed in the catalog. - **CRISP-DM stages.** Tag each task with the stage it serves: `data_understanding`, `data_preparation`, or `evaluation`. (Never `modeling`.) - **success_criteria is a reporting signal**, not a control trigger. State, in diff --git a/src/config/prompts/problem_statement.md b/src/config/prompts/problem_statement.md new file mode 100644 index 0000000000000000000000000000000000000000..8da9af5350d074cd4f732be83557fe1459bfe509 --- /dev/null +++ b/src/config/prompts/problem_statement.md @@ -0,0 +1,41 @@ +You are the Problem Statement coach for an AI data-analysis assistant. Your job is to help the user turn a vague goal into a clear, analyzable **problem statement**, using the analysis title and the conversation so far. + +You do not run analysis. You only shape the problem statement and decide whether it is complete enough. + +## What a complete problem statement needs + +1. **problem_statement** — a clear, standalone sentence describing the business problem or decision (refine the user's wording; incorporate the analysis title where useful). +2. **objective** — what success looks like (e.g. "reduce churn", "grow north-region revenue", "understand drivers of retention"). +3. **metric** — the concrete measure to move or investigate (e.g. "churn rate", "monthly revenue", "retention score"). + +A statement is **complete** only when all three are present and concrete, and **the user explicitly stated the objective and the metric in their own words**. If they haven't, leave that field empty, list it in `missing`, and ask for it in `feedback`. + +**A bare data question is NOT a complete problem statement.** Questions like "which product category has the most revenue?", "what's our top region?", "how many orders last month?" only tell you *what to compute* — they do not state a business objective or a target metric to move. Do **not** infer `objective`/`metric` from such a question. Put both in `missing` and ask the user for the actual goal. + +## Output (structured) + +- **`problem_statement`** — your best refined version so far (never empty; use the title if that's all you have). +- **`objective`** — filled ONLY when the user explicitly stated it; otherwise empty string. +- **`metric`** — filled ONLY when the user explicitly stated it; otherwise empty string. +- **`missing`** — the list of which fields among `objective` / `metric` the user has not yet explicitly stated. Empty list means the statement is complete and will be validated. A bare data question must yield `missing: ["objective", "metric"]`. +- **`feedback`** — a short, friendly message. If `missing` is non-empty: explain what's missing and ask one focused question. If complete: confirm the problem statement back and say they can start analyzing. + +## Rules + +- Be concise and concrete. One focused follow-up question at a time — don't interrogate. +- Only fill `objective`/`metric` from what the user **explicitly stated**, never from what a question merely implies. Empty + listed in `missing` is correct when the user hasn't said it. +- Keep `problem_statement` decision-oriented, not a restatement of the data. +- Match the user's language (English / Indonesian). + +## Examples + +**Incomplete — a bare data question (do NOT validate):** +User: "Which product category generates the most total revenue?" +→ `problem_statement`: "Identify which product category drives the most total revenue." +→ `objective`: "" · `metric`: "" · `missing`: ["objective", "metric"] +→ `feedback`: "Good starting question. To set this up as an analysis goal: what business outcome are you trying to drive (e.g. grow revenue, cut cost), and which metric should we track (e.g. total revenue per category)?" + +**Complete — the user stated the goal + metric:** +User: "Goal: grow total revenue by focusing marketing on the top categories. Metric: total revenue per category." +→ `objective`: "grow total revenue by focusing on the top categories" · `metric`: "total revenue per category" · `missing`: [] +→ `feedback`: "Your problem statement is complete — you can start the analysis." diff --git a/src/config/prompts/query_planner.md b/src/config/prompts/query_planner.md index e2ed549dc1b90946b979042310161aed04e0df68..bfa6c9ca8906c9eae8ef924d1edde66c038e8980 100644 --- a/src/config/prompts/query_planner.md +++ b/src/config/prompts/query_planner.md @@ -15,7 +15,10 @@ A `QueryIR` object: { "ir_version": "1.0", "source_id": "...", // pick from catalog - "table_id": "...", // pick from chosen source + "table_id": "...", // base table + "joins": [ // optional; DATABASE sources only; FK-backed + {"target_table_id": "...", "left_column_id": "...", "right_column_id": "...", "type": "inner"} + ], "select": [ {"kind": "column", "column_id": "...", "alias": "..."}, {"kind": "agg", "fn": "count|count_distinct|sum|avg|min|max", @@ -36,7 +39,7 @@ A `QueryIR` object: ## Hard constraints (a violation makes the IR invalid) 1. `source_id`, `table_id`, `column_id` must come **verbatim** from the catalog. Never invent IDs or copy table/column **names** in their place. -2. **Single-table only in v1.** Pick the table whose columns best answer the question. If the question genuinely needs a join, pick the table that yields the most useful answer alone and the user can refine. +2. **Joins (database sources only).** Prefer a single table when the measure and the grouping dimension already live together. When the number you aggregate and the dimension are in DIFFERENT tables of the same DB source, add a `joins` entry along a **foreign key declared in the catalog** — `left_column_id` must already be in the query (base table or an earlier join), `right_column_id` is in `target_table_id`. Do NOT pick a table missing the measure, and do NOT join unrelated tables. Tabular/file sources are single-table only — no joins. 3. Use only listed operators / aggregates. No window functions, no `CASE WHEN`, no subqueries — those are not part of v1. 4. `value_type` must be compatible with the column's `data_type`: - `int` column ↔ value_type ∈ {int, decimal} @@ -115,6 +118,28 @@ Output: } ``` +Question: "Which product category generates the most total revenue?" +(catalog DB source `src_prod_db` also has: `order_items` [id=t_order_items] with `line_total` [decimal, id=c_oi_line_total] and `product_id` [int, id=c_oi_product_id, FK → products.id]; `products` [id=t_products] with `category` [string, id=c_products_category] and `id` [int, id=c_products_id]) +Output: +```json +{ + "ir_version": "1.0", + "source_id": "src_prod_db", + "table_id": "t_order_items", + "joins": [ + {"target_table_id": "t_products", "left_column_id": "c_oi_product_id", "right_column_id": "c_products_id", "type": "inner"} + ], + "select": [ + {"kind": "column", "column_id": "c_products_category"}, + {"kind": "agg", "fn": "sum", "column_id": "c_oi_line_total", "alias": "total_revenue"} + ], + "filters": [], + "group_by": ["c_products_category"], + "order_by": [{"column_id": "total_revenue", "dir": "desc"}], + "limit": 100 +} +``` + Catalog excerpt (tabular source — XLSX sheet): ``` diff --git a/src/config/prompts/report_summary.md b/src/config/prompts/report_summary.md new file mode 100644 index 0000000000000000000000000000000000000000..c756f0d593296b30d326a36cad862633aa209dd0 --- /dev/null +++ b/src/config/prompts/report_summary.md @@ -0,0 +1,10 @@ +You are a senior data analyst writing the **executive summary** of an analysis report. + +You are given the Problem Statement and a list of already-finalized findings and caveats drawn from completed analyses. Write a concise executive summary (3–5 sentences) that synthesizes those findings in relation to the stated goal. + +Rules: +- Synthesize and prioritize — lead with the most decision-relevant finding. +- Do NOT introduce any number, fact, or claim that is not present in the findings. You are summarizing, not analyzing. +- Do NOT simply restate every finding; connect them into a narrative and say what they mean for the goal. +- If the findings are thin or inconclusive, say so plainly rather than overstating. +- Plain business language, prose only — no headings, no bullet lists. diff --git a/src/config/settings.py b/src/config/settings.py index 38a956cec789de70033442abd4a80e0fe1013acd..69cf65b79f56e8baa314cbe0b498af769a64682e 100644 --- a/src/config/settings.py +++ b/src/config/settings.py @@ -24,6 +24,13 @@ class Settings(BaseSettings): # real source lands, so this stays opt-in. enable_slow_path: bool = Field(alias="enable_slow_path", default=False) + # Apply the deterministic gate (problem_validated) before dispatch: redirect + # `structured_flow` to `problem_statement` until the analysis is validated. Off + # by default — legacy `rooms` have no `analysis_states` row, so it would gate + # everything. Flip ENABLE_GATE=true once the frontend creates analyses via + # /analysis/create. + enable_gate: bool = Field(alias="enable_gate", default=False) + # Database postgres_connstring: str diff --git a/src/db/postgres/init_db.py b/src/db/postgres/init_db.py index 1d009d06113d2e570312d605f3ce7b68f3d0bf50..bf10d0b72def21659d5181958fcd962642555042 100644 --- a/src/db/postgres/init_db.py +++ b/src/db/postgres/init_db.py @@ -3,6 +3,10 @@ from sqlalchemy import text from src.db.postgres.connection import engine, Base from src.db.postgres.models import ( + AnalysisDataSourceRow, + AnalysisRecordRow, + AnalysisReportRow, + AnalysisStateRow, Catalog, ChatMessage, DatabaseClient, diff --git a/src/db/postgres/models.py b/src/db/postgres/models.py index 8224aa8ea59259ba33bb96004bd1c3e71f6db37d..40f4b0807af7a342448fd10da4669dd8738b2ea1 100644 --- a/src/db/postgres/models.py +++ b/src/db/postgres/models.py @@ -1,10 +1,20 @@ """SQLAlchemy database models.""" from uuid import uuid4 -from sqlalchemy import Column, String, DateTime, Text, Integer, ForeignKey + +from sqlalchemy import ( + Boolean, + Column, + DateTime, + ForeignKey, + Integer, + String, + Text, +) +from sqlalchemy.dialects.postgresql import JSONB, UUID from sqlalchemy.orm import relationship from sqlalchemy.sql import func -from sqlalchemy.dialects.postgresql import JSONB + from src.db.postgres.connection import Base @@ -115,4 +125,90 @@ class Catalog(Base): schema_version = Column(String, nullable=False, default="1.0") generated_at = Column(DateTime(timezone=True), server_default=func.now()) updated_at = Column(DateTime(timezone=True), onupdate=func.now()) - \ No newline at end of file + + +class AnalysisRecordRow(Base): + """One row per completed slow-path analysis (the report's source of truth). + + `data` holds the full Pydantic AnalysisRecord + (src/agents/slow_path/schemas.py:AnalysisRecord) serialized via + `model_dump(mode="json")`; the read path rehydrates with + `AnalysisRecord.model_validate(...)`. Many records accumulate per analysis + session — `generate_report` reads them by `analysis_id`, oldest-first. + + `analysis_id` is nullable until the Analysis State (owned upstream) is wired + into the slow path; records still persist (and carry `user_id`) before then. + """ + __tablename__ = "analysis_records" + + id = Column(String, primary_key=True) # AnalysisRecord.record_id + analysis_id = Column(String, index=True) # FK to the analysis session (nullable for now) + user_id = Column(String, nullable=False, index=True) + plan_id = Column(String, nullable=False) + data = Column(JSONB, nullable=False) + created_at = Column(DateTime(timezone=True), server_default=func.now()) + + +class AnalysisReportRow(Base): + """One immutable row per generated report version — dedorch `reports` (Go-owned). + + dedorch stores the rendered markdown `content` + `title` + `version` (no jsonb + snapshot — markdown-only per the 2026-06-23 checkpoint). The read path rebuilds a + minimal `AnalysisReport` (structured fields empty; `rendered_markdown` = content). + Versions accumulate per analysis; versioning is serialized by a per-analysis + advisory lock in `ReportStore`. Class name kept; table + shape changed for dedorch. + """ + __tablename__ = "reports" + + id = Column(UUID(as_uuid=False), primary_key=True) # AnalysisReport.report_id (uuid) + analysis_id = Column(UUID(as_uuid=False), nullable=False, index=True) + title = Column(String, nullable=False) + content = Column(Text, nullable=False) # rendered markdown + generated_at = Column(DateTime(timezone=True), nullable=False, server_default=func.now()) + version = Column(Integer, nullable=False) + + +class AnalysisStateRow(Base): + """Per-analysis session state — the dedorch `analysis` table (Go-owned migration). + + One session = one analysis = one conversation; `id` is the shared session id + (canonical UUID). The orchestrator gate + Help skill read this every turn; + `problem_validated` gates structured analysis; the Problem Statement skill flips + it; `report_id` is null until a report exists. `id`/`report_id` are Postgres + `uuid` in dedorch, so they bind as UUID (canonical-string in/out). Class name + kept as `AnalysisStateRow`; only the table + id types changed for dedorch. + """ + __tablename__ = "analysis" + + id = Column(UUID(as_uuid=False), primary_key=True) # shared session id (uuid) + analysis_title = Column(String, nullable=False, default="New analysis") + problem_statement = Column(Text, nullable=False, default="") + problem_validated = Column(Boolean, nullable=False, default=False) + owner_id = Column(String, nullable=False, index=True) + report_id = Column(UUID(as_uuid=False), nullable=True) + created_at = Column(DateTime(timezone=True), server_default=func.now()) + updated_at = Column( + DateTime(timezone=True), server_default=func.now(), onupdate=func.now() + ) + + +class AnalysisDataSourceRow(Base): + """Per-analysis data-source binding (#10) — dedorch `data_sources` (Go-owned). + + Which catalog sources an analysis is scoped to. `reference_id` is the catalog + `Source.source_id`; `type`/`name` snapshot the source kind + label. Written at + `/analysis/create`; read by `structured_flow` scoping + the report appendix. + `source_metadata` maps to the `metadata` column (`metadata` is reserved by the + declarative API). Class name kept; table + shape changed for dedorch. + """ + __tablename__ = "data_sources" + + id = Column(UUID(as_uuid=False), primary_key=True) + analysis_id = Column(UUID(as_uuid=False), nullable=False, index=True) + type = Column(String, nullable=False) + name = Column(String, nullable=False) + reference_id = Column(String, nullable=False) # == catalog Source.source_id + bound_by = Column(String, nullable=False) + bound_at = Column(DateTime(timezone=True), nullable=False, server_default=func.now()) + source_metadata = Column("metadata", JSONB, nullable=True) + created_at = Column(DateTime(timezone=True), nullable=False, server_default=func.now()) diff --git a/src/models/api/report.py b/src/models/api/report.py new file mode 100644 index 0000000000000000000000000000000000000000..8f88e4fad910b4edc8910144716b4746594d48cd --- /dev/null +++ b/src/models/api/report.py @@ -0,0 +1,19 @@ +"""Request / response models for report routes (KM-644). + +The full report payload is the `AnalysisReport` contract +(src/agents/report/schemas.py); this module only adds the lightweight list-entry +returned by the versions endpoint. +""" + +from datetime import datetime + +from pydantic import BaseModel, Field + + +class ReportVersionEntry(BaseModel): + """One row in a session's report-version list (for the Analysis-menu sidebar).""" + + report_id: str = Field(..., description="Stable report identifier.") + version: int = Field(..., description="Monotonic version (V1, V2, …).") + generated_at: datetime = Field(..., description="When this version was generated.") + record_count: int = Field(..., description="Number of AnalysisRecords it was built from.") diff --git a/src/query/compiler/pandas.py b/src/query/compiler/pandas.py index b968d5f5c4de8d280be1822e509f5f0f6bffbadd..8ae9cd1f69140a65cbd95b61b6115e5a16c19b12 100644 --- a/src/query/compiler/pandas.py +++ b/src/query/compiler/pandas.py @@ -48,6 +48,13 @@ class PandasCompiler(BaseCompiler): self._catalog = catalog def compile(self, ir: QueryIR) -> CompiledPandas: + if ir.joins: + # Tabular joins need the executor to load + merge multiple parquet blobs + # (deferred). The validator already rejects joins on tabular sources; + # this is the defensive backstop. Joins are DB-only in v1 (KM-652 T4). + raise PandasCompilerError( + "joins are not supported on tabular sources yet (database sources only)" + ) _, table, cols_by_id = self._lookup(ir) output_columns = _output_column_names(ir.select, cols_by_id) diff --git a/src/query/compiler/sql.py b/src/query/compiler/sql.py index 707100cb1731e235f41b97ed655f0c1f2b643b13..12c0fb275b1edc4176e87151e14d9fc728f885c1 100644 --- a/src/query/compiler/sql.py +++ b/src/query/compiler/sql.py @@ -9,9 +9,10 @@ The output `CompiledSql.sql` uses SQLAlchemy-style named placeholders (`:p_0, :p_1, ...`) so it can be executed via `text(sql)` with a params dict on a sync SQLAlchemy engine. -v1 supports the Postgres dialect only. Supabase reuses the same compiler -output (Supabase = Postgres). MySQL / BigQuery / Snowflake compilers will -be separate classes that implement `BaseCompiler`. +Joins (KM-652 T4): a column_id resolves to its owning table across the base +table + any joined tables, so every reference is emitted table-qualified +(`"table"."col"`). With `joins=[]` the output is identical to the single-table +form. v1 supports the Postgres dialect only (Supabase = Postgres). """ from __future__ import annotations @@ -36,6 +37,9 @@ from .base import BaseCompiler # `row_cap` and flags truncation. MAX_RESULT_ROWS = 10_000 +# A resolved column reference: the table that owns it + the column itself. +ColRef = tuple[Table, Column] + @dataclass class CompiledSql: @@ -65,17 +69,15 @@ class SqlCompiler(BaseCompiler): self._dialect = dialect def compile(self, ir: QueryIR) -> CompiledSql: - _, table, cols_by_id = self._lookup(ir) + base_table, cols_by_id = self._lookup(ir) params: dict[str, Any] = {} param_seq = [0] - select_clause, select_aliases = self._build_select(ir.select, table, cols_by_id) - from_clause = self._build_from(table) - where_clause = self._build_where(ir.filters, table, cols_by_id, params, param_seq) - groupby_clause = self._build_groupby(ir.group_by, table, cols_by_id) - orderby_clause = self._build_orderby( - ir.order_by, table, cols_by_id, select_aliases - ) + select_clause, select_aliases = self._build_select(ir.select, cols_by_id) + from_clause = self._build_from(base_table, ir, cols_by_id) + where_clause = self._build_where(ir.filters, cols_by_id, params, param_seq) + groupby_clause = self._build_groupby(ir.group_by, cols_by_id) + orderby_clause = self._build_orderby(ir.order_by, cols_by_id, select_aliases) limit_clause, row_cap = self._build_limit(ir.limit) parts: list[str] = [select_clause, from_clause] @@ -86,23 +88,33 @@ class SqlCompiler(BaseCompiler): return CompiledSql(sql=" ".join(parts), params=params, row_cap=row_cap) # ------------------------------------------------------------------ - # Catalog lookup + # Catalog lookup — column_id -> (owning Table, Column) across base + joins # ------------------------------------------------------------------ - def _lookup(self, ir: QueryIR) -> tuple[Source, Table, dict[str, Column]]: + def _lookup(self, ir: QueryIR) -> tuple[Table, dict[str, ColRef]]: source = next( (s for s in self._catalog.sources if s.source_id == ir.source_id), None ) if source is None: raise SqlCompilerError(f"source_id {ir.source_id!r} not in catalog") - table = next( - (t for t in source.tables if t.table_id == ir.table_id), None - ) + base_table = self._find_table(source, ir.table_id) + cols_by_id: dict[str, ColRef] = { + c.column_id: (base_table, c) for c in base_table.columns + } + for j in ir.joins: + target = self._find_table(source, j.target_table_id) + for c in target.columns: + cols_by_id[c.column_id] = (target, c) + return base_table, cols_by_id + + @staticmethod + def _find_table(source: Source, table_id: str) -> Table: + table = next((t for t in source.tables if t.table_id == table_id), None) if table is None: raise SqlCompilerError( - f"table_id {ir.table_id!r} not in source {ir.source_id!r}" + f"table_id {table_id!r} not in source {source.source_id!r}" ) - return source, table, {c.column_id: c for c in table.columns} + return table # ------------------------------------------------------------------ # Identifier quoting @@ -113,7 +125,8 @@ class SqlCompiler(BaseCompiler): """Postgres-style double-quoted identifier with embedded-quote escape.""" return '"' + name.replace('"', '""') + '"' - def _qcol(self, table: Table, col: Column) -> str: + def _qcol(self, ref: ColRef) -> str: + table, col = ref return f"{self._qident(table.name)}.{self._qident(col.name)}" # ------------------------------------------------------------------ @@ -121,17 +134,14 @@ class SqlCompiler(BaseCompiler): # ------------------------------------------------------------------ def _build_select( - self, - items: list[SelectItem], - table: Table, - cols_by_id: dict[str, Column], + self, items: list[SelectItem], cols_by_id: dict[str, ColRef] ) -> tuple[str, set[str]]: if not items: raise SqlCompilerError("select clause cannot be empty") parts: list[str] = [] aliases: set[str] = set() for i, item in enumerate(items): - expr, alias = self._select_item(item, table, cols_by_id, i) + expr, alias = self._select_item(item, cols_by_id, i) if alias: parts.append(f"{expr} AS {self._qident(alias)}") aliases.add(alias) @@ -140,35 +150,27 @@ class SqlCompiler(BaseCompiler): return "SELECT " + ", ".join(parts), aliases def _select_item( - self, - item: SelectItem, - table: Table, - cols_by_id: dict[str, Column], - index: int, + self, item: SelectItem, cols_by_id: dict[str, ColRef], index: int ) -> tuple[str, str | None]: if isinstance(item, ColumnSelect): - col = self._require_col(cols_by_id, item.column_id, f"select[{index}]") - return self._qcol(table, col), item.alias + ref = self._require_col(cols_by_id, item.column_id, f"select[{index}]") + return self._qcol(ref), item.alias if not isinstance(item, AggSelect): raise SqlCompilerError( f"select[{index}]: unknown SelectItem kind {type(item).__name__}" ) - return self._compile_agg(item, table, cols_by_id, index), item.alias + return self._compile_agg(item, cols_by_id, index), item.alias def _compile_agg( - self, - item: AggSelect, - table: Table, - cols_by_id: dict[str, Column], - index: int, + self, item: AggSelect, cols_by_id: dict[str, ColRef], index: int ) -> str: if item.fn == "count_distinct": if item.column_id is None: raise SqlCompilerError( f"select[{index}].fn=count_distinct requires column_id" ) - col = self._require_col(cols_by_id, item.column_id, f"select[{index}]") - return f"COUNT(DISTINCT {self._qcol(table, col)})" + ref = self._require_col(cols_by_id, item.column_id, f"select[{index}]") + return f"COUNT(DISTINCT {self._qcol(ref)})" if item.column_id is None: if item.fn != "count": raise SqlCompilerError( @@ -176,24 +178,35 @@ class SqlCompiler(BaseCompiler): "(only 'count' may omit it for COUNT(*))" ) return "COUNT(*)" - col = self._require_col(cols_by_id, item.column_id, f"select[{index}]") - return f"{item.fn.upper()}({self._qcol(table, col)})" + ref = self._require_col(cols_by_id, item.column_id, f"select[{index}]") + return f"{item.fn.upper()}({self._qcol(ref)})" - def _build_from(self, table: Table) -> str: - return f"FROM {self._qident(table.name)}" + def _build_from( + self, base_table: Table, ir: QueryIR, cols_by_id: dict[str, ColRef] + ) -> str: + sql = f"FROM {self._qident(base_table.name)}" + for i, j in enumerate(ir.joins): + left = self._require_col(cols_by_id, j.left_column_id, f"joins[{i}].left") + right = self._require_col(cols_by_id, j.right_column_id, f"joins[{i}].right") + target_table = right[0] + keyword = "INNER JOIN" if j.type == "inner" else "LEFT JOIN" + sql += ( + f" {keyword} {self._qident(target_table.name)} " + f"ON {self._qcol(left)} = {self._qcol(right)}" + ) + return sql def _build_where( self, filters: list[FilterClause], - table: Table, - cols_by_id: dict[str, Column], + cols_by_id: dict[str, ColRef], params: dict[str, Any], param_seq: list[int], ) -> str: if not filters: return "" parts = [ - self._compile_filter(f, table, cols_by_id, params, param_seq, index=i) + self._compile_filter(f, cols_by_id, params, param_seq, index=i) for i, f in enumerate(filters) ] return "WHERE " + " AND ".join(parts) @@ -201,14 +214,13 @@ class SqlCompiler(BaseCompiler): def _compile_filter( self, f: FilterClause, - table: Table, - cols_by_id: dict[str, Column], + cols_by_id: dict[str, ColRef], params: dict[str, Any], param_seq: list[int], index: int, ) -> str: - col = self._require_col(cols_by_id, f.column_id, f"filters[{index}]") - col_ref = self._qcol(table, col) + ref = self._require_col(cols_by_id, f.column_id, f"filters[{index}]") + col_ref = self._qcol(ref) op = f.op if op == "is_null": @@ -248,15 +260,12 @@ class SqlCompiler(BaseCompiler): raise SqlCompilerError(f"filters[{index}]: unhandled op {op!r}") def _build_groupby( - self, - group_by: list[str], - table: Table, - cols_by_id: dict[str, Column], + self, group_by: list[str], cols_by_id: dict[str, ColRef] ) -> str: if not group_by: return "" parts = [ - self._qcol(table, self._require_col(cols_by_id, col_id, f"group_by[{i}]")) + self._qcol(self._require_col(cols_by_id, col_id, f"group_by[{i}]")) for i, col_id in enumerate(group_by) ] return "GROUP BY " + ", ".join(parts) @@ -264,8 +273,7 @@ class SqlCompiler(BaseCompiler): def _build_orderby( self, order_by: list[OrderByClause], - table: Table, - cols_by_id: dict[str, Column], + cols_by_id: dict[str, ColRef], select_aliases: set[str], ) -> str: if not order_by: @@ -273,12 +281,12 @@ class SqlCompiler(BaseCompiler): parts: list[str] = [] for i, ob in enumerate(order_by): if ob.column_id in cols_by_id: - ref = self._qcol(table, cols_by_id[ob.column_id]) + ref = self._qcol(cols_by_id[ob.column_id]) elif ob.column_id in select_aliases: ref = self._qident(ob.column_id) else: raise SqlCompilerError( - f"order_by[{i}].column_id: {ob.column_id!r} not in table " + f"order_by[{i}].column_id: {ob.column_id!r} not in query " "columns or select aliases" ) parts.append(f"{ref} {ob.dir.upper()}") @@ -312,9 +320,9 @@ class SqlCompiler(BaseCompiler): @staticmethod def _require_col( - cols_by_id: dict[str, Column], col_id: str, where: str - ) -> Column: - col = cols_by_id.get(col_id) - if col is None: - raise SqlCompilerError(f"{where}.column_id: {col_id!r} not in table") - return col + cols_by_id: dict[str, ColRef], col_id: str, where: str + ) -> ColRef: + ref = cols_by_id.get(col_id) + if ref is None: + raise SqlCompilerError(f"{where}.column_id: {col_id!r} not in query tables") + return ref diff --git a/src/query/executor/db.py b/src/query/executor/db.py index 0d025df9f23437baaace124fc69e5b2599753b12..faefdcb671c0cac9e5cda0705f7f6c4d3a3380d0 100644 --- a/src/query/executor/db.py +++ b/src/query/executor/db.py @@ -220,7 +220,7 @@ class DbExecutor(BaseExecutor): """Best-effort: warm pooled engines for the catalog's schema sources. Called at slow-path entry so the TCP+TLS+auth handshake overlaps the ~4s - Planner LLM call — by the time `query_structured` runs, the connection is + Planner LLM call — by the time `retrieve_data` runs, the connection is already established. Warming is an optimization, never a requirement, so this never raises and per-source failures are swallowed. """ diff --git a/src/query/ir/models.py b/src/query/ir/models.py index 3fb50f8410e81db66205dbd78350b145f7350d44..e731242ea5bc25c7c5290a8d1c9d6e90d020fffa 100644 --- a/src/query/ir/models.py +++ b/src/query/ir/models.py @@ -2,8 +2,14 @@ See ARCHITECTURE.md §7 for the schema. -Initial scope: single-table; filter, group_by, agg, order_by, limit. -Joins, having, offset, boolean tree filters are deferred to later versions. +Scope: filter, group_by, agg, order_by, limit, plus a single-level `joins` to +related tables in the SAME source (KM-652 T4). having, offset, and boolean tree +filters are still deferred. Joins are supported on database (schema) sources only; +the validator rejects them on tabular sources (cross-file merge is a later step). + +With joins, `select` / `group_by` / `filters` / `order_by` may reference a +`column_id` from the base table OR any joined table (column_ids are globally +unique), so a measure in one table can be grouped by a dimension in a related one. """ from typing import Any, Literal @@ -18,6 +24,21 @@ FilterOp = Literal[ AggFn = Literal["count", "count_distinct", "sum", "avg", "min", "max"] ValueType = Literal["int", "decimal", "string", "datetime", "date", "bool"] SortDir = Literal["asc", "desc"] +JoinType = Literal["inner", "left"] + + +class Join(BaseModel): + """A single equi-join to another table in the same source. + + `left_column_id` is a column already in the query (base table or an earlier + join); `right_column_id` is a column in `target_table_id`. The validator + requires the pair to match a foreign key declared in the catalog. + """ + + target_table_id: str + left_column_id: str + right_column_id: str + type: JoinType = "inner" class ColumnSelect(BaseModel): @@ -52,6 +73,7 @@ class QueryIR(BaseModel): ir_version: str = "1.0" source_id: str table_id: str + joins: list[Join] = Field(default_factory=list) select: list[SelectItem] filters: list[FilterClause] = Field(default_factory=list) group_by: list[str] = Field(default_factory=list) diff --git a/src/query/ir/validator.py b/src/query/ir/validator.py index c6d05abfdbdb8d2efa2f0fbfeaaa6cf86e7edff4..d777a0f328f84325a7dc665836a7b1bacfbf49b7 100644 --- a/src/query/ir/validator.py +++ b/src/query/ir/validator.py @@ -35,8 +35,12 @@ class IRValidator: def validate(self, ir: QueryIR, catalog: Catalog) -> None: source = self._find_source(catalog, ir.source_id) - table = self._find_table(source, ir.table_id) - columns_by_id: dict[str, Column] = {c.column_id: c for c in table.columns} + base_table = self._find_table(source, ir.table_id) + # Columns available to select/filter/group/order — grows as joins are added. + columns_by_id: dict[str, Column] = {c.column_id: c for c in base_table.columns} + + if ir.joins: + self._validate_joins(ir, source, columns_by_id) select_aliases: set[str] = set() for i, item in enumerate(ir.select): @@ -96,6 +100,51 @@ class IRValidator: f"limit {ir.limit} exceeds hard cap {LIMIT_HARD_CAP}" ) + def _validate_joins( + self, ir: QueryIR, source: Source, columns_by_id: dict[str, Column] + ) -> None: + """Validate joins and grow `columns_by_id` with each joined table's columns. + + Joins are DB-only in v1; each join's column pair must be a declared foreign + key, the left column must already be in the query, the right in the target. + """ + if source.source_type != "schema": + raise IRValidationError( + f"joins are only supported on database sources in v1; source " + f"{ir.source_id!r} is {source.source_type!r}" + ) + fk_pairs = self._fk_pairs(source) + for i, j in enumerate(ir.joins): + where = f"joins[{i}]" + target = self._find_table(source, j.target_table_id) + target_cols = {c.column_id: c for c in target.columns} + if j.left_column_id not in columns_by_id: + raise IRValidationError( + f"{where}.left_column_id {j.left_column_id!r} is not in the query " + "so far (base table or an earlier join)" + ) + if j.right_column_id not in target_cols: + raise IRValidationError( + f"{where}.right_column_id {j.right_column_id!r} not in target table " + f"{j.target_table_id!r}" + ) + if frozenset((j.left_column_id, j.right_column_id)) not in fk_pairs: + raise IRValidationError( + f"{where}: ({j.left_column_id!r}, {j.right_column_id!r}) is not a " + f"declared foreign key in source {ir.source_id!r} — only FK-backed " + "joins are allowed" + ) + columns_by_id.update(target_cols) + + @staticmethod + def _fk_pairs(source: Source) -> set[frozenset[str]]: + """All declared FK column pairs in the source, as unordered {col, col} sets.""" + pairs: set[frozenset[str]] = set() + for t in source.tables: + for fk in t.foreign_keys: + pairs.add(frozenset((fk.column_id, fk.target_column_id))) + return pairs + @staticmethod def _find_source(catalog: Catalog, source_id: str) -> Source: for s in catalog.sources: diff --git a/src/tools/analytics/aggregation.py b/src/tools/analytics/aggregation.py index ef00ed45f2f84745ebf1f6ac0dc8c205d7a96ce1..1c598f532fc599e00d06dae515cff8dc8d6693e6 100644 --- a/src/tools/analytics/aggregation.py +++ b/src/tools/analytics/aggregation.py @@ -3,7 +3,7 @@ An analytical "family" tool: in ONE call it groups rows by one or more keys and computes aggregates (sum, mean, count, min, max, median, nunique) per group. This is the deterministic compute twin of the Planner's -`query_structured` step — the wrapper layer later maps a QueryIR onto this. +`retrieve_data` step — the wrapper layer later maps a QueryIR onto this. STATUS: compute layer only — the function takes an already-materialized DataFrame. The wrapper layer (fetching data from the catalog via source_id, diff --git a/src/tools/data_access.py b/src/tools/data_access.py index 22e0c21ccd468f276ff27e7ef3346ef6e59a95d6..f9d66179fc423ab032fd567c8ac02d324cb2b557 100644 --- a/src/tools/data_access.py +++ b/src/tools/data_access.py @@ -8,14 +8,18 @@ the runtime/Coordinator supplies them; INV-7 keeps the agent layer tool-agnostic). Tools implemented here: -- `list_sources` — the user's data sources (id, name, type, table count). -- `describe_source` — tables/columns of one source (one row per column, - metadata only — exposes `pii_flag`, never sample values). -- `query_structured` — runs a pre-built `QueryIR` (validate -> dispatch -> +- `check_data` — structured data sources (DB + tabular). No `source_id` + → list sources (id, name, type, table count); with a + `source_id` → that source's tables/columns (one row per + column, metadata only — exposes `pii_flag`, never + sample values). +- `check_knowledge` — the user's unstructured sources / documents (id, name, + type). +- `retrieve_data` — runs a pre-built `QueryIR` (validate -> dispatch -> execute, skipping the planner) and returns rows as `ToolOutput(kind="table")` — the Pattern A handoff the `analyze_*` tools consume. -- `retrieve_documents` — dense retrieval over unstructured sources, returns +- `retrieve_knowledge` — dense retrieval over unstructured sources, returns `ToolOutput(kind="documents")`. Frozen guarantee (§8.4): **never throws.** Any failure returns @@ -47,7 +51,7 @@ DispatcherFactory = Callable[[Catalog], ExecutorDispatcher] # adding/renaming a data-access tool can't silently drift the router out of sync # from the registry (R11). Must match the names in `DataAccessToolInvoker.invoke`. DATA_ACCESS_TOOLS: frozenset[str] = frozenset( - {"list_sources", "describe_source", "query_structured", "retrieve_documents"} + {"check_data", "check_knowledge", "retrieve_data", "retrieve_knowledge"} ) @@ -73,28 +77,28 @@ class DataAccessToolInvoker: ) -> None: self._user_id = user_id self._reader = catalog_reader - # query_structured deps — injectable so tests need no real LLM/DB. The + # retrieve_data deps — injectable so tests need no real LLM/DB. The # validator is stateless; the dispatcher is built per-call from the # request's catalog (executors are picked by source_type). self._validator = ir_validator or IRValidator() self._dispatcher_factory: DispatcherFactory = ( dispatcher_factory or ExecutorDispatcher ) - # retrieve_documents dep — the module singleton by default, injectable + # retrieve_knowledge dep — the module singleton by default, injectable # for tests (the real one pulls PGVector + Redis). Lazy-imported on first # use so importing this module stays cheap. self._retriever = document_retriever async def invoke(self, tool_name: str, args: dict[str, Any]) -> ToolOutput: try: - if tool_name == "list_sources": - return await self._list_sources() - if tool_name == "describe_source": - return await self._describe_source(args) - if tool_name == "query_structured": - return await self._query_structured(args) - if tool_name == "retrieve_documents": - return await self._retrieve_documents(args) + if tool_name == "check_data": + return await self._check_data(args) + if tool_name == "check_knowledge": + return await self._check_knowledge() + if tool_name == "retrieve_data": + return await self._retrieve_data(args) + if tool_name == "retrieve_knowledge": + return await self._retrieve_knowledge(args) return ToolOutput( tool=tool_name, kind="error", error=f"unknown tool {tool_name!r}" ) @@ -103,47 +107,41 @@ class DataAccessToolInvoker: tool=tool_name, kind="error", error=f"{type(exc).__name__}: {exc}" ) - async def _list_sources(self) -> ToolOutput: - """List the user's data sources (structured + unstructured).""" - structured = await self._reader.read(self._user_id, "structured") - unstructured = await self._reader.read(self._user_id, "unstructured") - sources = list(structured.sources) + list(unstructured.sources) + async def _check_data(self, args: dict[str, Any]) -> ToolOutput: + """Inspect the user's structured data sources (DB + tabular). - rows = [ - [s.source_id, s.name, s.source_type, len(s.tables)] for s in sources - ] - return ToolOutput( - tool="list_sources", - kind="table", - columns=["source_id", "name", "source_type", "table_count"], - rows=rows, - meta={"source_count": len(sources)}, - ) - - async def _describe_source(self, args: dict[str, Any]) -> ToolOutput: - """Describe one source: one row per column across its tables. + No `source_id` → an overview: one row per structured source (id, name, + type, table count). With a `source_id` → that source's schema: one row + per column across its tables. - Pattern A note: this is catalog metadata only — never returns row + Pattern A note: schema is catalog metadata only — never returns row data or PII sample values (only the `pii_flag` boolean per column). + Unstructured documents are covered by `check_knowledge`. """ + structured = await self._reader.read(self._user_id, "structured") source_id = args.get("source_id") + if not source_id: + rows = [ + [s.source_id, s.name, s.source_type, len(s.tables)] + for s in structured.sources + ] return ToolOutput( - tool="describe_source", - kind="error", - error="missing 'source_id' argument", + tool="check_data", + kind="table", + columns=["source_id", "name", "source_type", "table_count"], + rows=rows, + meta={"source_count": len(structured.sources)}, ) - structured = await self._reader.read(self._user_id, "structured") - unstructured = await self._reader.read(self._user_id, "unstructured") - sources = list(structured.sources) + list(unstructured.sources) - - source = next((s for s in sources if s.source_id == source_id), None) + source = next( + (s for s in structured.sources if s.source_id == source_id), None + ) if source is None: return ToolOutput( - tool="describe_source", + tool="check_data", kind="error", - error=f"source {source_id!r} not found", + error=f"structured source {source_id!r} not found", ) rows = [ @@ -161,7 +159,7 @@ class DataAccessToolInvoker: for c in t.columns ] return ToolOutput( - tool="describe_source", + tool="check_data", kind="table", columns=[ "table_id", @@ -183,7 +181,24 @@ class DataAccessToolInvoker: }, ) - async def _query_structured(self, args: dict[str, Any]) -> ToolOutput: + async def _check_knowledge(self) -> ToolOutput: + """List the user's unstructured sources (documents). + + Documents have no column schema to drill into, so there is no + `source_id` mode — reading document content is `retrieve_knowledge`'s + job. + """ + unstructured = await self._reader.read(self._user_id, "unstructured") + rows = [[s.source_id, s.name, s.source_type] for s in unstructured.sources] + return ToolOutput( + tool="check_knowledge", + kind="table", + columns=["source_id", "name", "source_type"], + rows=rows, + meta={"source_count": len(unstructured.sources)}, + ) + + async def _retrieve_data(self, args: dict[str, Any]) -> ToolOutput: """Run one validated, single-table QueryIR and return rows as a table. This is the spine of the slow path (Pattern A): the `analyze_*` tools @@ -196,14 +211,14 @@ class DataAccessToolInvoker: raw = args.get("ir") if raw is None: return ToolOutput( - tool="query_structured", kind="error", error="missing 'ir' argument" + tool="retrieve_data", kind="error", error="missing 'ir' argument" ) try: ir = raw if isinstance(raw, QueryIR) else QueryIR.model_validate(raw) except ValidationError as exc: return ToolOutput( - tool="query_structured", kind="error", error=f"invalid IR: {exc}" + tool="retrieve_data", kind="error", error=f"invalid IR: {exc}" ) catalog = await self._reader.read(self._user_id, "structured") @@ -212,7 +227,7 @@ class DataAccessToolInvoker: self._validator.validate(ir, catalog) except IRValidationError as exc: return ToolOutput( - tool="query_structured", + tool="retrieve_data", kind="error", error=f"IR validation failed: {exc}", ) @@ -223,7 +238,7 @@ class DataAccessToolInvoker: if result.error: return ToolOutput( - tool="query_structured", kind="error", error=result.error + tool="retrieve_data", kind="error", error=result.error ) # QueryResult.rows is list[dict]; ToolOutput.rows is list[list] ordered @@ -235,7 +250,7 @@ class DataAccessToolInvoker: [_json_safe(row.get(c)) for c in result.columns] for row in result.rows ] return ToolOutput( - tool="query_structured", + tool="retrieve_data", kind="table", columns=result.columns, rows=rows, @@ -251,7 +266,7 @@ class DataAccessToolInvoker: }, ) - async def _retrieve_documents(self, args: dict[str, Any]) -> ToolOutput: + async def _retrieve_knowledge(self, args: dict[str, Any]) -> ToolOutput: """Dense-retrieve relevant chunks from the user's unstructured sources. Pulls qualitative context (PDF/DOCX/TXT) for a natural-language `query` @@ -259,7 +274,7 @@ class DataAccessToolInvoker: `source_id` scopes to one source (best-effort metadata filter — the router itself does not yet scope by source, so this prunes the results). - TODO(retrieval scoping): the Planner few-shot has no `retrieve_documents` + TODO(retrieval scoping): the Planner few-shot has no `retrieve_knowledge` example, so `source_id` is rarely emitted today and this post-filter is adequate. If source-scoped retrieval becomes common, push scoping down into RetrievalRouter.retrieve()/DocumentRetriever (WHERE @@ -269,7 +284,7 @@ class DataAccessToolInvoker: query = args.get("query") if not isinstance(query, str) or not query.strip(): return ToolOutput( - tool="retrieve_documents", + tool="retrieve_knowledge", kind="error", error="missing 'query' argument", ) @@ -300,7 +315,7 @@ class DataAccessToolInvoker: for r in results ] return ToolOutput( - tool="retrieve_documents", + tool="retrieve_knowledge", kind="documents", value=documents, meta={ diff --git a/src/tools/registry.py b/src/tools/registry.py index bb7ce1b32f16b2b487292ca250a2412f51f36b8d..c3894e2dbae4ca02b6548d3273f791430b12f26b 100644 --- a/src/tools/registry.py +++ b/src/tools/registry.py @@ -8,7 +8,7 @@ reads to choose a tool (KM-625). This replaces the agent team's local stub in Conventions (decided with the agent team, KM-465): - **Pattern A** — `analyze_*` tools do NOT self-fetch by `source_id`. Each takes a `data` argument that is a `"${t}"` placeholder pointing at an upstream - `query_structured` table output, resolved to a DataFrame at execution time. + `retrieve_data` table output, resolved to a DataFrame at execution time. Column arguments reference the aliases that upstream query produced. - `input_schema` is the lightweight JSON-schema-ish dict the planner validator consumes: `required` (arg names with no default) + `properties` (allowed args). @@ -17,8 +17,8 @@ Conventions (decided with the agent team, KM-465): - `output_kind` is the `ToolOutput.kind` each tool returns: stats (labelled-metric dict) | table (rows×cols) | series (ordered periods). -The four data-access tools (query_structured / retrieve_documents / list_sources / -describe_source) are registered separately once their wrappers land (KM-465 #4); +The four data-access tools (retrieve_data / retrieve_knowledge / check_data / +check_knowledge) are registered separately once their wrappers land (KM-465 #4); `default_registry()` composes both slices. """ @@ -36,7 +36,8 @@ from src.tools.analytics import ( ) from src.tools.contracts import ToolRegistry, ToolSpec -ANALYTICS_TOOLS: list[ToolSpec] = [ +# Active this round — the four analytics tools the Planner may select. +ACTIVE_ANALYTICS_TOOLS: list[ToolSpec] = [ ToolSpec( name="analyze_descriptive", category="analytics.descriptive", @@ -65,6 +66,45 @@ ANALYTICS_TOOLS: list[ToolSpec] = [ output_kind="table", description=aggregation.DESCRIPTION, ), + ToolSpec( + name="analyze_correlation", + category="analytics.relationship", + input_schema={ + "required": ["data"], + "properties": { + "data": {"type": "string"}, + "column_ids": {"type": "array"}, + "method": {"type": "string"}, + }, + }, + output_kind="stats", + description=relationship.DESCRIPTION, + ), + ToolSpec( + name="analyze_trend", + category="analytics.timeseries", + input_schema={ + "required": ["data", "date_column", "value_column"], + "properties": { + "data": {"type": "string"}, + "date_column": {"type": "string"}, + "value_column": {"type": "string"}, + "freq": {"type": "string"}, + "agg": {"type": "string"}, + }, + }, + output_kind="series", + description=temporal.DESCRIPTION, + ), +] + +# Deferred this round — specs kept intact for easy re-activation, NOT exposed to +# the Planner. The compute fns still exist (src/tools/analytics/*) and the invoker +# still maps them (src/tools/invoker.py); only registry exposure is withheld. +# To re-activate, move a spec back into ACTIVE_ANALYTICS_TOOLS. NOTE: a deferred +# tool re-activated here must also be re-added to the Planner few-shots +# (src/agents/planner/examples.py) — keep the two in sync. +DEFERRED_ANALYTICS_TOOLS: list[ToolSpec] = [ ToolSpec( name="analyze_comparison", category="analytics.comparison", @@ -111,20 +151,6 @@ ANALYTICS_TOOLS: list[ToolSpec] = [ output_kind="stats", description=quality.DESCRIPTION, ), - ToolSpec( - name="analyze_correlation", - category="analytics.relationship", - input_schema={ - "required": ["data"], - "properties": { - "data": {"type": "string"}, - "column_ids": {"type": "array"}, - "method": {"type": "string"}, - }, - }, - output_kind="stats", - description=relationship.DESCRIPTION, - ), ToolSpec( name="analyze_segment", category="analytics.segmentation", @@ -143,25 +169,17 @@ ANALYTICS_TOOLS: list[ToolSpec] = [ output_kind="table", description=segmentation.DESCRIPTION, ), - ToolSpec( - name="analyze_trend", - category="analytics.timeseries", - input_schema={ - "required": ["data", "date_column", "value_column"], - "properties": { - "data": {"type": "string"}, - "date_column": {"type": "string"}, - "value_column": {"type": "string"}, - "freq": {"type": "string"}, - "agg": {"type": "string"}, - }, - }, - output_kind="series", - description=temporal.DESCRIPTION, - ), ] +# Full set (active + deferred) — kept for callers that need every spec, e.g. tests +# or the invoker's name checks. The Planner-visible registry uses ACTIVE only. +ANALYTICS_TOOLS: list[ToolSpec] = [*ACTIVE_ANALYTICS_TOOLS, *DEFERRED_ANALYTICS_TOOLS] + def analytics_registry() -> ToolRegistry: - """The analytics (`analyze_*`) slice of the tool registry (fresh instance).""" - return ToolRegistry(tools=list(ANALYTICS_TOOLS)) + """The analytics (`analyze_*`) slice of the tool registry (fresh instance). + + Exposes only `ACTIVE_ANALYTICS_TOOLS`; deferred specs are withheld from the + Planner (see `DEFERRED_ANALYTICS_TOOLS`). + """ + return ToolRegistry(tools=list(ACTIVE_ANALYTICS_TOOLS))