Agentic-Service-Data-Eyond-Catalog

Running

App Files Files Community

/fix check and help tool

by rhbt6767 - opened 4 days ago

base: refs/heads/main

←

from: refs/pr/8

Discussion Files changed

+1061

-86

Files changed (18) hide show

REPO_STATUS.md +32 -13
eval/help/README.md +77 -0
eval/help/__init__.py +0 -0
eval/help/help_dataset.json +150 -0
eval/help/run_eval.py +428 -0
eval/readiness/readiness_dataset.json +19 -22
eval/readiness/run_eval.py +5 -4
main.py +2 -2
src/agents/handlers/help.py +94 -5
src/agents/planner/inputs.py +39 -0
src/catalog/fk_inference.py +97 -0
src/catalog/render.py +7 -1
src/catalog/store.py +24 -8
src/config/prompts/help.md +29 -11
src/config/prompts/planner.md +14 -9
src/config/settings.py +6 -0
src/db/postgres/models.py +32 -9
src/query/executor/db.py +6 -2

REPO_STATUS.md CHANGED Viewed

@@ -2,7 +2,7 @@
 **Audience:** teammates onboarding onto the Python repo (`Agentic-Service-Data-Eyond-Catalog`).
 **Scope:** what the code does **right now** (branch `pr/4`, ticket KM-652). Describes current state only — no roadmap or to-dos.
-**Snapshot date:** 2026-06-25. **Cross-repo update 2026-06-29:** §2/§8/§11/§12 re-verified against
 the **Go source** (`Orchestrator-Agent-Service`), not its docs. The Go service has moved well past its
 own (uncommitted, stale) design docs: it now hosts the **dedorch SQL migrations** in-repo and a full
 **`/api/v1/analyses` + `/api/v1/skills`** REST surface. Go does **not** call Python yet — those skills
@@ -178,7 +178,7 @@ unless `SKIP_INIT_DB=true`.
 |---|---|---|---|
 | `users`, `rooms`, `chat_messages`, `message_sources` | base app | chat endpoint, Go | chat history |
 | `documents`, `databases` | uploads + DB creds (Fernet-encrypted) | Go ingestion | executor cred resolution |
-| `data_catalog` | per-user jsonb `Catalog` (Source → Table → Column) | Go ingestion / Python pipeline | CatalogReader, planner, tools |
 | `langchain_pg_embedding` | PGVector document chunks | Go ingestion | DocumentRetriever |
 | `report_inputs` *(was `analysis_records`)* | jsonb `AnalysisRecord`, one per slow-path run; **Python-owned** | slow path | ReportGenerator, report readiness |
 | `analyses` *(dedorch, plural)* | uuid `id`, `user_id`, `analysis_title`, `objective`, `business_questions` jsonb, `status` (active\|inactive), `data_bind`(+`data_bind_version`), `report_id`, `report_collection` — **defined by Go migrations**; `problem_statement`/`problem_validated`/`owner_id` already **dropped** there (`0003`/`0004`) | Go `/api/v1/analyses`; Python state store | gate (no-op), Help, report |
@@ -186,16 +186,21 @@ unless `SKIP_INIT_DB=true`.
 | `data_sources` *(dedorch)* | per-analysis binding; `reference_id` = catalog source_id; `type ∈ document\|database` | Go `/analyses/{id}/data-bind` (+ Python `/analysis/create`) | structured-flow scoping, report appendix |
 | `analyses_messages` *(dedorch)* | the analysis chat room (`role ∈ user\|ai`); replaces deprecated `rooms`/`chat_messages` | Go `/analyses/{id}/messages` | Python chat path **not yet migrated here** (§12) |
-> ⚠️ **Python ORM ↔ dedorch drift (verified 2026-06-29).** Python's `AnalysisStateRow` + `state_store.py`
-> still model **`problem_statement` / `problem_validated`** and do **not** carry `objective` /
-> `business_questions`, but the Go migrations have already dropped the former and added the latter.
-> Pre-cutover this is harmless (Python runs `create_all` on its own copy); **post-`SKIP_INIT_DB`**, when
-> Python reads dedorch directly, ORM column selection on the dropped columns will break. Reconcile the
-> Python model before the connection-string cutover.
 **Catalog shape** (the jsonb in `data_catalog`):
 `Catalog → Source[ {source_id, source_type ∈ schema|tabular|unstructured, name, location_ref} → Table[ {table_id, name, row_count, foreign_keys[]} → Column[ {column_id, name, data_type, nullable, pii_flag, sample_values|null, stats} ] ] ]`. PII columns have `sample_values: null` so real values never enter prompts.
 **QueryIR shape** (`src/query/ir/models.py`):
 `{ source_id, table_id, joins[], select[], filters[], group_by[], order_by[], limit }`.
 Joins are single-level equi-joins to a related table **in the same source**, FK-backed,
@@ -286,7 +291,7 @@ only.
 |---|---|---|---|
 | `ENABLE_SLOW_PATH` | `settings.enable_slow_path` | **off** | Route `structured_flow` through Planner/TaskRunner/Assembler (vs single-query `QueryService`). Records persist only on the slow path → reports require this on. |
 | `ENABLE_GATE` | `settings.enable_gate` | **off** | **Deprecated 2026-06-25** — gate neutered; the flag has no effect. Kept to avoid `.env` churn. |
-| `SKIP_INIT_DB` | env, `main.py` | off | Skip `create_all` on startup — the dedorch cutover switch (Go owns dedorch migrations). |
 | `enable_tracing` | hardcoded `True` in `chat.py` | on (endpoint) | Langfuse tracing. |
 ---
@@ -309,8 +314,8 @@ copies disagree with the current code on:
 ## 12. dedorch migration — current state
-The Python DB is moving from `dataeyond` → **dedorch** (Go owns dedorch migrations; Python is
-consumer-only). State **re-verified against the Go source 2026-06-29**:
 - **The dedorch migrations now live IN the Go repo** — embedded SQL at
   `internal/repository/postgres/migrations/0001_create_core_schema.sql … 0004_replace_chat_with_analysis_scope.sql`,
@@ -325,8 +330,15 @@ consumer-only). State **re-verified against the Go source 2026-06-29**:
   `rooms`/`chat_messages`/`interview_*` tables to `zdeprecated_*`.
 - **`report_inputs`** (the slow-path structured output, formerly `analysis_records`) stays
   **Python-owned**; its finalized schema goes to Harry so the dedorch migration creates it post-cutover.
-- The connection-string cutover (paired with `SKIP_INIT_DB`) **has not happened yet**; Python still
-  runs `create_all` on its own models until then.
 **⚠️ Integration gap (verified — the big one).** Go's `/api/v1/analyses` and `/api/v1/skills`
 (`help` / `report`) are **placeholders that return dummy data** — the `SendMessage` / `GenerateReport`
@@ -348,6 +360,13 @@ records-based report; floor: ≥1 `analyze_*` success). Wiring Go → Python is
   values are always parameterized.
 - **Settings aliases:** `.env` uses double-underscore names (`azureai__api_key__4o`); `Settings`
   exposes them as `azureai_api_key_4o`.
 - **Never-throw seams** are pervasive (tool invoker, query service, executors, state/binding reads,
   record persistence, report summary). Failures degrade into soft output rather than raising — good
   for UX, but they can mask real breakage (e.g. a binding silently fail-opening to the full catalog).

 **Audience:** teammates onboarding onto the Python repo (`Agentic-Service-Data-Eyond-Catalog`).
 **Scope:** what the code does **right now** (branch `pr/4`, ticket KM-652). Describes current state only — no roadmap or to-dos.
+**Snapshot date:** 2026-06-25. **Data-layer reconcile 2026-07-01:** §8/§12 updated — dedorch cutover done, `data_catalog` model reconciled. **Query-path fix 2026-07-02:** §8/§13 — dedorch catalogs ship no FKs → Python infers them (`fk_inference.py`); shared-Fernet-key gotcha documented. **Cross-repo update 2026-06-29:** §2/§8/§11/§12 re-verified against
 the **Go source** (`Orchestrator-Agent-Service`), not its docs. The Go service has moved well past its
 own (uncommitted, stale) design docs: it now hosts the **dedorch SQL migrations** in-repo and a full
 **`/api/v1/analyses` + `/api/v1/skills`** REST surface. Go does **not** call Python yet — those skills
 |---|---|---|---|
 | `users`, `rooms`, `chat_messages`, `message_sources` | base app | chat endpoint, Go | chat history |
 | `documents`, `databases` | uploads + DB creds (Fernet-encrypted) | Go ingestion | executor cred resolution |
+| `data_catalog` *(dedorch, Go-owned)* | `id` uuid, `scope_type` ('user'\|'analysis'), `user_id`, `analysis_id`, **`catalog_payload`** jsonb (the `Catalog`: Source → Table → Column), schema_version, generated_at, updated_at; partial-unique on `user_id WHERE scope_type='user'` | **Go `catalog.Service`** (all writes: DB/file ingestion) | CatalogReader → CatalogStore (**read-only**), planner, tools |
 | `langchain_pg_embedding` | PGVector document chunks | Go ingestion | DocumentRetriever |
 | `report_inputs` *(was `analysis_records`)* | jsonb `AnalysisRecord`, one per slow-path run; **Python-owned** | slow path | ReportGenerator, report readiness |
 | `analyses` *(dedorch, plural)* | uuid `id`, `user_id`, `analysis_title`, `objective`, `business_questions` jsonb, `status` (active\|inactive), `data_bind`(+`data_bind_version`), `report_id`, `report_collection` — **defined by Go migrations**; `problem_statement`/`problem_validated`/`owner_id` already **dropped** there (`0003`/`0004`) | Go `/api/v1/analyses`; Python state store | gate (no-op), Help, report |
 | `data_sources` *(dedorch)* | per-analysis binding; `reference_id` = catalog source_id; `type ∈ document\|database` | Go `/analyses/{id}/data-bind` (+ Python `/analysis/create`) | structured-flow scoping, report appendix |
 | `analyses_messages` *(dedorch)* | the analysis chat room (`role ∈ user\|ai`); replaces deprecated `rooms`/`chat_messages` | Go `/analyses/{id}/messages` | Python chat path **not yet migrated here** (§12) |
+> ✅ **Python ORM ↔ dedorch drift — reconciled 2026-07-01.** `AnalysisStateRow` (`analyses`) dropped
+> `problem_statement`/`problem_validated` and added `objective`/`business_questions` (Harry's #3);
+> `data_catalog` was the last stale model. Its `Catalog` ORM (old `user_id`-PK + `data` jsonb) is now
+> the dedorch shape (`id` PK, `scope_type`, **`catalog_payload`**), and `CatalogStore` reads
+> `catalog_payload WHERE scope_type='user'` (matching Go's `catalog.Service`). This closed a **live
+> bug**: the `check` skill / `CatalogReader` still selected the dropped `data_catalog.data` column, so
+> every catalog read 500'd after the cutover ("what data do I have" → *"Sorry, I couldn't look that up:
+> column data_catalog.data does not exist"*). Python's catalog **write** methods (`upsert`/
+> `remove_source`/`StructuredPipeline`) were reconciled but are now **legacy** — Go owns ingestion.
 **Catalog shape** (the jsonb in `data_catalog`):
 `Catalog → Source[ {source_id, source_type ∈ schema|tabular|unstructured, name, location_ref} → Table[ {table_id, name, row_count, foreign_keys[]} → Column[ {column_id, name, data_type, nullable, pii_flag, sample_values|null, stats} ] ] ]`. PII columns have `sample_values: null` so real values never enter prompts.
+> ⚠️ **dedorch catalogs ship empty `foreign_keys`** (Go's introspection drops FK constraints), yet the IR validator only allows FK-backed joins — so every cross-table question failed validation until 2026-07-02. `src/catalog/fk_inference.py` (wired into `CatalogStore.get`) now infers the obvious `<base>_id → <table>.id` edges at read time: conservative (single unambiguous target, matching `data_type`, schema sources only) and **self-disabling** once any real FK is present. It's a **stopgap** — the durable fix is Go emitting real FKs during introspection.
 **QueryIR shape** (`src/query/ir/models.py`):
 `{ source_id, table_id, joins[], select[], filters[], group_by[], order_by[], limit }`.
 Joins are single-level equi-joins to a related table **in the same source**, FK-backed,
 |---|---|---|---|
 | `ENABLE_SLOW_PATH` | `settings.enable_slow_path` | **off** | Route `structured_flow` through Planner/TaskRunner/Assembler (vs single-query `QueryService`). Records persist only on the slow path → reports require this on. |
 | `ENABLE_GATE` | `settings.enable_gate` | **off** | **Deprecated 2026-06-25** — gate neutered; the flag has no effect. Kept to avoid `.env` churn. |
+| `SKIP_INIT_DB` | `settings.skip_init_db` (.env/env) | **on** | Skip `init_db()` on startup — the dedorch cutover switch. **Defaults TRUE** (Go owns the dedorch schema); set `false` only for a local Python-owned DB. |
 | `enable_tracing` | hardcoded `True` in `chat.py` | on (endpoint) | Langfuse tracing. |
 ---
 ## 12. dedorch migration — current state
+The Python DB has moved from `dataeyond` → **dedorch** (cutover 2026-07-01; Go owns dedorch migrations;
+Python is consumer-only). State **re-verified against the Go source 2026-06-29**:
 - **The dedorch migrations now live IN the Go repo** — embedded SQL at
   `internal/repository/postgres/migrations/0001_create_core_schema.sql … 0004_replace_chat_with_analysis_scope.sql`,
   `rooms`/`chat_messages`/`interview_*` tables to `zdeprecated_*`.
 - **`report_inputs`** (the slow-path structured output, formerly `analysis_records`) stays
   **Python-owned**; its finalized schema goes to Harry so the dedorch migration creates it post-cutover.
+- **Connection-string cutover DONE (2026-07-01).** Python's `postgres_connstring` now points at
+  **dedorch** and reads the Go-migrated tables directly. Every ORM model Python reads (`analyses`,
+  `data_sources`, `analyses_messages`, `data_catalog`) has been reconciled to its dedorch shape.
+  **`init_db()` is now skipped by default** (`settings.skip_init_db` defaults **True**): its privileged
+  DDL (`ALTER TABLE rooms …`, index creation) fails on Go-owned tables
+  (`InsufficientPrivilegeError: must be owner of table rooms`). Skipping is safe — Go migration `0001`
+  already provides the `vector` extension + the langchain FTS index. Set `SKIP_INIT_DB=false` (.env or
+  env) only for a local Python-owned DB. `report_inputs` is not in any Go migration yet (#22) — create
+  it in dedorch before enabling the slow path, else report/slow-path writes fail (chat path unaffected).
 **⚠️ Integration gap (verified — the big one).** Go's `/api/v1/analyses` and `/api/v1/skills`
 (`help` / `report`) are **placeholders that return dummy data** — the `SendMessage` / `GenerateReport`
   values are always parameterized.
 - **Settings aliases:** `.env` uses double-underscore names (`azureai__api_key__4o`); `Settings`
   exposes them as `azureai_api_key_4o`.
+- **Shared Fernet key across repos (gotcha).** User DB credentials in `databases` are written +
+  encrypted by **Go** and decrypted by Python; both read the **same** env var
+  `dataeyond__db__credential__key` (Go: `configs/app.yaml` → `credentials.fernet_key`). The two
+  deployments MUST hold the **identical value** or Python's decrypt throws
+  `cryptography.fernet.InvalidToken` — whose `str()` is **empty**, so it logged as `error=""` and
+  masqueraded as a DB-connection failure (the executor now logs `repr(e)` to expose it). Tell-apart:
+  a valid-but-wrong key → `InvalidToken`; a malformed key → a non-empty `ValueError` at cipher build.
 - **Never-throw seams** are pervasive (tool invoker, query service, executors, state/binding reads,
   record persistence, report summary). Failures degrade into soft output rather than raising — good
   for UX, but they can mask real breakage (e.g. a binding silently fail-opening to the full catalog).

eval/help/README.md ADDED Viewed

	@@ -0,0 +1,77 @@

+# Help-skill eval
+Scores the **live** Help skill (`src/agents/handlers/help.HelpAgent`) — the guide that
+tells a user where they are and what to do next. Each golden case declares an analysis
+state + report-readiness + chat history; the runner streams `HelpAgent.astream` for real
+and asserts the **rules** the reply must obey.
+Unlike `eval/readiness` (deterministic, no LLM), this calls the model, so it needs a
+working `.env` (Azure OpenAI) and spends tokens. Run it before a deploy that touches
+`config/prompts/help.md` — not on every commit. The fast, no-LLM guard is
+`tests/unit/agents/handlers/test_help.py` (fake chain); this is the end-to-end
+"does the model actually obey the prompt" layer on top.
+## Run
+```bash
+uv run python -m eval.help.run_eval
+uv run python -m eval.help.run_eval --limit 4     # smoke test
+uv run python -m eval.help.run_eval --no-table    # summary only
+```
+Each run writes a timestamped `results/help_result_<ts>.json` (never overwritten,
+diffable across runs).
+## What it measures
+Not accuracy — Help replies are free prose with no single correct wording. The metric is
+**compliance**: the % of cases whose reply obeys every rule asserted for it.
+- **`language`** — the reply must match the user's language. This is the regression guard
+  for the button-path bug (`/tools/help` passes `message=None`, and the reply used to
+  default to English even for an Indonesian conversation).
+- **`report_guard`** — never suggest generating a report when `report_ready.ready=false`;
+  do suggest it when `true`. Since `generate_report` is the only gated action, this also
+  serves as the "no action leakage" check.
+- **`orientation`** — quality of the suggested starter questions. **Manual review**: these
+  run but are excluded from the auto compliance rate. Read their `output_text` in the JSON.
+Assertion types: `language_match {expected}`, `must_not_contain_any {patterns}`,
+`must_contain_any {patterns}`.
+## Held-out vs carried-over (why the summary splits them)
+`carried_over: true` cases **mirror an example in `help.md`** — the case `id` *is* the
+prompt's `<!-- id: ... -->`. They are a regression guard: if the prompt is refactored, the
+demonstrated rule must still hold. What is mirrored is the **input spec + the assertion**,
+never the example's reply text (temperature > 0 makes exact match invalid).
+Held-out cases (`carried_over: false`) are **absent from the prompt**; their compliance is
+the real generalization signal. If held-out compliance drops while carried-over stays at
+100%, the prompt is overfitting to its own examples ("train on test set"). That's why the
+two are reported separately.
+**Sync rule (manual, like `intent`):** if `help.md`'s Examples change, keep the mirrored
+`id`s here in sync. Current mirrored ids: `help_ex_orient`, `help_ex_guard_delta`,
+`help_ex_guard_ready`.
+## Dataset
+`help_dataset.json` — see the `_about` / `_carried_over` doc keys in the file. Language
+detection reuses `help._detect_reply_language`; `report_ready.missing` uses the codes
+`analysis` / `delta` mapped to the real `is_report_ready` strings in the runner.
+## Known limitations
+- **Compliance is approximate across runs.** `HelpAgent` runs at `temperature=0.3`, so the
+  reply varies; a borderline case can flip pass/fail between runs. Treat the rate as a
+  signal, not a fixed number — re-run before trusting a single-point drop.
+- **`language_match` grades with the same detector the feature uses** (`_detect_reply_language`
+  over the reply). It verifies the model obeyed the `[Reply language]` directive, assuming the
+  detector is correct — the detector itself is unit-tested separately in
+  `tests/unit/agents/handlers/test_help.py`. It can also misfire on a reply that mixes
+  languages (e.g. an Indonesian reply quoting an English business question).
+- **Errored cases (stream crash) count as failures, not rule violations.** If `astream` raises
+  (Azure down, timeout), the case is flagged `errored` and reported under a separate `ERRORED`
+  line — assertions are NOT run on the error string (a crash must not trivially "pass" a
+  `must_not_contain_any`). A run with errors is not a clean pass; re-run once the cause clears.

eval/help/__init__.py ADDED Viewed

File without changes

eval/help/help_dataset.json ADDED Viewed

	@@ -0,0 +1,150 @@

+{
+  "_about": "Golden dataset for the Help skill (`src/agents/handlers/help.HelpAgent`). Unlike intent/readiness this calls the LIVE model: each case declares an analysis state + report-readiness + chat history, the runner streams HelpAgent.astream for real, and asserts RULES the reply must obey (not text similarity — help replies are free prose with no single correct wording). Metric is COMPLIANCE (% of rule assertions that hold), reported separately for held-out vs carried_over cases.",
+  "_groups": "language (reply matches the user's language — the button-path bug), report_guard (never suggest a report when report_ready.ready=false; do suggest it when true — this also IS the 'no action leakage' check, since generate_report is the only gated action), orientation (quality of the suggested starter questions — MANUAL review, not auto-scored).",
+  "_asserts": "language_match {expected} — detect the reply's language (help._detect_reply_language over the OUTPUT) must equal expected. must_not_contain_any {patterns} — none of the (case-insensitive) patterns appear. must_contain_any {patterns} — at least one appears.",
+  "_carried_over": "carried_over:true rows MIRROR an example in config/prompts/help.md (the row `id` IS the help.md `<!-- id: ... -->`). They are the regression guard: if the prompt is refactored, the demonstrated rule must still hold. What is mirrored is the INPUT spec + the assertion — NOT the example's reply text (temperature>0 makes exact match invalid). Held-out rows (carried_over:false) are NOT in the prompt; their compliance is the real generalization signal. If help.md's Examples change, keep these ids in sync (manual, like intent).",
+  "_missing_codes": "report_ready.missing uses codes mapped to the real strings is_report_ready emits (imported in run_eval): analysis -> _MISSING_ANALYSIS, delta -> _MISSING_DELTA. Kept as codes so the dataset survives wording changes.",
+  "schema": {
+    "id": "stable handle; for carried_over rows this equals the help.md example id",
+    "group": "language | report_guard | orientation",
+    "carried_over": "bool — mirrors a help.md example",
+    "manual_review": "bool — run but exclude from the auto compliance rate (read output_text)",
+    "state": "{ analysis_title, objective, business_questions[], report_id }",
+    "report_ready": "{ ready: bool, missing: [analysis|delta] }",
+    "history": "[{ role: human|ai, content }] — drives language on the button path",
+    "message": "the human turn; null = button path (HelpAgent falls back to a per-language trigger)",
+    "asserts": "[{ type, ...spec }] — the rules the reply must obey",
+    "note": "human-readable description"
+  },
+  "cases": [
+    {
+      "id": "lang_01", "group": "language", "carried_over": false, "manual_review": false,
+      "state": { "analysis_title": "Analisis penjualan", "objective": "memahami performa penjualan bulanan", "business_questions": ["produk mana yang paling laku?"], "report_id": null },
+      "report_ready": { "ready": false, "missing": ["analysis"] },
+      "history": [{ "role": "human", "content": "aku baru upload datanya, terus aku harus ngapain?" }],
+      "message": null,
+      "asserts": [{ "type": "language_match", "expected": "Indonesian" }],
+      "note": "REGRESSION of the button-path bug: Indonesian conversation, message=null. Reply must be Indonesian, not English."
+    },
+    {
+      "id": "lang_02", "group": "language", "carried_over": false, "manual_review": false,
+      "state": { "analysis_title": "Sales analysis", "objective": "understand monthly sales performance", "business_questions": ["which products drive revenue?"], "report_id": null },
+      "report_ready": { "ready": false, "missing": ["analysis"] },
+      "history": [{ "role": "human", "content": "okay I uploaded my data, what do I do next?" }],
+      "message": null,
+      "asserts": [{ "type": "language_match", "expected": "English" }],
+      "note": "English conversation, button path — reply must stay English."
+    },
+    {
+      "id": "lang_03", "group": "language", "carried_over": false, "manual_review": false,
+      "state": { "analysis_title": "Analisis churn", "objective": "menurunkan churn pelanggan", "business_questions": ["segmen mana yang paling banyak churn?"], "report_id": null },
+      "report_ready": { "ready": false, "missing": ["analysis"] },
+      "history": [],
+      "message": "gimana caranya mulai analisis ini ya?",
+      "asserts": [{ "type": "language_match", "expected": "Indonesian" }],
+      "note": "Intent path: the real Indonesian user turn drives the language."
+    },
+    {
+      "id": "lang_04", "group": "language", "carried_over": false, "manual_review": false,
+      "state": { "analysis_title": "Retention analysis", "objective": "understand user retention", "business_questions": ["what drives repeat usage?"], "report_id": null },
+      "report_ready": { "ready": false, "missing": ["analysis"] },
+      "history": [],
+      "message": null,
+      "asserts": [{ "type": "language_match", "expected": "English" }],
+      "note": "Fresh analysis, no chat yet, button path — with no turn to read, the user-authored goal (English objective + business_questions, required at onboarding) drives the language."
+    },
+    {
+      "id": "lang_06", "group": "language", "carried_over": false, "manual_review": false,
+      "state": { "analysis_title": "Analisis retensi", "objective": "memahami retensi pengguna", "business_questions": ["apa yang mendorong penggunaan berulang?"], "report_id": null },
+      "report_ready": { "ready": false, "missing": ["analysis"] },
+      "history": [],
+      "message": null,
+      "asserts": [{ "type": "language_match", "expected": "Indonesian" }],
+      "note": "Same fresh-analysis path as lang_04 but the goal is Indonesian — the goal signal must yield Indonesian (not the hard fallback, which only fires when the goal is empty too)."
+    },
+    {
+      "id": "lang_05", "group": "language", "carried_over": false, "manual_review": false,
+      "state": { "analysis_title": "Analisis penjualan", "objective": "memahami tren penjualan", "business_questions": ["bagaimana tren bulanan?"], "report_id": null },
+      "report_ready": { "ready": false, "missing": ["analysis"] },
+      "history": [
+        { "role": "human", "content": "apa saja yang bisa aku tanyakan tentang data ini?" },
+        { "role": "ai", "content": "You can start by asking which products sell the most." }
+      ],
+      "message": null,
+      "asserts": [{ "type": "language_match", "expected": "Indonesian" }],
+      "note": "Last AI turn is English but the human turn is Indonesian — mirror the human, reply Indonesian."
+    },
+    {
+      "id": "help_ex_guard_delta", "group": "report_guard", "carried_over": true, "manual_review": false,
+      "state": { "analysis_title": "Sales analysis", "objective": "understand monthly sales performance", "business_questions": ["which products drive revenue?"], "report_id": "rep-1" },
+      "report_ready": { "ready": false, "missing": ["delta"] },
+      "history": [{ "role": "human", "content": "what should I do next?" }],
+      "message": null,
+      "asserts": [{ "type": "must_not_contain_any", "patterns": ["/report", "generate the report", "generate your report", "create the report"] }],
+      "note": "MIRRORS help.md example help_ex_guard_delta. A report exists and nothing new since — must NOT tell the user to generate a report; steer them to run a fresh analysis first."
+    },
+    {
+      "id": "help_ex_guard_ready", "group": "report_guard", "carried_over": true, "manual_review": false,
+      "state": { "analysis_title": "Sales analysis", "objective": "understand monthly sales performance", "business_questions": ["which products drive revenue?"], "report_id": null },
+      "report_ready": { "ready": true, "missing": [] },
+      "history": [{ "role": "human", "content": "what should I do next?" }],
+      "message": null,
+      "asserts": [{ "type": "must_contain_any", "patterns": ["/report", "report"] }],
+      "note": "MIRRORS help.md example help_ex_guard_ready. Enough analysis done — SHOULD nudge toward the report (mention /report or the report option)."
+    },
+    {
+      "id": "guard_03", "group": "report_guard", "carried_over": false, "manual_review": false,
+      "state": { "analysis_title": "Retention analysis", "objective": "improve 30-day retention", "business_questions": ["which cohort retains best?"], "report_id": null },
+      "report_ready": { "ready": false, "missing": ["analysis"] },
+      "history": [{ "role": "human", "content": "can I get a report now?" }],
+      "message": null,
+      "asserts": [{ "type": "must_not_contain_any", "patterns": ["/report", "generate the report", "generate your report", "you can generate"] }],
+      "note": "No analysis run yet, user asks for a report directly — must NOT offer to generate; redirect to running an analysis first."
+    },
+    {
+      "id": "guard_04", "group": "report_guard", "carried_over": false, "manual_review": false,
+      "state": { "analysis_title": "Analisis penjualan", "objective": "memahami performa penjualan", "business_questions": ["produk mana yang paling laku?"], "report_id": null },
+      "report_ready": { "ready": true, "missing": [] },
+      "history": [{ "role": "human", "content": "selanjutnya aku ngapain?" }],
+      "message": null,
+      "asserts": [
+        { "type": "must_contain_any", "patterns": ["/report", "laporan", "report"] },
+        { "type": "language_match", "expected": "Indonesian" }
+      ],
+      "note": "Ready + Indonesian conversation — should nudge toward the report AND stay in Indonesian (two rules at once)."
+    },
+    {
+      "id": "guard_05", "group": "report_guard", "carried_over": false, "manual_review": false,
+      "state": { "analysis_title": "Analisis churn", "objective": "menurunkan churn", "business_questions": ["segmen mana yang paling churn?"], "report_id": null },
+      "report_ready": { "ready": false, "missing": ["analysis"] },
+      "history": [{ "role": "human", "content": "aku mau bikin laporan dong" }],
+      "message": null,
+      "asserts": [
+        { "type": "must_not_contain_any", "patterns": ["/report", "silakan buat laporan", "kamu bisa membuat laporan", "generate your report"] },
+        { "type": "language_match", "expected": "Indonesian" }
+      ],
+      "note": "Indonesian, not ready, user asks for a report — must NOT offer it and must reply in Indonesian."
+    },
+    {
+      "id": "help_ex_orient", "group": "orientation", "carried_over": true, "manual_review": true,
+      "state": { "analysis_title": "Sales analysis", "objective": "understand monthly sales performance", "business_questions": ["which products drive revenue?"], "report_id": null },
+      "report_ready": { "ready": false, "missing": ["analysis"] },
+      "history": [],
+      "message": null,
+      "asserts": [],
+      "note": "MIRRORS help.md example help_ex_orient. MANUAL: are the 2-3 starter questions concrete, descriptive-first, and tied to the objective? Read output_text."
+    },
+    {
+      "id": "orient_02", "group": "orientation", "carried_over": false, "manual_review": true,
+      "state": { "analysis_title": "Retention analysis", "objective": "improve 30-day retention", "business_questions": ["which acquisition channel retains best?"], "report_id": null },
+      "report_ready": { "ready": false, "missing": ["analysis"] },
+      "history": [
+        { "role": "human", "content": "which channel brings the most signups?" },
+        { "role": "ai", "content": "Organic search brought the most signups last month (1,240)." }
+      ],
+      "message": null,
+      "asserts": [],
+      "note": "MANUAL: one question already answered — does help build on it with a NEW follow-up (retention by channel), not re-suggest the answered question? Read output_text."
+    }
+  ]
+}

eval/help/run_eval.py ADDED Viewed

	@@ -0,0 +1,428 @@

+"""Help-skill eval runner.
+Feeds each golden case in `help_dataset.json` to the LIVE Help skill
+(`src/agents/handlers/help.HelpAgent.astream`), then scores whether the streamed
+reply obeys a set of RULE assertions — reply language, never suggesting a report
+when `report_ready.ready=false`, suggesting it when true. Prints a per-case detail
+table + aggregate summary and writes a timestamped JSON report under `results/`
+(never overwritten — one file per run, diffable).
+Unlike `eval/readiness` (deterministic, no LLM), this calls the model for real, so
+it needs a working `.env` (Azure OpenAI) and spends tokens — run it before a deploy
+that touches `help.md`, not on every commit. `tests/unit/agents/handlers/test_help.py`
+already covers the deterministic Python guard with a fake chain; this is the
+end-to-end "does the model actually obey the prompt" layer on top.
+Two things the metric separates on purpose:
+  - COMPLIANCE = % of rule assertions that hold. NOT accuracy — help replies are free
+    prose with no single correct wording; we score rule-obedience, not similarity.
+  - HELD-OUT vs CARRIED-OVER — carried_over cases mirror a help.md example (regression);
+    held-out cases are absent from the prompt. Held-out compliance is the real
+    generalization signal. If held-out drops while carried_over stays 100%, the prompt
+    is overfitting to its own examples.
+`orientation` cases are `manual_review` — run but excluded from the auto compliance
+rate; read their `output_text` in the JSON report to judge suggestion quality.
+Invoke as a module so `src` imports resolve:
+    uv run python -m eval.help.run_eval
+    uv run python -m eval.help.run_eval --limit 4     # smoke test
+    uv run python -m eval.help.run_eval --no-table    # summary only
+"""
+from __future__ import annotations
+import argparse
+import asyncio
+import json
+import statistics
+import time
+from dataclasses import asdict, dataclass, field
+from datetime import datetime
+from pathlib import Path
+from typing import Any
+from langchain_core.callbacks import BaseCallbackHandler
+from langchain_core.messages import AIMessage, BaseMessage, HumanMessage
+from langchain_core.outputs import LLMResult
+from src.agents.gate import AnalysisState, stub_analysis_state
+from src.agents.handlers.help import HelpAgent, ReportReadiness, _detect_reply_language
+from src.agents.report.readiness import _MISSING_ANALYSIS, _MISSING_DELTA
+_HERE = Path(__file__).resolve().parent
+DATASET = _HERE / "help_dataset.json"
+RESULTS_DIR = _HERE / "results"
+GROUPS = ["language", "report_guard", "orientation"]
+# Dataset short codes -> the exact `missing` strings is_report_ready emits. Imported
+# from the module so the dataset stays readable and survives wording changes.
+_CODE_TO_MISSING = {
+    "analysis": _MISSING_ANALYSIS,
+    "delta": _MISSING_DELTA,
+}
+class _UsageCollector(BaseCallbackHandler):
+    """Sums token usage across the LLM calls made during one astream()."""
+    def __init__(self) -> None:
+        self.input_tokens = 0
+        self.output_tokens = 0
+        self.total_tokens = 0
+    def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
+        before = self.total_tokens
+        for generation_list in response.generations:
+            for generation in generation_list:
+                message = getattr(generation, "message", None)
+                usage = getattr(message, "usage_metadata", None) if message else None
+                if usage:
+                    self.input_tokens += usage.get("input_tokens", 0)
+                    self.output_tokens += usage.get("output_tokens", 0)
+                    self.total_tokens += usage.get("total_tokens", 0)
+        if self.total_tokens == before and response.llm_output:
+            usage = response.llm_output.get("token_usage") or {}
+            self.input_tokens += usage.get("prompt_tokens", 0)
+            self.output_tokens += usage.get("completion_tokens", 0)
+            self.total_tokens += usage.get("total_tokens", 0)
+    @property
+    def tokens(self) -> dict[str, int]:
+        return {
+            "input": self.input_tokens,
+            "output": self.output_tokens,
+            "total": self.total_tokens,
+        }
+# --- assertion checkers -----------------------------------------------------
+# Each returns (passed, detail). `detail` explains a failure in the table/report.
+def _check_language_match(output: str, spec: dict[str, Any]) -> tuple[bool, str]:
+    got = _detect_reply_language([], message=output)
+    return got == spec["expected"], f"want {spec['expected']}, got {got}"
+def _check_must_not_contain_any(output: str, spec: dict[str, Any]) -> tuple[bool, str]:
+    low = output.lower()
+    hits = [p for p in spec["patterns"] if p.lower() in low]
+    return (not hits), (f"found {hits}" if hits else "none present")
+def _check_must_contain_any(output: str, spec: dict[str, Any]) -> tuple[bool, str]:
+    low = output.lower()
+    hits = [p for p in spec["patterns"] if p.lower() in low]
+    return bool(hits), (f"found {hits}" if hits else f"none of {spec['patterns']}")
+_ASSERT_CHECKS = {
+    "language_match": _check_language_match,
+    "must_not_contain_any": _check_must_not_contain_any,
+    "must_contain_any": _check_must_contain_any,
+}
+@dataclass
+class AssertResult:
+    type: str
+    passed: bool
+    detail: str
+@dataclass
+class CaseResult:
+    id: str
+    group: str
+    carried_over: bool
+    manual_review: bool
+    output_text: str
+    asserts: list[AssertResult]
+    all_passed: bool | None  # None when manual_review (not auto-scored)
+    latency_ms: float
+    tokens: dict[str, int]
+    errored: bool = False  # the astream call raised — infra failure, not a rule verdict
+def load_cases(path: Path) -> list[dict[str, Any]]:
+    """Read the `cases` array, skipping the leading `_*` doc keys and `schema`."""
+    data = json.loads(path.read_text(encoding="utf-8"))
+    return list(data["cases"])
+def _build_state(spec: dict[str, Any]) -> AnalysisState:
+    """Build an AnalysisState from a case's `state` block (defaults from the stub)."""
+    return stub_analysis_state().model_copy(
+        update={
+            "analysis_title": spec.get("analysis_title", "New analysis"),
+            "objective": spec.get("objective", ""),
+            "business_questions": list(spec.get("business_questions", [])),
+            "report_id": spec.get("report_id"),
+        }
+    )
+def _build_history(rows: list[dict[str, Any]]) -> list[BaseMessage]:
+    out: list[BaseMessage] = []
+    for row in rows:
+        cls = HumanMessage if row["role"] == "human" else AIMessage
+        out.append(cls(content=row["content"]))
+    return out
+def _build_readiness(spec: dict[str, Any]) -> ReportReadiness:
+    return ReportReadiness(
+        ready=bool(spec["ready"]),
+        missing=[_CODE_TO_MISSING[c] for c in spec.get("missing", [])],
+    )
+async def run_case(case: dict[str, Any]) -> CaseResult:
+    """Stream one Help reply and score its assertions; never throws."""
+    state = _build_state(case["state"])
+    history = _build_history(case.get("history", []))
+    readiness = _build_readiness(case["report_ready"])
+    collector = _UsageCollector()
+    agent = HelpAgent()  # real Azure chain, constructed lazily on first astream
+    start = time.perf_counter()
+    try:
+        output = "".join(
+            [
+                token
+                async for token in agent.astream(
+                    state,
+                    history=history,
+                    message=case.get("message"),
+                    report_ready=readiness,
+                    callbacks=[collector],
+                )
+            ]
+        )
+    except Exception as exc:  # noqa: BLE001 — one bad case shouldn't kill the run
+        output = f"ERROR:{type(exc).__name__}: {exc}"
+    latency_ms = round((time.perf_counter() - start) * 1000, 1)
+    manual = bool(case.get("manual_review"))
+    errored = output.startswith("ERROR:")
+    asserts: list[AssertResult] = []
+    if errored:
+        # Don't run rule checks on an error string — a crash must not "pass" a
+        # must_not_contain_any (the pattern is trivially absent) or a language check.
+        # Count it as a failure, but flag it as errored so it reads as infra, not a
+        # rule violation (overrides manual_review — a crash isn't reviewable).
+        asserts = [AssertResult(type="stream", passed=False, detail=_truncate(output, 100))]
+        all_passed: bool | None = False
+    elif manual:
+        all_passed = None
+    else:
+        for spec in case.get("asserts", []):
+            check = _ASSERT_CHECKS[spec["type"]]
+            passed, detail = check(output, spec)
+            asserts.append(AssertResult(type=spec["type"], passed=passed, detail=detail))
+        all_passed = all(a.passed for a in asserts)
+    return CaseResult(
+        id=case["id"],
+        group=case["group"],
+        carried_over=bool(case.get("carried_over")),
+        manual_review=manual,
+        output_text=output,
+        asserts=asserts,
+        all_passed=all_passed,
+        latency_ms=latency_ms,
+        tokens=collector.tokens,
+        errored=errored,
+    )
+def _compliance(results: list[CaseResult]) -> dict[str, Any]:
+    scored = [r for r in results if r.all_passed is not None]
+    passed = sum(1 for r in scored if r.all_passed)
+    return {
+        "n": len(scored),
+        "passed": passed,
+        "compliance": round(passed / len(scored), 3) if scored else 0.0,
+    }
+def summarize(results: list[CaseResult]) -> dict[str, Any]:
+    scored = [r for r in results if r.all_passed is not None]
+    latencies = [r.latency_ms for r in results]
+    tok_total = sum(r.tokens["total"] for r in results)
+    overall = _compliance(results)
+    by_group = {
+        g: _compliance([r for r in results if r.group == g])
+        for g in GROUPS
+        if any(r.group == g for r in results)
+    }
+    errored = [r for r in results if r.errored]
+    return {
+        "total": len(results),
+        "scored": len(scored),
+        "manual_review": len(results) - len(scored),
+        "passed": overall["passed"],
+        "compliance": overall["compliance"],
+        "runtime_avg_ms": round(statistics.mean(latencies), 1) if latencies else 0,
+        "tokens_total": tok_total,
+        "by_group": by_group,
+        "held_out": _compliance([r for r in scored if not r.carried_over]),
+        "carried_over": _compliance([r for r in scored if r.carried_over]),
+        "errored": {"count": len(errored), "ids": [r.id for r in errored]},
+    }
+def _truncate(text: str, width: int) -> str:
+    text = text.replace("\n", " ")
+    return text if len(text) <= width else text[: width - 3] + "..."
+def format_table(results: list[CaseResult]) -> str:
+    header = (
+        f"{'ID':<20} {'GROUP':<13} {'C/O':<4} {'ASSERTS':<22} {'OK':<4} {'MS':>7}"
+    )
+    rule = "-" * len(header)
+    lines = [rule, header, rule]
+    for r in results:
+        co = "CO" if r.carried_over else "-"
+        if r.manual_review:
+            atypes, ok = "(manual)", "~"
+        else:
+            atypes = ",".join(a.type.replace("_", "")[:6] for a in r.asserts) or "-"
+            ok = "ok" if r.all_passed else "X"
+        lines.append(
+            f"{r.id:<20} {r.group:<13} {co:<4} {_truncate(atypes, 22):<22} "
+            f"{ok:<4} {r.latency_ms:>7}"
+        )
+    lines.append(rule)
+    return "\n".join(lines)
+def format_summary(summary: dict[str, Any], results: list[CaseResult]) -> str:
+    lines = ["SUMMARY"]
+    lines.append(
+        f"  Compliance   {summary['passed']}/{summary['scored']} cases obey all rules"
+        f"   ({summary['compliance'] * 100:.1f}%)   avg {summary['runtime_avg_ms']} ms"
+    )
+    lines.append(
+        f"  Manual       {summary['manual_review']} case(s) excluded from the rate"
+        " (read output_text)"
+    )
+    lines.append("")
+    lines.append("  By group")
+    for g, m in summary["by_group"].items():
+        if m["n"]:
+            lines.append(f"    {g:<14} {m['passed']}/{m['n']}  {m['compliance'] * 100:.0f}%")
+        else:
+            lines.append(f"    {g:<14} (manual only)")
+    lines.append("")
+    ho, co = summary["held_out"], summary["carried_over"]
+    lines.append("  Held-out vs carried-over")
+    lines.append(
+        f"    held_out       {ho['passed']}/{ho['n']}  "
+        f"{ho['compliance'] * 100:.0f}%   <- generalization"
+    )
+    lines.append(
+        f"    carried_over   {co['passed']}/{co['n']}  "
+        f"{co['compliance'] * 100:.0f}%   <- regression"
+    )
+    # Rule failures (real disobedience) vs errored (infra/stream crash) — kept apart so
+    # a crashed run isn't misread as the model breaking a rule.
+    failures = [r for r in results if r.all_passed is False and not r.errored]
+    lines.append("")
+    lines.append(f"  FAILURES ({len(failures)})")
+    for r in failures:
+        bad = [f"{a.type}({a.detail})" for a in r.asserts if not a.passed]
+        lines.append(f"    {r.id:<20} {r.group:<13} {'; '.join(bad)}")
+    err = summary["errored"]
+    if err["count"]:
+        lines.append("")
+        lines.append(
+            f"  ERRORED ({err['count']}) - stream crashed, counted as fail NOT a rule miss"
+            f"  -> {', '.join(err['ids'])}"
+        )
+    return "\n".join(lines)
+def build_report(
+    results: list[CaseResult], summary: dict[str, Any], meta: dict[str, Any]
+) -> dict[str, Any]:
+    run = {
+        **meta,
+        **{
+            k: summary[k]
+            for k in ("total", "scored", "manual_review", "passed", "compliance",
+                      "runtime_avg_ms", "tokens_total")
+        },
+    }
+    return {
+        "run": run,
+        "by_group": summary["by_group"],
+        "held_out": summary["held_out"],
+        "carried_over": summary["carried_over"],
+        "errored": summary["errored"],
+        "cases": [asdict(r) for r in results],
+    }
+def _model_name() -> str:
+    try:
+        from src.config.settings import settings
+        return str(settings.azureai_deployment_name_4o)
+    except Exception:  # noqa: BLE001 — meta only; .env may be absent
+        return "gpt-4o"
+@dataclass
+class _Args:
+    dataset: Path = DATASET
+    limit: int = 0
+    no_table: bool = False
+    extra: dict[str, Any] = field(default_factory=dict)
+async def main() -> None:
+    parser = argparse.ArgumentParser(description="Help-skill eval")
+    parser.add_argument("--dataset", type=Path, default=DATASET)
+    parser.add_argument("--limit", type=int, default=0, help="run first N cases only")
+    parser.add_argument("--prompt-version", default="help.md")
+    parser.add_argument("--no-table", action="store_true", help="skip the detail table")
+    args = parser.parse_args()
+    cases = load_cases(args.dataset)
+    if args.limit:
+        cases = cases[: args.limit]
+    started = datetime.now()
+    print(f"Help Skill Eval -- {started:%Y-%m-%d %H:%M:%S}")
+    print(
+        f"dataset: {args.dataset.name} ({len(cases)} cases)  model: {_model_name()}  "
+        f"prompt: {args.prompt_version}  target: HelpAgent.astream (live)"
+    )
+    results = [await run_case(case) for case in cases]
+    summary = summarize(results)
+    if not args.no_table:
+        print(format_table(results))
+    print(format_summary(summary, results))
+    meta = {
+        "timestamp": started.isoformat(timespec="seconds"),
+        "dataset": args.dataset.name,
+        "model": _model_name(),
+        "prompt_version": args.prompt_version,
+        "target": "src/agents/handlers/help.HelpAgent.astream",
+    }
+    report = build_report(results, summary, meta)
+    RESULTS_DIR.mkdir(parents=True, exist_ok=True)
+    out_path = RESULTS_DIR / f"help_result_{started:%Y-%m-%d_%H%M%S}.json"
+    out_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
+    print(f"\n-> saved: {out_path.relative_to(_HERE.parent.parent)}")
+if __name__ == "__main__":
+    asyncio.run(main())

eval/readiness/readiness_dataset.json CHANGED Viewed

@@ -1,40 +1,37 @@
 {
   "_about": "Golden dataset for the report-readiness signal (`src/agents/report/readiness.is_report_ready`). Deterministic (no LLM): each case declares an analysis state + a set of persisted AnalysisRecords/reports, and the runner feeds them through is_report_ready via injectable fake stores, scoring the boolean `ready` AND the `missing` gaps. Floor cases should score ~100% (regression value). The `alignment` group probes the deferred LLM-judge — see _alignment.",
-  "_floor": "is_report_ready's deterministic floor: (1) problem_validated, (2) >=1 SUBSTANTIVE record, (3) delta-since-report. SUBSTANTIVE (KM-652 fix T1) = a record whose ANALYSIS task succeeded: tasks_run contains a task with status=success AND an analyze_* tool. A failed analysis still persists a record WITH findings (narrating the failure) and its data-access tasks (check_/retrieve_) succeed — so neither 'has findings' nor 'any task succeeded' counts. Only a successful analyze_* does.",
   "_records": "records[].analysis = 'success' (analyze_* succeeded → substantive) | 'failure' (analyze_* failed, data-access still succeeded — the real e2e case, NOT substantive) | 'none' (only check_/retrieve_ succeeded, no analyze task — NOT substantive; guards the 'any task succeeded' trap). records[].findings = count (a failure run still has findings; floor ignores them now). records[].age_min / reports[].age_min = minutes ago (smaller = newer).",
-  "_alignment": "ALIGNMENT cases: a successful analysis (floor says ready=true) but `aligned=false` means it doesn't address the problem statement — a human would say NOT ready. Scored floor-correct, counted separately as the 'alignment gap' = evidence for/against the LLM-judge. Alignment label owner: Rifqi (report semantics) + Sofhia.",
   "schema": {
     "id": "stable per-case handle, <group>_<NN>",
     "group": "floor | delta | edge | alignment",
-    "problem_validated": "bool",
     "report_id": "null = never generated; a string = a report exists",
     "records": "[{ analysis: success|failure|none, findings: int, age_min: int }]",
     "reports": "[{ age_min: int }] (only meaningful when report_id set)",
-    "aligned": "bool — do the analyses address the problem statement? (floor ignores this)",
     "expected_ready": "what the deterministic floor SHOULD return",
-    "expected_missing": "subset of [problem, analysis, delta]",
     "note": "human-readable description"
   },
   "cases": [
-    { "id": "floor_01", "group": "floor", "problem_validated": false, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["problem", "analysis"], "note": "new analysis: no validated goal and no records" },
-    { "id": "floor_02", "group": "floor", "problem_validated": false, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 30 }], "reports": [], "aligned": true, "expected_ready": false, "expected_missing": ["problem"], "note": "has a successful analysis but goal not validated (isolates the problem gap)" },
-    { "id": "floor_03", "group": "floor", "problem_validated": true, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "validated goal but no analysis run yet" },
-    { "id": "floor_04", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 20 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 REGRESSION: analyze_* FAILED but the record still has 3 findings (narrating failure) + check/retrieve succeeded. Must NOT be ready — this is the live e2e case (analyze_aggregate failed, report still got generated under the old 'has findings' rule)." },
-    { "id": "floor_05", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "none", "findings": 0, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 nuance: only data-access tasks (check/retrieve) succeeded, no analyze task. 'any task succeeded' would wrongly pass — must NOT be ready." },
-    { "id": "floor_06", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "validated + one successful analysis, no prior report → ready" },
-    { "id": "floor_07", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 40 }, { "analysis": "success", "findings": 1, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "multiple successful analyses → ready" },
-    { "id": "floor_08", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 30 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one failed + one successful analysis → the successful one is enough → ready" },
-    { "id": "delta_01", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }], "reports": [{ "age_min": 5 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "report exists, all analysis older than it → nothing new to report" },
-    { "id": "delta_02", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 5 }], "reports": [{ "age_min": 120 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "newer successful analysis after the report → ready to regenerate" },
-    { "id": "delta_03", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 1, "age_min": 90 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one old + one newer-than-report success → ready" },
-    { "id": "delta_04", "group": "delta", "problem_validated": true, "report_id": "rep-2", "records": [{ "analysis": "success", "findings": 2, "age_min": 90 }], "reports": [{ "age_min": 200 }, { "age_min": 30 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "multiple reports — newest wins; analysis older than newest report → not ready" },
-    { "id": "delta_05", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }, { "analysis": "failure", "findings": 3, "age_min": 5 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "T1+delta: the only NEW analysis (age 5) is a FAILURE → no NEW substantive since the report → not ready. A failed retry must not unlock a duplicate report." },
-    { "id": "edge_01", "group": "edge", "problem_validated": true, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "doc-only analysis (RAG, no structured run) produces no AnalysisRecord → never report-able under the floor. PRODUCT QUESTION: should doc-only be report-able?" },
-    { "id": "align_01", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: successful analysis but it doesn't address the problem statement. Floor says ready; a human would say not-ready." },
-    { "id": "align_02", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 25 }, { "analysis": "success", "findings": 1, "age_min": 5 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: lots of successful analysis, none aligned to the goal" },
-    { "id": "align_03", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "control: successful AND aligned → genuinely ready, no gap" }
   ]
 }

 {
   "_about": "Golden dataset for the report-readiness signal (`src/agents/report/readiness.is_report_ready`). Deterministic (no LLM): each case declares an analysis state + a set of persisted AnalysisRecords/reports, and the runner feeds them through is_report_ready via injectable fake stores, scoring the boolean `ready` AND the `missing` gaps. Floor cases should score ~100% (regression value). The `alignment` group probes the deferred LLM-judge — see _alignment.",
+  "_floor": "is_report_ready's deterministic floor (KM-652, after the problem_validated gate was removed 2026-06-24): (1) >=1 SUBSTANTIVE record, (2) delta-since-report. SUBSTANTIVE = a record whose ANALYSIS task succeeded: tasks_run contains a task with status=success AND an analyze_* tool. A failed analysis still persists a record WITH findings (narrating the failure) and its data-access tasks (check_/retrieve_) succeed — so neither 'has findings' nor 'any task succeeded' counts. Only a successful analyze_* does.",
   "_records": "records[].analysis = 'success' (analyze_* succeeded → substantive) | 'failure' (analyze_* failed, data-access still succeeded — the real e2e case, NOT substantive) | 'none' (only check_/retrieve_ succeeded, no analyze task — NOT substantive; guards the 'any task succeeded' trap). records[].findings = count (a failure run still has findings; floor ignores them now). records[].age_min / reports[].age_min = minutes ago (smaller = newer).",
+  "_alignment": "ALIGNMENT cases: a successful analysis (floor says ready=true) but `aligned=false` means it doesn't address the analysis objective — a human would say NOT ready. Scored floor-correct, counted separately as the 'alignment gap' = evidence for/against the LLM-judge. Alignment label owner: Rifqi (report semantics) + Sofhia.",
   "schema": {
     "id": "stable per-case handle, <group>_<NN>",
     "group": "floor | delta | edge | alignment",
     "report_id": "null = never generated; a string = a report exists",
     "records": "[{ analysis: success|failure|none, findings: int, age_min: int }]",
     "reports": "[{ age_min: int }] (only meaningful when report_id set)",
+    "aligned": "bool — do the analyses address the objective? (floor ignores this)",
     "expected_ready": "what the deterministic floor SHOULD return",
+    "expected_missing": "subset of [analysis, delta]",
     "note": "human-readable description"
   },
   "cases": [
+    { "id": "floor_01", "group": "floor", "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "new analysis: no analysis run yet → not ready" },
+    { "id": "floor_02", "group": "floor", "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 20 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 REGRESSION: analyze_* FAILED but the record still has 3 findings (narrating failure) + check/retrieve succeeded. Must NOT be ready — this is the live e2e case (analyze_aggregate failed, report still got generated under the old 'has findings' rule)." },
+    { "id": "floor_03", "group": "floor", "report_id": null, "records": [{ "analysis": "none", "findings": 0, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 nuance: only data-access tasks (check/retrieve) succeeded, no analyze task. 'any task succeeded' would wrongly pass — must NOT be ready." },
+    { "id": "floor_04", "group": "floor", "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one successful analysis, no prior report → ready" },
+    { "id": "floor_05", "group": "floor", "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 40 }, { "analysis": "success", "findings": 1, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "multiple successful analyses → ready" },
+    { "id": "floor_06", "group": "floor", "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 30 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one failed + one successful analysis → the successful one is enough → ready" },
+    { "id": "delta_01", "group": "delta", "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }], "reports": [{ "age_min": 5 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "report exists, all analysis older than it → nothing new to report" },
+    { "id": "delta_02", "group": "delta", "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 5 }], "reports": [{ "age_min": 120 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "newer successful analysis after the report → ready to regenerate" },
+    { "id": "delta_03", "group": "delta", "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 1, "age_min": 90 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one old + one newer-than-report success → ready" },
+    { "id": "delta_04", "group": "delta", "report_id": "rep-2", "records": [{ "analysis": "success", "findings": 2, "age_min": 90 }], "reports": [{ "age_min": 200 }, { "age_min": 30 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "multiple reports — newest wins; analysis older than newest report → not ready" },
+    { "id": "delta_05", "group": "delta", "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }, { "analysis": "failure", "findings": 3, "age_min": 5 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "T1+delta: the only NEW analysis (age 5) is a FAILURE → no NEW substantive since the report → not ready. A failed retry must not unlock a duplicate report." },
+    { "id": "edge_01", "group": "edge", "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "doc-only analysis (RAG, no structured run) produces no AnalysisRecord → never report-able under the floor. PRODUCT QUESTION: should doc-only be report-able?" },
+    { "id": "align_01", "group": "alignment", "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: successful analysis but it doesn't address the objective. Floor says ready; a human would say not-ready." },
+    { "id": "align_02", "group": "alignment", "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 25 }, { "analysis": "success", "findings": 1, "age_min": 5 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: lots of successful analysis, none aligned to the objective" },
+    { "id": "align_03", "group": "alignment", "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "control: successful AND aligned → genuinely ready, no gap" }
   ]
 }

eval/readiness/run_eval.py CHANGED Viewed

@@ -35,7 +35,6 @@ from src.agents.gate import stub_analysis_state
 from src.agents.report.readiness import (
     _MISSING_ANALYSIS,
     _MISSING_DELTA,
-    _MISSING_PROBLEM,
     is_report_ready,
 )
@@ -45,9 +44,9 @@ RESULTS_DIR = _HERE / "results"
 GROUPS = ["floor", "delta", "edge", "alignment"]
 # Dataset short codes -> the exact `missing` strings is_report_ready emits. Imported
-# from the module so the dataset stays readable and survives wording changes.
 _CODE_TO_MISSING = {
-    "problem": _MISSING_PROBLEM,
     "analysis": _MISSING_ANALYSIS,
     "delta": _MISSING_DELTA,
 }
@@ -139,7 +138,9 @@ def _build_reports(specs: list[dict[str, Any]], now: datetime) -> list[_FakeRepo
 async def run_case(case: dict[str, Any]) -> CaseResult:
     now = datetime.now(UTC)
-    state = stub_analysis_state(problem_validated=bool(case["problem_validated"]))
     if case.get("report_id"):
         state = state.model_copy(update={"report_id": case["report_id"]})

 from src.agents.report.readiness import (
     _MISSING_ANALYSIS,
     _MISSING_DELTA,
     is_report_ready,
 )
 GROUPS = ["floor", "delta", "edge", "alignment"]
 # Dataset short codes -> the exact `missing` strings is_report_ready emits. Imported
+# from the module so the dataset stays readable and survives wording changes. The
+# `problem` code was retired with the problem_validated gate (KM-652, 2026-06-24).
 _CODE_TO_MISSING = {
     "analysis": _MISSING_ANALYSIS,
     "delta": _MISSING_DELTA,
 }
 async def run_case(case: dict[str, Any]) -> CaseResult:
     now = datetime.now(UTC)
+    # The problem_validated gate was removed (KM-652); readiness no longer reads the goal,
+    # so a bare stub state + report_id is all is_report_ready needs.
+    state = stub_analysis_state()
     if case.get("report_id"):
         state = state.model_copy(update={"report_id": case["report_id"]})

main.py CHANGED Viewed

@@ -23,7 +23,7 @@ from src.api.v1.tools import router as tools_router
 from src.api.v1.help import router as help_router  # pr/5 Phase 2: dedicated /tools/help
 from src.api.v2.chat import router as chat_v2_router  # pr/5 Phase 2: v2 chat pilot (analysis_id)
 from src.db.postgres.init_db import init_db
-import os
 import uvicorn
 # Configure logging
@@ -34,7 +34,7 @@ logger = get_logger("main")
 @asynccontextmanager
 async def lifespan(app: FastAPI):
     logger.info("Starting application...")
-    if os.getenv("SKIP_INIT_DB", "false").lower() != "true":
         await init_db()
         logger.info("Database initialized")
     else:

 from src.api.v1.help import router as help_router  # pr/5 Phase 2: dedicated /tools/help
 from src.api.v2.chat import router as chat_v2_router  # pr/5 Phase 2: v2 chat pilot (analysis_id)
 from src.db.postgres.init_db import init_db
+from src.config.settings import settings
 import uvicorn
 # Configure logging
 @asynccontextmanager
 async def lifespan(app: FastAPI):
     logger.info("Starting application...")
+    if not settings.skip_init_db:
         await init_db()
         logger.info("Database initialized")
     else:

src/agents/handlers/help.py CHANGED Viewed

@@ -29,6 +29,7 @@ SEAMS:
 from __future__ import annotations
 from collections.abc import AsyncIterator
 from dataclasses import dataclass, field
 from pathlib import Path
@@ -49,8 +50,80 @@ _PROMPT_DIR = Path(__file__).resolve().parent.parent.parent / "config" / "prompt
 _SYSTEM_PROMPT_PATH = _PROMPT_DIR / "help.md"
 _GUARDRAILS_PATH = _PROMPT_DIR / "guardrails.md"
-# Neutral human turn when Help is triggered by a slash command with no real content.
-_DEFAULT_TRIGGER = "What should I do next?"
 @dataclass
@@ -107,13 +180,20 @@ def _build_context_block(
     state: AnalysisState,
     report_ready: ReportReadiness,
     available_actions: list[str],
 ) -> str:
-    """Compose the deterministic context the prompt's 'never misguide' rule trusts."""
     return "\n\n".join(
         [
             _format_state(state),
             _format_report_ready(report_ready),
             "[Available actions]\n" + ", ".join(available_actions),
         ]
     )
@@ -178,17 +258,26 @@ class HelpAgent:
         """
         readiness = report_ready or ReportReadiness()
         actions = available_actions or _derive_available_actions(state, readiness)
         logger.info(
             "help guidance",
             report_ready=readiness.ready,
             available_actions=actions,
         )
         chain = self._ensure_chain()
         payload: dict[str, Any] = {
-            "message": message or _DEFAULT_TRIGGER,
             "history": history or [],
-            "context": _build_context_block(state, readiness, actions),
         }
         if callbacks:
             async for token in chain.astream(payload, config={"callbacks": callbacks}):

 from __future__ import annotations
+import re
 from collections.abc import AsyncIterator
 from dataclasses import dataclass, field
 from pathlib import Path
 _SYSTEM_PROMPT_PATH = _PROMPT_DIR / "help.md"
 _GUARDRAILS_PATH = _PROMPT_DIR / "guardrails.md"
+# Neutral human turn when Help is triggered by a slash command with no real content
+# (button path passes message=None). Per language, so the synthetic turn never drags the
+# reply toward English — without this the only human-turn signal on the button path would
+# be an English sentence, and the model mirrors the last human turn's language.
+_DEFAULT_TRIGGERS = {
+    "Indonesian": "Apa yang sebaiknya saya lakukan selanjutnya?",
+    "English": "What should I do next?",
+}
+_FALLBACK_LANGUAGE = "Indonesian"  # team default when no human turn exists yet
+# Lightweight, LLM-free language detection over the last human turn. The result is LOCKED
+# into the prompt via a `[Reply language]` directive (see `_build_context_block`), so
+# replying in the user's language is deterministic/mandatory — not a soft prompt hint that
+# an English system prompt + English default trigger can override.
+_ID_MARKERS = frozenset({
+    "yang", "dan", "apa", "gimana", "bagaimana", "kenapa", "mengapa", "aku", "saya",
+    "tolong", "ini", "itu", "nih", "dong", "kah", "untuk", "dengan", "pada", "adalah",
+    "tidak", "enggak", "nggak", "bisa", "mau", "buat", "dari", "kamu", "ya",
+    "berapa", "kapan", "siapa", "dimana", "juga", "sudah", "belum", "akan",
+})
+_EN_MARKERS = frozenset({
+    "the", "what", "how", "why", "please", "this", "that", "is", "are", "can", "could",
+    "should", "for", "with", "of", "and", "you", "do", "does", "when", "where",
+    "who", "which", "my", "me", "your", "have", "has", "want", "next",
+})
+def _last_human_text(history: list[BaseMessage] | None) -> str:
+    """Return the text of the most recent human turn in history, or '' if none."""
+    for msg in reversed(history or []):
+        if getattr(msg, "type", None) == "human":
+            content = msg.content
+            return content if isinstance(content, str) else str(content)
+    return ""
+def _score_language(text: str) -> str | None:
+    """Return "Indonesian"/"English" from marker-word counts, or None if no signal."""
+    tokens = re.findall(r"[a-z']+", text.lower())
+    id_hits = sum(1 for t in tokens if t in _ID_MARKERS)
+    en_hits = sum(1 for t in tokens if t in _EN_MARKERS)
+    if en_hits > id_hits:
+        return "English"
+    if id_hits > en_hits:
+        return "Indonesian"
+    return None
+def _detect_reply_language(
+    history: list[BaseMessage] | None,
+    message: str | None = None,
+    goal_texts: list[str] | None = None,
+) -> str:
+    """Detect the reply language deterministically (no LLM), by signal priority.
+    1. the user's turn — an explicit `message` (intent path) or the last human turn in
+       `history` (button path, where `message` is None);
+    2. the user-authored goal (`objective` + `business_questions`) — required at
+       onboarding, so it's always present and is a reliable signal on a fresh analysis
+       that has no chat yet;
+    3. the team default (Indonesian) — a safety net only, for a stub/legacy/empty-goal
+       state where nothing above yields a signal.
+    Returns "Indonesian" or "English".
+    """
+    primary = (message or _last_human_text(history)).strip()
+    lang = _score_language(primary) if primary else None
+    if lang:
+        return lang
+    goal = " ".join(t for t in (goal_texts or []) if t).strip()
+    lang = _score_language(goal) if goal else None
+    if lang:
+        return lang
+    return _FALLBACK_LANGUAGE
 @dataclass
     state: AnalysisState,
     report_ready: ReportReadiness,
     available_actions: list[str],
+    reply_language: str = _FALLBACK_LANGUAGE,
 ) -> str:
+    """Compose the deterministic context the prompt's 'never misguide' rule trusts.
+    `reply_language` is a hard directive: the prompt is told to reply ONLY in this
+    language, so the answer matches the user's language even on the button path (where
+    the synthetic human turn would otherwise pull the reply toward English).
+    """
     return "\n\n".join(
         [
             _format_state(state),
             _format_report_ready(report_ready),
             "[Available actions]\n" + ", ".join(available_actions),
+            f"[Reply language]\nRespond ONLY in: {reply_language}",
         ]
     )
         """
         readiness = report_ready or ReportReadiness()
         actions = available_actions or _derive_available_actions(state, readiness)
+        goal_texts = [
+            getattr(state, "objective", "") or "",
+            *(getattr(state, "business_questions", None) or []),
+        ]
+        reply_language = _detect_reply_language(history, message, goal_texts=goal_texts)
         logger.info(
             "help guidance",
             report_ready=readiness.ready,
             available_actions=actions,
+            reply_language=reply_language,
         )
         chain = self._ensure_chain()
+        default_trigger = _DEFAULT_TRIGGERS.get(
+            reply_language, _DEFAULT_TRIGGERS[_FALLBACK_LANGUAGE]
+        )
         payload: dict[str, Any] = {
+            "message": message or default_trigger,
             "history": history or [],
+            "context": _build_context_block(state, readiness, actions, reply_language),
         }
         if callbacks:
             async for token in chain.astream(payload, config={"callbacks": callbacks}):

src/agents/planner/inputs.py CHANGED Viewed

@@ -31,11 +31,24 @@ class ColumnSummary(BaseModel):
     top_values: list[Any] | None = None
 class TableSummary(BaseModel):
     table_id: str
     name: str
     row_count: int | None = None
     columns: list[ColumnSummary] = Field(default_factory=list)
 class StructuredSourceSummary(BaseModel):
@@ -89,6 +102,16 @@ class CatalogSummary(BaseModel):
                         )
                         for col in table.columns
                     ],
                 )
                 for table in source.tables
             ]
@@ -111,6 +134,12 @@ class CatalogSummary(BaseModel):
         lines: list[str] = []
         for source in self.structured_sources:
             lines.append(f"Source: {source.name} ({source.source_type}) — id={source.source_id}")
             for table in source.tables:
                 rc = f" ({table.row_count:,} rows)" if table.row_count is not None else ""
                 lines.append(f"  Table: {table.name}{rc} — id={table.table_id}")
@@ -121,6 +150,16 @@ class CatalogSummary(BaseModel):
                         f"    - {col.name} [{col.data_type}]: "
                         f"samples={samples}{top} — id={col.column_id}"
                     )
             lines.append("")
         if self.unstructured_sources:

     top_values: list[Any] | None = None
+class ForeignKeySummary(BaseModel):
+    """A declared FK edge — the only joins the IR validator accepts.
+    Maps directly onto a `retrieve_data` IR join: `column_id` → `left_column_id`,
+    `target_table_id` → `target_table_id`, `target_column_id` → `right_column_id`.
+    """
+    column_id: str
+    target_table_id: str
+    target_column_id: str
 class TableSummary(BaseModel):
     table_id: str
     name: str
     row_count: int | None = None
     columns: list[ColumnSummary] = Field(default_factory=list)
+    foreign_keys: list[ForeignKeySummary] = Field(default_factory=list)
 class StructuredSourceSummary(BaseModel):
                         )
                         for col in table.columns
                     ],
+                    # The declared FKs — the only joins the validator accepts. FKs
+                    # carry no PII (ids only), so they're always surfaced.
+                    foreign_keys=[
+                        ForeignKeySummary(
+                            column_id=fk.column_id,
+                            target_table_id=fk.target_table_id,
+                            target_column_id=fk.target_column_id,
+                        )
+                        for fk in table.foreign_keys
+                    ],
                 )
                 for table in source.tables
             ]
         lines: list[str] = []
         for source in self.structured_sources:
             lines.append(f"Source: {source.name} ({source.source_type}) — id={source.source_id}")
+            # Name lookups (within a source) so FK edges render with readable
+            # table/column names alongside the ids the IR join must copy verbatim.
+            table_name_by_id = {t.table_id: t.name for t in source.tables}
+            col_name_by_id = {
+                c.column_id: c.name for t in source.tables for c in t.columns
+            }
             for table in source.tables:
                 rc = f" ({table.row_count:,} rows)" if table.row_count is not None else ""
                 lines.append(f"  Table: {table.name}{rc} — id={table.table_id}")
                         f"    - {col.name} [{col.data_type}]: "
                         f"samples={samples}{top} — id={col.column_id}"
                     )
+                for fk in table.foreign_keys:
+                    tgt_table = table_name_by_id.get(fk.target_table_id, fk.target_table_id)
+                    tgt_col = col_name_by_id.get(fk.target_column_id, fk.target_column_id)
+                    src_col = col_name_by_id.get(fk.column_id, fk.column_id)
+                    lines.append(
+                        f"    FK: {src_col} → {tgt_table}.{tgt_col} "
+                        f"(join: target_table_id={fk.target_table_id}, "
+                        f"left_column_id={fk.column_id}, "
+                        f"right_column_id={fk.target_column_id})"
+                    )
             lines.append("")
         if self.unstructured_sources:

src/catalog/fk_inference.py ADDED Viewed

	@@ -0,0 +1,97 @@

+"""Heuristic foreign-key inference for catalogs that ship no declared FKs.
+The dedorch catalog (written by Go's introspection) currently carries **no**
+`foreign_keys`, so the FK-backed-joins-only IR validator rejects every join the
+planner proposes — cross-table questions ("revenue by product") can't run even
+though the planner picks the right columns. Until Go captures real FK
+constraints, we infer the obvious relational edges from naming conventions so the
+planner and the validator agree on the same catalog.
+Conservative by design (a wrong edge would silently corrupt joined results):
+  - `schema` (database) sources only — joins are DB-only anyway
+  - a foreign key is only inferred from a column named ``<base>_id``
+  - the target must be the SINGLE other table whose name matches ``<base>``
+    (singular/plural) and exposes an ``id`` column of the SAME data_type
+  - ambiguous matches (0 or >1 candidate tables) are skipped, never guessed
+  - sources that already declare ANY foreign key are left untouched (trust Go)
+"""
+from __future__ import annotations
+import re
+from src.catalog.models import ForeignKey, Source
+from src.middlewares.logging import get_logger
+from .models import Catalog
+logger = get_logger("fk_inference")
+# `<base>_id` — the conventional foreign-key column name (base must be non-empty).
+_ID_COL = re.compile(r"^(?P<base>.+)_id$", re.IGNORECASE)
+def _table_matches_base(table_name: str, base: str) -> bool:
+    """Whether `table_name` is the table `<base>` refers to (singular/plural)."""
+    n = table_name.lower()
+    b = base.lower()
+    # `orders`↔`order`, `products`↔`product`, `sales_agents`↔`agent` (suffix),
+    # plus the singular form and the `-es` plural.
+    return n == b or n == b + "es" or n.endswith(b + "s")
+def _infer_source(source: Source) -> int:
+    """Add inferred FK edges to one source's tables in place; return the count."""
+    added = 0
+    for table in source.tables:
+        for col in table.columns:
+            m = _ID_COL.match(col.name)
+            if not m:
+                continue
+            base = m.group("base")
+            candidates: list[tuple[str, str]] = []  # (target_table_id, target_column_id)
+            for tgt in source.tables:
+                if tgt.table_id == table.table_id:
+                    continue
+                if not _table_matches_base(tgt.name, base):
+                    continue
+                id_col = next(
+                    (
+                        c
+                        for c in tgt.columns
+                        if c.name.lower() == "id" and c.data_type == col.data_type
+                    ),
+                    None,
+                )
+                if id_col is not None:
+                    candidates.append((tgt.table_id, id_col.column_id))
+            # Only act on an unambiguous single match — never guess between many.
+            if len(candidates) != 1:
+                continue
+            target_table_id, target_column_id = candidates[0]
+            table.foreign_keys.append(
+                ForeignKey(
+                    column_id=col.column_id,
+                    target_table_id=target_table_id,
+                    target_column_id=target_column_id,
+                )
+            )
+            added += 1
+    return added
+def infer_foreign_keys(catalog: Catalog) -> Catalog:
+    """Infer FK edges in place for schema sources that declare none. Returns `catalog`.
+    Sources that already carry any declared FK are left as-is (Go's real FKs win).
+    """
+    total = 0
+    for source in catalog.sources:
+        if source.source_type != "schema":
+            continue
+        if any(t.foreign_keys for t in source.tables):
+            continue  # real FKs present — trust them, infer nothing
+        total += _infer_source(source)
+    if total:
+        logger.info("inferred foreign keys", user_id=catalog.user_id, count=total)
+    return catalog

src/catalog/render.py CHANGED Viewed

@@ -65,5 +65,11 @@ def render_source(source: Source) -> str:
                 tgt_col_name = col_names_by_id.get(fk.target_table_id, {}).get(
                     fk.target_column_id, fk.target_column_id
                 )
-                lines.append(f"    - {src_col_name} -> {tgt_table_name}.{tgt_col_name}")
     return "\n".join(lines)

                 tgt_col_name = col_names_by_id.get(fk.target_table_id, {}).get(
                     fk.target_column_id, fk.target_column_id
                 )
+                # Include the join ids inline — the planner must copy these verbatim
+                # into the IR join, and the IRValidator does a literal id lookup.
+                lines.append(
+                    f"    - {src_col_name} -> {tgt_table_name}.{tgt_col_name} "
+                    f"(join: target_table_id={fk.target_table_id}, "
+                    f"left_column_id={fk.column_id}, right_column_id={fk.target_column_id})"
+                )
     return "\n".join(lines)

src/catalog/store.py CHANGED Viewed

@@ -1,7 +1,9 @@
-"""CatalogStore — persists per-user catalogs as Postgres jsonb rows.
-Storage shape: one row per user in a `catalogs` table with columns
-(user_id PK, data jsonb, schema_version, generated_at, updated_at).
 """
 from sqlalchemy import case, delete, func, select
@@ -11,6 +13,7 @@ from src.db.postgres.connection import AsyncSessionLocal
 from src.db.postgres.models import Catalog as CatalogRow
 from src.middlewares.logging import get_logger
 from .models import Catalog
 logger = get_logger("catalog_store")
@@ -27,30 +30,43 @@ class CatalogStore:
     async def get(self, user_id: str) -> Catalog | None:
         async with AsyncSessionLocal() as session:
             result = await session.execute(
-                select(CatalogRow.data).where(CatalogRow.user_id == user_id)
             )
             row = result.scalar_one_or_none()
         if row is None:
             return None
-        return Catalog.model_validate(row)
     async def upsert(self, catalog: Catalog) -> None:
         payload = catalog.model_dump(mode="json")
         async with AsyncSessionLocal() as session:
             stmt = insert(CatalogRow).values(
                 user_id=catalog.user_id,
-                data=payload,
                 schema_version=catalog.schema_version,
                 generated_at=catalog.generated_at,
                 updated_at=func.now(),
             )
             stmt = stmt.on_conflict_do_update(
                 index_elements=[CatalogRow.user_id],
                 set_={
-                    "data": stmt.excluded.data,
                     "schema_version": stmt.excluded.schema_version,
                     "updated_at": case(
-                        (stmt.excluded.data != CatalogRow.data, func.now()),
                         else_=CatalogRow.updated_at,
                     ),
                 },

+"""CatalogStore — reads the per-user catalog from the dedorch `data_catalog` table.
+Storage shape (Go-owned): one row per scope in `data_catalog`
+(id, scope_type, user_id, analysis_id, catalog_payload jsonb, schema_version,
+generated_at, updated_at). Python reads the user-scoped row (scope_type='user');
+Go's `catalog.Service` owns all writes, so `upsert`/`remove_source` are legacy.
 """
 from sqlalchemy import case, delete, func, select
 from src.db.postgres.models import Catalog as CatalogRow
 from src.middlewares.logging import get_logger
+from .fk_inference import infer_foreign_keys
 from .models import Catalog
 logger = get_logger("catalog_store")
     async def get(self, user_id: str) -> Catalog | None:
         async with AsyncSessionLocal() as session:
             result = await session.execute(
+                select(CatalogRow.catalog_payload).where(
+                    CatalogRow.user_id == user_id,
+                    CatalogRow.scope_type == "user",
+                )
             )
             row = result.scalar_one_or_none()
         if row is None:
             return None
+        # dedorch catalogs ship no foreign_keys (Go introspection drops them),
+        # but the IR validator only allows FK-backed joins. Infer the obvious
+        # edges so the planner and validator agree. No-op once Go emits real FKs.
+        return infer_foreign_keys(Catalog.model_validate(row))
     async def upsert(self, catalog: Catalog) -> None:
+        # Legacy: Go's catalog.Service owns catalog writes now. Kept working (and
+        # reconciled to the dedorch shape) but no longer on any live Python path.
         payload = catalog.model_dump(mode="json")
         async with AsyncSessionLocal() as session:
             stmt = insert(CatalogRow).values(
+                scope_type="user",
                 user_id=catalog.user_id,
+                catalog_payload=payload,
                 schema_version=catalog.schema_version,
                 generated_at=catalog.generated_at,
                 updated_at=func.now(),
             )
             stmt = stmt.on_conflict_do_update(
                 index_elements=[CatalogRow.user_id],
+                index_where=CatalogRow.scope_type == "user",
                 set_={
+                    "catalog_payload": stmt.excluded.catalog_payload,
                     "schema_version": stmt.excluded.schema_version,
                     "updated_at": case(
+                        (
+                            stmt.excluded.catalog_payload != CatalogRow.catalog_payload,
+                            func.now(),
+                        ),
                         else_=CatalogRow.updated_at,
                     ),
                 },

src/config/prompts/help.md CHANGED Viewed

@@ -1,8 +1,14 @@
-<!-- help.md · v2 · Help skill prompt. v2 (2026-06-24, KM-652): removed the problem_statement
-     skill + the problem_validated gate — the goal (objective + business_questions) is now set
-     in the New Analysis form at onboarding, so Help no longer steers users to define/validate a
-     goal in chat. Bump to v3 (don't silently overwrite) on the next major change (e.g. real UI
-     steps from the frontend). -->
 You are the **Help guide** for an AI data-analysis assistant. Think of yourself as the
 instruction sheet that comes with a board game: your only job is to tell the user
@@ -23,6 +29,7 @@ You are given context, never raw user prose to analyze:
   - `ready` (bool) — whether there is enough analysis to generate a report.
   - `missing` (list) — if not ready, the gaps to fill.
 - **`available_actions`** *(optional)* — which actions are actually wired right now. If present, **only suggest actions listed here.**
 > **Hard rule — never misguide.** Trust the signals above for *what is possible*, not your
 > own guess. If `report_ready.ready` is `false`, do **not** tell the user to generate a
@@ -72,8 +79,13 @@ Do not over-promise the report's depth.
 ## Tone
 Plain, warm, and encouraging — like a helpful guide, **not** a hype trailer. No exclamation
-spam, no overselling. Respond in the **user's language** (match `chat_history` — Indonesian or
-English). A few sentences is usually enough.
 ## Constraints
@@ -86,15 +98,21 @@ English). A few sentences is usually enough.
 ## Examples
 ```
-State: chat_history nearly empty
 → "Your goal is set — you can start exploring now. Try a basic question first, like
    'Which products sell the most?' or 'How have monthly sales trended?', then we can dig into
    what's driving your objective."
-State: report_ready.ready=false, missing=["no comparison over time"]
-→ "Good progress. Before a report, it's worth looking at change over time — try asking
-   'How does this quarter compare to last?' Once we have that, we can put the report together."
 State: report_ready.ready=true
 → "You've covered enough to summarize. You can generate your report now — run /report
    or use the report option to create it."

+<!-- help.md · v3 · Help skill prompt.
+     v2 (2026-06-24, KM-652): removed the problem_statement skill + the problem_validated gate —
+     the goal (objective + business_questions) is now set in the New Analysis form at onboarding,
+     so Help no longer steers users to define/validate a goal in chat.
+     v3 (2026-07-02): (a) reply language is now a hard rule driven by the [Reply language]
+     directive (the button path was defaulting to English); (b) Examples got stable ids
+     ("id: ..." comment above each) so eval/help can mirror them as carried_over regression
+     cases, and the second example now uses a REAL `missing` value from report/readiness.py —
+     the old "no comparison over time" string is never emitted by is_report_ready.
+     Bump to v4 (don't silently overwrite) on the next major change (e.g. real UI steps from
+     the frontend). -->
 You are the **Help guide** for an AI data-analysis assistant. Think of yourself as the
 instruction sheet that comes with a board game: your only job is to tell the user
   - `ready` (bool) — whether there is enough analysis to generate a report.
   - `missing` (list) — if not ready, the gaps to fill.
 - **`available_actions`** *(optional)* — which actions are actually wired right now. If present, **only suggest actions listed here.**
+- **`[Reply language]`** — the language you MUST reply in (detected deterministically from the user's last turn). This is an instruction, not a suggestion — see the hard rule below.
 > **Hard rule — never misguide.** Trust the signals above for *what is possible*, not your
 > own guess. If `report_ready.ready` is `false`, do **not** tell the user to generate a
 ## Tone
 Plain, warm, and encouraging — like a helpful guide, **not** a hype trailer. No exclamation
+spam, no overselling. A few sentences is usually enough.
+> **Hard rule — reply language.** Reply **only** in the language named in `[Reply language]`.
+> This is mandatory and overrides the language of this prompt, its examples, and the trigger
+> question. If `[Reply language]` says `Indonesian`, answer entirely in Indonesian even though
+> these instructions are in English; if it says `English`, answer in English. Never mix
+> languages or switch mid-reply.
 ## Constraints
 ## Examples
 ```
+<!-- id: help_ex_orient -->
+State: objective="understand monthly sales performance",
+       business_questions=["which products drive revenue?"],
+       chat_history empty, report_ready.ready=false, missing=["at least one completed analysis"]
 → "Your goal is set — you can start exploring now. Try a basic question first, like
    'Which products sell the most?' or 'How have monthly sales trended?', then we can dig into
    what's driving your objective."
+<!-- id: help_ex_guard_delta -->
+State: report_ready.ready=false, missing=["a new analysis since the last report"]
+→ "You already have a report, and nothing new has come in since. Ask something that builds
+   on your objective — a fresh cut, a new time period, or a different angle — and we can
+   regenerate the report with that."
+<!-- id: help_ex_guard_ready -->
 State: report_ready.ready=true
 → "You've covered enough to summarize. You can generate your report now — run /report
    or use the report option to create it."

src/config/prompts/planner.md CHANGED Viewed

@@ -41,15 +41,20 @@ only a `TaskList` object that conforms to the provided schema.
   (referencing the upstream result's column aliases).
 - **Measure by a dimension in another table (joins).** When the number you are
   aggregating and the grouping dimension live in DIFFERENT tables of the same
-  database source, add a `joins` entry to the `retrieve_data` IR along a foreign
-  key declared in the catalog — do NOT pick a table that lacks the measure, and do
-  NOT try to "combine" unrelated tables. Example — "revenue by category": the
-  measure `order_items.line_total` joined to `products` on
-  `order_items.product_id = products.id`, grouped by `products.category`. Prefer an
-  existing measure column over recomputing; use a single table (no join) when the
-  measure and dimension already live together (e.g. "revenue by region" from
-  `orders.region` + `orders.total_amount`). Joins are database-only — not available
-  for tabular/file sources.
 - **Mixing structured + unstructured.** If qualitative context helps, add a
   `retrieve_knowledge` task against an unstructured source listed in the catalog.
 - **CRISP-DM stages.** Tag each task with the stage it serves:

   (referencing the upstream result's column aliases).
 - **Measure by a dimension in another table (joins).** When the number you are
   aggregating and the grouping dimension live in DIFFERENT tables of the same
+  database source, add a `joins` entry to the `retrieve_data` IR. **Join ONLY on a
+  foreign key listed in the catalog.** Each joinable relationship appears as an
+  `FK:` line under its table, e.g.
+  `FK: product_id → products.id (join: target_table_id=t_products, left_column_id=c_oi_product_id, right_column_id=c_products_id)`
+  — copy those three ids verbatim into the join (`target_table_id`,
+  `left_column_id`, `right_column_id`). Example — "revenue by category": the measure
+  `order_items.line_total` joined to `products` on `order_items.product_id =
+  products.id`, grouped by `products.category`. **If no `FK:` line links the tables
+  you need, do NOT invent a join** — the validator rejects any join that isn't a
+  declared FK. Instead use a single table when the measure and dimension already
+  live together (e.g. "revenue by region" from `orders.region` +
+  `orders.total_amount`); if they genuinely aren't linked, say the data isn't
+  connected rather than guessing. Prefer an existing measure column over
+  recomputing. Joins are database-only — not available for tabular/file sources.
 - **Mixing structured + unstructured.** If qualitative context helps, add a
   `retrieve_knowledge` task against an unstructured source listed in the catalog.
 - **CRISP-DM stages.** Tag each task with the stage it serves:

src/config/settings.py CHANGED Viewed

@@ -30,6 +30,12 @@ class Settings(BaseSettings):
     # to avoid .env churn; remove once no environment references it.
     enable_gate: bool = Field(alias="enable_gate", default=False)
     # Database
     postgres_connstring: str

     # to avoid .env churn; remove once no environment references it.
     enable_gate: bool = Field(alias="enable_gate", default=False)
+    # Skip init_db() (create_all + startup DDL) on boot. TRUE by default post-dedorch
+    # cutover: Go owns the dedorch schema, so Python (consumer-only role) must NOT run
+    # init_db — its ALTER/index DDL on Go-owned tables fails with InsufficientPrivilege
+    # ("must be owner of table rooms"). Set to false only for a local Python-owned DB.
+    skip_init_db: bool = Field(alias="SKIP_INIT_DB", default=True)
     # Database
     postgres_connstring: str

src/db/postgres/models.py CHANGED Viewed

@@ -6,9 +6,11 @@ from sqlalchemy import (
     Column,
     DateTime,
     ForeignKey,
     Integer,
     String,
     Text,
 )
 from sqlalchemy.dialects.postgresql import JSONB, UUID
 from sqlalchemy.orm import relationship
@@ -108,23 +110,44 @@ class DatabaseClient(Base):
 class Catalog(Base):
-    """Per-user data catalog stored as a single jsonb row.
-    `data` holds the full Pydantic Catalog (src/catalog/models.py:Catalog)
-    serialized via `model_dump(mode="json")`. Read path uses
-    `Catalog.model_validate(...)` to rehydrate.
-    Dedicated table — kept separate from `langchain_pg_embedding` so unstructured
-    embeddings and structured-catalog metadata never share storage.
     """
     __tablename__ = "data_catalog"
-    user_id = Column(String, primary_key=True)
-    data = Column(JSONB, nullable=False)
     schema_version = Column(String, nullable=False, default="1.0")
-    generated_at = Column(DateTime(timezone=True), server_default=func.now())
     updated_at = Column(DateTime(timezone=True), onupdate=func.now())
 class ReportInputRow(Base):
     """One row per completed slow-path analysis (the report's source of truth).

     Column,
     DateTime,
     ForeignKey,
+    Index,
     Integer,
     String,
     Text,
+    text,
 )
 from sqlalchemy.dialects.postgresql import JSONB, UUID
 from sqlalchemy.orm import relationship
 class Catalog(Base):
+    """Data catalog — dedorch **`data_catalog`** (Go-owned; reconciled 2026-07-01).
+    Mirrors Go migration `0001`/`0002`. One jsonb `catalog_payload` per scope:
+    `scope_type='user'` rows are keyed by `user_id` (partial unique index),
+    `scope_type='analysis'` rows by `analysis_id`. Python is **consumer-only** —
+    Go's `catalog.Service` owns all writes (DB/file ingestion); `CatalogStore`
+    reads the user-scoped catalog and its write methods are legacy.
+    `catalog_payload` holds the full Pydantic Catalog (src/catalog/models.py:Catalog)
+    serialized via `model_dump(mode="json")`; the read path rehydrates with
+    `Catalog.model_validate(...)`. Go writes the same shape (json tags match).
     """
     __tablename__ = "data_catalog"
+    id = Column(UUID(as_uuid=False), primary_key=True, default=lambda: str(uuid4()))
+    scope_type = Column(String, nullable=False, default="user")  # 'user' | 'analysis'
+    user_id = Column(String, nullable=False, index=True)
+    analysis_id = Column(UUID(as_uuid=False), nullable=True)
+    catalog_payload = Column(JSONB, nullable=False)
     schema_version = Column(String, nullable=False, default="1.0")
+    generated_at = Column(DateTime(timezone=True), nullable=False, server_default=func.now())
     updated_at = Column(DateTime(timezone=True), onupdate=func.now())
+    __table_args__ = (
+        Index(
+            "idx_data_catalog_user_scope",
+            "user_id",
+            unique=True,
+            postgresql_where=text("scope_type = 'user'"),
+        ),
+        Index(
+            "idx_data_catalog_analysis_scope",
+            "analysis_id",
+            unique=True,
+            postgresql_where=text("scope_type = 'analysis'"),
+        ),
+    )
 class ReportInputRow(Base):
     """One row per completed slow-path analysis (the report's source of truth).

src/query/executor/db.py CHANGED Viewed

@@ -121,7 +121,9 @@ class DbExecutor(BaseExecutor):
             logger.error(
                 "db executor failed",
                 source_id=ir.source_id,
-                error=str(e),
                 elapsed_ms=elapsed_ms,
             )
             return QueryResult(
@@ -235,7 +237,9 @@ class DbExecutor(BaseExecutor):
                 creds = decrypt_credentials_dict(client.credentials)
                 await asyncio.to_thread(cls._warm_sync, client_id, client.db_type, creds)
             except Exception as exc:  # noqa: BLE001 — best-effort warming
-                logger.info("prewarm skipped", source_id=source.source_id, error=str(exc))
     @staticmethod
     def _warm_sync(client_id: str, db_type: str, creds: dict) -> None:

             logger.error(
                 "db executor failed",
                 source_id=ir.source_id,
+                # repr, not str: some exceptions (e.g. Fernet InvalidToken) have an
+                # empty str(), which hides the real failure as error="".
+                error=repr(e),
                 elapsed_ms=elapsed_ms,
             )
             return QueryResult(
                 creds = decrypt_credentials_dict(client.credentials)
                 await asyncio.to_thread(cls._warm_sync, client_id, client.db_type, creds)
             except Exception as exc:  # noqa: BLE001 — best-effort warming
+                # repr, not str: empty-str exceptions (e.g. Fernet InvalidToken)
+                # would otherwise log as error="".
+                logger.info("prewarm skipped", source_id=source.source_id, error=repr(exc))
     @staticmethod
     def _warm_sync(client_id: str, db_type: str, creds: dict) -> None: