/fix check and help tool

#8
by rhbt6767 - opened
REPO_STATUS.md CHANGED
@@ -2,7 +2,7 @@
2
 
3
  **Audience:** teammates onboarding onto the Python repo (`Agentic-Service-Data-Eyond-Catalog`).
4
  **Scope:** what the code does **right now** (branch `pr/4`, ticket KM-652). Describes current state only — no roadmap or to-dos.
5
- **Snapshot date:** 2026-06-25. **Cross-repo update 2026-06-29:** §2/§8/§11/§12 re-verified against
6
  the **Go source** (`Orchestrator-Agent-Service`), not its docs. The Go service has moved well past its
7
  own (uncommitted, stale) design docs: it now hosts the **dedorch SQL migrations** in-repo and a full
8
  **`/api/v1/analyses` + `/api/v1/skills`** REST surface. Go does **not** call Python yet — those skills
@@ -178,7 +178,7 @@ unless `SKIP_INIT_DB=true`.
178
  |---|---|---|---|
179
  | `users`, `rooms`, `chat_messages`, `message_sources` | base app | chat endpoint, Go | chat history |
180
  | `documents`, `databases` | uploads + DB creds (Fernet-encrypted) | Go ingestion | executor cred resolution |
181
- | `data_catalog` | per-user jsonb `Catalog` (Source → Table → Column) | Go ingestion / Python pipeline | CatalogReader, planner, tools |
182
  | `langchain_pg_embedding` | PGVector document chunks | Go ingestion | DocumentRetriever |
183
  | `report_inputs` *(was `analysis_records`)* | jsonb `AnalysisRecord`, one per slow-path run; **Python-owned** | slow path | ReportGenerator, report readiness |
184
  | `analyses` *(dedorch, plural)* | uuid `id`, `user_id`, `analysis_title`, `objective`, `business_questions` jsonb, `status` (active\|inactive), `data_bind`(+`data_bind_version`), `report_id`, `report_collection` — **defined by Go migrations**; `problem_statement`/`problem_validated`/`owner_id` already **dropped** there (`0003`/`0004`) | Go `/api/v1/analyses`; Python state store | gate (no-op), Help, report |
@@ -186,16 +186,21 @@ unless `SKIP_INIT_DB=true`.
186
  | `data_sources` *(dedorch)* | per-analysis binding; `reference_id` = catalog source_id; `type ∈ document\|database` | Go `/analyses/{id}/data-bind` (+ Python `/analysis/create`) | structured-flow scoping, report appendix |
187
  | `analyses_messages` *(dedorch)* | the analysis chat room (`role ∈ user\|ai`); replaces deprecated `rooms`/`chat_messages` | Go `/analyses/{id}/messages` | Python chat path **not yet migrated here** (§12) |
188
 
189
- > ⚠️ **Python ORM ↔ dedorch drift (verified 2026-06-29).** Python's `AnalysisStateRow` + `state_store.py`
190
- > still model **`problem_statement` / `problem_validated`** and do **not** carry `objective` /
191
- > `business_questions`, but the Go migrations have already dropped the former and added the latter.
192
- > Pre-cutover this is harmless (Python runs `create_all` on its own copy); **post-`SKIP_INIT_DB`**, when
193
- > Python reads dedorch directly, ORM column selection on the dropped columns will break. Reconcile the
194
- > Python model before the connection-string cutover.
 
 
 
195
 
196
  **Catalog shape** (the jsonb in `data_catalog`):
197
  `Catalog → Source[ {source_id, source_type ∈ schema|tabular|unstructured, name, location_ref} → Table[ {table_id, name, row_count, foreign_keys[]} → Column[ {column_id, name, data_type, nullable, pii_flag, sample_values|null, stats} ] ] ]`. PII columns have `sample_values: null` so real values never enter prompts.
198
 
 
 
199
  **QueryIR shape** (`src/query/ir/models.py`):
200
  `{ source_id, table_id, joins[], select[], filters[], group_by[], order_by[], limit }`.
201
  Joins are single-level equi-joins to a related table **in the same source**, FK-backed,
@@ -286,7 +291,7 @@ only.
286
  |---|---|---|---|
287
  | `ENABLE_SLOW_PATH` | `settings.enable_slow_path` | **off** | Route `structured_flow` through Planner/TaskRunner/Assembler (vs single-query `QueryService`). Records persist only on the slow path → reports require this on. |
288
  | `ENABLE_GATE` | `settings.enable_gate` | **off** | **Deprecated 2026-06-25** — gate neutered; the flag has no effect. Kept to avoid `.env` churn. |
289
- | `SKIP_INIT_DB` | env, `main.py` | off | Skip `create_all` on startup — the dedorch cutover switch (Go owns dedorch migrations). |
290
  | `enable_tracing` | hardcoded `True` in `chat.py` | on (endpoint) | Langfuse tracing. |
291
 
292
  ---
@@ -309,8 +314,8 @@ copies disagree with the current code on:
309
 
310
  ## 12. dedorch migration — current state
311
 
312
- The Python DB is moving from `dataeyond` → **dedorch** (Go owns dedorch migrations; Python is
313
- consumer-only). State **re-verified against the Go source 2026-06-29**:
314
 
315
  - **The dedorch migrations now live IN the Go repo** — embedded SQL at
316
  `internal/repository/postgres/migrations/0001_create_core_schema.sql … 0004_replace_chat_with_analysis_scope.sql`,
@@ -325,8 +330,15 @@ consumer-only). State **re-verified against the Go source 2026-06-29**:
325
  `rooms`/`chat_messages`/`interview_*` tables to `zdeprecated_*`.
326
  - **`report_inputs`** (the slow-path structured output, formerly `analysis_records`) stays
327
  **Python-owned**; its finalized schema goes to Harry so the dedorch migration creates it post-cutover.
328
- - The connection-string cutover (paired with `SKIP_INIT_DB`) **has not happened yet**; Python still
329
- runs `create_all` on its own models until then.
 
 
 
 
 
 
 
330
 
331
  **⚠️ Integration gap (verified — the big one).** Go's `/api/v1/analyses` and `/api/v1/skills`
332
  (`help` / `report`) are **placeholders that return dummy data** — the `SendMessage` / `GenerateReport`
@@ -348,6 +360,13 @@ records-based report; floor: ≥1 `analyze_*` success). Wiring Go → Python is
348
  values are always parameterized.
349
  - **Settings aliases:** `.env` uses double-underscore names (`azureai__api_key__4o`); `Settings`
350
  exposes them as `azureai_api_key_4o`.
 
 
 
 
 
 
 
351
  - **Never-throw seams** are pervasive (tool invoker, query service, executors, state/binding reads,
352
  record persistence, report summary). Failures degrade into soft output rather than raising — good
353
  for UX, but they can mask real breakage (e.g. a binding silently fail-opening to the full catalog).
 
2
 
3
  **Audience:** teammates onboarding onto the Python repo (`Agentic-Service-Data-Eyond-Catalog`).
4
  **Scope:** what the code does **right now** (branch `pr/4`, ticket KM-652). Describes current state only — no roadmap or to-dos.
5
+ **Snapshot date:** 2026-06-25. **Data-layer reconcile 2026-07-01:** §8/§12 updated — dedorch cutover done, `data_catalog` model reconciled. **Query-path fix 2026-07-02:** §8/§13 — dedorch catalogs ship no FKs → Python infers them (`fk_inference.py`); shared-Fernet-key gotcha documented. **Cross-repo update 2026-06-29:** §2/§8/§11/§12 re-verified against
6
  the **Go source** (`Orchestrator-Agent-Service`), not its docs. The Go service has moved well past its
7
  own (uncommitted, stale) design docs: it now hosts the **dedorch SQL migrations** in-repo and a full
8
  **`/api/v1/analyses` + `/api/v1/skills`** REST surface. Go does **not** call Python yet — those skills
 
178
  |---|---|---|---|
179
  | `users`, `rooms`, `chat_messages`, `message_sources` | base app | chat endpoint, Go | chat history |
180
  | `documents`, `databases` | uploads + DB creds (Fernet-encrypted) | Go ingestion | executor cred resolution |
181
+ | `data_catalog` *(dedorch, Go-owned)* | `id` uuid, `scope_type` ('user'\|'analysis'), `user_id`, `analysis_id`, **`catalog_payload`** jsonb (the `Catalog`: Source → Table → Column), schema_version, generated_at, updated_at; partial-unique on `user_id WHERE scope_type='user'` | **Go `catalog.Service`** (all writes: DB/file ingestion) | CatalogReader → CatalogStore (**read-only**), planner, tools |
182
  | `langchain_pg_embedding` | PGVector document chunks | Go ingestion | DocumentRetriever |
183
  | `report_inputs` *(was `analysis_records`)* | jsonb `AnalysisRecord`, one per slow-path run; **Python-owned** | slow path | ReportGenerator, report readiness |
184
  | `analyses` *(dedorch, plural)* | uuid `id`, `user_id`, `analysis_title`, `objective`, `business_questions` jsonb, `status` (active\|inactive), `data_bind`(+`data_bind_version`), `report_id`, `report_collection` — **defined by Go migrations**; `problem_statement`/`problem_validated`/`owner_id` already **dropped** there (`0003`/`0004`) | Go `/api/v1/analyses`; Python state store | gate (no-op), Help, report |
 
186
  | `data_sources` *(dedorch)* | per-analysis binding; `reference_id` = catalog source_id; `type ∈ document\|database` | Go `/analyses/{id}/data-bind` (+ Python `/analysis/create`) | structured-flow scoping, report appendix |
187
  | `analyses_messages` *(dedorch)* | the analysis chat room (`role ∈ user\|ai`); replaces deprecated `rooms`/`chat_messages` | Go `/analyses/{id}/messages` | Python chat path **not yet migrated here** (§12) |
188
 
189
+ > **Python ORM ↔ dedorch drift reconciled 2026-07-01.** `AnalysisStateRow` (`analyses`) dropped
190
+ > `problem_statement`/`problem_validated` and added `objective`/`business_questions` (Harry's #3);
191
+ > `data_catalog` was the last stale model. Its `Catalog` ORM (old `user_id`-PK + `data` jsonb) is now
192
+ > the dedorch shape (`id` PK, `scope_type`, **`catalog_payload`**), and `CatalogStore` reads
193
+ > `catalog_payload WHERE scope_type='user'` (matching Go's `catalog.Service`). This closed a **live
194
+ > bug**: the `check` skill / `CatalogReader` still selected the dropped `data_catalog.data` column, so
195
+ > every catalog read 500'd after the cutover ("what data do I have" → *"Sorry, I couldn't look that up:
196
+ > column data_catalog.data does not exist"*). Python's catalog **write** methods (`upsert`/
197
+ > `remove_source`/`StructuredPipeline`) were reconciled but are now **legacy** — Go owns ingestion.
198
 
199
  **Catalog shape** (the jsonb in `data_catalog`):
200
  `Catalog → Source[ {source_id, source_type ∈ schema|tabular|unstructured, name, location_ref} → Table[ {table_id, name, row_count, foreign_keys[]} → Column[ {column_id, name, data_type, nullable, pii_flag, sample_values|null, stats} ] ] ]`. PII columns have `sample_values: null` so real values never enter prompts.
201
 
202
+ > ⚠️ **dedorch catalogs ship empty `foreign_keys`** (Go's introspection drops FK constraints), yet the IR validator only allows FK-backed joins — so every cross-table question failed validation until 2026-07-02. `src/catalog/fk_inference.py` (wired into `CatalogStore.get`) now infers the obvious `<base>_id → <table>.id` edges at read time: conservative (single unambiguous target, matching `data_type`, schema sources only) and **self-disabling** once any real FK is present. It's a **stopgap** — the durable fix is Go emitting real FKs during introspection.
203
+
204
  **QueryIR shape** (`src/query/ir/models.py`):
205
  `{ source_id, table_id, joins[], select[], filters[], group_by[], order_by[], limit }`.
206
  Joins are single-level equi-joins to a related table **in the same source**, FK-backed,
 
291
  |---|---|---|---|
292
  | `ENABLE_SLOW_PATH` | `settings.enable_slow_path` | **off** | Route `structured_flow` through Planner/TaskRunner/Assembler (vs single-query `QueryService`). Records persist only on the slow path → reports require this on. |
293
  | `ENABLE_GATE` | `settings.enable_gate` | **off** | **Deprecated 2026-06-25** — gate neutered; the flag has no effect. Kept to avoid `.env` churn. |
294
+ | `SKIP_INIT_DB` | `settings.skip_init_db` (.env/env) | **on** | Skip `init_db()` on startup — the dedorch cutover switch. **Defaults TRUE** (Go owns the dedorch schema); set `false` only for a local Python-owned DB. |
295
  | `enable_tracing` | hardcoded `True` in `chat.py` | on (endpoint) | Langfuse tracing. |
296
 
297
  ---
 
314
 
315
  ## 12. dedorch migration — current state
316
 
317
+ The Python DB has moved from `dataeyond` → **dedorch** (cutover 2026-07-01; Go owns dedorch migrations;
318
+ Python is consumer-only). State **re-verified against the Go source 2026-06-29**:
319
 
320
  - **The dedorch migrations now live IN the Go repo** — embedded SQL at
321
  `internal/repository/postgres/migrations/0001_create_core_schema.sql … 0004_replace_chat_with_analysis_scope.sql`,
 
330
  `rooms`/`chat_messages`/`interview_*` tables to `zdeprecated_*`.
331
  - **`report_inputs`** (the slow-path structured output, formerly `analysis_records`) stays
332
  **Python-owned**; its finalized schema goes to Harry so the dedorch migration creates it post-cutover.
333
+ - **Connection-string cutover DONE (2026-07-01).** Python's `postgres_connstring` now points at
334
+ **dedorch** and reads the Go-migrated tables directly. Every ORM model Python reads (`analyses`,
335
+ `data_sources`, `analyses_messages`, `data_catalog`) has been reconciled to its dedorch shape.
336
+ **`init_db()` is now skipped by default** (`settings.skip_init_db` defaults **True**): its privileged
337
+ DDL (`ALTER TABLE rooms …`, index creation) fails on Go-owned tables
338
+ (`InsufficientPrivilegeError: must be owner of table rooms`). Skipping is safe — Go migration `0001`
339
+ already provides the `vector` extension + the langchain FTS index. Set `SKIP_INIT_DB=false` (.env or
340
+ env) only for a local Python-owned DB. `report_inputs` is not in any Go migration yet (#22) — create
341
+ it in dedorch before enabling the slow path, else report/slow-path writes fail (chat path unaffected).
342
 
343
  **⚠️ Integration gap (verified — the big one).** Go's `/api/v1/analyses` and `/api/v1/skills`
344
  (`help` / `report`) are **placeholders that return dummy data** — the `SendMessage` / `GenerateReport`
 
360
  values are always parameterized.
361
  - **Settings aliases:** `.env` uses double-underscore names (`azureai__api_key__4o`); `Settings`
362
  exposes them as `azureai_api_key_4o`.
363
+ - **Shared Fernet key across repos (gotcha).** User DB credentials in `databases` are written +
364
+ encrypted by **Go** and decrypted by Python; both read the **same** env var
365
+ `dataeyond__db__credential__key` (Go: `configs/app.yaml` → `credentials.fernet_key`). The two
366
+ deployments MUST hold the **identical value** or Python's decrypt throws
367
+ `cryptography.fernet.InvalidToken` — whose `str()` is **empty**, so it logged as `error=""` and
368
+ masqueraded as a DB-connection failure (the executor now logs `repr(e)` to expose it). Tell-apart:
369
+ a valid-but-wrong key → `InvalidToken`; a malformed key → a non-empty `ValueError` at cipher build.
370
  - **Never-throw seams** are pervasive (tool invoker, query service, executors, state/binding reads,
371
  record persistence, report summary). Failures degrade into soft output rather than raising — good
372
  for UX, but they can mask real breakage (e.g. a binding silently fail-opening to the full catalog).
eval/help/README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Help-skill eval
2
+
3
+ Scores the **live** Help skill (`src/agents/handlers/help.HelpAgent`) — the guide that
4
+ tells a user where they are and what to do next. Each golden case declares an analysis
5
+ state + report-readiness + chat history; the runner streams `HelpAgent.astream` for real
6
+ and asserts the **rules** the reply must obey.
7
+
8
+ Unlike `eval/readiness` (deterministic, no LLM), this calls the model, so it needs a
9
+ working `.env` (Azure OpenAI) and spends tokens. Run it before a deploy that touches
10
+ `config/prompts/help.md` — not on every commit. The fast, no-LLM guard is
11
+ `tests/unit/agents/handlers/test_help.py` (fake chain); this is the end-to-end
12
+ "does the model actually obey the prompt" layer on top.
13
+
14
+ ## Run
15
+
16
+ ```bash
17
+ uv run python -m eval.help.run_eval
18
+ uv run python -m eval.help.run_eval --limit 4 # smoke test
19
+ uv run python -m eval.help.run_eval --no-table # summary only
20
+ ```
21
+
22
+ Each run writes a timestamped `results/help_result_<ts>.json` (never overwritten,
23
+ diffable across runs).
24
+
25
+ ## What it measures
26
+
27
+ Not accuracy — Help replies are free prose with no single correct wording. The metric is
28
+ **compliance**: the % of cases whose reply obeys every rule asserted for it.
29
+
30
+ - **`language`** — the reply must match the user's language. This is the regression guard
31
+ for the button-path bug (`/tools/help` passes `message=None`, and the reply used to
32
+ default to English even for an Indonesian conversation).
33
+ - **`report_guard`** — never suggest generating a report when `report_ready.ready=false`;
34
+ do suggest it when `true`. Since `generate_report` is the only gated action, this also
35
+ serves as the "no action leakage" check.
36
+ - **`orientation`** — quality of the suggested starter questions. **Manual review**: these
37
+ run but are excluded from the auto compliance rate. Read their `output_text` in the JSON.
38
+
39
+ Assertion types: `language_match {expected}`, `must_not_contain_any {patterns}`,
40
+ `must_contain_any {patterns}`.
41
+
42
+ ## Held-out vs carried-over (why the summary splits them)
43
+
44
+ `carried_over: true` cases **mirror an example in `help.md`** — the case `id` *is* the
45
+ prompt's `<!-- id: ... -->`. They are a regression guard: if the prompt is refactored, the
46
+ demonstrated rule must still hold. What is mirrored is the **input spec + the assertion**,
47
+ never the example's reply text (temperature > 0 makes exact match invalid).
48
+
49
+ Held-out cases (`carried_over: false`) are **absent from the prompt**; their compliance is
50
+ the real generalization signal. If held-out compliance drops while carried-over stays at
51
+ 100%, the prompt is overfitting to its own examples ("train on test set"). That's why the
52
+ two are reported separately.
53
+
54
+ **Sync rule (manual, like `intent`):** if `help.md`'s Examples change, keep the mirrored
55
+ `id`s here in sync. Current mirrored ids: `help_ex_orient`, `help_ex_guard_delta`,
56
+ `help_ex_guard_ready`.
57
+
58
+ ## Dataset
59
+
60
+ `help_dataset.json` — see the `_about` / `_carried_over` doc keys in the file. Language
61
+ detection reuses `help._detect_reply_language`; `report_ready.missing` uses the codes
62
+ `analysis` / `delta` mapped to the real `is_report_ready` strings in the runner.
63
+
64
+ ## Known limitations
65
+
66
+ - **Compliance is approximate across runs.** `HelpAgent` runs at `temperature=0.3`, so the
67
+ reply varies; a borderline case can flip pass/fail between runs. Treat the rate as a
68
+ signal, not a fixed number — re-run before trusting a single-point drop.
69
+ - **`language_match` grades with the same detector the feature uses** (`_detect_reply_language`
70
+ over the reply). It verifies the model obeyed the `[Reply language]` directive, assuming the
71
+ detector is correct — the detector itself is unit-tested separately in
72
+ `tests/unit/agents/handlers/test_help.py`. It can also misfire on a reply that mixes
73
+ languages (e.g. an Indonesian reply quoting an English business question).
74
+ - **Errored cases (stream crash) count as failures, not rule violations.** If `astream` raises
75
+ (Azure down, timeout), the case is flagged `errored` and reported under a separate `ERRORED`
76
+ line — assertions are NOT run on the error string (a crash must not trivially "pass" a
77
+ `must_not_contain_any`). A run with errors is not a clean pass; re-run once the cause clears.
eval/help/__init__.py ADDED
File without changes
eval/help/help_dataset.json ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_about": "Golden dataset for the Help skill (`src/agents/handlers/help.HelpAgent`). Unlike intent/readiness this calls the LIVE model: each case declares an analysis state + report-readiness + chat history, the runner streams HelpAgent.astream for real, and asserts RULES the reply must obey (not text similarity — help replies are free prose with no single correct wording). Metric is COMPLIANCE (% of rule assertions that hold), reported separately for held-out vs carried_over cases.",
3
+ "_groups": "language (reply matches the user's language — the button-path bug), report_guard (never suggest a report when report_ready.ready=false; do suggest it when true — this also IS the 'no action leakage' check, since generate_report is the only gated action), orientation (quality of the suggested starter questions — MANUAL review, not auto-scored).",
4
+ "_asserts": "language_match {expected} — detect the reply's language (help._detect_reply_language over the OUTPUT) must equal expected. must_not_contain_any {patterns} — none of the (case-insensitive) patterns appear. must_contain_any {patterns} — at least one appears.",
5
+ "_carried_over": "carried_over:true rows MIRROR an example in config/prompts/help.md (the row `id` IS the help.md `<!-- id: ... -->`). They are the regression guard: if the prompt is refactored, the demonstrated rule must still hold. What is mirrored is the INPUT spec + the assertion — NOT the example's reply text (temperature>0 makes exact match invalid). Held-out rows (carried_over:false) are NOT in the prompt; their compliance is the real generalization signal. If help.md's Examples change, keep these ids in sync (manual, like intent).",
6
+ "_missing_codes": "report_ready.missing uses codes mapped to the real strings is_report_ready emits (imported in run_eval): analysis -> _MISSING_ANALYSIS, delta -> _MISSING_DELTA. Kept as codes so the dataset survives wording changes.",
7
+ "schema": {
8
+ "id": "stable handle; for carried_over rows this equals the help.md example id",
9
+ "group": "language | report_guard | orientation",
10
+ "carried_over": "bool — mirrors a help.md example",
11
+ "manual_review": "bool — run but exclude from the auto compliance rate (read output_text)",
12
+ "state": "{ analysis_title, objective, business_questions[], report_id }",
13
+ "report_ready": "{ ready: bool, missing: [analysis|delta] }",
14
+ "history": "[{ role: human|ai, content }] — drives language on the button path",
15
+ "message": "the human turn; null = button path (HelpAgent falls back to a per-language trigger)",
16
+ "asserts": "[{ type, ...spec }] — the rules the reply must obey",
17
+ "note": "human-readable description"
18
+ },
19
+ "cases": [
20
+ {
21
+ "id": "lang_01", "group": "language", "carried_over": false, "manual_review": false,
22
+ "state": { "analysis_title": "Analisis penjualan", "objective": "memahami performa penjualan bulanan", "business_questions": ["produk mana yang paling laku?"], "report_id": null },
23
+ "report_ready": { "ready": false, "missing": ["analysis"] },
24
+ "history": [{ "role": "human", "content": "aku baru upload datanya, terus aku harus ngapain?" }],
25
+ "message": null,
26
+ "asserts": [{ "type": "language_match", "expected": "Indonesian" }],
27
+ "note": "REGRESSION of the button-path bug: Indonesian conversation, message=null. Reply must be Indonesian, not English."
28
+ },
29
+ {
30
+ "id": "lang_02", "group": "language", "carried_over": false, "manual_review": false,
31
+ "state": { "analysis_title": "Sales analysis", "objective": "understand monthly sales performance", "business_questions": ["which products drive revenue?"], "report_id": null },
32
+ "report_ready": { "ready": false, "missing": ["analysis"] },
33
+ "history": [{ "role": "human", "content": "okay I uploaded my data, what do I do next?" }],
34
+ "message": null,
35
+ "asserts": [{ "type": "language_match", "expected": "English" }],
36
+ "note": "English conversation, button path — reply must stay English."
37
+ },
38
+ {
39
+ "id": "lang_03", "group": "language", "carried_over": false, "manual_review": false,
40
+ "state": { "analysis_title": "Analisis churn", "objective": "menurunkan churn pelanggan", "business_questions": ["segmen mana yang paling banyak churn?"], "report_id": null },
41
+ "report_ready": { "ready": false, "missing": ["analysis"] },
42
+ "history": [],
43
+ "message": "gimana caranya mulai analisis ini ya?",
44
+ "asserts": [{ "type": "language_match", "expected": "Indonesian" }],
45
+ "note": "Intent path: the real Indonesian user turn drives the language."
46
+ },
47
+ {
48
+ "id": "lang_04", "group": "language", "carried_over": false, "manual_review": false,
49
+ "state": { "analysis_title": "Retention analysis", "objective": "understand user retention", "business_questions": ["what drives repeat usage?"], "report_id": null },
50
+ "report_ready": { "ready": false, "missing": ["analysis"] },
51
+ "history": [],
52
+ "message": null,
53
+ "asserts": [{ "type": "language_match", "expected": "English" }],
54
+ "note": "Fresh analysis, no chat yet, button path — with no turn to read, the user-authored goal (English objective + business_questions, required at onboarding) drives the language."
55
+ },
56
+ {
57
+ "id": "lang_06", "group": "language", "carried_over": false, "manual_review": false,
58
+ "state": { "analysis_title": "Analisis retensi", "objective": "memahami retensi pengguna", "business_questions": ["apa yang mendorong penggunaan berulang?"], "report_id": null },
59
+ "report_ready": { "ready": false, "missing": ["analysis"] },
60
+ "history": [],
61
+ "message": null,
62
+ "asserts": [{ "type": "language_match", "expected": "Indonesian" }],
63
+ "note": "Same fresh-analysis path as lang_04 but the goal is Indonesian — the goal signal must yield Indonesian (not the hard fallback, which only fires when the goal is empty too)."
64
+ },
65
+ {
66
+ "id": "lang_05", "group": "language", "carried_over": false, "manual_review": false,
67
+ "state": { "analysis_title": "Analisis penjualan", "objective": "memahami tren penjualan", "business_questions": ["bagaimana tren bulanan?"], "report_id": null },
68
+ "report_ready": { "ready": false, "missing": ["analysis"] },
69
+ "history": [
70
+ { "role": "human", "content": "apa saja yang bisa aku tanyakan tentang data ini?" },
71
+ { "role": "ai", "content": "You can start by asking which products sell the most." }
72
+ ],
73
+ "message": null,
74
+ "asserts": [{ "type": "language_match", "expected": "Indonesian" }],
75
+ "note": "Last AI turn is English but the human turn is Indonesian — mirror the human, reply Indonesian."
76
+ },
77
+ {
78
+ "id": "help_ex_guard_delta", "group": "report_guard", "carried_over": true, "manual_review": false,
79
+ "state": { "analysis_title": "Sales analysis", "objective": "understand monthly sales performance", "business_questions": ["which products drive revenue?"], "report_id": "rep-1" },
80
+ "report_ready": { "ready": false, "missing": ["delta"] },
81
+ "history": [{ "role": "human", "content": "what should I do next?" }],
82
+ "message": null,
83
+ "asserts": [{ "type": "must_not_contain_any", "patterns": ["/report", "generate the report", "generate your report", "create the report"] }],
84
+ "note": "MIRRORS help.md example help_ex_guard_delta. A report exists and nothing new since — must NOT tell the user to generate a report; steer them to run a fresh analysis first."
85
+ },
86
+ {
87
+ "id": "help_ex_guard_ready", "group": "report_guard", "carried_over": true, "manual_review": false,
88
+ "state": { "analysis_title": "Sales analysis", "objective": "understand monthly sales performance", "business_questions": ["which products drive revenue?"], "report_id": null },
89
+ "report_ready": { "ready": true, "missing": [] },
90
+ "history": [{ "role": "human", "content": "what should I do next?" }],
91
+ "message": null,
92
+ "asserts": [{ "type": "must_contain_any", "patterns": ["/report", "report"] }],
93
+ "note": "MIRRORS help.md example help_ex_guard_ready. Enough analysis done — SHOULD nudge toward the report (mention /report or the report option)."
94
+ },
95
+ {
96
+ "id": "guard_03", "group": "report_guard", "carried_over": false, "manual_review": false,
97
+ "state": { "analysis_title": "Retention analysis", "objective": "improve 30-day retention", "business_questions": ["which cohort retains best?"], "report_id": null },
98
+ "report_ready": { "ready": false, "missing": ["analysis"] },
99
+ "history": [{ "role": "human", "content": "can I get a report now?" }],
100
+ "message": null,
101
+ "asserts": [{ "type": "must_not_contain_any", "patterns": ["/report", "generate the report", "generate your report", "you can generate"] }],
102
+ "note": "No analysis run yet, user asks for a report directly — must NOT offer to generate; redirect to running an analysis first."
103
+ },
104
+ {
105
+ "id": "guard_04", "group": "report_guard", "carried_over": false, "manual_review": false,
106
+ "state": { "analysis_title": "Analisis penjualan", "objective": "memahami performa penjualan", "business_questions": ["produk mana yang paling laku?"], "report_id": null },
107
+ "report_ready": { "ready": true, "missing": [] },
108
+ "history": [{ "role": "human", "content": "selanjutnya aku ngapain?" }],
109
+ "message": null,
110
+ "asserts": [
111
+ { "type": "must_contain_any", "patterns": ["/report", "laporan", "report"] },
112
+ { "type": "language_match", "expected": "Indonesian" }
113
+ ],
114
+ "note": "Ready + Indonesian conversation — should nudge toward the report AND stay in Indonesian (two rules at once)."
115
+ },
116
+ {
117
+ "id": "guard_05", "group": "report_guard", "carried_over": false, "manual_review": false,
118
+ "state": { "analysis_title": "Analisis churn", "objective": "menurunkan churn", "business_questions": ["segmen mana yang paling churn?"], "report_id": null },
119
+ "report_ready": { "ready": false, "missing": ["analysis"] },
120
+ "history": [{ "role": "human", "content": "aku mau bikin laporan dong" }],
121
+ "message": null,
122
+ "asserts": [
123
+ { "type": "must_not_contain_any", "patterns": ["/report", "silakan buat laporan", "kamu bisa membuat laporan", "generate your report"] },
124
+ { "type": "language_match", "expected": "Indonesian" }
125
+ ],
126
+ "note": "Indonesian, not ready, user asks for a report — must NOT offer it and must reply in Indonesian."
127
+ },
128
+ {
129
+ "id": "help_ex_orient", "group": "orientation", "carried_over": true, "manual_review": true,
130
+ "state": { "analysis_title": "Sales analysis", "objective": "understand monthly sales performance", "business_questions": ["which products drive revenue?"], "report_id": null },
131
+ "report_ready": { "ready": false, "missing": ["analysis"] },
132
+ "history": [],
133
+ "message": null,
134
+ "asserts": [],
135
+ "note": "MIRRORS help.md example help_ex_orient. MANUAL: are the 2-3 starter questions concrete, descriptive-first, and tied to the objective? Read output_text."
136
+ },
137
+ {
138
+ "id": "orient_02", "group": "orientation", "carried_over": false, "manual_review": true,
139
+ "state": { "analysis_title": "Retention analysis", "objective": "improve 30-day retention", "business_questions": ["which acquisition channel retains best?"], "report_id": null },
140
+ "report_ready": { "ready": false, "missing": ["analysis"] },
141
+ "history": [
142
+ { "role": "human", "content": "which channel brings the most signups?" },
143
+ { "role": "ai", "content": "Organic search brought the most signups last month (1,240)." }
144
+ ],
145
+ "message": null,
146
+ "asserts": [],
147
+ "note": "MANUAL: one question already answered — does help build on it with a NEW follow-up (retention by channel), not re-suggest the answered question? Read output_text."
148
+ }
149
+ ]
150
+ }
eval/help/run_eval.py ADDED
@@ -0,0 +1,428 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Help-skill eval runner.
2
+
3
+ Feeds each golden case in `help_dataset.json` to the LIVE Help skill
4
+ (`src/agents/handlers/help.HelpAgent.astream`), then scores whether the streamed
5
+ reply obeys a set of RULE assertions — reply language, never suggesting a report
6
+ when `report_ready.ready=false`, suggesting it when true. Prints a per-case detail
7
+ table + aggregate summary and writes a timestamped JSON report under `results/`
8
+ (never overwritten — one file per run, diffable).
9
+
10
+ Unlike `eval/readiness` (deterministic, no LLM), this calls the model for real, so
11
+ it needs a working `.env` (Azure OpenAI) and spends tokens — run it before a deploy
12
+ that touches `help.md`, not on every commit. `tests/unit/agents/handlers/test_help.py`
13
+ already covers the deterministic Python guard with a fake chain; this is the
14
+ end-to-end "does the model actually obey the prompt" layer on top.
15
+
16
+ Two things the metric separates on purpose:
17
+ - COMPLIANCE = % of rule assertions that hold. NOT accuracy — help replies are free
18
+ prose with no single correct wording; we score rule-obedience, not similarity.
19
+ - HELD-OUT vs CARRIED-OVER — carried_over cases mirror a help.md example (regression);
20
+ held-out cases are absent from the prompt. Held-out compliance is the real
21
+ generalization signal. If held-out drops while carried_over stays 100%, the prompt
22
+ is overfitting to its own examples.
23
+
24
+ `orientation` cases are `manual_review` — run but excluded from the auto compliance
25
+ rate; read their `output_text` in the JSON report to judge suggestion quality.
26
+
27
+ Invoke as a module so `src` imports resolve:
28
+
29
+ uv run python -m eval.help.run_eval
30
+ uv run python -m eval.help.run_eval --limit 4 # smoke test
31
+ uv run python -m eval.help.run_eval --no-table # summary only
32
+ """
33
+
34
+ from __future__ import annotations
35
+
36
+ import argparse
37
+ import asyncio
38
+ import json
39
+ import statistics
40
+ import time
41
+ from dataclasses import asdict, dataclass, field
42
+ from datetime import datetime
43
+ from pathlib import Path
44
+ from typing import Any
45
+
46
+ from langchain_core.callbacks import BaseCallbackHandler
47
+ from langchain_core.messages import AIMessage, BaseMessage, HumanMessage
48
+ from langchain_core.outputs import LLMResult
49
+
50
+ from src.agents.gate import AnalysisState, stub_analysis_state
51
+ from src.agents.handlers.help import HelpAgent, ReportReadiness, _detect_reply_language
52
+ from src.agents.report.readiness import _MISSING_ANALYSIS, _MISSING_DELTA
53
+
54
+ _HERE = Path(__file__).resolve().parent
55
+ DATASET = _HERE / "help_dataset.json"
56
+ RESULTS_DIR = _HERE / "results"
57
+ GROUPS = ["language", "report_guard", "orientation"]
58
+
59
+ # Dataset short codes -> the exact `missing` strings is_report_ready emits. Imported
60
+ # from the module so the dataset stays readable and survives wording changes.
61
+ _CODE_TO_MISSING = {
62
+ "analysis": _MISSING_ANALYSIS,
63
+ "delta": _MISSING_DELTA,
64
+ }
65
+
66
+
67
+ class _UsageCollector(BaseCallbackHandler):
68
+ """Sums token usage across the LLM calls made during one astream()."""
69
+
70
+ def __init__(self) -> None:
71
+ self.input_tokens = 0
72
+ self.output_tokens = 0
73
+ self.total_tokens = 0
74
+
75
+ def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
76
+ before = self.total_tokens
77
+ for generation_list in response.generations:
78
+ for generation in generation_list:
79
+ message = getattr(generation, "message", None)
80
+ usage = getattr(message, "usage_metadata", None) if message else None
81
+ if usage:
82
+ self.input_tokens += usage.get("input_tokens", 0)
83
+ self.output_tokens += usage.get("output_tokens", 0)
84
+ self.total_tokens += usage.get("total_tokens", 0)
85
+ if self.total_tokens == before and response.llm_output:
86
+ usage = response.llm_output.get("token_usage") or {}
87
+ self.input_tokens += usage.get("prompt_tokens", 0)
88
+ self.output_tokens += usage.get("completion_tokens", 0)
89
+ self.total_tokens += usage.get("total_tokens", 0)
90
+
91
+ @property
92
+ def tokens(self) -> dict[str, int]:
93
+ return {
94
+ "input": self.input_tokens,
95
+ "output": self.output_tokens,
96
+ "total": self.total_tokens,
97
+ }
98
+
99
+
100
+ # --- assertion checkers -----------------------------------------------------
101
+ # Each returns (passed, detail). `detail` explains a failure in the table/report.
102
+
103
+
104
+ def _check_language_match(output: str, spec: dict[str, Any]) -> tuple[bool, str]:
105
+ got = _detect_reply_language([], message=output)
106
+ return got == spec["expected"], f"want {spec['expected']}, got {got}"
107
+
108
+
109
+ def _check_must_not_contain_any(output: str, spec: dict[str, Any]) -> tuple[bool, str]:
110
+ low = output.lower()
111
+ hits = [p for p in spec["patterns"] if p.lower() in low]
112
+ return (not hits), (f"found {hits}" if hits else "none present")
113
+
114
+
115
+ def _check_must_contain_any(output: str, spec: dict[str, Any]) -> tuple[bool, str]:
116
+ low = output.lower()
117
+ hits = [p for p in spec["patterns"] if p.lower() in low]
118
+ return bool(hits), (f"found {hits}" if hits else f"none of {spec['patterns']}")
119
+
120
+
121
+ _ASSERT_CHECKS = {
122
+ "language_match": _check_language_match,
123
+ "must_not_contain_any": _check_must_not_contain_any,
124
+ "must_contain_any": _check_must_contain_any,
125
+ }
126
+
127
+
128
+ @dataclass
129
+ class AssertResult:
130
+ type: str
131
+ passed: bool
132
+ detail: str
133
+
134
+
135
+ @dataclass
136
+ class CaseResult:
137
+ id: str
138
+ group: str
139
+ carried_over: bool
140
+ manual_review: bool
141
+ output_text: str
142
+ asserts: list[AssertResult]
143
+ all_passed: bool | None # None when manual_review (not auto-scored)
144
+ latency_ms: float
145
+ tokens: dict[str, int]
146
+ errored: bool = False # the astream call raised — infra failure, not a rule verdict
147
+
148
+
149
+ def load_cases(path: Path) -> list[dict[str, Any]]:
150
+ """Read the `cases` array, skipping the leading `_*` doc keys and `schema`."""
151
+ data = json.loads(path.read_text(encoding="utf-8"))
152
+ return list(data["cases"])
153
+
154
+
155
+ def _build_state(spec: dict[str, Any]) -> AnalysisState:
156
+ """Build an AnalysisState from a case's `state` block (defaults from the stub)."""
157
+ return stub_analysis_state().model_copy(
158
+ update={
159
+ "analysis_title": spec.get("analysis_title", "New analysis"),
160
+ "objective": spec.get("objective", ""),
161
+ "business_questions": list(spec.get("business_questions", [])),
162
+ "report_id": spec.get("report_id"),
163
+ }
164
+ )
165
+
166
+
167
+ def _build_history(rows: list[dict[str, Any]]) -> list[BaseMessage]:
168
+ out: list[BaseMessage] = []
169
+ for row in rows:
170
+ cls = HumanMessage if row["role"] == "human" else AIMessage
171
+ out.append(cls(content=row["content"]))
172
+ return out
173
+
174
+
175
+ def _build_readiness(spec: dict[str, Any]) -> ReportReadiness:
176
+ return ReportReadiness(
177
+ ready=bool(spec["ready"]),
178
+ missing=[_CODE_TO_MISSING[c] for c in spec.get("missing", [])],
179
+ )
180
+
181
+
182
+ async def run_case(case: dict[str, Any]) -> CaseResult:
183
+ """Stream one Help reply and score its assertions; never throws."""
184
+ state = _build_state(case["state"])
185
+ history = _build_history(case.get("history", []))
186
+ readiness = _build_readiness(case["report_ready"])
187
+ collector = _UsageCollector()
188
+
189
+ agent = HelpAgent() # real Azure chain, constructed lazily on first astream
190
+ start = time.perf_counter()
191
+ try:
192
+ output = "".join(
193
+ [
194
+ token
195
+ async for token in agent.astream(
196
+ state,
197
+ history=history,
198
+ message=case.get("message"),
199
+ report_ready=readiness,
200
+ callbacks=[collector],
201
+ )
202
+ ]
203
+ )
204
+ except Exception as exc: # noqa: BLE001 — one bad case shouldn't kill the run
205
+ output = f"ERROR:{type(exc).__name__}: {exc}"
206
+ latency_ms = round((time.perf_counter() - start) * 1000, 1)
207
+
208
+ manual = bool(case.get("manual_review"))
209
+ errored = output.startswith("ERROR:")
210
+ asserts: list[AssertResult] = []
211
+ if errored:
212
+ # Don't run rule checks on an error string — a crash must not "pass" a
213
+ # must_not_contain_any (the pattern is trivially absent) or a language check.
214
+ # Count it as a failure, but flag it as errored so it reads as infra, not a
215
+ # rule violation (overrides manual_review — a crash isn't reviewable).
216
+ asserts = [AssertResult(type="stream", passed=False, detail=_truncate(output, 100))]
217
+ all_passed: bool | None = False
218
+ elif manual:
219
+ all_passed = None
220
+ else:
221
+ for spec in case.get("asserts", []):
222
+ check = _ASSERT_CHECKS[spec["type"]]
223
+ passed, detail = check(output, spec)
224
+ asserts.append(AssertResult(type=spec["type"], passed=passed, detail=detail))
225
+ all_passed = all(a.passed for a in asserts)
226
+
227
+ return CaseResult(
228
+ id=case["id"],
229
+ group=case["group"],
230
+ carried_over=bool(case.get("carried_over")),
231
+ manual_review=manual,
232
+ output_text=output,
233
+ asserts=asserts,
234
+ all_passed=all_passed,
235
+ latency_ms=latency_ms,
236
+ tokens=collector.tokens,
237
+ errored=errored,
238
+ )
239
+
240
+
241
+ def _compliance(results: list[CaseResult]) -> dict[str, Any]:
242
+ scored = [r for r in results if r.all_passed is not None]
243
+ passed = sum(1 for r in scored if r.all_passed)
244
+ return {
245
+ "n": len(scored),
246
+ "passed": passed,
247
+ "compliance": round(passed / len(scored), 3) if scored else 0.0,
248
+ }
249
+
250
+
251
+ def summarize(results: list[CaseResult]) -> dict[str, Any]:
252
+ scored = [r for r in results if r.all_passed is not None]
253
+ latencies = [r.latency_ms for r in results]
254
+ tok_total = sum(r.tokens["total"] for r in results)
255
+ overall = _compliance(results)
256
+ by_group = {
257
+ g: _compliance([r for r in results if r.group == g])
258
+ for g in GROUPS
259
+ if any(r.group == g for r in results)
260
+ }
261
+ errored = [r for r in results if r.errored]
262
+ return {
263
+ "total": len(results),
264
+ "scored": len(scored),
265
+ "manual_review": len(results) - len(scored),
266
+ "passed": overall["passed"],
267
+ "compliance": overall["compliance"],
268
+ "runtime_avg_ms": round(statistics.mean(latencies), 1) if latencies else 0,
269
+ "tokens_total": tok_total,
270
+ "by_group": by_group,
271
+ "held_out": _compliance([r for r in scored if not r.carried_over]),
272
+ "carried_over": _compliance([r for r in scored if r.carried_over]),
273
+ "errored": {"count": len(errored), "ids": [r.id for r in errored]},
274
+ }
275
+
276
+
277
+ def _truncate(text: str, width: int) -> str:
278
+ text = text.replace("\n", " ")
279
+ return text if len(text) <= width else text[: width - 3] + "..."
280
+
281
+
282
+ def format_table(results: list[CaseResult]) -> str:
283
+ header = (
284
+ f"{'ID':<20} {'GROUP':<13} {'C/O':<4} {'ASSERTS':<22} {'OK':<4} {'MS':>7}"
285
+ )
286
+ rule = "-" * len(header)
287
+ lines = [rule, header, rule]
288
+ for r in results:
289
+ co = "CO" if r.carried_over else "-"
290
+ if r.manual_review:
291
+ atypes, ok = "(manual)", "~"
292
+ else:
293
+ atypes = ",".join(a.type.replace("_", "")[:6] for a in r.asserts) or "-"
294
+ ok = "ok" if r.all_passed else "X"
295
+ lines.append(
296
+ f"{r.id:<20} {r.group:<13} {co:<4} {_truncate(atypes, 22):<22} "
297
+ f"{ok:<4} {r.latency_ms:>7}"
298
+ )
299
+ lines.append(rule)
300
+ return "\n".join(lines)
301
+
302
+
303
+ def format_summary(summary: dict[str, Any], results: list[CaseResult]) -> str:
304
+ lines = ["SUMMARY"]
305
+ lines.append(
306
+ f" Compliance {summary['passed']}/{summary['scored']} cases obey all rules"
307
+ f" ({summary['compliance'] * 100:.1f}%) avg {summary['runtime_avg_ms']} ms"
308
+ )
309
+ lines.append(
310
+ f" Manual {summary['manual_review']} case(s) excluded from the rate"
311
+ " (read output_text)"
312
+ )
313
+ lines.append("")
314
+ lines.append(" By group")
315
+ for g, m in summary["by_group"].items():
316
+ if m["n"]:
317
+ lines.append(f" {g:<14} {m['passed']}/{m['n']} {m['compliance'] * 100:.0f}%")
318
+ else:
319
+ lines.append(f" {g:<14} (manual only)")
320
+ lines.append("")
321
+ ho, co = summary["held_out"], summary["carried_over"]
322
+ lines.append(" Held-out vs carried-over")
323
+ lines.append(
324
+ f" held_out {ho['passed']}/{ho['n']} "
325
+ f"{ho['compliance'] * 100:.0f}% <- generalization"
326
+ )
327
+ lines.append(
328
+ f" carried_over {co['passed']}/{co['n']} "
329
+ f"{co['compliance'] * 100:.0f}% <- regression"
330
+ )
331
+ # Rule failures (real disobedience) vs errored (infra/stream crash) — kept apart so
332
+ # a crashed run isn't misread as the model breaking a rule.
333
+ failures = [r for r in results if r.all_passed is False and not r.errored]
334
+ lines.append("")
335
+ lines.append(f" FAILURES ({len(failures)})")
336
+ for r in failures:
337
+ bad = [f"{a.type}({a.detail})" for a in r.asserts if not a.passed]
338
+ lines.append(f" {r.id:<20} {r.group:<13} {'; '.join(bad)}")
339
+ err = summary["errored"]
340
+ if err["count"]:
341
+ lines.append("")
342
+ lines.append(
343
+ f" ERRORED ({err['count']}) - stream crashed, counted as fail NOT a rule miss"
344
+ f" -> {', '.join(err['ids'])}"
345
+ )
346
+ return "\n".join(lines)
347
+
348
+
349
+ def build_report(
350
+ results: list[CaseResult], summary: dict[str, Any], meta: dict[str, Any]
351
+ ) -> dict[str, Any]:
352
+ run = {
353
+ **meta,
354
+ **{
355
+ k: summary[k]
356
+ for k in ("total", "scored", "manual_review", "passed", "compliance",
357
+ "runtime_avg_ms", "tokens_total")
358
+ },
359
+ }
360
+ return {
361
+ "run": run,
362
+ "by_group": summary["by_group"],
363
+ "held_out": summary["held_out"],
364
+ "carried_over": summary["carried_over"],
365
+ "errored": summary["errored"],
366
+ "cases": [asdict(r) for r in results],
367
+ }
368
+
369
+
370
+ def _model_name() -> str:
371
+ try:
372
+ from src.config.settings import settings
373
+
374
+ return str(settings.azureai_deployment_name_4o)
375
+ except Exception: # noqa: BLE001 — meta only; .env may be absent
376
+ return "gpt-4o"
377
+
378
+
379
+ @dataclass
380
+ class _Args:
381
+ dataset: Path = DATASET
382
+ limit: int = 0
383
+ no_table: bool = False
384
+ extra: dict[str, Any] = field(default_factory=dict)
385
+
386
+
387
+ async def main() -> None:
388
+ parser = argparse.ArgumentParser(description="Help-skill eval")
389
+ parser.add_argument("--dataset", type=Path, default=DATASET)
390
+ parser.add_argument("--limit", type=int, default=0, help="run first N cases only")
391
+ parser.add_argument("--prompt-version", default="help.md")
392
+ parser.add_argument("--no-table", action="store_true", help="skip the detail table")
393
+ args = parser.parse_args()
394
+
395
+ cases = load_cases(args.dataset)
396
+ if args.limit:
397
+ cases = cases[: args.limit]
398
+
399
+ started = datetime.now()
400
+ print(f"Help Skill Eval -- {started:%Y-%m-%d %H:%M:%S}")
401
+ print(
402
+ f"dataset: {args.dataset.name} ({len(cases)} cases) model: {_model_name()} "
403
+ f"prompt: {args.prompt_version} target: HelpAgent.astream (live)"
404
+ )
405
+
406
+ results = [await run_case(case) for case in cases]
407
+
408
+ summary = summarize(results)
409
+ if not args.no_table:
410
+ print(format_table(results))
411
+ print(format_summary(summary, results))
412
+
413
+ meta = {
414
+ "timestamp": started.isoformat(timespec="seconds"),
415
+ "dataset": args.dataset.name,
416
+ "model": _model_name(),
417
+ "prompt_version": args.prompt_version,
418
+ "target": "src/agents/handlers/help.HelpAgent.astream",
419
+ }
420
+ report = build_report(results, summary, meta)
421
+ RESULTS_DIR.mkdir(parents=True, exist_ok=True)
422
+ out_path = RESULTS_DIR / f"help_result_{started:%Y-%m-%d_%H%M%S}.json"
423
+ out_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
424
+ print(f"\n-> saved: {out_path.relative_to(_HERE.parent.parent)}")
425
+
426
+
427
+ if __name__ == "__main__":
428
+ asyncio.run(main())
eval/readiness/readiness_dataset.json CHANGED
@@ -1,40 +1,37 @@
1
  {
2
  "_about": "Golden dataset for the report-readiness signal (`src/agents/report/readiness.is_report_ready`). Deterministic (no LLM): each case declares an analysis state + a set of persisted AnalysisRecords/reports, and the runner feeds them through is_report_ready via injectable fake stores, scoring the boolean `ready` AND the `missing` gaps. Floor cases should score ~100% (regression value). The `alignment` group probes the deferred LLM-judge — see _alignment.",
3
- "_floor": "is_report_ready's deterministic floor: (1) problem_validated, (2) >=1 SUBSTANTIVE record, (3) delta-since-report. SUBSTANTIVE (KM-652 fix T1) = a record whose ANALYSIS task succeeded: tasks_run contains a task with status=success AND an analyze_* tool. A failed analysis still persists a record WITH findings (narrating the failure) and its data-access tasks (check_/retrieve_) succeed — so neither 'has findings' nor 'any task succeeded' counts. Only a successful analyze_* does.",
4
  "_records": "records[].analysis = 'success' (analyze_* succeeded → substantive) | 'failure' (analyze_* failed, data-access still succeeded — the real e2e case, NOT substantive) | 'none' (only check_/retrieve_ succeeded, no analyze task — NOT substantive; guards the 'any task succeeded' trap). records[].findings = count (a failure run still has findings; floor ignores them now). records[].age_min / reports[].age_min = minutes ago (smaller = newer).",
5
- "_alignment": "ALIGNMENT cases: a successful analysis (floor says ready=true) but `aligned=false` means it doesn't address the problem statement — a human would say NOT ready. Scored floor-correct, counted separately as the 'alignment gap' = evidence for/against the LLM-judge. Alignment label owner: Rifqi (report semantics) + Sofhia.",
6
  "schema": {
7
  "id": "stable per-case handle, <group>_<NN>",
8
  "group": "floor | delta | edge | alignment",
9
- "problem_validated": "bool",
10
  "report_id": "null = never generated; a string = a report exists",
11
  "records": "[{ analysis: success|failure|none, findings: int, age_min: int }]",
12
  "reports": "[{ age_min: int }] (only meaningful when report_id set)",
13
- "aligned": "bool — do the analyses address the problem statement? (floor ignores this)",
14
  "expected_ready": "what the deterministic floor SHOULD return",
15
- "expected_missing": "subset of [problem, analysis, delta]",
16
  "note": "human-readable description"
17
  },
18
  "cases": [
19
- { "id": "floor_01", "group": "floor", "problem_validated": false, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["problem", "analysis"], "note": "new analysis: no validated goal and no records" },
20
- { "id": "floor_02", "group": "floor", "problem_validated": false, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 30 }], "reports": [], "aligned": true, "expected_ready": false, "expected_missing": ["problem"], "note": "has a successful analysis but goal not validated (isolates the problem gap)" },
21
- { "id": "floor_03", "group": "floor", "problem_validated": true, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "validated goal but no analysis run yet" },
22
- { "id": "floor_04", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 20 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 REGRESSION: analyze_* FAILED but the record still has 3 findings (narrating failure) + check/retrieve succeeded. Must NOT be ready — this is the live e2e case (analyze_aggregate failed, report still got generated under the old 'has findings' rule)." },
23
- { "id": "floor_05", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "none", "findings": 0, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 nuance: only data-access tasks (check/retrieve) succeeded, no analyze task. 'any task succeeded' would wrongly pass — must NOT be ready." },
24
- { "id": "floor_06", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "validated + one successful analysis, no prior report → ready" },
25
- { "id": "floor_07", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 40 }, { "analysis": "success", "findings": 1, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "multiple successful analyses → ready" },
26
- { "id": "floor_08", "group": "floor", "problem_validated": true, "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 30 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one failed + one successful analysis → the successful one is enough → ready" },
27
 
28
- { "id": "delta_01", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }], "reports": [{ "age_min": 5 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "report exists, all analysis older than it → nothing new to report" },
29
- { "id": "delta_02", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 5 }], "reports": [{ "age_min": 120 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "newer successful analysis after the report → ready to regenerate" },
30
- { "id": "delta_03", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 1, "age_min": 90 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one old + one newer-than-report success → ready" },
31
- { "id": "delta_04", "group": "delta", "problem_validated": true, "report_id": "rep-2", "records": [{ "analysis": "success", "findings": 2, "age_min": 90 }], "reports": [{ "age_min": 200 }, { "age_min": 30 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "multiple reports — newest wins; analysis older than newest report → not ready" },
32
- { "id": "delta_05", "group": "delta", "problem_validated": true, "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }, { "analysis": "failure", "findings": 3, "age_min": 5 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "T1+delta: the only NEW analysis (age 5) is a FAILURE → no NEW substantive since the report → not ready. A failed retry must not unlock a duplicate report." },
33
 
34
- { "id": "edge_01", "group": "edge", "problem_validated": true, "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "doc-only analysis (RAG, no structured run) produces no AnalysisRecord → never report-able under the floor. PRODUCT QUESTION: should doc-only be report-able?" },
35
 
36
- { "id": "align_01", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: successful analysis but it doesn't address the problem statement. Floor says ready; a human would say not-ready." },
37
- { "id": "align_02", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 25 }, { "analysis": "success", "findings": 1, "age_min": 5 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: lots of successful analysis, none aligned to the goal" },
38
- { "id": "align_03", "group": "alignment", "problem_validated": true, "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "control: successful AND aligned → genuinely ready, no gap" }
39
  ]
40
  }
 
1
  {
2
  "_about": "Golden dataset for the report-readiness signal (`src/agents/report/readiness.is_report_ready`). Deterministic (no LLM): each case declares an analysis state + a set of persisted AnalysisRecords/reports, and the runner feeds them through is_report_ready via injectable fake stores, scoring the boolean `ready` AND the `missing` gaps. Floor cases should score ~100% (regression value). The `alignment` group probes the deferred LLM-judge — see _alignment.",
3
+ "_floor": "is_report_ready's deterministic floor (KM-652, after the problem_validated gate was removed 2026-06-24): (1) >=1 SUBSTANTIVE record, (2) delta-since-report. SUBSTANTIVE = a record whose ANALYSIS task succeeded: tasks_run contains a task with status=success AND an analyze_* tool. A failed analysis still persists a record WITH findings (narrating the failure) and its data-access tasks (check_/retrieve_) succeed — so neither 'has findings' nor 'any task succeeded' counts. Only a successful analyze_* does.",
4
  "_records": "records[].analysis = 'success' (analyze_* succeeded → substantive) | 'failure' (analyze_* failed, data-access still succeeded — the real e2e case, NOT substantive) | 'none' (only check_/retrieve_ succeeded, no analyze task — NOT substantive; guards the 'any task succeeded' trap). records[].findings = count (a failure run still has findings; floor ignores them now). records[].age_min / reports[].age_min = minutes ago (smaller = newer).",
5
+ "_alignment": "ALIGNMENT cases: a successful analysis (floor says ready=true) but `aligned=false` means it doesn't address the analysis objective — a human would say NOT ready. Scored floor-correct, counted separately as the 'alignment gap' = evidence for/against the LLM-judge. Alignment label owner: Rifqi (report semantics) + Sofhia.",
6
  "schema": {
7
  "id": "stable per-case handle, <group>_<NN>",
8
  "group": "floor | delta | edge | alignment",
 
9
  "report_id": "null = never generated; a string = a report exists",
10
  "records": "[{ analysis: success|failure|none, findings: int, age_min: int }]",
11
  "reports": "[{ age_min: int }] (only meaningful when report_id set)",
12
+ "aligned": "bool — do the analyses address the objective? (floor ignores this)",
13
  "expected_ready": "what the deterministic floor SHOULD return",
14
+ "expected_missing": "subset of [analysis, delta]",
15
  "note": "human-readable description"
16
  },
17
  "cases": [
18
+ { "id": "floor_01", "group": "floor", "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "new analysis: no analysis run yet not ready" },
19
+ { "id": "floor_02", "group": "floor", "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 20 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 REGRESSION: analyze_* FAILED but the record still has 3 findings (narrating failure) + check/retrieve succeeded. Must NOT be ready — this is the live e2e case (analyze_aggregate failed, report still got generated under the old 'has findings' rule)." },
20
+ { "id": "floor_03", "group": "floor", "report_id": null, "records": [{ "analysis": "none", "findings": 0, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "T1 nuance: only data-access tasks (check/retrieve) succeeded, no analyze task. 'any task succeeded' would wrongly pass — must NOT be ready." },
21
+ { "id": "floor_04", "group": "floor", "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one successful analysis, no prior report ready" },
22
+ { "id": "floor_05", "group": "floor", "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 40 }, { "analysis": "success", "findings": 1, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "multiple successful analyses ready" },
23
+ { "id": "floor_06", "group": "floor", "report_id": null, "records": [{ "analysis": "failure", "findings": 3, "age_min": 30 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one failed + one successful analysis the successful one is enough → ready" },
 
 
24
 
25
+ { "id": "delta_01", "group": "delta", "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }], "reports": [{ "age_min": 5 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "report exists, all analysis older than it → nothing new to report" },
26
+ { "id": "delta_02", "group": "delta", "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 5 }], "reports": [{ "age_min": 120 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "newer successful analysis after the report → ready to regenerate" },
27
+ { "id": "delta_03", "group": "delta", "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 1, "age_min": 90 }, { "analysis": "success", "findings": 2, "age_min": 10 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "one old + one newer-than-report success → ready" },
28
+ { "id": "delta_04", "group": "delta", "report_id": "rep-2", "records": [{ "analysis": "success", "findings": 2, "age_min": 90 }], "reports": [{ "age_min": 200 }, { "age_min": 30 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "multiple reports — newest wins; analysis older than newest report → not ready" },
29
+ { "id": "delta_05", "group": "delta", "report_id": "rep-1", "records": [{ "analysis": "success", "findings": 2, "age_min": 120 }, { "analysis": "failure", "findings": 3, "age_min": 5 }], "reports": [{ "age_min": 60 }], "aligned": true, "expected_ready": false, "expected_missing": ["delta"], "note": "T1+delta: the only NEW analysis (age 5) is a FAILURE → no NEW substantive since the report → not ready. A failed retry must not unlock a duplicate report." },
30
 
31
+ { "id": "edge_01", "group": "edge", "report_id": null, "records": [], "reports": [], "aligned": false, "expected_ready": false, "expected_missing": ["analysis"], "note": "doc-only analysis (RAG, no structured run) produces no AnalysisRecord → never report-able under the floor. PRODUCT QUESTION: should doc-only be report-able?" },
32
 
33
+ { "id": "align_01", "group": "alignment", "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: successful analysis but it doesn't address the objective. Floor says ready; a human would say not-ready." },
34
+ { "id": "align_02", "group": "alignment", "report_id": null, "records": [{ "analysis": "success", "findings": 3, "age_min": 25 }, { "analysis": "success", "findings": 1, "age_min": 5 }], "reports": [], "aligned": false, "expected_ready": true, "expected_missing": [], "note": "GAP: lots of successful analysis, none aligned to the objective" },
35
+ { "id": "align_03", "group": "alignment", "report_id": null, "records": [{ "analysis": "success", "findings": 2, "age_min": 15 }], "reports": [], "aligned": true, "expected_ready": true, "expected_missing": [], "note": "control: successful AND aligned → genuinely ready, no gap" }
36
  ]
37
  }
eval/readiness/run_eval.py CHANGED
@@ -35,7 +35,6 @@ from src.agents.gate import stub_analysis_state
35
  from src.agents.report.readiness import (
36
  _MISSING_ANALYSIS,
37
  _MISSING_DELTA,
38
- _MISSING_PROBLEM,
39
  is_report_ready,
40
  )
41
 
@@ -45,9 +44,9 @@ RESULTS_DIR = _HERE / "results"
45
  GROUPS = ["floor", "delta", "edge", "alignment"]
46
 
47
  # Dataset short codes -> the exact `missing` strings is_report_ready emits. Imported
48
- # from the module so the dataset stays readable and survives wording changes.
 
49
  _CODE_TO_MISSING = {
50
- "problem": _MISSING_PROBLEM,
51
  "analysis": _MISSING_ANALYSIS,
52
  "delta": _MISSING_DELTA,
53
  }
@@ -139,7 +138,9 @@ def _build_reports(specs: list[dict[str, Any]], now: datetime) -> list[_FakeRepo
139
 
140
  async def run_case(case: dict[str, Any]) -> CaseResult:
141
  now = datetime.now(UTC)
142
- state = stub_analysis_state(problem_validated=bool(case["problem_validated"]))
 
 
143
  if case.get("report_id"):
144
  state = state.model_copy(update={"report_id": case["report_id"]})
145
 
 
35
  from src.agents.report.readiness import (
36
  _MISSING_ANALYSIS,
37
  _MISSING_DELTA,
 
38
  is_report_ready,
39
  )
40
 
 
44
  GROUPS = ["floor", "delta", "edge", "alignment"]
45
 
46
  # Dataset short codes -> the exact `missing` strings is_report_ready emits. Imported
47
+ # from the module so the dataset stays readable and survives wording changes. The
48
+ # `problem` code was retired with the problem_validated gate (KM-652, 2026-06-24).
49
  _CODE_TO_MISSING = {
 
50
  "analysis": _MISSING_ANALYSIS,
51
  "delta": _MISSING_DELTA,
52
  }
 
138
 
139
  async def run_case(case: dict[str, Any]) -> CaseResult:
140
  now = datetime.now(UTC)
141
+ # The problem_validated gate was removed (KM-652); readiness no longer reads the goal,
142
+ # so a bare stub state + report_id is all is_report_ready needs.
143
+ state = stub_analysis_state()
144
  if case.get("report_id"):
145
  state = state.model_copy(update={"report_id": case["report_id"]})
146
 
main.py CHANGED
@@ -23,7 +23,7 @@ from src.api.v1.tools import router as tools_router
23
  from src.api.v1.help import router as help_router # pr/5 Phase 2: dedicated /tools/help
24
  from src.api.v2.chat import router as chat_v2_router # pr/5 Phase 2: v2 chat pilot (analysis_id)
25
  from src.db.postgres.init_db import init_db
26
- import os
27
  import uvicorn
28
 
29
  # Configure logging
@@ -34,7 +34,7 @@ logger = get_logger("main")
34
  @asynccontextmanager
35
  async def lifespan(app: FastAPI):
36
  logger.info("Starting application...")
37
- if os.getenv("SKIP_INIT_DB", "false").lower() != "true":
38
  await init_db()
39
  logger.info("Database initialized")
40
  else:
 
23
  from src.api.v1.help import router as help_router # pr/5 Phase 2: dedicated /tools/help
24
  from src.api.v2.chat import router as chat_v2_router # pr/5 Phase 2: v2 chat pilot (analysis_id)
25
  from src.db.postgres.init_db import init_db
26
+ from src.config.settings import settings
27
  import uvicorn
28
 
29
  # Configure logging
 
34
  @asynccontextmanager
35
  async def lifespan(app: FastAPI):
36
  logger.info("Starting application...")
37
+ if not settings.skip_init_db:
38
  await init_db()
39
  logger.info("Database initialized")
40
  else:
src/agents/handlers/help.py CHANGED
@@ -29,6 +29,7 @@ SEAMS:
29
 
30
  from __future__ import annotations
31
 
 
32
  from collections.abc import AsyncIterator
33
  from dataclasses import dataclass, field
34
  from pathlib import Path
@@ -49,8 +50,80 @@ _PROMPT_DIR = Path(__file__).resolve().parent.parent.parent / "config" / "prompt
49
  _SYSTEM_PROMPT_PATH = _PROMPT_DIR / "help.md"
50
  _GUARDRAILS_PATH = _PROMPT_DIR / "guardrails.md"
51
 
52
- # Neutral human turn when Help is triggered by a slash command with no real content.
53
- _DEFAULT_TRIGGER = "What should I do next?"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
 
56
  @dataclass
@@ -107,13 +180,20 @@ def _build_context_block(
107
  state: AnalysisState,
108
  report_ready: ReportReadiness,
109
  available_actions: list[str],
 
110
  ) -> str:
111
- """Compose the deterministic context the prompt's 'never misguide' rule trusts."""
 
 
 
 
 
112
  return "\n\n".join(
113
  [
114
  _format_state(state),
115
  _format_report_ready(report_ready),
116
  "[Available actions]\n" + ", ".join(available_actions),
 
117
  ]
118
  )
119
 
@@ -178,17 +258,26 @@ class HelpAgent:
178
  """
179
  readiness = report_ready or ReportReadiness()
180
  actions = available_actions or _derive_available_actions(state, readiness)
 
 
 
 
 
181
  logger.info(
182
  "help guidance",
183
  report_ready=readiness.ready,
184
  available_actions=actions,
 
185
  )
186
 
187
  chain = self._ensure_chain()
 
 
 
188
  payload: dict[str, Any] = {
189
- "message": message or _DEFAULT_TRIGGER,
190
  "history": history or [],
191
- "context": _build_context_block(state, readiness, actions),
192
  }
193
  if callbacks:
194
  async for token in chain.astream(payload, config={"callbacks": callbacks}):
 
29
 
30
  from __future__ import annotations
31
 
32
+ import re
33
  from collections.abc import AsyncIterator
34
  from dataclasses import dataclass, field
35
  from pathlib import Path
 
50
  _SYSTEM_PROMPT_PATH = _PROMPT_DIR / "help.md"
51
  _GUARDRAILS_PATH = _PROMPT_DIR / "guardrails.md"
52
 
53
+ # Neutral human turn when Help is triggered by a slash command with no real content
54
+ # (button path passes message=None). Per language, so the synthetic turn never drags the
55
+ # reply toward English — without this the only human-turn signal on the button path would
56
+ # be an English sentence, and the model mirrors the last human turn's language.
57
+ _DEFAULT_TRIGGERS = {
58
+ "Indonesian": "Apa yang sebaiknya saya lakukan selanjutnya?",
59
+ "English": "What should I do next?",
60
+ }
61
+ _FALLBACK_LANGUAGE = "Indonesian" # team default when no human turn exists yet
62
+
63
+ # Lightweight, LLM-free language detection over the last human turn. The result is LOCKED
64
+ # into the prompt via a `[Reply language]` directive (see `_build_context_block`), so
65
+ # replying in the user's language is deterministic/mandatory — not a soft prompt hint that
66
+ # an English system prompt + English default trigger can override.
67
+ _ID_MARKERS = frozenset({
68
+ "yang", "dan", "apa", "gimana", "bagaimana", "kenapa", "mengapa", "aku", "saya",
69
+ "tolong", "ini", "itu", "nih", "dong", "kah", "untuk", "dengan", "pada", "adalah",
70
+ "tidak", "enggak", "nggak", "bisa", "mau", "buat", "dari", "kamu", "ya",
71
+ "berapa", "kapan", "siapa", "dimana", "juga", "sudah", "belum", "akan",
72
+ })
73
+ _EN_MARKERS = frozenset({
74
+ "the", "what", "how", "why", "please", "this", "that", "is", "are", "can", "could",
75
+ "should", "for", "with", "of", "and", "you", "do", "does", "when", "where",
76
+ "who", "which", "my", "me", "your", "have", "has", "want", "next",
77
+ })
78
+
79
+
80
+ def _last_human_text(history: list[BaseMessage] | None) -> str:
81
+ """Return the text of the most recent human turn in history, or '' if none."""
82
+ for msg in reversed(history or []):
83
+ if getattr(msg, "type", None) == "human":
84
+ content = msg.content
85
+ return content if isinstance(content, str) else str(content)
86
+ return ""
87
+
88
+
89
+ def _score_language(text: str) -> str | None:
90
+ """Return "Indonesian"/"English" from marker-word counts, or None if no signal."""
91
+ tokens = re.findall(r"[a-z']+", text.lower())
92
+ id_hits = sum(1 for t in tokens if t in _ID_MARKERS)
93
+ en_hits = sum(1 for t in tokens if t in _EN_MARKERS)
94
+ if en_hits > id_hits:
95
+ return "English"
96
+ if id_hits > en_hits:
97
+ return "Indonesian"
98
+ return None
99
+
100
+
101
+ def _detect_reply_language(
102
+ history: list[BaseMessage] | None,
103
+ message: str | None = None,
104
+ goal_texts: list[str] | None = None,
105
+ ) -> str:
106
+ """Detect the reply language deterministically (no LLM), by signal priority.
107
+
108
+ 1. the user's turn — an explicit `message` (intent path) or the last human turn in
109
+ `history` (button path, where `message` is None);
110
+ 2. the user-authored goal (`objective` + `business_questions`) — required at
111
+ onboarding, so it's always present and is a reliable signal on a fresh analysis
112
+ that has no chat yet;
113
+ 3. the team default (Indonesian) — a safety net only, for a stub/legacy/empty-goal
114
+ state where nothing above yields a signal.
115
+
116
+ Returns "Indonesian" or "English".
117
+ """
118
+ primary = (message or _last_human_text(history)).strip()
119
+ lang = _score_language(primary) if primary else None
120
+ if lang:
121
+ return lang
122
+ goal = " ".join(t for t in (goal_texts or []) if t).strip()
123
+ lang = _score_language(goal) if goal else None
124
+ if lang:
125
+ return lang
126
+ return _FALLBACK_LANGUAGE
127
 
128
 
129
  @dataclass
 
180
  state: AnalysisState,
181
  report_ready: ReportReadiness,
182
  available_actions: list[str],
183
+ reply_language: str = _FALLBACK_LANGUAGE,
184
  ) -> str:
185
+ """Compose the deterministic context the prompt's 'never misguide' rule trusts.
186
+
187
+ `reply_language` is a hard directive: the prompt is told to reply ONLY in this
188
+ language, so the answer matches the user's language even on the button path (where
189
+ the synthetic human turn would otherwise pull the reply toward English).
190
+ """
191
  return "\n\n".join(
192
  [
193
  _format_state(state),
194
  _format_report_ready(report_ready),
195
  "[Available actions]\n" + ", ".join(available_actions),
196
+ f"[Reply language]\nRespond ONLY in: {reply_language}",
197
  ]
198
  )
199
 
 
258
  """
259
  readiness = report_ready or ReportReadiness()
260
  actions = available_actions or _derive_available_actions(state, readiness)
261
+ goal_texts = [
262
+ getattr(state, "objective", "") or "",
263
+ *(getattr(state, "business_questions", None) or []),
264
+ ]
265
+ reply_language = _detect_reply_language(history, message, goal_texts=goal_texts)
266
  logger.info(
267
  "help guidance",
268
  report_ready=readiness.ready,
269
  available_actions=actions,
270
+ reply_language=reply_language,
271
  )
272
 
273
  chain = self._ensure_chain()
274
+ default_trigger = _DEFAULT_TRIGGERS.get(
275
+ reply_language, _DEFAULT_TRIGGERS[_FALLBACK_LANGUAGE]
276
+ )
277
  payload: dict[str, Any] = {
278
+ "message": message or default_trigger,
279
  "history": history or [],
280
+ "context": _build_context_block(state, readiness, actions, reply_language),
281
  }
282
  if callbacks:
283
  async for token in chain.astream(payload, config={"callbacks": callbacks}):
src/agents/planner/inputs.py CHANGED
@@ -31,11 +31,24 @@ class ColumnSummary(BaseModel):
31
  top_values: list[Any] | None = None
32
 
33
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  class TableSummary(BaseModel):
35
  table_id: str
36
  name: str
37
  row_count: int | None = None
38
  columns: list[ColumnSummary] = Field(default_factory=list)
 
39
 
40
 
41
  class StructuredSourceSummary(BaseModel):
@@ -89,6 +102,16 @@ class CatalogSummary(BaseModel):
89
  )
90
  for col in table.columns
91
  ],
 
 
 
 
 
 
 
 
 
 
92
  )
93
  for table in source.tables
94
  ]
@@ -111,6 +134,12 @@ class CatalogSummary(BaseModel):
111
  lines: list[str] = []
112
  for source in self.structured_sources:
113
  lines.append(f"Source: {source.name} ({source.source_type}) — id={source.source_id}")
 
 
 
 
 
 
114
  for table in source.tables:
115
  rc = f" ({table.row_count:,} rows)" if table.row_count is not None else ""
116
  lines.append(f" Table: {table.name}{rc} — id={table.table_id}")
@@ -121,6 +150,16 @@ class CatalogSummary(BaseModel):
121
  f" - {col.name} [{col.data_type}]: "
122
  f"samples={samples}{top} — id={col.column_id}"
123
  )
 
 
 
 
 
 
 
 
 
 
124
  lines.append("")
125
 
126
  if self.unstructured_sources:
 
31
  top_values: list[Any] | None = None
32
 
33
 
34
+ class ForeignKeySummary(BaseModel):
35
+ """A declared FK edge — the only joins the IR validator accepts.
36
+
37
+ Maps directly onto a `retrieve_data` IR join: `column_id` → `left_column_id`,
38
+ `target_table_id` → `target_table_id`, `target_column_id` → `right_column_id`.
39
+ """
40
+
41
+ column_id: str
42
+ target_table_id: str
43
+ target_column_id: str
44
+
45
+
46
  class TableSummary(BaseModel):
47
  table_id: str
48
  name: str
49
  row_count: int | None = None
50
  columns: list[ColumnSummary] = Field(default_factory=list)
51
+ foreign_keys: list[ForeignKeySummary] = Field(default_factory=list)
52
 
53
 
54
  class StructuredSourceSummary(BaseModel):
 
102
  )
103
  for col in table.columns
104
  ],
105
+ # The declared FKs — the only joins the validator accepts. FKs
106
+ # carry no PII (ids only), so they're always surfaced.
107
+ foreign_keys=[
108
+ ForeignKeySummary(
109
+ column_id=fk.column_id,
110
+ target_table_id=fk.target_table_id,
111
+ target_column_id=fk.target_column_id,
112
+ )
113
+ for fk in table.foreign_keys
114
+ ],
115
  )
116
  for table in source.tables
117
  ]
 
134
  lines: list[str] = []
135
  for source in self.structured_sources:
136
  lines.append(f"Source: {source.name} ({source.source_type}) — id={source.source_id}")
137
+ # Name lookups (within a source) so FK edges render with readable
138
+ # table/column names alongside the ids the IR join must copy verbatim.
139
+ table_name_by_id = {t.table_id: t.name for t in source.tables}
140
+ col_name_by_id = {
141
+ c.column_id: c.name for t in source.tables for c in t.columns
142
+ }
143
  for table in source.tables:
144
  rc = f" ({table.row_count:,} rows)" if table.row_count is not None else ""
145
  lines.append(f" Table: {table.name}{rc} — id={table.table_id}")
 
150
  f" - {col.name} [{col.data_type}]: "
151
  f"samples={samples}{top} — id={col.column_id}"
152
  )
153
+ for fk in table.foreign_keys:
154
+ tgt_table = table_name_by_id.get(fk.target_table_id, fk.target_table_id)
155
+ tgt_col = col_name_by_id.get(fk.target_column_id, fk.target_column_id)
156
+ src_col = col_name_by_id.get(fk.column_id, fk.column_id)
157
+ lines.append(
158
+ f" FK: {src_col} → {tgt_table}.{tgt_col} "
159
+ f"(join: target_table_id={fk.target_table_id}, "
160
+ f"left_column_id={fk.column_id}, "
161
+ f"right_column_id={fk.target_column_id})"
162
+ )
163
  lines.append("")
164
 
165
  if self.unstructured_sources:
src/catalog/fk_inference.py ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Heuristic foreign-key inference for catalogs that ship no declared FKs.
2
+
3
+ The dedorch catalog (written by Go's introspection) currently carries **no**
4
+ `foreign_keys`, so the FK-backed-joins-only IR validator rejects every join the
5
+ planner proposes — cross-table questions ("revenue by product") can't run even
6
+ though the planner picks the right columns. Until Go captures real FK
7
+ constraints, we infer the obvious relational edges from naming conventions so the
8
+ planner and the validator agree on the same catalog.
9
+
10
+ Conservative by design (a wrong edge would silently corrupt joined results):
11
+ - `schema` (database) sources only — joins are DB-only anyway
12
+ - a foreign key is only inferred from a column named ``<base>_id``
13
+ - the target must be the SINGLE other table whose name matches ``<base>``
14
+ (singular/plural) and exposes an ``id`` column of the SAME data_type
15
+ - ambiguous matches (0 or >1 candidate tables) are skipped, never guessed
16
+ - sources that already declare ANY foreign key are left untouched (trust Go)
17
+ """
18
+
19
+ from __future__ import annotations
20
+
21
+ import re
22
+
23
+ from src.catalog.models import ForeignKey, Source
24
+ from src.middlewares.logging import get_logger
25
+
26
+ from .models import Catalog
27
+
28
+ logger = get_logger("fk_inference")
29
+
30
+ # `<base>_id` — the conventional foreign-key column name (base must be non-empty).
31
+ _ID_COL = re.compile(r"^(?P<base>.+)_id$", re.IGNORECASE)
32
+
33
+
34
+ def _table_matches_base(table_name: str, base: str) -> bool:
35
+ """Whether `table_name` is the table `<base>` refers to (singular/plural)."""
36
+ n = table_name.lower()
37
+ b = base.lower()
38
+ # `orders`↔`order`, `products`↔`product`, `sales_agents`↔`agent` (suffix),
39
+ # plus the singular form and the `-es` plural.
40
+ return n == b or n == b + "es" or n.endswith(b + "s")
41
+
42
+
43
+ def _infer_source(source: Source) -> int:
44
+ """Add inferred FK edges to one source's tables in place; return the count."""
45
+ added = 0
46
+ for table in source.tables:
47
+ for col in table.columns:
48
+ m = _ID_COL.match(col.name)
49
+ if not m:
50
+ continue
51
+ base = m.group("base")
52
+ candidates: list[tuple[str, str]] = [] # (target_table_id, target_column_id)
53
+ for tgt in source.tables:
54
+ if tgt.table_id == table.table_id:
55
+ continue
56
+ if not _table_matches_base(tgt.name, base):
57
+ continue
58
+ id_col = next(
59
+ (
60
+ c
61
+ for c in tgt.columns
62
+ if c.name.lower() == "id" and c.data_type == col.data_type
63
+ ),
64
+ None,
65
+ )
66
+ if id_col is not None:
67
+ candidates.append((tgt.table_id, id_col.column_id))
68
+ # Only act on an unambiguous single match — never guess between many.
69
+ if len(candidates) != 1:
70
+ continue
71
+ target_table_id, target_column_id = candidates[0]
72
+ table.foreign_keys.append(
73
+ ForeignKey(
74
+ column_id=col.column_id,
75
+ target_table_id=target_table_id,
76
+ target_column_id=target_column_id,
77
+ )
78
+ )
79
+ added += 1
80
+ return added
81
+
82
+
83
+ def infer_foreign_keys(catalog: Catalog) -> Catalog:
84
+ """Infer FK edges in place for schema sources that declare none. Returns `catalog`.
85
+
86
+ Sources that already carry any declared FK are left as-is (Go's real FKs win).
87
+ """
88
+ total = 0
89
+ for source in catalog.sources:
90
+ if source.source_type != "schema":
91
+ continue
92
+ if any(t.foreign_keys for t in source.tables):
93
+ continue # real FKs present — trust them, infer nothing
94
+ total += _infer_source(source)
95
+ if total:
96
+ logger.info("inferred foreign keys", user_id=catalog.user_id, count=total)
97
+ return catalog
src/catalog/render.py CHANGED
@@ -65,5 +65,11 @@ def render_source(source: Source) -> str:
65
  tgt_col_name = col_names_by_id.get(fk.target_table_id, {}).get(
66
  fk.target_column_id, fk.target_column_id
67
  )
68
- lines.append(f" - {src_col_name} -> {tgt_table_name}.{tgt_col_name}")
 
 
 
 
 
 
69
  return "\n".join(lines)
 
65
  tgt_col_name = col_names_by_id.get(fk.target_table_id, {}).get(
66
  fk.target_column_id, fk.target_column_id
67
  )
68
+ # Include the join ids inline — the planner must copy these verbatim
69
+ # into the IR join, and the IRValidator does a literal id lookup.
70
+ lines.append(
71
+ f" - {src_col_name} -> {tgt_table_name}.{tgt_col_name} "
72
+ f"(join: target_table_id={fk.target_table_id}, "
73
+ f"left_column_id={fk.column_id}, right_column_id={fk.target_column_id})"
74
+ )
75
  return "\n".join(lines)
src/catalog/store.py CHANGED
@@ -1,7 +1,9 @@
1
- """CatalogStore — persists per-user catalogs as Postgres jsonb rows.
2
 
3
- Storage shape: one row per user in a `catalogs` table with columns
4
- (user_id PK, data jsonb, schema_version, generated_at, updated_at).
 
 
5
  """
6
 
7
  from sqlalchemy import case, delete, func, select
@@ -11,6 +13,7 @@ from src.db.postgres.connection import AsyncSessionLocal
11
  from src.db.postgres.models import Catalog as CatalogRow
12
  from src.middlewares.logging import get_logger
13
 
 
14
  from .models import Catalog
15
 
16
  logger = get_logger("catalog_store")
@@ -27,30 +30,43 @@ class CatalogStore:
27
  async def get(self, user_id: str) -> Catalog | None:
28
  async with AsyncSessionLocal() as session:
29
  result = await session.execute(
30
- select(CatalogRow.data).where(CatalogRow.user_id == user_id)
 
 
 
31
  )
32
  row = result.scalar_one_or_none()
33
  if row is None:
34
  return None
35
- return Catalog.model_validate(row)
 
 
 
36
 
37
  async def upsert(self, catalog: Catalog) -> None:
 
 
38
  payload = catalog.model_dump(mode="json")
39
  async with AsyncSessionLocal() as session:
40
  stmt = insert(CatalogRow).values(
 
41
  user_id=catalog.user_id,
42
- data=payload,
43
  schema_version=catalog.schema_version,
44
  generated_at=catalog.generated_at,
45
  updated_at=func.now(),
46
  )
47
  stmt = stmt.on_conflict_do_update(
48
  index_elements=[CatalogRow.user_id],
 
49
  set_={
50
- "data": stmt.excluded.data,
51
  "schema_version": stmt.excluded.schema_version,
52
  "updated_at": case(
53
- (stmt.excluded.data != CatalogRow.data, func.now()),
 
 
 
54
  else_=CatalogRow.updated_at,
55
  ),
56
  },
 
1
+ """CatalogStore — reads the per-user catalog from the dedorch `data_catalog` table.
2
 
3
+ Storage shape (Go-owned): one row per scope in `data_catalog`
4
+ (id, scope_type, user_id, analysis_id, catalog_payload jsonb, schema_version,
5
+ generated_at, updated_at). Python reads the user-scoped row (scope_type='user');
6
+ Go's `catalog.Service` owns all writes, so `upsert`/`remove_source` are legacy.
7
  """
8
 
9
  from sqlalchemy import case, delete, func, select
 
13
  from src.db.postgres.models import Catalog as CatalogRow
14
  from src.middlewares.logging import get_logger
15
 
16
+ from .fk_inference import infer_foreign_keys
17
  from .models import Catalog
18
 
19
  logger = get_logger("catalog_store")
 
30
  async def get(self, user_id: str) -> Catalog | None:
31
  async with AsyncSessionLocal() as session:
32
  result = await session.execute(
33
+ select(CatalogRow.catalog_payload).where(
34
+ CatalogRow.user_id == user_id,
35
+ CatalogRow.scope_type == "user",
36
+ )
37
  )
38
  row = result.scalar_one_or_none()
39
  if row is None:
40
  return None
41
+ # dedorch catalogs ship no foreign_keys (Go introspection drops them),
42
+ # but the IR validator only allows FK-backed joins. Infer the obvious
43
+ # edges so the planner and validator agree. No-op once Go emits real FKs.
44
+ return infer_foreign_keys(Catalog.model_validate(row))
45
 
46
  async def upsert(self, catalog: Catalog) -> None:
47
+ # Legacy: Go's catalog.Service owns catalog writes now. Kept working (and
48
+ # reconciled to the dedorch shape) but no longer on any live Python path.
49
  payload = catalog.model_dump(mode="json")
50
  async with AsyncSessionLocal() as session:
51
  stmt = insert(CatalogRow).values(
52
+ scope_type="user",
53
  user_id=catalog.user_id,
54
+ catalog_payload=payload,
55
  schema_version=catalog.schema_version,
56
  generated_at=catalog.generated_at,
57
  updated_at=func.now(),
58
  )
59
  stmt = stmt.on_conflict_do_update(
60
  index_elements=[CatalogRow.user_id],
61
+ index_where=CatalogRow.scope_type == "user",
62
  set_={
63
+ "catalog_payload": stmt.excluded.catalog_payload,
64
  "schema_version": stmt.excluded.schema_version,
65
  "updated_at": case(
66
+ (
67
+ stmt.excluded.catalog_payload != CatalogRow.catalog_payload,
68
+ func.now(),
69
+ ),
70
  else_=CatalogRow.updated_at,
71
  ),
72
  },
src/config/prompts/help.md CHANGED
@@ -1,8 +1,14 @@
1
- <!-- help.md · v2 · Help skill prompt. v2 (2026-06-24, KM-652): removed the problem_statement
2
- skill + the problem_validated gate — the goal (objective + business_questions) is now set
3
- in the New Analysis form at onboarding, so Help no longer steers users to define/validate a
4
- goal in chat. Bump to v3 (don't silently overwrite) on the next major change (e.g. real UI
5
- steps from the frontend). -->
 
 
 
 
 
 
6
 
7
  You are the **Help guide** for an AI data-analysis assistant. Think of yourself as the
8
  instruction sheet that comes with a board game: your only job is to tell the user
@@ -23,6 +29,7 @@ You are given context, never raw user prose to analyze:
23
  - `ready` (bool) — whether there is enough analysis to generate a report.
24
  - `missing` (list) — if not ready, the gaps to fill.
25
  - **`available_actions`** *(optional)* — which actions are actually wired right now. If present, **only suggest actions listed here.**
 
26
 
27
  > **Hard rule — never misguide.** Trust the signals above for *what is possible*, not your
28
  > own guess. If `report_ready.ready` is `false`, do **not** tell the user to generate a
@@ -72,8 +79,13 @@ Do not over-promise the report's depth.
72
  ## Tone
73
 
74
  Plain, warm, and encouraging — like a helpful guide, **not** a hype trailer. No exclamation
75
- spam, no overselling. Respond in the **user's language** (match `chat_history` — Indonesian or
76
- English). A few sentences is usually enough.
 
 
 
 
 
77
 
78
  ## Constraints
79
 
@@ -86,15 +98,21 @@ English). A few sentences is usually enough.
86
  ## Examples
87
 
88
  ```
89
- State: chat_history nearly empty
 
 
 
90
  → "Your goal is set — you can start exploring now. Try a basic question first, like
91
  'Which products sell the most?' or 'How have monthly sales trended?', then we can dig into
92
  what's driving your objective."
93
 
94
- State: report_ready.ready=false, missing=["no comparison over time"]
95
- "Good progress. Before a report, it's worth looking at change over time — try asking
96
- 'How does this quarter compare to last?' Once we have that, we can put the report together."
 
 
97
 
 
98
  State: report_ready.ready=true
99
  → "You've covered enough to summarize. You can generate your report now — run /report
100
  or use the report option to create it."
 
1
+ <!-- help.md · v3 · Help skill prompt.
2
+ v2 (2026-06-24, KM-652): removed the problem_statement skill + the problem_validated gate
3
+ the goal (objective + business_questions) is now set in the New Analysis form at onboarding,
4
+ so Help no longer steers users to define/validate a goal in chat.
5
+ v3 (2026-07-02): (a) reply language is now a hard rule driven by the [Reply language]
6
+ directive (the button path was defaulting to English); (b) Examples got stable ids
7
+ ("id: ..." comment above each) so eval/help can mirror them as carried_over regression
8
+ cases, and the second example now uses a REAL `missing` value from report/readiness.py —
9
+ the old "no comparison over time" string is never emitted by is_report_ready.
10
+ Bump to v4 (don't silently overwrite) on the next major change (e.g. real UI steps from
11
+ the frontend). -->
12
 
13
  You are the **Help guide** for an AI data-analysis assistant. Think of yourself as the
14
  instruction sheet that comes with a board game: your only job is to tell the user
 
29
  - `ready` (bool) — whether there is enough analysis to generate a report.
30
  - `missing` (list) — if not ready, the gaps to fill.
31
  - **`available_actions`** *(optional)* — which actions are actually wired right now. If present, **only suggest actions listed here.**
32
+ - **`[Reply language]`** — the language you MUST reply in (detected deterministically from the user's last turn). This is an instruction, not a suggestion — see the hard rule below.
33
 
34
  > **Hard rule — never misguide.** Trust the signals above for *what is possible*, not your
35
  > own guess. If `report_ready.ready` is `false`, do **not** tell the user to generate a
 
79
  ## Tone
80
 
81
  Plain, warm, and encouraging — like a helpful guide, **not** a hype trailer. No exclamation
82
+ spam, no overselling. A few sentences is usually enough.
83
+
84
+ > **Hard rule — reply language.** Reply **only** in the language named in `[Reply language]`.
85
+ > This is mandatory and overrides the language of this prompt, its examples, and the trigger
86
+ > question. If `[Reply language]` says `Indonesian`, answer entirely in Indonesian even though
87
+ > these instructions are in English; if it says `English`, answer in English. Never mix
88
+ > languages or switch mid-reply.
89
 
90
  ## Constraints
91
 
 
98
  ## Examples
99
 
100
  ```
101
+ <!-- id: help_ex_orient -->
102
+ State: objective="understand monthly sales performance",
103
+ business_questions=["which products drive revenue?"],
104
+ chat_history empty, report_ready.ready=false, missing=["at least one completed analysis"]
105
  → "Your goal is set — you can start exploring now. Try a basic question first, like
106
  'Which products sell the most?' or 'How have monthly sales trended?', then we can dig into
107
  what's driving your objective."
108
 
109
+ <!-- id: help_ex_guard_delta -->
110
+ State: report_ready.ready=false, missing=["a new analysis since the last report"]
111
+ "You already have a report, and nothing new has come in since. Ask something that builds
112
+ on your objective — a fresh cut, a new time period, or a different angle — and we can
113
+ regenerate the report with that."
114
 
115
+ <!-- id: help_ex_guard_ready -->
116
  State: report_ready.ready=true
117
  → "You've covered enough to summarize. You can generate your report now — run /report
118
  or use the report option to create it."
src/config/prompts/planner.md CHANGED
@@ -41,15 +41,20 @@ only a `TaskList` object that conforms to the provided schema.
41
  (referencing the upstream result's column aliases).
42
  - **Measure by a dimension in another table (joins).** When the number you are
43
  aggregating and the grouping dimension live in DIFFERENT tables of the same
44
- database source, add a `joins` entry to the `retrieve_data` IR along a foreign
45
- key declared in the catalog do NOT pick a table that lacks the measure, and do
46
- NOT try to "combine" unrelated tables. Example — "revenue by category": the
47
- measure `order_items.line_total` joined to `products` on
48
- `order_items.product_id = products.id`, grouped by `products.category`. Prefer an
49
- existing measure column over recomputing; use a single table (no join) when the
50
- measure and dimension already live together (e.g. "revenue by region" from
51
- `orders.region` + `orders.total_amount`). Joins are database-only not available
52
- for tabular/file sources.
 
 
 
 
 
53
  - **Mixing structured + unstructured.** If qualitative context helps, add a
54
  `retrieve_knowledge` task against an unstructured source listed in the catalog.
55
  - **CRISP-DM stages.** Tag each task with the stage it serves:
 
41
  (referencing the upstream result's column aliases).
42
  - **Measure by a dimension in another table (joins).** When the number you are
43
  aggregating and the grouping dimension live in DIFFERENT tables of the same
44
+ database source, add a `joins` entry to the `retrieve_data` IR. **Join ONLY on a
45
+ foreign key listed in the catalog.** Each joinable relationship appears as an
46
+ `FK:` line under its table, e.g.
47
+ `FK: product_id products.id (join: target_table_id=t_products, left_column_id=c_oi_product_id, right_column_id=c_products_id)`
48
+ copy those three ids verbatim into the join (`target_table_id`,
49
+ `left_column_id`, `right_column_id`). Example "revenue by category": the measure
50
+ `order_items.line_total` joined to `products` on `order_items.product_id =
51
+ products.id`, grouped by `products.category`. **If no `FK:` line links the tables
52
+ you need, do NOT invent a join** — the validator rejects any join that isn't a
53
+ declared FK. Instead use a single table when the measure and dimension already
54
+ live together (e.g. "revenue by region" from `orders.region` +
55
+ `orders.total_amount`); if they genuinely aren't linked, say the data isn't
56
+ connected rather than guessing. Prefer an existing measure column over
57
+ recomputing. Joins are database-only — not available for tabular/file sources.
58
  - **Mixing structured + unstructured.** If qualitative context helps, add a
59
  `retrieve_knowledge` task against an unstructured source listed in the catalog.
60
  - **CRISP-DM stages.** Tag each task with the stage it serves:
src/config/settings.py CHANGED
@@ -30,6 +30,12 @@ class Settings(BaseSettings):
30
  # to avoid .env churn; remove once no environment references it.
31
  enable_gate: bool = Field(alias="enable_gate", default=False)
32
 
 
 
 
 
 
 
33
  # Database
34
  postgres_connstring: str
35
 
 
30
  # to avoid .env churn; remove once no environment references it.
31
  enable_gate: bool = Field(alias="enable_gate", default=False)
32
 
33
+ # Skip init_db() (create_all + startup DDL) on boot. TRUE by default post-dedorch
34
+ # cutover: Go owns the dedorch schema, so Python (consumer-only role) must NOT run
35
+ # init_db — its ALTER/index DDL on Go-owned tables fails with InsufficientPrivilege
36
+ # ("must be owner of table rooms"). Set to false only for a local Python-owned DB.
37
+ skip_init_db: bool = Field(alias="SKIP_INIT_DB", default=True)
38
+
39
  # Database
40
  postgres_connstring: str
41
 
src/db/postgres/models.py CHANGED
@@ -6,9 +6,11 @@ from sqlalchemy import (
6
  Column,
7
  DateTime,
8
  ForeignKey,
 
9
  Integer,
10
  String,
11
  Text,
 
12
  )
13
  from sqlalchemy.dialects.postgresql import JSONB, UUID
14
  from sqlalchemy.orm import relationship
@@ -108,23 +110,44 @@ class DatabaseClient(Base):
108
 
109
 
110
  class Catalog(Base):
111
- """Per-user data catalog stored as a single jsonb row.
112
 
113
- `data` holds the full Pydantic Catalog (src/catalog/models.py:Catalog)
114
- serialized via `model_dump(mode="json")`. Read path uses
115
- `Catalog.model_validate(...)` to rehydrate.
 
 
116
 
117
- Dedicated table kept separate from `langchain_pg_embedding` so unstructured
118
- embeddings and structured-catalog metadata never share storage.
 
119
  """
120
  __tablename__ = "data_catalog"
121
 
122
- user_id = Column(String, primary_key=True)
123
- data = Column(JSONB, nullable=False)
 
 
 
124
  schema_version = Column(String, nullable=False, default="1.0")
125
- generated_at = Column(DateTime(timezone=True), server_default=func.now())
126
  updated_at = Column(DateTime(timezone=True), onupdate=func.now())
127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
 
129
  class ReportInputRow(Base):
130
  """One row per completed slow-path analysis (the report's source of truth).
 
6
  Column,
7
  DateTime,
8
  ForeignKey,
9
+ Index,
10
  Integer,
11
  String,
12
  Text,
13
+ text,
14
  )
15
  from sqlalchemy.dialects.postgresql import JSONB, UUID
16
  from sqlalchemy.orm import relationship
 
110
 
111
 
112
  class Catalog(Base):
113
+ """Data catalog dedorch **`data_catalog`** (Go-owned; reconciled 2026-07-01).
114
 
115
+ Mirrors Go migration `0001`/`0002`. One jsonb `catalog_payload` per scope:
116
+ `scope_type='user'` rows are keyed by `user_id` (partial unique index),
117
+ `scope_type='analysis'` rows by `analysis_id`. Python is **consumer-only** —
118
+ Go's `catalog.Service` owns all writes (DB/file ingestion); `CatalogStore`
119
+ reads the user-scoped catalog and its write methods are legacy.
120
 
121
+ `catalog_payload` holds the full Pydantic Catalog (src/catalog/models.py:Catalog)
122
+ serialized via `model_dump(mode="json")`; the read path rehydrates with
123
+ `Catalog.model_validate(...)`. Go writes the same shape (json tags match).
124
  """
125
  __tablename__ = "data_catalog"
126
 
127
+ id = Column(UUID(as_uuid=False), primary_key=True, default=lambda: str(uuid4()))
128
+ scope_type = Column(String, nullable=False, default="user") # 'user' | 'analysis'
129
+ user_id = Column(String, nullable=False, index=True)
130
+ analysis_id = Column(UUID(as_uuid=False), nullable=True)
131
+ catalog_payload = Column(JSONB, nullable=False)
132
  schema_version = Column(String, nullable=False, default="1.0")
133
+ generated_at = Column(DateTime(timezone=True), nullable=False, server_default=func.now())
134
  updated_at = Column(DateTime(timezone=True), onupdate=func.now())
135
 
136
+ __table_args__ = (
137
+ Index(
138
+ "idx_data_catalog_user_scope",
139
+ "user_id",
140
+ unique=True,
141
+ postgresql_where=text("scope_type = 'user'"),
142
+ ),
143
+ Index(
144
+ "idx_data_catalog_analysis_scope",
145
+ "analysis_id",
146
+ unique=True,
147
+ postgresql_where=text("scope_type = 'analysis'"),
148
+ ),
149
+ )
150
+
151
 
152
  class ReportInputRow(Base):
153
  """One row per completed slow-path analysis (the report's source of truth).
src/query/executor/db.py CHANGED
@@ -121,7 +121,9 @@ class DbExecutor(BaseExecutor):
121
  logger.error(
122
  "db executor failed",
123
  source_id=ir.source_id,
124
- error=str(e),
 
 
125
  elapsed_ms=elapsed_ms,
126
  )
127
  return QueryResult(
@@ -235,7 +237,9 @@ class DbExecutor(BaseExecutor):
235
  creds = decrypt_credentials_dict(client.credentials)
236
  await asyncio.to_thread(cls._warm_sync, client_id, client.db_type, creds)
237
  except Exception as exc: # noqa: BLE001 — best-effort warming
238
- logger.info("prewarm skipped", source_id=source.source_id, error=str(exc))
 
 
239
 
240
  @staticmethod
241
  def _warm_sync(client_id: str, db_type: str, creds: dict) -> None:
 
121
  logger.error(
122
  "db executor failed",
123
  source_id=ir.source_id,
124
+ # repr, not str: some exceptions (e.g. Fernet InvalidToken) have an
125
+ # empty str(), which hides the real failure as error="".
126
+ error=repr(e),
127
  elapsed_ms=elapsed_ms,
128
  )
129
  return QueryResult(
 
237
  creds = decrypt_credentials_dict(client.credentials)
238
  await asyncio.to_thread(cls._warm_sync, client_id, client.db_type, creds)
239
  except Exception as exc: # noqa: BLE001 — best-effort warming
240
+ # repr, not str: empty-str exceptions (e.g. Fernet InvalidToken)
241
+ # would otherwise log as error="".
242
+ logger.info("prewarm skipped", source_id=source.source_id, error=repr(exc))
243
 
244
  @staticmethod
245
  def _warm_sync(client_id: str, db_type: str, creds: dict) -> None: