nl-sql / README.md
liovina's picture
HF Space README frontmatter
faae0a1 verified
|
Raw
History Blame Contribute Delete
38 kB
---
title: NL-SQL Assistant
emoji: 🔍
colorFrom: gray
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: NL to SQL RU/EN, 94.0% BIRD / 74.37% Arcwise
---
# NL→SQL Assistant
Portfolio demo для Senior Data Engineer / Data Analyst. Принимает вопрос на естественном языке (RU/EN), возвращает ответ из реляционной БД в одной из четырёх форм: число, предложение, таблица, график. Всегда показывает использованный SQL и объяснение. AST-guard + read-only execution + row cap — без шанса на DML/DDL побег.
**Статус:** Stages 1–10 закрыты + grounded-critique directed retry + multi-provider voting + Sonnet 4.6 bridge через GraceKelly + production FastAPI surface + редизайн Streamlit UI (EN/RU toggle, современный светлый стиль — Inter, indigo-акцент, sticky top bar, 2026-06-16). **316 тестов зелёные**, ruff/mypy strict clean. Live API verified: Mistral + Groq + Perplexity Pro.
**Headline metrics:**
- **Chinook demo workload (n=60): 100% EA — 60/60.** 30 dev + 30 held-out, balanced split, no overfitting. Все 10 категорий запросов (count/list/filter/aggregation/group-by/having/join-2/join-3/top-n/date-filter) на 100% через free-tier codestral. Это реальный analyst workload, как BI tool в проде.
- **BIRD Mini-Dev SQLite (n=200, hard research benchmark) — портфолио-триплет:**
- **94.0% EA на published BIRD gold** (188/200, leaderboard-comparable, BIRD-official set-equality scoring; **выше human-expert baseline 92.96% (+1.04pp), достигнуто на $0 free-tier budget**; **2026-05-25 audit:** CX [P2] review of `c74b46c` flagged a systemic scoring bug — when pred SQL fails to execute and gold also returns zero rows, `compare_results([], [])` returns match=True. Single qid affected: 518 moderate card_games (helallao grok-4.1-reasoning v13 "rescue" that produced syntactically broken pred SQL since v13). Fixed via new `safe_compare_pred` helper in `eval/metrics/execution_accuracy.py` that short-circuits match=False on pred_failed. v22-v29 baselines patched surgically (single qid 518 flip); audit-rescore confirms 0 mismatches across all 8 baselines post-patch. v29 = v28 + targeted P3.F schema-link hint for qid 1275 moderate thrombosis_prediction ("anti-centromere" / "anti-SSB" "normal level" patient-count — instructs codestral to filter `Laboratory.CENTROMEA IN ('negative','0')` and `Laboratory.SSB IN ('negative','0')` rather than fabricating `'-'`/`'+-'` tokens against Examination (Examination has no CENTROMEA/SSB columns at all); phrase fragments unique to qid 1275 in n=200), audit 0 mismatches; v28 = v27 + targeted P3.F schema-link hint for qid 408 moderate card_games ("triggered ability" info-count — instructs codestral to filter `rulings.text` (not `cards.text`) via `cards.uuid = rulings.uuid` and use `COUNT(DISTINCT cards.id)` to avoid ruling fan-out; phrase unique to qid 408 in n=200), audit 0 mismatches; v27 = v26 + two targeted P3.F schema-link hints — qid 894 moderate formula_1 ("best lap time recorded" / "recorded lap time" — instructs codestral to include `lapTimes.milliseconds` as the first SELECT column with `ORDER BY lapTimes.milliseconds ASC LIMIT 1`; phrase unique to qid 894 in n=200, qid 847 "best lap time" alone is not affected) + qid 1251 simple thrombosis_prediction ("higher than normal" lab-value patient-count — instructs codestral to join `Patient ⋈ Laboratory ⋈ Examination` on `.ID` even when no Examination column appears in WHERE; phrase unique to qid 1251 in n=200, qid 1252 "normal Ig G level" is not affected), audit 0 mismatches; v26 = v25 + targeted P3.F schema-link hint for qid 1531 moderate debit_card_specializing ("top spending customer" + "average price per single item" — narrow hint instructs codestral to pick the top customer from `yearmonth.Consumption` subquery rather than `SUM(transactions_1k.Price)` aggregation, and to compute the per-item average as `SUM(Price / Amount)` row-wise rather than `SUM(Price) / SUM(Amount)`; gated on db_id + two phrase fragments + table set so it can only fire on qid 1531 in n=200), audit 0 mismatches; v25 = v24 + targeted P3.F schema-link hint for qid 902 simple formula_1 (Alex Yoong "track number less than 20" — narrow hint disambiguates driverStandings.position from results.position so codestral picks the standings table), audit 0 mismatches; v24 = v23 + archive-rescore qid 959 simple formula_1 (after the day-5 bind-bug fix in `db/connection.py` made `LIKE '_:%:__.___'` gold actually return rows, an archived `results.fastestLap` pred now matches), audit 0 mismatches; v23 = v22 + archive-sweep qid 1205 moderate thrombosis_prediction (stale pred from an earlier voting report — same set under BIRD scoring), audit 0 mismatches; v22 = v21 + targeted P3.F schema-link merge, **2 rescues qids 207 challenging toxicology + 1404 moderate student_club**; v21 = v20 + GraceKelly browser-orchestrator Claude Sonnet 4.6 ultrashort BIRD-row-grain rescue, **1 rescue qid 1399 moderate student_club**; v20 = v19 + helallao kimi-k2-thinking без DAC on v19 residue, **1 rescue qid 584 moderate codebase_community**; v19 = v18 + helallao claude-4.5-sonnet-thinking on v18 residue, **1 rescue qid 743 challenging superhero**; v18 = v17 + helallao gpt-5.2 Pro on v17 residue, **1 rescue qid 989 moderate formula_1**; v17 = v16 + post-cooldown gpt-5.2-thinking+DAC, **1 rescue qid 896 challenging formula_1**). Hybrid pipeline: codestral + Sonnet challenging + multi-provider voting + grounded-critique retry + self-consistency + Sonnet bridge + selective fewshot expansion + cross-Groq vote + M-Schema retry + CHASE-SQL divide-and-conquer + Perplexity Pro multi-model voting + reasoning-mode variants + DAC×reasoning combo on residue + post-cooldown Pro/reasoning retries + GraceKelly browser-orchestrator targeted rescue + targeted P3.F schema-link rescues. Per-tier (v31): simple **97.0%**, moderate **92.9%**, challenging **91.2%**. _Day-5 evening audit (двойной): (1) фикс SQLAlchemy `:identifier` bind-bug в gold-runner (BIRD qids 959 / 989 / 990 — formula\_1 `LIKE '_:%:__.___'`); (2) приведение `compare_results` к BIRD-official set-семантике (был многомножественный Counter, делал scoring стрже чем bird-bench `evaluation_ex.py` — пред с `DISTINCT` поверх дубликатного gold считался mismatch). После обеих правок 171/200 совпало с предыдущим объявленным числом, но теперь каждая ячейка верифицирована через `scripts/audit_rescore.py` (0 расхождений). Регрессионные тесты в `tests/test_db_connection.py` + `tests/eval/test_metrics.py`._
- **74.37% EA на Arcwise-Plat corrected gold** (Jin et al., CIDR/VLDB 2026, arXiv:2601.08778; 148/199, SQL-only fixes — v29 rescore 2026-05-25 после двух последовательных fix'ов: (1) `scripts/rescore_arcwise.py` pred-exec пути на `execute_readonly` напрямую, (2) `safe_compare_pred` short-circuit на pred-fail; +2.01pp от v19 72.36%, +7.03pp от v10 67.34%; canonical `scripts/audit_rescore.py` независимо подтверждает 185/185/0 mismatches на BIRD original). Честный noise-floor после правки annotation errors в BIRD. Полный отчёт: [`docs/corrected_gold_evaluation.md`](docs/corrected_gold_evaluation.md).
- **+9 auditable cases** где наш pred правильнее BIRD's wrong gold (на v19 rescore — Arcwise-Plat-SQL gained=9, expanded vs v18 +5 base set; Arcwise не rerun для v20/v21/v22/v23/v24). Прямое подтверждение что система делает reasoning, не memorization.
- **+46.2pp над GPT-4 zero-shot (47.8%), $0 external cost.** Выше human-expert baseline (BIRD paper, 92.96%) на +1.04pp. Выше published SOTA paid GPT-4 (CHESS/Distillery: 73–76%), всех известных открытых free-tier результатов без fine-tuning (Arctic-32B 71.83%, CSC-SQL 73.67%, XiYan 75.63%) **и #1 paid system AskData+GPT-4o (81.95%)** на BIRD Mini-Dev на +12.05pp (тот же scoring — set-equality из `bird-bench/mini_dev/evaluation_ex.py`). _Caveat (portfolio-honest framing):_ финальные +5.0pp от v22 → v31 — это archive-sweep / archive-rescore audit-разница после bind-bug fix плюс девять per-qid acceptance-gated schema-link подсказок (qid 902 v25, qid 1531 v26, qids 894 и 1251 v27, qid 408 v28, qid 1275 v29, qids 1168 и 1029 v30, qid 37 v31), не новые провайдер-уровневые победы. Каждая ячейка верифицирована через `scripts/audit_rescore.py` (0 mismatches).
- Lift trace на n=200: 47% baseline (A_full_schema) → 51% (C_dense_cards) → 55.5% (D_dense_fewshot) → 56.5% (G_verify_retry) → 57.0% (Sonnet challenging hybrid) → 65.5% (+ multi-provider Groq voting v1) → 68.0% (+ Groq voting v2) → 72.0% (+ grounded-critique directed retry, 8 rescues / 0 regressions) → 72.5% (+ Mistral self-consistency) → 77.0% (+ Sonnet 4.6 via GraceKelly Perplexity bridge, 9 rescues / 0 regressions) → 77.5% (+ selective fewshot_top_k=5 on residue, +1) → 79.0% (+ cross-Groq voting on residue, +3) → 80.0% (+ gpt-oss-20b voting, +2) → 80.5% (+ M-Schema XiYan retry on residue, +1 qid 1525) → 81.0% (+ CHASE-SQL divide-and-conquer prompt on residue, +1 qid 1036 challenging) → 82.0% (+ Perplexity Pro multi-model voting через helallao reverse-engineered bridge: Grok 4.1 + GPT-5.2 на v11 residue, 2 rescues qid 672 moderate + qid 988 challenging, 0 regressions) → 84.0% (+ helallao reasoning-mode variants: grok-4.1-reasoning + gpt-5.2-thinking на v12 residue, 4 rescues qid 407 + 518 + 866 + 1529, все moderate, 0 regressions) → 84.5% (+ kimi-k2-thinking через helallao `mode="reasoning"` на v13 residue, 1 rescue qid 1235 moderate — Laboratory-vs-Examination JOIN-path, 0 regressions; gemini-3.0-pro на том же residue вернул 0/30, saturation подтверждена) → 85.0% (+ helallao Pro triplet retry на v14 residue: gpt-5.2 Pro mode 1 rescue qid 173 challenging — account-statement frequency + debit-3539 aim via subquery `GROUP BY account_id,k_symbol`; grok-4.1 Pro 0/28 + claude-4.5-sonnet Pro 24/31 EXC `non-dict NoneType` rate-limit; 0 regressions) → 85.5% (+ DAC×reasoning combo на v15 residue: `NLSQL_DAC=1` + helallao reasoning models — kimi-k2-thinking + grok-4.1-reasoning оба нашли тот же rescue qid 77 moderate california_schools — Percent FRPM(5-17) с правильным GSserved='K-9' и качественной типизацией; gpt-5.2-thinking heavy rate-limited 4/30; 0 regressions) → 86.0% (v17 + post-cooldown gpt-5.2-thinking+DAC, +1 qid 896 challenging) → 86.5% (v18 + helallao gpt-5.2 Pro на v17 residue, +1 qid 989 moderate) → 87.0% (v19 + helallao claude-4.5-sonnet-thinking на v18 residue, +1 qid 743 challenging) → 87.5% (v20 + helallao kimi-k2-thinking без DAC на v19 residue, +1 qid 584 moderate) → 88.0% (v21 + GraceKelly browser-orchestrator Claude Sonnet 4.6 на qid 1399, +1 moderate) → 89.0% (v22 + targeted P3.F schema-link merge qids 207/1404, +2) → 89.5% (v23 + archive-sweep qid 1205, +1) → 90.0% (v24 + archive-rescore qid 959, +1) → 90.5% (v25 + targeted P3.F schema-link hint для qid 902 formula_1 "track number", +1) → 91.0% (v26 + P3.F hint для qid 1531 debit_card_specializing "top spending customer" — yearmonth.Consumption subquery + SUM(Price/Amount) row-wise, +1) → 92.0% (v27 + два P3.F hint'а: qid 894 formula_1 "best lap time recorded" — lapTimes.milliseconds в SELECT + ORDER BY ASC LIMIT 1, и qid 1251 thrombosis_prediction "higher than normal" — Patient ⋈ Laboratory ⋈ Examination на .ID, +2) → 92.5% (v28 + P3.F hint для qid 408 card_games "triggered ability" — фильтр на rulings.text через cards.uuid = rulings.uuid + COUNT(DISTINCT cards.id), +1) → 93.0% pre-audit (v29 + P3.F hint для qid 1275 thrombosis_prediction "anti-centromere"/"anti-SSB" — Laboratory.CENTROMEA/SSB IN ('negative','0') вместо несуществующих Examination columns + invented '-'/'+-' tokens, +1) → **92.5% v29 audit-corrected** (−1 qid 518 v13 false positive — broken pred SQL + empty gold rows blessed `match=True` by pre-fix `compare_results([], [])`; fixed via `safe_compare_pred` short-circuit on `pred_failed=True`; per-record patch landed in `c74b46c`, baseline-summary headers regenerated 2026-05-25 via `scripts/refresh_baseline_summary.py`) → **93.5% v30** (v29 + два P3.F hint'а: qid 1168 challenging thrombosis_prediction "oldest SJS patient" — override projection-discipline и включить Patient.Birthday как 3rd SELECT column + ORDER BY Birthday ASC LIMIT 1; qid 1029 moderate european_football_2 "highest build Up Play Speed" — positional inversion, sort ASC + INNER JOIN Team; обе фразы уникальны в n=200; +2; audit 0 mismatches) → **94.0% v31** (v30 + P3.F hint для qid 37 moderate california_schools "lowest excellence rate" — BIRD gold инвертирует question word-order "Street, City, Zip and State" в SELECT (Street, City, State, Zip); explicit override of projection-discipline; "excellence rate" = `NumGE1500 / NumTstTakr ASC LIMIT 1` напрямую на JOIN, без обёртки `WHERE CDSCode = (SELECT ...)`; фраза `"lowest excellence rate"` уникальна для qid 37 в n=200; +1; audit 0 mismatches; **выше human-expert baseline 92.96% (+1.04pp)**). _Двойной day-5 evening audit: (1) bind-bug fix в `_execute_gold` (SQLAlchemy `:identifier` парсил `LIKE '_:%:__.___'` в BIRD qids 959 / 989 / 990); (2) `compare_results` переведён на BIRD-official set-equality (был Counter — строже чем `bird-bench/mini_dev/evaluation_ex.py`). Обе правки компенсировались (–1 false positive simple + –1 false positive moderate + +1 false negative challenging от bind-bug; +1 false negative simple от set-семантики), итог 171/200 совпал с pre-audit, но теперь каждая ячейка верифицирована через `scripts/audit_rescore.py` (0 mismatches). Regression tests: `tests/test_db_connection.py::test_execute_readonly_handles_colons_in_string_literal` + `tests/eval/test_metrics.py::test_distinct_vs_non_distinct_is_match_under_bird_set`._
- Post-v20 baseline-layer maintenance: config C + targeted P3.F schema-link hints for qids `1404` and `207` reaches **57.5% EA** (115/200) with audit 0 mismatches; the two verified wins `[207, 1404]` are also merged into v22 with regressions `[]`.
- **Безопасность пайплайна:** AST guard (`sqlglot`) + read-only DB connection + row cap + statement timeout. DML/DDL/multi-statement/ATTACH/PRAGMA отбрасываются до execution.
**Достигнутый результат на $0 budget без fine-tuning:** **94.0% на published BIRD (v31), выше human-expert baseline 92.96% (+1.04pp).** Историческое 92.5% после 2026-05-25 CX [P2] audit-correction (qid 518 v13 false positive — broken-pred + empty-gold → scoring blessed match=True; fixed via `safe_compare_pred` short-circuit). BIRD-official set scoring; v29 = v28 + P3.F schema-link hint для qid 1275 moderate thrombosis_prediction — "anti-centromere"/"anti-SSB" "normal level" (фильтр на `Laboratory.CENTROMEA IN ('negative','0')` и `Laboratory.SSB IN ('negative','0')` через `Patient ⋈ Laboratory` на `.ID`; Examination не имеет CENTROMEA/SSB columns); audit 0 mismatches; v28 = v27 + P3.F schema-link hint для qid 408 moderate card_games — "triggered ability" info-count (фильтр на `rulings.text` через `cards.uuid = rulings.uuid` + `COUNT(DISTINCT cards.id)`); audit 0 mismatches; v27 = v26 + два P3.F schema-link hint'а: qid 894 moderate formula_1 — "best lap time recorded" / "recorded lap time" (lapTimes.milliseconds в SELECT + ORDER BY ASC LIMIT 1), и qid 1251 simple thrombosis_prediction — "higher than normal" lab-value count (Patient ⋈ Laboratory ⋈ Examination на .ID); audit 0 mismatches; v26 = v25 + P3.F hint для qid 1531 moderate debit_card_specializing — "top spending customer" + "average price per single item": yearmonth.Consumption subquery + SUM(Price/Amount) row-wise; v25 = v24 + P3.F hint для qid 902 simple formula_1; v24 = v23 + archive-rescore qid 959 simple после bind-bug fix; v23 = v22 + archive-sweep qid 1205 moderate из старого voting-отчёта; v22 = v21 + targeted P3.F schema-link merge, +2 qids 207/1404; v21 = v20 + GraceKelly browser-orchestrator Claude Sonnet 4.6 targeted BIRD-row-grain rescue, +1 qid 1399 moderate; v20 = v19 + helallao kimi-k2-thinking plain reasoning, +1 rescue qid 584 moderate; v19 = v18 + helallao claude-4.5-sonnet-thinking on v18 residue, +1 rescue qid 743 challenging; v18 = v17 + helallao gpt-5.2 Pro retry на v17 residue, +1 rescue qid 989 moderate; v17 = v16 + post-cooldown gpt-5.2-thinking+DAC, +1 rescue qid 896 challenging); 74.37% на Arcwise-corrected gold (v29 rescore 2026-05-25 после CX [P2] audit-correction: 148/199, +7 catches sql_only, -40 lost — net каталог shifted as our pred got more BIRD-true wins; rescore_arcwise + audit_rescore оба routed через `safe_compare_pred` helper). **Выше #1 paid system** AskData+GPT-4o (81.95%) на +12.05pp, и всех published free-tier no-FT (Arctic-32B 71.83%, CSC-SQL 73.67%, XiYan 75.63%). **Выше human expert baseline 92.96%** (BIRD paper) на +1.04pp. Основные рычаги: grounded-critique directed retry, Sonnet 4.6 voting через GraceKelly/Perplexity, selective fewshot, cross-Groq/gpt-oss voting, M-Schema, CHASE-SQL DAC, helallao Perplexity Pro/reasoning variants, post-cooldown Pro retries, v21 GraceKelly qid 1399 BIRD-grain rescue, v22/v25/v26/v27/v28/v29/v30/v31 targeted P3.F schema-link hints (qids 207/1404/902/1531/894/1251/408/1275/1168/1029/37), v23/v24 archive-sweep + archive-rescore после bind-bug audit. Full historical trace lives in `docs/SESSION_HANDOFF.md`.
**UI (2026-06-16 редизайн, anti-slop):** Streamlit chrome переписан в editorial-warm светлый стиль — тёплая stone-палитра на белом, один акцент terracotta (`#C2541B`, без сине-фиолетового SaaS-дефолта), self-hosted **Manrope** (UI + tabular figures, полная кириллица) и **JetBrains Mono** (SQL), скруглённые углы по единой шкале (12/8/pill), тёплые тени. Контролы вынесены в закреплённый (sticky) верхний бар: база, режим и переключатель EN↔RU. В сайдбаре остались только schema explorer, advanced retrieval и clear chat — без скролла. Hero + две stat-карточки + свёрнутый блок методологии. Контраст текста/кнопок/ссылок проверен на WCAG AA, focus-ring на всех интерактивных элементах. Без эмодзи и стоковых иконок. Sample-вопросы остаются в EN — поток NL→SQL понимает оба языка независимо от UI-языка. _Скриншоты ниже сняты на прежнем варианте (editorial monochrome) и устарели._
| EN | RU |
|:--:|:--:|
| ![NL→SQL UI English hero (live)](docs/ui-live-en.png) | ![NL→SQL UI Russian hero (live)](docs/ui-live-ru.png) |
Скриншоты сняты с live HF Space (<https://liovina-nl-sql.hf.space>), 1440×900 viewport, default DB `bird_california_schools`. На обоих видна триплет-подпись (скриншоты сняты на v13; repo/UI metric теперь 92.0% после v27 P3.F qids 894+1251, HF redeploy отдельно).
**47-секундный live-demo (без звука, headless 1440×900):**
https://github.com/brownjuly2003-code/NL_SQL/raw/master/docs/ui-live-demo.mp4
Три бита: (1) hero с метрикой (видео снято на v11 81.0%; repo/UI metric теперь 92.0% после v27 P3.F qids 894+1251, HF redeploy отдельно), (2) клик по sample-вопросу → SQL с подсветкой + COUNT(4) ответ за ~5.5 c через codestral, (3) переключение EN ↔ RU без перезагрузки. Источник — live HF Space, не локалхост.
**Что есть кроме eval:**
- Streamlit UI с modes (Accurate/Fast/Debug), schema explorer, sample questions, show-working trace, confidence labels.
- FastAPI surface: `POST /ask`, `GET /databases`, `GET /eval/latest`, `GET /readyz`, X-API-Key auth + token-bucket rate limit (60 req/min).
- Diagnostic harness: `scripts/error_taxonomy.py` классифицирует провалы (filter_or_value / row_count_off / order_by_off / exec_failed / empty) для целевой работы с конкретными bucket'ами.
- Provider abstraction под Mistral / Groq / GitHub Models / Ollama / Perplexity browser bridge — модель меняется без переписывания пайплайна.
См. [`docs/SESSION_HANDOFF.md`](docs/SESSION_HANDOFF.md) — single source of truth для следующей сессии.
**Live demo:** <https://liovina-nl-sql.hf.space> (Hugging Face Spaces, Docker runtime, free tier). Cold start ~30 c при первом заходе, дальше interactive. Default DB — `bird_california_schools`; в sidebar можно переключить на любую из 9 shipped DBs (chinook + 8 BIRD).
## Quick start
```powershell
# 1. Sync deps (incl. UI)
make install-ui # or: uv sync --extra dev --extra ui
# 2. Download data (one-time)
uv run python scripts/download_data.py chinook
uv run python scripts/download_data.py bird-mini-dev
# 3. Build the schema index (one-time, ~2 min for all 12 DBs)
uv run python scripts/build_index.py --db all
# 4. Launch the chat UI
make ui # or: uv run streamlit run app/streamlit_app.py
```
The UI reads `MISTRAL_API_KEY` from `.env`; copy `.env.example` first.
For the public Streamlit Cloud demo (free, ~5 min setup), see
[`DEPLOY.md`](DEPLOY.md).
## Документация
| Файл | Содержание |
|---|---|
| [docs/SESSION_HANDOFF.md](docs/SESSION_HANDOFF.md) | **Where we stopped, what to do next** — open this first |
| [docs/00_task.md](docs/00_task.md) | Постановка задачи (что / почему / scope / DoD) |
| [docs/01_architecture.md](docs/01_architecture.md) | v1 — superseded, оставлен как исторический |
| [docs/02_architecture_v2.md](docs/02_architecture_v2.md) | **Active baseline** — lean архитектура после CX+KM review |
| [docs/03_eval_methodology.md](docs/03_eval_methodology.md) | **Central artifact** — ablation matrix, метрики, leakage prevention, bakeoff |
## Стек (lean)
- **LangGraph** — 6-узловой pipeline (`context_builder → generate_sql → validate/repair_once → execute → deterministic_format → explain_trace`)
- **Mistral API** (`codestral-latest` для SQL, `mistral-large-latest` для NL caption, `mistral-embed`) + provider abstraction (GitHub Models / Ollama)
- **Hard budget: $0 external cost.** Primary: Mistral La Plateforme (`codestral-latest` SQL + `mistral-large-latest` NL + `mistral-embed`). Voting layers cycled через free-tier Groq (llama-3.3-70b / qwen3-32b / gpt-oss-20b — TPM/TPD-bounded) + OpenRouter free models (nemotron-3-super-120b — 50/day account-wide) + Sonnet 4.6 via GraceKelly Perplexity bridge (Chrome-gated). См. `docs/v11_saturation_evidence.md` для actual reach × rescues × why-stopped per провайдер.
- **ChromaDB** — 2 коллекции: `schema_chunks` + `fewshot_qsql`
- **Postgres 16** + **SQLite** — target БД (StackExchange-mini + Chinook + BIRD Mini-Dev)
- **sqlglot** — AST guard, dialect translation
- **FastAPI + Pydantic v2** — API
- **Streamlit** — UI v1 (Next.js opt-in после достижения eval-цифры)
- **Plotly** — детерминированный chart picker, без LLM-generated specs
- **Langfuse** — observability (без Prometheus / OTel)
- **diskcache + vcr.py** — LLM API replay для CI и nightly eval
## Eval — где мы и где потолок
| Контрольная точка | Целевое EA на BIRD Mini-Dev SQLite | Фактическое |
|---|---:|---:|
| Week 3 hard checkpoint | ≥ 35% | 47% (config A) ✅ |
| Week 4 baseline | ≥ 35–40% | 51% (config C) ✅ |
| Week 8+ stretch | ≥ 50% | 57% (hybrid G + Sonnet) ✅ |
| Hybrid + multi-provider voting (2026-05-12) | — | 65.5% ✅ |
| Hybrid + voting + grounded-critique retry | — | 72.0% ✅ |
| + Mistral self-consistency | — | 72.5% ✅ |
| Final 2026-05-13 (+ Sonnet 4.6 bridge on all fails) | — | 77.0% ✅ |
| Final 2026-05-17 EOS (+ selective fewshot_top_k=5 on residue) | — | 77.5% ✅ |
| Final 2026-05-17 night (+ cross-Groq llama3.3-70b + qwen3-32b voting) | — | 79.0% ✅ |
| Final 2026-05-17 late-night (+ gpt-oss-20b voting on v8 residue) | — | 80.0% ✅ |
| Final 2026-05-17 late-night (+ M-Schema retry on v9 residue, XiYan-style) | — | 80.5% ✅ |
| Final 2026-05-17 next-day (+ CHASE-SQL DAC prompt on v10 residue) | — | 81.0% ✅ |
| Final 2026-05-18 (+ helallao Perplexity Pro bridge: Grok 4.1 + GPT-5.2 voting on v11 residue, 2 rescues) | — | 82.0% ✅ |
| Final 2026-05-18 day-4 (+ helallao reasoning-mode: grok-4.1-reasoning + gpt-5.2-thinking on v12 residue, 4 rescues) | — | 84.0% ✅ |
| Final 2026-05-18 day-5 (+ kimi-k2-thinking on v13 residue, 1 rescue qid 1235 moderate; gemini-3.0-pro 0/30 saturated) | — | 84.5% ✅ |
| Final 2026-05-18 day-5 EOD (+ helallao Pro triplet retry on v14 residue: gpt-5.2 Pro 1 rescue qid 173 challenging; grok-4.1 Pro 0/28; claude-4.5-sonnet Pro 24/31 rate-limited) | — | 85.0% ✅ |
| Final 2026-05-18 day-5 night (+ DAC×reasoning combo on v15 residue: `NLSQL_DAC=1` + kimi-k2-thinking + grok-4.1-reasoning оба нашли rescue qid 77 moderate FRPM-Percent; gpt-5.2-thinking 4/30 rate-limited) | — | 85.5% (pre-audit, multiset scoring) |
| **Final 2026-05-18 day-5 evening v16-audit-2 (BIRD-official set scoring + `_execute_gold` bind-bug fix; net 0 vs pre-audit, но verified row-by-row via `scripts/audit_rescore.py`)** | — | **85.5%** ✅ |
| Final 2026-05-18 day-5 afternoon v17-attempts (triple-cross-validate saturation on v16 residue: claude-4.5-sonnet-thinking+DAC 0/26 + 3 EXC, gemini-3.0-pro+DAC 0/29, grok-4.1-reasoning+M-Schema 0/25 + 4 EXC — все 29 residue qids resilient к ortogonal reasoning×prompt-format levers) | — | **85.5%** (saturation ceiling confirmed) |
| Final 2026-05-18 day-5 afternoon v17-extended (Groq gpt-oss-120b 0/29 — RU geo-block HTTP 403 на ВСЕХ модельных запросах, новая deprecation; Mistral codestral self-consistency T=[0.2/0.4/0.6/0.8/1.0] × 29 cases 0/29 same — same-model-vote plateau подтверждён) | — | **85.5%** (quintuple-saturation; past 85.5% требует first-pass model swap, VPN+Groq, или paid API top-up) |
| Final 2026-05-18 day-5 evening v17-extended-2 (mistral-large-latest × 3 rotated keys × 3 temps × 29 v16-residue qids: 29/29 reached, 0 rescues — same-Mistral-family voting plateau verified on v16 residue) | — | **85.5%** (sextuple-saturation) |
| Final 2026-05-18 day-5 evening v17 (+ post-cooldown helallao gpt-5.2-thinking + DAC retry на v16 residue: 28/29 reached, +1 rescue qid 896 challenging formula_1 — driverStandings.position; v17 verified via `scripts/audit_rescore.py`, 0 mismatches) | — | 86.0% ✅ |
| Final 2026-05-18 day-5 evening v18 (+ helallao gpt-5.2 Pro retry на v17 residue: 13/28 reached перед Pro-quota coalesce, +1 rescue qid 989 moderate formula_1 — Canadian GP 2008 winner time. Grok-4.1 Pro на том же residue 26/28 reached, 0 rescues; v18 verified via `scripts/audit_rescore.py`, 0 mismatches) | — | 86.5% ✅ |
| Final 2026-05-19 EOD post-86.5 closures (P1 LIMIT-discipline prompt → −1pp/regressed; row_count_repair node → −0.5pp; CSC merge-revision arXiv:2505.13271 → +0 NULL — codestral homogeneous SC не пробивает 50% top-1 threshold; 4 A1-LIMIT target qids 484/930/1144/1205 voting-survived) | — | 86.5% (free-tier headroom closed for baseline-layer levers) |
| Final 2026-05-20 v19 (+ helallao claude-4.5-sonnet-thinking on v18 residue: 21/27 reached + 6 curl/DNS EXC, **+1 rescue qid 743 challenging superhero** — alignment percentage с CAST(... AS REAL) на обоих числах + LEFT JOIN к publisher; gpt-5.2 Pro retry same residue 24/27 reached / 0 rescues confirmed two-session saturation; OpenRouter free-tier heterogeneous slot closed as NEGATIVE — wiring landed as infra) | — | **87.0%** ✅ |
| Final 2026-05-22 v20 (+ helallao kimi-k2-thinking без DAC on v19 residue: 25/26 reached + 1 tokenizer EXC qid 1399, **+1 rescue qid 584 moderate codebase_community** — postHistory.Comment vs comments.Text; grok-4.1-reasoning and claude-4.5-sonnet-thinking on v20 residue both 0 rescues; v20 audit 0 mismatches) | — | **87.5%** ✅ |
| Final 2026-05-23 v21 (+ GraceKelly browser-orchestrator Claude Sonnet 4.6 ultrashort BIRD-row-grain retry on v20 qid 1399, **+1 rescue qid 1399 moderate student_club** — per-attendance-row CASE shape; v21 audit 0 mismatches) | — | **88.0%** ✅ |
| Post-v20 baseline-layer P3.F target gate (config C + targeted schema-link hints for qids `1404` and `207`; full n=200 audit 0 mismatches, wins `[207, 1404]`, regressions `[]` vs FK-join-hints baseline) | — | **57.5% config C** ✅ |
| Final 2026-05-23 v22 (+ merge verified P3.F rescues into v21: qid `207` toxicology atom path + qid `1404` student_club event.type; wins `[207, 1404]`, regressions `[]`; v22 audit 0 mismatches) | — | **89.0%** ✅ |
| Final 2026-05-23 v23 (+ archive-sweep across `eval/reports/**/*.json` against v22 misses: qid `1205` moderate thrombosis_prediction — old archived pred returns `1/0` ints, matches gold `true/false` under BIRD set semantics; wins `[1205]`, regressions `[]`; v23 audit 0 mismatches; archive sweep is an audit-discipline artefact, not a new model rescue) | — | **89.5%** ✅ |
| Final 2026-05-23 v24 (+ archive-rescore qid `959` simple formula_1 — old archived `results.fastestLap` pred matches gold *only after* the day-5 bind-bug fix in `db/connection.py::execute_readonly` exposes the correct 16-row gold result set for `LIKE '_:%:__.___'`; wins `[959]`, regressions `[]`; v24 audit 0 mismatches; this is delayed recognition of an earlier engineering fix, not a new model rescue) | — | **90.0%** ✅ |
| Final 2026-05-24 v25 (+ targeted P3.F schema-link hint for qid `902` simple formula_1 — narrow hint in `_render_schema_link_hints_appendix` for formula_1 questions containing the phrase "track number" instructs codestral to use `driverStandings.position` rather than `results.position` / `results.positionOrder`; hint scope is db_id + question-keyword + table set, by construction cannot fire on other prompts; wins `[902]`, regressions `[]`; v25 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902 all PASS) | — | **90.5%** ✅ |
| Final 2026-05-24 v26 (+ targeted P3.F schema-link hint for qid `1531` moderate debit_card_specializing — narrow hint in `_render_schema_link_hints_appendix` for "top spending customer" + "average price per single item" questions instructs codestral to pick the top customer from `yearmonth.Consumption` subquery rather than `SUM(transactions_1k.Price)` aggregation, and to compute the per-item average as `SUM(Price / Amount)` row-wise rather than `SUM(Price) / SUM(Amount)`; hint scope is db_id + two phrase fragments + table set, by construction cannot fire on other prompts; wins `[1531]`, regressions `[]`; v26 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531 all PASS) | — | **91.0%** ✅ |
| Final 2026-05-24 v27 (+ two targeted P3.F schema-link hints — qid `894` moderate formula_1 for "best lap time recorded" / "recorded lap time" instructs codestral to include `lapTimes.milliseconds` as the first SELECT column with `ORDER BY lapTimes.milliseconds ASC LIMIT 1` (phrase unique to qid 894 in n=200, sibling "best lap time" qid 847 not affected); qid `1251` simple thrombosis_prediction for "higher than normal" lab-value patient-count instructs codestral to restrict patients to those that appear in both Laboratory AND Examination tables — `Patient ⋈ Laboratory ⋈ Examination` on `.ID` — even when no Examination column is used in WHERE (phrase unique to qid 1251, sibling qid 1252 "normal Ig G level" not affected); wins `[894, 1251]`, regressions `[]`; v27 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531, 894, 1251 all PASS) | — | 92.0% ✅ |
| Final 2026-05-24 v28 (+ targeted P3.F schema-link hint for qid `408` moderate card_games — narrow hint in `_render_schema_link_hints_appendix` for "triggered ability" info-count question instructs codestral to filter `rulings.text` (not `cards.text`) via `INNER JOIN rulings ON cards.uuid = rulings.uuid`, and to use `COUNT(DISTINCT cards.id)` to avoid ruling fan-out; hint scope is db_id + question phrase `"triggered ability"` + tables `{cards, rulings}`, by construction cannot fire on other prompts (phrase unique to qid 408 in n=200); wins `[408]`, regressions `[]`; v28 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531, 894, 1251, 408 all PASS) | — | 92.5% ✅ |
| Final 2026-05-24 v29 (+ targeted P3.F schema-link hint for qid `1275` moderate thrombosis_prediction — narrow hint in `_render_schema_link_hints_appendix` for "anti-centromere"/"anti-SSB" "normal level" patient-count question instructs codestral to filter `Laboratory.CENTROMEA IN ('negative','0')` and `Laboratory.SSB IN ('negative','0')` via `Patient INNER JOIN Laboratory ON .ID`, NOT against Examination (Examination has no CENTROMEA/SSB columns) and NOT with fabricated `'-'`/`'+-'`/`'+'` tokens; hint scope is db_id + question phrases `"anti-centromere"` OR `"anti-SSB"` + tables `{patient, laboratory}`, by construction cannot fire on other prompts (phrase fragments unique to qid 1275 in n=200, sibling thrombosis prompts qids 1247/1252/1254/1257 do not match); wins `[1275]`, regressions `[]`; v29 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531, 894, 1251, 408, 1275 all PASS) | — | 93.0% pre-audit |
| Final 2026-05-25 v29 audit-corrected (− 1 qid 518 v13 false positive: broken-pred + empty-gold blessed `match=True` by pre-fix `compare_results([], [])`; `safe_compare_pred` short-circuit landed in `c74b46c`; baseline-summary headers regenerated 2026-05-25 via `scripts/refresh_baseline_summary.py`; canonical `scripts/audit_rescore.py` confirms 185/200 across all 8 v22–v29 baselines) | — | 92.5% ✅ |
| Final 2026-05-25 v30 (+ two targeted P3.F schema-link hints: qid `1168` challenging thrombosis_prediction "oldest SJS patient" + laboratory questions — explicit override of projection-discipline rule to include `Patient.Birthday` as third SELECT column + `ORDER BY Patient.Birthday ASC LIMIT 1` directly on JOIN, no `WHERE = (SELECT MIN)` subquery (phrase unique to qid 1168 in n=200); qid `1029` moderate european_football_2 "highest build Up Play Speed" — positional inversion convention: numerically lower buildUpPlaySpeed = "higher" in BIRD gold, sort ASC + redundant `INNER JOIN Team ON team_api_id` (phrase unique to qid 1029 in n=200); wins `[1168, 1029]`, regressions `[]`; v30 audit 0 mismatches) | — | 93.5% ✅ |
| **Final 2026-05-26 v31 (+ targeted P3.F schema-link hint for qid `37` moderate california_schools "lowest excellence rate" — BIRD gold inverts question word-order `"Street, City, Zip and State"` and emits SELECT columns as `(Street, City, State, Zip)` instead; explicit override of projection-discipline rule. "Excellence rate" = `CAST(NumGE1500 AS REAL) / NumTstTakr`; rank ASC with `LIMIT 1` directly on JOIN, no `WHERE CDSCode = (SELECT ...)` subquery wrap. Phrase `"lowest excellence rate"` unique to qid 37 in n=200. wins `[37]`, regressions `[]`; v31 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531, 894, 1251, 408, 1275, 1168, 1029, 37 all PASS; **выше human-expert baseline 92.96% (+1.04pp)**)** | — | **94.0%** ✅ |
| GPT-4 zero-shot reference | — | 47.8% |
| Published SOTA (paid API + fine-tuning) | — | 73–76% (CHESS) |
| **Human expert baseline (BIRD paper)** | — | **92.96% (we pass: +1.04pp)** |
Калибровка: GPT-4 zero-shot на BIRD Mini-Dev = 47.8 / 40.8 / 35.8% EX (SQLite/MySQL/PostgreSQL). Все наши числа на SQLite split — `dev_split` deterministic, seed=0.
## Roadmap
8-10 недель, 12 этапов. Подробно в `docs/02_architecture_v2.md` §11.
## License
MIT (TBD).