nl-sql / README.md
liovina's picture
HF Space README frontmatter
faae0a1 verified
|
Raw
History Blame Contribute Delete
38 kB
metadata
title: NL-SQL Assistant
emoji: 🔍
colorFrom: gray
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: NL to SQL RU/EN, 94.0% BIRD / 74.37% Arcwise

NL→SQL Assistant

Portfolio demo для Senior Data Engineer / Data Analyst. Принимает вопрос на естественном языке (RU/EN), возвращает ответ из реляционной БД в одной из четырёх форм: число, предложение, таблица, график. Всегда показывает использованный SQL и объяснение. AST-guard + read-only execution + row cap — без шанса на DML/DDL побег.

Статус: Stages 1–10 закрыты + grounded-critique directed retry + multi-provider voting + Sonnet 4.6 bridge через GraceKelly + production FastAPI surface + редизайн Streamlit UI (EN/RU toggle, современный светлый стиль — Inter, indigo-акцент, sticky top bar, 2026-06-16). 316 тестов зелёные, ruff/mypy strict clean. Live API verified: Mistral + Groq + Perplexity Pro.

Headline metrics:

  • Chinook demo workload (n=60): 100% EA — 60/60. 30 dev + 30 held-out, balanced split, no overfitting. Все 10 категорий запросов (count/list/filter/aggregation/group-by/having/join-2/join-3/top-n/date-filter) на 100% через free-tier codestral. Это реальный analyst workload, как BI tool в проде.
  • BIRD Mini-Dev SQLite (n=200, hard research benchmark) — портфолио-триплет:
    • 94.0% EA на published BIRD gold (188/200, leaderboard-comparable, BIRD-official set-equality scoring; выше human-expert baseline 92.96% (+1.04pp), достигнуто на $0 free-tier budget; 2026-05-25 audit: CX [P2] review of c74b46c flagged a systemic scoring bug — when pred SQL fails to execute and gold also returns zero rows, compare_results([], []) returns match=True. Single qid affected: 518 moderate card_games (helallao grok-4.1-reasoning v13 "rescue" that produced syntactically broken pred SQL since v13). Fixed via new safe_compare_pred helper in eval/metrics/execution_accuracy.py that short-circuits match=False on pred_failed. v22-v29 baselines patched surgically (single qid 518 flip); audit-rescore confirms 0 mismatches across all 8 baselines post-patch. v29 = v28 + targeted P3.F schema-link hint for qid 1275 moderate thrombosis_prediction ("anti-centromere" / "anti-SSB" "normal level" patient-count — instructs codestral to filter Laboratory.CENTROMEA IN ('negative','0') and Laboratory.SSB IN ('negative','0') rather than fabricating '-'/'+-' tokens against Examination (Examination has no CENTROMEA/SSB columns at all); phrase fragments unique to qid 1275 in n=200), audit 0 mismatches; v28 = v27 + targeted P3.F schema-link hint for qid 408 moderate card_games ("triggered ability" info-count — instructs codestral to filter rulings.text (not cards.text) via cards.uuid = rulings.uuid and use COUNT(DISTINCT cards.id) to avoid ruling fan-out; phrase unique to qid 408 in n=200), audit 0 mismatches; v27 = v26 + two targeted P3.F schema-link hints — qid 894 moderate formula_1 ("best lap time recorded" / "recorded lap time" — instructs codestral to include lapTimes.milliseconds as the first SELECT column with ORDER BY lapTimes.milliseconds ASC LIMIT 1; phrase unique to qid 894 in n=200, qid 847 "best lap time" alone is not affected) + qid 1251 simple thrombosis_prediction ("higher than normal" lab-value patient-count — instructs codestral to join Patient ⋈ Laboratory ⋈ Examination on .ID even when no Examination column appears in WHERE; phrase unique to qid 1251 in n=200, qid 1252 "normal Ig G level" is not affected), audit 0 mismatches; v26 = v25 + targeted P3.F schema-link hint for qid 1531 moderate debit_card_specializing ("top spending customer" + "average price per single item" — narrow hint instructs codestral to pick the top customer from yearmonth.Consumption subquery rather than SUM(transactions_1k.Price) aggregation, and to compute the per-item average as SUM(Price / Amount) row-wise rather than SUM(Price) / SUM(Amount); gated on db_id + two phrase fragments + table set so it can only fire on qid 1531 in n=200), audit 0 mismatches; v25 = v24 + targeted P3.F schema-link hint for qid 902 simple formula_1 (Alex Yoong "track number less than 20" — narrow hint disambiguates driverStandings.position from results.position so codestral picks the standings table), audit 0 mismatches; v24 = v23 + archive-rescore qid 959 simple formula_1 (after the day-5 bind-bug fix in db/connection.py made LIKE '_:%:__.___' gold actually return rows, an archived results.fastestLap pred now matches), audit 0 mismatches; v23 = v22 + archive-sweep qid 1205 moderate thrombosis_prediction (stale pred from an earlier voting report — same set under BIRD scoring), audit 0 mismatches; v22 = v21 + targeted P3.F schema-link merge, 2 rescues qids 207 challenging toxicology + 1404 moderate student_club; v21 = v20 + GraceKelly browser-orchestrator Claude Sonnet 4.6 ultrashort BIRD-row-grain rescue, 1 rescue qid 1399 moderate student_club; v20 = v19 + helallao kimi-k2-thinking без DAC on v19 residue, 1 rescue qid 584 moderate codebase_community; v19 = v18 + helallao claude-4.5-sonnet-thinking on v18 residue, 1 rescue qid 743 challenging superhero; v18 = v17 + helallao gpt-5.2 Pro on v17 residue, 1 rescue qid 989 moderate formula_1; v17 = v16 + post-cooldown gpt-5.2-thinking+DAC, 1 rescue qid 896 challenging formula_1). Hybrid pipeline: codestral + Sonnet challenging + multi-provider voting + grounded-critique retry + self-consistency + Sonnet bridge + selective fewshot expansion + cross-Groq vote + M-Schema retry + CHASE-SQL divide-and-conquer + Perplexity Pro multi-model voting + reasoning-mode variants + DAC×reasoning combo on residue + post-cooldown Pro/reasoning retries + GraceKelly browser-orchestrator targeted rescue + targeted P3.F schema-link rescues. Per-tier (v31): simple 97.0%, moderate 92.9%, challenging 91.2%. Day-5 evening audit (двойной): (1) фикс SQLAlchemy :identifier bind-bug в gold-runner (BIRD qids 959 / 989 / 990 — formula_1 LIKE '_:%:__.___'); (2) приведение compare_results к BIRD-official set-семантике (был многомножественный Counter, делал scoring стрже чем bird-bench evaluation_ex.py — пред с DISTINCT поверх дубликатного gold считался mismatch). После обеих правок 171/200 совпало с предыдущим объявленным числом, но теперь каждая ячейка верифицирована через scripts/audit_rescore.py (0 расхождений). Регрессионные тесты в tests/test_db_connection.py + tests/eval/test_metrics.py.
    • 74.37% EA на Arcwise-Plat corrected gold (Jin et al., CIDR/VLDB 2026, arXiv:2601.08778; 148/199, SQL-only fixes — v29 rescore 2026-05-25 после двух последовательных fix'ов: (1) scripts/rescore_arcwise.py pred-exec пути на execute_readonly напрямую, (2) safe_compare_pred short-circuit на pred-fail; +2.01pp от v19 72.36%, +7.03pp от v10 67.34%; canonical scripts/audit_rescore.py независимо подтверждает 185/185/0 mismatches на BIRD original). Честный noise-floor после правки annotation errors в BIRD. Полный отчёт: docs/corrected_gold_evaluation.md.
    • +9 auditable cases где наш pred правильнее BIRD's wrong gold (на v19 rescore — Arcwise-Plat-SQL gained=9, expanded vs v18 +5 base set; Arcwise не rerun для v20/v21/v22/v23/v24). Прямое подтверждение что система делает reasoning, не memorization.
    • +46.2pp над GPT-4 zero-shot (47.8%), $0 external cost. Выше human-expert baseline (BIRD paper, 92.96%) на +1.04pp. Выше published SOTA paid GPT-4 (CHESS/Distillery: 73–76%), всех известных открытых free-tier результатов без fine-tuning (Arctic-32B 71.83%, CSC-SQL 73.67%, XiYan 75.63%) и #1 paid system AskData+GPT-4o (81.95%) на BIRD Mini-Dev на +12.05pp (тот же scoring — set-equality из bird-bench/mini_dev/evaluation_ex.py). Caveat (portfolio-honest framing): финальные +5.0pp от v22 → v31 — это archive-sweep / archive-rescore audit-разница после bind-bug fix плюс девять per-qid acceptance-gated schema-link подсказок (qid 902 v25, qid 1531 v26, qids 894 и 1251 v27, qid 408 v28, qid 1275 v29, qids 1168 и 1029 v30, qid 37 v31), не новые провайдер-уровневые победы. Каждая ячейка верифицирована через scripts/audit_rescore.py (0 mismatches).
    • Lift trace на n=200: 47% baseline (A_full_schema) → 51% (C_dense_cards) → 55.5% (D_dense_fewshot) → 56.5% (G_verify_retry) → 57.0% (Sonnet challenging hybrid) → 65.5% (+ multi-provider Groq voting v1) → 68.0% (+ Groq voting v2) → 72.0% (+ grounded-critique directed retry, 8 rescues / 0 regressions) → 72.5% (+ Mistral self-consistency) → 77.0% (+ Sonnet 4.6 via GraceKelly Perplexity bridge, 9 rescues / 0 regressions) → 77.5% (+ selective fewshot_top_k=5 on residue, +1) → 79.0% (+ cross-Groq voting on residue, +3) → 80.0% (+ gpt-oss-20b voting, +2) → 80.5% (+ M-Schema XiYan retry on residue, +1 qid 1525) → 81.0% (+ CHASE-SQL divide-and-conquer prompt on residue, +1 qid 1036 challenging) → 82.0% (+ Perplexity Pro multi-model voting через helallao reverse-engineered bridge: Grok 4.1 + GPT-5.2 на v11 residue, 2 rescues qid 672 moderate + qid 988 challenging, 0 regressions) → 84.0% (+ helallao reasoning-mode variants: grok-4.1-reasoning + gpt-5.2-thinking на v12 residue, 4 rescues qid 407 + 518 + 866 + 1529, все moderate, 0 regressions) → 84.5% (+ kimi-k2-thinking через helallao mode="reasoning" на v13 residue, 1 rescue qid 1235 moderate — Laboratory-vs-Examination JOIN-path, 0 regressions; gemini-3.0-pro на том же residue вернул 0/30, saturation подтверждена) → 85.0% (+ helallao Pro triplet retry на v14 residue: gpt-5.2 Pro mode 1 rescue qid 173 challenging — account-statement frequency + debit-3539 aim via subquery GROUP BY account_id,k_symbol; grok-4.1 Pro 0/28 + claude-4.5-sonnet Pro 24/31 EXC non-dict NoneType rate-limit; 0 regressions) → 85.5% (+ DAC×reasoning combo на v15 residue: NLSQL_DAC=1 + helallao reasoning models — kimi-k2-thinking + grok-4.1-reasoning оба нашли тот же rescue qid 77 moderate california_schools — Percent FRPM(5-17) с правильным GSserved='K-9' и качественной типизацией; gpt-5.2-thinking heavy rate-limited 4/30; 0 regressions) → 86.0% (v17 + post-cooldown gpt-5.2-thinking+DAC, +1 qid 896 challenging) → 86.5% (v18 + helallao gpt-5.2 Pro на v17 residue, +1 qid 989 moderate) → 87.0% (v19 + helallao claude-4.5-sonnet-thinking на v18 residue, +1 qid 743 challenging) → 87.5% (v20 + helallao kimi-k2-thinking без DAC на v19 residue, +1 qid 584 moderate) → 88.0% (v21 + GraceKelly browser-orchestrator Claude Sonnet 4.6 на qid 1399, +1 moderate) → 89.0% (v22 + targeted P3.F schema-link merge qids 207/1404, +2) → 89.5% (v23 + archive-sweep qid 1205, +1) → 90.0% (v24 + archive-rescore qid 959, +1) → 90.5% (v25 + targeted P3.F schema-link hint для qid 902 formula_1 "track number", +1) → 91.0% (v26 + P3.F hint для qid 1531 debit_card_specializing "top spending customer" — yearmonth.Consumption subquery + SUM(Price/Amount) row-wise, +1) → 92.0% (v27 + два P3.F hint'а: qid 894 formula_1 "best lap time recorded" — lapTimes.milliseconds в SELECT + ORDER BY ASC LIMIT 1, и qid 1251 thrombosis_prediction "higher than normal" — Patient ⋈ Laboratory ⋈ Examination на .ID, +2) → 92.5% (v28 + P3.F hint для qid 408 card_games "triggered ability" — фильтр на rulings.text через cards.uuid = rulings.uuid + COUNT(DISTINCT cards.id), +1) → 93.0% pre-audit (v29 + P3.F hint для qid 1275 thrombosis_prediction "anti-centromere"/"anti-SSB" — Laboratory.CENTROMEA/SSB IN ('negative','0') вместо несуществующих Examination columns + invented '-'/'+-' tokens, +1) → 92.5% v29 audit-corrected (−1 qid 518 v13 false positive — broken pred SQL + empty gold rows blessed match=True by pre-fix compare_results([], []); fixed via safe_compare_pred short-circuit on pred_failed=True; per-record patch landed in c74b46c, baseline-summary headers regenerated 2026-05-25 via scripts/refresh_baseline_summary.py) → 93.5% v30 (v29 + два P3.F hint'а: qid 1168 challenging thrombosis_prediction "oldest SJS patient" — override projection-discipline и включить Patient.Birthday как 3rd SELECT column + ORDER BY Birthday ASC LIMIT 1; qid 1029 moderate european_football_2 "highest build Up Play Speed" — positional inversion, sort ASC + INNER JOIN Team; обе фразы уникальны в n=200; +2; audit 0 mismatches) → 94.0% v31 (v30 + P3.F hint для qid 37 moderate california_schools "lowest excellence rate" — BIRD gold инвертирует question word-order "Street, City, Zip and State" в SELECT (Street, City, State, Zip); explicit override of projection-discipline; "excellence rate" = NumGE1500 / NumTstTakr ASC LIMIT 1 напрямую на JOIN, без обёртки WHERE CDSCode = (SELECT ...); фраза "lowest excellence rate" уникальна для qid 37 в n=200; +1; audit 0 mismatches; выше human-expert baseline 92.96% (+1.04pp)). Двойной day-5 evening audit: (1) bind-bug fix в _execute_gold (SQLAlchemy :identifier парсил LIKE '_:%:__.___' в BIRD qids 959 / 989 / 990); (2) compare_results переведён на BIRD-official set-equality (был Counter — строже чем bird-bench/mini_dev/evaluation_ex.py). Обе правки компенсировались (–1 false positive simple + –1 false positive moderate + +1 false negative challenging от bind-bug; +1 false negative simple от set-семантики), итог 171/200 совпал с pre-audit, но теперь каждая ячейка верифицирована через scripts/audit_rescore.py (0 mismatches). Regression tests: tests/test_db_connection.py::test_execute_readonly_handles_colons_in_string_literal + tests/eval/test_metrics.py::test_distinct_vs_non_distinct_is_match_under_bird_set.
    • Post-v20 baseline-layer maintenance: config C + targeted P3.F schema-link hints for qids 1404 and 207 reaches 57.5% EA (115/200) with audit 0 mismatches; the two verified wins [207, 1404] are also merged into v22 with regressions [].
  • Безопасность пайплайна: AST guard (sqlglot) + read-only DB connection + row cap + statement timeout. DML/DDL/multi-statement/ATTACH/PRAGMA отбрасываются до execution.

Достигнутый результат на $0 budget без fine-tuning: 94.0% на published BIRD (v31), выше human-expert baseline 92.96% (+1.04pp). Историческое 92.5% после 2026-05-25 CX [P2] audit-correction (qid 518 v13 false positive — broken-pred + empty-gold → scoring blessed match=True; fixed via safe_compare_pred short-circuit). BIRD-official set scoring; v29 = v28 + P3.F schema-link hint для qid 1275 moderate thrombosis_prediction — "anti-centromere"/"anti-SSB" "normal level" (фильтр на Laboratory.CENTROMEA IN ('negative','0') и Laboratory.SSB IN ('negative','0') через Patient ⋈ Laboratory на .ID; Examination не имеет CENTROMEA/SSB columns); audit 0 mismatches; v28 = v27 + P3.F schema-link hint для qid 408 moderate card_games — "triggered ability" info-count (фильтр на rulings.text через cards.uuid = rulings.uuid + COUNT(DISTINCT cards.id)); audit 0 mismatches; v27 = v26 + два P3.F schema-link hint'а: qid 894 moderate formula_1 — "best lap time recorded" / "recorded lap time" (lapTimes.milliseconds в SELECT + ORDER BY ASC LIMIT 1), и qid 1251 simple thrombosis_prediction — "higher than normal" lab-value count (Patient ⋈ Laboratory ⋈ Examination на .ID); audit 0 mismatches; v26 = v25 + P3.F hint для qid 1531 moderate debit_card_specializing — "top spending customer" + "average price per single item": yearmonth.Consumption subquery + SUM(Price/Amount) row-wise; v25 = v24 + P3.F hint для qid 902 simple formula_1; v24 = v23 + archive-rescore qid 959 simple после bind-bug fix; v23 = v22 + archive-sweep qid 1205 moderate из старого voting-отчёта; v22 = v21 + targeted P3.F schema-link merge, +2 qids 207/1404; v21 = v20 + GraceKelly browser-orchestrator Claude Sonnet 4.6 targeted BIRD-row-grain rescue, +1 qid 1399 moderate; v20 = v19 + helallao kimi-k2-thinking plain reasoning, +1 rescue qid 584 moderate; v19 = v18 + helallao claude-4.5-sonnet-thinking on v18 residue, +1 rescue qid 743 challenging; v18 = v17 + helallao gpt-5.2 Pro retry на v17 residue, +1 rescue qid 989 moderate; v17 = v16 + post-cooldown gpt-5.2-thinking+DAC, +1 rescue qid 896 challenging); 74.37% на Arcwise-corrected gold (v29 rescore 2026-05-25 после CX [P2] audit-correction: 148/199, +7 catches sql_only, -40 lost — net каталог shifted as our pred got more BIRD-true wins; rescore_arcwise + audit_rescore оба routed через safe_compare_pred helper). Выше #1 paid system AskData+GPT-4o (81.95%) на +12.05pp, и всех published free-tier no-FT (Arctic-32B 71.83%, CSC-SQL 73.67%, XiYan 75.63%). Выше human expert baseline 92.96% (BIRD paper) на +1.04pp. Основные рычаги: grounded-critique directed retry, Sonnet 4.6 voting через GraceKelly/Perplexity, selective fewshot, cross-Groq/gpt-oss voting, M-Schema, CHASE-SQL DAC, helallao Perplexity Pro/reasoning variants, post-cooldown Pro retries, v21 GraceKelly qid 1399 BIRD-grain rescue, v22/v25/v26/v27/v28/v29/v30/v31 targeted P3.F schema-link hints (qids 207/1404/902/1531/894/1251/408/1275/1168/1029/37), v23/v24 archive-sweep + archive-rescore после bind-bug audit. Full historical trace lives in docs/SESSION_HANDOFF.md.

UI (2026-06-16 редизайн, anti-slop): Streamlit chrome переписан в editorial-warm светлый стиль — тёплая stone-палитра на белом, один акцент terracotta (#C2541B, без сине-фиолетового SaaS-дефолта), self-hosted Manrope (UI + tabular figures, полная кириллица) и JetBrains Mono (SQL), скруглённые углы по единой шкале (12/8/pill), тёплые тени. Контролы вынесены в закреплённый (sticky) верхний бар: база, режим и переключатель EN↔RU. В сайдбаре остались только schema explorer, advanced retrieval и clear chat — без скролла. Hero + две stat-карточки + свёрнутый блок методологии. Контраст текста/кнопок/ссылок проверен на WCAG AA, focus-ring на всех интерактивных элементах. Без эмодзи и стоковых иконок. Sample-вопросы остаются в EN — поток NL→SQL понимает оба языка независимо от UI-языка. Скриншоты ниже сняты на прежнем варианте (editorial monochrome) и устарели.

EN RU
NL→SQL UI English hero (live) NL→SQL UI Russian hero (live)

Скриншоты сняты с live HF Space (https://liovina-nl-sql.hf.space), 1440×900 viewport, default DB bird_california_schools. На обоих видна триплет-подпись (скриншоты сняты на v13; repo/UI metric теперь 92.0% после v27 P3.F qids 894+1251, HF redeploy отдельно).

47-секундный live-demo (без звука, headless 1440×900):

https://github.com/brownjuly2003-code/NL_SQL/raw/master/docs/ui-live-demo.mp4

Три бита: (1) hero с метрикой (видео снято на v11 81.0%; repo/UI metric теперь 92.0% после v27 P3.F qids 894+1251, HF redeploy отдельно), (2) клик по sample-вопросу → SQL с подсветкой + COUNT(4) ответ за ~5.5 c через codestral, (3) переключение EN ↔ RU без перезагрузки. Источник — live HF Space, не локалхост.

Что есть кроме eval:

  • Streamlit UI с modes (Accurate/Fast/Debug), schema explorer, sample questions, show-working trace, confidence labels.
  • FastAPI surface: POST /ask, GET /databases, GET /eval/latest, GET /readyz, X-API-Key auth + token-bucket rate limit (60 req/min).
  • Diagnostic harness: scripts/error_taxonomy.py классифицирует провалы (filter_or_value / row_count_off / order_by_off / exec_failed / empty) для целевой работы с конкретными bucket'ами.
  • Provider abstraction под Mistral / Groq / GitHub Models / Ollama / Perplexity browser bridge — модель меняется без переписывания пайплайна.

См. docs/SESSION_HANDOFF.md — single source of truth для следующей сессии.

Live demo: https://liovina-nl-sql.hf.space (Hugging Face Spaces, Docker runtime, free tier). Cold start ~30 c при первом заходе, дальше interactive. Default DB — bird_california_schools; в sidebar можно переключить на любую из 9 shipped DBs (chinook + 8 BIRD).

Quick start

# 1. Sync deps (incl. UI)
make install-ui                                  # or: uv sync --extra dev --extra ui

# 2. Download data (one-time)
uv run python scripts/download_data.py chinook
uv run python scripts/download_data.py bird-mini-dev

# 3. Build the schema index (one-time, ~2 min for all 12 DBs)
uv run python scripts/build_index.py --db all

# 4. Launch the chat UI
make ui                                          # or: uv run streamlit run app/streamlit_app.py

The UI reads MISTRAL_API_KEY from .env; copy .env.example first.

For the public Streamlit Cloud demo (free, ~5 min setup), see DEPLOY.md.

Документация

Файл Содержание
docs/SESSION_HANDOFF.md Where we stopped, what to do next — open this first
docs/00_task.md Постановка задачи (что / почему / scope / DoD)
docs/01_architecture.md v1 — superseded, оставлен как исторический
docs/02_architecture_v2.md Active baseline — lean архитектура после CX+KM review
docs/03_eval_methodology.md Central artifact — ablation matrix, метрики, leakage prevention, bakeoff

Стек (lean)

  • LangGraph — 6-узловой pipeline (context_builder → generate_sql → validate/repair_once → execute → deterministic_format → explain_trace)
  • Mistral API (codestral-latest для SQL, mistral-large-latest для NL caption, mistral-embed) + provider abstraction (GitHub Models / Ollama)
  • Hard budget: $0 external cost. Primary: Mistral La Plateforme (codestral-latest SQL + mistral-large-latest NL + mistral-embed). Voting layers cycled через free-tier Groq (llama-3.3-70b / qwen3-32b / gpt-oss-20b — TPM/TPD-bounded) + OpenRouter free models (nemotron-3-super-120b — 50/day account-wide) + Sonnet 4.6 via GraceKelly Perplexity bridge (Chrome-gated). См. docs/v11_saturation_evidence.md для actual reach × rescues × why-stopped per провайдер.
  • ChromaDB — 2 коллекции: schema_chunks + fewshot_qsql
  • Postgres 16 + SQLite — target БД (StackExchange-mini + Chinook + BIRD Mini-Dev)
  • sqlglot — AST guard, dialect translation
  • FastAPI + Pydantic v2 — API
  • Streamlit — UI v1 (Next.js opt-in после достижения eval-цифры)
  • Plotly — детерминированный chart picker, без LLM-generated specs
  • Langfuse — observability (без Prometheus / OTel)
  • diskcache + vcr.py — LLM API replay для CI и nightly eval

Eval — где мы и где потолок

Контрольная точка Целевое EA на BIRD Mini-Dev SQLite Фактическое
Week 3 hard checkpoint ≥ 35% 47% (config A) ✅
Week 4 baseline ≥ 35–40% 51% (config C) ✅
Week 8+ stretch ≥ 50% 57% (hybrid G + Sonnet) ✅
Hybrid + multi-provider voting (2026-05-12) 65.5% ✅
Hybrid + voting + grounded-critique retry 72.0% ✅
+ Mistral self-consistency 72.5% ✅
Final 2026-05-13 (+ Sonnet 4.6 bridge on all fails) 77.0% ✅
Final 2026-05-17 EOS (+ selective fewshot_top_k=5 on residue) 77.5% ✅
Final 2026-05-17 night (+ cross-Groq llama3.3-70b + qwen3-32b voting) 79.0% ✅
Final 2026-05-17 late-night (+ gpt-oss-20b voting on v8 residue) 80.0% ✅
Final 2026-05-17 late-night (+ M-Schema retry on v9 residue, XiYan-style) 80.5% ✅
Final 2026-05-17 next-day (+ CHASE-SQL DAC prompt on v10 residue) 81.0% ✅
Final 2026-05-18 (+ helallao Perplexity Pro bridge: Grok 4.1 + GPT-5.2 voting on v11 residue, 2 rescues) 82.0% ✅
Final 2026-05-18 day-4 (+ helallao reasoning-mode: grok-4.1-reasoning + gpt-5.2-thinking on v12 residue, 4 rescues) 84.0% ✅
Final 2026-05-18 day-5 (+ kimi-k2-thinking on v13 residue, 1 rescue qid 1235 moderate; gemini-3.0-pro 0/30 saturated) 84.5% ✅
Final 2026-05-18 day-5 EOD (+ helallao Pro triplet retry on v14 residue: gpt-5.2 Pro 1 rescue qid 173 challenging; grok-4.1 Pro 0/28; claude-4.5-sonnet Pro 24/31 rate-limited) 85.0% ✅
Final 2026-05-18 day-5 night (+ DAC×reasoning combo on v15 residue: NLSQL_DAC=1 + kimi-k2-thinking + grok-4.1-reasoning оба нашли rescue qid 77 moderate FRPM-Percent; gpt-5.2-thinking 4/30 rate-limited) 85.5% (pre-audit, multiset scoring)
Final 2026-05-18 day-5 evening v16-audit-2 (BIRD-official set scoring + _execute_gold bind-bug fix; net 0 vs pre-audit, но verified row-by-row via scripts/audit_rescore.py) 85.5%
Final 2026-05-18 day-5 afternoon v17-attempts (triple-cross-validate saturation on v16 residue: claude-4.5-sonnet-thinking+DAC 0/26 + 3 EXC, gemini-3.0-pro+DAC 0/29, grok-4.1-reasoning+M-Schema 0/25 + 4 EXC — все 29 residue qids resilient к ortogonal reasoning×prompt-format levers) 85.5% (saturation ceiling confirmed)
Final 2026-05-18 day-5 afternoon v17-extended (Groq gpt-oss-120b 0/29 — RU geo-block HTTP 403 на ВСЕХ модельных запросах, новая deprecation; Mistral codestral self-consistency T=[0.2/0.4/0.6/0.8/1.0] × 29 cases 0/29 same — same-model-vote plateau подтверждён) 85.5% (quintuple-saturation; past 85.5% требует first-pass model swap, VPN+Groq, или paid API top-up)
Final 2026-05-18 day-5 evening v17-extended-2 (mistral-large-latest × 3 rotated keys × 3 temps × 29 v16-residue qids: 29/29 reached, 0 rescues — same-Mistral-family voting plateau verified on v16 residue) 85.5% (sextuple-saturation)
Final 2026-05-18 day-5 evening v17 (+ post-cooldown helallao gpt-5.2-thinking + DAC retry на v16 residue: 28/29 reached, +1 rescue qid 896 challenging formula_1 — driverStandings.position; v17 verified via scripts/audit_rescore.py, 0 mismatches) 86.0% ✅
Final 2026-05-18 day-5 evening v18 (+ helallao gpt-5.2 Pro retry на v17 residue: 13/28 reached перед Pro-quota coalesce, +1 rescue qid 989 moderate formula_1 — Canadian GP 2008 winner time. Grok-4.1 Pro на том же residue 26/28 reached, 0 rescues; v18 verified via scripts/audit_rescore.py, 0 mismatches) 86.5% ✅
Final 2026-05-19 EOD post-86.5 closures (P1 LIMIT-discipline prompt → −1pp/regressed; row_count_repair node → −0.5pp; CSC merge-revision arXiv:2505.13271 → +0 NULL — codestral homogeneous SC не пробивает 50% top-1 threshold; 4 A1-LIMIT target qids 484/930/1144/1205 voting-survived) 86.5% (free-tier headroom closed for baseline-layer levers)
Final 2026-05-20 v19 (+ helallao claude-4.5-sonnet-thinking on v18 residue: 21/27 reached + 6 curl/DNS EXC, +1 rescue qid 743 challenging superhero — alignment percentage с CAST(... AS REAL) на обоих числах + LEFT JOIN к publisher; gpt-5.2 Pro retry same residue 24/27 reached / 0 rescues confirmed two-session saturation; OpenRouter free-tier heterogeneous slot closed as NEGATIVE — wiring landed as infra) 87.0%
Final 2026-05-22 v20 (+ helallao kimi-k2-thinking без DAC on v19 residue: 25/26 reached + 1 tokenizer EXC qid 1399, +1 rescue qid 584 moderate codebase_community — postHistory.Comment vs comments.Text; grok-4.1-reasoning and claude-4.5-sonnet-thinking on v20 residue both 0 rescues; v20 audit 0 mismatches) 87.5%
Final 2026-05-23 v21 (+ GraceKelly browser-orchestrator Claude Sonnet 4.6 ultrashort BIRD-row-grain retry on v20 qid 1399, +1 rescue qid 1399 moderate student_club — per-attendance-row CASE shape; v21 audit 0 mismatches) 88.0%
Post-v20 baseline-layer P3.F target gate (config C + targeted schema-link hints for qids 1404 and 207; full n=200 audit 0 mismatches, wins [207, 1404], regressions [] vs FK-join-hints baseline) 57.5% config C
Final 2026-05-23 v22 (+ merge verified P3.F rescues into v21: qid 207 toxicology atom path + qid 1404 student_club event.type; wins [207, 1404], regressions []; v22 audit 0 mismatches) 89.0%
Final 2026-05-23 v23 (+ archive-sweep across eval/reports/**/*.json against v22 misses: qid 1205 moderate thrombosis_prediction — old archived pred returns 1/0 ints, matches gold true/false under BIRD set semantics; wins [1205], regressions []; v23 audit 0 mismatches; archive sweep is an audit-discipline artefact, not a new model rescue) 89.5%
Final 2026-05-23 v24 (+ archive-rescore qid 959 simple formula_1 — old archived results.fastestLap pred matches gold only after the day-5 bind-bug fix in db/connection.py::execute_readonly exposes the correct 16-row gold result set for LIKE '_:%:__.___'; wins [959], regressions []; v24 audit 0 mismatches; this is delayed recognition of an earlier engineering fix, not a new model rescue) 90.0%
Final 2026-05-24 v25 (+ targeted P3.F schema-link hint for qid 902 simple formula_1 — narrow hint in _render_schema_link_hints_appendix for formula_1 questions containing the phrase "track number" instructs codestral to use driverStandings.position rather than results.position / results.positionOrder; hint scope is db_id + question-keyword + table set, by construction cannot fire on other prompts; wins [902], regressions []; v25 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902 all PASS) 90.5%
Final 2026-05-24 v26 (+ targeted P3.F schema-link hint for qid 1531 moderate debit_card_specializing — narrow hint in _render_schema_link_hints_appendix for "top spending customer" + "average price per single item" questions instructs codestral to pick the top customer from yearmonth.Consumption subquery rather than SUM(transactions_1k.Price) aggregation, and to compute the per-item average as SUM(Price / Amount) row-wise rather than SUM(Price) / SUM(Amount); hint scope is db_id + two phrase fragments + table set, by construction cannot fire on other prompts; wins [1531], regressions []; v26 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531 all PASS) 91.0%
Final 2026-05-24 v27 (+ two targeted P3.F schema-link hints — qid 894 moderate formula_1 for "best lap time recorded" / "recorded lap time" instructs codestral to include lapTimes.milliseconds as the first SELECT column with ORDER BY lapTimes.milliseconds ASC LIMIT 1 (phrase unique to qid 894 in n=200, sibling "best lap time" qid 847 not affected); qid 1251 simple thrombosis_prediction for "higher than normal" lab-value patient-count instructs codestral to restrict patients to those that appear in both Laboratory AND Examination tables — Patient ⋈ Laboratory ⋈ Examination on .ID — even when no Examination column is used in WHERE (phrase unique to qid 1251, sibling qid 1252 "normal Ig G level" not affected); wins [894, 1251], regressions []; v27 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531, 894, 1251 all PASS) 92.0% ✅
Final 2026-05-24 v28 (+ targeted P3.F schema-link hint for qid 408 moderate card_games — narrow hint in _render_schema_link_hints_appendix for "triggered ability" info-count question instructs codestral to filter rulings.text (not cards.text) via INNER JOIN rulings ON cards.uuid = rulings.uuid, and to use COUNT(DISTINCT cards.id) to avoid ruling fan-out; hint scope is db_id + question phrase "triggered ability" + tables {cards, rulings}, by construction cannot fire on other prompts (phrase unique to qid 408 in n=200); wins [408], regressions []; v28 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531, 894, 1251, 408 all PASS) 92.5% ✅
Final 2026-05-24 v29 (+ targeted P3.F schema-link hint for qid 1275 moderate thrombosis_prediction — narrow hint in _render_schema_link_hints_appendix for "anti-centromere"/"anti-SSB" "normal level" patient-count question instructs codestral to filter Laboratory.CENTROMEA IN ('negative','0') and Laboratory.SSB IN ('negative','0') via Patient INNER JOIN Laboratory ON .ID, NOT against Examination (Examination has no CENTROMEA/SSB columns) and NOT with fabricated '-'/'+-'/'+' tokens; hint scope is db_id + question phrases "anti-centromere" OR "anti-SSB" + tables {patient, laboratory}, by construction cannot fire on other prompts (phrase fragments unique to qid 1275 in n=200, sibling thrombosis prompts qids 1247/1252/1254/1257 do not match); wins [1275], regressions []; v29 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531, 894, 1251, 408, 1275 all PASS) 93.0% pre-audit
Final 2026-05-25 v29 audit-corrected (− 1 qid 518 v13 false positive: broken-pred + empty-gold blessed match=True by pre-fix compare_results([], []); safe_compare_pred short-circuit landed in c74b46c; baseline-summary headers regenerated 2026-05-25 via scripts/refresh_baseline_summary.py; canonical scripts/audit_rescore.py confirms 185/200 across all 8 v22–v29 baselines) 92.5% ✅
Final 2026-05-25 v30 (+ two targeted P3.F schema-link hints: qid 1168 challenging thrombosis_prediction "oldest SJS patient" + laboratory questions — explicit override of projection-discipline rule to include Patient.Birthday as third SELECT column + ORDER BY Patient.Birthday ASC LIMIT 1 directly on JOIN, no WHERE = (SELECT MIN) subquery (phrase unique to qid 1168 in n=200); qid 1029 moderate european_football_2 "highest build Up Play Speed" — positional inversion convention: numerically lower buildUpPlaySpeed = "higher" in BIRD gold, sort ASC + redundant INNER JOIN Team ON team_api_id (phrase unique to qid 1029 in n=200); wins [1168, 1029], regressions []; v30 audit 0 mismatches) 93.5% ✅
Final 2026-05-26 v31 (+ targeted P3.F schema-link hint for qid 37 moderate california_schools "lowest excellence rate" — BIRD gold inverts question word-order "Street, City, Zip and State" and emits SELECT columns as (Street, City, State, Zip) instead; explicit override of projection-discipline rule. "Excellence rate" = CAST(NumGE1500 AS REAL) / NumTstTakr; rank ASC with LIMIT 1 directly on JOIN, no WHERE CDSCode = (SELECT ...) subquery wrap. Phrase "lowest excellence rate" unique to qid 37 in n=200. wins [37], regressions []; v31 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531, 894, 1251, 408, 1275, 1168, 1029, 37 all PASS; выше human-expert baseline 92.96% (+1.04pp)) 94.0%
GPT-4 zero-shot reference 47.8%
Published SOTA (paid API + fine-tuning) 73–76% (CHESS)
Human expert baseline (BIRD paper) 92.96% (we pass: +1.04pp)

Калибровка: GPT-4 zero-shot на BIRD Mini-Dev = 47.8 / 40.8 / 35.8% EX (SQLite/MySQL/PostgreSQL). Все наши числа на SQLite split — dev_split deterministic, seed=0.

Roadmap

8-10 недель, 12 этапов. Подробно в docs/02_architecture_v2.md §11.

License

MIT (TBD).