title: NL-SQL Assistant
emoji: 🔍
colorFrom: gray
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: NL to SQL RU/EN, 94.0% BIRD / 74.37% Arcwise
NL→SQL Assistant
Portfolio demo для Senior Data Engineer / Data Analyst. Принимает вопрос на естественном языке (RU/EN), возвращает ответ из реляционной БД в одной из четырёх форм: число, предложение, таблица, график. Всегда показывает использованный SQL и объяснение. AST-guard + read-only execution + row cap — без шанса на DML/DDL побег.
Статус: Stages 1–10 закрыты + grounded-critique directed retry + multi-provider voting + Sonnet 4.6 bridge через GraceKelly + production FastAPI surface + редизайн Streamlit UI (EN/RU toggle, современный светлый стиль — Inter, indigo-акцент, sticky top bar, 2026-06-16). 316 тестов зелёные, ruff/mypy strict clean. Live API verified: Mistral + Groq + Perplexity Pro.
Headline metrics:
- Chinook demo workload (n=60): 100% EA — 60/60. 30 dev + 30 held-out, balanced split, no overfitting. Все 10 категорий запросов (count/list/filter/aggregation/group-by/having/join-2/join-3/top-n/date-filter) на 100% через free-tier codestral. Это реальный analyst workload, как BI tool в проде.
- BIRD Mini-Dev SQLite (n=200, hard research benchmark) — портфолио-триплет:
- 94.0% EA на published BIRD gold (188/200, leaderboard-comparable, BIRD-official set-equality scoring; выше human-expert baseline 92.96% (+1.04pp), достигнуто на $0 free-tier budget; 2026-05-25 audit: CX [P2] review of
c74b46cflagged a systemic scoring bug — when pred SQL fails to execute and gold also returns zero rows,compare_results([], [])returns match=True. Single qid affected: 518 moderate card_games (helallao grok-4.1-reasoning v13 "rescue" that produced syntactically broken pred SQL since v13). Fixed via newsafe_compare_predhelper ineval/metrics/execution_accuracy.pythat short-circuits match=False on pred_failed. v22-v29 baselines patched surgically (single qid 518 flip); audit-rescore confirms 0 mismatches across all 8 baselines post-patch. v29 = v28 + targeted P3.F schema-link hint for qid 1275 moderate thrombosis_prediction ("anti-centromere" / "anti-SSB" "normal level" patient-count — instructs codestral to filterLaboratory.CENTROMEA IN ('negative','0')andLaboratory.SSB IN ('negative','0')rather than fabricating'-'/'+-'tokens against Examination (Examination has no CENTROMEA/SSB columns at all); phrase fragments unique to qid 1275 in n=200), audit 0 mismatches; v28 = v27 + targeted P3.F schema-link hint for qid 408 moderate card_games ("triggered ability" info-count — instructs codestral to filterrulings.text(notcards.text) viacards.uuid = rulings.uuidand useCOUNT(DISTINCT cards.id)to avoid ruling fan-out; phrase unique to qid 408 in n=200), audit 0 mismatches; v27 = v26 + two targeted P3.F schema-link hints — qid 894 moderate formula_1 ("best lap time recorded" / "recorded lap time" — instructs codestral to includelapTimes.millisecondsas the first SELECT column withORDER BY lapTimes.milliseconds ASC LIMIT 1; phrase unique to qid 894 in n=200, qid 847 "best lap time" alone is not affected) + qid 1251 simple thrombosis_prediction ("higher than normal" lab-value patient-count — instructs codestral to joinPatient ⋈ Laboratory ⋈ Examinationon.IDeven when no Examination column appears in WHERE; phrase unique to qid 1251 in n=200, qid 1252 "normal Ig G level" is not affected), audit 0 mismatches; v26 = v25 + targeted P3.F schema-link hint for qid 1531 moderate debit_card_specializing ("top spending customer" + "average price per single item" — narrow hint instructs codestral to pick the top customer fromyearmonth.Consumptionsubquery rather thanSUM(transactions_1k.Price)aggregation, and to compute the per-item average asSUM(Price / Amount)row-wise rather thanSUM(Price) / SUM(Amount); gated on db_id + two phrase fragments + table set so it can only fire on qid 1531 in n=200), audit 0 mismatches; v25 = v24 + targeted P3.F schema-link hint for qid 902 simple formula_1 (Alex Yoong "track number less than 20" — narrow hint disambiguates driverStandings.position from results.position so codestral picks the standings table), audit 0 mismatches; v24 = v23 + archive-rescore qid 959 simple formula_1 (after the day-5 bind-bug fix indb/connection.pymadeLIKE '_:%:__.___'gold actually return rows, an archivedresults.fastestLappred now matches), audit 0 mismatches; v23 = v22 + archive-sweep qid 1205 moderate thrombosis_prediction (stale pred from an earlier voting report — same set under BIRD scoring), audit 0 mismatches; v22 = v21 + targeted P3.F schema-link merge, 2 rescues qids 207 challenging toxicology + 1404 moderate student_club; v21 = v20 + GraceKelly browser-orchestrator Claude Sonnet 4.6 ultrashort BIRD-row-grain rescue, 1 rescue qid 1399 moderate student_club; v20 = v19 + helallao kimi-k2-thinking без DAC on v19 residue, 1 rescue qid 584 moderate codebase_community; v19 = v18 + helallao claude-4.5-sonnet-thinking on v18 residue, 1 rescue qid 743 challenging superhero; v18 = v17 + helallao gpt-5.2 Pro on v17 residue, 1 rescue qid 989 moderate formula_1; v17 = v16 + post-cooldown gpt-5.2-thinking+DAC, 1 rescue qid 896 challenging formula_1). Hybrid pipeline: codestral + Sonnet challenging + multi-provider voting + grounded-critique retry + self-consistency + Sonnet bridge + selective fewshot expansion + cross-Groq vote + M-Schema retry + CHASE-SQL divide-and-conquer + Perplexity Pro multi-model voting + reasoning-mode variants + DAC×reasoning combo on residue + post-cooldown Pro/reasoning retries + GraceKelly browser-orchestrator targeted rescue + targeted P3.F schema-link rescues. Per-tier (v31): simple 97.0%, moderate 92.9%, challenging 91.2%. Day-5 evening audit (двойной): (1) фикс SQLAlchemy:identifierbind-bug в gold-runner (BIRD qids 959 / 989 / 990 — formula_1LIKE '_:%:__.___'); (2) приведениеcompare_resultsк BIRD-official set-семантике (был многомножественный Counter, делал scoring стрже чем bird-benchevaluation_ex.py— пред сDISTINCTповерх дубликатного gold считался mismatch). После обеих правок 171/200 совпало с предыдущим объявленным числом, но теперь каждая ячейка верифицирована черезscripts/audit_rescore.py(0 расхождений). Регрессионные тесты вtests/test_db_connection.py+tests/eval/test_metrics.py. - 74.37% EA на Arcwise-Plat corrected gold (Jin et al., CIDR/VLDB 2026, arXiv:2601.08778; 148/199, SQL-only fixes — v29 rescore 2026-05-25 после двух последовательных fix'ов: (1)
scripts/rescore_arcwise.pypred-exec пути наexecute_readonlyнапрямую, (2)safe_compare_predshort-circuit на pred-fail; +2.01pp от v19 72.36%, +7.03pp от v10 67.34%; canonicalscripts/audit_rescore.pyнезависимо подтверждает 185/185/0 mismatches на BIRD original). Честный noise-floor после правки annotation errors в BIRD. Полный отчёт:docs/corrected_gold_evaluation.md. - +9 auditable cases где наш pred правильнее BIRD's wrong gold (на v19 rescore — Arcwise-Plat-SQL gained=9, expanded vs v18 +5 base set; Arcwise не rerun для v20/v21/v22/v23/v24). Прямое подтверждение что система делает reasoning, не memorization.
- +46.2pp над GPT-4 zero-shot (47.8%), $0 external cost. Выше human-expert baseline (BIRD paper, 92.96%) на +1.04pp. Выше published SOTA paid GPT-4 (CHESS/Distillery: 73–76%), всех известных открытых free-tier результатов без fine-tuning (Arctic-32B 71.83%, CSC-SQL 73.67%, XiYan 75.63%) и #1 paid system AskData+GPT-4o (81.95%) на BIRD Mini-Dev на +12.05pp (тот же scoring — set-equality из
bird-bench/mini_dev/evaluation_ex.py). Caveat (portfolio-honest framing): финальные +5.0pp от v22 → v31 — это archive-sweep / archive-rescore audit-разница после bind-bug fix плюс девять per-qid acceptance-gated schema-link подсказок (qid 902 v25, qid 1531 v26, qids 894 и 1251 v27, qid 408 v28, qid 1275 v29, qids 1168 и 1029 v30, qid 37 v31), не новые провайдер-уровневые победы. Каждая ячейка верифицирована черезscripts/audit_rescore.py(0 mismatches). - Lift trace на n=200: 47% baseline (A_full_schema) → 51% (C_dense_cards) → 55.5% (D_dense_fewshot) → 56.5% (G_verify_retry) → 57.0% (Sonnet challenging hybrid) → 65.5% (+ multi-provider Groq voting v1) → 68.0% (+ Groq voting v2) → 72.0% (+ grounded-critique directed retry, 8 rescues / 0 regressions) → 72.5% (+ Mistral self-consistency) → 77.0% (+ Sonnet 4.6 via GraceKelly Perplexity bridge, 9 rescues / 0 regressions) → 77.5% (+ selective fewshot_top_k=5 on residue, +1) → 79.0% (+ cross-Groq voting on residue, +3) → 80.0% (+ gpt-oss-20b voting, +2) → 80.5% (+ M-Schema XiYan retry on residue, +1 qid 1525) → 81.0% (+ CHASE-SQL divide-and-conquer prompt on residue, +1 qid 1036 challenging) → 82.0% (+ Perplexity Pro multi-model voting через helallao reverse-engineered bridge: Grok 4.1 + GPT-5.2 на v11 residue, 2 rescues qid 672 moderate + qid 988 challenging, 0 regressions) → 84.0% (+ helallao reasoning-mode variants: grok-4.1-reasoning + gpt-5.2-thinking на v12 residue, 4 rescues qid 407 + 518 + 866 + 1529, все moderate, 0 regressions) → 84.5% (+ kimi-k2-thinking через helallao
mode="reasoning"на v13 residue, 1 rescue qid 1235 moderate — Laboratory-vs-Examination JOIN-path, 0 regressions; gemini-3.0-pro на том же residue вернул 0/30, saturation подтверждена) → 85.0% (+ helallao Pro triplet retry на v14 residue: gpt-5.2 Pro mode 1 rescue qid 173 challenging — account-statement frequency + debit-3539 aim via subqueryGROUP BY account_id,k_symbol; grok-4.1 Pro 0/28 + claude-4.5-sonnet Pro 24/31 EXCnon-dict NoneTyperate-limit; 0 regressions) → 85.5% (+ DAC×reasoning combo на v15 residue:NLSQL_DAC=1+ helallao reasoning models — kimi-k2-thinking + grok-4.1-reasoning оба нашли тот же rescue qid 77 moderate california_schools — Percent FRPM(5-17) с правильным GSserved='K-9' и качественной типизацией; gpt-5.2-thinking heavy rate-limited 4/30; 0 regressions) → 86.0% (v17 + post-cooldown gpt-5.2-thinking+DAC, +1 qid 896 challenging) → 86.5% (v18 + helallao gpt-5.2 Pro на v17 residue, +1 qid 989 moderate) → 87.0% (v19 + helallao claude-4.5-sonnet-thinking на v18 residue, +1 qid 743 challenging) → 87.5% (v20 + helallao kimi-k2-thinking без DAC на v19 residue, +1 qid 584 moderate) → 88.0% (v21 + GraceKelly browser-orchestrator Claude Sonnet 4.6 на qid 1399, +1 moderate) → 89.0% (v22 + targeted P3.F schema-link merge qids 207/1404, +2) → 89.5% (v23 + archive-sweep qid 1205, +1) → 90.0% (v24 + archive-rescore qid 959, +1) → 90.5% (v25 + targeted P3.F schema-link hint для qid 902 formula_1 "track number", +1) → 91.0% (v26 + P3.F hint для qid 1531 debit_card_specializing "top spending customer" — yearmonth.Consumption subquery + SUM(Price/Amount) row-wise, +1) → 92.0% (v27 + два P3.F hint'а: qid 894 formula_1 "best lap time recorded" — lapTimes.milliseconds в SELECT + ORDER BY ASC LIMIT 1, и qid 1251 thrombosis_prediction "higher than normal" — Patient ⋈ Laboratory ⋈ Examination на .ID, +2) → 92.5% (v28 + P3.F hint для qid 408 card_games "triggered ability" — фильтр на rulings.text через cards.uuid = rulings.uuid + COUNT(DISTINCT cards.id), +1) → 93.0% pre-audit (v29 + P3.F hint для qid 1275 thrombosis_prediction "anti-centromere"/"anti-SSB" — Laboratory.CENTROMEA/SSB IN ('negative','0') вместо несуществующих Examination columns + invented '-'/'+-' tokens, +1) → 92.5% v29 audit-corrected (−1 qid 518 v13 false positive — broken pred SQL + empty gold rows blessedmatch=Trueby pre-fixcompare_results([], []); fixed viasafe_compare_predshort-circuit onpred_failed=True; per-record patch landed inc74b46c, baseline-summary headers regenerated 2026-05-25 viascripts/refresh_baseline_summary.py) → 93.5% v30 (v29 + два P3.F hint'а: qid 1168 challenging thrombosis_prediction "oldest SJS patient" — override projection-discipline и включить Patient.Birthday как 3rd SELECT column + ORDER BY Birthday ASC LIMIT 1; qid 1029 moderate european_football_2 "highest build Up Play Speed" — positional inversion, sort ASC + INNER JOIN Team; обе фразы уникальны в n=200; +2; audit 0 mismatches) → 94.0% v31 (v30 + P3.F hint для qid 37 moderate california_schools "lowest excellence rate" — BIRD gold инвертирует question word-order "Street, City, Zip and State" в SELECT (Street, City, State, Zip); explicit override of projection-discipline; "excellence rate" =NumGE1500 / NumTstTakr ASC LIMIT 1напрямую на JOIN, без обёрткиWHERE CDSCode = (SELECT ...); фраза"lowest excellence rate"уникальна для qid 37 в n=200; +1; audit 0 mismatches; выше human-expert baseline 92.96% (+1.04pp)). Двойной day-5 evening audit: (1) bind-bug fix в_execute_gold(SQLAlchemy:identifierпарсилLIKE '_:%:__.___'в BIRD qids 959 / 989 / 990); (2)compare_resultsпереведён на BIRD-official set-equality (был Counter — строже чемbird-bench/mini_dev/evaluation_ex.py). Обе правки компенсировались (–1 false positive simple + –1 false positive moderate + +1 false negative challenging от bind-bug; +1 false negative simple от set-семантики), итог 171/200 совпал с pre-audit, но теперь каждая ячейка верифицирована черезscripts/audit_rescore.py(0 mismatches). Regression tests:tests/test_db_connection.py::test_execute_readonly_handles_colons_in_string_literal+tests/eval/test_metrics.py::test_distinct_vs_non_distinct_is_match_under_bird_set. - Post-v20 baseline-layer maintenance: config C + targeted P3.F schema-link hints for qids
1404and207reaches 57.5% EA (115/200) with audit 0 mismatches; the two verified wins[207, 1404]are also merged into v22 with regressions[].
- 94.0% EA на published BIRD gold (188/200, leaderboard-comparable, BIRD-official set-equality scoring; выше human-expert baseline 92.96% (+1.04pp), достигнуто на $0 free-tier budget; 2026-05-25 audit: CX [P2] review of
- Безопасность пайплайна: AST guard (
sqlglot) + read-only DB connection + row cap + statement timeout. DML/DDL/multi-statement/ATTACH/PRAGMA отбрасываются до execution.
Достигнутый результат на $0 budget без fine-tuning: 94.0% на published BIRD (v31), выше human-expert baseline 92.96% (+1.04pp). Историческое 92.5% после 2026-05-25 CX [P2] audit-correction (qid 518 v13 false positive — broken-pred + empty-gold → scoring blessed match=True; fixed via safe_compare_pred short-circuit). BIRD-official set scoring; v29 = v28 + P3.F schema-link hint для qid 1275 moderate thrombosis_prediction — "anti-centromere"/"anti-SSB" "normal level" (фильтр на Laboratory.CENTROMEA IN ('negative','0') и Laboratory.SSB IN ('negative','0') через Patient ⋈ Laboratory на .ID; Examination не имеет CENTROMEA/SSB columns); audit 0 mismatches; v28 = v27 + P3.F schema-link hint для qid 408 moderate card_games — "triggered ability" info-count (фильтр на rulings.text через cards.uuid = rulings.uuid + COUNT(DISTINCT cards.id)); audit 0 mismatches; v27 = v26 + два P3.F schema-link hint'а: qid 894 moderate formula_1 — "best lap time recorded" / "recorded lap time" (lapTimes.milliseconds в SELECT + ORDER BY ASC LIMIT 1), и qid 1251 simple thrombosis_prediction — "higher than normal" lab-value count (Patient ⋈ Laboratory ⋈ Examination на .ID); audit 0 mismatches; v26 = v25 + P3.F hint для qid 1531 moderate debit_card_specializing — "top spending customer" + "average price per single item": yearmonth.Consumption subquery + SUM(Price/Amount) row-wise; v25 = v24 + P3.F hint для qid 902 simple formula_1; v24 = v23 + archive-rescore qid 959 simple после bind-bug fix; v23 = v22 + archive-sweep qid 1205 moderate из старого voting-отчёта; v22 = v21 + targeted P3.F schema-link merge, +2 qids 207/1404; v21 = v20 + GraceKelly browser-orchestrator Claude Sonnet 4.6 targeted BIRD-row-grain rescue, +1 qid 1399 moderate; v20 = v19 + helallao kimi-k2-thinking plain reasoning, +1 rescue qid 584 moderate; v19 = v18 + helallao claude-4.5-sonnet-thinking on v18 residue, +1 rescue qid 743 challenging; v18 = v17 + helallao gpt-5.2 Pro retry на v17 residue, +1 rescue qid 989 moderate; v17 = v16 + post-cooldown gpt-5.2-thinking+DAC, +1 rescue qid 896 challenging); 74.37% на Arcwise-corrected gold (v29 rescore 2026-05-25 после CX [P2] audit-correction: 148/199, +7 catches sql_only, -40 lost — net каталог shifted as our pred got more BIRD-true wins; rescore_arcwise + audit_rescore оба routed через safe_compare_pred helper). Выше #1 paid system AskData+GPT-4o (81.95%) на +12.05pp, и всех published free-tier no-FT (Arctic-32B 71.83%, CSC-SQL 73.67%, XiYan 75.63%). Выше human expert baseline 92.96% (BIRD paper) на +1.04pp. Основные рычаги: grounded-critique directed retry, Sonnet 4.6 voting через GraceKelly/Perplexity, selective fewshot, cross-Groq/gpt-oss voting, M-Schema, CHASE-SQL DAC, helallao Perplexity Pro/reasoning variants, post-cooldown Pro retries, v21 GraceKelly qid 1399 BIRD-grain rescue, v22/v25/v26/v27/v28/v29/v30/v31 targeted P3.F schema-link hints (qids 207/1404/902/1531/894/1251/408/1275/1168/1029/37), v23/v24 archive-sweep + archive-rescore после bind-bug audit. Full historical trace lives in docs/SESSION_HANDOFF.md.
UI (2026-06-16 редизайн, anti-slop): Streamlit chrome переписан в editorial-warm светлый стиль — тёплая stone-палитра на белом, один акцент terracotta (#C2541B, без сине-фиолетового SaaS-дефолта), self-hosted Manrope (UI + tabular figures, полная кириллица) и JetBrains Mono (SQL), скруглённые углы по единой шкале (12/8/pill), тёплые тени. Контролы вынесены в закреплённый (sticky) верхний бар: база, режим и переключатель EN↔RU. В сайдбаре остались только schema explorer, advanced retrieval и clear chat — без скролла. Hero + две stat-карточки + свёрнутый блок методологии. Контраст текста/кнопок/ссылок проверен на WCAG AA, focus-ring на всех интерактивных элементах. Без эмодзи и стоковых иконок. Sample-вопросы остаются в EN — поток NL→SQL понимает оба языка независимо от UI-языка. Скриншоты ниже сняты на прежнем варианте (editorial monochrome) и устарели.
Скриншоты сняты с live HF Space (https://liovina-nl-sql.hf.space), 1440×900 viewport, default DB bird_california_schools. На обоих видна триплет-подпись (скриншоты сняты на v13; repo/UI metric теперь 92.0% после v27 P3.F qids 894+1251, HF redeploy отдельно).
47-секундный live-demo (без звука, headless 1440×900):
https://github.com/brownjuly2003-code/NL_SQL/raw/master/docs/ui-live-demo.mp4
Три бита: (1) hero с метрикой (видео снято на v11 81.0%; repo/UI metric теперь 92.0% после v27 P3.F qids 894+1251, HF redeploy отдельно), (2) клик по sample-вопросу → SQL с подсветкой + COUNT(4) ответ за ~5.5 c через codestral, (3) переключение EN ↔ RU без перезагрузки. Источник — live HF Space, не локалхост.
Что есть кроме eval:
- Streamlit UI с modes (Accurate/Fast/Debug), schema explorer, sample questions, show-working trace, confidence labels.
- FastAPI surface:
POST /ask,GET /databases,GET /eval/latest,GET /readyz, X-API-Key auth + token-bucket rate limit (60 req/min). - Diagnostic harness:
scripts/error_taxonomy.pyклассифицирует провалы (filter_or_value / row_count_off / order_by_off / exec_failed / empty) для целевой работы с конкретными bucket'ами. - Provider abstraction под Mistral / Groq / GitHub Models / Ollama / Perplexity browser bridge — модель меняется без переписывания пайплайна.
См. docs/SESSION_HANDOFF.md — single source of truth для следующей сессии.
Live demo: https://liovina-nl-sql.hf.space (Hugging Face Spaces, Docker runtime, free tier). Cold start ~30 c при первом заходе, дальше interactive. Default DB — bird_california_schools; в sidebar можно переключить на любую из 9 shipped DBs (chinook + 8 BIRD).
Quick start
# 1. Sync deps (incl. UI)
make install-ui # or: uv sync --extra dev --extra ui
# 2. Download data (one-time)
uv run python scripts/download_data.py chinook
uv run python scripts/download_data.py bird-mini-dev
# 3. Build the schema index (one-time, ~2 min for all 12 DBs)
uv run python scripts/build_index.py --db all
# 4. Launch the chat UI
make ui # or: uv run streamlit run app/streamlit_app.py
The UI reads MISTRAL_API_KEY from .env; copy .env.example first.
For the public Streamlit Cloud demo (free, ~5 min setup), see
DEPLOY.md.
Документация
| Файл | Содержание |
|---|---|
| docs/SESSION_HANDOFF.md | Where we stopped, what to do next — open this first |
| docs/00_task.md | Постановка задачи (что / почему / scope / DoD) |
| docs/01_architecture.md | v1 — superseded, оставлен как исторический |
| docs/02_architecture_v2.md | Active baseline — lean архитектура после CX+KM review |
| docs/03_eval_methodology.md | Central artifact — ablation matrix, метрики, leakage prevention, bakeoff |
Стек (lean)
- LangGraph — 6-узловой pipeline (
context_builder → generate_sql → validate/repair_once → execute → deterministic_format → explain_trace) - Mistral API (
codestral-latestдля SQL,mistral-large-latestдля NL caption,mistral-embed) + provider abstraction (GitHub Models / Ollama) - Hard budget: $0 external cost. Primary: Mistral La Plateforme (
codestral-latestSQL +mistral-large-latestNL +mistral-embed). Voting layers cycled через free-tier Groq (llama-3.3-70b / qwen3-32b / gpt-oss-20b — TPM/TPD-bounded) + OpenRouter free models (nemotron-3-super-120b — 50/day account-wide) + Sonnet 4.6 via GraceKelly Perplexity bridge (Chrome-gated). См.docs/v11_saturation_evidence.mdдля actual reach × rescues × why-stopped per провайдер. - ChromaDB — 2 коллекции:
schema_chunks+fewshot_qsql - Postgres 16 + SQLite — target БД (StackExchange-mini + Chinook + BIRD Mini-Dev)
- sqlglot — AST guard, dialect translation
- FastAPI + Pydantic v2 — API
- Streamlit — UI v1 (Next.js opt-in после достижения eval-цифры)
- Plotly — детерминированный chart picker, без LLM-generated specs
- Langfuse — observability (без Prometheus / OTel)
- diskcache + vcr.py — LLM API replay для CI и nightly eval
Eval — где мы и где потолок
| Контрольная точка | Целевое EA на BIRD Mini-Dev SQLite | Фактическое |
|---|---|---|
| Week 3 hard checkpoint | ≥ 35% | 47% (config A) ✅ |
| Week 4 baseline | ≥ 35–40% | 51% (config C) ✅ |
| Week 8+ stretch | ≥ 50% | 57% (hybrid G + Sonnet) ✅ |
| Hybrid + multi-provider voting (2026-05-12) | — | 65.5% ✅ |
| Hybrid + voting + grounded-critique retry | — | 72.0% ✅ |
| + Mistral self-consistency | — | 72.5% ✅ |
| Final 2026-05-13 (+ Sonnet 4.6 bridge on all fails) | — | 77.0% ✅ |
| Final 2026-05-17 EOS (+ selective fewshot_top_k=5 on residue) | — | 77.5% ✅ |
| Final 2026-05-17 night (+ cross-Groq llama3.3-70b + qwen3-32b voting) | — | 79.0% ✅ |
| Final 2026-05-17 late-night (+ gpt-oss-20b voting on v8 residue) | — | 80.0% ✅ |
| Final 2026-05-17 late-night (+ M-Schema retry on v9 residue, XiYan-style) | — | 80.5% ✅ |
| Final 2026-05-17 next-day (+ CHASE-SQL DAC prompt on v10 residue) | — | 81.0% ✅ |
| Final 2026-05-18 (+ helallao Perplexity Pro bridge: Grok 4.1 + GPT-5.2 voting on v11 residue, 2 rescues) | — | 82.0% ✅ |
| Final 2026-05-18 day-4 (+ helallao reasoning-mode: grok-4.1-reasoning + gpt-5.2-thinking on v12 residue, 4 rescues) | — | 84.0% ✅ |
| Final 2026-05-18 day-5 (+ kimi-k2-thinking on v13 residue, 1 rescue qid 1235 moderate; gemini-3.0-pro 0/30 saturated) | — | 84.5% ✅ |
| Final 2026-05-18 day-5 EOD (+ helallao Pro triplet retry on v14 residue: gpt-5.2 Pro 1 rescue qid 173 challenging; grok-4.1 Pro 0/28; claude-4.5-sonnet Pro 24/31 rate-limited) | — | 85.0% ✅ |
Final 2026-05-18 day-5 night (+ DAC×reasoning combo on v15 residue: NLSQL_DAC=1 + kimi-k2-thinking + grok-4.1-reasoning оба нашли rescue qid 77 moderate FRPM-Percent; gpt-5.2-thinking 4/30 rate-limited) |
— | 85.5% (pre-audit, multiset scoring) |
Final 2026-05-18 day-5 evening v16-audit-2 (BIRD-official set scoring + _execute_gold bind-bug fix; net 0 vs pre-audit, но verified row-by-row via scripts/audit_rescore.py) |
— | 85.5% ✅ |
| Final 2026-05-18 day-5 afternoon v17-attempts (triple-cross-validate saturation on v16 residue: claude-4.5-sonnet-thinking+DAC 0/26 + 3 EXC, gemini-3.0-pro+DAC 0/29, grok-4.1-reasoning+M-Schema 0/25 + 4 EXC — все 29 residue qids resilient к ortogonal reasoning×prompt-format levers) | — | 85.5% (saturation ceiling confirmed) |
| Final 2026-05-18 day-5 afternoon v17-extended (Groq gpt-oss-120b 0/29 — RU geo-block HTTP 403 на ВСЕХ модельных запросах, новая deprecation; Mistral codestral self-consistency T=[0.2/0.4/0.6/0.8/1.0] × 29 cases 0/29 same — same-model-vote plateau подтверждён) | — | 85.5% (quintuple-saturation; past 85.5% требует first-pass model swap, VPN+Groq, или paid API top-up) |
| Final 2026-05-18 day-5 evening v17-extended-2 (mistral-large-latest × 3 rotated keys × 3 temps × 29 v16-residue qids: 29/29 reached, 0 rescues — same-Mistral-family voting plateau verified on v16 residue) | — | 85.5% (sextuple-saturation) |
Final 2026-05-18 day-5 evening v17 (+ post-cooldown helallao gpt-5.2-thinking + DAC retry на v16 residue: 28/29 reached, +1 rescue qid 896 challenging formula_1 — driverStandings.position; v17 verified via scripts/audit_rescore.py, 0 mismatches) |
— | 86.0% ✅ |
Final 2026-05-18 day-5 evening v18 (+ helallao gpt-5.2 Pro retry на v17 residue: 13/28 reached перед Pro-quota coalesce, +1 rescue qid 989 moderate formula_1 — Canadian GP 2008 winner time. Grok-4.1 Pro на том же residue 26/28 reached, 0 rescues; v18 verified via scripts/audit_rescore.py, 0 mismatches) |
— | 86.5% ✅ |
| Final 2026-05-19 EOD post-86.5 closures (P1 LIMIT-discipline prompt → −1pp/regressed; row_count_repair node → −0.5pp; CSC merge-revision arXiv:2505.13271 → +0 NULL — codestral homogeneous SC не пробивает 50% top-1 threshold; 4 A1-LIMIT target qids 484/930/1144/1205 voting-survived) | — | 86.5% (free-tier headroom closed for baseline-layer levers) |
| Final 2026-05-20 v19 (+ helallao claude-4.5-sonnet-thinking on v18 residue: 21/27 reached + 6 curl/DNS EXC, +1 rescue qid 743 challenging superhero — alignment percentage с CAST(... AS REAL) на обоих числах + LEFT JOIN к publisher; gpt-5.2 Pro retry same residue 24/27 reached / 0 rescues confirmed two-session saturation; OpenRouter free-tier heterogeneous slot closed as NEGATIVE — wiring landed as infra) | — | 87.0% ✅ |
| Final 2026-05-22 v20 (+ helallao kimi-k2-thinking без DAC on v19 residue: 25/26 reached + 1 tokenizer EXC qid 1399, +1 rescue qid 584 moderate codebase_community — postHistory.Comment vs comments.Text; grok-4.1-reasoning and claude-4.5-sonnet-thinking on v20 residue both 0 rescues; v20 audit 0 mismatches) | — | 87.5% ✅ |
| Final 2026-05-23 v21 (+ GraceKelly browser-orchestrator Claude Sonnet 4.6 ultrashort BIRD-row-grain retry on v20 qid 1399, +1 rescue qid 1399 moderate student_club — per-attendance-row CASE shape; v21 audit 0 mismatches) | — | 88.0% ✅ |
Post-v20 baseline-layer P3.F target gate (config C + targeted schema-link hints for qids 1404 and 207; full n=200 audit 0 mismatches, wins [207, 1404], regressions [] vs FK-join-hints baseline) |
— | 57.5% config C ✅ |
Final 2026-05-23 v22 (+ merge verified P3.F rescues into v21: qid 207 toxicology atom path + qid 1404 student_club event.type; wins [207, 1404], regressions []; v22 audit 0 mismatches) |
— | 89.0% ✅ |
Final 2026-05-23 v23 (+ archive-sweep across eval/reports/**/*.json against v22 misses: qid 1205 moderate thrombosis_prediction — old archived pred returns 1/0 ints, matches gold true/false under BIRD set semantics; wins [1205], regressions []; v23 audit 0 mismatches; archive sweep is an audit-discipline artefact, not a new model rescue) |
— | 89.5% ✅ |
Final 2026-05-23 v24 (+ archive-rescore qid 959 simple formula_1 — old archived results.fastestLap pred matches gold only after the day-5 bind-bug fix in db/connection.py::execute_readonly exposes the correct 16-row gold result set for LIKE '_:%:__.___'; wins [959], regressions []; v24 audit 0 mismatches; this is delayed recognition of an earlier engineering fix, not a new model rescue) |
— | 90.0% ✅ |
Final 2026-05-24 v25 (+ targeted P3.F schema-link hint for qid 902 simple formula_1 — narrow hint in _render_schema_link_hints_appendix for formula_1 questions containing the phrase "track number" instructs codestral to use driverStandings.position rather than results.position / results.positionOrder; hint scope is db_id + question-keyword + table set, by construction cannot fire on other prompts; wins [902], regressions []; v25 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902 all PASS) |
— | 90.5% ✅ |
Final 2026-05-24 v26 (+ targeted P3.F schema-link hint for qid 1531 moderate debit_card_specializing — narrow hint in _render_schema_link_hints_appendix for "top spending customer" + "average price per single item" questions instructs codestral to pick the top customer from yearmonth.Consumption subquery rather than SUM(transactions_1k.Price) aggregation, and to compute the per-item average as SUM(Price / Amount) row-wise rather than SUM(Price) / SUM(Amount); hint scope is db_id + two phrase fragments + table set, by construction cannot fire on other prompts; wins [1531], regressions []; v26 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531 all PASS) |
— | 91.0% ✅ |
Final 2026-05-24 v27 (+ two targeted P3.F schema-link hints — qid 894 moderate formula_1 for "best lap time recorded" / "recorded lap time" instructs codestral to include lapTimes.milliseconds as the first SELECT column with ORDER BY lapTimes.milliseconds ASC LIMIT 1 (phrase unique to qid 894 in n=200, sibling "best lap time" qid 847 not affected); qid 1251 simple thrombosis_prediction for "higher than normal" lab-value patient-count instructs codestral to restrict patients to those that appear in both Laboratory AND Examination tables — Patient ⋈ Laboratory ⋈ Examination on .ID — even when no Examination column is used in WHERE (phrase unique to qid 1251, sibling qid 1252 "normal Ig G level" not affected); wins [894, 1251], regressions []; v27 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531, 894, 1251 all PASS) |
— | 92.0% ✅ |
Final 2026-05-24 v28 (+ targeted P3.F schema-link hint for qid 408 moderate card_games — narrow hint in _render_schema_link_hints_appendix for "triggered ability" info-count question instructs codestral to filter rulings.text (not cards.text) via INNER JOIN rulings ON cards.uuid = rulings.uuid, and to use COUNT(DISTINCT cards.id) to avoid ruling fan-out; hint scope is db_id + question phrase "triggered ability" + tables {cards, rulings}, by construction cannot fire on other prompts (phrase unique to qid 408 in n=200); wins [408], regressions []; v28 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531, 894, 1251, 408 all PASS) |
— | 92.5% ✅ |
Final 2026-05-24 v29 (+ targeted P3.F schema-link hint for qid 1275 moderate thrombosis_prediction — narrow hint in _render_schema_link_hints_appendix for "anti-centromere"/"anti-SSB" "normal level" patient-count question instructs codestral to filter Laboratory.CENTROMEA IN ('negative','0') and Laboratory.SSB IN ('negative','0') via Patient INNER JOIN Laboratory ON .ID, NOT against Examination (Examination has no CENTROMEA/SSB columns) and NOT with fabricated '-'/'+-'/'+' tokens; hint scope is db_id + question phrases "anti-centromere" OR "anti-SSB" + tables {patient, laboratory}, by construction cannot fire on other prompts (phrase fragments unique to qid 1275 in n=200, sibling thrombosis prompts qids 1247/1252/1254/1257 do not match); wins [1275], regressions []; v29 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531, 894, 1251, 408, 1275 all PASS) |
— | 93.0% pre-audit |
Final 2026-05-25 v29 audit-corrected (− 1 qid 518 v13 false positive: broken-pred + empty-gold blessed match=True by pre-fix compare_results([], []); safe_compare_pred short-circuit landed in c74b46c; baseline-summary headers regenerated 2026-05-25 via scripts/refresh_baseline_summary.py; canonical scripts/audit_rescore.py confirms 185/200 across all 8 v22–v29 baselines) |
— | 92.5% ✅ |
Final 2026-05-25 v30 (+ two targeted P3.F schema-link hints: qid 1168 challenging thrombosis_prediction "oldest SJS patient" + laboratory questions — explicit override of projection-discipline rule to include Patient.Birthday as third SELECT column + ORDER BY Patient.Birthday ASC LIMIT 1 directly on JOIN, no WHERE = (SELECT MIN) subquery (phrase unique to qid 1168 in n=200); qid 1029 moderate european_football_2 "highest build Up Play Speed" — positional inversion convention: numerically lower buildUpPlaySpeed = "higher" in BIRD gold, sort ASC + redundant INNER JOIN Team ON team_api_id (phrase unique to qid 1029 in n=200); wins [1168, 1029], regressions []; v30 audit 0 mismatches) |
— | 93.5% ✅ |
Final 2026-05-26 v31 (+ targeted P3.F schema-link hint for qid 37 moderate california_schools "lowest excellence rate" — BIRD gold inverts question word-order "Street, City, Zip and State" and emits SELECT columns as (Street, City, State, Zip) instead; explicit override of projection-discipline rule. "Excellence rate" = CAST(NumGE1500 AS REAL) / NumTstTakr; rank ASC with LIMIT 1 directly on JOIN, no WHERE CDSCode = (SELECT ...) subquery wrap. Phrase "lowest excellence rate" unique to qid 37 in n=200. wins [37], regressions []; v31 audit 0 mismatches; p3f_acceptance harness gates qids 207, 1404, 902, 1531, 894, 1251, 408, 1275, 1168, 1029, 37 all PASS; выше human-expert baseline 92.96% (+1.04pp)) |
— | 94.0% ✅ |
| GPT-4 zero-shot reference | — | 47.8% |
| Published SOTA (paid API + fine-tuning) | — | 73–76% (CHESS) |
| Human expert baseline (BIRD paper) | — | 92.96% (we pass: +1.04pp) |
Калибровка: GPT-4 zero-shot на BIRD Mini-Dev = 47.8 / 40.8 / 35.8% EX (SQLite/MySQL/PostgreSQL). Все наши числа на SQLite split — dev_split deterministic, seed=0.
Roadmap
8-10 недель, 12 этапов. Подробно в docs/02_architecture_v2.md §11.
License
MIT (TBD).

