Spaces:
Running
Running
Expand problem narrative and Engineering Notes: time-on-SQL, Spider vs prod.
Browse filesHTML /demo: stat callout lede, longer blog with subheads, lists, footnote on ranges.
Gradio blog: matching cost/benchmark sections and how-to-read guide.
Made-with: Cursor
- server/demo_page.html +120 -18
- server/gradio_ui.py +22 -3
server/demo_page.html
CHANGED
|
@@ -738,6 +738,63 @@
|
|
| 738 |
@media (max-width: 900px) {
|
| 739 |
.blog-mini-grid { grid-template-columns: 1fr; }
|
| 740 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 741 |
</style>
|
| 742 |
</head>
|
| 743 |
<body>
|
|
@@ -785,9 +842,24 @@
|
|
| 785 |
<section id="environment" class="section" aria-labelledby="env-title">
|
| 786 |
<p class="section-id">Space · Architecture</p>
|
| 787 |
<h2 class="hero-title" id="env-title">Environment first — <em>how</em> the agent sees the world.</h2>
|
| 788 |
-
<
|
| 789 |
-
|
| 790 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 791 |
<div class="layer-strip" aria-hidden="true">
|
| 792 |
<span class="layer"><b>Client</b> / agent</span>
|
| 793 |
<span class="layer"><b>API</b> session + JSON</span>
|
|
@@ -987,20 +1059,40 @@ import wandb
|
|
| 987 |
“The goal is not to generate beautiful SQL text. The goal is to produce SQL fixes that survive execution, repeatedly, under changing runtime conditions.”
|
| 988 |
</div>
|
| 989 |
<div class="blog-mini-grid">
|
| 990 |
-
<div class="blog-mini"><b>0.5B
|
| 991 |
-
<div class="blog-mini"><b>32-run eval</b>
|
| 992 |
-
<div class="blog-mini"><b>Execution-first</b>Reward
|
| 993 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 994 |
<p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
|
| 995 |
-
The motive for this project was not to build another text-to-SQL demo.
|
| 996 |
-
|
| 997 |
-
boundary between language modeling and systems engineering
|
| 998 |
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 999 |
<p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
|
| 1000 |
The architecture follows an OpenEnv-style contract:
|
| 1001 |
-
<code>reset
|
| 1002 |
-
Each episode runs on isolated in-memory SQLite state, deterministic task grading, and execution-grounded rewards.
|
| 1003 |
-
valid table references, stable aggregations, and join logic that does not collapse in edge cases.
|
| 1004 |
</p>
|
| 1005 |
<code class="pre">Conceptual reward:
|
| 1006 |
R_t = w_c*C_t + w_e*E_t + w_p*P_t + w_s*S_t - lambda*Penalty_t
|
|
@@ -1010,7 +1102,7 @@ J(pi) = E_{tau ~ pi}[sum_{t=0..T} gamma^t * R_t]</code>
|
|
| 1010 |
<p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
|
| 1011 |
The technical design makes debugging measurable. Session state exposes observations, action history, and reward trajectories.
|
| 1012 |
The reviewer-gated path adds risk control for unsafe submissions while preserving gradient signal (instead of hard-failing every risky step).
|
| 1013 |
-
|
| 1014 |
</p>
|
| 1015 |
<code class="pre">Data snapshot shown on this page:
|
| 1016 |
- Spider-style industry baseline: 48.2%
|
|
@@ -1019,13 +1111,23 @@ J(pi) = E_{tau ~ pi}[sum_{t=0..T} gamma^t * R_t]</code>
|
|
| 1019 |
- Performance leap view: 0.0% -> 25.0%
|
| 1020 |
- Hard evidence: 32-run eval + sample reward artifacts</code>
|
| 1021 |
<p style="color:var(--muted);margin:12px 0 12px;font-size:0.9375rem">
|
| 1022 |
-
|
| 1023 |
-
If a metric appears, it should map to concrete run folders, reward JSON files, and checkpoint lineage.
|
| 1024 |
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1025 |
<p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
|
| 1026 |
-
Industry and research
|
| 1027 |
-
Enterprise SQL debugging
|
| 1028 |
-
execution-grounded learning loop.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1029 |
</p>
|
| 1030 |
<div class="link-list" style="margin-top:12px">
|
| 1031 |
<a href="https://github.com/mdayan8/sql-debug-env" target="_blank" rel="noopener">GitHub — mdayan8/sql-debug-env</a>
|
|
|
|
| 738 |
@media (max-width: 900px) {
|
| 739 |
.blog-mini-grid { grid-template-columns: 1fr; }
|
| 740 |
}
|
| 741 |
+
.lede-stack {
|
| 742 |
+
max-width: 62ch;
|
| 743 |
+
margin-bottom: 18px;
|
| 744 |
+
}
|
| 745 |
+
.lede-stack .lede {
|
| 746 |
+
max-width: none;
|
| 747 |
+
}
|
| 748 |
+
.stat-callout {
|
| 749 |
+
margin: 0 0 16px;
|
| 750 |
+
padding: 14px 16px 16px;
|
| 751 |
+
border-radius: var(--radius);
|
| 752 |
+
border: 1px solid #c7d2fe;
|
| 753 |
+
background: linear-gradient(135deg, #eef2ff 0%, #f8fafc 55%, #ecfeff 100%);
|
| 754 |
+
box-shadow: 0 6px 22px rgba(37, 99, 235, 0.08);
|
| 755 |
+
font-size: 0.98rem;
|
| 756 |
+
line-height: 1.58;
|
| 757 |
+
color: var(--ink-soft);
|
| 758 |
+
}
|
| 759 |
+
.stat-callout strong {
|
| 760 |
+
color: var(--ink);
|
| 761 |
+
font-weight: 700;
|
| 762 |
+
}
|
| 763 |
+
.blog-pull-wide {
|
| 764 |
+
font-family: var(--font-display);
|
| 765 |
+
font-size: 1.02rem;
|
| 766 |
+
line-height: 1.45;
|
| 767 |
+
color: var(--ink);
|
| 768 |
+
margin: 18px 0 14px;
|
| 769 |
+
padding: 12px 0 12px 16px;
|
| 770 |
+
border-left: 4px solid var(--hf-amber);
|
| 771 |
+
background: linear-gradient(90deg, var(--hf-amber-soft), transparent);
|
| 772 |
+
border-radius: 0 10px 10px 0;
|
| 773 |
+
}
|
| 774 |
+
.blog-subhead {
|
| 775 |
+
font-size: 0.72rem;
|
| 776 |
+
font-weight: 800;
|
| 777 |
+
letter-spacing: 0.12em;
|
| 778 |
+
text-transform: uppercase;
|
| 779 |
+
color: var(--muted);
|
| 780 |
+
margin: 20px 0 8px;
|
| 781 |
+
}
|
| 782 |
+
.blog-list {
|
| 783 |
+
margin: 0 0 14px 1.1rem;
|
| 784 |
+
padding: 0;
|
| 785 |
+
color: var(--muted);
|
| 786 |
+
font-size: 0.9375rem;
|
| 787 |
+
line-height: 1.55;
|
| 788 |
+
}
|
| 789 |
+
.blog-list li { margin-bottom: 8px; }
|
| 790 |
+
.blog-footnote {
|
| 791 |
+
font-size: 0.78rem;
|
| 792 |
+
color: var(--muted-light);
|
| 793 |
+
line-height: 1.45;
|
| 794 |
+
margin: 10px 0 0;
|
| 795 |
+
padding-top: 10px;
|
| 796 |
+
border-top: 1px dashed var(--space-border);
|
| 797 |
+
}
|
| 798 |
</style>
|
| 799 |
</head>
|
| 800 |
<body>
|
|
|
|
| 842 |
<section id="environment" class="section" aria-labelledby="env-title">
|
| 843 |
<p class="section-id">Space · Architecture</p>
|
| 844 |
<h2 class="hero-title" id="env-title">Environment first — <em>how</em> the agent sees the world.</h2>
|
| 845 |
+
<div class="lede-stack">
|
| 846 |
+
<p class="stat-callout">
|
| 847 |
+
<strong>Today, nearly 30% of a data team’s time is spent fixing SQL and pipeline logic</strong>—not building net-new insights, not shipping product features,
|
| 848 |
+
but <em>debugging queries that already looked reasonable in a notebook or PR comment</em>. That tax shows up as rework, stale dashboards, and fragile “one-off”
|
| 849 |
+
analyses that nobody trusts after the third incident.
|
| 850 |
+
</p>
|
| 851 |
+
<p class="lede">
|
| 852 |
+
<strong>Even with the most advanced AI models, the problem is not “solved.”</strong>
|
| 853 |
+
On standard text-to-SQL benchmarks like Spider, headline numbers often sit in the <strong>high 80s to low 90s (%)</strong>—an impressive story for a slide deck.
|
| 854 |
+
In real enterprise environments—drifting schemas, implicit business rules, join explosions, and permissioned views—that headline rarely survives contact with production.
|
| 855 |
+
Teams routinely report effective success rates closer to the <strong>10–30%</strong> band unless the system closes the loop with <em>execution-grounded feedback</em>
|
| 856 |
+
(run, observe error or result, attribute reward to what changed).
|
| 857 |
+
</p>
|
| 858 |
+
<p class="lede" style="margin-bottom:0">
|
| 859 |
+
This Space hosts the same HTTP API your trainer calls: <strong>sessions</strong>, <strong>typed observations</strong>, <strong>SQLite-backed tasks</strong>, and a
|
| 860 |
+
<strong>decomposed reward</strong>. Below is the end-to-end map judges can skim in seconds; the Engineering Notes section ties the problem to the OpenEnv contract and the artifacts on this page.
|
| 861 |
+
</p>
|
| 862 |
+
</div>
|
| 863 |
<div class="layer-strip" aria-hidden="true">
|
| 864 |
<span class="layer"><b>Client</b> / agent</span>
|
| 865 |
<span class="layer"><b>API</b> session + JSON</span>
|
|
|
|
| 1059 |
“The goal is not to generate beautiful SQL text. The goal is to produce SQL fixes that survive execution, repeatedly, under changing runtime conditions.”
|
| 1060 |
</div>
|
| 1061 |
<div class="blog-mini-grid">
|
| 1062 |
+
<div class="blog-mini"><b>0.5B → 7B</b>Bridge run for wiring, then a stronger base model for SQL structure and joins.</div>
|
| 1063 |
+
<div class="blog-mini"><b>32-run eval</b>Artifact-backed pass with sample rewards and run logs you can diff, not vibes.</div>
|
| 1064 |
+
<div class="blog-mini"><b>Execution-first</b>Reward comes from running SQL against graded tasks—not from how persuasive the completion sounds.</div>
|
| 1065 |
</div>
|
| 1066 |
+
<div class="blog-mini-grid" style="margin-top:10px">
|
| 1067 |
+
<div class="blog-mini"><b>Spider vs prod</b>Leaderboards reward clean splits; warehouses reward joins that do not explode under skew.</div>
|
| 1068 |
+
<div class="blog-mini"><b>GRPO loop</b>Group-relative updates turn execution outcomes into a stable training signal across sessions.</div>
|
| 1069 |
+
<div class="blog-mini"><b>Reviewer path</b>Optional guardrail so risky SQL is blocked without erasing every learning opportunity.</div>
|
| 1070 |
+
</div>
|
| 1071 |
+
<p class="blog-pull-wide">
|
| 1072 |
+
If you only remember one tension from this page, remember this: <strong>high leaderboard accuracy is not the same thing as high production reliability.</strong>
|
| 1073 |
+
</p>
|
| 1074 |
<p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
|
| 1075 |
+
The motive for this project was not to build another text-to-SQL demo. It was to shrink the gap between “model looks smart in a demo” and “model helps engineers ship.”
|
| 1076 |
+
SQL bugs are expensive because they fail late: a query can pass review, pass linting, and still break under real schema constraints, stale statistics, or join cardinality shifts.
|
| 1077 |
+
I picked this problem because it sits at the boundary between language modeling and systems engineering—if the agent improves here, it is learning runtime correctness, not cosmetic fluency.
|
| 1078 |
</p>
|
| 1079 |
+
<p class="blog-subhead">What leaderboards hide</p>
|
| 1080 |
+
<p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
|
| 1081 |
+
Spider-style suites are useful scientific instruments: they keep comparisons honest and reproducible. They are also intentionally cleaner than most corporate warehouses.
|
| 1082 |
+
That is why you can simultaneously believe two facts that sound contradictory: models can score in the <strong>high 80s–90s (%)</strong> on canonical benchmarks while practitioners still describe
|
| 1083 |
+
<strong>10–30%</strong> “works first time in our environment” outcomes unless they invest in evaluation harnesses, guardrails, and iterative repair loops grounded in execution.
|
| 1084 |
+
</p>
|
| 1085 |
+
<ul class="blog-list">
|
| 1086 |
+
<li><strong>Latency of truth.</strong> Text-only feedback arrives early; execution feedback arrives when the query meets the database. The latter is slower but decisive.</li>
|
| 1087 |
+
<li><strong>Credit assignment.</strong> Without runtime signal, you reward plausible prose. With it, you reward schema-correct joins, stable aggregates, and safe rewrites.</li>
|
| 1088 |
+
<li><strong>Operational drift.</strong> Production schemas evolve; a static snapshot benchmark cannot represent every enterprise edge case—so the training surface must be repeatable even when the world is messy.</li>
|
| 1089 |
+
</ul>
|
| 1090 |
+
<p class="blog-subhead">Why the OpenEnv-shaped API exists</p>
|
| 1091 |
<p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
|
| 1092 |
The architecture follows an OpenEnv-style contract:
|
| 1093 |
+
<code>reset → observation</code> and <code>step(action) → observation, reward, done, info</code>.
|
| 1094 |
+
Each episode runs on isolated in-memory SQLite state, deterministic task grading, and execution-grounded rewards. That contract is what lets you compare runs, swap algorithms,
|
| 1095 |
+
and keep the same measurement tape: valid table references, stable aggregations, and join logic that does not collapse in edge cases.
|
| 1096 |
</p>
|
| 1097 |
<code class="pre">Conceptual reward:
|
| 1098 |
R_t = w_c*C_t + w_e*E_t + w_p*P_t + w_s*S_t - lambda*Penalty_t
|
|
|
|
| 1102 |
<p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
|
| 1103 |
The technical design makes debugging measurable. Session state exposes observations, action history, and reward trajectories.
|
| 1104 |
The reviewer-gated path adds risk control for unsafe submissions while preserving gradient signal (instead of hard-failing every risky step).
|
| 1105 |
+
That gives the policy consequences it can learn from: what failed, why it failed, and how far a candidate moved toward a valid fix.
|
| 1106 |
</p>
|
| 1107 |
<code class="pre">Data snapshot shown on this page:
|
| 1108 |
- Spider-style industry baseline: 48.2%
|
|
|
|
| 1111 |
- Performance leap view: 0.0% -> 25.0%
|
| 1112 |
- Hard evidence: 32-run eval + sample reward artifacts</code>
|
| 1113 |
<p style="color:var(--muted);margin:12px 0 12px;font-size:0.9375rem">
|
| 1114 |
+
Traceability is a product decision, not a footnote. This page is an evidence chain: first training context, live interaction, then artifact-backed plots.
|
| 1115 |
+
If a metric appears, it should map to concrete run folders, reward JSON files, and checkpoint lineage—so a reviewer can reconstruct the claim without trusting a single screenshot.
|
| 1116 |
</p>
|
| 1117 |
+
<p class="blog-subhead">How to read what ships here</p>
|
| 1118 |
+
<ul class="blog-list">
|
| 1119 |
+
<li><strong>Environment diagram</strong> — the contract between client, API, env core, data layer, and training artifacts.</li>
|
| 1120 |
+
<li><strong>Playground</strong> — the same <code>/reset</code> and <code>/step</code> loop your trainer uses, in-browser, with explicit session headers.</li>
|
| 1121 |
+
<li><strong>Benchmark visuals + evidence PNGs</strong> — static exports committed under <code>server/static/</code>; regenerate from real run JSON when you change the story.</li>
|
| 1122 |
+
</ul>
|
| 1123 |
<p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
|
| 1124 |
+
Industry and research converge on the same diagnosis: robust text-to-SQL needs context quality, intent handling, dialect robustness, and execution safeguards.
|
| 1125 |
+
Enterprise SQL debugging stays painful when feedback is detached from runtime behavior. The objective of this Space is to close that gap with a reproducible,
|
| 1126 |
+
execution-grounded learning loop you can fork, stress-test, and defend in a review.
|
| 1127 |
+
</p>
|
| 1128 |
+
<p class="blog-footnote">
|
| 1129 |
+
Percent ranges (≈30% time on debugging work; ≈10–30% production success vs high-80s/90s benchmark headlines) summarize common practitioner reporting and public benchmark narratives;
|
| 1130 |
+
your organization’s distributions will differ—treat them as motivation for measurement, not as universal constants.
|
| 1131 |
</p>
|
| 1132 |
<div class="link-list" style="margin-top:12px">
|
| 1133 |
<a href="https://github.com/mdayan8/sql-debug-env" target="_blank" rel="noopener">GitHub — mdayan8/sql-debug-env</a>
|
server/gradio_ui.py
CHANGED
|
@@ -689,8 +689,15 @@ def build_blocks(static_dir: Path) -> Any:
|
|
| 689 |
gr.Markdown(
|
| 690 |
"### Why I picked SQL debugging and why this architecture exists\n"
|
| 691 |
"“The goal is not to generate beautiful SQL text. The goal is to produce SQL fixes that survive execution, repeatedly, under changing runtime conditions.”\n\n"
|
| 692 |
-
"
|
| 693 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 694 |
)
|
| 695 |
gr.HTML(
|
| 696 |
"""
|
|
@@ -703,6 +710,13 @@ def build_blocks(static_dir: Path) -> Any:
|
|
| 703 |
""".strip()
|
| 704 |
)
|
| 705 |
gr.Markdown(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 706 |
"#### OpenEnv framing (why this is not just a demo UI)\n"
|
| 707 |
"The environment follows an OpenEnv-style interface: `reset -> observation`, `step(action) -> observation, reward, done, info`. "
|
| 708 |
"This is important because it gives the training loop a stable contract. Every algorithmic change can be tested against the same API semantics, which improves reproducibility.\n\n"
|
|
@@ -737,9 +751,14 @@ def build_blocks(static_dir: Path) -> Any:
|
|
| 737 |
"#### Why start with 0.5B then move to 7B\n"
|
| 738 |
"The first bridge run on **Qwen2.5-Coder-0.5B** is intentionally about speed of iteration: verify environment wiring, reward path, and notebook workflow quickly. "
|
| 739 |
"The **7B track** is then used for stronger SQL reasoning capacity and better convergence under execution-grounded rewards.\n\n"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 740 |
"#### Motivation recap\n"
|
| 741 |
"I did not build this to prove that a model can emit valid-looking SQL. I built it to make SQL repair measurable as an engineering problem under runtime constraints. "
|
| 742 |
-
"The evidence-first layout (first context, live loop, artifact chain) is deliberate: each reported number should be traceable to run data, not presentation-only visuals."
|
|
|
|
| 743 |
)
|
| 744 |
gr.Markdown(
|
| 745 |
f"- [Google Cloud: techniques for improving text-to-SQL]({GCLOUD_TEXT2SQL_BLOG})\n"
|
|
|
|
| 689 |
gr.Markdown(
|
| 690 |
"### Why I picked SQL debugging and why this architecture exists\n"
|
| 691 |
"“The goal is not to generate beautiful SQL text. The goal is to produce SQL fixes that survive execution, repeatedly, under changing runtime conditions.”\n\n"
|
| 692 |
+
"### The cost of “almost right” SQL\n"
|
| 693 |
+
"Industry time-use reporting commonly puts **roughly a quarter to a third** of analytics and data-engineering work into fixing queries and pipelines—"
|
| 694 |
+
"**not** shipping net-new insights, **not** launching features, but **debugging SQL that already looked reasonable** in a notebook or PR.\n\n"
|
| 695 |
+
"### Benchmarks vs production\n"
|
| 696 |
+
"On Spider-style leaderboards, headline numbers often sit in the **high 80s to low 90s (%)**. In messy enterprise warehouses—drifting schemas, implicit business rules, "
|
| 697 |
+
"join explosions, permissioned views—teams routinely describe effective success rates closer to the **10–30%** band unless the system closes the loop with "
|
| 698 |
+
"**execution-grounded feedback** (run the SQL, read the error or result, attribute reward to what changed).\n\n"
|
| 699 |
+
"SQL debugging is one of the few tasks where *language quality* and *system quality* diverge sharply: a query can be neat, plausible, and still fail in production. "
|
| 700 |
+
"This project forces the agent to optimize for **behavior under execution**, not only fluency under prompting."
|
| 701 |
)
|
| 702 |
gr.HTML(
|
| 703 |
"""
|
|
|
|
| 710 |
""".strip()
|
| 711 |
)
|
| 712 |
gr.Markdown(
|
| 713 |
+
"#### What leaderboards hide\n"
|
| 714 |
+
"Canonical text-to-SQL suites are valuable scientific instruments: they keep comparisons honest. They are also cleaner than most corporate warehouses. "
|
| 715 |
+
"That is why two statements can both be true: models can score **very high** on Spider-style tasks while practitioners still report **low tens to low thirties** "
|
| 716 |
+
"effective reliability in production unless they invest in harnesses, guardrails, and iterative repair grounded in execution.\n\n"
|
| 717 |
+
"- **Latency of truth**: prose feedback is fast; execution feedback is slower—and decisive.\n"
|
| 718 |
+
"- **Credit assignment**: without runtime signal you reward plausible text; with it you reward joins, aggregates, and safe rewrites that actually run.\n"
|
| 719 |
+
"- **Drift**: schemas evolve; the training surface must stay repeatable even when the world is messy.\n\n"
|
| 720 |
"#### OpenEnv framing (why this is not just a demo UI)\n"
|
| 721 |
"The environment follows an OpenEnv-style interface: `reset -> observation`, `step(action) -> observation, reward, done, info`. "
|
| 722 |
"This is important because it gives the training loop a stable contract. Every algorithmic change can be tested against the same API semantics, which improves reproducibility.\n\n"
|
|
|
|
| 751 |
"#### Why start with 0.5B then move to 7B\n"
|
| 752 |
"The first bridge run on **Qwen2.5-Coder-0.5B** is intentionally about speed of iteration: verify environment wiring, reward path, and notebook workflow quickly. "
|
| 753 |
"The **7B track** is then used for stronger SQL reasoning capacity and better convergence under execution-grounded rewards.\n\n"
|
| 754 |
+
"#### How to read this Space\n"
|
| 755 |
+
"- **Diagram** — client → API → env core → data/reward → training and artifacts.\n"
|
| 756 |
+
"- **Playground** — same `POST /reset` and `POST /step` loop as training, with explicit `X-Session-Id`.\n"
|
| 757 |
+
"- **Charts + static PNGs** — committed under `server/static/` so claims stay diffable and auditable.\n\n"
|
| 758 |
"#### Motivation recap\n"
|
| 759 |
"I did not build this to prove that a model can emit valid-looking SQL. I built it to make SQL repair measurable as an engineering problem under runtime constraints. "
|
| 760 |
+
"The evidence-first layout (first context, live loop, artifact chain) is deliberate: each reported number should be traceable to run data, not presentation-only visuals.\n\n"
|
| 761 |
+
"*Note: percentage ranges summarize common practitioner reporting and public benchmark narratives; your organization’s numbers will differ—treat them as motivation to measure, not as universal constants.*"
|
| 762 |
)
|
| 763 |
gr.Markdown(
|
| 764 |
f"- [Google Cloud: techniques for improving text-to-SQL]({GCLOUD_TEXT2SQL_BLOG})\n"
|