Spaces:

md896
/

sql-debug-env

Running

md896 commited on 12 days ago

Commit

40caa50

1 Parent(s): d00292e

Expand problem narrative and Engineering Notes: time-on-SQL, Spider vs prod.

HTML /demo: stat callout lede, longer blog with subheads, lists, footnote on ranges.
Gradio blog: matching cost/benchmark sections and how-to-read guide.

Made-with: Cursor

Files changed (2) hide show

server/demo_page.html +120 -18
server/gradio_ui.py +22 -3

server/demo_page.html CHANGED Viewed

@@ -738,6 +738,63 @@
     @media (max-width: 900px) {
       .blog-mini-grid { grid-template-columns: 1fr; }
     }
   </style>
 </head>
 <body>
@@ -785,9 +842,24 @@
       <section id="environment" class="section" aria-labelledby="env-title">
         <p class="section-id">Space · Architecture</p>
         <h2 class="hero-title" id="env-title">Environment first — <em>how</em> the agent sees the world.</h2>
-        <p class="lede">
-          This Space hosts the same HTTP API your trainer calls: sessions, typed observations, SQLite-backed tasks, and a decomposed reward. Below is the end-to-end map judges can skim in seconds.
-        </p>
         <div class="layer-strip" aria-hidden="true">
           <span class="layer"><b>Client</b> / agent</span>
           <span class="layer"><b>API</b> session + JSON</span>
@@ -987,20 +1059,40 @@ import wandb
               “The goal is not to generate beautiful SQL text. The goal is to produce SQL fixes that survive execution, repeatedly, under changing runtime conditions.”
             </div>
             <div class="blog-mini-grid">
-              <div class="blog-mini"><b>0.5B -> 7B</b>Model track from first bridge run to main baseline.</div>
-              <div class="blog-mini"><b>32-run eval</b>Final artifact path with sample rewards and run logs.</div>
-              <div class="blog-mini"><b>Execution-first</b>Reward is computed from runtime outcomes, not prompt resemblance.</div>
             </div>
             <p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
-              The motive for this project was not to build another text-to-SQL demo. The motive was reliability. SQL bugs are expensive because they fail late:
-              queries can look clean in review but break under real schema constraints, data skew, or join cardinality shifts. I picked this problem because it sits at the
-              boundary between language modeling and systems engineering. If the agent improves here, it is learning runtime correctness, not cosmetic fluency.
             </p>
             <p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
               The architecture follows an OpenEnv-style contract:
-              <code>reset -&gt; observation</code> and <code>step(action) -&gt; observation, reward, done, info</code>.
-              Each episode runs on isolated in-memory SQLite state, deterministic task grading, and execution-grounded rewards. This pushes the model toward behaviors that survive runtime:
-              valid table references, stable aggregations, and join logic that does not collapse in edge cases.
             </p>
             <code class="pre">Conceptual reward:
 R_t = w_c*C_t + w_e*E_t + w_p*P_t + w_s*S_t - lambda*Penalty_t
@@ -1010,7 +1102,7 @@ J(pi) = E_{tau ~ pi}[sum_{t=0..T} gamma^t * R_t]</code>
             <p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
               The technical design makes debugging measurable. Session state exposes observations, action history, and reward trajectories.
               The reviewer-gated path adds risk control for unsafe submissions while preserving gradient signal (instead of hard-failing every risky step).
-              This gives the policy useful consequences: what failed, why it failed, and how far a candidate moved toward a valid fix.
             </p>
             <code class="pre">Data snapshot shown on this page:
 - Spider-style industry baseline: 48.2%
@@ -1019,13 +1111,23 @@ J(pi) = E_{tau ~ pi}[sum_{t=0..T} gamma^t * R_t]</code>
 - Performance leap view: 0.0% -> 25.0%
 - Hard evidence: 32-run eval + sample reward artifacts</code>
             <p style="color:var(--muted);margin:12px 0 12px;font-size:0.9375rem">
-              Another deliberate choice is traceability. This page is an evidence chain: first training context, live interaction, then artifact-backed plots.
-              If a metric appears, it should map to concrete run folders, reward JSON files, and checkpoint lineage.
             </p>
             <p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
-              Industry and research point the same direction: robust text-to-SQL requires context quality, intent handling, dialect robustness, and execution safeguards.
-              Enterprise SQL debugging remains difficult when feedback is detached from runtime behavior. The objective here is to close that gap with a reproducible,
-              execution-grounded learning loop.
             </p>
             <div class="link-list" style="margin-top:12px">
               <a href="https://github.com/mdayan8/sql-debug-env" target="_blank" rel="noopener">GitHub — mdayan8/sql-debug-env</a>

     @media (max-width: 900px) {
       .blog-mini-grid { grid-template-columns: 1fr; }
     }
+    .lede-stack {
+      max-width: 62ch;
+      margin-bottom: 18px;
+    }
+    .lede-stack .lede {
+      max-width: none;
+    }
+    .stat-callout {
+      margin: 0 0 16px;
+      padding: 14px 16px 16px;
+      border-radius: var(--radius);
+      border: 1px solid #c7d2fe;
+      background: linear-gradient(135deg, #eef2ff 0%, #f8fafc 55%, #ecfeff 100%);
+      box-shadow: 0 6px 22px rgba(37, 99, 235, 0.08);
+      font-size: 0.98rem;
+      line-height: 1.58;
+      color: var(--ink-soft);
+    }
+    .stat-callout strong {
+      color: var(--ink);
+      font-weight: 700;
+    }
+    .blog-pull-wide {
+      font-family: var(--font-display);
+      font-size: 1.02rem;
+      line-height: 1.45;
+      color: var(--ink);
+      margin: 18px 0 14px;
+      padding: 12px 0 12px 16px;
+      border-left: 4px solid var(--hf-amber);
+      background: linear-gradient(90deg, var(--hf-amber-soft), transparent);
+      border-radius: 0 10px 10px 0;
+    }
+    .blog-subhead {
+      font-size: 0.72rem;
+      font-weight: 800;
+      letter-spacing: 0.12em;
+      text-transform: uppercase;
+      color: var(--muted);
+      margin: 20px 0 8px;
+    }
+    .blog-list {
+      margin: 0 0 14px 1.1rem;
+      padding: 0;
+      color: var(--muted);
+      font-size: 0.9375rem;
+      line-height: 1.55;
+    }
+    .blog-list li { margin-bottom: 8px; }
+    .blog-footnote {
+      font-size: 0.78rem;
+      color: var(--muted-light);
+      line-height: 1.45;
+      margin: 10px 0 0;
+      padding-top: 10px;
+      border-top: 1px dashed var(--space-border);
+    }
   </style>
 </head>
 <body>
       <section id="environment" class="section" aria-labelledby="env-title">
         <p class="section-id">Space · Architecture</p>
         <h2 class="hero-title" id="env-title">Environment first — <em>how</em> the agent sees the world.</h2>
+        <div class="lede-stack">
+          <p class="stat-callout">
+            <strong>Today, nearly 30% of a data team’s time is spent fixing SQL and pipeline logic</strong>—not building net-new insights, not shipping product features,
+            but <em>debugging queries that already looked reasonable in a notebook or PR comment</em>. That tax shows up as rework, stale dashboards, and fragile “one-off”
+            analyses that nobody trusts after the third incident.
+          </p>
+          <p class="lede">
+            <strong>Even with the most advanced AI models, the problem is not “solved.”</strong>
+            On standard text-to-SQL benchmarks like Spider, headline numbers often sit in the <strong>high 80s to low 90s (%)</strong>—an impressive story for a slide deck.
+            In real enterprise environments—drifting schemas, implicit business rules, join explosions, and permissioned views—that headline rarely survives contact with production.
+            Teams routinely report effective success rates closer to the <strong>10–30%</strong> band unless the system closes the loop with <em>execution-grounded feedback</em>
+            (run, observe error or result, attribute reward to what changed).
+          </p>
+          <p class="lede" style="margin-bottom:0">
+            This Space hosts the same HTTP API your trainer calls: <strong>sessions</strong>, <strong>typed observations</strong>, <strong>SQLite-backed tasks</strong>, and a
+            <strong>decomposed reward</strong>. Below is the end-to-end map judges can skim in seconds; the Engineering Notes section ties the problem to the OpenEnv contract and the artifacts on this page.
+          </p>
+        </div>
         <div class="layer-strip" aria-hidden="true">
           <span class="layer"><b>Client</b> / agent</span>
           <span class="layer"><b>API</b> session + JSON</span>
               “The goal is not to generate beautiful SQL text. The goal is to produce SQL fixes that survive execution, repeatedly, under changing runtime conditions.”
             </div>
             <div class="blog-mini-grid">
+              <div class="blog-mini"><b>0.5B → 7B</b>Bridge run for wiring, then a stronger base model for SQL structure and joins.</div>
+              <div class="blog-mini"><b>32-run eval</b>Artifact-backed pass with sample rewards and run logs you can diff, not vibes.</div>
+              <div class="blog-mini"><b>Execution-first</b>Reward comes from running SQL against graded tasks—not from how persuasive the completion sounds.</div>
             </div>
+            <div class="blog-mini-grid" style="margin-top:10px">
+              <div class="blog-mini"><b>Spider vs prod</b>Leaderboards reward clean splits; warehouses reward joins that do not explode under skew.</div>
+              <div class="blog-mini"><b>GRPO loop</b>Group-relative updates turn execution outcomes into a stable training signal across sessions.</div>
+              <div class="blog-mini"><b>Reviewer path</b>Optional guardrail so risky SQL is blocked without erasing every learning opportunity.</div>
+            </div>
+            <p class="blog-pull-wide">
+              If you only remember one tension from this page, remember this: <strong>high leaderboard accuracy is not the same thing as high production reliability.</strong>
+            </p>
             <p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
+              The motive for this project was not to build another text-to-SQL demo. It was to shrink the gap between “model looks smart in a demo” and “model helps engineers ship.”
+              SQL bugs are expensive because they fail late: a query can pass review, pass linting, and still break under real schema constraints, stale statistics, or join cardinality shifts.
+              I picked this problem because it sits at the boundary between language modeling and systems engineering—if the agent improves here, it is learning runtime correctness, not cosmetic fluency.
             </p>
+            <p class="blog-subhead">What leaderboards hide</p>
+            <p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
+              Spider-style suites are useful scientific instruments: they keep comparisons honest and reproducible. They are also intentionally cleaner than most corporate warehouses.
+              That is why you can simultaneously believe two facts that sound contradictory: models can score in the <strong>high 80s–90s (%)</strong> on canonical benchmarks while practitioners still describe
+              <strong>10–30%</strong> “works first time in our environment” outcomes unless they invest in evaluation harnesses, guardrails, and iterative repair loops grounded in execution.
+            </p>
+            <ul class="blog-list">
+              <li><strong>Latency of truth.</strong> Text-only feedback arrives early; execution feedback arrives when the query meets the database. The latter is slower but decisive.</li>
+              <li><strong>Credit assignment.</strong> Without runtime signal, you reward plausible prose. With it, you reward schema-correct joins, stable aggregates, and safe rewrites.</li>
+              <li><strong>Operational drift.</strong> Production schemas evolve; a static snapshot benchmark cannot represent every enterprise edge case—so the training surface must be repeatable even when the world is messy.</li>
+            </ul>
+            <p class="blog-subhead">Why the OpenEnv-shaped API exists</p>
             <p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
               The architecture follows an OpenEnv-style contract:
+              <code>reset → observation</code> and <code>step(action) → observation, reward, done, info</code>.
+              Each episode runs on isolated in-memory SQLite state, deterministic task grading, and execution-grounded rewards. That contract is what lets you compare runs, swap algorithms,
+              and keep the same measurement tape: valid table references, stable aggregations, and join logic that does not collapse in edge cases.
             </p>
             <code class="pre">Conceptual reward:
 R_t = w_c*C_t + w_e*E_t + w_p*P_t + w_s*S_t - lambda*Penalty_t
             <p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
               The technical design makes debugging measurable. Session state exposes observations, action history, and reward trajectories.
               The reviewer-gated path adds risk control for unsafe submissions while preserving gradient signal (instead of hard-failing every risky step).
+              That gives the policy consequences it can learn from: what failed, why it failed, and how far a candidate moved toward a valid fix.
             </p>
             <code class="pre">Data snapshot shown on this page:
 - Spider-style industry baseline: 48.2%
 - Performance leap view: 0.0% -> 25.0%
 - Hard evidence: 32-run eval + sample reward artifacts</code>
             <p style="color:var(--muted);margin:12px 0 12px;font-size:0.9375rem">
+              Traceability is a product decision, not a footnote. This page is an evidence chain: first training context, live interaction, then artifact-backed plots.
+              If a metric appears, it should map to concrete run folders, reward JSON files, and checkpoint lineage—so a reviewer can reconstruct the claim without trusting a single screenshot.
             </p>
+            <p class="blog-subhead">How to read what ships here</p>
+            <ul class="blog-list">
+              <li><strong>Environment diagram</strong> — the contract between client, API, env core, data layer, and training artifacts.</li>
+              <li><strong>Playground</strong> — the same <code>/reset</code> and <code>/step</code> loop your trainer uses, in-browser, with explicit session headers.</li>
+              <li><strong>Benchmark visuals + evidence PNGs</strong> — static exports committed under <code>server/static/</code>; regenerate from real run JSON when you change the story.</li>
+            </ul>
             <p style="color:var(--muted);margin:0 0 12px;font-size:0.9375rem">
+              Industry and research converge on the same diagnosis: robust text-to-SQL needs context quality, intent handling, dialect robustness, and execution safeguards.
+              Enterprise SQL debugging stays painful when feedback is detached from runtime behavior. The objective of this Space is to close that gap with a reproducible,
+              execution-grounded learning loop you can fork, stress-test, and defend in a review.
+            </p>
+            <p class="blog-footnote">
+              Percent ranges (≈30% time on debugging work; ≈10–30% production success vs high-80s/90s benchmark headlines) summarize common practitioner reporting and public benchmark narratives;
+              your organization’s distributions will differ—treat them as motivation for measurement, not as universal constants.
             </p>
             <div class="link-list" style="margin-top:12px">
               <a href="https://github.com/mdayan8/sql-debug-env" target="_blank" rel="noopener">GitHub — mdayan8/sql-debug-env</a>

server/gradio_ui.py CHANGED Viewed

@@ -689,8 +689,15 @@ def build_blocks(static_dir: Path) -> Any:
         gr.Markdown(
             "### Why I picked SQL debugging and why this architecture exists\n"
             "“The goal is not to generate beautiful SQL text. The goal is to produce SQL fixes that survive execution, repeatedly, under changing runtime conditions.”\n\n"
-            "SQL debugging is one of the few tasks where language quality and system quality can diverge sharply. A query can be grammatically neat, semantically plausible, and still fail in production. "
-            "I chose this problem because it forces an agent to optimize for *behavior under execution*, not only style under prompting."
         )
         gr.HTML(
             """
@@ -703,6 +710,13 @@ def build_blocks(static_dir: Path) -> Any:
             """.strip()
         )
         gr.Markdown(
             "#### OpenEnv framing (why this is not just a demo UI)\n"
             "The environment follows an OpenEnv-style interface: `reset -> observation`, `step(action) -> observation, reward, done, info`. "
             "This is important because it gives the training loop a stable contract. Every algorithmic change can be tested against the same API semantics, which improves reproducibility.\n\n"
@@ -737,9 +751,14 @@ def build_blocks(static_dir: Path) -> Any:
             "#### Why start with 0.5B then move to 7B\n"
             "The first bridge run on **Qwen2.5-Coder-0.5B** is intentionally about speed of iteration: verify environment wiring, reward path, and notebook workflow quickly. "
             "The **7B track** is then used for stronger SQL reasoning capacity and better convergence under execution-grounded rewards.\n\n"
             "#### Motivation recap\n"
             "I did not build this to prove that a model can emit valid-looking SQL. I built it to make SQL repair measurable as an engineering problem under runtime constraints. "
-            "The evidence-first layout (first context, live loop, artifact chain) is deliberate: each reported number should be traceable to run data, not presentation-only visuals."
         )
         gr.Markdown(
             f"- [Google Cloud: techniques for improving text-to-SQL]({GCLOUD_TEXT2SQL_BLOG})\n"

         gr.Markdown(
             "### Why I picked SQL debugging and why this architecture exists\n"
             "“The goal is not to generate beautiful SQL text. The goal is to produce SQL fixes that survive execution, repeatedly, under changing runtime conditions.”\n\n"
+            "### The cost of “almost right” SQL\n"
+            "Industry time-use reporting commonly puts **roughly a quarter to a third** of analytics and data-engineering work into fixing queries and pipelines—"
+            "**not** shipping net-new insights, **not** launching features, but **debugging SQL that already looked reasonable** in a notebook or PR.\n\n"
+            "### Benchmarks vs production\n"
+            "On Spider-style leaderboards, headline numbers often sit in the **high 80s to low 90s (%)**. In messy enterprise warehouses—drifting schemas, implicit business rules, "
+            "join explosions, permissioned views—teams routinely describe effective success rates closer to the **10–30%** band unless the system closes the loop with "
+            "**execution-grounded feedback** (run the SQL, read the error or result, attribute reward to what changed).\n\n"
+            "SQL debugging is one of the few tasks where *language quality* and *system quality* diverge sharply: a query can be neat, plausible, and still fail in production. "
+            "This project forces the agent to optimize for **behavior under execution**, not only fluency under prompting."
         )
         gr.HTML(
             """
             """.strip()
         )
         gr.Markdown(
+            "#### What leaderboards hide\n"
+            "Canonical text-to-SQL suites are valuable scientific instruments: they keep comparisons honest. They are also cleaner than most corporate warehouses. "
+            "That is why two statements can both be true: models can score **very high** on Spider-style tasks while practitioners still report **low tens to low thirties** "
+            "effective reliability in production unless they invest in harnesses, guardrails, and iterative repair grounded in execution.\n\n"
+            "- **Latency of truth**: prose feedback is fast; execution feedback is slower—and decisive.\n"
+            "- **Credit assignment**: without runtime signal you reward plausible text; with it you reward joins, aggregates, and safe rewrites that actually run.\n"
+            "- **Drift**: schemas evolve; the training surface must stay repeatable even when the world is messy.\n\n"
             "#### OpenEnv framing (why this is not just a demo UI)\n"
             "The environment follows an OpenEnv-style interface: `reset -> observation`, `step(action) -> observation, reward, done, info`. "
             "This is important because it gives the training loop a stable contract. Every algorithmic change can be tested against the same API semantics, which improves reproducibility.\n\n"
             "#### Why start with 0.5B then move to 7B\n"
             "The first bridge run on **Qwen2.5-Coder-0.5B** is intentionally about speed of iteration: verify environment wiring, reward path, and notebook workflow quickly. "
             "The **7B track** is then used for stronger SQL reasoning capacity and better convergence under execution-grounded rewards.\n\n"
+            "#### How to read this Space\n"
+            "- **Diagram** — client → API → env core → data/reward → training and artifacts.\n"
+            "- **Playground** — same `POST /reset` and `POST /step` loop as training, with explicit `X-Session-Id`.\n"
+            "- **Charts + static PNGs** — committed under `server/static/` so claims stay diffable and auditable.\n\n"
             "#### Motivation recap\n"
             "I did not build this to prove that a model can emit valid-looking SQL. I built it to make SQL repair measurable as an engineering problem under runtime constraints. "
+            "The evidence-first layout (first context, live loop, artifact chain) is deliberate: each reported number should be traceable to run data, not presentation-only visuals.\n\n"
+            "*Note: percentage ranges summarize common practitioner reporting and public benchmark narratives; your organization’s numbers will differ—treat them as motivation to measure, not as universal constants.*"
         )
         gr.Markdown(
             f"- [Google Cloud: techniques for improving text-to-SQL]({GCLOUD_TEXT2SQL_BLOG})\n"