Spaces:
Running on Zero
Running on Zero
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8" /> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0" /> | |
| <title>DataSense E2B — The Full Story</title> | |
| <link rel="preconnect" href="https://fonts.googleapis.com" /> | |
| <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /> | |
| <link href="https://fonts.googleapis.com/css2?family=Fraunces:ital,opsz,wght@0,9..144,300..900;1,9..144,300..900&family=IBM+Plex+Mono:ital,wght@0,400;0,500;0,600;1,400&family=Newsreader:ital,opsz,wght@0,6..72,200..800;1,6..72,200..800&display=swap" rel="stylesheet" /> | |
| <style> | |
| :root { | |
| /* Editorial Color Palette */ | |
| --bg: #F4F3ED; /* Warm newspaper cream */ | |
| --text: #111110; /* Deep ink */ | |
| --text-muted: #4A4A46; | |
| --border: #111110; | |
| /* Vibrant Print Accents */ | |
| --accent: #E1341E; /* Vermilion Red */ | |
| --accent-blue: #1843D2; /* Cobalt */ | |
| --accent-warm: #D46F15; /* Ochre */ | |
| --accent-ok: #0D733B; /* Forest Green */ | |
| --max-width: 860px; | |
| --radius: 0px; /* Brutalist/Print - absolutely no rounded corners */ | |
| --shadow-offset: 6px; | |
| } | |
| * { box-sizing: border-box; margin: 0; padding: 0; } | |
| html { scroll-behavior: smooth; } | |
| ::selection { | |
| background: var(--accent); | |
| color: var(--bg); | |
| } | |
| body { | |
| font-family: "Newsreader", serif; | |
| background-color: var(--bg); | |
| color: var(--text); | |
| line-height: 1.65; | |
| font-size: 1.15rem; | |
| font-weight: 400; | |
| -webkit-font-smoothing: antialiased; | |
| /* Subtle noise texture for a paper feel */ | |
| background-image: url("data:image/svg+xml,%3Csvg viewBox='0 0 400 400' xmlns='http://www.w3.org/2000/svg'%3E%3Cfilter id='noiseFilter'%3E%3CfeTurbulence type='fractalNoise' baseFrequency='0.9' numOctaves='3' stitchTiles='stitch'/%3E%3C/filter%3E%3Crect width='100%25' height='100%25' filter='url(%23noiseFilter)' opacity='0.04'/%3E%3C/svg%3E"); | |
| } | |
| .wrap { | |
| max-width: var(--max-width); | |
| margin: 0 auto; | |
| padding: 4rem 2rem 8rem; | |
| } | |
| /* ------------------------------------------- | |
| Header & Hero Typography | |
| ------------------------------------------- */ | |
| header { | |
| margin-bottom: 4rem; | |
| padding-bottom: 3rem; | |
| border-bottom: 4px solid var(--border); | |
| position: relative; | |
| } | |
| header::after { | |
| content: ""; | |
| position: absolute; | |
| bottom: -10px; | |
| left: 0; | |
| width: 100%; | |
| height: 1px; | |
| background: var(--border); | |
| } | |
| .badge { | |
| display: inline-block; | |
| font-family: "IBM Plex Mono", monospace; | |
| font-size: 0.75rem; | |
| font-weight: 600; | |
| letter-spacing: 0.1em; | |
| text-transform: uppercase; | |
| color: var(--bg); | |
| background: var(--text); | |
| padding: 0.4rem 0.8rem; | |
| margin-bottom: 2rem; | |
| } | |
| h1 { | |
| font-family: "Fraunces", serif; | |
| font-size: clamp(3rem, 7vw, 5.5rem); | |
| font-weight: 800; | |
| font-variation-settings: "SOFT" 0, "WONK" 1; | |
| line-height: 0.95; | |
| letter-spacing: -0.03em; | |
| margin-bottom: 1.5rem; | |
| text-transform: uppercase; | |
| } | |
| .subtitle { | |
| font-family: "Newsreader", serif; | |
| font-size: 1.4rem; | |
| font-style: italic; | |
| color: var(--text-muted); | |
| max-width: 36em; | |
| line-height: 1.4; | |
| } | |
| .meta { | |
| margin-top: 2rem; | |
| font-family: "IBM Plex Mono", monospace; | |
| font-size: 0.85rem; | |
| text-transform: uppercase; | |
| letter-spacing: 0.05em; | |
| color: var(--text-muted); | |
| border-top: 1px dashed var(--border); | |
| padding-top: 1rem; | |
| } | |
| /* ------------------------------------------- | |
| Table of Contents | |
| ------------------------------------------- */ | |
| nav.toc { | |
| background: transparent; | |
| border: 2px solid var(--border); | |
| padding: 2rem; | |
| margin-bottom: 4rem; | |
| box-shadow: var(--shadow-offset) var(--shadow-offset) 0 var(--border); | |
| } | |
| nav.toc h2 { | |
| font-family: "IBM Plex Mono", monospace; | |
| font-size: 0.9rem; | |
| text-transform: uppercase; | |
| letter-spacing: 0.1em; | |
| border-bottom: 2px solid var(--border); | |
| padding-bottom: 0.75rem; | |
| margin-bottom: 1.5rem; | |
| padding-top: 0; | |
| } | |
| nav.toc ol { | |
| list-style: none; | |
| counter-reset: toc; | |
| column-count: 2; | |
| column-gap: 3rem; | |
| } | |
| @media (max-width: 640px) { | |
| nav.toc ol { column-count: 1; } | |
| } | |
| nav.toc li { | |
| counter-increment: toc; | |
| margin-bottom: 0.75rem; | |
| break-inside: avoid; | |
| } | |
| nav.toc a { | |
| color: var(--text); | |
| text-decoration: none; | |
| display: flex; | |
| gap: 0.5rem; | |
| font-weight: 500; | |
| transition: color 0.2s, transform 0.2s; | |
| } | |
| nav.toc a::before { | |
| content: counter(toc, decimal-leading-zero) "."; | |
| font-family: "IBM Plex Mono", monospace; | |
| font-weight: 600; | |
| color: var(--accent); | |
| } | |
| nav.toc a:hover { | |
| color: var(--accent); | |
| transform: translateX(4px); | |
| } | |
| /* ------------------------------------------- | |
| Typography & Content | |
| ------------------------------------------- */ | |
| section { | |
| margin-bottom: 5rem; | |
| position: relative; | |
| } | |
| section::before { | |
| content: ""; | |
| display: block; | |
| width: 3rem; | |
| height: 4px; | |
| background: var(--accent); | |
| margin-bottom: 1.5rem; | |
| } | |
| h2 { | |
| font-family: "Fraunces", serif; | |
| font-size: 2.5rem; | |
| font-weight: 700; | |
| letter-spacing: -0.02em; | |
| margin-bottom: 1.5rem; | |
| line-height: 1.1; | |
| } | |
| h3 { | |
| font-family: "Fraunces", serif; | |
| font-size: 1.5rem; | |
| font-weight: 600; | |
| font-style: italic; | |
| margin: 2.5rem 0 1rem; | |
| color: var(--text); | |
| } | |
| h4 { | |
| font-family: "IBM Plex Mono", monospace; | |
| font-size: 1rem; | |
| font-weight: 600; | |
| text-transform: uppercase; | |
| letter-spacing: 0.05em; | |
| margin: 2rem 0 0.75rem; | |
| color: var(--text); | |
| } | |
| p { margin-bottom: 1.25rem; } | |
| ul, ol { | |
| margin: 0 0 1.5rem 2rem; | |
| padding: 0; | |
| } | |
| li { margin-bottom: 0.5rem; } | |
| li::marker { | |
| color: var(--accent); | |
| font-weight: bold; | |
| } | |
| strong { font-weight: 700; color: var(--text); } | |
| em { font-style: italic; font-family: "Fraunces", serif; } | |
| a { | |
| color: var(--accent-blue); | |
| text-decoration: underline; | |
| text-underline-offset: 4px; | |
| text-decoration-thickness: 1px; | |
| transition: all 0.2s; | |
| } | |
| a:hover { | |
| background: var(--accent-blue); | |
| color: var(--bg); | |
| text-decoration-color: transparent; | |
| } | |
| /* ------------------------------------------- | |
| Cards & Callouts | |
| ------------------------------------------- */ | |
| .card { | |
| background: var(--bg); | |
| border: 2px solid var(--border); | |
| padding: 1.75rem 2rem; | |
| margin: 2rem 0; | |
| position: relative; | |
| box-shadow: var(--shadow-offset) var(--shadow-offset) 0 var(--border); | |
| transition: transform 0.2s, box-shadow 0.2s; | |
| } | |
| .card:hover { | |
| transform: translate(-2px, -2px); | |
| box-shadow: calc(var(--shadow-offset) + 2px) calc(var(--shadow-offset) + 2px) 0 var(--border); | |
| } | |
| .card.highlight { | |
| border-color: var(--text); | |
| background: #fdfcfa; | |
| } | |
| .card.highlight::before { | |
| content: ""; | |
| position: absolute; | |
| top: 0; left: 0; bottom: 0; | |
| width: 8px; | |
| background: var(--accent-blue); | |
| } | |
| .card.warn { | |
| background: #fcf6ef; | |
| } | |
| .card.warn::before { | |
| content: ""; | |
| position: absolute; | |
| top: 0; left: 0; bottom: 0; | |
| width: 8px; | |
| background: var(--accent-warm); | |
| } | |
| .card.danger { | |
| background: #fcefed; | |
| } | |
| .card.danger::before { | |
| content: ""; | |
| position: absolute; | |
| top: 0; left: 0; bottom: 0; | |
| width: 8px; | |
| background: var(--accent); | |
| } | |
| .card-title { | |
| font-family: "IBM Plex Mono", monospace; | |
| font-weight: 700; | |
| font-size: 0.85rem; | |
| text-transform: uppercase; | |
| letter-spacing: 0.08em; | |
| color: var(--text); | |
| border-bottom: 1px solid var(--border); | |
| padding-bottom: 0.5rem; | |
| margin-bottom: 1rem; | |
| } | |
| .card h4 { | |
| margin-top: 0; | |
| border-bottom: 1px solid var(--border); | |
| padding-bottom: 0.5rem; | |
| } | |
| /* ------------------------------------------- | |
| Data Display (Tables & Code) | |
| ------------------------------------------- */ | |
| table { | |
| width: 100%; | |
| border-collapse: collapse; | |
| margin: 2rem 0; | |
| font-family: "Newsreader", serif; | |
| font-size: 1rem; | |
| border-top: 3px solid var(--border); | |
| border-bottom: 3px solid var(--border); | |
| } | |
| th, td { | |
| text-align: left; | |
| padding: 0.85rem 1rem; | |
| border-bottom: 1px solid #d4d3cf; | |
| } | |
| th { | |
| font-family: "IBM Plex Mono", monospace; | |
| font-size: 0.75rem; | |
| text-transform: uppercase; | |
| letter-spacing: 0.05em; | |
| color: var(--text); | |
| font-weight: 600; | |
| vertical-align: bottom; | |
| } | |
| tr:last-child td { border-bottom: none; } | |
| tr:hover td { background: rgba(0,0,0,0.03); } | |
| .num-good { color: var(--accent-ok); font-weight: 700; } | |
| .num-mid { color: var(--accent-warm); font-weight: 700; } | |
| .num-bad { color: var(--accent); font-weight: 700; } | |
| .pending { color: var(--text-muted); font-style: italic; } | |
| code, .mono { | |
| font-family: "IBM Plex Mono", monospace; | |
| font-size: 0.85em; | |
| } | |
| p code, li code { | |
| background: #e8e7e1; | |
| border: 1px solid #d4d3cf; | |
| padding: 0.15em 0.3em; | |
| color: var(--text); | |
| font-weight: 500; | |
| } | |
| pre { | |
| background: var(--text); | |
| color: var(--bg); | |
| padding: 1.5rem; | |
| overflow-x: auto; | |
| font-family: "IBM Plex Mono", monospace; | |
| font-size: 0.85rem; | |
| line-height: 1.5; | |
| margin: 2rem 0; | |
| box-shadow: var(--shadow-offset) var(--shadow-offset) 0 var(--accent); | |
| } | |
| pre code { | |
| background: transparent; | |
| border: none; | |
| color: inherit; | |
| padding: 0; | |
| } | |
| /* ------------------------------------------- | |
| UI Elements | |
| ------------------------------------------- */ | |
| .flow { | |
| display: flex; | |
| flex-wrap: wrap; | |
| gap: 0; | |
| align-items: center; | |
| margin: 2rem 0; | |
| font-family: "IBM Plex Mono", monospace; | |
| font-size: 0.85rem; | |
| font-weight: 600; | |
| text-transform: uppercase; | |
| border: 2px solid var(--border); | |
| box-shadow: 4px 4px 0 var(--border); | |
| width: fit-content; | |
| } | |
| .flow span { | |
| padding: 0.5rem 1rem; | |
| background: var(--bg); | |
| } | |
| .flow .arrow { | |
| background: var(--text); | |
| color: var(--bg); | |
| padding: 0.5rem; | |
| } | |
| .pill-row { | |
| display: flex; | |
| flex-wrap: wrap; | |
| gap: 0.5rem; | |
| margin: 1rem 0; | |
| } | |
| .pill { | |
| font-family: "IBM Plex Mono", monospace; | |
| font-size: 0.75rem; | |
| font-weight: 600; | |
| text-transform: uppercase; | |
| padding: 0.25rem 0.5rem; | |
| border: 1px solid var(--border); | |
| background: var(--bg); | |
| } | |
| .pill.ok { background: var(--accent-ok); color: #fff; border-color: var(--accent-ok); } | |
| .pill.no { background: var(--accent); color: #fff; border-color: var(--accent); } | |
| .pill.run { background: var(--accent-blue); color: #fff; border-color: var(--accent-blue); } | |
| .two-col { | |
| display: grid; | |
| grid-template-columns: 1fr 1fr; | |
| gap: 2rem; | |
| margin: 2rem 0; | |
| } | |
| /* ------------------------------------------- | |
| Special Components | |
| ------------------------------------------- */ | |
| .status-banner { | |
| background: var(--text); | |
| color: var(--bg); | |
| padding: 1rem 1.5rem; | |
| margin-bottom: 3rem; | |
| font-family: "IBM Plex Mono", monospace; | |
| font-size: 0.85rem; | |
| border: 2px solid var(--text); | |
| position: relative; | |
| } | |
| .status-banner::after { | |
| content: ""; | |
| position: absolute; | |
| top: 4px; left: 4px; right: -8px; bottom: -8px; | |
| border: 1px solid var(--text); | |
| z-index: -1; | |
| } | |
| .status-banner strong { | |
| color: #fff; | |
| text-transform: uppercase; | |
| letter-spacing: 0.05em; | |
| margin-right: 0.5rem; | |
| } | |
| figure.figure { | |
| margin: 3rem 0; | |
| border: 2px solid var(--border); | |
| box-shadow: var(--shadow-offset) var(--shadow-offset) 0 var(--border); | |
| background: var(--bg); | |
| } | |
| figure.figure img { | |
| display: block; | |
| width: 100%; | |
| height: auto; | |
| filter: grayscale(100%) contrast(1.1); /* Editorial print feel */ | |
| transition: filter 0.3s; | |
| } | |
| figure.figure:hover img { | |
| filter: grayscale(0%); | |
| } | |
| figure.figure figcaption { | |
| padding: 1rem 1.25rem; | |
| font-family: "Newsreader", serif; | |
| font-size: 0.95rem; | |
| color: var(--text); | |
| border-top: 2px solid var(--border); | |
| background: #fdfcfa; | |
| } | |
| .gate-table td:first-child { | |
| font-family: "IBM Plex Mono", monospace; | |
| font-size: 0.85rem; | |
| font-weight: 600; | |
| } | |
| .phase-grid { | |
| display: grid; | |
| gap: 1.5rem; | |
| margin: 2.5rem 0; | |
| } | |
| .phase-card { | |
| border: 1px solid var(--border); | |
| padding: 1.5rem; | |
| position: relative; | |
| } | |
| .phase-card::before { | |
| content: ""; | |
| position: absolute; | |
| top: 0; left: 0; | |
| width: 100%; | |
| height: 4px; | |
| background: var(--accent); | |
| } | |
| .phase-card h4 { margin: 0 0 0.5rem; } | |
| .phase-card p { margin: 0; } | |
| blockquote.pull { | |
| font-family: "Fraunces", serif; | |
| font-size: 1.5rem; | |
| line-height: 1.4; | |
| font-style: italic; | |
| margin: 3rem 0; | |
| padding: 2rem; | |
| border-top: 2px solid var(--border); | |
| border-bottom: 2px solid var(--border); | |
| text-align: center; | |
| color: var(--text); | |
| background: repeating-linear-gradient( | |
| 45deg, | |
| transparent, | |
| transparent 10px, | |
| rgba(0,0,0,0.02) 10px, | |
| rgba(0,0,0,0.02) 20px | |
| ); | |
| } | |
| /* ------------------------------------------- | |
| Footer | |
| ------------------------------------------- */ | |
| footer { | |
| margin-top: 6rem; | |
| padding-top: 3rem; | |
| border-top: 4px solid var(--border); | |
| font-family: "IBM Plex Mono", monospace; | |
| font-size: 0.85rem; | |
| text-transform: uppercase; | |
| letter-spacing: 0.05em; | |
| color: var(--text-muted); | |
| } | |
| footer a { color: var(--text); font-weight: 600; } | |
| @media (max-width: 640px) { | |
| .two-col { grid-template-columns: 1fr; } | |
| .wrap { padding: 2rem 1rem 4rem; } | |
| h1 { font-size: 2.5rem; } | |
| } | |
| </style> | |
| </head> | |
| <body> | |
| <div class="wrap"> | |
| <header> | |
| <h1>DataSense E2B<br />The Full Story</h1> | |
| <p class="subtitle"> | |
| How we set out to build a <strong>personal data-science agent</strong> — not a chatbot that | |
| <em>pretends</em> to run code, but one that <strong>writes Python, executes it, reads real errors, | |
| and verifies answers</strong> — and what we learned training Gemma-4-2B on Modal with methods | |
| we had to invent along the way. | |
| </p> | |
| <p class="meta"> | |
| Base: <code>unsloth/gemma-4-E2B-it</code><br /> | |
| Pipeline: Modal A100/T4<br /> | |
| Team: <strong>DataSense E2B</strong> (Execution-verified, Tutor-escalation)<br /> | |
| </p> | |
| </header> | |
| <nav class="toc" aria-label="Table of contents"> | |
| <h2>Index</h2> | |
| <ol> | |
| <li><a href="#goal">The goal</a></li> | |
| <li><a href="#start">Where we started</a></li> | |
| <li><a href="#problem">The problem with naive finetuning</a></li> | |
| <li><a href="#agent">The DataSense agent loop</a></li> | |
| <li><a href="#pipeline">Training pipeline: SFT → GRPO → DPO</a></li> | |
| <li><a href="#methods">Supporting methods (verifiers, eval)</a></li> | |
| <li><a href="#evte">EVTE — core idea & motivation</a></li> | |
| <li><a href="#evte-feedback">EVTE feedback loops (self-recovery)</a></li> | |
| <li><a href="#evte-mentor">Mentor verify & hint protocol</a></li> | |
| <li><a href="#evte-star">EVTE-STaR — online micro-SFT</a></li> | |
| <li><a href="#evte-outcomes">Episode outcomes & trainability gates</a></li> | |
| <li><a href="#worked">What worked</a></li> | |
| <li><a href="#didnt">What didn't work</a></li> | |
| <li><a href="#evals">Evaluation results</a></li> | |
| <li><a href="#demo-choice">Why SFT v1 for the demo</a></li> | |
| <li><a href="#benchmarks">Benchmark suite</a></li> | |
| <li><a href="#models">Model checkpoints</a></li> | |
| <li><a href="#demo">This demo & what's next</a></li> | |
| </ol> | |
| </nav> | |
| <!-- 01 GOAL --> | |
| <section id="goal"> | |
| <h2>01 · The goal</h2> | |
| <p> | |
| The hackathon asked for something ambitious: take a small open model and make it genuinely useful | |
| for <strong>data work</strong> — exploring tables, cleaning messy columns, aggregating, joining, | |
| visualizing, and answering questions with <strong>verifiable correctness</strong>, not plausible prose. | |
| </p> | |
| <p>Our north star was simple to state and hard to achieve:</p> | |
| <div class="card highlight"> | |
| <div class="card-title">North star</div> | |
| <p style="margin:0"> | |
| A <strong>2B-parameter student agent</strong> that behaves like a junior data analyst: | |
| inspect schema first, run focused code steps, debug from real tracebacks, and only claim an | |
| answer after execution confirms it — with a training story credible enough for slides, | |
| papers, and a public Hugging Face demo. | |
| </p> | |
| </div> | |
| <p>Concretely, we targeted:</p> | |
| <ul> | |
| <li><strong>Execution-grounded behavior</strong> — rewards and eval tied to real <code>stdout</code> / errors, not hallucinated <code><result></code> blocks</li> | |
| <li><strong>Multi-benchmark credibility</strong> — DataBench, DSBench Excel analysis, and a curated hard pool from our own training data</li> | |
| <li><strong>A reproducible Modal pipeline</strong> — one app, volume checkpoints, automatic HF Hub pushes</li> | |
| <li><strong>Novel training for hard questions</strong> — when the student fails, a larger mentor verifies a solution and gives diagnostic hints <em>without leaking the answer</em></li> | |
| </ul> | |
| <figure class="figure"> | |
| <img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/01-goal-agent-vs-formatter.png" alt="Formatter that fakes answers versus a real execution-verified agent" loading="lazy" /> | |
| <figcaption><strong>Fig 1 — Goal.</strong> We optimize for an agent that runs code on real data and verifies answers — not a model that prints plausible <code> Answer: </code> tags without executing anything.</figcaption> | |
| </figure> | |
| </section> | |
| <!-- 02 START --> | |
| <section id="start"> | |
| <h2>02 · Where we started</h2> | |
| <h3>The base model</h3> | |
| <p> | |
| We built on <code>unsloth/gemma-4-E2B-it</code> — Google's Gemma 4 2B instruction model in | |
| Unsloth's E2B (execution-to-build) variant. It's small enough to fine-tune on a single GPU, | |
| yet designed with code and tool use in mind. We used 4-bit quantization, LoRA rank 32, | |
| and a 2048-token context throughout. | |
| </p> | |
| <h3>Three Kaggle notebooks → one Modal app</h3> | |
| <p> | |
| The project began as three separate Kaggle notebooks covering supervised fine-tuning (SFT), | |
| GRPO reinforcement learning, and DPO preference optimization. We consolidated them into | |
| <code>datasense_pipeline.py</code> — a single Modal application with shared config in | |
| <code>datasense_utils.py</code> — so training could run unattended on cloud GPUs with | |
| checkpoints persisted to a Modal volume and pushed to Hugging Face. | |
| </p> | |
| <h3>Nine bugs we fixed before trusting any number</h3> | |
| <p>Early runs were misleading because the ported notebooks had latent bugs. We fixed all nine before building the pipeline:</p> | |
| <table> | |
| <thead> | |
| <tr><th>#</th><th>Bug</th><th>Impact</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr><td>1</td><td><code>sft_warmup</code> KeyError</td><td>SFT wouldn't start</td></tr> | |
| <tr><td>2</td><td><code>lora_target_modules</code> KeyError</td><td>LoRA attach failed</td></tr> | |
| <tr><td>3</td><td><code>result_str</code> UnboundLocalError</td><td>Agent loop crashed mid-rollout</td></tr> | |
| <tr><td>4</td><td>DPO pairs missing chat template prefix</td><td>Preference data malformed</td></tr> | |
| <tr><td>5</td><td><code>skip_special_tokens=False</code></td><td>Decode pollution in rewards</td></tr> | |
| <tr><td>6</td><td>Dead <code>oci_sft_v1</code> variable</td><td>Confusing / broken cells</td></tr> | |
| <tr><td>7</td><td>GRPO <code>max_steps</code> hardcoded</td><td>Config ignored</td></tr> | |
| <tr><td>8</td><td>Shorter <code>SYSTEM_PROMPT</code> in DPO cell</td><td>Train/eval prompt drift</td></tr> | |
| <tr><td>9</td><td><code>_PROBLEM_LOOKUP</code> naming mismatch</td><td>Dataset indexing broken</td></tr> | |
| </tbody> | |
| </table> | |
| <h3>Day-one eval: 0% accuracy (and why that was informative)</h3> | |
| <p> | |
| Our first agent eval reported <strong>0% accuracy</strong> for everyone — including SFT — while | |
| SFT already showed <strong>100% execution success</strong> and ~5.6 agent steps vs base's 2% exec / | |
| 1.1 steps. That gap taught us the first big lesson: <strong>the model was learning to run code, | |
| but we weren't scoring against real data.</strong> | |
| </p> | |
| <div class="card warn"> | |
| <div class="card-title">Root cause</div> | |
| <p style="margin:0"> | |
| Eval workspaces used <strong>synthetic random CSVs</strong> when DataBench parquet wasn't mounted, | |
| but ground truth came from the <strong>real</strong> dataset. The agent analyzed fake data and | |
| was graded against true answers — guaranteed 0%. | |
| </p> | |
| </div> | |
| <figure class="figure"> | |
| <img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/02-fake-data-eval.png" alt="Eval bug: synthetic workspace data scored against real ground truth" loading="lazy" /> | |
| <figcaption><strong>Fig 2 — The 0% eval bug.</strong> Early runs used random synthetic CSVs in the sandbox while ground truth came from real DataBench files — so even a good agent could never match.</figcaption> | |
| </figure> | |
| </section> | |
| <!-- 03 PROBLEM --> | |
| <section id="problem"> | |
| <h2>03 · The problem with naive finetuning</h2> | |
| <p> | |
| Most "data agent" demos finetune on static (question, code, answer) triples. The model learns | |
| to <em>format</em> responses that look like an agent — <code> Answer: </code> tags, pandas snippets, | |
| confident summaries — without ever closing the loop on execution. | |
| </p> | |
| <p>We observed three failure modes immediately:</p> | |
| <div class="two-col"> | |
| <div class="card"> | |
| <div class="card-title">Formatter, not agent</div> | |
| <p style="margin:0;font-size:0.95rem"> | |
| Base Gemma-4 could score well on easy boolean questions by emitting answer tags in a single | |
| turn — <strong>0% code execution</strong> — beating SFT on accuracy while doing none of the work. | |
| </p> | |
| </div> | |
| <div class="card"> | |
| <div class="card-title">Hallucinated execution</div> | |
| <p style="margin:0;font-size:0.95rem"> | |
| Models invent <code><result></code> blocks with fake stdout. RL rewards on text alone | |
| reinforce the illusion of competence. | |
| </p> | |
| </div> | |
| </div> | |
| <p> | |
| The fix wasn't "more SFT data." It was changing <strong>what we optimize and measure</strong>: | |
| real subprocess execution, multi-turn observe→fix→retry, and verifiers that compare parsed answers | |
| to typed ground truth (boolean, number, category, list types). | |
| </p> | |
| </section> | |
| <!-- 04 AGENT --> | |
| <section id="agent"> | |
| <h2>04 · The DataSense agent loop</h2> | |
| <p>Every training rollout and eval episode follows the same production-shaped loop:</p> | |
| <div class="flow"> | |
| <span>THINK</span><span class="arrow">→</span> | |
| <span>EXPLORE</span><span class="arrow">→</span> | |
| <span>EXECUTE</span><span class="arrow">→</span> | |
| <span>DEBUG</span><span class="arrow">→</span> | |
| <span>ANSWER</span> | |
| </div> | |
| <ol> | |
| <li><strong>THINK</strong> — inspect schema, dtypes, nulls before analysis</li> | |
| <li><strong>EXPLORE</strong> — <code>head()</code>, <code>describe()</code>, small SQL <code>LIMIT</code> queries</li> | |
| <li><strong>EXECUTE</strong> — one focused Python step; read real <code><result></code> from sandbox</li> | |
| <li><strong>DEBUG</strong> — fix column names, joins, dtypes from tracebacks</li> | |
| <li><strong>ANSWER</strong> — <code> Answer: </code> + <code> Summary: </code> after verified execution</li> | |
| </ol> | |
| <p> | |
| The system prompt (shared across train, eval, and this HF demo) explicitly forbids hallucinated APIs | |
| and requires the final printed value to match the answer tag. For DataBench we mount real | |
| <code>sample.parquet</code> into the workspace; for DSBench we copy <code>.xlsx</code> workbooks | |
| and use <code>inspect_source</code> for Excel structure. | |
| </p> | |
| <pre>Reward signal (simplified): | |
| + execution actually ran | |
| + stdout parseable | |
| + answer matches ground truth (typed comparator) | |
| − hallucinated inline <result> without [EXEC:real] | |
| − debug rambling / column dumps as "answers"</pre> | |
| <figure class="figure"> | |
| <img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/03-agent-loop.png" alt="THINK EXPLORE EXECUTE DEBUG ANSWER agent loop" loading="lazy" /> | |
| <figcaption><strong>Fig 3 — Agent loop.</strong> Every rollout follows the same multi-step cycle: inspect, run code, read real output, debug, then answer.</figcaption> | |
| </figure> | |
| </section> | |
| <!-- 05 PIPELINE --> | |
| <section id="pipeline"> | |
| <h2>05 · Training pipeline: SFT → GRPO → DPO</h2> | |
| <p>Our planned stack mirrors modern agent training — with execution at every stage:</p> | |
| <div class="flow"> | |
| <span>SFT</span><span class="arrow">→</span> | |
| <span>GRPO</span><span class="arrow">→</span> | |
| <span>DPO</span><span class="arrow">→</span> | |
| <span>Eval</span> | |
| </div> | |
| <h3>Stage 1 — Supervised fine-tuning (SFT v1) ✅</h3> | |
| <p> | |
| Bulk SFT on DataBench-style traces plus agent supplements: multi-turn dialogs, Jupyter-agent | |
| traces, dashboard examples, and code-feedback execution pairs. This produced our strongest | |
| baseline — <code>sanjaymalladi/DataSense-Modal-E2B-SFT</code>. | |
| </p> | |
| <ul> | |
| <li>LoRA r=32, α=64 on all attention + MLP projections</li> | |
| <li>~600 max steps, effective batch 8</li> | |
| <li>Teaches the model to <em>use</em> the agent format and run multi-step code</li> | |
| </ul> | |
| <h3>Stage 2 — GRPO (execution-grounded RL) ⚠️ partial</h3> | |
| <p> | |
| Group Relative Policy Optimization with <strong>real Python rollouts</strong> per prompt. | |
| Each step spawns multiple agent trajectories; rewards use <code>compute_trajectory_reward()</code> | |
| with <code>require_real_execution=True</code>. | |
| </p> | |
| <p> | |
| GRPO on Gemma-4 is brutally slow (~11 min/step on A100) because most wall time is | |
| <strong>CPU-bound execution</strong>, not GPU matmul — 4 rollouts × up to 5 agent steps × | |
| subprocess sandboxing. We fixed trajectory forwarding bugs, KL instability | |
| (<code>final_logit_softcapping=30</code>), and added parallel rollout workers — but full | |
| 300-step GRPO remained impractical within hackathon time. A shortened 100-step run was targeted. | |
| </p> | |
| <h3>Stage 3 — DPO ⏸️ deferred</h3> | |
| <p> | |
| Preference pairs from high vs low reward rollouts (min gap 0.15) — planned but deprioritized | |
| once EVTE-STaR showed more promise for hard-question gains within our compute budget. | |
| </p> | |
| <figure class="figure"> | |
| <img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/04-pipeline-stages.png" alt="SFT GRPO DPO training pipeline stages" loading="lazy" /> | |
| <figcaption><strong>Fig 4 — Training stages.</strong> SFT v1 shipped and works. Full GRPO was execution-bound and slow. DPO was deferred in favor of EVTE-STaR.</figcaption> | |
| </figure> | |
| </section> | |
| <!-- 06 METHODS (supporting) --> | |
| <section id="methods"> | |
| <h2>06 · Supporting infrastructure (not EVTE itself)</h2> | |
| <p> | |
| Before EVTE could work, we needed execution-grounded rollouts, typed verifiers, and honest eval. | |
| These are the plumbing; the novel research contribution is EVTE + EVTE-STaR (sections 07–11 below). | |
| </p> | |
| <h3>Execution-grounded rollouts</h3> | |
| <p> | |
| Every GRPO/DPO/EVTE trajectory runs code in an isolated workspace. Rewards ignore fake | |
| <code><result></code> tags unless tagged <code>[EXEC:real]</code>. | |
| </p> | |
| <h3>Typed answer verification (<code>databench_compare</code> + neural verifier)</h3> | |
| <p> | |
| Evidence-bound scoring chain: exec stdout → <code> Answer: </code> tag → LLM extract → typed compare | |
| (boolean, float, category, <code>list[category]</code>, <code>list[number]</code>). | |
| Without this, mentors "fail" when extraction fails, not when reasoning fails. | |
| </p> | |
| <h3>Lite eval & hackathon harness</h3> | |
| <p> | |
| DataBench lite scores against <code>sample_answer</code> on mounted parquet. | |
| <code>run_hackathon_benchmarks_parallel</code> runs Base / SFT / Micro-1 across three benchmarks on T4. | |
| </p> | |
| </section> | |
| <!-- 07 EVTE CORE --> | |
| <section id="evte"> | |
| <h2>07 · EVTE — Execution-Verified Tutor Escalation</h2> | |
| <p> | |
| <strong>EVTE</strong> is the method we built when classical distillation and STaR broke down for | |
| data agents. The name encodes three commitments: | |
| </p> | |
| <ul> | |
| <li><strong>Execution</strong> — every claim of success must be backed by real code that ran on real files</li> | |
| <li><strong>Verified</strong> — student <em>and</em> mentor answers pass the same typed verifier</li> | |
| <li><strong>Tutor Escalation</strong> — a larger model intervenes only after student failure, and only as a <em>coach</em>, not an answer vending machine</li> | |
| </ul> | |
| <h3>Why we needed EVTE</h3> | |
| <p> | |
| Classical <strong>STaR</strong> (Self-Taught Reasoner) assumes a strong teacher can produce correct | |
| reasoning chains, filter them, and fine-tune the student offline. That fails for DataSense because: | |
| </p> | |
| <ol> | |
| <li>Our <strong>2B student</strong> often can't solve list/category questions at all</li> | |
| <li>Our <strong>31B mentor</strong> also fails verification on the hardest 5 problems (~40% mentor-hard pool)</li> | |
| <li>Even when code is <em>right</em>, <strong>answer extraction</strong> fails (no tag, wrong stdout parse)</li> | |
| <li>Distilling final answers teaches <strong>memorization</strong>; we need debugging under execution constraints</li> | |
| </ol> | |
| <h3>The five-phase episode (EVTE and EVTE-STaR share this skeleton)</h3> | |
| <p>Implemented in <code>datasense_evte.py</code> — <code>run_evte_episode</code> (offline collection) and <code>run_evte_star_episode</code> (online training).</p> | |
| <div class="phase-grid"> | |
| <div class="phase-card"> | |
| <h4>Phase 1 · Student first attempt</h4> | |
| <p>2B student, up to 5 agent steps, real workspace (CSV/parquet/xlsx). Scored via <code>score_rollout()</code>.</p> | |
| </div> | |
| <div class="phase-card"> | |
| <h4>Phase 2 · Self-recovery feedback</h4> | |
| <p>Up to 3 rounds of <code>build_self_recovery_feedback()</code> — real tracebacks, answer withheld.</p> | |
| </div> | |
| <div class="phase-card"> | |
| <h4>Phase 3 · Mentor independent verify</h4> | |
| <p>31B mentor solves in a <em>fresh</em> workspace; must pass the same verifier before any hint.</p> | |
| </div> | |
| <div class="phase-card"> | |
| <h4>Phase 4 · Diagnostic mentor hint</h4> | |
| <p><code>generate_mentor_hint()</code> under <code>MENTOR_HINT_SYSTEM</code> — no final answer, no full script.</p> | |
| </div> | |
| <div class="phase-card"> | |
| <h4>Phase 5 · Post-hint student</h4> | |
| <p>Up to 2 attempts × 5 steps. Episode saved only if student verifies after reading the hint.</p> | |
| </div> | |
| </div> | |
| <figure class="figure"> | |
| <img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/05-evte-five-phases.png" alt="EVTE five phases from student attempt to mentor-assisted success" loading="lazy" /> | |
| <figcaption><strong>Fig 5 — EVTE in five phases.</strong> Student tries → self-recovery → mentor must verify independently → diagnostic hint → student retries. Only verified post-hint wins become training data.</figcaption> | |
| </figure> | |
| <pre>run_evte_star_episode (simplified control flow): | |
| student_rollout = phase_1_student() | |
| if clean_first_try_verified and not messy_recovery_in_trace: | |
| return SKIP # already knows it — not trainable in STaR mode | |
| if not verified: | |
| for i in 1..3: | |
| add_user(build_self_recovery_feedback()) # ← EVTE feedback | |
| student_rollout = student_retry() | |
| mentor_ok, mentor_rollout = mentor_verify_solution( | |
| student_rollout=junior_trace # mentor sees failed code | |
| ) | |
| if not mentor_ok: | |
| return DISCARD # mentor_unverified — no training signal | |
| hint = generate_mentor_hint(student_rollout, mentor_rollout) | |
| add_user("[MENTOR] " + hint) # diagnostic only | |
| for j in 1..2: | |
| student_rollout = student_retry() | |
| if verified: | |
| return SAVE_TRAINABLE_EPISODE # mentor_assisted</pre> | |
| <h3>Hard-first curriculum</h3> | |
| <p> | |
| <code>_prioritize_evte_problems()</code> sorts <code>list[category]</code>, <code>list[number]</code>, | |
| and multi-answer types before easy booleans. EVTE compute is expensive (two models × multi-step agents); | |
| we spend it where SFT v1 plateaus. | |
| </p> | |
| <h3>Mentor hardware choreography</h3> | |
| <p> | |
| Student (2B) and mentor (31B) don't fit comfortably together on one A100. The STaR loop uses | |
| <code>on_micro_batch</code> hooks to <strong>unload mentor → micro-SFT student → reload mentor</strong> | |
| every 15 episodes. Progress persists to <code>evte_star_progress.json</code> with resume support. | |
| </p> | |
| </section> | |
| <!-- 08 EVTE FEEDBACK --> | |
| <section id="evte-feedback"> | |
| <h2>08 · EVTE feedback — self-recovery without answer leakage</h2> | |
| <p> | |
| The most underrated piece of EVTE is not the mentor — it's <strong>what we put in the user turn | |
| when the student fails</strong>. This is <code>build_self_recovery_feedback()</code> in | |
| <code>datasense_evte.py</code>. | |
| </p> | |
| <figure class="figure"> | |
| <img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/06-evte-self-recovery.png" alt="Self-recovery feedback loop with real errors but hidden ground truth" loading="lazy" /> | |
| <figcaption><strong>Fig 6 — Self-recovery feedback.</strong> The student sees wrong predictions, last code, and real tracebacks — never the correct answer.</figcaption> | |
| </figure> | |
| <blockquote class="pull"> | |
| Messy success = verified answer but conversation contains debug/recovery language | |
| (<code>trajectory_has_recovery_signal()</code>). We don't want to reinforce "stumble into correctness" | |
| without tutor review in STaR mode. | |
| </blockquote> | |
| <h3>Why SFT v2 failed — feedback without balance</h3> | |
| <p> | |
| When we later fine-tuned <strong>only</strong> on recovery trajectories (SFT v2), the model learned | |
| the <em>shape</em> of debug prose — dtype dumps, column lists — without improving verified answers. | |
| Lesson: self-recovery feedback is essential <strong>during collection</strong>, but training must mix | |
| clean completions with mentor-assisted wins, not recovery-only soup. | |
| </p> | |
| </section> | |
| <!-- 09 EVTE MENTOR --> | |
| <section id="evte-mentor"> | |
| <h2>09 · Mentor verify & hint protocol</h2> | |
| <p> | |
| The mentor is <code>google/gemma-4-31B-it</code> (4-bit via Unsloth). It is <strong>not</strong> an oracle | |
| that whispers answers. It must earn the right to hint by passing the same execution verifier as the student. | |
| </p> | |
| <figure class="figure"> | |
| <img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/07-evte-mentor-gate.png" alt="Mentor must pass verification gate before giving a diagnostic hint" loading="lazy" /> | |
| <figcaption><strong>Fig 7 — Mentor gate.</strong> The 31B mentor must verify its own solution by running code before it may give a hint — and the hint must not leak the final answer.</figcaption> | |
| </figure> | |
| <h3>Mentor retry modes</h3> | |
| <table> | |
| <thead> | |
| <tr><th>Mode</th><th>Behavior</th><th>Config</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td><strong>series</strong></td> | |
| <td>Same conversation; temps ramp 0.4 → 0.65 → 0.85</td> | |
| <td><code>evte_mentor_retry_mode=series</code></td> | |
| </tr> | |
| <tr> | |
| <td><strong>parallel</strong></td> | |
| <td>3 independent workspaces; first verified wins; temps [0.2, 0.5, 0.7]</td> | |
| <td><code>evte_mentor_retry_mode=parallel</code></td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| </section> | |
| <!-- 10 EVTE-STAR --> | |
| <section id="evte-star"> | |
| <h2>10 · EVTE-STaR — online Self-Taught Reasoner with micro-SFT</h2> | |
| <p> | |
| <strong>EVTE-STaR</strong> combines EVTE episode collection with <strong>online weight updates</strong>. | |
| Classical STaR: collect all successes → train offline once. EVTE-STaR: | |
| <strong>collect 15 verified mentor-assisted wins → micro-SFT 30 steps → student is slightly better → repeat.</strong> | |
| </p> | |
| <figure class="figure"> | |
| <img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/08-evte-star-online.png" alt="EVTE-STaR online micro-SFT every 15 verified episodes" loading="lazy" /> | |
| <figcaption><strong>Fig 8 — EVTE-STaR online loop.</strong> Every 15 mentor-assisted wins → 30-step micro-SFT at low LR → student continues on harder problems with nudged weights.</figcaption> | |
| </figure> | |
| <h3>The overtraining curve (batches 2–3 vs batch 6)</h3> | |
| <p> | |
| Micro-batch <strong>1</strong> replay in RAM scored <strong>100%</strong> on mentor-hard (5 problems). | |
| Saved Micro-1 checkpoint: ~<strong>60%</strong> confirmatory. Replay of batches <strong>2–3</strong>: | |
| ~<strong>80%</strong>. Final batch <strong>6</strong> checkpoint: ~<strong>40%</strong> — worse than SFT v1. | |
| </p> | |
| <div class="card warn"> | |
| <div class="card-title">Lesson</div> | |
| <p style="margin:0"> | |
| Online micro-SFT needs <strong>early stopping on a held-out hard set</strong>, not "more batches = better." | |
| We only preserved micro-1 and final checkpoints on the volume — sweet-spot batches 2–3 were lost | |
| until <code>run_micro_replay_eval</code> reconstructed them in RAM. | |
| </p> | |
| </div> | |
| </section> | |
| <!-- 11 EVTE OUTCOMES --> | |
| <section id="evte-outcomes"> | |
| <h2>11 · Episode outcomes & trainability gates</h2> | |
| <p>Every episode ends in exactly one outcome. The outcome determines whether it enters training.</p> | |
| <table> | |
| <thead> | |
| <tr><th>Outcome</th><th>Meaning</th><th>EVTE-STaR: train?</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td><code>self_solved_clean</code></td> | |
| <td>First-try verified, no recovery signals in trace</td> | |
| <td class="num-bad">Skip</td> | |
| </tr> | |
| <tr> | |
| <td><code>self_recovered</code></td> | |
| <td>Fixed via self-recovery feedback only</td> | |
| <td class="num-mid">Optional</td> | |
| </tr> | |
| <tr> | |
| <td><code>mentor_assisted</code></td> | |
| <td>Failed → mentor verified → hint → student verified</td> | |
| <td class="num-good">Yes</td> | |
| </tr> | |
| <tr> | |
| <td><code>discarded</code></td> | |
| <td>Mentor couldn't pass execution verifier</td> | |
| <td class="num-bad">No</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| </section> | |
| <!-- 12 WORKED --> | |
| <section id="worked"> | |
| <h2>12 · What worked</h2> | |
| <div class="card"> | |
| <h4>✅ SFT v1 — real execution behavior</h4> | |
| <p style="margin:0.5rem 0 0"> | |
| SFT v1 consistently runs real Python (100% exec on many evals), uses ~4–5 agent steps, and | |
| beats base on hard questions where base "wins" without code. This is the behavioral foundation | |
| everything else builds on. | |
| </p> | |
| </div> | |
| <div class="card"> | |
| <h4>✅ EVTE episode quality filter</h4> | |
| <p style="margin:0.5rem 0 0"> | |
| Saving only mentor-assisted verified trajectories produced high-signal data — multi-turn debug | |
| with real errors, not synthetic Q/A. 92 episodes is small but <em>curated</em>. | |
| </p> | |
| </div> | |
| </section> | |
| <!-- 09 DIDNT --> | |
| <section id="didnt"> | |
| <h2>13 · What didn't work</h2> | |
| <div class="card danger"> | |
| <h4>❌ Full GRPO within hackathon time</h4> | |
| <p style="margin:0.5rem 0 0"> | |
| ~11 min/step × hundreds of steps × execution-bound rollouts ≈ multi-day runs. Parallel rollout | |
| workers helped but couldn't change the fundamental CPU/GPU pipeline stall. vLLM isn't available | |
| for Gemma 4 E2B, so generation stays on HF generate. | |
| </p> | |
| </div> | |
| <div class="card danger"> | |
| <h4>❌ SFT v2 (recovery-only fine-tune)</h4> | |
| <p style="margin:0.5rem 0 0"> | |
| Training only on EVTE recovery trajectories taught <strong>debug prose</strong> — column dtype | |
| dumps, rambling — without improving answers. Mentor-hard: 40% vs SFT v1's 60%. | |
| </p> | |
| </div> | |
| </section> | |
| <!-- 10 EVALS --> | |
| <section id="evals"> | |
| <h2>14 · Evaluation results</h2> | |
| <p> | |
| <strong>Agent accuracy</strong> on real data files (lite DataBench parquet, DSBench Excel, mentor-hard pool). | |
| Macro average = unweighted mean across three benchmarks (30 problems). Always pair accuracy with | |
| <strong>exec_ok</strong> — base can match easy booleans via answer tags without running code. | |
| </p> | |
| <figure class="figure"> | |
| <img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/09-eval-benchmarks.png" alt="Three hackathon benchmarks across three models" loading="lazy" /> | |
| <figcaption><strong>Fig 9 — Hackathon eval suite.</strong> DataBench (15) + DSBench Excel (10) + mentor-hard (5) per model on T4.</figcaption> | |
| </figure> | |
| <h3>Hackathon benchmark suite — final (first complete run)</h3> | |
| <p>Parallel eval: <code>run_hackathon_benchmarks_parallel</code> · 3× T4 · June 2026.</p> | |
| <table> | |
| <thead> | |
| <tr><th>Model</th><th>DataBench (15)</th><th>DSBench (10)</th><th>Mentor-hard (5)</th><th>Macro avg</th><th>Total</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>Base</td> | |
| <td class="num-mid">60.0%</td> | |
| <td class="num-bad">0.0%</td> | |
| <td class="num-mid">20.0%</td> | |
| <td class="num-mid">26.7%</td> | |
| <td>10/30</td> | |
| </tr> | |
| <tr> | |
| <td><strong>SFT v1</strong></td> | |
| <td class="num-good">86.7%</td> | |
| <td class="num-bad">0.0%</td> | |
| <td class="num-good">60.0%</td> | |
| <td class="num-good">48.9%</td> | |
| <td>16/30</td> | |
| </tr> | |
| <tr> | |
| <td>EVTE Micro-1</td> | |
| <td class="num-good">80.0%</td> | |
| <td class="num-bad">0.0%*</td> | |
| <td class="num-good">100.0%</td> | |
| <td class="num-good">60.0%</td> | |
| <td>17/30</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| <p style="font-size:0.9rem;color:var(--text-muted)"> | |
| *DSBench official scorer = 0% for all models. Micro-1 Q15 computed <code>$12,829,511</code> = option <strong>A</strong> (correct) but was graded wrong because we compare letters not dollar values → value-aware DSBench would be 1/10 (macro <strong>63.3%</strong>). | |
| </p> | |
| <h3>Earlier standalone evals (sanity checks)</h3> | |
| <table> | |
| <thead> | |
| <tr><th>Eval</th><th>Base</th><th>SFT v1</th><th>Micro-1 / SFT v2</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>Quick DataBench (5)</td> | |
| <td>80% acc / 0% exec</td> | |
| <td class="num-good">80% / 100% exec</td> | |
| <td>SFT v2: 40%</td> | |
| </tr> | |
| <tr> | |
| <td>Mentor-hard (5)</td> | |
| <td>40% / 0% exec</td> | |
| <td class="num-good">60% / 100% exec</td> | |
| <td>Micro-1 replay: 100% (RAM); saved ckpt ~60%</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| <div class="card"> | |
| <div class="card-title">How to read DSBench</div> | |
| <p style="margin:0"> | |
| Models often <strong>run code</strong> (50–100% exec_ok) but return dataframe strings, <code>0.0</code>, or dollar amounts that map to the <em>wrong</em> MCQ letter. Only one case (Micro-1 Q15) was a true scoring-format bug. DSBench 0% is mostly real Excel/parsing failure, not a broken metric. | |
| </p> | |
| </div> | |
| </section> | |
| <!-- DEMO MODEL CHOICE --> | |
| <section id="demo-choice"> | |
| <h2>15 · Why SFT v1 for the live demo (not Micro-1)</h2> | |
| <p> | |
| Micro-1 wins <strong>macro average</strong> (60% vs 48.9%) on paper — driven by a perfect 5/5 on mentor-hard. | |
| We still ship <strong>SFT v1</strong> on this Hugging Face Space. Here's why: | |
| </p> | |
| <table> | |
| <thead> | |
| <tr><th>Factor</th><th>SFT v1</th><th>EVTE Micro-1</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td><strong>DataBench (breadth)</strong></td> | |
| <td class="num-good"><strong>86.7%</strong> — best on the largest held-out slice</td> | |
| <td>80.0%</td> | |
| </tr> | |
| <tr> | |
| <td><strong>Mentor-hard (depth)</strong></td> | |
| <td>60% (3/5), 100% exec</td> | |
| <td class="num-good"><strong>100%</strong> (5/5) on first complete run</td> | |
| </tr> | |
| <tr> | |
| <td><strong>Stability</strong></td> | |
| <td class="num-good">Single bulk SFT — predictable at inference</td> | |
| <td>Online micro-SFT batch 1 — replay 100% vs saved ckpt ~60%</td> | |
| </tr> | |
| <tr> | |
| <td><strong>Straggler reruns</strong></td> | |
| <td class="num-good">Held up when Modal overwrote volume</td> | |
| <td>Mentor-hard dropped to 60% on duplicate run</td> | |
| </tr> | |
| <tr> | |
| <td><strong>Live demo risk</strong></td> | |
| <td class="num-good">Lower — fewer debug ramble / dtype dumps</td> | |
| <td>Higher — tuned on hard pool, can overfit quirks</td> | |
| </tr> | |
| <tr> | |
| <td><strong>Story on slides</strong></td> | |
| <td>“Execution-grounded baseline that works”</td> | |
| <td>“EVTE-STaR peak — best hard-pool result”</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| <div class="card highlight"> | |
| <div class="card-title">Decision</div> | |
| <p style="margin:0"> | |
| <strong>Gradio Space → SFT v1</strong> (<code>sanjaymalladi/DataSense-Modal-E2B-SFT</code>) for reliable live CSV demos.<br /> | |
| <strong>Slides → show all three models</strong>; cite Micro-1 as evidence EVTE-STaR helps on the hard curated pool, not as the production default yet. | |
| </p> | |
| </div> | |
| </section> | |
| <!-- 11 BENCHMARKS --> | |
| <section id="benchmarks"> | |
| <h2>16 · Benchmark suite</h2> | |
| <table> | |
| <thead> | |
| <tr><th>Benchmark</th><th>Problems</th><th>What it tests</th><th>Status</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td><strong>DataBench test (lite)</strong></td> | |
| <td>15</td> | |
| <td>SemEval-style QA on real parquet samples</td> | |
| <td><span class="pill ok">integrated</span></td> | |
| </tr> | |
| <tr> | |
| <td><strong>DSBench analysis</strong></td> | |
| <td>10</td> | |
| <td>ModelOff Excel financial modeling</td> | |
| <td><span class="pill ok">integrated</span></td> | |
| </tr> | |
| <tr> | |
| <td><strong>Mentor-hard</strong></td> | |
| <td>5</td> | |
| <td>Curated EVTE failures</td> | |
| <td><span class="pill ok">integrated</span></td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| </section> | |
| <!-- 12 MODELS --> | |
| <section id="models"> | |
| <h2>17 · Model checkpoints on Hugging Face</h2> | |
| <table> | |
| <thead> | |
| <tr><th>Checkpoint</th><th>HF repo</th><th>Role</th></tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>Base</td> | |
| <td><a href="https://huggingface.co/unsloth/gemma-4-E2B-it">unsloth/gemma-4-E2B-it</a></td> | |
| <td>Frozen foundation</td> | |
| </tr> | |
| <tr> | |
| <td><strong>SFT v1 ★ demo</strong></td> | |
| <td><a href="https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT">DataSense-Modal-E2B-SFT</a></td> | |
| <td>Live HF Space adapter — stable execution</td> | |
| </tr> | |
| <tr> | |
| <td>EVTE-STaR Micro-1</td> | |
| <td><a href="https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-EVTE-Star-Micro1">DataSense-Modal-E2B-EVTE-Star-Micro1</a></td> | |
| <td>Best mentor-hard (5/5) — research checkpoint</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| </section> | |
| <!-- 13 DEMO --> | |
| <section id="demo"> | |
| <h2>18 · This Hugging Face demo</h2> | |
| <p> | |
| The Gradio app runs <strong>SFT v1</strong> — same agent loop as training eval: load CSV → multi-step | |
| code generation → sandbox execution → <strong>Answer</strong> + <strong>Summary</strong>. | |
| Six built-in examples cover sales, employees, and students datasets. | |
| </p> | |
| <figure class="figure"> | |
| <img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/03-agent-loop.png" alt="Agent loop used in the HF Space demo" loading="lazy" /> | |
| <figcaption><strong>Same loop as eval.</strong> Upload this <code>hf_demo/</code> folder to a Gradio Space (GPU T4), set <code>HF_TOKEN</code> if needed.</figcaption> | |
| </figure> | |
| <h3>Deploy checklist</h3> | |
| <ol> | |
| <li>Create Space (Gradio, <strong>gpu-t4</strong>) — see <code>README.md</code> frontmatter</li> | |
| <li>Upload <code>hf_demo/</code> including <code>assets/illustrations/</code> and <code>story.html</code></li> | |
| <li>Secret <code>HF_TOKEN</code> if adapter repo is private</li> | |
| <li>Smoke-test all 6 examples</li> | |
| </ol> | |
| <h3>Future work</h3> | |
| <ul> | |
| <li>DSBench MCQ letter mapping in scorer</li> | |
| <li>Per-micro-batch checkpointing during EVTE-STaR</li> | |
| <li>Optional Space variant with Micro-1 for hard-pool showcase</li> | |
| </ul> | |
| </section> | |
| <footer> | |
| <p> | |
| <strong>DataSense E2B</strong> — Execution-verified, Tutor-escalation training for personal data science agents.<br /> | |
| Code: <code>datasense_pipeline.py</code> · <code>datasense_evte.py</code> · <code>datasense_agent.py</code> · <code>hf_demo/</code><br /> | |
| Built for the Gemma / DataBench hackathon, June 2026. | |
| </p> | |
| <p style="margin-top:2rem"> | |
| <a href="/">← Back to Gradio demo</a> · | |
| <a href="https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT">SFT v1 on HF</a> · | |
| <a href="https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-EVTE-Star-Micro1">Micro-1 on HF</a> | |
| </p> | |
| </footer> | |
| </div> | |
| </body> | |
| </html> |