Spaces:

build-small-hackathon
/

DataSense_E2B

Running on Zero

App Files Files Community

DataSense_E2B / index.html

sanjaymalladi

Softer answer parsing, simplify UI, netlify story link

9edae52 verified 19 days ago

Raw

History Blame Contribute Delete

53.3 kB

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1.0" />
	<title>DataSense E2B — The Full Story</title>
	<link rel="preconnect" href="https://fonts.googleapis.com" />
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
	<link href="https://fonts.googleapis.com/css2?family=Fraunces:ital,opsz,wght@0,9..144,300..900;1,9..144,300..900&family=IBM+Plex+Mono:ital,wght@0,400;0,500;0,600;1,400&family=Newsreader:ital,opsz,wght@0,6..72,200..800;1,6..72,200..800&display=swap" rel="stylesheet" />
	<style>
	:root {
	/* Editorial Color Palette */
	--bg: #F4F3ED; /* Warm newspaper cream */
	--text: #111110; /* Deep ink */
	--text-muted: #4A4A46;
	--border: #111110;

	/* Vibrant Print Accents */
	--accent: #E1341E; /* Vermilion Red */
	--accent-blue: #1843D2; /* Cobalt */
	--accent-warm: #D46F15; /* Ochre */
	--accent-ok: #0D733B; /* Forest Green */

	--max-width: 860px;
	--radius: 0px; /* Brutalist/Print - absolutely no rounded corners */
	--shadow-offset: 6px;
	}

	* { box-sizing: border-box; margin: 0; padding: 0; }

	html { scroll-behavior: smooth; }

	::selection {
	background: var(--accent);
	color: var(--bg);
	}

	body {
	font-family: "Newsreader", serif;
	background-color: var(--bg);
	color: var(--text);
	line-height: 1.65;
	font-size: 1.15rem;
	font-weight: 400;
	-webkit-font-smoothing: antialiased;
	/* Subtle noise texture for a paper feel */
	background-image: url("data:image/svg+xml,%3Csvg viewBox='0 0 400 400' xmlns='http://www.w3.org/2000/svg'%3E%3Cfilter id='noiseFilter'%3E%3CfeTurbulence type='fractalNoise' baseFrequency='0.9' numOctaves='3' stitchTiles='stitch'/%3E%3C/filter%3E%3Crect width='100%25' height='100%25' filter='url(%23noiseFilter)' opacity='0.04'/%3E%3C/svg%3E");
	}

	.wrap {
	max-width: var(--max-width);
	margin: 0 auto;
	padding: 4rem 2rem 8rem;
	}

	/* -------------------------------------------
	Header & Hero Typography
	------------------------------------------- */
	header {
	margin-bottom: 4rem;
	padding-bottom: 3rem;
	border-bottom: 4px solid var(--border);
	position: relative;
	}

	header::after {
	content: "";
	position: absolute;
	bottom: -10px;
	left: 0;
	width: 100%;
	height: 1px;
	background: var(--border);
	}

	.badge {
	display: inline-block;
	font-family: "IBM Plex Mono", monospace;
	font-size: 0.75rem;
	font-weight: 600;
	letter-spacing: 0.1em;
	text-transform: uppercase;
	color: var(--bg);
	background: var(--text);
	padding: 0.4rem 0.8rem;
	margin-bottom: 2rem;
	}

	h1 {
	font-family: "Fraunces", serif;
	font-size: clamp(3rem, 7vw, 5.5rem);
	font-weight: 800;
	font-variation-settings: "SOFT" 0, "WONK" 1;
	line-height: 0.95;
	letter-spacing: -0.03em;
	margin-bottom: 1.5rem;
	text-transform: uppercase;
	}

	.subtitle {
	font-family: "Newsreader", serif;
	font-size: 1.4rem;
	font-style: italic;
	color: var(--text-muted);
	max-width: 36em;
	line-height: 1.4;
	}

	.meta {
	margin-top: 2rem;
	font-family: "IBM Plex Mono", monospace;
	font-size: 0.85rem;
	text-transform: uppercase;
	letter-spacing: 0.05em;
	color: var(--text-muted);
	border-top: 1px dashed var(--border);
	padding-top: 1rem;
	}

	/* -------------------------------------------
	Table of Contents
	------------------------------------------- */
	nav.toc {
	background: transparent;
	border: 2px solid var(--border);
	padding: 2rem;
	margin-bottom: 4rem;
	box-shadow: var(--shadow-offset) var(--shadow-offset) 0 var(--border);
	}

	nav.toc h2 {
	font-family: "IBM Plex Mono", monospace;
	font-size: 0.9rem;
	text-transform: uppercase;
	letter-spacing: 0.1em;
	border-bottom: 2px solid var(--border);
	padding-bottom: 0.75rem;
	margin-bottom: 1.5rem;
	padding-top: 0;
	}

	nav.toc ol {
	list-style: none;
	counter-reset: toc;
	column-count: 2;
	column-gap: 3rem;
	}

	@media (max-width: 640px) {
	nav.toc ol { column-count: 1; }
	}

	nav.toc li {
	counter-increment: toc;
	margin-bottom: 0.75rem;
	break-inside: avoid;
	}

	nav.toc a {
	color: var(--text);
	text-decoration: none;
	display: flex;
	gap: 0.5rem;
	font-weight: 500;
	transition: color 0.2s, transform 0.2s;
	}

	nav.toc a::before {
	content: counter(toc, decimal-leading-zero) ".";
	font-family: "IBM Plex Mono", monospace;
	font-weight: 600;
	color: var(--accent);
	}

	nav.toc a:hover {
	color: var(--accent);
	transform: translateX(4px);
	}

	/* -------------------------------------------
	Typography & Content
	------------------------------------------- */
	section {
	margin-bottom: 5rem;
	position: relative;
	}

	section::before {
	content: "";
	display: block;
	width: 3rem;
	height: 4px;
	background: var(--accent);
	margin-bottom: 1.5rem;
	}

	h2 {
	font-family: "Fraunces", serif;
	font-size: 2.5rem;
	font-weight: 700;
	letter-spacing: -0.02em;
	margin-bottom: 1.5rem;
	line-height: 1.1;
	}

	h3 {
	font-family: "Fraunces", serif;
	font-size: 1.5rem;
	font-weight: 600;
	font-style: italic;
	margin: 2.5rem 0 1rem;
	color: var(--text);
	}

	h4 {
	font-family: "IBM Plex Mono", monospace;
	font-size: 1rem;
	font-weight: 600;
	text-transform: uppercase;
	letter-spacing: 0.05em;
	margin: 2rem 0 0.75rem;
	color: var(--text);
	}

	p { margin-bottom: 1.25rem; }

	ul, ol {
	margin: 0 0 1.5rem 2rem;
	padding: 0;
	}

	li { margin-bottom: 0.5rem; }

	li::marker {
	color: var(--accent);
	font-weight: bold;
	}

	strong { font-weight: 700; color: var(--text); }
	em { font-style: italic; font-family: "Fraunces", serif; }

	a {
	color: var(--accent-blue);
	text-decoration: underline;
	text-underline-offset: 4px;
	text-decoration-thickness: 1px;
	transition: all 0.2s;
	}

	a:hover {
	background: var(--accent-blue);
	color: var(--bg);
	text-decoration-color: transparent;
	}

	/* -------------------------------------------
	Cards & Callouts
	------------------------------------------- */
	.card {
	background: var(--bg);
	border: 2px solid var(--border);
	padding: 1.75rem 2rem;
	margin: 2rem 0;
	position: relative;
	box-shadow: var(--shadow-offset) var(--shadow-offset) 0 var(--border);
	transition: transform 0.2s, box-shadow 0.2s;
	}

	.card:hover {
	transform: translate(-2px, -2px);
	box-shadow: calc(var(--shadow-offset) + 2px) calc(var(--shadow-offset) + 2px) 0 var(--border);
	}

	.card.highlight {
	border-color: var(--text);
	background: #fdfcfa;
	}

	.card.highlight::before {
	content: "";
	position: absolute;
	top: 0; left: 0; bottom: 0;
	width: 8px;
	background: var(--accent-blue);
	}

	.card.warn {
	background: #fcf6ef;
	}

	.card.warn::before {
	content: "";
	position: absolute;
	top: 0; left: 0; bottom: 0;
	width: 8px;
	background: var(--accent-warm);
	}

	.card.danger {
	background: #fcefed;
	}

	.card.danger::before {
	content: "";
	position: absolute;
	top: 0; left: 0; bottom: 0;
	width: 8px;
	background: var(--accent);
	}

	.card-title {
	font-family: "IBM Plex Mono", monospace;
	font-weight: 700;
	font-size: 0.85rem;
	text-transform: uppercase;
	letter-spacing: 0.08em;
	color: var(--text);
	border-bottom: 1px solid var(--border);
	padding-bottom: 0.5rem;
	margin-bottom: 1rem;
	}

	.card h4 {
	margin-top: 0;
	border-bottom: 1px solid var(--border);
	padding-bottom: 0.5rem;
	}

	/* -------------------------------------------
	Data Display (Tables & Code)
	------------------------------------------- */
	table {
	width: 100%;
	border-collapse: collapse;
	margin: 2rem 0;
	font-family: "Newsreader", serif;
	font-size: 1rem;
	border-top: 3px solid var(--border);
	border-bottom: 3px solid var(--border);
	}

	th, td {
	text-align: left;
	padding: 0.85rem 1rem;
	border-bottom: 1px solid #d4d3cf;
	}

	th {
	font-family: "IBM Plex Mono", monospace;
	font-size: 0.75rem;
	text-transform: uppercase;
	letter-spacing: 0.05em;
	color: var(--text);
	font-weight: 600;
	vertical-align: bottom;
	}

	tr:last-child td { border-bottom: none; }

	tr:hover td { background: rgba(0,0,0,0.03); }

	.num-good { color: var(--accent-ok); font-weight: 700; }
	.num-mid { color: var(--accent-warm); font-weight: 700; }
	.num-bad { color: var(--accent); font-weight: 700; }
	.pending { color: var(--text-muted); font-style: italic; }

	code, .mono {
	font-family: "IBM Plex Mono", monospace;
	font-size: 0.85em;
	}

	p code, li code {
	background: #e8e7e1;
	border: 1px solid #d4d3cf;
	padding: 0.15em 0.3em;
	color: var(--text);
	font-weight: 500;
	}

	pre {
	background: var(--text);
	color: var(--bg);
	padding: 1.5rem;
	overflow-x: auto;
	font-family: "IBM Plex Mono", monospace;
	font-size: 0.85rem;
	line-height: 1.5;
	margin: 2rem 0;
	box-shadow: var(--shadow-offset) var(--shadow-offset) 0 var(--accent);
	}

	pre code {
	background: transparent;
	border: none;
	color: inherit;
	padding: 0;
	}

	/* -------------------------------------------
	UI Elements
	------------------------------------------- */
	.flow {
	display: flex;
	flex-wrap: wrap;
	gap: 0;
	align-items: center;
	margin: 2rem 0;
	font-family: "IBM Plex Mono", monospace;
	font-size: 0.85rem;
	font-weight: 600;
	text-transform: uppercase;
	border: 2px solid var(--border);
	box-shadow: 4px 4px 0 var(--border);
	width: fit-content;
	}

	.flow span {
	padding: 0.5rem 1rem;
	background: var(--bg);
	}

	.flow .arrow {
	background: var(--text);
	color: var(--bg);
	padding: 0.5rem;
	}

	.pill-row {
	display: flex;
	flex-wrap: wrap;
	gap: 0.5rem;
	margin: 1rem 0;
	}

	.pill {
	font-family: "IBM Plex Mono", monospace;
	font-size: 0.75rem;
	font-weight: 600;
	text-transform: uppercase;
	padding: 0.25rem 0.5rem;
	border: 1px solid var(--border);
	background: var(--bg);
	}

	.pill.ok { background: var(--accent-ok); color: #fff; border-color: var(--accent-ok); }
	.pill.no { background: var(--accent); color: #fff; border-color: var(--accent); }
	.pill.run { background: var(--accent-blue); color: #fff; border-color: var(--accent-blue); }

	.two-col {
	display: grid;
	grid-template-columns: 1fr 1fr;
	gap: 2rem;
	margin: 2rem 0;
	}

	/* -------------------------------------------
	Special Components
	------------------------------------------- */
	.status-banner {
	background: var(--text);
	color: var(--bg);
	padding: 1rem 1.5rem;
	margin-bottom: 3rem;
	font-family: "IBM Plex Mono", monospace;
	font-size: 0.85rem;
	border: 2px solid var(--text);
	position: relative;
	}

	.status-banner::after {
	content: "";
	position: absolute;
	top: 4px; left: 4px; right: -8px; bottom: -8px;
	border: 1px solid var(--text);
	z-index: -1;
	}

	.status-banner strong {
	color: #fff;
	text-transform: uppercase;
	letter-spacing: 0.05em;
	margin-right: 0.5rem;
	}

	figure.figure {
	margin: 3rem 0;
	border: 2px solid var(--border);
	box-shadow: var(--shadow-offset) var(--shadow-offset) 0 var(--border);
	background: var(--bg);
	}

	figure.figure img {
	display: block;
	width: 100%;
	height: auto;
	filter: grayscale(100%) contrast(1.1); /* Editorial print feel */
	transition: filter 0.3s;
	}

	figure.figure:hover img {
	filter: grayscale(0%);
	}

	figure.figure figcaption {
	padding: 1rem 1.25rem;
	font-family: "Newsreader", serif;
	font-size: 0.95rem;
	color: var(--text);
	border-top: 2px solid var(--border);
	background: #fdfcfa;
	}

	.gate-table td:first-child {
	font-family: "IBM Plex Mono", monospace;
	font-size: 0.85rem;
	font-weight: 600;
	}

	.phase-grid {
	display: grid;
	gap: 1.5rem;
	margin: 2.5rem 0;
	}

	.phase-card {
	border: 1px solid var(--border);
	padding: 1.5rem;
	position: relative;
	}

	.phase-card::before {
	content: "";
	position: absolute;
	top: 0; left: 0;
	width: 100%;
	height: 4px;
	background: var(--accent);
	}

	.phase-card h4 { margin: 0 0 0.5rem; }
	.phase-card p { margin: 0; }

	blockquote.pull {
	font-family: "Fraunces", serif;
	font-size: 1.5rem;
	line-height: 1.4;
	font-style: italic;
	margin: 3rem 0;
	padding: 2rem;
	border-top: 2px solid var(--border);
	border-bottom: 2px solid var(--border);
	text-align: center;
	color: var(--text);
	background: repeating-linear-gradient(
	45deg,
	transparent,
	transparent 10px,
	rgba(0,0,0,0.02) 10px,
	rgba(0,0,0,0.02) 20px
	);
	}

	/* -------------------------------------------
	Footer
	------------------------------------------- */
	footer {
	margin-top: 6rem;
	padding-top: 3rem;
	border-top: 4px solid var(--border);
	font-family: "IBM Plex Mono", monospace;
	font-size: 0.85rem;
	text-transform: uppercase;
	letter-spacing: 0.05em;
	color: var(--text-muted);
	}

	footer a { color: var(--text); font-weight: 600; }

	@media (max-width: 640px) {
	.two-col { grid-template-columns: 1fr; }
	.wrap { padding: 2rem 1rem 4rem; }
	h1 { font-size: 2.5rem; }
	}
	</style>
	</head>
	<body>
	<div class="wrap">
	<header>
	<h1>DataSense E2B<br />The Full Story</h1>
	<p class="subtitle">
	How we set out to build a <strong>personal data-science agent</strong> — not a chatbot that
	<em>pretends</em> to run code, but one that <strong>writes Python, executes it, reads real errors,
	and verifies answers</strong> — and what we learned training Gemma-4-2B on Modal with methods
	we had to invent along the way.
	</p>
	<p class="meta">
	Base: <code>unsloth/gemma-4-E2B-it</code><br />
	Pipeline: Modal A100/T4<br />
	Team: <strong>DataSense E2B</strong> (Execution-verified, Tutor-escalation)<br />
	</p>
	</header>

	<nav class="toc" aria-label="Table of contents">
	<h2>Index</h2>
	<ol>
	<li><a href="#goal">The goal</a></li>
	<li><a href="#start">Where we started</a></li>
	<li><a href="#problem">The problem with naive finetuning</a></li>
	<li><a href="#agent">The DataSense agent loop</a></li>
	<li><a href="#pipeline">Training pipeline: SFT → GRPO → DPO</a></li>
	<li><a href="#methods">Supporting methods (verifiers, eval)</a></li>
	<li><a href="#evte">EVTE — core idea & motivation</a></li>
	<li><a href="#evte-feedback">EVTE feedback loops (self-recovery)</a></li>
	<li><a href="#evte-mentor">Mentor verify & hint protocol</a></li>
	<li><a href="#evte-star">EVTE-STaR — online micro-SFT</a></li>
	<li><a href="#evte-outcomes">Episode outcomes & trainability gates</a></li>
	<li><a href="#worked">What worked</a></li>
	<li><a href="#didnt">What didn't work</a></li>
	<li><a href="#evals">Evaluation results</a></li>
	<li><a href="#demo-choice">Why SFT v1 for the demo</a></li>
	<li><a href="#benchmarks">Benchmark suite</a></li>
	<li><a href="#models">Model checkpoints</a></li>
	<li><a href="#demo">This demo & what's next</a></li>
	</ol>
	</nav>

	<!-- 01 GOAL -->
	<section id="goal">
	<h2>01 · The goal</h2>
	<p>
	The hackathon asked for something ambitious: take a small open model and make it genuinely useful
	for <strong>data work</strong> — exploring tables, cleaning messy columns, aggregating, joining,
	visualizing, and answering questions with <strong>verifiable correctness</strong>, not plausible prose.
	</p>
	<p>Our north star was simple to state and hard to achieve:</p>
	<div class="card highlight">
	<div class="card-title">North star</div>
	<p style="margin:0">
	A <strong>2B-parameter student agent</strong> that behaves like a junior data analyst:
	inspect schema first, run focused code steps, debug from real tracebacks, and only claim an
	answer after execution confirms it — with a training story credible enough for slides,
	papers, and a public Hugging Face demo.
	</p>
	</div>
	<p>Concretely, we targeted:</p>
	<ul>
	<li><strong>Execution-grounded behavior</strong> — rewards and eval tied to real <code>stdout</code> / errors, not hallucinated <code><result></code> blocks</li>
	<li><strong>Multi-benchmark credibility</strong> — DataBench, DSBench Excel analysis, and a curated hard pool from our own training data</li>
	<li><strong>A reproducible Modal pipeline</strong> — one app, volume checkpoints, automatic HF Hub pushes</li>
	<li><strong>Novel training for hard questions</strong> — when the student fails, a larger mentor verifies a solution and gives diagnostic hints <em>without leaking the answer</em></li>
	</ul>

	<figure class="figure">
	<img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/01-goal-agent-vs-formatter.png" alt="Formatter that fakes answers versus a real execution-verified agent" loading="lazy" />
	<figcaption><strong>Fig 1 — Goal.</strong> We optimize for an agent that runs code on real data and verifies answers — not a model that prints plausible <code> Answer: </code> tags without executing anything.</figcaption>
	</figure>
	</section>

	<!-- 02 START -->
	<section id="start">
	<h2>02 · Where we started</h2>
	<h3>The base model</h3>
	<p>
	We built on <code>unsloth/gemma-4-E2B-it</code> — Google's Gemma 4 2B instruction model in
	Unsloth's E2B (execution-to-build) variant. It's small enough to fine-tune on a single GPU,
	yet designed with code and tool use in mind. We used 4-bit quantization, LoRA rank 32,
	and a 2048-token context throughout.
	</p>

	<h3>Three Kaggle notebooks → one Modal app</h3>
	<p>
	The project began as three separate Kaggle notebooks covering supervised fine-tuning (SFT),
	GRPO reinforcement learning, and DPO preference optimization. We consolidated them into
	<code>datasense_pipeline.py</code> — a single Modal application with shared config in
	<code>datasense_utils.py</code> — so training could run unattended on cloud GPUs with
	checkpoints persisted to a Modal volume and pushed to Hugging Face.
	</p>

	<h3>Nine bugs we fixed before trusting any number</h3>
	<p>Early runs were misleading because the ported notebooks had latent bugs. We fixed all nine before building the pipeline:</p>
	<table>
	<thead>
	<tr><th>#</th><th>Bug</th><th>Impact</th></tr>
	</thead>
	<tbody>
	<tr><td>1</td><td><code>sft_warmup</code> KeyError</td><td>SFT wouldn't start</td></tr>
	<tr><td>2</td><td><code>lora_target_modules</code> KeyError</td><td>LoRA attach failed</td></tr>
	<tr><td>3</td><td><code>result_str</code> UnboundLocalError</td><td>Agent loop crashed mid-rollout</td></tr>
	<tr><td>4</td><td>DPO pairs missing chat template prefix</td><td>Preference data malformed</td></tr>
	<tr><td>5</td><td><code>skip_special_tokens=False</code></td><td>Decode pollution in rewards</td></tr>
	<tr><td>6</td><td>Dead <code>oci_sft_v1</code> variable</td><td>Confusing / broken cells</td></tr>
	<tr><td>7</td><td>GRPO <code>max_steps</code> hardcoded</td><td>Config ignored</td></tr>
	<tr><td>8</td><td>Shorter <code>SYSTEM_PROMPT</code> in DPO cell</td><td>Train/eval prompt drift</td></tr>
	<tr><td>9</td><td><code>_PROBLEM_LOOKUP</code> naming mismatch</td><td>Dataset indexing broken</td></tr>
	</tbody>
	</table>

	<h3>Day-one eval: 0% accuracy (and why that was informative)</h3>
	<p>
	Our first agent eval reported <strong>0% accuracy</strong> for everyone — including SFT — while
	SFT already showed <strong>100% execution success</strong> and ~5.6 agent steps vs base's 2% exec /
	1.1 steps. That gap taught us the first big lesson: <strong>the model was learning to run code,
	but we weren't scoring against real data.</strong>
	</p>
	<div class="card warn">
	<div class="card-title">Root cause</div>
	<p style="margin:0">
	Eval workspaces used <strong>synthetic random CSVs</strong> when DataBench parquet wasn't mounted,
	but ground truth came from the <strong>real</strong> dataset. The agent analyzed fake data and
	was graded against true answers — guaranteed 0%.
	</p>
	</div>

	<figure class="figure">
	<img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/02-fake-data-eval.png" alt="Eval bug: synthetic workspace data scored against real ground truth" loading="lazy" />
	<figcaption><strong>Fig 2 — The 0% eval bug.</strong> Early runs used random synthetic CSVs in the sandbox while ground truth came from real DataBench files — so even a good agent could never match.</figcaption>
	</figure>
	</section>

	<!-- 03 PROBLEM -->
	<section id="problem">
	<h2>03 · The problem with naive finetuning</h2>
	<p>
	Most "data agent" demos finetune on static (question, code, answer) triples. The model learns
	to <em>format</em> responses that look like an agent — <code> Answer: </code> tags, pandas snippets,
	confident summaries — without ever closing the loop on execution.
	</p>
	<p>We observed three failure modes immediately:</p>
	<div class="two-col">
	<div class="card">
	<div class="card-title">Formatter, not agent</div>
	<p style="margin:0;font-size:0.95rem">
	Base Gemma-4 could score well on easy boolean questions by emitting answer tags in a single
	turn — <strong>0% code execution</strong> — beating SFT on accuracy while doing none of the work.
	</p>
	</div>
	<div class="card">
	<div class="card-title">Hallucinated execution</div>
	<p style="margin:0;font-size:0.95rem">
	Models invent <code><result></code> blocks with fake stdout. RL rewards on text alone
	reinforce the illusion of competence.
	</p>
	</div>
	</div>
	<p>
	The fix wasn't "more SFT data." It was changing <strong>what we optimize and measure</strong>:
	real subprocess execution, multi-turn observe→fix→retry, and verifiers that compare parsed answers
	to typed ground truth (boolean, number, category, list types).
	</p>
	</section>

	<!-- 04 AGENT -->
	<section id="agent">
	<h2>04 · The DataSense agent loop</h2>
	<p>Every training rollout and eval episode follows the same production-shaped loop:</p>
	<div class="flow">
	<span>THINK</span><span class="arrow">→</span>
	<span>EXPLORE</span><span class="arrow">→</span>
	<span>EXECUTE</span><span class="arrow">→</span>
	<span>DEBUG</span><span class="arrow">→</span>
	<span>ANSWER</span>
	</div>
	<ol>
	<li><strong>THINK</strong> — inspect schema, dtypes, nulls before analysis</li>
	<li><strong>EXPLORE</strong> — <code>head()</code>, <code>describe()</code>, small SQL <code>LIMIT</code> queries</li>
	<li><strong>EXECUTE</strong> — one focused Python step; read real <code><result></code> from sandbox</li>
	<li><strong>DEBUG</strong> — fix column names, joins, dtypes from tracebacks</li>
	<li><strong>ANSWER</strong> — <code> Answer: </code> + <code> Summary: </code> after verified execution</li>
	</ol>
	<p>
	The system prompt (shared across train, eval, and this HF demo) explicitly forbids hallucinated APIs
	and requires the final printed value to match the answer tag. For DataBench we mount real
	<code>sample.parquet</code> into the workspace; for DSBench we copy <code>.xlsx</code> workbooks
	and use <code>inspect_source</code> for Excel structure.
	</p>
	<pre>Reward signal (simplified):
	+ execution actually ran
	+ stdout parseable
	+ answer matches ground truth (typed comparator)
	− hallucinated inline <result> without [EXEC:real]
	− debug rambling / column dumps as "answers"</pre>

	<figure class="figure">
	<img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/03-agent-loop.png" alt="THINK EXPLORE EXECUTE DEBUG ANSWER agent loop" loading="lazy" />
	<figcaption><strong>Fig 3 — Agent loop.</strong> Every rollout follows the same multi-step cycle: inspect, run code, read real output, debug, then answer.</figcaption>
	</figure>
	</section>

	<!-- 05 PIPELINE -->
	<section id="pipeline">
	<h2>05 · Training pipeline: SFT → GRPO → DPO</h2>
	<p>Our planned stack mirrors modern agent training — with execution at every stage:</p>
	<div class="flow">
	<span>SFT</span><span class="arrow">→</span>
	<span>GRPO</span><span class="arrow">→</span>
	<span>DPO</span><span class="arrow">→</span>
	<span>Eval</span>
	</div>

	<h3>Stage 1 — Supervised fine-tuning (SFT v1) ✅</h3>
	<p>
	Bulk SFT on DataBench-style traces plus agent supplements: multi-turn dialogs, Jupyter-agent
	traces, dashboard examples, and code-feedback execution pairs. This produced our strongest
	baseline — <code>sanjaymalladi/DataSense-Modal-E2B-SFT</code>.
	</p>
	<ul>
	<li>LoRA r=32, α=64 on all attention + MLP projections</li>
	<li>~600 max steps, effective batch 8</li>
	<li>Teaches the model to <em>use</em> the agent format and run multi-step code</li>
	</ul>

	<h3>Stage 2 — GRPO (execution-grounded RL) ⚠️ partial</h3>
	<p>
	Group Relative Policy Optimization with <strong>real Python rollouts</strong> per prompt.
	Each step spawns multiple agent trajectories; rewards use <code>compute_trajectory_reward()</code>
	with <code>require_real_execution=True</code>.
	</p>
	<p>
	GRPO on Gemma-4 is brutally slow (~11 min/step on A100) because most wall time is
	<strong>CPU-bound execution</strong>, not GPU matmul — 4 rollouts × up to 5 agent steps ×
	subprocess sandboxing. We fixed trajectory forwarding bugs, KL instability
	(<code>final_logit_softcapping=30</code>), and added parallel rollout workers — but full
	300-step GRPO remained impractical within hackathon time. A shortened 100-step run was targeted.
	</p>

	<h3>Stage 3 — DPO ⏸️ deferred</h3>
	<p>
	Preference pairs from high vs low reward rollouts (min gap 0.15) — planned but deprioritized
	once EVTE-STaR showed more promise for hard-question gains within our compute budget.
	</p>

	<figure class="figure">
	<img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/04-pipeline-stages.png" alt="SFT GRPO DPO training pipeline stages" loading="lazy" />
	<figcaption><strong>Fig 4 — Training stages.</strong> SFT v1 shipped and works. Full GRPO was execution-bound and slow. DPO was deferred in favor of EVTE-STaR.</figcaption>
	</figure>
	</section>

	<!-- 06 METHODS (supporting) -->
	<section id="methods">
	<h2>06 · Supporting infrastructure (not EVTE itself)</h2>
	<p>
	Before EVTE could work, we needed execution-grounded rollouts, typed verifiers, and honest eval.
	These are the plumbing; the novel research contribution is EVTE + EVTE-STaR (sections 07–11 below).
	</p>

	<h3>Execution-grounded rollouts</h3>
	<p>
	Every GRPO/DPO/EVTE trajectory runs code in an isolated workspace. Rewards ignore fake
	<code><result></code> tags unless tagged <code>[EXEC:real]</code>.
	</p>

	<h3>Typed answer verification (<code>databench_compare</code> + neural verifier)</h3>
	<p>
	Evidence-bound scoring chain: exec stdout → <code> Answer: </code> tag → LLM extract → typed compare
	(boolean, float, category, <code>list[category]</code>, <code>list[number]</code>).
	Without this, mentors "fail" when extraction fails, not when reasoning fails.
	</p>

	<h3>Lite eval & hackathon harness</h3>
	<p>
	DataBench lite scores against <code>sample_answer</code> on mounted parquet.
	<code>run_hackathon_benchmarks_parallel</code> runs Base / SFT / Micro-1 across three benchmarks on T4.
	</p>
	</section>

	<!-- 07 EVTE CORE -->
	<section id="evte">
	<h2>07 · EVTE — Execution-Verified Tutor Escalation</h2>
	<p>
	<strong>EVTE</strong> is the method we built when classical distillation and STaR broke down for
	data agents. The name encodes three commitments:
	</p>
	<ul>
	<li><strong>Execution</strong> — every claim of success must be backed by real code that ran on real files</li>
	<li><strong>Verified</strong> — student <em>and</em> mentor answers pass the same typed verifier</li>
	<li><strong>Tutor Escalation</strong> — a larger model intervenes only after student failure, and only as a <em>coach</em>, not an answer vending machine</li>
	</ul>

	<h3>Why we needed EVTE</h3>
	<p>
	Classical <strong>STaR</strong> (Self-Taught Reasoner) assumes a strong teacher can produce correct
	reasoning chains, filter them, and fine-tune the student offline. That fails for DataSense because:
	</p>
	<ol>
	<li>Our <strong>2B student</strong> often can't solve list/category questions at all</li>
	<li>Our <strong>31B mentor</strong> also fails verification on the hardest 5 problems (~40% mentor-hard pool)</li>
	<li>Even when code is <em>right</em>, <strong>answer extraction</strong> fails (no tag, wrong stdout parse)</li>
	<li>Distilling final answers teaches <strong>memorization</strong>; we need debugging under execution constraints</li>
	</ol>

	<h3>The five-phase episode (EVTE and EVTE-STaR share this skeleton)</h3>
	<p>Implemented in <code>datasense_evte.py</code> — <code>run_evte_episode</code> (offline collection) and <code>run_evte_star_episode</code> (online training).</p>

	<div class="phase-grid">
	<div class="phase-card">
	<h4>Phase 1 · Student first attempt</h4>
	<p>2B student, up to 5 agent steps, real workspace (CSV/parquet/xlsx). Scored via <code>score_rollout()</code>.</p>
	</div>
	<div class="phase-card">
	<h4>Phase 2 · Self-recovery feedback</h4>
	<p>Up to 3 rounds of <code>build_self_recovery_feedback()</code> — real tracebacks, answer withheld.</p>
	</div>
	<div class="phase-card">
	<h4>Phase 3 · Mentor independent verify</h4>
	<p>31B mentor solves in a <em>fresh</em> workspace; must pass the same verifier before any hint.</p>
	</div>
	<div class="phase-card">
	<h4>Phase 4 · Diagnostic mentor hint</h4>
	<p><code>generate_mentor_hint()</code> under <code>MENTOR_HINT_SYSTEM</code> — no final answer, no full script.</p>
	</div>
	<div class="phase-card">
	<h4>Phase 5 · Post-hint student</h4>
	<p>Up to 2 attempts × 5 steps. Episode saved only if student verifies after reading the hint.</p>
	</div>
	</div>

	<figure class="figure">
	<img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/05-evte-five-phases.png" alt="EVTE five phases from student attempt to mentor-assisted success" loading="lazy" />
	<figcaption><strong>Fig 5 — EVTE in five phases.</strong> Student tries → self-recovery → mentor must verify independently → diagnostic hint → student retries. Only verified post-hint wins become training data.</figcaption>
	</figure>

	<pre>run_evte_star_episode (simplified control flow):

	student_rollout = phase_1_student()
	if clean_first_try_verified and not messy_recovery_in_trace:
	return SKIP # already knows it — not trainable in STaR mode

	if not verified:
	for i in 1..3:
	add_user(build_self_recovery_feedback()) # ← EVTE feedback
	student_rollout = student_retry()

	mentor_ok, mentor_rollout = mentor_verify_solution(
	student_rollout=junior_trace # mentor sees failed code
	)
	if not mentor_ok:
	return DISCARD # mentor_unverified — no training signal

	hint = generate_mentor_hint(student_rollout, mentor_rollout)
	add_user("[MENTOR] " + hint) # diagnostic only

	for j in 1..2:
	student_rollout = student_retry()
	if verified:
	return SAVE_TRAINABLE_EPISODE # mentor_assisted</pre>

	<h3>Hard-first curriculum</h3>
	<p>
	<code>_prioritize_evte_problems()</code> sorts <code>list[category]</code>, <code>list[number]</code>,
	and multi-answer types before easy booleans. EVTE compute is expensive (two models × multi-step agents);
	we spend it where SFT v1 plateaus.
	</p>

	<h3>Mentor hardware choreography</h3>
	<p>
	Student (2B) and mentor (31B) don't fit comfortably together on one A100. The STaR loop uses
	<code>on_micro_batch</code> hooks to <strong>unload mentor → micro-SFT student → reload mentor</strong>
	every 15 episodes. Progress persists to <code>evte_star_progress.json</code> with resume support.
	</p>
	</section>

	<!-- 08 EVTE FEEDBACK -->
	<section id="evte-feedback">
	<h2>08 · EVTE feedback — self-recovery without answer leakage</h2>
	<p>
	The most underrated piece of EVTE is not the mentor — it's <strong>what we put in the user turn
	when the student fails</strong>. This is <code>build_self_recovery_feedback()</code> in
	<code>datasense_evte.py</code>.
	</p>

	<figure class="figure">
	<img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/06-evte-self-recovery.png" alt="Self-recovery feedback loop with real errors but hidden ground truth" loading="lazy" />
	<figcaption><strong>Fig 6 — Self-recovery feedback.</strong> The student sees wrong predictions, last code, and real tracebacks — never the correct answer.</figcaption>
	</figure>

	<blockquote class="pull">
	Messy success = verified answer but conversation contains debug/recovery language
	(<code>trajectory_has_recovery_signal()</code>). We don't want to reinforce "stumble into correctness"
	without tutor review in STaR mode.
	</blockquote>

	<h3>Why SFT v2 failed — feedback without balance</h3>
	<p>
	When we later fine-tuned <strong>only</strong> on recovery trajectories (SFT v2), the model learned
	the <em>shape</em> of debug prose — dtype dumps, column lists — without improving verified answers.
	Lesson: self-recovery feedback is essential <strong>during collection</strong>, but training must mix
	clean completions with mentor-assisted wins, not recovery-only soup.
	</p>
	</section>

	<!-- 09 EVTE MENTOR -->
	<section id="evte-mentor">
	<h2>09 · Mentor verify & hint protocol</h2>
	<p>
	The mentor is <code>google/gemma-4-31B-it</code> (4-bit via Unsloth). It is <strong>not</strong> an oracle
	that whispers answers. It must earn the right to hint by passing the same execution verifier as the student.
	</p>

	<figure class="figure">
	<img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/07-evte-mentor-gate.png" alt="Mentor must pass verification gate before giving a diagnostic hint" loading="lazy" />
	<figcaption><strong>Fig 7 — Mentor gate.</strong> The 31B mentor must verify its own solution by running code before it may give a hint — and the hint must not leak the final answer.</figcaption>
	</figure>

	<h3>Mentor retry modes</h3>
	<table>
	<thead>
	<tr><th>Mode</th><th>Behavior</th><th>Config</th></tr>
	</thead>
	<tbody>
	<tr>
	<td><strong>series</strong></td>
	<td>Same conversation; temps ramp 0.4 → 0.65 → 0.85</td>
	<td><code>evte_mentor_retry_mode=series</code></td>
	</tr>
	<tr>
	<td><strong>parallel</strong></td>
	<td>3 independent workspaces; first verified wins; temps [0.2, 0.5, 0.7]</td>
	<td><code>evte_mentor_retry_mode=parallel</code></td>
	</tr>
	</tbody>
	</table>

	</section>

	<!-- 10 EVTE-STAR -->
	<section id="evte-star">
	<h2>10 · EVTE-STaR — online Self-Taught Reasoner with micro-SFT</h2>
	<p>
	<strong>EVTE-STaR</strong> combines EVTE episode collection with <strong>online weight updates</strong>.
	Classical STaR: collect all successes → train offline once. EVTE-STaR:
	<strong>collect 15 verified mentor-assisted wins → micro-SFT 30 steps → student is slightly better → repeat.</strong>
	</p>

	<figure class="figure">
	<img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/08-evte-star-online.png" alt="EVTE-STaR online micro-SFT every 15 verified episodes" loading="lazy" />
	<figcaption><strong>Fig 8 — EVTE-STaR online loop.</strong> Every 15 mentor-assisted wins → 30-step micro-SFT at low LR → student continues on harder problems with nudged weights.</figcaption>
	</figure>

	<h3>The overtraining curve (batches 2–3 vs batch 6)</h3>
	<p>
	Micro-batch <strong>1</strong> replay in RAM scored <strong>100%</strong> on mentor-hard (5 problems).
	Saved Micro-1 checkpoint: ~<strong>60%</strong> confirmatory. Replay of batches <strong>2–3</strong>:
	~<strong>80%</strong>. Final batch <strong>6</strong> checkpoint: ~<strong>40%</strong> — worse than SFT v1.
	</p>
	<div class="card warn">
	<div class="card-title">Lesson</div>
	<p style="margin:0">
	Online micro-SFT needs <strong>early stopping on a held-out hard set</strong>, not "more batches = better."
	We only preserved micro-1 and final checkpoints on the volume — sweet-spot batches 2–3 were lost
	until <code>run_micro_replay_eval</code> reconstructed them in RAM.
	</p>
	</div>
	</section>

	<!-- 11 EVTE OUTCOMES -->
	<section id="evte-outcomes">
	<h2>11 · Episode outcomes & trainability gates</h2>
	<p>Every episode ends in exactly one outcome. The outcome determines whether it enters training.</p>

	<table>
	<thead>
	<tr><th>Outcome</th><th>Meaning</th><th>EVTE-STaR: train?</th></tr>
	</thead>
	<tbody>
	<tr>
	<td><code>self_solved_clean</code></td>
	<td>First-try verified, no recovery signals in trace</td>
	<td class="num-bad">Skip</td>
	</tr>
	<tr>
	<td><code>self_recovered</code></td>
	<td>Fixed via self-recovery feedback only</td>
	<td class="num-mid">Optional</td>
	</tr>
	<tr>
	<td><code>mentor_assisted</code></td>
	<td>Failed → mentor verified → hint → student verified</td>
	<td class="num-good">Yes</td>
	</tr>
	<tr>
	<td><code>discarded</code></td>
	<td>Mentor couldn't pass execution verifier</td>
	<td class="num-bad">No</td>
	</tr>
	</tbody>
	</table>
	</section>

	<!-- 12 WORKED -->
	<section id="worked">
	<h2>12 · What worked</h2>

	<div class="card">
	<h4>✅ SFT v1 — real execution behavior</h4>
	<p style="margin:0.5rem 0 0">
	SFT v1 consistently runs real Python (100% exec on many evals), uses ~4–5 agent steps, and
	beats base on hard questions where base "wins" without code. This is the behavioral foundation
	everything else builds on.
	</p>
	</div>

	<div class="card">
	<h4>✅ EVTE episode quality filter</h4>
	<p style="margin:0.5rem 0 0">
	Saving only mentor-assisted verified trajectories produced high-signal data — multi-turn debug
	with real errors, not synthetic Q/A. 92 episodes is small but <em>curated</em>.
	</p>
	</div>
	</section>

	<!-- 09 DIDNT -->
	<section id="didnt">
	<h2>13 · What didn't work</h2>

	<div class="card danger">
	<h4>❌ Full GRPO within hackathon time</h4>
	<p style="margin:0.5rem 0 0">
	~11 min/step × hundreds of steps × execution-bound rollouts ≈ multi-day runs. Parallel rollout
	workers helped but couldn't change the fundamental CPU/GPU pipeline stall. vLLM isn't available
	for Gemma 4 E2B, so generation stays on HF generate.
	</p>
	</div>

	<div class="card danger">
	<h4>❌ SFT v2 (recovery-only fine-tune)</h4>
	<p style="margin:0.5rem 0 0">
	Training only on EVTE recovery trajectories taught <strong>debug prose</strong> — column dtype
	dumps, rambling — without improving answers. Mentor-hard: 40% vs SFT v1's 60%.
	</p>
	</div>
	</section>

	<!-- 10 EVALS -->
	<section id="evals">
	<h2>14 · Evaluation results</h2>
	<p>
	<strong>Agent accuracy</strong> on real data files (lite DataBench parquet, DSBench Excel, mentor-hard pool).
	Macro average = unweighted mean across three benchmarks (30 problems). Always pair accuracy with
	<strong>exec_ok</strong> — base can match easy booleans via answer tags without running code.
	</p>

	<figure class="figure">
	<img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/09-eval-benchmarks.png" alt="Three hackathon benchmarks across three models" loading="lazy" />
	<figcaption><strong>Fig 9 — Hackathon eval suite.</strong> DataBench (15) + DSBench Excel (10) + mentor-hard (5) per model on T4.</figcaption>
	</figure>

	<h3>Hackathon benchmark suite — final (first complete run)</h3>
	<p>Parallel eval: <code>run_hackathon_benchmarks_parallel</code> · 3× T4 · June 2026.</p>
	<table>
	<thead>
	<tr><th>Model</th><th>DataBench (15)</th><th>DSBench (10)</th><th>Mentor-hard (5)</th><th>Macro avg</th><th>Total</th></tr>
	</thead>
	<tbody>
	<tr>
	<td>Base</td>
	<td class="num-mid">60.0%</td>
	<td class="num-bad">0.0%</td>
	<td class="num-mid">20.0%</td>
	<td class="num-mid">26.7%</td>
	<td>10/30</td>
	</tr>
	<tr>
	<td><strong>SFT v1</strong></td>
	<td class="num-good">86.7%</td>
	<td class="num-bad">0.0%</td>
	<td class="num-good">60.0%</td>
	<td class="num-good">48.9%</td>
	<td>16/30</td>
	</tr>
	<tr>
	<td>EVTE Micro-1</td>
	<td class="num-good">80.0%</td>
	<td class="num-bad">0.0%*</td>
	<td class="num-good">100.0%</td>
	<td class="num-good">60.0%</td>
	<td>17/30</td>
	</tr>
	</tbody>
	</table>
	<p style="font-size:0.9rem;color:var(--text-muted)">
	*DSBench official scorer = 0% for all models. Micro-1 Q15 computed <code>$12,829,511</code> = option <strong>A</strong> (correct) but was graded wrong because we compare letters not dollar values → value-aware DSBench would be 1/10 (macro <strong>63.3%</strong>).
	</p>

	<h3>Earlier standalone evals (sanity checks)</h3>
	<table>
	<thead>
	<tr><th>Eval</th><th>Base</th><th>SFT v1</th><th>Micro-1 / SFT v2</th></tr>
	</thead>
	<tbody>
	<tr>
	<td>Quick DataBench (5)</td>
	<td>80% acc / 0% exec</td>
	<td class="num-good">80% / 100% exec</td>
	<td>SFT v2: 40%</td>
	</tr>
	<tr>
	<td>Mentor-hard (5)</td>
	<td>40% / 0% exec</td>
	<td class="num-good">60% / 100% exec</td>
	<td>Micro-1 replay: 100% (RAM); saved ckpt ~60%</td>
	</tr>
	</tbody>
	</table>

	<div class="card">
	<div class="card-title">How to read DSBench</div>
	<p style="margin:0">
	Models often <strong>run code</strong> (50–100% exec_ok) but return dataframe strings, <code>0.0</code>, or dollar amounts that map to the <em>wrong</em> MCQ letter. Only one case (Micro-1 Q15) was a true scoring-format bug. DSBench 0% is mostly real Excel/parsing failure, not a broken metric.
	</p>
	</div>
	</section>

	<!-- DEMO MODEL CHOICE -->
	<section id="demo-choice">
	<h2>15 · Why SFT v1 for the live demo (not Micro-1)</h2>
	<p>
	Micro-1 wins <strong>macro average</strong> (60% vs 48.9%) on paper — driven by a perfect 5/5 on mentor-hard.
	We still ship <strong>SFT v1</strong> on this Hugging Face Space. Here's why:
	</p>

	<table>
	<thead>
	<tr><th>Factor</th><th>SFT v1</th><th>EVTE Micro-1</th></tr>
	</thead>
	<tbody>
	<tr>
	<td><strong>DataBench (breadth)</strong></td>
	<td class="num-good"><strong>86.7%</strong> — best on the largest held-out slice</td>
	<td>80.0%</td>
	</tr>
	<tr>
	<td><strong>Mentor-hard (depth)</strong></td>
	<td>60% (3/5), 100% exec</td>
	<td class="num-good"><strong>100%</strong> (5/5) on first complete run</td>
	</tr>
	<tr>
	<td><strong>Stability</strong></td>
	<td class="num-good">Single bulk SFT — predictable at inference</td>
	<td>Online micro-SFT batch 1 — replay 100% vs saved ckpt ~60%</td>
	</tr>
	<tr>
	<td><strong>Straggler reruns</strong></td>
	<td class="num-good">Held up when Modal overwrote volume</td>
	<td>Mentor-hard dropped to 60% on duplicate run</td>
	</tr>
	<tr>
	<td><strong>Live demo risk</strong></td>
	<td class="num-good">Lower — fewer debug ramble / dtype dumps</td>
	<td>Higher — tuned on hard pool, can overfit quirks</td>
	</tr>
	<tr>
	<td><strong>Story on slides</strong></td>
	<td>“Execution-grounded baseline that works”</td>
	<td>“EVTE-STaR peak — best hard-pool result”</td>
	</tr>
	</tbody>
	</table>

	<div class="card highlight">
	<div class="card-title">Decision</div>
	<p style="margin:0">
	<strong>Gradio Space → SFT v1</strong> (<code>sanjaymalladi/DataSense-Modal-E2B-SFT</code>) for reliable live CSV demos.<br />
	<strong>Slides → show all three models</strong>; cite Micro-1 as evidence EVTE-STaR helps on the hard curated pool, not as the production default yet.
	</p>
	</div>
	</section>

	<!-- 11 BENCHMARKS -->
	<section id="benchmarks">
	<h2>16 · Benchmark suite</h2>
	<table>
	<thead>
	<tr><th>Benchmark</th><th>Problems</th><th>What it tests</th><th>Status</th></tr>
	</thead>
	<tbody>
	<tr>
	<td><strong>DataBench test (lite)</strong></td>
	<td>15</td>
	<td>SemEval-style QA on real parquet samples</td>
	<td><span class="pill ok">integrated</span></td>
	</tr>
	<tr>
	<td><strong>DSBench analysis</strong></td>
	<td>10</td>
	<td>ModelOff Excel financial modeling</td>
	<td><span class="pill ok">integrated</span></td>
	</tr>
	<tr>
	<td><strong>Mentor-hard</strong></td>
	<td>5</td>
	<td>Curated EVTE failures</td>
	<td><span class="pill ok">integrated</span></td>
	</tr>
	</tbody>
	</table>
	</section>

	<!-- 12 MODELS -->
	<section id="models">
	<h2>17 · Model checkpoints on Hugging Face</h2>
	<table>
	<thead>
	<tr><th>Checkpoint</th><th>HF repo</th><th>Role</th></tr>
	</thead>
	<tbody>
	<tr>
	<td>Base</td>
	<td><a href="https://huggingface.co/unsloth/gemma-4-E2B-it">unsloth/gemma-4-E2B-it</a></td>
	<td>Frozen foundation</td>
	</tr>
	<tr>
	<td><strong>SFT v1 ★ demo</strong></td>
	<td><a href="https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT">DataSense-Modal-E2B-SFT</a></td>
	<td>Live HF Space adapter — stable execution</td>
	</tr>
	<tr>
	<td>EVTE-STaR Micro-1</td>
	<td><a href="https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-EVTE-Star-Micro1">DataSense-Modal-E2B-EVTE-Star-Micro1</a></td>
	<td>Best mentor-hard (5/5) — research checkpoint</td>
	</tr>
	</tbody>
	</table>
	</section>

	<!-- 13 DEMO -->
	<section id="demo">
	<h2>18 · This Hugging Face demo</h2>
	<p>
	The Gradio app runs <strong>SFT v1</strong> — same agent loop as training eval: load CSV → multi-step
	code generation → sandbox execution → <strong>Answer</strong> + <strong>Summary</strong>.
	Six built-in examples cover sales, employees, and students datasets.
	</p>

	<figure class="figure">
	<img src="https://huggingface.co/spaces/build-small-hackathon/DataSense_E2B/resolve/main/assets/illustrations/03-agent-loop.png" alt="Agent loop used in the HF Space demo" loading="lazy" />
	<figcaption><strong>Same loop as eval.</strong> Upload this <code>hf_demo/</code> folder to a Gradio Space (GPU T4), set <code>HF_TOKEN</code> if needed.</figcaption>
	</figure>

	<h3>Deploy checklist</h3>
	<ol>
	<li>Create Space (Gradio, <strong>gpu-t4</strong>) — see <code>README.md</code> frontmatter</li>
	<li>Upload <code>hf_demo/</code> including <code>assets/illustrations/</code> and <code>story.html</code></li>
	<li>Secret <code>HF_TOKEN</code> if adapter repo is private</li>
	<li>Smoke-test all 6 examples</li>
	</ol>
	<h3>Future work</h3>
	<ul>
	<li>DSBench MCQ letter mapping in scorer</li>
	<li>Per-micro-batch checkpointing during EVTE-STaR</li>
	<li>Optional Space variant with Micro-1 for hard-pool showcase</li>
	</ul>
	</section>

	<footer>
	<p>
	<strong>DataSense E2B</strong> — Execution-verified, Tutor-escalation training for personal data science agents.<br />
	Code: <code>datasense_pipeline.py</code> · <code>datasense_evte.py</code> · <code>datasense_agent.py</code> · <code>hf_demo/</code><br />
	Built for the Gemma / DataBench hackathon, June 2026.
	</p>
	<p style="margin-top:2rem">
	<a href="/">← Back to Gradio demo</a>   ·
	<a href="https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT">SFT v1 on HF</a>   ·
	<a href="https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-EVTE-Star-Micro1">Micro-1 on HF</a>
	</p>
	</footer>
	</div>
	</body>
	</html>