Spaces:

build-small-hackathon
/

DataSense_E2B

Running on Zero

App Files Files Community

DataSense_E2B / README.md

sanjaymalladi

Upload README.md with huggingface_hub

ae7a3f7 verified 19 days ago

preview code

Raw

History Blame Contribute Delete

6.23 kB

	---
	title: DataSense E2B
	emoji: 📊
	colorFrom: blue
	colorTo: green
	sdk: gradio
	sdk_version: 6.5.1
	app_file: app.py
	pinned: false
	license: apache-2.0
	startup_duration_timeout: 30m
	short_description: Execution-grounded data agent — Gemma-4 2B + SFT v1
	models:
	- sanjaymalladi/DataSense-Modal-E2B-SFT
	- unsloth/gemma-4-E2B-it
	hardware: gpu-t4
	tags:
	- track:wood
	- sponsor:modal
	- achievement:offgrid
	- achievement:welltuned
	- achievement:offbrand
	- achievement:sharing
	- achievement:fieldnotes
	---

	# DataSense E2B

	A personal data-science agent built for the Gemma / DataBench hackathon — not a chatbot that pretends to run code, but a model that writes Python, executes it, reads real errors, and verifies answers before claiming a result.

	This Space runs SFT v1 on [`unsloth/gemma-4-E2B-it`](https://huggingface.co/unsloth/gemma-4-E2B-it) with LoRA adapter [`sanjaymalladi/DataSense-Modal-E2B-SFT`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT). Pick a bundled CSV example or ask your own question — the agent inspects schema, runs code in a sandbox, debugs from tracebacks, and returns Answer + Summary tags.

	Powered by Modal for fine tuning the model.

	📖 [Read the full project story →](https://datasense-e2b.netlify.app/) · 🎬 [Watch the demo →](https://youtu.be/ucFoCdMK7sE) · [LinkedIn post →](https://www.linkedin.com/posts/sanjaymalladi_buildsmall-huggingface-modal-share-7471993638814654464-47hY/)

	---

	## The problem we set out to solve

	Small instruction models can look competent on data questions by printing plausible `Answer:` tags without executing anything. Our first eval even reported 0% accuracy for everyone — not because training failed, but because we were scoring synthetic sandbox data against real DataBench ground truth.

	The fix was changing what we optimize and measure: execution-grounded rollouts, typed verifiers, and eval on mounted real files — not hallucinated `<result>` blocks.

	## Agent loop

	![Agent loop: THINK → EXPLORE → EXECUTE → DEBUG → ANSWER](assets/illustrations/03-agent-loop.png)

	Same loop in training, eval, and this demo: THINK → EXECUTE → DEBUG → ANSWER.

	In this Hugging Face Space, full execution traces are visible on screen — open the Execution trace tab to watch each step: generated code, sandbox stdout, and tracebacks as the agent works.

	## Training story (Modal)

	Three Kaggle notebooks (SFT → GRPO → DPO) became one Modal app (`datasense_pipeline.py`) with volume checkpoints and automatic Hub pushes.

	\| Stage \| Status \| What we learned \|
	\|-------\|--------\|-----------------\|
	\| SFT v1 \| ✅ Shipped \| Real execution behavior (~100% exec on many evals); foundation everything else builds on \|
	\| GRPO \| ⏸ Deferred \| ~11 min/step × execution-bound rollouts — too slow for hackathon window \|
	\| DPO \| ⏸ Deferred \| Prompt drift risk; EVTE-STaR took priority for hard questions \|
	\| EVTE \| ✅ Novel \| When the 2B student fails, a 31B mentor must verify its own code before giving a diagnostic hint \|
	\| EVTE-STaR \| ✅ Research peak \| Online micro-SFT every 15 verified mentor-assisted wins → Micro-1 checkpoint \|

	EVTE = Execution-Verified Tutor Escalation. EVTE-STaR = Self-Taught Reasoner with online weight updates instead of one offline train at the end.

	## Hackathon eval (30 problems × 3 models, T4)

	Macro average = unweighted mean across DataBench (15), DSBench Excel (10), and mentor-hard (5).

	\| Model \| DataBench \| DSBench \| Mentor-hard \| Macro \| Total \|
	\|-------\|-----------\|---------\|-------------\|-------\|-------\|
	\| Base \| 60.0% \| 0.0% \| 20.0% \| 26.7% \| 10/30 \|
	\| SFT v1 ★ \| 86.7% \| 0.0% \| 60.0% \| 48.9% \| 16/30 \|
	\| EVTE Micro-1 \| 80.0% \| 0.0%* \| 100.0% \| 60.0% \| 17/30 \|

	\DSBench official scorer = 0% for all models (letter vs dollar mismatch). Micro-1 Q15 computed the correct dollar value → value-aware macro would be 63.3%*.

	Always pair accuracy with exec_ok: base can match easy booleans via answer tags while running 0% of its code.

	## Why SFT v1 for this demo (not Micro-1)

	Micro-1 wins macro average on paper (driven by 5/5 mentor-hard). We still ship SFT v1 here:

	- Best DataBench breadth — 86.7% vs 80% (largest held-out slice)
	- Stable inference — single bulk SFT vs online micro-batch 1 (replay 100% vs saved ckpt ~60%)
	- Lower live-demo risk — fewer debug ramble / dtype dumps
	- Held up under eval reruns — Micro-1 mentor-hard dropped when Modal stragglers overwrote volume

	Slides show all three models. Micro-1 is the EVTE-STaR research peak; SFT v1 is the production-shaped baseline.

	## What worked / what didn't

	Worked: SFT v1 execution behavior · EVTE episode quality filter (92 curated mentor-assisted trajectories) · honest eval harness on real files

	Didn't: Full GRPO in hackathon time · SFT v2 (recovery-only fine-tune taught debug prose, not answers) · EVTE-STaR batch 6 overtraining (40% mentor-hard vs Micro-1's 100%)

	## Models on Hugging Face

	\| Checkpoint \| Repo \| Role \|
	\|------------\|------\|------\|
	\| Base \| [`unsloth/gemma-4-E2B-it`](https://huggingface.co/unsloth/gemma-4-E2B-it) \| Frozen foundation \|
	\| SFT v1 ★ \| [`DataSense-Modal-E2B-SFT`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT) \| This Space \|
	\| EVTE Micro-1 \| [`DataSense-Modal-E2B-EVTE-Star-Micro1`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-EVTE-Star-Micro1) \| Best mentor-hard — research \|

	## Try the demo

	Six one-click examples on sales, employees, and students CSVs — no upload required. Hit Run DataSense and follow the live progress bar; traces stream into the results panel so you can verify the model really executed code before reading the final answer.

	---

	DataSense E2B — Execution-verified, Tutor-escalation training for personal data science agents.
	Built June 2026 · [Full story](https://datasense-e2b.netlify.app/) · [Demo video](https://youtu.be/ucFoCdMK7sE) · [LinkedIn post](https://www.linkedin.com/posts/sanjaymalladi_buildsmall-huggingface-modal-share-7471993638814654464-47hY/).