Spaces:

build-small-hackathon
/

DataSense_E2B

Running on Zero

App Files Files Community

DataSense_E2B / README.md

sanjaymalladi

Upload README.md with huggingface_hub

ae7a3f7 verified 18 days ago

preview code

Raw

History Blame Contribute Delete

6.23 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

metadata

title: DataSense E2B
emoji: 📊
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
license: apache-2.0
startup_duration_timeout: 30m
short_description: Execution-grounded data agent — Gemma-4 2B + SFT v1
models:
  - sanjaymalladi/DataSense-Modal-E2B-SFT
  - unsloth/gemma-4-E2B-it
hardware: gpu-t4
tags:
  - track:wood
  - sponsor:modal
  - achievement:offgrid
  - achievement:welltuned
  - achievement:offbrand
  - achievement:sharing
  - achievement:fieldnotes

DataSense E2B

A personal data-science agent built for the Gemma / DataBench hackathon — not a chatbot that pretends to run code, but a model that writes Python, executes it, reads real errors, and verifies answers before claiming a result.

This Space runs SFT v1 on unsloth/gemma-4-E2B-it with LoRA adapter sanjaymalladi/DataSense-Modal-E2B-SFT. Pick a bundled CSV example or ask your own question — the agent inspects schema, runs code in a sandbox, debugs from tracebacks, and returns Answer + Summary tags.

Powered by Modal for fine tuning the model.

📖 Read the full project story → · 🎬 Watch the demo → · LinkedIn post →

The problem we set out to solve

Small instruction models can look competent on data questions by printing plausible **Answer:** tags without executing anything. Our first eval even reported 0% accuracy for everyone — not because training failed, but because we were scoring synthetic sandbox data against real DataBench ground truth.

The fix was changing what we optimize and measure: execution-grounded rollouts, typed verifiers, and eval on mounted real files — not hallucinated <result> blocks.

Agent loop

Same loop in training, eval, and this demo: THINK → EXECUTE → DEBUG → ANSWER.

In this Hugging Face Space, full execution traces are visible on screen — open the Execution trace tab to watch each step: generated code, sandbox stdout, and tracebacks as the agent works.

Training story (Modal)

Three Kaggle notebooks (SFT → GRPO → DPO) became one Modal app (datasense_pipeline.py) with volume checkpoints and automatic Hub pushes.

Stage	Status	What we learned
SFT v1	✅ Shipped	Real execution behavior (~100% exec on many evals); foundation everything else builds on
GRPO	⏸ Deferred	~11 min/step × execution-bound rollouts — too slow for hackathon window
DPO	⏸ Deferred	Prompt drift risk; EVTE-STaR took priority for hard questions
EVTE	✅ Novel	When the 2B student fails, a 31B mentor must verify its own code before giving a diagnostic hint
EVTE-STaR	✅ Research peak	Online micro-SFT every 15 verified mentor-assisted wins → Micro-1 checkpoint

EVTE = Execution-Verified Tutor Escalation. EVTE-STaR = Self-Taught Reasoner with online weight updates instead of one offline train at the end.

Hackathon eval (30 problems × 3 models, T4)

Macro average = unweighted mean across DataBench (15), DSBench Excel (10), and mentor-hard (5).

Model	DataBench	DSBench	Mentor-hard	Macro	Total
Base	60.0%	0.0%	20.0%	26.7%	10/30
SFT v1 ★	86.7%	0.0%	60.0%	48.9%	16/30
EVTE Micro-1	80.0%	0.0%*	100.0%	60.0%	17/30

*DSBench official scorer = 0% for all models (letter vs dollar mismatch). Micro-1 Q15 computed the correct dollar value → value-aware macro would be 63.3%.

Always pair accuracy with exec_ok: base can match easy booleans via answer tags while running 0% of its code.

Why SFT v1 for this demo (not Micro-1)

Micro-1 wins macro average on paper (driven by 5/5 mentor-hard). We still ship SFT v1 here:

Best DataBench breadth — 86.7% vs 80% (largest held-out slice)
Stable inference — single bulk SFT vs online micro-batch 1 (replay 100% vs saved ckpt ~60%)
Lower live-demo risk — fewer debug ramble / dtype dumps
Held up under eval reruns — Micro-1 mentor-hard dropped when Modal stragglers overwrote volume

Slides show all three models. Micro-1 is the EVTE-STaR research peak; SFT v1 is the production-shaped baseline.

What worked / what didn't

Worked: SFT v1 execution behavior · EVTE episode quality filter (92 curated mentor-assisted trajectories) · honest eval harness on real files

Didn't: Full GRPO in hackathon time · SFT v2 (recovery-only fine-tune taught debug prose, not answers) · EVTE-STaR batch 6 overtraining (40% mentor-hard vs Micro-1's 100%)

Models on Hugging Face

Checkpoint	Repo	Role
Base	`unsloth/gemma-4-E2B-it`	Frozen foundation
SFT v1 ★	`DataSense-Modal-E2B-SFT`	This Space
EVTE Micro-1	`DataSense-Modal-E2B-EVTE-Star-Micro1`	Best mentor-hard — research

Try the demo

Six one-click examples on sales, employees, and students CSVs — no upload required. Hit Run DataSense and follow the live progress bar; traces stream into the results panel so you can verify the model really executed code before reading the final answer.

DataSense E2B — Execution-verified, Tutor-escalation training for personal data science agents.
Built June 2026 · Full story · Demo video · LinkedIn post.