---
title: DataSense E2B
emoji: 📊
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
license: apache-2.0
startup_duration_timeout: 30m
short_description: Execution-grounded data agent — Gemma-4 2B + SFT v1
models:
  - sanjaymalladi/DataSense-Modal-E2B-SFT
  - unsloth/gemma-4-E2B-it
hardware: gpu-t4
tags:
  - track:wood
  - sponsor:modal
  - achievement:offgrid
  - achievement:welltuned
  - achievement:offbrand
  - achievement:sharing
  - achievement:fieldnotes
---

# DataSense E2B

A **personal data-science agent** built for the Gemma / DataBench hackathon — not a chatbot that *pretends* to run code, but a model that **writes Python, executes it, reads real errors, and verifies answers** before claiming a result.

This Space runs **SFT v1** on [`unsloth/gemma-4-E2B-it`](https://huggingface.co/unsloth/gemma-4-E2B-it) with LoRA adapter [`sanjaymalladi/DataSense-Modal-E2B-SFT`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT). Pick a bundled CSV example or ask your own question — the agent inspects schema, runs code in a sandbox, debugs from tracebacks, and returns **Answer** + **Summary** tags.

**Powered by Modal for fine tuning the model.**

📖 **[Read the full project story →](https://datasense-e2b.netlify.app/)** · **🎬 [Watch the demo →](https://youtu.be/ucFoCdMK7sE)** · **[LinkedIn post →](https://www.linkedin.com/posts/sanjaymalladi_buildsmall-huggingface-modal-share-7471993638814654464-47hY/)**

---

## The problem we set out to solve

Small instruction models can look competent on data questions by printing plausible `**Answer:**` tags **without executing anything**. Our first eval even reported **0% accuracy for everyone** — not because training failed, but because we were scoring synthetic sandbox data against real DataBench ground truth.

The fix was changing **what we optimize and measure**: execution-grounded rollouts, typed verifiers, and eval on **mounted real files** — not hallucinated `<result>` blocks.

## Agent loop

![Agent loop: THINK → EXPLORE → EXECUTE → DEBUG → ANSWER](assets/illustrations/03-agent-loop.png)

Same loop in training, eval, and this demo: **THINK** → **EXECUTE** → **DEBUG** → **ANSWER**.

In this Hugging Face Space, full **execution traces are visible on screen** — open the **Execution trace** tab to watch each step: generated code, sandbox stdout, and tracebacks as the agent works.

## Training story (Modal)

Three Kaggle notebooks (SFT → GRPO → DPO) became one Modal app (`datasense_pipeline.py`) with volume checkpoints and automatic Hub pushes.

| Stage | Status | What we learned |
|-------|--------|-----------------|
| **SFT v1** | ✅ Shipped | Real execution behavior (~100% exec on many evals); foundation everything else builds on |
| **GRPO** | ⏸ Deferred | ~11 min/step × execution-bound rollouts — too slow for hackathon window |
| **DPO** | ⏸ Deferred | Prompt drift risk; EVTE-STaR took priority for hard questions |
| **EVTE** | ✅ Novel | When the 2B student fails, a 31B mentor must **verify its own code** before giving a diagnostic hint |
| **EVTE-STaR** | ✅ Research peak | Online micro-SFT every 15 verified mentor-assisted wins → Micro-1 checkpoint |

**EVTE** = Execution-Verified Tutor Escalation. **EVTE-STaR** = Self-Taught Reasoner with online weight updates instead of one offline train at the end.

## Hackathon eval (30 problems × 3 models, T4)

Macro average = unweighted mean across DataBench (15), DSBench Excel (10), and mentor-hard (5).

| Model | DataBench | DSBench | Mentor-hard | Macro | Total |
|-------|-----------|---------|-------------|-------|-------|
| Base | 60.0% | 0.0% | 20.0% | 26.7% | 10/30 |
| **SFT v1** ★ | **86.7%** | 0.0% | 60.0% | **48.9%** | 16/30 |
| EVTE Micro-1 | 80.0% | 0.0%* | **100.0%** | 60.0% | 17/30 |

\*DSBench official scorer = 0% for all models (letter vs dollar mismatch). Micro-1 Q15 computed the correct dollar value → value-aware macro would be **63.3%**.

Always pair accuracy with **exec_ok**: base can match easy booleans via answer tags while running **0%** of its code.

## Why SFT v1 for this demo (not Micro-1)

Micro-1 wins macro average on paper (driven by 5/5 mentor-hard). We still ship **SFT v1** here:

- **Best DataBench breadth** — 86.7% vs 80% (largest held-out slice)
- **Stable inference** — single bulk SFT vs online micro-batch 1 (replay 100% vs saved ckpt ~60%)
- **Lower live-demo risk** — fewer debug ramble / dtype dumps
- **Held up under eval reruns** — Micro-1 mentor-hard dropped when Modal stragglers overwrote volume

**Slides show all three models.** Micro-1 is the EVTE-STaR research peak; SFT v1 is the production-shaped baseline.

## What worked / what didn't

**Worked:** SFT v1 execution behavior · EVTE episode quality filter (92 curated mentor-assisted trajectories) · honest eval harness on real files

**Didn't:** Full GRPO in hackathon time · SFT v2 (recovery-only fine-tune taught debug prose, not answers) · EVTE-STaR batch 6 overtraining (40% mentor-hard vs Micro-1's 100%)

## Models on Hugging Face

| Checkpoint | Repo | Role |
|------------|------|------|
| Base | [`unsloth/gemma-4-E2B-it`](https://huggingface.co/unsloth/gemma-4-E2B-it) | Frozen foundation |
| **SFT v1** ★ | [`DataSense-Modal-E2B-SFT`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT) | **This Space** |
| EVTE Micro-1 | [`DataSense-Modal-E2B-EVTE-Star-Micro1`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-EVTE-Star-Micro1) | Best mentor-hard — research |

## Try the demo

Six one-click examples on **sales**, **employees**, and **students** CSVs — no upload required. Hit **Run DataSense** and follow the live progress bar; traces stream into the results panel so you can verify the model really executed code before reading the final answer.

---

**DataSense E2B** — Execution-verified, Tutor-escalation training for personal data science agents.  
Built June 2026 · [Full story](https://datasense-e2b.netlify.app/) · [Demo video](https://youtu.be/ucFoCdMK7sE) · [LinkedIn post](https://www.linkedin.com/posts/sanjaymalladi_buildsmall-huggingface-modal-share-7471993638814654464-47hY/).