DataSense_E2B / README.md
sanjaymalladi's picture
Upload README.md with huggingface_hub
ae7a3f7 verified
|
Raw
History Blame Contribute Delete
6.23 kB
---
title: DataSense E2B
emoji: πŸ“Š
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
license: apache-2.0
startup_duration_timeout: 30m
short_description: Execution-grounded data agent β€” Gemma-4 2B + SFT v1
models:
- sanjaymalladi/DataSense-Modal-E2B-SFT
- unsloth/gemma-4-E2B-it
hardware: gpu-t4
tags:
- track:wood
- sponsor:modal
- achievement:offgrid
- achievement:welltuned
- achievement:offbrand
- achievement:sharing
- achievement:fieldnotes
---
# DataSense E2B
A **personal data-science agent** built for the Gemma / DataBench hackathon β€” not a chatbot that *pretends* to run code, but a model that **writes Python, executes it, reads real errors, and verifies answers** before claiming a result.
This Space runs **SFT v1** on [`unsloth/gemma-4-E2B-it`](https://huggingface.co/unsloth/gemma-4-E2B-it) with LoRA adapter [`sanjaymalladi/DataSense-Modal-E2B-SFT`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT). Pick a bundled CSV example or ask your own question β€” the agent inspects schema, runs code in a sandbox, debugs from tracebacks, and returns **Answer** + **Summary** tags.
**Powered by Modal for fine tuning the model.**
πŸ“– **[Read the full project story β†’](https://datasense-e2b.netlify.app/)** Β· **🎬 [Watch the demo β†’](https://youtu.be/ucFoCdMK7sE)** Β· **[LinkedIn post β†’](https://www.linkedin.com/posts/sanjaymalladi_buildsmall-huggingface-modal-share-7471993638814654464-47hY/)**
---
## The problem we set out to solve
Small instruction models can look competent on data questions by printing plausible `**Answer:**` tags **without executing anything**. Our first eval even reported **0% accuracy for everyone** β€” not because training failed, but because we were scoring synthetic sandbox data against real DataBench ground truth.
The fix was changing **what we optimize and measure**: execution-grounded rollouts, typed verifiers, and eval on **mounted real files** β€” not hallucinated `<result>` blocks.
## Agent loop
![Agent loop: THINK β†’ EXPLORE β†’ EXECUTE β†’ DEBUG β†’ ANSWER](assets/illustrations/03-agent-loop.png)
Same loop in training, eval, and this demo: **THINK** β†’ **EXECUTE** β†’ **DEBUG** β†’ **ANSWER**.
In this Hugging Face Space, full **execution traces are visible on screen** β€” open the **Execution trace** tab to watch each step: generated code, sandbox stdout, and tracebacks as the agent works.
## Training story (Modal)
Three Kaggle notebooks (SFT β†’ GRPO β†’ DPO) became one Modal app (`datasense_pipeline.py`) with volume checkpoints and automatic Hub pushes.
| Stage | Status | What we learned |
|-------|--------|-----------------|
| **SFT v1** | βœ… Shipped | Real execution behavior (~100% exec on many evals); foundation everything else builds on |
| **GRPO** | ⏸ Deferred | ~11 min/step Γ— execution-bound rollouts β€” too slow for hackathon window |
| **DPO** | ⏸ Deferred | Prompt drift risk; EVTE-STaR took priority for hard questions |
| **EVTE** | βœ… Novel | When the 2B student fails, a 31B mentor must **verify its own code** before giving a diagnostic hint |
| **EVTE-STaR** | βœ… Research peak | Online micro-SFT every 15 verified mentor-assisted wins β†’ Micro-1 checkpoint |
**EVTE** = Execution-Verified Tutor Escalation. **EVTE-STaR** = Self-Taught Reasoner with online weight updates instead of one offline train at the end.
## Hackathon eval (30 problems Γ— 3 models, T4)
Macro average = unweighted mean across DataBench (15), DSBench Excel (10), and mentor-hard (5).
| Model | DataBench | DSBench | Mentor-hard | Macro | Total |
|-------|-----------|---------|-------------|-------|-------|
| Base | 60.0% | 0.0% | 20.0% | 26.7% | 10/30 |
| **SFT v1** β˜… | **86.7%** | 0.0% | 60.0% | **48.9%** | 16/30 |
| EVTE Micro-1 | 80.0% | 0.0%* | **100.0%** | 60.0% | 17/30 |
\*DSBench official scorer = 0% for all models (letter vs dollar mismatch). Micro-1 Q15 computed the correct dollar value β†’ value-aware macro would be **63.3%**.
Always pair accuracy with **exec_ok**: base can match easy booleans via answer tags while running **0%** of its code.
## Why SFT v1 for this demo (not Micro-1)
Micro-1 wins macro average on paper (driven by 5/5 mentor-hard). We still ship **SFT v1** here:
- **Best DataBench breadth** β€” 86.7% vs 80% (largest held-out slice)
- **Stable inference** β€” single bulk SFT vs online micro-batch 1 (replay 100% vs saved ckpt ~60%)
- **Lower live-demo risk** β€” fewer debug ramble / dtype dumps
- **Held up under eval reruns** β€” Micro-1 mentor-hard dropped when Modal stragglers overwrote volume
**Slides show all three models.** Micro-1 is the EVTE-STaR research peak; SFT v1 is the production-shaped baseline.
## What worked / what didn't
**Worked:** SFT v1 execution behavior Β· EVTE episode quality filter (92 curated mentor-assisted trajectories) Β· honest eval harness on real files
**Didn't:** Full GRPO in hackathon time Β· SFT v2 (recovery-only fine-tune taught debug prose, not answers) Β· EVTE-STaR batch 6 overtraining (40% mentor-hard vs Micro-1's 100%)
## Models on Hugging Face
| Checkpoint | Repo | Role |
|------------|------|------|
| Base | [`unsloth/gemma-4-E2B-it`](https://huggingface.co/unsloth/gemma-4-E2B-it) | Frozen foundation |
| **SFT v1** β˜… | [`DataSense-Modal-E2B-SFT`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT) | **This Space** |
| EVTE Micro-1 | [`DataSense-Modal-E2B-EVTE-Star-Micro1`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-EVTE-Star-Micro1) | Best mentor-hard β€” research |
## Try the demo
Six one-click examples on **sales**, **employees**, and **students** CSVs β€” no upload required. Hit **Run DataSense** and follow the live progress bar; traces stream into the results panel so you can verify the model really executed code before reading the final answer.
---
**DataSense E2B** β€” Execution-verified, Tutor-escalation training for personal data science agents.
Built June 2026 Β· [Full story](https://datasense-e2b.netlify.app/) Β· [Demo video](https://youtu.be/ucFoCdMK7sE) Β· [LinkedIn post](https://www.linkedin.com/posts/sanjaymalladi_buildsmall-huggingface-modal-share-7471993638814654464-47hY/).