Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.19.0
title: DataSense E2B
emoji: π
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
license: apache-2.0
startup_duration_timeout: 30m
short_description: Execution-grounded data agent β Gemma-4 2B + SFT v1
models:
- sanjaymalladi/DataSense-Modal-E2B-SFT
- unsloth/gemma-4-E2B-it
hardware: gpu-t4
tags:
- track:wood
- sponsor:modal
- achievement:offgrid
- achievement:welltuned
- achievement:offbrand
- achievement:sharing
- achievement:fieldnotes
DataSense E2B
A personal data-science agent built for the Gemma / DataBench hackathon β not a chatbot that pretends to run code, but a model that writes Python, executes it, reads real errors, and verifies answers before claiming a result.
This Space runs SFT v1 on unsloth/gemma-4-E2B-it with LoRA adapter sanjaymalladi/DataSense-Modal-E2B-SFT. Pick a bundled CSV example or ask your own question β the agent inspects schema, runs code in a sandbox, debugs from tracebacks, and returns Answer + Summary tags.
Powered by Modal for fine tuning the model.
π Read the full project story β Β· π¬ Watch the demo β Β· LinkedIn post β
The problem we set out to solve
Small instruction models can look competent on data questions by printing plausible **Answer:** tags without executing anything. Our first eval even reported 0% accuracy for everyone β not because training failed, but because we were scoring synthetic sandbox data against real DataBench ground truth.
The fix was changing what we optimize and measure: execution-grounded rollouts, typed verifiers, and eval on mounted real files β not hallucinated <result> blocks.
Agent loop
Same loop in training, eval, and this demo: THINK β EXECUTE β DEBUG β ANSWER.
In this Hugging Face Space, full execution traces are visible on screen β open the Execution trace tab to watch each step: generated code, sandbox stdout, and tracebacks as the agent works.
Training story (Modal)
Three Kaggle notebooks (SFT β GRPO β DPO) became one Modal app (datasense_pipeline.py) with volume checkpoints and automatic Hub pushes.
| Stage | Status | What we learned |
|---|---|---|
| SFT v1 | β Shipped | Real execution behavior (~100% exec on many evals); foundation everything else builds on |
| GRPO | βΈ Deferred | ~11 min/step Γ execution-bound rollouts β too slow for hackathon window |
| DPO | βΈ Deferred | Prompt drift risk; EVTE-STaR took priority for hard questions |
| EVTE | β Novel | When the 2B student fails, a 31B mentor must verify its own code before giving a diagnostic hint |
| EVTE-STaR | β Research peak | Online micro-SFT every 15 verified mentor-assisted wins β Micro-1 checkpoint |
EVTE = Execution-Verified Tutor Escalation. EVTE-STaR = Self-Taught Reasoner with online weight updates instead of one offline train at the end.
Hackathon eval (30 problems Γ 3 models, T4)
Macro average = unweighted mean across DataBench (15), DSBench Excel (10), and mentor-hard (5).
| Model | DataBench | DSBench | Mentor-hard | Macro | Total |
|---|---|---|---|---|---|
| Base | 60.0% | 0.0% | 20.0% | 26.7% | 10/30 |
| SFT v1 β | 86.7% | 0.0% | 60.0% | 48.9% | 16/30 |
| EVTE Micro-1 | 80.0% | 0.0%* | 100.0% | 60.0% | 17/30 |
*DSBench official scorer = 0% for all models (letter vs dollar mismatch). Micro-1 Q15 computed the correct dollar value β value-aware macro would be 63.3%.
Always pair accuracy with exec_ok: base can match easy booleans via answer tags while running 0% of its code.
Why SFT v1 for this demo (not Micro-1)
Micro-1 wins macro average on paper (driven by 5/5 mentor-hard). We still ship SFT v1 here:
- Best DataBench breadth β 86.7% vs 80% (largest held-out slice)
- Stable inference β single bulk SFT vs online micro-batch 1 (replay 100% vs saved ckpt ~60%)
- Lower live-demo risk β fewer debug ramble / dtype dumps
- Held up under eval reruns β Micro-1 mentor-hard dropped when Modal stragglers overwrote volume
Slides show all three models. Micro-1 is the EVTE-STaR research peak; SFT v1 is the production-shaped baseline.
What worked / what didn't
Worked: SFT v1 execution behavior Β· EVTE episode quality filter (92 curated mentor-assisted trajectories) Β· honest eval harness on real files
Didn't: Full GRPO in hackathon time Β· SFT v2 (recovery-only fine-tune taught debug prose, not answers) Β· EVTE-STaR batch 6 overtraining (40% mentor-hard vs Micro-1's 100%)
Models on Hugging Face
| Checkpoint | Repo | Role |
|---|---|---|
| Base | unsloth/gemma-4-E2B-it |
Frozen foundation |
| SFT v1 β | DataSense-Modal-E2B-SFT |
This Space |
| EVTE Micro-1 | DataSense-Modal-E2B-EVTE-Star-Micro1 |
Best mentor-hard β research |
Try the demo
Six one-click examples on sales, employees, and students CSVs β no upload required. Hit Run DataSense and follow the live progress bar; traces stream into the results panel so you can verify the model really executed code before reading the final answer.
DataSense E2B β Execution-verified, Tutor-escalation training for personal data science agents.
Built June 2026 Β· Full story Β· Demo video Β· LinkedIn post.
