Spaces:
Running on Zero
Running on Zero
| title: DataSense E2B | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: 6.5.1 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| startup_duration_timeout: 30m | |
| short_description: Execution-grounded data agent β Gemma-4 2B + SFT v1 | |
| models: | |
| - sanjaymalladi/DataSense-Modal-E2B-SFT | |
| - unsloth/gemma-4-E2B-it | |
| hardware: gpu-t4 | |
| tags: | |
| - track:wood | |
| - sponsor:modal | |
| - achievement:offgrid | |
| - achievement:welltuned | |
| - achievement:offbrand | |
| - achievement:sharing | |
| - achievement:fieldnotes | |
| # DataSense E2B | |
| A **personal data-science agent** built for the Gemma / DataBench hackathon β not a chatbot that *pretends* to run code, but a model that **writes Python, executes it, reads real errors, and verifies answers** before claiming a result. | |
| This Space runs **SFT v1** on [`unsloth/gemma-4-E2B-it`](https://huggingface.co/unsloth/gemma-4-E2B-it) with LoRA adapter [`sanjaymalladi/DataSense-Modal-E2B-SFT`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT). Pick a bundled CSV example or ask your own question β the agent inspects schema, runs code in a sandbox, debugs from tracebacks, and returns **Answer** + **Summary** tags. | |
| **Powered by Modal for fine tuning the model.** | |
| π **[Read the full project story β](https://datasense-e2b.netlify.app/)** Β· **π¬ [Watch the demo β](https://youtu.be/ucFoCdMK7sE)** Β· **[LinkedIn post β](https://www.linkedin.com/posts/sanjaymalladi_buildsmall-huggingface-modal-share-7471993638814654464-47hY/)** | |
| --- | |
| ## The problem we set out to solve | |
| Small instruction models can look competent on data questions by printing plausible `**Answer:**` tags **without executing anything**. Our first eval even reported **0% accuracy for everyone** β not because training failed, but because we were scoring synthetic sandbox data against real DataBench ground truth. | |
| The fix was changing **what we optimize and measure**: execution-grounded rollouts, typed verifiers, and eval on **mounted real files** β not hallucinated `<result>` blocks. | |
| ## Agent loop | |
|  | |
| Same loop in training, eval, and this demo: **THINK** β **EXECUTE** β **DEBUG** β **ANSWER**. | |
| In this Hugging Face Space, full **execution traces are visible on screen** β open the **Execution trace** tab to watch each step: generated code, sandbox stdout, and tracebacks as the agent works. | |
| ## Training story (Modal) | |
| Three Kaggle notebooks (SFT β GRPO β DPO) became one Modal app (`datasense_pipeline.py`) with volume checkpoints and automatic Hub pushes. | |
| | Stage | Status | What we learned | | |
| |-------|--------|-----------------| | |
| | **SFT v1** | β Shipped | Real execution behavior (~100% exec on many evals); foundation everything else builds on | | |
| | **GRPO** | βΈ Deferred | ~11 min/step Γ execution-bound rollouts β too slow for hackathon window | | |
| | **DPO** | βΈ Deferred | Prompt drift risk; EVTE-STaR took priority for hard questions | | |
| | **EVTE** | β Novel | When the 2B student fails, a 31B mentor must **verify its own code** before giving a diagnostic hint | | |
| | **EVTE-STaR** | β Research peak | Online micro-SFT every 15 verified mentor-assisted wins β Micro-1 checkpoint | | |
| **EVTE** = Execution-Verified Tutor Escalation. **EVTE-STaR** = Self-Taught Reasoner with online weight updates instead of one offline train at the end. | |
| ## Hackathon eval (30 problems Γ 3 models, T4) | |
| Macro average = unweighted mean across DataBench (15), DSBench Excel (10), and mentor-hard (5). | |
| | Model | DataBench | DSBench | Mentor-hard | Macro | Total | | |
| |-------|-----------|---------|-------------|-------|-------| | |
| | Base | 60.0% | 0.0% | 20.0% | 26.7% | 10/30 | | |
| | **SFT v1** β | **86.7%** | 0.0% | 60.0% | **48.9%** | 16/30 | | |
| | EVTE Micro-1 | 80.0% | 0.0%* | **100.0%** | 60.0% | 17/30 | | |
| \*DSBench official scorer = 0% for all models (letter vs dollar mismatch). Micro-1 Q15 computed the correct dollar value β value-aware macro would be **63.3%**. | |
| Always pair accuracy with **exec_ok**: base can match easy booleans via answer tags while running **0%** of its code. | |
| ## Why SFT v1 for this demo (not Micro-1) | |
| Micro-1 wins macro average on paper (driven by 5/5 mentor-hard). We still ship **SFT v1** here: | |
| - **Best DataBench breadth** β 86.7% vs 80% (largest held-out slice) | |
| - **Stable inference** β single bulk SFT vs online micro-batch 1 (replay 100% vs saved ckpt ~60%) | |
| - **Lower live-demo risk** β fewer debug ramble / dtype dumps | |
| - **Held up under eval reruns** β Micro-1 mentor-hard dropped when Modal stragglers overwrote volume | |
| **Slides show all three models.** Micro-1 is the EVTE-STaR research peak; SFT v1 is the production-shaped baseline. | |
| ## What worked / what didn't | |
| **Worked:** SFT v1 execution behavior Β· EVTE episode quality filter (92 curated mentor-assisted trajectories) Β· honest eval harness on real files | |
| **Didn't:** Full GRPO in hackathon time Β· SFT v2 (recovery-only fine-tune taught debug prose, not answers) Β· EVTE-STaR batch 6 overtraining (40% mentor-hard vs Micro-1's 100%) | |
| ## Models on Hugging Face | |
| | Checkpoint | Repo | Role | | |
| |------------|------|------| | |
| | Base | [`unsloth/gemma-4-E2B-it`](https://huggingface.co/unsloth/gemma-4-E2B-it) | Frozen foundation | | |
| | **SFT v1** β | [`DataSense-Modal-E2B-SFT`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT) | **This Space** | | |
| | EVTE Micro-1 | [`DataSense-Modal-E2B-EVTE-Star-Micro1`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-EVTE-Star-Micro1) | Best mentor-hard β research | | |
| ## Try the demo | |
| Six one-click examples on **sales**, **employees**, and **students** CSVs β no upload required. Hit **Run DataSense** and follow the live progress bar; traces stream into the results panel so you can verify the model really executed code before reading the final answer. | |
| --- | |
| **DataSense E2B** β Execution-verified, Tutor-escalation training for personal data science agents. | |
| Built June 2026 Β· [Full story](https://datasense-e2b.netlify.app/) Β· [Demo video](https://youtu.be/ucFoCdMK7sE) Β· [LinkedIn post](https://www.linkedin.com/posts/sanjaymalladi_buildsmall-huggingface-modal-share-7471993638814654464-47hY/). | |