--- title: DataSense E2B emoji: ๐Ÿ“Š colorFrom: blue colorTo: green sdk: gradio sdk_version: 6.5.1 app_file: app.py pinned: false license: apache-2.0 startup_duration_timeout: 30m short_description: Execution-grounded data agent โ€” Gemma-4 2B + SFT v1 models: - sanjaymalladi/DataSense-Modal-E2B-SFT - unsloth/gemma-4-E2B-it hardware: gpu-t4 tags: - track:wood - sponsor:modal - achievement:offgrid - achievement:welltuned - achievement:offbrand - achievement:sharing - achievement:fieldnotes --- # DataSense E2B A **personal data-science agent** built for the Gemma / DataBench hackathon โ€” not a chatbot that *pretends* to run code, but a model that **writes Python, executes it, reads real errors, and verifies answers** before claiming a result. This Space runs **SFT v1** on [`unsloth/gemma-4-E2B-it`](https://huggingface.co/unsloth/gemma-4-E2B-it) with LoRA adapter [`sanjaymalladi/DataSense-Modal-E2B-SFT`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT). Pick a bundled CSV example or ask your own question โ€” the agent inspects schema, runs code in a sandbox, debugs from tracebacks, and returns **Answer** + **Summary** tags. **Powered by Modal for fine tuning the model.** ๐Ÿ“– **[Read the full project story โ†’](https://datasense-e2b.netlify.app/)** ยท **๐ŸŽฌ [Watch the demo โ†’](https://youtu.be/ucFoCdMK7sE)** ยท **[LinkedIn post โ†’](https://www.linkedin.com/posts/sanjaymalladi_buildsmall-huggingface-modal-share-7471993638814654464-47hY/)** --- ## The problem we set out to solve Small instruction models can look competent on data questions by printing plausible `**Answer:**` tags **without executing anything**. Our first eval even reported **0% accuracy for everyone** โ€” not because training failed, but because we were scoring synthetic sandbox data against real DataBench ground truth. The fix was changing **what we optimize and measure**: execution-grounded rollouts, typed verifiers, and eval on **mounted real files** โ€” not hallucinated `` blocks. ## Agent loop ![Agent loop: THINK โ†’ EXPLORE โ†’ EXECUTE โ†’ DEBUG โ†’ ANSWER](assets/illustrations/03-agent-loop.png) Same loop in training, eval, and this demo: **THINK** โ†’ **EXECUTE** โ†’ **DEBUG** โ†’ **ANSWER**. In this Hugging Face Space, full **execution traces are visible on screen** โ€” open the **Execution trace** tab to watch each step: generated code, sandbox stdout, and tracebacks as the agent works. ## Training story (Modal) Three Kaggle notebooks (SFT โ†’ GRPO โ†’ DPO) became one Modal app (`datasense_pipeline.py`) with volume checkpoints and automatic Hub pushes. | Stage | Status | What we learned | |-------|--------|-----------------| | **SFT v1** | โœ… Shipped | Real execution behavior (~100% exec on many evals); foundation everything else builds on | | **GRPO** | โธ Deferred | ~11 min/step ร— execution-bound rollouts โ€” too slow for hackathon window | | **DPO** | โธ Deferred | Prompt drift risk; EVTE-STaR took priority for hard questions | | **EVTE** | โœ… Novel | When the 2B student fails, a 31B mentor must **verify its own code** before giving a diagnostic hint | | **EVTE-STaR** | โœ… Research peak | Online micro-SFT every 15 verified mentor-assisted wins โ†’ Micro-1 checkpoint | **EVTE** = Execution-Verified Tutor Escalation. **EVTE-STaR** = Self-Taught Reasoner with online weight updates instead of one offline train at the end. ## Hackathon eval (30 problems ร— 3 models, T4) Macro average = unweighted mean across DataBench (15), DSBench Excel (10), and mentor-hard (5). | Model | DataBench | DSBench | Mentor-hard | Macro | Total | |-------|-----------|---------|-------------|-------|-------| | Base | 60.0% | 0.0% | 20.0% | 26.7% | 10/30 | | **SFT v1** โ˜… | **86.7%** | 0.0% | 60.0% | **48.9%** | 16/30 | | EVTE Micro-1 | 80.0% | 0.0%* | **100.0%** | 60.0% | 17/30 | \*DSBench official scorer = 0% for all models (letter vs dollar mismatch). Micro-1 Q15 computed the correct dollar value โ†’ value-aware macro would be **63.3%**. Always pair accuracy with **exec_ok**: base can match easy booleans via answer tags while running **0%** of its code. ## Why SFT v1 for this demo (not Micro-1) Micro-1 wins macro average on paper (driven by 5/5 mentor-hard). We still ship **SFT v1** here: - **Best DataBench breadth** โ€” 86.7% vs 80% (largest held-out slice) - **Stable inference** โ€” single bulk SFT vs online micro-batch 1 (replay 100% vs saved ckpt ~60%) - **Lower live-demo risk** โ€” fewer debug ramble / dtype dumps - **Held up under eval reruns** โ€” Micro-1 mentor-hard dropped when Modal stragglers overwrote volume **Slides show all three models.** Micro-1 is the EVTE-STaR research peak; SFT v1 is the production-shaped baseline. ## What worked / what didn't **Worked:** SFT v1 execution behavior ยท EVTE episode quality filter (92 curated mentor-assisted trajectories) ยท honest eval harness on real files **Didn't:** Full GRPO in hackathon time ยท SFT v2 (recovery-only fine-tune taught debug prose, not answers) ยท EVTE-STaR batch 6 overtraining (40% mentor-hard vs Micro-1's 100%) ## Models on Hugging Face | Checkpoint | Repo | Role | |------------|------|------| | Base | [`unsloth/gemma-4-E2B-it`](https://huggingface.co/unsloth/gemma-4-E2B-it) | Frozen foundation | | **SFT v1** โ˜… | [`DataSense-Modal-E2B-SFT`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-SFT) | **This Space** | | EVTE Micro-1 | [`DataSense-Modal-E2B-EVTE-Star-Micro1`](https://huggingface.co/sanjaymalladi/DataSense-Modal-E2B-EVTE-Star-Micro1) | Best mentor-hard โ€” research | ## Try the demo Six one-click examples on **sales**, **employees**, and **students** CSVs โ€” no upload required. Hit **Run DataSense** and follow the live progress bar; traces stream into the results panel so you can verify the model really executed code before reading the final answer. --- **DataSense E2B** โ€” Execution-verified, Tutor-escalation training for personal data science agents. Built June 2026 ยท [Full story](https://datasense-e2b.netlify.app/) ยท [Demo video](https://youtu.be/ucFoCdMK7sE) ยท [LinkedIn post](https://www.linkedin.com/posts/sanjaymalladi_buildsmall-huggingface-modal-share-7471993638814654464-47hY/).