DataSense_E2B / README.md
sanjaymalladi's picture
Upload README.md with huggingface_hub
ae7a3f7 verified
|
Raw
History Blame Contribute Delete
6.23 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade
metadata
title: DataSense E2B
emoji: πŸ“Š
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
license: apache-2.0
startup_duration_timeout: 30m
short_description: Execution-grounded data agent β€” Gemma-4 2B + SFT v1
models:
  - sanjaymalladi/DataSense-Modal-E2B-SFT
  - unsloth/gemma-4-E2B-it
hardware: gpu-t4
tags:
  - track:wood
  - sponsor:modal
  - achievement:offgrid
  - achievement:welltuned
  - achievement:offbrand
  - achievement:sharing
  - achievement:fieldnotes

DataSense E2B

A personal data-science agent built for the Gemma / DataBench hackathon β€” not a chatbot that pretends to run code, but a model that writes Python, executes it, reads real errors, and verifies answers before claiming a result.

This Space runs SFT v1 on unsloth/gemma-4-E2B-it with LoRA adapter sanjaymalladi/DataSense-Modal-E2B-SFT. Pick a bundled CSV example or ask your own question β€” the agent inspects schema, runs code in a sandbox, debugs from tracebacks, and returns Answer + Summary tags.

Powered by Modal for fine tuning the model.

πŸ“– Read the full project story β†’ Β· 🎬 Watch the demo β†’ Β· LinkedIn post β†’


The problem we set out to solve

Small instruction models can look competent on data questions by printing plausible **Answer:** tags without executing anything. Our first eval even reported 0% accuracy for everyone β€” not because training failed, but because we were scoring synthetic sandbox data against real DataBench ground truth.

The fix was changing what we optimize and measure: execution-grounded rollouts, typed verifiers, and eval on mounted real files β€” not hallucinated <result> blocks.

Agent loop

Agent loop: THINK β†’ EXPLORE β†’ EXECUTE β†’ DEBUG β†’ ANSWER

Same loop in training, eval, and this demo: THINK β†’ EXECUTE β†’ DEBUG β†’ ANSWER.

In this Hugging Face Space, full execution traces are visible on screen β€” open the Execution trace tab to watch each step: generated code, sandbox stdout, and tracebacks as the agent works.

Training story (Modal)

Three Kaggle notebooks (SFT β†’ GRPO β†’ DPO) became one Modal app (datasense_pipeline.py) with volume checkpoints and automatic Hub pushes.

Stage Status What we learned
SFT v1 βœ… Shipped Real execution behavior (~100% exec on many evals); foundation everything else builds on
GRPO ⏸ Deferred ~11 min/step Γ— execution-bound rollouts β€” too slow for hackathon window
DPO ⏸ Deferred Prompt drift risk; EVTE-STaR took priority for hard questions
EVTE βœ… Novel When the 2B student fails, a 31B mentor must verify its own code before giving a diagnostic hint
EVTE-STaR βœ… Research peak Online micro-SFT every 15 verified mentor-assisted wins β†’ Micro-1 checkpoint

EVTE = Execution-Verified Tutor Escalation. EVTE-STaR = Self-Taught Reasoner with online weight updates instead of one offline train at the end.

Hackathon eval (30 problems Γ— 3 models, T4)

Macro average = unweighted mean across DataBench (15), DSBench Excel (10), and mentor-hard (5).

Model DataBench DSBench Mentor-hard Macro Total
Base 60.0% 0.0% 20.0% 26.7% 10/30
SFT v1 β˜… 86.7% 0.0% 60.0% 48.9% 16/30
EVTE Micro-1 80.0% 0.0%* 100.0% 60.0% 17/30

*DSBench official scorer = 0% for all models (letter vs dollar mismatch). Micro-1 Q15 computed the correct dollar value β†’ value-aware macro would be 63.3%.

Always pair accuracy with exec_ok: base can match easy booleans via answer tags while running 0% of its code.

Why SFT v1 for this demo (not Micro-1)

Micro-1 wins macro average on paper (driven by 5/5 mentor-hard). We still ship SFT v1 here:

  • Best DataBench breadth β€” 86.7% vs 80% (largest held-out slice)
  • Stable inference β€” single bulk SFT vs online micro-batch 1 (replay 100% vs saved ckpt ~60%)
  • Lower live-demo risk β€” fewer debug ramble / dtype dumps
  • Held up under eval reruns β€” Micro-1 mentor-hard dropped when Modal stragglers overwrote volume

Slides show all three models. Micro-1 is the EVTE-STaR research peak; SFT v1 is the production-shaped baseline.

What worked / what didn't

Worked: SFT v1 execution behavior Β· EVTE episode quality filter (92 curated mentor-assisted trajectories) Β· honest eval harness on real files

Didn't: Full GRPO in hackathon time Β· SFT v2 (recovery-only fine-tune taught debug prose, not answers) Β· EVTE-STaR batch 6 overtraining (40% mentor-hard vs Micro-1's 100%)

Models on Hugging Face

Checkpoint Repo Role
Base unsloth/gemma-4-E2B-it Frozen foundation
SFT v1 β˜… DataSense-Modal-E2B-SFT This Space
EVTE Micro-1 DataSense-Modal-E2B-EVTE-Star-Micro1 Best mentor-hard β€” research

Try the demo

Six one-click examples on sales, employees, and students CSVs β€” no upload required. Hit Run DataSense and follow the live progress bar; traces stream into the results panel so you can verify the model really executed code before reading the final answer.


DataSense E2B β€” Execution-verified, Tutor-escalation training for personal data science agents.
Built June 2026 Β· Full story Β· Demo video Β· LinkedIn post.