DataSense E2B
The Full Story

How we set out to build a personal data-science agent — not a chatbot that pretends to run code, but one that writes Python, executes it, reads real errors, and verifies answers — and what we learned training Gemma-4-2B on Modal with methods we had to invent along the way.

Base: unsloth/gemma-4-E2B-it
Pipeline: Modal A100/T4
Team: DataSense E2B (Execution-verified, Tutor-escalation)

01 · The goal

The hackathon asked for something ambitious: take a small open model and make it genuinely useful for data work — exploring tables, cleaning messy columns, aggregating, joining, visualizing, and answering questions with verifiable correctness, not plausible prose.

Our north star was simple to state and hard to achieve:

North star

A 2B-parameter student agent that behaves like a junior data analyst: inspect schema first, run focused code steps, debug from real tracebacks, and only claim an answer after execution confirms it — with a training story credible enough for slides, papers, and a public Hugging Face demo.

Concretely, we targeted:

Formatter that fakes answers versus a real execution-verified agent
Fig 1 — Goal. We optimize for an agent that runs code on real data and verifies answers — not a model that prints plausible Answer: tags without executing anything.

02 · Where we started

The base model

We built on unsloth/gemma-4-E2B-it — Google's Gemma 4 2B instruction model in Unsloth's E2B (execution-to-build) variant. It's small enough to fine-tune on a single GPU, yet designed with code and tool use in mind. We used 4-bit quantization, LoRA rank 32, and a 2048-token context throughout.

Three Kaggle notebooks → one Modal app

The project began as three separate Kaggle notebooks covering supervised fine-tuning (SFT), GRPO reinforcement learning, and DPO preference optimization. We consolidated them into datasense_pipeline.py — a single Modal application with shared config in datasense_utils.py — so training could run unattended on cloud GPUs with checkpoints persisted to a Modal volume and pushed to Hugging Face.

Nine bugs we fixed before trusting any number

Early runs were misleading because the ported notebooks had latent bugs. We fixed all nine before building the pipeline:

#BugImpact
1sft_warmup KeyErrorSFT wouldn't start
2lora_target_modules KeyErrorLoRA attach failed
3result_str UnboundLocalErrorAgent loop crashed mid-rollout
4DPO pairs missing chat template prefixPreference data malformed
5skip_special_tokens=FalseDecode pollution in rewards
6Dead oci_sft_v1 variableConfusing / broken cells
7GRPO max_steps hardcodedConfig ignored
8Shorter SYSTEM_PROMPT in DPO cellTrain/eval prompt drift
9_PROBLEM_LOOKUP naming mismatchDataset indexing broken

Day-one eval: 0% accuracy (and why that was informative)

Our first agent eval reported 0% accuracy for everyone — including SFT — while SFT already showed 100% execution success and ~5.6 agent steps vs base's 2% exec / 1.1 steps. That gap taught us the first big lesson: the model was learning to run code, but we weren't scoring against real data.

Root cause

Eval workspaces used synthetic random CSVs when DataBench parquet wasn't mounted, but ground truth came from the real dataset. The agent analyzed fake data and was graded against true answers — guaranteed 0%.

Eval bug: synthetic workspace data scored against real ground truth
Fig 2 — The 0% eval bug. Early runs used random synthetic CSVs in the sandbox while ground truth came from real DataBench files — so even a good agent could never match.

03 · The problem with naive finetuning

Most "data agent" demos finetune on static (question, code, answer) triples. The model learns to format responses that look like an agent — Answer: tags, pandas snippets, confident summaries — without ever closing the loop on execution.

We observed three failure modes immediately:

Formatter, not agent

Base Gemma-4 could score well on easy boolean questions by emitting answer tags in a single turn — 0% code execution — beating SFT on accuracy while doing none of the work.

Hallucinated execution

Models invent <result> blocks with fake stdout. RL rewards on text alone reinforce the illusion of competence.

The fix wasn't "more SFT data." It was changing what we optimize and measure: real subprocess execution, multi-turn observe→fix→retry, and verifiers that compare parsed answers to typed ground truth (boolean, number, category, list types).

04 · The DataSense agent loop

Every training rollout and eval episode follows the same production-shaped loop:

THINK EXPLORE EXECUTE DEBUG ANSWER
  1. THINK — inspect schema, dtypes, nulls before analysis
  2. EXPLOREhead(), describe(), small SQL LIMIT queries
  3. EXECUTE — one focused Python step; read real <result> from sandbox
  4. DEBUG — fix column names, joins, dtypes from tracebacks
  5. ANSWER Answer: + Summary: after verified execution

The system prompt (shared across train, eval, and this HF demo) explicitly forbids hallucinated APIs and requires the final printed value to match the answer tag. For DataBench we mount real sample.parquet into the workspace; for DSBench we copy .xlsx workbooks and use inspect_source for Excel structure.

Reward signal (simplified):
  + execution actually ran
  + stdout parseable
  + answer matches ground truth (typed comparator)
  − hallucinated inline <result> without [EXEC:real]
  − debug rambling / column dumps as "answers"
THINK EXPLORE EXECUTE DEBUG ANSWER agent loop
Fig 3 — Agent loop. Every rollout follows the same multi-step cycle: inspect, run code, read real output, debug, then answer.

05 · Training pipeline: SFT → GRPO → DPO

Our planned stack mirrors modern agent training — with execution at every stage:

SFT GRPO DPO Eval

Stage 1 — Supervised fine-tuning (SFT v1) ✅

Bulk SFT on DataBench-style traces plus agent supplements: multi-turn dialogs, Jupyter-agent traces, dashboard examples, and code-feedback execution pairs. This produced our strongest baseline — sanjaymalladi/DataSense-Modal-E2B-SFT.

Stage 2 — GRPO (execution-grounded RL) ⚠️ partial

Group Relative Policy Optimization with real Python rollouts per prompt. Each step spawns multiple agent trajectories; rewards use compute_trajectory_reward() with require_real_execution=True.

GRPO on Gemma-4 is brutally slow (~11 min/step on A100) because most wall time is CPU-bound execution, not GPU matmul — 4 rollouts × up to 5 agent steps × subprocess sandboxing. We fixed trajectory forwarding bugs, KL instability (final_logit_softcapping=30), and added parallel rollout workers — but full 300-step GRPO remained impractical within hackathon time. A shortened 100-step run was targeted.

Stage 3 — DPO ⏸️ deferred

Preference pairs from high vs low reward rollouts (min gap 0.15) — planned but deprioritized once EVTE-STaR showed more promise for hard-question gains within our compute budget.

SFT GRPO DPO training pipeline stages
Fig 4 — Training stages. SFT v1 shipped and works. Full GRPO was execution-bound and slow. DPO was deferred in favor of EVTE-STaR.

06 · Supporting infrastructure (not EVTE itself)

Before EVTE could work, we needed execution-grounded rollouts, typed verifiers, and honest eval. These are the plumbing; the novel research contribution is EVTE + EVTE-STaR (sections 07–11 below).

Execution-grounded rollouts

Every GRPO/DPO/EVTE trajectory runs code in an isolated workspace. Rewards ignore fake <result> tags unless tagged [EXEC:real].

Typed answer verification (databench_compare + neural verifier)

Evidence-bound scoring chain: exec stdout → Answer: tag → LLM extract → typed compare (boolean, float, category, list[category], list[number]). Without this, mentors "fail" when extraction fails, not when reasoning fails.

Lite eval & hackathon harness

DataBench lite scores against sample_answer on mounted parquet. run_hackathon_benchmarks_parallel runs Base / SFT / Micro-1 across three benchmarks on T4.

07 · EVTE — Execution-Verified Tutor Escalation

EVTE is the method we built when classical distillation and STaR broke down for data agents. The name encodes three commitments:

Why we needed EVTE

Classical STaR (Self-Taught Reasoner) assumes a strong teacher can produce correct reasoning chains, filter them, and fine-tune the student offline. That fails for DataSense because:

  1. Our 2B student often can't solve list/category questions at all
  2. Our 31B mentor also fails verification on the hardest 5 problems (~40% mentor-hard pool)
  3. Even when code is right, answer extraction fails (no tag, wrong stdout parse)
  4. Distilling final answers teaches memorization; we need debugging under execution constraints

The five-phase episode (EVTE and EVTE-STaR share this skeleton)

Implemented in datasense_evte.pyrun_evte_episode (offline collection) and run_evte_star_episode (online training).

Phase 1 · Student first attempt

2B student, up to 5 agent steps, real workspace (CSV/parquet/xlsx). Scored via score_rollout().

Phase 2 · Self-recovery feedback

Up to 3 rounds of build_self_recovery_feedback() — real tracebacks, answer withheld.

Phase 3 · Mentor independent verify

31B mentor solves in a fresh workspace; must pass the same verifier before any hint.

Phase 4 · Diagnostic mentor hint

generate_mentor_hint() under MENTOR_HINT_SYSTEM — no final answer, no full script.

Phase 5 · Post-hint student

Up to 2 attempts × 5 steps. Episode saved only if student verifies after reading the hint.

EVTE five phases from student attempt to mentor-assisted success
Fig 5 — EVTE in five phases. Student tries → self-recovery → mentor must verify independently → diagnostic hint → student retries. Only verified post-hint wins become training data.
run_evte_star_episode (simplified control flow):

  student_rollout = phase_1_student()
  if clean_first_try_verified and not messy_recovery_in_trace:
      return SKIP  # already knows it — not trainable in STaR mode

  if not verified:
      for i in 1..3:
          add_user(build_self_recovery_feedback())  # ← EVTE feedback
          student_rollout = student_retry()

  mentor_ok, mentor_rollout = mentor_verify_solution(
      student_rollout=junior_trace  # mentor sees failed code
  )
  if not mentor_ok:
      return DISCARD  # mentor_unverified — no training signal

  hint = generate_mentor_hint(student_rollout, mentor_rollout)
  add_user("[MENTOR] " + hint)  # diagnostic only

  for j in 1..2:
      student_rollout = student_retry()
      if verified:
          return SAVE_TRAINABLE_EPISODE  # mentor_assisted

Hard-first curriculum

_prioritize_evte_problems() sorts list[category], list[number], and multi-answer types before easy booleans. EVTE compute is expensive (two models × multi-step agents); we spend it where SFT v1 plateaus.

Mentor hardware choreography

Student (2B) and mentor (31B) don't fit comfortably together on one A100. The STaR loop uses on_micro_batch hooks to unload mentor → micro-SFT student → reload mentor every 15 episodes. Progress persists to evte_star_progress.json with resume support.

08 · EVTE feedback — self-recovery without answer leakage

The most underrated piece of EVTE is not the mentor — it's what we put in the user turn when the student fails. This is build_self_recovery_feedback() in datasense_evte.py.

Self-recovery feedback loop with real errors but hidden ground truth
Fig 6 — Self-recovery feedback. The student sees wrong predictions, last code, and real tracebacks — never the correct answer.
Messy success = verified answer but conversation contains debug/recovery language (trajectory_has_recovery_signal()). We don't want to reinforce "stumble into correctness" without tutor review in STaR mode.

Why SFT v2 failed — feedback without balance

When we later fine-tuned only on recovery trajectories (SFT v2), the model learned the shape of debug prose — dtype dumps, column lists — without improving verified answers. Lesson: self-recovery feedback is essential during collection, but training must mix clean completions with mentor-assisted wins, not recovery-only soup.

09 · Mentor verify & hint protocol

The mentor is google/gemma-4-31B-it (4-bit via Unsloth). It is not an oracle that whispers answers. It must earn the right to hint by passing the same execution verifier as the student.

Mentor must pass verification gate before giving a diagnostic hint
Fig 7 — Mentor gate. The 31B mentor must verify its own solution by running code before it may give a hint — and the hint must not leak the final answer.

Mentor retry modes

ModeBehaviorConfig
series Same conversation; temps ramp 0.4 → 0.65 → 0.85 evte_mentor_retry_mode=series
parallel 3 independent workspaces; first verified wins; temps [0.2, 0.5, 0.7] evte_mentor_retry_mode=parallel

10 · EVTE-STaR — online Self-Taught Reasoner with micro-SFT

EVTE-STaR combines EVTE episode collection with online weight updates. Classical STaR: collect all successes → train offline once. EVTE-STaR: collect 15 verified mentor-assisted wins → micro-SFT 30 steps → student is slightly better → repeat.

EVTE-STaR online micro-SFT every 15 verified episodes
Fig 8 — EVTE-STaR online loop. Every 15 mentor-assisted wins → 30-step micro-SFT at low LR → student continues on harder problems with nudged weights.

The overtraining curve (batches 2–3 vs batch 6)

Micro-batch 1 replay in RAM scored 100% on mentor-hard (5 problems). Saved Micro-1 checkpoint: ~60% confirmatory. Replay of batches 2–3: ~80%. Final batch 6 checkpoint: ~40% — worse than SFT v1.

Lesson

Online micro-SFT needs early stopping on a held-out hard set, not "more batches = better." We only preserved micro-1 and final checkpoints on the volume — sweet-spot batches 2–3 were lost until run_micro_replay_eval reconstructed them in RAM.

11 · Episode outcomes & trainability gates

Every episode ends in exactly one outcome. The outcome determines whether it enters training.

OutcomeMeaningEVTE-STaR: train?
self_solved_clean First-try verified, no recovery signals in trace Skip
self_recovered Fixed via self-recovery feedback only Optional
mentor_assisted Failed → mentor verified → hint → student verified Yes
discarded Mentor couldn't pass execution verifier No

12 · What worked

✅ SFT v1 — real execution behavior

SFT v1 consistently runs real Python (100% exec on many evals), uses ~4–5 agent steps, and beats base on hard questions where base "wins" without code. This is the behavioral foundation everything else builds on.

✅ EVTE episode quality filter

Saving only mentor-assisted verified trajectories produced high-signal data — multi-turn debug with real errors, not synthetic Q/A. 92 episodes is small but curated.

13 · What didn't work

❌ Full GRPO within hackathon time

~11 min/step × hundreds of steps × execution-bound rollouts ≈ multi-day runs. Parallel rollout workers helped but couldn't change the fundamental CPU/GPU pipeline stall. vLLM isn't available for Gemma 4 E2B, so generation stays on HF generate.

❌ SFT v2 (recovery-only fine-tune)

Training only on EVTE recovery trajectories taught debug prose — column dtype dumps, rambling — without improving answers. Mentor-hard: 40% vs SFT v1's 60%.

14 · Evaluation results

Agent accuracy on real data files (lite DataBench parquet, DSBench Excel, mentor-hard pool). Macro average = unweighted mean across three benchmarks (30 problems). Always pair accuracy with exec_ok — base can match easy booleans via answer tags without running code.

Three hackathon benchmarks across three models
Fig 9 — Hackathon eval suite. DataBench (15) + DSBench Excel (10) + mentor-hard (5) per model on T4.

Hackathon benchmark suite — final (first complete run)

Parallel eval: run_hackathon_benchmarks_parallel · 3× T4 · June 2026.

ModelDataBench (15)DSBench (10)Mentor-hard (5)Macro avgTotal
Base 60.0% 0.0% 20.0% 26.7% 10/30
SFT v1 86.7% 0.0% 60.0% 48.9% 16/30
EVTE Micro-1 80.0% 0.0%* 100.0% 60.0% 17/30

*DSBench official scorer = 0% for all models. Micro-1 Q15 computed $12,829,511 = option A (correct) but was graded wrong because we compare letters not dollar values → value-aware DSBench would be 1/10 (macro 63.3%).

Earlier standalone evals (sanity checks)

EvalBaseSFT v1Micro-1 / SFT v2
Quick DataBench (5) 80% acc / 0% exec 80% / 100% exec SFT v2: 40%
Mentor-hard (5) 40% / 0% exec 60% / 100% exec Micro-1 replay: 100% (RAM); saved ckpt ~60%
How to read DSBench

Models often run code (50–100% exec_ok) but return dataframe strings, 0.0, or dollar amounts that map to the wrong MCQ letter. Only one case (Micro-1 Q15) was a true scoring-format bug. DSBench 0% is mostly real Excel/parsing failure, not a broken metric.

15 · Why SFT v1 for the live demo (not Micro-1)

Micro-1 wins macro average (60% vs 48.9%) on paper — driven by a perfect 5/5 on mentor-hard. We still ship SFT v1 on this Hugging Face Space. Here's why:

FactorSFT v1EVTE Micro-1
DataBench (breadth) 86.7% — best on the largest held-out slice 80.0%
Mentor-hard (depth) 60% (3/5), 100% exec 100% (5/5) on first complete run
Stability Single bulk SFT — predictable at inference Online micro-SFT batch 1 — replay 100% vs saved ckpt ~60%
Straggler reruns Held up when Modal overwrote volume Mentor-hard dropped to 60% on duplicate run
Live demo risk Lower — fewer debug ramble / dtype dumps Higher — tuned on hard pool, can overfit quirks
Story on slides “Execution-grounded baseline that works” “EVTE-STaR peak — best hard-pool result”
Decision

Gradio Space → SFT v1 (sanjaymalladi/DataSense-Modal-E2B-SFT) for reliable live CSV demos.
Slides → show all three models; cite Micro-1 as evidence EVTE-STaR helps on the hard curated pool, not as the production default yet.

16 · Benchmark suite

BenchmarkProblemsWhat it testsStatus
DataBench test (lite) 15 SemEval-style QA on real parquet samples integrated
DSBench analysis 10 ModelOff Excel financial modeling integrated
Mentor-hard 5 Curated EVTE failures integrated

17 · Model checkpoints on Hugging Face

CheckpointHF repoRole
Base unsloth/gemma-4-E2B-it Frozen foundation
SFT v1 ★ demo DataSense-Modal-E2B-SFT Live HF Space adapter — stable execution
EVTE-STaR Micro-1 DataSense-Modal-E2B-EVTE-Star-Micro1 Best mentor-hard (5/5) — research checkpoint

18 · This Hugging Face demo

The Gradio app runs SFT v1 — same agent loop as training eval: load CSV → multi-step code generation → sandbox execution → Answer + Summary. Six built-in examples cover sales, employees, and students datasets.

Agent loop used in the HF Space demo
Same loop as eval. Upload this hf_demo/ folder to a Gradio Space (GPU T4), set HF_TOKEN if needed.

Deploy checklist

  1. Create Space (Gradio, gpu-t4) — see README.md frontmatter
  2. Upload hf_demo/ including assets/illustrations/ and story.html
  3. Secret HF_TOKEN if adapter repo is private
  4. Smoke-test all 6 examples

Future work