Roadside Gemma β€” E2B fine-tune for CDL pre-trip inspections

A LoRA fine-tune of unsloth/gemma-4-E2B-it that turns the base model into a voice-driven copilot for commercial-driver pre-trip vehicle inspections.

The model runs fully on-device on a modern Android/iOS phone via flutter_gemma and the LiteRT runtime β€” no network required, which matters because most truck yards and pre-trip inspection sites are cellular dead zones.

Built for the Gemma 4 Impact Challenge (May 2026). Project repo: github.com/jtmuller5/roadside-gemma.


What's in this repo

Path What it is Size
lora-adapter/ PEFT LoRA adapter (r=128, Ξ±=128, all attn + MLP projections). Merge against unsloth/gemma-4-E2B-it. 948 MB
litertlm/model.litertlm Deployment artifact for the LiteRT runtime. Quantized dynamic_wi8_afp32. Drop into flutter_gemma directly. 4.8 GB

The 9.6 GB merged BF16 is reproducible by merging the LoRA β€” omitted to keep the repo lean.


What the model actually does

The model is an agent with seven tools and a strict JSON tool-calling contract. It guides the driver step-by-step through the 7-category / 54-item canonical pre-trip inspection (cab, engine, brakes, lights, tires, trailer, coupling) and records OK / defect outcomes.

Tools surfaced to the model:

  • get_next_step() β€” advance the inspection
  • query_inspection_item(step, item) β€” return DOT inspection criteria
  • mark_item_ok(step, item) β€” record a passing item
  • record_defect(step, item, severity, description) β€” record a defect
  • complete_inspection() β€” finalize and sign off
  • (plus refusal / clarification turns with no tool call)

The training corpus enforces a canonical (step, item) keyset; the model is trained to refuse off-topic asks and to ask for clarification rather than hallucinate a tool call.


Evaluation

30 hand-crafted prompts across 6 categories (5 each). Scored against expected tool name + key args. "Hard fail" = wrong/no tool when one was required. "Soft fail" = right tool, wrong arg (e.g. wrong side of vehicle).

Category v3 (no refusal data) v4 (this model)
ambiguous 0 / 5 (HF=5) 5 / 5 βœ“
off_topic 1 / 5 (HF=4) 5 / 5 βœ“
multi_intent 0 / 5 (HF=0) 4 / 5
mid_correction 1 / 5 (HF=0) 3 / 5
happy_path 2 / 5 (HF=0) 2 / 5 (HF=1)
stt_noisy 1 / 5 (HF=2) 2 / 5 (HF=1)
Total 5 / 30, HF=11 21 / 30, HF=2

With the production app-injected opener ("Now checking <Item>. ...") in context. No-context eval (worst case): 17 / 30, HF=4.

Remaining soft fails are mostly wrong-side args on dual-sided items (passenger_side vs driver_side).


The training journey (why two-factor matters)

v1 of this model failed hard (2/30 pass) and the debugging path is worth documenting because two independent bugs combined to make it look like one:

  1. Loss-mask bug. The initial training run computed loss over the full sequence including the ~700-token system prompt. With 173 rows sharing one prompt, the model "converged" by memorizing the prompt while never fitting the assistant tool-call tokens. Fixed by switching to unsloth.chat_templates.train_on_responses_only.
  2. Corpus pollution. The 31B teacher model used to synthesize the corpus hallucinated tool-call keys: 78 distinct (step, item) pairs in the data vs. 54 canonical pairs. 46 / 173 rows (27%) were polluted. Fixed by embedding the canonical catalog in the synthesis prompt and adding a validate_conversation() step that drops any row referencing a non-canonical pair.
  3. Missing refusal data. Even the clean v3 corpus had zero examples of "user asks something off-topic." The model called a tool every time because it had never seen what not calling one looked like. Fixed by adding Cat 8 to the synthesis pipeline: 40 conversations across ambiguity, off-topic, uncertainty, greetings, and acknowledgments β€” all producing text responses with no tool call.

Each fix in isolation was insufficient. v4 = all three.


Training recipe

  • Base: unsloth/gemma-4-E2B-it
  • Framework: Unsloth + TRL SFTTrainer
  • Adapter: LoRA r=128, Ξ±=128, dropout=0
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Loss mask: train_on_responses_only (assistant turns only)
  • Schedule: 8 epochs, cosine LR 1e-4, batch_size=4 Γ— grad_accum=2 (effective 8)
  • Corpus: 380 synthetic conversations across 8 categories (340 task + 40 refusal), all teacher-generated against the canonical 54-item keyset
  • Hardware: 1Γ— RTX 5090 (32 GB VRAM)
  • Final train loss: 0.155 mean (final batches ~0.01)

Deployment

Android / iOS via flutter_gemma

import 'package:flutter_gemma/flutter_gemma.dart';

final gemma = FlutterGemmaPlugin.instance;
await gemma.modelManager.setModelPath('<path>/model.litertlm');
final session = await gemma.createModel(/* ... */);

The .litertlm is quantized dynamic_wi8_afp32 β€” the ship recipe per the flutter_gemma notes. Recipes that quantize the LoRA matrices (e.g. wi4 at rank-128) erase the fine-tune.

PyTorch via PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-E2B-it")
tok = AutoTokenizer.from_pretrained("unsloth/gemma-4-E2B-it")
model = PeftModel.from_pretrained(base, "jtmuller/roadside-gemma-e2b",
                                  subfolder="lora-adapter")

Limitations & honest disclosure

  • Domain-narrow. This is a pre-trip inspection agent, not a general assistant. It will try to interpret most utterances as part of the inspection flow.
  • English only. Corpus is monolingual.
  • Dual-sided items are still soft. Expect occasional wrong-side args on tires, mirrors, lights.
  • Synthetic corpus. All training data is teacher-generated, not real driver transcripts. The Cat 5 (STT-noisy) category models speech recognition artifacts but isn't a substitute for real STT data.
  • Safety scope. This model assists with the inspection workflow. It does not replace a qualified driver's judgment about whether a vehicle is safe to operate.

License

  • LoRA adapter and .litertlm: released under the Gemma Terms of Use.
  • Synthesis prompts and code in the project repo: MIT.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jtmuller/roadside-gemma-e2b

Adapter
(20)
this model