Roadside Gemma β E2B fine-tune for CDL pre-trip inspections
A LoRA fine-tune of unsloth/gemma-4-E2B-it that turns the base model into a
voice-driven copilot for commercial-driver pre-trip vehicle inspections.
The model runs fully on-device on a modern Android/iOS phone via
flutter_gemma and the
LiteRT runtime β no network required,
which matters because most truck yards and pre-trip inspection sites are
cellular dead zones.
Built for the Gemma 4 Impact Challenge (May 2026). Project repo: github.com/jtmuller5/roadside-gemma.
What's in this repo
| Path | What it is | Size |
|---|---|---|
lora-adapter/ |
PEFT LoRA adapter (r=128, Ξ±=128, all attn + MLP projections). Merge against unsloth/gemma-4-E2B-it. |
948 MB |
litertlm/model.litertlm |
Deployment artifact for the LiteRT runtime. Quantized dynamic_wi8_afp32. Drop into flutter_gemma directly. |
4.8 GB |
The 9.6 GB merged BF16 is reproducible by merging the LoRA β omitted to keep the repo lean.
What the model actually does
The model is an agent with seven tools and a strict JSON tool-calling contract. It guides the driver step-by-step through the 7-category / 54-item canonical pre-trip inspection (cab, engine, brakes, lights, tires, trailer, coupling) and records OK / defect outcomes.
Tools surfaced to the model:
get_next_step()β advance the inspectionquery_inspection_item(step, item)β return DOT inspection criteriamark_item_ok(step, item)β record a passing itemrecord_defect(step, item, severity, description)β record a defectcomplete_inspection()β finalize and sign off- (plus refusal / clarification turns with no tool call)
The training corpus enforces a canonical (step, item) keyset; the model is
trained to refuse off-topic asks and to ask for clarification rather
than hallucinate a tool call.
Evaluation
30 hand-crafted prompts across 6 categories (5 each). Scored against expected tool name + key args. "Hard fail" = wrong/no tool when one was required. "Soft fail" = right tool, wrong arg (e.g. wrong side of vehicle).
| Category | v3 (no refusal data) | v4 (this model) |
|---|---|---|
| ambiguous | 0 / 5 (HF=5) | 5 / 5 β |
| off_topic | 1 / 5 (HF=4) | 5 / 5 β |
| multi_intent | 0 / 5 (HF=0) | 4 / 5 |
| mid_correction | 1 / 5 (HF=0) | 3 / 5 |
| happy_path | 2 / 5 (HF=0) | 2 / 5 (HF=1) |
| stt_noisy | 1 / 5 (HF=2) | 2 / 5 (HF=1) |
| Total | 5 / 30, HF=11 | 21 / 30, HF=2 |
With the production app-injected opener ("Now checking <Item>. ...") in
context. No-context eval (worst case): 17 / 30, HF=4.
Remaining soft fails are mostly wrong-side args on dual-sided items
(passenger_side vs driver_side).
The training journey (why two-factor matters)
v1 of this model failed hard (2/30 pass) and the debugging path is worth documenting because two independent bugs combined to make it look like one:
- Loss-mask bug. The initial training run computed loss over the full
sequence including the ~700-token system prompt. With 173 rows sharing
one prompt, the model "converged" by memorizing the prompt while never
fitting the assistant tool-call tokens. Fixed by switching to
unsloth.chat_templates.train_on_responses_only. - Corpus pollution. The 31B teacher model used to synthesize the corpus
hallucinated tool-call keys: 78 distinct
(step, item)pairs in the data vs. 54 canonical pairs. 46 / 173 rows (27%) were polluted. Fixed by embedding the canonical catalog in the synthesis prompt and adding avalidate_conversation()step that drops any row referencing a non-canonical pair. - Missing refusal data. Even the clean v3 corpus had zero examples of "user asks something off-topic." The model called a tool every time because it had never seen what not calling one looked like. Fixed by adding Cat 8 to the synthesis pipeline: 40 conversations across ambiguity, off-topic, uncertainty, greetings, and acknowledgments β all producing text responses with no tool call.
Each fix in isolation was insufficient. v4 = all three.
Training recipe
- Base:
unsloth/gemma-4-E2B-it - Framework: Unsloth + TRL
SFTTrainer - Adapter: LoRA r=128, Ξ±=128, dropout=0
- Target modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - Loss mask:
train_on_responses_only(assistant turns only) - Schedule: 8 epochs, cosine LR 1e-4, batch_size=4 Γ grad_accum=2 (effective 8)
- Corpus: 380 synthetic conversations across 8 categories (340 task + 40 refusal), all teacher-generated against the canonical 54-item keyset
- Hardware: 1Γ RTX 5090 (32 GB VRAM)
- Final train loss: 0.155 mean (final batches ~0.01)
Deployment
Android / iOS via flutter_gemma
import 'package:flutter_gemma/flutter_gemma.dart';
final gemma = FlutterGemmaPlugin.instance;
await gemma.modelManager.setModelPath('<path>/model.litertlm');
final session = await gemma.createModel(/* ... */);
The .litertlm is quantized dynamic_wi8_afp32 β the ship recipe per the
flutter_gemma notes. Recipes
that quantize the LoRA matrices (e.g. wi4 at rank-128) erase the
fine-tune.
PyTorch via PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-E2B-it")
tok = AutoTokenizer.from_pretrained("unsloth/gemma-4-E2B-it")
model = PeftModel.from_pretrained(base, "jtmuller/roadside-gemma-e2b",
subfolder="lora-adapter")
Limitations & honest disclosure
- Domain-narrow. This is a pre-trip inspection agent, not a general assistant. It will try to interpret most utterances as part of the inspection flow.
- English only. Corpus is monolingual.
- Dual-sided items are still soft. Expect occasional wrong-side args on tires, mirrors, lights.
- Synthetic corpus. All training data is teacher-generated, not real driver transcripts. The Cat 5 (STT-noisy) category models speech recognition artifacts but isn't a substitute for real STT data.
- Safety scope. This model assists with the inspection workflow. It does not replace a qualified driver's judgment about whether a vehicle is safe to operate.
License
- LoRA adapter and
.litertlm: released under the Gemma Terms of Use. - Synthesis prompts and code in the project repo: MIT.
- Downloads last month
- -