Roadside Gemma — E2B fine-tune for CDL pre-trip inspections

A LoRA fine-tune of unsloth/gemma-4-E2B-it that turns the base model into a voice-driven copilot for commercial-driver pre-trip vehicle inspections.

The model runs fully on-device on a modern Android/iOS phone via flutter_gemma and the LiteRT runtime — no network required, which matters because most truck yards and pre-trip inspection sites are cellular dead zones.

Built for the Gemma 4 Impact Challenge (May 2026). Project repo: github.com/jtmuller5/roadside-gemma.

What's in this repo

Path	What it is	Size
`lora-adapter/`	PEFT LoRA adapter (r=128, α=128, all attn + MLP projections). Merge against `unsloth/gemma-4-E2B-it`.	948 MB
`litertlm/model.litertlm`	Deployment artifact for the LiteRT runtime. Quantized `dynamic_wi8_afp32`. Drop into `flutter_gemma` directly.	4.8 GB

The 9.6 GB merged BF16 is reproducible by merging the LoRA — omitted to keep the repo lean.

What the model actually does

The model is an agent with seven tools and a strict JSON tool-calling contract. It guides the driver step-by-step through the 7-category / 54-item canonical pre-trip inspection (cab, engine, brakes, lights, tires, trailer, coupling) and records OK / defect outcomes.

Tools surfaced to the model:

get_next_step() — advance the inspection
query_inspection_item(step, item) — return DOT inspection criteria
mark_item_ok(step, item) — record a passing item
record_defect(step, item, severity, description) — record a defect
complete_inspection() — finalize and sign off
(plus refusal / clarification turns with no tool call)

The training corpus enforces a canonical (step, item) keyset; the model is trained to refuse off-topic asks and to ask for clarification rather than hallucinate a tool call.

Evaluation

30 hand-crafted prompts across 6 categories (5 each). Scored against expected tool name + key args. "Hard fail" = wrong/no tool when one was required. "Soft fail" = right tool, wrong arg (e.g. wrong side of vehicle).

Category	v3 (no refusal data)	v4 (this model)
ambiguous	0 / 5 (HF=5)	5 / 5 ✓
off_topic	1 / 5 (HF=4)	5 / 5 ✓
multi_intent	0 / 5 (HF=0)	4 / 5
mid_correction	1 / 5 (HF=0)	3 / 5
happy_path	2 / 5 (HF=0)	2 / 5 (HF=1)
stt_noisy	1 / 5 (HF=2)	2 / 5 (HF=1)
Total	5 / 30, HF=11	21 / 30, HF=2

With the production app-injected opener ("Now checking <Item>. ...") in context. No-context eval (worst case): 17 / 30, HF=4.

Remaining soft fails are mostly wrong-side args on dual-sided items (passenger_side vs driver_side).

The training journey (why two-factor matters)

v1 of this model failed hard (2/30 pass) and the debugging path is worth documenting because two independent bugs combined to make it look like one:

Loss-mask bug. The initial training run computed loss over the full sequence including the ~700-token system prompt. With 173 rows sharing one prompt, the model "converged" by memorizing the prompt while never fitting the assistant tool-call tokens. Fixed by switching to unsloth.chat_templates.train_on_responses_only.
Corpus pollution. The 31B teacher model used to synthesize the corpus hallucinated tool-call keys: 78 distinct (step, item) pairs in the data vs. 54 canonical pairs. 46 / 173 rows (27%) were polluted. Fixed by embedding the canonical catalog in the synthesis prompt and adding a validate_conversation() step that drops any row referencing a non-canonical pair.
Missing refusal data. Even the clean v3 corpus had zero examples of "user asks something off-topic." The model called a tool every time because it had never seen what not calling one looked like. Fixed by adding Cat 8 to the synthesis pipeline: 40 conversations across ambiguity, off-topic, uncertainty, greetings, and acknowledgments — all producing text responses with no tool call.

Each fix in isolation was insufficient. v4 = all three.

Training recipe

Base: unsloth/gemma-4-E2B-it
Framework: Unsloth + TRL SFTTrainer
Adapter: LoRA r=128, α=128, dropout=0
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Loss mask: train_on_responses_only (assistant turns only)
Schedule: 8 epochs, cosine LR 1e-4, batch_size=4 × grad_accum=2 (effective 8)
Corpus: 380 synthetic conversations across 8 categories (340 task + 40 refusal), all teacher-generated against the canonical 54-item keyset
Hardware: 1× RTX 5090 (32 GB VRAM)
Final train loss: 0.155 mean (final batches ~0.01)

Deployment

Android / iOS via `flutter_gemma`

import 'package:flutter_gemma/flutter_gemma.dart';

final gemma = FlutterGemmaPlugin.instance;
await gemma.modelManager.setModelPath('<path>/model.litertlm');
final session = await gemma.createModel(/* ... */);

The .litertlm is quantized dynamic_wi8_afp32 — the ship recipe per the flutter_gemma notes. Recipes that quantize the LoRA matrices (e.g. wi4 at rank-128) erase the fine-tune.

PyTorch via PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-E2B-it")
tok = AutoTokenizer.from_pretrained("unsloth/gemma-4-E2B-it")
model = PeftModel.from_pretrained(base, "jtmuller/roadside-gemma-e2b",
                                  subfolder="lora-adapter")

Limitations & honest disclosure

Domain-narrow. This is a pre-trip inspection agent, not a general assistant. It will try to interpret most utterances as part of the inspection flow.
English only. Corpus is monolingual.
Dual-sided items are still soft. Expect occasional wrong-side args on tires, mirrors, lights.
Synthetic corpus. All training data is teacher-generated, not real driver transcripts. The Cat 5 (STT-noisy) category models speech recognition artifacts but isn't a substitute for real STT data.
Safety scope. This model assists with the inspection workflow. It does not replace a qualified driver's judgment about whether a vehicle is safe to operate.

License

LoRA adapter and .litertlm: released under the Gemma Terms of Use.
Synthesis prompts and code in the project repo: MIT.

Downloads last month: -

Model tree for jtmuller/roadside-gemma-e2b

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it