--- license: gemma base_model: unsloth/gemma-4-E2B-it library_name: peft pipeline_tag: text-generation tags: - gemma - gemma-4 - lora - unsloth - litertlm - on-device - function-calling - tool-use - mobile - flutter language: - en --- # Roadside Gemma — E2B fine-tune for CDL pre-trip inspections A LoRA fine-tune of **`unsloth/gemma-4-E2B-it`** that turns the base model into a voice-driven copilot for **commercial-driver pre-trip vehicle inspections**. The model runs **fully on-device** on a modern Android/iOS phone via [`flutter_gemma`](https://pub.dev/packages/flutter_gemma) and the [LiteRT](https://ai.google.dev/edge/litert) runtime — **no network required**, which matters because most truck yards and pre-trip inspection sites are cellular dead zones. > Built for the **Gemma 4 Impact Challenge** (May 2026). Project repo: > [github.com/jtmuller5/roadside-gemma](https://github.com/jtmuller5/roadside-gemma). --- ## What's in this repo | Path | What it is | Size | |------|------------|------| | `lora-adapter/` | PEFT LoRA adapter (r=128, α=128, all attn + MLP projections). Merge against `unsloth/gemma-4-E2B-it`. | 948 MB | | `litertlm/model.litertlm` | Deployment artifact for the LiteRT runtime. Quantized `dynamic_wi8_afp32`. Drop into `flutter_gemma` directly. | 4.8 GB | The 9.6 GB merged BF16 is reproducible by merging the LoRA — omitted to keep the repo lean. --- ## What the model actually does The model is an **agent** with seven tools and a strict JSON tool-calling contract. It guides the driver step-by-step through the 7-category / 54-item canonical pre-trip inspection (cab, engine, brakes, lights, tires, trailer, coupling) and records OK / defect outcomes. Tools surfaced to the model: - `get_next_step()` — advance the inspection - `query_inspection_item(step, item)` — return DOT inspection criteria - `mark_item_ok(step, item)` — record a passing item - `record_defect(step, item, severity, description)` — record a defect - `complete_inspection()` — finalize and sign off - (plus refusal / clarification turns with **no** tool call) The training corpus enforces a canonical `(step, item)` keyset; the model is trained to **refuse** off-topic asks and to **ask for clarification** rather than hallucinate a tool call. --- ## Evaluation 30 hand-crafted prompts across 6 categories (5 each). Scored against expected tool name + key args. "Hard fail" = wrong/no tool when one was required. "Soft fail" = right tool, wrong arg (e.g. wrong side of vehicle). | Category | v3 (no refusal data) | **v4 (this model)** | |----------------|----------------------|---------------------| | ambiguous | 0 / 5 (HF=5) | **5 / 5** ✓ | | off_topic | 1 / 5 (HF=4) | **5 / 5** ✓ | | multi_intent | 0 / 5 (HF=0) | 4 / 5 | | mid_correction | 1 / 5 (HF=0) | 3 / 5 | | happy_path | 2 / 5 (HF=0) | 2 / 5 (HF=1) | | stt_noisy | 1 / 5 (HF=2) | 2 / 5 (HF=1) | | **Total** | **5 / 30, HF=11** | **21 / 30, HF=2** | With the production app-injected opener (`"Now checking . ..."`) in context. No-context eval (worst case): 17 / 30, HF=4. Remaining soft fails are mostly wrong-side args on dual-sided items (`passenger_side` vs `driver_side`). --- ## The training journey (why two-factor matters) v1 of this model **failed hard** (2/30 pass) and the debugging path is worth documenting because two independent bugs combined to make it look like one: 1. **Loss-mask bug.** The initial training run computed loss over the full sequence including the ~700-token system prompt. With 173 rows sharing one prompt, the model "converged" by memorizing the prompt while never fitting the assistant tool-call tokens. Fixed by switching to `unsloth.chat_templates.train_on_responses_only`. 2. **Corpus pollution.** The 31B teacher model used to synthesize the corpus hallucinated tool-call keys: 78 distinct `(step, item)` pairs in the data vs. 54 canonical pairs. 46 / 173 rows (27%) were polluted. Fixed by embedding the canonical catalog in the synthesis prompt and adding a `validate_conversation()` step that drops any row referencing a non-canonical pair. 3. **Missing refusal data.** Even the clean v3 corpus had zero examples of "user asks something off-topic." The model called a tool every time because it had never seen what *not* calling one looked like. Fixed by adding **Cat 8** to the synthesis pipeline: 40 conversations across ambiguity, off-topic, uncertainty, greetings, and acknowledgments — all producing text responses with no tool call. Each fix in isolation was insufficient. v4 = all three. --- ## Training recipe - **Base:** `unsloth/gemma-4-E2B-it` - **Framework:** Unsloth + TRL `SFTTrainer` - **Adapter:** LoRA r=128, α=128, dropout=0 - **Target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` - **Loss mask:** `train_on_responses_only` (assistant turns only) - **Schedule:** 8 epochs, cosine LR 1e-4, batch_size=4 × grad_accum=2 (effective 8) - **Corpus:** 380 synthetic conversations across 8 categories (340 task + 40 refusal), all teacher-generated against the canonical 54-item keyset - **Hardware:** 1× RTX 5090 (32 GB VRAM) - **Final train loss:** 0.155 mean (final batches ~0.01) --- ## Deployment ### Android / iOS via `flutter_gemma` ```dart import 'package:flutter_gemma/flutter_gemma.dart'; final gemma = FlutterGemmaPlugin.instance; await gemma.modelManager.setModelPath('/model.litertlm'); final session = await gemma.createModel(/* ... */); ``` The `.litertlm` is quantized `dynamic_wi8_afp32` — the ship recipe per the [`flutter_gemma` notes](https://pub.dev/packages/flutter_gemma). Recipes that quantize the LoRA matrices (e.g. `wi4` at rank-128) erase the fine-tune. ### PyTorch via PEFT ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel base = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-E2B-it") tok = AutoTokenizer.from_pretrained("unsloth/gemma-4-E2B-it") model = PeftModel.from_pretrained(base, "jtmuller/roadside-gemma-e2b", subfolder="lora-adapter") ``` --- ## Limitations & honest disclosure - **Domain-narrow.** This is a pre-trip inspection agent, not a general assistant. It will try to interpret most utterances as part of the inspection flow. - **English only.** Corpus is monolingual. - **Dual-sided items are still soft.** Expect occasional wrong-side args on tires, mirrors, lights. - **Synthetic corpus.** All training data is teacher-generated, not real driver transcripts. The Cat 5 (STT-noisy) category models speech recognition artifacts but isn't a substitute for real STT data. - **Safety scope.** This model assists with the inspection workflow. It does **not** replace a qualified driver's judgment about whether a vehicle is safe to operate. --- ## License - LoRA adapter and `.litertlm`: released under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms). - Synthesis prompts and code in the project repo: MIT.