jtmuller
/

roadside-gemma-e2b

+---
+license: gemma
+base_model: unsloth/gemma-4-E2B-it
+library_name: peft
+pipeline_tag: text-generation
+tags:
+  - gemma
+  - gemma-4
+  - lora
+  - unsloth
+  - litertlm
+  - on-device
+  - function-calling
+  - tool-use
+  - mobile
+  - flutter
+language:
+  - en
+---
+# Roadside Gemma — E2B fine-tune for CDL pre-trip inspections
+A LoRA fine-tune of **`unsloth/gemma-4-E2B-it`** that turns the base model into a
+voice-driven copilot for **commercial-driver pre-trip vehicle inspections**.
+The model runs **fully on-device** on a modern Android/iOS phone via
+[`flutter_gemma`](https://pub.dev/packages/flutter_gemma) and the
+[LiteRT](https://ai.google.dev/edge/litert) runtime — **no network required**,
+which matters because most truck yards and pre-trip inspection sites are
+cellular dead zones.
+> Built for the **Gemma 4 Impact Challenge** (May 2026). Project repo:
+> [github.com/jtmuller5/roadside-gemma](https://github.com/jtmuller5/roadside-gemma).
+---
+## What's in this repo
+| Path | What it is | Size |
+|------|------------|------|
+| `lora-adapter/` | PEFT LoRA adapter (r=128, α=128, all attn + MLP projections). Merge against `unsloth/gemma-4-E2B-it`. | 948 MB |
+| `litertlm/model.litertlm` | Deployment artifact for the LiteRT runtime. Quantized `dynamic_wi8_afp32`. Drop into `flutter_gemma` directly. | 4.8 GB |
+The 9.6 GB merged BF16 is reproducible by merging the LoRA — omitted to keep
+the repo lean.
+---
+## What the model actually does
+The model is an **agent** with seven tools and a strict JSON tool-calling
+contract. It guides the driver step-by-step through the 7-category /
+54-item canonical pre-trip inspection (cab, engine, brakes, lights, tires,
+trailer, coupling) and records OK / defect outcomes.
+Tools surfaced to the model:
+- `get_next_step()` — advance the inspection
+- `query_inspection_item(step, item)` — return DOT inspection criteria
+- `mark_item_ok(step, item)` — record a passing item
+- `record_defect(step, item, severity, description)` — record a defect
+- `complete_inspection()` — finalize and sign off
+- (plus refusal / clarification turns with **no** tool call)
+The training corpus enforces a canonical `(step, item)` keyset; the model is
+trained to **refuse** off-topic asks and to **ask for clarification** rather
+than hallucinate a tool call.
+---
+## Evaluation
+30 hand-crafted prompts across 6 categories (5 each). Scored against
+expected tool name + key args. "Hard fail" = wrong/no tool when one was
+required. "Soft fail" = right tool, wrong arg (e.g. wrong side of vehicle).
+| Category       | v3 (no refusal data) | **v4 (this model)** |
+|----------------|----------------------|---------------------|
+| ambiguous      | 0 / 5  (HF=5)        | **5 / 5** ✓         |
+| off_topic      | 1 / 5  (HF=4)        | **5 / 5** ✓         |
+| multi_intent   | 0 / 5  (HF=0)        | 4 / 5               |
+| mid_correction | 1 / 5  (HF=0)        | 3 / 5               |
+| happy_path     | 2 / 5  (HF=0)        | 2 / 5  (HF=1)       |
+| stt_noisy      | 1 / 5  (HF=2)        | 2 / 5  (HF=1)       |
+| **Total**      | **5 / 30, HF=11**    | **21 / 30, HF=2**   |
+With the production app-injected opener (`"Now checking <Item>. ..."`) in
+context. No-context eval (worst case): 17 / 30, HF=4.
+Remaining soft fails are mostly wrong-side args on dual-sided items
+(`passenger_side` vs `driver_side`).
+---
+## The training journey (why two-factor matters)
+v1 of this model **failed hard** (2/30 pass) and the debugging path is worth
+documenting because two independent bugs combined to make it look like one:
+1. **Loss-mask bug.** The initial training run computed loss over the full
+   sequence including the ~700-token system prompt. With 173 rows sharing
+   one prompt, the model "converged" by memorizing the prompt while never
+   fitting the assistant tool-call tokens. Fixed by switching to
+   `unsloth.chat_templates.train_on_responses_only`.
+2. **Corpus pollution.** The 31B teacher model used to synthesize the corpus
+   hallucinated tool-call keys: 78 distinct `(step, item)` pairs in the data
+   vs. 54 canonical pairs. 46 / 173 rows (27%) were polluted. Fixed by
+   embedding the canonical catalog in the synthesis prompt and adding a
+   `validate_conversation()` step that drops any row referencing a
+   non-canonical pair.
+3. **Missing refusal data.** Even the clean v3 corpus had zero examples of
+   "user asks something off-topic." The model called a tool every time
+   because it had never seen what *not* calling one looked like. Fixed by
+   adding **Cat 8** to the synthesis pipeline: 40 conversations across
+   ambiguity, off-topic, uncertainty, greetings, and acknowledgments — all
+   producing text responses with no tool call.
+Each fix in isolation was insufficient. v4 = all three.
+---
+## Training recipe
+- **Base:** `unsloth/gemma-4-E2B-it`
+- **Framework:** Unsloth + TRL `SFTTrainer`
+- **Adapter:** LoRA r=128, α=128, dropout=0
+- **Target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`,
+  `up_proj`, `down_proj`
+- **Loss mask:** `train_on_responses_only` (assistant turns only)
+- **Schedule:** 8 epochs, cosine LR 1e-4, batch_size=4 × grad_accum=2
+  (effective 8)
+- **Corpus:** 380 synthetic conversations across 8 categories (340 task +
+  40 refusal), all teacher-generated against the canonical 54-item keyset
+- **Hardware:** 1× RTX 5090 (32 GB VRAM)
+- **Final train loss:** 0.155 mean (final batches ~0.01)
+---
+## Deployment
+### Android / iOS via `flutter_gemma`
+```dart
+import 'package:flutter_gemma/flutter_gemma.dart';
+final gemma = FlutterGemmaPlugin.instance;
+await gemma.modelManager.setModelPath('<path>/model.litertlm');
+final session = await gemma.createModel(/* ... */);
+```
+The `.litertlm` is quantized `dynamic_wi8_afp32` — the ship recipe per the
+[`flutter_gemma` notes](https://pub.dev/packages/flutter_gemma). Recipes
+that quantize the LoRA matrices (e.g. `wi4` at rank-128) erase the
+fine-tune.
+### PyTorch via PEFT
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+base = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-E2B-it")
+tok = AutoTokenizer.from_pretrained("unsloth/gemma-4-E2B-it")
+model = PeftModel.from_pretrained(base, "jtmuller/roadside-gemma-e2b",
+                                  subfolder="lora-adapter")
+```
+---
+## Limitations & honest disclosure
+- **Domain-narrow.** This is a pre-trip inspection agent, not a general
+  assistant. It will try to interpret most utterances as part of the
+  inspection flow.
+- **English only.** Corpus is monolingual.
+- **Dual-sided items are still soft.** Expect occasional wrong-side args
+  on tires, mirrors, lights.
+- **Synthetic corpus.** All training data is teacher-generated, not
+  real driver transcripts. The Cat 5 (STT-noisy) category models speech
+  recognition artifacts but isn't a substitute for real STT data.
+- **Safety scope.** This model assists with the inspection workflow.
+  It does **not** replace a qualified driver's judgment about whether a
+  vehicle is safe to operate.
+---
+## License
+- LoRA adapter and `.litertlm`: released under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
+- Synthesis prompts and code in the project repo: MIT.