File size: 7,229 Bytes
40d9999
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
license: gemma
base_model: unsloth/gemma-4-E2B-it
library_name: peft
pipeline_tag: text-generation
tags:
  - gemma
  - gemma-4
  - lora
  - unsloth
  - litertlm
  - on-device
  - function-calling
  - tool-use
  - mobile
  - flutter
language:
  - en
---

# Roadside Gemma β€” E2B fine-tune for CDL pre-trip inspections

A LoRA fine-tune of **`unsloth/gemma-4-E2B-it`** that turns the base model into a
voice-driven copilot for **commercial-driver pre-trip vehicle inspections**.

The model runs **fully on-device** on a modern Android/iOS phone via
[`flutter_gemma`](https://pub.dev/packages/flutter_gemma) and the
[LiteRT](https://ai.google.dev/edge/litert) runtime β€” **no network required**,
which matters because most truck yards and pre-trip inspection sites are
cellular dead zones.

> Built for the **Gemma 4 Impact Challenge** (May 2026). Project repo:
> [github.com/jtmuller5/roadside-gemma](https://github.com/jtmuller5/roadside-gemma).

---

## What's in this repo

| Path | What it is | Size |
|------|------------|------|
| `lora-adapter/` | PEFT LoRA adapter (r=128, Ξ±=128, all attn + MLP projections). Merge against `unsloth/gemma-4-E2B-it`. | 948 MB |
| `litertlm/model.litertlm` | Deployment artifact for the LiteRT runtime. Quantized `dynamic_wi8_afp32`. Drop into `flutter_gemma` directly. | 4.8 GB |

The 9.6 GB merged BF16 is reproducible by merging the LoRA β€” omitted to keep
the repo lean.

---

## What the model actually does

The model is an **agent** with seven tools and a strict JSON tool-calling
contract. It guides the driver step-by-step through the 7-category /
54-item canonical pre-trip inspection (cab, engine, brakes, lights, tires,
trailer, coupling) and records OK / defect outcomes.

Tools surfaced to the model:

- `get_next_step()` β€” advance the inspection
- `query_inspection_item(step, item)` β€” return DOT inspection criteria
- `mark_item_ok(step, item)` β€” record a passing item
- `record_defect(step, item, severity, description)` β€” record a defect
- `complete_inspection()` β€” finalize and sign off
- (plus refusal / clarification turns with **no** tool call)

The training corpus enforces a canonical `(step, item)` keyset; the model is
trained to **refuse** off-topic asks and to **ask for clarification** rather
than hallucinate a tool call.

---

## Evaluation

30 hand-crafted prompts across 6 categories (5 each). Scored against
expected tool name + key args. "Hard fail" = wrong/no tool when one was
required. "Soft fail" = right tool, wrong arg (e.g. wrong side of vehicle).

| Category       | v3 (no refusal data) | **v4 (this model)** |
|----------------|----------------------|---------------------|
| ambiguous      | 0 / 5  (HF=5)        | **5 / 5** βœ“         |
| off_topic      | 1 / 5  (HF=4)        | **5 / 5** βœ“         |
| multi_intent   | 0 / 5  (HF=0)        | 4 / 5               |
| mid_correction | 1 / 5  (HF=0)        | 3 / 5               |
| happy_path     | 2 / 5  (HF=0)        | 2 / 5  (HF=1)       |
| stt_noisy      | 1 / 5  (HF=2)        | 2 / 5  (HF=1)       |
| **Total**      | **5 / 30, HF=11**    | **21 / 30, HF=2**   |

With the production app-injected opener (`"Now checking <Item>. ..."`) in
context. No-context eval (worst case): 17 / 30, HF=4.

Remaining soft fails are mostly wrong-side args on dual-sided items
(`passenger_side` vs `driver_side`).

---

## The training journey (why two-factor matters)

v1 of this model **failed hard** (2/30 pass) and the debugging path is worth
documenting because two independent bugs combined to make it look like one:

1. **Loss-mask bug.** The initial training run computed loss over the full
   sequence including the ~700-token system prompt. With 173 rows sharing
   one prompt, the model "converged" by memorizing the prompt while never
   fitting the assistant tool-call tokens. Fixed by switching to
   `unsloth.chat_templates.train_on_responses_only`.
2. **Corpus pollution.** The 31B teacher model used to synthesize the corpus
   hallucinated tool-call keys: 78 distinct `(step, item)` pairs in the data
   vs. 54 canonical pairs. 46 / 173 rows (27%) were polluted. Fixed by
   embedding the canonical catalog in the synthesis prompt and adding a
   `validate_conversation()` step that drops any row referencing a
   non-canonical pair.
3. **Missing refusal data.** Even the clean v3 corpus had zero examples of
   "user asks something off-topic." The model called a tool every time
   because it had never seen what *not* calling one looked like. Fixed by
   adding **Cat 8** to the synthesis pipeline: 40 conversations across
   ambiguity, off-topic, uncertainty, greetings, and acknowledgments β€” all
   producing text responses with no tool call.

Each fix in isolation was insufficient. v4 = all three.

---

## Training recipe

- **Base:** `unsloth/gemma-4-E2B-it`
- **Framework:** Unsloth + TRL `SFTTrainer`
- **Adapter:** LoRA r=128, Ξ±=128, dropout=0
- **Target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`,
  `up_proj`, `down_proj`
- **Loss mask:** `train_on_responses_only` (assistant turns only)
- **Schedule:** 8 epochs, cosine LR 1e-4, batch_size=4 Γ— grad_accum=2
  (effective 8)
- **Corpus:** 380 synthetic conversations across 8 categories (340 task +
  40 refusal), all teacher-generated against the canonical 54-item keyset
- **Hardware:** 1Γ— RTX 5090 (32 GB VRAM)
- **Final train loss:** 0.155 mean (final batches ~0.01)

---

## Deployment

### Android / iOS via `flutter_gemma`

```dart
import 'package:flutter_gemma/flutter_gemma.dart';

final gemma = FlutterGemmaPlugin.instance;
await gemma.modelManager.setModelPath('<path>/model.litertlm');
final session = await gemma.createModel(/* ... */);
```

The `.litertlm` is quantized `dynamic_wi8_afp32` β€” the ship recipe per the
[`flutter_gemma` notes](https://pub.dev/packages/flutter_gemma). Recipes
that quantize the LoRA matrices (e.g. `wi4` at rank-128) erase the
fine-tune.

### PyTorch via PEFT

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-E2B-it")
tok = AutoTokenizer.from_pretrained("unsloth/gemma-4-E2B-it")
model = PeftModel.from_pretrained(base, "jtmuller/roadside-gemma-e2b",
                                  subfolder="lora-adapter")
```

---

## Limitations & honest disclosure

- **Domain-narrow.** This is a pre-trip inspection agent, not a general
  assistant. It will try to interpret most utterances as part of the
  inspection flow.
- **English only.** Corpus is monolingual.
- **Dual-sided items are still soft.** Expect occasional wrong-side args
  on tires, mirrors, lights.
- **Synthetic corpus.** All training data is teacher-generated, not
  real driver transcripts. The Cat 5 (STT-noisy) category models speech
  recognition artifacts but isn't a substitute for real STT data.
- **Safety scope.** This model assists with the inspection workflow.
  It does **not** replace a qualified driver's judgment about whether a
  vehicle is safe to operate.

---

## License

- LoRA adapter and `.litertlm`: released under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
- Synthesis prompts and code in the project repo: MIT.