jtmuller commited on
Commit
40d9999
Β·
verified Β·
1 Parent(s): 5a0d246

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +190 -0
README.md ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ base_model: unsloth/gemma-4-E2B-it
4
+ library_name: peft
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - gemma
8
+ - gemma-4
9
+ - lora
10
+ - unsloth
11
+ - litertlm
12
+ - on-device
13
+ - function-calling
14
+ - tool-use
15
+ - mobile
16
+ - flutter
17
+ language:
18
+ - en
19
+ ---
20
+
21
+ # Roadside Gemma β€” E2B fine-tune for CDL pre-trip inspections
22
+
23
+ A LoRA fine-tune of **`unsloth/gemma-4-E2B-it`** that turns the base model into a
24
+ voice-driven copilot for **commercial-driver pre-trip vehicle inspections**.
25
+
26
+ The model runs **fully on-device** on a modern Android/iOS phone via
27
+ [`flutter_gemma`](https://pub.dev/packages/flutter_gemma) and the
28
+ [LiteRT](https://ai.google.dev/edge/litert) runtime β€” **no network required**,
29
+ which matters because most truck yards and pre-trip inspection sites are
30
+ cellular dead zones.
31
+
32
+ > Built for the **Gemma 4 Impact Challenge** (May 2026). Project repo:
33
+ > [github.com/jtmuller5/roadside-gemma](https://github.com/jtmuller5/roadside-gemma).
34
+
35
+ ---
36
+
37
+ ## What's in this repo
38
+
39
+ | Path | What it is | Size |
40
+ |------|------------|------|
41
+ | `lora-adapter/` | PEFT LoRA adapter (r=128, Ξ±=128, all attn + MLP projections). Merge against `unsloth/gemma-4-E2B-it`. | 948 MB |
42
+ | `litertlm/model.litertlm` | Deployment artifact for the LiteRT runtime. Quantized `dynamic_wi8_afp32`. Drop into `flutter_gemma` directly. | 4.8 GB |
43
+
44
+ The 9.6 GB merged BF16 is reproducible by merging the LoRA β€” omitted to keep
45
+ the repo lean.
46
+
47
+ ---
48
+
49
+ ## What the model actually does
50
+
51
+ The model is an **agent** with seven tools and a strict JSON tool-calling
52
+ contract. It guides the driver step-by-step through the 7-category /
53
+ 54-item canonical pre-trip inspection (cab, engine, brakes, lights, tires,
54
+ trailer, coupling) and records OK / defect outcomes.
55
+
56
+ Tools surfaced to the model:
57
+
58
+ - `get_next_step()` β€” advance the inspection
59
+ - `query_inspection_item(step, item)` β€” return DOT inspection criteria
60
+ - `mark_item_ok(step, item)` β€” record a passing item
61
+ - `record_defect(step, item, severity, description)` β€” record a defect
62
+ - `complete_inspection()` β€” finalize and sign off
63
+ - (plus refusal / clarification turns with **no** tool call)
64
+
65
+ The training corpus enforces a canonical `(step, item)` keyset; the model is
66
+ trained to **refuse** off-topic asks and to **ask for clarification** rather
67
+ than hallucinate a tool call.
68
+
69
+ ---
70
+
71
+ ## Evaluation
72
+
73
+ 30 hand-crafted prompts across 6 categories (5 each). Scored against
74
+ expected tool name + key args. "Hard fail" = wrong/no tool when one was
75
+ required. "Soft fail" = right tool, wrong arg (e.g. wrong side of vehicle).
76
+
77
+ | Category | v3 (no refusal data) | **v4 (this model)** |
78
+ |----------------|----------------------|---------------------|
79
+ | ambiguous | 0 / 5 (HF=5) | **5 / 5** βœ“ |
80
+ | off_topic | 1 / 5 (HF=4) | **5 / 5** βœ“ |
81
+ | multi_intent | 0 / 5 (HF=0) | 4 / 5 |
82
+ | mid_correction | 1 / 5 (HF=0) | 3 / 5 |
83
+ | happy_path | 2 / 5 (HF=0) | 2 / 5 (HF=1) |
84
+ | stt_noisy | 1 / 5 (HF=2) | 2 / 5 (HF=1) |
85
+ | **Total** | **5 / 30, HF=11** | **21 / 30, HF=2** |
86
+
87
+ With the production app-injected opener (`"Now checking <Item>. ..."`) in
88
+ context. No-context eval (worst case): 17 / 30, HF=4.
89
+
90
+ Remaining soft fails are mostly wrong-side args on dual-sided items
91
+ (`passenger_side` vs `driver_side`).
92
+
93
+ ---
94
+
95
+ ## The training journey (why two-factor matters)
96
+
97
+ v1 of this model **failed hard** (2/30 pass) and the debugging path is worth
98
+ documenting because two independent bugs combined to make it look like one:
99
+
100
+ 1. **Loss-mask bug.** The initial training run computed loss over the full
101
+ sequence including the ~700-token system prompt. With 173 rows sharing
102
+ one prompt, the model "converged" by memorizing the prompt while never
103
+ fitting the assistant tool-call tokens. Fixed by switching to
104
+ `unsloth.chat_templates.train_on_responses_only`.
105
+ 2. **Corpus pollution.** The 31B teacher model used to synthesize the corpus
106
+ hallucinated tool-call keys: 78 distinct `(step, item)` pairs in the data
107
+ vs. 54 canonical pairs. 46 / 173 rows (27%) were polluted. Fixed by
108
+ embedding the canonical catalog in the synthesis prompt and adding a
109
+ `validate_conversation()` step that drops any row referencing a
110
+ non-canonical pair.
111
+ 3. **Missing refusal data.** Even the clean v3 corpus had zero examples of
112
+ "user asks something off-topic." The model called a tool every time
113
+ because it had never seen what *not* calling one looked like. Fixed by
114
+ adding **Cat 8** to the synthesis pipeline: 40 conversations across
115
+ ambiguity, off-topic, uncertainty, greetings, and acknowledgments β€” all
116
+ producing text responses with no tool call.
117
+
118
+ Each fix in isolation was insufficient. v4 = all three.
119
+
120
+ ---
121
+
122
+ ## Training recipe
123
+
124
+ - **Base:** `unsloth/gemma-4-E2B-it`
125
+ - **Framework:** Unsloth + TRL `SFTTrainer`
126
+ - **Adapter:** LoRA r=128, Ξ±=128, dropout=0
127
+ - **Target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`,
128
+ `up_proj`, `down_proj`
129
+ - **Loss mask:** `train_on_responses_only` (assistant turns only)
130
+ - **Schedule:** 8 epochs, cosine LR 1e-4, batch_size=4 Γ— grad_accum=2
131
+ (effective 8)
132
+ - **Corpus:** 380 synthetic conversations across 8 categories (340 task +
133
+ 40 refusal), all teacher-generated against the canonical 54-item keyset
134
+ - **Hardware:** 1Γ— RTX 5090 (32 GB VRAM)
135
+ - **Final train loss:** 0.155 mean (final batches ~0.01)
136
+
137
+ ---
138
+
139
+ ## Deployment
140
+
141
+ ### Android / iOS via `flutter_gemma`
142
+
143
+ ```dart
144
+ import 'package:flutter_gemma/flutter_gemma.dart';
145
+
146
+ final gemma = FlutterGemmaPlugin.instance;
147
+ await gemma.modelManager.setModelPath('<path>/model.litertlm');
148
+ final session = await gemma.createModel(/* ... */);
149
+ ```
150
+
151
+ The `.litertlm` is quantized `dynamic_wi8_afp32` β€” the ship recipe per the
152
+ [`flutter_gemma` notes](https://pub.dev/packages/flutter_gemma). Recipes
153
+ that quantize the LoRA matrices (e.g. `wi4` at rank-128) erase the
154
+ fine-tune.
155
+
156
+ ### PyTorch via PEFT
157
+
158
+ ```python
159
+ from transformers import AutoModelForCausalLM, AutoTokenizer
160
+ from peft import PeftModel
161
+
162
+ base = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-E2B-it")
163
+ tok = AutoTokenizer.from_pretrained("unsloth/gemma-4-E2B-it")
164
+ model = PeftModel.from_pretrained(base, "jtmuller/roadside-gemma-e2b",
165
+ subfolder="lora-adapter")
166
+ ```
167
+
168
+ ---
169
+
170
+ ## Limitations & honest disclosure
171
+
172
+ - **Domain-narrow.** This is a pre-trip inspection agent, not a general
173
+ assistant. It will try to interpret most utterances as part of the
174
+ inspection flow.
175
+ - **English only.** Corpus is monolingual.
176
+ - **Dual-sided items are still soft.** Expect occasional wrong-side args
177
+ on tires, mirrors, lights.
178
+ - **Synthetic corpus.** All training data is teacher-generated, not
179
+ real driver transcripts. The Cat 5 (STT-noisy) category models speech
180
+ recognition artifacts but isn't a substitute for real STT data.
181
+ - **Safety scope.** This model assists with the inspection workflow.
182
+ It does **not** replace a qualified driver's judgment about whether a
183
+ vehicle is safe to operate.
184
+
185
+ ---
186
+
187
+ ## License
188
+
189
+ - LoRA adapter and `.litertlm`: released under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
190
+ - Synthesis prompts and code in the project repo: MIT.