Upload README.md with huggingface_hub

40d9999 verified 6 days ago

7.23 kB

	---
	license: gemma
	base_model: unsloth/gemma-4-E2B-it
	library_name: peft
	pipeline_tag: text-generation
	tags:
	- gemma
	- gemma-4
	- lora
	- unsloth
	- litertlm
	- on-device
	- function-calling
	- tool-use
	- mobile
	- flutter
	language:
	- en
	---

	# Roadside Gemma — E2B fine-tune for CDL pre-trip inspections

	A LoRA fine-tune of `unsloth/gemma-4-E2B-it` that turns the base model into a
	voice-driven copilot for commercial-driver pre-trip vehicle inspections.

	The model runs fully on-device on a modern Android/iOS phone via
	[`flutter_gemma`](https://pub.dev/packages/flutter_gemma) and the
	[LiteRT](https://ai.google.dev/edge/litert) runtime — no network required,
	which matters because most truck yards and pre-trip inspection sites are
	cellular dead zones.

	> Built for the Gemma 4 Impact Challenge (May 2026). Project repo:
	> [github.com/jtmuller5/roadside-gemma](https://github.com/jtmuller5/roadside-gemma).

	---

	## What's in this repo

	\| Path \| What it is \| Size \|
	\|------\|------------\|------\|
	\| `lora-adapter/` \| PEFT LoRA adapter (r=128, α=128, all attn + MLP projections). Merge against `unsloth/gemma-4-E2B-it`. \| 948 MB \|
	\| `litertlm/model.litertlm` \| Deployment artifact for the LiteRT runtime. Quantized `dynamic_wi8_afp32`. Drop into `flutter_gemma` directly. \| 4.8 GB \|

	The 9.6 GB merged BF16 is reproducible by merging the LoRA — omitted to keep
	the repo lean.

	---

	## What the model actually does

	The model is an agent with seven tools and a strict JSON tool-calling
	contract. It guides the driver step-by-step through the 7-category /
	54-item canonical pre-trip inspection (cab, engine, brakes, lights, tires,
	trailer, coupling) and records OK / defect outcomes.

	Tools surfaced to the model:

	- `get_next_step()` — advance the inspection
	- `query_inspection_item(step, item)` — return DOT inspection criteria
	- `mark_item_ok(step, item)` — record a passing item
	- `record_defect(step, item, severity, description)` — record a defect
	- `complete_inspection()` — finalize and sign off
	- (plus refusal / clarification turns with no tool call)

	The training corpus enforces a canonical `(step, item)` keyset; the model is
	trained to refuse off-topic asks and to ask for clarification rather
	than hallucinate a tool call.

	---

	## Evaluation

	30 hand-crafted prompts across 6 categories (5 each). Scored against
	expected tool name + key args. "Hard fail" = wrong/no tool when one was
	required. "Soft fail" = right tool, wrong arg (e.g. wrong side of vehicle).

	\| Category \| v3 (no refusal data) \| v4 (this model) \|
	\|----------------\|----------------------\|---------------------\|
	\| ambiguous \| 0 / 5 (HF=5) \| 5 / 5 ✓ \|
	\| off_topic \| 1 / 5 (HF=4) \| 5 / 5 ✓ \|
	\| multi_intent \| 0 / 5 (HF=0) \| 4 / 5 \|
	\| mid_correction \| 1 / 5 (HF=0) \| 3 / 5 \|
	\| happy_path \| 2 / 5 (HF=0) \| 2 / 5 (HF=1) \|
	\| stt_noisy \| 1 / 5 (HF=2) \| 2 / 5 (HF=1) \|
	\| Total \| 5 / 30, HF=11 \| 21 / 30, HF=2 \|

	With the production app-injected opener (`"Now checking <Item>. ..."`) in
	context. No-context eval (worst case): 17 / 30, HF=4.

	Remaining soft fails are mostly wrong-side args on dual-sided items
	(`passenger_side` vs `driver_side`).

	---

	## The training journey (why two-factor matters)

	v1 of this model failed hard (2/30 pass) and the debugging path is worth
	documenting because two independent bugs combined to make it look like one:

	1. Loss-mask bug. The initial training run computed loss over the full
	sequence including the ~700-token system prompt. With 173 rows sharing
	one prompt, the model "converged" by memorizing the prompt while never
	fitting the assistant tool-call tokens. Fixed by switching to
	`unsloth.chat_templates.train_on_responses_only`.
	2. Corpus pollution. The 31B teacher model used to synthesize the corpus
	hallucinated tool-call keys: 78 distinct `(step, item)` pairs in the data
	vs. 54 canonical pairs. 46 / 173 rows (27%) were polluted. Fixed by
	embedding the canonical catalog in the synthesis prompt and adding a
	`validate_conversation()` step that drops any row referencing a
	non-canonical pair.
	3. Missing refusal data. Even the clean v3 corpus had zero examples of
	"user asks something off-topic." The model called a tool every time
	because it had never seen what not calling one looked like. Fixed by
	adding Cat 8 to the synthesis pipeline: 40 conversations across
	ambiguity, off-topic, uncertainty, greetings, and acknowledgments — all
	producing text responses with no tool call.

	Each fix in isolation was insufficient. v4 = all three.

	---

	## Training recipe

	- Base: `unsloth/gemma-4-E2B-it`
	- Framework: Unsloth + TRL `SFTTrainer`
	- Adapter: LoRA r=128, α=128, dropout=0
	- Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`,
	`up_proj`, `down_proj`
	- Loss mask: `train_on_responses_only` (assistant turns only)
	- Schedule: 8 epochs, cosine LR 1e-4, batch_size=4 × grad_accum=2
	(effective 8)
	- Corpus: 380 synthetic conversations across 8 categories (340 task +
	40 refusal), all teacher-generated against the canonical 54-item keyset
	- Hardware: 1× RTX 5090 (32 GB VRAM)
	- Final train loss: 0.155 mean (final batches ~0.01)

	---

	## Deployment

	### Android / iOS via `flutter_gemma`

	```dart
	import 'package:flutter_gemma/flutter_gemma.dart';

	final gemma = FlutterGemmaPlugin.instance;
	await gemma.modelManager.setModelPath('<path>/model.litertlm');
	final session = await gemma.createModel(/* ... */);
	```

	The `.litertlm` is quantized `dynamic_wi8_afp32` — the ship recipe per the
	[`flutter_gemma` notes](https://pub.dev/packages/flutter_gemma). Recipes
	that quantize the LoRA matrices (e.g. `wi4` at rank-128) erase the
	fine-tune.

	### PyTorch via PEFT

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	base = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-E2B-it")
	tok = AutoTokenizer.from_pretrained("unsloth/gemma-4-E2B-it")
	model = PeftModel.from_pretrained(base, "jtmuller/roadside-gemma-e2b",
	subfolder="lora-adapter")
	```

	---

	## Limitations & honest disclosure

	- Domain-narrow. This is a pre-trip inspection agent, not a general
	assistant. It will try to interpret most utterances as part of the
	inspection flow.
	- English only. Corpus is monolingual.
	- Dual-sided items are still soft. Expect occasional wrong-side args
	on tires, mirrors, lights.
	- Synthetic corpus. All training data is teacher-generated, not
	real driver transcripts. The Cat 5 (STT-noisy) category models speech
	recognition artifacts but isn't a substitute for real STT data.
	- Safety scope. This model assists with the inspection workflow.
	It does not replace a qualified driver's judgment about whether a
	vehicle is safe to operate.

	---

	## License

	- LoRA adapter and `.litertlm`: released under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
	- Synthesis prompts and code in the project repo: MIT.