Part 2: When Fine-Tuning Isn't the Answer (Yet)
Follow-up to "Teaching a Tiny Model to Hear Bash" Working title β refine before publishing
Narrative arc
Part 1 ended on a high: 97% accuracy, 3GB RAM, under a second. But there's a catch we glossed over β that 97% is on clean protocol input. When users speak naturally ("okay so the command is...") or make corrections mid-sentence ("dash dash no wait just dash v"), the model falls apart.
This post is about what we tried next, what we learned, and the architectural insight that changed our approach.
Key beats
1. The 97% Illusion
The fine-tuned model is great... if you speak its language perfectly. Real users don't.
Four difficulty levels:
- Clean: "git space push space dash u space origin space main" β 93% (processor alone)
- Fuzzy: "git commit minus m quote fix login bug quote" β 0% (no "space" keywords)
- Natural: "okay so the command is git push dash u origin main" β 0% (filler)
- Chaotic: "dash dash no wait just dash v" β 0% (self-corrections)
The training data was clean. Reality isn't.
2. The Procedural Processor Discovery
Before throwing more ML at it, we asked: how much of this task is deterministic?
Answer: almost all of it. "dash" always means "-". "dot" always means ".". A rule-based token scanner gets 93% on clean input with zero hallucination, zero latency, zero training.
This raised the question: what is the LLM actually contributing? It's memorizing fixed mappings. The 11,207 times "dash" appears in training β the model learned them all, but a dictionary lookup does the same job.
3. The Split Architecture
The insight: use each tool for what it's good at.
Raw speech β LLM (language understanding) β Protocol text β Processor (deterministic) β Final syntax
The LLM's job shrinks dramatically:
- Strip conversational filler
- Resolve self-corrections ("no wait, actually...")
- Insert "space" keywords between arguments
- Replace synonyms (minusβdash, periodβdot)
It never outputs symbols. It never makes the dash-to-minus conversion. It just cleans up natural language into a constrained protocol format, and the processor handles the rest.
4. Zero-Training Results
We tested this with pure prompting (no fine-tuning) across 3 models:
| Model | Clean | Fuzzy | Natural | Chaotic | Overall |
|---|---|---|---|---|---|
| Processor only | 92% | 0% | 0% | 2% | 23.5% |
| Qwen 2.5 1.5B | 90% | 20% | 54% | 24% | 47% |
| Qwen 2.5 0.5B | 90% | 12% | 44% | 20% | 41.5% |
| Llama 3.2 1B | 92% | 14% | 34% | 10% | 37.5% |
Key findings:
- 2x baseline with zero training
- Clean input maintained at 90%+ (protocol bypass β if input already has "space" keywords, skip the LLM entirely)
- Natural/chaotic show real improvement (filler stripping, self-correction resolution work)
- Fuzzy is the bottleneck (20%) β inserting "space" keywords requires understanding command structure
5. The Hybrid Architecture
The winning trick: don't send everything through the LLM.
if input contains "space" keywords and no filler:
β bypass LLM, send directly to processor
else:
β LLM normalizes, then processor converts
This gives us:
- 96% on clean independent eval (up from 93% processor baseline)
- Near-zero latency for protocol-format input
- LLM only called when genuinely needed (26% of inputs bypassed)
6. Where Prompting Hits Its Ceiling
Fuzzy normalization is the hard problem. The LLM needs to understand:
cat file period txtβ "cat" and "file.txt" are separate tokens (need "space")- But within "file.txt", "file" + "dot" + "txt" concatenate (no "space")
dash dash verboseβ compound flag, stays togetherdash u space originβ flag and argument, need "space"
This requires understanding command structure β which words are commands, flags, paths, filenames. A 1.5B model can't learn this from 12 few-shot examples. But it CAN learn it from 5,000 training examples.
7. The Path Forward
The fine-tuning task just got dramatically simpler:
- Old task: dictated text β final syntax (model must learn ALL symbol mappings)
- New task: dictated text β protocol text (model only learns WHERE to put "space")
Same training data. Same model. Much simpler output space. The processor handles the rest.
Themes to emphasize
- Don't teach an LLM what a dictionary can do. Deterministic mappings belong in code.
- Split tasks at the boundary of language understanding. The LLM handles ambiguity; code handles rules.
- Zero-training experiments reveal architecture. Prompting told us exactly where the value is (filler stripping, correction resolution) and where it isn't (symbol conversion, space insertion).
- Evaluation infrastructure matters. The 4-difficulty eval set (clean/fuzzy/natural/chaotic) made it possible to see WHERE each approach fails, not just a single accuracy number.
Data to include
- The results table above (all 3 models x 4 difficulties)
- Architecture diagram (raw β LLM β protocol β processor β syntax)
- Comparison: end-to-end fine-tuning vs split pipeline
- Error examples showing what the LLM gets right and wrong
- Latency numbers (2.5s with LLM vs ~0ms bypassed)
Code references
All code in the datasets/ directory:
procedural-processor.pyβ the deterministic backbonenormalizer-pipeline.pyβ the zero-training pipelineeval-fuzzy.jsonβ 200 entries, 4 difficulty levelseval-independent.jsonβ 100 clean protocol entries- Fine-tuning infrastructure in
finetune/(from Part 1)
Open questions for Part 3
- How much does fine-tuning the normalizer improve fuzzy accuracy?
- Can we generate training data programmatically? (take clean protocol, randomly drop "space" keywords, add filler)
- Is there a sweet spot between prompting and fine-tuning? (e.g., fine-tune on 100 examples instead of 5000)
- Should the normalizer be a separate model from the transcription engine?