| # Part 2: When Fine-Tuning Isn't the Answer (Yet) |
|
|
| > Follow-up to "Teaching a Tiny Model to Hear Bash" |
| > Working title β refine before publishing |
|
|
| ## Narrative arc |
|
|
| Part 1 ended on a high: 97% accuracy, 3GB RAM, under a second. But there's a catch we glossed over β that 97% is on **clean protocol input**. When users speak naturally ("okay so the command is...") or make corrections mid-sentence ("dash dash no wait just dash v"), the model falls apart. |
|
|
| This post is about what we tried next, what we learned, and the architectural insight that changed our approach. |
|
|
| ## Key beats |
|
|
| ### 1. The 97% Illusion |
|
|
| The fine-tuned model is great... if you speak its language perfectly. Real users don't. |
|
|
| Four difficulty levels: |
| - **Clean**: "git space push space dash u space origin space main" β 93% (processor alone) |
| - **Fuzzy**: "git commit minus m quote fix login bug quote" β 0% (no "space" keywords) |
| - **Natural**: "okay so the command is git push dash u origin main" β 0% (filler) |
| - **Chaotic**: "dash dash no wait just dash v" β 0% (self-corrections) |
|
|
| The training data was clean. Reality isn't. |
|
|
| ### 2. The Procedural Processor Discovery |
|
|
| Before throwing more ML at it, we asked: how much of this task is deterministic? |
|
|
| Answer: almost all of it. "dash" always means "-". "dot" always means ".". A rule-based token scanner gets **93% on clean input** with zero hallucination, zero latency, zero training. |
|
|
| This raised the question: what is the LLM actually contributing? It's memorizing fixed mappings. The 11,207 times "dash" appears in training β the model learned them all, but a dictionary lookup does the same job. |
|
|
| ### 3. The Split Architecture |
|
|
| The insight: **use each tool for what it's good at.** |
|
|
| ``` |
| Raw speech β LLM (language understanding) β Protocol text β Processor (deterministic) β Final syntax |
| ``` |
|
|
| The LLM's job shrinks dramatically: |
| - Strip conversational filler |
| - Resolve self-corrections ("no wait, actually...") |
| - Insert "space" keywords between arguments |
| - Replace synonyms (minusβdash, periodβdot) |
|
|
| It never outputs symbols. It never makes the dash-to-minus conversion. It just cleans up natural language into a constrained protocol format, and the processor handles the rest. |
|
|
| ### 4. Zero-Training Results |
|
|
| We tested this with pure prompting (no fine-tuning) across 3 models: |
|
|
| | Model | Clean | Fuzzy | Natural | Chaotic | Overall | |
| |---|---|---|---|---|---| |
| | Processor only | 92% | 0% | 0% | 2% | 23.5% | |
| | Qwen 2.5 1.5B | 90% | 20% | 54% | 24% | 47% | |
| | Qwen 2.5 0.5B | 90% | 12% | 44% | 20% | 41.5% | |
| | Llama 3.2 1B | 92% | 14% | 34% | 10% | 37.5% | |
|
|
| Key findings: |
| - 2x baseline with zero training |
| - Clean input maintained at 90%+ (protocol bypass β if input already has "space" keywords, skip the LLM entirely) |
| - Natural/chaotic show real improvement (filler stripping, self-correction resolution work) |
| - Fuzzy is the bottleneck (20%) β inserting "space" keywords requires understanding command structure |
|
|
| ### 5. The Hybrid Architecture |
|
|
| The winning trick: **don't send everything through the LLM.** |
|
|
| ```python |
| if input contains "space" keywords and no filler: |
| β bypass LLM, send directly to processor |
| else: |
| β LLM normalizes, then processor converts |
| ``` |
|
|
| This gives us: |
| - 96% on clean independent eval (up from 93% processor baseline) |
| - Near-zero latency for protocol-format input |
| - LLM only called when genuinely needed (26% of inputs bypassed) |
|
|
| ### 6. Where Prompting Hits Its Ceiling |
|
|
| Fuzzy normalization is the hard problem. The LLM needs to understand: |
| - `cat file period txt` β "cat" and "file.txt" are separate tokens (need "space") |
| - But within "file.txt", "file" + "dot" + "txt" concatenate (no "space") |
| - `dash dash verbose` β compound flag, stays together |
| - `dash u space origin` β flag and argument, need "space" |
|
|
| This requires understanding command structure β which words are commands, flags, paths, filenames. A 1.5B model can't learn this from 12 few-shot examples. But it CAN learn it from 5,000 training examples. |
|
|
| ### 7. The Path Forward |
|
|
| The fine-tuning task just got dramatically simpler: |
| - Old task: dictated text β final syntax (model must learn ALL symbol mappings) |
| - New task: dictated text β protocol text (model only learns WHERE to put "space") |
|
|
| Same training data. Same model. Much simpler output space. The processor handles the rest. |
|
|
| ## Themes to emphasize |
|
|
| - **Don't teach an LLM what a dictionary can do.** Deterministic mappings belong in code. |
| - **Split tasks at the boundary of language understanding.** The LLM handles ambiguity; code handles rules. |
| - **Zero-training experiments reveal architecture.** Prompting told us exactly where the value is (filler stripping, correction resolution) and where it isn't (symbol conversion, space insertion). |
| - **Evaluation infrastructure matters.** The 4-difficulty eval set (clean/fuzzy/natural/chaotic) made it possible to see WHERE each approach fails, not just a single accuracy number. |
|
|
| ## Data to include |
|
|
| - The results table above (all 3 models x 4 difficulties) |
| - Architecture diagram (raw β LLM β protocol β processor β syntax) |
| - Comparison: end-to-end fine-tuning vs split pipeline |
| - Error examples showing what the LLM gets right and wrong |
| - Latency numbers (2.5s with LLM vs ~0ms bypassed) |
|
|
| ## Code references |
|
|
| All code in the datasets/ directory: |
| - `procedural-processor.py` β the deterministic backbone |
| - `normalizer-pipeline.py` β the zero-training pipeline |
| - `eval-fuzzy.json` β 200 entries, 4 difficulty levels |
| - `eval-independent.json` β 100 clean protocol entries |
| - Fine-tuning infrastructure in `finetune/` (from Part 1) |
|
|
| ## Open questions for Part 3 |
|
|
| - How much does fine-tuning the normalizer improve fuzzy accuracy? |
| - Can we generate training data programmatically? (take clean protocol, randomly drop "space" keywords, add filler) |
| - Is there a sweet spot between prompting and fine-tuning? (e.g., fine-tune on 100 examples instead of 5000) |
| - Should the normalizer be a separate model from the transcription engine? |
|
|