| # training-lab |
|
|
| Experiments in voice dictation to programming syntax. Teaching small models to understand spoken code. |
|
|
| ## Domain |
|
|
| Converting spoken dictation like `"git space push space dash u space origin space main"` into actual syntax: `git push -u origin main`. |
|
|
| The challenge: users don't always speak in perfect protocol format. They use synonyms ("minus" for "dash"), skip separator words, add conversational filler ("okay so the command is..."), and make mid-sentence corrections ("no wait, actually..."). |
|
|
| ## Architecture |
|
|
| ``` |
| Raw speech transcript |
| β Protocol detector (is it already clean?) |
| β IF clean: bypass LLM β procedural processor |
| β IF messy: LLM normalizer β procedural processor |
| β Final syntax output |
| ``` |
|
|
| **Procedural processor** β deterministic token scanner. Symbol vocabulary, number words, casing directives. 93% on clean input, zero hallucination, instant. |
|
|
| **LLM normalizer** β rewrites messy dictation into clean protocol format. Strips filler, resolves corrections, inserts spacing keywords. The LLM never outputs actual symbols β it only outputs protocol words. |
|
|
| ## Structure |
|
|
| ``` |
| processor/ Deterministic symbol/number/casing processor |
| pipeline/ LLM + processor pipeline (zero-training normalizer) |
| eval/ Evaluation datasets (fuzzy + independent) |
| training/ |
| data/ Training data (syntax-reconstruction, dictation-to-bash) |
| converters/ Scripts to generate training data from NL2Bash |
| adapters/ Fine-tuned model adapters (LoRA/DoRA) |
| scripts/ Evaluation and benchmarking scripts |
| blog/ Writeup drafts and notes |
| ``` |
|
|
| ## Quick start |
|
|
| ```bash |
| # Run the procedural processor on clean protocol input |
| python3 processor/procedural.py eval/independent.json |
| |
| # Run the normalizer pipeline (requires mlx-lm) |
| pip install mlx mlx-lm |
| python3 pipeline/normalizer.py eval/fuzzy.json --model mlx-community/Qwen2.5-1.5B-Instruct-4bit |
| ``` |
|
|
| ## Results (zero-training, prompted only) |
|
|
| | Model | Clean | Fuzzy | Natural | Chaotic | Overall | |
| |---|---|---|---|---|---| |
| | Processor only | 92% | 0% | 0% | 2% | 23.5% | |
| | Qwen 2.5 1.5B | 90% | 20% | 54% | 24% | 47% | |
| | Qwen 2.5 0.5B | 90% | 12% | 44% | 20% | 41.5% | |
| | Llama 3.2 1B | 92% | 14% | 34% | 10% | 37.5% | |
|
|
| ## Protocol format |
|
|
| The "space-as-a-word" protocol eliminates spacing ambiguity: |
|
|
| - `"space"` β literal space between tokens |
| - Symbol words: `dash dot slash pipe colon quote` etc. |
| - Casing: `camel case`, `snake case`, `pascal case`, `kebab case` |
| - Numbers: `zero` through `nineteen`, `twenty`...`ninety`, `hundred`, `thousand` |
| - Capitalization: `capital X`, `all caps WORD` |
|
|