File size: 2,656 Bytes
04558eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# training-lab

Experiments in voice dictation to programming syntax. Teaching small models to understand spoken code.

## Domain

Converting spoken dictation like `"git space push space dash u space origin space main"` into actual syntax: `git push -u origin main`.

The challenge: users don't always speak in perfect protocol format. They use synonyms ("minus" for "dash"), skip separator words, add conversational filler ("okay so the command is..."), and make mid-sentence corrections ("no wait, actually...").

## Architecture

```
Raw speech transcript
  β†’ Protocol detector (is it already clean?)
  β†’ IF clean: bypass LLM β†’ procedural processor
  β†’ IF messy: LLM normalizer β†’ procedural processor
  β†’ Final syntax output
```

**Procedural processor** β€” deterministic token scanner. Symbol vocabulary, number words, casing directives. 93% on clean input, zero hallucination, instant.

**LLM normalizer** β€” rewrites messy dictation into clean protocol format. Strips filler, resolves corrections, inserts spacing keywords. The LLM never outputs actual symbols β€” it only outputs protocol words.

## Structure

```
processor/          Deterministic symbol/number/casing processor
pipeline/           LLM + processor pipeline (zero-training normalizer)
eval/               Evaluation datasets (fuzzy + independent)
training/
  data/             Training data (syntax-reconstruction, dictation-to-bash)
  converters/       Scripts to generate training data from NL2Bash
  adapters/         Fine-tuned model adapters (LoRA/DoRA)
scripts/            Evaluation and benchmarking scripts
blog/               Writeup drafts and notes
```

## Quick start

```bash
# Run the procedural processor on clean protocol input
python3 processor/procedural.py eval/independent.json

# Run the normalizer pipeline (requires mlx-lm)
pip install mlx mlx-lm
python3 pipeline/normalizer.py eval/fuzzy.json --model mlx-community/Qwen2.5-1.5B-Instruct-4bit
```

## Results (zero-training, prompted only)

| Model | Clean | Fuzzy | Natural | Chaotic | Overall |
|---|---|---|---|---|---|
| Processor only | 92% | 0% | 0% | 2% | 23.5% |
| Qwen 2.5 1.5B | 90% | 20% | 54% | 24% | 47% |
| Qwen 2.5 0.5B | 90% | 12% | 44% | 20% | 41.5% |
| Llama 3.2 1B | 92% | 14% | 34% | 10% | 37.5% |

## Protocol format

The "space-as-a-word" protocol eliminates spacing ambiguity:

- `"space"` β†’ literal space between tokens
- Symbol words: `dash dot slash pipe colon quote` etc.
- Casing: `camel case`, `snake case`, `pascal case`, `kebab case`
- Numbers: `zero` through `nineteen`, `twenty`...`ninety`, `hundred`, `thousand`
- Capitalization: `capital X`, `all caps WORD`