arach commited on Mar 4

Commit

04558eb

1 Parent(s): 3079a1d

🧪 initial commit — voice-to-syntax training lab

Processor, pipeline, eval data, training infrastructure, and blog drafts
from the voice dictation → programming syntax experiments.

Architecture: LLM normalizer → procedural processor (split approach).
Zero-training pipeline achieves 47% on mixed-difficulty eval (2x baseline).

Files changed (49) hide show

.gitignore +22 -0
README.md +67 -0
blog/part1-teaching-tiny-model.md +369 -0
blog/part2-split-architecture-notes.md +130 -0
eval/fuzzy.json +1202 -0
eval/independent.json +102 -0
pipeline/normalizer.py +415 -0
processor/procedural.py +377 -0
scripts/bench-inference.py +68 -0
scripts/eval-bash-v2.py +184 -0
scripts/eval-bash-v3.py +194 -0
scripts/eval-bash.py +137 -0
scripts/eval-finetune.py +150 -0
scripts/eval-holdout.py +122 -0
training/adapters/llama-1b-lora-v2/adapter_config.json +40 -0
training/adapters/llama-1b-lora/adapter_config.json +40 -0
training/adapters/qwen-0.5b-bash-v1/adapter_config.json +40 -0
training/adapters/qwen-0.5b-dora/adapter_config.json +40 -0
training/adapters/qwen-0.5b-lora-v2/adapter_config.json +40 -0
training/adapters/qwen-0.5b-lora-v3/adapter_config.json +40 -0
training/adapters/qwen-0.5b-lora/adapter_config.json +40 -0
training/adapters/qwen-1.5b-bash-v1/adapter_config.json +40 -0
training/adapters/qwen-1.5b-bash-v2-minimal/adapter_config.json +40 -0
training/adapters/qwen-1.5b-bash-v2-protocol/adapter_config.json +40 -0
training/adapters/qwen-1.5b-bash-v3-minimal/adapter_config.json +40 -0
training/adapters/qwen-1.5b-lora-v1/adapter_config.json +40 -0
training/converters/bash-to-dictation-v2.py +394 -0
training/converters/bash-to-dictation-v3.py +486 -0
training/converters/prepare-finetune.py +91 -0
training/data/bash-v2/minimal/test.jsonl +0 -0
training/data/bash-v2/minimal/train.jsonl +0 -0
training/data/bash-v2/minimal/valid.jsonl +0 -0
training/data/bash-v2/protocol/test.jsonl +0 -0
training/data/bash-v2/protocol/train.jsonl +0 -0
training/data/bash-v2/protocol/valid.jsonl +0 -0
training/data/bash-v3/minimal/test.jsonl +0 -0
training/data/bash-v3/minimal/train.jsonl +0 -0
training/data/bash-v3/minimal/valid.jsonl +0 -0
training/data/bash-v3/protocol/test.jsonl +0 -0
training/data/bash-v3/protocol/train.jsonl +0 -0
training/data/bash-v3/protocol/valid.jsonl +0 -0
training/data/chat/test.jsonl +66 -0
training/data/chat/train.jsonl +0 -0
training/data/chat/valid.jsonl +59 -0
training/data/completions/test.jsonl +66 -0
training/data/completions/train.jsonl +491 -0
training/data/completions/valid.jsonl +59 -0
training/data/dictation-to-bash.json +666 -0
training/data/syntax-reconstruction.json +685 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,22 @@

+__pycache__/
+*.pyc
+.DS_Store
+*.egg-info/
+dist/
+build/
+# Secrets
+.hf
+.env
+# Large model files
+*.safetensors
+*.bin
+*.gguf
+# NL2Bash source (download separately)
+nl2bash-repo/
+# Adapter weights (track configs, not weights)
+training/adapters/*/adapter.safetensors
+training/adapters/*/model.safetensors.index.json

README.md ADDED Viewed

	@@ -0,0 +1,67 @@

+# training-lab
+Experiments in voice dictation to programming syntax. Teaching small models to understand spoken code.
+## Domain
+Converting spoken dictation like `"git space push space dash u space origin space main"` into actual syntax: `git push -u origin main`.
+The challenge: users don't always speak in perfect protocol format. They use synonyms ("minus" for "dash"), skip separator words, add conversational filler ("okay so the command is..."), and make mid-sentence corrections ("no wait, actually...").
+## Architecture
+```
+Raw speech transcript
+  → Protocol detector (is it already clean?)
+  → IF clean: bypass LLM → procedural processor
+  → IF messy: LLM normalizer → procedural processor
+  → Final syntax output
+```
+**Procedural processor** — deterministic token scanner. Symbol vocabulary, number words, casing directives. 93% on clean input, zero hallucination, instant.
+**LLM normalizer** — rewrites messy dictation into clean protocol format. Strips filler, resolves corrections, inserts spacing keywords. The LLM never outputs actual symbols — it only outputs protocol words.
+## Structure
+```
+processor/          Deterministic symbol/number/casing processor
+pipeline/           LLM + processor pipeline (zero-training normalizer)
+eval/               Evaluation datasets (fuzzy + independent)
+training/
+  data/             Training data (syntax-reconstruction, dictation-to-bash)
+  converters/       Scripts to generate training data from NL2Bash
+  adapters/         Fine-tuned model adapters (LoRA/DoRA)
+scripts/            Evaluation and benchmarking scripts
+blog/               Writeup drafts and notes
+```
+## Quick start
+```bash
+# Run the procedural processor on clean protocol input
+python3 processor/procedural.py eval/independent.json
+# Run the normalizer pipeline (requires mlx-lm)
+pip install mlx mlx-lm
+python3 pipeline/normalizer.py eval/fuzzy.json --model mlx-community/Qwen2.5-1.5B-Instruct-4bit
+```
+## Results (zero-training, prompted only)
+| Model | Clean | Fuzzy | Natural | Chaotic | Overall |
+|---|---|---|---|---|---|
+| Processor only | 92% | 0% | 0% | 2% | 23.5% |
+| Qwen 2.5 1.5B | 90% | 20% | 54% | 24% | 47% |
+| Qwen 2.5 0.5B | 90% | 12% | 44% | 20% | 41.5% |
+| Llama 3.2 1B | 92% | 14% | 34% | 10% | 37.5% |
+## Protocol format
+The "space-as-a-word" protocol eliminates spacing ambiguity:
+- `"space"` → literal space between tokens
+- Symbol words: `dash dot slash pipe colon quote` etc.
+- Casing: `camel case`, `snake case`, `pascal case`, `kebab case`
+- Numbers: `zero` through `nineteen`, `twenty`...`ninety`, `hundred`, `thousand`
+- Capitalization: `capital X`, `all caps WORD`

blog/part1-teaching-tiny-model.md ADDED Viewed

	@@ -0,0 +1,369 @@

+# Teaching a Tiny Model to Hear Bash
+> Fine-tuning a 1.5B model to reconstruct shell commands from voice. 97% accuracy, 3GB of RAM, under a second on a phone.
+<!-- METADATA
+slug: teaching-a-tiny-model-to-hear-bash
+date: 2026-03-04
+tags: mlx, fine-tuning, lora, on-device-ml, voice, speech-to-code
+author: Arach
+-->
+---
+<!-- ILLUSTRATION: hero
+Style: Dark terminal window with a waveform entering from the left (speech),
+flowing through a small glowing chip/brain icon in the center, and clean bash
+syntax emerging on the right. The waveform is warm orange, the chip is a cool
+blue, the output text is green on black (classic terminal).
+Alt: Speech waveform flowing through a tiny neural network into bash syntax
+-->
+## The Problem
+I'm building [Talkie](https://usetalkie.com), a voice-first productivity app. One of its features is keyboard dictation. You speak into your phone, it types into whatever app you're using.
+For regular prose, off-the-shelf speech-to-text handles it. For terminal commands, it falls apart completely.
+<!-- ILLUSTRATION: side-by-side comparison
+Left panel (labeled "What you say"):
+  A speech bubble containing: "find dot dash name star dot txt"
+Right panel (labeled "What transcription gives you"):
+  A terminal showing: find dot dash name star dot text
+Below both (labeled "What you meant"):
+  A terminal showing: find . -name *.txt
+Style: Clean, minimal. The "what you meant" panel should feel correct/resolved
+— maybe a subtle green checkmark or highlight.
+-->
+Say "find dot dash name star dot txt" to any transcription engine and you get back a faithful transcription of your words. Not the command you meant. The gap between spoken description and intended syntax is the problem.
+## The Bet: A Tiny Model, On-Device
+I wanted to know if a model small enough to run on a phone could learn this mapping end-to-end. Not a rule engine. Not a cloud API call to GPT-4. A model that fits in pocket-sized RAM and returns an answer before the user notices it's thinking.
+**Model:** Qwen2.5-1.5B-Instruct, 4-bit quantized via MLX. Fits in ~3GB.
+**Method:** LoRA fine-tuning on Apple Silicon. Rank 8, scale 20, no dropout. The whole training run uses under 3GB of memory.
+**Data:** 6,304 examples of dictated bash paired with intended syntax — 5,044 train, 630 validation, 630 test. Each example is a simple chat turn:
+```json
+{
+  "messages": [
+    {"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."},
+    {"role": "user", "content": "find dot dash name star dot txt"},
+    {"role": "assistant", "content": "find . -name *.txt"}
+  ]
+}
+```
+The data covers a wide surface of Unix — `find`, `grep`, `ssh`, `tar`, `chmod`, piped chains, quoted arguments, nested subshells, escape sequences. The dictation convention is consistent: symbols are spoken as English words ("dash", "dot", "slash", "pipe") and numbers are spelled digit-by-digit ("one two seven" for `127`).
+## Training
+```bash
+mlx_lm.lora \
+  --model mlx-community/Qwen2.5-1.5B-Instruct-4bit \
+  --data datasets/finetune/bash-v2/minimal \
+  --batch-size 4 \
+  --lora-layers 16 \
+  --iters 1000 \
+  --learning-rate 1e-4 \
+  --mask-prompt
+```
+One flag worth calling out: `--mask-prompt`. The model only learns to predict the assistant response, not the system and user turns. All the training signal goes to the actual reconstruction task.
+It converged fast.
+<!-- ILLUSTRATION: training-curve
+A clean line chart with two lines:
+  - Train loss (blue): drops steeply from ~2.5 to ~0.05
+  - Val loss (orange): drops from ~2.8 to 0.109 at iter 800, ticks up to 0.137 at 1000
+X-axis: Iteration (0 to 1000)
+Y-axis: Loss (0 to 3.0)
+A small annotation at iter 800: "best checkpoint"
+A subtle shaded region after 800 labeled "mild overfit"
+Style: Minimal, no gridlines. Just the curves and the annotation.
+-->
+| Iter | Train Loss | Val Loss |
+|------|-----------|----------|
+| 200 | 0.337 | 0.213 |
+| 400 | 0.108 | 0.204 |
+| 600 | 0.068 | 0.137 |
+| **800** | **0.049** | **0.109** |
+| 1000 | 0.052 | 0.137 |
+Best validation loss at iteration 800. A mild overfit signal by 1000. Final test loss: 0.098, perplexity: 1.103.
+Peak memory during training: 2.95 GB. Total wall time: about 35 minutes on a MacBook.
+## Beyond Val Loss: Does It Actually Get Commands Right?
+Validation loss says the model is learning. It doesn't say whether it produces correct commands. So I ran the full 630-example test set through inference, compared each output character-for-character against the expected command, and sorted the results into buckets.
+<!-- ILLUSTRATION: results-bar
+A horizontal stacked bar chart, single bar, full width:
+  - Green (76.2%): "Exact" — label inside
+  - Light green (21.0%): "Near" — label inside
+  - Yellow (2.4%): "Partial" — label inside or above
+  - Red (0.5%): "Wrong" — tiny sliver, label above
+Below the bar: "97.1% effective accuracy (exact + near)"
+Style: Clean, bold. The green dominates. The red sliver is barely visible.
+-->
+```
+Exact match              : 480 / 630 (76.2%)
+Near match (>90% similar): 132 / 630 (21.0%)
+Partial (70-90%)         :  15 / 630 (2.4%)
+Wrong (<70%)             :   3 / 630 (0.5%)
+Effective accuracy: 97.1%
+```
+Average inference time: 0.69 seconds per command on Apple Silicon.
+The "near match" bucket is mostly whitespace and trivial formatting — extra spaces around operators, minor quoting style differences. Functionally identical outputs. The interesting signal is in the failures.
+## Anatomy of the 3%
+Every failure fell into one of two categories. No exceptions.
+### Repeated Digits
+When the input contains a long spoken digit sequence — "one zero zero zero zero zero" for `100000` — the model starts generating correctly, then falls into a repetition loop.
+<!-- ILLUSTRATION: repetition-failure
+Show 3 examples as "cards" or terminal snippets:
+Card 1:
+  Voice: "split dash L one zero zero zero zero zero database dot sql"
+  Expected: split -l 100000 database.sql database-
+  Got: split -l 100̶0̶0̶0̶0̶0̶0̶0̶0̶0̶0̶… (trailing zeros shown as fading/struck)
+Card 2:
+  Voice: "head dash N nine nine nine nine nine nine nine nine"
+  Expected: head -n99999999 file1.txt
+  Got: head -n 99̶9̶9̶9̶9̶9̶… (same treatment)
+Card 3:
+  Voice: "ping eight dot eight dot eight dot eight"
+  Expected: ping 8.8.8.8
+  Got: ping 8̶.̶8̶.̶8̶… (same treatment)
+Style: The correct portion in white/green, the degenerate tail in red/faded,
+visually showing where the model "loses the plot."
+-->
+```
+IN:  "one zero zero zero zero zero"  →  GOT: 100[000000000…]  EXPECTED: 100000
+IN:  "nine nine nine nine nine nine" →  GOT: 99[999999999…]   EXPECTED: 99999999
+IN:  "eight dot eight dot eight"     →  GOT: 8[.8.8.8.8.…]   EXPECTED: 8.8.8.8
+```
+This is a known weakness of small language models with repeated tokens. The model sees "I just generated a zero" and assigns high probability to the next token also being a zero. The attention pattern becomes self-reinforcing.
+All 4 of the "wrong" results in the evaluation were this exact failure mode.
+### Casing Ambiguity
+```
+IN:  "df dash I H"          →  GOT: df -iH      EXPECTED: df -ih
+IN:  "diff dash Y A B"      →  GOT: diff -y A B  EXPECTED: diff -y a b
+IN:  "cp dash R S /mnt/..." →  GOT: cp -R s/...  EXPECTED: cp -rs /...
+```
+When someone says "dash I H" — should it be `-ih` or `-iH`? Both are valid bash. The model preserves the casing from the spoken input, which is a reasonable default but doesn't always match the expected answer.
+21 of 630 examples (3.3%) differed only in letter casing. Score case-insensitively and they're all correct.
+The remaining 14 partial matches were structural — a doubled token, a missed path segment, a quoting difference. Real model limitations, but minor ones.
+## The Insight
+Here's the thing I didn't expect going in.
+Looking at the dictation vocabulary across the entire dataset, the mapping from spoken words to symbols is *completely deterministic*:
+<!-- ILLUSTRATION: vocabulary-table
+A visual "lookup table" or "decoder ring" showing the spoken-to-symbol mapping.
+Two columns. Left: spoken word in a speech bubble or rounded tag. Right: the
+symbol in a monospace/terminal style.
+Show the top ~15 mappings arranged in a visually interesting grid or flow:
+  dash → -       pipe → |        star → *
+  dot → .        backslash → \   semicolon → ;
+  slash → /      dollar → $      plus → +
+  quote → "      underscore → _  equals → =
+  single quote → '   tilde → ~  colon → :
+  open brace → {    close brace → }
+Below: "30 spoken tokens → 30 symbols. No ambiguity. No ML needed."
+Style: This should be visually striking — the contrast between fuzzy human speech
+and precise symbols is the point. Maybe the left side feels organic/warm and the
+right side feels precise/mechanical.
+-->
+| Spoken | Symbol | Occurrences |
+|--------|--------|-------------|
+| dash | `-` | 11,207 |
+| quote | `"` | 4,676 |
+| dot | `.` | 4,297 |
+| slash | `/` | 4,079 |
+| pipe | `\|` | 1,791 |
+| star | `*` | 1,730 |
+| backslash | `\` | 924 |
+| semicolon | `;` | 766 |
+| dollar | `$` | 636 |
+| ... | ... | ... |
+Thirty spoken tokens mapping to thirty symbols. No ambiguity. No context-dependence. A lookup table handles it perfectly.
+Same for digits: "zero" through "nine" map 1:1 to `0`-`9`, spoken digit-by-digit and concatenated. "One two seven" is always `127`. "Zero six four four" is always `0644`.
+The model is spending a huge chunk of its 1.5 billion parameters learning these fixed mappings. Every training example where "dash" becomes `-` is a wasted gradient. The model figured this out after the first hundred examples and then saw it eleven thousand more times.
+**The fix isn't more training. It's less work for the model.**
+## The Architecture That Emerges
+<!-- ILLUSTRATION: pipeline-architecture
+A vertical flow diagram with three stages, each as a distinct box/card:
+Stage 1 — PREPROCESSOR (labeled "Deterministic Code"):
+  Input: "find dot dash name star dot txt"
+  Processing: symbol/digit lookup table
+  Output: "find . - name * . txt"
+  Visual style: Mechanical, precise. Gear icon or lookup table icon.
+Stage 2 — MODEL (labeled "Fine-tuned 1.5B LM"):
+  Input: "find . - name * . txt"
+  Processing: structural reasoning (spacing, quoting, grouping)
+  Output: "find . -name *.txt"
+  Visual style: Neural/organic. Small brain or network icon.
+Stage 3 — POST-PROCESSOR (labeled "Deterministic Code"):
+  Input: "find . -name *.txt"
+  Processing: repetition guard, balanced quotes check
+  Output: "find . -name *.txt" ✓
+  Visual style: Same mechanical feel as Stage 1. Shield or checkmark icon.
+The key visual idea: the ML part is sandwiched between two layers of regular code.
+Most of the "intelligence" is deterministic. The model handles the narrow slice
+that actually requires judgment.
+-->
+```
+  "find dot dash name star dot txt"
+              │
+     ┌────────▼─────────┐
+     │   Preprocessor   │  Deterministic: symbol + digit expansion
+     │   (no ML)        │  "find . - name * . txt"
+     └────────┬─────────┘
+              │
+     ┌────────▼─────────┐
+     │  Fine-tuned LM   │  Structural reasoning only
+     │  (1.5B, LoRA)    │  find . -name *.txt
+     └────────┬─────────┘
+              │
+     ┌────────▼─────────┐
+     │  Post-processor   │  Repetition guard, sanity checks
+     │  (no ML)         │
+     └────────┬─────────┘
+              │
+              ▼
+     find . -name *.txt
+```
+**Preprocessor** — deterministic code, no model involved:
+- Symbol words to literal characters: `dash` → `-`, `pipe` → `|`, `open brace` → `{`
+- Digit sequences to numbers: `one two seven` → `127`, `zero six four four` → `0644`
+- Compound numbers to digits: `twenty three` → `23`, `twelve` → `12`
+**Model** — the only part that requires ML, and now its job is purely structural:
+- Where do spaces go? (`-name` vs `- name`)
+- What gets quoted? (`"*.txt"` vs `*.txt`)
+- How do tokens group? (`-exec rm -f {} \;` as a unit)
+- What's a flag vs. an argument? (`-rs` vs `-R s`)
+**Post-processor** — deterministic code again:
+- Repetition detection: same n-gram 3+ times in a row, truncate
+- Structural validation: balanced quotes, balanced braces, no trailing artifacts
+The model becomes a structural reasoner instead of a lookup table. It stops memorizing that "dash" means `-` and starts focusing on the actually hard part: how these symbols compose into valid commands.
+## What the Numbers Mean
+<!-- ILLUSTRATION: production-stats
+Three "stat cards" in a row, bold numbers with subtle icons:
+Card 1: "97%" with subtext "effective accuracy"
+  Icon: target/bullseye
+Card 2: "3 GB" with subtext "total memory"
+  Icon: chip/RAM stick
+Card 3: "0.7s" with subtext "per command"
+  Icon: stopwatch
+Below: "On a phone. Offline. No cloud."
+Style: Clean, confident. These numbers should feel impressive without being flashy.
+-->
+97% accuracy from a model that fits in 3GB and runs in under a second. On a phone. Offline. No API call, no network dependency, no usage fees.
+The remaining 3% breaks down cleanly:
+- **Repeated digits** (~0.6%): eliminated entirely by the preprocessor — digits never reach the model
+- **Casing** (~3.3%): arguably not errors — both casings are valid bash. Case-insensitive accuracy is already ~99%
+- **Structural** (~2.2%): genuine model limitations, mostly minor — a doubled token, a missed path segment
+With the preprocessing pipeline handling symbols and digits, the model's effective job shrinks substantially, and I'd expect accuracy above 98% without any retraining.
+## Practical Notes
+**Training cost.** 35 minutes on a MacBook, 3GB RAM. No GPU cluster. MLX makes LoRA fine-tuning on Apple Silicon feel like running a build.
+**Data efficiency.** 5,044 training examples was enough for 97%. The model converged in 800 iterations — 3,200 examples at batch size 4. Small, focused datasets beat large noisy ones when the task is narrow.
+**Checkpoint selection.** Best validation loss at iteration 800 (0.109). Iteration 1000 showed mild overfitting (0.137). In practice the difference was small — both produced similar accuracy in full evaluation.
+**Inference.** 0.69 seconds average. Fast enough to run between when you stop speaking and when text appears. The user doesn't wait.
+## What's Next
+<!-- ILLUSTRATION: domains-expansion
+A grid of domain "cards" showing where this same approach applies:
+  bash     →  ✅ done (this post)
+  SQL      →  "select star from users where..."
+  regex    →  "caret open bracket A dash Z close bracket plus dollar"
+  URLs     →  "H T T P S colon slash slash..."
+  math     →  "integral from zero to infinity..."
+  file paths → "slash users slash arach slash..."
+Each card has the domain name, a tiny example of spoken → written, and a
+status indicator (done, next, future).
+Style: Grid layout. "bash" card is highlighted/completed. Others are dimmed
+or outlined, suggesting a roadmap.
+-->
+Building the preprocessing pipeline is the immediate next step — the deterministic symbol and digit expander that feeds cleaned input to the model.
+Beyond that, the approach generalizes to any domain with a consistent spoken-to-written mapping. SQL, regex, file paths, URLs, mathematical notation. The model architecture stays the same. You change the training data and the preprocessor's lookup table.
+The broader point: the right role for a small model isn't doing everything. It's doing the one thing that only a model can do, sandwiched between deterministic code that handles the rest.
+<!-- ILLUSTRATION: closing
+A minimal callback to the hero image — the same speech-to-syntax flow, but
+now with the preprocessor and post-processor stages visible as small nodes
+in the pipeline. The model in the center is smaller/lighter, because its
+job is smaller now. The deterministic stages are doing the heavy lifting.
+Or alternatively: a phone lying on a desk, terminal open, with a speech bubble
+above it containing "find dot dash name star dot txt" and the terminal showing
+the correct output. Simple, confident, done.
+-->

blog/part2-split-architecture-notes.md ADDED Viewed

	@@ -0,0 +1,130 @@

+# Part 2: When Fine-Tuning Isn't the Answer (Yet)
+> Follow-up to "Teaching a Tiny Model to Hear Bash"
+> Working title — refine before publishing
+## Narrative arc
+Part 1 ended on a high: 97% accuracy, 3GB RAM, under a second. But there's a catch we glossed over — that 97% is on **clean protocol input**. When users speak naturally ("okay so the command is...") or make corrections mid-sentence ("dash dash no wait just dash v"), the model falls apart.
+This post is about what we tried next, what we learned, and the architectural insight that changed our approach.
+## Key beats
+### 1. The 97% Illusion
+The fine-tuned model is great... if you speak its language perfectly. Real users don't.
+Four difficulty levels:
+- **Clean**: "git space push space dash u space origin space main" → 93% (processor alone)
+- **Fuzzy**: "git commit minus m quote fix login bug quote" → 0% (no "space" keywords)
+- **Natural**: "okay so the command is git push dash u origin main" → 0% (filler)
+- **Chaotic**: "dash dash no wait just dash v" → 0% (self-corrections)
+The training data was clean. Reality isn't.
+### 2. The Procedural Processor Discovery
+Before throwing more ML at it, we asked: how much of this task is deterministic?
+Answer: almost all of it. "dash" always means "-". "dot" always means ".". A rule-based token scanner gets **93% on clean input** with zero hallucination, zero latency, zero training.
+This raised the question: what is the LLM actually contributing? It's memorizing fixed mappings. The 11,207 times "dash" appears in training — the model learned them all, but a dictionary lookup does the same job.
+### 3. The Split Architecture
+The insight: **use each tool for what it's good at.**
+```
+Raw speech → LLM (language understanding) → Protocol text → Processor (deterministic) → Final syntax
+```
+The LLM's job shrinks dramatically:
+- Strip conversational filler
+- Resolve self-corrections ("no wait, actually...")
+- Insert "space" keywords between arguments
+- Replace synonyms (minus→dash, period→dot)
+It never outputs symbols. It never makes the dash-to-minus conversion. It just cleans up natural language into a constrained protocol format, and the processor handles the rest.
+### 4. Zero-Training Results
+We tested this with pure prompting (no fine-tuning) across 3 models:
+| Model | Clean | Fuzzy | Natural | Chaotic | Overall |
+|---|---|---|---|---|---|
+| Processor only | 92% | 0% | 0% | 2% | 23.5% |
+| Qwen 2.5 1.5B | 90% | 20% | 54% | 24% | 47% |
+| Qwen 2.5 0.5B | 90% | 12% | 44% | 20% | 41.5% |
+| Llama 3.2 1B | 92% | 14% | 34% | 10% | 37.5% |
+Key findings:
+- 2x baseline with zero training
+- Clean input maintained at 90%+ (protocol bypass — if input already has "space" keywords, skip the LLM entirely)
+- Natural/chaotic show real improvement (filler stripping, self-correction resolution work)
+- Fuzzy is the bottleneck (20%) — inserting "space" keywords requires understanding command structure
+### 5. The Hybrid Architecture
+The winning trick: **don't send everything through the LLM.**
+```python
+if input contains "space" keywords and no filler:
+    → bypass LLM, send directly to processor
+else:
+    → LLM normalizes, then processor converts
+```
+This gives us:
+- 96% on clean independent eval (up from 93% processor baseline)
+- Near-zero latency for protocol-format input
+- LLM only called when genuinely needed (26% of inputs bypassed)
+### 6. Where Prompting Hits Its Ceiling
+Fuzzy normalization is the hard problem. The LLM needs to understand:
+- `cat file period txt` → "cat" and "file.txt" are separate tokens (need "space")
+- But within "file.txt", "file" + "dot" + "txt" concatenate (no "space")
+- `dash dash verbose` → compound flag, stays together
+- `dash u space origin` → flag and argument, need "space"
+This requires understanding command structure — which words are commands, flags, paths, filenames. A 1.5B model can't learn this from 12 few-shot examples. But it CAN learn it from 5,000 training examples.
+### 7. The Path Forward
+The fine-tuning task just got dramatically simpler:
+- Old task: dictated text → final syntax (model must learn ALL symbol mappings)
+- New task: dictated text → protocol text (model only learns WHERE to put "space")
+Same training data. Same model. Much simpler output space. The processor handles the rest.
+## Themes to emphasize
+- **Don't teach an LLM what a dictionary can do.** Deterministic mappings belong in code.
+- **Split tasks at the boundary of language understanding.** The LLM handles ambiguity; code handles rules.
+- **Zero-training experiments reveal architecture.** Prompting told us exactly where the value is (filler stripping, correction resolution) and where it isn't (symbol conversion, space insertion).
+- **Evaluation infrastructure matters.** The 4-difficulty eval set (clean/fuzzy/natural/chaotic) made it possible to see WHERE each approach fails, not just a single accuracy number.
+## Data to include
+- The results table above (all 3 models x 4 difficulties)
+- Architecture diagram (raw → LLM → protocol → processor → syntax)
+- Comparison: end-to-end fine-tuning vs split pipeline
+- Error examples showing what the LLM gets right and wrong
+- Latency numbers (2.5s with LLM vs ~0ms bypassed)
+## Code references
+All code in the datasets/ directory:
+- `procedural-processor.py` — the deterministic backbone
+- `normalizer-pipeline.py` — the zero-training pipeline
+- `eval-fuzzy.json` — 200 entries, 4 difficulty levels
+- `eval-independent.json` — 100 clean protocol entries
+- Fine-tuning infrastructure in `finetune/` (from Part 1)
+## Open questions for Part 3
+- How much does fine-tuning the normalizer improve fuzzy accuracy?
+- Can we generate training data programmatically? (take clean protocol, randomly drop "space" keywords, add filler)
+- Is there a sweet spot between prompting and fine-tuning? (e.g., fine-tune on 100 examples instead of 5000)
+- Should the normalizer be a separate model from the transcription engine?

eval/fuzzy.json ADDED Viewed

	@@ -0,0 +1,1202 @@

+[
+  {
+    "dictated": "git space push space dash u space origin space main",
+    "expected": "git push -u origin main",
+    "category": "git",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "docker space run space dash dash rm space dash p space eight zero eight zero colon eight zero space nginx",
+    "expected": "docker run --rm -p 8080:80 nginx",
+    "category": "docker",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "npm space install space dash capital D space typescript space at five",
+    "expected": "npm install -D typescript@5",
+    "category": "npm",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "ssh space dash i space tilde slash dot ssh slash id underscore rsa space root at one nine two dot one six eight dot one dot one",
+    "expected": "ssh -i ~/.ssh/id_rsa root@192.168.1.1",
+    "category": "ssh",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "curl space dash capital X space all caps POST space dash capital H space quote capital Content dash capital Type colon space application slash json quote space https colon slash slash api dot example dot com slash v one slash users",
+    "expected": "curl -X POST -H \"Content-Type: application/json\" https://api.example.com/v1/users",
+    "category": "curl",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "kubectl space get space pods space dash n space kube dash system",
+    "expected": "kubectl get pods -n kube-system",
+    "category": "kubectl",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "terraform space plan space dash var dash file equals production dot tfvars",
+    "expected": "terraform plan -var-file=production.tfvars",
+    "category": "terraform",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "chmod space zero seven five five space slash usr slash local slash bin slash deploy dot sh",
+    "expected": "chmod 0755 /usr/local/bin/deploy.sh",
+    "category": "filesystem",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "export space all caps DATABASE underscore URL equals quote postgres colon slash slash admin colon secret at localhost colon five four three two slash mydb quote",
+    "expected": "export DATABASE_URL=\"postgres://admin:secret@localhost:5432/mydb\"",
+    "category": "env",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "rsync space dash a v z space dash e space ssh space dot slash dist slash space user at one seven two dot sixteen dot zero dot one colon slash var slash www slash",
+    "expected": "rsync -avz -e ssh ./dist/ user@172.16.0.1:/var/www/",
+    "category": "rsync",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "psql space dash h space localhost space dash capital U space postgres space dash d space production",
+    "expected": "psql -h localhost -U postgres -d production",
+    "category": "psql",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "redis dash cli space dash h space one two seven dot zero dot zero dot one space dash p space six three seven nine space all caps PING",
+    "expected": "redis-cli -h 127.0.0.1 -p 6379 PING",
+    "category": "redis",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "brew space install space dash dash cask space visual dash studio dash code",
+    "expected": "brew install --cask visual-studio-code",
+    "category": "brew",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "systemctl space restart space nginx dot service",
+    "expected": "systemctl restart nginx.service",
+    "category": "systemctl",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "make space dash j space eight space all caps CC equals gcc space all caps CFLAGS equals quote dash capital O two dash capital Wall quote",
+    "expected": "make -j 8 CC=gcc CFLAGS=\"-O2 -Wall\"",
+    "category": "make",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "cargo space build space dash dash release space dash dash target space x eighty six underscore sixty four dash unknown dash linux dash gnu",
+    "expected": "cargo build --release --target x86_64-unknown-linux-gnu",
+    "category": "rust",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "go space build space dash o space bin slash server space dot slash cmd slash server",
+    "expected": "go build -o bin/server ./cmd/server",
+    "category": "go",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "swift space build space dash c space release space dash dash triple space arm sixty four dash apple dash macosx",
+    "expected": "swift build -c release --triple arm64-apple-macosx",
+    "category": "swift",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "aws space s three space cp space s three colon slash slash my dash bucket slash data dot csv space dot slash",
+    "expected": "aws s3 cp s3://my-bucket/data.csv ./",
+    "category": "aws",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "ffmpeg space dash i space input dot mp four space dash vf space quote scale equals one nine twenty colon one zero eighty quote space dash c colon a space copy space output dot mp four",
+    "expected": "ffmpeg -i input.mp4 -vf \"scale=1920:1080\" -c:a copy output.mp4",
+    "category": "ffmpeg",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "openssl space req space dash x five zero nine space dash newkey space rsa colon four zero nine six space dash keyout space key dot pem space dash out space cert dot pem space dash days space three six five",
+    "expected": "openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365",
+    "category": "openssl",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "camel case get user profile",
+    "expected": "getUserProfile",
+    "category": "casing",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "snake case api response handler",
+    "expected": "api_response_handler",
+    "category": "casing",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "pascal case user authentication service",
+    "expected": "UserAuthenticationService",
+    "category": "casing",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "kebab case my awesome component",
+    "expected": "my-awesome-component",
+    "category": "casing",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "git space log space dash dash oneline space dash dash graph space dash n space twenty",
+    "expected": "git log --oneline --graph -n 20",
+    "category": "git",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "docker space compose space dash f space docker dash compose dot prod dot yml space up space dash d",
+    "expected": "docker compose -f docker-compose.prod.yml up -d",
+    "category": "docker",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "python space dash m space venv space dot venv space ampersand ampersand space source space dot venv slash bin slash activate",
+    "expected": "python -m venv .venv && source .venv/bin/activate",
+    "category": "python",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "pip space install space dash r space requirements dot txt space dash dash upgrade",
+    "expected": "pip install -r requirements.txt --upgrade",
+    "category": "python",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "grep space dash r space dash n space dash i space quote all caps TODO quote space dot slash src slash",
+    "expected": "grep -r -n -i \"TODO\" ./src/",
+    "category": "grep",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "tar space dash x z f space archive dot tar dot gz space dash capital C space slash opt slash app",
+    "expected": "tar -xzf archive.tar.gz -C /opt/app",
+    "category": "filesystem",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "find space dot space dash name space quote star dot log quote space dash mtime space plus seven space dash delete",
+    "expected": "find . -name \"*.log\" -mtime +7 -delete",
+    "category": "filesystem",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "echo space dollar all caps HOME slash dot config slash app dot yml",
+    "expected": "echo $HOME/.config/app.yml",
+    "category": "shell",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "sed space dash i space quote s slash old dash text slash new dash text slash g quote space config dot yaml",
+    "expected": "sed -i \"s/old-text/new-text/g\" config.yaml",
+    "category": "sed",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "awk space quote open brace print space dollar two close brace quote space data dot tsv",
+    "expected": "awk \"{print $2}\" data.tsv",
+    "category": "awk",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "xcodebuild space dash workspace space capital Talkie dot xcworkspace space dash scheme space capital Talkie space dash configuration space capital Release",
+    "expected": "xcodebuild -workspace Talkie.xcworkspace -scheme Talkie -configuration Release",
+    "category": "xcode",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "git space remote space add space upstream space https colon slash slash github dot com slash owner slash repo dot git",
+    "expected": "git remote add upstream https://github.com/owner/repo.git",
+    "category": "git",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "scp space dash capital P space two two zero two space user at ten dot zero dot zero dot five colon slash tmp slash dump dot sql space dot slash",
+    "expected": "scp -P 2202 user@10.0.0.5:/tmp/dump.sql ./",
+    "category": "scp",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "cmake space dash capital B space build space dash capital D space all caps CMAKE underscore BUILD underscore TYPE equals capital Release",
+    "expected": "cmake -B build -D CMAKE_BUILD_TYPE=Release",
+    "category": "cmake",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "gradle space dash dash parallel space dash dash build dash cache space assembleRelease",
+    "expected": "gradle --parallel --build-cache assembleRelease",
+    "category": "gradle",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "convert space input dot png space dash resize space eight hundred x six hundred space dash quality space eighty five space output dot jpg",
+    "expected": "convert input.png -resize 800x600 -quality 85 output.jpg",
+    "category": "imagemagick",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "cat space slash etc slash hosts space pipe space grep space one nine two dot one six eight",
+    "expected": "cat /etc/hosts | grep 192.168",
+    "category": "shell",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "git space stash space push space dash m space quote wip colon space refactor auth quote",
+    "expected": "git stash push -m \"wip: refactor auth\"",
+    "category": "git",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "kubectl space apply space dash f space dash space less than space manifest dot yaml",
+    "expected": "kubectl apply -f - < manifest.yaml",
+    "category": "kubectl",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "npx space create dash next dash app at latest space my dash app space dash dash typescript space dash dash tailwind space dash dash eslint",
+    "expected": "npx create-next-app@latest my-app --typescript --tailwind --eslint",
+    "category": "npm",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "docker space build space dash t space my dash app colon v one dot two dot three space dash f space capital Dockerfile dot prod space dot",
+    "expected": "docker build -t my-app:v1.2.3 -f Dockerfile.prod .",
+    "category": "docker",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "git space rebase space dash dash onto space main space feature dash base space feature dash work",
+    "expected": "git rebase --onto main feature-base feature-work",
+    "category": "git",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "env space all caps NODE underscore ENV equals production space node space server dot js",
+    "expected": "env NODE_ENV=production node server.js",
+    "category": "env",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "aws space ec two space describe dash instances space dash dash filters space quote capital Name equals instance dash state dash name comma capital Values equals running quote space dash dash query space quote capital Reservations open bracket close bracket dot capital Instances open bracket close bracket dot capital Instance capital Id quote",
+    "expected": "aws ec2 describe-instances --filters \"Name=instance-state-name,Values=running\" --query \"Reservations[].Instances[].InstanceId\"",
+    "category": "aws",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "wc space dash l space less than space data dot csv space pipe space xargs space echo space quote total rows colon quote",
+    "expected": "wc -l < data.csv | xargs echo \"total rows:\"",
+    "category": "shell",
+    "difficulty": "clean"
+  },
+  {
+    "dictated": "git commit minus m quote fix login bug quote",
+    "expected": "git commit -m \"fix login bug\"",
+    "category": "git",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "ls minus l minus a slash var slash log",
+    "expected": "ls -l -a /var/log",
+    "category": "filesystem",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "cat file period txt",
+    "expected": "cat file.txt",
+    "category": "filesystem",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "cd forward slash usr forward slash local forward slash bin",
+    "expected": "cd /usr/local/bin",
+    "category": "filesystem",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "python server period py double dash port eight thousand",
+    "expected": "python server.py --port 8000",
+    "category": "python",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "git push hyphen u origin main",
+    "expected": "git push -u origin main",
+    "category": "git",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "grep asterisk period log forward slash var forward slash log forward slash",
+    "expected": "grep *.log /var/log/",
+    "category": "grep",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "npm install hyphen hyphen save dev eslint",
+    "expected": "npm install --save-dev eslint",
+    "category": "npm",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "echo hashtag this is a comment",
+    "expected": "echo # this is a comment",
+    "category": "shell",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "docker run minus minus rm minus it ubuntu",
+    "expected": "docker run --rm -it ubuntu",
+    "category": "docker",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "curl minus capital X capital POST minus capital H content hyphen type colon application forward slash json",
+    "expected": "curl -X POST -H Content-Type: application/json",
+    "category": "curl",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "ssh minus i tilde forward slash period ssh forward slash key period pem user at sign server",
+    "expected": "ssh -i ~/.ssh/key.pem user@server",
+    "category": "ssh",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "git diff double dash staged",
+    "expected": "git diff --staged",
+    "category": "git",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "tar minus xzf backup period tar period gz",
+    "expected": "tar -xzf backup.tar.gz",
+    "category": "filesystem",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "find period minus name asterisk period py minus type f",
+    "expected": "find . -name *.py -type f",
+    "category": "filesystem",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "export capital NODE underscore capital ENV equals sign production",
+    "expected": "export NODE_ENV=production",
+    "category": "env",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "kubectl get pods minus n default double dash output json",
+    "expected": "kubectl get pods -n default --output json",
+    "category": "kubectl",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "chmod seven five five script period sh",
+    "expected": "chmod 755 script.sh",
+    "category": "filesystem",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "pip install flask equals sign equals sign two period zero period zero",
+    "expected": "pip install flask==2.0.0",
+    "category": "python",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "git checkout minus b feature forward slash auth",
+    "expected": "git checkout -b feature/auth",
+    "category": "git",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "docker compose up minus d double dash build",
+    "expected": "docker compose up -d --build",
+    "category": "docker",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "psql minus capital U postgres minus d mydb minus c quote select asterisk from users quote",
+    "expected": "psql -U postgres -d mydb -c \"select * from users\"",
+    "category": "psql",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "scp file period txt user at sign one ninety two period one sixty eight period one period one hundred colon tilde forward slash",
+    "expected": "scp file.txt user@192.168.1.100:~/",
+    "category": "scp",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "git log double dash oneline minus n ten",
+    "expected": "git log --oneline -n 10",
+    "category": "git",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "brew install double dash cask firefox",
+    "expected": "brew install --cask firefox",
+    "category": "brew",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "rsync minus avz period forward slash source forward slash user at sign host colon forward slash dest forward slash",
+    "expected": "rsync -avz ./source/ user@host:/dest/",
+    "category": "rsync",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "aws s3 sync period forward slash build forward slash s3 colon forward slash forward slash my hyphen bucket forward slash static",
+    "expected": "aws s3 sync ./build/ s3://my-bucket/static",
+    "category": "aws",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "camelcase handle form submit",
+    "expected": "handleFormSubmit",
+    "category": "casing",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "snake_case max retry count",
+    "expected": "max_retry_count",
+    "category": "casing",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "git stash pop at sign open brace zero close brace",
+    "expected": "git stash pop @{0}",
+    "category": "git",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "backslash n backslash t hello world",
+    "expected": "\\n\\t hello world",
+    "category": "escape",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "echo dollar sign open parenthesis date close parenthesis",
+    "expected": "echo $(date)",
+    "category": "shell",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "ffmpeg minus i input period mov minus codec copy output period mp4",
+    "expected": "ffmpeg -i input.mov -codec copy output.mp4",
+    "category": "ffmpeg",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "terraform apply minus auto hyphen approve",
+    "expected": "terraform apply -auto-approve",
+    "category": "terraform",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "go test period forward slash period period period minus v minus race",
+    "expected": "go test ./... -v -race",
+    "category": "go",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "cargo run double dash double dash release",
+    "expected": "cargo run -- --release",
+    "category": "rust",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "git rebase minus minus interactive capital HEAD tilde three",
+    "expected": "git rebase --interactive HEAD~3",
+    "category": "git",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "nginx minus t ampersand ampersand systemctl reload nginx",
+    "expected": "nginx -t && systemctl reload nginx",
+    "category": "systemctl",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "sed minus i quote s forward slash http forward slash https forward slash g quote config period yml",
+    "expected": "sed -i \"s/http/https/g\" config.yml",
+    "category": "sed",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "curl minus s minus o forward slash dev forward slash null minus w quote percent open brace http underscore code close brace quote http colon forward slash forward slash localhost colon three thousand",
+    "expected": "curl -s -o /dev/null -w \"%{http_code}\" http://localhost:3000",
+    "category": "curl",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "docker exec minus it container underscore name bash",
+    "expected": "docker exec -it container_name bash",
+    "category": "docker",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "openssl s underscore client minus connect example period com colon four four three",
+    "expected": "openssl s_client -connect example.com:443",
+    "category": "openssl",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "all-caps MAX underscore RETRIES equals sign five",
+    "expected": "MAX_RETRIES=5",
+    "category": "env",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "pascal-case user profile view model",
+    "expected": "UserProfileViewModel",
+    "category": "casing",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "kebab-case my react component",
+    "expected": "my-react-component",
+    "category": "casing",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "redis minus cli set my colon key quote hello world quote",
+    "expected": "redis-cli set my:key \"hello world\"",
+    "category": "redis",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "jq period data open bracket close bracket period name input period json",
+    "expected": "jq .data[].name input.json",
+    "category": "shell",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "okay so the command is git space push space dash u space origin space main",
+    "expected": "git push -u origin main",
+    "category": "git",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "I wanna set the variable name to camel case get user profile",
+    "expected": "getUserProfile",
+    "category": "casing",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "change that to snake case api response handler",
+    "expected": "api_response_handler",
+    "category": "casing",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "the path should be slash usr slash local slash bin",
+    "expected": "/usr/local/bin",
+    "category": "filesystem",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "make it all caps DATABASE underscore URL",
+    "expected": "DATABASE_URL",
+    "category": "casing",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "can you type out docker space run space dash dash rm space nginx",
+    "expected": "docker run --rm nginx",
+    "category": "docker",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "so like the function name would be camel case handle click event",
+    "expected": "handleClickEvent",
+    "category": "casing",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "let's do npm space install space dash capital D space typescript",
+    "expected": "npm install -D typescript",
+    "category": "npm",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "um the flag is dash dash verbose",
+    "expected": "--verbose",
+    "category": "flags",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "and then pipe it to grep space dash i space error",
+    "expected": "| grep -i error",
+    "category": "shell",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "basically run git space fetch space dash dash all space ampersand ampersand space git space pull",
+    "expected": "git fetch --all && git pull",
+    "category": "git",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "I think we need kubectl space scale space deployment slash api space dash dash replicas equals three",
+    "expected": "kubectl scale deployment/api --replicas=3",
+    "category": "kubectl",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "right so set it to snake case max connection pool size",
+    "expected": "max_connection_pool_size",
+    "category": "casing",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "type out ssh space dash capital L space eight zero eight zero colon localhost colon five four three two space bastion",
+    "expected": "ssh -L 8080:localhost:5432 bastion",
+    "category": "ssh",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "so for the environment variable it's all caps AWS underscore SECRET underscore ACCESS underscore KEY",
+    "expected": "AWS_SECRET_ACCESS_KEY",
+    "category": "env",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "the terraform command should be terraform space init space dash backend dash config equals quote key equals prod slash terraform dot tfstate quote",
+    "expected": "terraform init -backend-config=\"key=prod/terraform.tfstate\"",
+    "category": "terraform",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "okay let me type the curl command so it's curl space dash s space dash capital H space quote all caps Authorization colon space capital Bearer space dollar all caps TOKEN quote space https colon slash slash api dot example dot com slash me",
+    "expected": "curl -s -H \"Authorization: Bearer $TOKEN\" https://api.example.com/me",
+    "category": "curl",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "I want to run docker space exec space dash it space postgres underscore db space psql space dash capital U space admin",
+    "expected": "docker exec -it postgres_db psql -U admin",
+    "category": "docker",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "um so the function is called pascal case create payment intent",
+    "expected": "CreatePaymentIntent",
+    "category": "casing",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "let's see we need to do pip space install space dash e space dot open bracket dev close bracket",
+    "expected": "pip install -e .[dev]",
+    "category": "python",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "the redis command I want is redis dash cli space dash dash scan space match space quote session colon star quote",
+    "expected": "redis-cli --scan --match \"session:*\"",
+    "category": "redis",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "so basically what we wanna run is git space cherry dash pick space dash dash no dash commit space abc one two three four",
+    "expected": "git cherry-pick --no-commit abc1234",
+    "category": "git",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "and the port number should be colon three thousand",
+    "expected": ":3000",
+    "category": "network",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "alright change the class name to pascal case authenticated user session",
+    "expected": "AuthenticatedUserSession",
+    "category": "casing",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "I need the output piped to jq space dot open bracket close bracket dot name",
+    "expected": "| jq .[].name",
+    "category": "shell",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "set the variable to dollar open brace all caps HOME close brace slash dot config slash app dot toml",
+    "expected": "${HOME}/.config/app.toml",
+    "category": "env",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "like the regex pattern is caret open bracket a dash z A dash Z close bracket plus dollar",
+    "expected": "^[a-zA-Z]+$",
+    "category": "regex",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "for the flag use dash dash dry dash run please",
+    "expected": "--dry-run",
+    "category": "flags",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "so the full path is tilde slash capital Library slash capital Application space capital Support slash capital Talkie slash talkie dot sqlite",
+    "expected": "~/Library/Application Support/Talkie/talkie.sqlite",
+    "category": "filesystem",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "go ahead and type git space reset space dash dash soft space capital HEAD tilde one",
+    "expected": "git reset --soft HEAD~1",
+    "category": "git",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "let me think um yeah the image tag is my dash registry dot io slash api colon v two dot one dash rc one",
+    "expected": "my-registry.io/api:v2.1-rc1",
+    "category": "docker",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "make it cargo space test space dash dash lib space dash dash space dash dash test dash threads space one",
+    "expected": "cargo test --lib -- --test-threads 1",
+    "category": "rust",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "we should set the cron to star space star slash two space star space star space star",
+    "expected": "* */2 * * *",
+    "category": "cron",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "the nginx location block should match tilde space caret slash api slash v open bracket zero dash nine close bracket plus",
+    "expected": "~ ^/api/v[0-9]+",
+    "category": "regex",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "and at the end redirect with two greater than space slash dev slash null",
+    "expected": "2> /dev/null",
+    "category": "shell",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "I need to add the alias um alias space ll equals quote ls space dash la quote to my bashrc",
+    "expected": "alias ll=\"ls -la\"",
+    "category": "shell",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "run it with like xargs space dash capital I space open brace close brace space cp space open brace close brace space backup slash",
+    "expected": "xargs -I {} cp {} backup/",
+    "category": "shell",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "the go struct tag should be json colon quote camel case user name comma omitempty quote",
+    "expected": "json:\"userName,omitempty\"",
+    "category": "go",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "so like make the webpack config output to dist slash open bracket name close bracket dot open bracket contenthash colon eight close bracket dot js",
+    "expected": "dist/[name].[contenthash:8].js",
+    "category": "webpack",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "we want to write iptables minus capital A capital INPUT minus p tcp double dash dport four four three minus j all caps ACCEPT",
+    "expected": "iptables -A INPUT -p tcp --dport 443 -j ACCEPT",
+    "category": "network",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "dash dash no wait just dash v",
+    "expected": "-v",
+    "category": "flags",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the API endpoint is slash api slash v two slash users slash colon id",
+    "expected": "/api/v2/users/:id",
+    "category": "url",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "so we need to... actually let's just do git stash",
+    "expected": "git stash",
+    "category": "git",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "type ssh at sign root at the server at one ninety two dot one sixty eight dot one dot fifty",
+    "expected": "ssh root@192.168.1.50",
+    "category": "ssh",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "run it on port three thousand",
+    "expected": "3000",
+    "category": "network",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "camel case is authenticated",
+    "expected": "isAuthenticated",
+    "category": "casing",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "just the flag dash dash dry dash run",
+    "expected": "--dry-run",
+    "category": "flags",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "open curly brace newline tab quote name quote colon quote capital John quote comma newline tab quote age quote colon twenty five newline close curly brace",
+    "expected": "{\n\t\"name\": \"John\",\n\t\"age\": 25\n}",
+    "category": "json",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "wait no not dash dash force I meant dash dash force dash with dash lease",
+    "expected": "--force-with-lease",
+    "category": "git",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the connection string is postgres colon slash slash wait what was the password oh right admin colon p at sign s s w zero r d at localhost colon five four three two slash production underscore db",
+    "expected": "postgres://admin:p@ssw0rd@localhost:5432/production_db",
+    "category": "database",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "kubectl get pods no actually I want kubectl get deployments dash o wide",
+    "expected": "kubectl get deployments -o wide",
+    "category": "kubectl",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "um okay so like the variable name should be um camel case handle submit and then no wait pascal case handle submit because it's a component",
+    "expected": "HandleSubmit",
+    "category": "casing",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "three files in the directory",
+    "expected": "three files in the directory",
+    "category": "ambiguous",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "port eighty four forty three I mean port eight four four three",
+    "expected": "8443",
+    "category": "network",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the address is like http no https colon slash slash api dot production dot our company dot com slash v three slash webhook",
+    "expected": "https://api.production.our-company.com/v3/webhook",
+    "category": "url",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "so I need to pipe it through like four commands cat the file then grep for errors then sort then unique with count so cat space log dot txt space pipe space grep space error space pipe space sort space pipe space uniq space dash c",
+    "expected": "cat log.txt | grep error | sort | uniq -c",
+    "category": "shell",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the docker tag is ghcr dot io slash my org slash my app colon sha dash wait how do you say git sha... the sha prefix",
+    "expected": "ghcr.io/my-org/my-app:sha-",
+    "category": "docker",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "okay write open bracket open bracket colon minus s colon minus d close bracket close bracket double ampersand echo pass pipe pipe echo fail",
+    "expected": "[[ -s -d ]] && echo pass || echo fail",
+    "category": "shell",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "no no no go back the command was git diff HEAD tilde two dot dot HEAD that's HEAD tilde the number two then two dots then HEAD",
+    "expected": "git diff HEAD~2..HEAD",
+    "category": "git",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "I need an awk command to... you know what just do awk space quote open brace if dollar three greater than one hundred print dollar zero close brace quote space data dot csv",
+    "expected": "awk \"{if $3 > 100 print $0}\" data.csv",
+    "category": "awk",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the method signature is func space camel case fetch user open paren underscore id colon capital String close paren space async space throws space dash greater than capital User",
+    "expected": "func fetchUser(_ id: String) async throws -> User",
+    "category": "swift",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "umm set the crontab to... it should run every fifteen minutes so star slash fifteen space star space star space star space star",
+    "expected": "*/15 * * * *",
+    "category": "cron",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the kubernetes label selector is app equals my dash app comma version in open paren v one comma v two close paren",
+    "expected": "app=my-app,version in (v1,v2)",
+    "category": "kubectl",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "type the regex... hmm it's like caret open bracket A dash Z close bracket open bracket a dash z A dash Z zero dash nine close bracket star at sign open bracket a dash z close bracket plus backslash dot open bracket a dash z close bracket open brace two comma close brace dollar",
+    "expected": "^[A-Z][a-zA-Z0-9]*@[a-z]+\\.[a-z]{2,}$",
+    "category": "regex",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "git rebase onto main the commit from tuesday I think it was like a b c one two three four five six seven",
+    "expected": "git rebase --onto main abc1234567",
+    "category": "git",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "webpack serve open paren or is it webpack dash dev dash server I always forget close paren with dash dash hot and dash dash port nine thousand",
+    "expected": "webpack serve --hot --port 9000",
+    "category": "webpack",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the nginx config upstream block upstream space my underscore backend space open curly brace newline space space server space one two seven dot zero dot zero dot one colon three thousand weight equals five semicolon newline space space server space one two seven dot zero dot zero dot one colon three thousand one weight equals three semicolon newline close curly brace",
+    "expected": "upstream my_backend {\n  server 127.0.0.1:3000 weight=5;\n  server 127.0.0.1:3001 weight=3;\n}",
+    "category": "nginx",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the TypeScript type is capital Record less than string comma Array less than open curly brace id colon number semicolon name colon string close curly brace greater than greater than",
+    "expected": "Record<string, Array<{id: number; name: string}>>",
+    "category": "typescript",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "wait I need to escape the dollar signs in the dockerfile so it's backslash dollar open paren cat slash run slash secrets slash db underscore password close paren",
+    "expected": "\\$(cat /run/secrets/db_password)",
+    "category": "docker",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "do the curl but like with retries so curl space dash dash retry space three space dash dash retry dash delay space two space then the url",
+    "expected": "curl --retry 3 --retry-delay 2",
+    "category": "curl",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the swift property wrapper at capital Published var camel case selected tab colon capital Tab equals dot home",
+    "expected": "@Published var selectedTab: Tab = .home",
+    "category": "swift",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "ssh minus capital J jump dash host user at final dash host so that's proxy jumping through the bastion",
+    "expected": "ssh -J jump-host user@final-host",
+    "category": "ssh",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "actually scratch what I said before just do a simple ls minus la slash tmp",
+    "expected": "ls -la /tmp",
+    "category": "filesystem",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the helm values should be set dash dash set image dot tag equals v one dot four dot two dash rc one and set dash dash set replicas equals three",
+    "expected": "--set image.tag=v1.4.2-rc1 --set replicas=3",
+    "category": "helm",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "open bracket dollar open paren date plus percent capital Y minus percent m minus percent d close paren close bracket underscore backup dot sql",
+    "expected": "[$(date +%Y-%m-%d)]_backup.sql",
+    "category": "shell",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the go generics syntax is func capital Map open bracket capital T any comma capital U any close bracket open paren slice open bracket close bracket capital T comma fn func open paren capital T close paren capital U close paren open bracket close bracket capital U",
+    "expected": "func Map[T any, U any](slice []T, fn func(T) U) []U",
+    "category": "go",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "I want to write the SQL query but like dynamically so select star from users where created underscore at greater than dollar one and status equals quote active quote order by id limit dollar two",
+    "expected": "select * from users where created_at > $1 and status = 'active' order by id limit $2",
+    "category": "sql",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the whole pipeline is cat access dot log pipe grep five hundred pipe awk open brace print dollar one close brace pipe sort pipe uniq minus c pipe sort minus r n pipe head minus five",
+    "expected": "cat access.log | grep 500 | awk {print $1} | sort | uniq -c | sort -rn | head -5",
+    "category": "shell",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "for the terraform block it's resource quote aws underscore lambda underscore function quote quote my underscore function quote",
+    "expected": "resource \"aws_lambda_function\" \"my_function\"",
+    "category": "terraform",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "I keep getting the args wrong okay the ffmpeg command is ffmpeg minus i concat colon file one dot ts pipe file two dot ts minus c copy output dot mp four",
+    "expected": "ffmpeg -i concat:file1.ts|file2.ts -c copy output.mp4",
+    "category": "ffmpeg",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "whoops I meant to say camel case not pascal case so it should be camel case on authentication state changed",
+    "expected": "onAuthenticationStateChanged",
+    "category": "casing",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "paste in the IP ten dot two forty four dot one thirty seven dot ninety two",
+    "expected": "10.244.137.92",
+    "category": "network",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the package json script is quote dev quote colon quote concurrently backslash quote next dev backslash quote backslash quote tailwind minus minus watch backslash quote quote",
+    "expected": "\"dev\": \"concurrently \\\"next dev\\\" \\\"tailwind --watch\\\"\"",
+    "category": "json",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "should I use single or double quotes whatever just do git commit minus m quote feat colon add user authentication quote",
+    "expected": "git commit -m \"feat: add user authentication\"",
+    "category": "git",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the makefile target is dot capital PHONY colon test newline test colon newline tab go test dot slash dot dot dot minus v minus count equals one",
+    "expected": ".PHONY: test\ntest:\n\tgo test ./... -v -count=1",
+    "category": "make",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "okay the whole thing end to end is curl minus s https colon slash slash api dot github dot com slash repos slash facebook slash react slash releases slash latest pipe jq minus r dot tag underscore name",
+    "expected": "curl -s https://api.github.com/repos/facebook/react/releases/latest | jq -r .tag_name",
+    "category": "curl",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "I need it to match like the group syntax so open paren question mark less than name greater than open bracket a dash z close bracket plus close paren at open bracket zero dash nine close bracket plus",
+    "expected": "(?<name>[a-z]+)@[0-9]+",
+    "category": "regex",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "my dot env file has like the whole nine yards so all caps REDIS underscore URL equals redis colon slash slash colon my password at redis dash primary dot internal colon six three seven nine slash zero",
+    "expected": "REDIS_URL=redis://:mypassword@redis-primary.internal:6379/0",
+    "category": "env",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "the github action step is uses colon actions slash checkout at v four with colon newline space space fetch dash depth colon zero",
+    "expected": "uses: actions/checkout@v4\nwith:\n  fetch-depth: 0",
+    "category": "yaml",
+    "difficulty": "chaotic"
+  },
+  {
+    "dictated": "git merge minus minus no minus ff feature forward slash payments",
+    "expected": "git merge --no-ff feature/payments",
+    "category": "git",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "mkdir minus p src forward slash components forward slash auth",
+    "expected": "mkdir -p src/components/auth",
+    "category": "filesystem",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "lsof minus i colon eight zero eight zero",
+    "expected": "lsof -i :8080",
+    "category": "network",
+    "difficulty": "fuzzy"
+  },
+  {
+    "dictated": "so for this one we want to type out camel case validate email address",
+    "expected": "validateEmailAddress",
+    "category": "casing",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "I want to set the environment variable to all caps NEXT underscore PUBLIC underscore API underscore URL equals https colon slash slash api dot useTalkie dot com",
+    "expected": "NEXT_PUBLIC_API_URL=https://api.useTalkie.com",
+    "category": "env",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "right so basically just add dash dash no dash cache to the docker build command",
+    "expected": "--no-cache",
+    "category": "docker",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "the file glob should be star star slash star dot test dot ts",
+    "expected": "**/*.test.ts",
+    "category": "filesystem",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "use the flag dash capital R for recursive copy",
+    "expected": "-R",
+    "category": "flags",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "can you type kubectl space logs space dash f space dash l space app equals web space dash c space nginx",
+    "expected": "kubectl logs -f -l app=web -c nginx",
+    "category": "kubectl",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "the version constraint is greater than equals two dot zero dot zero comma less than three dot zero dot zero",
+    "expected": ">=2.0.0,<3.0.0",
+    "category": "semver",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "so uh the test command would be pytest space dash v space dash dash cov equals src space tests slash",
+    "expected": "pytest -v --cov=src tests/",
+    "category": "python",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "just write out xattr space dash r space dash d space com dot apple dot quarantine space slash Applications slash capital My capital App dot app",
+    "expected": "xattr -r -d com.apple.quarantine /Applications/MyApp.app",
+    "category": "macos",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "oh and also add the flag dash dash prefer dash offline to the pnpm install",
+    "expected": "--prefer-offline",
+    "category": "npm",
+    "difficulty": "natural"
+  },
+  {
+    "dictated": "the whole thing is a mess but it should be something like docker run minus minus network equals host minus e all caps POSTGRES underscore PASSWORD equals not a real password minus d postgres colon sixteen dash alpine",
+    "expected": "docker run --network=host -e POSTGRES_PASSWORD=notarealpassword -d postgres:16-alpine",
+    "category": "docker",
+    "difficulty": "chaotic"
+  }
+]

eval/independent.json ADDED Viewed

	@@ -0,0 +1,102 @@

+[
+  {"dictated": "git space push space dash u space origin space main", "expected": "git push -u origin main", "category": "git"},
+  {"dictated": "git space commit space dash m space quote fix space login space bug quote", "expected": "git commit -m \"fix login bug\"", "category": "git"},
+  {"dictated": "git space log space dash dash oneline space dash n space ten", "expected": "git log --oneline -n 10", "category": "git"},
+  {"dictated": "git space stash space pop", "expected": "git stash pop", "category": "git"},
+  {"dictated": "git space checkout space dash b space feature slash auth", "expected": "git checkout -b feature/auth", "category": "git"},
+  {"dictated": "git space diff space dash dash staged space dot", "expected": "git diff --staged .", "category": "git"},
+  {"dictated": "git space rebase space dash i space capital h capital e capital a capital d tilde three", "expected": "git rebase -i HEAD~3", "category": "git"},
+  {"dictated": "git space remote space add space origin space git at github dot com colon user slash repo dot git", "expected": "git remote add origin git@github.com:user/repo.git", "category": "git"},
+  {"dictated": "docker space run space dash dash rm space dash p space eight zero eight zero colon eight zero space nginx", "expected": "docker run --rm -p 8080:80 nginx", "category": "docker"},
+  {"dictated": "docker space compose space up space dash d", "expected": "docker compose up -d", "category": "docker"},
+  {"dictated": "docker space build space dash t space my dash app colon latest space dot", "expected": "docker build -t my-app:latest .", "category": "docker"},
+  {"dictated": "docker space exec space dash i t space my dash container space bash", "expected": "docker exec -it my-container bash", "category": "docker"},
+  {"dictated": "docker space logs space dash f space dash dash tail space one hundred space web", "expected": "docker logs -f --tail 100 web", "category": "docker"},
+  {"dictated": "docker space ps space dash a", "expected": "docker ps -a", "category": "docker"},
+  {"dictated": "docker space system space prune space dash dash all space dash f", "expected": "docker system prune --all -f", "category": "docker"},
+  {"dictated": "kubectl space get space pods space dash n space production", "expected": "kubectl get pods -n production", "category": "kubernetes"},
+  {"dictated": "kubectl space apply space dash f space deployment dot yaml", "expected": "kubectl apply -f deployment.yaml", "category": "kubernetes"},
+  {"dictated": "kubectl space logs space dash f space dash l space app equals api space dash n space default", "expected": "kubectl logs -f -l app=api -n default", "category": "kubernetes"},
+  {"dictated": "kubectl space describe space pod space my dash pod dash one two three", "expected": "kubectl describe pod my-pod-123", "category": "kubernetes"},
+  {"dictated": "kubectl space scale space deployment slash web space dash dash replicas equals three", "expected": "kubectl scale deployment/web --replicas=3", "category": "kubernetes"},
+  {"dictated": "npm space install space dash capital d space typescript", "expected": "npm install -D typescript", "category": "npm"},
+  {"dictated": "npx space create dash next dash app at latest space my dash app", "expected": "npx create-next-app@latest my-app", "category": "npm"},
+  {"dictated": "pnpm space add space dash capital d space eslint space prettier", "expected": "pnpm add -D eslint prettier", "category": "npm"},
+  {"dictated": "npm space run space build space and and space npm space test", "expected": "npm run build && npm test", "category": "npm"},
+  {"dictated": "curl space dash s capital l space https colon slash slash api dot github dot com slash repos", "expected": "curl -sL https://api.github.com/repos", "category": "curl"},
+  {"dictated": "curl space dash capital x space all caps post space dash capital h space quote all caps content dash capital type colon space application slash json quote space dash d space single quote open brace quote key quote colon quote value quote close brace single quote space http colon slash slash localhost colon three thousand", "expected": "curl -X POST -H \"Content-Type: application/json\" -d '{\"key\":\"value\"}' http://localhost:3000", "category": "curl"},
+  {"dictated": "curl space dash o space file dot zip space dash capital l space https colon slash slash example dot com slash download", "expected": "curl -o file.zip -L https://example.com/download", "category": "curl"},
+  {"dictated": "curl space dash i space dash capital x space all caps delete space https colon slash slash api dot example dot com slash users slash four two", "expected": "curl -i -X DELETE https://api.example.com/users/42", "category": "curl"},
+  {"dictated": "ssh space dash i space tilde slash dot ssh slash id underscore rsa space ubuntu at one nine two dot one six eight dot one dot one hundred", "expected": "ssh -i ~/.ssh/id_rsa ubuntu@192.168.1.100", "category": "ssh"},
+  {"dictated": "ssh space dash capital l space eight zero eight zero colon localhost colon five four three two space user at bastion dot example dot com", "expected": "ssh -L 8080:localhost:5432 user@bastion.example.com", "category": "ssh"},
+  {"dictated": "scp space dash r space user at host colon slash var slash log slash space dot slash logs slash", "expected": "scp -r user@host:/var/log/ ./logs/", "category": "ssh"},
+  {"dictated": "ls space dash l a h", "expected": "ls -lah", "category": "filesystem"},
+  {"dictated": "find space dot space dash name space quote star dot ts quote space dash not space dash path space quote star slash node underscore modules slash star quote", "expected": "find . -name \"*.ts\" -not -path \"*/node_modules/*\"", "category": "filesystem"},
+  {"dictated": "chmod space seven five five space script dot sh", "expected": "chmod 755 script.sh", "category": "filesystem"},
+  {"dictated": "chown space dash capital r space www dash data colon www dash data space slash var slash www", "expected": "chown -R www-data:www-data /var/www", "category": "filesystem"},
+  {"dictated": "mkdir space dash p space src slash components slash ui", "expected": "mkdir -p src/components/ui", "category": "filesystem"},
+  {"dictated": "tar space dash x z f space archive dot tar dot gz space dash capital c space slash tmp", "expected": "tar -xzf archive.tar.gz -C /tmp", "category": "filesystem"},
+  {"dictated": "du space dash s h space star space pipe space sort space dash h r space pipe space head space dash n space ten", "expected": "du -sh * | sort -hr | head -n 10", "category": "filesystem"},
+  {"dictated": "cp space dash r space src slash space dist slash", "expected": "cp -r src/ dist/", "category": "filesystem"},
+  {"dictated": "python space dash m space venv space dot venv", "expected": "python -m venv .venv", "category": "python"},
+  {"dictated": "pip space install space dash r space requirements dot txt", "expected": "pip install -r requirements.txt", "category": "python"},
+  {"dictated": "python space dash c space quote import space sys semicolon space print open paren sys dot version close paren quote", "expected": "python -c \"import sys; print(sys.version)\"", "category": "python"},
+  {"dictated": "pytest space dash v space dash x space tests slash", "expected": "pytest -v -x tests/", "category": "python"},
+  {"dictated": "pip space install space dash e space dot open bracket dev close bracket", "expected": "pip install -e .[dev]", "category": "python"},
+  {"dictated": "uvicorn space main colon app space dash dash reload space dash dash port space eight thousand", "expected": "uvicorn main:app --reload --port 8000", "category": "python"},
+  {"dictated": "cargo space build space dash dash release", "expected": "cargo build --release", "category": "rust"},
+  {"dictated": "cargo space test space dash dash space dash dash nocapture", "expected": "cargo test -- --nocapture", "category": "rust"},
+  {"dictated": "cargo space add space serde space dash capital f space derive", "expected": "cargo add serde -F derive", "category": "rust"},
+  {"dictated": "export space all caps database underscore all caps url equals quote postgres colon slash slash user colon pass at localhost colon five four three two slash mydb quote", "expected": "export DATABASE_URL=\"postgres://user:pass@localhost:5432/mydb\"", "category": "env"},
+  {"dictated": "export space all caps path equals dollar all caps path colon slash usr slash local slash bin", "expected": "export PATH=$PATH:/usr/local/bin", "category": "env"},
+  {"dictated": "echo space dollar all caps home", "expected": "echo $HOME", "category": "env"},
+  {"dictated": "all caps node underscore all caps env equals production space node space server dot js", "expected": "NODE_ENV=production node server.js", "category": "env"},
+  {"dictated": "env space pipe space grep space dash i space all caps aws", "expected": "env | grep -i AWS", "category": "env"},
+  {"dictated": "brew space install space dash dash cask space visual dash studio dash code", "expected": "brew install --cask visual-studio-code", "category": "brew"},
+  {"dictated": "brew space upgrade space and and space brew space cleanup", "expected": "brew upgrade && brew cleanup", "category": "brew"},
+  {"dictated": "brew space services space start space postgresql at fifteen", "expected": "brew services start postgresql@15", "category": "brew"},
+  {"dictated": "brew space tap space homebrew slash cask dash fonts", "expected": "brew tap homebrew/cask-fonts", "category": "brew"},
+  {"dictated": "grep space dash r n space quote all caps todo quote space dash dash include equals quote star dot swift quote space dot", "expected": "grep -rn \"TODO\" --include=\"*.swift\" .", "category": "grep"},
+  {"dictated": "grep space dash capital e space quote caret open bracket hash close bracket quote space config dot yaml", "expected": "grep -E \"^[#]\" config.yaml", "category": "grep"},
+  {"dictated": "grep space dash v space quote caret dollar quote space file dot txt space pipe space wc space dash l", "expected": "grep -v \"^$\" file.txt | wc -l", "category": "grep"},
+  {"dictated": "swift space build space dash c space release", "expected": "swift build -c release", "category": "swift"},
+  {"dictated": "swift space test space dash dash filter space capital my capital tests", "expected": "swift test --filter MyTests", "category": "swift"},
+  {"dictated": "swift space package space init space dash dash name space capital my capital package space dash dash type space library", "expected": "swift package init --name MyPackage --type library", "category": "swift"},
+  {"dictated": "xcodebuild space dash scheme space capital talkie space dash configuration space capital release space build", "expected": "xcodebuild -scheme Talkie -configuration Release build", "category": "xcode"},
+  {"dictated": "xcodebuild space dash project space capital my capital app dot xcodeproj space dash list", "expected": "xcodebuild -project MyApp.xcodeproj -list", "category": "xcode"},
+  {"dictated": "xcrun space simctl space boot space quote capital i capital phone space fifteen quote", "expected": "xcrun simctl boot \"iPhone 15\"", "category": "xcode"},
+  {"dictated": "psql space dash capital u space postgres space dash d space mydb space dash c space quote all caps select space star space from space users semicolon quote", "expected": "psql -U postgres -d mydb -c \"SELECT * FROM users;\"", "category": "database"},
+  {"dictated": "redis dash cli space dash h space one two seven dot zero dot zero dot one space dash p space six three seven nine space ping", "expected": "redis-cli -h 127.0.0.1 -p 6379 ping", "category": "database"},
+  {"dictated": "pg underscore dump space dash capital f c space dash d space mydb space redirect space backup dot dump", "expected": "pg_dump -Fc -d mydb > backup.dump", "category": "database"},
+  {"dictated": "sed space dash i space single quote single quote space single quote s slash old slash new slash g single quote space file dot txt", "expected": "sed -i '' 's/old/new/g' file.txt", "category": "sed"},
+  {"dictated": "awk space single quote open brace print space dollar one comma space dollar three close brace single quote space data dot csv", "expected": "awk '{print $1, $3}' data.csv", "category": "awk"},
+  {"dictated": "cat space access dot log space pipe space awk space single quote open brace print space dollar one close brace single quote space pipe space sort space pipe space uniq space dash c space pipe space sort space dash r n", "expected": "cat access.log | awk '{print $1}' | sort | uniq -c | sort -rn", "category": "pipeline"},
+  {"dictated": "ps space aux space pipe space grep space node space pipe space grep space dash v space grep", "expected": "ps aux | grep node | grep -v grep", "category": "pipeline"},
+  {"dictated": "echo space quote hello quote space append space output dot log", "expected": "echo \"hello\" >> output.log", "category": "redirect"},
+  {"dictated": "cat space error dot log space redirect space slash dev slash null", "expected": "cat error.log > /dev/null", "category": "redirect"},
+  {"dictated": "make space dash j space four space and and space make space install", "expected": "make -j 4 && make install", "category": "build"},
+  {"dictated": "cmake space dash capital b space build space dash capital d all caps cmake underscore all caps build underscore all caps type equals all caps release", "expected": "cmake -B build -DCMAKE_BUILD_TYPE=RELEASE", "category": "build"},
+  {"dictated": "go space build space dash o space bin slash server space dot slash cmd slash server", "expected": "go build -o bin/server ./cmd/server", "category": "go"},
+  {"dictated": "go space test space dash v space dash race space dot slash dot dot dot", "expected": "go test -v -race ./...", "category": "go"},
+  {"dictated": "go space mod space tidy", "expected": "go mod tidy", "category": "go"},
+  {"dictated": "terraform space plan space dash var space quote region equals us dash west dash two quote", "expected": "terraform plan -var \"region=us-west-2\"", "category": "terraform"},
+  {"dictated": "terraform space apply space dash auto dash approve", "expected": "terraform apply -auto-approve", "category": "terraform"},
+  {"dictated": "aws space s three space cp space dash dash recursive space s three colon slash slash my dash bucket slash data slash space dot slash data slash", "expected": "aws s3 cp --recursive s3://my-bucket/data/ ./data/", "category": "aws"},
+  {"dictated": "aws space ec two space describe dash instances space dash dash filters space quote capital name equals tag colon capital name comma all caps values equals prod quote", "expected": "aws ec2 describe-instances --filters \"Name=tag:Name,Values=prod\"", "category": "aws"},
+  {"dictated": "gh space pr space create space dash dash title space quote fix space auth space bug quote space dash dash draft", "expected": "gh pr create --title \"fix auth bug\" --draft", "category": "github"},
+  {"dictated": "gh space issue space list space dash dash label space bug space dash dash state space open", "expected": "gh issue list --label bug --state open", "category": "github"},
+  {"dictated": "rsync space dash a v z space dash dash progress space src slash space user at host colon slash backup slash", "expected": "rsync -avz --progress src/ user@host:/backup/", "category": "network"},
+  {"dictated": "ping space dash c space four space eight dot eight dot eight dot eight", "expected": "ping -c 4 8.8.8.8", "category": "network"},
+  {"dictated": "dig space plus short space example dot com space all caps a", "expected": "dig +short example.com A", "category": "network"},
+  {"dictated": "openssl space s underscore client space dash connect space example dot com colon four four three", "expected": "openssl s_client -connect example.com:443", "category": "network"},
+  {"dictated": "systemctl space status space nginx dot service", "expected": "systemctl status nginx.service", "category": "systemd"},
+  {"dictated": "journalctl space dash u space docker space dash dash since space quote one space hour space ago quote", "expected": "journalctl -u docker --since \"1 hour ago\"", "category": "systemd"},
+  {"dictated": "bun space run space dash dash watch space src slash server dot ts", "expected": "bun run --watch src/server.ts", "category": "bun"},
+  {"dictated": "bun space add space at hono slash hono", "expected": "bun add @hono/hono", "category": "bun"},
+  {"dictated": "deno space run space dash dash allow dash net space dash dash allow dash read space server dot ts", "expected": "deno run --allow-net --allow-read server.ts", "category": "deno"},
+  {"dictated": "ffmpeg space dash i space input dot mp four space dash c colon v space libx two six four space dash crf space twenty three space output dot mp four", "expected": "ffmpeg -i input.mp4 -c:v libx264 -crf 23 output.mp4", "category": "media"},
+  {"dictated": "convert space input dot png space dash resize space fifty percent space output dot png", "expected": "convert input.png -resize 50% output.png", "category": "media"},
+  {"dictated": "jq space single quote dot data open bracket close bracket space pipe space dot name single quote space response dot json", "expected": "jq '.data[] | .name' response.json", "category": "json"},
+  {"dictated": "cat space config dot json space pipe space jq space single quote dot dot dependencies single quote", "expected": "cat config.json | jq '..dependencies'", "category": "json"},
+  {"dictated": "wrangler space deploy space dash dash env space production", "expected": "wrangler deploy --env production", "category": "cloudflare"}
+]

pipeline/normalizer.py ADDED Viewed

	@@ -0,0 +1,415 @@

+#!/usr/bin/env python3
+"""Zero-training normalizer pipeline.
+Architecture:
+  Raw transcript
+    → Protocol detector (is it already in protocol format?)
+    → IF protocol: strip filler procedurally → processor
+    → IF NOT protocol: LLM normalize → processor
+    → Final syntax output
+The LLM only handles non-protocol input (fuzzy dictation, natural language).
+Protocol-format input bypasses the LLM entirely for deterministic handling.
+"""
+import json
+import sys
+import time
+import re
+import os
+import argparse
+from collections import defaultdict
+from mlx_lm import load, generate
+from mlx_lm.sample_utils import make_sampler
+# Import the procedural processor
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'processor'))
+from procedural import process_dictation
+# ── Protocol detection ───────────────────────────────────────────────────
+# Words that are part of the protocol vocabulary (not filler)
+PROTOCOL_VOCAB = {
+    'space', 'dash', 'dot', 'slash', 'pipe', 'star', 'bang', 'hash',
+    'tilde', 'at', 'dollar', 'percent', 'caret', 'ampersand', 'equals',
+    'plus', 'colon', 'semicolon', 'underscore', 'comma', 'backslash',
+    'quote', 'backtick', 'redirect', 'append',
+    'capital', 'camel', 'snake', 'pascal', 'kebab', 'screaming',
+}
+# Common conversational filler patterns to strip
+FILLER_PREFIXES = [
+    r"^okay\s+so\s+(?:the\s+command\s+is\s+|like\s+)?",
+    r"^so\s+(?:the\s+command\s+is\s+|like\s+|it's\s+)?",
+    r"^um+\s+(?:so\s+)?(?:the\s+)?",
+    r"^(?:I\s+wanna?|I\s+want\s+to)\s+(?:\w+\s+)*?(?:to\s+|is\s+)?",
+    r"^can\s+you\s+(?:type\s+(?:out\s+)?)?",
+    r"^(?:let's\s+(?:do|see|try)\s+)",
+    r"^basically\s+(?:run\s+|do\s+|type\s+)?",
+    r"^(?:and\s+then|then)\s+",
+    r"^right\s+so\s+",
+    r"^(?:type\s+(?:out\s+)?)",
+    r"^okay\s+(?:let\s+me\s+type\s+)?(?:the\s+)?(?:\w+\s+)?(?:command\s+)?(?:so\s+)?(?:it's\s+)?",
+    r"^I\s+think\s+we\s+need\s+",
+    r"^(?:so\s+)?for\s+the\s+\w+\s+(?:variable\s+)?(?:it's\s+)?",
+    r"^I\s+want\s+to\s+run\s+",
+]
+FILLER_SUFFIXES = [
+    r"\s+I\s+think$",
+    r"\s+right$",
+    r"\s+yeah$",
+]
+FILLER_WORDS = {
+    'okay', 'ok', 'so', 'um', 'uh', 'like', 'basically', 'actually',
+    'i', 'the', 'can', 'right', 'wait', 'well', 'and',
+    'we', 'you', 'hmm', "let's", 'just',
+    'then', "i'm", "it's", "that's",
+    'should', 'would', 'could', 'maybe',
+}
+SELF_CORRECTION = {'wait', 'no', 'actually', 'meant', 'not'}
+def is_pure_protocol(text):
+    """Check if text is pure protocol format (no filler, no corrections).
+    Returns True only if:
+    1. Input contains "space" as separator (protocol format)
+    2. Does NOT start with filler words (conversational)
+    3. Does NOT contain self-correction markers
+    """
+    words = text.lower().split()
+    if not words:
+        return False
+    # Must contain "space" keyword
+    if 'space' not in words:
+        return False
+    # Must not start with filler
+    if words[0] in FILLER_WORDS:
+        return False
+    # Must not contain self-correction patterns
+    word_set = set(words)
+    if word_set & SELF_CORRECTION:
+        return False
+    return True
+def strip_filler(text):
+    """Procedurally strip conversational filler from text."""
+    result = text
+    for pattern in FILLER_PREFIXES:
+        result = re.sub(pattern, '', result, flags=re.IGNORECASE)
+    for pattern in FILLER_SUFFIXES:
+        result = re.sub(pattern, '', result, flags=re.IGNORECASE)
+    return result.strip()
+# ── LLM prompt (optimized for non-protocol input) ───────────────────────
+SYSTEM_PROMPT = """You normalize voice dictation into clean protocol format for a processor.
+YOUR JOB:
+1. If the input already contains "space" keywords with conversational filler → strip the filler, output the protocol content VERBATIM
+2. If input is natural speech without "space" keywords → normalize it:
+   a) Replace synonyms: minus→dash, hyphen→dash, period→dot, forward slash→slash, asterisk→star, hashtag→hash, double dash→dash dash
+   b) Insert "space" between separate arguments/tokens
+   c) Do NOT insert "space" within: paths (slash-separated), dotted names (file dot txt), compound flags (dash dash verbose)
+3. Resolve self-corrections (no wait, actually, I meant) → keep only the FINAL intent
+4. Output ONLY protocol words — never output actual symbols like - . / @ etc.
+PROTOCOL KEYWORDS (output as words):
+Separator: space
+Symbols: dash dot slash pipe star bang hash tilde at dollar percent caret ampersand equals plus colon semicolon underscore comma backslash quote backtick redirect append
+Multi-word: dash dash, single quote, open/close paren, open/close brace, open/close bracket, less than, question mark, and and, pipe pipe, dot dot, new line
+Casing: camel case, snake case, pascal case, kebab case (followed by the words to transform)
+Capitalization: capital (next word), all caps (next word)
+Numbers: zero through nineteen, twenty/thirty/.../ninety, hundred, thousand
+Output ONLY the normalized protocol text. Nothing else."""
+FEW_SHOT = [
+    # Fuzzy: missing spaces, synonym replacement needed
+    {
+        "input": "git commit minus m quote fix login bug quote",
+        "output": "git space commit space dash m space quote fix space login space bug quote"
+    },
+    {
+        "input": "cat file period txt",
+        "output": "cat space file dot txt"
+    },
+    {
+        "input": "ls minus l minus a slash var slash log",
+        "output": "ls space dash l space dash a space slash var slash log"
+    },
+    {
+        "input": "docker run minus minus rm minus it ubuntu",
+        "output": "docker space run space dash dash rm space dash it space ubuntu"
+    },
+    {
+        "input": "cd forward slash usr forward slash local forward slash bin",
+        "output": "cd space slash usr slash local slash bin"
+    },
+    {
+        "input": "python server period py double dash port eight thousand",
+        "output": "python space server dot py space dash dash port space eight thousand"
+    },
+    {
+        "input": "git push hyphen u origin main",
+        "output": "git space push space dash u space origin space main"
+    },
+    {
+        "input": "npm install hyphen hyphen save dev eslint",
+        "output": "npm space install space dash dash save dash dev space eslint"
+    },
+    # Casing: pass through verbatim, no spaces between words after the directive
+    {
+        "input": "snake case api response handler",
+        "output": "snake case api response handler"
+    },
+    {
+        "input": "camel case is authenticated",
+        "output": "camel case is authenticated"
+    },
+    # Natural: filler around protocol content, strip filler and pass through protocol
+    {
+        "input": "okay so the command is git space push space dash u space origin space main",
+        "output": "git space push space dash u space origin space main"
+    },
+    {
+        "input": "can you type out docker space run space dash dash rm space nginx",
+        "output": "docker space run space dash dash rm space nginx"
+    },
+    {
+        "input": "I wanna set the variable name to camel case get user profile",
+        "output": "camel case get user profile"
+    },
+    {
+        "input": "the path should be slash usr slash local slash bin",
+        "output": "slash usr slash local slash bin"
+    },
+    {
+        "input": "um the flag is dash dash verbose",
+        "output": "dash dash verbose"
+    },
+    {
+        "input": "so for the environment variable it's all caps AWS underscore SECRET underscore ACCESS underscore KEY",
+        "output": "all caps AWS underscore SECRET underscore ACCESS underscore KEY"
+    },
+    # Chaotic: self-corrections
+    {
+        "input": "dash dash no wait just dash v",
+        "output": "dash v"
+    },
+    {
+        "input": "run it on port three thousand",
+        "output": "three thousand"
+    },
+    {
+        "input": "wait no not dash dash force I meant dash dash force dash with dash lease",
+        "output": "dash dash force dash with dash lease"
+    },
+    {
+        "input": "so we need to... actually let's just do git stash",
+        "output": "git space stash"
+    },
+]
+def build_prompt(tokenizer, user_input):
+    """Build the full prompt with system instructions, few-shot examples, and the user input."""
+    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+    for ex in FEW_SHOT:
+        messages.append({"role": "user", "content": ex["input"]})
+        messages.append({"role": "assistant", "content": ex["output"]})
+    messages.append({"role": "user", "content": user_input})
+    return tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+def llm_normalize(model, tokenizer, raw_input, max_tokens=200):
+    """Use the LLM to normalize raw dictation into protocol format."""
+    prompt = build_prompt(tokenizer, raw_input)
+    sampler = make_sampler(temp=0.0)
+    output = generate(
+        model, tokenizer, prompt=prompt,
+        max_tokens=max_tokens, verbose=False,
+        sampler=sampler,
+    )
+    # Clean up: strip whitespace, remove any wrapping quotes/backticks
+    result = output.strip()
+    result = result.strip('`').strip('"').strip("'")
+    # Remove markdown code blocks if present
+    result = re.sub(r'^```\w*\n?', '', result)
+    result = re.sub(r'\n?```$', '', result)
+    return result.strip()
+def run_pipeline(model, tokenizer, raw_input):
+    """Full pipeline: detect format → normalize if needed → processor."""
+    t0 = time.perf_counter()
+    if is_pure_protocol(raw_input):
+        # Already in protocol format — strip filler procedurally, skip LLM
+        protocol_text = strip_filler(raw_input)
+        used_llm = False
+    else:
+        # Needs LLM normalization
+        protocol_text = llm_normalize(model, tokenizer, raw_input)
+        used_llm = True
+    t_norm = time.perf_counter()
+    final_output = process_dictation(protocol_text)
+    t_proc = time.perf_counter()
+    return {
+        'protocol': protocol_text,
+        'output': final_output,
+        'used_llm': used_llm,
+        'norm_ms': (t_norm - t0) * 1000,
+        'proc_ms': (t_proc - t_norm) * 1000,
+        'total_ms': (t_proc - t0) * 1000,
+    }
+def main():
+    parser = argparse.ArgumentParser(description='Zero-training normalizer pipeline evaluation')
+    parser.add_argument('eval_file', help='Path to evaluation JSON file')
+    parser.add_argument('--model', default='mlx-community/Qwen2.5-1.5B-Instruct-4bit',
+                        help='MLX model to use')
+    parser.add_argument('--limit', type=int, default=0,
+                        help='Limit number of entries to evaluate (0 = all)')
+    parser.add_argument('--show-all', action='store_true',
+                        help='Show all results, not just errors')
+    parser.add_argument('--show-protocol', action='store_true',
+                        help='Show normalized protocol output for each entry')
+    args = parser.parse_args()
+    # Load model
+    print(f'Loading model: {args.model}')
+    model, tokenizer = load(args.model)
+    print(f'Model loaded.\n')
+    # Load eval data
+    data = json.load(open(args.eval_file))
+    if args.limit:
+        data = data[:args.limit]
+    n = len(data)
+    exact = ws = 0
+    llm_calls = 0
+    errors = []
+    by_difficulty = defaultdict(list)
+    latencies = []
+    print(f'Evaluating {n} entries from {args.eval_file}')
+    print(f'Pipeline: Protocol Detect → LLM ({args.model.split("/")[-1]}) / Filler Strip → Processor')
+    print('=' * 70)
+    for idx, d in enumerate(data):
+        result = run_pipeline(model, tokenizer, d['dictated'])
+        if result['used_llm']:
+            llm_calls += 1
+        expected = d['expected']
+        got = result['output']
+        ws_got = re.sub(r'\s+', ' ', got.strip())
+        ws_exp = re.sub(r'\s+', ' ', expected.strip())
+        is_exact = got == expected
+        is_ws = ws_got == ws_exp
+        if is_exact:
+            exact += 1
+        if is_ws:
+            ws += 1
+        diff = d.get('difficulty', 'unknown')
+        by_difficulty[diff].append(is_exact)
+        latencies.append(result['total_ms'])
+        marker = '.' if is_exact else 'x'
+        sys.stdout.write(marker)
+        sys.stdout.flush()
+        if (idx + 1) % 50 == 0:
+            sys.stdout.write(f' [{idx+1}/{n}]\n')
+            sys.stdout.flush()
+        if args.show_all or (args.show_protocol and not is_exact):
+            llm_tag = 'LLM' if result['used_llm'] else 'SKIP'
+            print(f'\n  [{diff:>7}] [{d.get("category", "")}] {"PASS" if is_exact else "FAIL"} ({llm_tag})')
+            print(f'    input:    {d["dictated"][:120]}')
+            if args.show_protocol:
+                print(f'    protocol: {result["protocol"][:120]}')
+            print(f'    expected: {expected[:100]}')
+            print(f'    got:      {got[:100]}')
+            print(f'    latency:  {result["total_ms"]:.0f}ms')
+        if not is_exact:
+            errors.append({
+                'dictated': d['dictated'][:120],
+                'expected': expected[:100],
+                'got': got[:100],
+                'protocol': result['protocol'][:120],
+                'category': d.get('category', ''),
+                'difficulty': diff,
+                'used_llm': result['used_llm'],
+                'latency_ms': result['total_ms'],
+            })
+    # Ensure newline after progress dots
+    if n % 50 != 0:
+        print(f' [{n}/{n}]')
+    print()
+    # ── Results ──
+    print(f'NORMALIZER PIPELINE — {args.eval_file}')
+    print(f'Model: {args.model}')
+    print('=' * 70)
+    print(f'  Exact:   {exact}/{n} ({exact/n*100:.1f}%)')
+    print(f'  WS-norm: {ws}/{n} ({ws/n*100:.1f}%)')
+    print(f'  LLM calls: {llm_calls}/{n} ({llm_calls/n*100:.0f}% needed LLM)')
+    print()
+    if len(by_difficulty) > 1 or 'unknown' not in by_difficulty:
+        print('BY DIFFICULTY:')
+        for diff in ['clean', 'fuzzy', 'natural', 'chaotic', 'unknown']:
+            if diff in by_difficulty:
+                results = by_difficulty[diff]
+                ex = sum(results)
+                tot = len(results)
+                print(f'  {diff:>10}: {ex}/{tot} ({ex/tot*100:.0f}%)')
+        print()
+    avg_lat = sum(latencies) / len(latencies) if latencies else 0
+    p50 = sorted(latencies)[len(latencies) // 2] if latencies else 0
+    p95 = sorted(latencies)[int(len(latencies) * 0.95)] if latencies else 0
+    print(f'LATENCY:')
+    print(f'  avg: {avg_lat:.0f}ms  p50: {p50:.0f}ms  p95: {p95:.0f}ms')
+    print()
+    print(f'ERRORS ({len(errors)}, showing first 25):')
+    print('-' * 70)
+    for e in errors[:25]:
+        llm_tag = 'LLM' if e['used_llm'] else 'SKIP'
+        print(f'  [{e["difficulty"]:>7}] [{e["category"]}] ({llm_tag})')
+        print(f'    input:    {e["dictated"]}')
+        print(f'    protocol: {e["protocol"]}')
+        print(f'    expected: {e["expected"]}')
+        print(f'    got:      {e["got"]}')
+        print()
+if __name__ == '__main__':
+    main()

processor/procedural.py ADDED Viewed

	@@ -0,0 +1,377 @@

+#!/usr/bin/env python3
+"""Procedural dictation → syntax processor.
+No LLM needed. Just token scanning with a symbol vocabulary.
+Rules:
+  1. "space" → literal space
+  2. Symbol words → their character (dash→-, dot→., etc.)
+  3. Synonyms: minus→-, period→., forward slash→/, etc.
+  4. Number words → digits (one→1, forty two→42, hundred→00, thousand→000)
+  5. Casing directives: camel case, snake case, pascal case, kebab case
+  6. "capital X" → X (uppercase), "all caps word" → WORD
+  7. Everything else → pass through literally
+"""
+import json
+import re
+# ── Symbol vocabulary ────────────────────────────────────────────────────
+SYMBOLS = {
+    # Primary protocol words
+    'dash': '-',
+    'dot': '.',
+    'slash': '/',
+    'pipe': '|',
+    'redirect': '>',
+    'append': '>>',
+    'less': None,  # needs lookahead for "less than"
+    'star': '*',
+    'bang': '!',
+    'hash': '#',
+    'tilde': '~',
+    'at': '@',
+    'dollar': '$',
+    'percent': '%',
+    'caret': '^',
+    'ampersand': '&',
+    'equals': '=',
+    'plus': '+',
+    'colon': ':',
+    'semicolon': ';',
+    'underscore': '_',
+    'comma': ',',
+    'backslash': '\\',
+    'quote': '"',
+    'backtick': '`',
+    'question': None,  # needs lookahead for "question mark"
+    # Synonyms — common alternatives people use
+    'minus': '-',
+    'hyphen': '-',
+    'period': '.',
+    'asterisk': '*',
+    'hashtag': '#',
+}
+# Two-word symbols (checked before single-word)
+TWO_WORD_SYMBOLS = {
+    ('single', 'quote'): "'",
+    ('open', 'paren'): '(',
+    ('close', 'paren'): ')',
+    ('open', 'brace'): '{',
+    ('close', 'brace'): '}',
+    ('open', 'bracket'): '[',
+    ('close', 'bracket'): ']',
+    ('open', 'angle'): '<',
+    ('close', 'angle'): '>',
+    ('open', 'curly'): '{',
+    ('close', 'curly'): '}',
+    ('less', 'than'): '<',
+    ('question', 'mark'): '?',
+    ('dash', 'dash'): '--',
+    ('double', 'dash'): '--',
+    ('minus', 'minus'): '--',
+    ('and', 'and'): '&&',
+    ('pipe', 'pipe'): '||',
+    ('dot', 'dot'): '..',
+    ('two', 'redirect'): '2>',
+    ('forward', 'slash'): '/',
+    ('back', 'slash'): '\\',
+    ('equals', 'sign'): '=',
+    ('at', 'sign'): '@',
+    ('dollar', 'sign'): '$',
+    ('open', 'parenthesis'): '(',
+    ('close', 'parenthesis'): ')',
+    ('new', 'line'): '\n',
+}
+# Three-word symbols
+THREE_WORD_SYMBOLS = {
+    ('two', 'redirect', 'ampersand'): '2>&',
+}
+# ── Number words ─────────────────────────────────────────────────────────
+ONES = {
+    'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4,
+    'five': 5, 'six': 6, 'seven': 7, 'eight': 8, 'nine': 9,
+    'ten': 10, 'eleven': 11, 'twelve': 12, 'thirteen': 13,
+    'fourteen': 14, 'fifteen': 15, 'sixteen': 16, 'seventeen': 17,
+    'eighteen': 18, 'nineteen': 19,
+}
+TENS = {
+    'twenty': 20, 'thirty': 30, 'forty': 40, 'fifty': 50,
+    'sixty': 60, 'seventy': 70, 'eighty': 80, 'ninety': 90,
+}
+MULTIPLIERS = {
+    'hundred': 100,
+    'thousand': 1000,
+}
+ALL_NUMBER_WORDS = set(ONES.keys()) | set(TENS.keys()) | set(MULTIPLIERS.keys())
+def is_number_word(w):
+    return w in ALL_NUMBER_WORDS
+def consume_number(words, i):
+    """Try to consume a number starting at position i.
+    Handles:
+    - Single: "five" → 5
+    - Teens: "twelve" → 12
+    - Compound: "forty two" → 42
+    - Multipliers: "three thousand" → 3000, "one hundred" → 100
+    - Digit sequences: "one nine two" → 192 (when 3+ single digits)
+    - Mixed: "eight thousand" → 8000
+    """
+    w = words[i]
+    # Tens word: twenty, thirty, etc.
+    if w in TENS:
+        val = TENS[w]
+        j = i + 1
+        # "forty two" compound
+        if j < len(words) and words[j] in ONES and ONES[words[j]] < 10:
+            val += ONES[words[j]]
+            j += 1
+        # Check for multiplier: "forty thousand"
+        if j < len(words) and words[j] in MULTIPLIERS:
+            val *= MULTIPLIERS[words[j]]
+            j += 1
+        return str(val), j
+    # Single/teens: zero through nineteen
+    if w in ONES:
+        val = ONES[w]
+        j = i + 1
+        # Check for multiplier: "three thousand", "one hundred"
+        if j < len(words) and words[j] in MULTIPLIERS:
+            val *= MULTIPLIERS[words[j]]
+            j += 1
+            # "three thousand two hundred" etc — keep consuming
+            # But keep it simple for now
+            return str(val), j
+        # Check for digit sequence: "one nine two" → "192"
+        # Only if next word is ALSO a single digit (0-9)
+        result = str(val)
+        while j < len(words) and words[j] in ONES and ONES[words[j]] < 10:
+            result += str(ONES[words[j]])
+            j += 1
+        if j > i + 1:
+            return result, j
+        return str(val), i + 1
+    return None, i
+# ── Casing directives ───────────────────────────────────────────────────
+CASING_DIRECTIVES = {'camel', 'snake', 'pascal', 'kebab', 'screaming'}
+def consume_casing(words, i):
+    """Try to consume a casing directive and its arguments.
+    "camel case get user profile" → "getUserProfile"
+    "snake case api key" → "api_key"
+    "pascal case my component" → "MyComponent"
+    "kebab case my component" → "my-component"
+    Consumes words until "space" or end of input.
+    Returns (result, new_i) or (None, i).
+    """
+    w = words[i].lower()
+    if w not in CASING_DIRECTIVES:
+        return None, i
+    if i + 1 >= len(words) or words[i + 1].lower() != 'case':
+        return None, i
+    style = w
+    j = i + 2
+    # Consume words until "space" or end or another directive/symbol
+    parts = []
+    while j < len(words):
+        next_w = words[j]
+        if next_w == 'space':
+            break
+        if next_w in SYMBOLS:
+            break
+        if next_w in CASING_DIRECTIVES and j + 1 < len(words) and words[j + 1] == 'case':
+            break
+        if next_w in TWO_WORD_SYMBOLS or next_w == 'all' or next_w == 'capital':
+            break
+        parts.append(next_w.lower())
+        j += 1
+    if not parts:
+        return None, i
+    if style == 'camel':
+        result = parts[0] + ''.join(p.capitalize() for p in parts[1:])
+    elif style == 'pascal':
+        result = ''.join(p.capitalize() for p in parts)
+    elif style == 'snake':
+        result = '_'.join(parts)
+    elif style == 'kebab':
+        result = '-'.join(parts)
+    elif style == 'screaming':
+        result = '_'.join(p.upper() for p in parts)
+    else:
+        return None, i
+    return result, j
+def process_dictation(text):
+    """Convert dictated text to syntax using purely procedural rules."""
+    words = text.split()
+    output = []
+    i = 0
+    n = len(words)
+    while i < n:
+        w = words[i]
+        # ── "space" → literal space ──
+        if w == 'space':
+            output.append(' ')
+            i += 1
+            continue
+        # ── Three-word symbols ──
+        if i + 2 < n:
+            triple = (words[i], words[i+1], words[i+2])
+            if triple in THREE_WORD_SYMBOLS:
+                output.append(THREE_WORD_SYMBOLS[triple])
+                i += 3
+                continue
+        # ── Casing directives ──
+        cased, new_i = consume_casing(words, i)
+        if cased is not None:
+            output.append(cased)
+            i = new_i
+            continue
+        # ── Two-word symbols ──
+        if i + 1 < n:
+            pair = (words[i], words[i+1])
+            if pair in TWO_WORD_SYMBOLS:
+                output.append(TWO_WORD_SYMBOLS[pair])
+                i += 2
+                continue
+        # ── "all caps <word>" ──
+        if w == 'all' and i + 2 < n and words[i+1] == 'caps':
+            output.append(words[i+2].upper())
+            i += 3
+            continue
+        # ── "capital <letter or word>" ──
+        if w == 'capital' and i + 1 < n:
+            next_w = words[i+1]
+            if len(next_w) == 1:
+                output.append(next_w.upper())
+            else:
+                output.append(next_w[0].upper() + next_w[1:])
+            i += 2
+            continue
+        # ── Single-word symbols ──
+        if w in SYMBOLS and SYMBOLS[w] is not None:
+            output.append(SYMBOLS[w])
+            i += 1
+            continue
+        # ── Number words ──
+        if is_number_word(w):
+            num_str, new_i = consume_number(words, i)
+            if num_str is not None:
+                output.append(num_str)
+                i = new_i
+                continue
+        # ── Regular word → pass through ──
+        output.append(w)
+        i += 1
+    return ''.join(output)
+# ── Main: evaluate ──────────────────────────────────────────────────────
+if __name__ == '__main__':
+    import sys
+    from collections import defaultdict
+    eval_file = sys.argv[1] if len(sys.argv) > 1 else 'datasets/eval-independent.json'
+    data = json.load(open(eval_file))
+    n = len(data)
+    exact = ws = wsc = 0
+    errors = []
+    cat_results = defaultdict(lambda: {'exact': 0, 'total': 0})
+    # Group by difficulty if present
+    by_difficulty = defaultdict(list)
+    for d in data:
+        got = process_dictation(d['dictated'])
+        expected = d['expected']
+        ws_got = re.sub(r'\s+', ' ', got.strip())
+        ws_exp = re.sub(r'\s+', ' ', expected.strip())
+        is_exact = got == expected
+        is_ws = ws_got == ws_exp
+        is_wsc = ws_got.lower() == ws_exp.lower()
+        if is_exact: exact += 1
+        if is_ws: ws += 1
+        if is_wsc: wsc += 1
+        diff = d.get('difficulty', 'unknown')
+        by_difficulty[diff].append(is_exact)
+        if not is_exact:
+            errors.append({
+                'dictated': d['dictated'][:80],
+                'expected': expected[:60],
+                'got': got[:60],
+                'category': d.get('category', ''),
+                'difficulty': diff,
+            })
+    print(f'PROCEDURAL PROCESSOR — {eval_file}')
+    print('=' * 70)
+    print(f'  Exact:   {exact}/{n} ({exact/n*100:.1f}%)')
+    print(f'  WS-norm: {ws}/{n} ({ws/n*100:.1f}%)')
+    print(f'  WS+case: {wsc}/{n} ({wsc/n*100:.1f}%)')
+    print()
+    if len(by_difficulty) > 1:
+        print('BY DIFFICULTY:')
+        for diff in ['clean', 'fuzzy', 'natural', 'chaotic', 'unknown']:
+            if diff in by_difficulty:
+                results = by_difficulty[diff]
+                ex = sum(results)
+                tot = len(results)
+                print(f'  {diff:>10}: {ex}/{tot} ({ex/tot*100:.0f}%)')
+        print()
+    print(f'ERRORS ({len(errors)}, showing first 15):')
+    print('-' * 70)
+    for e in errors[:15]:
+        print(f'  [{e["difficulty"]:>7}] [{e["category"]}]')
+        print(f'    expected: {e["expected"]}')
+        print(f'    got:      {e["got"]}')
+        print()

scripts/bench-inference.py ADDED Viewed

	@@ -0,0 +1,68 @@

+#!/usr/bin/env python3
+"""Benchmark inference latency for LoRA vs DoRA vs base model."""
+import time
+from mlx_lm import load, generate
+SYS = "Convert the dictated text into the exact syntax it represents. Output only the result."
+# A few representative samples at different lengths
+samples = [
+    {"label": "short", "input": "dash dash verbose", "expected": "--verbose"},
+    {"label": "medium", "input": "export all caps API underscore KEY equals quote my dash key dash one two three quote", "expected": 'export API_KEY="my-key-123"'},
+    {"label": "long", "input": "git add dash A and and git commit dash M quote fix typo quote and and git push", "expected": 'git add -A && git commit -m "fix typo" && git push'},
+]
+configs = [
+    {"label": "Qwen 0.5B (base)", "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit", "adapter": None},
+    {"label": "Qwen 0.5B + LoRA", "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit", "adapter": "/Users/arach/dev/talkie/datasets/finetune/adapters/qwen-0.5b-lora"},
+    {"label": "Qwen 0.5B + DoRA", "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit", "adapter": "/Users/arach/dev/talkie/datasets/finetune/adapters/qwen-0.5b-dora"},
+]
+WARMUP = 2
+RUNS = 10
+for cfg in configs:
+    print(f"\n{'='*60}")
+    print(f"  {cfg['label']}")
+    print(f"{'='*60}")
+    kwargs = {"adapter_path": cfg["adapter"]} if cfg["adapter"] else {}
+    model, tokenizer = load(cfg["model"], **kwargs)
+    for sample in samples:
+        messages = [
+            {"role": "system", "content": SYS},
+            {"role": "user", "content": sample["input"]},
+        ]
+        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+        # Warmup
+        for _ in range(WARMUP):
+            generate(model, tokenizer, prompt=prompt, max_tokens=60, verbose=False)
+        # Timed runs
+        times = []
+        outputs = []
+        for _ in range(RUNS):
+            t0 = time.perf_counter()
+            out = generate(model, tokenizer, prompt=prompt, max_tokens=60, verbose=False)
+            t1 = time.perf_counter()
+            times.append((t1 - t0) * 1000)  # ms
+            outputs.append(out.strip())
+        avg = sum(times) / len(times)
+        mn = min(times)
+        mx = max(times)
+        last_out = outputs[-1]
+        match = "✓" if last_out == sample["expected"] else "✗"
+        print(f"\n  [{sample['label']}] \"{sample['input'][:50]}{'...' if len(sample['input'])>50 else ''}\"")
+        print(f"    avg: {avg:.1f}ms  min: {mn:.1f}ms  max: {mx:.1f}ms  ({RUNS} runs)")
+        print(f"    output: {last_out}  {match}")
+    del model, tokenizer
+print(f"\n{'='*60}")
+print("  Done")
+print(f"{'='*60}")

scripts/eval-bash-v2.py ADDED Viewed

	@@ -0,0 +1,184 @@

+#!/usr/bin/env python3
+"""Evaluate v2-trained models on the NL2Bash holdout set.
+Compares minimal vs protocol prompts on the same held-out data.
+Reports both exact match AND whitespace-normalized match.
+"""
+import json
+import re
+import time
+from collections import defaultdict
+from mlx_lm import load, generate
+# ── Prompts (must match training) ────────────────────────────────────────
+SYSTEM_MINIMAL = (
+    "Reconstruct the intended syntax from the dictated text. "
+    "Output only the result."
+)
+SYSTEM_PROTOCOL = (
+    "Convert dictated syntax to code.\n"
+    "Symbol words: dash(-) dot(.) slash(/) pipe(|) star(*) bang(!) "
+    "hash(#) tilde(~) at(@) dollar($) percent(%) caret(^) equals(=) "
+    "plus(+) colon(:) semicolon(;) underscore(_) comma(,) backslash(\\)\n"
+    "Quotes: quote(\") single quote(') backtick(`)\n"
+    "Brackets: open/close paren()  brace{}  bracket[]  angle<>\n"
+    "Pairs: dash dash(--) and and(&&) pipe pipe(||) dot dot(..)\n"
+    "Casing: camel case(camelCase) snake case(snake_case) "
+    "kebab case(kebab-case) pascal case(PascalCase) all caps(ALLCAPS)\n"
+    "Spacing: no space(join words)\n"
+    "Letters after dash are flags: dash L A → -la\n"
+    "Numbers spoken as words: forty two → 42\n"
+    "Output only the result."
+)
+def ws_normalize(s: str) -> str:
+    """Normalize whitespace for lenient comparison."""
+    return re.sub(r'\s+', ' ', s.strip())
+# ── Load test data ───────────────────────────────────────────────────────
+# Use minimal test file (same data, different system prompt doesn't matter
+# for loading — we override the system prompt at inference time)
+with open("datasets/finetune/bash-v2/minimal/test.jsonl") as f:
+    tests = []
+    for line in f:
+        msg = json.loads(line)["messages"]
+        tests.append({
+            "dictated": msg[1]["content"],
+            "expected": msg[2]["content"],
+        })
+print(f"Loaded {len(tests)} held-out bash test entries (v2 converter)\n")
+configs = [
+    {
+        "label": "1.5B + minimal prompt",
+        "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
+        "adapter": "datasets/finetune/adapters/qwen-1.5b-bash-v2-minimal",
+        "system": SYSTEM_MINIMAL,
+    },
+    {
+        "label": "1.5B + protocol prompt",
+        "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
+        "adapter": "datasets/finetune/adapters/qwen-1.5b-bash-v2-protocol",
+        "system": SYSTEM_PROTOCOL,
+    },
+]
+all_results = {}
+for cfg in configs:
+    print(f"\n{'='*60}")
+    print(f"  {cfg['label']}")
+    print(f"{'='*60}\n")
+    try:
+        model, tokenizer = load(cfg["model"], adapter_path=cfg["adapter"])
+    except Exception as e:
+        print(f"  SKIPPED — {e}\n")
+        continue
+    results = []
+    total_time = 0
+    errors_shown = 0
+    for i, t in enumerate(tests):
+        messages = [
+            {"role": "system", "content": cfg["system"]},
+            {"role": "user", "content": t["dictated"]},
+        ]
+        prompt = tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        start = time.perf_counter()
+        got = generate(model, tokenizer, prompt=prompt, max_tokens=120, verbose=False)
+        elapsed = time.perf_counter() - start
+        total_time += elapsed
+        got = got.strip()
+        exact = got == t["expected"]
+        ws_match = ws_normalize(got) == ws_normalize(t["expected"])
+        case_match = got.lower() == t["expected"].lower()
+        ws_case = ws_normalize(got).lower() == ws_normalize(t["expected"]).lower()
+        results.append({
+            "exact": exact,
+            "ws_match": ws_match,
+            "case_match": case_match,
+            "ws_case": ws_case,
+            "got": got,
+            "expected": t["expected"],
+            "dictated": t["dictated"],
+        })
+        if not exact and errors_shown < 25:
+            tag = "~" if ws_match else "✗"
+            color = "\033[33m" if ws_match else "\033[31m"
+            print(f"{color}{tag}\033[0m {i+1:>3}. \"{t['dictated'][:60]}\"")
+            print(f"      expected: {t['expected']}")
+            print(f"      got:      {got}")
+            errors_shown += 1
+    exact_correct = sum(1 for r in results if r["exact"])
+    ws_correct = sum(1 for r in results if r["ws_match"])
+    wscase_correct = sum(1 for r in results if r["ws_case"])
+    total = len(results)
+    avg_ms = round(total_time / total * 1000)
+    print(f"\n  Exact match:   {exact_correct}/{total} ({round(exact_correct/total*100, 1)}%)")
+    print(f"  WS-normalized: {ws_correct}/{total} ({round(ws_correct/total*100, 1)}%)")
+    print(f"  WS+case norm:  {wscase_correct}/{total} ({round(wscase_correct/total*100, 1)}%)")
+    print(f"  Avg latency:   {avg_ms}ms")
+    all_results[cfg["label"]] = results
+    del model, tokenizer
+# ── Summary ──────────────────────────────────────────────────────────────
+print(f"\n{'='*60}")
+print("  RESULTS SUMMARY")
+print(f"{'='*60}\n")
+print(f"  {'Model':<30} {'Exact':>8} {'WS-norm':>8} {'WS+case':>8}")
+print(f"  {'-'*30} {'-'*8} {'-'*8} {'-'*8}")
+for label, results in all_results.items():
+    total = len(results)
+    exact = sum(1 for r in results if r["exact"])
+    ws = sum(1 for r in results if r["ws_match"])
+    wsc = sum(1 for r in results if r["ws_case"])
+    print(f"  {label:<30} {exact/total*100:>7.1f}% {ws/total*100:>7.1f}% {wsc/total*100:>7.1f}%")
+# ── Error Categories ─────────────────────────────────────────────────────
+if all_results:
+    print(f"\n{'='*60}")
+    print("  ERROR CATEGORIES (first model)")
+    print(f"{'='*60}\n")
+    first_results = list(all_results.values())[0]
+    errors = [r for r in first_results if not r["exact"]]
+    cats = defaultdict(int)
+    for r in errors:
+        if r["ws_case"]:
+            cats["spacing+case only"] += 1
+        elif r["ws_match"]:
+            cats["spacing only"] += 1
+        elif r["case_match"]:
+            cats["case only"] += 1
+        elif len(r["got"]) > len(r["expected"]) * 2:
+            cats["hallucination"] += 1
+        elif abs(len(r["got"]) - len(r["expected"])) <= 3:
+            cats["minor diff"] += 1
+        else:
+            cats["structural"] += 1
+    for cat, count in sorted(cats.items(), key=lambda x: -x[1]):
+        print(f"  {cat:<20} {count:>4}")

scripts/eval-bash-v3.py ADDED Viewed

	@@ -0,0 +1,194 @@

+#!/usr/bin/env python3
+"""Evaluate v3-trained models on the NL2Bash holdout set.
+v3 uses "space as a word" model:
+- The word "space" in dictation = literal space in output
+- Everything else concatenates
+- Casing qualifiers: "capital X", "all caps var"
+Compares v3 to v2 baseline using the SAME bash commands but different
+dictation conventions.
+"""
+import json
+import re
+import time
+from collections import defaultdict
+from mlx_lm import load, generate
+# ── Prompts (must match training) ────────────────────────────────────────
+SYSTEM_V3_MINIMAL = (
+    "Reconstruct the intended syntax from the dictated text. "
+    "The word 'space' means insert a literal space. "
+    "Everything else concatenates. "
+    "Output only the result."
+)
+SYSTEM_V2_MINIMAL = (
+    "Reconstruct the intended syntax from the dictated text. "
+    "Output only the result."
+)
+def ws_normalize(s: str) -> str:
+    """Normalize whitespace for lenient comparison."""
+    return re.sub(r'\s+', ' ', s.strip())
+# ── Load test data ───────────────────────────────────────────────────────
+# Load v3 test data
+with open("datasets/finetune/bash-v3/minimal/test.jsonl") as f:
+    v3_tests = []
+    for line in f:
+        msg = json.loads(line)["messages"]
+        v3_tests.append({
+            "dictated": msg[1]["content"],
+            "expected": msg[2]["content"],
+        })
+print(f"Loaded {len(v3_tests)} held-out bash test entries (v3 converter)\n")
+configs = [
+    {
+        "label": "v3: 1.5B + space-as-word",
+        "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
+        "adapter": "datasets/finetune/adapters/qwen-1.5b-bash-v3-minimal",
+        "system": SYSTEM_V3_MINIMAL,
+        "tests": v3_tests,
+    },
+]
+# Optionally load v2 for comparison (same model, different adapter + data)
+try:
+    with open("datasets/finetune/bash-v2/minimal/test.jsonl") as f:
+        v2_tests = []
+        for line in f:
+            msg = json.loads(line)["messages"]
+            v2_tests.append({
+                "dictated": msg[1]["content"],
+                "expected": msg[2]["content"],
+            })
+    configs.append({
+        "label": "v2: 1.5B + baseline",
+        "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
+        "adapter": "datasets/finetune/adapters/qwen-1.5b-bash-v2-minimal",
+        "system": SYSTEM_V2_MINIMAL,
+        "tests": v2_tests,
+    })
+except FileNotFoundError:
+    pass
+all_results = {}
+for cfg in configs:
+    print(f"\n{'='*60}")
+    print(f"  {cfg['label']}")
+    print(f"{'='*60}\n")
+    try:
+        model, tokenizer = load(cfg["model"], adapter_path=cfg["adapter"])
+    except Exception as e:
+        print(f"  SKIPPED — {e}\n")
+        continue
+    results = []
+    total_time = 0
+    errors_shown = 0
+    tests = cfg["tests"]
+    for i, t in enumerate(tests):
+        messages = [
+            {"role": "system", "content": cfg["system"]},
+            {"role": "user", "content": t["dictated"]},
+        ]
+        prompt = tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        start = time.perf_counter()
+        got = generate(model, tokenizer, prompt=prompt, max_tokens=120, verbose=False)
+        elapsed = time.perf_counter() - start
+        total_time += elapsed
+        got = got.strip()
+        exact = got == t["expected"]
+        ws_match = ws_normalize(got) == ws_normalize(t["expected"])
+        case_match = got.lower() == t["expected"].lower()
+        ws_case = ws_normalize(got).lower() == ws_normalize(t["expected"]).lower()
+        results.append({
+            "exact": exact,
+            "ws_match": ws_match,
+            "case_match": case_match,
+            "ws_case": ws_case,
+            "got": got,
+            "expected": t["expected"],
+            "dictated": t["dictated"],
+        })
+        if not exact and errors_shown < 25:
+            tag = "~" if ws_match else "✗"
+            color = "\033[33m" if ws_match else "\033[31m"
+            print(f"{color}{tag}\033[0m {i+1:>3}. \"{t['dictated'][:60]}\"")
+            print(f"      expected: {t['expected']}")
+            print(f"      got:      {got}")
+            errors_shown += 1
+    exact_correct = sum(1 for r in results if r["exact"])
+    ws_correct = sum(1 for r in results if r["ws_match"])
+    wscase_correct = sum(1 for r in results if r["ws_case"])
+    total = len(results)
+    avg_ms = round(total_time / total * 1000)
+    print(f"\n  Exact match:   {exact_correct}/{total} ({round(exact_correct/total*100, 1)}%)")
+    print(f"  WS-normalized: {ws_correct}/{total} ({round(ws_correct/total*100, 1)}%)")
+    print(f"  WS+case norm:  {wscase_correct}/{total} ({round(wscase_correct/total*100, 1)}%)")
+    print(f"  Avg latency:   {avg_ms}ms")
+    all_results[cfg["label"]] = results
+    del model, tokenizer
+# ── Summary ──────────────────────────────────────────────────────────────
+print(f"\n{'='*60}")
+print("  RESULTS SUMMARY")
+print(f"{'='*60}\n")
+print(f"  {'Model':<30} {'Exact':>8} {'WS-norm':>8} {'WS+case':>8}")
+print(f"  {'-'*30} {'-'*8} {'-'*8} {'-'*8}")
+for label, results in all_results.items():
+    total = len(results)
+    exact = sum(1 for r in results if r["exact"])
+    ws = sum(1 for r in results if r["ws_match"])
+    wsc = sum(1 for r in results if r["ws_case"])
+    print(f"  {label:<30} {exact/total*100:>7.1f}% {ws/total*100:>7.1f}% {wsc/total*100:>7.1f}%")
+# ── Error Categories ─────────────────────────────────────────────────────
+for label, results in all_results.items():
+    print(f"\n{'='*60}")
+    print(f"  ERROR CATEGORIES — {label}")
+    print(f"{'='*60}\n")
+    errors = [r for r in results if not r["exact"]]
+    cats = defaultdict(int)
+    for r in errors:
+        if r["ws_case"]:
+            cats["spacing+case only"] += 1
+        elif r["ws_match"]:
+            cats["spacing only"] += 1
+        elif r["case_match"]:
+            cats["case only"] += 1
+        elif len(r["got"]) > len(r["expected"]) * 2:
+            cats["hallucination"] += 1
+        elif abs(len(r["got"]) - len(r["expected"])) <= 3:
+            cats["minor diff"] += 1
+        else:
+            cats["structural"] += 1
+    for cat, count in sorted(cats.items(), key=lambda x: -x[1]):
+        print(f"  {cat:<20} {count:>4}")

scripts/eval-bash.py ADDED Viewed

	@@ -0,0 +1,137 @@

+#!/usr/bin/env python3
+"""Evaluate fine-tuned models on the NL2Bash holdout set (604 entries).
+Compares multiple adapters on the same held-out dictation → bash pairs.
+"""
+import json
+import time
+from collections import defaultdict
+from mlx_lm import load, generate
+SYS = "Reconstruct the intended syntax from the dictated text. Output only the result."
+# Load test split
+with open("datasets/finetune/bash/test.jsonl") as f:
+    tests = []
+    for line in f:
+        msg = json.loads(line)["messages"]
+        tests.append({
+            "dictated": msg[1]["content"],
+            "expected": msg[2]["content"],
+        })
+print(f"Loaded {len(tests)} held-out bash test entries\n")
+configs = [
+    {
+        "label": "Qwen 0.5B + LoRA (bash)",
+        "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
+        "adapter": "datasets/finetune/adapters/qwen-0.5b-bash-v1",
+    },
+    {
+        "label": "Qwen 1.5B + LoRA (bash)",
+        "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
+        "adapter": "datasets/finetune/adapters/qwen-1.5b-bash-v1",
+    },
+]
+all_results = {}
+for cfg in configs:
+    print(f"\n{'='*60}")
+    print(f"  {cfg['label']}")
+    print(f"{'='*60}\n")
+    try:
+        model, tokenizer = load(cfg["model"], adapter_path=cfg["adapter"])
+    except Exception as e:
+        print(f"  SKIPPED — {e}\n")
+        continue
+    results = []
+    total_time = 0
+    errors_shown = 0
+    for i, t in enumerate(tests):
+        messages = [
+            {"role": "system", "content": SYS},
+            {"role": "user", "content": t["dictated"]},
+        ]
+        prompt = tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        start = time.perf_counter()
+        got = generate(model, tokenizer, prompt=prompt, max_tokens=120, verbose=False)
+        elapsed = time.perf_counter() - start
+        total_time += elapsed
+        got = got.strip()
+        match = got == t["expected"]
+        results.append({
+            "match": match,
+            "got": got,
+            "expected": t["expected"],
+            "dictated": t["dictated"],
+        })
+        if not match and errors_shown < 30:
+            icon = "\033[31m✗\033[0m"
+            print(f"{icon} {i+1:>3}. \"{t['dictated']}\"")
+            print(f"      expected: {t['expected']}")
+            print(f"      got:      {got}")
+            errors_shown += 1
+    correct = sum(1 for r in results if r["match"])
+    pct = round(correct / len(results) * 100, 1)
+    avg_ms = round(total_time / len(results) * 1000)
+    print(f"\nScore: {correct}/{len(results)} ({pct}%)")
+    print(f"Avg latency: {avg_ms}ms")
+    if errors_shown < sum(1 for r in results if not r["match"]):
+        print(f"({sum(1 for r in results if not r['match'])} total errors, showing first {errors_shown})")
+    all_results[cfg["label"]] = results
+    del model, tokenizer
+# Summary
+print(f"\n{'='*60}")
+print("  RESULTS SUMMARY")
+print(f"{'='*60}\n")
+for label, results in all_results.items():
+    correct = sum(1 for r in results if r["match"])
+    pct = round(correct / len(results) * 100, 1)
+    print(f"  {label}: {correct}/{len(results)} ({pct}%)")
+# Error analysis: categorize common failure patterns
+if all_results:
+    print(f"\n{'='*60}")
+    print("  ERROR ANALYSIS (first model)")
+    print(f"{'='*60}\n")
+    first_results = list(all_results.values())[0]
+    errors = [r for r in first_results if not r["match"]]
+    # Categorize errors
+    categories = defaultdict(list)
+    for r in errors:
+        got, exp = r["got"], r["expected"]
+        if len(got) > len(exp) * 2:
+            categories["hallucination"].append(r)
+        elif got.lower() == exp.lower():
+            categories["case_mismatch"].append(r)
+        elif got.replace(" ", "") == exp.replace(" ", ""):
+            categories["spacing"].append(r)
+        elif abs(len(got) - len(exp)) <= 3:
+            categories["minor_diff"].append(r)
+        else:
+            categories["structural"].append(r)
+    for cat, items in sorted(categories.items(), key=lambda x: -len(x[1])):
+        print(f"  {cat}: {len(items)}")
+        for item in items[:3]:
+            print(f"    \"{item['dictated'][:50]}...\"")
+            print(f"      exp: {item['expected'][:60]}")
+            print(f"      got: {item['got'][:60]}")

scripts/eval-finetune.py ADDED Viewed

	@@ -0,0 +1,150 @@

+#!/usr/bin/env python3
+"""Evaluate fine-tuned LoRA models on expanded bakeoff test set.
+30 tests organized by tier:
+  Tier 1 (1-10):  Core functionality — basic symbol subs, casing, paths, URLs
+  Tier 2 (11-20): Compound patterns — multi-symbol, git commands, env vars, pipes
+  Tier 3 (21-30): Known failure modes — dot dot, compound &&, numbers, fidelity
+"""
+import json
+import time
+from collections import defaultdict
+from mlx_lm import load, generate
+SYS = "Reconstruct the intended syntax from the dictated text. Output only the result."
+tests = [
+    # --- Tier 1: Core functionality ---
+    {"id": 1, "cat": "symbols-basic", "dictated": "hello dash world", "expected": "hello-world"},
+    {"id": 2, "cat": "symbols-basic", "dictated": "hello underscore world", "expected": "hello_world"},
+    {"id": 3, "cat": "symbols-compound", "dictated": "dash dash verbose", "expected": "--verbose"},
+    {"id": 4, "cat": "symbols-compound", "dictated": "equals equals equals", "expected": "==="},
+    {"id": 5, "cat": "casing", "dictated": "camel case get user name", "expected": "getUserName"},
+    {"id": 6, "cat": "casing", "dictated": "snake case total tokens generated", "expected": "total_tokens_generated"},
+    {"id": 7, "cat": "casing", "dictated": "kebab case dark mode toggle", "expected": "dark-mode-toggle"},
+    {"id": 8, "cat": "quotes", "dictated": "quote hello world quote", "expected": "\"hello world\""},
+    {"id": 9, "cat": "paths", "dictated": "tilde slash dev slash talkie", "expected": "~/dev/talkie"},
+    {"id": 10, "cat": "urls", "dictated": "HTTPS colon slash slash GitHub dot com slash arach slash talkie", "expected": "https://github.com/arach/talkie"},
+    # --- Tier 2: Compound patterns ---
+    {"id": 11, "cat": "mixed", "dictated": "git commit dash M quote fix latency quote", "expected": "git commit -m \"fix latency\""},
+    {"id": 12, "cat": "mixed", "dictated": "export all caps API underscore KEY equals quote my dash key dash one two three quote", "expected": "export API_KEY=\"my-key-123\""},
+    {"id": 13, "cat": "mixed", "dictated": "shebang slash bin slash bash", "expected": "#!/bin/bash"},
+    {"id": 14, "cat": "mixed", "dictated": "docker run dash D dash P eighty eighty colon eighty eighty nginx", "expected": "docker run -d -p 8080:8080 nginx"},
+    {"id": 15, "cat": "mixed", "dictated": "func camel case view did load open paren close paren", "expected": "func viewDidLoad()"},
+    {"id": 16, "cat": "mixed", "dictated": "import open brace camel case use state close brace from single quote react single quote", "expected": "import { useState } from 'react'"},
+    {"id": 17, "cat": "mixed", "dictated": "LS dash L A pipe grep dot swift", "expected": "ls -la | grep .swift"},
+    {"id": 18, "cat": "mixed", "dictated": "GH PR create dash dash title quote fix inference latency quote dash dash body quote added TTFT tracking and latency instrumentation quote", "expected": "gh pr create --title \"fix inference latency\" --body \"Added TTFT tracking and latency instrumentation\""},
+    {"id": 19, "cat": "identifiers", "dictated": "dot E N V dot local", "expected": ".env.local"},
+    {"id": 20, "cat": "operators", "dictated": "open paren X close paren fat arrow open brace close brace", "expected": "(x) => {}"},
+    # --- Tier 3: Known failure modes ---
+    {"id": 21, "cat": "symbols-compound", "dictated": "dot dot slash dev", "expected": "../dev"},
+    {"id": 22, "cat": "paths", "dictated": "dot dot slash dot dot slash dot dot slash", "expected": "../../../"},
+    {"id": 23, "cat": "paths", "dictated": "dot dot slash configs", "expected": "../configs"},
+    {"id": 24, "cat": "operators", "dictated": "A and and B and and C", "expected": "a && b && c"},
+    {"id": 25, "cat": "numbers", "dictated": "zero point seven", "expected": "0.7"},
+    {"id": 26, "cat": "numbers", "dictated": "one two seven dot zero dot zero dot one", "expected": "127.0.0.1"},
+    {"id": 27, "cat": "mixed", "dictated": "git add dash A and and git commit dash M quote fix typo quote and and git push", "expected": "git add -A && git commit -m \"fix typo\" && git push"},
+    {"id": 28, "cat": "spacing", "dictated": "no space git hub", "expected": "github"},
+    {"id": 29, "cat": "brackets", "dictated": "open bracket colon colon dash one close bracket", "expected": "[::-1]"},
+    {"id": 30, "cat": "mixed", "dictated": "dash dash temp zero point seven dash dash tokens five twelve", "expected": "--temp 0.7 --tokens 512"},
+]
+configs = [
+    {
+        "label": "QWEN 0.5B + LoRA v3",
+        "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
+        "adapter": "/Users/arach/dev/talkie/datasets/finetune/adapters/qwen-0.5b-lora-v3",
+    },
+]
+all_results = {}
+for cfg in configs:
+    print(f"\n{'='*60}")
+    print(f"  {cfg['label']}")
+    print(f"{'='*60}\n")
+    model, tokenizer = load(cfg["model"], adapter_path=cfg["adapter"])
+    results = []
+    total_time = 0
+    for t in tests:
+        messages = [
+            {"role": "system", "content": SYS},
+            {"role": "user", "content": t["dictated"]},
+        ]
+        prompt = tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        start = time.perf_counter()
+        got = generate(
+            model, tokenizer, prompt=prompt,
+            max_tokens=80, verbose=False
+        )
+        elapsed = time.perf_counter() - start
+        total_time += elapsed
+        got = got.strip()
+        match = got == t["expected"]
+        results.append({"id": t["id"], "cat": t["cat"], "match": match, "got": got})
+        icon = "\033[32m✓\033[0m" if match else "\033[31m✗\033[0m"
+        print(f"{icon} {str(t['id']).rjust(2)}. \"{t['dictated']}\"")
+        print(f"     expected: {t['expected']}")
+        if not match:
+            print(f"     got:      {got}")
+    correct = sum(1 for r in results if r["match"])
+    pct = round(correct / len(results) * 100)
+    avg_ms = round(total_time / len(results) * 1000)
+    print(f"\nScore: {correct}/{len(results)} ({pct}%)")
+    print(f"Avg latency: {avg_ms}ms per inference")
+    all_results[cfg["label"]] = results
+    # --- Per-category breakdown ---
+    print(f"\nPer-category accuracy:")
+    cat_results = defaultdict(lambda: {"correct": 0, "total": 0})
+    for r in results:
+        cat_results[r["cat"]]["total"] += 1
+        if r["match"]:
+            cat_results[r["cat"]]["correct"] += 1
+    for cat in sorted(cat_results.keys()):
+        cr = cat_results[cat]
+        cat_pct = round(cr["correct"] / cr["total"] * 100)
+        bar = "█" * cr["correct"] + "░" * (cr["total"] - cr["correct"])
+        print(f"  {cat:<20} {cr['correct']}/{cr['total']} ({cat_pct:>3}%) {bar}")
+    # --- Per-tier breakdown ---
+    print(f"\nPer-tier accuracy:")
+    tiers = [
+        ("Tier 1: Core", results[0:10]),
+        ("Tier 2: Compound", results[10:20]),
+        ("Tier 3: Failure modes", results[20:30]),
+    ]
+    for tier_name, tier_results in tiers:
+        tier_correct = sum(1 for r in tier_results if r["match"])
+        tier_pct = round(tier_correct / len(tier_results) * 100)
+        print(f"  {tier_name:<25} {tier_correct}/{len(tier_results)} ({tier_pct}%)")
+    del model, tokenizer
+# Summary
+print(f"\n{'='*60}")
+print("  RESULTS SUMMARY")
+print(f"{'='*60}\n")
+for label, results in all_results.items():
+    correct = sum(1 for r in results if r["match"])
+    pct = round(correct / len(results) * 100)
+    print(f"  {label}: {correct}/{len(results)} ({pct}%)")
+print("\nBaselines:")
+print("  LoRA v1 (240 train): 13/15 (87%)")
+print("  LoRA v2 (474 train): 27/30 (90%)")
+print("  Claude:              14/15 (93%)")

scripts/eval-holdout.py ADDED Viewed

	@@ -0,0 +1,122 @@

+#!/usr/bin/env python3
+"""Compare v2 vs v3 on the full held-out test split (66 entries)."""
+import json
+import time
+from collections import defaultdict
+from mlx_lm import load, generate
+SYS = "Reconstruct the intended syntax from the dictated text. Output only the result."
+# Load test split
+with open("datasets/finetune/chat/test.jsonl") as f:
+    tests = []
+    for line in f:
+        msg = json.loads(line)["messages"]
+        tests.append({
+            "dictated": msg[1]["content"],
+            "expected": msg[2]["content"],
+        })
+print(f"Loaded {len(tests)} held-out test entries\n")
+configs = [
+    {
+        "label": "LoRA v2",
+        "adapter": "datasets/finetune/adapters/qwen-0.5b-lora-v2",
+    },
+    {
+        "label": "LoRA v3",
+        "adapter": "datasets/finetune/adapters/qwen-0.5b-lora-v3",
+    },
+]
+all_results = {}
+for cfg in configs:
+    print(f"\n{'='*60}")
+    print(f"  {cfg['label']}")
+    print(f"{'='*60}\n")
+    model, tokenizer = load(
+        "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
+        adapter_path=cfg["adapter"],
+    )
+    results = []
+    total_time = 0
+    for i, t in enumerate(tests):
+        messages = [
+            {"role": "system", "content": SYS},
+            {"role": "user", "content": t["dictated"]},
+        ]
+        prompt = tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        start = time.perf_counter()
+        got = generate(model, tokenizer, prompt=prompt, max_tokens=80, verbose=False)
+        elapsed = time.perf_counter() - start
+        total_time += elapsed
+        got = got.strip()
+        match = got == t["expected"]
+        results.append({"match": match, "got": got, "expected": t["expected"], "dictated": t["dictated"]})
+        icon = "\033[32m✓\033[0m" if match else "\033[31m✗\033[0m"
+        if not match:
+            print(f"{icon} {i+1:>2}. \"{t['dictated']}\"")
+            print(f"     expected: {t['expected']}")
+            print(f"     got:      {got}")
+    correct = sum(1 for r in results if r["match"])
+    pct = round(correct / len(results) * 100, 1)
+    avg_ms = round(total_time / len(results) * 1000)
+    print(f"\nScore: {correct}/{len(results)} ({pct}%)")
+    print(f"Avg latency: {avg_ms}ms")
+    all_results[cfg["label"]] = results
+    del model, tokenizer
+# Head-to-head comparison
+print(f"\n{'='*60}")
+print("  HEAD TO HEAD")
+print(f"{'='*60}\n")
+v2 = all_results["LoRA v2"]
+v3 = all_results["LoRA v3"]
+v2_only = []  # v2 right, v3 wrong
+v3_only = []  # v3 right, v2 wrong
+both_wrong = []
+for i in range(len(tests)):
+    if v2[i]["match"] and not v3[i]["match"]:
+        v2_only.append((i, tests[i], v3[i]["got"]))
+    elif v3[i]["match"] and not v2[i]["match"]:
+        v3_only.append((i, tests[i], v2[i]["got"]))
+    elif not v2[i]["match"] and not v3[i]["match"]:
+        both_wrong.append((i, tests[i], v2[i]["got"], v3[i]["got"]))
+v2_correct = sum(1 for r in v2 if r["match"])
+v3_correct = sum(1 for r in v3 if r["match"])
+print(f"LoRA v2: {v2_correct}/{len(tests)} ({round(v2_correct/len(tests)*100,1)}%)")
+print(f"LoRA v3: {v3_correct}/{len(tests)} ({round(v3_correct/len(tests)*100,1)}%)")
+if v2_only:
+    print(f"\nv2 correct, v3 wrong ({len(v2_only)}):")
+    for i, t, got in v2_only:
+        print(f"  {i+1}. \"{t['dictated']}\" → expected \"{t['expected']}\" got \"{got}\"")
+if v3_only:
+    print(f"\nv3 correct, v2 wrong ({len(v3_only)}):")
+    for i, t, got in v3_only:
+        print(f"  {i+1}. \"{t['dictated']}\" → expected \"{t['expected']}\" got \"{got}\"")
+if both_wrong:
+    print(f"\nBoth wrong ({len(both_wrong)}):")
+    for i, t, v2_got, v3_got in both_wrong:
+        print(f"  {i+1}. \"{t['dictated']}\"")
+        print(f"      expected: \"{t['expected']}\"")
+        print(f"      v2:       \"{v2_got}\"")
+        print(f"      v3:       \"{v3_got}\"")

training/adapters/llama-1b-lora-v2/adapter_config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+    "adapter_path": "datasets/finetune/adapters/llama-1b-lora-v2",
+    "batch_size": 4,
+    "config": null,
+    "data": "datasets/finetune/chat",
+    "fine_tune_type": "lora",
+    "grad_accumulation_steps": 1,
+    "grad_checkpoint": false,
+    "iters": 500,
+    "learning_rate": 0.0001,
+    "lora_parameters": {
+        "rank": 8,
+        "dropout": 0.0,
+        "scale": 20.0
+    },
+    "lr_schedule": null,
+    "mask_prompt": true,
+    "max_seq_length": 2048,
+    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
+    "num_layers": 16,
+    "optimizer": "adam",
+    "optimizer_config": {
+        "adam": {},
+        "adamw": {},
+        "muon": {},
+        "sgd": {},
+        "adafactor": {}
+    },
+    "project_name": null,
+    "report_to": null,
+    "resume_adapter_file": null,
+    "save_every": 100,
+    "seed": 42,
+    "steps_per_eval": 50,
+    "steps_per_report": 10,
+    "test": true,
+    "test_batches": 500,
+    "train": true,
+    "val_batches": 25
+}

training/adapters/llama-1b-lora/adapter_config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+    "adapter_path": "/Users/arach/dev/talkie/datasets/finetune/adapters/llama-1b-lora",
+    "batch_size": 4,
+    "config": null,
+    "data": "/Users/arach/dev/talkie/datasets/finetune/chat",
+    "fine_tune_type": "lora",
+    "grad_accumulation_steps": 1,
+    "grad_checkpoint": false,
+    "iters": 200,
+    "learning_rate": 0.0001,
+    "lora_parameters": {
+        "rank": 8,
+        "dropout": 0.0,
+        "scale": 20.0
+    },
+    "lr_schedule": null,
+    "mask_prompt": true,
+    "max_seq_length": 2048,
+    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
+    "num_layers": 16,
+    "optimizer": "adam",
+    "optimizer_config": {
+        "adam": {},
+        "adamw": {},
+        "muon": {},
+        "sgd": {},
+        "adafactor": {}
+    },
+    "project_name": null,
+    "report_to": null,
+    "resume_adapter_file": null,
+    "save_every": 50,
+    "seed": 42,
+    "steps_per_eval": 50,
+    "steps_per_report": 10,
+    "test": true,
+    "test_batches": 500,
+    "train": true,
+    "val_batches": 25
+}

training/adapters/qwen-0.5b-bash-v1/adapter_config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+    "adapter_path": "datasets/finetune/adapters/qwen-0.5b-bash-v1",
+    "batch_size": 4,
+    "config": null,
+    "data": "datasets/finetune/bash",
+    "fine_tune_type": "lora",
+    "grad_accumulation_steps": 1,
+    "grad_checkpoint": false,
+    "iters": 1000,
+    "learning_rate": 0.0001,
+    "lora_parameters": {
+        "rank": 8,
+        "dropout": 0.0,
+        "scale": 20.0
+    },
+    "lr_schedule": null,
+    "mask_prompt": true,
+    "max_seq_length": 2048,
+    "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
+    "num_layers": 16,
+    "optimizer": "adam",
+    "optimizer_config": {
+        "adam": {},
+        "adamw": {},
+        "muon": {},
+        "sgd": {},
+        "adafactor": {}
+    },
+    "project_name": null,
+    "report_to": null,
+    "resume_adapter_file": null,
+    "save_every": 200,
+    "seed": 42,
+    "steps_per_eval": 200,
+    "steps_per_report": 50,
+    "test": true,
+    "test_batches": 500,
+    "train": true,
+    "val_batches": 25
+}

training/adapters/qwen-0.5b-dora/adapter_config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+    "adapter_path": "/Users/arach/dev/talkie/datasets/finetune/adapters/qwen-0.5b-dora",
+    "batch_size": 4,
+    "config": null,
+    "data": "/Users/arach/dev/talkie/datasets/finetune/chat",
+    "fine_tune_type": "dora",
+    "grad_accumulation_steps": 1,
+    "grad_checkpoint": false,
+    "iters": 300,
+    "learning_rate": 5e-05,
+    "lora_parameters": {
+        "rank": 8,
+        "dropout": 0.0,
+        "scale": 20.0
+    },
+    "lr_schedule": null,
+    "mask_prompt": true,
+    "max_seq_length": 2048,
+    "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
+    "num_layers": 16,
+    "optimizer": "adam",
+    "optimizer_config": {
+        "adam": {},
+        "adamw": {},
+        "muon": {},
+        "sgd": {},
+        "adafactor": {}
+    },
+    "project_name": null,
+    "report_to": null,
+    "resume_adapter_file": null,
+    "save_every": 100,
+    "seed": 42,
+    "steps_per_eval": 50,
+    "steps_per_report": 10,
+    "test": true,
+    "test_batches": 500,
+    "train": true,
+    "val_batches": 25
+}

training/adapters/qwen-0.5b-lora-v2/adapter_config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+    "adapter_path": "datasets/finetune/adapters/qwen-0.5b-lora-v2",
+    "batch_size": 4,
+    "config": null,
+    "data": "datasets/finetune/chat",
+    "fine_tune_type": "lora",
+    "grad_accumulation_steps": 1,
+    "grad_checkpoint": false,
+    "iters": 500,
+    "learning_rate": 0.0001,
+    "lora_parameters": {
+        "rank": 8,
+        "dropout": 0.0,
+        "scale": 20.0
+    },
+    "lr_schedule": null,
+    "mask_prompt": true,
+    "max_seq_length": 2048,
+    "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
+    "num_layers": 16,
+    "optimizer": "adam",
+    "optimizer_config": {
+        "adam": {},
+        "adamw": {},
+        "muon": {},
+        "sgd": {},
+        "adafactor": {}
+    },
+    "project_name": null,
+    "report_to": null,
+    "resume_adapter_file": null,
+    "save_every": 100,
+    "seed": 42,
+    "steps_per_eval": 50,
+    "steps_per_report": 10,
+    "test": true,
+    "test_batches": 500,
+    "train": true,
+    "val_batches": 25
+}

training/adapters/qwen-0.5b-lora-v3/adapter_config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+    "adapter_path": "datasets/finetune/adapters/qwen-0.5b-lora-v3",
+    "batch_size": 4,
+    "config": null,
+    "data": "datasets/finetune/chat",
+    "fine_tune_type": "lora",
+    "grad_accumulation_steps": 1,
+    "grad_checkpoint": false,
+    "iters": 500,
+    "learning_rate": 0.0001,
+    "lora_parameters": {
+        "rank": 8,
+        "dropout": 0.0,
+        "scale": 20.0
+    },
+    "lr_schedule": null,
+    "mask_prompt": true,
+    "max_seq_length": 2048,
+    "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
+    "num_layers": 16,
+    "optimizer": "adam",
+    "optimizer_config": {
+        "adam": {},
+        "adamw": {},
+        "muon": {},
+        "sgd": {},
+        "adafactor": {}
+    },
+    "project_name": null,
+    "report_to": null,
+    "resume_adapter_file": null,
+    "save_every": 100,
+    "seed": 42,
+    "steps_per_eval": 50,
+    "steps_per_report": 10,
+    "test": true,
+    "test_batches": 500,
+    "train": true,
+    "val_batches": 25
+}

training/adapters/qwen-0.5b-lora/adapter_config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+    "adapter_path": "/Users/arach/dev/talkie/datasets/finetune/adapters/qwen-0.5b-lora",
+    "batch_size": 4,
+    "config": null,
+    "data": "/Users/arach/dev/talkie/datasets/finetune/chat",
+    "fine_tune_type": "lora",
+    "grad_accumulation_steps": 1,
+    "grad_checkpoint": false,
+    "iters": 200,
+    "learning_rate": 0.0001,
+    "lora_parameters": {
+        "rank": 8,
+        "dropout": 0.0,
+        "scale": 20.0
+    },
+    "lr_schedule": null,
+    "mask_prompt": true,
+    "max_seq_length": 2048,
+    "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
+    "num_layers": 16,
+    "optimizer": "adam",
+    "optimizer_config": {
+        "adam": {},
+        "adamw": {},
+        "muon": {},
+        "sgd": {},
+        "adafactor": {}
+    },
+    "project_name": null,
+    "report_to": null,
+    "resume_adapter_file": null,
+    "save_every": 50,
+    "seed": 42,
+    "steps_per_eval": 50,
+    "steps_per_report": 10,
+    "test": true,
+    "test_batches": 500,
+    "train": true,
+    "val_batches": 25
+}

training/adapters/qwen-1.5b-bash-v1/adapter_config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+    "adapter_path": "datasets/finetune/adapters/qwen-1.5b-bash-v1",
+    "batch_size": 4,
+    "config": null,
+    "data": "datasets/finetune/bash",
+    "fine_tune_type": "lora",
+    "grad_accumulation_steps": 1,
+    "grad_checkpoint": false,
+    "iters": 1000,
+    "learning_rate": 0.0001,
+    "lora_parameters": {
+        "rank": 8,
+        "dropout": 0.0,
+        "scale": 20.0
+    },
+    "lr_schedule": null,
+    "mask_prompt": true,
+    "max_seq_length": 2048,
+    "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
+    "num_layers": 16,
+    "optimizer": "adam",
+    "optimizer_config": {
+        "adam": {},
+        "adamw": {},
+        "muon": {},
+        "sgd": {},
+        "adafactor": {}
+    },
+    "project_name": null,
+    "report_to": null,
+    "resume_adapter_file": null,
+    "save_every": 200,
+    "seed": 42,
+    "steps_per_eval": 200,
+    "steps_per_report": 50,
+    "test": true,
+    "test_batches": 500,
+    "train": true,
+    "val_batches": 25
+}

training/adapters/qwen-1.5b-bash-v2-minimal/adapter_config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+    "adapter_path": "datasets/finetune/adapters/qwen-1.5b-bash-v2-minimal",
+    "batch_size": 4,
+    "config": null,
+    "data": "datasets/finetune/bash-v2/minimal",
+    "fine_tune_type": "lora",
+    "grad_accumulation_steps": 1,
+    "grad_checkpoint": false,
+    "iters": 1000,
+    "learning_rate": 0.0001,
+    "lora_parameters": {
+        "rank": 8,
+        "dropout": 0.0,
+        "scale": 20.0
+    },
+    "lr_schedule": null,
+    "mask_prompt": true,
+    "max_seq_length": 2048,
+    "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
+    "num_layers": 16,
+    "optimizer": "adam",
+    "optimizer_config": {
+        "adam": {},
+        "adamw": {},
+        "muon": {},
+        "sgd": {},
+        "adafactor": {}
+    },
+    "project_name": null,
+    "report_to": null,
+    "resume_adapter_file": null,
+    "save_every": 200,
+    "seed": 42,
+    "steps_per_eval": 200,
+    "steps_per_report": 50,
+    "test": true,
+    "test_batches": 500,
+    "train": true,
+    "val_batches": 25
+}

training/adapters/qwen-1.5b-bash-v2-protocol/adapter_config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+    "adapter_path": "datasets/finetune/adapters/qwen-1.5b-bash-v2-protocol",
+    "batch_size": 4,
+    "config": null,
+    "data": "datasets/finetune/bash-v2/protocol",
+    "fine_tune_type": "lora",
+    "grad_accumulation_steps": 1,
+    "grad_checkpoint": false,
+    "iters": 1000,
+    "learning_rate": 0.0001,
+    "lora_parameters": {
+        "rank": 8,
+        "dropout": 0.0,
+        "scale": 20.0
+    },
+    "lr_schedule": null,
+    "mask_prompt": true,
+    "max_seq_length": 2048,
+    "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
+    "num_layers": 16,
+    "optimizer": "adam",
+    "optimizer_config": {
+        "adam": {},
+        "adamw": {},
+        "muon": {},
+        "sgd": {},
+        "adafactor": {}
+    },
+    "project_name": null,
+    "report_to": null,
+    "resume_adapter_file": null,
+    "save_every": 200,
+    "seed": 42,
+    "steps_per_eval": 200,
+    "steps_per_report": 50,
+    "test": true,
+    "test_batches": 500,
+    "train": true,
+    "val_batches": 25
+}

training/adapters/qwen-1.5b-bash-v3-minimal/adapter_config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+    "adapter_path": "datasets/finetune/adapters/qwen-1.5b-bash-v3-minimal",
+    "batch_size": 4,
+    "config": null,
+    "data": "datasets/finetune/bash-v3/minimal",
+    "fine_tune_type": "lora",
+    "grad_accumulation_steps": 1,
+    "grad_checkpoint": false,
+    "iters": 800,
+    "learning_rate": 0.0001,
+    "lora_parameters": {
+        "rank": 8,
+        "dropout": 0.0,
+        "scale": 20.0
+    },
+    "lr_schedule": null,
+    "mask_prompt": true,
+    "max_seq_length": 2048,
+    "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
+    "num_layers": 16,
+    "optimizer": "adam",
+    "optimizer_config": {
+        "adam": {},
+        "adamw": {},
+        "muon": {},
+        "sgd": {},
+        "adafactor": {}
+    },
+    "project_name": null,
+    "report_to": null,
+    "resume_adapter_file": "datasets/finetune/adapters/qwen-1.5b-bash-v3-minimal/adapters.safetensors",
+    "save_every": 200,
+    "seed": 42,
+    "steps_per_eval": 200,
+    "steps_per_report": 50,
+    "test": true,
+    "test_batches": 500,
+    "train": true,
+    "val_batches": 25
+}

training/adapters/qwen-1.5b-lora-v1/adapter_config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+    "adapter_path": "datasets/finetune/adapters/qwen-1.5b-lora-v1",
+    "batch_size": 4,
+    "config": null,
+    "data": "datasets/finetune/chat",
+    "fine_tune_type": "lora",
+    "grad_accumulation_steps": 1,
+    "grad_checkpoint": false,
+    "iters": 500,
+    "learning_rate": 0.0001,
+    "lora_parameters": {
+        "rank": 8,
+        "dropout": 0.0,
+        "scale": 20.0
+    },
+    "lr_schedule": null,
+    "mask_prompt": true,
+    "max_seq_length": 2048,
+    "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
+    "num_layers": 16,
+    "optimizer": "adam",
+    "optimizer_config": {
+        "adam": {},
+        "adamw": {},
+        "muon": {},
+        "sgd": {},
+        "adafactor": {}
+    },
+    "project_name": null,
+    "report_to": null,
+    "resume_adapter_file": null,
+    "save_every": 100,
+    "seed": 42,
+    "steps_per_eval": 50,
+    "steps_per_report": 10,
+    "test": true,
+    "test_batches": 500,
+    "train": true,
+    "val_batches": 25
+}

training/converters/bash-to-dictation-v2.py ADDED Viewed

	@@ -0,0 +1,394 @@

+#!/usr/bin/env python3
+"""Convert NL2Bash commands → dictation training pairs for syntax reconstruction.
+v2: Improved converter following the Talkie Dictation Protocol.
+Key improvements over v1:
+- Natural number words (0-99) instead of digit-by-digit
+- Smarter flag handling: -la → "dash L A", -name → "dash name"
+- Token-level processing: handles tightly-bound tokens like 2>/dev/null
+- Consistent spacing: whitespace-separated tokens in bash = word boundaries
+- Better multi-char pattern matching
+"""
+import json
+import random
+import re
+from pathlib import Path
+SEED = 42
+# ── System Prompts ───────────────────────────────────────────────────────
+SYSTEM_MINIMAL = (
+    "Reconstruct the intended syntax from the dictated text. "
+    "Output only the result."
+)
+SYSTEM_PROTOCOL = (
+    "Convert dictated syntax to code.\n"
+    "Symbol words: dash(-) dot(.) slash(/) pipe(|) star(*) bang(!) "
+    "hash(#) tilde(~) at(@) dollar($) percent(%) caret(^) equals(=) "
+    "plus(+) colon(:) semicolon(;) underscore(_) comma(,) backslash(\\)\n"
+    "Quotes: quote(\") single quote(') backtick(`)\n"
+    "Brackets: open/close paren()  brace{}  bracket[]  angle<>\n"
+    "Pairs: dash dash(--) and and(&&) pipe pipe(||) dot dot(..)\n"
+    "Casing: camel case(camelCase) snake case(snake_case) "
+    "kebab case(kebab-case) pascal case(PascalCase) all caps(ALLCAPS)\n"
+    "Spacing: no space(join words)\n"
+    "Letters after dash are flags: dash L A → -la\n"
+    "Numbers spoken as words: forty two → 42\n"
+    "Output only the result."
+)
+# ── Number Words ─────────────────────────────────────────────────────────
+ONES = [
+    'zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
+    'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
+    'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen',
+]
+TENS = [
+    '', '', 'twenty', 'thirty', 'forty', 'fifty',
+    'sixty', 'seventy', 'eighty', 'ninety',
+]
+def number_to_words(num_str: str) -> str:
+    """Convert a number string to spoken words.
+    Rules:
+    - 0-99: natural words (zero, twelve, forty two, ninety nine)
+    - 100+: digit-by-digit (one two seven, eight zero eight zero)
+    - Leading zeros: always digit-by-digit (zero six four four for 0644)
+    Takes string not int to preserve leading zeros.
+    """
+    # Leading zeros → always digit-by-digit
+    if len(num_str) > 1 and num_str[0] == '0':
+        return ' '.join(ONES[int(d)] for d in num_str)
+    n = int(num_str)
+    # 0-99: natural spoken form
+    if n < 20:
+        return ONES[n]
+    if n < 100:
+        if n % 10 == 0:
+            return TENS[n // 10]
+        return f"{TENS[n // 10]} {ONES[n % 10]}"
+    # 100+: digit-by-digit (unambiguous, matches how people dictate IPs, ports, etc.)
+    return ' '.join(ONES[int(d)] for d in num_str)
+# ── Single char map ──────────────────────────────────────────────────────
+CHAR_MAP = {
+    '-': 'dash',
+    '.': 'dot',
+    '/': 'slash',
+    '|': 'pipe',
+    '>': 'greater than',
+    '<': 'less than',
+    '(': 'open paren',
+    ')': 'close paren',
+    '{': 'open brace',
+    '}': 'close brace',
+    '[': 'open bracket',
+    ']': 'close bracket',
+    '"': 'quote',
+    "'": 'single quote',
+    '`': 'backtick',
+    '*': 'star',
+    '~': 'tilde',
+    '@': 'at',
+    '#': 'hash',
+    '$': 'dollar',
+    '%': 'percent',
+    '^': 'caret',
+    '&': 'ampersand',
+    '=': 'equals',
+    '+': 'plus',
+    ':': 'colon',
+    ';': 'semicolon',
+    '?': 'question mark',
+    '!': 'bang',
+    '\\': 'backslash',
+    '_': 'underscore',
+    ',': 'comma',
+}
+# ── Filters ──────────────────────────────────────────────────────────────
+SKIP_PATTERNS = [
+    re.compile(r'`'),           # backtick subshells
+    re.compile(r'\$\('),        # $() subshells
+    re.compile(r'\\[nt"\'()]'), # escape sequences
+    re.compile(r'\\\\'),        # double backslash
+    re.compile(r"awk\s+'"),     # awk scripts
+    re.compile(r"sed\s+'"),     # sed scripts
+    re.compile(r"sed\s+-"),     # sed with flags
+    re.compile(r'\{[0-9]'),     # brace expansion {1..10}
+    re.compile(r'<<<'),         # here-strings
+    re.compile(r'<<'),          # here-docs
+    re.compile(r'\[\['),        # bash test brackets
+    re.compile(r'\bif\s'),      # conditionals
+    re.compile(r'\bfor\s'),     # loops
+    re.compile(r'\bwhile\s'),   # loops
+    re.compile(r'\bdo\b'),      # loop body
+    re.compile(r'\bdone\b'),    # loop end
+    re.compile(r'\bthen\b'),    # conditional body
+    re.compile(r'\bfi\b'),      # conditional end
+    re.compile(r'\bcase\s'),    # case statements
+    re.compile(r'\besac\b'),    # case end
+    re.compile(r'\bfunction\b'),# function defs
+    re.compile(r'\(\)'),        # function parens
+    re.compile(r'printf\s'),    # printf format strings
+    re.compile(r'\$\{'),        # parameter expansion
+    re.compile(r'[^\x00-\x7F]'), # non-ASCII characters
+]
+def should_skip(cmd: str) -> bool:
+    """Return True if command is too complex for dictation."""
+    if len(cmd) > 60:
+        return True
+    if len(cmd) < 5:
+        return True
+    if cmd.count('|') > 2:
+        return True
+    for pat in SKIP_PATTERNS:
+        if pat.search(cmd):
+            return True
+    return False
+# ── Token Converter ──────────────────────────────────────────────────────
+def convert_token(token: str) -> str | None:
+    """Convert a single whitespace-delimited bash token to dictation.
+    Processes the token character by character, accumulating letter runs
+    and converting symbols/numbers to spoken form.
+    Returns None if the token contains unconvertible characters.
+    """
+    parts = []
+    i = 0
+    n = len(token)
+    while i < n:
+        c = token[i]
+        # ── Multi-char patterns ──
+        # && and ||
+        if c == '&' and i + 1 < n and token[i + 1] == '&':
+            parts.append('and and')
+            i += 2
+            continue
+        if c == '|' and i + 1 < n and token[i + 1] == '|':
+            parts.append('pipe pipe')
+            i += 2
+            continue
+        # -- (double dash)
+        if c == '-' and i + 1 < n and token[i + 1] == '-':
+            parts.append('dash dash')
+            i += 2
+            continue
+        # .. (double dot)
+        if c == '.' and i + 1 < n and token[i + 1] == '.':
+            parts.append('dot dot')
+            i += 2
+            continue
+        # >> (append redirect)
+        if c == '>' and i + 1 < n and token[i + 1] == '>':
+            parts.append('append')
+            i += 2
+            continue
+        # 2> (stderr redirect) — only at start of token or after space
+        if c == '2' and i + 1 < n and token[i + 1] == '>':
+            if i + 2 < n and token[i + 2] == '&':
+                parts.append('two redirect ampersand')
+                i += 3
+            else:
+                parts.append('two redirect')
+                i += 2
+            continue
+        # ── Number runs ──
+        if c.isdigit():
+            num_start = i
+            while i < n and token[i].isdigit():
+                i += 1
+            num_str = token[num_start:i]
+            parts.append(number_to_words(num_str))
+            continue
+        # ── Letter runs ──
+        if c.isalpha():
+            word_start = i
+            while i < n and token[i].isalpha():
+                i += 1
+            word = token[word_start:i]
+            # After a dash, short letter runs (1-3 chars) are flags → spell out
+            if len(word) <= 3 and parts and parts[-1] == 'dash':
+                parts.extend(ch.upper() for ch in word)
+            elif len(word) == 1:
+                # Standalone single letter → uppercase
+                parts.append(word.upper())
+            else:
+                # Regular word
+                parts.append(word)
+            continue
+        # ── Single symbols ──
+        if c in CHAR_MAP:
+            parts.append(CHAR_MAP[c])
+            i += 1
+            continue
+        # Unknown character → bail
+        return None
+    return ' '.join(parts) if parts else None
+def bash_to_dictation(cmd: str) -> str | None:
+    """Convert a bash command to its dictated form.
+    Splits by whitespace first (preserving bash token boundaries),
+    then converts each token individually. This means spacing in the
+    output matches the original command's whitespace exactly.
+    Returns None if any token can't be cleanly converted.
+    """
+    # Split by whitespace, preserving token boundaries
+    tokens = cmd.split()
+    if not tokens:
+        return None
+    dictated_tokens = []
+    for token in tokens:
+        converted = convert_token(token)
+        if converted is None:
+            return None
+        dictated_tokens.append(converted)
+    result = ' '.join(dictated_tokens)
+    return result if result else None
+# ── Main ─────────────────────────────────────────────────────────────────
+def main():
+    cm_path = Path("datasets/nl2bash-repo/data/bash/all.cm")
+    out_dir = Path("datasets/finetune/bash-v2")
+    out_dir.mkdir(parents=True, exist_ok=True)
+    commands = cm_path.read_text().strip().split('\n')
+    print(f"Total commands in NL2Bash: {len(commands)}")
+    # Filter and convert
+    pairs = []
+    skipped_filter = 0
+    skipped_convert = 0
+    seen = set()
+    for cmd in commands:
+        cmd = cmd.strip()
+        if not cmd:
+            continue
+        if should_skip(cmd):
+            skipped_filter += 1
+            continue
+        if cmd in seen:
+            continue
+        seen.add(cmd)
+        dictated = bash_to_dictation(cmd)
+        if dictated is None:
+            skipped_convert += 1
+            continue
+        # Skip if dictation is too long or too short
+        word_count = len(dictated.split())
+        if word_count > 40 or word_count < 3:
+            skipped_convert += 1
+            continue
+        pairs.append({
+            "dictated": dictated,
+            "expected": cmd,
+        })
+    print(f"Filtered out (complexity): {skipped_filter}")
+    print(f"Filtered out (conversion): {skipped_convert}")
+    print(f"Usable pairs: {len(pairs)}")
+    # Shuffle and split: 80/10/10
+    random.seed(SEED)
+    random.shuffle(pairs)
+    n = len(pairs)
+    test_size = max(1, n // 10)
+    valid_size = max(1, n // 10)
+    train_size = n - test_size - valid_size
+    train = pairs[:train_size]
+    valid = pairs[train_size:train_size + valid_size]
+    test = pairs[train_size + valid_size:]
+    print(f"Train: {len(train)}, Valid: {len(valid)}, Test: {len(test)}")
+    # Write BOTH prompt variants
+    for label, system_prompt in [("minimal", SYSTEM_MINIMAL), ("protocol", SYSTEM_PROTOCOL)]:
+        sub_dir = out_dir / label
+        sub_dir.mkdir(exist_ok=True)
+        def write_jsonl(path: Path, data: list):
+            with open(path, 'w') as f:
+                for item in data:
+                    entry = {
+                        "messages": [
+                            {"role": "system", "content": system_prompt},
+                            {"role": "user", "content": item["dictated"]},
+                            {"role": "assistant", "content": item["expected"]},
+                        ]
+                    }
+                    f.write(json.dumps(entry) + '\n')
+        write_jsonl(sub_dir / "train.jsonl", train)
+        write_jsonl(sub_dir / "valid.jsonl", valid)
+        write_jsonl(sub_dir / "test.jsonl", test)
+        print(f"  Wrote {label}/ splits")
+    # Print sample conversions
+    print(f"\n{'='*70}")
+    print("  SAMPLE PAIRS (v2 converter)")
+    print(f"{'='*70}\n")
+    for p in pairs[:20]:
+        print(f"  dictated: {p['dictated']}")
+        print(f"  expected: {p['expected']}")
+        print()
+    # Show number conversion examples
+    print(f"{'='*70}")
+    print("  NUMBER HANDLING EXAMPLES")
+    print(f"{'='*70}\n")
+    examples = ['0', '1', '5', '10', '12', '20', '42', '80', '99',
+                '100', '127', '0644', '0755', '255', '443',
+                '1024', '3000', '5432', '8080', '9090', '65535']
+    for s in examples:
+        print(f"  {s:>5} → {number_to_words(s)}")
+if __name__ == "__main__":
+    main()

training/converters/bash-to-dictation-v3.py ADDED Viewed

	@@ -0,0 +1,486 @@

+#!/usr/bin/env python3
+"""Convert NL2Bash commands → dictation training pairs for syntax reconstruction.
+v3: "Space is a word" model.
+Core idea: the word "space" is an explicit token in the dictation that means
+"insert a literal space here." Everything else concatenates. This eliminates
+the entire class of spacing ambiguity from v2.
+v2 problem:
+    tail -n1  → "tail dash N one"
+    tail -n 1 → "tail dash N one"   ← SAME DICTATION, different output!
+v3 fix:
+    tail -n1  → "tail space dash N one"        → tail + ' ' + -n1
+    tail -n 1 → "tail space dash N space one"   → tail + ' ' + -n + ' ' + 1
+Rules for the model:
+    1. Map each spoken word to its text form (dash→-, one→1, etc.)
+    2. Concatenate everything by default
+    3. The word "space" → insert a literal space
+"""
+import json
+import random
+import re
+from pathlib import Path
+SEED = 42
+# ── System Prompts ───────────────────────────────────────────────────────
+SYSTEM_MINIMAL = (
+    "Reconstruct the intended syntax from the dictated text. "
+    "The word 'space' means insert a literal space. "
+    "Everything else concatenates. "
+    "Output only the result."
+)
+SYSTEM_PROTOCOL = (
+    "Reconstruct syntax from dictated text.\n"
+    "Default: all words concatenate into one token.\n"
+    "The word 'space' inserts a literal space (argument boundary).\n"
+    "Symbol words: dash(-) dot(.) slash(/) pipe(|) star(*) bang(!) "
+    "hash(#) tilde(~) at(@) dollar($) percent(%) caret(^) equals(=) "
+    "plus(+) colon(:) semicolon(;) underscore(_) comma(,) backslash(\\)\n"
+    "Quotes: quote(\") single quote(') backtick(`)\n"
+    "Brackets: open/close paren()  brace{}  bracket[]  angle<>\n"
+    "Pairs: dash dash(--) and and(&&) pipe pipe(||) dot dot(..)\n"
+    "Append: append(>>)  Redirect: redirect(>)\n"
+    "Numbers: spoken as words (one→1, forty two→42)\n"
+    "Flags: letters after dash are lowercase flags (dash L A → -la)\n"
+    "Capital: capital X → X (preserves uppercase)\n"
+    "Output only the result."
+)
+# ── Number Words ─────────────────────────────────────────────────────────
+ONES = [
+    'zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
+    'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
+    'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen',
+]
+TENS = [
+    '', '', 'twenty', 'thirty', 'forty', 'fifty',
+    'sixty', 'seventy', 'eighty', 'ninety',
+]
+def number_to_words(num_str: str) -> str:
+    """Convert a number string to spoken words.
+    Rules:
+    - 0-99: natural words (zero, twelve, forty two, ninety nine)
+    - 100+: digit-by-digit (one two seven, eight zero eight zero)
+    - Leading zeros: always digit-by-digit (zero six four four for 0644)
+    Takes string not int to preserve leading zeros.
+    """
+    # Leading zeros → always digit-by-digit
+    if len(num_str) > 1 and num_str[0] == '0':
+        return ' '.join(ONES[int(d)] for d in num_str)
+    n = int(num_str)
+    # 0-99: natural spoken form
+    if n < 20:
+        return ONES[n]
+    if n < 100:
+        if n % 10 == 0:
+            return TENS[n // 10]
+        return f"{TENS[n // 10]} {ONES[n % 10]}"
+    # 100+: digit-by-digit
+    return ' '.join(ONES[int(d)] for d in num_str)
+# ── Single char map ──────────────────────────────────────────────────────
+CHAR_MAP = {
+    '-': 'dash',
+    '.': 'dot',
+    '/': 'slash',
+    '|': 'pipe',
+    '>': 'redirect',
+    '<': 'less than',
+    '(': 'open paren',
+    ')': 'close paren',
+    '{': 'open brace',
+    '}': 'close brace',
+    '[': 'open bracket',
+    ']': 'close bracket',
+    '"': 'quote',
+    "'": 'single quote',
+    '`': 'backtick',
+    '*': 'star',
+    '~': 'tilde',
+    '@': 'at',
+    '#': 'hash',
+    '$': 'dollar',
+    '%': 'percent',
+    '^': 'caret',
+    '&': 'ampersand',
+    '=': 'equals',
+    '+': 'plus',
+    ':': 'colon',
+    ';': 'semicolon',
+    '?': 'question mark',
+    '!': 'bang',
+    '\\': 'backslash',
+    '_': 'underscore',
+    ',': 'comma',
+}
+# ── Filters ──────────────────────────────────────────────────────────────
+SKIP_PATTERNS = [
+    re.compile(r'`'),           # backtick subshells
+    re.compile(r'\$\('),        # $() subshells
+    re.compile(r'\\[nt"\'()]'), # escape sequences
+    re.compile(r'\\\\'),        # double backslash
+    re.compile(r"awk\s+'"),     # awk scripts
+    re.compile(r"sed\s+'"),     # sed scripts
+    re.compile(r"sed\s+-"),     # sed with flags
+    re.compile(r'\{[0-9]'),     # brace expansion {1..10}
+    re.compile(r'<<<'),         # here-strings
+    re.compile(r'<<'),          # here-docs
+    re.compile(r'\[\['),        # bash test brackets
+    re.compile(r'\bif\s'),      # conditionals
+    re.compile(r'\bfor\s'),     # loops
+    re.compile(r'\bwhile\s'),   # loops
+    re.compile(r'\bdo\b'),      # loop body
+    re.compile(r'\bdone\b'),    # loop end
+    re.compile(r'\bthen\b'),    # conditional body
+    re.compile(r'\bfi\b'),      # conditional end
+    re.compile(r'\bcase\s'),    # case statements
+    re.compile(r'\besac\b'),    # case end
+    re.compile(r'\bfunction\b'),# function defs
+    re.compile(r'\(\)'),        # function parens
+    re.compile(r'printf\s'),    # printf format strings
+    re.compile(r'\$\{'),        # parameter expansion
+    re.compile(r'[^\x00-\x7F]'), # non-ASCII characters
+]
+def should_skip(cmd: str) -> bool:
+    """Return True if command is too complex for dictation."""
+    if len(cmd) > 60:
+        return True
+    if len(cmd) < 5:
+        return True
+    if cmd.count('|') > 2:
+        return True
+    for pat in SKIP_PATTERNS:
+        if pat.search(cmd):
+            return True
+    return False
+# ── Token Converter ──────────────────────────────────────────────────────
+def emit_word(word: str) -> str:
+    """Convert a word to its dictation form with casing qualifiers.
+    Casing rules:
+    - all lowercase → as-is (default, no qualifier needed)
+    - ALL UPPERCASE → "all caps <word>"
+    - Capitalized   → "capital <word>"
+    - Mixed case    → bail (return None) — too complex to dictate
+    The model also learns contextual casing (env vars after $ are uppercase,
+    commands are lowercase) so qualifiers aren't always required in practice.
+    But the training data should include them so the model knows the convention.
+    """
+    if word.islower():
+        return word
+    if word.isupper():
+        return f'all caps {word.lower()}'
+    if word[0].isupper() and word[1:].islower():
+        return f'capital {word.lower()}'
+    # Mixed case (camelCase, etc.) — skip for now
+    return None
+def convert_token(token: str) -> str | None:
+    """Convert a single whitespace-delimited bash token to dictation.
+    Processes the token character by character, accumulating letter runs
+    and converting symbols/numbers to spoken form.
+    v3 changes:
+    - Casing qualifiers: "capital X", "all caps var"
+    - Flag casing preserved: -x → "dash x", -X → "dash capital X"
+    - Multi-char uppercase words: VAR → "all caps var"
+    - Capitalized words: Foto → "capital foto"
+    Returns None if the token contains unconvertible characters.
+    """
+    parts = []
+    i = 0
+    n = len(token)
+    while i < n:
+        c = token[i]
+        # ── Multi-char patterns ──
+        # && and ||
+        if c == '&' and i + 1 < n and token[i + 1] == '&':
+            parts.append('and and')
+            i += 2
+            continue
+        if c == '|' and i + 1 < n and token[i + 1] == '|':
+            parts.append('pipe pipe')
+            i += 2
+            continue
+        # -- (double dash)
+        if c == '-' and i + 1 < n and token[i + 1] == '-':
+            parts.append('dash dash')
+            i += 2
+            continue
+        # .. (double dot)
+        if c == '.' and i + 1 < n and token[i + 1] == '.':
+            parts.append('dot dot')
+            i += 2
+            continue
+        # >> (append redirect)
+        if c == '>' and i + 1 < n and token[i + 1] == '>':
+            parts.append('append')
+            i += 2
+            continue
+        # 2> (stderr redirect) — only at start of token or after space
+        if c == '2' and i + 1 < n and token[i + 1] == '>':
+            if i + 2 < n and token[i + 2] == '&':
+                parts.append('two redirect ampersand')
+                i += 3
+            else:
+                parts.append('two redirect')
+                i += 2
+            continue
+        # ── Number runs ──
+        if c.isdigit():
+            num_start = i
+            while i < n and token[i].isdigit():
+                i += 1
+            num_str = token[num_start:i]
+            parts.append(number_to_words(num_str))
+            continue
+        # ── Letter runs ──
+        if c.isalpha():
+            word_start = i
+            while i < n and token[i].isalpha():
+                i += 1
+            word = token[word_start:i]
+            # After a dash, short letter runs are flags — spell out
+            # Also handle longer mixed-case flag runs like -aPSHiv
+            is_flag = parts and parts[-1] == 'dash'
+            is_short = len(word) <= 3
+            has_mixed_case = not word.islower() and not word.isupper()
+            if is_flag and (is_short or has_mixed_case):
+                for ch in word:
+                    if ch.isupper():
+                        parts.append(f'capital {ch}')
+                    else:
+                        parts.append(ch)
+            elif len(word) == 1:
+                # Standalone single letter — preserve case
+                if word.isupper():
+                    parts.append(f'capital {word}')
+                else:
+                    parts.append(word)
+            else:
+                # Multi-char word — use casing qualifiers
+                emitted = emit_word(word)
+                if emitted is None:
+                    return None  # mixed case too complex
+                parts.append(emitted)
+            continue
+        # ── Single symbols ──
+        if c in CHAR_MAP:
+            parts.append(CHAR_MAP[c])
+            i += 1
+            continue
+        # Unknown character → bail
+        return None
+    return ' '.join(parts) if parts else None
+def bash_to_dictation(cmd: str) -> str | None:
+    """Convert a bash command to its dictated form.
+    v3 key change: token boundaries are marked with the word "space".
+    In v2, `tail -n1` and `tail -n 1` both became "tail dash N one".
+    In v3:
+        tail -n1  → "tail space dash n one"
+        tail -n 1 → "tail space dash n space one"
+    The model learns: "space" = literal space, everything else concatenates.
+    """
+    tokens = cmd.split()
+    if not tokens:
+        return None
+    dictated_tokens = []
+    for token in tokens:
+        converted = convert_token(token)
+        if converted is None:
+            return None
+        dictated_tokens.append(converted)
+    # v3: join with " space " — the word "space" is an explicit token
+    result = ' space '.join(dictated_tokens)
+    return result if result else None
+# ── Main ─────────────────────────────────────────────────────────────────
+def main():
+    cm_path = Path("datasets/nl2bash-repo/data/bash/all.cm")
+    out_dir = Path("datasets/finetune/bash-v3")
+    out_dir.mkdir(parents=True, exist_ok=True)
+    commands = cm_path.read_text().strip().split('\n')
+    print(f"Total commands in NL2Bash: {len(commands)}")
+    # Filter and convert
+    pairs = []
+    skipped_filter = 0
+    skipped_convert = 0
+    seen = set()
+    for cmd in commands:
+        cmd = cmd.strip()
+        if not cmd:
+            continue
+        if should_skip(cmd):
+            skipped_filter += 1
+            continue
+        if cmd in seen:
+            continue
+        seen.add(cmd)
+        dictated = bash_to_dictation(cmd)
+        if dictated is None:
+            skipped_convert += 1
+            continue
+        # Skip if dictation is too long or too short
+        word_count = len(dictated.split())
+        if word_count > 50 or word_count < 3:  # bumped from 40 — "space" adds words
+            skipped_convert += 1
+            continue
+        pairs.append({
+            "dictated": dictated,
+            "expected": cmd,
+        })
+    print(f"Filtered out (complexity): {skipped_filter}")
+    print(f"Filtered out (conversion): {skipped_convert}")
+    print(f"Usable pairs: {len(pairs)}")
+    # Shuffle and split: 80/10/10
+    random.seed(SEED)
+    random.shuffle(pairs)
+    n = len(pairs)
+    test_size = max(1, n // 10)
+    valid_size = max(1, n // 10)
+    train_size = n - test_size - valid_size
+    train = pairs[:train_size]
+    valid = pairs[train_size:train_size + valid_size]
+    test = pairs[train_size + valid_size:]
+    print(f"Train: {len(train)}, Valid: {len(valid)}, Test: {len(test)}")
+    # Write BOTH prompt variants
+    for label, system_prompt in [("minimal", SYSTEM_MINIMAL), ("protocol", SYSTEM_PROTOCOL)]:
+        sub_dir = out_dir / label
+        sub_dir.mkdir(exist_ok=True)
+        def write_jsonl(path: Path, data: list):
+            with open(path, 'w') as f:
+                for item in data:
+                    entry = {
+                        "messages": [
+                            {"role": "system", "content": system_prompt},
+                            {"role": "user", "content": item["dictated"]},
+                            {"role": "assistant", "content": item["expected"]},
+                        ]
+                    }
+                    f.write(json.dumps(entry) + '\n')
+        write_jsonl(sub_dir / "train.jsonl", train)
+        write_jsonl(sub_dir / "valid.jsonl", valid)
+        write_jsonl(sub_dir / "test.jsonl", test)
+        print(f"  Wrote {label}/ splits")
+    # ── Show the key improvement ──
+    print(f"\n{'='*70}")
+    print("  V3 vs V2: AMBIGUITY RESOLUTION")
+    print(f"{'='*70}\n")
+    demo_pairs = [
+        ("tail -n1", "tail -n 1"),
+        ("cut -c1-10", "cut -c 1-10"),
+        ("top -b -n1", "top -b -n 1"),
+        ("ssh -X user@server", "ssh -x user@server"),
+        ("chmod 0644 file.txt", "chmod 644 file.txt"),
+    ]
+    # Show casing qualifiers
+    print(f"\n{'='*70}")
+    print("  CASING QUALIFIERS")
+    print(f"{'='*70}\n")
+    case_examples = [
+        "export PATH=/usr/bin",
+        "echo $HOME",
+        "find ~ -name 'Foto*'",
+        "read -n10 -e VAR",
+        "ls -la /tmp/MyDir",
+        "grep -i ERROR log.txt",
+    ]
+    for cmd in case_examples:
+        d = bash_to_dictation(cmd)
+        if d:
+            print(f"  {cmd:<35} → {d}")
+        else:
+            print(f"  {cmd:<35} → SKIPPED")
+    print()
+    for a, b in demo_pairs:
+        da = bash_to_dictation(a)
+        db = bash_to_dictation(b)
+        same = "SAME ✗" if da == db else "DIFFERENT ✓"
+        print(f"  {a:<25} → {da}")
+        print(f"  {b:<25} → {db}")
+        print(f"  {'':25}   {same}")
+        print()
+    # Print sample conversions
+    print(f"{'='*70}")
+    print("  SAMPLE PAIRS (v3 converter)")
+    print(f"{'='*70}\n")
+    for p in pairs[:15]:
+        print(f"  dictated: {p['dictated']}")
+        print(f"  expected: {p['expected']}")
+        print()
+if __name__ == "__main__":
+    main()

training/converters/prepare-finetune.py ADDED Viewed

	@@ -0,0 +1,91 @@

+#!/usr/bin/env python3
+"""Convert syntax-reconstruction.json → train/valid/test JSONL for mlx_lm.
+Uses stratified splitting by category so each split has proportional
+category representation.
+"""
+import json
+import random
+import os
+from collections import defaultdict
+SRC = os.path.join(os.path.dirname(__file__), "syntax-reconstruction.json")
+OUT = os.path.join(os.path.dirname(__file__), "finetune")
+SYSTEM = "Reconstruct the intended syntax from the dictated text. Output only the result."
+with open(SRC) as f:
+    entries = json.load(f)["entries"]
+random.seed(42)
+# --- Stratified split by category ---
+by_cat = defaultdict(list)
+for e in entries:
+    by_cat[e["cat"]].append(e)
+train, valid, test = [], [], []
+for cat, items in sorted(by_cat.items()):
+    random.shuffle(items)
+    n = len(items)
+    train_end = int(n * 0.80)
+    valid_end = int(n * 0.90)
+    train.extend(items[:train_end])
+    valid.extend(items[train_end:valid_end])
+    test.extend(items[valid_end:])
+# Shuffle within each split so categories are interleaved
+random.shuffle(train)
+random.shuffle(valid)
+random.shuffle(test)
+splits = {"train": train, "valid": valid, "test": test}
+# Print category distribution
+print("Category distribution:")
+for cat in sorted(by_cat.keys()):
+    total = len(by_cat[cat])
+    t = sum(1 for e in train if e["cat"] == cat)
+    v = sum(1 for e in valid if e["cat"] == cat)
+    te = sum(1 for e in test if e["cat"] == cat)
+    print(f"  {cat:<20} {total:>4} total → {t:>4} train / {v:>3} valid / {te:>3} test")
+# --- Completions format ---
+comp_dir = os.path.join(OUT, "completions")
+os.makedirs(comp_dir, exist_ok=True)
+for split_name, split_entries in splits.items():
+    path = os.path.join(comp_dir, f"{split_name}.jsonl")
+    with open(path, "w") as f:
+        for e in split_entries:
+            line = {
+                "prompt": e["dictated"],
+                "completion": e["output"],
+            }
+            f.write(json.dumps(line) + "\n")
+    print(f"  {split_name}: {len(split_entries)} → {path}")
+# --- Chat format ---
+chat_dir = os.path.join(OUT, "chat")
+os.makedirs(chat_dir, exist_ok=True)
+for split_name, split_entries in splits.items():
+    path = os.path.join(chat_dir, f"{split_name}.jsonl")
+    with open(path, "w") as f:
+        for e in split_entries:
+            line = {
+                "messages": [
+                    {"role": "system", "content": SYSTEM},
+                    {"role": "user", "content": e["dictated"]},
+                    {"role": "assistant", "content": e["output"]},
+                ]
+            }
+            f.write(json.dumps(line) + "\n")
+    print(f"  {split_name}: {len(split_entries)} → {path}")
+n = len(entries)
+print(f"\nTotal: {n} entries → {len(train)} train / {len(valid)} valid / {len(test)} test")
+print(f"System prompt: \"{SYSTEM}\"")
+print("Formats: completions (prompt/completion), chat (messages with system prompt)")

training/data/bash-v2/minimal/test.jsonl ADDED Viewed