arach commited on
Commit
04558eb
·
1 Parent(s): 3079a1d

🧪 initial commit — voice-to-syntax training lab

Browse files

Processor, pipeline, eval data, training infrastructure, and blog drafts
from the voice dictation → programming syntax experiments.

Architecture: LLM normalizer → procedural processor (split approach).
Zero-training pipeline achieves 47% on mixed-difficulty eval (2x baseline).

Files changed (49) hide show
  1. .gitignore +22 -0
  2. README.md +67 -0
  3. blog/part1-teaching-tiny-model.md +369 -0
  4. blog/part2-split-architecture-notes.md +130 -0
  5. eval/fuzzy.json +1202 -0
  6. eval/independent.json +102 -0
  7. pipeline/normalizer.py +415 -0
  8. processor/procedural.py +377 -0
  9. scripts/bench-inference.py +68 -0
  10. scripts/eval-bash-v2.py +184 -0
  11. scripts/eval-bash-v3.py +194 -0
  12. scripts/eval-bash.py +137 -0
  13. scripts/eval-finetune.py +150 -0
  14. scripts/eval-holdout.py +122 -0
  15. training/adapters/llama-1b-lora-v2/adapter_config.json +40 -0
  16. training/adapters/llama-1b-lora/adapter_config.json +40 -0
  17. training/adapters/qwen-0.5b-bash-v1/adapter_config.json +40 -0
  18. training/adapters/qwen-0.5b-dora/adapter_config.json +40 -0
  19. training/adapters/qwen-0.5b-lora-v2/adapter_config.json +40 -0
  20. training/adapters/qwen-0.5b-lora-v3/adapter_config.json +40 -0
  21. training/adapters/qwen-0.5b-lora/adapter_config.json +40 -0
  22. training/adapters/qwen-1.5b-bash-v1/adapter_config.json +40 -0
  23. training/adapters/qwen-1.5b-bash-v2-minimal/adapter_config.json +40 -0
  24. training/adapters/qwen-1.5b-bash-v2-protocol/adapter_config.json +40 -0
  25. training/adapters/qwen-1.5b-bash-v3-minimal/adapter_config.json +40 -0
  26. training/adapters/qwen-1.5b-lora-v1/adapter_config.json +40 -0
  27. training/converters/bash-to-dictation-v2.py +394 -0
  28. training/converters/bash-to-dictation-v3.py +486 -0
  29. training/converters/prepare-finetune.py +91 -0
  30. training/data/bash-v2/minimal/test.jsonl +0 -0
  31. training/data/bash-v2/minimal/train.jsonl +0 -0
  32. training/data/bash-v2/minimal/valid.jsonl +0 -0
  33. training/data/bash-v2/protocol/test.jsonl +0 -0
  34. training/data/bash-v2/protocol/train.jsonl +0 -0
  35. training/data/bash-v2/protocol/valid.jsonl +0 -0
  36. training/data/bash-v3/minimal/test.jsonl +0 -0
  37. training/data/bash-v3/minimal/train.jsonl +0 -0
  38. training/data/bash-v3/minimal/valid.jsonl +0 -0
  39. training/data/bash-v3/protocol/test.jsonl +0 -0
  40. training/data/bash-v3/protocol/train.jsonl +0 -0
  41. training/data/bash-v3/protocol/valid.jsonl +0 -0
  42. training/data/chat/test.jsonl +66 -0
  43. training/data/chat/train.jsonl +0 -0
  44. training/data/chat/valid.jsonl +59 -0
  45. training/data/completions/test.jsonl +66 -0
  46. training/data/completions/train.jsonl +491 -0
  47. training/data/completions/valid.jsonl +59 -0
  48. training/data/dictation-to-bash.json +666 -0
  49. training/data/syntax-reconstruction.json +685 -0
.gitignore ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.pyc
3
+ .DS_Store
4
+ *.egg-info/
5
+ dist/
6
+ build/
7
+
8
+ # Secrets
9
+ .hf
10
+ .env
11
+
12
+ # Large model files
13
+ *.safetensors
14
+ *.bin
15
+ *.gguf
16
+
17
+ # NL2Bash source (download separately)
18
+ nl2bash-repo/
19
+
20
+ # Adapter weights (track configs, not weights)
21
+ training/adapters/*/adapter.safetensors
22
+ training/adapters/*/model.safetensors.index.json
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # training-lab
2
+
3
+ Experiments in voice dictation to programming syntax. Teaching small models to understand spoken code.
4
+
5
+ ## Domain
6
+
7
+ Converting spoken dictation like `"git space push space dash u space origin space main"` into actual syntax: `git push -u origin main`.
8
+
9
+ The challenge: users don't always speak in perfect protocol format. They use synonyms ("minus" for "dash"), skip separator words, add conversational filler ("okay so the command is..."), and make mid-sentence corrections ("no wait, actually...").
10
+
11
+ ## Architecture
12
+
13
+ ```
14
+ Raw speech transcript
15
+ → Protocol detector (is it already clean?)
16
+ → IF clean: bypass LLM → procedural processor
17
+ → IF messy: LLM normalizer → procedural processor
18
+ → Final syntax output
19
+ ```
20
+
21
+ **Procedural processor** — deterministic token scanner. Symbol vocabulary, number words, casing directives. 93% on clean input, zero hallucination, instant.
22
+
23
+ **LLM normalizer** — rewrites messy dictation into clean protocol format. Strips filler, resolves corrections, inserts spacing keywords. The LLM never outputs actual symbols — it only outputs protocol words.
24
+
25
+ ## Structure
26
+
27
+ ```
28
+ processor/ Deterministic symbol/number/casing processor
29
+ pipeline/ LLM + processor pipeline (zero-training normalizer)
30
+ eval/ Evaluation datasets (fuzzy + independent)
31
+ training/
32
+ data/ Training data (syntax-reconstruction, dictation-to-bash)
33
+ converters/ Scripts to generate training data from NL2Bash
34
+ adapters/ Fine-tuned model adapters (LoRA/DoRA)
35
+ scripts/ Evaluation and benchmarking scripts
36
+ blog/ Writeup drafts and notes
37
+ ```
38
+
39
+ ## Quick start
40
+
41
+ ```bash
42
+ # Run the procedural processor on clean protocol input
43
+ python3 processor/procedural.py eval/independent.json
44
+
45
+ # Run the normalizer pipeline (requires mlx-lm)
46
+ pip install mlx mlx-lm
47
+ python3 pipeline/normalizer.py eval/fuzzy.json --model mlx-community/Qwen2.5-1.5B-Instruct-4bit
48
+ ```
49
+
50
+ ## Results (zero-training, prompted only)
51
+
52
+ | Model | Clean | Fuzzy | Natural | Chaotic | Overall |
53
+ |---|---|---|---|---|---|
54
+ | Processor only | 92% | 0% | 0% | 2% | 23.5% |
55
+ | Qwen 2.5 1.5B | 90% | 20% | 54% | 24% | 47% |
56
+ | Qwen 2.5 0.5B | 90% | 12% | 44% | 20% | 41.5% |
57
+ | Llama 3.2 1B | 92% | 14% | 34% | 10% | 37.5% |
58
+
59
+ ## Protocol format
60
+
61
+ The "space-as-a-word" protocol eliminates spacing ambiguity:
62
+
63
+ - `"space"` → literal space between tokens
64
+ - Symbol words: `dash dot slash pipe colon quote` etc.
65
+ - Casing: `camel case`, `snake case`, `pascal case`, `kebab case`
66
+ - Numbers: `zero` through `nineteen`, `twenty`...`ninety`, `hundred`, `thousand`
67
+ - Capitalization: `capital X`, `all caps WORD`
blog/part1-teaching-tiny-model.md ADDED
@@ -0,0 +1,369 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Teaching a Tiny Model to Hear Bash
2
+
3
+ > Fine-tuning a 1.5B model to reconstruct shell commands from voice. 97% accuracy, 3GB of RAM, under a second on a phone.
4
+
5
+ <!-- METADATA
6
+ slug: teaching-a-tiny-model-to-hear-bash
7
+ date: 2026-03-04
8
+ tags: mlx, fine-tuning, lora, on-device-ml, voice, speech-to-code
9
+ author: Arach
10
+ -->
11
+
12
+ ---
13
+
14
+ <!-- ILLUSTRATION: hero
15
+ Style: Dark terminal window with a waveform entering from the left (speech),
16
+ flowing through a small glowing chip/brain icon in the center, and clean bash
17
+ syntax emerging on the right. The waveform is warm orange, the chip is a cool
18
+ blue, the output text is green on black (classic terminal).
19
+ Alt: Speech waveform flowing through a tiny neural network into bash syntax
20
+ -->
21
+
22
+ ## The Problem
23
+
24
+ I'm building [Talkie](https://usetalkie.com), a voice-first productivity app. One of its features is keyboard dictation. You speak into your phone, it types into whatever app you're using.
25
+
26
+ For regular prose, off-the-shelf speech-to-text handles it. For terminal commands, it falls apart completely.
27
+
28
+ <!-- ILLUSTRATION: side-by-side comparison
29
+ Left panel (labeled "What you say"):
30
+ A speech bubble containing: "find dot dash name star dot txt"
31
+ Right panel (labeled "What transcription gives you"):
32
+ A terminal showing: find dot dash name star dot text
33
+ Below both (labeled "What you meant"):
34
+ A terminal showing: find . -name *.txt
35
+ Style: Clean, minimal. The "what you meant" panel should feel correct/resolved
36
+ — maybe a subtle green checkmark or highlight.
37
+ -->
38
+
39
+ Say "find dot dash name star dot txt" to any transcription engine and you get back a faithful transcription of your words. Not the command you meant. The gap between spoken description and intended syntax is the problem.
40
+
41
+ ## The Bet: A Tiny Model, On-Device
42
+
43
+ I wanted to know if a model small enough to run on a phone could learn this mapping end-to-end. Not a rule engine. Not a cloud API call to GPT-4. A model that fits in pocket-sized RAM and returns an answer before the user notices it's thinking.
44
+
45
+ **Model:** Qwen2.5-1.5B-Instruct, 4-bit quantized via MLX. Fits in ~3GB.
46
+
47
+ **Method:** LoRA fine-tuning on Apple Silicon. Rank 8, scale 20, no dropout. The whole training run uses under 3GB of memory.
48
+
49
+ **Data:** 6,304 examples of dictated bash paired with intended syntax — 5,044 train, 630 validation, 630 test. Each example is a simple chat turn:
50
+
51
+ ```json
52
+ {
53
+ "messages": [
54
+ {"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."},
55
+ {"role": "user", "content": "find dot dash name star dot txt"},
56
+ {"role": "assistant", "content": "find . -name *.txt"}
57
+ ]
58
+ }
59
+ ```
60
+
61
+ The data covers a wide surface of Unix — `find`, `grep`, `ssh`, `tar`, `chmod`, piped chains, quoted arguments, nested subshells, escape sequences. The dictation convention is consistent: symbols are spoken as English words ("dash", "dot", "slash", "pipe") and numbers are spelled digit-by-digit ("one two seven" for `127`).
62
+
63
+ ## Training
64
+
65
+ ```bash
66
+ mlx_lm.lora \
67
+ --model mlx-community/Qwen2.5-1.5B-Instruct-4bit \
68
+ --data datasets/finetune/bash-v2/minimal \
69
+ --batch-size 4 \
70
+ --lora-layers 16 \
71
+ --iters 1000 \
72
+ --learning-rate 1e-4 \
73
+ --mask-prompt
74
+ ```
75
+
76
+ One flag worth calling out: `--mask-prompt`. The model only learns to predict the assistant response, not the system and user turns. All the training signal goes to the actual reconstruction task.
77
+
78
+ It converged fast.
79
+
80
+ <!-- ILLUSTRATION: training-curve
81
+ A clean line chart with two lines:
82
+ - Train loss (blue): drops steeply from ~2.5 to ~0.05
83
+ - Val loss (orange): drops from ~2.8 to 0.109 at iter 800, ticks up to 0.137 at 1000
84
+ X-axis: Iteration (0 to 1000)
85
+ Y-axis: Loss (0 to 3.0)
86
+ A small annotation at iter 800: "best checkpoint"
87
+ A subtle shaded region after 800 labeled "mild overfit"
88
+ Style: Minimal, no gridlines. Just the curves and the annotation.
89
+ -->
90
+
91
+ | Iter | Train Loss | Val Loss |
92
+ |------|-----------|----------|
93
+ | 200 | 0.337 | 0.213 |
94
+ | 400 | 0.108 | 0.204 |
95
+ | 600 | 0.068 | 0.137 |
96
+ | **800** | **0.049** | **0.109** |
97
+ | 1000 | 0.052 | 0.137 |
98
+
99
+ Best validation loss at iteration 800. A mild overfit signal by 1000. Final test loss: 0.098, perplexity: 1.103.
100
+
101
+ Peak memory during training: 2.95 GB. Total wall time: about 35 minutes on a MacBook.
102
+
103
+ ## Beyond Val Loss: Does It Actually Get Commands Right?
104
+
105
+ Validation loss says the model is learning. It doesn't say whether it produces correct commands. So I ran the full 630-example test set through inference, compared each output character-for-character against the expected command, and sorted the results into buckets.
106
+
107
+ <!-- ILLUSTRATION: results-bar
108
+ A horizontal stacked bar chart, single bar, full width:
109
+ - Green (76.2%): "Exact" — label inside
110
+ - Light green (21.0%): "Near" — label inside
111
+ - Yellow (2.4%): "Partial" — label inside or above
112
+ - Red (0.5%): "Wrong" — tiny sliver, label above
113
+ Below the bar: "97.1% effective accuracy (exact + near)"
114
+ Style: Clean, bold. The green dominates. The red sliver is barely visible.
115
+ -->
116
+
117
+ ```
118
+ Exact match : 480 / 630 (76.2%)
119
+ Near match (>90% similar): 132 / 630 (21.0%)
120
+ Partial (70-90%) : 15 / 630 (2.4%)
121
+ Wrong (<70%) : 3 / 630 (0.5%)
122
+
123
+ Effective accuracy: 97.1%
124
+ ```
125
+
126
+ Average inference time: 0.69 seconds per command on Apple Silicon.
127
+
128
+ The "near match" bucket is mostly whitespace and trivial formatting — extra spaces around operators, minor quoting style differences. Functionally identical outputs. The interesting signal is in the failures.
129
+
130
+ ## Anatomy of the 3%
131
+
132
+ Every failure fell into one of two categories. No exceptions.
133
+
134
+ ### Repeated Digits
135
+
136
+ When the input contains a long spoken digit sequence — "one zero zero zero zero zero" for `100000` — the model starts generating correctly, then falls into a repetition loop.
137
+
138
+ <!-- ILLUSTRATION: repetition-failure
139
+ Show 3 examples as "cards" or terminal snippets:
140
+
141
+ Card 1:
142
+ Voice: "split dash L one zero zero zero zero zero database dot sql"
143
+ Expected: split -l 100000 database.sql database-
144
+ Got: split -l 100̶0̶0̶0̶0̶0̶0̶0̶0̶0̶0̶… (trailing zeros shown as fading/struck)
145
+
146
+ Card 2:
147
+ Voice: "head dash N nine nine nine nine nine nine nine nine"
148
+ Expected: head -n99999999 file1.txt
149
+ Got: head -n 99̶9̶9̶9̶9̶9̶… (same treatment)
150
+
151
+ Card 3:
152
+ Voice: "ping eight dot eight dot eight dot eight"
153
+ Expected: ping 8.8.8.8
154
+ Got: ping 8̶.̶8̶.̶8̶… (same treatment)
155
+
156
+ Style: The correct portion in white/green, the degenerate tail in red/faded,
157
+ visually showing where the model "loses the plot."
158
+ -->
159
+
160
+ ```
161
+ IN: "one zero zero zero zero zero" → GOT: 100[000000000…] EXPECTED: 100000
162
+ IN: "nine nine nine nine nine nine" → GOT: 99[999999999…] EXPECTED: 99999999
163
+ IN: "eight dot eight dot eight" → GOT: 8[.8.8.8.8.…] EXPECTED: 8.8.8.8
164
+ ```
165
+
166
+ This is a known weakness of small language models with repeated tokens. The model sees "I just generated a zero" and assigns high probability to the next token also being a zero. The attention pattern becomes self-reinforcing.
167
+
168
+ All 4 of the "wrong" results in the evaluation were this exact failure mode.
169
+
170
+ ### Casing Ambiguity
171
+
172
+ ```
173
+ IN: "df dash I H" → GOT: df -iH EXPECTED: df -ih
174
+ IN: "diff dash Y A B" → GOT: diff -y A B EXPECTED: diff -y a b
175
+ IN: "cp dash R S /mnt/..." → GOT: cp -R s/... EXPECTED: cp -rs /...
176
+ ```
177
+
178
+ When someone says "dash I H" — should it be `-ih` or `-iH`? Both are valid bash. The model preserves the casing from the spoken input, which is a reasonable default but doesn't always match the expected answer.
179
+
180
+ 21 of 630 examples (3.3%) differed only in letter casing. Score case-insensitively and they're all correct.
181
+
182
+ The remaining 14 partial matches were structural — a doubled token, a missed path segment, a quoting difference. Real model limitations, but minor ones.
183
+
184
+ ## The Insight
185
+
186
+ Here's the thing I didn't expect going in.
187
+
188
+ Looking at the dictation vocabulary across the entire dataset, the mapping from spoken words to symbols is *completely deterministic*:
189
+
190
+ <!-- ILLUSTRATION: vocabulary-table
191
+ A visual "lookup table" or "decoder ring" showing the spoken-to-symbol mapping.
192
+ Two columns. Left: spoken word in a speech bubble or rounded tag. Right: the
193
+ symbol in a monospace/terminal style.
194
+
195
+ Show the top ~15 mappings arranged in a visually interesting grid or flow:
196
+ dash → - pipe → | star → *
197
+ dot → . backslash → \ semicolon → ;
198
+ slash → / dollar → $ plus → +
199
+ quote → " underscore → _ equals → =
200
+ single quote → ' tilde → ~ colon → :
201
+ open brace → { close brace → }
202
+
203
+ Below: "30 spoken tokens → 30 symbols. No ambiguity. No ML needed."
204
+
205
+ Style: This should be visually striking — the contrast between fuzzy human speech
206
+ and precise symbols is the point. Maybe the left side feels organic/warm and the
207
+ right side feels precise/mechanical.
208
+ -->
209
+
210
+ | Spoken | Symbol | Occurrences |
211
+ |--------|--------|-------------|
212
+ | dash | `-` | 11,207 |
213
+ | quote | `"` | 4,676 |
214
+ | dot | `.` | 4,297 |
215
+ | slash | `/` | 4,079 |
216
+ | pipe | `\|` | 1,791 |
217
+ | star | `*` | 1,730 |
218
+ | backslash | `\` | 924 |
219
+ | semicolon | `;` | 766 |
220
+ | dollar | `$` | 636 |
221
+ | ... | ... | ... |
222
+
223
+ Thirty spoken tokens mapping to thirty symbols. No ambiguity. No context-dependence. A lookup table handles it perfectly.
224
+
225
+ Same for digits: "zero" through "nine" map 1:1 to `0`-`9`, spoken digit-by-digit and concatenated. "One two seven" is always `127`. "Zero six four four" is always `0644`.
226
+
227
+ The model is spending a huge chunk of its 1.5 billion parameters learning these fixed mappings. Every training example where "dash" becomes `-` is a wasted gradient. The model figured this out after the first hundred examples and then saw it eleven thousand more times.
228
+
229
+ **The fix isn't more training. It's less work for the model.**
230
+
231
+ ## The Architecture That Emerges
232
+
233
+ <!-- ILLUSTRATION: pipeline-architecture
234
+ A vertical flow diagram with three stages, each as a distinct box/card:
235
+
236
+ Stage 1 — PREPROCESSOR (labeled "Deterministic Code"):
237
+ Input: "find dot dash name star dot txt"
238
+ Processing: symbol/digit lookup table
239
+ Output: "find . - name * . txt"
240
+ Visual style: Mechanical, precise. Gear icon or lookup table icon.
241
+
242
+ Stage 2 — MODEL (labeled "Fine-tuned 1.5B LM"):
243
+ Input: "find . - name * . txt"
244
+ Processing: structural reasoning (spacing, quoting, grouping)
245
+ Output: "find . -name *.txt"
246
+ Visual style: Neural/organic. Small brain or network icon.
247
+
248
+ Stage 3 — POST-PROCESSOR (labeled "Deterministic Code"):
249
+ Input: "find . -name *.txt"
250
+ Processing: repetition guard, balanced quotes check
251
+ Output: "find . -name *.txt" ✓
252
+ Visual style: Same mechanical feel as Stage 1. Shield or checkmark icon.
253
+
254
+ The key visual idea: the ML part is sandwiched between two layers of regular code.
255
+ Most of the "intelligence" is deterministic. The model handles the narrow slice
256
+ that actually requires judgment.
257
+ -->
258
+
259
+ ```
260
+ "find dot dash name star dot txt"
261
+
262
+ ┌────────▼─────────┐
263
+ │ Preprocessor │ Deterministic: symbol + digit expansion
264
+ │ (no ML) │ "find . - name * . txt"
265
+ └────────┬─────────┘
266
+
267
+ ┌────────▼─────────┐
268
+ │ Fine-tuned LM │ Structural reasoning only
269
+ │ (1.5B, LoRA) │ find . -name *.txt
270
+ └────────┬─────────┘
271
+
272
+ ┌────────▼─────────┐
273
+ │ Post-processor │ Repetition guard, sanity checks
274
+ │ (no ML) │
275
+ └────────┬─────────┘
276
+
277
+
278
+ find . -name *.txt
279
+ ```
280
+
281
+ **Preprocessor** — deterministic code, no model involved:
282
+ - Symbol words to literal characters: `dash` → `-`, `pipe` → `|`, `open brace` → `{`
283
+ - Digit sequences to numbers: `one two seven` → `127`, `zero six four four` → `0644`
284
+ - Compound numbers to digits: `twenty three` → `23`, `twelve` → `12`
285
+
286
+ **Model** — the only part that requires ML, and now its job is purely structural:
287
+ - Where do spaces go? (`-name` vs `- name`)
288
+ - What gets quoted? (`"*.txt"` vs `*.txt`)
289
+ - How do tokens group? (`-exec rm -f {} \;` as a unit)
290
+ - What's a flag vs. an argument? (`-rs` vs `-R s`)
291
+
292
+ **Post-processor** — deterministic code again:
293
+ - Repetition detection: same n-gram 3+ times in a row, truncate
294
+ - Structural validation: balanced quotes, balanced braces, no trailing artifacts
295
+
296
+ The model becomes a structural reasoner instead of a lookup table. It stops memorizing that "dash" means `-` and starts focusing on the actually hard part: how these symbols compose into valid commands.
297
+
298
+ ## What the Numbers Mean
299
+
300
+ <!-- ILLUSTRATION: production-stats
301
+ Three "stat cards" in a row, bold numbers with subtle icons:
302
+
303
+ Card 1: "97%" with subtext "effective accuracy"
304
+ Icon: target/bullseye
305
+ Card 2: "3 GB" with subtext "total memory"
306
+ Icon: chip/RAM stick
307
+ Card 3: "0.7s" with subtext "per command"
308
+ Icon: stopwatch
309
+
310
+ Below: "On a phone. Offline. No cloud."
311
+
312
+ Style: Clean, confident. These numbers should feel impressive without being flashy.
313
+ -->
314
+
315
+ 97% accuracy from a model that fits in 3GB and runs in under a second. On a phone. Offline. No API call, no network dependency, no usage fees.
316
+
317
+ The remaining 3% breaks down cleanly:
318
+
319
+ - **Repeated digits** (~0.6%): eliminated entirely by the preprocessor — digits never reach the model
320
+ - **Casing** (~3.3%): arguably not errors — both casings are valid bash. Case-insensitive accuracy is already ~99%
321
+ - **Structural** (~2.2%): genuine model limitations, mostly minor — a doubled token, a missed path segment
322
+
323
+ With the preprocessing pipeline handling symbols and digits, the model's effective job shrinks substantially, and I'd expect accuracy above 98% without any retraining.
324
+
325
+ ## Practical Notes
326
+
327
+ **Training cost.** 35 minutes on a MacBook, 3GB RAM. No GPU cluster. MLX makes LoRA fine-tuning on Apple Silicon feel like running a build.
328
+
329
+ **Data efficiency.** 5,044 training examples was enough for 97%. The model converged in 800 iterations — 3,200 examples at batch size 4. Small, focused datasets beat large noisy ones when the task is narrow.
330
+
331
+ **Checkpoint selection.** Best validation loss at iteration 800 (0.109). Iteration 1000 showed mild overfitting (0.137). In practice the difference was small — both produced similar accuracy in full evaluation.
332
+
333
+ **Inference.** 0.69 seconds average. Fast enough to run between when you stop speaking and when text appears. The user doesn't wait.
334
+
335
+ ## What's Next
336
+
337
+ <!-- ILLUSTRATION: domains-expansion
338
+ A grid of domain "cards" showing where this same approach applies:
339
+
340
+ bash → ✅ done (this post)
341
+ SQL → "select star from users where..."
342
+ regex → "caret open bracket A dash Z close bracket plus dollar"
343
+ URLs → "H T T P S colon slash slash..."
344
+ math → "integral from zero to infinity..."
345
+ file paths → "slash users slash arach slash..."
346
+
347
+ Each card has the domain name, a tiny example of spoken → written, and a
348
+ status indicator (done, next, future).
349
+
350
+ Style: Grid layout. "bash" card is highlighted/completed. Others are dimmed
351
+ or outlined, suggesting a roadmap.
352
+ -->
353
+
354
+ Building the preprocessing pipeline is the immediate next step — the deterministic symbol and digit expander that feeds cleaned input to the model.
355
+
356
+ Beyond that, the approach generalizes to any domain with a consistent spoken-to-written mapping. SQL, regex, file paths, URLs, mathematical notation. The model architecture stays the same. You change the training data and the preprocessor's lookup table.
357
+
358
+ The broader point: the right role for a small model isn't doing everything. It's doing the one thing that only a model can do, sandwiched between deterministic code that handles the rest.
359
+
360
+ <!-- ILLUSTRATION: closing
361
+ A minimal callback to the hero image — the same speech-to-syntax flow, but
362
+ now with the preprocessor and post-processor stages visible as small nodes
363
+ in the pipeline. The model in the center is smaller/lighter, because its
364
+ job is smaller now. The deterministic stages are doing the heavy lifting.
365
+
366
+ Or alternatively: a phone lying on a desk, terminal open, with a speech bubble
367
+ above it containing "find dot dash name star dot txt" and the terminal showing
368
+ the correct output. Simple, confident, done.
369
+ -->
blog/part2-split-architecture-notes.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Part 2: When Fine-Tuning Isn't the Answer (Yet)
2
+
3
+ > Follow-up to "Teaching a Tiny Model to Hear Bash"
4
+ > Working title — refine before publishing
5
+
6
+ ## Narrative arc
7
+
8
+ Part 1 ended on a high: 97% accuracy, 3GB RAM, under a second. But there's a catch we glossed over — that 97% is on **clean protocol input**. When users speak naturally ("okay so the command is...") or make corrections mid-sentence ("dash dash no wait just dash v"), the model falls apart.
9
+
10
+ This post is about what we tried next, what we learned, and the architectural insight that changed our approach.
11
+
12
+ ## Key beats
13
+
14
+ ### 1. The 97% Illusion
15
+
16
+ The fine-tuned model is great... if you speak its language perfectly. Real users don't.
17
+
18
+ Four difficulty levels:
19
+ - **Clean**: "git space push space dash u space origin space main" → 93% (processor alone)
20
+ - **Fuzzy**: "git commit minus m quote fix login bug quote" → 0% (no "space" keywords)
21
+ - **Natural**: "okay so the command is git push dash u origin main" → 0% (filler)
22
+ - **Chaotic**: "dash dash no wait just dash v" → 0% (self-corrections)
23
+
24
+ The training data was clean. Reality isn't.
25
+
26
+ ### 2. The Procedural Processor Discovery
27
+
28
+ Before throwing more ML at it, we asked: how much of this task is deterministic?
29
+
30
+ Answer: almost all of it. "dash" always means "-". "dot" always means ".". A rule-based token scanner gets **93% on clean input** with zero hallucination, zero latency, zero training.
31
+
32
+ This raised the question: what is the LLM actually contributing? It's memorizing fixed mappings. The 11,207 times "dash" appears in training — the model learned them all, but a dictionary lookup does the same job.
33
+
34
+ ### 3. The Split Architecture
35
+
36
+ The insight: **use each tool for what it's good at.**
37
+
38
+ ```
39
+ Raw speech → LLM (language understanding) → Protocol text → Processor (deterministic) → Final syntax
40
+ ```
41
+
42
+ The LLM's job shrinks dramatically:
43
+ - Strip conversational filler
44
+ - Resolve self-corrections ("no wait, actually...")
45
+ - Insert "space" keywords between arguments
46
+ - Replace synonyms (minus→dash, period→dot)
47
+
48
+ It never outputs symbols. It never makes the dash-to-minus conversion. It just cleans up natural language into a constrained protocol format, and the processor handles the rest.
49
+
50
+ ### 4. Zero-Training Results
51
+
52
+ We tested this with pure prompting (no fine-tuning) across 3 models:
53
+
54
+ | Model | Clean | Fuzzy | Natural | Chaotic | Overall |
55
+ |---|---|---|---|---|---|
56
+ | Processor only | 92% | 0% | 0% | 2% | 23.5% |
57
+ | Qwen 2.5 1.5B | 90% | 20% | 54% | 24% | 47% |
58
+ | Qwen 2.5 0.5B | 90% | 12% | 44% | 20% | 41.5% |
59
+ | Llama 3.2 1B | 92% | 14% | 34% | 10% | 37.5% |
60
+
61
+ Key findings:
62
+ - 2x baseline with zero training
63
+ - Clean input maintained at 90%+ (protocol bypass — if input already has "space" keywords, skip the LLM entirely)
64
+ - Natural/chaotic show real improvement (filler stripping, self-correction resolution work)
65
+ - Fuzzy is the bottleneck (20%) — inserting "space" keywords requires understanding command structure
66
+
67
+ ### 5. The Hybrid Architecture
68
+
69
+ The winning trick: **don't send everything through the LLM.**
70
+
71
+ ```python
72
+ if input contains "space" keywords and no filler:
73
+ → bypass LLM, send directly to processor
74
+ else:
75
+ → LLM normalizes, then processor converts
76
+ ```
77
+
78
+ This gives us:
79
+ - 96% on clean independent eval (up from 93% processor baseline)
80
+ - Near-zero latency for protocol-format input
81
+ - LLM only called when genuinely needed (26% of inputs bypassed)
82
+
83
+ ### 6. Where Prompting Hits Its Ceiling
84
+
85
+ Fuzzy normalization is the hard problem. The LLM needs to understand:
86
+ - `cat file period txt` → "cat" and "file.txt" are separate tokens (need "space")
87
+ - But within "file.txt", "file" + "dot" + "txt" concatenate (no "space")
88
+ - `dash dash verbose` → compound flag, stays together
89
+ - `dash u space origin` → flag and argument, need "space"
90
+
91
+ This requires understanding command structure — which words are commands, flags, paths, filenames. A 1.5B model can't learn this from 12 few-shot examples. But it CAN learn it from 5,000 training examples.
92
+
93
+ ### 7. The Path Forward
94
+
95
+ The fine-tuning task just got dramatically simpler:
96
+ - Old task: dictated text → final syntax (model must learn ALL symbol mappings)
97
+ - New task: dictated text → protocol text (model only learns WHERE to put "space")
98
+
99
+ Same training data. Same model. Much simpler output space. The processor handles the rest.
100
+
101
+ ## Themes to emphasize
102
+
103
+ - **Don't teach an LLM what a dictionary can do.** Deterministic mappings belong in code.
104
+ - **Split tasks at the boundary of language understanding.** The LLM handles ambiguity; code handles rules.
105
+ - **Zero-training experiments reveal architecture.** Prompting told us exactly where the value is (filler stripping, correction resolution) and where it isn't (symbol conversion, space insertion).
106
+ - **Evaluation infrastructure matters.** The 4-difficulty eval set (clean/fuzzy/natural/chaotic) made it possible to see WHERE each approach fails, not just a single accuracy number.
107
+
108
+ ## Data to include
109
+
110
+ - The results table above (all 3 models x 4 difficulties)
111
+ - Architecture diagram (raw → LLM → protocol → processor → syntax)
112
+ - Comparison: end-to-end fine-tuning vs split pipeline
113
+ - Error examples showing what the LLM gets right and wrong
114
+ - Latency numbers (2.5s with LLM vs ~0ms bypassed)
115
+
116
+ ## Code references
117
+
118
+ All code in the datasets/ directory:
119
+ - `procedural-processor.py` — the deterministic backbone
120
+ - `normalizer-pipeline.py` — the zero-training pipeline
121
+ - `eval-fuzzy.json` — 200 entries, 4 difficulty levels
122
+ - `eval-independent.json` — 100 clean protocol entries
123
+ - Fine-tuning infrastructure in `finetune/` (from Part 1)
124
+
125
+ ## Open questions for Part 3
126
+
127
+ - How much does fine-tuning the normalizer improve fuzzy accuracy?
128
+ - Can we generate training data programmatically? (take clean protocol, randomly drop "space" keywords, add filler)
129
+ - Is there a sweet spot between prompting and fine-tuning? (e.g., fine-tune on 100 examples instead of 5000)
130
+ - Should the normalizer be a separate model from the transcription engine?
eval/fuzzy.json ADDED
@@ -0,0 +1,1202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "dictated": "git space push space dash u space origin space main",
4
+ "expected": "git push -u origin main",
5
+ "category": "git",
6
+ "difficulty": "clean"
7
+ },
8
+ {
9
+ "dictated": "docker space run space dash dash rm space dash p space eight zero eight zero colon eight zero space nginx",
10
+ "expected": "docker run --rm -p 8080:80 nginx",
11
+ "category": "docker",
12
+ "difficulty": "clean"
13
+ },
14
+ {
15
+ "dictated": "npm space install space dash capital D space typescript space at five",
16
+ "expected": "npm install -D typescript@5",
17
+ "category": "npm",
18
+ "difficulty": "clean"
19
+ },
20
+ {
21
+ "dictated": "ssh space dash i space tilde slash dot ssh slash id underscore rsa space root at one nine two dot one six eight dot one dot one",
22
+ "expected": "ssh -i ~/.ssh/id_rsa root@192.168.1.1",
23
+ "category": "ssh",
24
+ "difficulty": "clean"
25
+ },
26
+ {
27
+ "dictated": "curl space dash capital X space all caps POST space dash capital H space quote capital Content dash capital Type colon space application slash json quote space https colon slash slash api dot example dot com slash v one slash users",
28
+ "expected": "curl -X POST -H \"Content-Type: application/json\" https://api.example.com/v1/users",
29
+ "category": "curl",
30
+ "difficulty": "clean"
31
+ },
32
+ {
33
+ "dictated": "kubectl space get space pods space dash n space kube dash system",
34
+ "expected": "kubectl get pods -n kube-system",
35
+ "category": "kubectl",
36
+ "difficulty": "clean"
37
+ },
38
+ {
39
+ "dictated": "terraform space plan space dash var dash file equals production dot tfvars",
40
+ "expected": "terraform plan -var-file=production.tfvars",
41
+ "category": "terraform",
42
+ "difficulty": "clean"
43
+ },
44
+ {
45
+ "dictated": "chmod space zero seven five five space slash usr slash local slash bin slash deploy dot sh",
46
+ "expected": "chmod 0755 /usr/local/bin/deploy.sh",
47
+ "category": "filesystem",
48
+ "difficulty": "clean"
49
+ },
50
+ {
51
+ "dictated": "export space all caps DATABASE underscore URL equals quote postgres colon slash slash admin colon secret at localhost colon five four three two slash mydb quote",
52
+ "expected": "export DATABASE_URL=\"postgres://admin:secret@localhost:5432/mydb\"",
53
+ "category": "env",
54
+ "difficulty": "clean"
55
+ },
56
+ {
57
+ "dictated": "rsync space dash a v z space dash e space ssh space dot slash dist slash space user at one seven two dot sixteen dot zero dot one colon slash var slash www slash",
58
+ "expected": "rsync -avz -e ssh ./dist/ user@172.16.0.1:/var/www/",
59
+ "category": "rsync",
60
+ "difficulty": "clean"
61
+ },
62
+ {
63
+ "dictated": "psql space dash h space localhost space dash capital U space postgres space dash d space production",
64
+ "expected": "psql -h localhost -U postgres -d production",
65
+ "category": "psql",
66
+ "difficulty": "clean"
67
+ },
68
+ {
69
+ "dictated": "redis dash cli space dash h space one two seven dot zero dot zero dot one space dash p space six three seven nine space all caps PING",
70
+ "expected": "redis-cli -h 127.0.0.1 -p 6379 PING",
71
+ "category": "redis",
72
+ "difficulty": "clean"
73
+ },
74
+ {
75
+ "dictated": "brew space install space dash dash cask space visual dash studio dash code",
76
+ "expected": "brew install --cask visual-studio-code",
77
+ "category": "brew",
78
+ "difficulty": "clean"
79
+ },
80
+ {
81
+ "dictated": "systemctl space restart space nginx dot service",
82
+ "expected": "systemctl restart nginx.service",
83
+ "category": "systemctl",
84
+ "difficulty": "clean"
85
+ },
86
+ {
87
+ "dictated": "make space dash j space eight space all caps CC equals gcc space all caps CFLAGS equals quote dash capital O two dash capital Wall quote",
88
+ "expected": "make -j 8 CC=gcc CFLAGS=\"-O2 -Wall\"",
89
+ "category": "make",
90
+ "difficulty": "clean"
91
+ },
92
+ {
93
+ "dictated": "cargo space build space dash dash release space dash dash target space x eighty six underscore sixty four dash unknown dash linux dash gnu",
94
+ "expected": "cargo build --release --target x86_64-unknown-linux-gnu",
95
+ "category": "rust",
96
+ "difficulty": "clean"
97
+ },
98
+ {
99
+ "dictated": "go space build space dash o space bin slash server space dot slash cmd slash server",
100
+ "expected": "go build -o bin/server ./cmd/server",
101
+ "category": "go",
102
+ "difficulty": "clean"
103
+ },
104
+ {
105
+ "dictated": "swift space build space dash c space release space dash dash triple space arm sixty four dash apple dash macosx",
106
+ "expected": "swift build -c release --triple arm64-apple-macosx",
107
+ "category": "swift",
108
+ "difficulty": "clean"
109
+ },
110
+ {
111
+ "dictated": "aws space s three space cp space s three colon slash slash my dash bucket slash data dot csv space dot slash",
112
+ "expected": "aws s3 cp s3://my-bucket/data.csv ./",
113
+ "category": "aws",
114
+ "difficulty": "clean"
115
+ },
116
+ {
117
+ "dictated": "ffmpeg space dash i space input dot mp four space dash vf space quote scale equals one nine twenty colon one zero eighty quote space dash c colon a space copy space output dot mp four",
118
+ "expected": "ffmpeg -i input.mp4 -vf \"scale=1920:1080\" -c:a copy output.mp4",
119
+ "category": "ffmpeg",
120
+ "difficulty": "clean"
121
+ },
122
+ {
123
+ "dictated": "openssl space req space dash x five zero nine space dash newkey space rsa colon four zero nine six space dash keyout space key dot pem space dash out space cert dot pem space dash days space three six five",
124
+ "expected": "openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365",
125
+ "category": "openssl",
126
+ "difficulty": "clean"
127
+ },
128
+ {
129
+ "dictated": "camel case get user profile",
130
+ "expected": "getUserProfile",
131
+ "category": "casing",
132
+ "difficulty": "clean"
133
+ },
134
+ {
135
+ "dictated": "snake case api response handler",
136
+ "expected": "api_response_handler",
137
+ "category": "casing",
138
+ "difficulty": "clean"
139
+ },
140
+ {
141
+ "dictated": "pascal case user authentication service",
142
+ "expected": "UserAuthenticationService",
143
+ "category": "casing",
144
+ "difficulty": "clean"
145
+ },
146
+ {
147
+ "dictated": "kebab case my awesome component",
148
+ "expected": "my-awesome-component",
149
+ "category": "casing",
150
+ "difficulty": "clean"
151
+ },
152
+ {
153
+ "dictated": "git space log space dash dash oneline space dash dash graph space dash n space twenty",
154
+ "expected": "git log --oneline --graph -n 20",
155
+ "category": "git",
156
+ "difficulty": "clean"
157
+ },
158
+ {
159
+ "dictated": "docker space compose space dash f space docker dash compose dot prod dot yml space up space dash d",
160
+ "expected": "docker compose -f docker-compose.prod.yml up -d",
161
+ "category": "docker",
162
+ "difficulty": "clean"
163
+ },
164
+ {
165
+ "dictated": "python space dash m space venv space dot venv space ampersand ampersand space source space dot venv slash bin slash activate",
166
+ "expected": "python -m venv .venv && source .venv/bin/activate",
167
+ "category": "python",
168
+ "difficulty": "clean"
169
+ },
170
+ {
171
+ "dictated": "pip space install space dash r space requirements dot txt space dash dash upgrade",
172
+ "expected": "pip install -r requirements.txt --upgrade",
173
+ "category": "python",
174
+ "difficulty": "clean"
175
+ },
176
+ {
177
+ "dictated": "grep space dash r space dash n space dash i space quote all caps TODO quote space dot slash src slash",
178
+ "expected": "grep -r -n -i \"TODO\" ./src/",
179
+ "category": "grep",
180
+ "difficulty": "clean"
181
+ },
182
+ {
183
+ "dictated": "tar space dash x z f space archive dot tar dot gz space dash capital C space slash opt slash app",
184
+ "expected": "tar -xzf archive.tar.gz -C /opt/app",
185
+ "category": "filesystem",
186
+ "difficulty": "clean"
187
+ },
188
+ {
189
+ "dictated": "find space dot space dash name space quote star dot log quote space dash mtime space plus seven space dash delete",
190
+ "expected": "find . -name \"*.log\" -mtime +7 -delete",
191
+ "category": "filesystem",
192
+ "difficulty": "clean"
193
+ },
194
+ {
195
+ "dictated": "echo space dollar all caps HOME slash dot config slash app dot yml",
196
+ "expected": "echo $HOME/.config/app.yml",
197
+ "category": "shell",
198
+ "difficulty": "clean"
199
+ },
200
+ {
201
+ "dictated": "sed space dash i space quote s slash old dash text slash new dash text slash g quote space config dot yaml",
202
+ "expected": "sed -i \"s/old-text/new-text/g\" config.yaml",
203
+ "category": "sed",
204
+ "difficulty": "clean"
205
+ },
206
+ {
207
+ "dictated": "awk space quote open brace print space dollar two close brace quote space data dot tsv",
208
+ "expected": "awk \"{print $2}\" data.tsv",
209
+ "category": "awk",
210
+ "difficulty": "clean"
211
+ },
212
+ {
213
+ "dictated": "xcodebuild space dash workspace space capital Talkie dot xcworkspace space dash scheme space capital Talkie space dash configuration space capital Release",
214
+ "expected": "xcodebuild -workspace Talkie.xcworkspace -scheme Talkie -configuration Release",
215
+ "category": "xcode",
216
+ "difficulty": "clean"
217
+ },
218
+ {
219
+ "dictated": "git space remote space add space upstream space https colon slash slash github dot com slash owner slash repo dot git",
220
+ "expected": "git remote add upstream https://github.com/owner/repo.git",
221
+ "category": "git",
222
+ "difficulty": "clean"
223
+ },
224
+ {
225
+ "dictated": "scp space dash capital P space two two zero two space user at ten dot zero dot zero dot five colon slash tmp slash dump dot sql space dot slash",
226
+ "expected": "scp -P 2202 user@10.0.0.5:/tmp/dump.sql ./",
227
+ "category": "scp",
228
+ "difficulty": "clean"
229
+ },
230
+ {
231
+ "dictated": "cmake space dash capital B space build space dash capital D space all caps CMAKE underscore BUILD underscore TYPE equals capital Release",
232
+ "expected": "cmake -B build -D CMAKE_BUILD_TYPE=Release",
233
+ "category": "cmake",
234
+ "difficulty": "clean"
235
+ },
236
+ {
237
+ "dictated": "gradle space dash dash parallel space dash dash build dash cache space assembleRelease",
238
+ "expected": "gradle --parallel --build-cache assembleRelease",
239
+ "category": "gradle",
240
+ "difficulty": "clean"
241
+ },
242
+ {
243
+ "dictated": "convert space input dot png space dash resize space eight hundred x six hundred space dash quality space eighty five space output dot jpg",
244
+ "expected": "convert input.png -resize 800x600 -quality 85 output.jpg",
245
+ "category": "imagemagick",
246
+ "difficulty": "clean"
247
+ },
248
+ {
249
+ "dictated": "cat space slash etc slash hosts space pipe space grep space one nine two dot one six eight",
250
+ "expected": "cat /etc/hosts | grep 192.168",
251
+ "category": "shell",
252
+ "difficulty": "clean"
253
+ },
254
+ {
255
+ "dictated": "git space stash space push space dash m space quote wip colon space refactor auth quote",
256
+ "expected": "git stash push -m \"wip: refactor auth\"",
257
+ "category": "git",
258
+ "difficulty": "clean"
259
+ },
260
+ {
261
+ "dictated": "kubectl space apply space dash f space dash space less than space manifest dot yaml",
262
+ "expected": "kubectl apply -f - < manifest.yaml",
263
+ "category": "kubectl",
264
+ "difficulty": "clean"
265
+ },
266
+ {
267
+ "dictated": "npx space create dash next dash app at latest space my dash app space dash dash typescript space dash dash tailwind space dash dash eslint",
268
+ "expected": "npx create-next-app@latest my-app --typescript --tailwind --eslint",
269
+ "category": "npm",
270
+ "difficulty": "clean"
271
+ },
272
+ {
273
+ "dictated": "docker space build space dash t space my dash app colon v one dot two dot three space dash f space capital Dockerfile dot prod space dot",
274
+ "expected": "docker build -t my-app:v1.2.3 -f Dockerfile.prod .",
275
+ "category": "docker",
276
+ "difficulty": "clean"
277
+ },
278
+ {
279
+ "dictated": "git space rebase space dash dash onto space main space feature dash base space feature dash work",
280
+ "expected": "git rebase --onto main feature-base feature-work",
281
+ "category": "git",
282
+ "difficulty": "clean"
283
+ },
284
+ {
285
+ "dictated": "env space all caps NODE underscore ENV equals production space node space server dot js",
286
+ "expected": "env NODE_ENV=production node server.js",
287
+ "category": "env",
288
+ "difficulty": "clean"
289
+ },
290
+ {
291
+ "dictated": "aws space ec two space describe dash instances space dash dash filters space quote capital Name equals instance dash state dash name comma capital Values equals running quote space dash dash query space quote capital Reservations open bracket close bracket dot capital Instances open bracket close bracket dot capital Instance capital Id quote",
292
+ "expected": "aws ec2 describe-instances --filters \"Name=instance-state-name,Values=running\" --query \"Reservations[].Instances[].InstanceId\"",
293
+ "category": "aws",
294
+ "difficulty": "clean"
295
+ },
296
+ {
297
+ "dictated": "wc space dash l space less than space data dot csv space pipe space xargs space echo space quote total rows colon quote",
298
+ "expected": "wc -l < data.csv | xargs echo \"total rows:\"",
299
+ "category": "shell",
300
+ "difficulty": "clean"
301
+ },
302
+ {
303
+ "dictated": "git commit minus m quote fix login bug quote",
304
+ "expected": "git commit -m \"fix login bug\"",
305
+ "category": "git",
306
+ "difficulty": "fuzzy"
307
+ },
308
+ {
309
+ "dictated": "ls minus l minus a slash var slash log",
310
+ "expected": "ls -l -a /var/log",
311
+ "category": "filesystem",
312
+ "difficulty": "fuzzy"
313
+ },
314
+ {
315
+ "dictated": "cat file period txt",
316
+ "expected": "cat file.txt",
317
+ "category": "filesystem",
318
+ "difficulty": "fuzzy"
319
+ },
320
+ {
321
+ "dictated": "cd forward slash usr forward slash local forward slash bin",
322
+ "expected": "cd /usr/local/bin",
323
+ "category": "filesystem",
324
+ "difficulty": "fuzzy"
325
+ },
326
+ {
327
+ "dictated": "python server period py double dash port eight thousand",
328
+ "expected": "python server.py --port 8000",
329
+ "category": "python",
330
+ "difficulty": "fuzzy"
331
+ },
332
+ {
333
+ "dictated": "git push hyphen u origin main",
334
+ "expected": "git push -u origin main",
335
+ "category": "git",
336
+ "difficulty": "fuzzy"
337
+ },
338
+ {
339
+ "dictated": "grep asterisk period log forward slash var forward slash log forward slash",
340
+ "expected": "grep *.log /var/log/",
341
+ "category": "grep",
342
+ "difficulty": "fuzzy"
343
+ },
344
+ {
345
+ "dictated": "npm install hyphen hyphen save dev eslint",
346
+ "expected": "npm install --save-dev eslint",
347
+ "category": "npm",
348
+ "difficulty": "fuzzy"
349
+ },
350
+ {
351
+ "dictated": "echo hashtag this is a comment",
352
+ "expected": "echo # this is a comment",
353
+ "category": "shell",
354
+ "difficulty": "fuzzy"
355
+ },
356
+ {
357
+ "dictated": "docker run minus minus rm minus it ubuntu",
358
+ "expected": "docker run --rm -it ubuntu",
359
+ "category": "docker",
360
+ "difficulty": "fuzzy"
361
+ },
362
+ {
363
+ "dictated": "curl minus capital X capital POST minus capital H content hyphen type colon application forward slash json",
364
+ "expected": "curl -X POST -H Content-Type: application/json",
365
+ "category": "curl",
366
+ "difficulty": "fuzzy"
367
+ },
368
+ {
369
+ "dictated": "ssh minus i tilde forward slash period ssh forward slash key period pem user at sign server",
370
+ "expected": "ssh -i ~/.ssh/key.pem user@server",
371
+ "category": "ssh",
372
+ "difficulty": "fuzzy"
373
+ },
374
+ {
375
+ "dictated": "git diff double dash staged",
376
+ "expected": "git diff --staged",
377
+ "category": "git",
378
+ "difficulty": "fuzzy"
379
+ },
380
+ {
381
+ "dictated": "tar minus xzf backup period tar period gz",
382
+ "expected": "tar -xzf backup.tar.gz",
383
+ "category": "filesystem",
384
+ "difficulty": "fuzzy"
385
+ },
386
+ {
387
+ "dictated": "find period minus name asterisk period py minus type f",
388
+ "expected": "find . -name *.py -type f",
389
+ "category": "filesystem",
390
+ "difficulty": "fuzzy"
391
+ },
392
+ {
393
+ "dictated": "export capital NODE underscore capital ENV equals sign production",
394
+ "expected": "export NODE_ENV=production",
395
+ "category": "env",
396
+ "difficulty": "fuzzy"
397
+ },
398
+ {
399
+ "dictated": "kubectl get pods minus n default double dash output json",
400
+ "expected": "kubectl get pods -n default --output json",
401
+ "category": "kubectl",
402
+ "difficulty": "fuzzy"
403
+ },
404
+ {
405
+ "dictated": "chmod seven five five script period sh",
406
+ "expected": "chmod 755 script.sh",
407
+ "category": "filesystem",
408
+ "difficulty": "fuzzy"
409
+ },
410
+ {
411
+ "dictated": "pip install flask equals sign equals sign two period zero period zero",
412
+ "expected": "pip install flask==2.0.0",
413
+ "category": "python",
414
+ "difficulty": "fuzzy"
415
+ },
416
+ {
417
+ "dictated": "git checkout minus b feature forward slash auth",
418
+ "expected": "git checkout -b feature/auth",
419
+ "category": "git",
420
+ "difficulty": "fuzzy"
421
+ },
422
+ {
423
+ "dictated": "docker compose up minus d double dash build",
424
+ "expected": "docker compose up -d --build",
425
+ "category": "docker",
426
+ "difficulty": "fuzzy"
427
+ },
428
+ {
429
+ "dictated": "psql minus capital U postgres minus d mydb minus c quote select asterisk from users quote",
430
+ "expected": "psql -U postgres -d mydb -c \"select * from users\"",
431
+ "category": "psql",
432
+ "difficulty": "fuzzy"
433
+ },
434
+ {
435
+ "dictated": "scp file period txt user at sign one ninety two period one sixty eight period one period one hundred colon tilde forward slash",
436
+ "expected": "scp file.txt user@192.168.1.100:~/",
437
+ "category": "scp",
438
+ "difficulty": "fuzzy"
439
+ },
440
+ {
441
+ "dictated": "git log double dash oneline minus n ten",
442
+ "expected": "git log --oneline -n 10",
443
+ "category": "git",
444
+ "difficulty": "fuzzy"
445
+ },
446
+ {
447
+ "dictated": "brew install double dash cask firefox",
448
+ "expected": "brew install --cask firefox",
449
+ "category": "brew",
450
+ "difficulty": "fuzzy"
451
+ },
452
+ {
453
+ "dictated": "rsync minus avz period forward slash source forward slash user at sign host colon forward slash dest forward slash",
454
+ "expected": "rsync -avz ./source/ user@host:/dest/",
455
+ "category": "rsync",
456
+ "difficulty": "fuzzy"
457
+ },
458
+ {
459
+ "dictated": "aws s3 sync period forward slash build forward slash s3 colon forward slash forward slash my hyphen bucket forward slash static",
460
+ "expected": "aws s3 sync ./build/ s3://my-bucket/static",
461
+ "category": "aws",
462
+ "difficulty": "fuzzy"
463
+ },
464
+ {
465
+ "dictated": "camelcase handle form submit",
466
+ "expected": "handleFormSubmit",
467
+ "category": "casing",
468
+ "difficulty": "fuzzy"
469
+ },
470
+ {
471
+ "dictated": "snake_case max retry count",
472
+ "expected": "max_retry_count",
473
+ "category": "casing",
474
+ "difficulty": "fuzzy"
475
+ },
476
+ {
477
+ "dictated": "git stash pop at sign open brace zero close brace",
478
+ "expected": "git stash pop @{0}",
479
+ "category": "git",
480
+ "difficulty": "fuzzy"
481
+ },
482
+ {
483
+ "dictated": "backslash n backslash t hello world",
484
+ "expected": "\\n\\t hello world",
485
+ "category": "escape",
486
+ "difficulty": "fuzzy"
487
+ },
488
+ {
489
+ "dictated": "echo dollar sign open parenthesis date close parenthesis",
490
+ "expected": "echo $(date)",
491
+ "category": "shell",
492
+ "difficulty": "fuzzy"
493
+ },
494
+ {
495
+ "dictated": "ffmpeg minus i input period mov minus codec copy output period mp4",
496
+ "expected": "ffmpeg -i input.mov -codec copy output.mp4",
497
+ "category": "ffmpeg",
498
+ "difficulty": "fuzzy"
499
+ },
500
+ {
501
+ "dictated": "terraform apply minus auto hyphen approve",
502
+ "expected": "terraform apply -auto-approve",
503
+ "category": "terraform",
504
+ "difficulty": "fuzzy"
505
+ },
506
+ {
507
+ "dictated": "go test period forward slash period period period minus v minus race",
508
+ "expected": "go test ./... -v -race",
509
+ "category": "go",
510
+ "difficulty": "fuzzy"
511
+ },
512
+ {
513
+ "dictated": "cargo run double dash double dash release",
514
+ "expected": "cargo run -- --release",
515
+ "category": "rust",
516
+ "difficulty": "fuzzy"
517
+ },
518
+ {
519
+ "dictated": "git rebase minus minus interactive capital HEAD tilde three",
520
+ "expected": "git rebase --interactive HEAD~3",
521
+ "category": "git",
522
+ "difficulty": "fuzzy"
523
+ },
524
+ {
525
+ "dictated": "nginx minus t ampersand ampersand systemctl reload nginx",
526
+ "expected": "nginx -t && systemctl reload nginx",
527
+ "category": "systemctl",
528
+ "difficulty": "fuzzy"
529
+ },
530
+ {
531
+ "dictated": "sed minus i quote s forward slash http forward slash https forward slash g quote config period yml",
532
+ "expected": "sed -i \"s/http/https/g\" config.yml",
533
+ "category": "sed",
534
+ "difficulty": "fuzzy"
535
+ },
536
+ {
537
+ "dictated": "curl minus s minus o forward slash dev forward slash null minus w quote percent open brace http underscore code close brace quote http colon forward slash forward slash localhost colon three thousand",
538
+ "expected": "curl -s -o /dev/null -w \"%{http_code}\" http://localhost:3000",
539
+ "category": "curl",
540
+ "difficulty": "fuzzy"
541
+ },
542
+ {
543
+ "dictated": "docker exec minus it container underscore name bash",
544
+ "expected": "docker exec -it container_name bash",
545
+ "category": "docker",
546
+ "difficulty": "fuzzy"
547
+ },
548
+ {
549
+ "dictated": "openssl s underscore client minus connect example period com colon four four three",
550
+ "expected": "openssl s_client -connect example.com:443",
551
+ "category": "openssl",
552
+ "difficulty": "fuzzy"
553
+ },
554
+ {
555
+ "dictated": "all-caps MAX underscore RETRIES equals sign five",
556
+ "expected": "MAX_RETRIES=5",
557
+ "category": "env",
558
+ "difficulty": "fuzzy"
559
+ },
560
+ {
561
+ "dictated": "pascal-case user profile view model",
562
+ "expected": "UserProfileViewModel",
563
+ "category": "casing",
564
+ "difficulty": "fuzzy"
565
+ },
566
+ {
567
+ "dictated": "kebab-case my react component",
568
+ "expected": "my-react-component",
569
+ "category": "casing",
570
+ "difficulty": "fuzzy"
571
+ },
572
+ {
573
+ "dictated": "redis minus cli set my colon key quote hello world quote",
574
+ "expected": "redis-cli set my:key \"hello world\"",
575
+ "category": "redis",
576
+ "difficulty": "fuzzy"
577
+ },
578
+ {
579
+ "dictated": "jq period data open bracket close bracket period name input period json",
580
+ "expected": "jq .data[].name input.json",
581
+ "category": "shell",
582
+ "difficulty": "fuzzy"
583
+ },
584
+ {
585
+ "dictated": "okay so the command is git space push space dash u space origin space main",
586
+ "expected": "git push -u origin main",
587
+ "category": "git",
588
+ "difficulty": "natural"
589
+ },
590
+ {
591
+ "dictated": "I wanna set the variable name to camel case get user profile",
592
+ "expected": "getUserProfile",
593
+ "category": "casing",
594
+ "difficulty": "natural"
595
+ },
596
+ {
597
+ "dictated": "change that to snake case api response handler",
598
+ "expected": "api_response_handler",
599
+ "category": "casing",
600
+ "difficulty": "natural"
601
+ },
602
+ {
603
+ "dictated": "the path should be slash usr slash local slash bin",
604
+ "expected": "/usr/local/bin",
605
+ "category": "filesystem",
606
+ "difficulty": "natural"
607
+ },
608
+ {
609
+ "dictated": "make it all caps DATABASE underscore URL",
610
+ "expected": "DATABASE_URL",
611
+ "category": "casing",
612
+ "difficulty": "natural"
613
+ },
614
+ {
615
+ "dictated": "can you type out docker space run space dash dash rm space nginx",
616
+ "expected": "docker run --rm nginx",
617
+ "category": "docker",
618
+ "difficulty": "natural"
619
+ },
620
+ {
621
+ "dictated": "so like the function name would be camel case handle click event",
622
+ "expected": "handleClickEvent",
623
+ "category": "casing",
624
+ "difficulty": "natural"
625
+ },
626
+ {
627
+ "dictated": "let's do npm space install space dash capital D space typescript",
628
+ "expected": "npm install -D typescript",
629
+ "category": "npm",
630
+ "difficulty": "natural"
631
+ },
632
+ {
633
+ "dictated": "um the flag is dash dash verbose",
634
+ "expected": "--verbose",
635
+ "category": "flags",
636
+ "difficulty": "natural"
637
+ },
638
+ {
639
+ "dictated": "and then pipe it to grep space dash i space error",
640
+ "expected": "| grep -i error",
641
+ "category": "shell",
642
+ "difficulty": "natural"
643
+ },
644
+ {
645
+ "dictated": "basically run git space fetch space dash dash all space ampersand ampersand space git space pull",
646
+ "expected": "git fetch --all && git pull",
647
+ "category": "git",
648
+ "difficulty": "natural"
649
+ },
650
+ {
651
+ "dictated": "I think we need kubectl space scale space deployment slash api space dash dash replicas equals three",
652
+ "expected": "kubectl scale deployment/api --replicas=3",
653
+ "category": "kubectl",
654
+ "difficulty": "natural"
655
+ },
656
+ {
657
+ "dictated": "right so set it to snake case max connection pool size",
658
+ "expected": "max_connection_pool_size",
659
+ "category": "casing",
660
+ "difficulty": "natural"
661
+ },
662
+ {
663
+ "dictated": "type out ssh space dash capital L space eight zero eight zero colon localhost colon five four three two space bastion",
664
+ "expected": "ssh -L 8080:localhost:5432 bastion",
665
+ "category": "ssh",
666
+ "difficulty": "natural"
667
+ },
668
+ {
669
+ "dictated": "so for the environment variable it's all caps AWS underscore SECRET underscore ACCESS underscore KEY",
670
+ "expected": "AWS_SECRET_ACCESS_KEY",
671
+ "category": "env",
672
+ "difficulty": "natural"
673
+ },
674
+ {
675
+ "dictated": "the terraform command should be terraform space init space dash backend dash config equals quote key equals prod slash terraform dot tfstate quote",
676
+ "expected": "terraform init -backend-config=\"key=prod/terraform.tfstate\"",
677
+ "category": "terraform",
678
+ "difficulty": "natural"
679
+ },
680
+ {
681
+ "dictated": "okay let me type the curl command so it's curl space dash s space dash capital H space quote all caps Authorization colon space capital Bearer space dollar all caps TOKEN quote space https colon slash slash api dot example dot com slash me",
682
+ "expected": "curl -s -H \"Authorization: Bearer $TOKEN\" https://api.example.com/me",
683
+ "category": "curl",
684
+ "difficulty": "natural"
685
+ },
686
+ {
687
+ "dictated": "I want to run docker space exec space dash it space postgres underscore db space psql space dash capital U space admin",
688
+ "expected": "docker exec -it postgres_db psql -U admin",
689
+ "category": "docker",
690
+ "difficulty": "natural"
691
+ },
692
+ {
693
+ "dictated": "um so the function is called pascal case create payment intent",
694
+ "expected": "CreatePaymentIntent",
695
+ "category": "casing",
696
+ "difficulty": "natural"
697
+ },
698
+ {
699
+ "dictated": "let's see we need to do pip space install space dash e space dot open bracket dev close bracket",
700
+ "expected": "pip install -e .[dev]",
701
+ "category": "python",
702
+ "difficulty": "natural"
703
+ },
704
+ {
705
+ "dictated": "the redis command I want is redis dash cli space dash dash scan space match space quote session colon star quote",
706
+ "expected": "redis-cli --scan --match \"session:*\"",
707
+ "category": "redis",
708
+ "difficulty": "natural"
709
+ },
710
+ {
711
+ "dictated": "so basically what we wanna run is git space cherry dash pick space dash dash no dash commit space abc one two three four",
712
+ "expected": "git cherry-pick --no-commit abc1234",
713
+ "category": "git",
714
+ "difficulty": "natural"
715
+ },
716
+ {
717
+ "dictated": "and the port number should be colon three thousand",
718
+ "expected": ":3000",
719
+ "category": "network",
720
+ "difficulty": "natural"
721
+ },
722
+ {
723
+ "dictated": "alright change the class name to pascal case authenticated user session",
724
+ "expected": "AuthenticatedUserSession",
725
+ "category": "casing",
726
+ "difficulty": "natural"
727
+ },
728
+ {
729
+ "dictated": "I need the output piped to jq space dot open bracket close bracket dot name",
730
+ "expected": "| jq .[].name",
731
+ "category": "shell",
732
+ "difficulty": "natural"
733
+ },
734
+ {
735
+ "dictated": "set the variable to dollar open brace all caps HOME close brace slash dot config slash app dot toml",
736
+ "expected": "${HOME}/.config/app.toml",
737
+ "category": "env",
738
+ "difficulty": "natural"
739
+ },
740
+ {
741
+ "dictated": "like the regex pattern is caret open bracket a dash z A dash Z close bracket plus dollar",
742
+ "expected": "^[a-zA-Z]+$",
743
+ "category": "regex",
744
+ "difficulty": "natural"
745
+ },
746
+ {
747
+ "dictated": "for the flag use dash dash dry dash run please",
748
+ "expected": "--dry-run",
749
+ "category": "flags",
750
+ "difficulty": "natural"
751
+ },
752
+ {
753
+ "dictated": "so the full path is tilde slash capital Library slash capital Application space capital Support slash capital Talkie slash talkie dot sqlite",
754
+ "expected": "~/Library/Application Support/Talkie/talkie.sqlite",
755
+ "category": "filesystem",
756
+ "difficulty": "natural"
757
+ },
758
+ {
759
+ "dictated": "go ahead and type git space reset space dash dash soft space capital HEAD tilde one",
760
+ "expected": "git reset --soft HEAD~1",
761
+ "category": "git",
762
+ "difficulty": "natural"
763
+ },
764
+ {
765
+ "dictated": "let me think um yeah the image tag is my dash registry dot io slash api colon v two dot one dash rc one",
766
+ "expected": "my-registry.io/api:v2.1-rc1",
767
+ "category": "docker",
768
+ "difficulty": "natural"
769
+ },
770
+ {
771
+ "dictated": "make it cargo space test space dash dash lib space dash dash space dash dash test dash threads space one",
772
+ "expected": "cargo test --lib -- --test-threads 1",
773
+ "category": "rust",
774
+ "difficulty": "natural"
775
+ },
776
+ {
777
+ "dictated": "we should set the cron to star space star slash two space star space star space star",
778
+ "expected": "* */2 * * *",
779
+ "category": "cron",
780
+ "difficulty": "natural"
781
+ },
782
+ {
783
+ "dictated": "the nginx location block should match tilde space caret slash api slash v open bracket zero dash nine close bracket plus",
784
+ "expected": "~ ^/api/v[0-9]+",
785
+ "category": "regex",
786
+ "difficulty": "natural"
787
+ },
788
+ {
789
+ "dictated": "and at the end redirect with two greater than space slash dev slash null",
790
+ "expected": "2> /dev/null",
791
+ "category": "shell",
792
+ "difficulty": "natural"
793
+ },
794
+ {
795
+ "dictated": "I need to add the alias um alias space ll equals quote ls space dash la quote to my bashrc",
796
+ "expected": "alias ll=\"ls -la\"",
797
+ "category": "shell",
798
+ "difficulty": "natural"
799
+ },
800
+ {
801
+ "dictated": "run it with like xargs space dash capital I space open brace close brace space cp space open brace close brace space backup slash",
802
+ "expected": "xargs -I {} cp {} backup/",
803
+ "category": "shell",
804
+ "difficulty": "natural"
805
+ },
806
+ {
807
+ "dictated": "the go struct tag should be json colon quote camel case user name comma omitempty quote",
808
+ "expected": "json:\"userName,omitempty\"",
809
+ "category": "go",
810
+ "difficulty": "natural"
811
+ },
812
+ {
813
+ "dictated": "so like make the webpack config output to dist slash open bracket name close bracket dot open bracket contenthash colon eight close bracket dot js",
814
+ "expected": "dist/[name].[contenthash:8].js",
815
+ "category": "webpack",
816
+ "difficulty": "natural"
817
+ },
818
+ {
819
+ "dictated": "we want to write iptables minus capital A capital INPUT minus p tcp double dash dport four four three minus j all caps ACCEPT",
820
+ "expected": "iptables -A INPUT -p tcp --dport 443 -j ACCEPT",
821
+ "category": "network",
822
+ "difficulty": "natural"
823
+ },
824
+ {
825
+ "dictated": "dash dash no wait just dash v",
826
+ "expected": "-v",
827
+ "category": "flags",
828
+ "difficulty": "chaotic"
829
+ },
830
+ {
831
+ "dictated": "the API endpoint is slash api slash v two slash users slash colon id",
832
+ "expected": "/api/v2/users/:id",
833
+ "category": "url",
834
+ "difficulty": "chaotic"
835
+ },
836
+ {
837
+ "dictated": "so we need to... actually let's just do git stash",
838
+ "expected": "git stash",
839
+ "category": "git",
840
+ "difficulty": "chaotic"
841
+ },
842
+ {
843
+ "dictated": "type ssh at sign root at the server at one ninety two dot one sixty eight dot one dot fifty",
844
+ "expected": "ssh root@192.168.1.50",
845
+ "category": "ssh",
846
+ "difficulty": "chaotic"
847
+ },
848
+ {
849
+ "dictated": "run it on port three thousand",
850
+ "expected": "3000",
851
+ "category": "network",
852
+ "difficulty": "chaotic"
853
+ },
854
+ {
855
+ "dictated": "camel case is authenticated",
856
+ "expected": "isAuthenticated",
857
+ "category": "casing",
858
+ "difficulty": "chaotic"
859
+ },
860
+ {
861
+ "dictated": "just the flag dash dash dry dash run",
862
+ "expected": "--dry-run",
863
+ "category": "flags",
864
+ "difficulty": "chaotic"
865
+ },
866
+ {
867
+ "dictated": "open curly brace newline tab quote name quote colon quote capital John quote comma newline tab quote age quote colon twenty five newline close curly brace",
868
+ "expected": "{\n\t\"name\": \"John\",\n\t\"age\": 25\n}",
869
+ "category": "json",
870
+ "difficulty": "chaotic"
871
+ },
872
+ {
873
+ "dictated": "wait no not dash dash force I meant dash dash force dash with dash lease",
874
+ "expected": "--force-with-lease",
875
+ "category": "git",
876
+ "difficulty": "chaotic"
877
+ },
878
+ {
879
+ "dictated": "the connection string is postgres colon slash slash wait what was the password oh right admin colon p at sign s s w zero r d at localhost colon five four three two slash production underscore db",
880
+ "expected": "postgres://admin:p@ssw0rd@localhost:5432/production_db",
881
+ "category": "database",
882
+ "difficulty": "chaotic"
883
+ },
884
+ {
885
+ "dictated": "kubectl get pods no actually I want kubectl get deployments dash o wide",
886
+ "expected": "kubectl get deployments -o wide",
887
+ "category": "kubectl",
888
+ "difficulty": "chaotic"
889
+ },
890
+ {
891
+ "dictated": "um okay so like the variable name should be um camel case handle submit and then no wait pascal case handle submit because it's a component",
892
+ "expected": "HandleSubmit",
893
+ "category": "casing",
894
+ "difficulty": "chaotic"
895
+ },
896
+ {
897
+ "dictated": "three files in the directory",
898
+ "expected": "three files in the directory",
899
+ "category": "ambiguous",
900
+ "difficulty": "chaotic"
901
+ },
902
+ {
903
+ "dictated": "port eighty four forty three I mean port eight four four three",
904
+ "expected": "8443",
905
+ "category": "network",
906
+ "difficulty": "chaotic"
907
+ },
908
+ {
909
+ "dictated": "the address is like http no https colon slash slash api dot production dot our company dot com slash v three slash webhook",
910
+ "expected": "https://api.production.our-company.com/v3/webhook",
911
+ "category": "url",
912
+ "difficulty": "chaotic"
913
+ },
914
+ {
915
+ "dictated": "so I need to pipe it through like four commands cat the file then grep for errors then sort then unique with count so cat space log dot txt space pipe space grep space error space pipe space sort space pipe space uniq space dash c",
916
+ "expected": "cat log.txt | grep error | sort | uniq -c",
917
+ "category": "shell",
918
+ "difficulty": "chaotic"
919
+ },
920
+ {
921
+ "dictated": "the docker tag is ghcr dot io slash my org slash my app colon sha dash wait how do you say git sha... the sha prefix",
922
+ "expected": "ghcr.io/my-org/my-app:sha-",
923
+ "category": "docker",
924
+ "difficulty": "chaotic"
925
+ },
926
+ {
927
+ "dictated": "okay write open bracket open bracket colon minus s colon minus d close bracket close bracket double ampersand echo pass pipe pipe echo fail",
928
+ "expected": "[[ -s -d ]] && echo pass || echo fail",
929
+ "category": "shell",
930
+ "difficulty": "chaotic"
931
+ },
932
+ {
933
+ "dictated": "no no no go back the command was git diff HEAD tilde two dot dot HEAD that's HEAD tilde the number two then two dots then HEAD",
934
+ "expected": "git diff HEAD~2..HEAD",
935
+ "category": "git",
936
+ "difficulty": "chaotic"
937
+ },
938
+ {
939
+ "dictated": "I need an awk command to... you know what just do awk space quote open brace if dollar three greater than one hundred print dollar zero close brace quote space data dot csv",
940
+ "expected": "awk \"{if $3 > 100 print $0}\" data.csv",
941
+ "category": "awk",
942
+ "difficulty": "chaotic"
943
+ },
944
+ {
945
+ "dictated": "the method signature is func space camel case fetch user open paren underscore id colon capital String close paren space async space throws space dash greater than capital User",
946
+ "expected": "func fetchUser(_ id: String) async throws -> User",
947
+ "category": "swift",
948
+ "difficulty": "chaotic"
949
+ },
950
+ {
951
+ "dictated": "umm set the crontab to... it should run every fifteen minutes so star slash fifteen space star space star space star space star",
952
+ "expected": "*/15 * * * *",
953
+ "category": "cron",
954
+ "difficulty": "chaotic"
955
+ },
956
+ {
957
+ "dictated": "the kubernetes label selector is app equals my dash app comma version in open paren v one comma v two close paren",
958
+ "expected": "app=my-app,version in (v1,v2)",
959
+ "category": "kubectl",
960
+ "difficulty": "chaotic"
961
+ },
962
+ {
963
+ "dictated": "type the regex... hmm it's like caret open bracket A dash Z close bracket open bracket a dash z A dash Z zero dash nine close bracket star at sign open bracket a dash z close bracket plus backslash dot open bracket a dash z close bracket open brace two comma close brace dollar",
964
+ "expected": "^[A-Z][a-zA-Z0-9]*@[a-z]+\\.[a-z]{2,}$",
965
+ "category": "regex",
966
+ "difficulty": "chaotic"
967
+ },
968
+ {
969
+ "dictated": "git rebase onto main the commit from tuesday I think it was like a b c one two three four five six seven",
970
+ "expected": "git rebase --onto main abc1234567",
971
+ "category": "git",
972
+ "difficulty": "chaotic"
973
+ },
974
+ {
975
+ "dictated": "webpack serve open paren or is it webpack dash dev dash server I always forget close paren with dash dash hot and dash dash port nine thousand",
976
+ "expected": "webpack serve --hot --port 9000",
977
+ "category": "webpack",
978
+ "difficulty": "chaotic"
979
+ },
980
+ {
981
+ "dictated": "the nginx config upstream block upstream space my underscore backend space open curly brace newline space space server space one two seven dot zero dot zero dot one colon three thousand weight equals five semicolon newline space space server space one two seven dot zero dot zero dot one colon three thousand one weight equals three semicolon newline close curly brace",
982
+ "expected": "upstream my_backend {\n server 127.0.0.1:3000 weight=5;\n server 127.0.0.1:3001 weight=3;\n}",
983
+ "category": "nginx",
984
+ "difficulty": "chaotic"
985
+ },
986
+ {
987
+ "dictated": "the TypeScript type is capital Record less than string comma Array less than open curly brace id colon number semicolon name colon string close curly brace greater than greater than",
988
+ "expected": "Record<string, Array<{id: number; name: string}>>",
989
+ "category": "typescript",
990
+ "difficulty": "chaotic"
991
+ },
992
+ {
993
+ "dictated": "wait I need to escape the dollar signs in the dockerfile so it's backslash dollar open paren cat slash run slash secrets slash db underscore password close paren",
994
+ "expected": "\\$(cat /run/secrets/db_password)",
995
+ "category": "docker",
996
+ "difficulty": "chaotic"
997
+ },
998
+ {
999
+ "dictated": "do the curl but like with retries so curl space dash dash retry space three space dash dash retry dash delay space two space then the url",
1000
+ "expected": "curl --retry 3 --retry-delay 2",
1001
+ "category": "curl",
1002
+ "difficulty": "chaotic"
1003
+ },
1004
+ {
1005
+ "dictated": "the swift property wrapper at capital Published var camel case selected tab colon capital Tab equals dot home",
1006
+ "expected": "@Published var selectedTab: Tab = .home",
1007
+ "category": "swift",
1008
+ "difficulty": "chaotic"
1009
+ },
1010
+ {
1011
+ "dictated": "ssh minus capital J jump dash host user at final dash host so that's proxy jumping through the bastion",
1012
+ "expected": "ssh -J jump-host user@final-host",
1013
+ "category": "ssh",
1014
+ "difficulty": "chaotic"
1015
+ },
1016
+ {
1017
+ "dictated": "actually scratch what I said before just do a simple ls minus la slash tmp",
1018
+ "expected": "ls -la /tmp",
1019
+ "category": "filesystem",
1020
+ "difficulty": "chaotic"
1021
+ },
1022
+ {
1023
+ "dictated": "the helm values should be set dash dash set image dot tag equals v one dot four dot two dash rc one and set dash dash set replicas equals three",
1024
+ "expected": "--set image.tag=v1.4.2-rc1 --set replicas=3",
1025
+ "category": "helm",
1026
+ "difficulty": "chaotic"
1027
+ },
1028
+ {
1029
+ "dictated": "open bracket dollar open paren date plus percent capital Y minus percent m minus percent d close paren close bracket underscore backup dot sql",
1030
+ "expected": "[$(date +%Y-%m-%d)]_backup.sql",
1031
+ "category": "shell",
1032
+ "difficulty": "chaotic"
1033
+ },
1034
+ {
1035
+ "dictated": "the go generics syntax is func capital Map open bracket capital T any comma capital U any close bracket open paren slice open bracket close bracket capital T comma fn func open paren capital T close paren capital U close paren open bracket close bracket capital U",
1036
+ "expected": "func Map[T any, U any](slice []T, fn func(T) U) []U",
1037
+ "category": "go",
1038
+ "difficulty": "chaotic"
1039
+ },
1040
+ {
1041
+ "dictated": "I want to write the SQL query but like dynamically so select star from users where created underscore at greater than dollar one and status equals quote active quote order by id limit dollar two",
1042
+ "expected": "select * from users where created_at > $1 and status = 'active' order by id limit $2",
1043
+ "category": "sql",
1044
+ "difficulty": "chaotic"
1045
+ },
1046
+ {
1047
+ "dictated": "the whole pipeline is cat access dot log pipe grep five hundred pipe awk open brace print dollar one close brace pipe sort pipe uniq minus c pipe sort minus r n pipe head minus five",
1048
+ "expected": "cat access.log | grep 500 | awk {print $1} | sort | uniq -c | sort -rn | head -5",
1049
+ "category": "shell",
1050
+ "difficulty": "chaotic"
1051
+ },
1052
+ {
1053
+ "dictated": "for the terraform block it's resource quote aws underscore lambda underscore function quote quote my underscore function quote",
1054
+ "expected": "resource \"aws_lambda_function\" \"my_function\"",
1055
+ "category": "terraform",
1056
+ "difficulty": "chaotic"
1057
+ },
1058
+ {
1059
+ "dictated": "I keep getting the args wrong okay the ffmpeg command is ffmpeg minus i concat colon file one dot ts pipe file two dot ts minus c copy output dot mp four",
1060
+ "expected": "ffmpeg -i concat:file1.ts|file2.ts -c copy output.mp4",
1061
+ "category": "ffmpeg",
1062
+ "difficulty": "chaotic"
1063
+ },
1064
+ {
1065
+ "dictated": "whoops I meant to say camel case not pascal case so it should be camel case on authentication state changed",
1066
+ "expected": "onAuthenticationStateChanged",
1067
+ "category": "casing",
1068
+ "difficulty": "chaotic"
1069
+ },
1070
+ {
1071
+ "dictated": "paste in the IP ten dot two forty four dot one thirty seven dot ninety two",
1072
+ "expected": "10.244.137.92",
1073
+ "category": "network",
1074
+ "difficulty": "chaotic"
1075
+ },
1076
+ {
1077
+ "dictated": "the package json script is quote dev quote colon quote concurrently backslash quote next dev backslash quote backslash quote tailwind minus minus watch backslash quote quote",
1078
+ "expected": "\"dev\": \"concurrently \\\"next dev\\\" \\\"tailwind --watch\\\"\"",
1079
+ "category": "json",
1080
+ "difficulty": "chaotic"
1081
+ },
1082
+ {
1083
+ "dictated": "should I use single or double quotes whatever just do git commit minus m quote feat colon add user authentication quote",
1084
+ "expected": "git commit -m \"feat: add user authentication\"",
1085
+ "category": "git",
1086
+ "difficulty": "chaotic"
1087
+ },
1088
+ {
1089
+ "dictated": "the makefile target is dot capital PHONY colon test newline test colon newline tab go test dot slash dot dot dot minus v minus count equals one",
1090
+ "expected": ".PHONY: test\ntest:\n\tgo test ./... -v -count=1",
1091
+ "category": "make",
1092
+ "difficulty": "chaotic"
1093
+ },
1094
+ {
1095
+ "dictated": "okay the whole thing end to end is curl minus s https colon slash slash api dot github dot com slash repos slash facebook slash react slash releases slash latest pipe jq minus r dot tag underscore name",
1096
+ "expected": "curl -s https://api.github.com/repos/facebook/react/releases/latest | jq -r .tag_name",
1097
+ "category": "curl",
1098
+ "difficulty": "chaotic"
1099
+ },
1100
+ {
1101
+ "dictated": "I need it to match like the group syntax so open paren question mark less than name greater than open bracket a dash z close bracket plus close paren at open bracket zero dash nine close bracket plus",
1102
+ "expected": "(?<name>[a-z]+)@[0-9]+",
1103
+ "category": "regex",
1104
+ "difficulty": "chaotic"
1105
+ },
1106
+ {
1107
+ "dictated": "my dot env file has like the whole nine yards so all caps REDIS underscore URL equals redis colon slash slash colon my password at redis dash primary dot internal colon six three seven nine slash zero",
1108
+ "expected": "REDIS_URL=redis://:mypassword@redis-primary.internal:6379/0",
1109
+ "category": "env",
1110
+ "difficulty": "chaotic"
1111
+ },
1112
+ {
1113
+ "dictated": "the github action step is uses colon actions slash checkout at v four with colon newline space space fetch dash depth colon zero",
1114
+ "expected": "uses: actions/checkout@v4\nwith:\n fetch-depth: 0",
1115
+ "category": "yaml",
1116
+ "difficulty": "chaotic"
1117
+ },
1118
+ {
1119
+ "dictated": "git merge minus minus no minus ff feature forward slash payments",
1120
+ "expected": "git merge --no-ff feature/payments",
1121
+ "category": "git",
1122
+ "difficulty": "fuzzy"
1123
+ },
1124
+ {
1125
+ "dictated": "mkdir minus p src forward slash components forward slash auth",
1126
+ "expected": "mkdir -p src/components/auth",
1127
+ "category": "filesystem",
1128
+ "difficulty": "fuzzy"
1129
+ },
1130
+ {
1131
+ "dictated": "lsof minus i colon eight zero eight zero",
1132
+ "expected": "lsof -i :8080",
1133
+ "category": "network",
1134
+ "difficulty": "fuzzy"
1135
+ },
1136
+ {
1137
+ "dictated": "so for this one we want to type out camel case validate email address",
1138
+ "expected": "validateEmailAddress",
1139
+ "category": "casing",
1140
+ "difficulty": "natural"
1141
+ },
1142
+ {
1143
+ "dictated": "I want to set the environment variable to all caps NEXT underscore PUBLIC underscore API underscore URL equals https colon slash slash api dot useTalkie dot com",
1144
+ "expected": "NEXT_PUBLIC_API_URL=https://api.useTalkie.com",
1145
+ "category": "env",
1146
+ "difficulty": "natural"
1147
+ },
1148
+ {
1149
+ "dictated": "right so basically just add dash dash no dash cache to the docker build command",
1150
+ "expected": "--no-cache",
1151
+ "category": "docker",
1152
+ "difficulty": "natural"
1153
+ },
1154
+ {
1155
+ "dictated": "the file glob should be star star slash star dot test dot ts",
1156
+ "expected": "**/*.test.ts",
1157
+ "category": "filesystem",
1158
+ "difficulty": "natural"
1159
+ },
1160
+ {
1161
+ "dictated": "use the flag dash capital R for recursive copy",
1162
+ "expected": "-R",
1163
+ "category": "flags",
1164
+ "difficulty": "natural"
1165
+ },
1166
+ {
1167
+ "dictated": "can you type kubectl space logs space dash f space dash l space app equals web space dash c space nginx",
1168
+ "expected": "kubectl logs -f -l app=web -c nginx",
1169
+ "category": "kubectl",
1170
+ "difficulty": "natural"
1171
+ },
1172
+ {
1173
+ "dictated": "the version constraint is greater than equals two dot zero dot zero comma less than three dot zero dot zero",
1174
+ "expected": ">=2.0.0,<3.0.0",
1175
+ "category": "semver",
1176
+ "difficulty": "natural"
1177
+ },
1178
+ {
1179
+ "dictated": "so uh the test command would be pytest space dash v space dash dash cov equals src space tests slash",
1180
+ "expected": "pytest -v --cov=src tests/",
1181
+ "category": "python",
1182
+ "difficulty": "natural"
1183
+ },
1184
+ {
1185
+ "dictated": "just write out xattr space dash r space dash d space com dot apple dot quarantine space slash Applications slash capital My capital App dot app",
1186
+ "expected": "xattr -r -d com.apple.quarantine /Applications/MyApp.app",
1187
+ "category": "macos",
1188
+ "difficulty": "natural"
1189
+ },
1190
+ {
1191
+ "dictated": "oh and also add the flag dash dash prefer dash offline to the pnpm install",
1192
+ "expected": "--prefer-offline",
1193
+ "category": "npm",
1194
+ "difficulty": "natural"
1195
+ },
1196
+ {
1197
+ "dictated": "the whole thing is a mess but it should be something like docker run minus minus network equals host minus e all caps POSTGRES underscore PASSWORD equals not a real password minus d postgres colon sixteen dash alpine",
1198
+ "expected": "docker run --network=host -e POSTGRES_PASSWORD=notarealpassword -d postgres:16-alpine",
1199
+ "category": "docker",
1200
+ "difficulty": "chaotic"
1201
+ }
1202
+ ]
eval/independent.json ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {"dictated": "git space push space dash u space origin space main", "expected": "git push -u origin main", "category": "git"},
3
+ {"dictated": "git space commit space dash m space quote fix space login space bug quote", "expected": "git commit -m \"fix login bug\"", "category": "git"},
4
+ {"dictated": "git space log space dash dash oneline space dash n space ten", "expected": "git log --oneline -n 10", "category": "git"},
5
+ {"dictated": "git space stash space pop", "expected": "git stash pop", "category": "git"},
6
+ {"dictated": "git space checkout space dash b space feature slash auth", "expected": "git checkout -b feature/auth", "category": "git"},
7
+ {"dictated": "git space diff space dash dash staged space dot", "expected": "git diff --staged .", "category": "git"},
8
+ {"dictated": "git space rebase space dash i space capital h capital e capital a capital d tilde three", "expected": "git rebase -i HEAD~3", "category": "git"},
9
+ {"dictated": "git space remote space add space origin space git at github dot com colon user slash repo dot git", "expected": "git remote add origin git@github.com:user/repo.git", "category": "git"},
10
+ {"dictated": "docker space run space dash dash rm space dash p space eight zero eight zero colon eight zero space nginx", "expected": "docker run --rm -p 8080:80 nginx", "category": "docker"},
11
+ {"dictated": "docker space compose space up space dash d", "expected": "docker compose up -d", "category": "docker"},
12
+ {"dictated": "docker space build space dash t space my dash app colon latest space dot", "expected": "docker build -t my-app:latest .", "category": "docker"},
13
+ {"dictated": "docker space exec space dash i t space my dash container space bash", "expected": "docker exec -it my-container bash", "category": "docker"},
14
+ {"dictated": "docker space logs space dash f space dash dash tail space one hundred space web", "expected": "docker logs -f --tail 100 web", "category": "docker"},
15
+ {"dictated": "docker space ps space dash a", "expected": "docker ps -a", "category": "docker"},
16
+ {"dictated": "docker space system space prune space dash dash all space dash f", "expected": "docker system prune --all -f", "category": "docker"},
17
+ {"dictated": "kubectl space get space pods space dash n space production", "expected": "kubectl get pods -n production", "category": "kubernetes"},
18
+ {"dictated": "kubectl space apply space dash f space deployment dot yaml", "expected": "kubectl apply -f deployment.yaml", "category": "kubernetes"},
19
+ {"dictated": "kubectl space logs space dash f space dash l space app equals api space dash n space default", "expected": "kubectl logs -f -l app=api -n default", "category": "kubernetes"},
20
+ {"dictated": "kubectl space describe space pod space my dash pod dash one two three", "expected": "kubectl describe pod my-pod-123", "category": "kubernetes"},
21
+ {"dictated": "kubectl space scale space deployment slash web space dash dash replicas equals three", "expected": "kubectl scale deployment/web --replicas=3", "category": "kubernetes"},
22
+ {"dictated": "npm space install space dash capital d space typescript", "expected": "npm install -D typescript", "category": "npm"},
23
+ {"dictated": "npx space create dash next dash app at latest space my dash app", "expected": "npx create-next-app@latest my-app", "category": "npm"},
24
+ {"dictated": "pnpm space add space dash capital d space eslint space prettier", "expected": "pnpm add -D eslint prettier", "category": "npm"},
25
+ {"dictated": "npm space run space build space and and space npm space test", "expected": "npm run build && npm test", "category": "npm"},
26
+ {"dictated": "curl space dash s capital l space https colon slash slash api dot github dot com slash repos", "expected": "curl -sL https://api.github.com/repos", "category": "curl"},
27
+ {"dictated": "curl space dash capital x space all caps post space dash capital h space quote all caps content dash capital type colon space application slash json quote space dash d space single quote open brace quote key quote colon quote value quote close brace single quote space http colon slash slash localhost colon three thousand", "expected": "curl -X POST -H \"Content-Type: application/json\" -d '{\"key\":\"value\"}' http://localhost:3000", "category": "curl"},
28
+ {"dictated": "curl space dash o space file dot zip space dash capital l space https colon slash slash example dot com slash download", "expected": "curl -o file.zip -L https://example.com/download", "category": "curl"},
29
+ {"dictated": "curl space dash i space dash capital x space all caps delete space https colon slash slash api dot example dot com slash users slash four two", "expected": "curl -i -X DELETE https://api.example.com/users/42", "category": "curl"},
30
+ {"dictated": "ssh space dash i space tilde slash dot ssh slash id underscore rsa space ubuntu at one nine two dot one six eight dot one dot one hundred", "expected": "ssh -i ~/.ssh/id_rsa ubuntu@192.168.1.100", "category": "ssh"},
31
+ {"dictated": "ssh space dash capital l space eight zero eight zero colon localhost colon five four three two space user at bastion dot example dot com", "expected": "ssh -L 8080:localhost:5432 user@bastion.example.com", "category": "ssh"},
32
+ {"dictated": "scp space dash r space user at host colon slash var slash log slash space dot slash logs slash", "expected": "scp -r user@host:/var/log/ ./logs/", "category": "ssh"},
33
+ {"dictated": "ls space dash l a h", "expected": "ls -lah", "category": "filesystem"},
34
+ {"dictated": "find space dot space dash name space quote star dot ts quote space dash not space dash path space quote star slash node underscore modules slash star quote", "expected": "find . -name \"*.ts\" -not -path \"*/node_modules/*\"", "category": "filesystem"},
35
+ {"dictated": "chmod space seven five five space script dot sh", "expected": "chmod 755 script.sh", "category": "filesystem"},
36
+ {"dictated": "chown space dash capital r space www dash data colon www dash data space slash var slash www", "expected": "chown -R www-data:www-data /var/www", "category": "filesystem"},
37
+ {"dictated": "mkdir space dash p space src slash components slash ui", "expected": "mkdir -p src/components/ui", "category": "filesystem"},
38
+ {"dictated": "tar space dash x z f space archive dot tar dot gz space dash capital c space slash tmp", "expected": "tar -xzf archive.tar.gz -C /tmp", "category": "filesystem"},
39
+ {"dictated": "du space dash s h space star space pipe space sort space dash h r space pipe space head space dash n space ten", "expected": "du -sh * | sort -hr | head -n 10", "category": "filesystem"},
40
+ {"dictated": "cp space dash r space src slash space dist slash", "expected": "cp -r src/ dist/", "category": "filesystem"},
41
+ {"dictated": "python space dash m space venv space dot venv", "expected": "python -m venv .venv", "category": "python"},
42
+ {"dictated": "pip space install space dash r space requirements dot txt", "expected": "pip install -r requirements.txt", "category": "python"},
43
+ {"dictated": "python space dash c space quote import space sys semicolon space print open paren sys dot version close paren quote", "expected": "python -c \"import sys; print(sys.version)\"", "category": "python"},
44
+ {"dictated": "pytest space dash v space dash x space tests slash", "expected": "pytest -v -x tests/", "category": "python"},
45
+ {"dictated": "pip space install space dash e space dot open bracket dev close bracket", "expected": "pip install -e .[dev]", "category": "python"},
46
+ {"dictated": "uvicorn space main colon app space dash dash reload space dash dash port space eight thousand", "expected": "uvicorn main:app --reload --port 8000", "category": "python"},
47
+ {"dictated": "cargo space build space dash dash release", "expected": "cargo build --release", "category": "rust"},
48
+ {"dictated": "cargo space test space dash dash space dash dash nocapture", "expected": "cargo test -- --nocapture", "category": "rust"},
49
+ {"dictated": "cargo space add space serde space dash capital f space derive", "expected": "cargo add serde -F derive", "category": "rust"},
50
+ {"dictated": "export space all caps database underscore all caps url equals quote postgres colon slash slash user colon pass at localhost colon five four three two slash mydb quote", "expected": "export DATABASE_URL=\"postgres://user:pass@localhost:5432/mydb\"", "category": "env"},
51
+ {"dictated": "export space all caps path equals dollar all caps path colon slash usr slash local slash bin", "expected": "export PATH=$PATH:/usr/local/bin", "category": "env"},
52
+ {"dictated": "echo space dollar all caps home", "expected": "echo $HOME", "category": "env"},
53
+ {"dictated": "all caps node underscore all caps env equals production space node space server dot js", "expected": "NODE_ENV=production node server.js", "category": "env"},
54
+ {"dictated": "env space pipe space grep space dash i space all caps aws", "expected": "env | grep -i AWS", "category": "env"},
55
+ {"dictated": "brew space install space dash dash cask space visual dash studio dash code", "expected": "brew install --cask visual-studio-code", "category": "brew"},
56
+ {"dictated": "brew space upgrade space and and space brew space cleanup", "expected": "brew upgrade && brew cleanup", "category": "brew"},
57
+ {"dictated": "brew space services space start space postgresql at fifteen", "expected": "brew services start postgresql@15", "category": "brew"},
58
+ {"dictated": "brew space tap space homebrew slash cask dash fonts", "expected": "brew tap homebrew/cask-fonts", "category": "brew"},
59
+ {"dictated": "grep space dash r n space quote all caps todo quote space dash dash include equals quote star dot swift quote space dot", "expected": "grep -rn \"TODO\" --include=\"*.swift\" .", "category": "grep"},
60
+ {"dictated": "grep space dash capital e space quote caret open bracket hash close bracket quote space config dot yaml", "expected": "grep -E \"^[#]\" config.yaml", "category": "grep"},
61
+ {"dictated": "grep space dash v space quote caret dollar quote space file dot txt space pipe space wc space dash l", "expected": "grep -v \"^$\" file.txt | wc -l", "category": "grep"},
62
+ {"dictated": "swift space build space dash c space release", "expected": "swift build -c release", "category": "swift"},
63
+ {"dictated": "swift space test space dash dash filter space capital my capital tests", "expected": "swift test --filter MyTests", "category": "swift"},
64
+ {"dictated": "swift space package space init space dash dash name space capital my capital package space dash dash type space library", "expected": "swift package init --name MyPackage --type library", "category": "swift"},
65
+ {"dictated": "xcodebuild space dash scheme space capital talkie space dash configuration space capital release space build", "expected": "xcodebuild -scheme Talkie -configuration Release build", "category": "xcode"},
66
+ {"dictated": "xcodebuild space dash project space capital my capital app dot xcodeproj space dash list", "expected": "xcodebuild -project MyApp.xcodeproj -list", "category": "xcode"},
67
+ {"dictated": "xcrun space simctl space boot space quote capital i capital phone space fifteen quote", "expected": "xcrun simctl boot \"iPhone 15\"", "category": "xcode"},
68
+ {"dictated": "psql space dash capital u space postgres space dash d space mydb space dash c space quote all caps select space star space from space users semicolon quote", "expected": "psql -U postgres -d mydb -c \"SELECT * FROM users;\"", "category": "database"},
69
+ {"dictated": "redis dash cli space dash h space one two seven dot zero dot zero dot one space dash p space six three seven nine space ping", "expected": "redis-cli -h 127.0.0.1 -p 6379 ping", "category": "database"},
70
+ {"dictated": "pg underscore dump space dash capital f c space dash d space mydb space redirect space backup dot dump", "expected": "pg_dump -Fc -d mydb > backup.dump", "category": "database"},
71
+ {"dictated": "sed space dash i space single quote single quote space single quote s slash old slash new slash g single quote space file dot txt", "expected": "sed -i '' 's/old/new/g' file.txt", "category": "sed"},
72
+ {"dictated": "awk space single quote open brace print space dollar one comma space dollar three close brace single quote space data dot csv", "expected": "awk '{print $1, $3}' data.csv", "category": "awk"},
73
+ {"dictated": "cat space access dot log space pipe space awk space single quote open brace print space dollar one close brace single quote space pipe space sort space pipe space uniq space dash c space pipe space sort space dash r n", "expected": "cat access.log | awk '{print $1}' | sort | uniq -c | sort -rn", "category": "pipeline"},
74
+ {"dictated": "ps space aux space pipe space grep space node space pipe space grep space dash v space grep", "expected": "ps aux | grep node | grep -v grep", "category": "pipeline"},
75
+ {"dictated": "echo space quote hello quote space append space output dot log", "expected": "echo \"hello\" >> output.log", "category": "redirect"},
76
+ {"dictated": "cat space error dot log space redirect space slash dev slash null", "expected": "cat error.log > /dev/null", "category": "redirect"},
77
+ {"dictated": "make space dash j space four space and and space make space install", "expected": "make -j 4 && make install", "category": "build"},
78
+ {"dictated": "cmake space dash capital b space build space dash capital d all caps cmake underscore all caps build underscore all caps type equals all caps release", "expected": "cmake -B build -DCMAKE_BUILD_TYPE=RELEASE", "category": "build"},
79
+ {"dictated": "go space build space dash o space bin slash server space dot slash cmd slash server", "expected": "go build -o bin/server ./cmd/server", "category": "go"},
80
+ {"dictated": "go space test space dash v space dash race space dot slash dot dot dot", "expected": "go test -v -race ./...", "category": "go"},
81
+ {"dictated": "go space mod space tidy", "expected": "go mod tidy", "category": "go"},
82
+ {"dictated": "terraform space plan space dash var space quote region equals us dash west dash two quote", "expected": "terraform plan -var \"region=us-west-2\"", "category": "terraform"},
83
+ {"dictated": "terraform space apply space dash auto dash approve", "expected": "terraform apply -auto-approve", "category": "terraform"},
84
+ {"dictated": "aws space s three space cp space dash dash recursive space s three colon slash slash my dash bucket slash data slash space dot slash data slash", "expected": "aws s3 cp --recursive s3://my-bucket/data/ ./data/", "category": "aws"},
85
+ {"dictated": "aws space ec two space describe dash instances space dash dash filters space quote capital name equals tag colon capital name comma all caps values equals prod quote", "expected": "aws ec2 describe-instances --filters \"Name=tag:Name,Values=prod\"", "category": "aws"},
86
+ {"dictated": "gh space pr space create space dash dash title space quote fix space auth space bug quote space dash dash draft", "expected": "gh pr create --title \"fix auth bug\" --draft", "category": "github"},
87
+ {"dictated": "gh space issue space list space dash dash label space bug space dash dash state space open", "expected": "gh issue list --label bug --state open", "category": "github"},
88
+ {"dictated": "rsync space dash a v z space dash dash progress space src slash space user at host colon slash backup slash", "expected": "rsync -avz --progress src/ user@host:/backup/", "category": "network"},
89
+ {"dictated": "ping space dash c space four space eight dot eight dot eight dot eight", "expected": "ping -c 4 8.8.8.8", "category": "network"},
90
+ {"dictated": "dig space plus short space example dot com space all caps a", "expected": "dig +short example.com A", "category": "network"},
91
+ {"dictated": "openssl space s underscore client space dash connect space example dot com colon four four three", "expected": "openssl s_client -connect example.com:443", "category": "network"},
92
+ {"dictated": "systemctl space status space nginx dot service", "expected": "systemctl status nginx.service", "category": "systemd"},
93
+ {"dictated": "journalctl space dash u space docker space dash dash since space quote one space hour space ago quote", "expected": "journalctl -u docker --since \"1 hour ago\"", "category": "systemd"},
94
+ {"dictated": "bun space run space dash dash watch space src slash server dot ts", "expected": "bun run --watch src/server.ts", "category": "bun"},
95
+ {"dictated": "bun space add space at hono slash hono", "expected": "bun add @hono/hono", "category": "bun"},
96
+ {"dictated": "deno space run space dash dash allow dash net space dash dash allow dash read space server dot ts", "expected": "deno run --allow-net --allow-read server.ts", "category": "deno"},
97
+ {"dictated": "ffmpeg space dash i space input dot mp four space dash c colon v space libx two six four space dash crf space twenty three space output dot mp four", "expected": "ffmpeg -i input.mp4 -c:v libx264 -crf 23 output.mp4", "category": "media"},
98
+ {"dictated": "convert space input dot png space dash resize space fifty percent space output dot png", "expected": "convert input.png -resize 50% output.png", "category": "media"},
99
+ {"dictated": "jq space single quote dot data open bracket close bracket space pipe space dot name single quote space response dot json", "expected": "jq '.data[] | .name' response.json", "category": "json"},
100
+ {"dictated": "cat space config dot json space pipe space jq space single quote dot dot dependencies single quote", "expected": "cat config.json | jq '..dependencies'", "category": "json"},
101
+ {"dictated": "wrangler space deploy space dash dash env space production", "expected": "wrangler deploy --env production", "category": "cloudflare"}
102
+ ]
pipeline/normalizer.py ADDED
@@ -0,0 +1,415 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Zero-training normalizer pipeline.
3
+
4
+ Architecture:
5
+ Raw transcript
6
+ → Protocol detector (is it already in protocol format?)
7
+ → IF protocol: strip filler procedurally → processor
8
+ → IF NOT protocol: LLM normalize → processor
9
+ → Final syntax output
10
+
11
+ The LLM only handles non-protocol input (fuzzy dictation, natural language).
12
+ Protocol-format input bypasses the LLM entirely for deterministic handling.
13
+ """
14
+
15
+ import json
16
+ import sys
17
+ import time
18
+ import re
19
+ import os
20
+ import argparse
21
+ from collections import defaultdict
22
+
23
+ from mlx_lm import load, generate
24
+ from mlx_lm.sample_utils import make_sampler
25
+
26
+ # Import the procedural processor
27
+ sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'processor'))
28
+ from procedural import process_dictation
29
+
30
+ # ── Protocol detection ───────────────────────────────────────────────────
31
+
32
+ # Words that are part of the protocol vocabulary (not filler)
33
+ PROTOCOL_VOCAB = {
34
+ 'space', 'dash', 'dot', 'slash', 'pipe', 'star', 'bang', 'hash',
35
+ 'tilde', 'at', 'dollar', 'percent', 'caret', 'ampersand', 'equals',
36
+ 'plus', 'colon', 'semicolon', 'underscore', 'comma', 'backslash',
37
+ 'quote', 'backtick', 'redirect', 'append',
38
+ 'capital', 'camel', 'snake', 'pascal', 'kebab', 'screaming',
39
+ }
40
+
41
+ # Common conversational filler patterns to strip
42
+ FILLER_PREFIXES = [
43
+ r"^okay\s+so\s+(?:the\s+command\s+is\s+|like\s+)?",
44
+ r"^so\s+(?:the\s+command\s+is\s+|like\s+|it's\s+)?",
45
+ r"^um+\s+(?:so\s+)?(?:the\s+)?",
46
+ r"^(?:I\s+wanna?|I\s+want\s+to)\s+(?:\w+\s+)*?(?:to\s+|is\s+)?",
47
+ r"^can\s+you\s+(?:type\s+(?:out\s+)?)?",
48
+ r"^(?:let's\s+(?:do|see|try)\s+)",
49
+ r"^basically\s+(?:run\s+|do\s+|type\s+)?",
50
+ r"^(?:and\s+then|then)\s+",
51
+ r"^right\s+so\s+",
52
+ r"^(?:type\s+(?:out\s+)?)",
53
+ r"^okay\s+(?:let\s+me\s+type\s+)?(?:the\s+)?(?:\w+\s+)?(?:command\s+)?(?:so\s+)?(?:it's\s+)?",
54
+ r"^I\s+think\s+we\s+need\s+",
55
+ r"^(?:so\s+)?for\s+the\s+\w+\s+(?:variable\s+)?(?:it's\s+)?",
56
+ r"^I\s+want\s+to\s+run\s+",
57
+ ]
58
+
59
+ FILLER_SUFFIXES = [
60
+ r"\s+I\s+think$",
61
+ r"\s+right$",
62
+ r"\s+yeah$",
63
+ ]
64
+
65
+
66
+ FILLER_WORDS = {
67
+ 'okay', 'ok', 'so', 'um', 'uh', 'like', 'basically', 'actually',
68
+ 'i', 'the', 'can', 'right', 'wait', 'well', 'and',
69
+ 'we', 'you', 'hmm', "let's", 'just',
70
+ 'then', "i'm", "it's", "that's",
71
+ 'should', 'would', 'could', 'maybe',
72
+ }
73
+
74
+ SELF_CORRECTION = {'wait', 'no', 'actually', 'meant', 'not'}
75
+
76
+
77
+ def is_pure_protocol(text):
78
+ """Check if text is pure protocol format (no filler, no corrections).
79
+
80
+ Returns True only if:
81
+ 1. Input contains "space" as separator (protocol format)
82
+ 2. Does NOT start with filler words (conversational)
83
+ 3. Does NOT contain self-correction markers
84
+ """
85
+ words = text.lower().split()
86
+ if not words:
87
+ return False
88
+
89
+ # Must contain "space" keyword
90
+ if 'space' not in words:
91
+ return False
92
+
93
+ # Must not start with filler
94
+ if words[0] in FILLER_WORDS:
95
+ return False
96
+
97
+ # Must not contain self-correction patterns
98
+ word_set = set(words)
99
+ if word_set & SELF_CORRECTION:
100
+ return False
101
+
102
+ return True
103
+
104
+
105
+ def strip_filler(text):
106
+ """Procedurally strip conversational filler from text."""
107
+ result = text
108
+ for pattern in FILLER_PREFIXES:
109
+ result = re.sub(pattern, '', result, flags=re.IGNORECASE)
110
+ for pattern in FILLER_SUFFIXES:
111
+ result = re.sub(pattern, '', result, flags=re.IGNORECASE)
112
+ return result.strip()
113
+
114
+
115
+ # ── LLM prompt (optimized for non-protocol input) ───────────────────────
116
+
117
+ SYSTEM_PROMPT = """You normalize voice dictation into clean protocol format for a processor.
118
+
119
+ YOUR JOB:
120
+ 1. If the input already contains "space" keywords with conversational filler → strip the filler, output the protocol content VERBATIM
121
+ 2. If input is natural speech without "space" keywords → normalize it:
122
+ a) Replace synonyms: minus→dash, hyphen→dash, period→dot, forward slash→slash, asterisk→star, hashtag→hash, double dash→dash dash
123
+ b) Insert "space" between separate arguments/tokens
124
+ c) Do NOT insert "space" within: paths (slash-separated), dotted names (file dot txt), compound flags (dash dash verbose)
125
+ 3. Resolve self-corrections (no wait, actually, I meant) → keep only the FINAL intent
126
+ 4. Output ONLY protocol words — never output actual symbols like - . / @ etc.
127
+
128
+ PROTOCOL KEYWORDS (output as words):
129
+ Separator: space
130
+ Symbols: dash dot slash pipe star bang hash tilde at dollar percent caret ampersand equals plus colon semicolon underscore comma backslash quote backtick redirect append
131
+ Multi-word: dash dash, single quote, open/close paren, open/close brace, open/close bracket, less than, question mark, and and, pipe pipe, dot dot, new line
132
+ Casing: camel case, snake case, pascal case, kebab case (followed by the words to transform)
133
+ Capitalization: capital (next word), all caps (next word)
134
+ Numbers: zero through nineteen, twenty/thirty/.../ninety, hundred, thousand
135
+
136
+ Output ONLY the normalized protocol text. Nothing else."""
137
+
138
+ FEW_SHOT = [
139
+ # Fuzzy: missing spaces, synonym replacement needed
140
+ {
141
+ "input": "git commit minus m quote fix login bug quote",
142
+ "output": "git space commit space dash m space quote fix space login space bug quote"
143
+ },
144
+ {
145
+ "input": "cat file period txt",
146
+ "output": "cat space file dot txt"
147
+ },
148
+ {
149
+ "input": "ls minus l minus a slash var slash log",
150
+ "output": "ls space dash l space dash a space slash var slash log"
151
+ },
152
+ {
153
+ "input": "docker run minus minus rm minus it ubuntu",
154
+ "output": "docker space run space dash dash rm space dash it space ubuntu"
155
+ },
156
+ {
157
+ "input": "cd forward slash usr forward slash local forward slash bin",
158
+ "output": "cd space slash usr slash local slash bin"
159
+ },
160
+ {
161
+ "input": "python server period py double dash port eight thousand",
162
+ "output": "python space server dot py space dash dash port space eight thousand"
163
+ },
164
+ {
165
+ "input": "git push hyphen u origin main",
166
+ "output": "git space push space dash u space origin space main"
167
+ },
168
+ {
169
+ "input": "npm install hyphen hyphen save dev eslint",
170
+ "output": "npm space install space dash dash save dash dev space eslint"
171
+ },
172
+ # Casing: pass through verbatim, no spaces between words after the directive
173
+ {
174
+ "input": "snake case api response handler",
175
+ "output": "snake case api response handler"
176
+ },
177
+ {
178
+ "input": "camel case is authenticated",
179
+ "output": "camel case is authenticated"
180
+ },
181
+ # Natural: filler around protocol content, strip filler and pass through protocol
182
+ {
183
+ "input": "okay so the command is git space push space dash u space origin space main",
184
+ "output": "git space push space dash u space origin space main"
185
+ },
186
+ {
187
+ "input": "can you type out docker space run space dash dash rm space nginx",
188
+ "output": "docker space run space dash dash rm space nginx"
189
+ },
190
+ {
191
+ "input": "I wanna set the variable name to camel case get user profile",
192
+ "output": "camel case get user profile"
193
+ },
194
+ {
195
+ "input": "the path should be slash usr slash local slash bin",
196
+ "output": "slash usr slash local slash bin"
197
+ },
198
+ {
199
+ "input": "um the flag is dash dash verbose",
200
+ "output": "dash dash verbose"
201
+ },
202
+ {
203
+ "input": "so for the environment variable it's all caps AWS underscore SECRET underscore ACCESS underscore KEY",
204
+ "output": "all caps AWS underscore SECRET underscore ACCESS underscore KEY"
205
+ },
206
+ # Chaotic: self-corrections
207
+ {
208
+ "input": "dash dash no wait just dash v",
209
+ "output": "dash v"
210
+ },
211
+ {
212
+ "input": "run it on port three thousand",
213
+ "output": "three thousand"
214
+ },
215
+ {
216
+ "input": "wait no not dash dash force I meant dash dash force dash with dash lease",
217
+ "output": "dash dash force dash with dash lease"
218
+ },
219
+ {
220
+ "input": "so we need to... actually let's just do git stash",
221
+ "output": "git space stash"
222
+ },
223
+ ]
224
+
225
+
226
+ def build_prompt(tokenizer, user_input):
227
+ """Build the full prompt with system instructions, few-shot examples, and the user input."""
228
+ messages = [{"role": "system", "content": SYSTEM_PROMPT}]
229
+
230
+ for ex in FEW_SHOT:
231
+ messages.append({"role": "user", "content": ex["input"]})
232
+ messages.append({"role": "assistant", "content": ex["output"]})
233
+
234
+ messages.append({"role": "user", "content": user_input})
235
+
236
+ return tokenizer.apply_chat_template(
237
+ messages, tokenize=False, add_generation_prompt=True
238
+ )
239
+
240
+
241
+ def llm_normalize(model, tokenizer, raw_input, max_tokens=200):
242
+ """Use the LLM to normalize raw dictation into protocol format."""
243
+ prompt = build_prompt(tokenizer, raw_input)
244
+ sampler = make_sampler(temp=0.0)
245
+ output = generate(
246
+ model, tokenizer, prompt=prompt,
247
+ max_tokens=max_tokens, verbose=False,
248
+ sampler=sampler,
249
+ )
250
+ # Clean up: strip whitespace, remove any wrapping quotes/backticks
251
+ result = output.strip()
252
+ result = result.strip('`').strip('"').strip("'")
253
+ # Remove markdown code blocks if present
254
+ result = re.sub(r'^```\w*\n?', '', result)
255
+ result = re.sub(r'\n?```$', '', result)
256
+ return result.strip()
257
+
258
+
259
+ def run_pipeline(model, tokenizer, raw_input):
260
+ """Full pipeline: detect format → normalize if needed → processor."""
261
+ t0 = time.perf_counter()
262
+
263
+ if is_pure_protocol(raw_input):
264
+ # Already in protocol format — strip filler procedurally, skip LLM
265
+ protocol_text = strip_filler(raw_input)
266
+ used_llm = False
267
+ else:
268
+ # Needs LLM normalization
269
+ protocol_text = llm_normalize(model, tokenizer, raw_input)
270
+ used_llm = True
271
+
272
+ t_norm = time.perf_counter()
273
+ final_output = process_dictation(protocol_text)
274
+ t_proc = time.perf_counter()
275
+
276
+ return {
277
+ 'protocol': protocol_text,
278
+ 'output': final_output,
279
+ 'used_llm': used_llm,
280
+ 'norm_ms': (t_norm - t0) * 1000,
281
+ 'proc_ms': (t_proc - t_norm) * 1000,
282
+ 'total_ms': (t_proc - t0) * 1000,
283
+ }
284
+
285
+
286
+ def main():
287
+ parser = argparse.ArgumentParser(description='Zero-training normalizer pipeline evaluation')
288
+ parser.add_argument('eval_file', help='Path to evaluation JSON file')
289
+ parser.add_argument('--model', default='mlx-community/Qwen2.5-1.5B-Instruct-4bit',
290
+ help='MLX model to use')
291
+ parser.add_argument('--limit', type=int, default=0,
292
+ help='Limit number of entries to evaluate (0 = all)')
293
+ parser.add_argument('--show-all', action='store_true',
294
+ help='Show all results, not just errors')
295
+ parser.add_argument('--show-protocol', action='store_true',
296
+ help='Show normalized protocol output for each entry')
297
+ args = parser.parse_args()
298
+
299
+ # Load model
300
+ print(f'Loading model: {args.model}')
301
+ model, tokenizer = load(args.model)
302
+ print(f'Model loaded.\n')
303
+
304
+ # Load eval data
305
+ data = json.load(open(args.eval_file))
306
+ if args.limit:
307
+ data = data[:args.limit]
308
+
309
+ n = len(data)
310
+ exact = ws = 0
311
+ llm_calls = 0
312
+ errors = []
313
+ by_difficulty = defaultdict(list)
314
+ latencies = []
315
+
316
+ print(f'Evaluating {n} entries from {args.eval_file}')
317
+ print(f'Pipeline: Protocol Detect → LLM ({args.model.split("/")[-1]}) / Filler Strip → Processor')
318
+ print('=' * 70)
319
+
320
+ for idx, d in enumerate(data):
321
+ result = run_pipeline(model, tokenizer, d['dictated'])
322
+ if result['used_llm']:
323
+ llm_calls += 1
324
+
325
+ expected = d['expected']
326
+ got = result['output']
327
+
328
+ ws_got = re.sub(r'\s+', ' ', got.strip())
329
+ ws_exp = re.sub(r'\s+', ' ', expected.strip())
330
+ is_exact = got == expected
331
+ is_ws = ws_got == ws_exp
332
+
333
+ if is_exact:
334
+ exact += 1
335
+ if is_ws:
336
+ ws += 1
337
+
338
+ diff = d.get('difficulty', 'unknown')
339
+ by_difficulty[diff].append(is_exact)
340
+ latencies.append(result['total_ms'])
341
+
342
+ marker = '.' if is_exact else 'x'
343
+ sys.stdout.write(marker)
344
+ sys.stdout.flush()
345
+ if (idx + 1) % 50 == 0:
346
+ sys.stdout.write(f' [{idx+1}/{n}]\n')
347
+ sys.stdout.flush()
348
+
349
+ if args.show_all or (args.show_protocol and not is_exact):
350
+ llm_tag = 'LLM' if result['used_llm'] else 'SKIP'
351
+ print(f'\n [{diff:>7}] [{d.get("category", "")}] {"PASS" if is_exact else "FAIL"} ({llm_tag})')
352
+ print(f' input: {d["dictated"][:120]}')
353
+ if args.show_protocol:
354
+ print(f' protocol: {result["protocol"][:120]}')
355
+ print(f' expected: {expected[:100]}')
356
+ print(f' got: {got[:100]}')
357
+ print(f' latency: {result["total_ms"]:.0f}ms')
358
+
359
+ if not is_exact:
360
+ errors.append({
361
+ 'dictated': d['dictated'][:120],
362
+ 'expected': expected[:100],
363
+ 'got': got[:100],
364
+ 'protocol': result['protocol'][:120],
365
+ 'category': d.get('category', ''),
366
+ 'difficulty': diff,
367
+ 'used_llm': result['used_llm'],
368
+ 'latency_ms': result['total_ms'],
369
+ })
370
+
371
+ # Ensure newline after progress dots
372
+ if n % 50 != 0:
373
+ print(f' [{n}/{n}]')
374
+ print()
375
+
376
+ # ── Results ──
377
+ print(f'NORMALIZER PIPELINE — {args.eval_file}')
378
+ print(f'Model: {args.model}')
379
+ print('=' * 70)
380
+ print(f' Exact: {exact}/{n} ({exact/n*100:.1f}%)')
381
+ print(f' WS-norm: {ws}/{n} ({ws/n*100:.1f}%)')
382
+ print(f' LLM calls: {llm_calls}/{n} ({llm_calls/n*100:.0f}% needed LLM)')
383
+ print()
384
+
385
+ if len(by_difficulty) > 1 or 'unknown' not in by_difficulty:
386
+ print('BY DIFFICULTY:')
387
+ for diff in ['clean', 'fuzzy', 'natural', 'chaotic', 'unknown']:
388
+ if diff in by_difficulty:
389
+ results = by_difficulty[diff]
390
+ ex = sum(results)
391
+ tot = len(results)
392
+ print(f' {diff:>10}: {ex}/{tot} ({ex/tot*100:.0f}%)')
393
+ print()
394
+
395
+ avg_lat = sum(latencies) / len(latencies) if latencies else 0
396
+ p50 = sorted(latencies)[len(latencies) // 2] if latencies else 0
397
+ p95 = sorted(latencies)[int(len(latencies) * 0.95)] if latencies else 0
398
+ print(f'LATENCY:')
399
+ print(f' avg: {avg_lat:.0f}ms p50: {p50:.0f}ms p95: {p95:.0f}ms')
400
+ print()
401
+
402
+ print(f'ERRORS ({len(errors)}, showing first 25):')
403
+ print('-' * 70)
404
+ for e in errors[:25]:
405
+ llm_tag = 'LLM' if e['used_llm'] else 'SKIP'
406
+ print(f' [{e["difficulty"]:>7}] [{e["category"]}] ({llm_tag})')
407
+ print(f' input: {e["dictated"]}')
408
+ print(f' protocol: {e["protocol"]}')
409
+ print(f' expected: {e["expected"]}')
410
+ print(f' got: {e["got"]}')
411
+ print()
412
+
413
+
414
+ if __name__ == '__main__':
415
+ main()
processor/procedural.py ADDED
@@ -0,0 +1,377 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Procedural dictation → syntax processor.
3
+
4
+ No LLM needed. Just token scanning with a symbol vocabulary.
5
+
6
+ Rules:
7
+ 1. "space" → literal space
8
+ 2. Symbol words → their character (dash→-, dot→., etc.)
9
+ 3. Synonyms: minus→-, period→., forward slash→/, etc.
10
+ 4. Number words → digits (one→1, forty two→42, hundred→00, thousand→000)
11
+ 5. Casing directives: camel case, snake case, pascal case, kebab case
12
+ 6. "capital X" → X (uppercase), "all caps word" → WORD
13
+ 7. Everything else → pass through literally
14
+ """
15
+
16
+ import json
17
+ import re
18
+
19
+ # ── Symbol vocabulary ────────────────────────────────────────────────────
20
+
21
+ SYMBOLS = {
22
+ # Primary protocol words
23
+ 'dash': '-',
24
+ 'dot': '.',
25
+ 'slash': '/',
26
+ 'pipe': '|',
27
+ 'redirect': '>',
28
+ 'append': '>>',
29
+ 'less': None, # needs lookahead for "less than"
30
+ 'star': '*',
31
+ 'bang': '!',
32
+ 'hash': '#',
33
+ 'tilde': '~',
34
+ 'at': '@',
35
+ 'dollar': '$',
36
+ 'percent': '%',
37
+ 'caret': '^',
38
+ 'ampersand': '&',
39
+ 'equals': '=',
40
+ 'plus': '+',
41
+ 'colon': ':',
42
+ 'semicolon': ';',
43
+ 'underscore': '_',
44
+ 'comma': ',',
45
+ 'backslash': '\\',
46
+ 'quote': '"',
47
+ 'backtick': '`',
48
+ 'question': None, # needs lookahead for "question mark"
49
+
50
+ # Synonyms — common alternatives people use
51
+ 'minus': '-',
52
+ 'hyphen': '-',
53
+ 'period': '.',
54
+ 'asterisk': '*',
55
+ 'hashtag': '#',
56
+ }
57
+
58
+ # Two-word symbols (checked before single-word)
59
+ TWO_WORD_SYMBOLS = {
60
+ ('single', 'quote'): "'",
61
+ ('open', 'paren'): '(',
62
+ ('close', 'paren'): ')',
63
+ ('open', 'brace'): '{',
64
+ ('close', 'brace'): '}',
65
+ ('open', 'bracket'): '[',
66
+ ('close', 'bracket'): ']',
67
+ ('open', 'angle'): '<',
68
+ ('close', 'angle'): '>',
69
+ ('open', 'curly'): '{',
70
+ ('close', 'curly'): '}',
71
+ ('less', 'than'): '<',
72
+ ('question', 'mark'): '?',
73
+ ('dash', 'dash'): '--',
74
+ ('double', 'dash'): '--',
75
+ ('minus', 'minus'): '--',
76
+ ('and', 'and'): '&&',
77
+ ('pipe', 'pipe'): '||',
78
+ ('dot', 'dot'): '..',
79
+ ('two', 'redirect'): '2>',
80
+ ('forward', 'slash'): '/',
81
+ ('back', 'slash'): '\\',
82
+ ('equals', 'sign'): '=',
83
+ ('at', 'sign'): '@',
84
+ ('dollar', 'sign'): '$',
85
+ ('open', 'parenthesis'): '(',
86
+ ('close', 'parenthesis'): ')',
87
+ ('new', 'line'): '\n',
88
+ }
89
+
90
+ # Three-word symbols
91
+ THREE_WORD_SYMBOLS = {
92
+ ('two', 'redirect', 'ampersand'): '2>&',
93
+ }
94
+
95
+ # ── Number words ─────────────────────────────────────────────────────────
96
+
97
+ ONES = {
98
+ 'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4,
99
+ 'five': 5, 'six': 6, 'seven': 7, 'eight': 8, 'nine': 9,
100
+ 'ten': 10, 'eleven': 11, 'twelve': 12, 'thirteen': 13,
101
+ 'fourteen': 14, 'fifteen': 15, 'sixteen': 16, 'seventeen': 17,
102
+ 'eighteen': 18, 'nineteen': 19,
103
+ }
104
+
105
+ TENS = {
106
+ 'twenty': 20, 'thirty': 30, 'forty': 40, 'fifty': 50,
107
+ 'sixty': 60, 'seventy': 70, 'eighty': 80, 'ninety': 90,
108
+ }
109
+
110
+ MULTIPLIERS = {
111
+ 'hundred': 100,
112
+ 'thousand': 1000,
113
+ }
114
+
115
+ ALL_NUMBER_WORDS = set(ONES.keys()) | set(TENS.keys()) | set(MULTIPLIERS.keys())
116
+
117
+
118
+ def is_number_word(w):
119
+ return w in ALL_NUMBER_WORDS
120
+
121
+
122
+ def consume_number(words, i):
123
+ """Try to consume a number starting at position i.
124
+
125
+ Handles:
126
+ - Single: "five" → 5
127
+ - Teens: "twelve" → 12
128
+ - Compound: "forty two" → 42
129
+ - Multipliers: "three thousand" → 3000, "one hundred" → 100
130
+ - Digit sequences: "one nine two" → 192 (when 3+ single digits)
131
+ - Mixed: "eight thousand" → 8000
132
+ """
133
+ w = words[i]
134
+
135
+ # Tens word: twenty, thirty, etc.
136
+ if w in TENS:
137
+ val = TENS[w]
138
+ j = i + 1
139
+ # "forty two" compound
140
+ if j < len(words) and words[j] in ONES and ONES[words[j]] < 10:
141
+ val += ONES[words[j]]
142
+ j += 1
143
+ # Check for multiplier: "forty thousand"
144
+ if j < len(words) and words[j] in MULTIPLIERS:
145
+ val *= MULTIPLIERS[words[j]]
146
+ j += 1
147
+ return str(val), j
148
+
149
+ # Single/teens: zero through nineteen
150
+ if w in ONES:
151
+ val = ONES[w]
152
+ j = i + 1
153
+
154
+ # Check for multiplier: "three thousand", "one hundred"
155
+ if j < len(words) and words[j] in MULTIPLIERS:
156
+ val *= MULTIPLIERS[words[j]]
157
+ j += 1
158
+ # "three thousand two hundred" etc — keep consuming
159
+ # But keep it simple for now
160
+ return str(val), j
161
+
162
+ # Check for digit sequence: "one nine two" → "192"
163
+ # Only if next word is ALSO a single digit (0-9)
164
+ result = str(val)
165
+ while j < len(words) and words[j] in ONES and ONES[words[j]] < 10:
166
+ result += str(ONES[words[j]])
167
+ j += 1
168
+ if j > i + 1:
169
+ return result, j
170
+
171
+ return str(val), i + 1
172
+
173
+ return None, i
174
+
175
+
176
+ # ── Casing directives ───────────────────────────────────────────────────
177
+
178
+ CASING_DIRECTIVES = {'camel', 'snake', 'pascal', 'kebab', 'screaming'}
179
+
180
+
181
+ def consume_casing(words, i):
182
+ """Try to consume a casing directive and its arguments.
183
+
184
+ "camel case get user profile" → "getUserProfile"
185
+ "snake case api key" → "api_key"
186
+ "pascal case my component" → "MyComponent"
187
+ "kebab case my component" → "my-component"
188
+
189
+ Consumes words until "space" or end of input.
190
+ Returns (result, new_i) or (None, i).
191
+ """
192
+ w = words[i].lower()
193
+ if w not in CASING_DIRECTIVES:
194
+ return None, i
195
+ if i + 1 >= len(words) or words[i + 1].lower() != 'case':
196
+ return None, i
197
+
198
+ style = w
199
+ j = i + 2
200
+
201
+ # Consume words until "space" or end or another directive/symbol
202
+ parts = []
203
+ while j < len(words):
204
+ next_w = words[j]
205
+ if next_w == 'space':
206
+ break
207
+ if next_w in SYMBOLS:
208
+ break
209
+ if next_w in CASING_DIRECTIVES and j + 1 < len(words) and words[j + 1] == 'case':
210
+ break
211
+ if next_w in TWO_WORD_SYMBOLS or next_w == 'all' or next_w == 'capital':
212
+ break
213
+ parts.append(next_w.lower())
214
+ j += 1
215
+
216
+ if not parts:
217
+ return None, i
218
+
219
+ if style == 'camel':
220
+ result = parts[0] + ''.join(p.capitalize() for p in parts[1:])
221
+ elif style == 'pascal':
222
+ result = ''.join(p.capitalize() for p in parts)
223
+ elif style == 'snake':
224
+ result = '_'.join(parts)
225
+ elif style == 'kebab':
226
+ result = '-'.join(parts)
227
+ elif style == 'screaming':
228
+ result = '_'.join(p.upper() for p in parts)
229
+ else:
230
+ return None, i
231
+
232
+ return result, j
233
+
234
+
235
+ def process_dictation(text):
236
+ """Convert dictated text to syntax using purely procedural rules."""
237
+ words = text.split()
238
+ output = []
239
+ i = 0
240
+ n = len(words)
241
+
242
+ while i < n:
243
+ w = words[i]
244
+
245
+ # ── "space" → literal space ──
246
+ if w == 'space':
247
+ output.append(' ')
248
+ i += 1
249
+ continue
250
+
251
+ # ── Three-word symbols ──
252
+ if i + 2 < n:
253
+ triple = (words[i], words[i+1], words[i+2])
254
+ if triple in THREE_WORD_SYMBOLS:
255
+ output.append(THREE_WORD_SYMBOLS[triple])
256
+ i += 3
257
+ continue
258
+
259
+ # ── Casing directives ──
260
+ cased, new_i = consume_casing(words, i)
261
+ if cased is not None:
262
+ output.append(cased)
263
+ i = new_i
264
+ continue
265
+
266
+ # ── Two-word symbols ──
267
+ if i + 1 < n:
268
+ pair = (words[i], words[i+1])
269
+ if pair in TWO_WORD_SYMBOLS:
270
+ output.append(TWO_WORD_SYMBOLS[pair])
271
+ i += 2
272
+ continue
273
+
274
+ # ── "all caps <word>" ──
275
+ if w == 'all' and i + 2 < n and words[i+1] == 'caps':
276
+ output.append(words[i+2].upper())
277
+ i += 3
278
+ continue
279
+
280
+ # ── "capital <letter or word>" ──
281
+ if w == 'capital' and i + 1 < n:
282
+ next_w = words[i+1]
283
+ if len(next_w) == 1:
284
+ output.append(next_w.upper())
285
+ else:
286
+ output.append(next_w[0].upper() + next_w[1:])
287
+ i += 2
288
+ continue
289
+
290
+ # ── Single-word symbols ──
291
+ if w in SYMBOLS and SYMBOLS[w] is not None:
292
+ output.append(SYMBOLS[w])
293
+ i += 1
294
+ continue
295
+
296
+ # ── Number words ──
297
+ if is_number_word(w):
298
+ num_str, new_i = consume_number(words, i)
299
+ if num_str is not None:
300
+ output.append(num_str)
301
+ i = new_i
302
+ continue
303
+
304
+ # ── Regular word → pass through ──
305
+ output.append(w)
306
+ i += 1
307
+
308
+ return ''.join(output)
309
+
310
+
311
+ # ── Main: evaluate ──────────────────────────────────────────────────────
312
+
313
+ if __name__ == '__main__':
314
+ import sys
315
+ from collections import defaultdict
316
+
317
+ eval_file = sys.argv[1] if len(sys.argv) > 1 else 'datasets/eval-independent.json'
318
+ data = json.load(open(eval_file))
319
+
320
+ n = len(data)
321
+ exact = ws = wsc = 0
322
+ errors = []
323
+ cat_results = defaultdict(lambda: {'exact': 0, 'total': 0})
324
+
325
+ # Group by difficulty if present
326
+ by_difficulty = defaultdict(list)
327
+
328
+ for d in data:
329
+ got = process_dictation(d['dictated'])
330
+ expected = d['expected']
331
+
332
+ ws_got = re.sub(r'\s+', ' ', got.strip())
333
+ ws_exp = re.sub(r'\s+', ' ', expected.strip())
334
+ is_exact = got == expected
335
+ is_ws = ws_got == ws_exp
336
+ is_wsc = ws_got.lower() == ws_exp.lower()
337
+
338
+ if is_exact: exact += 1
339
+ if is_ws: ws += 1
340
+ if is_wsc: wsc += 1
341
+
342
+ diff = d.get('difficulty', 'unknown')
343
+ by_difficulty[diff].append(is_exact)
344
+
345
+ if not is_exact:
346
+ errors.append({
347
+ 'dictated': d['dictated'][:80],
348
+ 'expected': expected[:60],
349
+ 'got': got[:60],
350
+ 'category': d.get('category', ''),
351
+ 'difficulty': diff,
352
+ })
353
+
354
+ print(f'PROCEDURAL PROCESSOR — {eval_file}')
355
+ print('=' * 70)
356
+ print(f' Exact: {exact}/{n} ({exact/n*100:.1f}%)')
357
+ print(f' WS-norm: {ws}/{n} ({ws/n*100:.1f}%)')
358
+ print(f' WS+case: {wsc}/{n} ({wsc/n*100:.1f}%)')
359
+ print()
360
+
361
+ if len(by_difficulty) > 1:
362
+ print('BY DIFFICULTY:')
363
+ for diff in ['clean', 'fuzzy', 'natural', 'chaotic', 'unknown']:
364
+ if diff in by_difficulty:
365
+ results = by_difficulty[diff]
366
+ ex = sum(results)
367
+ tot = len(results)
368
+ print(f' {diff:>10}: {ex}/{tot} ({ex/tot*100:.0f}%)')
369
+ print()
370
+
371
+ print(f'ERRORS ({len(errors)}, showing first 15):')
372
+ print('-' * 70)
373
+ for e in errors[:15]:
374
+ print(f' [{e["difficulty"]:>7}] [{e["category"]}]')
375
+ print(f' expected: {e["expected"]}')
376
+ print(f' got: {e["got"]}')
377
+ print()
scripts/bench-inference.py ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Benchmark inference latency for LoRA vs DoRA vs base model."""
3
+
4
+ import time
5
+ from mlx_lm import load, generate
6
+
7
+ SYS = "Convert the dictated text into the exact syntax it represents. Output only the result."
8
+
9
+ # A few representative samples at different lengths
10
+ samples = [
11
+ {"label": "short", "input": "dash dash verbose", "expected": "--verbose"},
12
+ {"label": "medium", "input": "export all caps API underscore KEY equals quote my dash key dash one two three quote", "expected": 'export API_KEY="my-key-123"'},
13
+ {"label": "long", "input": "git add dash A and and git commit dash M quote fix typo quote and and git push", "expected": 'git add -A && git commit -m "fix typo" && git push'},
14
+ ]
15
+
16
+ configs = [
17
+ {"label": "Qwen 0.5B (base)", "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit", "adapter": None},
18
+ {"label": "Qwen 0.5B + LoRA", "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit", "adapter": "/Users/arach/dev/talkie/datasets/finetune/adapters/qwen-0.5b-lora"},
19
+ {"label": "Qwen 0.5B + DoRA", "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit", "adapter": "/Users/arach/dev/talkie/datasets/finetune/adapters/qwen-0.5b-dora"},
20
+ ]
21
+
22
+ WARMUP = 2
23
+ RUNS = 10
24
+
25
+ for cfg in configs:
26
+ print(f"\n{'='*60}")
27
+ print(f" {cfg['label']}")
28
+ print(f"{'='*60}")
29
+
30
+ kwargs = {"adapter_path": cfg["adapter"]} if cfg["adapter"] else {}
31
+ model, tokenizer = load(cfg["model"], **kwargs)
32
+
33
+ for sample in samples:
34
+ messages = [
35
+ {"role": "system", "content": SYS},
36
+ {"role": "user", "content": sample["input"]},
37
+ ]
38
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
39
+
40
+ # Warmup
41
+ for _ in range(WARMUP):
42
+ generate(model, tokenizer, prompt=prompt, max_tokens=60, verbose=False)
43
+
44
+ # Timed runs
45
+ times = []
46
+ outputs = []
47
+ for _ in range(RUNS):
48
+ t0 = time.perf_counter()
49
+ out = generate(model, tokenizer, prompt=prompt, max_tokens=60, verbose=False)
50
+ t1 = time.perf_counter()
51
+ times.append((t1 - t0) * 1000) # ms
52
+ outputs.append(out.strip())
53
+
54
+ avg = sum(times) / len(times)
55
+ mn = min(times)
56
+ mx = max(times)
57
+ last_out = outputs[-1]
58
+ match = "✓" if last_out == sample["expected"] else "✗"
59
+
60
+ print(f"\n [{sample['label']}] \"{sample['input'][:50]}{'...' if len(sample['input'])>50 else ''}\"")
61
+ print(f" avg: {avg:.1f}ms min: {mn:.1f}ms max: {mx:.1f}ms ({RUNS} runs)")
62
+ print(f" output: {last_out} {match}")
63
+
64
+ del model, tokenizer
65
+
66
+ print(f"\n{'='*60}")
67
+ print(" Done")
68
+ print(f"{'='*60}")
scripts/eval-bash-v2.py ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Evaluate v2-trained models on the NL2Bash holdout set.
3
+
4
+ Compares minimal vs protocol prompts on the same held-out data.
5
+ Reports both exact match AND whitespace-normalized match.
6
+ """
7
+
8
+ import json
9
+ import re
10
+ import time
11
+ from collections import defaultdict
12
+ from mlx_lm import load, generate
13
+
14
+ # ── Prompts (must match training) ────────────────────────────────────────
15
+
16
+ SYSTEM_MINIMAL = (
17
+ "Reconstruct the intended syntax from the dictated text. "
18
+ "Output only the result."
19
+ )
20
+
21
+ SYSTEM_PROTOCOL = (
22
+ "Convert dictated syntax to code.\n"
23
+ "Symbol words: dash(-) dot(.) slash(/) pipe(|) star(*) bang(!) "
24
+ "hash(#) tilde(~) at(@) dollar($) percent(%) caret(^) equals(=) "
25
+ "plus(+) colon(:) semicolon(;) underscore(_) comma(,) backslash(\\)\n"
26
+ "Quotes: quote(\") single quote(') backtick(`)\n"
27
+ "Brackets: open/close paren() brace{} bracket[] angle<>\n"
28
+ "Pairs: dash dash(--) and and(&&) pipe pipe(||) dot dot(..)\n"
29
+ "Casing: camel case(camelCase) snake case(snake_case) "
30
+ "kebab case(kebab-case) pascal case(PascalCase) all caps(ALLCAPS)\n"
31
+ "Spacing: no space(join words)\n"
32
+ "Letters after dash are flags: dash L A → -la\n"
33
+ "Numbers spoken as words: forty two → 42\n"
34
+ "Output only the result."
35
+ )
36
+
37
+
38
+ def ws_normalize(s: str) -> str:
39
+ """Normalize whitespace for lenient comparison."""
40
+ return re.sub(r'\s+', ' ', s.strip())
41
+
42
+
43
+ # ── Load test data ───────────────────────────────────────────────────────
44
+
45
+ # Use minimal test file (same data, different system prompt doesn't matter
46
+ # for loading — we override the system prompt at inference time)
47
+ with open("datasets/finetune/bash-v2/minimal/test.jsonl") as f:
48
+ tests = []
49
+ for line in f:
50
+ msg = json.loads(line)["messages"]
51
+ tests.append({
52
+ "dictated": msg[1]["content"],
53
+ "expected": msg[2]["content"],
54
+ })
55
+
56
+ print(f"Loaded {len(tests)} held-out bash test entries (v2 converter)\n")
57
+
58
+ configs = [
59
+ {
60
+ "label": "1.5B + minimal prompt",
61
+ "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
62
+ "adapter": "datasets/finetune/adapters/qwen-1.5b-bash-v2-minimal",
63
+ "system": SYSTEM_MINIMAL,
64
+ },
65
+ {
66
+ "label": "1.5B + protocol prompt",
67
+ "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
68
+ "adapter": "datasets/finetune/adapters/qwen-1.5b-bash-v2-protocol",
69
+ "system": SYSTEM_PROTOCOL,
70
+ },
71
+ ]
72
+
73
+ all_results = {}
74
+
75
+ for cfg in configs:
76
+ print(f"\n{'='*60}")
77
+ print(f" {cfg['label']}")
78
+ print(f"{'='*60}\n")
79
+
80
+ try:
81
+ model, tokenizer = load(cfg["model"], adapter_path=cfg["adapter"])
82
+ except Exception as e:
83
+ print(f" SKIPPED — {e}\n")
84
+ continue
85
+
86
+ results = []
87
+ total_time = 0
88
+ errors_shown = 0
89
+
90
+ for i, t in enumerate(tests):
91
+ messages = [
92
+ {"role": "system", "content": cfg["system"]},
93
+ {"role": "user", "content": t["dictated"]},
94
+ ]
95
+ prompt = tokenizer.apply_chat_template(
96
+ messages, tokenize=False, add_generation_prompt=True
97
+ )
98
+
99
+ start = time.perf_counter()
100
+ got = generate(model, tokenizer, prompt=prompt, max_tokens=120, verbose=False)
101
+ elapsed = time.perf_counter() - start
102
+ total_time += elapsed
103
+
104
+ got = got.strip()
105
+ exact = got == t["expected"]
106
+ ws_match = ws_normalize(got) == ws_normalize(t["expected"])
107
+ case_match = got.lower() == t["expected"].lower()
108
+ ws_case = ws_normalize(got).lower() == ws_normalize(t["expected"]).lower()
109
+
110
+ results.append({
111
+ "exact": exact,
112
+ "ws_match": ws_match,
113
+ "case_match": case_match,
114
+ "ws_case": ws_case,
115
+ "got": got,
116
+ "expected": t["expected"],
117
+ "dictated": t["dictated"],
118
+ })
119
+
120
+ if not exact and errors_shown < 25:
121
+ tag = "~" if ws_match else "✗"
122
+ color = "\033[33m" if ws_match else "\033[31m"
123
+ print(f"{color}{tag}\033[0m {i+1:>3}. \"{t['dictated'][:60]}\"")
124
+ print(f" expected: {t['expected']}")
125
+ print(f" got: {got}")
126
+ errors_shown += 1
127
+
128
+ exact_correct = sum(1 for r in results if r["exact"])
129
+ ws_correct = sum(1 for r in results if r["ws_match"])
130
+ wscase_correct = sum(1 for r in results if r["ws_case"])
131
+ total = len(results)
132
+ avg_ms = round(total_time / total * 1000)
133
+
134
+ print(f"\n Exact match: {exact_correct}/{total} ({round(exact_correct/total*100, 1)}%)")
135
+ print(f" WS-normalized: {ws_correct}/{total} ({round(ws_correct/total*100, 1)}%)")
136
+ print(f" WS+case norm: {wscase_correct}/{total} ({round(wscase_correct/total*100, 1)}%)")
137
+ print(f" Avg latency: {avg_ms}ms")
138
+
139
+ all_results[cfg["label"]] = results
140
+ del model, tokenizer
141
+
142
+ # ── Summary ──────────────────────────────────────────────────────────────
143
+
144
+ print(f"\n{'='*60}")
145
+ print(" RESULTS SUMMARY")
146
+ print(f"{'='*60}\n")
147
+
148
+ print(f" {'Model':<30} {'Exact':>8} {'WS-norm':>8} {'WS+case':>8}")
149
+ print(f" {'-'*30} {'-'*8} {'-'*8} {'-'*8}")
150
+
151
+ for label, results in all_results.items():
152
+ total = len(results)
153
+ exact = sum(1 for r in results if r["exact"])
154
+ ws = sum(1 for r in results if r["ws_match"])
155
+ wsc = sum(1 for r in results if r["ws_case"])
156
+ print(f" {label:<30} {exact/total*100:>7.1f}% {ws/total*100:>7.1f}% {wsc/total*100:>7.1f}%")
157
+
158
+ # ── Error Categories ─────────────────────────────────────────────────────
159
+
160
+ if all_results:
161
+ print(f"\n{'='*60}")
162
+ print(" ERROR CATEGORIES (first model)")
163
+ print(f"{'='*60}\n")
164
+
165
+ first_results = list(all_results.values())[0]
166
+ errors = [r for r in first_results if not r["exact"]]
167
+
168
+ cats = defaultdict(int)
169
+ for r in errors:
170
+ if r["ws_case"]:
171
+ cats["spacing+case only"] += 1
172
+ elif r["ws_match"]:
173
+ cats["spacing only"] += 1
174
+ elif r["case_match"]:
175
+ cats["case only"] += 1
176
+ elif len(r["got"]) > len(r["expected"]) * 2:
177
+ cats["hallucination"] += 1
178
+ elif abs(len(r["got"]) - len(r["expected"])) <= 3:
179
+ cats["minor diff"] += 1
180
+ else:
181
+ cats["structural"] += 1
182
+
183
+ for cat, count in sorted(cats.items(), key=lambda x: -x[1]):
184
+ print(f" {cat:<20} {count:>4}")
scripts/eval-bash-v3.py ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Evaluate v3-trained models on the NL2Bash holdout set.
3
+
4
+ v3 uses "space as a word" model:
5
+ - The word "space" in dictation = literal space in output
6
+ - Everything else concatenates
7
+ - Casing qualifiers: "capital X", "all caps var"
8
+
9
+ Compares v3 to v2 baseline using the SAME bash commands but different
10
+ dictation conventions.
11
+ """
12
+
13
+ import json
14
+ import re
15
+ import time
16
+ from collections import defaultdict
17
+ from mlx_lm import load, generate
18
+
19
+ # ── Prompts (must match training) ────────────────────────────────────────
20
+
21
+ SYSTEM_V3_MINIMAL = (
22
+ "Reconstruct the intended syntax from the dictated text. "
23
+ "The word 'space' means insert a literal space. "
24
+ "Everything else concatenates. "
25
+ "Output only the result."
26
+ )
27
+
28
+ SYSTEM_V2_MINIMAL = (
29
+ "Reconstruct the intended syntax from the dictated text. "
30
+ "Output only the result."
31
+ )
32
+
33
+
34
+ def ws_normalize(s: str) -> str:
35
+ """Normalize whitespace for lenient comparison."""
36
+ return re.sub(r'\s+', ' ', s.strip())
37
+
38
+
39
+ # ── Load test data ───────────────────────────────────────────────────────
40
+
41
+ # Load v3 test data
42
+ with open("datasets/finetune/bash-v3/minimal/test.jsonl") as f:
43
+ v3_tests = []
44
+ for line in f:
45
+ msg = json.loads(line)["messages"]
46
+ v3_tests.append({
47
+ "dictated": msg[1]["content"],
48
+ "expected": msg[2]["content"],
49
+ })
50
+
51
+ print(f"Loaded {len(v3_tests)} held-out bash test entries (v3 converter)\n")
52
+
53
+ configs = [
54
+ {
55
+ "label": "v3: 1.5B + space-as-word",
56
+ "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
57
+ "adapter": "datasets/finetune/adapters/qwen-1.5b-bash-v3-minimal",
58
+ "system": SYSTEM_V3_MINIMAL,
59
+ "tests": v3_tests,
60
+ },
61
+ ]
62
+
63
+ # Optionally load v2 for comparison (same model, different adapter + data)
64
+ try:
65
+ with open("datasets/finetune/bash-v2/minimal/test.jsonl") as f:
66
+ v2_tests = []
67
+ for line in f:
68
+ msg = json.loads(line)["messages"]
69
+ v2_tests.append({
70
+ "dictated": msg[1]["content"],
71
+ "expected": msg[2]["content"],
72
+ })
73
+ configs.append({
74
+ "label": "v2: 1.5B + baseline",
75
+ "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
76
+ "adapter": "datasets/finetune/adapters/qwen-1.5b-bash-v2-minimal",
77
+ "system": SYSTEM_V2_MINIMAL,
78
+ "tests": v2_tests,
79
+ })
80
+ except FileNotFoundError:
81
+ pass
82
+
83
+ all_results = {}
84
+
85
+ for cfg in configs:
86
+ print(f"\n{'='*60}")
87
+ print(f" {cfg['label']}")
88
+ print(f"{'='*60}\n")
89
+
90
+ try:
91
+ model, tokenizer = load(cfg["model"], adapter_path=cfg["adapter"])
92
+ except Exception as e:
93
+ print(f" SKIPPED — {e}\n")
94
+ continue
95
+
96
+ results = []
97
+ total_time = 0
98
+ errors_shown = 0
99
+ tests = cfg["tests"]
100
+
101
+ for i, t in enumerate(tests):
102
+ messages = [
103
+ {"role": "system", "content": cfg["system"]},
104
+ {"role": "user", "content": t["dictated"]},
105
+ ]
106
+ prompt = tokenizer.apply_chat_template(
107
+ messages, tokenize=False, add_generation_prompt=True
108
+ )
109
+
110
+ start = time.perf_counter()
111
+ got = generate(model, tokenizer, prompt=prompt, max_tokens=120, verbose=False)
112
+ elapsed = time.perf_counter() - start
113
+ total_time += elapsed
114
+
115
+ got = got.strip()
116
+ exact = got == t["expected"]
117
+ ws_match = ws_normalize(got) == ws_normalize(t["expected"])
118
+ case_match = got.lower() == t["expected"].lower()
119
+ ws_case = ws_normalize(got).lower() == ws_normalize(t["expected"]).lower()
120
+
121
+ results.append({
122
+ "exact": exact,
123
+ "ws_match": ws_match,
124
+ "case_match": case_match,
125
+ "ws_case": ws_case,
126
+ "got": got,
127
+ "expected": t["expected"],
128
+ "dictated": t["dictated"],
129
+ })
130
+
131
+ if not exact and errors_shown < 25:
132
+ tag = "~" if ws_match else "✗"
133
+ color = "\033[33m" if ws_match else "\033[31m"
134
+ print(f"{color}{tag}\033[0m {i+1:>3}. \"{t['dictated'][:60]}\"")
135
+ print(f" expected: {t['expected']}")
136
+ print(f" got: {got}")
137
+ errors_shown += 1
138
+
139
+ exact_correct = sum(1 for r in results if r["exact"])
140
+ ws_correct = sum(1 for r in results if r["ws_match"])
141
+ wscase_correct = sum(1 for r in results if r["ws_case"])
142
+ total = len(results)
143
+ avg_ms = round(total_time / total * 1000)
144
+
145
+ print(f"\n Exact match: {exact_correct}/{total} ({round(exact_correct/total*100, 1)}%)")
146
+ print(f" WS-normalized: {ws_correct}/{total} ({round(ws_correct/total*100, 1)}%)")
147
+ print(f" WS+case norm: {wscase_correct}/{total} ({round(wscase_correct/total*100, 1)}%)")
148
+ print(f" Avg latency: {avg_ms}ms")
149
+
150
+ all_results[cfg["label"]] = results
151
+ del model, tokenizer
152
+
153
+ # ── Summary ──────────────────────────────────────────────────────────────
154
+
155
+ print(f"\n{'='*60}")
156
+ print(" RESULTS SUMMARY")
157
+ print(f"{'='*60}\n")
158
+
159
+ print(f" {'Model':<30} {'Exact':>8} {'WS-norm':>8} {'WS+case':>8}")
160
+ print(f" {'-'*30} {'-'*8} {'-'*8} {'-'*8}")
161
+
162
+ for label, results in all_results.items():
163
+ total = len(results)
164
+ exact = sum(1 for r in results if r["exact"])
165
+ ws = sum(1 for r in results if r["ws_match"])
166
+ wsc = sum(1 for r in results if r["ws_case"])
167
+ print(f" {label:<30} {exact/total*100:>7.1f}% {ws/total*100:>7.1f}% {wsc/total*100:>7.1f}%")
168
+
169
+ # ── Error Categories ─────────────────────────────────────────────────────
170
+
171
+ for label, results in all_results.items():
172
+ print(f"\n{'='*60}")
173
+ print(f" ERROR CATEGORIES — {label}")
174
+ print(f"{'='*60}\n")
175
+
176
+ errors = [r for r in results if not r["exact"]]
177
+
178
+ cats = defaultdict(int)
179
+ for r in errors:
180
+ if r["ws_case"]:
181
+ cats["spacing+case only"] += 1
182
+ elif r["ws_match"]:
183
+ cats["spacing only"] += 1
184
+ elif r["case_match"]:
185
+ cats["case only"] += 1
186
+ elif len(r["got"]) > len(r["expected"]) * 2:
187
+ cats["hallucination"] += 1
188
+ elif abs(len(r["got"]) - len(r["expected"])) <= 3:
189
+ cats["minor diff"] += 1
190
+ else:
191
+ cats["structural"] += 1
192
+
193
+ for cat, count in sorted(cats.items(), key=lambda x: -x[1]):
194
+ print(f" {cat:<20} {count:>4}")
scripts/eval-bash.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Evaluate fine-tuned models on the NL2Bash holdout set (604 entries).
3
+
4
+ Compares multiple adapters on the same held-out dictation → bash pairs.
5
+ """
6
+
7
+ import json
8
+ import time
9
+ from collections import defaultdict
10
+ from mlx_lm import load, generate
11
+
12
+ SYS = "Reconstruct the intended syntax from the dictated text. Output only the result."
13
+
14
+ # Load test split
15
+ with open("datasets/finetune/bash/test.jsonl") as f:
16
+ tests = []
17
+ for line in f:
18
+ msg = json.loads(line)["messages"]
19
+ tests.append({
20
+ "dictated": msg[1]["content"],
21
+ "expected": msg[2]["content"],
22
+ })
23
+
24
+ print(f"Loaded {len(tests)} held-out bash test entries\n")
25
+
26
+ configs = [
27
+ {
28
+ "label": "Qwen 0.5B + LoRA (bash)",
29
+ "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
30
+ "adapter": "datasets/finetune/adapters/qwen-0.5b-bash-v1",
31
+ },
32
+ {
33
+ "label": "Qwen 1.5B + LoRA (bash)",
34
+ "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
35
+ "adapter": "datasets/finetune/adapters/qwen-1.5b-bash-v1",
36
+ },
37
+ ]
38
+
39
+ all_results = {}
40
+
41
+ for cfg in configs:
42
+ print(f"\n{'='*60}")
43
+ print(f" {cfg['label']}")
44
+ print(f"{'='*60}\n")
45
+
46
+ try:
47
+ model, tokenizer = load(cfg["model"], adapter_path=cfg["adapter"])
48
+ except Exception as e:
49
+ print(f" SKIPPED — {e}\n")
50
+ continue
51
+
52
+ results = []
53
+ total_time = 0
54
+ errors_shown = 0
55
+
56
+ for i, t in enumerate(tests):
57
+ messages = [
58
+ {"role": "system", "content": SYS},
59
+ {"role": "user", "content": t["dictated"]},
60
+ ]
61
+ prompt = tokenizer.apply_chat_template(
62
+ messages, tokenize=False, add_generation_prompt=True
63
+ )
64
+
65
+ start = time.perf_counter()
66
+ got = generate(model, tokenizer, prompt=prompt, max_tokens=120, verbose=False)
67
+ elapsed = time.perf_counter() - start
68
+ total_time += elapsed
69
+
70
+ got = got.strip()
71
+ match = got == t["expected"]
72
+ results.append({
73
+ "match": match,
74
+ "got": got,
75
+ "expected": t["expected"],
76
+ "dictated": t["dictated"],
77
+ })
78
+
79
+ if not match and errors_shown < 30:
80
+ icon = "\033[31m✗\033[0m"
81
+ print(f"{icon} {i+1:>3}. \"{t['dictated']}\"")
82
+ print(f" expected: {t['expected']}")
83
+ print(f" got: {got}")
84
+ errors_shown += 1
85
+
86
+ correct = sum(1 for r in results if r["match"])
87
+ pct = round(correct / len(results) * 100, 1)
88
+ avg_ms = round(total_time / len(results) * 1000)
89
+ print(f"\nScore: {correct}/{len(results)} ({pct}%)")
90
+ print(f"Avg latency: {avg_ms}ms")
91
+
92
+ if errors_shown < sum(1 for r in results if not r["match"]):
93
+ print(f"({sum(1 for r in results if not r['match'])} total errors, showing first {errors_shown})")
94
+
95
+ all_results[cfg["label"]] = results
96
+ del model, tokenizer
97
+
98
+ # Summary
99
+ print(f"\n{'='*60}")
100
+ print(" RESULTS SUMMARY")
101
+ print(f"{'='*60}\n")
102
+
103
+ for label, results in all_results.items():
104
+ correct = sum(1 for r in results if r["match"])
105
+ pct = round(correct / len(results) * 100, 1)
106
+ print(f" {label}: {correct}/{len(results)} ({pct}%)")
107
+
108
+ # Error analysis: categorize common failure patterns
109
+ if all_results:
110
+ print(f"\n{'='*60}")
111
+ print(" ERROR ANALYSIS (first model)")
112
+ print(f"{'='*60}\n")
113
+
114
+ first_results = list(all_results.values())[0]
115
+ errors = [r for r in first_results if not r["match"]]
116
+
117
+ # Categorize errors
118
+ categories = defaultdict(list)
119
+ for r in errors:
120
+ got, exp = r["got"], r["expected"]
121
+ if len(got) > len(exp) * 2:
122
+ categories["hallucination"].append(r)
123
+ elif got.lower() == exp.lower():
124
+ categories["case_mismatch"].append(r)
125
+ elif got.replace(" ", "") == exp.replace(" ", ""):
126
+ categories["spacing"].append(r)
127
+ elif abs(len(got) - len(exp)) <= 3:
128
+ categories["minor_diff"].append(r)
129
+ else:
130
+ categories["structural"].append(r)
131
+
132
+ for cat, items in sorted(categories.items(), key=lambda x: -len(x[1])):
133
+ print(f" {cat}: {len(items)}")
134
+ for item in items[:3]:
135
+ print(f" \"{item['dictated'][:50]}...\"")
136
+ print(f" exp: {item['expected'][:60]}")
137
+ print(f" got: {item['got'][:60]}")
scripts/eval-finetune.py ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Evaluate fine-tuned LoRA models on expanded bakeoff test set.
3
+
4
+ 30 tests organized by tier:
5
+ Tier 1 (1-10): Core functionality — basic symbol subs, casing, paths, URLs
6
+ Tier 2 (11-20): Compound patterns — multi-symbol, git commands, env vars, pipes
7
+ Tier 3 (21-30): Known failure modes — dot dot, compound &&, numbers, fidelity
8
+ """
9
+
10
+ import json
11
+ import time
12
+ from collections import defaultdict
13
+ from mlx_lm import load, generate
14
+
15
+ SYS = "Reconstruct the intended syntax from the dictated text. Output only the result."
16
+
17
+ tests = [
18
+ # --- Tier 1: Core functionality ---
19
+ {"id": 1, "cat": "symbols-basic", "dictated": "hello dash world", "expected": "hello-world"},
20
+ {"id": 2, "cat": "symbols-basic", "dictated": "hello underscore world", "expected": "hello_world"},
21
+ {"id": 3, "cat": "symbols-compound", "dictated": "dash dash verbose", "expected": "--verbose"},
22
+ {"id": 4, "cat": "symbols-compound", "dictated": "equals equals equals", "expected": "==="},
23
+ {"id": 5, "cat": "casing", "dictated": "camel case get user name", "expected": "getUserName"},
24
+ {"id": 6, "cat": "casing", "dictated": "snake case total tokens generated", "expected": "total_tokens_generated"},
25
+ {"id": 7, "cat": "casing", "dictated": "kebab case dark mode toggle", "expected": "dark-mode-toggle"},
26
+ {"id": 8, "cat": "quotes", "dictated": "quote hello world quote", "expected": "\"hello world\""},
27
+ {"id": 9, "cat": "paths", "dictated": "tilde slash dev slash talkie", "expected": "~/dev/talkie"},
28
+ {"id": 10, "cat": "urls", "dictated": "HTTPS colon slash slash GitHub dot com slash arach slash talkie", "expected": "https://github.com/arach/talkie"},
29
+
30
+ # --- Tier 2: Compound patterns ---
31
+ {"id": 11, "cat": "mixed", "dictated": "git commit dash M quote fix latency quote", "expected": "git commit -m \"fix latency\""},
32
+ {"id": 12, "cat": "mixed", "dictated": "export all caps API underscore KEY equals quote my dash key dash one two three quote", "expected": "export API_KEY=\"my-key-123\""},
33
+ {"id": 13, "cat": "mixed", "dictated": "shebang slash bin slash bash", "expected": "#!/bin/bash"},
34
+ {"id": 14, "cat": "mixed", "dictated": "docker run dash D dash P eighty eighty colon eighty eighty nginx", "expected": "docker run -d -p 8080:8080 nginx"},
35
+ {"id": 15, "cat": "mixed", "dictated": "func camel case view did load open paren close paren", "expected": "func viewDidLoad()"},
36
+ {"id": 16, "cat": "mixed", "dictated": "import open brace camel case use state close brace from single quote react single quote", "expected": "import { useState } from 'react'"},
37
+ {"id": 17, "cat": "mixed", "dictated": "LS dash L A pipe grep dot swift", "expected": "ls -la | grep .swift"},
38
+ {"id": 18, "cat": "mixed", "dictated": "GH PR create dash dash title quote fix inference latency quote dash dash body quote added TTFT tracking and latency instrumentation quote", "expected": "gh pr create --title \"fix inference latency\" --body \"Added TTFT tracking and latency instrumentation\""},
39
+ {"id": 19, "cat": "identifiers", "dictated": "dot E N V dot local", "expected": ".env.local"},
40
+ {"id": 20, "cat": "operators", "dictated": "open paren X close paren fat arrow open brace close brace", "expected": "(x) => {}"},
41
+
42
+ # --- Tier 3: Known failure modes ---
43
+ {"id": 21, "cat": "symbols-compound", "dictated": "dot dot slash dev", "expected": "../dev"},
44
+ {"id": 22, "cat": "paths", "dictated": "dot dot slash dot dot slash dot dot slash", "expected": "../../../"},
45
+ {"id": 23, "cat": "paths", "dictated": "dot dot slash configs", "expected": "../configs"},
46
+ {"id": 24, "cat": "operators", "dictated": "A and and B and and C", "expected": "a && b && c"},
47
+ {"id": 25, "cat": "numbers", "dictated": "zero point seven", "expected": "0.7"},
48
+ {"id": 26, "cat": "numbers", "dictated": "one two seven dot zero dot zero dot one", "expected": "127.0.0.1"},
49
+ {"id": 27, "cat": "mixed", "dictated": "git add dash A and and git commit dash M quote fix typo quote and and git push", "expected": "git add -A && git commit -m \"fix typo\" && git push"},
50
+ {"id": 28, "cat": "spacing", "dictated": "no space git hub", "expected": "github"},
51
+ {"id": 29, "cat": "brackets", "dictated": "open bracket colon colon dash one close bracket", "expected": "[::-1]"},
52
+ {"id": 30, "cat": "mixed", "dictated": "dash dash temp zero point seven dash dash tokens five twelve", "expected": "--temp 0.7 --tokens 512"},
53
+ ]
54
+
55
+ configs = [
56
+ {
57
+ "label": "QWEN 0.5B + LoRA v3",
58
+ "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
59
+ "adapter": "/Users/arach/dev/talkie/datasets/finetune/adapters/qwen-0.5b-lora-v3",
60
+ },
61
+ ]
62
+
63
+ all_results = {}
64
+
65
+ for cfg in configs:
66
+ print(f"\n{'='*60}")
67
+ print(f" {cfg['label']}")
68
+ print(f"{'='*60}\n")
69
+
70
+ model, tokenizer = load(cfg["model"], adapter_path=cfg["adapter"])
71
+
72
+ results = []
73
+ total_time = 0
74
+ for t in tests:
75
+ messages = [
76
+ {"role": "system", "content": SYS},
77
+ {"role": "user", "content": t["dictated"]},
78
+ ]
79
+
80
+ prompt = tokenizer.apply_chat_template(
81
+ messages, tokenize=False, add_generation_prompt=True
82
+ )
83
+
84
+ start = time.perf_counter()
85
+ got = generate(
86
+ model, tokenizer, prompt=prompt,
87
+ max_tokens=80, verbose=False
88
+ )
89
+ elapsed = time.perf_counter() - start
90
+ total_time += elapsed
91
+
92
+ got = got.strip()
93
+ match = got == t["expected"]
94
+ results.append({"id": t["id"], "cat": t["cat"], "match": match, "got": got})
95
+
96
+ icon = "\033[32m✓\033[0m" if match else "\033[31m✗\033[0m"
97
+ print(f"{icon} {str(t['id']).rjust(2)}. \"{t['dictated']}\"")
98
+ print(f" expected: {t['expected']}")
99
+ if not match:
100
+ print(f" got: {got}")
101
+
102
+ correct = sum(1 for r in results if r["match"])
103
+ pct = round(correct / len(results) * 100)
104
+ avg_ms = round(total_time / len(results) * 1000)
105
+ print(f"\nScore: {correct}/{len(results)} ({pct}%)")
106
+ print(f"Avg latency: {avg_ms}ms per inference")
107
+ all_results[cfg["label"]] = results
108
+
109
+ # --- Per-category breakdown ---
110
+ print(f"\nPer-category accuracy:")
111
+ cat_results = defaultdict(lambda: {"correct": 0, "total": 0})
112
+ for r in results:
113
+ cat_results[r["cat"]]["total"] += 1
114
+ if r["match"]:
115
+ cat_results[r["cat"]]["correct"] += 1
116
+
117
+ for cat in sorted(cat_results.keys()):
118
+ cr = cat_results[cat]
119
+ cat_pct = round(cr["correct"] / cr["total"] * 100)
120
+ bar = "█" * cr["correct"] + "░" * (cr["total"] - cr["correct"])
121
+ print(f" {cat:<20} {cr['correct']}/{cr['total']} ({cat_pct:>3}%) {bar}")
122
+
123
+ # --- Per-tier breakdown ---
124
+ print(f"\nPer-tier accuracy:")
125
+ tiers = [
126
+ ("Tier 1: Core", results[0:10]),
127
+ ("Tier 2: Compound", results[10:20]),
128
+ ("Tier 3: Failure modes", results[20:30]),
129
+ ]
130
+ for tier_name, tier_results in tiers:
131
+ tier_correct = sum(1 for r in tier_results if r["match"])
132
+ tier_pct = round(tier_correct / len(tier_results) * 100)
133
+ print(f" {tier_name:<25} {tier_correct}/{len(tier_results)} ({tier_pct}%)")
134
+
135
+ del model, tokenizer
136
+
137
+ # Summary
138
+ print(f"\n{'='*60}")
139
+ print(" RESULTS SUMMARY")
140
+ print(f"{'='*60}\n")
141
+
142
+ for label, results in all_results.items():
143
+ correct = sum(1 for r in results if r["match"])
144
+ pct = round(correct / len(results) * 100)
145
+ print(f" {label}: {correct}/{len(results)} ({pct}%)")
146
+
147
+ print("\nBaselines:")
148
+ print(" LoRA v1 (240 train): 13/15 (87%)")
149
+ print(" LoRA v2 (474 train): 27/30 (90%)")
150
+ print(" Claude: 14/15 (93%)")
scripts/eval-holdout.py ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Compare v2 vs v3 on the full held-out test split (66 entries)."""
3
+
4
+ import json
5
+ import time
6
+ from collections import defaultdict
7
+ from mlx_lm import load, generate
8
+
9
+ SYS = "Reconstruct the intended syntax from the dictated text. Output only the result."
10
+
11
+ # Load test split
12
+ with open("datasets/finetune/chat/test.jsonl") as f:
13
+ tests = []
14
+ for line in f:
15
+ msg = json.loads(line)["messages"]
16
+ tests.append({
17
+ "dictated": msg[1]["content"],
18
+ "expected": msg[2]["content"],
19
+ })
20
+
21
+ print(f"Loaded {len(tests)} held-out test entries\n")
22
+
23
+ configs = [
24
+ {
25
+ "label": "LoRA v2",
26
+ "adapter": "datasets/finetune/adapters/qwen-0.5b-lora-v2",
27
+ },
28
+ {
29
+ "label": "LoRA v3",
30
+ "adapter": "datasets/finetune/adapters/qwen-0.5b-lora-v3",
31
+ },
32
+ ]
33
+
34
+ all_results = {}
35
+
36
+ for cfg in configs:
37
+ print(f"\n{'='*60}")
38
+ print(f" {cfg['label']}")
39
+ print(f"{'='*60}\n")
40
+
41
+ model, tokenizer = load(
42
+ "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
43
+ adapter_path=cfg["adapter"],
44
+ )
45
+
46
+ results = []
47
+ total_time = 0
48
+ for i, t in enumerate(tests):
49
+ messages = [
50
+ {"role": "system", "content": SYS},
51
+ {"role": "user", "content": t["dictated"]},
52
+ ]
53
+ prompt = tokenizer.apply_chat_template(
54
+ messages, tokenize=False, add_generation_prompt=True
55
+ )
56
+
57
+ start = time.perf_counter()
58
+ got = generate(model, tokenizer, prompt=prompt, max_tokens=80, verbose=False)
59
+ elapsed = time.perf_counter() - start
60
+ total_time += elapsed
61
+
62
+ got = got.strip()
63
+ match = got == t["expected"]
64
+ results.append({"match": match, "got": got, "expected": t["expected"], "dictated": t["dictated"]})
65
+
66
+ icon = "\033[32m✓\033[0m" if match else "\033[31m✗\033[0m"
67
+ if not match:
68
+ print(f"{icon} {i+1:>2}. \"{t['dictated']}\"")
69
+ print(f" expected: {t['expected']}")
70
+ print(f" got: {got}")
71
+
72
+ correct = sum(1 for r in results if r["match"])
73
+ pct = round(correct / len(results) * 100, 1)
74
+ avg_ms = round(total_time / len(results) * 1000)
75
+ print(f"\nScore: {correct}/{len(results)} ({pct}%)")
76
+ print(f"Avg latency: {avg_ms}ms")
77
+ all_results[cfg["label"]] = results
78
+ del model, tokenizer
79
+
80
+ # Head-to-head comparison
81
+ print(f"\n{'='*60}")
82
+ print(" HEAD TO HEAD")
83
+ print(f"{'='*60}\n")
84
+
85
+ v2 = all_results["LoRA v2"]
86
+ v3 = all_results["LoRA v3"]
87
+
88
+ v2_only = [] # v2 right, v3 wrong
89
+ v3_only = [] # v3 right, v2 wrong
90
+ both_wrong = []
91
+
92
+ for i in range(len(tests)):
93
+ if v2[i]["match"] and not v3[i]["match"]:
94
+ v2_only.append((i, tests[i], v3[i]["got"]))
95
+ elif v3[i]["match"] and not v2[i]["match"]:
96
+ v3_only.append((i, tests[i], v2[i]["got"]))
97
+ elif not v2[i]["match"] and not v3[i]["match"]:
98
+ both_wrong.append((i, tests[i], v2[i]["got"], v3[i]["got"]))
99
+
100
+ v2_correct = sum(1 for r in v2 if r["match"])
101
+ v3_correct = sum(1 for r in v3 if r["match"])
102
+
103
+ print(f"LoRA v2: {v2_correct}/{len(tests)} ({round(v2_correct/len(tests)*100,1)}%)")
104
+ print(f"LoRA v3: {v3_correct}/{len(tests)} ({round(v3_correct/len(tests)*100,1)}%)")
105
+
106
+ if v2_only:
107
+ print(f"\nv2 correct, v3 wrong ({len(v2_only)}):")
108
+ for i, t, got in v2_only:
109
+ print(f" {i+1}. \"{t['dictated']}\" → expected \"{t['expected']}\" got \"{got}\"")
110
+
111
+ if v3_only:
112
+ print(f"\nv3 correct, v2 wrong ({len(v3_only)}):")
113
+ for i, t, got in v3_only:
114
+ print(f" {i+1}. \"{t['dictated']}\" → expected \"{t['expected']}\" got \"{got}\"")
115
+
116
+ if both_wrong:
117
+ print(f"\nBoth wrong ({len(both_wrong)}):")
118
+ for i, t, v2_got, v3_got in both_wrong:
119
+ print(f" {i+1}. \"{t['dictated']}\"")
120
+ print(f" expected: \"{t['expected']}\"")
121
+ print(f" v2: \"{v2_got}\"")
122
+ print(f" v3: \"{v3_got}\"")
training/adapters/llama-1b-lora-v2/adapter_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adapter_path": "datasets/finetune/adapters/llama-1b-lora-v2",
3
+ "batch_size": 4,
4
+ "config": null,
5
+ "data": "datasets/finetune/chat",
6
+ "fine_tune_type": "lora",
7
+ "grad_accumulation_steps": 1,
8
+ "grad_checkpoint": false,
9
+ "iters": 500,
10
+ "learning_rate": 0.0001,
11
+ "lora_parameters": {
12
+ "rank": 8,
13
+ "dropout": 0.0,
14
+ "scale": 20.0
15
+ },
16
+ "lr_schedule": null,
17
+ "mask_prompt": true,
18
+ "max_seq_length": 2048,
19
+ "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
20
+ "num_layers": 16,
21
+ "optimizer": "adam",
22
+ "optimizer_config": {
23
+ "adam": {},
24
+ "adamw": {},
25
+ "muon": {},
26
+ "sgd": {},
27
+ "adafactor": {}
28
+ },
29
+ "project_name": null,
30
+ "report_to": null,
31
+ "resume_adapter_file": null,
32
+ "save_every": 100,
33
+ "seed": 42,
34
+ "steps_per_eval": 50,
35
+ "steps_per_report": 10,
36
+ "test": true,
37
+ "test_batches": 500,
38
+ "train": true,
39
+ "val_batches": 25
40
+ }
training/adapters/llama-1b-lora/adapter_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adapter_path": "/Users/arach/dev/talkie/datasets/finetune/adapters/llama-1b-lora",
3
+ "batch_size": 4,
4
+ "config": null,
5
+ "data": "/Users/arach/dev/talkie/datasets/finetune/chat",
6
+ "fine_tune_type": "lora",
7
+ "grad_accumulation_steps": 1,
8
+ "grad_checkpoint": false,
9
+ "iters": 200,
10
+ "learning_rate": 0.0001,
11
+ "lora_parameters": {
12
+ "rank": 8,
13
+ "dropout": 0.0,
14
+ "scale": 20.0
15
+ },
16
+ "lr_schedule": null,
17
+ "mask_prompt": true,
18
+ "max_seq_length": 2048,
19
+ "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
20
+ "num_layers": 16,
21
+ "optimizer": "adam",
22
+ "optimizer_config": {
23
+ "adam": {},
24
+ "adamw": {},
25
+ "muon": {},
26
+ "sgd": {},
27
+ "adafactor": {}
28
+ },
29
+ "project_name": null,
30
+ "report_to": null,
31
+ "resume_adapter_file": null,
32
+ "save_every": 50,
33
+ "seed": 42,
34
+ "steps_per_eval": 50,
35
+ "steps_per_report": 10,
36
+ "test": true,
37
+ "test_batches": 500,
38
+ "train": true,
39
+ "val_batches": 25
40
+ }
training/adapters/qwen-0.5b-bash-v1/adapter_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adapter_path": "datasets/finetune/adapters/qwen-0.5b-bash-v1",
3
+ "batch_size": 4,
4
+ "config": null,
5
+ "data": "datasets/finetune/bash",
6
+ "fine_tune_type": "lora",
7
+ "grad_accumulation_steps": 1,
8
+ "grad_checkpoint": false,
9
+ "iters": 1000,
10
+ "learning_rate": 0.0001,
11
+ "lora_parameters": {
12
+ "rank": 8,
13
+ "dropout": 0.0,
14
+ "scale": 20.0
15
+ },
16
+ "lr_schedule": null,
17
+ "mask_prompt": true,
18
+ "max_seq_length": 2048,
19
+ "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
20
+ "num_layers": 16,
21
+ "optimizer": "adam",
22
+ "optimizer_config": {
23
+ "adam": {},
24
+ "adamw": {},
25
+ "muon": {},
26
+ "sgd": {},
27
+ "adafactor": {}
28
+ },
29
+ "project_name": null,
30
+ "report_to": null,
31
+ "resume_adapter_file": null,
32
+ "save_every": 200,
33
+ "seed": 42,
34
+ "steps_per_eval": 200,
35
+ "steps_per_report": 50,
36
+ "test": true,
37
+ "test_batches": 500,
38
+ "train": true,
39
+ "val_batches": 25
40
+ }
training/adapters/qwen-0.5b-dora/adapter_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adapter_path": "/Users/arach/dev/talkie/datasets/finetune/adapters/qwen-0.5b-dora",
3
+ "batch_size": 4,
4
+ "config": null,
5
+ "data": "/Users/arach/dev/talkie/datasets/finetune/chat",
6
+ "fine_tune_type": "dora",
7
+ "grad_accumulation_steps": 1,
8
+ "grad_checkpoint": false,
9
+ "iters": 300,
10
+ "learning_rate": 5e-05,
11
+ "lora_parameters": {
12
+ "rank": 8,
13
+ "dropout": 0.0,
14
+ "scale": 20.0
15
+ },
16
+ "lr_schedule": null,
17
+ "mask_prompt": true,
18
+ "max_seq_length": 2048,
19
+ "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
20
+ "num_layers": 16,
21
+ "optimizer": "adam",
22
+ "optimizer_config": {
23
+ "adam": {},
24
+ "adamw": {},
25
+ "muon": {},
26
+ "sgd": {},
27
+ "adafactor": {}
28
+ },
29
+ "project_name": null,
30
+ "report_to": null,
31
+ "resume_adapter_file": null,
32
+ "save_every": 100,
33
+ "seed": 42,
34
+ "steps_per_eval": 50,
35
+ "steps_per_report": 10,
36
+ "test": true,
37
+ "test_batches": 500,
38
+ "train": true,
39
+ "val_batches": 25
40
+ }
training/adapters/qwen-0.5b-lora-v2/adapter_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adapter_path": "datasets/finetune/adapters/qwen-0.5b-lora-v2",
3
+ "batch_size": 4,
4
+ "config": null,
5
+ "data": "datasets/finetune/chat",
6
+ "fine_tune_type": "lora",
7
+ "grad_accumulation_steps": 1,
8
+ "grad_checkpoint": false,
9
+ "iters": 500,
10
+ "learning_rate": 0.0001,
11
+ "lora_parameters": {
12
+ "rank": 8,
13
+ "dropout": 0.0,
14
+ "scale": 20.0
15
+ },
16
+ "lr_schedule": null,
17
+ "mask_prompt": true,
18
+ "max_seq_length": 2048,
19
+ "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
20
+ "num_layers": 16,
21
+ "optimizer": "adam",
22
+ "optimizer_config": {
23
+ "adam": {},
24
+ "adamw": {},
25
+ "muon": {},
26
+ "sgd": {},
27
+ "adafactor": {}
28
+ },
29
+ "project_name": null,
30
+ "report_to": null,
31
+ "resume_adapter_file": null,
32
+ "save_every": 100,
33
+ "seed": 42,
34
+ "steps_per_eval": 50,
35
+ "steps_per_report": 10,
36
+ "test": true,
37
+ "test_batches": 500,
38
+ "train": true,
39
+ "val_batches": 25
40
+ }
training/adapters/qwen-0.5b-lora-v3/adapter_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adapter_path": "datasets/finetune/adapters/qwen-0.5b-lora-v3",
3
+ "batch_size": 4,
4
+ "config": null,
5
+ "data": "datasets/finetune/chat",
6
+ "fine_tune_type": "lora",
7
+ "grad_accumulation_steps": 1,
8
+ "grad_checkpoint": false,
9
+ "iters": 500,
10
+ "learning_rate": 0.0001,
11
+ "lora_parameters": {
12
+ "rank": 8,
13
+ "dropout": 0.0,
14
+ "scale": 20.0
15
+ },
16
+ "lr_schedule": null,
17
+ "mask_prompt": true,
18
+ "max_seq_length": 2048,
19
+ "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
20
+ "num_layers": 16,
21
+ "optimizer": "adam",
22
+ "optimizer_config": {
23
+ "adam": {},
24
+ "adamw": {},
25
+ "muon": {},
26
+ "sgd": {},
27
+ "adafactor": {}
28
+ },
29
+ "project_name": null,
30
+ "report_to": null,
31
+ "resume_adapter_file": null,
32
+ "save_every": 100,
33
+ "seed": 42,
34
+ "steps_per_eval": 50,
35
+ "steps_per_report": 10,
36
+ "test": true,
37
+ "test_batches": 500,
38
+ "train": true,
39
+ "val_batches": 25
40
+ }
training/adapters/qwen-0.5b-lora/adapter_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adapter_path": "/Users/arach/dev/talkie/datasets/finetune/adapters/qwen-0.5b-lora",
3
+ "batch_size": 4,
4
+ "config": null,
5
+ "data": "/Users/arach/dev/talkie/datasets/finetune/chat",
6
+ "fine_tune_type": "lora",
7
+ "grad_accumulation_steps": 1,
8
+ "grad_checkpoint": false,
9
+ "iters": 200,
10
+ "learning_rate": 0.0001,
11
+ "lora_parameters": {
12
+ "rank": 8,
13
+ "dropout": 0.0,
14
+ "scale": 20.0
15
+ },
16
+ "lr_schedule": null,
17
+ "mask_prompt": true,
18
+ "max_seq_length": 2048,
19
+ "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
20
+ "num_layers": 16,
21
+ "optimizer": "adam",
22
+ "optimizer_config": {
23
+ "adam": {},
24
+ "adamw": {},
25
+ "muon": {},
26
+ "sgd": {},
27
+ "adafactor": {}
28
+ },
29
+ "project_name": null,
30
+ "report_to": null,
31
+ "resume_adapter_file": null,
32
+ "save_every": 50,
33
+ "seed": 42,
34
+ "steps_per_eval": 50,
35
+ "steps_per_report": 10,
36
+ "test": true,
37
+ "test_batches": 500,
38
+ "train": true,
39
+ "val_batches": 25
40
+ }
training/adapters/qwen-1.5b-bash-v1/adapter_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adapter_path": "datasets/finetune/adapters/qwen-1.5b-bash-v1",
3
+ "batch_size": 4,
4
+ "config": null,
5
+ "data": "datasets/finetune/bash",
6
+ "fine_tune_type": "lora",
7
+ "grad_accumulation_steps": 1,
8
+ "grad_checkpoint": false,
9
+ "iters": 1000,
10
+ "learning_rate": 0.0001,
11
+ "lora_parameters": {
12
+ "rank": 8,
13
+ "dropout": 0.0,
14
+ "scale": 20.0
15
+ },
16
+ "lr_schedule": null,
17
+ "mask_prompt": true,
18
+ "max_seq_length": 2048,
19
+ "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
20
+ "num_layers": 16,
21
+ "optimizer": "adam",
22
+ "optimizer_config": {
23
+ "adam": {},
24
+ "adamw": {},
25
+ "muon": {},
26
+ "sgd": {},
27
+ "adafactor": {}
28
+ },
29
+ "project_name": null,
30
+ "report_to": null,
31
+ "resume_adapter_file": null,
32
+ "save_every": 200,
33
+ "seed": 42,
34
+ "steps_per_eval": 200,
35
+ "steps_per_report": 50,
36
+ "test": true,
37
+ "test_batches": 500,
38
+ "train": true,
39
+ "val_batches": 25
40
+ }
training/adapters/qwen-1.5b-bash-v2-minimal/adapter_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adapter_path": "datasets/finetune/adapters/qwen-1.5b-bash-v2-minimal",
3
+ "batch_size": 4,
4
+ "config": null,
5
+ "data": "datasets/finetune/bash-v2/minimal",
6
+ "fine_tune_type": "lora",
7
+ "grad_accumulation_steps": 1,
8
+ "grad_checkpoint": false,
9
+ "iters": 1000,
10
+ "learning_rate": 0.0001,
11
+ "lora_parameters": {
12
+ "rank": 8,
13
+ "dropout": 0.0,
14
+ "scale": 20.0
15
+ },
16
+ "lr_schedule": null,
17
+ "mask_prompt": true,
18
+ "max_seq_length": 2048,
19
+ "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
20
+ "num_layers": 16,
21
+ "optimizer": "adam",
22
+ "optimizer_config": {
23
+ "adam": {},
24
+ "adamw": {},
25
+ "muon": {},
26
+ "sgd": {},
27
+ "adafactor": {}
28
+ },
29
+ "project_name": null,
30
+ "report_to": null,
31
+ "resume_adapter_file": null,
32
+ "save_every": 200,
33
+ "seed": 42,
34
+ "steps_per_eval": 200,
35
+ "steps_per_report": 50,
36
+ "test": true,
37
+ "test_batches": 500,
38
+ "train": true,
39
+ "val_batches": 25
40
+ }
training/adapters/qwen-1.5b-bash-v2-protocol/adapter_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adapter_path": "datasets/finetune/adapters/qwen-1.5b-bash-v2-protocol",
3
+ "batch_size": 4,
4
+ "config": null,
5
+ "data": "datasets/finetune/bash-v2/protocol",
6
+ "fine_tune_type": "lora",
7
+ "grad_accumulation_steps": 1,
8
+ "grad_checkpoint": false,
9
+ "iters": 1000,
10
+ "learning_rate": 0.0001,
11
+ "lora_parameters": {
12
+ "rank": 8,
13
+ "dropout": 0.0,
14
+ "scale": 20.0
15
+ },
16
+ "lr_schedule": null,
17
+ "mask_prompt": true,
18
+ "max_seq_length": 2048,
19
+ "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
20
+ "num_layers": 16,
21
+ "optimizer": "adam",
22
+ "optimizer_config": {
23
+ "adam": {},
24
+ "adamw": {},
25
+ "muon": {},
26
+ "sgd": {},
27
+ "adafactor": {}
28
+ },
29
+ "project_name": null,
30
+ "report_to": null,
31
+ "resume_adapter_file": null,
32
+ "save_every": 200,
33
+ "seed": 42,
34
+ "steps_per_eval": 200,
35
+ "steps_per_report": 50,
36
+ "test": true,
37
+ "test_batches": 500,
38
+ "train": true,
39
+ "val_batches": 25
40
+ }
training/adapters/qwen-1.5b-bash-v3-minimal/adapter_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adapter_path": "datasets/finetune/adapters/qwen-1.5b-bash-v3-minimal",
3
+ "batch_size": 4,
4
+ "config": null,
5
+ "data": "datasets/finetune/bash-v3/minimal",
6
+ "fine_tune_type": "lora",
7
+ "grad_accumulation_steps": 1,
8
+ "grad_checkpoint": false,
9
+ "iters": 800,
10
+ "learning_rate": 0.0001,
11
+ "lora_parameters": {
12
+ "rank": 8,
13
+ "dropout": 0.0,
14
+ "scale": 20.0
15
+ },
16
+ "lr_schedule": null,
17
+ "mask_prompt": true,
18
+ "max_seq_length": 2048,
19
+ "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
20
+ "num_layers": 16,
21
+ "optimizer": "adam",
22
+ "optimizer_config": {
23
+ "adam": {},
24
+ "adamw": {},
25
+ "muon": {},
26
+ "sgd": {},
27
+ "adafactor": {}
28
+ },
29
+ "project_name": null,
30
+ "report_to": null,
31
+ "resume_adapter_file": "datasets/finetune/adapters/qwen-1.5b-bash-v3-minimal/adapters.safetensors",
32
+ "save_every": 200,
33
+ "seed": 42,
34
+ "steps_per_eval": 200,
35
+ "steps_per_report": 50,
36
+ "test": true,
37
+ "test_batches": 500,
38
+ "train": true,
39
+ "val_batches": 25
40
+ }
training/adapters/qwen-1.5b-lora-v1/adapter_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "adapter_path": "datasets/finetune/adapters/qwen-1.5b-lora-v1",
3
+ "batch_size": 4,
4
+ "config": null,
5
+ "data": "datasets/finetune/chat",
6
+ "fine_tune_type": "lora",
7
+ "grad_accumulation_steps": 1,
8
+ "grad_checkpoint": false,
9
+ "iters": 500,
10
+ "learning_rate": 0.0001,
11
+ "lora_parameters": {
12
+ "rank": 8,
13
+ "dropout": 0.0,
14
+ "scale": 20.0
15
+ },
16
+ "lr_schedule": null,
17
+ "mask_prompt": true,
18
+ "max_seq_length": 2048,
19
+ "model": "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
20
+ "num_layers": 16,
21
+ "optimizer": "adam",
22
+ "optimizer_config": {
23
+ "adam": {},
24
+ "adamw": {},
25
+ "muon": {},
26
+ "sgd": {},
27
+ "adafactor": {}
28
+ },
29
+ "project_name": null,
30
+ "report_to": null,
31
+ "resume_adapter_file": null,
32
+ "save_every": 100,
33
+ "seed": 42,
34
+ "steps_per_eval": 50,
35
+ "steps_per_report": 10,
36
+ "test": true,
37
+ "test_batches": 500,
38
+ "train": true,
39
+ "val_batches": 25
40
+ }
training/converters/bash-to-dictation-v2.py ADDED
@@ -0,0 +1,394 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Convert NL2Bash commands → dictation training pairs for syntax reconstruction.
3
+
4
+ v2: Improved converter following the Talkie Dictation Protocol.
5
+
6
+ Key improvements over v1:
7
+ - Natural number words (0-99) instead of digit-by-digit
8
+ - Smarter flag handling: -la → "dash L A", -name → "dash name"
9
+ - Token-level processing: handles tightly-bound tokens like 2>/dev/null
10
+ - Consistent spacing: whitespace-separated tokens in bash = word boundaries
11
+ - Better multi-char pattern matching
12
+ """
13
+
14
+ import json
15
+ import random
16
+ import re
17
+ from pathlib import Path
18
+
19
+ SEED = 42
20
+
21
+ # ── System Prompts ───────────────────────────────────────────────────────
22
+
23
+ SYSTEM_MINIMAL = (
24
+ "Reconstruct the intended syntax from the dictated text. "
25
+ "Output only the result."
26
+ )
27
+
28
+ SYSTEM_PROTOCOL = (
29
+ "Convert dictated syntax to code.\n"
30
+ "Symbol words: dash(-) dot(.) slash(/) pipe(|) star(*) bang(!) "
31
+ "hash(#) tilde(~) at(@) dollar($) percent(%) caret(^) equals(=) "
32
+ "plus(+) colon(:) semicolon(;) underscore(_) comma(,) backslash(\\)\n"
33
+ "Quotes: quote(\") single quote(') backtick(`)\n"
34
+ "Brackets: open/close paren() brace{} bracket[] angle<>\n"
35
+ "Pairs: dash dash(--) and and(&&) pipe pipe(||) dot dot(..)\n"
36
+ "Casing: camel case(camelCase) snake case(snake_case) "
37
+ "kebab case(kebab-case) pascal case(PascalCase) all caps(ALLCAPS)\n"
38
+ "Spacing: no space(join words)\n"
39
+ "Letters after dash are flags: dash L A → -la\n"
40
+ "Numbers spoken as words: forty two → 42\n"
41
+ "Output only the result."
42
+ )
43
+
44
+ # ── Number Words ─────────────────────────────────────────────────────────
45
+
46
+ ONES = [
47
+ 'zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
48
+ 'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
49
+ 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen',
50
+ ]
51
+ TENS = [
52
+ '', '', 'twenty', 'thirty', 'forty', 'fifty',
53
+ 'sixty', 'seventy', 'eighty', 'ninety',
54
+ ]
55
+
56
+
57
+ def number_to_words(num_str: str) -> str:
58
+ """Convert a number string to spoken words.
59
+
60
+ Rules:
61
+ - 0-99: natural words (zero, twelve, forty two, ninety nine)
62
+ - 100+: digit-by-digit (one two seven, eight zero eight zero)
63
+ - Leading zeros: always digit-by-digit (zero six four four for 0644)
64
+
65
+ Takes string not int to preserve leading zeros.
66
+ """
67
+ # Leading zeros → always digit-by-digit
68
+ if len(num_str) > 1 and num_str[0] == '0':
69
+ return ' '.join(ONES[int(d)] for d in num_str)
70
+
71
+ n = int(num_str)
72
+
73
+ # 0-99: natural spoken form
74
+ if n < 20:
75
+ return ONES[n]
76
+ if n < 100:
77
+ if n % 10 == 0:
78
+ return TENS[n // 10]
79
+ return f"{TENS[n // 10]} {ONES[n % 10]}"
80
+
81
+ # 100+: digit-by-digit (unambiguous, matches how people dictate IPs, ports, etc.)
82
+ return ' '.join(ONES[int(d)] for d in num_str)
83
+
84
+
85
+ # ── Single char map ──────────────────────────────────────────────────────
86
+
87
+ CHAR_MAP = {
88
+ '-': 'dash',
89
+ '.': 'dot',
90
+ '/': 'slash',
91
+ '|': 'pipe',
92
+ '>': 'greater than',
93
+ '<': 'less than',
94
+ '(': 'open paren',
95
+ ')': 'close paren',
96
+ '{': 'open brace',
97
+ '}': 'close brace',
98
+ '[': 'open bracket',
99
+ ']': 'close bracket',
100
+ '"': 'quote',
101
+ "'": 'single quote',
102
+ '`': 'backtick',
103
+ '*': 'star',
104
+ '~': 'tilde',
105
+ '@': 'at',
106
+ '#': 'hash',
107
+ '$': 'dollar',
108
+ '%': 'percent',
109
+ '^': 'caret',
110
+ '&': 'ampersand',
111
+ '=': 'equals',
112
+ '+': 'plus',
113
+ ':': 'colon',
114
+ ';': 'semicolon',
115
+ '?': 'question mark',
116
+ '!': 'bang',
117
+ '\\': 'backslash',
118
+ '_': 'underscore',
119
+ ',': 'comma',
120
+ }
121
+
122
+ # ── Filters ──────────────────────────────────────────────────────────────
123
+
124
+ SKIP_PATTERNS = [
125
+ re.compile(r'`'), # backtick subshells
126
+ re.compile(r'\$\('), # $() subshells
127
+ re.compile(r'\\[nt"\'()]'), # escape sequences
128
+ re.compile(r'\\\\'), # double backslash
129
+ re.compile(r"awk\s+'"), # awk scripts
130
+ re.compile(r"sed\s+'"), # sed scripts
131
+ re.compile(r"sed\s+-"), # sed with flags
132
+ re.compile(r'\{[0-9]'), # brace expansion {1..10}
133
+ re.compile(r'<<<'), # here-strings
134
+ re.compile(r'<<'), # here-docs
135
+ re.compile(r'\[\['), # bash test brackets
136
+ re.compile(r'\bif\s'), # conditionals
137
+ re.compile(r'\bfor\s'), # loops
138
+ re.compile(r'\bwhile\s'), # loops
139
+ re.compile(r'\bdo\b'), # loop body
140
+ re.compile(r'\bdone\b'), # loop end
141
+ re.compile(r'\bthen\b'), # conditional body
142
+ re.compile(r'\bfi\b'), # conditional end
143
+ re.compile(r'\bcase\s'), # case statements
144
+ re.compile(r'\besac\b'), # case end
145
+ re.compile(r'\bfunction\b'),# function defs
146
+ re.compile(r'\(\)'), # function parens
147
+ re.compile(r'printf\s'), # printf format strings
148
+ re.compile(r'\$\{'), # parameter expansion
149
+ re.compile(r'[^\x00-\x7F]'), # non-ASCII characters
150
+ ]
151
+
152
+
153
+ def should_skip(cmd: str) -> bool:
154
+ """Return True if command is too complex for dictation."""
155
+ if len(cmd) > 60:
156
+ return True
157
+ if len(cmd) < 5:
158
+ return True
159
+ if cmd.count('|') > 2:
160
+ return True
161
+ for pat in SKIP_PATTERNS:
162
+ if pat.search(cmd):
163
+ return True
164
+ return False
165
+
166
+
167
+ # ── Token Converter ──────────────────────────────────────────────────────
168
+
169
+ def convert_token(token: str) -> str | None:
170
+ """Convert a single whitespace-delimited bash token to dictation.
171
+
172
+ Processes the token character by character, accumulating letter runs
173
+ and converting symbols/numbers to spoken form.
174
+
175
+ Returns None if the token contains unconvertible characters.
176
+ """
177
+ parts = []
178
+ i = 0
179
+ n = len(token)
180
+
181
+ while i < n:
182
+ c = token[i]
183
+
184
+ # ── Multi-char patterns ──
185
+ # && and ||
186
+ if c == '&' and i + 1 < n and token[i + 1] == '&':
187
+ parts.append('and and')
188
+ i += 2
189
+ continue
190
+ if c == '|' and i + 1 < n and token[i + 1] == '|':
191
+ parts.append('pipe pipe')
192
+ i += 2
193
+ continue
194
+
195
+ # -- (double dash)
196
+ if c == '-' and i + 1 < n and token[i + 1] == '-':
197
+ parts.append('dash dash')
198
+ i += 2
199
+ continue
200
+
201
+ # .. (double dot)
202
+ if c == '.' and i + 1 < n and token[i + 1] == '.':
203
+ parts.append('dot dot')
204
+ i += 2
205
+ continue
206
+
207
+ # >> (append redirect)
208
+ if c == '>' and i + 1 < n and token[i + 1] == '>':
209
+ parts.append('append')
210
+ i += 2
211
+ continue
212
+
213
+ # 2> (stderr redirect) — only at start of token or after space
214
+ if c == '2' and i + 1 < n and token[i + 1] == '>':
215
+ if i + 2 < n and token[i + 2] == '&':
216
+ parts.append('two redirect ampersand')
217
+ i += 3
218
+ else:
219
+ parts.append('two redirect')
220
+ i += 2
221
+ continue
222
+
223
+ # ── Number runs ──
224
+ if c.isdigit():
225
+ num_start = i
226
+ while i < n and token[i].isdigit():
227
+ i += 1
228
+ num_str = token[num_start:i]
229
+ parts.append(number_to_words(num_str))
230
+ continue
231
+
232
+ # ── Letter runs ──
233
+ if c.isalpha():
234
+ word_start = i
235
+ while i < n and token[i].isalpha():
236
+ i += 1
237
+ word = token[word_start:i]
238
+
239
+ # After a dash, short letter runs (1-3 chars) are flags → spell out
240
+ if len(word) <= 3 and parts and parts[-1] == 'dash':
241
+ parts.extend(ch.upper() for ch in word)
242
+ elif len(word) == 1:
243
+ # Standalone single letter → uppercase
244
+ parts.append(word.upper())
245
+ else:
246
+ # Regular word
247
+ parts.append(word)
248
+ continue
249
+
250
+ # ── Single symbols ──
251
+ if c in CHAR_MAP:
252
+ parts.append(CHAR_MAP[c])
253
+ i += 1
254
+ continue
255
+
256
+ # Unknown character → bail
257
+ return None
258
+
259
+ return ' '.join(parts) if parts else None
260
+
261
+
262
+ def bash_to_dictation(cmd: str) -> str | None:
263
+ """Convert a bash command to its dictated form.
264
+
265
+ Splits by whitespace first (preserving bash token boundaries),
266
+ then converts each token individually. This means spacing in the
267
+ output matches the original command's whitespace exactly.
268
+
269
+ Returns None if any token can't be cleanly converted.
270
+ """
271
+ # Split by whitespace, preserving token boundaries
272
+ tokens = cmd.split()
273
+ if not tokens:
274
+ return None
275
+
276
+ dictated_tokens = []
277
+ for token in tokens:
278
+ converted = convert_token(token)
279
+ if converted is None:
280
+ return None
281
+ dictated_tokens.append(converted)
282
+
283
+ result = ' '.join(dictated_tokens)
284
+ return result if result else None
285
+
286
+
287
+ # ── Main ─────────────────────────────────────────────────────────────────
288
+
289
+ def main():
290
+ cm_path = Path("datasets/nl2bash-repo/data/bash/all.cm")
291
+ out_dir = Path("datasets/finetune/bash-v2")
292
+ out_dir.mkdir(parents=True, exist_ok=True)
293
+
294
+ commands = cm_path.read_text().strip().split('\n')
295
+ print(f"Total commands in NL2Bash: {len(commands)}")
296
+
297
+ # Filter and convert
298
+ pairs = []
299
+ skipped_filter = 0
300
+ skipped_convert = 0
301
+ seen = set()
302
+
303
+ for cmd in commands:
304
+ cmd = cmd.strip()
305
+ if not cmd:
306
+ continue
307
+
308
+ if should_skip(cmd):
309
+ skipped_filter += 1
310
+ continue
311
+
312
+ if cmd in seen:
313
+ continue
314
+ seen.add(cmd)
315
+
316
+ dictated = bash_to_dictation(cmd)
317
+ if dictated is None:
318
+ skipped_convert += 1
319
+ continue
320
+
321
+ # Skip if dictation is too long or too short
322
+ word_count = len(dictated.split())
323
+ if word_count > 40 or word_count < 3:
324
+ skipped_convert += 1
325
+ continue
326
+
327
+ pairs.append({
328
+ "dictated": dictated,
329
+ "expected": cmd,
330
+ })
331
+
332
+ print(f"Filtered out (complexity): {skipped_filter}")
333
+ print(f"Filtered out (conversion): {skipped_convert}")
334
+ print(f"Usable pairs: {len(pairs)}")
335
+
336
+ # Shuffle and split: 80/10/10
337
+ random.seed(SEED)
338
+ random.shuffle(pairs)
339
+
340
+ n = len(pairs)
341
+ test_size = max(1, n // 10)
342
+ valid_size = max(1, n // 10)
343
+ train_size = n - test_size - valid_size
344
+
345
+ train = pairs[:train_size]
346
+ valid = pairs[train_size:train_size + valid_size]
347
+ test = pairs[train_size + valid_size:]
348
+
349
+ print(f"Train: {len(train)}, Valid: {len(valid)}, Test: {len(test)}")
350
+
351
+ # Write BOTH prompt variants
352
+ for label, system_prompt in [("minimal", SYSTEM_MINIMAL), ("protocol", SYSTEM_PROTOCOL)]:
353
+ sub_dir = out_dir / label
354
+ sub_dir.mkdir(exist_ok=True)
355
+
356
+ def write_jsonl(path: Path, data: list):
357
+ with open(path, 'w') as f:
358
+ for item in data:
359
+ entry = {
360
+ "messages": [
361
+ {"role": "system", "content": system_prompt},
362
+ {"role": "user", "content": item["dictated"]},
363
+ {"role": "assistant", "content": item["expected"]},
364
+ ]
365
+ }
366
+ f.write(json.dumps(entry) + '\n')
367
+
368
+ write_jsonl(sub_dir / "train.jsonl", train)
369
+ write_jsonl(sub_dir / "valid.jsonl", valid)
370
+ write_jsonl(sub_dir / "test.jsonl", test)
371
+ print(f" Wrote {label}/ splits")
372
+
373
+ # Print sample conversions
374
+ print(f"\n{'='*70}")
375
+ print(" SAMPLE PAIRS (v2 converter)")
376
+ print(f"{'='*70}\n")
377
+ for p in pairs[:20]:
378
+ print(f" dictated: {p['dictated']}")
379
+ print(f" expected: {p['expected']}")
380
+ print()
381
+
382
+ # Show number conversion examples
383
+ print(f"{'='*70}")
384
+ print(" NUMBER HANDLING EXAMPLES")
385
+ print(f"{'='*70}\n")
386
+ examples = ['0', '1', '5', '10', '12', '20', '42', '80', '99',
387
+ '100', '127', '0644', '0755', '255', '443',
388
+ '1024', '3000', '5432', '8080', '9090', '65535']
389
+ for s in examples:
390
+ print(f" {s:>5} → {number_to_words(s)}")
391
+
392
+
393
+ if __name__ == "__main__":
394
+ main()
training/converters/bash-to-dictation-v3.py ADDED
@@ -0,0 +1,486 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Convert NL2Bash commands → dictation training pairs for syntax reconstruction.
3
+
4
+ v3: "Space is a word" model.
5
+
6
+ Core idea: the word "space" is an explicit token in the dictation that means
7
+ "insert a literal space here." Everything else concatenates. This eliminates
8
+ the entire class of spacing ambiguity from v2.
9
+
10
+ v2 problem:
11
+ tail -n1 → "tail dash N one"
12
+ tail -n 1 → "tail dash N one" ← SAME DICTATION, different output!
13
+
14
+ v3 fix:
15
+ tail -n1 → "tail space dash N one" → tail + ' ' + -n1
16
+ tail -n 1 → "tail space dash N space one" → tail + ' ' + -n + ' ' + 1
17
+
18
+ Rules for the model:
19
+ 1. Map each spoken word to its text form (dash→-, one→1, etc.)
20
+ 2. Concatenate everything by default
21
+ 3. The word "space" → insert a literal space
22
+ """
23
+
24
+ import json
25
+ import random
26
+ import re
27
+ from pathlib import Path
28
+
29
+ SEED = 42
30
+
31
+ # ── System Prompts ───────────────────────────────────────────────────────
32
+
33
+ SYSTEM_MINIMAL = (
34
+ "Reconstruct the intended syntax from the dictated text. "
35
+ "The word 'space' means insert a literal space. "
36
+ "Everything else concatenates. "
37
+ "Output only the result."
38
+ )
39
+
40
+ SYSTEM_PROTOCOL = (
41
+ "Reconstruct syntax from dictated text.\n"
42
+ "Default: all words concatenate into one token.\n"
43
+ "The word 'space' inserts a literal space (argument boundary).\n"
44
+ "Symbol words: dash(-) dot(.) slash(/) pipe(|) star(*) bang(!) "
45
+ "hash(#) tilde(~) at(@) dollar($) percent(%) caret(^) equals(=) "
46
+ "plus(+) colon(:) semicolon(;) underscore(_) comma(,) backslash(\\)\n"
47
+ "Quotes: quote(\") single quote(') backtick(`)\n"
48
+ "Brackets: open/close paren() brace{} bracket[] angle<>\n"
49
+ "Pairs: dash dash(--) and and(&&) pipe pipe(||) dot dot(..)\n"
50
+ "Append: append(>>) Redirect: redirect(>)\n"
51
+ "Numbers: spoken as words (one→1, forty two→42)\n"
52
+ "Flags: letters after dash are lowercase flags (dash L A → -la)\n"
53
+ "Capital: capital X → X (preserves uppercase)\n"
54
+ "Output only the result."
55
+ )
56
+
57
+ # ── Number Words ─────────────────────────────────────────────────────────
58
+
59
+ ONES = [
60
+ 'zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven',
61
+ 'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen',
62
+ 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen',
63
+ ]
64
+ TENS = [
65
+ '', '', 'twenty', 'thirty', 'forty', 'fifty',
66
+ 'sixty', 'seventy', 'eighty', 'ninety',
67
+ ]
68
+
69
+
70
+ def number_to_words(num_str: str) -> str:
71
+ """Convert a number string to spoken words.
72
+
73
+ Rules:
74
+ - 0-99: natural words (zero, twelve, forty two, ninety nine)
75
+ - 100+: digit-by-digit (one two seven, eight zero eight zero)
76
+ - Leading zeros: always digit-by-digit (zero six four four for 0644)
77
+
78
+ Takes string not int to preserve leading zeros.
79
+ """
80
+ # Leading zeros → always digit-by-digit
81
+ if len(num_str) > 1 and num_str[0] == '0':
82
+ return ' '.join(ONES[int(d)] for d in num_str)
83
+
84
+ n = int(num_str)
85
+
86
+ # 0-99: natural spoken form
87
+ if n < 20:
88
+ return ONES[n]
89
+ if n < 100:
90
+ if n % 10 == 0:
91
+ return TENS[n // 10]
92
+ return f"{TENS[n // 10]} {ONES[n % 10]}"
93
+
94
+ # 100+: digit-by-digit
95
+ return ' '.join(ONES[int(d)] for d in num_str)
96
+
97
+
98
+ # ── Single char map ──────────────────────────────────────────────────────
99
+
100
+ CHAR_MAP = {
101
+ '-': 'dash',
102
+ '.': 'dot',
103
+ '/': 'slash',
104
+ '|': 'pipe',
105
+ '>': 'redirect',
106
+ '<': 'less than',
107
+ '(': 'open paren',
108
+ ')': 'close paren',
109
+ '{': 'open brace',
110
+ '}': 'close brace',
111
+ '[': 'open bracket',
112
+ ']': 'close bracket',
113
+ '"': 'quote',
114
+ "'": 'single quote',
115
+ '`': 'backtick',
116
+ '*': 'star',
117
+ '~': 'tilde',
118
+ '@': 'at',
119
+ '#': 'hash',
120
+ '$': 'dollar',
121
+ '%': 'percent',
122
+ '^': 'caret',
123
+ '&': 'ampersand',
124
+ '=': 'equals',
125
+ '+': 'plus',
126
+ ':': 'colon',
127
+ ';': 'semicolon',
128
+ '?': 'question mark',
129
+ '!': 'bang',
130
+ '\\': 'backslash',
131
+ '_': 'underscore',
132
+ ',': 'comma',
133
+ }
134
+
135
+ # ── Filters ──────────────────────────────────────────────────────────────
136
+
137
+ SKIP_PATTERNS = [
138
+ re.compile(r'`'), # backtick subshells
139
+ re.compile(r'\$\('), # $() subshells
140
+ re.compile(r'\\[nt"\'()]'), # escape sequences
141
+ re.compile(r'\\\\'), # double backslash
142
+ re.compile(r"awk\s+'"), # awk scripts
143
+ re.compile(r"sed\s+'"), # sed scripts
144
+ re.compile(r"sed\s+-"), # sed with flags
145
+ re.compile(r'\{[0-9]'), # brace expansion {1..10}
146
+ re.compile(r'<<<'), # here-strings
147
+ re.compile(r'<<'), # here-docs
148
+ re.compile(r'\[\['), # bash test brackets
149
+ re.compile(r'\bif\s'), # conditionals
150
+ re.compile(r'\bfor\s'), # loops
151
+ re.compile(r'\bwhile\s'), # loops
152
+ re.compile(r'\bdo\b'), # loop body
153
+ re.compile(r'\bdone\b'), # loop end
154
+ re.compile(r'\bthen\b'), # conditional body
155
+ re.compile(r'\bfi\b'), # conditional end
156
+ re.compile(r'\bcase\s'), # case statements
157
+ re.compile(r'\besac\b'), # case end
158
+ re.compile(r'\bfunction\b'),# function defs
159
+ re.compile(r'\(\)'), # function parens
160
+ re.compile(r'printf\s'), # printf format strings
161
+ re.compile(r'\$\{'), # parameter expansion
162
+ re.compile(r'[^\x00-\x7F]'), # non-ASCII characters
163
+ ]
164
+
165
+
166
+ def should_skip(cmd: str) -> bool:
167
+ """Return True if command is too complex for dictation."""
168
+ if len(cmd) > 60:
169
+ return True
170
+ if len(cmd) < 5:
171
+ return True
172
+ if cmd.count('|') > 2:
173
+ return True
174
+ for pat in SKIP_PATTERNS:
175
+ if pat.search(cmd):
176
+ return True
177
+ return False
178
+
179
+
180
+ # ── Token Converter ──────────────────────────────────────────────────────
181
+
182
+ def emit_word(word: str) -> str:
183
+ """Convert a word to its dictation form with casing qualifiers.
184
+
185
+ Casing rules:
186
+ - all lowercase → as-is (default, no qualifier needed)
187
+ - ALL UPPERCASE → "all caps <word>"
188
+ - Capitalized → "capital <word>"
189
+ - Mixed case → bail (return None) — too complex to dictate
190
+
191
+ The model also learns contextual casing (env vars after $ are uppercase,
192
+ commands are lowercase) so qualifiers aren't always required in practice.
193
+ But the training data should include them so the model knows the convention.
194
+ """
195
+ if word.islower():
196
+ return word
197
+ if word.isupper():
198
+ return f'all caps {word.lower()}'
199
+ if word[0].isupper() and word[1:].islower():
200
+ return f'capital {word.lower()}'
201
+ # Mixed case (camelCase, etc.) — skip for now
202
+ return None
203
+
204
+
205
+ def convert_token(token: str) -> str | None:
206
+ """Convert a single whitespace-delimited bash token to dictation.
207
+
208
+ Processes the token character by character, accumulating letter runs
209
+ and converting symbols/numbers to spoken form.
210
+
211
+ v3 changes:
212
+ - Casing qualifiers: "capital X", "all caps var"
213
+ - Flag casing preserved: -x → "dash x", -X → "dash capital X"
214
+ - Multi-char uppercase words: VAR → "all caps var"
215
+ - Capitalized words: Foto → "capital foto"
216
+
217
+ Returns None if the token contains unconvertible characters.
218
+ """
219
+ parts = []
220
+ i = 0
221
+ n = len(token)
222
+
223
+ while i < n:
224
+ c = token[i]
225
+
226
+ # ── Multi-char patterns ──
227
+ # && and ||
228
+ if c == '&' and i + 1 < n and token[i + 1] == '&':
229
+ parts.append('and and')
230
+ i += 2
231
+ continue
232
+ if c == '|' and i + 1 < n and token[i + 1] == '|':
233
+ parts.append('pipe pipe')
234
+ i += 2
235
+ continue
236
+
237
+ # -- (double dash)
238
+ if c == '-' and i + 1 < n and token[i + 1] == '-':
239
+ parts.append('dash dash')
240
+ i += 2
241
+ continue
242
+
243
+ # .. (double dot)
244
+ if c == '.' and i + 1 < n and token[i + 1] == '.':
245
+ parts.append('dot dot')
246
+ i += 2
247
+ continue
248
+
249
+ # >> (append redirect)
250
+ if c == '>' and i + 1 < n and token[i + 1] == '>':
251
+ parts.append('append')
252
+ i += 2
253
+ continue
254
+
255
+ # 2> (stderr redirect) — only at start of token or after space
256
+ if c == '2' and i + 1 < n and token[i + 1] == '>':
257
+ if i + 2 < n and token[i + 2] == '&':
258
+ parts.append('two redirect ampersand')
259
+ i += 3
260
+ else:
261
+ parts.append('two redirect')
262
+ i += 2
263
+ continue
264
+
265
+ # ── Number runs ──
266
+ if c.isdigit():
267
+ num_start = i
268
+ while i < n and token[i].isdigit():
269
+ i += 1
270
+ num_str = token[num_start:i]
271
+ parts.append(number_to_words(num_str))
272
+ continue
273
+
274
+ # ── Letter runs ──
275
+ if c.isalpha():
276
+ word_start = i
277
+ while i < n and token[i].isalpha():
278
+ i += 1
279
+ word = token[word_start:i]
280
+
281
+ # After a dash, short letter runs are flags — spell out
282
+ # Also handle longer mixed-case flag runs like -aPSHiv
283
+ is_flag = parts and parts[-1] == 'dash'
284
+ is_short = len(word) <= 3
285
+ has_mixed_case = not word.islower() and not word.isupper()
286
+
287
+ if is_flag and (is_short or has_mixed_case):
288
+ for ch in word:
289
+ if ch.isupper():
290
+ parts.append(f'capital {ch}')
291
+ else:
292
+ parts.append(ch)
293
+ elif len(word) == 1:
294
+ # Standalone single letter — preserve case
295
+ if word.isupper():
296
+ parts.append(f'capital {word}')
297
+ else:
298
+ parts.append(word)
299
+ else:
300
+ # Multi-char word — use casing qualifiers
301
+ emitted = emit_word(word)
302
+ if emitted is None:
303
+ return None # mixed case too complex
304
+ parts.append(emitted)
305
+ continue
306
+
307
+ # ── Single symbols ──
308
+ if c in CHAR_MAP:
309
+ parts.append(CHAR_MAP[c])
310
+ i += 1
311
+ continue
312
+
313
+ # Unknown character → bail
314
+ return None
315
+
316
+ return ' '.join(parts) if parts else None
317
+
318
+
319
+ def bash_to_dictation(cmd: str) -> str | None:
320
+ """Convert a bash command to its dictated form.
321
+
322
+ v3 key change: token boundaries are marked with the word "space".
323
+
324
+ In v2, `tail -n1` and `tail -n 1` both became "tail dash N one".
325
+ In v3:
326
+ tail -n1 → "tail space dash n one"
327
+ tail -n 1 → "tail space dash n space one"
328
+
329
+ The model learns: "space" = literal space, everything else concatenates.
330
+ """
331
+ tokens = cmd.split()
332
+ if not tokens:
333
+ return None
334
+
335
+ dictated_tokens = []
336
+ for token in tokens:
337
+ converted = convert_token(token)
338
+ if converted is None:
339
+ return None
340
+ dictated_tokens.append(converted)
341
+
342
+ # v3: join with " space " — the word "space" is an explicit token
343
+ result = ' space '.join(dictated_tokens)
344
+ return result if result else None
345
+
346
+
347
+ # ── Main ─────────────────────────────────────────────────────────────────
348
+
349
+ def main():
350
+ cm_path = Path("datasets/nl2bash-repo/data/bash/all.cm")
351
+ out_dir = Path("datasets/finetune/bash-v3")
352
+ out_dir.mkdir(parents=True, exist_ok=True)
353
+
354
+ commands = cm_path.read_text().strip().split('\n')
355
+ print(f"Total commands in NL2Bash: {len(commands)}")
356
+
357
+ # Filter and convert
358
+ pairs = []
359
+ skipped_filter = 0
360
+ skipped_convert = 0
361
+ seen = set()
362
+
363
+ for cmd in commands:
364
+ cmd = cmd.strip()
365
+ if not cmd:
366
+ continue
367
+
368
+ if should_skip(cmd):
369
+ skipped_filter += 1
370
+ continue
371
+
372
+ if cmd in seen:
373
+ continue
374
+ seen.add(cmd)
375
+
376
+ dictated = bash_to_dictation(cmd)
377
+ if dictated is None:
378
+ skipped_convert += 1
379
+ continue
380
+
381
+ # Skip if dictation is too long or too short
382
+ word_count = len(dictated.split())
383
+ if word_count > 50 or word_count < 3: # bumped from 40 — "space" adds words
384
+ skipped_convert += 1
385
+ continue
386
+
387
+ pairs.append({
388
+ "dictated": dictated,
389
+ "expected": cmd,
390
+ })
391
+
392
+ print(f"Filtered out (complexity): {skipped_filter}")
393
+ print(f"Filtered out (conversion): {skipped_convert}")
394
+ print(f"Usable pairs: {len(pairs)}")
395
+
396
+ # Shuffle and split: 80/10/10
397
+ random.seed(SEED)
398
+ random.shuffle(pairs)
399
+
400
+ n = len(pairs)
401
+ test_size = max(1, n // 10)
402
+ valid_size = max(1, n // 10)
403
+ train_size = n - test_size - valid_size
404
+
405
+ train = pairs[:train_size]
406
+ valid = pairs[train_size:train_size + valid_size]
407
+ test = pairs[train_size + valid_size:]
408
+
409
+ print(f"Train: {len(train)}, Valid: {len(valid)}, Test: {len(test)}")
410
+
411
+ # Write BOTH prompt variants
412
+ for label, system_prompt in [("minimal", SYSTEM_MINIMAL), ("protocol", SYSTEM_PROTOCOL)]:
413
+ sub_dir = out_dir / label
414
+ sub_dir.mkdir(exist_ok=True)
415
+
416
+ def write_jsonl(path: Path, data: list):
417
+ with open(path, 'w') as f:
418
+ for item in data:
419
+ entry = {
420
+ "messages": [
421
+ {"role": "system", "content": system_prompt},
422
+ {"role": "user", "content": item["dictated"]},
423
+ {"role": "assistant", "content": item["expected"]},
424
+ ]
425
+ }
426
+ f.write(json.dumps(entry) + '\n')
427
+
428
+ write_jsonl(sub_dir / "train.jsonl", train)
429
+ write_jsonl(sub_dir / "valid.jsonl", valid)
430
+ write_jsonl(sub_dir / "test.jsonl", test)
431
+ print(f" Wrote {label}/ splits")
432
+
433
+ # ── Show the key improvement ──
434
+ print(f"\n{'='*70}")
435
+ print(" V3 vs V2: AMBIGUITY RESOLUTION")
436
+ print(f"{'='*70}\n")
437
+
438
+ demo_pairs = [
439
+ ("tail -n1", "tail -n 1"),
440
+ ("cut -c1-10", "cut -c 1-10"),
441
+ ("top -b -n1", "top -b -n 1"),
442
+ ("ssh -X user@server", "ssh -x user@server"),
443
+ ("chmod 0644 file.txt", "chmod 644 file.txt"),
444
+ ]
445
+
446
+ # Show casing qualifiers
447
+ print(f"\n{'='*70}")
448
+ print(" CASING QUALIFIERS")
449
+ print(f"{'='*70}\n")
450
+
451
+ case_examples = [
452
+ "export PATH=/usr/bin",
453
+ "echo $HOME",
454
+ "find ~ -name 'Foto*'",
455
+ "read -n10 -e VAR",
456
+ "ls -la /tmp/MyDir",
457
+ "grep -i ERROR log.txt",
458
+ ]
459
+ for cmd in case_examples:
460
+ d = bash_to_dictation(cmd)
461
+ if d:
462
+ print(f" {cmd:<35} → {d}")
463
+ else:
464
+ print(f" {cmd:<35} → SKIPPED")
465
+ print()
466
+ for a, b in demo_pairs:
467
+ da = bash_to_dictation(a)
468
+ db = bash_to_dictation(b)
469
+ same = "SAME ✗" if da == db else "DIFFERENT ✓"
470
+ print(f" {a:<25} → {da}")
471
+ print(f" {b:<25} → {db}")
472
+ print(f" {'':25} {same}")
473
+ print()
474
+
475
+ # Print sample conversions
476
+ print(f"{'='*70}")
477
+ print(" SAMPLE PAIRS (v3 converter)")
478
+ print(f"{'='*70}\n")
479
+ for p in pairs[:15]:
480
+ print(f" dictated: {p['dictated']}")
481
+ print(f" expected: {p['expected']}")
482
+ print()
483
+
484
+
485
+ if __name__ == "__main__":
486
+ main()
training/converters/prepare-finetune.py ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Convert syntax-reconstruction.json → train/valid/test JSONL for mlx_lm.
3
+
4
+ Uses stratified splitting by category so each split has proportional
5
+ category representation.
6
+ """
7
+
8
+ import json
9
+ import random
10
+ import os
11
+ from collections import defaultdict
12
+
13
+ SRC = os.path.join(os.path.dirname(__file__), "syntax-reconstruction.json")
14
+ OUT = os.path.join(os.path.dirname(__file__), "finetune")
15
+
16
+ SYSTEM = "Reconstruct the intended syntax from the dictated text. Output only the result."
17
+
18
+ with open(SRC) as f:
19
+ entries = json.load(f)["entries"]
20
+
21
+ random.seed(42)
22
+
23
+ # --- Stratified split by category ---
24
+ by_cat = defaultdict(list)
25
+ for e in entries:
26
+ by_cat[e["cat"]].append(e)
27
+
28
+ train, valid, test = [], [], []
29
+
30
+ for cat, items in sorted(by_cat.items()):
31
+ random.shuffle(items)
32
+ n = len(items)
33
+ train_end = int(n * 0.80)
34
+ valid_end = int(n * 0.90)
35
+ train.extend(items[:train_end])
36
+ valid.extend(items[train_end:valid_end])
37
+ test.extend(items[valid_end:])
38
+
39
+ # Shuffle within each split so categories are interleaved
40
+ random.shuffle(train)
41
+ random.shuffle(valid)
42
+ random.shuffle(test)
43
+
44
+ splits = {"train": train, "valid": valid, "test": test}
45
+
46
+ # Print category distribution
47
+ print("Category distribution:")
48
+ for cat in sorted(by_cat.keys()):
49
+ total = len(by_cat[cat])
50
+ t = sum(1 for e in train if e["cat"] == cat)
51
+ v = sum(1 for e in valid if e["cat"] == cat)
52
+ te = sum(1 for e in test if e["cat"] == cat)
53
+ print(f" {cat:<20} {total:>4} total → {t:>4} train / {v:>3} valid / {te:>3} test")
54
+
55
+ # --- Completions format ---
56
+ comp_dir = os.path.join(OUT, "completions")
57
+ os.makedirs(comp_dir, exist_ok=True)
58
+
59
+ for split_name, split_entries in splits.items():
60
+ path = os.path.join(comp_dir, f"{split_name}.jsonl")
61
+ with open(path, "w") as f:
62
+ for e in split_entries:
63
+ line = {
64
+ "prompt": e["dictated"],
65
+ "completion": e["output"],
66
+ }
67
+ f.write(json.dumps(line) + "\n")
68
+ print(f" {split_name}: {len(split_entries)} → {path}")
69
+
70
+ # --- Chat format ---
71
+ chat_dir = os.path.join(OUT, "chat")
72
+ os.makedirs(chat_dir, exist_ok=True)
73
+
74
+ for split_name, split_entries in splits.items():
75
+ path = os.path.join(chat_dir, f"{split_name}.jsonl")
76
+ with open(path, "w") as f:
77
+ for e in split_entries:
78
+ line = {
79
+ "messages": [
80
+ {"role": "system", "content": SYSTEM},
81
+ {"role": "user", "content": e["dictated"]},
82
+ {"role": "assistant", "content": e["output"]},
83
+ ]
84
+ }
85
+ f.write(json.dumps(line) + "\n")
86
+ print(f" {split_name}: {len(split_entries)} → {path}")
87
+
88
+ n = len(entries)
89
+ print(f"\nTotal: {n} entries → {len(train)} train / {len(valid)} valid / {len(test)} test")
90
+ print(f"System prompt: \"{SYSTEM}\"")
91
+ print("Formats: completions (prompt/completion), chat (messages with system prompt)")
training/data/bash-v2/minimal/test.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
training/data/bash-v2/minimal/train.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
training/data/bash-v2/minimal/valid.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
training/data/bash-v2/protocol/test.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
training/data/bash-v2/protocol/train.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
training/data/bash-v2/protocol/valid.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
training/data/bash-v3/minimal/test.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
training/data/bash-v3/minimal/train.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
training/data/bash-v3/minimal/valid.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
training/data/bash-v3/protocol/test.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
training/data/bash-v3/protocol/train.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
training/data/bash-v3/protocol/valid.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
training/data/chat/test.jsonl ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "X percent equals two"}, {"role": "assistant", "content": "x%=2"}]}
2
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "button dot T S X"}, {"role": "assistant", "content": "Button.tsx"}]}
3
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "triple backtick"}, {"role": "assistant", "content": "```"}]}
4
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "hello equals world"}, {"role": "assistant", "content": "hello=world"}]}
5
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "find dot dash name quote star dot T S X quote dash delete"}, {"role": "assistant", "content": "find . -name \"*.tsx\" -delete"}]}
6
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "ten S"}, {"role": "assistant", "content": "10s"}]}
7
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "if X bang equals equals Y"}, {"role": "assistant", "content": "if x !== y"}]}
8
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "no space fire base"}, {"role": "assistant", "content": "firebase"}]}
9
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "HTTP colon slash slash one two seven dot zero dot zero dot one colon three thousand"}, {"role": "assistant", "content": "http://127.0.0.1:3000"}]}
10
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dot yaml"}, {"role": "assistant", "content": ".yaml"}]}
11
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "NPM run build and and NPM run test"}, {"role": "assistant", "content": "npm run build && npm run test"}]}
12
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "set dash E and and set dash O camel case pipe fail"}, {"role": "assistant", "content": "set -e && set -o pipefail"}]}
13
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dot slash"}, {"role": "assistant", "content": "./"}]}
14
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "quote test dash value quote"}, {"role": "assistant", "content": "\"test-value\""}]}
15
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "let camel case file URL colon all caps URL"}, {"role": "assistant", "content": "let fileURL: URL"}]}
16
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "open paren close paren"}, {"role": "assistant", "content": "()"}]}
17
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "no space type script"}, {"role": "assistant", "content": "typescript"}]}
18
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "zero point nine nine"}, {"role": "assistant", "content": "0.99"}]}
19
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "one two seven dot zero dot zero dot one colon five four three two"}, {"role": "assistant", "content": "127.0.0.1:5432"}]}
20
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "double colon"}, {"role": "assistant", "content": "::"}]}
21
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "SSH dash I tilde slash dot SSH slash ID underscore RSA user at one nine two dot one six eight dot one dot one"}, {"role": "assistant", "content": "ssh -i ~/.ssh/id_rsa user@192.168.1.1"}]}
22
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "HTTPS colon slash slash example dot com slash search question mark Q equals hello"}, {"role": "assistant", "content": "https://example.com/search?q=hello"}]}
23
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "A question mark B colon C"}, {"role": "assistant", "content": "a ? b : c"}]}
24
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "ten dot zero dot zero dot one"}, {"role": "assistant", "content": "10.0.0.1"}]}
25
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "snake case created at"}, {"role": "assistant", "content": "created_at"}]}
26
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "pipe grep quote physical footprint quote"}, {"role": "assistant", "content": "| grep \"Physical footprint\""}]}
27
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "record open angle string comma unknown close angle"}, {"role": "assistant", "content": "Record<string, unknown>"}]}
28
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dash L A"}, {"role": "assistant", "content": "-la"}]}
29
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "if X not equals Y"}, {"role": "assistant", "content": "if x != y"}]}
30
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "mac OS slash talkie slash app slash app delegate dot swift"}, {"role": "assistant", "content": "macOS/Talkie/App/AppDelegate.swift"}]}
31
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "hello comma world"}, {"role": "assistant", "content": "hello, world"}]}
32
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "eighty eighty"}, {"role": "assistant", "content": "8080"}]}
33
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "one word hello world"}, {"role": "assistant", "content": "helloworld"}]}
34
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "demo dash typing dot M D"}, {"role": "assistant", "content": "demo-typing.md"}]}
35
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "docker run dash D dash P nine zero nine zero colon nine zero nine zero grafana"}, {"role": "assistant", "content": "docker run -d -p 9090:9090 grafana"}]}
36
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "git stash push dash M quote work in progress quote"}, {"role": "assistant", "content": "git stash push -m \"work in progress\""}]}
37
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dash X"}, {"role": "assistant", "content": "-x"}]}
38
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "camel case content view"}, {"role": "assistant", "content": "contentView"}]}
39
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "percent D"}, {"role": "assistant", "content": "%d"}]}
40
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "two greater than slash dev slash null"}, {"role": "assistant", "content": "2>/dev/null"}]}
41
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "const camel case app equals express open paren close paren"}, {"role": "assistant", "content": "const app = express()"}]}
42
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "star star"}, {"role": "assistant", "content": "**"}]}
43
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "camel case get value"}, {"role": "assistant", "content": "getValue"}]}
44
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "brew install dash dash cask visual dash studio dash code"}, {"role": "assistant", "content": "brew install --cask visual-studio-code"}]}
45
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "open paren X colon pascal case int comma Y colon pascal case int close paren"}, {"role": "assistant", "content": "(x: Int, y: Int)"}]}
46
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "open angle close angle"}, {"role": "assistant", "content": "<>"}]}
47
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "docker build dash T my dash app colon latest dot"}, {"role": "assistant", "content": "docker build -t my-app:latest ."}]}
48
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dot dot slash dev"}, {"role": "assistant", "content": "../dev"}]}
49
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dash dash tokens"}, {"role": "assistant", "content": "--tokens"}]}
50
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "CD dot dot"}, {"role": "assistant", "content": "cd .."}]}
51
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "A semicolon B semicolon C"}, {"role": "assistant", "content": "A; B; C"}]}
52
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dot dot slash dot dot slash node underscore modules"}, {"role": "assistant", "content": "../../node_modules"}]}
53
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dot star"}, {"role": "assistant", "content": ".*"}]}
54
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "double quote hello world double quote"}, {"role": "assistant", "content": "\"hello world\""}]}
55
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "snake case batch size"}, {"role": "assistant", "content": "batch_size"}]}
56
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dollar all caps USER"}, {"role": "assistant", "content": "$USER"}]}
57
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "snake case total tokens generated"}, {"role": "assistant", "content": "total_tokens_generated"}]}
58
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dot dot slash packages"}, {"role": "assistant", "content": "../packages"}]}
59
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dash dash prod"}, {"role": "assistant", "content": "--prod"}]}
60
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "export all caps API underscore KEY equals quote my dash key dash one two three quote"}, {"role": "assistant", "content": "export API_KEY=\"my-key-123\""}]}
61
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "colon equals"}, {"role": "assistant", "content": ":="}]}
62
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "SRC slash components slash button dot T S X"}, {"role": "assistant", "content": "src/components/Button.tsx"}]}
63
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "if X less than or equal Y"}, {"role": "assistant", "content": "if x <= y"}]}
64
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "pip install dash R requirements dot T X T"}, {"role": "assistant", "content": "pip install -r requirements.txt"}]}
65
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "capital hello"}, {"role": "assistant", "content": "Hello"}]}
66
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "pascal case debug kit"}, {"role": "assistant", "content": "DebugKit"}]}
training/data/chat/train.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
training/data/chat/valid.jsonl ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "kubectl get pods dash N production"}, {"role": "assistant", "content": "kubectl get pods -n production"}]}
2
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "single quote R single quote"}, {"role": "assistant", "content": "'r'"}]}
3
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "all caps HTTPS"}, {"role": "assistant", "content": "HTTPS"}]}
4
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "hello dollar world"}, {"role": "assistant", "content": "hello$world"}]}
5
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "X code build dash scheme talkie dash destination quote platform equals mac OS quote"}, {"role": "assistant", "content": "xcodebuild -scheme Talkie -destination \"platform=macOS\""}]}
6
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "no space home brew"}, {"role": "assistant", "content": "homebrew"}]}
7
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "hash endif"}, {"role": "assistant", "content": "#endif"}]}
8
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dash dash version"}, {"role": "assistant", "content": "--version"}]}
9
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "single quote single quote"}, {"role": "assistant", "content": "''"}]}
10
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "pascal case service bridge"}, {"role": "assistant", "content": "ServiceBridge"}]}
11
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "less than less than"}, {"role": "assistant", "content": "<<"}]}
12
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "camel case is palindrome"}, {"role": "assistant", "content": "isPalindrome"}]}
13
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "no space make file"}, {"role": "assistant", "content": "makefile"}]}
14
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "shebang"}, {"role": "assistant", "content": "#!"}]}
15
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "all caps URL"}, {"role": "assistant", "content": "URL"}]}
16
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "tilde slash dev slash talkie"}, {"role": "assistant", "content": "~/dev/talkie"}]}
17
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "hash hash"}, {"role": "assistant", "content": "##"}]}
18
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dash dash json"}, {"role": "assistant", "content": "--json"}]}
19
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dot dot dot"}, {"role": "assistant", "content": "..."}]}
20
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "open brace key colon value close brace"}, {"role": "assistant", "content": "{key: value}"}]}
21
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "var camel case is generating colon all caps bool equals false"}, {"role": "assistant", "content": "var isGenerating: Bool = false"}]}
22
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dollar open brace all caps DATABASE underscore URL close brace"}, {"role": "assistant", "content": "${DATABASE_URL}"}]}
23
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "one"}, {"role": "assistant", "content": "1"}]}
24
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "X star star Y"}, {"role": "assistant", "content": "x ** y"}]}
25
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "screaming snake case node env"}, {"role": "assistant", "content": "NODE_ENV"}]}
26
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dot J S"}, {"role": "assistant", "content": ".js"}]}
27
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "process dot env dot all caps API underscore KEY"}, {"role": "assistant", "content": "process.env.API_KEY"}]}
28
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dollar open paren date plus percent S close paren"}, {"role": "assistant", "content": "$(date +%s)"}]}
29
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "kebab case audio player"}, {"role": "assistant", "content": "audio-player"}]}
30
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "open angle string close angle"}, {"role": "assistant", "content": "<string>"}]}
31
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dot dot slash services"}, {"role": "assistant", "content": "../services"}]}
32
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "HTTPS colon slash slash raw dot github user content dot com"}, {"role": "assistant", "content": "https://raw.githubusercontent.com"}]}
33
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "one nine eight two two"}, {"role": "assistant", "content": "19822"}]}
34
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "not A"}, {"role": "assistant", "content": "!a"}]}
35
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "hello star world"}, {"role": "assistant", "content": "hello*world"}]}
36
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "tilde slash library slash keychains slash login dot keychain dash DB"}, {"role": "assistant", "content": "~/Library/Keychains/login.keychain-db"}]}
37
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "open brace quote key quote colon quote value quote close brace"}, {"role": "assistant", "content": "{\"key\": \"value\"}"}]}
38
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "talkie inference generate quote what is two plus two quote dash dash model MLX dash community slash llama dash three dot two dash one B dash instruct dash four bit dash dash verbose dash dash temp zero point seven"}, {"role": "assistant", "content": "talkie inference generate \"What is two plus two\" --model mlx-community/Llama-3.2-1B-Instruct-4bit --verbose --temp 0.7"}]}
39
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dash dash title"}, {"role": "assistant", "content": "--title"}]}
40
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "quote fix colon resolve timeout quote"}, {"role": "assistant", "content": "\"fix: resolve timeout\""}]}
41
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "X fat arrow Y"}, {"role": "assistant", "content": "x => y"}]}
42
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "open bracket dash F dot next close bracket and and RM dash RF dot next"}, {"role": "assistant", "content": "[ -f .next ] && rm -rf .next"}]}
43
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dash V"}, {"role": "assistant", "content": "-v"}]}
44
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "shebang slash usr slash bin slash env python three"}, {"role": "assistant", "content": "#!/usr/bin/env python3"}]}
45
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "promise open angle void close angle"}, {"role": "assistant", "content": "Promise<void>"}]}
46
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "GH issue list dash dash label bug dash dash state open"}, {"role": "assistant", "content": "gh issue list --label bug --state open"}]}
47
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "hello bang world"}, {"role": "assistant", "content": "hello!world"}]}
48
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "tilde X"}, {"role": "assistant", "content": "~x"}]}
49
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dollar open brace all caps HOME colon dash slash tmp close brace"}, {"role": "assistant", "content": "${HOME:-/tmp}"}]}
50
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "guard let container equals self dot camel case model container else"}, {"role": "assistant", "content": "guard let container = self.modelContainer else"}]}
51
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dash M quote fix colon resolve XPC timeout on cold start quote"}, {"role": "assistant", "content": "-m \"fix: resolve XPC timeout on cold start\""}]}
52
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "five one eight nine"}, {"role": "assistant", "content": "5189"}]}
53
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dot camel case on appear open brace close brace"}, {"role": "assistant", "content": ".onAppear {}"}]}
54
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "tsconfig dot JSON"}, {"role": "assistant", "content": "tsconfig.json"}]}
55
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "pipe X args dash N one echo"}, {"role": "assistant", "content": "| xargs -n1 echo"}]}
56
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "mac OS slash talkie slash services slash"}, {"role": "assistant", "content": "macOS/Talkie/Services/"}]}
57
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "FTP colon slash slash files dot example dot com"}, {"role": "assistant", "content": "ftp://files.example.com"}]}
58
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "A bang bang"}, {"role": "assistant", "content": "a!!"}]}
59
+ {"messages": [{"role": "system", "content": "Reconstruct the intended syntax from the dictated text. Output only the result."}, {"role": "user", "content": "dash equals"}, {"role": "assistant", "content": "-="}]}
training/data/completions/test.jsonl ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"prompt": "X percent equals two", "completion": "x%=2"}
2
+ {"prompt": "button dot T S X", "completion": "Button.tsx"}
3
+ {"prompt": "triple backtick", "completion": "```"}
4
+ {"prompt": "hello equals world", "completion": "hello=world"}
5
+ {"prompt": "find dot dash name quote star dot T S X quote dash delete", "completion": "find . -name \"*.tsx\" -delete"}
6
+ {"prompt": "ten S", "completion": "10s"}
7
+ {"prompt": "if X bang equals equals Y", "completion": "if x !== y"}
8
+ {"prompt": "no space fire base", "completion": "firebase"}
9
+ {"prompt": "HTTP colon slash slash one two seven dot zero dot zero dot one colon three thousand", "completion": "http://127.0.0.1:3000"}
10
+ {"prompt": "dot yaml", "completion": ".yaml"}
11
+ {"prompt": "NPM run build and and NPM run test", "completion": "npm run build && npm run test"}
12
+ {"prompt": "set dash E and and set dash O camel case pipe fail", "completion": "set -e && set -o pipefail"}
13
+ {"prompt": "dot slash", "completion": "./"}
14
+ {"prompt": "quote test dash value quote", "completion": "\"test-value\""}
15
+ {"prompt": "let camel case file URL colon all caps URL", "completion": "let fileURL: URL"}
16
+ {"prompt": "open paren close paren", "completion": "()"}
17
+ {"prompt": "no space type script", "completion": "typescript"}
18
+ {"prompt": "zero point nine nine", "completion": "0.99"}
19
+ {"prompt": "one two seven dot zero dot zero dot one colon five four three two", "completion": "127.0.0.1:5432"}
20
+ {"prompt": "double colon", "completion": "::"}
21
+ {"prompt": "SSH dash I tilde slash dot SSH slash ID underscore RSA user at one nine two dot one six eight dot one dot one", "completion": "ssh -i ~/.ssh/id_rsa user@192.168.1.1"}
22
+ {"prompt": "HTTPS colon slash slash example dot com slash search question mark Q equals hello", "completion": "https://example.com/search?q=hello"}
23
+ {"prompt": "A question mark B colon C", "completion": "a ? b : c"}
24
+ {"prompt": "ten dot zero dot zero dot one", "completion": "10.0.0.1"}
25
+ {"prompt": "snake case created at", "completion": "created_at"}
26
+ {"prompt": "pipe grep quote physical footprint quote", "completion": "| grep \"Physical footprint\""}
27
+ {"prompt": "record open angle string comma unknown close angle", "completion": "Record<string, unknown>"}
28
+ {"prompt": "dash L A", "completion": "-la"}
29
+ {"prompt": "if X not equals Y", "completion": "if x != y"}
30
+ {"prompt": "mac OS slash talkie slash app slash app delegate dot swift", "completion": "macOS/Talkie/App/AppDelegate.swift"}
31
+ {"prompt": "hello comma world", "completion": "hello, world"}
32
+ {"prompt": "eighty eighty", "completion": "8080"}
33
+ {"prompt": "one word hello world", "completion": "helloworld"}
34
+ {"prompt": "demo dash typing dot M D", "completion": "demo-typing.md"}
35
+ {"prompt": "docker run dash D dash P nine zero nine zero colon nine zero nine zero grafana", "completion": "docker run -d -p 9090:9090 grafana"}
36
+ {"prompt": "git stash push dash M quote work in progress quote", "completion": "git stash push -m \"work in progress\""}
37
+ {"prompt": "dash X", "completion": "-x"}
38
+ {"prompt": "camel case content view", "completion": "contentView"}
39
+ {"prompt": "percent D", "completion": "%d"}
40
+ {"prompt": "two greater than slash dev slash null", "completion": "2>/dev/null"}
41
+ {"prompt": "const camel case app equals express open paren close paren", "completion": "const app = express()"}
42
+ {"prompt": "star star", "completion": "**"}
43
+ {"prompt": "camel case get value", "completion": "getValue"}
44
+ {"prompt": "brew install dash dash cask visual dash studio dash code", "completion": "brew install --cask visual-studio-code"}
45
+ {"prompt": "open paren X colon pascal case int comma Y colon pascal case int close paren", "completion": "(x: Int, y: Int)"}
46
+ {"prompt": "open angle close angle", "completion": "<>"}
47
+ {"prompt": "docker build dash T my dash app colon latest dot", "completion": "docker build -t my-app:latest ."}
48
+ {"prompt": "dot dot slash dev", "completion": "../dev"}
49
+ {"prompt": "dash dash tokens", "completion": "--tokens"}
50
+ {"prompt": "CD dot dot", "completion": "cd .."}
51
+ {"prompt": "A semicolon B semicolon C", "completion": "A; B; C"}
52
+ {"prompt": "dot dot slash dot dot slash node underscore modules", "completion": "../../node_modules"}
53
+ {"prompt": "dot star", "completion": ".*"}
54
+ {"prompt": "double quote hello world double quote", "completion": "\"hello world\""}
55
+ {"prompt": "snake case batch size", "completion": "batch_size"}
56
+ {"prompt": "dollar all caps USER", "completion": "$USER"}
57
+ {"prompt": "snake case total tokens generated", "completion": "total_tokens_generated"}
58
+ {"prompt": "dot dot slash packages", "completion": "../packages"}
59
+ {"prompt": "dash dash prod", "completion": "--prod"}
60
+ {"prompt": "export all caps API underscore KEY equals quote my dash key dash one two three quote", "completion": "export API_KEY=\"my-key-123\""}
61
+ {"prompt": "colon equals", "completion": ":="}
62
+ {"prompt": "SRC slash components slash button dot T S X", "completion": "src/components/Button.tsx"}
63
+ {"prompt": "if X less than or equal Y", "completion": "if x <= y"}
64
+ {"prompt": "pip install dash R requirements dot T X T", "completion": "pip install -r requirements.txt"}
65
+ {"prompt": "capital hello", "completion": "Hello"}
66
+ {"prompt": "pascal case debug kit", "completion": "DebugKit"}
training/data/completions/train.jsonl ADDED
@@ -0,0 +1,491 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"prompt": "pascal case floating pill", "completion": "FloatingPill"}
2
+ {"prompt": "two fifty six", "completion": "256"}
3
+ {"prompt": "X plus equals one", "completion": "x+=1"}
4
+ {"prompt": "four zero nine six", "completion": "4096"}
5
+ {"prompt": "dot dot slash dot dot slash dot dot slash config", "completion": "../../../config"}
6
+ {"prompt": "dot dot slash dot dot slash dot dot slash", "completion": "../../../"}
7
+ {"prompt": "space separated A B C", "completion": "A B C"}
8
+ {"prompt": "camel case on submit", "completion": "onSubmit"}
9
+ {"prompt": "dot dot slash", "completion": "../"}
10
+ {"prompt": "at types slash node", "completion": "@types/node"}
11
+ {"prompt": "git reset dash dash soft all caps HEAD tilde one", "completion": "git reset --soft HEAD~1"}
12
+ {"prompt": "at escaping open paren close paren arrow capital void", "completion": "@escaping () -> Void"}
13
+ {"prompt": "async function camel case inference call open paren method colon string close paren", "completion": "async function inferenceCall(method: string)"}
14
+ {"prompt": "quote service bridge quote", "completion": "\"ServiceBridge\""}
15
+ {"prompt": "hello at world", "completion": "hello@world"}
16
+ {"prompt": "camel case set value", "completion": "setValue"}
17
+ {"prompt": "test dash D dot venv and and source dot venv slash bin slash activate", "completion": "test -d .venv && source .venv/bin/activate"}
18
+ {"prompt": "if X equals equals equals Y", "completion": "if x === y"}
19
+ {"prompt": "pascal case app delegate", "completion": "AppDelegate"}
20
+ {"prompt": "X dash equals one", "completion": "x-=1"}
21
+ {"prompt": "GH PR view dash dash web", "completion": "gh pr view --web"}
22
+ {"prompt": "dash dash title quote fix inference latency quote", "completion": "--title \"fix inference latency\""}
23
+ {"prompt": "underscore underscore init underscore underscore", "completion": "__init__"}
24
+ {"prompt": "no space hello world", "completion": "helloworld"}
25
+ {"prompt": "at OpenAI slash codex", "completion": "@openai/codex"}
26
+ {"prompt": "SSH user at host dash P two two two two", "completion": "ssh user@host -p 2222"}
27
+ {"prompt": "A plus B star C", "completion": "a + b * c"}
28
+ {"prompt": "A and and B", "completion": "a && b"}
29
+ {"prompt": "if X less than Y", "completion": "if x < y"}
30
+ {"prompt": "at main actor", "completion": "@MainActor"}
31
+ {"prompt": "zero", "completion": "0"}
32
+ {"prompt": "forty two", "completion": "42"}
33
+ {"prompt": "typeof X equals equals equals single quote string single quote", "completion": "typeof x === 'string'"}
34
+ {"prompt": "docker run dash D dash P four four three colon four four three nginx colon alpine", "completion": "docker run -d -p 443:443 nginx:alpine"}
35
+ {"prompt": "GH PR create dash dash title quote fix inference latency quote dash dash body quote added TTFT tracking and latency instrumentation quote", "completion": "gh pr create --title \"fix inference latency\" --body \"Added TTFT tracking and latency instrumentation\""}
36
+ {"prompt": "dot dot slash dot dot slash dot dot slash dot dot slash", "completion": "../../../../"}
37
+ {"prompt": "for I in star dot log semicolon do echo dollar I semicolon done", "completion": "for i in *.log; do echo $i; done"}
38
+ {"prompt": "git log dash dash oneline dash N ten", "completion": "git log --oneline -n 10"}
39
+ {"prompt": "pip install dash E dot", "completion": "pip install -e ."}
40
+ {"prompt": "one point zero", "completion": "1.0"}
41
+ {"prompt": "A question mark dot B", "completion": "a?.b"}
42
+ {"prompt": "dash dash verbose", "completion": "--verbose"}
43
+ {"prompt": "open angle T colon pascal case codable close angle", "completion": "<T: Codable>"}
44
+ {"prompt": "open bracket colon colon dash one close bracket", "completion": "[::-1]"}
45
+ {"prompt": "launch D dot S H", "completion": "launchd.sh"}
46
+ {"prompt": "SSH dash P two two two two user at server", "completion": "ssh -p 2222 user@server"}
47
+ {"prompt": "HTTPS colon slash slash example dot com slash page hash section", "completion": "https://example.com/page#section"}
48
+ {"prompt": "hash if all caps DEBUG", "completion": "#if DEBUG"}
49
+ {"prompt": "const camel case bridge port equals one nine eight two two", "completion": "const bridgePort = 19822"}
50
+ {"prompt": "dot frame open paren width colon two hundred comma height colon one hundred close paren", "completion": ".frame(width: 200, height: 100)"}
51
+ {"prompt": "open brace close brace", "completion": "{}"}
52
+ {"prompt": "dot P Y", "completion": ".py"}
53
+ {"prompt": "dot next", "completion": ".next"}
54
+ {"prompt": "A pipe greater than B", "completion": "a |> b"}
55
+ {"prompt": "pascal case user defaults", "completion": "UserDefaults"}
56
+ {"prompt": "camel case will appear", "completion": "willAppear"}
57
+ {"prompt": "private var camel case is recording equals false", "completion": "private var isRecording = false"}
58
+ {"prompt": "pascal case notch composer", "completion": "NotchComposer"}
59
+ {"prompt": "HTTPS colon slash slash N P M J S dot com slash package slash commander", "completion": "https://npmjs.com/package/commander"}
60
+ {"prompt": "X star equals two", "completion": "x*=2"}
61
+ {"prompt": "all caps hello", "completion": "HELLO"}
62
+ {"prompt": "open bracket pascal case int close bracket", "completion": "[Int]"}
63
+ {"prompt": "document dot camel case query selector open paren single quote dot container single quote close paren", "completion": "document.querySelector('.container')"}
64
+ {"prompt": "vercel dash labs slash agent dash skills", "completion": "vercel-labs/agent-skills"}
65
+ {"prompt": "VM map dash summary dollar open paren P grep dash X talkie close paren two greater than slash dev slash null pipe head dash twenty five", "completion": "vmmap -summary $(pgrep -x Talkie) 2>/dev/null | head -25"}
66
+ {"prompt": "dash dash body quote added TTFT tracking quote", "completion": "--body \"Added TTFT tracking\""}
67
+ {"prompt": "hello pipe world", "completion": "hello|world"}
68
+ {"prompt": "dot dot slash dot dot slash dot dot slash utils", "completion": "../../../utils"}
69
+ {"prompt": "A pipe pipe B", "completion": "a || b"}
70
+ {"prompt": "dash F S S L", "completion": "-fsSL"}
71
+ {"prompt": "dash dash since", "completion": "--since"}
72
+ {"prompt": "docker exec dash I T container underscore name bash", "completion": "docker exec -it container_name bash"}
73
+ {"prompt": "index dot T S", "completion": "index.ts"}
74
+ {"prompt": "forward slash", "completion": "/"}
75
+ {"prompt": "open paren close paren arrow capital void", "completion": "() -> Void"}
76
+ {"prompt": "func camel case view did load open paren close paren", "completion": "func viewDidLoad()"}
77
+ {"prompt": "screaming snake case default timeout", "completion": "DEFAULT_TIMEOUT"}
78
+ {"prompt": "try await container dot perform", "completion": "try await container.perform"}
79
+ {"prompt": "if X equals equals Y", "completion": "if x == y"}
80
+ {"prompt": "dot dot slash configs", "completion": "../configs"}
81
+ {"prompt": "hello dash world", "completion": "hello-world"}
82
+ {"prompt": "import open brace camel case use effect close brace from single quote react single quote", "completion": "import { useEffect } from 'react'"}
83
+ {"prompt": "console dot error open paren quote failed quote close paren", "completion": "console.error(\"failed\")"}
84
+ {"prompt": "if X greater than or equal Y", "completion": "if x >= y"}
85
+ {"prompt": "guard let X equals Y else", "completion": "guard let x = y else"}
86
+ {"prompt": "dot dot slash dot dot slash assets", "completion": "../../assets"}
87
+ {"prompt": "dot dot slash dot dot slash dot dot slash index dot T S", "completion": "../../../index.ts"}
88
+ {"prompt": "python three dash M HTTP dot server eighty eighty", "completion": "python3 -m http.server 8080"}
89
+ {"prompt": "dot dot slash dot dot slash src", "completion": "../../src"}
90
+ {"prompt": "kebab case hello world", "completion": "hello-world"}
91
+ {"prompt": "dot S H", "completion": ".sh"}
92
+ {"prompt": "at types slash react", "completion": "@types/react"}
93
+ {"prompt": "at user", "completion": "@user"}
94
+ {"prompt": "kebab case file upload", "completion": "file-upload"}
95
+ {"prompt": "quote hello world quote", "completion": "\"hello world\""}
96
+ {"prompt": "open bracket dash Z dollar all caps API underscore KEY close bracket and and echo quote missing key quote", "completion": "[ -z $API_KEY ] && echo \"missing key\""}
97
+ {"prompt": "HTTPS colon slash slash my dot use talkie dot com slash account", "completion": "https://my.useTalkie.com/account"}
98
+ {"prompt": "all caps API", "completion": "API"}
99
+ {"prompt": "snake case file path", "completion": "file_path"}
100
+ {"prompt": "screaming snake case launch daemon", "completion": "LAUNCH_DAEMON"}
101
+ {"prompt": "dash dash model", "completion": "--model"}
102
+ {"prompt": "git clone git at GitHub dot com colon arach slash use talkie dot com dot git tilde slash dev slash use talkie dot com", "completion": "git clone git@github.com:arach/usetalkie.com.git ~/dev/usetalkie.com"}
103
+ {"prompt": "slash dev slash null", "completion": "/dev/null"}
104
+ {"prompt": "single quote hello world single quote", "completion": "'hello world'"}
105
+ {"prompt": "snake case max tokens", "completion": "max_tokens"}
106
+ {"prompt": "star slash", "completion": "*/"}
107
+ {"prompt": "at objc", "completion": "@objc"}
108
+ {"prompt": "HTTPS colon slash slash API dot github dot com slash repos", "completion": "https://api.github.com/repos"}
109
+ {"prompt": "open bracket one colon five close bracket", "completion": "[1:5]"}
110
+ {"prompt": "no space time stamp", "completion": "timestamp"}
111
+ {"prompt": "curl dash F S S L HTTPS colon slash slash example dot com slash install dot S H pipe bash", "completion": "curl -fsSL https://example.com/install.sh | bash"}
112
+ {"prompt": "dash dash no dash verify", "completion": "--no-verify"}
113
+ {"prompt": "arach at tchoupani dot com", "completion": "arach@tchoupani.com"}
114
+ {"prompt": "dash A", "completion": "-A"}
115
+ {"prompt": "swift build dash C release", "completion": "swift build -c release"}
116
+ {"prompt": "X slash equals two", "completion": "x/=2"}
117
+ {"prompt": "tilde slash dot SSH slash config", "completion": "~/.ssh/config"}
118
+ {"prompt": "LS dash L A pipe grep dot swift", "completion": "ls -la | grep .swift"}
119
+ {"prompt": "backtick hello world backtick", "completion": "`hello world`"}
120
+ {"prompt": "dot dot less than", "completion": "..<"}
121
+ {"prompt": "M K dir dash P SRC slash components and and touch SRC slash components slash button dot T S X", "completion": "mkdir -p src/components && touch src/components/Button.tsx"}
122
+ {"prompt": "plus plus", "completion": "++"}
123
+ {"prompt": "echo quote hello world quote pipe P B copy", "completion": "echo \"hello world\" | pbcopy"}
124
+ {"prompt": "single quote UTF dash eight single quote", "completion": "'utf-8'"}
125
+ {"prompt": "hash selector open paren camel case handle tap open paren underscore colon close paren close paren", "completion": "#selector(handleTap(_:))"}
126
+ {"prompt": "question mark colon", "completion": "?:"}
127
+ {"prompt": "kebab case web design guidelines", "completion": "web-design-guidelines"}
128
+ {"prompt": "quote hello comma world quote", "completion": "\"hello, world\""}
129
+ {"prompt": "one twenty eight", "completion": "128"}
130
+ {"prompt": "hello plus world", "completion": "hello+world"}
131
+ {"prompt": "dash dash skill", "completion": "--skill"}
132
+ {"prompt": "git at GitHub dot com colon arach slash use talkie dot com dot git", "completion": "git@github.com:arach/usetalkie.com.git"}
133
+ {"prompt": "dot E N V", "completion": ".env"}
134
+ {"prompt": "pipe pipe", "completion": "||"}
135
+ {"prompt": "star dot swift", "completion": "*.swift"}
136
+ {"prompt": "docker run dash D dash P eighty eighty colon eighty eighty nginx", "completion": "docker run -d -p 8080:8080 nginx"}
137
+ {"prompt": "bang equals equals", "completion": "!=="}
138
+ {"prompt": "hello underscore world", "completion": "hello_world"}
139
+ {"prompt": "one nine two dot one six eight dot one dot one", "completion": "192.168.1.1"}
140
+ {"prompt": "J Q single quote dot name single quote package dot JSON", "completion": "jq '.name' package.json"}
141
+ {"prompt": "screaming snake case base URL", "completion": "BASE_URL"}
142
+ {"prompt": "snake case user id", "completion": "user_id"}
143
+ {"prompt": "dollar open paren PWD close paren", "completion": "$(pwd)"}
144
+ {"prompt": "if open bracket dash F config dot yaml close bracket semicolon then source config dot yaml semicolon fi", "completion": "if [ -f config.yaml ]; then source config.yaml; fi"}
145
+ {"prompt": "hello question mark world", "completion": "hello?world"}
146
+ {"prompt": "zero X F F", "completion": "0xFF"}
147
+ {"prompt": "X ampersand Y", "completion": "x & y"}
148
+ {"prompt": "camel case on change", "completion": "onChange"}
149
+ {"prompt": "open bracket one comma two comma three close bracket", "completion": "[1, 2, 3]"}
150
+ {"prompt": "hello exclamation point world", "completion": "hello!world"}
151
+ {"prompt": "node dash dash max dash old dash space dash size equals four zero nine six", "completion": "node --max-old-space-size=4096"}
152
+ {"prompt": "dash P", "completion": "-p"}
153
+ {"prompt": "A pipe pipe B pipe pipe C", "completion": "a || b || c"}
154
+ {"prompt": "screaming snake case hello world", "completion": "HELLO_WORLD"}
155
+ {"prompt": "kebab case search results", "completion": "search-results"}
156
+ {"prompt": "less than", "completion": "<"}
157
+ {"prompt": "python three dash M pip install dash dash upgrade pip", "completion": "python3 -m pip install --upgrade pip"}
158
+ {"prompt": "export all caps NODE underscore ENV equals quote production quote", "completion": "export NODE_ENV=\"production\""}
159
+ {"prompt": "try await self dot camel case load model open paren close paren", "completion": "try await self.loadModel()"}
160
+ {"prompt": "process dot exit open paren one close paren", "completion": "process.exit(1)"}
161
+ {"prompt": "arach slash clipper", "completion": "arach/clipper"}
162
+ {"prompt": "open bracket pascal case string close bracket", "completion": "[String]"}
163
+ {"prompt": "no space host name", "completion": "hostname"}
164
+ {"prompt": "open angle pascal case input comma pascal case output close angle", "completion": "<Input, Output>"}
165
+ {"prompt": "A W K single quote open brace print dollar one close brace single quote", "completion": "awk '{print $1}'"}
166
+ {"prompt": "slash slash", "completion": "//"}
167
+ {"prompt": "git add dash A and and git commit dash M quote fix typo quote and and git push", "completion": "git add -A && git commit -m \"fix typo\" && git push"}
168
+ {"prompt": "snake case learning rate", "completion": "learning_rate"}
169
+ {"prompt": "backtick dollar open brace name close brace backtick", "completion": "`${name}`"}
170
+ {"prompt": "eight thousand", "completion": "8000"}
171
+ {"prompt": "pascal case table view", "completion": "TableView"}
172
+ {"prompt": "X dot dot less than Y", "completion": "x..<y"}
173
+ {"prompt": "at binding var text colon pascal case string", "completion": "@Binding var text: String"}
174
+ {"prompt": "pascal case inference model catalog", "completion": "InferenceModelCatalog"}
175
+ {"prompt": "thirty D", "completion": "30d"}
176
+ {"prompt": "five M", "completion": "5m"}
177
+ {"prompt": "quote content dash type quote", "completion": "\"content-type\""}
178
+ {"prompt": "zero point five", "completion": "0.5"}
179
+ {"prompt": "echo dollar all caps HOME", "completion": "echo $HOME"}
180
+ {"prompt": "try question mark X", "completion": "try? x"}
181
+ {"prompt": "git at GitHub dot com colon arach slash clipper dot git", "completion": "git@github.com:arach/clipper.git"}
182
+ {"prompt": "hello percent world", "completion": "hello%world"}
183
+ {"prompt": "package dot JSON", "completion": "package.json"}
184
+ {"prompt": "three thousand", "completion": "3000"}
185
+ {"prompt": "dash dash port", "completion": "--port"}
186
+ {"prompt": "nine zero nine zero", "completion": "9090"}
187
+ {"prompt": "no space data base", "completion": "database"}
188
+ {"prompt": "camel case did finish loading", "completion": "didFinishLoading"}
189
+ {"prompt": "func camel case get status open paren close paren arrow pascal case inference status", "completion": "func getStatus() -> InferenceStatus"}
190
+ {"prompt": "hash mark dash life cycle", "completion": "// MARK: - Life Cycle"}
191
+ {"prompt": "git checkout dash B feature slash add dash auth", "completion": "git checkout -b feature/add-auth"}
192
+ {"prompt": "dot T X T", "completion": ".txt"}
193
+ {"prompt": "tilde slash dot SSH slash ID underscore RSA", "completion": "~/.ssh/id_rsa"}
194
+ {"prompt": "snake case bridge port", "completion": "bridge_port"}
195
+ {"prompt": "camel case view model", "completion": "viewModel"}
196
+ {"prompt": "slash usr slash local slash bin", "completion": "/usr/local/bin"}
197
+ {"prompt": "dollar zero", "completion": "$0"}
198
+ {"prompt": "talkie dash dev dev rebuild agent", "completion": "talkie-dev dev rebuild agent"}
199
+ {"prompt": "hello semicolon world", "completion": "hello;world"}
200
+ {"prompt": "slash star", "completion": "/*"}
201
+ {"prompt": "cat package dot JSON pipe python three dash C quote import JSON semicolon print open paren JSON dot load open paren open open paren quote R quote close paren close paren close paren quote", "completion": "cat package.json | python3 -c \"import json; print(json.load(open('r')))\""}
202
+ {"prompt": "JSON dot stringify open paren obj close paren", "completion": "JSON.stringify(obj)"}
203
+ {"prompt": "PS aux pipe grep node", "completion": "ps aux | grep node"}
204
+ {"prompt": "const camel case fetch data equals async open paren close paren fat arrow", "completion": "const fetchData = async () =>"}
205
+ {"prompt": "open bracket pascal case string colon any close bracket", "completion": "[String: Any]"}
206
+ {"prompt": "A question mark question mark B", "completion": "a ?? b"}
207
+ {"prompt": "quote quote", "completion": "\"\""}
208
+ {"prompt": "open bracket colon five close bracket", "completion": "[:5]"}
209
+ {"prompt": "camel case should auto rotate", "completion": "shouldAutoRotate"}
210
+ {"prompt": "slash usr slash local slash bin slash python three", "completion": "/usr/local/bin/python3"}
211
+ {"prompt": "quote star dot log quote", "completion": "\"*.log\""}
212
+ {"prompt": "zero point seven", "completion": "0.7"}
213
+ {"prompt": "no space code base", "completion": "codebase"}
214
+ {"prompt": "dash dash system", "completion": "--system"}
215
+ {"prompt": "hello asterisk world", "completion": "hello*world"}
216
+ {"prompt": "empty string", "completion": "\"\""}
217
+ {"prompt": "HTTPS colon slash slash example dot com slash API slash V one slash users", "completion": "https://example.com/api/v1/users"}
218
+ {"prompt": "equals equals equals", "completion": "==="}
219
+ {"prompt": "zero dot zero dot zero dot zero colon eighty eighty", "completion": "0.0.0.0:8080"}
220
+ {"prompt": "X equals five", "completion": "x=5"}
221
+ {"prompt": "triple backtick python", "completion": "```python"}
222
+ {"prompt": "X question mark dot Y question mark dot Z", "completion": "x?.y?.z"}
223
+ {"prompt": "tilde slash dot bash profile", "completion": "~/.bash_profile"}
224
+ {"prompt": "slash tmp", "completion": "/tmp"}
225
+ {"prompt": "HTTPS colon slash slash hugging face dot co slash MLX dash community", "completion": "https://huggingface.co/mlx-community"}
226
+ {"prompt": "X less than less than Y", "completion": "x << y"}
227
+ {"prompt": "docker run dash P three thousand colon three thousand redis", "completion": "docker run -p 3000:3000 redis"}
228
+ {"prompt": "JDI dot talkie dot inference dot dev", "completion": "jdi.talkie.inference.dev"}
229
+ {"prompt": "if X greater than Y", "completion": "if x > y"}
230
+ {"prompt": "dot dot slash dot dot slash lib", "completion": "../../lib"}
231
+ {"prompt": "open bracket zero colon close bracket", "completion": "[0:]"}
232
+ {"prompt": "sixty four", "completion": "64"}
233
+ {"prompt": "HTTPS colon slash slash cloud dot use talkie dot com", "completion": "https://cloud.useTalkie.com"}
234
+ {"prompt": "NPM install dash capital D typescript", "completion": "npm install -D typescript"}
235
+ {"prompt": "SRC slash utils slash index dot T S", "completion": "src/utils/index.ts"}
236
+ {"prompt": "and and", "completion": "&&"}
237
+ {"prompt": "HTTPS colon slash slash API dot use talkie dot com slash API slash user", "completion": "https://api.useTalkie.com/api/user"}
238
+ {"prompt": "camel case get user name", "completion": "getUserName"}
239
+ {"prompt": "dash dash help", "completion": "--help"}
240
+ {"prompt": "mac OS slash talkie kit slash", "completion": "macOS/TalkieKit/"}
241
+ {"prompt": "dash G", "completion": "-g"}
242
+ {"prompt": "all caps PORT equals three thousand", "completion": "PORT=3000"}
243
+ {"prompt": "for file in star dot log semicolon do RM dollar file semicolon done", "completion": "for file in *.log; do rm $file; done"}
244
+ {"prompt": "no space drop down", "completion": "dropdown"}
245
+ {"prompt": "snake case hello world", "completion": "hello_world"}
246
+ {"prompt": "equals equals", "completion": "=="}
247
+ {"prompt": "sed single quote S slash old slash new slash G single quote file dot T X T", "completion": "sed 's/old/new/g' file.txt"}
248
+ {"prompt": "open paren A comma B close paren arrow pascal case int", "completion": "(a, b) -> Int"}
249
+ {"prompt": "X bang", "completion": "x!"}
250
+ {"prompt": "X dot dot dot Y", "completion": "x...y"}
251
+ {"prompt": "HTTPS colon slash slash accounts dot use talkie dot com slash sign dash in", "completion": "https://accounts.useTalkie.com/sign-in"}
252
+ {"prompt": "one two seven dot zero dot zero dot one", "completion": "127.0.0.1"}
253
+ {"prompt": "slash users slash arach slash dev slash dot E N V", "completion": "/Users/arach/dev/.env"}
254
+ {"prompt": "bang equals", "completion": "!="}
255
+ {"prompt": "screaming snake case bundle ID", "completion": "BUNDLE_ID"}
256
+ {"prompt": "dash dash temp", "completion": "--temp"}
257
+ {"prompt": "new line A new line B new line C", "completion": "A\nB\nC"}
258
+ {"prompt": "S equals S open bracket colon colon dash one close bracket", "completion": "s = s[::-1]"}
259
+ {"prompt": "MLX dash community slash Qwen two point five dash zero point five B dash instruct dash four bit", "completion": "mlx-community/Qwen2.5-0.5B-Instruct-4bit"}
260
+ {"prompt": "dot dot slash SRC", "completion": "../src"}
261
+ {"prompt": "localhost colon eighty eighty", "completion": "localhost:8080"}
262
+ {"prompt": "WC dash L star dot swift", "completion": "wc -l *.swift"}
263
+ {"prompt": "one hundred K", "completion": "100k"}
264
+ {"prompt": "quote application slash JSON quote", "completion": "\"application/json\""}
265
+ {"prompt": "new line hello new line world", "completion": "hello\nworld"}
266
+ {"prompt": "double greater than", "completion": ">>"}
267
+ {"prompt": "not equals", "completion": "!="}
268
+ {"prompt": "shebang slash bin slash bash", "completion": "#!/bin/bash"}
269
+ {"prompt": "at available open paren mac OS thirteen comma star close paren", "completion": "@available(macOS 13, *)"}
270
+ {"prompt": "quote fix colon improve performance quote", "completion": "\"fix: improve performance\""}
271
+ {"prompt": "quote chore colon update dependencies quote", "completion": "\"chore: update dependencies\""}
272
+ {"prompt": "import pascal case talkie kit", "completion": "import TalkieKit"}
273
+ {"prompt": "X percent Y", "completion": "x % y"}
274
+ {"prompt": "A comma B comma C", "completion": "A, B, C"}
275
+ {"prompt": "dot T S X", "completion": ".tsx"}
276
+ {"prompt": "X args dash I open brace close brace echo open brace close brace", "completion": "xargs -I {} echo {}"}
277
+ {"prompt": "camel case hello world", "completion": "helloWorld"}
278
+ {"prompt": "open angle T close angle", "completion": "<T>"}
279
+ {"prompt": "chmod plus X install dot S H", "completion": "chmod +x install.sh"}
280
+ {"prompt": "open bracket weak self close bracket", "completion": "[weak self]"}
281
+ {"prompt": "GH API repos slash arach slash talkie slash pulls", "completion": "gh api repos/arach/talkie/pulls"}
282
+ {"prompt": "hello caret world", "completion": "hello^world"}
283
+ {"prompt": "slash var slash log", "completion": "/var/log"}
284
+ {"prompt": "plus equals", "completion": "+="}
285
+ {"prompt": "com dot JDI dot talkie", "completion": "com.jdi.talkie"}
286
+ {"prompt": "all caps HTTP", "completion": "HTTP"}
287
+ {"prompt": "dash M", "completion": "-m"}
288
+ {"prompt": "one hundred", "completion": "100"}
289
+ {"prompt": "open bracket zero close bracket", "completion": "[0]"}
290
+ {"prompt": "two point five", "completion": "2.5"}
291
+ {"prompt": "dollar open paren P grep dash X talkie close paren", "completion": "$(pgrep -x Talkie)"}
292
+ {"prompt": "git push dash U origin feature slash my dash feature", "completion": "git push -u origin feature/my-feature"}
293
+ {"prompt": "dot dot slash utils", "completion": "../utils"}
294
+ {"prompt": "JSON dot parse open paren data close paren", "completion": "JSON.parse(data)"}
295
+ {"prompt": "case dot camel case success open paren let value close paren", "completion": "case .success(let value)"}
296
+ {"prompt": "no space name space", "completion": "namespace"}
297
+ {"prompt": "greater than", "completion": ">"}
298
+ {"prompt": "no space git hub", "completion": "github"}
299
+ {"prompt": "open brace quote model quote colon quote llama quote close brace", "completion": "{\"model\": \"llama\"}"}
300
+ {"prompt": "require open paren single quote path single quote close paren", "completion": "require('path')"}
301
+ {"prompt": "hello backslash world", "completion": "hello\\world"}
302
+ {"prompt": "no space java script", "completion": "javascript"}
303
+ {"prompt": "no space H top", "completion": "htop"}
304
+ {"prompt": "open paren X close paren fat arrow open brace close brace", "completion": "(x) => {}"}
305
+ {"prompt": "array open angle string close angle", "completion": "Array<string>"}
306
+ {"prompt": "slash etc slash hosts", "completion": "/etc/hosts"}
307
+ {"prompt": "dash dash temp zero point seven dash dash tokens five twelve", "completion": "--temp 0.7 --tokens 512"}
308
+ {"prompt": "no space work space", "completion": "workspace"}
309
+ {"prompt": "camel case text field", "completion": "textField"}
310
+ {"prompt": "MLX dash community slash llama dash three dot two dash one B dash instruct dash four bit", "completion": "mlx-community/Llama-3.2-1B-Instruct-4bit"}
311
+ {"prompt": "dollar open brace one colon dash default close brace", "completion": "${1:-default}"}
312
+ {"prompt": "export all caps DATABASE underscore URL equals quote postgres colon slash slash localhost slash DB quote", "completion": "export DATABASE_URL=\"postgres://localhost/db\""}
313
+ {"prompt": "slash users slash arach slash dev slash vif slash scripts slash launch D dot S H", "completion": "/Users/arach/dev/vif/scripts/launchd.sh"}
314
+ {"prompt": "hello hash world", "completion": "hello#world"}
315
+ {"prompt": "single quote W single quote", "completion": "'w'"}
316
+ {"prompt": "git diff dash dash staged", "completion": "git diff --staged"}
317
+ {"prompt": "arrow", "completion": "->"}
318
+ {"prompt": "X greater than greater than Y", "completion": "x >> y"}
319
+ {"prompt": "find dot dash name quote star dot swift quote", "completion": "find . -name \"*.swift\""}
320
+ {"prompt": "five twelve", "completion": "512"}
321
+ {"prompt": "dot swift", "completion": ".swift"}
322
+ {"prompt": "self dot camel case loaded model ID", "completion": "self.loadedModelId"}
323
+ {"prompt": "quote star dot T S quote", "completion": "\"*.ts\""}
324
+ {"prompt": "alias L L equals single quote LS dash L A single quote", "completion": "alias ll='ls -la'"}
325
+ {"prompt": "dollar open brace all caps API underscore KEY close brace", "completion": "${API_KEY}"}
326
+ {"prompt": "pipe tee output dot log", "completion": "| tee output.log"}
327
+ {"prompt": "pascal case shared settings", "completion": "SharedSettings"}
328
+ {"prompt": "docker compose up dash D", "completion": "docker compose up -d"}
329
+ {"prompt": "no space web pack", "completion": "webpack"}
330
+ {"prompt": "open paren X close paren", "completion": "(x)"}
331
+ {"prompt": "cargo build dash dash release", "completion": "cargo build --release"}
332
+ {"prompt": "node underscore modules slash dot bin", "completion": "node_modules/.bin"}
333
+ {"prompt": "zero point zero one", "completion": "0.01"}
334
+ {"prompt": "fat arrow", "completion": "=>"}
335
+ {"prompt": "import open brace camel case use state close brace from single quote react single quote", "completion": "import { useState } from 'react'"}
336
+ {"prompt": "greater than greater than greater than", "completion": ">>>"}
337
+ {"prompt": "four four three", "completion": "443"}
338
+ {"prompt": "dollar open brace all caps HOME close brace slash dot config", "completion": "${HOME}/.config"}
339
+ {"prompt": "open angle pascal case string comma pascal case string close angle", "completion": "<String, String>"}
340
+ {"prompt": "X caret Y", "completion": "x ^ y"}
341
+ {"prompt": "snake case updated at", "completion": "updated_at"}
342
+ {"prompt": "X question mark", "completion": "x?"}
343
+ {"prompt": "pipe head dash twenty five", "completion": "| head -25"}
344
+ {"prompt": "pascal case audio recorder", "completion": "AudioRecorder"}
345
+ {"prompt": "install dot S H", "completion": "install.sh"}
346
+ {"prompt": "X as bang Y", "completion": "x as! y"}
347
+ {"prompt": "percent S", "completion": "%s"}
348
+ {"prompt": "snake case num layers", "completion": "num_layers"}
349
+ {"prompt": "output dot T X T", "completion": "output.txt"}
350
+ {"prompt": "security unlock dash keychain tilde slash library slash keychains slash login dot keychain dash DB", "completion": "security unlock-keychain ~/Library/Keychains/login.keychain-db"}
351
+ {"prompt": "talkie dash dev dev logs agent dash dash since ten M", "completion": "talkie-dev dev logs agent --since 10m"}
352
+ {"prompt": "dot dot slash models", "completion": "../models"}
353
+ {"prompt": "dollar open brace", "completion": "${"}
354
+ {"prompt": "dot dot slash tests", "completion": "../tests"}
355
+ {"prompt": "hello ampersand world", "completion": "hello&world"}
356
+ {"prompt": "A and and B and and C", "completion": "a && b && c"}
357
+ {"prompt": "pascal case hello world", "completion": "HelloWorld"}
358
+ {"prompt": "let result colon pascal case result open angle pascal case string comma pascal case error close angle", "completion": "let result: Result<String, Error>"}
359
+ {"prompt": "dot M D", "completion": ".md"}
360
+ {"prompt": "quote feat colon add dark mode quote", "completion": "\"feat: add dark mode\""}
361
+ {"prompt": "HTTPS colon slash slash open code dot AI slash install", "completion": "https://opencode.ai/install"}
362
+ {"prompt": "X arrow Y", "completion": "x -> y"}
363
+ {"prompt": "dash I", "completion": "-i"}
364
+ {"prompt": "module dot exports equals open brace close brace", "completion": "module.exports = {}"}
365
+ {"prompt": "at observable", "completion": "@Observable"}
366
+ {"prompt": "dot dot slash scripts", "completion": "../scripts"}
367
+ {"prompt": "dash dash no dash verify", "completion": "--no-verify"}
368
+ {"prompt": "dot JSON", "completion": ".json"}
369
+ {"prompt": "underscore", "completion": "_"}
370
+ {"prompt": "all caps JSON", "completion": "JSON"}
371
+ {"prompt": "dot dot slash dot dot slash", "completion": "../../"}
372
+ {"prompt": "dash dash since five M two greater than ampersand one pipe grep error", "completion": "--since 5m 2>&1 | grep error"}
373
+ {"prompt": "camel case model container", "completion": "modelContainer"}
374
+ {"prompt": "dot git ignore", "completion": ".gitignore"}
375
+ {"prompt": "localhost colon three thousand", "completion": "localhost:3000"}
376
+ {"prompt": "hello slash world", "completion": "hello/world"}
377
+ {"prompt": "talkie colon slash slash dictate", "completion": "talkie://dictate"}
378
+ {"prompt": "dash R", "completion": "-r"}
379
+ {"prompt": "launch CTL load tilde slash library slash launch agents slash com dot JDI dot talkie dot plist", "completion": "launchctl load ~/Library/LaunchAgents/com.jdi.talkie.plist"}
380
+ {"prompt": "open bracket close bracket", "completion": "[]"}
381
+ {"prompt": "left paren right paren", "completion": "()"}
382
+ {"prompt": "CLI slash SRC slash commands slash inference dot T S", "completion": "cli/src/commands/inference.ts"}
383
+ {"prompt": "kebab case my component", "completion": "my-component"}
384
+ {"prompt": "tilde slash dot config", "completion": "~/.config"}
385
+ {"prompt": "dot T S", "completion": ".ts"}
386
+ {"prompt": "git branch dash capital D feature slash fix dash typo", "completion": "git branch -D feature/fix-typo"}
387
+ {"prompt": "all caps XPC", "completion": "XPC"}
388
+ {"prompt": "WS colon slash slash one twenty seven dot zero dot zero dot one colon one nine eight two two", "completion": "ws://127.0.0.1:19822"}
389
+ {"prompt": "backtick dollar open brace camel case user name close brace backtick", "completion": "`${userName}`"}
390
+ {"prompt": "open bracket open bracket test close bracket close bracket", "completion": "[[test]]"}
391
+ {"prompt": "echo dollar all caps PATH", "completion": "echo $PATH"}
392
+ {"prompt": "talkie colon slash slash auth slash callback", "completion": "talkie://auth/callback"}
393
+ {"prompt": "open dash A pascal case safari HTTPS colon slash slash example dot com", "completion": "open -a Safari https://example.com"}
394
+ {"prompt": "star star slash star dot T S X", "completion": "**/*.tsx"}
395
+ {"prompt": "dash dash verbose", "completion": "--verbose"}
396
+ {"prompt": "let camel case max tokens colon all caps int equals five twelve", "completion": "let maxTokens: Int = 512"}
397
+ {"prompt": "tab A tab B tab C", "completion": "A\tB\tC"}
398
+ {"prompt": "no space open code", "completion": "opencode"}
399
+ {"prompt": "dollar one", "completion": "$1"}
400
+ {"prompt": "no space key chain", "completion": "keychain"}
401
+ {"prompt": "at state private var camel case is showing equals false", "completion": "@State private var isShowing = false"}
402
+ {"prompt": "dash dash resume", "completion": "--resume"}
403
+ {"prompt": "git config dash dash global user dot email quote arach at tchoupani dot com quote", "completion": "git config --global user.email \"arach@tchoupani.com\""}
404
+ {"prompt": "dollar open paren", "completion": "$("}
405
+ {"prompt": "N C dash Z localhost eighty eighty and and echo quote up quote", "completion": "nc -z localhost 8080 && echo \"up\""}
406
+ {"prompt": "camel case file manager", "completion": "fileManager"}
407
+ {"prompt": "dot E N V dot local", "completion": ".env.local"}
408
+ {"prompt": "kill dash nine dollar open paren P grep node close paren", "completion": "kill -9 $(pgrep node)"}
409
+ {"prompt": "camel case handle request", "completion": "handleRequest"}
410
+ {"prompt": "docker run dash P five four three two colon five four three two postgres", "completion": "docker run -p 5432:5432 postgres"}
411
+ {"prompt": "dot slash SRC", "completion": "./src"}
412
+ {"prompt": "screaming snake case max retry count", "completion": "MAX_RETRY_COUNT"}
413
+ {"prompt": "back tick", "completion": "`"}
414
+ {"prompt": "dot slash scripts slash sync dash xcode dash files dot P Y", "completion": "./scripts/sync-xcode-files.py"}
415
+ {"prompt": "X pipe Y", "completion": "x | y"}
416
+ {"prompt": "open paren A plus B close paren star C", "completion": "(a + b) * c"}
417
+ {"prompt": "J Q single quote dot data open bracket close bracket single quote response dot JSON", "completion": "jq '.data[]' response.json"}
418
+ {"prompt": "open angle T colon pascal case comparable close angle", "completion": "<T: Comparable>"}
419
+ {"prompt": "zero X zero zero", "completion": "0x00"}
420
+ {"prompt": "if let X equals Y", "completion": "if let x = y"}
421
+ {"prompt": "lowercase HELLO", "completion": "hello"}
422
+ {"prompt": "dollar all caps SHELL", "completion": "$SHELL"}
423
+ {"prompt": "dash dash dry dash run", "completion": "--dry-run"}
424
+ {"prompt": "backtick camel case is loading backtick", "completion": "`isLoading`"}
425
+ {"prompt": "talkie dash dev dev logs inference dash dash since five M two greater than ampersand one pipe head dash twenty", "completion": "talkie-dev dev logs inference --since 5m 2>&1 | head -20"}
426
+ {"prompt": "hello tilde world", "completion": "hello~world"}
427
+ {"prompt": "tilde slash library slash application support slash talkie", "completion": "~/Library/Application Support/Talkie"}
428
+ {"prompt": "tilde slash dot zshrc", "completion": "~/.zshrc"}
429
+ {"prompt": "MK dir dash P SRC slash lib and and touch SRC slash lib slash utils dot T S", "completion": "mkdir -p src/lib && touch src/lib/utils.ts"}
430
+ {"prompt": "dot dot", "completion": ".."}
431
+ {"prompt": "open paren A comma B comma C close paren", "completion": "(a, b, c)"}
432
+ {"prompt": "sort dash U pipe head dash ten", "completion": "sort -u | head -10"}
433
+ {"prompt": "map open angle string comma number close angle", "completion": "Map<string, number>"}
434
+ {"prompt": "HTTPS colon slash slash pypi dot org slash project slash torch", "completion": "https://pypi.org/project/torch"}
435
+ {"prompt": "two greater than ampersand one", "completion": "2>&1"}
436
+ {"prompt": "NPX skills add HTTPS colon slash slash GitHub dot com slash vercel dash labs slash agent dash skills dash dash skill web dash design dash guidelines", "completion": "npx skills add https://github.com/vercel-labs/agent-skills --skill web-design-guidelines"}
437
+ {"prompt": "dot P B X proj", "completion": ".pbxproj"}
438
+ {"prompt": "X as question mark Y", "completion": "x as? y"}
439
+ {"prompt": "hello dot world", "completion": "hello.world"}
440
+ {"prompt": "three point one four", "completion": "3.14"}
441
+ {"prompt": "triple backtick swift", "completion": "```swift"}
442
+ {"prompt": "single quote use strict single quote", "completion": "'use strict'"}
443
+ {"prompt": "two zero four eight", "completion": "2048"}
444
+ {"prompt": "say dash V ava dash R two eighty quote hi I'm your new premium voice quote", "completion": "say -v ava -r 280 \"Hi, I'm your new premium voice\""}
445
+ {"prompt": "hash one", "completion": "#1"}
446
+ {"prompt": "dollar dollar", "completion": "$$"}
447
+ {"prompt": "star dot star", "completion": "*.*"}
448
+ {"prompt": "backtick echo dollar all caps HOME backtick", "completion": "`echo $HOME`"}
449
+ {"prompt": "defaults write com dot apple dot dock camel case auto hide dash bool true", "completion": "defaults write com.apple.dock autohide -bool true"}
450
+ {"prompt": "export default function pascal case app open paren close paren", "completion": "export default function App()"}
451
+ {"prompt": "open paren underscore colon pascal case string close paren", "completion": "(_: String)"}
452
+ {"prompt": "console dot log open paren quote hello quote close paren", "completion": "console.log(\"hello\")"}
453
+ {"prompt": "GH release create V one dot zero dot zero dash dash notes quote first release quote", "completion": "gh release create v1.0.0 --notes \"first release\""}
454
+ {"prompt": "open paren close paren fat arrow open brace close brace", "completion": "() => {}"}
455
+ {"prompt": "tab separated hello world", "completion": "hello\tworld"}
456
+ {"prompt": "tilde slash", "completion": "~/"}
457
+ {"prompt": "star star slash star dot T S", "completion": "**/*.ts"}
458
+ {"prompt": "open bracket dash one close bracket", "completion": "[-1]"}
459
+ {"prompt": "services slash talkie dash API", "completion": "services/talkie-api"}
460
+ {"prompt": "try bang X", "completion": "try! x"}
461
+ {"prompt": "no space user name", "completion": "username"}
462
+ {"prompt": "source dot venv slash bin slash activate", "completion": "source .venv/bin/activate"}
463
+ {"prompt": "fetch open paren quote slash API slash data quote close paren", "completion": "fetch(\"/api/data\")"}
464
+ {"prompt": "hello colon world", "completion": "hello:world"}
465
+ {"prompt": "app delegate dot swift", "completion": "AppDelegate.swift"}
466
+ {"prompt": "camel case load model", "completion": "loadModel"}
467
+ {"prompt": "HTTPS colon slash slash GitHub dot com slash arach slash talkie", "completion": "https://github.com/arach/talkie"}
468
+ {"prompt": "git push origin main colon main", "completion": "git push origin main:main"}
469
+ {"prompt": "ten twenty four", "completion": "1024"}
470
+ {"prompt": "tar dash X Z F archive dot tar dot G Z", "completion": "tar -xzf archive.tar.gz"}
471
+ {"prompt": "python three dash C quote import sys semicolon print open paren sys dot version close paren quote", "completion": "python3 -c \"import sys; print(sys.version)\""}
472
+ {"prompt": "guard let self else open brace return close brace", "completion": "guard let self else { return }"}
473
+ {"prompt": "pascal case recording overlay", "completion": "RecordingOverlay"}
474
+ {"prompt": "server dot T S", "completion": "server.ts"}
475
+ {"prompt": "dot dot slash index dot T S", "completion": "../index.ts"}
476
+ {"prompt": "dash dash body", "completion": "--body"}
477
+ {"prompt": "packages slash talkie kit slash sources", "completion": "Packages/TalkieKit/Sources"}
478
+ {"prompt": "case dot camel case failure open paren let error close paren", "completion": "case .failure(let error)"}
479
+ {"prompt": "dash N", "completion": "-n"}
480
+ {"prompt": "dash C", "completion": "-c"}
481
+ {"prompt": "kebab case dark mode toggle", "completion": "dark-mode-toggle"}
482
+ {"prompt": "pascal case screen capture", "completion": "ScreenCapture"}
483
+ {"prompt": "three dot two", "completion": "3.2"}
484
+ {"prompt": "hash bang", "completion": "#!"}
485
+ {"prompt": "git tag dash A V one dot zero dot zero dash M quote initial release quote", "completion": "git tag -a v1.0.0 -m \"initial release\""}
486
+ {"prompt": "at published", "completion": "@Published"}
487
+ {"prompt": "RM dash R F node underscore modules and and RM package dash lock dot JSON", "completion": "rm -rf node_modules && rm package-lock.json"}
488
+ {"prompt": "inference service dot swift", "completion": "InferenceService.swift"}
489
+ {"prompt": "P N P M add dash capital D eslint", "completion": "pnpm add -D eslint"}
490
+ {"prompt": "quote dot slash SRC quote", "completion": "\"./src\""}
491
+ {"prompt": "pascal case view controller", "completion": "ViewController"}
training/data/completions/valid.jsonl ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"prompt": "kubectl get pods dash N production", "completion": "kubectl get pods -n production"}
2
+ {"prompt": "single quote R single quote", "completion": "'r'"}
3
+ {"prompt": "all caps HTTPS", "completion": "HTTPS"}
4
+ {"prompt": "hello dollar world", "completion": "hello$world"}
5
+ {"prompt": "X code build dash scheme talkie dash destination quote platform equals mac OS quote", "completion": "xcodebuild -scheme Talkie -destination \"platform=macOS\""}
6
+ {"prompt": "no space home brew", "completion": "homebrew"}
7
+ {"prompt": "hash endif", "completion": "#endif"}
8
+ {"prompt": "dash dash version", "completion": "--version"}
9
+ {"prompt": "single quote single quote", "completion": "''"}
10
+ {"prompt": "pascal case service bridge", "completion": "ServiceBridge"}
11
+ {"prompt": "less than less than", "completion": "<<"}
12
+ {"prompt": "camel case is palindrome", "completion": "isPalindrome"}
13
+ {"prompt": "no space make file", "completion": "makefile"}
14
+ {"prompt": "shebang", "completion": "#!"}
15
+ {"prompt": "all caps URL", "completion": "URL"}
16
+ {"prompt": "tilde slash dev slash talkie", "completion": "~/dev/talkie"}
17
+ {"prompt": "hash hash", "completion": "##"}
18
+ {"prompt": "dash dash json", "completion": "--json"}
19
+ {"prompt": "dot dot dot", "completion": "..."}
20
+ {"prompt": "open brace key colon value close brace", "completion": "{key: value}"}
21
+ {"prompt": "var camel case is generating colon all caps bool equals false", "completion": "var isGenerating: Bool = false"}
22
+ {"prompt": "dollar open brace all caps DATABASE underscore URL close brace", "completion": "${DATABASE_URL}"}
23
+ {"prompt": "one", "completion": "1"}
24
+ {"prompt": "X star star Y", "completion": "x ** y"}
25
+ {"prompt": "screaming snake case node env", "completion": "NODE_ENV"}
26
+ {"prompt": "dot J S", "completion": ".js"}
27
+ {"prompt": "process dot env dot all caps API underscore KEY", "completion": "process.env.API_KEY"}
28
+ {"prompt": "dollar open paren date plus percent S close paren", "completion": "$(date +%s)"}
29
+ {"prompt": "kebab case audio player", "completion": "audio-player"}
30
+ {"prompt": "open angle string close angle", "completion": "<string>"}
31
+ {"prompt": "dot dot slash services", "completion": "../services"}
32
+ {"prompt": "HTTPS colon slash slash raw dot github user content dot com", "completion": "https://raw.githubusercontent.com"}
33
+ {"prompt": "one nine eight two two", "completion": "19822"}
34
+ {"prompt": "not A", "completion": "!a"}
35
+ {"prompt": "hello star world", "completion": "hello*world"}
36
+ {"prompt": "tilde slash library slash keychains slash login dot keychain dash DB", "completion": "~/Library/Keychains/login.keychain-db"}
37
+ {"prompt": "open brace quote key quote colon quote value quote close brace", "completion": "{\"key\": \"value\"}"}
38
+ {"prompt": "talkie inference generate quote what is two plus two quote dash dash model MLX dash community slash llama dash three dot two dash one B dash instruct dash four bit dash dash verbose dash dash temp zero point seven", "completion": "talkie inference generate \"What is two plus two\" --model mlx-community/Llama-3.2-1B-Instruct-4bit --verbose --temp 0.7"}
39
+ {"prompt": "dash dash title", "completion": "--title"}
40
+ {"prompt": "quote fix colon resolve timeout quote", "completion": "\"fix: resolve timeout\""}
41
+ {"prompt": "X fat arrow Y", "completion": "x => y"}
42
+ {"prompt": "open bracket dash F dot next close bracket and and RM dash RF dot next", "completion": "[ -f .next ] && rm -rf .next"}
43
+ {"prompt": "dash V", "completion": "-v"}
44
+ {"prompt": "shebang slash usr slash bin slash env python three", "completion": "#!/usr/bin/env python3"}
45
+ {"prompt": "promise open angle void close angle", "completion": "Promise<void>"}
46
+ {"prompt": "GH issue list dash dash label bug dash dash state open", "completion": "gh issue list --label bug --state open"}
47
+ {"prompt": "hello bang world", "completion": "hello!world"}
48
+ {"prompt": "tilde X", "completion": "~x"}
49
+ {"prompt": "dollar open brace all caps HOME colon dash slash tmp close brace", "completion": "${HOME:-/tmp}"}
50
+ {"prompt": "guard let container equals self dot camel case model container else", "completion": "guard let container = self.modelContainer else"}
51
+ {"prompt": "dash M quote fix colon resolve XPC timeout on cold start quote", "completion": "-m \"fix: resolve XPC timeout on cold start\""}
52
+ {"prompt": "five one eight nine", "completion": "5189"}
53
+ {"prompt": "dot camel case on appear open brace close brace", "completion": ".onAppear {}"}
54
+ {"prompt": "tsconfig dot JSON", "completion": "tsconfig.json"}
55
+ {"prompt": "pipe X args dash N one echo", "completion": "| xargs -n1 echo"}
56
+ {"prompt": "mac OS slash talkie slash services slash", "completion": "macOS/Talkie/Services/"}
57
+ {"prompt": "FTP colon slash slash files dot example dot com", "completion": "ftp://files.example.com"}
58
+ {"prompt": "A bang bang", "completion": "a!!"}
59
+ {"prompt": "dash equals", "completion": "-="}
training/data/dictation-to-bash.json ADDED
@@ -0,0 +1,666 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "meta": {
3
+ "name": "dictation-to-bash",
4
+ "version": "1.0",
5
+ "description": "Round-trip test: spoken bash commands → reconstructed syntax. Each entry has the original bash command, how a transcription engine (Whisper/Parakeet) would capture it when spoken aloud, and the expected reconstructed output.",
6
+ "categories": {
7
+ "nav": "Directory navigation (cd, pwd, ls)",
8
+ "git": "Git operations",
9
+ "flags": "Commands with single/double dash flags",
10
+ "paths": "Commands with complex paths (tilde, dots, escaping)",
11
+ "pipes": "Piped and chained commands (|, &&, ;)",
12
+ "quotes": "Commands with quoted arguments",
13
+ "numbers": "Spoken numbers, decimals, units",
14
+ "redirects": "Output redirection (>, >>, 2>&1)",
15
+ "compound": "Multi-part commands combining several syntax patterns",
16
+ "tools": "Project-specific CLI tools (talkie-dev, bun, brew, etc.)"
17
+ },
18
+ "difficulty_levels": {
19
+ "easy": "Single command, minimal syntax transformation",
20
+ "medium": "One or two syntax transformations needed",
21
+ "hard": "Multiple transformations, complex syntax, easy to get wrong"
22
+ }
23
+ },
24
+ "entries": [
25
+ {
26
+ "id": 1,
27
+ "category": "nav",
28
+ "difficulty": "easy",
29
+ "bash": "cd ..",
30
+ "dictated": "CD dot dot",
31
+ "notes": "Basic parent directory navigation"
32
+ },
33
+ {
34
+ "id": 2,
35
+ "category": "nav",
36
+ "difficulty": "easy",
37
+ "bash": "cd ../dev",
38
+ "dictated": "CD dot dot slash dev",
39
+ "notes": "Relative path with parent"
40
+ },
41
+ {
42
+ "id": 3,
43
+ "category": "nav",
44
+ "difficulty": "easy",
45
+ "bash": "cd ../talkie",
46
+ "dictated": "CD dot dot slash talkie",
47
+ "notes": "Sibling directory"
48
+ },
49
+ {
50
+ "id": 4,
51
+ "category": "nav",
52
+ "difficulty": "easy",
53
+ "bash": "pwd",
54
+ "dictated": "PWD",
55
+ "notes": "Single command, all caps in speech"
56
+ },
57
+ {
58
+ "id": 5,
59
+ "category": "nav",
60
+ "difficulty": "medium",
61
+ "bash": "cd dev/talkie",
62
+ "dictated": "CD dev slash talkie",
63
+ "notes": "Forward path"
64
+ },
65
+ {
66
+ "id": 6,
67
+ "category": "nav",
68
+ "difficulty": "medium",
69
+ "bash": "cd ~/dev/talkie",
70
+ "dictated": "CD tilde slash dev slash talkie",
71
+ "notes": "Home directory expansion"
72
+ },
73
+ {
74
+ "id": 7,
75
+ "category": "nav",
76
+ "difficulty": "medium",
77
+ "bash": "cd services/talkie-api",
78
+ "dictated": "CD services slash talkie dash API",
79
+ "notes": "Hyphenated directory name"
80
+ },
81
+ {
82
+ "id": 8,
83
+ "category": "flags",
84
+ "difficulty": "easy",
85
+ "bash": "ls -l",
86
+ "dictated": "LS dash L",
87
+ "notes": "Single flag"
88
+ },
89
+ {
90
+ "id": 9,
91
+ "category": "flags",
92
+ "difficulty": "easy",
93
+ "bash": "ls -la",
94
+ "dictated": "LS dash LA",
95
+ "notes": "Combined flags"
96
+ },
97
+ {
98
+ "id": 10,
99
+ "category": "flags",
100
+ "difficulty": "medium",
101
+ "bash": "rm -rf .next",
102
+ "dictated": "RM dash RF dot next",
103
+ "notes": "Destructive command with dot-prefixed dir"
104
+ },
105
+ {
106
+ "id": 11,
107
+ "category": "flags",
108
+ "difficulty": "medium",
109
+ "bash": "rm -rf .next/",
110
+ "dictated": "RM dash RF dot next slash",
111
+ "notes": "Trailing slash variant"
112
+ },
113
+ {
114
+ "id": 12,
115
+ "category": "git",
116
+ "difficulty": "easy",
117
+ "bash": "git status",
118
+ "dictated": "git status",
119
+ "notes": "No transformation needed — baseline"
120
+ },
121
+ {
122
+ "id": 13,
123
+ "category": "git",
124
+ "difficulty": "easy",
125
+ "bash": "git pull",
126
+ "dictated": "git pull",
127
+ "notes": "Baseline"
128
+ },
129
+ {
130
+ "id": 14,
131
+ "category": "git",
132
+ "difficulty": "easy",
133
+ "bash": "git push origin master",
134
+ "dictated": "git push origin master",
135
+ "notes": "Multi-word, no syntax transformation"
136
+ },
137
+ {
138
+ "id": 15,
139
+ "category": "git",
140
+ "difficulty": "medium",
141
+ "bash": "git add -A",
142
+ "dictated": "git add dash A",
143
+ "notes": "Flag with uppercase letter"
144
+ },
145
+ {
146
+ "id": 16,
147
+ "category": "git",
148
+ "difficulty": "medium",
149
+ "bash": "git clone git@github.com:arach/clipper.git",
150
+ "dictated": "git clone git at GitHub dot com colon arach slash clipper dot git",
151
+ "notes": "SSH URL with special characters"
152
+ },
153
+ {
154
+ "id": 17,
155
+ "category": "quotes",
156
+ "difficulty": "medium",
157
+ "bash": "git commit -m \"fix latency\"",
158
+ "dictated": "git commit dash M quote fix latency quote",
159
+ "notes": "Quoted commit message"
160
+ },
161
+ {
162
+ "id": 18,
163
+ "category": "quotes",
164
+ "difficulty": "medium",
165
+ "bash": "git commit -m \"update model catalog\"",
166
+ "dictated": "git commit dash M quote update model catalog quote",
167
+ "notes": "Longer quoted message"
168
+ },
169
+ {
170
+ "id": 19,
171
+ "category": "quotes",
172
+ "difficulty": "hard",
173
+ "bash": "git commit -m \"fix: resolve XPC timeout on cold start\"",
174
+ "dictated": "git commit dash M quote fix colon resolve XPC timeout on cold start quote",
175
+ "notes": "Quoted message with colon and abbreviation"
176
+ },
177
+ {
178
+ "id": 20,
179
+ "category": "quotes",
180
+ "difficulty": "medium",
181
+ "bash": "say -v ava \"Hi, I'm your new premium voice\"",
182
+ "dictated": "say dash V ava quote hi I'm your new premium voice quote",
183
+ "notes": "Quoted string with apostrophe"
184
+ },
185
+ {
186
+ "id": 21,
187
+ "category": "tools",
188
+ "difficulty": "easy",
189
+ "bash": "bun run dev",
190
+ "dictated": "bun run dev",
191
+ "notes": "No transformation"
192
+ },
193
+ {
194
+ "id": 22,
195
+ "category": "tools",
196
+ "difficulty": "easy",
197
+ "bash": "bun install",
198
+ "dictated": "bun install",
199
+ "notes": "No transformation"
200
+ },
201
+ {
202
+ "id": 23,
203
+ "category": "tools",
204
+ "difficulty": "easy",
205
+ "bash": "bun dev",
206
+ "dictated": "bun dev",
207
+ "notes": "Shorthand"
208
+ },
209
+ {
210
+ "id": 24,
211
+ "category": "tools",
212
+ "difficulty": "medium",
213
+ "bash": "bun run dev --port 5189",
214
+ "dictated": "bun run dev dash dash port 5189",
215
+ "notes": "Double dash flag with number"
216
+ },
217
+ {
218
+ "id": 25,
219
+ "category": "tools",
220
+ "difficulty": "easy",
221
+ "bash": "brew install age",
222
+ "dictated": "brew install age",
223
+ "notes": "No transformation"
224
+ },
225
+ {
226
+ "id": 26,
227
+ "category": "tools",
228
+ "difficulty": "easy",
229
+ "bash": "brew upgrade gh",
230
+ "dictated": "brew upgrade GH",
231
+ "notes": "Abbreviation casing"
232
+ },
233
+ {
234
+ "id": 27,
235
+ "category": "tools",
236
+ "difficulty": "medium",
237
+ "bash": "npm i -g @openai/codex",
238
+ "dictated": "NPM I dash G at OpenAI slash codex",
239
+ "notes": "Scoped package with @ symbol"
240
+ },
241
+ {
242
+ "id": 28,
243
+ "category": "tools",
244
+ "difficulty": "medium",
245
+ "bash": "npx vercel --prod",
246
+ "dictated": "NPX vercel dash dash prod",
247
+ "notes": "Double dash flag"
248
+ },
249
+ {
250
+ "id": 29,
251
+ "category": "tools",
252
+ "difficulty": "medium",
253
+ "bash": "talkie-dev dev rebuild inference",
254
+ "dictated": "talkie dash dev dev rebuild inference",
255
+ "notes": "Hyphenated tool name"
256
+ },
257
+ {
258
+ "id": 30,
259
+ "category": "tools",
260
+ "difficulty": "medium",
261
+ "bash": "talkie-dev dev status",
262
+ "dictated": "talkie dash dev dev status",
263
+ "notes": "Hyphenated tool name"
264
+ },
265
+ {
266
+ "id": 31,
267
+ "category": "tools",
268
+ "difficulty": "hard",
269
+ "bash": "talkie-dev dev logs inference --since 5m",
270
+ "dictated": "talkie dash dev dev logs inference dash dash since five M",
271
+ "notes": "Hyphenated tool, double dash flag, spoken number with unit"
272
+ },
273
+ {
274
+ "id": 32,
275
+ "category": "tools",
276
+ "difficulty": "medium",
277
+ "bash": "talkie inference generate \"hello\" --verbose",
278
+ "dictated": "talkie inference generate quote hello quote dash dash verbose",
279
+ "notes": "Quoted arg and double dash flag"
280
+ },
281
+ {
282
+ "id": 33,
283
+ "category": "numbers",
284
+ "difficulty": "medium",
285
+ "bash": "talkie inference generate \"test\" --temp 0.7",
286
+ "dictated": "talkie inference generate quote test quote dash dash temp zero point seven",
287
+ "notes": "Decimal number from speech"
288
+ },
289
+ {
290
+ "id": 34,
291
+ "category": "numbers",
292
+ "difficulty": "medium",
293
+ "bash": "talkie inference generate \"test\" --tokens 512",
294
+ "dictated": "talkie inference generate quote test quote dash dash tokens five twelve",
295
+ "notes": "Spoken number — ambiguous (512 vs 'five twelve')"
296
+ },
297
+ {
298
+ "id": 35,
299
+ "category": "numbers",
300
+ "difficulty": "medium",
301
+ "bash": "talkie inference generate \"test\" --tokens 100",
302
+ "dictated": "talkie inference generate quote test quote dash dash tokens one hundred",
303
+ "notes": "Spoken number"
304
+ },
305
+ {
306
+ "id": 36,
307
+ "category": "numbers",
308
+ "difficulty": "hard",
309
+ "bash": "say -v ava -r 280 \"hello world\"",
310
+ "dictated": "say dash V ava dash R two eighty quote hello world quote",
311
+ "notes": "Multiple flags with spoken number and quoted string"
312
+ },
313
+ {
314
+ "id": 37,
315
+ "category": "pipes",
316
+ "difficulty": "medium",
317
+ "bash": "ps aux | grep bun",
318
+ "dictated": "PS aux pipe grep bun",
319
+ "notes": "Single pipe"
320
+ },
321
+ {
322
+ "id": 38,
323
+ "category": "pipes",
324
+ "difficulty": "medium",
325
+ "bash": "cd cli && bun run build",
326
+ "dictated": "CD CLI and and bun run build",
327
+ "notes": "Double ampersand from 'and and'"
328
+ },
329
+ {
330
+ "id": 39,
331
+ "category": "pipes",
332
+ "difficulty": "medium",
333
+ "bash": "cd landing; bun run dev",
334
+ "dictated": "CD landing semicolon bun run dev",
335
+ "notes": "Semicolon separator"
336
+ },
337
+ {
338
+ "id": 40,
339
+ "category": "pipes",
340
+ "difficulty": "hard",
341
+ "bash": "git add -A && git commit -m \"fix typo\" && git push",
342
+ "dictated": "git add dash A and and git commit dash M quote fix typo quote and and git push",
343
+ "notes": "Triple chained with quotes"
344
+ },
345
+ {
346
+ "id": 41,
347
+ "category": "paths",
348
+ "difficulty": "medium",
349
+ "bash": "ls ~/Library/Application\\ Support/Talkie/",
350
+ "dictated": "LS tilde slash library slash application support slash talkie slash",
351
+ "notes": "Escaped space in path, home directory"
352
+ },
353
+ {
354
+ "id": 42,
355
+ "category": "paths",
356
+ "difficulty": "medium",
357
+ "bash": "cat .env",
358
+ "dictated": "cat dot E N V",
359
+ "notes": "Dot-prefixed hidden file, spelled out"
360
+ },
361
+ {
362
+ "id": 43,
363
+ "category": "paths",
364
+ "difficulty": "medium",
365
+ "bash": "vim .env.local",
366
+ "dictated": "vim dot E N V dot local",
367
+ "notes": "Multiple dots in filename"
368
+ },
369
+ {
370
+ "id": 44,
371
+ "category": "paths",
372
+ "difficulty": "medium",
373
+ "bash": "cat demo-typing.md",
374
+ "dictated": "cat demo dash typing dot MD",
375
+ "notes": "Hyphenated filename with extension"
376
+ },
377
+ {
378
+ "id": 45,
379
+ "category": "paths",
380
+ "difficulty": "hard",
381
+ "bash": "less package.json",
382
+ "dictated": "less package dot JSON",
383
+ "notes": "Dot-separated filename"
384
+ },
385
+ {
386
+ "id": 46,
387
+ "category": "paths",
388
+ "difficulty": "hard",
389
+ "bash": "/Users/arach/dev/vif/scripts/launchd.sh restart",
390
+ "dictated": "slash users slash arach slash dev slash vif slash scripts slash launch D dot SH restart",
391
+ "notes": "Absolute path with many segments"
392
+ },
393
+ {
394
+ "id": 47,
395
+ "category": "quotes",
396
+ "difficulty": "hard",
397
+ "bash": "grep -r \"ServiceBridge\" macOS/TalkieKit/",
398
+ "dictated": "grep dash R quote service bridge quote mac OS slash talkie kit slash",
399
+ "notes": "CamelCase inside quotes, compound path"
400
+ },
401
+ {
402
+ "id": 48,
403
+ "category": "compound",
404
+ "difficulty": "hard",
405
+ "bash": "vmmap -summary $(pgrep -x Talkie) 2>/dev/null | head -25",
406
+ "dictated": "VM map dash summary dollar paren P grep dash X talkie close paren two greater than slash dev slash null pipe head dash twenty five",
407
+ "notes": "Command substitution, stderr redirect, pipe"
408
+ },
409
+ {
410
+ "id": 49,
411
+ "category": "compound",
412
+ "difficulty": "hard",
413
+ "bash": "vmmap $(pgrep -x Talkie) | grep \"Physical footprint\"",
414
+ "dictated": "VM map dollar paren P grep dash X talkie close paren pipe grep quote physical footprint quote",
415
+ "notes": "Command substitution with pipe and quoted grep"
416
+ },
417
+ {
418
+ "id": 50,
419
+ "category": "redirects",
420
+ "difficulty": "medium",
421
+ "bash": "echo \"hello\" > output.txt",
422
+ "dictated": "echo quote hello quote greater than output dot TXT",
423
+ "notes": "Simple redirect to file"
424
+ },
425
+ {
426
+ "id": 51,
427
+ "category": "redirects",
428
+ "difficulty": "medium",
429
+ "bash": "echo \"hello\" >> output.txt",
430
+ "dictated": "echo quote hello quote double greater than output dot TXT",
431
+ "notes": "Append redirect"
432
+ },
433
+ {
434
+ "id": 52,
435
+ "category": "compound",
436
+ "difficulty": "hard",
437
+ "bash": "curl -fsSL https://opencode.ai/install | bash",
438
+ "dictated": "curl dash F S S L HTTPS colon slash slash open code dot AI slash install pipe bash",
439
+ "notes": "URL with protocol, piped to bash"
440
+ },
441
+ {
442
+ "id": 53,
443
+ "category": "git",
444
+ "difficulty": "hard",
445
+ "bash": "git clone git@github.com:arach/usetalkie.com.git ~/dev/usetalkie.com",
446
+ "dictated": "git clone git at GitHub dot com colon arach slash use talkie dot com dot git tilde slash dev slash use talkie dot com",
447
+ "notes": "SSH clone with dots in repo name and custom destination"
448
+ },
449
+ {
450
+ "id": 54,
451
+ "category": "tools",
452
+ "difficulty": "medium",
453
+ "bash": "npx skills add arach/arc",
454
+ "dictated": "NPX skills add arach slash arc",
455
+ "notes": "Slash-separated argument"
456
+ },
457
+ {
458
+ "id": 55,
459
+ "category": "tools",
460
+ "difficulty": "hard",
461
+ "bash": "npx skills add https://github.com/vercel-labs/agent-skills --skill web-design-guidelines",
462
+ "dictated": "NPX skills add HTTPS colon slash slash GitHub dot com slash vercel dash labs slash agent dash skills dash dash skill web dash design dash guidelines",
463
+ "notes": "URL argument with double dash flag and hyphenated value"
464
+ },
465
+ {
466
+ "id": 56,
467
+ "category": "tools",
468
+ "difficulty": "easy",
469
+ "bash": "claude --resume",
470
+ "dictated": "claude dash dash resume",
471
+ "notes": "Double dash flag"
472
+ },
473
+ {
474
+ "id": 57,
475
+ "category": "tools",
476
+ "difficulty": "easy",
477
+ "bash": "htop",
478
+ "dictated": "H top",
479
+ "notes": "Single word, might be transcribed with space"
480
+ },
481
+ {
482
+ "id": 58,
483
+ "category": "tools",
484
+ "difficulty": "easy",
485
+ "bash": "top",
486
+ "dictated": "top",
487
+ "notes": "Baseline — no transformation"
488
+ },
489
+ {
490
+ "id": 59,
491
+ "category": "nav",
492
+ "difficulty": "easy",
493
+ "bash": "cd www",
494
+ "dictated": "CD www",
495
+ "notes": "Simple directory"
496
+ },
497
+ {
498
+ "id": 60,
499
+ "category": "nav",
500
+ "difficulty": "easy",
501
+ "bash": "cd web",
502
+ "dictated": "CD web",
503
+ "notes": "Simple directory"
504
+ },
505
+ {
506
+ "id": 61,
507
+ "category": "tools",
508
+ "difficulty": "hard",
509
+ "bash": "security unlock-keychain ~/Library/Keychains/login.keychain-db",
510
+ "dictated": "security unlock dash keychain tilde slash library slash keychains slash login dot keychain dash DB",
511
+ "notes": "Long compound path with hyphens and dots"
512
+ },
513
+ {
514
+ "id": 62,
515
+ "category": "pipes",
516
+ "difficulty": "hard",
517
+ "bash": "ps aux | grep Talkie | grep -v grep",
518
+ "dictated": "PS aux pipe grep talkie pipe grep dash V grep",
519
+ "notes": "Double pipe chain with flag"
520
+ },
521
+ {
522
+ "id": 63,
523
+ "category": "flags",
524
+ "difficulty": "medium",
525
+ "bash": "claude -h",
526
+ "dictated": "claude dash H",
527
+ "notes": "Single letter flag"
528
+ },
529
+ {
530
+ "id": 64,
531
+ "category": "paths",
532
+ "difficulty": "hard",
533
+ "bash": "vim /Users/arach/dev/.env",
534
+ "dictated": "vim slash users slash arach slash dev slash dot E N V",
535
+ "notes": "Absolute path to hidden file"
536
+ },
537
+ {
538
+ "id": 65,
539
+ "category": "compound",
540
+ "difficulty": "hard",
541
+ "bash": "pgrep -x TalkieAgent | xargs kill -9",
542
+ "dictated": "P grep dash X talkie agent pipe X args kill dash nine",
543
+ "notes": "Pipe with signal number"
544
+ },
545
+ {
546
+ "id": 66,
547
+ "category": "redirects",
548
+ "difficulty": "hard",
549
+ "bash": "talkie-dev dev logs inference --since 5m 2>&1 | grep error",
550
+ "dictated": "talkie dash dev dev logs inference dash dash since five M two greater than ampersand one pipe grep error",
551
+ "notes": "stderr redirect merged with pipe"
552
+ },
553
+ {
554
+ "id": 67,
555
+ "category": "tools",
556
+ "difficulty": "medium",
557
+ "bash": "talkie inference load mlx-community/Llama-3.2-1B-Instruct-4bit",
558
+ "dictated": "talkie inference load MLX dash community slash llama dash three dot two dash one B dash instruct dash four bit",
559
+ "notes": "HuggingFace model ID with many hyphens and dots"
560
+ },
561
+ {
562
+ "id": 68,
563
+ "category": "tools",
564
+ "difficulty": "medium",
565
+ "bash": "talkie inference unload",
566
+ "dictated": "talkie inference unload",
567
+ "notes": "No transformation"
568
+ },
569
+ {
570
+ "id": 69,
571
+ "category": "tools",
572
+ "difficulty": "medium",
573
+ "bash": "talkie inference status --json",
574
+ "dictated": "talkie inference status dash dash JSON",
575
+ "notes": "Double dash flag"
576
+ },
577
+ {
578
+ "id": 70,
579
+ "category": "compound",
580
+ "difficulty": "hard",
581
+ "bash": "cd cli && bun run build && talkie inference status",
582
+ "dictated": "CD CLI and and bun run build and and talkie inference status",
583
+ "notes": "Triple chain with 'and and'"
584
+ },
585
+ {
586
+ "id": 71,
587
+ "category": "git",
588
+ "difficulty": "medium",
589
+ "bash": "gh repo clone arach/clipper",
590
+ "dictated": "GH repo clone arach slash clipper",
591
+ "notes": "GitHub CLI with slash"
592
+ },
593
+ {
594
+ "id": 72,
595
+ "category": "git",
596
+ "difficulty": "hard",
597
+ "bash": "gh pr create --title \"fix inference latency\" --body \"Added TTFT tracking\"",
598
+ "dictated": "GH PR create dash dash title quote fix inference latency quote dash dash body quote added TTFT tracking quote",
599
+ "notes": "Multiple double dash flags with quoted values"
600
+ },
601
+ {
602
+ "id": 73,
603
+ "category": "numbers",
604
+ "difficulty": "hard",
605
+ "bash": "talkie inference generate \"test\" --temp 0.7 --tokens 512 --model mlx-community/Qwen2.5-0.5B-Instruct-4bit",
606
+ "dictated": "talkie inference generate quote test quote dash dash temp zero point seven dash dash tokens five hundred twelve dash dash model MLX dash community slash Qwen two point five dash zero point five B dash instruct dash four bit",
607
+ "notes": "Multiple numeric conversions and complex model ID"
608
+ },
609
+ {
610
+ "id": 74,
611
+ "category": "pipes",
612
+ "difficulty": "medium",
613
+ "bash": "cat package.json | python3 -c \"import sys,json; print(json.load(sys.stdin))\"",
614
+ "dictated": "cat package dot JSON pipe python three dash C quote import sys comma JSON semicolon print open paren JSON dot load open paren sys dot stdin close paren close paren quote",
615
+ "notes": "Inline Python with nested parens — probably too hard for any small model"
616
+ },
617
+ {
618
+ "id": 75,
619
+ "category": "nav",
620
+ "difficulty": "easy",
621
+ "bash": "cd ../dewey",
622
+ "dictated": "CD dot dot slash Dewey",
623
+ "notes": "Sibling directory"
624
+ },
625
+ {
626
+ "id": 76,
627
+ "category": "nav",
628
+ "difficulty": "easy",
629
+ "bash": "cd ../vif",
630
+ "dictated": "CD dot dot slash VIF",
631
+ "notes": "Sibling directory, short name"
632
+ },
633
+ {
634
+ "id": 77,
635
+ "category": "nav",
636
+ "difficulty": "medium",
637
+ "bash": "cd dewey/www",
638
+ "dictated": "CD Dewey slash www",
639
+ "notes": "Nested path"
640
+ },
641
+ {
642
+ "id": 78,
643
+ "category": "tools",
644
+ "difficulty": "medium",
645
+ "bash": "opencode auth login",
646
+ "dictated": "open code auth login",
647
+ "notes": "Tool name that Whisper might split into two words"
648
+ },
649
+ {
650
+ "id": 79,
651
+ "category": "flags",
652
+ "difficulty": "hard",
653
+ "bash": "curl -fsSL https://example.com/install.sh",
654
+ "dictated": "curl dash F S S L HTTPS colon slash slash example dot com slash install dot SH",
655
+ "notes": "Packed flags and URL"
656
+ },
657
+ {
658
+ "id": 80,
659
+ "category": "compound",
660
+ "difficulty": "hard",
661
+ "bash": "mkdir -p src/components && touch src/components/Button.tsx",
662
+ "dictated": "M K dir dash P SRC slash components and and touch SRC slash components slash button dot TSX",
663
+ "notes": "mkdir + touch chain, CamelCase filename"
664
+ }
665
+ ]
666
+ }
training/data/syntax-reconstruction.json ADDED
@@ -0,0 +1,685 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "meta": {
3
+ "name": "syntax-reconstruction",
4
+ "version": "2.0",
5
+ "description": "Training data for converting spoken/dictated text back into syntactically correct strings. Focus: punctuation, symbols, casing, spacing — not command recognition. Each entry: 'dictated' is what a transcription engine outputs, 'output' is the target string.",
6
+ "task": "Given the dictated text, reconstruct the exact syntactic string.",
7
+ "categories": [
8
+ "symbols-basic",
9
+ "symbols-compound",
10
+ "casing",
11
+ "spacing",
12
+ "quotes",
13
+ "brackets",
14
+ "paths",
15
+ "urls",
16
+ "identifiers",
17
+ "numbers",
18
+ "operators",
19
+ "mixed"
20
+ ]
21
+ },
22
+ "entries": [
23
+
24
+ {"id": 1, "cat": "symbols-basic", "dictated": "hello dash world", "output": "hello-world"},
25
+ {"id": 2, "cat": "symbols-basic", "dictated": "hello underscore world", "output": "hello_world"},
26
+ {"id": 3, "cat": "symbols-basic", "dictated": "hello dot world", "output": "hello.world"},
27
+ {"id": 4, "cat": "symbols-basic", "dictated": "hello slash world", "output": "hello/world"},
28
+ {"id": 5, "cat": "symbols-basic", "dictated": "hello backslash world", "output": "hello\\world"},
29
+ {"id": 6, "cat": "symbols-basic", "dictated": "hello colon world", "output": "hello:world"},
30
+ {"id": 7, "cat": "symbols-basic", "dictated": "hello semicolon world", "output": "hello;world"},
31
+ {"id": 8, "cat": "symbols-basic", "dictated": "hello comma world", "output": "hello, world"},
32
+ {"id": 9, "cat": "symbols-basic", "dictated": "hello at world", "output": "hello@world"},
33
+ {"id": 10, "cat": "symbols-basic", "dictated": "hello hash world", "output": "hello#world"},
34
+ {"id": 11, "cat": "symbols-basic", "dictated": "hello dollar world", "output": "hello$world"},
35
+ {"id": 12, "cat": "symbols-basic", "dictated": "hello ampersand world", "output": "hello&world"},
36
+ {"id": 13, "cat": "symbols-basic", "dictated": "hello asterisk world", "output": "hello*world"},
37
+ {"id": 14, "cat": "symbols-basic", "dictated": "hello star world", "output": "hello*world"},
38
+ {"id": 15, "cat": "symbols-basic", "dictated": "hello pipe world", "output": "hello|world"},
39
+ {"id": 16, "cat": "symbols-basic", "dictated": "hello tilde world", "output": "hello~world"},
40
+ {"id": 17, "cat": "symbols-basic", "dictated": "hello caret world", "output": "hello^world"},
41
+ {"id": 18, "cat": "symbols-basic", "dictated": "hello percent world", "output": "hello%world"},
42
+ {"id": 19, "cat": "symbols-basic", "dictated": "hello plus world", "output": "hello+world"},
43
+ {"id": 20, "cat": "symbols-basic", "dictated": "hello equals world", "output": "hello=world"},
44
+ {"id": 21, "cat": "symbols-basic", "dictated": "hello exclamation point world", "output": "hello!world"},
45
+ {"id": 22, "cat": "symbols-basic", "dictated": "hello bang world", "output": "hello!world"},
46
+ {"id": 23, "cat": "symbols-basic", "dictated": "hello question mark world", "output": "hello?world"},
47
+
48
+ {"id": 24, "cat": "symbols-compound", "dictated": "dash dash verbose", "output": "--verbose"},
49
+ {"id": 25, "cat": "symbols-compound", "dictated": "dash dash no dash verify", "output": "--no-verify"},
50
+ {"id": 26, "cat": "symbols-compound", "dictated": "dot dot slash", "output": "../"},
51
+ {"id": 27, "cat": "symbols-compound", "dictated": "dot dot slash dot dot slash", "output": "../../"},
52
+ {"id": 28, "cat": "symbols-compound", "dictated": "tilde slash", "output": "~/"},
53
+ {"id": 29, "cat": "symbols-compound", "dictated": "dot slash", "output": "./"},
54
+ {"id": 30, "cat": "symbols-compound", "dictated": "and and", "output": "&&"},
55
+ {"id": 31, "cat": "symbols-compound", "dictated": "pipe pipe", "output": "||"},
56
+ {"id": 32, "cat": "symbols-compound", "dictated": "greater than", "output": ">"},
57
+ {"id": 33, "cat": "symbols-compound", "dictated": "less than", "output": "<"},
58
+ {"id": 34, "cat": "symbols-compound", "dictated": "double greater than", "output": ">>"},
59
+ {"id": 35, "cat": "symbols-compound", "dictated": "equals equals", "output": "=="},
60
+ {"id": 36, "cat": "symbols-compound", "dictated": "not equals", "output": "!="},
61
+ {"id": 37, "cat": "symbols-compound", "dictated": "bang equals", "output": "!="},
62
+ {"id": 38, "cat": "symbols-compound", "dictated": "plus equals", "output": "+="},
63
+ {"id": 39, "cat": "symbols-compound", "dictated": "dash equals", "output": "-="},
64
+ {"id": 40, "cat": "symbols-compound", "dictated": "arrow", "output": "->"},
65
+ {"id": 41, "cat": "symbols-compound", "dictated": "fat arrow", "output": "=>"},
66
+ {"id": 42, "cat": "symbols-compound", "dictated": "double colon", "output": "::"},
67
+ {"id": 43, "cat": "symbols-compound", "dictated": "dot dot dot", "output": "..."},
68
+ {"id": 44, "cat": "symbols-compound", "dictated": "hash bang", "output": "#!"},
69
+ {"id": 45, "cat": "symbols-compound", "dictated": "shebang", "output": "#!"},
70
+ {"id": 46, "cat": "symbols-compound", "dictated": "two greater than ampersand one", "output": "2>&1"},
71
+ {"id": 47, "cat": "symbols-compound", "dictated": "slash dev slash null", "output": "/dev/null"},
72
+ {"id": 48, "cat": "symbols-compound", "dictated": "star dot star", "output": "*.*"},
73
+ {"id": 49, "cat": "symbols-compound", "dictated": "star dot swift", "output": "*.swift"},
74
+ {"id": 50, "cat": "symbols-compound", "dictated": "star star slash star dot T S", "output": "**/*.ts"},
75
+ {"id": 51, "cat": "symbols-compound", "dictated": "star star slash star dot T S X", "output": "**/*.tsx"},
76
+ {"id": 52, "cat": "symbols-compound", "dictated": "dollar open paren", "output": "$("},
77
+ {"id": 53, "cat": "symbols-compound", "dictated": "dollar open brace", "output": "${"},
78
+
79
+ {"id": 54, "cat": "quotes", "dictated": "quote hello world quote", "output": "\"hello world\""},
80
+ {"id": 55, "cat": "quotes", "dictated": "double quote hello world double quote", "output": "\"hello world\""},
81
+ {"id": 56, "cat": "quotes", "dictated": "single quote hello world single quote", "output": "'hello world'"},
82
+ {"id": 57, "cat": "quotes", "dictated": "backtick hello world backtick", "output": "`hello world`"},
83
+ {"id": 58, "cat": "quotes", "dictated": "quote quote", "output": "\"\""},
84
+ {"id": 59, "cat": "quotes", "dictated": "empty string", "output": "\"\""},
85
+ {"id": 60, "cat": "quotes", "dictated": "single quote single quote", "output": "''"},
86
+ {"id": 61, "cat": "quotes", "dictated": "quote fix colon resolve timeout quote", "output": "\"fix: resolve timeout\""},
87
+ {"id": 62, "cat": "quotes", "dictated": "quote star dot log quote", "output": "\"*.log\""},
88
+ {"id": 63, "cat": "quotes", "dictated": "quote service bridge quote", "output": "\"ServiceBridge\""},
89
+ {"id": 64, "cat": "quotes", "dictated": "quote hello comma world quote", "output": "\"hello, world\""},
90
+
91
+ {"id": 65, "cat": "brackets", "dictated": "open paren close paren", "output": "()"},
92
+ {"id": 66, "cat": "brackets", "dictated": "open bracket close bracket", "output": "[]"},
93
+ {"id": 67, "cat": "brackets", "dictated": "open brace close brace", "output": "{}"},
94
+ {"id": 68, "cat": "brackets", "dictated": "open angle close angle", "output": "<>"},
95
+ {"id": 69, "cat": "brackets", "dictated": "left paren right paren", "output": "()"},
96
+ {"id": 70, "cat": "brackets", "dictated": "open paren X close paren", "output": "(x)"},
97
+ {"id": 71, "cat": "brackets", "dictated": "open bracket zero close bracket", "output": "[0]"},
98
+ {"id": 72, "cat": "brackets", "dictated": "open brace key colon value close brace", "output": "{key: value}"},
99
+ {"id": 73, "cat": "brackets", "dictated": "open angle string close angle", "output": "<string>"},
100
+ {"id": 74, "cat": "brackets", "dictated": "open bracket open bracket test close bracket close bracket", "output": "[[test]]"},
101
+
102
+ {"id": 75, "cat": "casing", "dictated": "capital hello", "output": "Hello"},
103
+ {"id": 76, "cat": "casing", "dictated": "all caps hello", "output": "HELLO"},
104
+ {"id": 77, "cat": "casing", "dictated": "lowercase HELLO", "output": "hello"},
105
+ {"id": 78, "cat": "casing", "dictated": "camel case hello world", "output": "helloWorld"},
106
+ {"id": 79, "cat": "casing", "dictated": "pascal case hello world", "output": "HelloWorld"},
107
+ {"id": 80, "cat": "casing", "dictated": "snake case hello world", "output": "hello_world"},
108
+ {"id": 81, "cat": "casing", "dictated": "screaming snake case hello world", "output": "HELLO_WORLD"},
109
+ {"id": 82, "cat": "casing", "dictated": "kebab case hello world", "output": "hello-world"},
110
+ {"id": 83, "cat": "casing", "dictated": "camel case get user name", "output": "getUserName"},
111
+ {"id": 84, "cat": "casing", "dictated": "pascal case service bridge", "output": "ServiceBridge"},
112
+ {"id": 85, "cat": "casing", "dictated": "snake case total tokens generated", "output": "total_tokens_generated"},
113
+ {"id": 86, "cat": "casing", "dictated": "screaming snake case max retry count", "output": "MAX_RETRY_COUNT"},
114
+ {"id": 87, "cat": "casing", "dictated": "kebab case web design guidelines", "output": "web-design-guidelines"},
115
+ {"id": 88, "cat": "casing", "dictated": "camel case is palindrome", "output": "isPalindrome"},
116
+ {"id": 89, "cat": "casing", "dictated": "pascal case inference model catalog", "output": "InferenceModelCatalog"},
117
+ {"id": 90, "cat": "casing", "dictated": "snake case bridge port", "output": "bridge_port"},
118
+ {"id": 91, "cat": "casing", "dictated": "all caps API", "output": "API"},
119
+ {"id": 92, "cat": "casing", "dictated": "all caps URL", "output": "URL"},
120
+ {"id": 93, "cat": "casing", "dictated": "all caps JSON", "output": "JSON"},
121
+ {"id": 94, "cat": "casing", "dictated": "all caps XPC", "output": "XPC"},
122
+ {"id": 95, "cat": "casing", "dictated": "all caps HTTP", "output": "HTTP"},
123
+ {"id": 96, "cat": "casing", "dictated": "all caps HTTPS", "output": "HTTPS"},
124
+
125
+ {"id": 97, "cat": "spacing", "dictated": "no space hello world", "output": "helloworld"},
126
+ {"id": 98, "cat": "spacing", "dictated": "one word hello world", "output": "helloworld"},
127
+ {"id": 99, "cat": "spacing", "dictated": "no space web pack", "output": "webpack"},
128
+ {"id": 100, "cat": "spacing", "dictated": "no space type script", "output": "typescript"},
129
+ {"id": 101, "cat": "spacing", "dictated": "no space java script", "output": "javascript"},
130
+ {"id": 102, "cat": "spacing", "dictated": "no space open code", "output": "opencode"},
131
+ {"id": 103, "cat": "spacing", "dictated": "no space H top", "output": "htop"},
132
+ {"id": 104, "cat": "spacing", "dictated": "space separated A B C", "output": "A B C"},
133
+ {"id": 105, "cat": "spacing", "dictated": "tab separated hello world", "output": "hello\tworld"},
134
+ {"id": 106, "cat": "spacing", "dictated": "new line hello new line world", "output": "hello\nworld"},
135
+
136
+ {"id": 107, "cat": "numbers", "dictated": "zero", "output": "0"},
137
+ {"id": 108, "cat": "numbers", "dictated": "one", "output": "1"},
138
+ {"id": 109, "cat": "numbers", "dictated": "forty two", "output": "42"},
139
+ {"id": 110, "cat": "numbers", "dictated": "one hundred", "output": "100"},
140
+ {"id": 111, "cat": "numbers", "dictated": "two fifty six", "output": "256"},
141
+ {"id": 112, "cat": "numbers", "dictated": "five twelve", "output": "512"},
142
+ {"id": 113, "cat": "numbers", "dictated": "ten twenty four", "output": "1024"},
143
+ {"id": 114, "cat": "numbers", "dictated": "eighty eighty", "output": "8080"},
144
+ {"id": 115, "cat": "numbers", "dictated": "zero point seven", "output": "0.7"},
145
+ {"id": 116, "cat": "numbers", "dictated": "zero point zero one", "output": "0.01"},
146
+ {"id": 117, "cat": "numbers", "dictated": "three point one four", "output": "3.14"},
147
+ {"id": 118, "cat": "numbers", "dictated": "one point zero", "output": "1.0"},
148
+ {"id": 119, "cat": "numbers", "dictated": "zero point five", "output": "0.5"},
149
+ {"id": 120, "cat": "numbers", "dictated": "two point five", "output": "2.5"},
150
+ {"id": 121, "cat": "numbers", "dictated": "five M", "output": "5m"},
151
+ {"id": 122, "cat": "numbers", "dictated": "ten S", "output": "10s"},
152
+ {"id": 123, "cat": "numbers", "dictated": "thirty D", "output": "30d"},
153
+ {"id": 124, "cat": "numbers", "dictated": "one hundred K", "output": "100k"},
154
+ {"id": 125, "cat": "numbers", "dictated": "five one eight nine", "output": "5189"},
155
+ {"id": 126, "cat": "numbers", "dictated": "one nine eight two two", "output": "19822"},
156
+ {"id": 127, "cat": "numbers", "dictated": "three dot two", "output": "3.2"},
157
+ {"id": 128, "cat": "numbers", "dictated": "zero X F F", "output": "0xFF"},
158
+ {"id": 129, "cat": "numbers", "dictated": "zero X zero zero", "output": "0x00"},
159
+
160
+ {"id": 130, "cat": "identifiers", "dictated": "dash A", "output": "-A"},
161
+ {"id": 131, "cat": "identifiers", "dictated": "dash R", "output": "-r"},
162
+ {"id": 132, "cat": "identifiers", "dictated": "dash L A", "output": "-la"},
163
+ {"id": 133, "cat": "identifiers", "dictated": "dash F S S L", "output": "-fsSL"},
164
+ {"id": 134, "cat": "identifiers", "dictated": "dash V", "output": "-v"},
165
+ {"id": 135, "cat": "identifiers", "dictated": "dash M", "output": "-m"},
166
+ {"id": 136, "cat": "identifiers", "dictated": "dash G", "output": "-g"},
167
+ {"id": 137, "cat": "identifiers", "dictated": "dash X", "output": "-x"},
168
+ {"id": 138, "cat": "identifiers", "dictated": "dash P", "output": "-p"},
169
+ {"id": 139, "cat": "identifiers", "dictated": "dash N", "output": "-n"},
170
+ {"id": 140, "cat": "identifiers", "dictated": "dash I", "output": "-i"},
171
+ {"id": 141, "cat": "identifiers", "dictated": "dash C", "output": "-c"},
172
+ {"id": 142, "cat": "identifiers", "dictated": "dash dash help", "output": "--help"},
173
+ {"id": 143, "cat": "identifiers", "dictated": "dash dash version", "output": "--version"},
174
+ {"id": 144, "cat": "identifiers", "dictated": "dash dash prod", "output": "--prod"},
175
+ {"id": 145, "cat": "identifiers", "dictated": "dash dash port", "output": "--port"},
176
+ {"id": 146, "cat": "identifiers", "dictated": "dash dash since", "output": "--since"},
177
+ {"id": 147, "cat": "identifiers", "dictated": "dash dash json", "output": "--json"},
178
+ {"id": 148, "cat": "identifiers", "dictated": "dash dash verbose", "output": "--verbose"},
179
+ {"id": 149, "cat": "identifiers", "dictated": "dash dash temp", "output": "--temp"},
180
+ {"id": 150, "cat": "identifiers", "dictated": "dash dash tokens", "output": "--tokens"},
181
+ {"id": 151, "cat": "identifiers", "dictated": "dash dash model", "output": "--model"},
182
+ {"id": 152, "cat": "identifiers", "dictated": "dash dash system", "output": "--system"},
183
+ {"id": 153, "cat": "identifiers", "dictated": "dash dash resume", "output": "--resume"},
184
+ {"id": 154, "cat": "identifiers", "dictated": "dash dash skill", "output": "--skill"},
185
+ {"id": 155, "cat": "identifiers", "dictated": "dash dash title", "output": "--title"},
186
+ {"id": 156, "cat": "identifiers", "dictated": "dash dash body", "output": "--body"},
187
+ {"id": 157, "cat": "identifiers", "dictated": "dash dash no dash verify", "output": "--no-verify"},
188
+ {"id": 158, "cat": "identifiers", "dictated": "dash dash dry dash run", "output": "--dry-run"},
189
+
190
+ {"id": 159, "cat": "identifiers", "dictated": "dot E N V", "output": ".env"},
191
+ {"id": 160, "cat": "identifiers", "dictated": "dot E N V dot local", "output": ".env.local"},
192
+ {"id": 161, "cat": "identifiers", "dictated": "dot git ignore", "output": ".gitignore"},
193
+ {"id": 162, "cat": "identifiers", "dictated": "dot next", "output": ".next"},
194
+ {"id": 163, "cat": "identifiers", "dictated": "dot swift", "output": ".swift"},
195
+ {"id": 164, "cat": "identifiers", "dictated": "dot T S X", "output": ".tsx"},
196
+ {"id": 165, "cat": "identifiers", "dictated": "dot T S", "output": ".ts"},
197
+ {"id": 166, "cat": "identifiers", "dictated": "dot J S", "output": ".js"},
198
+ {"id": 167, "cat": "identifiers", "dictated": "dot JSON", "output": ".json"},
199
+ {"id": 168, "cat": "identifiers", "dictated": "dot M D", "output": ".md"},
200
+ {"id": 169, "cat": "identifiers", "dictated": "dot S H", "output": ".sh"},
201
+ {"id": 170, "cat": "identifiers", "dictated": "dot P Y", "output": ".py"},
202
+ {"id": 171, "cat": "identifiers", "dictated": "dot yaml", "output": ".yaml"},
203
+ {"id": 172, "cat": "identifiers", "dictated": "dot T X T", "output": ".txt"},
204
+ {"id": 173, "cat": "identifiers", "dictated": "dot P B X proj", "output": ".pbxproj"},
205
+
206
+ {"id": 174, "cat": "identifiers", "dictated": "package dot JSON", "output": "package.json"},
207
+ {"id": 175, "cat": "identifiers", "dictated": "tsconfig dot JSON", "output": "tsconfig.json"},
208
+ {"id": 176, "cat": "identifiers", "dictated": "index dot T S", "output": "index.ts"},
209
+ {"id": 177, "cat": "identifiers", "dictated": "server dot T S", "output": "server.ts"},
210
+ {"id": 178, "cat": "identifiers", "dictated": "app delegate dot swift", "output": "AppDelegate.swift"},
211
+ {"id": 179, "cat": "identifiers", "dictated": "inference service dot swift", "output": "InferenceService.swift"},
212
+ {"id": 180, "cat": "identifiers", "dictated": "launch D dot S H", "output": "launchd.sh"},
213
+ {"id": 181, "cat": "identifiers", "dictated": "demo dash typing dot M D", "output": "demo-typing.md"},
214
+ {"id": 182, "cat": "identifiers", "dictated": "button dot T S X", "output": "Button.tsx"},
215
+ {"id": 183, "cat": "identifiers", "dictated": "output dot T X T", "output": "output.txt"},
216
+ {"id": 184, "cat": "identifiers", "dictated": "install dot S H", "output": "install.sh"},
217
+
218
+ {"id": 185, "cat": "paths", "dictated": "dot dot slash dev", "output": "../dev"},
219
+ {"id": 186, "cat": "paths", "dictated": "dot dot slash dot dot slash src", "output": "../../src"},
220
+ {"id": 187, "cat": "paths", "dictated": "tilde slash dev slash talkie", "output": "~/dev/talkie"},
221
+ {"id": 188, "cat": "paths", "dictated": "tilde slash library slash application support slash talkie", "output": "~/Library/Application Support/Talkie"},
222
+ {"id": 189, "cat": "paths", "dictated": "tilde slash library slash keychains slash login dot keychain dash DB", "output": "~/Library/Keychains/login.keychain-db"},
223
+ {"id": 190, "cat": "paths", "dictated": "slash users slash arach slash dev slash dot E N V", "output": "/Users/arach/dev/.env"},
224
+ {"id": 191, "cat": "paths", "dictated": "slash users slash arach slash dev slash vif slash scripts slash launch D dot S H", "output": "/Users/arach/dev/vif/scripts/launchd.sh"},
225
+ {"id": 192, "cat": "paths", "dictated": "SRC slash components slash button dot T S X", "output": "src/components/Button.tsx"},
226
+ {"id": 193, "cat": "paths", "dictated": "mac OS slash talkie kit slash", "output": "macOS/TalkieKit/"},
227
+ {"id": 194, "cat": "paths", "dictated": "mac OS slash talkie slash services slash", "output": "macOS/Talkie/Services/"},
228
+ {"id": 195, "cat": "paths", "dictated": "CLI slash SRC slash commands slash inference dot T S", "output": "cli/src/commands/inference.ts"},
229
+ {"id": 196, "cat": "paths", "dictated": "services slash talkie dash API", "output": "services/talkie-api"},
230
+ {"id": 197, "cat": "paths", "dictated": "dot slash scripts slash sync dash xcode dash files dot P Y", "output": "./scripts/sync-xcode-files.py"},
231
+
232
+ {"id": 198, "cat": "urls", "dictated": "HTTPS colon slash slash GitHub dot com slash arach slash talkie", "output": "https://github.com/arach/talkie"},
233
+ {"id": 199, "cat": "urls", "dictated": "HTTPS colon slash slash open code dot AI slash install", "output": "https://opencode.ai/install"},
234
+ {"id": 200, "cat": "urls", "dictated": "HTTPS colon slash slash hugging face dot co slash MLX dash community", "output": "https://huggingface.co/mlx-community"},
235
+ {"id": 201, "cat": "urls", "dictated": "HTTPS colon slash slash API dot use talkie dot com slash API slash user", "output": "https://api.useTalkie.com/api/user"},
236
+ {"id": 202, "cat": "urls", "dictated": "HTTPS colon slash slash accounts dot use talkie dot com slash sign dash in", "output": "https://accounts.useTalkie.com/sign-in"},
237
+ {"id": 203, "cat": "urls", "dictated": "git at GitHub dot com colon arach slash clipper dot git", "output": "git@github.com:arach/clipper.git"},
238
+ {"id": 204, "cat": "urls", "dictated": "git at GitHub dot com colon arach slash use talkie dot com dot git", "output": "git@github.com:arach/usetalkie.com.git"},
239
+ {"id": 205, "cat": "urls", "dictated": "WS colon slash slash one twenty seven dot zero dot zero dot one colon one nine eight two two", "output": "ws://127.0.0.1:19822"},
240
+ {"id": 206, "cat": "urls", "dictated": "localhost colon three thousand", "output": "localhost:3000"},
241
+ {"id": 207, "cat": "urls", "dictated": "localhost colon eighty eighty", "output": "localhost:8080"},
242
+
243
+ {"id": 208, "cat": "identifiers", "dictated": "at OpenAI slash codex", "output": "@openai/codex"},
244
+ {"id": 209, "cat": "identifiers", "dictated": "at types slash node", "output": "@types/node"},
245
+ {"id": 210, "cat": "identifiers", "dictated": "at types slash react", "output": "@types/react"},
246
+ {"id": 211, "cat": "identifiers", "dictated": "MLX dash community slash llama dash three dot two dash one B dash instruct dash four bit", "output": "mlx-community/Llama-3.2-1B-Instruct-4bit"},
247
+ {"id": 212, "cat": "identifiers", "dictated": "MLX dash community slash Qwen two point five dash zero point five B dash instruct dash four bit", "output": "mlx-community/Qwen2.5-0.5B-Instruct-4bit"},
248
+ {"id": 213, "cat": "identifiers", "dictated": "arach slash clipper", "output": "arach/clipper"},
249
+ {"id": 214, "cat": "identifiers", "dictated": "vercel dash labs slash agent dash skills", "output": "vercel-labs/agent-skills"},
250
+ {"id": 215, "cat": "identifiers", "dictated": "arach at tchoupani dot com", "output": "arach@tchoupani.com"},
251
+ {"id": 216, "cat": "identifiers", "dictated": "JDI dot talkie dot inference dot dev", "output": "jdi.talkie.inference.dev"},
252
+ {"id": 217, "cat": "identifiers", "dictated": "com dot JDI dot talkie", "output": "com.jdi.talkie"},
253
+
254
+ {"id": 218, "cat": "operators", "dictated": "X equals five", "output": "x=5"},
255
+ {"id": 219, "cat": "operators", "dictated": "X plus equals one", "output": "x+=1"},
256
+ {"id": 220, "cat": "operators", "dictated": "X dash equals one", "output": "x-=1"},
257
+ {"id": 221, "cat": "operators", "dictated": "if X equals equals Y", "output": "if x == y"},
258
+ {"id": 222, "cat": "operators", "dictated": "if X not equals Y", "output": "if x != y"},
259
+ {"id": 223, "cat": "operators", "dictated": "if X greater than Y", "output": "if x > y"},
260
+ {"id": 224, "cat": "operators", "dictated": "if X less than Y", "output": "if x < y"},
261
+ {"id": 225, "cat": "operators", "dictated": "if X greater than or equal Y", "output": "if x >= y"},
262
+ {"id": 226, "cat": "operators", "dictated": "if X less than or equal Y", "output": "if x <= y"},
263
+ {"id": 227, "cat": "operators", "dictated": "X arrow Y", "output": "x -> y"},
264
+ {"id": 228, "cat": "operators", "dictated": "X fat arrow Y", "output": "x => y"},
265
+ {"id": 229, "cat": "operators", "dictated": "open paren X close paren fat arrow open brace close brace", "output": "(x) => {}"},
266
+ {"id": 230, "cat": "operators", "dictated": "open paren close paren fat arrow open brace close brace", "output": "() => {}"},
267
+ {"id": 231, "cat": "operators", "dictated": "A pipe pipe B", "output": "a || b"},
268
+ {"id": 232, "cat": "operators", "dictated": "A and and B", "output": "a && b"},
269
+ {"id": 233, "cat": "operators", "dictated": "not A", "output": "!a"},
270
+ {"id": 234, "cat": "operators", "dictated": "A question mark B colon C", "output": "a ? b : c"},
271
+ {"id": 235, "cat": "operators", "dictated": "A question mark dot B", "output": "a?.b"},
272
+ {"id": 236, "cat": "operators", "dictated": "A bang bang", "output": "a!!"},
273
+ {"id": 237, "cat": "operators", "dictated": "A question mark question mark B", "output": "a ?? b"},
274
+
275
+ {"id": 238, "cat": "mixed", "dictated": "export all caps API underscore KEY equals quote my dash key dash one two three quote", "output": "export API_KEY=\"my-key-123\""},
276
+ {"id": 239, "cat": "mixed", "dictated": "echo dollar all caps HOME", "output": "echo $HOME"},
277
+ {"id": 240, "cat": "mixed", "dictated": "echo dollar all caps PATH", "output": "echo $PATH"},
278
+ {"id": 241, "cat": "mixed", "dictated": "dollar open brace all caps HOME close brace slash dot config", "output": "${HOME}/.config"},
279
+ {"id": 242, "cat": "mixed", "dictated": "dash M quote fix colon resolve XPC timeout on cold start quote", "output": "-m \"fix: resolve XPC timeout on cold start\""},
280
+ {"id": 243, "cat": "mixed", "dictated": "dash dash title quote fix inference latency quote", "output": "--title \"fix inference latency\""},
281
+ {"id": 244, "cat": "mixed", "dictated": "dash dash body quote added TTFT tracking quote", "output": "--body \"Added TTFT tracking\""},
282
+ {"id": 245, "cat": "mixed", "dictated": "dash dash temp zero point seven dash dash tokens five twelve", "output": "--temp 0.7 --tokens 512"},
283
+ {"id": 246, "cat": "mixed", "dictated": "dash dash since five M two greater than ampersand one pipe grep error", "output": "--since 5m 2>&1 | grep error"},
284
+ {"id": 247, "cat": "mixed", "dictated": "at main actor", "output": "@MainActor"},
285
+ {"id": 248, "cat": "mixed", "dictated": "at published", "output": "@Published"},
286
+ {"id": 249, "cat": "mixed", "dictated": "at observable", "output": "@Observable"},
287
+ {"id": 250, "cat": "mixed", "dictated": "at objc", "output": "@objc"},
288
+ {"id": 251, "cat": "mixed", "dictated": "hash if all caps DEBUG", "output": "#if DEBUG"},
289
+ {"id": 252, "cat": "mixed", "dictated": "hash endif", "output": "#endif"},
290
+ {"id": 253, "cat": "mixed", "dictated": "hash mark dash life cycle", "output": "// MARK: - Life Cycle"},
291
+ {"id": 254, "cat": "mixed", "dictated": "shebang slash bin slash bash", "output": "#!/bin/bash"},
292
+ {"id": 255, "cat": "mixed", "dictated": "shebang slash usr slash bin slash env python three", "output": "#!/usr/bin/env python3"},
293
+
294
+ {"id": 256, "cat": "mixed", "dictated": "func camel case get status open paren close paren arrow pascal case inference status", "output": "func getStatus() -> InferenceStatus"},
295
+ {"id": 257, "cat": "mixed", "dictated": "let camel case max tokens colon all caps int equals five twelve", "output": "let maxTokens: Int = 512"},
296
+ {"id": 258, "cat": "mixed", "dictated": "var camel case is generating colon all caps bool equals false", "output": "var isGenerating: Bool = false"},
297
+ {"id": 259, "cat": "mixed", "dictated": "guard let container equals self dot camel case model container else", "output": "guard let container = self.modelContainer else"},
298
+ {"id": 260, "cat": "mixed", "dictated": "import pascal case talkie kit", "output": "import TalkieKit"},
299
+
300
+ {"id": 261, "cat": "mixed", "dictated": "const camel case bridge port equals one nine eight two two", "output": "const bridgePort = 19822"},
301
+ {"id": 262, "cat": "mixed", "dictated": "async function camel case inference call open paren method colon string close paren", "output": "async function inferenceCall(method: string)"},
302
+ {"id": 263, "cat": "mixed", "dictated": "open brace quote model quote colon quote llama quote close brace", "output": "{\"model\": \"llama\"}"},
303
+ {"id": 264, "cat": "mixed", "dictated": "open bracket one comma two comma three close bracket", "output": "[1, 2, 3]"},
304
+ {"id": 265, "cat": "mixed", "dictated": "record open angle string comma unknown close angle", "output": "Record<string, unknown>"},
305
+ {"id": 266, "cat": "mixed", "dictated": "promise open angle void close angle", "output": "Promise<void>"},
306
+ {"id": 267, "cat": "mixed", "dictated": "array open angle string close angle", "output": "Array<string>"},
307
+ {"id": 268, "cat": "mixed", "dictated": "map open angle string comma number close angle", "output": "Map<string, number>"},
308
+
309
+ {"id": 269, "cat": "mixed", "dictated": "console dot log open paren quote hello quote close paren", "output": "console.log(\"hello\")"},
310
+ {"id": 270, "cat": "mixed", "dictated": "process dot env dot all caps API underscore KEY", "output": "process.env.API_KEY"},
311
+ {"id": 271, "cat": "mixed", "dictated": "JSON dot parse open paren data close paren", "output": "JSON.parse(data)"},
312
+ {"id": 272, "cat": "mixed", "dictated": "JSON dot stringify open paren obj close paren", "output": "JSON.stringify(obj)"},
313
+ {"id": 273, "cat": "mixed", "dictated": "self dot camel case loaded model ID", "output": "self.loadedModelId"},
314
+ {"id": 274, "cat": "mixed", "dictated": "try await container dot perform", "output": "try await container.perform"},
315
+ {"id": 275, "cat": "mixed", "dictated": "S equals S open bracket colon colon dash one close bracket", "output": "s = s[::-1]"},
316
+
317
+ {"id": 276, "cat": "mixed", "dictated": "dollar open paren P grep dash X talkie close paren", "output": "$(pgrep -x Talkie)"},
318
+ {"id": 277, "cat": "mixed", "dictated": "two greater than slash dev slash null", "output": "2>/dev/null"},
319
+ {"id": 278, "cat": "mixed", "dictated": "pipe head dash twenty five", "output": "| head -25"},
320
+ {"id": 279, "cat": "mixed", "dictated": "pipe grep quote physical footprint quote", "output": "| grep \"Physical footprint\""},
321
+ {"id": 280, "cat": "mixed", "dictated": "open bracket dash F dot next close bracket and and RM dash RF dot next", "output": "[ -f .next ] && rm -rf .next"},
322
+
323
+ {"id": 281, "cat": "mixed", "dictated": "for I in star dot log semicolon do echo dollar I semicolon done", "output": "for i in *.log; do echo $i; done"},
324
+ {"id": 282, "cat": "mixed", "dictated": "find dot dash name quote star dot swift quote", "output": "find . -name \"*.swift\""},
325
+ {"id": 283, "cat": "mixed", "dictated": "find dot dash name quote star dot T S X quote dash delete", "output": "find . -name \"*.tsx\" -delete"},
326
+ {"id": 284, "cat": "mixed", "dictated": "X args dash I open brace close brace echo open brace close brace", "output": "xargs -I {} echo {}"},
327
+
328
+ {"id": 285, "cat": "mixed", "dictated": "docker compose up dash D", "output": "docker compose up -d"},
329
+ {"id": 286, "cat": "mixed", "dictated": "docker exec dash I T container underscore name bash", "output": "docker exec -it container_name bash"},
330
+ {"id": 287, "cat": "mixed", "dictated": "SSH user at host dash P two two two two", "output": "ssh user@host -p 2222"},
331
+ {"id": 288, "cat": "mixed", "dictated": "SSH dash I tilde slash dot SSH slash ID underscore RSA user at one nine two dot one six eight dot one dot one", "output": "ssh -i ~/.ssh/id_rsa user@192.168.1.1"},
332
+
333
+ {"id": 289, "cat": "mixed", "dictated": "M K dir dash P SRC slash components and and touch SRC slash components slash button dot T S X", "output": "mkdir -p src/components && touch src/components/Button.tsx"},
334
+ {"id": 290, "cat": "mixed", "dictated": "cat package dot JSON pipe python three dash C quote import JSON semicolon print open paren JSON dot load open paren open open paren quote R quote close paren close paren close paren quote", "output": "cat package.json | python3 -c \"import json; print(json.load(open('r')))\""},
335
+ {"id": 291, "cat": "mixed", "dictated": "curl dash F S S L HTTPS colon slash slash example dot com slash install dot S H pipe bash", "output": "curl -fsSL https://example.com/install.sh | bash"},
336
+ {"id": 292, "cat": "mixed", "dictated": "git add dash A and and git commit dash M quote fix typo quote and and git push", "output": "git add -A && git commit -m \"fix typo\" && git push"},
337
+ {"id": 293, "cat": "mixed", "dictated": "talkie dash dev dev logs inference dash dash since five M two greater than ampersand one pipe head dash twenty", "output": "talkie-dev dev logs inference --since 5m 2>&1 | head -20"},
338
+ {"id": 294, "cat": "mixed", "dictated": "NPX skills add HTTPS colon slash slash GitHub dot com slash vercel dash labs slash agent dash skills dash dash skill web dash design dash guidelines", "output": "npx skills add https://github.com/vercel-labs/agent-skills --skill web-design-guidelines"},
339
+ {"id": 295, "cat": "mixed", "dictated": "git clone git at GitHub dot com colon arach slash use talkie dot com dot git tilde slash dev slash use talkie dot com", "output": "git clone git@github.com:arach/usetalkie.com.git ~/dev/usetalkie.com"},
340
+ {"id": 296, "cat": "mixed", "dictated": "VM map dash summary dollar open paren P grep dash X talkie close paren two greater than slash dev slash null pipe head dash twenty five", "output": "vmmap -summary $(pgrep -x Talkie) 2>/dev/null | head -25"},
341
+ {"id": 297, "cat": "mixed", "dictated": "security unlock dash keychain tilde slash library slash keychains slash login dot keychain dash DB", "output": "security unlock-keychain ~/Library/Keychains/login.keychain-db"},
342
+
343
+ {"id": 298, "cat": "mixed", "dictated": "say dash V ava dash R two eighty quote hi I'm your new premium voice quote", "output": "say -v ava -r 280 \"Hi, I'm your new premium voice\""},
344
+ {"id": 299, "cat": "mixed", "dictated": "talkie inference generate quote what is two plus two quote dash dash model MLX dash community slash llama dash three dot two dash one B dash instruct dash four bit dash dash verbose dash dash temp zero point seven", "output": "talkie inference generate \"What is two plus two\" --model mlx-community/Llama-3.2-1B-Instruct-4bit --verbose --temp 0.7"},
345
+ {"id": 300, "cat": "mixed", "dictated": "GH PR create dash dash title quote fix inference latency quote dash dash body quote added TTFT tracking and latency instrumentation quote", "output": "gh pr create --title \"fix inference latency\" --body \"Added TTFT tracking and latency instrumentation\""},
346
+
347
+ {"id": 301, "cat": "symbols-basic", "dictated": "forward slash", "output": "/"},
348
+ {"id": 302, "cat": "symbols-basic", "dictated": "back tick", "output": "`"},
349
+ {"id": 303, "cat": "symbols-basic", "dictated": "A comma B comma C", "output": "A, B, C"},
350
+ {"id": 304, "cat": "symbols-basic", "dictated": "underscore underscore init underscore underscore", "output": "__init__"},
351
+ {"id": 305, "cat": "symbols-basic", "dictated": "hash one", "output": "#1"},
352
+ {"id": 306, "cat": "symbols-basic", "dictated": "at user", "output": "@user"},
353
+ {"id": 307, "cat": "symbols-basic", "dictated": "dollar one", "output": "$1"},
354
+ {"id": 308, "cat": "symbols-basic", "dictated": "dollar zero", "output": "$0"},
355
+ {"id": 309, "cat": "symbols-basic", "dictated": "percent S", "output": "%s"},
356
+ {"id": 310, "cat": "symbols-basic", "dictated": "percent D", "output": "%d"},
357
+ {"id": 311, "cat": "symbols-basic", "dictated": "A semicolon B semicolon C", "output": "A; B; C"},
358
+ {"id": 312, "cat": "symbols-basic", "dictated": "underscore", "output": "_"},
359
+
360
+ {"id": 313, "cat": "symbols-compound", "dictated": "equals equals equals", "output": "==="},
361
+ {"id": 314, "cat": "symbols-compound", "dictated": "bang equals equals", "output": "!=="},
362
+ {"id": 315, "cat": "symbols-compound", "dictated": "star star", "output": "**"},
363
+ {"id": 316, "cat": "symbols-compound", "dictated": "dollar dollar", "output": "$$"},
364
+ {"id": 317, "cat": "symbols-compound", "dictated": "dot dot less than", "output": "..<"},
365
+ {"id": 318, "cat": "symbols-compound", "dictated": "plus plus", "output": "++"},
366
+ {"id": 319, "cat": "symbols-compound", "dictated": "slash star", "output": "/*"},
367
+ {"id": 320, "cat": "symbols-compound", "dictated": "star slash", "output": "*/"},
368
+ {"id": 321, "cat": "symbols-compound", "dictated": "slash slash", "output": "//"},
369
+ {"id": 322, "cat": "symbols-compound", "dictated": "hash hash", "output": "##"},
370
+ {"id": 323, "cat": "symbols-compound", "dictated": "colon equals", "output": ":="},
371
+ {"id": 324, "cat": "symbols-compound", "dictated": "question mark colon", "output": "?:"},
372
+ {"id": 325, "cat": "symbols-compound", "dictated": "dot star", "output": ".*"},
373
+ {"id": 326, "cat": "symbols-compound", "dictated": "greater than greater than greater than", "output": ">>>"},
374
+ {"id": 327, "cat": "symbols-compound", "dictated": "less than less than", "output": "<<"},
375
+
376
+ {"id": 328, "cat": "quotes", "dictated": "backtick echo dollar all caps HOME backtick", "output": "`echo $HOME`"},
377
+ {"id": 329, "cat": "quotes", "dictated": "triple backtick", "output": "```"},
378
+ {"id": 330, "cat": "quotes", "dictated": "triple backtick swift", "output": "```swift"},
379
+ {"id": 331, "cat": "quotes", "dictated": "triple backtick python", "output": "```python"},
380
+ {"id": 332, "cat": "quotes", "dictated": "single quote R single quote", "output": "'r'"},
381
+ {"id": 333, "cat": "quotes", "dictated": "single quote W single quote", "output": "'w'"},
382
+ {"id": 334, "cat": "quotes", "dictated": "quote fix colon improve performance quote", "output": "\"fix: improve performance\""},
383
+ {"id": 335, "cat": "quotes", "dictated": "quote feat colon add dark mode quote", "output": "\"feat: add dark mode\""},
384
+ {"id": 336, "cat": "quotes", "dictated": "quote chore colon update dependencies quote", "output": "\"chore: update dependencies\""},
385
+ {"id": 337, "cat": "quotes", "dictated": "single quote use strict single quote", "output": "'use strict'"},
386
+ {"id": 338, "cat": "quotes", "dictated": "single quote UTF dash eight single quote", "output": "'utf-8'"},
387
+ {"id": 339, "cat": "quotes", "dictated": "quote dot slash SRC quote", "output": "\"./src\""},
388
+ {"id": 340, "cat": "quotes", "dictated": "quote star dot T S quote", "output": "\"*.ts\""},
389
+ {"id": 341, "cat": "quotes", "dictated": "backtick dollar open brace name close brace backtick", "output": "`${name}`"},
390
+ {"id": 342, "cat": "quotes", "dictated": "backtick dollar open brace camel case user name close brace backtick", "output": "`${userName}`"},
391
+ {"id": 343, "cat": "quotes", "dictated": "backtick camel case is loading backtick", "output": "`isLoading`"},
392
+ {"id": 344, "cat": "quotes", "dictated": "quote test dash value quote", "output": "\"test-value\""},
393
+ {"id": 345, "cat": "quotes", "dictated": "quote application slash JSON quote", "output": "\"application/json\""},
394
+ {"id": 346, "cat": "quotes", "dictated": "quote content dash type quote", "output": "\"content-type\""},
395
+
396
+ {"id": 347, "cat": "brackets", "dictated": "open bracket one colon five close bracket", "output": "[1:5]"},
397
+ {"id": 348, "cat": "brackets", "dictated": "open bracket colon colon dash one close bracket", "output": "[::-1]"},
398
+ {"id": 349, "cat": "brackets", "dictated": "open bracket zero colon close bracket", "output": "[0:]"},
399
+ {"id": 350, "cat": "brackets", "dictated": "open bracket colon five close bracket", "output": "[:5]"},
400
+ {"id": 351, "cat": "brackets", "dictated": "open angle T close angle", "output": "<T>"},
401
+ {"id": 352, "cat": "brackets", "dictated": "open angle T colon pascal case comparable close angle", "output": "<T: Comparable>"},
402
+ {"id": 353, "cat": "brackets", "dictated": "open angle T colon pascal case codable close angle", "output": "<T: Codable>"},
403
+ {"id": 354, "cat": "brackets", "dictated": "open bracket pascal case string colon any close bracket", "output": "[String: Any]"},
404
+ {"id": 355, "cat": "brackets", "dictated": "open bracket pascal case string close bracket", "output": "[String]"},
405
+ {"id": 356, "cat": "brackets", "dictated": "open bracket pascal case int close bracket", "output": "[Int]"},
406
+ {"id": 357, "cat": "brackets", "dictated": "open paren X colon pascal case int comma Y colon pascal case int close paren", "output": "(x: Int, y: Int)"},
407
+ {"id": 358, "cat": "brackets", "dictated": "open paren underscore colon pascal case string close paren", "output": "(_: String)"},
408
+ {"id": 359, "cat": "brackets", "dictated": "open angle pascal case string comma pascal case string close angle", "output": "<String, String>"},
409
+ {"id": 360, "cat": "brackets", "dictated": "open brace quote key quote colon quote value quote close brace", "output": "{\"key\": \"value\"}"},
410
+ {"id": 361, "cat": "brackets", "dictated": "open bracket dash one close bracket", "output": "[-1]"},
411
+ {"id": 362, "cat": "brackets", "dictated": "open paren A comma B comma C close paren", "output": "(a, b, c)"},
412
+ {"id": 363, "cat": "brackets", "dictated": "open angle pascal case input comma pascal case output close angle", "output": "<Input, Output>"},
413
+ {"id": 364, "cat": "brackets", "dictated": "open paren close paren arrow capital void", "output": "() -> Void"},
414
+
415
+ {"id": 365, "cat": "casing", "dictated": "camel case handle request", "output": "handleRequest"},
416
+ {"id": 366, "cat": "casing", "dictated": "camel case did finish loading", "output": "didFinishLoading"},
417
+ {"id": 367, "cat": "casing", "dictated": "camel case will appear", "output": "willAppear"},
418
+ {"id": 368, "cat": "casing", "dictated": "camel case should auto rotate", "output": "shouldAutoRotate"},
419
+ {"id": 369, "cat": "casing", "dictated": "camel case set value", "output": "setValue"},
420
+ {"id": 370, "cat": "casing", "dictated": "camel case get value", "output": "getValue"},
421
+ {"id": 371, "cat": "casing", "dictated": "camel case load model", "output": "loadModel"},
422
+ {"id": 372, "cat": "casing", "dictated": "camel case model container", "output": "modelContainer"},
423
+ {"id": 373, "cat": "casing", "dictated": "camel case file manager", "output": "fileManager"},
424
+ {"id": 374, "cat": "casing", "dictated": "camel case text field", "output": "textField"},
425
+ {"id": 375, "cat": "casing", "dictated": "camel case on change", "output": "onChange"},
426
+ {"id": 376, "cat": "casing", "dictated": "camel case on submit", "output": "onSubmit"},
427
+ {"id": 377, "cat": "casing", "dictated": "camel case view model", "output": "viewModel"},
428
+ {"id": 378, "cat": "casing", "dictated": "camel case content view", "output": "contentView"},
429
+ {"id": 379, "cat": "casing", "dictated": "pascal case view controller", "output": "ViewController"},
430
+ {"id": 380, "cat": "casing", "dictated": "pascal case app delegate", "output": "AppDelegate"},
431
+ {"id": 381, "cat": "casing", "dictated": "pascal case table view", "output": "TableView"},
432
+ {"id": 382, "cat": "casing", "dictated": "pascal case audio recorder", "output": "AudioRecorder"},
433
+ {"id": 383, "cat": "casing", "dictated": "pascal case notch composer", "output": "NotchComposer"},
434
+ {"id": 384, "cat": "casing", "dictated": "pascal case floating pill", "output": "FloatingPill"},
435
+ {"id": 385, "cat": "casing", "dictated": "pascal case recording overlay", "output": "RecordingOverlay"},
436
+ {"id": 386, "cat": "casing", "dictated": "pascal case shared settings", "output": "SharedSettings"},
437
+ {"id": 387, "cat": "casing", "dictated": "pascal case debug kit", "output": "DebugKit"},
438
+ {"id": 388, "cat": "casing", "dictated": "pascal case screen capture", "output": "ScreenCapture"},
439
+ {"id": 389, "cat": "casing", "dictated": "pascal case user defaults", "output": "UserDefaults"},
440
+ {"id": 390, "cat": "casing", "dictated": "snake case user id", "output": "user_id"},
441
+ {"id": 391, "cat": "casing", "dictated": "snake case created at", "output": "created_at"},
442
+ {"id": 392, "cat": "casing", "dictated": "snake case updated at", "output": "updated_at"},
443
+ {"id": 393, "cat": "casing", "dictated": "snake case file path", "output": "file_path"},
444
+ {"id": 394, "cat": "casing", "dictated": "snake case batch size", "output": "batch_size"},
445
+ {"id": 395, "cat": "casing", "dictated": "snake case learning rate", "output": "learning_rate"},
446
+ {"id": 396, "cat": "casing", "dictated": "snake case num layers", "output": "num_layers"},
447
+ {"id": 397, "cat": "casing", "dictated": "snake case max tokens", "output": "max_tokens"},
448
+ {"id": 398, "cat": "casing", "dictated": "screaming snake case default timeout", "output": "DEFAULT_TIMEOUT"},
449
+ {"id": 399, "cat": "casing", "dictated": "screaming snake case base URL", "output": "BASE_URL"},
450
+ {"id": 400, "cat": "casing", "dictated": "screaming snake case node env", "output": "NODE_ENV"},
451
+ {"id": 401, "cat": "casing", "dictated": "screaming snake case bundle ID", "output": "BUNDLE_ID"},
452
+ {"id": 402, "cat": "casing", "dictated": "screaming snake case launch daemon", "output": "LAUNCH_DAEMON"},
453
+ {"id": 403, "cat": "casing", "dictated": "kebab case my component", "output": "my-component"},
454
+ {"id": 404, "cat": "casing", "dictated": "kebab case dark mode toggle", "output": "dark-mode-toggle"},
455
+ {"id": 405, "cat": "casing", "dictated": "kebab case audio player", "output": "audio-player"},
456
+ {"id": 406, "cat": "casing", "dictated": "kebab case file upload", "output": "file-upload"},
457
+ {"id": 407, "cat": "casing", "dictated": "kebab case search results", "output": "search-results"},
458
+
459
+ {"id": 408, "cat": "spacing", "dictated": "no space git hub", "output": "github"},
460
+ {"id": 409, "cat": "spacing", "dictated": "no space home brew", "output": "homebrew"},
461
+ {"id": 410, "cat": "spacing", "dictated": "no space key chain", "output": "keychain"},
462
+ {"id": 411, "cat": "spacing", "dictated": "no space make file", "output": "makefile"},
463
+ {"id": 412, "cat": "spacing", "dictated": "no space name space", "output": "namespace"},
464
+ {"id": 413, "cat": "spacing", "dictated": "no space user name", "output": "username"},
465
+ {"id": 414, "cat": "spacing", "dictated": "no space data base", "output": "database"},
466
+ {"id": 415, "cat": "spacing", "dictated": "no space host name", "output": "hostname"},
467
+ {"id": 416, "cat": "spacing", "dictated": "no space time stamp", "output": "timestamp"},
468
+ {"id": 417, "cat": "spacing", "dictated": "no space fire base", "output": "firebase"},
469
+ {"id": 418, "cat": "spacing", "dictated": "no space code base", "output": "codebase"},
470
+ {"id": 419, "cat": "spacing", "dictated": "no space drop down", "output": "dropdown"},
471
+ {"id": 420, "cat": "spacing", "dictated": "no space work space", "output": "workspace"},
472
+ {"id": 421, "cat": "spacing", "dictated": "tab A tab B tab C", "output": "A\tB\tC"},
473
+ {"id": 422, "cat": "spacing", "dictated": "new line A new line B new line C", "output": "A\nB\nC"},
474
+
475
+ {"id": 423, "cat": "numbers", "dictated": "one two seven dot zero dot zero dot one", "output": "127.0.0.1"},
476
+ {"id": 424, "cat": "numbers", "dictated": "one nine two dot one six eight dot one dot one", "output": "192.168.1.1"},
477
+ {"id": 425, "cat": "numbers", "dictated": "ten dot zero dot zero dot one", "output": "10.0.0.1"},
478
+ {"id": 426, "cat": "numbers", "dictated": "four four three", "output": "443"},
479
+ {"id": 427, "cat": "numbers", "dictated": "three thousand", "output": "3000"},
480
+ {"id": 428, "cat": "numbers", "dictated": "eight thousand", "output": "8000"},
481
+ {"id": 429, "cat": "numbers", "dictated": "nine zero nine zero", "output": "9090"},
482
+ {"id": 430, "cat": "numbers", "dictated": "sixty four", "output": "64"},
483
+ {"id": 431, "cat": "numbers", "dictated": "one twenty eight", "output": "128"},
484
+ {"id": 432, "cat": "numbers", "dictated": "two zero four eight", "output": "2048"},
485
+ {"id": 433, "cat": "numbers", "dictated": "four zero nine six", "output": "4096"},
486
+ {"id": 434, "cat": "numbers", "dictated": "zero point nine nine", "output": "0.99"},
487
+
488
+ {"id": 435, "cat": "paths", "dictated": "slash etc slash hosts", "output": "/etc/hosts"},
489
+ {"id": 436, "cat": "paths", "dictated": "slash usr slash local slash bin", "output": "/usr/local/bin"},
490
+ {"id": 437, "cat": "paths", "dictated": "slash usr slash local slash bin slash python three", "output": "/usr/local/bin/python3"},
491
+ {"id": 438, "cat": "paths", "dictated": "slash var slash log", "output": "/var/log"},
492
+ {"id": 439, "cat": "paths", "dictated": "slash tmp", "output": "/tmp"},
493
+ {"id": 440, "cat": "paths", "dictated": "tilde slash dot SSH slash ID underscore RSA", "output": "~/.ssh/id_rsa"},
494
+ {"id": 441, "cat": "paths", "dictated": "tilde slash dot SSH slash config", "output": "~/.ssh/config"},
495
+ {"id": 442, "cat": "paths", "dictated": "tilde slash dot config", "output": "~/.config"},
496
+ {"id": 443, "cat": "paths", "dictated": "tilde slash dot bash profile", "output": "~/.bash_profile"},
497
+ {"id": 444, "cat": "paths", "dictated": "tilde slash dot zshrc", "output": "~/.zshrc"},
498
+ {"id": 445, "cat": "paths", "dictated": "dot dot slash configs", "output": "../configs"},
499
+ {"id": 446, "cat": "paths", "dictated": "dot dot slash dot dot slash dot dot slash", "output": "../../../"},
500
+ {"id": 447, "cat": "paths", "dictated": "dot dot slash dot dot slash lib", "output": "../../lib"},
501
+ {"id": 448, "cat": "paths", "dictated": "node underscore modules slash dot bin", "output": "node_modules/.bin"},
502
+ {"id": 449, "cat": "paths", "dictated": "mac OS slash talkie slash app slash app delegate dot swift", "output": "macOS/Talkie/App/AppDelegate.swift"},
503
+ {"id": 450, "cat": "paths", "dictated": "packages slash talkie kit slash sources", "output": "Packages/TalkieKit/Sources"},
504
+ {"id": 451, "cat": "paths", "dictated": "SRC slash utils slash index dot T S", "output": "src/utils/index.ts"},
505
+
506
+ {"id": 452, "cat": "urls", "dictated": "HTTPS colon slash slash N P M J S dot com slash package slash commander", "output": "https://npmjs.com/package/commander"},
507
+ {"id": 453, "cat": "urls", "dictated": "talkie colon slash slash dictate", "output": "talkie://dictate"},
508
+ {"id": 454, "cat": "urls", "dictated": "talkie colon slash slash auth slash callback", "output": "talkie://auth/callback"},
509
+ {"id": 455, "cat": "urls", "dictated": "HTTPS colon slash slash example dot com slash API slash V one slash users", "output": "https://example.com/api/v1/users"},
510
+ {"id": 456, "cat": "urls", "dictated": "HTTPS colon slash slash example dot com slash search question mark Q equals hello", "output": "https://example.com/search?q=hello"},
511
+ {"id": 457, "cat": "urls", "dictated": "HTTPS colon slash slash example dot com slash page hash section", "output": "https://example.com/page#section"},
512
+ {"id": 458, "cat": "urls", "dictated": "HTTP colon slash slash one two seven dot zero dot zero dot one colon three thousand", "output": "http://127.0.0.1:3000"},
513
+ {"id": 459, "cat": "urls", "dictated": "HTTPS colon slash slash raw dot github user content dot com", "output": "https://raw.githubusercontent.com"},
514
+ {"id": 460, "cat": "urls", "dictated": "HTTPS colon slash slash pypi dot org slash project slash torch", "output": "https://pypi.org/project/torch"},
515
+ {"id": 461, "cat": "urls", "dictated": "FTP colon slash slash files dot example dot com", "output": "ftp://files.example.com"},
516
+ {"id": 462, "cat": "urls", "dictated": "HTTPS colon slash slash API dot github dot com slash repos", "output": "https://api.github.com/repos"},
517
+ {"id": 463, "cat": "urls", "dictated": "one two seven dot zero dot zero dot one colon five four three two", "output": "127.0.0.1:5432"},
518
+ {"id": 464, "cat": "urls", "dictated": "zero dot zero dot zero dot zero colon eighty eighty", "output": "0.0.0.0:8080"},
519
+ {"id": 465, "cat": "urls", "dictated": "HTTPS colon slash slash cloud dot use talkie dot com", "output": "https://cloud.useTalkie.com"},
520
+ {"id": 466, "cat": "urls", "dictated": "HTTPS colon slash slash my dot use talkie dot com slash account", "output": "https://my.useTalkie.com/account"},
521
+
522
+ {"id": 467, "cat": "operators", "dictated": "X star equals two", "output": "x*=2"},
523
+ {"id": 468, "cat": "operators", "dictated": "X slash equals two", "output": "x/=2"},
524
+ {"id": 469, "cat": "operators", "dictated": "X percent equals two", "output": "x%=2"},
525
+ {"id": 470, "cat": "operators", "dictated": "X star star Y", "output": "x ** y"},
526
+ {"id": 471, "cat": "operators", "dictated": "X percent Y", "output": "x % y"},
527
+ {"id": 472, "cat": "operators", "dictated": "X ampersand Y", "output": "x & y"},
528
+ {"id": 473, "cat": "operators", "dictated": "X pipe Y", "output": "x | y"},
529
+ {"id": 474, "cat": "operators", "dictated": "X caret Y", "output": "x ^ y"},
530
+ {"id": 475, "cat": "operators", "dictated": "tilde X", "output": "~x"},
531
+ {"id": 476, "cat": "operators", "dictated": "X less than less than Y", "output": "x << y"},
532
+ {"id": 477, "cat": "operators", "dictated": "X greater than greater than Y", "output": "x >> y"},
533
+ {"id": 478, "cat": "operators", "dictated": "if X equals equals equals Y", "output": "if x === y"},
534
+ {"id": 479, "cat": "operators", "dictated": "if X bang equals equals Y", "output": "if x !== y"},
535
+ {"id": 480, "cat": "operators", "dictated": "X dot dot dot Y", "output": "x...y"},
536
+ {"id": 481, "cat": "operators", "dictated": "X dot dot less than Y", "output": "x..<y"},
537
+ {"id": 482, "cat": "operators", "dictated": "try question mark X", "output": "try? x"},
538
+ {"id": 483, "cat": "operators", "dictated": "try bang X", "output": "try! x"},
539
+ {"id": 484, "cat": "operators", "dictated": "X as question mark Y", "output": "x as? y"},
540
+ {"id": 485, "cat": "operators", "dictated": "X as bang Y", "output": "x as! y"},
541
+ {"id": 486, "cat": "operators", "dictated": "X question mark dot Y question mark dot Z", "output": "x?.y?.z"},
542
+ {"id": 487, "cat": "operators", "dictated": "X bang", "output": "x!"},
543
+ {"id": 488, "cat": "operators", "dictated": "X question mark", "output": "x?"},
544
+ {"id": 489, "cat": "operators", "dictated": "open paren A comma B close paren arrow pascal case int", "output": "(a, b) -> Int"},
545
+ {"id": 490, "cat": "operators", "dictated": "A plus B star C", "output": "a + b * c"},
546
+ {"id": 491, "cat": "operators", "dictated": "open paren A plus B close paren star C", "output": "(a + b) * c"},
547
+ {"id": 492, "cat": "operators", "dictated": "A pipe greater than B", "output": "a |> b"},
548
+ {"id": 493, "cat": "operators", "dictated": "A and and B and and C", "output": "a && b && c"},
549
+ {"id": 494, "cat": "operators", "dictated": "A pipe pipe B pipe pipe C", "output": "a || b || c"},
550
+ {"id": 495, "cat": "operators", "dictated": "if let X equals Y", "output": "if let x = y"},
551
+ {"id": 496, "cat": "operators", "dictated": "guard let X equals Y else", "output": "guard let x = y else"},
552
+
553
+ {"id": 497, "cat": "mixed", "dictated": "git branch dash capital D feature slash fix dash typo", "output": "git branch -D feature/fix-typo"},
554
+ {"id": 498, "cat": "mixed", "dictated": "git checkout dash B feature slash add dash auth", "output": "git checkout -b feature/add-auth"},
555
+ {"id": 499, "cat": "mixed", "dictated": "git log dash dash oneline dash N ten", "output": "git log --oneline -n 10"},
556
+ {"id": 500, "cat": "mixed", "dictated": "git diff dash dash staged", "output": "git diff --staged"},
557
+ {"id": 501, "cat": "mixed", "dictated": "git stash push dash M quote work in progress quote", "output": "git stash push -m \"work in progress\""},
558
+ {"id": 502, "cat": "mixed", "dictated": "git tag dash A V one dot zero dot zero dash M quote initial release quote", "output": "git tag -a v1.0.0 -m \"initial release\""},
559
+ {"id": 503, "cat": "mixed", "dictated": "git reset dash dash soft all caps HEAD tilde one", "output": "git reset --soft HEAD~1"},
560
+ {"id": 504, "cat": "mixed", "dictated": "git push origin main colon main", "output": "git push origin main:main"},
561
+ {"id": 505, "cat": "mixed", "dictated": "git config dash dash global user dot email quote arach at tchoupani dot com quote", "output": "git config --global user.email \"arach@tchoupani.com\""},
562
+ {"id": 506, "cat": "mixed", "dictated": "git push dash U origin feature slash my dash feature", "output": "git push -u origin feature/my-feature"},
563
+
564
+ {"id": 507, "cat": "mixed", "dictated": "export all caps NODE underscore ENV equals quote production quote", "output": "export NODE_ENV=\"production\""},
565
+ {"id": 508, "cat": "mixed", "dictated": "export all caps DATABASE underscore URL equals quote postgres colon slash slash localhost slash DB quote", "output": "export DATABASE_URL=\"postgres://localhost/db\""},
566
+ {"id": 509, "cat": "mixed", "dictated": "dollar all caps SHELL", "output": "$SHELL"},
567
+ {"id": 510, "cat": "mixed", "dictated": "dollar all caps USER", "output": "$USER"},
568
+ {"id": 511, "cat": "mixed", "dictated": "dollar open brace all caps API underscore KEY close brace", "output": "${API_KEY}"},
569
+ {"id": 512, "cat": "mixed", "dictated": "dollar open brace all caps DATABASE underscore URL close brace", "output": "${DATABASE_URL}"},
570
+ {"id": 513, "cat": "mixed", "dictated": "all caps PORT equals three thousand", "output": "PORT=3000"},
571
+
572
+ {"id": 514, "cat": "mixed", "dictated": "func camel case view did load open paren close paren", "output": "func viewDidLoad()"},
573
+ {"id": 515, "cat": "mixed", "dictated": "private var camel case is recording equals false", "output": "private var isRecording = false"},
574
+ {"id": 516, "cat": "mixed", "dictated": "let camel case file URL colon all caps URL", "output": "let fileURL: URL"},
575
+ {"id": 517, "cat": "mixed", "dictated": "open bracket weak self close bracket", "output": "[weak self]"},
576
+ {"id": 518, "cat": "mixed", "dictated": "at escaping open paren close paren arrow capital void", "output": "@escaping () -> Void"},
577
+ {"id": 519, "cat": "mixed", "dictated": "try await self dot camel case load model open paren close paren", "output": "try await self.loadModel()"},
578
+ {"id": 520, "cat": "mixed", "dictated": "case dot camel case success open paren let value close paren", "output": "case .success(let value)"},
579
+ {"id": 521, "cat": "mixed", "dictated": "case dot camel case failure open paren let error close paren", "output": "case .failure(let error)"},
580
+ {"id": 522, "cat": "mixed", "dictated": "guard let self else open brace return close brace", "output": "guard let self else { return }"},
581
+ {"id": 523, "cat": "mixed", "dictated": "hash selector open paren camel case handle tap open paren underscore colon close paren close paren", "output": "#selector(handleTap(_:))"},
582
+ {"id": 524, "cat": "mixed", "dictated": "at state private var camel case is showing equals false", "output": "@State private var isShowing = false"},
583
+ {"id": 525, "cat": "mixed", "dictated": "at binding var text colon pascal case string", "output": "@Binding var text: String"},
584
+ {"id": 526, "cat": "mixed", "dictated": "dot camel case on appear open brace close brace", "output": ".onAppear {}"},
585
+ {"id": 527, "cat": "mixed", "dictated": "dot frame open paren width colon two hundred comma height colon one hundred close paren", "output": ".frame(width: 200, height: 100)"},
586
+ {"id": 528, "cat": "mixed", "dictated": "let result colon pascal case result open angle pascal case string comma pascal case error close angle", "output": "let result: Result<String, Error>"},
587
+
588
+ {"id": 529, "cat": "mixed", "dictated": "const camel case app equals express open paren close paren", "output": "const app = express()"},
589
+ {"id": 530, "cat": "mixed", "dictated": "import open brace camel case use state close brace from single quote react single quote", "output": "import { useState } from 'react'"},
590
+ {"id": 531, "cat": "mixed", "dictated": "import open brace camel case use effect close brace from single quote react single quote", "output": "import { useEffect } from 'react'"},
591
+ {"id": 532, "cat": "mixed", "dictated": "export default function pascal case app open paren close paren", "output": "export default function App()"},
592
+ {"id": 533, "cat": "mixed", "dictated": "const camel case fetch data equals async open paren close paren fat arrow", "output": "const fetchData = async () =>"},
593
+ {"id": 534, "cat": "mixed", "dictated": "document dot camel case query selector open paren single quote dot container single quote close paren", "output": "document.querySelector('.container')"},
594
+ {"id": 535, "cat": "mixed", "dictated": "process dot exit open paren one close paren", "output": "process.exit(1)"},
595
+ {"id": 536, "cat": "mixed", "dictated": "require open paren single quote path single quote close paren", "output": "require('path')"},
596
+ {"id": 537, "cat": "mixed", "dictated": "module dot exports equals open brace close brace", "output": "module.exports = {}"},
597
+ {"id": 538, "cat": "mixed", "dictated": "console dot error open paren quote failed quote close paren", "output": "console.error(\"failed\")"},
598
+ {"id": 539, "cat": "mixed", "dictated": "typeof X equals equals equals single quote string single quote", "output": "typeof x === 'string'"},
599
+ {"id": 540, "cat": "mixed", "dictated": "fetch open paren quote slash API slash data quote close paren", "output": "fetch(\"/api/data\")"},
600
+
601
+ {"id": 541, "cat": "mixed", "dictated": "LS dash L A pipe grep dot swift", "output": "ls -la | grep .swift"},
602
+ {"id": 542, "cat": "mixed", "dictated": "PS aux pipe grep node", "output": "ps aux | grep node"},
603
+ {"id": 543, "cat": "mixed", "dictated": "echo quote hello world quote pipe P B copy", "output": "echo \"hello world\" | pbcopy"},
604
+ {"id": 544, "cat": "mixed", "dictated": "WC dash L star dot swift", "output": "wc -l *.swift"},
605
+ {"id": 545, "cat": "mixed", "dictated": "chmod plus X install dot S H", "output": "chmod +x install.sh"},
606
+ {"id": 546, "cat": "mixed", "dictated": "tar dash X Z F archive dot tar dot G Z", "output": "tar -xzf archive.tar.gz"},
607
+ {"id": 547, "cat": "mixed", "dictated": "kill dash nine dollar open paren P grep node close paren", "output": "kill -9 $(pgrep node)"},
608
+ {"id": 548, "cat": "mixed", "dictated": "sed single quote S slash old slash new slash G single quote file dot T X T", "output": "sed 's/old/new/g' file.txt"},
609
+ {"id": 549, "cat": "mixed", "dictated": "sort dash U pipe head dash ten", "output": "sort -u | head -10"},
610
+ {"id": 550, "cat": "mixed", "dictated": "A W K single quote open brace print dollar one close brace single quote", "output": "awk '{print $1}'"},
611
+ {"id": 551, "cat": "mixed", "dictated": "alias L L equals single quote LS dash L A single quote", "output": "alias ll='ls -la'"},
612
+ {"id": 552, "cat": "mixed", "dictated": "for file in star dot log semicolon do RM dollar file semicolon done", "output": "for file in *.log; do rm $file; done"},
613
+ {"id": 553, "cat": "mixed", "dictated": "if open bracket dash F config dot yaml close bracket semicolon then source config dot yaml semicolon fi", "output": "if [ -f config.yaml ]; then source config.yaml; fi"},
614
+ {"id": 554, "cat": "mixed", "dictated": "test dash D dot venv and and source dot venv slash bin slash activate", "output": "test -d .venv && source .venv/bin/activate"},
615
+ {"id": 555, "cat": "mixed", "dictated": "dollar open paren date plus percent S close paren", "output": "$(date +%s)"},
616
+ {"id": 556, "cat": "mixed", "dictated": "dollar open paren PWD close paren", "output": "$(pwd)"},
617
+ {"id": 557, "cat": "mixed", "dictated": "dollar open brace one colon dash default close brace", "output": "${1:-default}"},
618
+ {"id": 558, "cat": "mixed", "dictated": "dollar open brace all caps HOME colon dash slash tmp close brace", "output": "${HOME:-/tmp}"},
619
+ {"id": 559, "cat": "mixed", "dictated": "pipe X args dash N one echo", "output": "| xargs -n1 echo"},
620
+ {"id": 560, "cat": "mixed", "dictated": "pipe tee output dot log", "output": "| tee output.log"},
621
+
622
+ {"id": 561, "cat": "mixed", "dictated": "NPM install dash capital D typescript", "output": "npm install -D typescript"},
623
+ {"id": 562, "cat": "mixed", "dictated": "NPM run build and and NPM run test", "output": "npm run build && npm run test"},
624
+ {"id": 563, "cat": "mixed", "dictated": "pip install dash R requirements dot T X T", "output": "pip install -r requirements.txt"},
625
+ {"id": 564, "cat": "mixed", "dictated": "pip install dash E dot", "output": "pip install -e ."},
626
+ {"id": 565, "cat": "mixed", "dictated": "cargo build dash dash release", "output": "cargo build --release"},
627
+ {"id": 566, "cat": "mixed", "dictated": "P N P M add dash capital D eslint", "output": "pnpm add -D eslint"},
628
+ {"id": 567, "cat": "mixed", "dictated": "swift build dash C release", "output": "swift build -c release"},
629
+
630
+ {"id": 568, "cat": "mixed", "dictated": "docker run dash D dash P eighty eighty colon eighty eighty nginx", "output": "docker run -d -p 8080:8080 nginx"},
631
+ {"id": 569, "cat": "mixed", "dictated": "docker build dash T my dash app colon latest dot", "output": "docker build -t my-app:latest ."},
632
+ {"id": 570, "cat": "mixed", "dictated": "kubectl get pods dash N production", "output": "kubectl get pods -n production"},
633
+ {"id": 571, "cat": "mixed", "dictated": "X code build dash scheme talkie dash destination quote platform equals mac OS quote", "output": "xcodebuild -scheme Talkie -destination \"platform=macOS\""},
634
+ {"id": 572, "cat": "mixed", "dictated": "brew install dash dash cask visual dash studio dash code", "output": "brew install --cask visual-studio-code"},
635
+ {"id": 573, "cat": "mixed", "dictated": "open dash A pascal case safari HTTPS colon slash slash example dot com", "output": "open -a Safari https://example.com"},
636
+ {"id": 574, "cat": "mixed", "dictated": "defaults write com dot apple dot dock camel case auto hide dash bool true", "output": "defaults write com.apple.dock autohide -bool true"},
637
+ {"id": 575, "cat": "mixed", "dictated": "launch CTL load tilde slash library slash launch agents slash com dot JDI dot talkie dot plist", "output": "launchctl load ~/Library/LaunchAgents/com.jdi.talkie.plist"},
638
+ {"id": 576, "cat": "mixed", "dictated": "MK dir dash P SRC slash lib and and touch SRC slash lib slash utils dot T S", "output": "mkdir -p src/lib && touch src/lib/utils.ts"},
639
+ {"id": 577, "cat": "mixed", "dictated": "RM dash R F node underscore modules and and RM package dash lock dot JSON", "output": "rm -rf node_modules && rm package-lock.json"},
640
+
641
+ {"id": 578, "cat": "mixed", "dictated": "GH issue list dash dash label bug dash dash state open", "output": "gh issue list --label bug --state open"},
642
+ {"id": 579, "cat": "mixed", "dictated": "GH PR view dash dash web", "output": "gh pr view --web"},
643
+ {"id": 580, "cat": "mixed", "dictated": "GH release create V one dot zero dot zero dash dash notes quote first release quote", "output": "gh release create v1.0.0 --notes \"first release\""},
644
+ {"id": 581, "cat": "mixed", "dictated": "GH API repos slash arach slash talkie slash pulls", "output": "gh api repos/arach/talkie/pulls"},
645
+ {"id": 582, "cat": "mixed", "dictated": "python three dash M pip install dash dash upgrade pip", "output": "python3 -m pip install --upgrade pip"},
646
+ {"id": 583, "cat": "mixed", "dictated": "python three dash C quote import sys semicolon print open paren sys dot version close paren quote", "output": "python3 -c \"import sys; print(sys.version)\""},
647
+ {"id": 584, "cat": "mixed", "dictated": "J Q single quote dot data open bracket close bracket single quote response dot JSON", "output": "jq '.data[]' response.json"},
648
+ {"id": 585, "cat": "mixed", "dictated": "J Q single quote dot name single quote package dot JSON", "output": "jq '.name' package.json"},
649
+ {"id": 586, "cat": "mixed", "dictated": "talkie dash dev dev rebuild agent", "output": "talkie-dev dev rebuild agent"},
650
+ {"id": 587, "cat": "mixed", "dictated": "talkie dash dev dev logs agent dash dash since ten M", "output": "talkie-dev dev logs agent --since 10m"},
651
+
652
+ {"id": 588, "cat": "mixed", "dictated": "at available open paren mac OS thirteen comma star close paren", "output": "@available(macOS 13, *)"},
653
+ {"id": 589, "cat": "mixed", "dictated": "set dash E and and set dash O camel case pipe fail", "output": "set -e && set -o pipefail"},
654
+ {"id": 590, "cat": "mixed", "dictated": "open bracket dash Z dollar all caps API underscore KEY close bracket and and echo quote missing key quote", "output": "[ -z $API_KEY ] && echo \"missing key\""},
655
+ {"id": 591, "cat": "mixed", "dictated": "N C dash Z localhost eighty eighty and and echo quote up quote", "output": "nc -z localhost 8080 && echo \"up\""},
656
+ {"id": 592, "cat": "mixed", "dictated": "node dash dash max dash old dash space dash size equals four zero nine six", "output": "node --max-old-space-size=4096"},
657
+ {"id": 593, "cat": "mixed", "dictated": "source dot venv slash bin slash activate", "output": "source .venv/bin/activate"},
658
+
659
+ {"id": 594, "cat": "paths", "dictated": "dot slash SRC", "output": "./src"},
660
+ {"id": 595, "cat": "paths", "dictated": "dot dot slash SRC", "output": "../src"},
661
+ {"id": 596, "cat": "paths", "dictated": "dot dot slash utils", "output": "../utils"},
662
+ {"id": 597, "cat": "paths", "dictated": "dot dot slash models", "output": "../models"},
663
+ {"id": 598, "cat": "paths", "dictated": "dot dot slash services", "output": "../services"},
664
+ {"id": 599, "cat": "symbols-compound", "dictated": "dot dot", "output": ".."},
665
+ {"id": 600, "cat": "mixed", "dictated": "CD dot dot", "output": "cd .."},
666
+ {"id": 601, "cat": "paths", "dictated": "dot dot slash packages", "output": "../packages"},
667
+ {"id": 602, "cat": "paths", "dictated": "dot dot slash scripts", "output": "../scripts"},
668
+ {"id": 603, "cat": "paths", "dictated": "dot dot slash tests", "output": "../tests"},
669
+ {"id": 604, "cat": "paths", "dictated": "dot dot slash index dot T S", "output": "../index.ts"},
670
+
671
+ {"id": 605, "cat": "paths", "dictated": "dot dot slash dot dot slash assets", "output": "../../assets"},
672
+ {"id": 606, "cat": "paths", "dictated": "dot dot slash dot dot slash dot dot slash config", "output": "../../../config"},
673
+ {"id": 607, "cat": "paths", "dictated": "dot dot slash dot dot slash dot dot slash utils", "output": "../../../utils"},
674
+ {"id": 608, "cat": "paths", "dictated": "dot dot slash dot dot slash dot dot slash dot dot slash", "output": "../../../../"},
675
+ {"id": 609, "cat": "paths", "dictated": "dot dot slash dot dot slash dot dot slash index dot T S", "output": "../../../index.ts"},
676
+ {"id": 610, "cat": "paths", "dictated": "dot dot slash dot dot slash node underscore modules", "output": "../../node_modules"},
677
+
678
+ {"id": 611, "cat": "mixed", "dictated": "docker run dash P three thousand colon three thousand redis", "output": "docker run -p 3000:3000 redis"},
679
+ {"id": 612, "cat": "mixed", "dictated": "docker run dash P five four three two colon five four three two postgres", "output": "docker run -p 5432:5432 postgres"},
680
+ {"id": 613, "cat": "mixed", "dictated": "docker run dash D dash P nine zero nine zero colon nine zero nine zero grafana", "output": "docker run -d -p 9090:9090 grafana"},
681
+ {"id": 614, "cat": "mixed", "dictated": "docker run dash D dash P four four three colon four four three nginx colon alpine", "output": "docker run -d -p 443:443 nginx:alpine"},
682
+ {"id": 615, "cat": "mixed", "dictated": "SSH dash P two two two two user at server", "output": "ssh -p 2222 user@server"},
683
+ {"id": 616, "cat": "mixed", "dictated": "python three dash M HTTP dot server eighty eighty", "output": "python3 -m http.server 8080"}
684
+ ]
685
+ }