arach commited on
Commit
d4490f7
Β·
1 Parent(s): 04558eb

πŸ“ add Part 3 outline: the 40-millisecond classifier gate

Browse files

Trained embedding classifier (logistic regression on NLEmbedding)
decides whether dictation needs LLM normalization. 100% accuracy
on held-out data, trained in 40ms on 120 examples using BLAS.

Covers: hand-crafted vs trained classifier, Accelerate/BLAS speedup,
fan-out architecture, and the routing gate concept.

Files changed (1) hide show
  1. blog/part3-classifier-gate-outline.md +172 -0
blog/part3-classifier-gate-outline.md ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Part 3: The 40-Millisecond Gate
2
+
3
+ > A trained embedding classifier decides whether to call the LLM β€” 100% accuracy on held-out data, trained in 40ms on 120 examples.
4
+
5
+ <!-- METADATA
6
+ slug: the-40-millisecond-gate
7
+ series: teaching-a-tiny-model-to-hear-bash
8
+ part: 3
9
+ date: TBD
10
+ tags: nlembedding, classifier, on-device-ml, voice, apple, accelerate
11
+ author: Arach
12
+ -->
13
+
14
+ ## Series context
15
+
16
+ - **Part 1** β€” Fine-tuned a 1.5B model to reconstruct bash from dictation. 97% accuracy, 3GB RAM, 0.7s inference.
17
+ - **Part 2** β€” Discovered the split architecture: deterministic processor handles symbols/digits, LLM handles language understanding (filler stripping, corrections, normalization).
18
+ - **Part 3** (this post) β€” The routing decision: does this input even need the LLM? A classifier trained in 40ms answers with 100% accuracy.
19
+
20
+ ## Opening hook
21
+
22
+ Part 2 ended with a bypass rule: if the input contains "space" keywords and no conversational filler, skip the LLM entirely. But that rule was hand-crafted. It worked for clean protocol input but missed edge cases.
23
+
24
+ The question: can we learn the routing decision instead of hand-coding it?
25
+
26
+ ## Beat 1: The hand-crafted classifier
27
+
28
+ We started with `NeedsLLMClassifier` β€” a rule-based system that scores inputs on:
29
+ - Protocol vocabulary density ("space", "dash", "colon", etc.)
30
+ - Conversational markers ("okay", "um", "like", "actually")
31
+ - Structural patterns (corrections, false starts, hedging)
32
+
33
+ It's fast (< 0.01ms per classification). On our 40-case eval set spanning four difficulty levels, it hits 95%. Good, but it took significant iteration to build, and every edge case is another rule.
34
+
35
+ **The question behind the question:** Can we replace human pattern-matching with a trained model that's just as fast but doesn't require hand-tuning?
36
+
37
+ ## Beat 2: The embedding insight
38
+
39
+ Apple ships `NLEmbedding` as a system framework. It's a 512-dimensional word embedding model, already on every Mac and iPhone. No download. No setup. One API call gives you a feature vector.
40
+
41
+ The key property: word-averaged embeddings of protocol-heavy input ("git space push space dash u space origin space main") land in a completely different region of embedding space than conversational input ("okay so like the command is git push"). The words "space", "dash", "colon" cluster differently from "um", "actually", "wait".
42
+
43
+ **The bet:** If the embedding already separates these two classes, a simple linear classifier on top should work. No deep learning. No fine-tuning. Just logistic regression.
44
+
45
+ ## Beat 3: The training data
46
+
47
+ We already had the eval dataset from Part 2: 200 dictation examples across four difficulty levels.
48
+
49
+ | Difficulty | Description | needsLLM |
50
+ |---|---|---|
51
+ | clean | Protocol-formatted, "space"/"dash" keywords | false |
52
+ | fuzzy | Synonym substitutions ("minus", "period", "forward slash") | true |
53
+ | natural | Conversational wrapping ("okay so the command is...") | true |
54
+ | chaotic | Self-corrections, false starts, mid-sentence changes | true |
55
+
56
+ Split: 120 for training, 40 for testing, 40 held out.
57
+
58
+ The label is binary: `clean` maps to "doesn't need LLM" (the deterministic processor handles it). Everything else maps to "needs LLM."
59
+
60
+ ## Beat 4: The implementation
61
+
62
+ Logistic regression on 512-dimensional embeddings. The entire classifier is:
63
+ - A weight vector (512 doubles)
64
+ - A bias term (1 double)
65
+ - A sigmoid activation
66
+
67
+ Training: batch gradient descent with L2 regularization. Standardize features internally, un-transform weights at the end so the deployed head works on raw embeddings. No data pipeline. No framework. 80 lines of Swift using Apple's Accelerate (BLAS) for the matrix math.
68
+
69
+ Hyperparameters that mattered:
70
+ - **Learning rate 0.1** (not 0.01 β€” standardized features converge fast)
71
+ - **Lambda 0.01** (not 1.0 β€” light regularization on 512 dims, heavy regularization starves the model)
72
+ - First attempt with lr=0.01 and lambda=1.0: 88% accuracy, 2134ms training
73
+ - After fix: 100% accuracy, 40ms training
74
+
75
+ The 50x speedup came from BLAS. The 12-point accuracy jump came from letting the model actually fit the data instead of regularizing it to death.
76
+
77
+ ## Beat 5: The results
78
+
79
+ ```
80
+ ACCURACY (vs ground truth labels)
81
+ Hand-Crafted: 95.0% (38/40)
82
+ Trained Head: 100.0% (40/40)
83
+
84
+ TRAINING
85
+ Cases: 120
86
+ Time: 40ms
87
+
88
+ LATENCY
89
+ Embedding: 0.05ms median
90
+ Classification: 0.00ms median
91
+ ```
92
+
93
+ The trained head beats the hand-crafted classifier on every metric:
94
+ - Higher accuracy (100% vs 95%)
95
+ - Trained in 40ms (vs hours of manual rule iteration)
96
+ - Same inference speed (< 0.1ms total)
97
+
98
+ Per-difficulty breakdown β€” where the hand-crafted classifier fails and the trained head doesn't.
99
+
100
+ ## Beat 6: The fan-out insight
101
+
102
+ The embedding is the expensive part (~0.05ms). The classifier head is essentially free (~0.001ms). This means you can run N different classifier heads on the same embedding for almost no extra cost.
103
+
104
+ ```
105
+ Embed once: 0.05ms
106
+ 1 head: 0.001ms
107
+ 4 heads: 0.004ms
108
+ ```
109
+
110
+ One embedding, multiple decisions:
111
+ - Does this need an LLM?
112
+ - Is this a command, a variable name, or prose?
113
+ - Which domain? (bash, SQL, regex, URL)
114
+ - What's the confidence level?
115
+
116
+ The shared backbone pattern: compute the embedding once, fan out to cheap task-specific heads. Each head is 512 weights + 1 bias, trained in milliseconds.
117
+
118
+ ## Beat 7: What this means architecturally
119
+
120
+ ```
121
+ Raw transcription
122
+ |
123
+ v
124
+ [ NLEmbedding ] ← 0.05ms, system framework, no download
125
+ |
126
+ +--β†’ [ needsLLM? ] ← 0.001ms, trained head
127
+ +--β†’ [ domain? ] ← 0.001ms, trained head (future)
128
+ +--β†’ [ confidence? ] ← 0.001ms, trained head (future)
129
+ |
130
+ v
131
+ Route to:
132
+ - Deterministic processor (clean protocol input)
133
+ - On-device LLM (fuzzy/natural, needs normalization)
134
+ - Cloud LLM (chaotic, high ambiguity)
135
+ ```
136
+
137
+ The classifier gate sits between transcription and processing. It costs essentially nothing. It routes inputs to the cheapest processor that can handle them correctly.
138
+
139
+ For Talkie's keyboard dictation, this means:
140
+ - 25% of inputs (clean protocol) get instant results β€” no LLM, no latency
141
+ - 75% of inputs go through the LLM normalizer from Part 2
142
+ - The user never notices the routing. They just see fast, correct output.
143
+
144
+ ## Closing: The meta-lesson
145
+
146
+ Three posts. Three layers of the same insight.
147
+
148
+ **Part 1:** Don't use a big model when a small one works. (1.5B vs GPT-4)
149
+ **Part 2:** Don't use a model when code works. (Processor vs fine-tuned LLM)
150
+ **Part 3:** Don't use a model to decide whether to use a model β€” unless training it takes 40ms. Then do.
151
+
152
+ The whole pipeline costs less than a single GPT-4 API call. It runs offline. It fits on a phone. And the most expensive operation in the entire stack is a 0.05ms embedding lookup that Apple ships for free.
153
+
154
+ ## Appendix notes
155
+
156
+ ### Code references
157
+ - `ClassifierPipelineBenchmark.swift` β€” benchmark runner, training, eval
158
+ - `NeedsLLMClassifier.swift` β€” hand-crafted classifier (the baseline)
159
+ - `eval-fuzzy.json` β€” 200 labeled examples across 4 difficulties
160
+
161
+ ### Numbers to verify on-device before publishing
162
+ - Exact HC accuracy % (currently 95% on 40 cases)
163
+ - Exact trained accuracy % (currently 100% on 40 cases)
164
+ - Training time range across multiple runs
165
+ - Per-difficulty breakdown
166
+ - Fan-out latency at N=4, N=8, N=16
167
+
168
+ ### Illustration ideas
169
+ - Hero: a fork in the road β€” one path labeled "LLM" (longer, scenic), one labeled "processor" (short, direct). A tiny gate at the fork.
170
+ - Training visualization: 120 dots in 2D (PCA of embeddings), colored by class, with the decision boundary drawn through them.
171
+ - Speed comparison: a race track showing 40ms training vs hours of hand-coding rules.
172
+ - Fan-out diagram: one embedding node at top, multiple classifier heads branching below, each labeled with a different question.