π add Part 3 outline: the 40-millisecond classifier gate
Browse filesTrained embedding classifier (logistic regression on NLEmbedding)
decides whether dictation needs LLM normalization. 100% accuracy
on held-out data, trained in 40ms on 120 examples using BLAS.
Covers: hand-crafted vs trained classifier, Accelerate/BLAS speedup,
fan-out architecture, and the routing gate concept.
blog/part3-classifier-gate-outline.md
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Part 3: The 40-Millisecond Gate
|
| 2 |
+
|
| 3 |
+
> A trained embedding classifier decides whether to call the LLM β 100% accuracy on held-out data, trained in 40ms on 120 examples.
|
| 4 |
+
|
| 5 |
+
<!-- METADATA
|
| 6 |
+
slug: the-40-millisecond-gate
|
| 7 |
+
series: teaching-a-tiny-model-to-hear-bash
|
| 8 |
+
part: 3
|
| 9 |
+
date: TBD
|
| 10 |
+
tags: nlembedding, classifier, on-device-ml, voice, apple, accelerate
|
| 11 |
+
author: Arach
|
| 12 |
+
-->
|
| 13 |
+
|
| 14 |
+
## Series context
|
| 15 |
+
|
| 16 |
+
- **Part 1** β Fine-tuned a 1.5B model to reconstruct bash from dictation. 97% accuracy, 3GB RAM, 0.7s inference.
|
| 17 |
+
- **Part 2** β Discovered the split architecture: deterministic processor handles symbols/digits, LLM handles language understanding (filler stripping, corrections, normalization).
|
| 18 |
+
- **Part 3** (this post) β The routing decision: does this input even need the LLM? A classifier trained in 40ms answers with 100% accuracy.
|
| 19 |
+
|
| 20 |
+
## Opening hook
|
| 21 |
+
|
| 22 |
+
Part 2 ended with a bypass rule: if the input contains "space" keywords and no conversational filler, skip the LLM entirely. But that rule was hand-crafted. It worked for clean protocol input but missed edge cases.
|
| 23 |
+
|
| 24 |
+
The question: can we learn the routing decision instead of hand-coding it?
|
| 25 |
+
|
| 26 |
+
## Beat 1: The hand-crafted classifier
|
| 27 |
+
|
| 28 |
+
We started with `NeedsLLMClassifier` β a rule-based system that scores inputs on:
|
| 29 |
+
- Protocol vocabulary density ("space", "dash", "colon", etc.)
|
| 30 |
+
- Conversational markers ("okay", "um", "like", "actually")
|
| 31 |
+
- Structural patterns (corrections, false starts, hedging)
|
| 32 |
+
|
| 33 |
+
It's fast (< 0.01ms per classification). On our 40-case eval set spanning four difficulty levels, it hits 95%. Good, but it took significant iteration to build, and every edge case is another rule.
|
| 34 |
+
|
| 35 |
+
**The question behind the question:** Can we replace human pattern-matching with a trained model that's just as fast but doesn't require hand-tuning?
|
| 36 |
+
|
| 37 |
+
## Beat 2: The embedding insight
|
| 38 |
+
|
| 39 |
+
Apple ships `NLEmbedding` as a system framework. It's a 512-dimensional word embedding model, already on every Mac and iPhone. No download. No setup. One API call gives you a feature vector.
|
| 40 |
+
|
| 41 |
+
The key property: word-averaged embeddings of protocol-heavy input ("git space push space dash u space origin space main") land in a completely different region of embedding space than conversational input ("okay so like the command is git push"). The words "space", "dash", "colon" cluster differently from "um", "actually", "wait".
|
| 42 |
+
|
| 43 |
+
**The bet:** If the embedding already separates these two classes, a simple linear classifier on top should work. No deep learning. No fine-tuning. Just logistic regression.
|
| 44 |
+
|
| 45 |
+
## Beat 3: The training data
|
| 46 |
+
|
| 47 |
+
We already had the eval dataset from Part 2: 200 dictation examples across four difficulty levels.
|
| 48 |
+
|
| 49 |
+
| Difficulty | Description | needsLLM |
|
| 50 |
+
|---|---|---|
|
| 51 |
+
| clean | Protocol-formatted, "space"/"dash" keywords | false |
|
| 52 |
+
| fuzzy | Synonym substitutions ("minus", "period", "forward slash") | true |
|
| 53 |
+
| natural | Conversational wrapping ("okay so the command is...") | true |
|
| 54 |
+
| chaotic | Self-corrections, false starts, mid-sentence changes | true |
|
| 55 |
+
|
| 56 |
+
Split: 120 for training, 40 for testing, 40 held out.
|
| 57 |
+
|
| 58 |
+
The label is binary: `clean` maps to "doesn't need LLM" (the deterministic processor handles it). Everything else maps to "needs LLM."
|
| 59 |
+
|
| 60 |
+
## Beat 4: The implementation
|
| 61 |
+
|
| 62 |
+
Logistic regression on 512-dimensional embeddings. The entire classifier is:
|
| 63 |
+
- A weight vector (512 doubles)
|
| 64 |
+
- A bias term (1 double)
|
| 65 |
+
- A sigmoid activation
|
| 66 |
+
|
| 67 |
+
Training: batch gradient descent with L2 regularization. Standardize features internally, un-transform weights at the end so the deployed head works on raw embeddings. No data pipeline. No framework. 80 lines of Swift using Apple's Accelerate (BLAS) for the matrix math.
|
| 68 |
+
|
| 69 |
+
Hyperparameters that mattered:
|
| 70 |
+
- **Learning rate 0.1** (not 0.01 β standardized features converge fast)
|
| 71 |
+
- **Lambda 0.01** (not 1.0 β light regularization on 512 dims, heavy regularization starves the model)
|
| 72 |
+
- First attempt with lr=0.01 and lambda=1.0: 88% accuracy, 2134ms training
|
| 73 |
+
- After fix: 100% accuracy, 40ms training
|
| 74 |
+
|
| 75 |
+
The 50x speedup came from BLAS. The 12-point accuracy jump came from letting the model actually fit the data instead of regularizing it to death.
|
| 76 |
+
|
| 77 |
+
## Beat 5: The results
|
| 78 |
+
|
| 79 |
+
```
|
| 80 |
+
ACCURACY (vs ground truth labels)
|
| 81 |
+
Hand-Crafted: 95.0% (38/40)
|
| 82 |
+
Trained Head: 100.0% (40/40)
|
| 83 |
+
|
| 84 |
+
TRAINING
|
| 85 |
+
Cases: 120
|
| 86 |
+
Time: 40ms
|
| 87 |
+
|
| 88 |
+
LATENCY
|
| 89 |
+
Embedding: 0.05ms median
|
| 90 |
+
Classification: 0.00ms median
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
The trained head beats the hand-crafted classifier on every metric:
|
| 94 |
+
- Higher accuracy (100% vs 95%)
|
| 95 |
+
- Trained in 40ms (vs hours of manual rule iteration)
|
| 96 |
+
- Same inference speed (< 0.1ms total)
|
| 97 |
+
|
| 98 |
+
Per-difficulty breakdown β where the hand-crafted classifier fails and the trained head doesn't.
|
| 99 |
+
|
| 100 |
+
## Beat 6: The fan-out insight
|
| 101 |
+
|
| 102 |
+
The embedding is the expensive part (~0.05ms). The classifier head is essentially free (~0.001ms). This means you can run N different classifier heads on the same embedding for almost no extra cost.
|
| 103 |
+
|
| 104 |
+
```
|
| 105 |
+
Embed once: 0.05ms
|
| 106 |
+
1 head: 0.001ms
|
| 107 |
+
4 heads: 0.004ms
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
One embedding, multiple decisions:
|
| 111 |
+
- Does this need an LLM?
|
| 112 |
+
- Is this a command, a variable name, or prose?
|
| 113 |
+
- Which domain? (bash, SQL, regex, URL)
|
| 114 |
+
- What's the confidence level?
|
| 115 |
+
|
| 116 |
+
The shared backbone pattern: compute the embedding once, fan out to cheap task-specific heads. Each head is 512 weights + 1 bias, trained in milliseconds.
|
| 117 |
+
|
| 118 |
+
## Beat 7: What this means architecturally
|
| 119 |
+
|
| 120 |
+
```
|
| 121 |
+
Raw transcription
|
| 122 |
+
|
|
| 123 |
+
v
|
| 124 |
+
[ NLEmbedding ] β 0.05ms, system framework, no download
|
| 125 |
+
|
|
| 126 |
+
+--β [ needsLLM? ] β 0.001ms, trained head
|
| 127 |
+
+--β [ domain? ] β 0.001ms, trained head (future)
|
| 128 |
+
+--β [ confidence? ] β 0.001ms, trained head (future)
|
| 129 |
+
|
|
| 130 |
+
v
|
| 131 |
+
Route to:
|
| 132 |
+
- Deterministic processor (clean protocol input)
|
| 133 |
+
- On-device LLM (fuzzy/natural, needs normalization)
|
| 134 |
+
- Cloud LLM (chaotic, high ambiguity)
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
The classifier gate sits between transcription and processing. It costs essentially nothing. It routes inputs to the cheapest processor that can handle them correctly.
|
| 138 |
+
|
| 139 |
+
For Talkie's keyboard dictation, this means:
|
| 140 |
+
- 25% of inputs (clean protocol) get instant results β no LLM, no latency
|
| 141 |
+
- 75% of inputs go through the LLM normalizer from Part 2
|
| 142 |
+
- The user never notices the routing. They just see fast, correct output.
|
| 143 |
+
|
| 144 |
+
## Closing: The meta-lesson
|
| 145 |
+
|
| 146 |
+
Three posts. Three layers of the same insight.
|
| 147 |
+
|
| 148 |
+
**Part 1:** Don't use a big model when a small one works. (1.5B vs GPT-4)
|
| 149 |
+
**Part 2:** Don't use a model when code works. (Processor vs fine-tuned LLM)
|
| 150 |
+
**Part 3:** Don't use a model to decide whether to use a model β unless training it takes 40ms. Then do.
|
| 151 |
+
|
| 152 |
+
The whole pipeline costs less than a single GPT-4 API call. It runs offline. It fits on a phone. And the most expensive operation in the entire stack is a 0.05ms embedding lookup that Apple ships for free.
|
| 153 |
+
|
| 154 |
+
## Appendix notes
|
| 155 |
+
|
| 156 |
+
### Code references
|
| 157 |
+
- `ClassifierPipelineBenchmark.swift` β benchmark runner, training, eval
|
| 158 |
+
- `NeedsLLMClassifier.swift` β hand-crafted classifier (the baseline)
|
| 159 |
+
- `eval-fuzzy.json` β 200 labeled examples across 4 difficulties
|
| 160 |
+
|
| 161 |
+
### Numbers to verify on-device before publishing
|
| 162 |
+
- Exact HC accuracy % (currently 95% on 40 cases)
|
| 163 |
+
- Exact trained accuracy % (currently 100% on 40 cases)
|
| 164 |
+
- Training time range across multiple runs
|
| 165 |
+
- Per-difficulty breakdown
|
| 166 |
+
- Fan-out latency at N=4, N=8, N=16
|
| 167 |
+
|
| 168 |
+
### Illustration ideas
|
| 169 |
+
- Hero: a fork in the road β one path labeled "LLM" (longer, scenic), one labeled "processor" (short, direct). A tiny gate at the fork.
|
| 170 |
+
- Training visualization: 120 dots in 2D (PCA of embeddings), colored by class, with the decision boundary drawn through them.
|
| 171 |
+
- Speed comparison: a race track showing 40ms training vs hours of hand-coding rules.
|
| 172 |
+
- Fan-out diagram: one embedding node at top, multiple classifier heads branching below, each labeled with a different question.
|