training-lab / blog /part4-segment-classifier.md

🧪 add segment classifier, protocol processor test harness, and full chain tests

35717ca about 1 month ago

preview code

raw

history blame contribute delete

13 kB

Part 4: Splitting the Stream

The whole-text classifier asks "should this go to the model?" The segment classifier asks "which words?"

Series context

Part 1 — Fine-tuned a 1.5B model to reconstruct bash from dictation. 97% accuracy, 3GB RAM, 0.7s inference.
Part 2 — Split architecture: deterministic processor for symbols, LLM for language understanding.
Part 3 — A classifier trained in 40ms gates whether input needs an LLM at all. 100% accuracy on held-out data.
Part 4 (this post) — Mixed dictations break the gate. A per-word classifier segments the stream so only the command parts hit the model.

The problem Part 3 didn't solve

Part 3's classifier works on whole utterances. Clean protocol input goes to the processor. Fuzzy input goes to the LLM. But real dictation isn't one or the other — it's both at once.

"I want to check the directory ls dash la and then we can talk about the project timeline."

If you send all of that to the fused model — a quantized 0.6B fine-tuned exclusively on bash reconstruction — you get hallucinated garbage where the natural speech was. The model has never seen conversational English. It tries to reconstruct "I want to check the directory" as if it were dictated protocol and produces something unrecognizable.

If you skip the model entirely, the protocol segment "ls dash la" passes through as raw dictated text. The user sees the words "ls dash la" instead of ls -la.

The gate needs to operate inside the sentence, not around it.

First attempt: heuristics

The obvious approach: maintain a set of protocol words ("dash", "dot", "slash", "space"), scan the text, mark any word that matches, expand a window around matches to capture adjacent command tokens like "ls" or "git".

This works for the simple cases. Then it breaks:

"at three o'clock" — "at" is in protocol vocab (for @), "three" is a number word. Flagged as protocol. Now the model tries to reconstruct a meeting time as bash.
"the performance numbers look really good" — clean pass. Correct.
"talkie-dev dash-help" — hyphens embedded in words. The heuristic checks for standalone "dash" but not for syntax characters inside words. Missed.

Every edge case is another rule. Every rule has its own edge cases. This is the exact problem the Part 3 classifier solved for whole-text routing. Same insight applies here — learn the boundary instead of coding it.

The vocabulary split

The key insight that made training work: not all protocol words are equally diagnostic.

"dash" almost never appears in natural English speech. If someone says "dash", they mean a hyphen. Strong signal.

"at" appears constantly in natural speech. "I arrived at the office at nine." Two uses of "at", zero protocol intent. Weak signal.

Split the vocabulary into two tiers:

Strong protocol (28 words): dash, dot, slash, space, colon, underscore, tilde, hash, pipe, backtick, semicolon, equals, backslash, capital, caps, camel, snake, pascal, kebab, paren, brace, bracket, ampersand, percent, caret, dollar, minus, asterisk.

Weak protocol (30 words): at, star, bang, question, comma, quote, plus, open, close, new, line, all, case, one, two, three, four, five, six, seven, eight, nine, ten, zero, back, sign, mark, double, single, less.

A strong protocol word is almost always protocol. A weak protocol word is only protocol when it appears near strong ones. "at" next to "tilde slash dot" is protocol. "at" next to "the office" is not.

Per-word features

14 features per word, extracted with a ±2 context window:

#	Feature	What it captures
0	is_strong_protocol	Unambiguous protocol word
1	is_weak_protocol	Ambiguous protocol word
2	is_expanded_symbol	Already-converted symbol (-, ., /)
3	has_syntax_chars	Contains syntax characters
4	word_length_norm	Length / 10
5	is_short_word	Length <= 3
6	context_strong_density	Fraction of ±2 window that's strong protocol
7	context_any_density	Fraction of ±2 window that's any protocol
8	left_is_strong	Left neighbor is strong protocol
9	right_is_strong	Right neighbor is strong protocol
10	is_number_like	Number word or digit
11	strong_neighbor_count	Raw count of strong protocol in window
12	is_all_lower	All lowercase letters
13	position_ratio	Position / total words

The most powerful feature: strong_neighbor_count (weight +1.04). Being near unambiguous protocol words is the single strongest signal that a word is part of a command. This is the learned version of the ±2 expansion heuristic — but the model decides how much weight to give it rather than a hard-coded rule.

Training

Synthetic mixed dictations: natural English fragments (from user manuals, business writing, HR policies, medical literature) concatenated with actual bash command dictations. 600 examples producing ~8,000 labeled words.

Same logistic regression as Part 3. Same gradient descent. Trains in under a second.

Two-pass architecture

The classifier identifies protocol anchor words — the unambiguous markers that a command is present. Then a deterministic expansion captures adjacent command tokens (the "ls" next to "dash la", the "git" before "dash dash help").

"I want to check the directory ls dash la and then discuss the timeline"
                                    ↓ ML classifier
Word scores:  0.1 0.1 0.1 0.1 0.1 0.2 0.3 [0.9] 0.4 0.1 0.1 0.1 0.1 0.1
                                                ↑ anchor: "dash"
                                    ↓ ±2 expansion
Segments:  [natural: "I want to check the directory"]
           [protocol: "ls dash la"]
           [natural: "and then discuss the timeline"]

Only the protocol segment goes to the fused model. Natural speech passes through untouched.

Results

Evaluated on 50 hand-labeled mixed dictations — real-world patterns from documentation, business writing, legal text, and technical manuals mixed with bash commands.

WORD-LEVEL
  Accuracy:   92.5%
  Precision:  84.7%
  Recall:     99.4%
  F1:         0.915

EXAMPLE-LEVEL
  Pure natural text:  19/21 perfect (90%)
  Mixed text:         protocol segments found in all 29 cases

FALSE POSITIVE RATE: 12.2%
FALSE NEGATIVE RATE:  0.6%

The asymmetry is deliberate. A false negative means a protocol segment passes through as raw text — the user sees "dash la" instead of -la. Annoying but comprehensible. A false positive means natural speech gets sent to the fused model — the model hallucinates on English and produces garbage. Much worse.

So we optimized for recall (99.4%) at the cost of precision (84.7%). Miss almost nothing, accept some boundary bleed.

The false positives are almost entirely transition words at command boundaries — "with", "run", "use", "to" — words that sit right before the command starts. Including them in the protocol segment is harmless; the fused model ignores them.

The two-classifier architecture

This is the full routing stack now:

Raw transcription
      |
      v
  [ NeedsLLMClassifier ]     ← whole-text: "does this need cleanup?"
      |                          → routes to frontier model (Claude, GPT)
      v
  [ ProtocolSegmentClassifier ] ← per-word: "which words are commands?"
      |                          → routes segments to fused Qwen3 0.6B
      v
  Output: natural speech untouched, commands reconstructed

Two classifiers, two jobs, two scales:

NeedsLLMClassifier: whole-dictation gate for quality cleanup with a large model
ProtocolSegmentClassifier: per-word segmentation for fast on-device command reconstruction

Both are logistic regression. Both are under 1KB of weights. Both run in microseconds. The expensive operations — the LLM inference, the embedding lookup — only happen when the classifiers say they should.

The procedural processor

The segment classifier decides which words go to the model. But what does the model actually replace? A deterministic procedural processor that needs no neural network at all.

The processor is a token scanner with a symbol vocabulary:

"dash" → -, "dot" → ., "slash" → /, "pipe" → |
Two-word: "open paren" → (, "dash dash" → --, "and and" → &&
Numbers: "eight zero eight zero" → 8080, "forty two" → 42
Casing: "capital A" → A, "all caps POST" → POST, "camel case get user" → getUserProfile
"space" as explicit token separator

A test harness validates the processor against 100 real commands across 31 categories (git, docker, curl, ssh, kubernetes, terraform, etc.) plus 92 isolated unit tests for every feature:

UNIT TESTS: 92/92 (100%)
  symbols, synonyms, two-word, three-word, numbers,
  casing, directives, spacing, mixed commands, redirects

EVAL TESTS: 100 commands across 31 categories
  Exact match:      96/100 (96.0%)
  Case-insensitive: 100/100 (100.0%)
  Median latency:   11us per command

  git: 8/8  docker: 7/7  kubernetes: 5/5  ssh: 3/3
  python: 6/6  rust: 3/3  go: 3/3  brew: 4/4  ...

The 4 non-exact matches are all casing edge cases — all caps scope interacting with adjacent capital directives. Every symbol, number, and structural transformation is correct. At 11 microseconds per command, the processor adds essentially zero latency.

This is the piece the fused model is meant to replace for fuzzy input. When the dictation is clean protocol, the processor handles it directly — no model needed. The model only activates for segments where the dictation is ambiguous or natural-sounding.

End-to-end chain test

The full pipeline runs without a microphone — synthetic test cases simulate mixed dictations through every stage:

Dictation text
      |
      v
  [ ProtocolSegmentClassifier ]  ← ML: which words are commands?
      |
      v
  [ Procedural Processor ]       ← deterministic: symbols, numbers, casing
      or
  [ Fused Model (Qwen3 0.6B) ]  ← neural: ambiguous reconstructions
      |
      v
  Reassembled output

12 test cases covering simple commands, pure protocol, pure natural speech, sandwiched commands, two-command bridges, and complex multi-flag commands (SSH with IPs, iptables, curl with headers):

CLASSIFIER SEGMENTATION: 12/12 correct
  - 3 pure natural correctly passed through (no model call)
  - 9 command cases correctly detected protocol segments

MODEL INFERENCE:
  - Simple commands (ls -la, ps -aux): correct
  - Medium commands (git push, npm install): correct
  - Complex commands (iptables, curl): drops arguments

The classifier is the strong link. The model is the current bottleneck on complex commands — a 0.6B model can reconstruct simple-to-medium bash but struggles with long multi-flag commands. This is a training data and model capacity issue, not a pipeline issue.

What's next

Better training data. The current training set is synthetic — programmatically assembled from fragments. Real mixed dictations from actual users would capture patterns the synthetic data misses.
Model capacity. The 0.6B fused model handles simple commands well but drops arguments on complex ones (iptables, curl with headers). Either more targeted training data or stepping up to 1.5B for the fused model.
Number word bridging. Removing number words from protocol vocabulary eliminated false positives on natural speech ("the board voted six to three") but created a gap: isolated number sequences between protocol anchors (like IP address digits) can fall outside the ±2 expansion window.

The classifier is deployed and running in Talkie's live dictation pipeline. Every dictation in iTerm2 now goes through the segment classifier before hitting the fused model. Natural speech stays natural. Commands get reconstructed. The user doesn't know which words went where.

Code references

datasets/train-segment-classifier.py — training pipeline, synthetic data generation
datasets/eval-segment-classifier.py — evaluation harness with boundary analysis
datasets/eval-segment-classifier.json — 50 labeled mixed dictation examples
datasets/test-protocol-processor.py — processor test harness (92 unit tests + 100 eval commands)
datasets/test-full-chain.py — end-to-end chain test (classifier + model + reassembly)
datasets/procedural-processor.py — deterministic protocol processor
datasets/eval-independent.json — 100 labeled dictation-to-command pairs across 31 categories
macOS/TalkieKit/Sources/TalkieKit/ProtocolSegmentClassifier.swift — deployed classifier
macOS/TalkieKit/Sources/TalkieKit/NeedsLLMClassifier.swift — whole-text classifier (Part 3)