Update OptGuideOnDeviceClassifierModel/CLASSIFIER_MODEL_ANALYSIS.md

#17
by oliseg - opened
OptGuideOnDeviceClassifierModel/CLASSIFIER_MODEL_ANALYSIS.md CHANGED
@@ -1,275 +1,392 @@
1
- # OptGuideOnDeviceClassifierModel Complete Analysis
2
-
3
- ## Overview
4
-
5
- **OptGuideOnDeviceClassifierModel** is a 120 MB on-device language model shipped with Chrome Canary as a Chrome Component. Its manifest names it **"Optimization Guide On Device Taxonomy Model"**, with a base model spec called **`taxonomy-tiny`**.
6
-
7
- It is a **Gemma 2 variant** purpose-built for **page-level classification** — specifically extracting the **brand** and **intent** of web pages for Chrome's client-side **scam/phishing detection** pipeline.
8
-
9
- | Field | Value |
10
- |---|---|
11
- | Manifest name | Optimization Guide On Device Taxonomy Model |
12
- | Base model | `taxonomy-tiny` v0.0.0.0 |
13
- | Component version | `2026.2.12.1554` |
14
- | Component ID (CRX) | `eidcjfoningnkhpoelgpjemmhmopkeoi` |
15
- | File | `weights.bin` (126,025,728 bytes / 120.19 MB) |
16
- | Execution config | Empty (0 bytes) — no prompt template bundled |
17
- | Performance hint | `3` |
18
- | Availability | **Chrome Canary** (not tested in Stable) |
19
- | Optimization target | `OPTIMIZATION_TARGET_MODEL_EXECUTION_FEATURE_CLASSIFIER` (ID 72) |
20
- | Chrome feature flag | `ClientSideDetectionBrandAndIntentForScamDetection` |
21
-
22
- ---
23
-
24
- ## Purpose: Scam Detection via Brand + Intent Classification
25
-
26
- Chrome's Client-Side Detection (CSD) system extracts page text from suspicious websites and sends it to this model with the following prompt (decoded from `on_device_model_execution_config.pb` of model ID 55):
27
-
28
- ```
29
- You are a web page text scanner. Your task is to carefully review text from
30
- a web page and answer the following questions in English:
31
-
32
- 1) What brand does the page represent?
33
- 2) In one complete sentence, summarize what this page aims to do.
34
- Do not leak PII data.
35
-
36
- You should output your answers strictly in the following JSON format:
37
-
38
- {"brand": "<brand>", "intent": "<intent>"}
39
-
40
- Do not use ```json``` block in your output.
41
-
42
- Text: [PAGE CONTENT HERE]
43
- ```
44
-
45
- The expected response conforms to this JSON schema:
46
-
47
- ```json
48
- {
49
- "type": "object",
50
- "additionalProperties": false,
51
- "properties": {
52
- "brand": { "type": "string" },
53
- "intent": { "type": "string" }
54
- },
55
- "required": ["brand", "intent"]
56
- }
57
- ```
58
-
59
- When the detected brand/intent combination is inconsistent with the actual page behavior (e.g., a page claiming to be PayPal but actually harvesting credentials on an unrelated domain), Chrome flags the page as a potential scam via Safe Browsing.
60
-
61
- ---
62
-
63
- ## Binary Format: LITERTLM Container
64
-
65
- The `weights.bin` file is **not** a raw TFLite model. It uses the **LITERTLM** (LiteRT Language Model) container format — a proprietary Google ODML packaging format with a FlatBuffer header and multiple embedded submodels.
66
-
67
- ### File Layout
68
-
69
- ```
70
- Offset Component Size
71
- ────────────────────────────────────────────────────────────
72
- 0x00000000 LITERTLM FlatBuffer header 32 KB
73
- Magic: "LITERTLM"
74
- Version: 1
75
- Submodels: 4 entries declared
76
- Metadata:
77
- model_type = "tf_lite_prefill_decode"
78
- model_type = "tf_lite_embedder"
79
- model_version = "1.0.1"
80
- Authors = "ODML team"
81
-
82
- 0x00008000 TFLite #1 — Embedder 8.20 MB (8,601,600 bytes)
83
- Input: token_ids [1, 1] int32
84
- Output: embeddings [1, 1, 1024] float32
85
- Op: lookup_embedding_table
86
- TFLite runtime: 2.18.0
87
-
88
- 0x0083C000 TFLite #2 — Prefill + Decode 111.63 MB (117,055,216 bytes)
89
- 2 signatures: "prefill" and "decode"
90
- 39 inputs (embeddings + position + mask + 36 KV cache)
91
- 37 outputs (36 KV cache + logits [1, 1, 16384])
92
- 18 transformer layers
93
- Full Gemma 2 architecture
94
-
95
- 0x077E0000 SentencePiece tokenizer 305.6 KB (312,918 bytes)
96
- Vocab size: 16,384 tokens
97
- Special tokens: <pad>=0, </s>=1, <s>=2, <unk>=3
98
- 256 byte-fallback tokens
99
- Normalizer: nmt_nfkc
100
-
101
- 0x0782C656 Zero padding to alignment 14.7 KB
102
- 0x07830000 End of file 126,025,728 bytes total
103
- ```
104
-
105
- ### How to Extract the Submodels
106
-
107
- ```python
108
- data = open('weights.bin', 'rb').read()
109
-
110
- # TFLite embedder
111
- open('embedder.tflite', 'wb').write(data[0x8000:0x83C000])
112
-
113
- # TFLite prefill+decode transformer
114
- open('decoder.tflite', 'wb').write(data[0x83C000:0x77DDEF0])
115
-
116
- # SentencePiece tokenizer
117
- open('tokenizer.model', 'wb').write(data[0x77E0000:0x782C656])
118
- ```
119
-
120
- ---
121
-
122
- ## Architecture: Gemma 2 "taxonomy-tiny"
123
-
124
- The model is a **distilled Gemma 2** with reduced dimensions, confirmed by layer name analysis of the TFLite graph.
125
-
126
- ### Specifications
127
-
128
- | Parameter | Value | Evidence |
129
- |---|---|---|
130
- | Architecture family | **Gemma 2** | QK normalization + post-FFN norm = Gemma 2 exclusive features |
131
- | Transformer layers | **18** | `layer_0` through `layer_17` in tensor names |
132
- | Embedding dimension | **1024** | Embedder output shape `[1, 1, 1024]` |
133
- | KV cache dimension | **256** per layer | KV input/output shape `[1, 1, 1, 256]` |
134
- | Vocabulary size | **16,384** | Logits output shape `[1, 1, 16384]`; SentencePiece vocab |
135
- | Normalization | **RMSNorm** | `rms_norm/mul`, `rms_norm/rsqrt`, `rms_norm/square` |
136
- | Pre-attention norm | **Yes** | `pre_attention_norm/rms_norm` |
137
- | Pre-FFN norm | **Yes** | `pre_ffw_norm` patterns |
138
- | Post-FFN norm | **Yes** | Post-FFN norm present (Gemma 2 specific) |
139
- | QK normalization | **Yes** | `key_norm/rms_norm` (Gemma 2 specific) |
140
- | Positional encoding | **RoPE** | `maybe_rope/concatenate` |
141
- | Attention type | **Full attention** | No sliding window patterns found |
142
- | Activation | **GeLU** (likely) | Standard for Gemma 2 |
143
- | Quantization | **Mixed INT4/INT8** | 120 MB for 18 layers with 1024 dim implies heavy quantization |
144
- | Estimated parameters | **~100–200M** | Based on file size and quantization assumptions |
145
- | TFLite signatures | `prefill` (no logits) + `decode` (with logits) | Standard ODML LLM execution pattern |
146
-
147
- ### Comparison with Known Models
148
-
149
- | | **taxonomy-tiny** | Gemma 2 2B | Gemini Nano v3 |
150
- |---|---|---|---|
151
- | Layers | 18 | 26 | ~32 |
152
- | Embed dim | 1,024 | 2,304 | unknown |
153
- | Vocab size | 16,384 | 256,128 | 256,128 |
154
- | File size | 120 MB | ~2.6 GB | 4.07 GB |
155
- | QK norm | Yes | Yes | Yes |
156
- | Post-FFN norm | Yes | Yes | Yes |
157
- | Sliding window | No | Yes (alternating) | Yes |
158
- | Purpose | Classification | General | General |
159
-
160
- ### Single Transformer Block Structure
161
-
162
- From tensor name analysis, each of the 18 layers contains:
163
-
164
- ```
165
- layer_N/
166
- ├── layer_N.pre_qkv/
167
- │ ├── pre_attention_norm/rms_norm/ (RMSNorm)
168
- │ └── attn._pre_attention_fn/
169
- │ └── maybe_rope/ (RoPE positional encoding)
170
- ├── attn.dot_product_attention/
171
- │ └── dot_attn._qkv_fn/
172
- │ ├── key_norm/rms_norm/ (QK normalization)
173
- │ ├── dot_general (Q*K)
174
- │ ├── tfl_softmax
175
- │ ├── dot_general (attn*V)
176
- │ └── reshape/transpose
177
- ├── layer_N.post_qkv/
178
- │ ├── attn.post_qkv/attn_vec_einsum/ (output projection)
179
- │ ├── add (residual)
180
- │ └── add1 (post-attention residual)
181
- ├── layer_N.update_cache/
182
- │ └── attn.update_cache/concatenate (KV cache update)
183
- └── [pre_ffw_norm + FFN + post_ffw_norm] (feed-forward block)
184
- ```
185
-
186
- Final output: `final_norm/rms_norm` `decode_softmax` logits `[1, 1, 16384]`
187
-
188
- ---
189
-
190
- ## Tokenizer: Reduced Gemma Vocabulary
191
-
192
- The embedded SentencePiece model uses a **16,384-token vocabulary** — a dramatic reduction from Gemma's standard 256,128 tokens. This is consistent with a classification-focused model that doesn't need the full multilingual generative vocabulary.
193
-
194
- | Property | Value |
195
- |---|---|
196
- | Vocab size | 16,384 |
197
- | BOS token | `<s>` (id=2) |
198
- | EOS token | `</s>` (id=1) |
199
- | PAD token | `<pad>` (id=0) |
200
- | UNK token | `<unk>` (id=3) |
201
- | Byte fallbacks | 256 tokens (`<0x00>` through `<0xFF>`) |
202
- | Normalizer | `nmt_nfkc` |
203
-
204
- Notably, Gemma's conversation tokens (`<start_of_turn>`, `<end_of_turn>`) are **absent** from this vocabulary — they map to UNK (id=3). The model does not use chat-turn formatting.
205
-
206
- Sample vocabulary entries:
207
-
208
- ```
209
- [ 260] = '.' [ 500] = '▁such' [ 1000] = '▁amount'
210
- [ 2000] = '▁Q' [ 5000] = '▁tradition' [10000] = '▁Computer'
211
- [15000] = '▁Philosophy' [16383] = '▁<custom370>'
212
- ```
213
-
214
- ---
215
-
216
- ## Chrome Integration Pipeline
217
-
218
- ```
219
- User visits a page
220
-
221
-
222
- ┌─────────────────────────────┐
223
- │ Safe Browsing Heuristics │ Pre-filter: URL reputation, phishing
224
- │ (CSD - Client Side Det.) │ patterns, keyboard lock API, etc.
225
- └──────────┬──────────────────┘
226
- Page flagged as suspicious
227
-
228
- ┌─────────────────────────────┐
229
- │ Page Text Extraction │ Extract visible text content from DOM
230
- └──────────┬──────────────────┘
231
-
232
-
233
- ┌─────────────────────────────┐
234
- │ Prompt Construction │ "You are a web page text scanner..."
235
- (from model ID 55 config) │ + page text appended
236
- └──────────┬───────────────���──┘
237
-
238
- ┌─────┴──────┐
239
- ▼ ▼
240
- ┌─────────┐ ┌──────────────┐
241
- Gemini │ │ taxonomy- │ Whichever model is available
242
- Nano │ │ tiny │ (taxonomy-tiny is 33x smaller)
243
- (4 GB) │ │ (120 MB) │
244
- └────┬────┘ └──────┬───────┘
245
- │ │
246
- └──────┬───────┘
247
-
248
- ┌─────────────────────────────┐
249
- │ JSON Response Parsing │ {"brand": "PayPal",
250
- │ │ "intent": "credential harvesting"}
251
- └──────────┬──────────────────┘
252
-
253
-
254
- ┌─────────────────────────────┐
255
- │ Verdict Logic │ Compare brand claim vs. actual domain,
256
- │ │ intent vs. page behavior
257
- └──────────┬──────────────────┘
258
-
259
-
260
- ┌─────────────────────────────┐
261
- │ Safe Browsing Warning │ Red interstitial page shown to user
262
- └─────────────────────────────┘
263
- ```
264
-
265
- ### Trigger Conditions
266
-
267
- The classifier does **not** run on every page. It triggers when Chrome's CSD heuristics detect suspicious signals:
268
-
269
- - Phishing URL patterns (Safe Browsing prefix match)
270
- - Keyboard Lock API usage (common in tech support scams)
271
- - Aggressive popups or fullscreen requests
272
- - Form fields requesting sensitive data (passwords, SSN, credit cards)
273
- - Urgency language patterns
274
-
275
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Chrome taxonomy-tiny: Observed Facts
2
+
3
+ **Date:** 2026-04-19
4
+ **Analysts:** Local research (reverse engineering) + public Chromium sources
5
+
6
+ ---
7
+
8
+ ## 1. The Chrome Component
9
+
10
+ ### Identity
11
+ | Field | Value | Source |
12
+ |-------|-------|--------|
13
+ | CRX ID | `eidcjfoningnkhpoelgpjemmhmopkeoi` | `verified_contents.json` |
14
+ | Name | `Optimization Guide On Device Taxonomy Model` | `manifest.json` |
15
+ | Version | `2026.2.12.1554` | `manifest.json` |
16
+ | BaseModelSpec.name | `taxonomy-tiny` | `manifest.json` |
17
+ | BaseModelSpec.version | `0.0.0.0` | `manifest.json` |
18
+ | Performance hints | `[3]` | `manifest.json` |
19
+
20
+ ### Signed Files (CRX verified_contents)
21
+ The component contains exactly 3 files signed by Google:
22
+
23
+ | File | Size | Content |
24
+ |------|------|---------|
25
+ | `manifest.json` | 247 bytes | Component metadata |
26
+ | `on_device_model_execution_config.pb` | **0 bytes** | Empty |
27
+ | `weights.bin` | 126,025,728 bytes (120.2 MB) | LITERTLM container |
28
+
29
+ No `adaptation_weights.bin`, `adapter.bin`, `lora.bin`, or `model-info.pb` file exists in the component.
30
+
31
+ ### LITERTLM Container (weights.bin)
32
+ | Entry | Size | Description |
33
+ |-------|------|-------------|
34
+ | `embedder.tflite` | 8,601,600 bytes | TFLite v3 |
35
+ | `prefill_decode.tflite` | 117,391,360 bytes | TFLite v3 |
36
+ | `tokenizer.spm` | 312,918 bytes (LITERTLM header; extracted file: 312,917 bytes) | SentencePiece |
37
+ | `model_version` | 5 bytes | String `"1.0.1"` |
38
+
39
+ ---
40
+
41
+ ## 2. Model Architecture
42
+
43
+ ### Transformer Specifications
44
+ | Parameter | Value | Measurement Method |
45
+ |-----------|-------|--------------------|
46
+ | Family | Gemma (named "TinyGemma" in Chromium code) | Chromium C++ source |
47
+ | Layers | 18 transformer decoder | TFLite inspection |
48
+ | Parameters | ~319M estimated (INT4 quantized) | Estimated: embedding 16384x1024 + 18 layers x (gate 2048x1024 + up 2048x1024 + down 1024x2048 + o_proj 1024x1024 + norms) + tied LM head |
49
+ | Embedding dim | 1024 | Embedder output tensor |
50
+ | KV head dim | 256, 1 head per layer | KV cache tensors |
51
+ | Quantization | INT4 stored as INT8 (15 unique values) | Weight analysis |
52
+ | Embedding table vs LM head | Byte-for-byte identical matrices | Full matrix comparison |
53
+ | Execution | CPU only | Chromium C++ source |
54
+
55
+ ### TFLite Signatures
56
+
57
+ **embedder.tflite:**
58
+ | Input/Output | Shape | Type |
59
+ |--------------|-------|------|
60
+ | Input: `token_ids` | `[1, 1]` | int32 |
61
+ | Output: `embeddings` | `[1, 1, 1024]` | float32 |
62
+
63
+ **prefill_decode.tflite -- `prefill` signature:**
64
+ | Input/Output | Shape | Type |
65
+ |--------------|-------|------|
66
+ | Input: `embeddings` | `[1, N, 1024]` | float32 |
67
+ | Input: `input_pos` | `[N]` | int32 |
68
+ | Input: `mask` | `[1, 1, N, N]` | float32 |
69
+ | Input: 36 KV caches | `[1, N, 1, 256]` | float32 |
70
+ | Output: 36 KV caches (updated) | `[1, N, 1, 256]` | float32 |
71
+ | Output: **no logits** | -- | -- |
72
+
73
+ **prefill_decode.tflite -- `decode` signature:**
74
+ | Input/Output | Shape | Type |
75
+ |--------------|-------|------|
76
+ | Input: `embeddings` | `[1, 1, 1024]` | float32 |
77
+ | Input: `input_pos` | `[1]` | int32 |
78
+ | Input: `mask` | `[1, 1, 1, 1]` | float32 |
79
+ | Input: 36 KV caches | `[1, ?, 1, 256]` | float32 |
80
+ | Output: 36 KV caches (updated) | `[1, ?, 1, 256]` | float32 |
81
+ | Output: `logits` | `[1, 1, 16384]` | float32 |
82
+
83
+ ### KV Cache: Observed Behavior
84
+ The model writes new KV values **at position `input_pos` within the existing buffer**. Output has the same size as input. The cache must be **replaced** with the output, not concatenated.
85
+
86
+ ---
87
+
88
+ ## 3. Tokenizer
89
+
90
+ ### Vocabulary Structure (16,384 tokens)
91
+ | Range | Tokens | Role |
92
+ |-------|--------|------|
93
+ | 0 | `<pad>` | Padding |
94
+ | 1 | `</s>` | End of sequence |
95
+ | 2 | `<s>` | Beginning of sequence (BOS) |
96
+ | 3 | `<unk>` | Unknown |
97
+ | 4--259 | `<0x00>`--`<0xFF>` | Byte fallback |
98
+ | 260--15999 | Text tokens | Standard SentencePiece vocabulary |
99
+ | 16000--16001 | `▁<start_of_audio>`, `▁<end_of_audio>` | Audio markers |
100
+ | 16002--16003 | `▁<start_of_image>`, `▁<end_of_image>` | Image markers |
101
+ | 16004--16013 | `▁<ctrl1>` -- `▁<ctrl10>` | Control tokens |
102
+ | 16014--16383 | `▁<custom1>` -- `▁<custom370>` | Custom tokens |
103
+
104
+ ### Tokenization of `<ctrl1>`
105
+ - `" <ctrl1>"` (with preceding space) -> single token 16004
106
+ - `"<ctrl1>"` (without space) -> decomposed into individual characters (byte fallback)
107
+ - `sp.PieceToId('<ctrl1>')` returns 3 (UNK) because the actual piece name is `▁<ctrl1>`
108
+
109
+ ---
110
+
111
+ ## 4. Chromium Source Code
112
+
113
+ ### File: `optimization_guide_on_device_model_installer.cc`
114
+
115
+ Confirmed via URL provided by the user:
116
+ `https://chromium.googlesource.com/experimental/chromium/src/+/refs/heads/main/chrome/browser/component_updater/optimization_guide_on_device_model_installer.cc`
117
+
118
+ ```cpp
119
+ // Extension id is eidcjfoningnkhpoelgpjemmhmopkeoi.
120
+ constexpr char kClassifierModelManifestName[] =
121
+ "Optimization Guide On Device Taxonomy Model";
122
+ constexpr base::FilePath::CharType kClassifierModelInstallationRelativePath[] =
123
+ FILE_PATH_LITERAL("OptGuideOnDeviceClassifierModel");
124
+ ```
125
+
126
+ The component is registered through an `OptimizationGuideOnDeviceClassifierModelInstallerPolicy` class that inherits from `OnDeviceModelInstallerPolicy`.
127
+
128
+ The `OnDeviceModelType` enum distinguishes two model types:
129
+ ```cpp
130
+ enum OnDeviceModelType { kBaseModel, kClassifierModel };
131
+ ```
132
+
133
+ ### File: `on_device_model_classifier_controller.cc`
134
+
135
+ Hardcoded classification config (Chromium source, `experimental/chromium/src` branch, Copyright 2026 in file header):
136
+ ```cpp
137
+ // Prompt template
138
+ substitution->set_string_template("%s <ctrl1>");
139
+
140
+ // Proto types
141
+ // Request: optimization_guide.proto.ClassifyApiRequest { string text = 1; }
142
+ // Response: optimization_guide.proto.ClassifyApiResponse { string output = 1; }
143
+
144
+ // Feature ID
145
+ config.set_feature(proto::ModelExecutionFeature::MODEL_EXECUTION_FEATURE_CLASSIFIER);
146
+ // = 33
147
+
148
+ // Execution
149
+ params->backend_type = ml::ModelBackendType::kCpuBackend;
150
+ params->max_tokens = kOnDeviceModelMaxTokens; // 10240
151
+
152
+ // Log on disconnect
153
+ LOG(ERROR) << "TinyGemma model disconnected unexpectedly.";
154
+ ```
155
+
156
+ ### Proto: `substitution.proto`
157
+ ```protobuf
158
+ enum ControlToken {
159
+ CONTROL_TOKEN_UNSPECIFIED = 0;
160
+ SYSTEM = 1; // <ctrl1>
161
+ MODEL = 2; // <ctrl2>
162
+ USER = 3; // <ctrl3>
163
+ END = 4; // <ctrl4>
164
+ }
165
+ ```
166
+
167
+ ### Execution Pipeline in Chrome
168
+ ```
169
+ Input text -> ClassifyApiRequest { text: "..." }
170
+ -> FormatString: "%s <ctrl1>" -> "text <ctrl1>"
171
+ -> BOS + tokenize -> embedder -> prefill/decode -> logits -> sample
172
+ -> ParseResponse -> ClassifyApiResponse { output: "..." }
173
+ ```
174
+
175
+ ---
176
+
177
+ ## 5. Observed Inference Behavior
178
+
179
+ ### Text Generation
180
+ When provided with text followed by `<ctrl1>`, the model **echoes and reformats** the input:
181
+
182
+ | Input | Generated Output |
183
+ |-------|-----------------|
184
+ | `best laptop deals gaming electronics` | `Best laptop deals: gaming electronics` |
185
+ | `Italian cooking recipes pasta carbonara` | `Italian cooking recipes Pasta carbonara` |
186
+ | `NBA basketball scores live updates` | `NBA Basketball Scores Live Update` |
187
+ | `stock market investment trading` | `Stock market investment trading` |
188
+
189
+ Across the 4 tests performed, the output contains capitalization and punctuation changes relative to the input. No classification labels or category identifiers were observed in any tested output.
190
+
191
+ ### Custom Tokens in Generation
192
+ | Metric | Value |
193
+ |--------|-------|
194
+ | Custom token max logit | -19 to -21 |
195
+ | Regular token max logit | +19 to +27 |
196
+ | Gap | ~37 to 46 logit units |
197
+ | Custom token entropy ratio | 1.0000 (uniform) |
198
+
199
+ Across all tests performed, custom tokens (16014-16383) had logits between -19 and -21, producing negligible probability after softmax.
200
+
201
+ ### Embedding Quality (Layer 17 KV cache as representation)
202
+ | Metric | Last Position | Mean-Pooled |
203
+ |--------|--------------|-------------|
204
+ | Within-category similarity | 0.849 | 0.783 |
205
+ | Across-category similarity | 0.805 | 0.756 |
206
+ | Separation gap | 0.044 | 0.027 |
207
+
208
+ Within-category similarity exceeds across-category similarity by 0.044 (last position) and 0.027 (mean-pooled).
209
+
210
+ ### Performance
211
+ | Operation | Duration (~16 tokens) |
212
+ |-----------|-----------------------|
213
+ | Decode only (token by token) | ~2.0--2.4s |
214
+ | Prefill batch | ~1.3s |
215
+ | Prefill + full decode | ~3.0s |
216
+
217
+ ---
218
+
219
+ ## 6. Weight Analysis
220
+
221
+ ### Embedding Table: Custom vs Regular Tokens
222
+
223
+ | Metric | Custom (370) | Regular (370) | Multimodal (14) |
224
+ |--------|-------------|---------------|-----------------|
225
+ | Intra-group cosine | **0.999937** | 0.051 | 0.886 |
226
+ | L2 norm mean | 32.045 | 29.616 | 32.029 |
227
+ | L2 norm std | **0.007** | 3.019 | 0.077 |
228
+ | Per-dimension variance (mean) | **0.0004** | 6.600 | -- |
229
+ | Regular/custom variance ratio | **14,817x** | -- | -- |
230
+ | K-means k=50: distinct clusters | 33 (collapse) | 50 | -- |
231
+ | DBSCAN: clusters (eps x1.0) | 2 | 1 | 1 |
232
+
233
+ K-means at k=50 collapses to 33 clusters (ConvergenceWarning). DBSCAN at eps x1.0 finds 2 clusters. No significant intra/inter-group separation detected for the grid patterns tested (10x37, 37x10, 74x5, 185x2).
234
+
235
+ Multimodal tokens (16000-16013) show 2 sub-clusters with cosine ~0.20 between groups. Custom tokens show no sub-clusters.
236
+
237
+ ### Transformer Layers: Weight Distribution
238
+
239
+ | Metric | Value (54 FFN tensors, 18 layers) |
240
+ |--------|-----------------------------------|
241
+ | Mean std | 2.6252 |
242
+ | Std of stds | 0.0093 |
243
+ | Coefficient of variation | **0.36%** |
244
+ | % outliers | 0.00% (all layers) |
245
+ | % zeros | 15.3--16.4% |
246
+
247
+ 0.36% coefficient of variation across the 54 FFN tensors. Differences between early, mid, and late layers:
248
+
249
+ | Weight type | Early (0-5) | Mid (6-11) | Late (12-17) | Variation |
250
+ |-------------|-------------|------------|--------------|-----------|
251
+ | gate_proj std | 2.6319 | 2.6243 | 2.6307 | 0.8% |
252
+ | up_proj std | 2.6314 | 2.6246 | 2.6318 | 0.8% |
253
+ | down_proj std | 2.6217 | 2.6090 | 2.6214 | 1.3% |
254
+
255
+ `o_proj` has std of 18.49 (7x the FFN layers), constant across all 18 layers (18.492--18.496).
256
+
257
+ ### Normalization Layers
258
+ - Layer 0: all RMSNorm values are zero
259
+ - Layers 1-17: RMSNorm values range from 3.89 to 14.02
260
+
261
+ ---
262
+
263
+ ## 7. Comparison with Other Chrome Targets
264
+
265
+ ### Target 72 vs Targets 43 and 74
266
+
267
+ | Attribute | Target 43 (Embedder) | Target 72 (Classifier) | Target 74 (Shopping) |
268
+ |-----------|---------------------|----------------------|---------------------|
269
+ | Optimization target | `PASSAGE_EMBEDDER` | `CLASSIFIER` | `SHOPPING_CLASSIFIER` |
270
+ | Timestamp in model-info.pb / version | 2024-06-25 (timestamp) | 2026-02-12 (inferred from version `2026.2.12.1554`) | 2026-03-30 (timestamp) |
271
+ | Config populated | Yes (147 bytes) | **No (0 bytes)** | Yes (121 bytes) |
272
+ | model-info.pb | Present | **Absent** | Present |
273
+ | Metadata type | `PassageEmbeddingsModelMetadata` | None | `CategoryClassifierMetadata` |
274
+ | Architecture | Sentence embedder TFLite | TinyGemma LLM 18 layers | Linear classifier head TFLite |
275
+ | Model size | 112 MB | 120 MB | 789 KB |
276
+ | Input | int32[1,64] token IDs | Text + `<ctrl1>` | float32[1,1536] embeddings |
277
+ | Output | float32[1,768] embedding | Generated text | float32[1,1] score |
278
+ | Tokenizer | sentencepiece.model (794 KB) | tokenizer.spm (313 KB) | Uses target 43's |
279
+ | Input window | 64 tokens | ~10240 tokens | N/A |
280
+ | Depends on another target | No | No | Yes (target 43, v >= 2026-02-02) |
281
+
282
+ Target 74's metadata declares a dependency on target 43 via `required_embedder_version`. Target 74's input shape (1536-dim) matches the concatenation of two 768-dim embeddings from target 43. Target 72 has no declared dependency.
283
+
284
+ ### Target 72 vs Target 62 (Proofreader)
285
+
286
+ | Attribute | Target 62 | Target 72 |
287
+ |-----------|-----------|-----------|
288
+ | Own model | No (uses shared Gemini Nano ~1.7 GB) | Yes (standalone TinyGemma 120 MB) |
289
+ | LoRA adapter | Yes (`adaptation_weights.bin`, 16.6 MB) | No |
290
+ | Config | Non-empty | Empty (0 bytes) |
291
+
292
+ ### Legacy Topics System: PAGE_TOPICS_V2
293
+
294
+ | Attribute | PAGE_TOPICS_V2 | Target 72 |
295
+ |-----------|---------------|-----------|
296
+ | Target ID | 15 | 72 |
297
+ | Architecture | BERT (BertNLClassifier) | TinyGemma (Gemma) |
298
+ | Size | ~2.7 MB | ~120 MB |
299
+ | Input | Cleaned hostname | Full page content |
300
+ | Output | 469 topic probabilities | Generated text |
301
+ | Taxonomy | v2, 469 topics (IDs 1-629) | None in the component |
302
+ | Override list | ~47,128 domain-to-topic mappings | None |
303
+
304
+ ---
305
+
306
+ ## 8. Verified External Context
307
+
308
+ ### Topics API: Deprecation Status
309
+ - **October 17, 2025**: Google announces retirement of the Topics API, Protected Audience, and Attribution Reporting
310
+ - **Deprecation** scheduled for Chrome 144
311
+ - **Full removal** scheduled for Chrome 150
312
+ - **Reason**: insufficient ecosystem adoption
313
+ - Source: [Privacy Sandbox Update](https://privacysandbox.google.com/blog/update-on-plans-for-privacy-sandbox-technologies)
314
+
315
+ ### Classifier API: Public Proposal (GitHub)
316
+ An explainer published by the Chrome Built-in AI Team exists on GitHub:
317
+ - **Repo**: [`explainers-by-googlers/classifier-api`](https://github.com/explainers-by-googlers/classifier-api)
318
+ - **Description**: "A proposal for a high-performance, on-device browser API to classify text."
319
+ - **Status**: "a tentative and early design sketch [...] to solicit feedback"
320
+
321
+ Proposal contents:
322
+
323
+ | Element | Detail |
324
+ |---------|--------|
325
+ | JavaScript API | `Classifier.availability()`, `Classifier.create()`, `classifier.classify(text)` |
326
+ | Taxonomy | IAB Content Taxonomy V3.1 (experimental default) |
327
+ | IAB v3.1 size | 706 total categories, 588 leaf nodes ([source TSV](https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/Content%20Taxonomies/Content%20Taxonomy%203.1.tsv)) |
328
+ | Output | Array of `{id: string, confidence: number}` — ID-based, not human-readable |
329
+ | Model | Described as "dedicated, on-device expert model optimized for high-speed classification" |
330
+ | Target latency | "Millisecond-level inference" |
331
+ | vs generic LLM | "A generic LLM is 'overkill' for classification" |
332
+ | Privacy | "Raw text never leaves the browser", "stateless by design" |
333
+ | Model updates | "managed by the browser component updater" |
334
+ | Non-goals | Summarization, translation, sentiment analysis; human-readable string outputs |
335
+ | Out of MVP scope | Streaming, multilingual, quota API, download progress, AbortSignal |
336
+ | Language | English only (MVP) |
337
+ | Author | Chrome Built-in AI Team |
338
+
339
+ ### Shopping Classifier: Working Pipeline (Target 43+74)
340
+ Documented by Dejan Petrovic ([blog](https://dejan.ai/blog/google-shopping-classifier/), [HuggingFace](https://huggingface.co/dejanseo/chrome_models)):
341
+
342
+ | Step | Detail |
343
+ |------|--------|
344
+ | Chunking | 100 words max per passage, 10 passages max, truncated to 64 tokens |
345
+ | Embedding | Target 43: text -> 768-dim embedding |
346
+ | Title+URL | Concatenated into a single embedding |
347
+ | Classifier input | `[title_emb(768) || mean_pool(passage_embs)(768)]` = 1536 dim |
348
+ | Output | Score 0--1 (shopping probability) |
349
+ | Storage | `VisitContentAnnotations` in Chrome's history database |
350
+ | Usage | Aggregated via `OPTIMIZATION_TARGET_SEGMENTATION_SHOPPING_USER` to enable commerce features |
351
+
352
+ ### LiteRT-LM
353
+ Open-source C++ framework by Google for on-device language model inference. Used in Chrome, Chromebook Plus, and Pixel Watch according to Google. Handles KV cache, session cloning, and CPU/GPU/NPU acceleration. The `LITERTLM` container format in which taxonomy-tiny is packaged corresponds to this framework.
354
+ Source: [Google Developers Blog](https://developers.googleblog.com/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/)
355
+
356
+ ---
357
+
358
+ ## 9. What the Facts Do Not Tell Us
359
+
360
+ The following points are not resolved by available data:
361
+
362
+ 1. **The link between target 72 and the Classifier API** -- The GitHub explainer does not name the underlying model. Target 72 has no reference to the Classifier API in its files. The correspondence is plausible but unconfirmed.
363
+
364
+ 2. **The intended use of the 370 custom tokens** -- They exist in the vocabulary and are untrained. IAB Content Taxonomy V3.1 has 706 categories (588 leaf nodes); 370 does not match either count. Their correspondence to any taxonomy is not documented anywhere.
365
+
366
+ 3. **Why a generative LLM rather than an embedding-based classifier** -- The explainer says "expert model", not "LLM". It says "millisecond-level inference", which does not match the ~3s observed on target 72. The explainer's output format is structured (`{id, confidence}`), while the Chromium controller reads a plain string (`ClassifyApiResponse { string output = 1; }`).
367
+
368
+ 4. **The relationship with the Topics API** -- Target 72 appeared (2026-02-12) after the Topics API retirement announcement (2025-10-17). Both involve taxonomies and page classification, but no source code explicitly links them.
369
+
370
+ 5. **When or whether fine-tuning will be deployed** -- The model is a pre-trained base model. No public information exists on a fine-tuning timeline.
371
+
372
+ 6. **The semantics of `BaseModelSpec.version = "0.0.0.0"`** -- This format is not documented anywhere. Other Chrome components use normal numeric versions.
373
+
374
+ ---
375
+
376
+ ## 10. References
377
+
378
+ | Source | Type | URL / Path |
379
+ |--------|------|------------|
380
+ | Chrome component (CRX) | Local file | `classifier_analysis/classifier.crx3` |
381
+ | Manifest.json | Local file | `classifier_analysis/extracted/manifest.json` |
382
+ | Installer source | Chromium | [optimization_guide_on_device_model_installer.cc](https://chromium.googlesource.com/experimental/chromium/src/+/refs/heads/main/chrome/browser/component_updater/optimization_guide_on_device_model_installer.cc) |
383
+ | Controller source | Chromium | `on_device_model_classifier_controller.cc` (read via Chromium Code Search, `experimental/chromium/src` branch, no stable direct URL) |
384
+ | Classifier API explainer | GitHub | [explainers-by-googlers/classifier-api](https://github.com/explainers-by-googlers/classifier-api) |
385
+ | Shopping classifier blog | Web | [dejan.ai/blog/google-shopping-classifier](https://dejan.ai/blog/google-shopping-classifier/) |
386
+ | Privacy Sandbox update | Web | [privacysandbox.google.com/blog/update-on-plans](https://privacysandbox.google.com/blog/update-on-plans-for-privacy-sandbox-technologies) |
387
+ | Topics deprecation | Chromium Groups | [topics-api-announce](https://groups.google.com/a/chromium.org/g/topics-api-announce/c/iQX7PC3S0Ds) |
388
+ | LiteRT-LM blog | Web | [developers.googleblog.com](https://developers.googleblog.com/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/) |
389
+ | Mission 1 report | Local file | `results/MISSION_01/target_72_inventory.md` |
390
+ | Mission 2 report | Local file | `results/MISSION_02/target_configs_comparison.md` |
391
+ | Mission 4 report | Local file | `results/MISSION_04/weight_variance_analysis.md` |
392
+ | Mission 5 report | Local file | `results/MISSION_05/custom_tokens_clustering.md` |