Update OptGuideOnDeviceClassifierModel/CLASSIFIER_MODEL_ANALYSIS.md

#17

by oliseg - opened 22 days ago

base: refs/heads/main

←

from: refs/pr/17

Discussion Files changed

+392

-275

Files changed (1) hide show

OptGuideOnDeviceClassifierModel/CLASSIFIER_MODEL_ANALYSIS.md +392 -275

OptGuideOnDeviceClassifierModel/CLASSIFIER_MODEL_ANALYSIS.md CHANGED Viewed

@@ -1,275 +1,392 @@
-# OptGuideOnDeviceClassifierModel — Complete Analysis
-## Overview
-**OptGuideOnDeviceClassifierModel** is a 120 MB on-device language model shipped with Chrome Canary as a Chrome Component. Its manifest names it **"Optimization Guide On Device Taxonomy Model"**, with a base model spec called **`taxonomy-tiny`**.
-It is a **Gemma 2 variant** purpose-built for **page-level classification** — specifically extracting the **brand** and **intent** of web pages for Chrome's client-side **scam/phishing detection** pipeline.
-| Field | Value |
-|---|---|
-| Manifest name | Optimization Guide On Device Taxonomy Model |
-| Base model | `taxonomy-tiny` v0.0.0.0 |
-| Component version | `2026.2.12.1554` |
-| Component ID (CRX) | `eidcjfoningnkhpoelgpjemmhmopkeoi` |
-| File | `weights.bin` (126,025,728 bytes / 120.19 MB) |
-| Execution config | Empty (0 bytes) — no prompt template bundled |
-| Performance hint | `3` |
-| Availability | **Chrome Canary** (not tested in Stable) |
-| Optimization target | `OPTIMIZATION_TARGET_MODEL_EXECUTION_FEATURE_CLASSIFIER` (ID 72) |
-| Chrome feature flag | `ClientSideDetectionBrandAndIntentForScamDetection` |
----
-## Purpose: Scam Detection via Brand + Intent Classification
-Chrome's Client-Side Detection (CSD) system extracts page text from suspicious websites and sends it to this model with the following prompt (decoded from `on_device_model_execution_config.pb` of model ID 55):
-```
-You are a web page text scanner. Your task is to carefully review text from
-a web page and answer the following questions in English:
-1) What brand does the page represent?
-2) In one complete sentence, summarize what this page aims to do.
-   Do not leak PII data.
-You should output your answers strictly in the following JSON format:
-{"brand": "<brand>", "intent": "<intent>"}
-Do not use ```json``` block in your output.
-Text: [PAGE CONTENT HERE]
-```
-The expected response conforms to this JSON schema:
-```json
-{
-  "type": "object",
-  "additionalProperties": false,
-  "properties": {
-    "brand": { "type": "string" },
-    "intent": { "type": "string" }
-  },
-  "required": ["brand", "intent"]
-}
-```
-When the detected brand/intent combination is inconsistent with the actual page behavior (e.g., a page claiming to be PayPal but actually harvesting credentials on an unrelated domain), Chrome flags the page as a potential scam via Safe Browsing.
----
-## Binary Format: LITERTLM Container
-The `weights.bin` file is **not** a raw TFLite model. It uses the **LITERTLM** (LiteRT Language Model) container format — a proprietary Google ODML packaging format with a FlatBuffer header and multiple embedded submodels.
-### File Layout
-```
-Offset          Component                        Size
-────────────────────────────────────────────────────────────
-0x00000000      LITERTLM FlatBuffer header       32 KB
-                  Magic: "LITERTLM"
-                  Version: 1
-                  Submodels: 4 entries declared
-                  Metadata:
-                    model_type = "tf_lite_prefill_decode"
-                    model_type = "tf_lite_embedder"
-                    model_version = "1.0.1"
-                    Authors = "ODML team"
-0x00008000      TFLite #1 — Embedder             8.20 MB (8,601,600 bytes)
-                  Input:  token_ids [1, 1] int32
-                  Output: embeddings [1, 1, 1024] float32
-                  Op: lookup_embedding_table
-                  TFLite runtime: 2.18.0
-0x0083C000      TFLite #2 — Prefill + Decode     111.63 MB (117,055,216 bytes)
-                  2 signatures: "prefill" and "decode"
-                  39 inputs (embeddings + position + mask + 36 KV cache)
-                  37 outputs (36 KV cache + logits [1, 1, 16384])
-                  18 transformer layers
-                  Full Gemma 2 architecture
-0x077E0000      SentencePiece tokenizer          305.6 KB (312,918 bytes)
-                  Vocab size: 16,384 tokens
-                  Special tokens: <pad>=0, </s>=1, <s>=2, <unk>=3
-                  256 byte-fallback tokens
-                  Normalizer: nmt_nfkc
-0x0782C656      Zero padding to alignment        14.7 KB
-0x07830000      End of file                      126,025,728 bytes total
-```
-### How to Extract the Submodels
-```python
-data = open('weights.bin', 'rb').read()
-# TFLite embedder
-open('embedder.tflite', 'wb').write(data[0x8000:0x83C000])
-# TFLite prefill+decode transformer
-open('decoder.tflite', 'wb').write(data[0x83C000:0x77DDEF0])
-# SentencePiece tokenizer
-open('tokenizer.model', 'wb').write(data[0x77E0000:0x782C656])
-```
----
-## Architecture: Gemma 2 "taxonomy-tiny"
-The model is a **distilled Gemma 2** with reduced dimensions, confirmed by layer name analysis of the TFLite graph.
-### Specifications
-| Parameter | Value | Evidence |
-|---|---|---|
-| Architecture family | **Gemma 2** | QK normalization + post-FFN norm = Gemma 2 exclusive features |
-| Transformer layers | **18** | `layer_0` through `layer_17` in tensor names |
-| Embedding dimension | **1024** | Embedder output shape `[1, 1, 1024]` |
-| KV cache dimension | **256** per layer | KV input/output shape `[1, 1, 1, 256]` |
-| Vocabulary size | **16,384** | Logits output shape `[1, 1, 16384]`; SentencePiece vocab |
-| Normalization | **RMSNorm** | `rms_norm/mul`, `rms_norm/rsqrt`, `rms_norm/square` |
-| Pre-attention norm | **Yes** | `pre_attention_norm/rms_norm` |
-| Pre-FFN norm | **Yes** | `pre_ffw_norm` patterns |
-| Post-FFN norm | **Yes** | Post-FFN norm present (Gemma 2 specific) |
-| QK normalization | **Yes** | `key_norm/rms_norm` (Gemma 2 specific) |
-| Positional encoding | **RoPE** | `maybe_rope/concatenate` |
-| Attention type | **Full attention** | No sliding window patterns found |
-| Activation | **GeLU** (likely) | Standard for Gemma 2 |
-| Quantization | **Mixed INT4/INT8** | 120 MB for 18 layers with 1024 dim implies heavy quantization |
-| Estimated parameters | **~100–200M** | Based on file size and quantization assumptions |
-| TFLite signatures | `prefill` (no logits) + `decode` (with logits) | Standard ODML LLM execution pattern |
-### Comparison with Known Models
-| | **taxonomy-tiny** | Gemma 2 2B | Gemini Nano v3 |
-|---|---|---|---|
-| Layers | 18 | 26 | ~32 |
-| Embed dim | 1,024 | 2,304 | unknown |
-| Vocab size | 16,384 | 256,128 | 256,128 |
-| File size | 120 MB | ~2.6 GB | 4.07 GB |
-| QK norm | Yes | Yes | Yes |
-| Post-FFN norm | Yes | Yes | Yes |
-| Sliding window | No | Yes (alternating) | Yes |
-| Purpose | Classification | General | General |
-### Single Transformer Block Structure
-From tensor name analysis, each of the 18 layers contains:
-```
-layer_N/
-├── layer_N.pre_qkv/
-│   ├── pre_attention_norm/rms_norm/          (RMSNorm)
-│   └── attn._pre_attention_fn/
-│       └── maybe_rope/                       (RoPE positional encoding)
-├── attn.dot_product_attention/
-│   └── dot_attn._qkv_fn/
-│       ├── key_norm/rms_norm/                (QK normalization)
-│       ├── dot_general (Q*K)
-│       ├── tfl_softmax
-│       ├── dot_general (attn*V)
-│       └── reshape/transpose
-├── layer_N.post_qkv/
-│   ├── attn.post_qkv/attn_vec_einsum/       (output projection)
-│   ├── add (residual)
-│   └── add1 (post-attention residual)
-├── layer_N.update_cache/
-│   └── attn.update_cache/concatenate         (KV cache update)
-└── [pre_ffw_norm + FFN + post_ffw_norm]      (feed-forward block)
-```
-Final output: `final_norm/rms_norm` → `decode_softmax` → logits `[1, 1, 16384]`
----
-## Tokenizer: Reduced Gemma Vocabulary
-The embedded SentencePiece model uses a **16,384-token vocabulary** — a dramatic reduction from Gemma's standard 256,128 tokens. This is consistent with a classification-focused model that doesn't need the full multilingual generative vocabulary.
-| Property | Value |
-|---|---|
-| Vocab size | 16,384 |
-| BOS token | `<s>` (id=2) |
-| EOS token | `</s>` (id=1) |
-| PAD token | `<pad>` (id=0) |
-| UNK token | `<unk>` (id=3) |
-| Byte fallbacks | 256 tokens (`<0x00>` through `<0xFF>`) |
-| Normalizer | `nmt_nfkc` |
-Notably, Gemma's conversation tokens (`<start_of_turn>`, `<end_of_turn>`) are **absent** from this vocabulary — they map to UNK (id=3). The model does not use chat-turn formatting.
-Sample vocabulary entries:
-```
-[  260] = '.'           [  500] = '▁such'       [ 1000] = '▁amount'
-[ 2000] = '▁Q'          [ 5000] = '▁tradition'  [10000] = '▁Computer'
-[15000] = '▁Philosophy'  [16383] = '▁<custom370>'
-```
----
-## Chrome Integration Pipeline
-```
-User visits a page
-        │
-        ▼
-┌─────────────────────────────┐
-│  Safe Browsing Heuristics   │  Pre-filter: URL reputation, phishing
-│  (CSD - Client Side Det.)   │  patterns, keyboard lock API, etc.
-└──────────┬──────────────────┘
-           │ Page flagged as suspicious
-           ▼
-┌─────────────────────────────┐
-│  Page Text Extraction       │  Extract visible text content from DOM
-└──────────┬──────────────────┘
-           │
-           ▼
-┌─────────────────────────────┐
-│  Prompt Construction        │  "You are a web page text scanner..."
-│  (from model ID 55 config)  │  + page text appended
-└──────────┬───────────────���──┘
-           │
-     ┌─────┴──────┐
-     ▼            ▼
-┌─────────┐  ┌──────────────┐
-│ Gemini  │  │ taxonomy-    │   Whichever model is available
-│ Nano    │  │ tiny         │   (taxonomy-tiny is 33x smaller)
-│ (4 GB)  │  │ (120 MB)     │
-└────┬────┘  └──────┬───────┘
-     │              │
-     └──────┬───────┘
-            ▼
-┌─────────────────────────────┐
-│  JSON Response Parsing      │  {"brand": "PayPal",
-│                             │   "intent": "credential harvesting"}
-└──────────┬──────────────────┘
-           │
-           ▼
-┌─────────────────────────────┐
-│  Verdict Logic              │  Compare brand claim vs. actual domain,
-│                             │  intent vs. page behavior
-└──────────┬──────────────────┘
-           │
-           ▼
-┌─────────────────────────────┐
-│  Safe Browsing Warning      │  Red interstitial page shown to user
-└─────────────────────────────┘
-```
-### Trigger Conditions
-The classifier does **not** run on every page. It triggers when Chrome's CSD heuristics detect suspicious signals:
-- Phishing URL patterns (Safe Browsing prefix match)
-- Keyboard Lock API usage (common in tech support scams)
-- Aggressive popups or fullscreen requests
-- Form fields requesting sensitive data (passwords, SSN, credit cards)
-- Urgency language patterns
----

+# Chrome taxonomy-tiny: Observed Facts
+**Date:** 2026-04-19
+**Analysts:** Local research (reverse engineering) + public Chromium sources
+---
+## 1. The Chrome Component
+### Identity
+| Field | Value | Source |
+|-------|-------|--------|
+| CRX ID | `eidcjfoningnkhpoelgpjemmhmopkeoi` | `verified_contents.json` |
+| Name | `Optimization Guide On Device Taxonomy Model` | `manifest.json` |
+| Version | `2026.2.12.1554` | `manifest.json` |
+| BaseModelSpec.name | `taxonomy-tiny` | `manifest.json` |
+| BaseModelSpec.version | `0.0.0.0` | `manifest.json` |
+| Performance hints | `[3]` | `manifest.json` |
+### Signed Files (CRX verified_contents)
+The component contains exactly 3 files signed by Google:
+| File | Size | Content |
+|------|------|---------|
+| `manifest.json` | 247 bytes | Component metadata |
+| `on_device_model_execution_config.pb` | **0 bytes** | Empty |
+| `weights.bin` | 126,025,728 bytes (120.2 MB) | LITERTLM container |
+No `adaptation_weights.bin`, `adapter.bin`, `lora.bin`, or `model-info.pb` file exists in the component.
+### LITERTLM Container (weights.bin)
+| Entry | Size | Description |
+|-------|------|-------------|
+| `embedder.tflite` | 8,601,600 bytes | TFLite v3 |
+| `prefill_decode.tflite` | 117,391,360 bytes | TFLite v3 |
+| `tokenizer.spm` | 312,918 bytes (LITERTLM header; extracted file: 312,917 bytes) | SentencePiece |
+| `model_version` | 5 bytes | String `"1.0.1"` |
+---
+## 2. Model Architecture
+### Transformer Specifications
+| Parameter | Value | Measurement Method |
+|-----------|-------|--------------------|
+| Family | Gemma (named "TinyGemma" in Chromium code) | Chromium C++ source |
+| Layers | 18 transformer decoder | TFLite inspection |
+| Parameters | ~319M estimated (INT4 quantized) | Estimated: embedding 16384x1024 + 18 layers x (gate 2048x1024 + up 2048x1024 + down 1024x2048 + o_proj 1024x1024 + norms) + tied LM head |
+| Embedding dim | 1024 | Embedder output tensor |
+| KV head dim | 256, 1 head per layer | KV cache tensors |
+| Quantization | INT4 stored as INT8 (15 unique values) | Weight analysis |
+| Embedding table vs LM head | Byte-for-byte identical matrices | Full matrix comparison |
+| Execution | CPU only | Chromium C++ source |
+### TFLite Signatures
+**embedder.tflite:**
+| Input/Output | Shape | Type |
+|--------------|-------|------|
+| Input: `token_ids` | `[1, 1]` | int32 |
+| Output: `embeddings` | `[1, 1, 1024]` | float32 |
+**prefill_decode.tflite -- `prefill` signature:**
+| Input/Output | Shape | Type |
+|--------------|-------|------|
+| Input: `embeddings` | `[1, N, 1024]` | float32 |
+| Input: `input_pos` | `[N]` | int32 |
+| Input: `mask` | `[1, 1, N, N]` | float32 |
+| Input: 36 KV caches | `[1, N, 1, 256]` | float32 |
+| Output: 36 KV caches (updated) | `[1, N, 1, 256]` | float32 |
+| Output: **no logits** | -- | -- |
+**prefill_decode.tflite -- `decode` signature:**
+| Input/Output | Shape | Type |
+|--------------|-------|------|
+| Input: `embeddings` | `[1, 1, 1024]` | float32 |
+| Input: `input_pos` | `[1]` | int32 |
+| Input: `mask` | `[1, 1, 1, 1]` | float32 |
+| Input: 36 KV caches | `[1, ?, 1, 256]` | float32 |
+| Output: 36 KV caches (updated) | `[1, ?, 1, 256]` | float32 |
+| Output: `logits` | `[1, 1, 16384]` | float32 |
+### KV Cache: Observed Behavior
+The model writes new KV values **at position `input_pos` within the existing buffer**. Output has the same size as input. The cache must be **replaced** with the output, not concatenated.
+---
+## 3. Tokenizer
+### Vocabulary Structure (16,384 tokens)
+| Range | Tokens | Role |
+|-------|--------|------|
+| 0 | `<pad>` | Padding |
+| 1 | `</s>` | End of sequence |
+| 2 | `<s>` | Beginning of sequence (BOS) |
+| 3 | `<unk>` | Unknown |
+| 4--259 | `<0x00>`--`<0xFF>` | Byte fallback |
+| 260--15999 | Text tokens | Standard SentencePiece vocabulary |
+| 16000--16001 | `▁<start_of_audio>`, `▁<end_of_audio>` | Audio markers |
+| 16002--16003 | `▁<start_of_image>`, `▁<end_of_image>` | Image markers |
+| 16004--16013 | `▁<ctrl1>` -- `▁<ctrl10>` | Control tokens |
+| 16014--16383 | `▁<custom1>` -- `▁<custom370>` | Custom tokens |
+### Tokenization of `<ctrl1>`
+- `" <ctrl1>"` (with preceding space) -> single token 16004
+- `"<ctrl1>"` (without space) -> decomposed into individual characters (byte fallback)
+- `sp.PieceToId('<ctrl1>')` returns 3 (UNK) because the actual piece name is `▁<ctrl1>`
+---
+## 4. Chromium Source Code
+### File: `optimization_guide_on_device_model_installer.cc`
+Confirmed via URL provided by the user:
+`https://chromium.googlesource.com/experimental/chromium/src/+/refs/heads/main/chrome/browser/component_updater/optimization_guide_on_device_model_installer.cc`
+```cpp
+// Extension id is eidcjfoningnkhpoelgpjemmhmopkeoi.
+constexpr char kClassifierModelManifestName[] =
+    "Optimization Guide On Device Taxonomy Model";
+constexpr base::FilePath::CharType kClassifierModelInstallationRelativePath[] =
+    FILE_PATH_LITERAL("OptGuideOnDeviceClassifierModel");
+```
+The component is registered through an `OptimizationGuideOnDeviceClassifierModelInstallerPolicy` class that inherits from `OnDeviceModelInstallerPolicy`.
+The `OnDeviceModelType` enum distinguishes two model types:
+```cpp
+enum OnDeviceModelType { kBaseModel, kClassifierModel };
+```
+### File: `on_device_model_classifier_controller.cc`
+Hardcoded classification config (Chromium source, `experimental/chromium/src` branch, Copyright 2026 in file header):
+```cpp
+// Prompt template
+substitution->set_string_template("%s <ctrl1>");
+// Proto types
+// Request: optimization_guide.proto.ClassifyApiRequest { string text = 1; }
+// Response: optimization_guide.proto.ClassifyApiResponse { string output = 1; }
+// Feature ID
+config.set_feature(proto::ModelExecutionFeature::MODEL_EXECUTION_FEATURE_CLASSIFIER);
+// = 33
+// Execution
+params->backend_type = ml::ModelBackendType::kCpuBackend;
+params->max_tokens = kOnDeviceModelMaxTokens;  // 10240
+// Log on disconnect
+LOG(ERROR) << "TinyGemma model disconnected unexpectedly.";
+```
+### Proto: `substitution.proto`
+```protobuf
+enum ControlToken {
+  CONTROL_TOKEN_UNSPECIFIED = 0;
+  SYSTEM = 1;   // <ctrl1>
+  MODEL = 2;    // <ctrl2>
+  USER = 3;     // <ctrl3>
+  END = 4;      // <ctrl4>
+}
+```
+### Execution Pipeline in Chrome
+```
+Input text -> ClassifyApiRequest { text: "..." }
+  -> FormatString: "%s <ctrl1>" -> "text <ctrl1>"
+  -> BOS + tokenize -> embedder -> prefill/decode -> logits -> sample
+  -> ParseResponse -> ClassifyApiResponse { output: "..." }
+```
+---
+## 5. Observed Inference Behavior
+### Text Generation
+When provided with text followed by `<ctrl1>`, the model **echoes and reformats** the input:
+| Input | Generated Output |
+|-------|-----------------|
+| `best laptop deals gaming electronics` | `Best laptop deals: gaming electronics` |
+| `Italian cooking recipes pasta carbonara` | `Italian cooking recipes Pasta carbonara` |
+| `NBA basketball scores live updates` | `NBA Basketball Scores Live Update` |
+| `stock market investment trading` | `Stock market investment trading` |
+Across the 4 tests performed, the output contains capitalization and punctuation changes relative to the input. No classification labels or category identifiers were observed in any tested output.
+### Custom Tokens in Generation
+| Metric | Value |
+|--------|-------|
+| Custom token max logit | -19 to -21 |
+| Regular token max logit | +19 to +27 |
+| Gap | ~37 to 46 logit units |
+| Custom token entropy ratio | 1.0000 (uniform) |
+Across all tests performed, custom tokens (16014-16383) had logits between -19 and -21, producing negligible probability after softmax.
+### Embedding Quality (Layer 17 KV cache as representation)
+| Metric | Last Position | Mean-Pooled |
+|--------|--------------|-------------|
+| Within-category similarity | 0.849 | 0.783 |
+| Across-category similarity | 0.805 | 0.756 |
+| Separation gap | 0.044 | 0.027 |
+Within-category similarity exceeds across-category similarity by 0.044 (last position) and 0.027 (mean-pooled).
+### Performance
+| Operation | Duration (~16 tokens) |
+|-----------|-----------------------|
+| Decode only (token by token) | ~2.0--2.4s |
+| Prefill batch | ~1.3s |
+| Prefill + full decode | ~3.0s |
+---
+## 6. Weight Analysis
+### Embedding Table: Custom vs Regular Tokens
+| Metric | Custom (370) | Regular (370) | Multimodal (14) |
+|--------|-------------|---------------|-----------------|
+| Intra-group cosine | **0.999937** | 0.051 | 0.886 |
+| L2 norm mean | 32.045 | 29.616 | 32.029 |
+| L2 norm std | **0.007** | 3.019 | 0.077 |
+| Per-dimension variance (mean) | **0.0004** | 6.600 | -- |
+| Regular/custom variance ratio | **14,817x** | -- | -- |
+| K-means k=50: distinct clusters | 33 (collapse) | 50 | -- |
+| DBSCAN: clusters (eps x1.0) | 2 | 1 | 1 |
+K-means at k=50 collapses to 33 clusters (ConvergenceWarning). DBSCAN at eps x1.0 finds 2 clusters. No significant intra/inter-group separation detected for the grid patterns tested (10x37, 37x10, 74x5, 185x2).
+Multimodal tokens (16000-16013) show 2 sub-clusters with cosine ~0.20 between groups. Custom tokens show no sub-clusters.
+### Transformer Layers: Weight Distribution
+| Metric | Value (54 FFN tensors, 18 layers) |
+|--------|-----------------------------------|
+| Mean std | 2.6252 |
+| Std of stds | 0.0093 |
+| Coefficient of variation | **0.36%** |
+| % outliers | 0.00% (all layers) |
+| % zeros | 15.3--16.4% |
+0.36% coefficient of variation across the 54 FFN tensors. Differences between early, mid, and late layers:
+| Weight type | Early (0-5) | Mid (6-11) | Late (12-17) | Variation |
+|-------------|-------------|------------|--------------|-----------|
+| gate_proj std | 2.6319 | 2.6243 | 2.6307 | 0.8% |
+| up_proj std | 2.6314 | 2.6246 | 2.6318 | 0.8% |
+| down_proj std | 2.6217 | 2.6090 | 2.6214 | 1.3% |
+`o_proj` has std of 18.49 (7x the FFN layers), constant across all 18 layers (18.492--18.496).
+### Normalization Layers
+- Layer 0: all RMSNorm values are zero
+- Layers 1-17: RMSNorm values range from 3.89 to 14.02
+---
+## 7. Comparison with Other Chrome Targets
+### Target 72 vs Targets 43 and 74
+| Attribute | Target 43 (Embedder) | Target 72 (Classifier) | Target 74 (Shopping) |
+|-----------|---------------------|----------------------|---------------------|
+| Optimization target | `PASSAGE_EMBEDDER` | `CLASSIFIER` | `SHOPPING_CLASSIFIER` |
+| Timestamp in model-info.pb / version | 2024-06-25 (timestamp) | 2026-02-12 (inferred from version `2026.2.12.1554`) | 2026-03-30 (timestamp) |
+| Config populated | Yes (147 bytes) | **No (0 bytes)** | Yes (121 bytes) |
+| model-info.pb | Present | **Absent** | Present |
+| Metadata type | `PassageEmbeddingsModelMetadata` | None | `CategoryClassifierMetadata` |
+| Architecture | Sentence embedder TFLite | TinyGemma LLM 18 layers | Linear classifier head TFLite |
+| Model size | 112 MB | 120 MB | 789 KB |
+| Input | int32[1,64] token IDs | Text + `<ctrl1>` | float32[1,1536] embeddings |
+| Output | float32[1,768] embedding | Generated text | float32[1,1] score |
+| Tokenizer | sentencepiece.model (794 KB) | tokenizer.spm (313 KB) | Uses target 43's |
+| Input window | 64 tokens | ~10240 tokens | N/A |
+| Depends on another target | No | No | Yes (target 43, v >= 2026-02-02) |
+Target 74's metadata declares a dependency on target 43 via `required_embedder_version`. Target 74's input shape (1536-dim) matches the concatenation of two 768-dim embeddings from target 43. Target 72 has no declared dependency.
+### Target 72 vs Target 62 (Proofreader)
+| Attribute | Target 62 | Target 72 |
+|-----------|-----------|-----------|
+| Own model | No (uses shared Gemini Nano ~1.7 GB) | Yes (standalone TinyGemma 120 MB) |
+| LoRA adapter | Yes (`adaptation_weights.bin`, 16.6 MB) | No |
+| Config | Non-empty | Empty (0 bytes) |
+### Legacy Topics System: PAGE_TOPICS_V2
+| Attribute | PAGE_TOPICS_V2 | Target 72 |
+|-----------|---------------|-----------|
+| Target ID | 15 | 72 |
+| Architecture | BERT (BertNLClassifier) | TinyGemma (Gemma) |
+| Size | ~2.7 MB | ~120 MB |
+| Input | Cleaned hostname | Full page content |
+| Output | 469 topic probabilities | Generated text |
+| Taxonomy | v2, 469 topics (IDs 1-629) | None in the component |
+| Override list | ~47,128 domain-to-topic mappings | None |
+---
+## 8. Verified External Context
+### Topics API: Deprecation Status
+- **October 17, 2025**: Google announces retirement of the Topics API, Protected Audience, and Attribution Reporting
+- **Deprecation** scheduled for Chrome 144
+- **Full removal** scheduled for Chrome 150
+- **Reason**: insufficient ecosystem adoption
+- Source: [Privacy Sandbox Update](https://privacysandbox.google.com/blog/update-on-plans-for-privacy-sandbox-technologies)
+### Classifier API: Public Proposal (GitHub)
+An explainer published by the Chrome Built-in AI Team exists on GitHub:
+- **Repo**: [`explainers-by-googlers/classifier-api`](https://github.com/explainers-by-googlers/classifier-api)
+- **Description**: "A proposal for a high-performance, on-device browser API to classify text."
+- **Status**: "a tentative and early design sketch [...] to solicit feedback"
+Proposal contents:
+| Element | Detail |
+|---------|--------|
+| JavaScript API | `Classifier.availability()`, `Classifier.create()`, `classifier.classify(text)` |
+| Taxonomy | IAB Content Taxonomy V3.1 (experimental default) |
+| IAB v3.1 size | 706 total categories, 588 leaf nodes ([source TSV](https://github.com/InteractiveAdvertisingBureau/Taxonomies/blob/develop/Content%20Taxonomies/Content%20Taxonomy%203.1.tsv)) |
+| Output | Array of `{id: string, confidence: number}` — ID-based, not human-readable |
+| Model | Described as "dedicated, on-device expert model optimized for high-speed classification" |
+| Target latency | "Millisecond-level inference" |
+| vs generic LLM | "A generic LLM is 'overkill' for classification" |
+| Privacy | "Raw text never leaves the browser", "stateless by design" |
+| Model updates | "managed by the browser component updater" |
+| Non-goals | Summarization, translation, sentiment analysis; human-readable string outputs |
+| Out of MVP scope | Streaming, multilingual, quota API, download progress, AbortSignal |
+| Language | English only (MVP) |
+| Author | Chrome Built-in AI Team |
+### Shopping Classifier: Working Pipeline (Target 43+74)
+Documented by Dejan Petrovic ([blog](https://dejan.ai/blog/google-shopping-classifier/), [HuggingFace](https://huggingface.co/dejanseo/chrome_models)):
+| Step | Detail |
+|------|--------|
+| Chunking | 100 words max per passage, 10 passages max, truncated to 64 tokens |
+| Embedding | Target 43: text -> 768-dim embedding |
+| Title+URL | Concatenated into a single embedding |
+| Classifier input | `[title_emb(768) || mean_pool(passage_embs)(768)]` = 1536 dim |
+| Output | Score 0--1 (shopping probability) |
+| Storage | `VisitContentAnnotations` in Chrome's history database |
+| Usage | Aggregated via `OPTIMIZATION_TARGET_SEGMENTATION_SHOPPING_USER` to enable commerce features |
+### LiteRT-LM
+Open-source C++ framework by Google for on-device language model inference. Used in Chrome, Chromebook Plus, and Pixel Watch according to Google. Handles KV cache, session cloning, and CPU/GPU/NPU acceleration. The `LITERTLM` container format in which taxonomy-tiny is packaged corresponds to this framework.
+Source: [Google Developers Blog](https://developers.googleblog.com/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/)
+---
+## 9. What the Facts Do Not Tell Us
+The following points are not resolved by available data:
+1. **The link between target 72 and the Classifier API** -- The GitHub explainer does not name the underlying model. Target 72 has no reference to the Classifier API in its files. The correspondence is plausible but unconfirmed.
+2. **The intended use of the 370 custom tokens** -- They exist in the vocabulary and are untrained. IAB Content Taxonomy V3.1 has 706 categories (588 leaf nodes); 370 does not match either count. Their correspondence to any taxonomy is not documented anywhere.
+3. **Why a generative LLM rather than an embedding-based classifier** -- The explainer says "expert model", not "LLM". It says "millisecond-level inference", which does not match the ~3s observed on target 72. The explainer's output format is structured (`{id, confidence}`), while the Chromium controller reads a plain string (`ClassifyApiResponse { string output = 1; }`).
+4. **The relationship with the Topics API** -- Target 72 appeared (2026-02-12) after the Topics API retirement announcement (2025-10-17). Both involve taxonomies and page classification, but no source code explicitly links them.
+5. **When or whether fine-tuning will be deployed** -- The model is a pre-trained base model. No public information exists on a fine-tuning timeline.
+6. **The semantics of `BaseModelSpec.version = "0.0.0.0"`** -- This format is not documented anywhere. Other Chrome components use normal numeric versions.
+---
+## 10. References
+| Source | Type | URL / Path |
+|--------|------|------------|
+| Chrome component (CRX) | Local file | `classifier_analysis/classifier.crx3` |
+| Manifest.json | Local file | `classifier_analysis/extracted/manifest.json` |
+| Installer source | Chromium | [optimization_guide_on_device_model_installer.cc](https://chromium.googlesource.com/experimental/chromium/src/+/refs/heads/main/chrome/browser/component_updater/optimization_guide_on_device_model_installer.cc) |
+| Controller source | Chromium | `on_device_model_classifier_controller.cc` (read via Chromium Code Search, `experimental/chromium/src` branch, no stable direct URL) |
+| Classifier API explainer | GitHub | [explainers-by-googlers/classifier-api](https://github.com/explainers-by-googlers/classifier-api) |
+| Shopping classifier blog | Web | [dejan.ai/blog/google-shopping-classifier](https://dejan.ai/blog/google-shopping-classifier/) |
+| Privacy Sandbox update | Web | [privacysandbox.google.com/blog/update-on-plans](https://privacysandbox.google.com/blog/update-on-plans-for-privacy-sandbox-technologies) |
+| Topics deprecation | Chromium Groups | [topics-api-announce](https://groups.google.com/a/chromium.org/g/topics-api-announce/c/iQX7PC3S0Ds) |
+| LiteRT-LM blog | Web | [developers.googleblog.com](https://developers.googleblog.com/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/) |
+| Mission 1 report | Local file | `results/MISSION_01/target_72_inventory.md` |
+| Mission 2 report | Local file | `results/MISSION_02/target_configs_comparison.md` |
+| Mission 4 report | Local file | `results/MISSION_04/weight_variance_analysis.md` |
+| Mission 5 report | Local file | `results/MISSION_05/custom_tokens_clustering.md` |