| --- |
| license: apache-2.0 |
| tags: |
| - gemma |
| - tokenizer |
| - sentencepiece |
| - bpe |
| --- |
| |
| # Gemma 4 E2B SentencePiece BPE Tokenizer (derived) |
|
|
| A SentencePiece `.model` file synthesized from the upstream |
| `onnx-community/gemma-4-E2B-it-ONNX` `tokenizer.json`, plus a companion |
| `atomic_tokens.json` that lists the special / added tokens for an HF-style |
| atomic pre-pass. |
|
|
| Byte-equivalence verified against the upstream HF tokenizer on **1000 / 1000** |
| diverse strings (English prose, source code, CJK / Cyrillic / Arabic / Hindi |
| text, whitespace edge cases, emoji, NFKC-sensitive ligatures, and special-token |
| literals). Verification uses an atomic pre-pass over `atomic_tokens.json` |
| followed by `sentencepiece.encode`. |
|
|
| ## Files |
|
|
| - `tokenizer.model` — SentencePiece BPE ModelProto, 262 144 pieces, |
| `byte_fallback=true`, `normalizer_spec.name="identity"`, |
| `normalizer_spec.add_dummy_prefix=false`. |
| - `atomic_tokens.json` — array of `{ piece, id }` entries, sorted longest-first, |
| covering every `added_tokens` entry from upstream `tokenizer.json` with |
| `special: true` (e.g. `<bos>`, `<eos>`, `<pad>`, `<unk>`, `<mask>`, `<|tool>`, |
| …). Consumers run a greedy, overlap-free, longest-match-first scan over the |
| raw input and emit the listed `id` directly for matched substrings before |
| invoking the SentencePiece tokenizer on the surrounding text. |
|
|
| ## Hashes (sha256) |
|
|
| - `tokenizer.model`: `57dc9e2498ca33228ae89bed2721a57817ded224a7e02d3a79a03a9619f2ff33` |
| - `atomic_tokens.json`: `492ec2ad2531b146d6b8396327704e25b339264bef8356d251e3e3ee787c2423` |
|
|
| ## License |
|
|
| Apache-2.0 (matching upstream Gemma 4 license). |
|
|