File size: 1,612 Bytes
2a756fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
---
license: apache-2.0
tags:
  - gemma
  - tokenizer
  - sentencepiece
  - bpe
---

# Gemma 4 E2B SentencePiece BPE Tokenizer (derived)

A SentencePiece `.model` file synthesized from the upstream
`onnx-community/gemma-4-E2B-it-ONNX` `tokenizer.json`, plus a companion
`atomic_tokens.json` that lists the special / added tokens for an HF-style
atomic pre-pass.

Byte-equivalence verified against the upstream HF tokenizer on **1000 / 1000**
diverse strings (English prose, source code, CJK / Cyrillic / Arabic / Hindi
text, whitespace edge cases, emoji, NFKC-sensitive ligatures, and special-token
literals). Verification uses an atomic pre-pass over `atomic_tokens.json`
followed by `sentencepiece.encode`.

## Files

- `tokenizer.model` — SentencePiece BPE ModelProto, 262 144 pieces,
  `byte_fallback=true`, `normalizer_spec.name="identity"`,
  `normalizer_spec.add_dummy_prefix=false`.
- `atomic_tokens.json` — array of `{ piece, id }` entries, sorted longest-first,
  covering every `added_tokens` entry from upstream `tokenizer.json` with
  `special: true` (e.g. `<bos>`, `<eos>`, `<pad>`, `<unk>`, `<mask>`, `<|tool>`,
  …). Consumers run a greedy, overlap-free, longest-match-first scan over the
  raw input and emit the listed `id` directly for matched substrings before
  invoking the SentencePiece tokenizer on the surrounding text.

## Hashes (sha256)

- `tokenizer.model`: `57dc9e2498ca33228ae89bed2721a57817ded224a7e02d3a79a03a9619f2ff33`
- `atomic_tokens.json`: `492ec2ad2531b146d6b8396327704e25b339264bef8356d251e3e3ee787c2423`

## License

Apache-2.0 (matching upstream Gemma 4 license).