sepehrn commited on
Commit
2a756fd
·
verified ·
1 Parent(s): 5538344

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +42 -0
README.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - gemma
5
+ - tokenizer
6
+ - sentencepiece
7
+ - bpe
8
+ ---
9
+
10
+ # Gemma 4 E2B SentencePiece BPE Tokenizer (derived)
11
+
12
+ A SentencePiece `.model` file synthesized from the upstream
13
+ `onnx-community/gemma-4-E2B-it-ONNX` `tokenizer.json`, plus a companion
14
+ `atomic_tokens.json` that lists the special / added tokens for an HF-style
15
+ atomic pre-pass.
16
+
17
+ Byte-equivalence verified against the upstream HF tokenizer on **1000 / 1000**
18
+ diverse strings (English prose, source code, CJK / Cyrillic / Arabic / Hindi
19
+ text, whitespace edge cases, emoji, NFKC-sensitive ligatures, and special-token
20
+ literals). Verification uses an atomic pre-pass over `atomic_tokens.json`
21
+ followed by `sentencepiece.encode`.
22
+
23
+ ## Files
24
+
25
+ - `tokenizer.model` — SentencePiece BPE ModelProto, 262 144 pieces,
26
+ `byte_fallback=true`, `normalizer_spec.name="identity"`,
27
+ `normalizer_spec.add_dummy_prefix=false`.
28
+ - `atomic_tokens.json` — array of `{ piece, id }` entries, sorted longest-first,
29
+ covering every `added_tokens` entry from upstream `tokenizer.json` with
30
+ `special: true` (e.g. `<bos>`, `<eos>`, `<pad>`, `<unk>`, `<mask>`, `<|tool>`,
31
+ …). Consumers run a greedy, overlap-free, longest-match-first scan over the
32
+ raw input and emit the listed `id` directly for matched substrings before
33
+ invoking the SentencePiece tokenizer on the surrounding text.
34
+
35
+ ## Hashes (sha256)
36
+
37
+ - `tokenizer.model`: `57dc9e2498ca33228ae89bed2721a57817ded224a7e02d3a79a03a9619f2ff33`
38
+ - `atomic_tokens.json`: `492ec2ad2531b146d6b8396327704e25b339264bef8356d251e3e3ee787c2423`
39
+
40
+ ## License
41
+
42
+ Apache-2.0 (matching upstream Gemma 4 license).