Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- gemma
|
| 5 |
+
- tokenizer
|
| 6 |
+
- sentencepiece
|
| 7 |
+
- bpe
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# Gemma 4 E2B SentencePiece BPE Tokenizer (derived)
|
| 11 |
+
|
| 12 |
+
A SentencePiece `.model` file synthesized from the upstream
|
| 13 |
+
`onnx-community/gemma-4-E2B-it-ONNX` `tokenizer.json`, plus a companion
|
| 14 |
+
`atomic_tokens.json` that lists the special / added tokens for an HF-style
|
| 15 |
+
atomic pre-pass.
|
| 16 |
+
|
| 17 |
+
Byte-equivalence verified against the upstream HF tokenizer on **1000 / 1000**
|
| 18 |
+
diverse strings (English prose, source code, CJK / Cyrillic / Arabic / Hindi
|
| 19 |
+
text, whitespace edge cases, emoji, NFKC-sensitive ligatures, and special-token
|
| 20 |
+
literals). Verification uses an atomic pre-pass over `atomic_tokens.json`
|
| 21 |
+
followed by `sentencepiece.encode`.
|
| 22 |
+
|
| 23 |
+
## Files
|
| 24 |
+
|
| 25 |
+
- `tokenizer.model` — SentencePiece BPE ModelProto, 262 144 pieces,
|
| 26 |
+
`byte_fallback=true`, `normalizer_spec.name="identity"`,
|
| 27 |
+
`normalizer_spec.add_dummy_prefix=false`.
|
| 28 |
+
- `atomic_tokens.json` — array of `{ piece, id }` entries, sorted longest-first,
|
| 29 |
+
covering every `added_tokens` entry from upstream `tokenizer.json` with
|
| 30 |
+
`special: true` (e.g. `<bos>`, `<eos>`, `<pad>`, `<unk>`, `<mask>`, `<|tool>`,
|
| 31 |
+
…). Consumers run a greedy, overlap-free, longest-match-first scan over the
|
| 32 |
+
raw input and emit the listed `id` directly for matched substrings before
|
| 33 |
+
invoking the SentencePiece tokenizer on the surrounding text.
|
| 34 |
+
|
| 35 |
+
## Hashes (sha256)
|
| 36 |
+
|
| 37 |
+
- `tokenizer.model`: `57dc9e2498ca33228ae89bed2721a57817ded224a7e02d3a79a03a9619f2ff33`
|
| 38 |
+
- `atomic_tokens.json`: `492ec2ad2531b146d6b8396327704e25b339264bef8356d251e3e3ee787c2423`
|
| 39 |
+
|
| 40 |
+
## License
|
| 41 |
+
|
| 42 |
+
Apache-2.0 (matching upstream Gemma 4 license).
|