Gemma 4 E2B SentencePiece BPE Tokenizer (derived)
A SentencePiece .model file synthesized from the upstream
onnx-community/gemma-4-E2B-it-ONNX tokenizer.json, plus a companion
atomic_tokens.json that lists the special / added tokens for an HF-style
atomic pre-pass.
Byte-equivalence verified against the upstream HF tokenizer on 1000 / 1000
diverse strings (English prose, source code, CJK / Cyrillic / Arabic / Hindi
text, whitespace edge cases, emoji, NFKC-sensitive ligatures, and special-token
literals). Verification uses an atomic pre-pass over atomic_tokens.json
followed by sentencepiece.encode.
Files
tokenizer.model— SentencePiece BPE ModelProto, 262 144 pieces,byte_fallback=true,normalizer_spec.name="identity",normalizer_spec.add_dummy_prefix=false.atomic_tokens.json— array of{ piece, id }entries, sorted longest-first, covering everyadded_tokensentry from upstreamtokenizer.jsonwithspecial: true(e.g.<bos>,<eos>,<pad>,<unk>,<mask>,<|tool>, …). Consumers run a greedy, overlap-free, longest-match-first scan over the raw input and emit the listediddirectly for matched substrings before invoking the SentencePiece tokenizer on the surrounding text.
Hashes (sha256)
tokenizer.model:57dc9e2498ca33228ae89bed2721a57817ded224a7e02d3a79a03a9619f2ff33atomic_tokens.json:492ec2ad2531b146d6b8396327704e25b339264bef8356d251e3e3ee787c2423
License
Apache-2.0 (matching upstream Gemma 4 license).
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support