Gemma 4 E2B SentencePiece BPE Tokenizer (derived)

A SentencePiece .model file synthesized from the upstream onnx-community/gemma-4-E2B-it-ONNX tokenizer.json, plus a companion atomic_tokens.json that lists the special / added tokens for an HF-style atomic pre-pass.

Byte-equivalence verified against the upstream HF tokenizer on 1000 / 1000 diverse strings (English prose, source code, CJK / Cyrillic / Arabic / Hindi text, whitespace edge cases, emoji, NFKC-sensitive ligatures, and special-token literals). Verification uses an atomic pre-pass over atomic_tokens.json followed by sentencepiece.encode.

Files

  • tokenizer.model — SentencePiece BPE ModelProto, 262 144 pieces, byte_fallback=true, normalizer_spec.name="identity", normalizer_spec.add_dummy_prefix=false.
  • atomic_tokens.json — array of { piece, id } entries, sorted longest-first, covering every added_tokens entry from upstream tokenizer.json with special: true (e.g. <bos>, <eos>, <pad>, <unk>, <mask>, <|tool>, …). Consumers run a greedy, overlap-free, longest-match-first scan over the raw input and emit the listed id directly for matched substrings before invoking the SentencePiece tokenizer on the surrounding text.

Hashes (sha256)

  • tokenizer.model: 57dc9e2498ca33228ae89bed2721a57817ded224a7e02d3a79a03a9619f2ff33
  • atomic_tokens.json: 492ec2ad2531b146d6b8396327704e25b339264bef8356d251e3e3ee787c2423

License

Apache-2.0 (matching upstream Gemma 4 license).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support