Request access to the GBM tokenizer

By agreeing you share your Hugging Face username and email with the model authors.

Log in or Sign Up to review the conditions and access this model content.

GBM Tokenizer

A SentencePiece unigram tokenizer optimized for Garhwali (GBM - ISO 639-3), a Central Pahari language of Uttarakhand, India. It handles Devanagari script and mixed Garhwali–English text, suitable for language modeling, MT, and NLP on Garhwali corpora.

Key highlights

  • 128,000 tokens — Vocabulary size aligned with modern LLM tokenizers (e.g. Llama 3).
  • Optimized for Garhwali — Trained on a domain-specific corpus (621K+ lines).
  • Efficient tokenization — ~2.11 tokens per word (comparable to GPT-4o’s ~1.92).
  • Strong compression — ~2.66× characters per token.
  • Fast — ~2.2M tokens/sec encoding on CPU (single-threaded, Apple Silicon).

Model details

Property Value
Vocab size 128,000
Model type Unigram (SentencePiece)
Scripts Devanagari, Latin
Special tokens pad_id=0, unk_id=1, bos_id=2, eos_id=3
Normalization NMT NFKC, byte fallback for full coverage
Max piece length 20
License Unlicense (public domain)

Evaluation (comparison)

Benchmarks on 152 test cases (Devanagari, English, mixed, code, math):

Tokenizer Vocab size Compression Round-trip accuracy Fertility (tokens/word) Speed
GBM Tokenizer 128,000 2.66× 98.5% 2.11 ~2.2M t/s
GPT-4o (o200k) 199,998 2.93× 100% 1.92 ~1.2M t/s
Gemma 3 262,144 3.06× 100% 1.84 ~0.5M t/s
Llama 3 128,000 2.51× 99.5% 2.24 ~0.4M t/s
GPT-4/Claude 100,256 1.77× 100% 3.18 ~1.6M t/s
Sarvam-1 68,096 2.31× 100% 2.44 ~0.6M t/s

Metrics are from the project’s eval.txt test set; speed is hardware-dependent (e.g. Apple Silicon).

Usage

With SentencePiece (Python)

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("gbm_tokenizer.model")  # or path from hf_hub_download

tokens = sp.encode("गढ़वळि पाठ", out_type=str)
ids = sp.encode("गढ़वळि पाठ", out_type=int)
decoded = sp.decode(ids)

Download from Hub (gated — request access first)

from huggingface_hub import hf_hub_download
import sentencepiece as spm

path = hf_hub_download(repo_id="somu9/gbm-tokenizer", filename="gbm_tokenizer.model")
sp = spm.SentencePieceProcessor()
sp.load(path)

Training

  • Algorithm: SentencePiece unigram with default splitting (Unicode script, whitespace, numbers).
  • Corpus: Garhwali-focused text; configurable via the training repo (train.py, corpus.txt).

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support