GBM Tokenizer
A SentencePiece unigram tokenizer optimized for Garhwali (GBM - ISO 639-3), a Central Pahari language of Uttarakhand, India. It handles Devanagari script and mixed Garhwali–English text, suitable for language modeling, MT, and NLP on Garhwali corpora.
Key highlights
- 128,000 tokens — Vocabulary size aligned with modern LLM tokenizers (e.g. Llama 3).
- Optimized for Garhwali — Trained on a domain-specific corpus (621K+ lines).
- Efficient tokenization — ~2.11 tokens per word (comparable to GPT-4o’s ~1.92).
- Strong compression — ~2.66× characters per token.
- Fast — ~2.2M tokens/sec encoding on CPU (single-threaded, Apple Silicon).
Model details
| Property | Value |
|---|---|
| Vocab size | 128,000 |
| Model type | Unigram (SentencePiece) |
| Scripts | Devanagari, Latin |
| Special tokens | pad_id=0, unk_id=1, bos_id=2, eos_id=3 |
| Normalization | NMT NFKC, byte fallback for full coverage |
| Max piece length | 20 |
| License | Unlicense (public domain) |
Evaluation (comparison)
Benchmarks on 152 test cases (Devanagari, English, mixed, code, math):
| Tokenizer | Vocab size | Compression | Round-trip accuracy | Fertility (tokens/word) | Speed |
|---|---|---|---|---|---|
| GBM Tokenizer | 128,000 | 2.66× | 98.5% | 2.11 | ~2.2M t/s |
| GPT-4o (o200k) | 199,998 | 2.93× | 100% | 1.92 | ~1.2M t/s |
| Gemma 3 | 262,144 | 3.06× | 100% | 1.84 | ~0.5M t/s |
| Llama 3 | 128,000 | 2.51× | 99.5% | 2.24 | ~0.4M t/s |
| GPT-4/Claude | 100,256 | 1.77× | 100% | 3.18 | ~1.6M t/s |
| Sarvam-1 | 68,096 | 2.31× | 100% | 2.44 | ~0.6M t/s |
Metrics are from the project’s eval.txt test set; speed is hardware-dependent (e.g. Apple Silicon).
Usage
With SentencePiece (Python)
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("gbm_tokenizer.model") # or path from hf_hub_download
tokens = sp.encode("गढ़वळि पाठ", out_type=str)
ids = sp.encode("गढ़वळि पाठ", out_type=int)
decoded = sp.decode(ids)
Download from Hub (gated — request access first)
from huggingface_hub import hf_hub_download
import sentencepiece as spm
path = hf_hub_download(repo_id="somu9/gbm-tokenizer", filename="gbm_tokenizer.model")
sp = spm.SentencePieceProcessor()
sp.load(path)
Training
- Algorithm: SentencePiece unigram with default splitting (Unicode script, whitespace, numbers).
- Corpus: Garhwali-focused text; configurable via the training repo (
train.py,corpus.txt).
References
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support