Request access to the GBM tokenizer

By agreeing you share your Hugging Face username and email with the model authors.

GBM Tokenizer

A SentencePiece unigram tokenizer optimized for Garhwali (GBM - ISO 639-3), a Central Pahari language of Uttarakhand, India. It handles Devanagari script and mixed Garhwali–English text, suitable for language modeling, MT, and NLP on Garhwali corpora.

Key highlights

128,000 tokens — Vocabulary size aligned with modern LLM tokenizers (e.g. Llama 3).
Optimized for Garhwali — Trained on a domain-specific corpus (621K+ lines).
Efficient tokenization — ~2.11 tokens per word (comparable to GPT-4o’s ~1.92).
Strong compression — ~2.66× characters per token.
Fast — ~2.2M tokens/sec encoding on CPU (single-threaded, Apple Silicon).

Model details

Property	Value
Vocab size	128,000
Model type	Unigram (SentencePiece)
Scripts	Devanagari, Latin
Special tokens	`pad_id=0`, `unk_id=1`, `bos_id=2`, `eos_id=3`
Normalization	NMT NFKC, byte fallback for full coverage
Max piece length	20
License	Unlicense (public domain)

Evaluation (comparison)

Benchmarks on 152 test cases (Devanagari, English, mixed, code, math):

Tokenizer	Vocab size	Compression	Round-trip accuracy	Fertility (tokens/word)	Speed
GBM Tokenizer	128,000	2.66×	98.5%	2.11	~2.2M t/s
GPT-4o (o200k)	199,998	2.93×	100%	1.92	~1.2M t/s
Gemma 3	262,144	3.06×	100%	1.84	~0.5M t/s
Llama 3	128,000	2.51×	99.5%	2.24	~0.4M t/s
GPT-4/Claude	100,256	1.77×	100%	3.18	~1.6M t/s
Sarvam-1	68,096	2.31×	100%	2.44	~0.6M t/s

Metrics are from the project’s eval.txt test set; speed is hardware-dependent (e.g. Apple Silicon).

Usage

With SentencePiece (Python)

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("gbm_tokenizer.model")  # or path from hf_hub_download

tokens = sp.encode("गढ़वळि पाठ", out_type=str)
ids = sp.encode("गढ़वळि पाठ", out_type=int)
decoded = sp.decode(ids)

Download from Hub (gated — request access first)

from huggingface_hub import hf_hub_download
import sentencepiece as spm

path = hf_hub_download(repo_id="somu9/gbm-tokenizer", filename="gbm_tokenizer.model")
sp = spm.SentencePieceProcessor()
sp.load(path)

Training

Algorithm: SentencePiece unigram with default splitting (Unicode script, whitespace, numbers).
Corpus: Garhwali-focused text; configurable via the training repo (train.py, corpus.txt).

References

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support