somu9
/

gbm-tokenizer

Model card Files Files and versions

gbm-tokenizer / README.md

somu9's picture

Upload README.md with huggingface_hub

23955f8 verified 9 days ago

|

history blame contribute delete

3.31 kB

	---
	license: unlicense
	language:
	- gbm
	- hi
	- en
	tags:
	- tokenizer
	- sentencepiece
	- garhwali
	- devanagari
	library_name: sentencepiece
	gated: true
	extra_gated_heading: "Request access to the GBM tokenizer"
	extra_gated_description: "By agreeing you share your Hugging Face username and email with the model authors."
	extra_gated_button_content: "Agree and send request"
	---

	# GBM Tokenizer

	A SentencePiece unigram tokenizer optimized for Garhwali (GBM - ISO 639-3), a Central Pahari language of Uttarakhand, India. It handles Devanagari script and mixed Garhwali–English text, suitable for language modeling, MT, and NLP on Garhwali corpora.

	## Key highlights

	- 128,000 tokens — Vocabulary size aligned with modern LLM tokenizers (e.g. Llama 3).
	- Optimized for Garhwali — Trained on a domain-specific corpus (621K+ lines).
	- Efficient tokenization — ~2.11 tokens per word (comparable to GPT-4o’s ~1.92).
	- Strong compression — ~2.66× characters per token.
	- Fast — ~2.2M tokens/sec encoding on CPU (single-threaded, Apple Silicon).

	## Model details

	\| Property \| Value \|
	\|----------\|--------\|
	\| Vocab size \| 128,000 \|
	\| Model type \| Unigram (SentencePiece) \|
	\| Scripts \| Devanagari, Latin \|
	\| Special tokens \| `pad_id=0`, `unk_id=1`, `bos_id=2`, `eos_id=3` \|
	\| Normalization \| NMT NFKC, byte fallback for full coverage \|
	\| Max piece length \| 20 \|
	\| License \| Unlicense (public domain) \|

	## Evaluation (comparison)

	Benchmarks on 152 test cases (Devanagari, English, mixed, code, math):

	\| Tokenizer \| Vocab size \| Compression \| Round-trip accuracy \| Fertility (tokens/word) \| Speed \|
	\|-----------\|------------\|--------------\|----------------------\|-------------------------\|-------\|
	\| GBM Tokenizer \| 128,000 \| 2.66× \| 98.5% \| 2.11 \| ~2.2M t/s \|
	\| GPT-4o (o200k) \| 199,998 \| 2.93× \| 100% \| 1.92 \| ~1.2M t/s \|
	\| Gemma 3 \| 262,144 \| 3.06× \| 100% \| 1.84 \| ~0.5M t/s \|
	\| Llama 3 \| 128,000 \| 2.51× \| 99.5% \| 2.24 \| ~0.4M t/s \|
	\| GPT-4/Claude \| 100,256 \| 1.77× \| 100% \| 3.18 \| ~1.6M t/s \|
	\| Sarvam-1 \| 68,096 \| 2.31× \| 100% \| 2.44 \| ~0.6M t/s \|

	Metrics are from the project’s `eval.txt` test set; speed is hardware-dependent (e.g. Apple Silicon).

	## Usage

	### With SentencePiece (Python)

	```python
	import sentencepiece as spm

	sp = spm.SentencePieceProcessor()
	sp.load("gbm_tokenizer.model") # or path from hf_hub_download

	tokens = sp.encode("गढ़वळि पाठ", out_type=str)
	ids = sp.encode("गढ़वळि पाठ", out_type=int)
	decoded = sp.decode(ids)
	```

	### Download from Hub (gated — request access first)

	```python
	from huggingface_hub import hf_hub_download
	import sentencepiece as spm

	path = hf_hub_download(repo_id="somu9/gbm-tokenizer", filename="gbm_tokenizer.model")
	sp = spm.SentencePieceProcessor()
	sp.load(path)
	```

	## Training

	- Algorithm: SentencePiece unigram with default splitting (Unicode script, whitespace, numbers).
	- Corpus: Garhwali-focused text; configurable via the [training repo](https://github.com/sumitesh9/gbm-tokenizer) (`train.py`, `corpus.txt`).

	## References

	- [SentencePiece](https://github.com/google/sentencepiece)
	- [ISO 639-3: GBM (Garhwali)](https://iso639-3.sil.org/code/gbm)
	- [Project: gbm-tokenizer](https://github.com/sumitesh9/gbm-tokenizer)