gbm-tokenizer / README.md
somu9's picture
Upload README.md with huggingface_hub
23955f8 verified
---
license: unlicense
language:
- gbm
- hi
- en
tags:
- tokenizer
- sentencepiece
- garhwali
- devanagari
library_name: sentencepiece
gated: true
extra_gated_heading: "Request access to the GBM tokenizer"
extra_gated_description: "By agreeing you share your Hugging Face username and email with the model authors."
extra_gated_button_content: "Agree and send request"
---
# GBM Tokenizer
A **SentencePiece unigram** tokenizer optimized for **Garhwali (GBM - ISO 639-3)**, a Central Pahari language of Uttarakhand, India. It handles **Devanagari** script and mixed Garhwali–English text, suitable for language modeling, MT, and NLP on Garhwali corpora.
## Key highlights
- **128,000 tokens** — Vocabulary size aligned with modern LLM tokenizers (e.g. Llama 3).
- **Optimized for Garhwali** — Trained on a domain-specific corpus (621K+ lines).
- **Efficient tokenization** — ~2.11 tokens per word (comparable to GPT-4o’s ~1.92).
- **Strong compression** — ~2.66× characters per token.
- **Fast** — ~2.2M tokens/sec encoding on CPU (single-threaded, Apple Silicon).
## Model details
| Property | Value |
|----------|--------|
| Vocab size | 128,000 |
| Model type | Unigram (SentencePiece) |
| Scripts | Devanagari, Latin |
| Special tokens | `pad_id=0`, `unk_id=1`, `bos_id=2`, `eos_id=3` |
| Normalization | NMT NFKC, byte fallback for full coverage |
| Max piece length | 20 |
| License | Unlicense (public domain) |
## Evaluation (comparison)
Benchmarks on 152 test cases (Devanagari, English, mixed, code, math):
| Tokenizer | Vocab size | Compression | Round-trip accuracy | Fertility (tokens/word) | Speed |
|-----------|------------|--------------|----------------------|-------------------------|-------|
| **GBM Tokenizer** | 128,000 | 2.66× | 98.5% | 2.11 | ~2.2M t/s |
| GPT-4o (o200k) | 199,998 | 2.93× | 100% | 1.92 | ~1.2M t/s |
| Gemma 3 | 262,144 | 3.06× | 100% | 1.84 | ~0.5M t/s |
| Llama 3 | 128,000 | 2.51× | 99.5% | 2.24 | ~0.4M t/s |
| GPT-4/Claude | 100,256 | 1.77× | 100% | 3.18 | ~1.6M t/s |
| Sarvam-1 | 68,096 | 2.31× | 100% | 2.44 | ~0.6M t/s |
*Metrics are from the project’s `eval.txt` test set; speed is hardware-dependent (e.g. Apple Silicon).*
## Usage
### With SentencePiece (Python)
```python
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("gbm_tokenizer.model") # or path from hf_hub_download
tokens = sp.encode("गढ़वळि पाठ", out_type=str)
ids = sp.encode("गढ़वळि पाठ", out_type=int)
decoded = sp.decode(ids)
```
### Download from Hub (gated — request access first)
```python
from huggingface_hub import hf_hub_download
import sentencepiece as spm
path = hf_hub_download(repo_id="somu9/gbm-tokenizer", filename="gbm_tokenizer.model")
sp = spm.SentencePieceProcessor()
sp.load(path)
```
## Training
- **Algorithm**: SentencePiece unigram with default splitting (Unicode script, whitespace, numbers).
- **Corpus**: Garhwali-focused text; configurable via the [training repo](https://github.com/sumitesh9/gbm-tokenizer) (`train.py`, `corpus.txt`).
## References
- [SentencePiece](https://github.com/google/sentencepiece)
- [ISO 639-3: GBM (Garhwali)](https://iso639-3.sil.org/code/gbm)
- [Project: gbm-tokenizer](https://github.com/sumitesh9/gbm-tokenizer)