|
|
--- |
|
|
license: unlicense |
|
|
language: |
|
|
- gbm |
|
|
- hi |
|
|
- en |
|
|
tags: |
|
|
- tokenizer |
|
|
- sentencepiece |
|
|
- garhwali |
|
|
- devanagari |
|
|
library_name: sentencepiece |
|
|
gated: true |
|
|
extra_gated_heading: "Request access to the GBM tokenizer" |
|
|
extra_gated_description: "By agreeing you share your Hugging Face username and email with the model authors." |
|
|
extra_gated_button_content: "Agree and send request" |
|
|
--- |
|
|
|
|
|
# GBM Tokenizer |
|
|
|
|
|
A **SentencePiece unigram** tokenizer optimized for **Garhwali (GBM - ISO 639-3)**, a Central Pahari language of Uttarakhand, India. It handles **Devanagari** script and mixed Garhwali–English text, suitable for language modeling, MT, and NLP on Garhwali corpora. |
|
|
|
|
|
## Key highlights |
|
|
|
|
|
- **128,000 tokens** — Vocabulary size aligned with modern LLM tokenizers (e.g. Llama 3). |
|
|
- **Optimized for Garhwali** — Trained on a domain-specific corpus (621K+ lines). |
|
|
- **Efficient tokenization** — ~2.11 tokens per word (comparable to GPT-4o’s ~1.92). |
|
|
- **Strong compression** — ~2.66× characters per token. |
|
|
- **Fast** — ~2.2M tokens/sec encoding on CPU (single-threaded, Apple Silicon). |
|
|
|
|
|
## Model details |
|
|
|
|
|
| Property | Value | |
|
|
|----------|--------| |
|
|
| Vocab size | 128,000 | |
|
|
| Model type | Unigram (SentencePiece) | |
|
|
| Scripts | Devanagari, Latin | |
|
|
| Special tokens | `pad_id=0`, `unk_id=1`, `bos_id=2`, `eos_id=3` | |
|
|
| Normalization | NMT NFKC, byte fallback for full coverage | |
|
|
| Max piece length | 20 | |
|
|
| License | Unlicense (public domain) | |
|
|
|
|
|
## Evaluation (comparison) |
|
|
|
|
|
Benchmarks on 152 test cases (Devanagari, English, mixed, code, math): |
|
|
|
|
|
| Tokenizer | Vocab size | Compression | Round-trip accuracy | Fertility (tokens/word) | Speed | |
|
|
|-----------|------------|--------------|----------------------|-------------------------|-------| |
|
|
| **GBM Tokenizer** | 128,000 | 2.66× | 98.5% | 2.11 | ~2.2M t/s | |
|
|
| GPT-4o (o200k) | 199,998 | 2.93× | 100% | 1.92 | ~1.2M t/s | |
|
|
| Gemma 3 | 262,144 | 3.06× | 100% | 1.84 | ~0.5M t/s | |
|
|
| Llama 3 | 128,000 | 2.51× | 99.5% | 2.24 | ~0.4M t/s | |
|
|
| GPT-4/Claude | 100,256 | 1.77× | 100% | 3.18 | ~1.6M t/s | |
|
|
| Sarvam-1 | 68,096 | 2.31× | 100% | 2.44 | ~0.6M t/s | |
|
|
|
|
|
*Metrics are from the project’s `eval.txt` test set; speed is hardware-dependent (e.g. Apple Silicon).* |
|
|
|
|
|
## Usage |
|
|
|
|
|
### With SentencePiece (Python) |
|
|
|
|
|
```python |
|
|
import sentencepiece as spm |
|
|
|
|
|
sp = spm.SentencePieceProcessor() |
|
|
sp.load("gbm_tokenizer.model") # or path from hf_hub_download |
|
|
|
|
|
tokens = sp.encode("गढ़वळि पाठ", out_type=str) |
|
|
ids = sp.encode("गढ़वळि पाठ", out_type=int) |
|
|
decoded = sp.decode(ids) |
|
|
``` |
|
|
|
|
|
### Download from Hub (gated — request access first) |
|
|
|
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
import sentencepiece as spm |
|
|
|
|
|
path = hf_hub_download(repo_id="somu9/gbm-tokenizer", filename="gbm_tokenizer.model") |
|
|
sp = spm.SentencePieceProcessor() |
|
|
sp.load(path) |
|
|
``` |
|
|
|
|
|
## Training |
|
|
|
|
|
- **Algorithm**: SentencePiece unigram with default splitting (Unicode script, whitespace, numbers). |
|
|
- **Corpus**: Garhwali-focused text; configurable via the [training repo](https://github.com/sumitesh9/gbm-tokenizer) (`train.py`, `corpus.txt`). |
|
|
|
|
|
## References |
|
|
|
|
|
- [SentencePiece](https://github.com/google/sentencepiece) |
|
|
- [ISO 639-3: GBM (Garhwali)](https://iso639-3.sil.org/code/gbm) |
|
|
- [Project: gbm-tokenizer](https://github.com/sumitesh9/gbm-tokenizer) |
|
|
|