gbm-tokenizer / README.md
somu9's picture
Upload README.md with huggingface_hub
23955f8 verified
metadata
license: unlicense
language:
  - gbm
  - hi
  - en
tags:
  - tokenizer
  - sentencepiece
  - garhwali
  - devanagari
library_name: sentencepiece
gated: true
extra_gated_heading: Request access to the GBM tokenizer
extra_gated_description: >-
  By agreeing you share your Hugging Face username and email with the model
  authors.
extra_gated_button_content: Agree and send request

GBM Tokenizer

A SentencePiece unigram tokenizer optimized for Garhwali (GBM - ISO 639-3), a Central Pahari language of Uttarakhand, India. It handles Devanagari script and mixed Garhwali–English text, suitable for language modeling, MT, and NLP on Garhwali corpora.

Key highlights

  • 128,000 tokens — Vocabulary size aligned with modern LLM tokenizers (e.g. Llama 3).
  • Optimized for Garhwali — Trained on a domain-specific corpus (621K+ lines).
  • Efficient tokenization — ~2.11 tokens per word (comparable to GPT-4o’s ~1.92).
  • Strong compression — ~2.66× characters per token.
  • Fast — ~2.2M tokens/sec encoding on CPU (single-threaded, Apple Silicon).

Model details

Property Value
Vocab size 128,000
Model type Unigram (SentencePiece)
Scripts Devanagari, Latin
Special tokens pad_id=0, unk_id=1, bos_id=2, eos_id=3
Normalization NMT NFKC, byte fallback for full coverage
Max piece length 20
License Unlicense (public domain)

Evaluation (comparison)

Benchmarks on 152 test cases (Devanagari, English, mixed, code, math):

Tokenizer Vocab size Compression Round-trip accuracy Fertility (tokens/word) Speed
GBM Tokenizer 128,000 2.66× 98.5% 2.11 ~2.2M t/s
GPT-4o (o200k) 199,998 2.93× 100% 1.92 ~1.2M t/s
Gemma 3 262,144 3.06× 100% 1.84 ~0.5M t/s
Llama 3 128,000 2.51× 99.5% 2.24 ~0.4M t/s
GPT-4/Claude 100,256 1.77× 100% 3.18 ~1.6M t/s
Sarvam-1 68,096 2.31× 100% 2.44 ~0.6M t/s

Metrics are from the project’s eval.txt test set; speed is hardware-dependent (e.g. Apple Silicon).

Usage

With SentencePiece (Python)

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("gbm_tokenizer.model")  # or path from hf_hub_download

tokens = sp.encode("गढ़वळि पाठ", out_type=str)
ids = sp.encode("गढ़वळि पाठ", out_type=int)
decoded = sp.decode(ids)

Download from Hub (gated — request access first)

from huggingface_hub import hf_hub_download
import sentencepiece as spm

path = hf_hub_download(repo_id="somu9/gbm-tokenizer", filename="gbm_tokenizer.model")
sp = spm.SentencePieceProcessor()
sp.load(path)

Training

  • Algorithm: SentencePiece unigram with default splitting (Unicode script, whitespace, numbers).
  • Corpus: Garhwali-focused text; configurable via the training repo (train.py, corpus.txt).

References