gbm-tokenizer / README.md

somu9

Upload README.md with huggingface_hub

23955f8 verified 7 days ago

preview code

raw

history blame contribute delete

3.31 kB

metadata

license: unlicense
language:
  - gbm
  - hi
  - en
tags:
  - tokenizer
  - sentencepiece
  - garhwali
  - devanagari
library_name: sentencepiece
gated: true
extra_gated_heading: Request access to the GBM tokenizer
extra_gated_description: >-
  By agreeing you share your Hugging Face username and email with the model
  authors.
extra_gated_button_content: Agree and send request

GBM Tokenizer

A SentencePiece unigram tokenizer optimized for Garhwali (GBM - ISO 639-3), a Central Pahari language of Uttarakhand, India. It handles Devanagari script and mixed Garhwali–English text, suitable for language modeling, MT, and NLP on Garhwali corpora.

Key highlights

128,000 tokens — Vocabulary size aligned with modern LLM tokenizers (e.g. Llama 3).
Optimized for Garhwali — Trained on a domain-specific corpus (621K+ lines).
Efficient tokenization — ~2.11 tokens per word (comparable to GPT-4o’s ~1.92).
Strong compression — ~2.66× characters per token.
Fast — ~2.2M tokens/sec encoding on CPU (single-threaded, Apple Silicon).

Model details

Property	Value
Vocab size	128,000
Model type	Unigram (SentencePiece)
Scripts	Devanagari, Latin
Special tokens	`pad_id=0`, `unk_id=1`, `bos_id=2`, `eos_id=3`
Normalization	NMT NFKC, byte fallback for full coverage
Max piece length	20
License	Unlicense (public domain)

Evaluation (comparison)

Benchmarks on 152 test cases (Devanagari, English, mixed, code, math):

Tokenizer	Vocab size	Compression	Round-trip accuracy	Fertility (tokens/word)	Speed
GBM Tokenizer	128,000	2.66×	98.5%	2.11	~2.2M t/s
GPT-4o (o200k)	199,998	2.93×	100%	1.92	~1.2M t/s
Gemma 3	262,144	3.06×	100%	1.84	~0.5M t/s
Llama 3	128,000	2.51×	99.5%	2.24	~0.4M t/s
GPT-4/Claude	100,256	1.77×	100%	3.18	~1.6M t/s
Sarvam-1	68,096	2.31×	100%	2.44	~0.6M t/s

Metrics are from the project’s eval.txt test set; speed is hardware-dependent (e.g. Apple Silicon).

Usage

With SentencePiece (Python)

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("gbm_tokenizer.model")  # or path from hf_hub_download

tokens = sp.encode("गढ़वळि पाठ", out_type=str)
ids = sp.encode("गढ़वळि पाठ", out_type=int)
decoded = sp.decode(ids)

Download from Hub (gated — request access first)

from huggingface_hub import hf_hub_download
import sentencepiece as spm

path = hf_hub_download(repo_id="somu9/gbm-tokenizer", filename="gbm_tokenizer.model")
sp = spm.SentencePieceProcessor()
sp.load(path)

Training

Algorithm: SentencePiece unigram with default splitting (Unicode script, whitespace, numbers).
Corpus: Garhwali-focused text; configurable via the training repo (train.py, corpus.txt).

somu9
/

gbm-tokenizer