Akshar-32k: BPE Tokenizer for Hinglish and Minglish

Akshar is a BPE tokenizer trained on Romanized code-mixed text, the kind of Hindi and Marathi people actually type online. Not Devanagari. Not formal. The "bhai kya kar raha hai" kind.

Most tokenizers (Llama-3, GPT-4o) are built for English and handle Romanized Indic text badly. They fragment words into meaningless pieces: bhai becomes b, ha, i. The model then has to reconstruct meaning from that noise. Akshar fixes this for Hinglish and Minglish specifically, with a 32k vocabulary fully specialized for Romanized subwords.

Fertility Comparison

Model Vocab Size Avg Fertility
Akshar 32k 1.34
GPT-4o ~200k 1.49
Gemma ~256k 1.58
Llama-3 128k 1.61
Qwen2 151k 1.65
GPT-2 50k 1.79
Mistral 32k 1.93
Sarvam-2B ~150k 2.10

Tokenizer Fertility Comparison

Fertility = average tokens per word. Lower is better. Akshar beats Llama-3 by ~20% on Romanized text despite having a 4x smaller vocabulary. The gains come from specialization, not scale.

Vocabulary

32,768 tokens. Special tokens:

Token Role
<s> Beginning of stream
</s> End of stream
<pad> Padding
<unk> Unknown
<|user|> Chat: user turn
<|assistant|> Chat: assistant turn

Training Data

~40M tokens, 20.2M sentences from three sources:

  • L3Cube-HingCorpus: Hindi-English code-mixed, scraped from Twitter
  • L3Cube-MeCorpus: Marathi-English code-mixed, social media
  • findnitai/english-to-hinglish: parallel translation data, heavily capped because the dataset skews US-centric
  • Youtube comments: This is the most natural source.

Pipeline

Three preprocessing steps that made the most difference:

CamelCase splitting: #MumbaiIndians becomes Mumbai Indians before BPE sees it. Stops the algorithm from wasting vocabulary slots on monolithic hashtags.

Elongated character collapsing: bhaaaaaai becomes bhai, gooooood becomes good. Social media text has a lot of this. Without handling it, these variations produce noise tokens that never appear at inference time.

Digit isolation: 2024 becomes 2 0 2 4. Forces numerical generalization instead of memorizing specific years and quantities.

Everything is lowercased. No casing redundancy means the 32k slots go toward actual linguistic variety rather than storing Bhai, bhai, and BHAI as separate entries.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Sujalvc/akshar-32k")

text = "bhai kya scene hai, python sikha de"
tokens = tokenizer.encode(text)
print(tokenizer.convert_ids_to_tokens(tokens))

Limitations

Spelling variation is the hardest unsolved problem. "kaise" / "kese" / "kayse" are all common and all produce different token sequences. Frequency-based spellings tokenize well; rare variants still fragment. There is no clean fix without a normalization layer or significantly more training data.

Minglish coverage is thinner than Hinglish. MeCorpus helped, but Marathi Romanization has wider spelling variance than Hindi and 40M tokens is not enough to cover the long tail.

The findnitai dataset introduced some Western geographic noise (US city names ended up as tokens). Mitigated by capping its contribution to the corpus, but not fully eliminated.

Acknowledgements

  • L3Cube-Pune for MeCorpus and HingCorpus. Their work in Romanized Indic NLP is the foundation of this project.
  • findnitai for the English-to-Hinglish parallel corpus.
  • Sarvam AI for Sarvam-2B, which provided a useful Indic baseline for comparison.
  • The Hugging Face Tokenizers team for the Rust-based BPE engine.

Citation

@misc{choudhari2026akshar,
  author       = {Sujal Choudhari},
  title        = {Akshar: A High-Efficiency BPE Tokenizer for Romanized Code-Mixed Indic Text},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Sujalvc/akshar-32k}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Sujalvc/akshar-32k