Marathi BPE Tokenizer

A Byte Pair Encoding (BPE) tokenizer trained on Marathi text using the Devanagari script.

Model Details

  • Model Type: BPE Tokenizer
  • Language: Marathi (mr)
  • Script: Devanagari
  • Vocabulary Size: 4845 tokens
  • Base Vocabulary: 845 graphemes
  • Merge Operations: 4000
  • License: MIT

Training Details

The tokenizer was trained using a custom Byte Pair Encoding implementation optimized for Devanagari script:

  • Starting Unit: Unicode extended grapheme clusters (not bytes)
  • Training Corpus Size: 92,627 characters
  • Compression Ratio (Grapheme): 2.84x
  • Compression Ratio (Byte): 12.30x

Usage

from tokenizers import Tokenizer

# Load the tokenizer
tokenizer = Tokenizer.from_pretrained("pandurangpatil/sample-marathi-bpe-tokenizer")

# Encode text
text = "नमस्कार! हे एक मराठी टोकनायझर आहे."
encoded = tokenizer.encode(text)
print(f"Token IDs: {encoded.ids}")
print(f"Tokens: {encoded.tokens}")

# Decode back to text
decoded = tokenizer.decode(encoded.ids)
print(f"Decoded: {decoded}")

Using with Custom Scripts

If you want to use the raw artifacts:

import json
from tokenizer_utils import encode, decode

# Load artifacts
with open('vocab.json', 'r', encoding='utf-8') as f:
    token_to_id = json.load(f)

with open('merges.json', 'r', encoding='utf-8') as f:
    merges_str = json.load(f)
    merges = {tuple(map(int, k.split(','))): v for k, v in merges_str.items()}

with open('id_to_token.json', 'r', encoding='utf-8') as f:
    id_to_token = {int(k): v for k, v in json.load(f).items()}

# Encode and decode
text = "मराठी मजकूर"
token_ids = encode(merges, token_to_id, text)
reconstructed = decode(id_to_token, token_ids)

Grapheme-based Approach

Unlike traditional byte-level BPE, this tokenizer:

  • Starts with Unicode grapheme clusters as base units
  • Properly handles Devanagari combining characters (matras, virama)
  • Maintains linguistic meaning at the subword level
  • Achieves better compression for Devanagari text

Example of grapheme segmentation:

  • नमस्कार → [न, म, स्, का, र] (graphemes)
  • Each grapheme preserves visual/phonetic integrity

Limitations

  • Trained on a limited corpus size
  • May not generalize well to domains outside training data
  • Does not include special tokens for ML models (PAD, UNK, BOS, EOS)
  • Designed for tokenization research and experimentation

Citation

@misc{marathi-bpe-tokenizer,
  author = {Your Name},
  title = {Marathi BPE Tokenizer},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/pandurangpatil/sample-marathi-bpe-tokenizer}
}

License

MIT License - see LICENSE file for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support