Marathi BPE Tokenizer

A Byte Pair Encoding (BPE) tokenizer trained on Marathi text using the Devanagari script.

Model Details

Model Type: BPE Tokenizer
Language: Marathi (mr)
Script: Devanagari
Vocabulary Size: 4845 tokens
Base Vocabulary: 845 graphemes
Merge Operations: 4000
License: MIT

Training Details

The tokenizer was trained using a custom Byte Pair Encoding implementation optimized for Devanagari script:

Starting Unit: Unicode extended grapheme clusters (not bytes)
Training Corpus Size: 92,627 characters
Compression Ratio (Grapheme): 2.84x
Compression Ratio (Byte): 12.30x

Usage

from tokenizers import Tokenizer

# Load the tokenizer
tokenizer = Tokenizer.from_pretrained("pandurangpatil/sample-marathi-bpe-tokenizer")

# Encode text
text = "नमस्कार! हे एक मराठी टोकनायझर आहे."
encoded = tokenizer.encode(text)
print(f"Token IDs: {encoded.ids}")
print(f"Tokens: {encoded.tokens}")

# Decode back to text
decoded = tokenizer.decode(encoded.ids)
print(f"Decoded: {decoded}")

Using with Custom Scripts

If you want to use the raw artifacts:

import json
from tokenizer_utils import encode, decode

# Load artifacts
with open('vocab.json', 'r', encoding='utf-8') as f:
    token_to_id = json.load(f)

with open('merges.json', 'r', encoding='utf-8') as f:
    merges_str = json.load(f)
    merges = {tuple(map(int, k.split(','))): v for k, v in merges_str.items()}

with open('id_to_token.json', 'r', encoding='utf-8') as f:
    id_to_token = {int(k): v for k, v in json.load(f).items()}

# Encode and decode
text = "मराठी मजकूर"
token_ids = encode(merges, token_to_id, text)
reconstructed = decode(id_to_token, token_ids)

Grapheme-based Approach

Unlike traditional byte-level BPE, this tokenizer:

Starts with Unicode grapheme clusters as base units
Properly handles Devanagari combining characters (matras, virama)
Maintains linguistic meaning at the subword level
Achieves better compression for Devanagari text

Example of grapheme segmentation:

नमस्कार → [न, म, स्, का, र] (graphemes)
Each grapheme preserves visual/phonetic integrity

Limitations

Trained on a limited corpus size
May not generalize well to domains outside training data
Does not include special tokens for ML models (PAD, UNK, BOS, EOS)
Designed for tokenization research and experimentation

Citation

@misc{marathi-bpe-tokenizer,
  author = {Your Name},
  title = {Marathi BPE Tokenizer},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/pandurangpatil/sample-marathi-bpe-tokenizer}
}

License

MIT License - see LICENSE file for details.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support