Marathi BPE Tokenizer
A Byte Pair Encoding (BPE) tokenizer trained on Marathi text using the Devanagari script.
Model Details
- Model Type: BPE Tokenizer
- Language: Marathi (mr)
- Script: Devanagari
- Vocabulary Size: 4845 tokens
- Base Vocabulary: 845 graphemes
- Merge Operations: 4000
- License: MIT
Training Details
The tokenizer was trained using a custom Byte Pair Encoding implementation optimized for Devanagari script:
- Starting Unit: Unicode extended grapheme clusters (not bytes)
- Training Corpus Size: 92,627 characters
- Compression Ratio (Grapheme): 2.84x
- Compression Ratio (Byte): 12.30x
Usage
from tokenizers import Tokenizer
# Load the tokenizer
tokenizer = Tokenizer.from_pretrained("pandurangpatil/sample-marathi-bpe-tokenizer")
# Encode text
text = "नमस्कार! हे एक मराठी टोकनायझर आहे."
encoded = tokenizer.encode(text)
print(f"Token IDs: {encoded.ids}")
print(f"Tokens: {encoded.tokens}")
# Decode back to text
decoded = tokenizer.decode(encoded.ids)
print(f"Decoded: {decoded}")
Using with Custom Scripts
If you want to use the raw artifacts:
import json
from tokenizer_utils import encode, decode
# Load artifacts
with open('vocab.json', 'r', encoding='utf-8') as f:
token_to_id = json.load(f)
with open('merges.json', 'r', encoding='utf-8') as f:
merges_str = json.load(f)
merges = {tuple(map(int, k.split(','))): v for k, v in merges_str.items()}
with open('id_to_token.json', 'r', encoding='utf-8') as f:
id_to_token = {int(k): v for k, v in json.load(f).items()}
# Encode and decode
text = "मराठी मजकूर"
token_ids = encode(merges, token_to_id, text)
reconstructed = decode(id_to_token, token_ids)
Grapheme-based Approach
Unlike traditional byte-level BPE, this tokenizer:
- Starts with Unicode grapheme clusters as base units
- Properly handles Devanagari combining characters (matras, virama)
- Maintains linguistic meaning at the subword level
- Achieves better compression for Devanagari text
Example of grapheme segmentation:
- नमस्कार → [न, म, स्, का, र] (graphemes)
- Each grapheme preserves visual/phonetic integrity
Limitations
- Trained on a limited corpus size
- May not generalize well to domains outside training data
- Does not include special tokens for ML models (PAD, UNK, BOS, EOS)
- Designed for tokenization research and experimentation
Citation
@misc{marathi-bpe-tokenizer,
author = {Your Name},
title = {Marathi BPE Tokenizer},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/pandurangpatil/sample-marathi-bpe-tokenizer}
}
License
MIT License - see LICENSE file for details.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support