|
|
--- |
|
|
language: |
|
|
- hi |
|
|
tags: |
|
|
- tokenizer |
|
|
- bpe |
|
|
- hindi |
|
|
- devanagari |
|
|
- byte-pair-encoding |
|
|
- nlp |
|
|
license: mit |
|
|
library_name: custom |
|
|
--- |
|
|
|
|
|
# Hindi BPE Tokenizer |
|
|
|
|
|
A Byte Pair Encoding (BPE) tokenizer optimized for Hindi text using Devanagari script. |
|
|
|
|
|
## 🎯 Model Description |
|
|
|
|
|
- **Vocabulary Size:** 5,500 tokens |
|
|
- **Compression Ratio:** 6.52X average (up to 10.44X on technical text) |
|
|
- **Training Corpus:** 575K characters (1.5MB) of diverse Hindi text |
|
|
- **Decoding Accuracy:** 100% |
|
|
- **Training Time:** ~30 seconds |
|
|
|
|
|
## ✨ Features |
|
|
|
|
|
- **Hindi-Optimized:** Devanagari Unicode prioritization (`\u0900-\u097F`) |
|
|
- **High Compression:** 6.52X average, up to 10.44X on technical text |
|
|
- **Perfect Decoding:** 100% accuracy in text reconstruction |
|
|
- **Simple API:** Easy encode/decode with compression stats |
|
|
- **Fast Training:** Train from scratch in ~30 seconds |
|
|
|
|
|
## 📦 Installation |
|
|
|
|
|
```bash |
|
|
# Clone the repository |
|
|
git clone https://huggingface.co/ansul90/hindi-bpe-tokenizer |
|
|
cd hindi-bpe-tokenizer |
|
|
|
|
|
# Install dependencies |
|
|
pip install regex numpy |
|
|
# Or with uv: |
|
|
uv add regex numpy |
|
|
``` |
|
|
|
|
|
## 🚀 Quick Start |
|
|
|
|
|
### Step 1: Train the Tokenizer |
|
|
|
|
|
> **⚠️ Note:** This repository does not include the pre-trained model file (543MB). |
|
|
> You need to train it once locally, which takes only ~30 seconds. |
|
|
|
|
|
```bash |
|
|
python train_bpe_simple.py |
|
|
``` |
|
|
|
|
|
This will: |
|
|
- Load the Hindi corpus (included) |
|
|
- Train the BPE tokenizer |
|
|
- Generate `hindi_bpe_tokenizer.json` (~543MB) |
|
|
- Test on 8 Hindi samples |
|
|
- Display performance metrics |
|
|
|
|
|
### Step 2: Use the Tokenizer |
|
|
|
|
|
```python |
|
|
from hindi_bpe_tokenizer import HindiBPETokenizer |
|
|
|
|
|
# Load trained tokenizer |
|
|
tokenizer = HindiBPETokenizer() |
|
|
tokenizer.load('hindi_bpe_tokenizer.json') |
|
|
|
|
|
# Encode Hindi text |
|
|
text = "भारत एक महान देश है।" |
|
|
tokens = tokenizer.encode(text) |
|
|
print(f"Tokens: {tokens}") |
|
|
|
|
|
# Decode back to text |
|
|
decoded = tokenizer.decode(tokens) |
|
|
print(f"Decoded: {decoded}") |
|
|
|
|
|
# Get compression statistics |
|
|
stats = tokenizer.get_compression_stats(text) |
|
|
print(f"Compression ratio: {stats['compression_ratio']:.2f}X") |
|
|
print(f"Original bytes: {stats['original_bytes']}") |
|
|
print(f"Compressed tokens: {stats['compressed_tokens']}") |
|
|
``` |
|
|
|
|
|
## 📊 Performance Metrics |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Vocabulary Size | 5,500 tokens | |
|
|
| Compression Ratio | 6.52X (avg), 10.44X (best) | |
|
|
| Decoding Accuracy | 100% | |
|
|
| Training Corpus | 575K chars, 1.5MB | |
|
|
| Training Time | ~30 seconds | |
|
|
|
|
|
### Test Results on Different Text Types |
|
|
|
|
|
| Category | Original Bytes | Compressed Tokens | Compression Ratio | |
|
|
|----------|---------------|-------------------|-------------------| |
|
|
| Space Mission | 204 | 31 | 6.58X | |
|
|
| Cricket News | 146 | 27 | 5.41X | |
|
|
| Science & Tech | 188 | 18 | **10.44X** | |
|
|
| Language | 123 | 18 | 6.83X | |
|
|
| Education | 140 | 17 | 8.24X | |
|
|
| Environment | 132 | 21 | 6.29X | |
|
|
| Mixed Content | 125 | 34 | 3.68X | |
|
|
| Long Sentence | 240 | 33 | 7.27X | |
|
|
|
|
|
## 🔧 Advanced Usage |
|
|
|
|
|
### Custom Training |
|
|
|
|
|
```python |
|
|
from hindi_bpe_tokenizer import HindiBPETokenizer |
|
|
|
|
|
# Create tokenizer with custom vocabulary size |
|
|
tokenizer = HindiBPETokenizer(vocab_size=8000) |
|
|
|
|
|
# Load your custom Hindi corpus |
|
|
with open('my_corpus.txt', 'r', encoding='utf-8') as f: |
|
|
corpus = f.read() |
|
|
|
|
|
# Train |
|
|
tokenizer.train(corpus, verbose=True) |
|
|
|
|
|
# Save |
|
|
tokenizer.save('my_custom_tokenizer.json') |
|
|
``` |
|
|
|
|
|
### Get Detailed Statistics |
|
|
|
|
|
```python |
|
|
stats = tokenizer.get_compression_stats("हिंदी टेक्स्ट") |
|
|
print(f"Original characters: {stats['original_chars']}") |
|
|
print(f"Original bytes: {stats['original_bytes']}") |
|
|
print(f"Compressed tokens: {stats['compressed_tokens']}") |
|
|
print(f"Compression ratio: {stats['compression_ratio']:.2f}X") |
|
|
print(f"Vocabulary size: {stats['vocab_size']:,}") |
|
|
``` |
|
|
|
|
|
## 📁 Repository Structure |
|
|
|
|
|
``` |
|
|
hindi-bpe-tokenizer/ |
|
|
├── hindi_bpe_tokenizer.py # Core implementation (8KB) |
|
|
├── train_bpe_simple.py # Training script (5KB) |
|
|
├── create_diverse_hindi_corpus.py # Corpus generator (17KB) |
|
|
├── hindi_corpus.txt # Training data (1.5MB) |
|
|
├── training_results.json # Performance metrics (2KB) |
|
|
├── pyproject.toml # Dependencies |
|
|
└── README.md # This file |
|
|
``` |
|
|
|
|
|
**Note:** `hindi_bpe_tokenizer.json` (543MB) is generated when you run `train_bpe_simple.py` |
|
|
|
|
|
## 🎓 Training Data |
|
|
|
|
|
The tokenizer was trained on diverse Hindi content including: |
|
|
|
|
|
- **News:** Cricket, space missions, current events |
|
|
- **Science & Technology:** विज्ञान, प्रौद्योगिकी vocabulary |
|
|
- **Education & Environment:** शिक्षा, पर्यावरण topics |
|
|
- **Politics & Governance:** राजनीति, संविधान terms |
|
|
- **Daily Life:** Common phrases, daily vocabulary |
|
|
- **Complete Alphabet:** All Devanagari letters, vowels, consonants |
|
|
- **Numbers:** Both Arabic (0-9) and Devanagari (०-९) numerals |
|
|
|
|
|
## 🔬 Technical Details |
|
|
|
|
|
### BPE Algorithm |
|
|
|
|
|
1. Start with 256 byte vocabulary |
|
|
2. Find most frequent byte pair in corpus |
|
|
3. Merge pair into new token |
|
|
4. Repeat until target vocabulary size reached |
|
|
|
|
|
### Hindi-Specific Optimizations |
|
|
|
|
|
- Devanagari Unicode blocks: `\u0900-\u097F`, `\u0980-\u09FF` |
|
|
- Optimized regex pattern for Hindi word boundaries |
|
|
- JSON-based serialization for easy sharing |
|
|
|
|
|
### Regex Pattern |
|
|
|
|
|
```python |
|
|
r""" ?[\u0900-\u097F]+| ?[\u0980-\u09FF]+| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""" |
|
|
``` |
|
|
|
|
|
## 🤝 Use Cases |
|
|
|
|
|
- Text compression for Hindi documents |
|
|
- Tokenization for Hindi NLP models |
|
|
- Language model preprocessing |
|
|
- Text analysis and statistics |
|
|
- Educational purposes (understanding BPE) |
|
|
|
|
|
## 📝 Requirements |
|
|
|
|
|
- Python 3.13+ |
|
|
- `regex` library (for Unicode support) |
|
|
- `numpy` (optional, for numerical operations) |
|
|
|
|
|
## ⚠️ Limitations |
|
|
|
|
|
- Trained on specific corpus (may need retraining for domain-specific text) |
|
|
- Best for Devanagari script (Hindi) |
|
|
- Compression ratio varies by text type |
|
|
- Not optimized for mixed Hindi-English text (compression drops to ~3.68X) |
|
|
|
|
|
## 🛠️ Troubleshooting |
|
|
|
|
|
### Issue: ModuleNotFoundError: No module named 'regex' |
|
|
```bash |
|
|
pip install regex |
|
|
``` |
|
|
|
|
|
### Issue: hindi_bpe_tokenizer.json not found |
|
|
```bash |
|
|
# Train the tokenizer first |
|
|
python train_bpe_simple.py |
|
|
``` |
|
|
|
|
|
### Issue: UnicodeDecodeError |
|
|
```bash |
|
|
# Ensure files are read with UTF-8 encoding |
|
|
with open('file.txt', 'r', encoding='utf-8') as f: |
|
|
content = f.read() |
|
|
``` |
|
|
|
|
|
## 📚 Citation |
|
|
|
|
|
If you use this tokenizer in your research or project, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{hindi_bpe_tokenizer_2025, |
|
|
title={Hindi BPE Tokenizer: Byte Pair Encoding for Devanagari Script}, |
|
|
author={Your Name}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/ansul90/hindi-bpe-tokenizer} |
|
|
} |
|
|
``` |
|
|
|
|
|
## 📄 License |
|
|
|
|
|
MIT License - See LICENSE file for details |
|
|
|
|
|
## 🙏 Acknowledgments |
|
|
|
|
|
- Inspired by OpenAI's GPT-2 BPE implementation |
|
|
- Built for the Hindi NLP community |
|
|
- Based on the paper: "Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016) |
|
|
|
|
|
## 📧 Contact & Links |
|
|
|
|
|
- **Hugging Face:** [ansul90/hindi-bpe-tokenizer](https://huggingface.co/ansul90/hindi-bpe-tokenizer) |
|
|
- **GitHub:** Your GitHub link |
|
|
- **Email:** Your email |
|
|
|
|
|
--- |
|
|
|
|
|
**धन्यवाद (Thank you) for using Hindi BPE Tokenizer!** 🙏 |
|
|
|
|
|
|