Update README.md

dd38b63 verified 3 months ago

7.28 kB

	---
	language:
	- hi
	tags:
	- tokenizer
	- bpe
	- hindi
	- devanagari
	- byte-pair-encoding
	- nlp
	license: mit
	library_name: custom
	---

	# Hindi BPE Tokenizer

	A Byte Pair Encoding (BPE) tokenizer optimized for Hindi text using Devanagari script.

	## 🎯 Model Description

	- Vocabulary Size: 5,500 tokens
	- Compression Ratio: 6.52X average (up to 10.44X on technical text)
	- Training Corpus: 575K characters (1.5MB) of diverse Hindi text
	- Decoding Accuracy: 100%
	- Training Time: ~30 seconds

	## ✨ Features

	- Hindi-Optimized: Devanagari Unicode prioritization (`\u0900-\u097F`)
	- High Compression: 6.52X average, up to 10.44X on technical text
	- Perfect Decoding: 100% accuracy in text reconstruction
	- Simple API: Easy encode/decode with compression stats
	- Fast Training: Train from scratch in ~30 seconds

	## 📦 Installation

	```bash
	# Clone the repository
	git clone https://huggingface.co/ansul90/hindi-bpe-tokenizer
	cd hindi-bpe-tokenizer

	# Install dependencies
	pip install regex numpy
	# Or with uv:
	uv add regex numpy
	```

	## 🚀 Quick Start

	### Step 1: Train the Tokenizer

	> ⚠️ Note: This repository does not include the pre-trained model file (543MB).
	> You need to train it once locally, which takes only ~30 seconds.

	```bash
	python train_bpe_simple.py
	```

	This will:
	- Load the Hindi corpus (included)
	- Train the BPE tokenizer
	- Generate `hindi_bpe_tokenizer.json` (~543MB)
	- Test on 8 Hindi samples
	- Display performance metrics

	### Step 2: Use the Tokenizer

	```python
	from hindi_bpe_tokenizer import HindiBPETokenizer

	# Load trained tokenizer
	tokenizer = HindiBPETokenizer()
	tokenizer.load('hindi_bpe_tokenizer.json')

	# Encode Hindi text
	text = "भारत एक महान देश है।"
	tokens = tokenizer.encode(text)
	print(f"Tokens: {tokens}")

	# Decode back to text
	decoded = tokenizer.decode(tokens)
	print(f"Decoded: {decoded}")

	# Get compression statistics
	stats = tokenizer.get_compression_stats(text)
	print(f"Compression ratio: {stats['compression_ratio']:.2f}X")
	print(f"Original bytes: {stats['original_bytes']}")
	print(f"Compressed tokens: {stats['compressed_tokens']}")
	```

	## 📊 Performance Metrics

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Vocabulary Size \| 5,500 tokens \|
	\| Compression Ratio \| 6.52X (avg), 10.44X (best) \|
	\| Decoding Accuracy \| 100% \|
	\| Training Corpus \| 575K chars, 1.5MB \|
	\| Training Time \| ~30 seconds \|

	### Test Results on Different Text Types

	\| Category \| Original Bytes \| Compressed Tokens \| Compression Ratio \|
	\|----------\|---------------\|-------------------\|-------------------\|
	\| Space Mission \| 204 \| 31 \| 6.58X \|
	\| Cricket News \| 146 \| 27 \| 5.41X \|
	\| Science & Tech \| 188 \| 18 \| 10.44X \|
	\| Language \| 123 \| 18 \| 6.83X \|
	\| Education \| 140 \| 17 \| 8.24X \|
	\| Environment \| 132 \| 21 \| 6.29X \|
	\| Mixed Content \| 125 \| 34 \| 3.68X \|
	\| Long Sentence \| 240 \| 33 \| 7.27X \|

	## 🔧 Advanced Usage

	### Custom Training

	```python
	from hindi_bpe_tokenizer import HindiBPETokenizer

	# Create tokenizer with custom vocabulary size
	tokenizer = HindiBPETokenizer(vocab_size=8000)

	# Load your custom Hindi corpus
	with open('my_corpus.txt', 'r', encoding='utf-8') as f:
	corpus = f.read()

	# Train
	tokenizer.train(corpus, verbose=True)

	# Save
	tokenizer.save('my_custom_tokenizer.json')
	```

	### Get Detailed Statistics

	```python
	stats = tokenizer.get_compression_stats("हिंदी टेक्स्ट")
	print(f"Original characters: {stats['original_chars']}")
	print(f"Original bytes: {stats['original_bytes']}")
	print(f"Compressed tokens: {stats['compressed_tokens']}")
	print(f"Compression ratio: {stats['compression_ratio']:.2f}X")
	print(f"Vocabulary size: {stats['vocab_size']:,}")
	```

	## 📁 Repository Structure

	```
	hindi-bpe-tokenizer/
	├── hindi_bpe_tokenizer.py # Core implementation (8KB)
	├── train_bpe_simple.py # Training script (5KB)
	├── create_diverse_hindi_corpus.py # Corpus generator (17KB)
	├── hindi_corpus.txt # Training data (1.5MB)
	├── training_results.json # Performance metrics (2KB)
	├── pyproject.toml # Dependencies
	└── README.md # This file
	```

	Note: `hindi_bpe_tokenizer.json` (543MB) is generated when you run `train_bpe_simple.py`

	## 🎓 Training Data

	The tokenizer was trained on diverse Hindi content including:

	- News: Cricket, space missions, current events
	- Science & Technology: विज्ञान, प्रौद्योगिकी vocabulary
	- Education & Environment: शिक्षा, पर्यावरण topics
	- Politics & Governance: राजनीति, संविधान terms
	- Daily Life: Common phrases, daily vocabulary
	- Complete Alphabet: All Devanagari letters, vowels, consonants
	- Numbers: Both Arabic (0-9) and Devanagari (०-९) numerals

	## 🔬 Technical Details

	### BPE Algorithm

	1. Start with 256 byte vocabulary
	2. Find most frequent byte pair in corpus
	3. Merge pair into new token
	4. Repeat until target vocabulary size reached

	### Hindi-Specific Optimizations

	- Devanagari Unicode blocks: `\u0900-\u097F`, `\u0980-\u09FF`
	- Optimized regex pattern for Hindi word boundaries
	- JSON-based serialization for easy sharing

	### Regex Pattern

	```python
	r""" ?[\u0900-\u097F]+\| ?[\u0980-\u09FF]+\| ?\p{L}+\| ?\p{N}+\| ?[^\s\p{L}\p{N}]+\|\s+(?!\S)\|\s+"""
	```

	## 🤝 Use Cases

	- Text compression for Hindi documents
	- Tokenization for Hindi NLP models
	- Language model preprocessing
	- Text analysis and statistics
	- Educational purposes (understanding BPE)

	## 📝 Requirements

	- Python 3.13+
	- `regex` library (for Unicode support)
	- `numpy` (optional, for numerical operations)

	## ⚠️ Limitations

	- Trained on specific corpus (may need retraining for domain-specific text)
	- Best for Devanagari script (Hindi)
	- Compression ratio varies by text type
	- Not optimized for mixed Hindi-English text (compression drops to ~3.68X)

	## 🛠️ Troubleshooting

	### Issue: ModuleNotFoundError: No module named 'regex'
	```bash
	pip install regex
	```

	### Issue: hindi_bpe_tokenizer.json not found
	```bash
	# Train the tokenizer first
	python train_bpe_simple.py
	```

	### Issue: UnicodeDecodeError
	```bash
	# Ensure files are read with UTF-8 encoding
	with open('file.txt', 'r', encoding='utf-8') as f:
	content = f.read()
	```

	## 📚 Citation

	If you use this tokenizer in your research or project, please cite:

	```bibtex
	@misc{hindi_bpe_tokenizer_2025,
	title={Hindi BPE Tokenizer: Byte Pair Encoding for Devanagari Script},
	author={Your Name},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/ansul90/hindi-bpe-tokenizer}
	}
	```

	## 📄 License

	MIT License - See LICENSE file for details

	## 🙏 Acknowledgments

	- Inspired by OpenAI's GPT-2 BPE implementation
	- Built for the Hindi NLP community
	- Based on the paper: "Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016)

	## 📧 Contact & Links

	- Hugging Face: [ansul90/hindi-bpe-tokenizer](https://huggingface.co/ansul90/hindi-bpe-tokenizer)
	- GitHub: Your GitHub link
	- Email: Your email

	---

	धन्यवाद (Thank you) for using Hindi BPE Tokenizer! 🙏