--- language: - hi tags: - tokenizer - bpe - hindi - devanagari - byte-pair-encoding - nlp license: mit library_name: custom --- # Hindi BPE Tokenizer A Byte Pair Encoding (BPE) tokenizer optimized for Hindi text using Devanagari script. ## 🎯 Model Description - **Vocabulary Size:** 5,500 tokens - **Compression Ratio:** 6.52X average (up to 10.44X on technical text) - **Training Corpus:** 575K characters (1.5MB) of diverse Hindi text - **Decoding Accuracy:** 100% - **Training Time:** ~30 seconds ## ✨ Features - **Hindi-Optimized:** Devanagari Unicode prioritization (`\u0900-\u097F`) - **High Compression:** 6.52X average, up to 10.44X on technical text - **Perfect Decoding:** 100% accuracy in text reconstruction - **Simple API:** Easy encode/decode with compression stats - **Fast Training:** Train from scratch in ~30 seconds ## 📦 Installation ```bash # Clone the repository git clone https://huggingface.co/ansul90/hindi-bpe-tokenizer cd hindi-bpe-tokenizer # Install dependencies pip install regex numpy # Or with uv: uv add regex numpy ``` ## 🚀 Quick Start ### Step 1: Train the Tokenizer > **⚠️ Note:** This repository does not include the pre-trained model file (543MB). > You need to train it once locally, which takes only ~30 seconds. ```bash python train_bpe_simple.py ``` This will: - Load the Hindi corpus (included) - Train the BPE tokenizer - Generate `hindi_bpe_tokenizer.json` (~543MB) - Test on 8 Hindi samples - Display performance metrics ### Step 2: Use the Tokenizer ```python from hindi_bpe_tokenizer import HindiBPETokenizer # Load trained tokenizer tokenizer = HindiBPETokenizer() tokenizer.load('hindi_bpe_tokenizer.json') # Encode Hindi text text = "भारत एक महान देश है।" tokens = tokenizer.encode(text) print(f"Tokens: {tokens}") # Decode back to text decoded = tokenizer.decode(tokens) print(f"Decoded: {decoded}") # Get compression statistics stats = tokenizer.get_compression_stats(text) print(f"Compression ratio: {stats['compression_ratio']:.2f}X") print(f"Original bytes: {stats['original_bytes']}") print(f"Compressed tokens: {stats['compressed_tokens']}") ``` ## 📊 Performance Metrics | Metric | Value | |--------|-------| | Vocabulary Size | 5,500 tokens | | Compression Ratio | 6.52X (avg), 10.44X (best) | | Decoding Accuracy | 100% | | Training Corpus | 575K chars, 1.5MB | | Training Time | ~30 seconds | ### Test Results on Different Text Types | Category | Original Bytes | Compressed Tokens | Compression Ratio | |----------|---------------|-------------------|-------------------| | Space Mission | 204 | 31 | 6.58X | | Cricket News | 146 | 27 | 5.41X | | Science & Tech | 188 | 18 | **10.44X** | | Language | 123 | 18 | 6.83X | | Education | 140 | 17 | 8.24X | | Environment | 132 | 21 | 6.29X | | Mixed Content | 125 | 34 | 3.68X | | Long Sentence | 240 | 33 | 7.27X | ## 🔧 Advanced Usage ### Custom Training ```python from hindi_bpe_tokenizer import HindiBPETokenizer # Create tokenizer with custom vocabulary size tokenizer = HindiBPETokenizer(vocab_size=8000) # Load your custom Hindi corpus with open('my_corpus.txt', 'r', encoding='utf-8') as f: corpus = f.read() # Train tokenizer.train(corpus, verbose=True) # Save tokenizer.save('my_custom_tokenizer.json') ``` ### Get Detailed Statistics ```python stats = tokenizer.get_compression_stats("हिंदी टेक्स्ट") print(f"Original characters: {stats['original_chars']}") print(f"Original bytes: {stats['original_bytes']}") print(f"Compressed tokens: {stats['compressed_tokens']}") print(f"Compression ratio: {stats['compression_ratio']:.2f}X") print(f"Vocabulary size: {stats['vocab_size']:,}") ``` ## 📁 Repository Structure ``` hindi-bpe-tokenizer/ ├── hindi_bpe_tokenizer.py # Core implementation (8KB) ├── train_bpe_simple.py # Training script (5KB) ├── create_diverse_hindi_corpus.py # Corpus generator (17KB) ├── hindi_corpus.txt # Training data (1.5MB) ├── training_results.json # Performance metrics (2KB) ├── pyproject.toml # Dependencies └── README.md # This file ``` **Note:** `hindi_bpe_tokenizer.json` (543MB) is generated when you run `train_bpe_simple.py` ## 🎓 Training Data The tokenizer was trained on diverse Hindi content including: - **News:** Cricket, space missions, current events - **Science & Technology:** विज्ञान, प्रौद्योगिकी vocabulary - **Education & Environment:** शिक्षा, पर्यावरण topics - **Politics & Governance:** राजनीति, संविधान terms - **Daily Life:** Common phrases, daily vocabulary - **Complete Alphabet:** All Devanagari letters, vowels, consonants - **Numbers:** Both Arabic (0-9) and Devanagari (०-९) numerals ## 🔬 Technical Details ### BPE Algorithm 1. Start with 256 byte vocabulary 2. Find most frequent byte pair in corpus 3. Merge pair into new token 4. Repeat until target vocabulary size reached ### Hindi-Specific Optimizations - Devanagari Unicode blocks: `\u0900-\u097F`, `\u0980-\u09FF` - Optimized regex pattern for Hindi word boundaries - JSON-based serialization for easy sharing ### Regex Pattern ```python r""" ?[\u0900-\u097F]+| ?[\u0980-\u09FF]+| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""" ``` ## 🤝 Use Cases - Text compression for Hindi documents - Tokenization for Hindi NLP models - Language model preprocessing - Text analysis and statistics - Educational purposes (understanding BPE) ## 📝 Requirements - Python 3.13+ - `regex` library (for Unicode support) - `numpy` (optional, for numerical operations) ## ⚠️ Limitations - Trained on specific corpus (may need retraining for domain-specific text) - Best for Devanagari script (Hindi) - Compression ratio varies by text type - Not optimized for mixed Hindi-English text (compression drops to ~3.68X) ## 🛠️ Troubleshooting ### Issue: ModuleNotFoundError: No module named 'regex' ```bash pip install regex ``` ### Issue: hindi_bpe_tokenizer.json not found ```bash # Train the tokenizer first python train_bpe_simple.py ``` ### Issue: UnicodeDecodeError ```bash # Ensure files are read with UTF-8 encoding with open('file.txt', 'r', encoding='utf-8') as f: content = f.read() ``` ## 📚 Citation If you use this tokenizer in your research or project, please cite: ```bibtex @misc{hindi_bpe_tokenizer_2025, title={Hindi BPE Tokenizer: Byte Pair Encoding for Devanagari Script}, author={Your Name}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/ansul90/hindi-bpe-tokenizer} } ``` ## 📄 License MIT License - See LICENSE file for details ## 🙏 Acknowledgments - Inspired by OpenAI's GPT-2 BPE implementation - Built for the Hindi NLP community - Based on the paper: "Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016) ## 📧 Contact & Links - **Hugging Face:** [ansul90/hindi-bpe-tokenizer](https://huggingface.co/ansul90/hindi-bpe-tokenizer) - **GitHub:** Your GitHub link - **Email:** Your email --- **धन्यवाद (Thank you) for using Hindi BPE Tokenizer!** 🙏