File size: 4,910 Bytes
8a02978 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | # 📘 Hing-BERT Language Identification Module
**Hing-BERT Language Identifier** is a Python module for **Hindi-English token-level language detection** and **transliteration**.
It wraps around **L3Cube HingBERT-LID** and integrates transliteration logic and heuristic improvements to accurately detect Hindi words in romanized or mixed Hindi-English text.
---
## 🧩 Features
- Detects Hindi (`HI`) and English (`EN`) tokens in mixed text (Hinglish).
- Uses **L3Cube HingBERT-LID** Transformer model.
- Integrates **pattern-based heuristics** for Hindi-like tokens.
- Supports **dictionary-based and model-based transliteration**.
- Works both as a **CLI tool** and an **importable Python module**.
- Outputs Hindi word detections and transliterations to console or file.
---
## 🗂️ Folder Structure
```
hing_bert_module/
│
├── __init__.py
├── classifier.py # HingBERT model loading + token classification
├── transliteration.py # Dictionary and transliteration functions
├── utils.py # Logging and helper utilities
├── main.py # CLI entry point
├── hing-bert-lid/ # Pretrained model folder
└── dictionary.txt # Transliteration dictionary
```
---
## ⚙️ Installation
### 1. Clone this repository
```bash
git clone https://github.com/yourusername/hing_bert_module.git
cd hing_bert_module
```
### 2. Install dependencies
Create a virtual environment (recommended):
```bash
python -m venv env
source env/bin/activate # or env\Scripts\activate on Windows
```
Then install:
```bash
pip install torch transformers hindi-xlit
```
If you plan to use the dictionary transliteration:
```bash
pip install indic-transliteration
```
---
## 🚀 Usage
### **Option 1: Command-Line Interface (CLI)**
Run directly as a command:
```bash
python -m hing_bert_module.main --text "Ram went to Ayodhya with Sita"
```
#### 🧾 Example Output
```
Token-level predictions:
------------------------
Ram -> HI (0.97)
went -> EN (0.99)
to -> EN (0.99)
Ayodhya -> HI (0.98)
with -> EN (0.99)
Sita -> HI (0.96)
Reconstructed Output:
राम went to अयोध्या with सीता
```
#### Available CLI arguments
| Argument | Description | Example |
|-----------|--------------|----------|
| `--text` | Input sentence for classification | `"Ram went to Ayodhya"` |
| `--file` | Input text file (alternative to --text) | `input.txt` |
| `--threshold` | Confidence threshold (default: 0.80) | `--threshold 0.85` |
| `--out` | Output file to save predictions | `--out results.txt` |
---
### **Option 2: Import as a Python Module**
```python
from hing_bert_module import load_model, classify_text, load_dictionary, get_transliteration
from hindi_xlit import HindiTransliterator
# Load model
tokenizer, model, device = load_model()
# Run inference
text = "Ram went to Ayodhya with Sita"
predictions = classify_text(text, tokenizer, model, device, threshold=0.8)
# Display predictions
for p in predictions:
print(f"{p.token} -> {p.label} ({p.confidence:.2f})")
# Load dictionary
dictionary = load_dictionary('dictionary.txt')
# Transliterate Hindi words
hindi_tokens = [p.token for p in predictions if p.label == "HI"]
transliterator = HindiTransliterator(scheme='itrans')
for token in hindi_tokens:
print(token, "→", get_transliteration(token, dictionary, transliterator))
```
---
## 🧠 How It Works
1. **Tokenizer + Model Inference:**
Each token is passed through the HingBERT model for token classification.
2. **Heuristic Rules:**
Custom rules adjust predictions based on:
- Hindi phonetic clusters (`bh`, `chh`, `th`, etc.)
- Common suffixes (`-a`, `-am`, `-iya`, etc.)
- Stopword filtering for English tokens.
3. **Confidence Thresholding:**
Only tokens with probability above the given threshold are considered confidently Hindi.
4. **Transliteration:**
Hindi-classified tokens are converted into **Devanagari script** using:
- Custom dictionary (if available)
- Model-based transliteration fallback
---
## 🧪 Example Integration
```python
from hing_bert_module.main import process_text
result = process_text("Rama and Sita returned to Ayodhya.")
print(result["reconstructed_text"])
# Output: "राम and सीता returned to अयोध्या."
```
---
## ⚡ Performance Tips
- Use GPU (`cuda`) for faster inference.
- Keep model loaded between multiple text calls.
- Use thresholds between `0.75 – 0.85` for balanced accuracy.
---
## 📄 License
This project is licensed under the MIT License.
Model weights belong to **L3Cube Pune** under their research license.
---
## 🤝 Acknowledgements
- **L3Cube Pune** – HingBERT Language Identification Model
- **Hindi-Xlit** – Transliteration utility
- **HuggingFace Transformers** – Model backbone
|