PraveenSharma08's picture
Initial project upload: Hindi/English Text-to-Speech pipeline
8a02978
# 📘 Hing-BERT Language Identification Module
**Hing-BERT Language Identifier** is a Python module for **Hindi-English token-level language detection** and **transliteration**.
It wraps around **L3Cube HingBERT-LID** and integrates transliteration logic and heuristic improvements to accurately detect Hindi words in romanized or mixed Hindi-English text.
---
## 🧩 Features
- Detects Hindi (`HI`) and English (`EN`) tokens in mixed text (Hinglish).
- Uses **L3Cube HingBERT-LID** Transformer model.
- Integrates **pattern-based heuristics** for Hindi-like tokens.
- Supports **dictionary-based and model-based transliteration**.
- Works both as a **CLI tool** and an **importable Python module**.
- Outputs Hindi word detections and transliterations to console or file.
---
## 🗂️ Folder Structure
```
hing_bert_module/
├── __init__.py
├── classifier.py # HingBERT model loading + token classification
├── transliteration.py # Dictionary and transliteration functions
├── utils.py # Logging and helper utilities
├── main.py # CLI entry point
├── hing-bert-lid/ # Pretrained model folder
└── dictionary.txt # Transliteration dictionary
```
---
## ⚙️ Installation
### 1. Clone this repository
```bash
git clone https://github.com/yourusername/hing_bert_module.git
cd hing_bert_module
```
### 2. Install dependencies
Create a virtual environment (recommended):
```bash
python -m venv env
source env/bin/activate # or env\Scripts\activate on Windows
```
Then install:
```bash
pip install torch transformers hindi-xlit
```
If you plan to use the dictionary transliteration:
```bash
pip install indic-transliteration
```
---
## 🚀 Usage
### **Option 1: Command-Line Interface (CLI)**
Run directly as a command:
```bash
python -m hing_bert_module.main --text "Ram went to Ayodhya with Sita"
```
#### 🧾 Example Output
```
Token-level predictions:
------------------------
Ram -> HI (0.97)
went -> EN (0.99)
to -> EN (0.99)
Ayodhya -> HI (0.98)
with -> EN (0.99)
Sita -> HI (0.96)
Reconstructed Output:
राम went to अयोध्या with सीता
```
#### Available CLI arguments
| Argument | Description | Example |
|-----------|--------------|----------|
| `--text` | Input sentence for classification | `"Ram went to Ayodhya"` |
| `--file` | Input text file (alternative to --text) | `input.txt` |
| `--threshold` | Confidence threshold (default: 0.80) | `--threshold 0.85` |
| `--out` | Output file to save predictions | `--out results.txt` |
---
### **Option 2: Import as a Python Module**
```python
from hing_bert_module import load_model, classify_text, load_dictionary, get_transliteration
from hindi_xlit import HindiTransliterator
# Load model
tokenizer, model, device = load_model()
# Run inference
text = "Ram went to Ayodhya with Sita"
predictions = classify_text(text, tokenizer, model, device, threshold=0.8)
# Display predictions
for p in predictions:
print(f"{p.token} -> {p.label} ({p.confidence:.2f})")
# Load dictionary
dictionary = load_dictionary('dictionary.txt')
# Transliterate Hindi words
hindi_tokens = [p.token for p in predictions if p.label == "HI"]
transliterator = HindiTransliterator(scheme='itrans')
for token in hindi_tokens:
print(token, "→", get_transliteration(token, dictionary, transliterator))
```
---
## 🧠 How It Works
1. **Tokenizer + Model Inference:**
Each token is passed through the HingBERT model for token classification.
2. **Heuristic Rules:**
Custom rules adjust predictions based on:
- Hindi phonetic clusters (`bh`, `chh`, `th`, etc.)
- Common suffixes (`-a`, `-am`, `-iya`, etc.)
- Stopword filtering for English tokens.
3. **Confidence Thresholding:**
Only tokens with probability above the given threshold are considered confidently Hindi.
4. **Transliteration:**
Hindi-classified tokens are converted into **Devanagari script** using:
- Custom dictionary (if available)
- Model-based transliteration fallback
---
## 🧪 Example Integration
```python
from hing_bert_module.main import process_text
result = process_text("Rama and Sita returned to Ayodhya.")
print(result["reconstructed_text"])
# Output: "राम and सीता returned to अयोध्या."
```
---
## ⚡ Performance Tips
- Use GPU (`cuda`) for faster inference.
- Keep model loaded between multiple text calls.
- Use thresholds between `0.75 – 0.85` for balanced accuracy.
---
## 📄 License
This project is licensed under the MIT License.
Model weights belong to **L3Cube Pune** under their research license.
---
## 🤝 Acknowledgements
- **L3Cube Pune** – HingBERT Language Identification Model
- **Hindi-Xlit** – Transliteration utility
- **HuggingFace Transformers** – Model backbone