File size: 4,910 Bytes

8a02978

# 📘 Hing-BERT Language Identification Module

**Hing-BERT Language Identifier** is a Python module for **Hindi-English token-level language detection** and **transliteration**.  
It wraps around **L3Cube HingBERT-LID** and integrates transliteration logic and heuristic improvements to accurately detect Hindi words in romanized or mixed Hindi-English text.

---

## 🧩 Features
- Detects Hindi (`HI`) and English (`EN`) tokens in mixed text (Hinglish).  
- Uses **L3Cube HingBERT-LID** Transformer model.  
- Integrates **pattern-based heuristics** for Hindi-like tokens.  
- Supports **dictionary-based and model-based transliteration**.  
- Works both as a **CLI tool** and an **importable Python module**.  
- Outputs Hindi word detections and transliterations to console or file.

---

## 🗂️ Folder Structure
```
hing_bert_module/
│
├── __init__.py
├── classifier.py         # HingBERT model loading + token classification
├── transliteration.py    # Dictionary and transliteration functions
├── utils.py              # Logging and helper utilities
├── main.py               # CLI entry point
├── hing-bert-lid/        # Pretrained model folder
└── dictionary.txt        # Transliteration dictionary
```

---

## ⚙️ Installation

### 1. Clone this repository
```bash
git clone https://github.com/yourusername/hing_bert_module.git
cd hing_bert_module
```

### 2. Install dependencies
Create a virtual environment (recommended):
```bash
python -m venv env
source env/bin/activate   # or env\Scripts\activate on Windows
```

Then install:
```bash
pip install torch transformers hindi-xlit
```

If you plan to use the dictionary transliteration:
```bash
pip install indic-transliteration
```

---

## 🚀 Usage

### **Option 1: Command-Line Interface (CLI)**

Run directly as a command:
```bash
python -m hing_bert_module.main --text "Ram went to Ayodhya with Sita"
```

#### 🧾 Example Output
```
Token-level predictions:
------------------------
Ram         -> HI (0.97)
went        -> EN (0.99)
to          -> EN (0.99)
Ayodhya     -> HI (0.98)
with        -> EN (0.99)
Sita        -> HI (0.96)

Reconstructed Output:
राम went to अयोध्या with सीता
```

#### Available CLI arguments
| Argument | Description | Example |
|-----------|--------------|----------|
| `--text` | Input sentence for classification | `"Ram went to Ayodhya"` |
| `--file` | Input text file (alternative to --text) | `input.txt` |
| `--threshold` | Confidence threshold (default: 0.80) | `--threshold 0.85` |
| `--out` | Output file to save predictions | `--out results.txt` |

---

### **Option 2: Import as a Python Module**

```python
from hing_bert_module import load_model, classify_text, load_dictionary, get_transliteration
from hindi_xlit import HindiTransliterator

# Load model
tokenizer, model, device = load_model()

# Run inference
text = "Ram went to Ayodhya with Sita"
predictions = classify_text(text, tokenizer, model, device, threshold=0.8)

# Display predictions
for p in predictions:
    print(f"{p.token} -> {p.label} ({p.confidence:.2f})")

# Load dictionary
dictionary = load_dictionary('dictionary.txt')

# Transliterate Hindi words
hindi_tokens = [p.token for p in predictions if p.label == "HI"]
transliterator = HindiTransliterator(scheme='itrans')
for token in hindi_tokens:
    print(token, "→", get_transliteration(token, dictionary, transliterator))
```

---

## 🧠 How It Works

1. **Tokenizer + Model Inference:**  
   Each token is passed through the HingBERT model for token classification.

2. **Heuristic Rules:**  
   Custom rules adjust predictions based on:
   - Hindi phonetic clusters (`bh`, `chh`, `th`, etc.)
   - Common suffixes (`-a`, `-am`, `-iya`, etc.)
   - Stopword filtering for English tokens.

3. **Confidence Thresholding:**  
   Only tokens with probability above the given threshold are considered confidently Hindi.

4. **Transliteration:**  
   Hindi-classified tokens are converted into **Devanagari script** using:
   - Custom dictionary (if available)
   - Model-based transliteration fallback

---

## 🧪 Example Integration

```python
from hing_bert_module.main import process_text

result = process_text("Rama and Sita returned to Ayodhya.")
print(result["reconstructed_text"])
# Output: "राम and सीता returned to अयोध्या."
```

---

## ⚡ Performance Tips
- Use GPU (`cuda`) for faster inference.
- Keep model loaded between multiple text calls.
- Use thresholds between `0.75 – 0.85` for balanced accuracy.

---

## 📄 License
This project is licensed under the MIT License.  
Model weights belong to **L3Cube Pune** under their research license.

---

## 🤝 Acknowledgements
- **L3Cube Pune** – HingBERT Language Identification Model  
- **Hindi-Xlit** – Transliteration utility  
- **HuggingFace Transformers** – Model backbone