File size: 4,910 Bytes
8a02978
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
# 📘 Hing-BERT Language Identification Module

**Hing-BERT Language Identifier** is a Python module for **Hindi-English token-level language detection** and **transliteration**.  
It wraps around **L3Cube HingBERT-LID** and integrates transliteration logic and heuristic improvements to accurately detect Hindi words in romanized or mixed Hindi-English text.

---

## 🧩 Features
- Detects Hindi (`HI`) and English (`EN`) tokens in mixed text (Hinglish).  
- Uses **L3Cube HingBERT-LID** Transformer model.  
- Integrates **pattern-based heuristics** for Hindi-like tokens.  
- Supports **dictionary-based and model-based transliteration**.  
- Works both as a **CLI tool** and an **importable Python module**.  
- Outputs Hindi word detections and transliterations to console or file.

---

## 🗂️ Folder Structure
```
hing_bert_module/

├── __init__.py
├── classifier.py         # HingBERT model loading + token classification
├── transliteration.py    # Dictionary and transliteration functions
├── utils.py              # Logging and helper utilities
├── main.py               # CLI entry point
├── hing-bert-lid/        # Pretrained model folder
└── dictionary.txt        # Transliteration dictionary
```

---

## ⚙️ Installation

### 1. Clone this repository
```bash
git clone https://github.com/yourusername/hing_bert_module.git
cd hing_bert_module
```

### 2. Install dependencies
Create a virtual environment (recommended):
```bash
python -m venv env
source env/bin/activate   # or env\Scripts\activate on Windows
```

Then install:
```bash
pip install torch transformers hindi-xlit
```

If you plan to use the dictionary transliteration:
```bash
pip install indic-transliteration
```

---

## 🚀 Usage

### **Option 1: Command-Line Interface (CLI)**

Run directly as a command:
```bash
python -m hing_bert_module.main --text "Ram went to Ayodhya with Sita"
```

#### 🧾 Example Output
```
Token-level predictions:
------------------------
Ram         -> HI (0.97)
went        -> EN (0.99)
to          -> EN (0.99)
Ayodhya     -> HI (0.98)
with        -> EN (0.99)
Sita        -> HI (0.96)

Reconstructed Output:
राम went to अयोध्या with सीता
```

#### Available CLI arguments
| Argument | Description | Example |
|-----------|--------------|----------|
| `--text` | Input sentence for classification | `"Ram went to Ayodhya"` |
| `--file` | Input text file (alternative to --text) | `input.txt` |
| `--threshold` | Confidence threshold (default: 0.80) | `--threshold 0.85` |
| `--out` | Output file to save predictions | `--out results.txt` |

---

### **Option 2: Import as a Python Module**

```python
from hing_bert_module import load_model, classify_text, load_dictionary, get_transliteration
from hindi_xlit import HindiTransliterator

# Load model
tokenizer, model, device = load_model()

# Run inference
text = "Ram went to Ayodhya with Sita"
predictions = classify_text(text, tokenizer, model, device, threshold=0.8)

# Display predictions
for p in predictions:
    print(f"{p.token} -> {p.label} ({p.confidence:.2f})")

# Load dictionary
dictionary = load_dictionary('dictionary.txt')

# Transliterate Hindi words
hindi_tokens = [p.token for p in predictions if p.label == "HI"]
transliterator = HindiTransliterator(scheme='itrans')
for token in hindi_tokens:
    print(token, "→", get_transliteration(token, dictionary, transliterator))
```

---

## 🧠 How It Works

1. **Tokenizer + Model Inference:**  
   Each token is passed through the HingBERT model for token classification.

2. **Heuristic Rules:**  
   Custom rules adjust predictions based on:
   - Hindi phonetic clusters (`bh`, `chh`, `th`, etc.)
   - Common suffixes (`-a`, `-am`, `-iya`, etc.)
   - Stopword filtering for English tokens.

3. **Confidence Thresholding:**  
   Only tokens with probability above the given threshold are considered confidently Hindi.

4. **Transliteration:**  
   Hindi-classified tokens are converted into **Devanagari script** using:
   - Custom dictionary (if available)
   - Model-based transliteration fallback

---

## 🧪 Example Integration

```python
from hing_bert_module.main import process_text

result = process_text("Rama and Sita returned to Ayodhya.")
print(result["reconstructed_text"])
# Output: "राम and सीता returned to अयोध्या."
```

---

## ⚡ Performance Tips
- Use GPU (`cuda`) for faster inference.
- Keep model loaded between multiple text calls.
- Use thresholds between `0.75 – 0.85` for balanced accuracy.

---

## 📄 License
This project is licensed under the MIT License.  
Model weights belong to **L3Cube Pune** under their research license.

---

## 🤝 Acknowledgements
- **L3Cube Pune** – HingBERT Language Identification Model  
- **Hindi-Xlit** – Transliteration utility  
- **HuggingFace Transformers** – Model backbone