📘 Hing-BERT Language Identification Module
Hing-BERT Language Identifier is a Python module for Hindi-English token-level language detection and transliteration.
It wraps around L3Cube HingBERT-LID and integrates transliteration logic and heuristic improvements to accurately detect Hindi words in romanized or mixed Hindi-English text.
🧩 Features
- Detects Hindi (
HI) and English (EN) tokens in mixed text (Hinglish). - Uses L3Cube HingBERT-LID Transformer model.
- Integrates pattern-based heuristics for Hindi-like tokens.
- Supports dictionary-based and model-based transliteration.
- Works both as a CLI tool and an importable Python module.
- Outputs Hindi word detections and transliterations to console or file.
🗂️ Folder Structure
hing_bert_module/
│
├── __init__.py
├── classifier.py # HingBERT model loading + token classification
├── transliteration.py # Dictionary and transliteration functions
├── utils.py # Logging and helper utilities
├── main.py # CLI entry point
├── hing-bert-lid/ # Pretrained model folder
└── dictionary.txt # Transliteration dictionary
⚙️ Installation
1. Clone this repository
git clone https://github.com/yourusername/hing_bert_module.git
cd hing_bert_module
2. Install dependencies
Create a virtual environment (recommended):
python -m venv env
source env/bin/activate # or env\Scripts\activate on Windows
Then install:
pip install torch transformers hindi-xlit
If you plan to use the dictionary transliteration:
pip install indic-transliteration
🚀 Usage
Option 1: Command-Line Interface (CLI)
Run directly as a command:
python -m hing_bert_module.main --text "Ram went to Ayodhya with Sita"
🧾 Example Output
Token-level predictions:
------------------------
Ram -> HI (0.97)
went -> EN (0.99)
to -> EN (0.99)
Ayodhya -> HI (0.98)
with -> EN (0.99)
Sita -> HI (0.96)
Reconstructed Output:
राम went to अयोध्या with सीता
Available CLI arguments
| Argument | Description | Example |
|---|---|---|
--text |
Input sentence for classification | "Ram went to Ayodhya" |
--file |
Input text file (alternative to --text) | input.txt |
--threshold |
Confidence threshold (default: 0.80) | --threshold 0.85 |
--out |
Output file to save predictions | --out results.txt |
Option 2: Import as a Python Module
from hing_bert_module import load_model, classify_text, load_dictionary, get_transliteration
from hindi_xlit import HindiTransliterator
# Load model
tokenizer, model, device = load_model()
# Run inference
text = "Ram went to Ayodhya with Sita"
predictions = classify_text(text, tokenizer, model, device, threshold=0.8)
# Display predictions
for p in predictions:
print(f"{p.token} -> {p.label} ({p.confidence:.2f})")
# Load dictionary
dictionary = load_dictionary('dictionary.txt')
# Transliterate Hindi words
hindi_tokens = [p.token for p in predictions if p.label == "HI"]
transliterator = HindiTransliterator(scheme='itrans')
for token in hindi_tokens:
print(token, "→", get_transliteration(token, dictionary, transliterator))
🧠 How It Works
Tokenizer + Model Inference:
Each token is passed through the HingBERT model for token classification.Heuristic Rules:
Custom rules adjust predictions based on:- Hindi phonetic clusters (
bh,chh,th, etc.) - Common suffixes (
-a,-am,-iya, etc.) - Stopword filtering for English tokens.
- Hindi phonetic clusters (
Confidence Thresholding:
Only tokens with probability above the given threshold are considered confidently Hindi.Transliteration:
Hindi-classified tokens are converted into Devanagari script using:- Custom dictionary (if available)
- Model-based transliteration fallback
🧪 Example Integration
from hing_bert_module.main import process_text
result = process_text("Rama and Sita returned to Ayodhya.")
print(result["reconstructed_text"])
# Output: "राम and सीता returned to अयोध्या."
⚡ Performance Tips
- Use GPU (
cuda) for faster inference. - Keep model loaded between multiple text calls.
- Use thresholds between
0.75 – 0.85for balanced accuracy.
📄 License
This project is licensed under the MIT License.
Model weights belong to L3Cube Pune under their research license.
🤝 Acknowledgements
- L3Cube Pune – HingBERT Language Identification Model
- Hindi-Xlit – Transliteration utility
- HuggingFace Transformers – Model backbone