📘 Hing-BERT Language Identification Module

Hing-BERT Language Identifier is a Python module for Hindi-English token-level language detection and transliteration.
It wraps around L3Cube HingBERT-LID and integrates transliteration logic and heuristic improvements to accurately detect Hindi words in romanized or mixed Hindi-English text.

🧩 Features

Detects Hindi (HI) and English (EN) tokens in mixed text (Hinglish).
Uses L3Cube HingBERT-LID Transformer model.
Integrates pattern-based heuristics for Hindi-like tokens.
Supports dictionary-based and model-based transliteration.
Works both as a CLI tool and an importable Python module.
Outputs Hindi word detections and transliterations to console or file.

🗂️ Folder Structure

hing_bert_module/
│
├── __init__.py
├── classifier.py         # HingBERT model loading + token classification
├── transliteration.py    # Dictionary and transliteration functions
├── utils.py              # Logging and helper utilities
├── main.py               # CLI entry point
├── hing-bert-lid/        # Pretrained model folder
└── dictionary.txt        # Transliteration dictionary

⚙️ Installation

1. Clone this repository

git clone https://github.com/yourusername/hing_bert_module.git
cd hing_bert_module

2. Install dependencies

Create a virtual environment (recommended):

python -m venv env
source env/bin/activate   # or env\Scripts\activate on Windows

Then install:

pip install torch transformers hindi-xlit

If you plan to use the dictionary transliteration:

pip install indic-transliteration

🚀 Usage

Option 1: Command-Line Interface (CLI)

Run directly as a command:

python -m hing_bert_module.main --text "Ram went to Ayodhya with Sita"

🧾 Example Output

Token-level predictions:
------------------------
Ram         -> HI (0.97)
went        -> EN (0.99)
to          -> EN (0.99)
Ayodhya     -> HI (0.98)
with        -> EN (0.99)
Sita        -> HI (0.96)

Reconstructed Output:
राम went to अयोध्या with सीता

Available CLI arguments

Argument	Description	Example
`--text`	Input sentence for classification	`"Ram went to Ayodhya"`
`--file`	Input text file (alternative to --text)	`input.txt`
`--threshold`	Confidence threshold (default: 0.80)	`--threshold 0.85`
`--out`	Output file to save predictions	`--out results.txt`

Option 2: Import as a Python Module

from hing_bert_module import load_model, classify_text, load_dictionary, get_transliteration
from hindi_xlit import HindiTransliterator

# Load model
tokenizer, model, device = load_model()

# Run inference
text = "Ram went to Ayodhya with Sita"
predictions = classify_text(text, tokenizer, model, device, threshold=0.8)

# Display predictions
for p in predictions:
    print(f"{p.token} -> {p.label} ({p.confidence:.2f})")

# Load dictionary
dictionary = load_dictionary('dictionary.txt')

# Transliterate Hindi words
hindi_tokens = [p.token for p in predictions if p.label == "HI"]
transliterator = HindiTransliterator(scheme='itrans')
for token in hindi_tokens:
    print(token, "→", get_transliteration(token, dictionary, transliterator))

🧠 How It Works

Tokenizer + Model Inference:
Each token is passed through the HingBERT model for token classification.
Heuristic Rules:
Custom rules adjust predictions based on:
- Hindi phonetic clusters (bh, chh, th, etc.)
- Common suffixes (-a, -am, -iya, etc.)
- Stopword filtering for English tokens.
Confidence Thresholding:
Only tokens with probability above the given threshold are considered confidently Hindi.
Transliteration:
Hindi-classified tokens are converted into Devanagari script using:
- Custom dictionary (if available)
- Model-based transliteration fallback

🧪 Example Integration

from hing_bert_module.main import process_text

result = process_text("Rama and Sita returned to Ayodhya.")
print(result["reconstructed_text"])
# Output: "राम and सीता returned to अयोध्या."

⚡ Performance Tips

Use GPU (cuda) for faster inference.
Keep model loaded between multiple text calls.
Use thresholds between 0.75 – 0.85 for balanced accuracy.

📄 License

This project is licensed under the MIT License.
Model weights belong to L3Cube Pune under their research license.

🤝 Acknowledgements

L3Cube Pune – HingBERT Language Identification Model
Hindi-Xlit – Transliteration utility
HuggingFace Transformers – Model backbone