| # 📘 Hing-BERT Language Identification Module | |
| **Hing-BERT Language Identifier** is a Python module for **Hindi-English token-level language detection** and **transliteration**. | |
| It wraps around **L3Cube HingBERT-LID** and integrates transliteration logic and heuristic improvements to accurately detect Hindi words in romanized or mixed Hindi-English text. | |
| --- | |
| ## 🧩 Features | |
| - Detects Hindi (`HI`) and English (`EN`) tokens in mixed text (Hinglish). | |
| - Uses **L3Cube HingBERT-LID** Transformer model. | |
| - Integrates **pattern-based heuristics** for Hindi-like tokens. | |
| - Supports **dictionary-based and model-based transliteration**. | |
| - Works both as a **CLI tool** and an **importable Python module**. | |
| - Outputs Hindi word detections and transliterations to console or file. | |
| --- | |
| ## 🗂️ Folder Structure | |
| ``` | |
| hing_bert_module/ | |
| │ | |
| ├── __init__.py | |
| ├── classifier.py # HingBERT model loading + token classification | |
| ├── transliteration.py # Dictionary and transliteration functions | |
| ├── utils.py # Logging and helper utilities | |
| ├── main.py # CLI entry point | |
| ├── hing-bert-lid/ # Pretrained model folder | |
| └── dictionary.txt # Transliteration dictionary | |
| ``` | |
| --- | |
| ## ⚙️ Installation | |
| ### 1. Clone this repository | |
| ```bash | |
| git clone https://github.com/yourusername/hing_bert_module.git | |
| cd hing_bert_module | |
| ``` | |
| ### 2. Install dependencies | |
| Create a virtual environment (recommended): | |
| ```bash | |
| python -m venv env | |
| source env/bin/activate # or env\Scripts\activate on Windows | |
| ``` | |
| Then install: | |
| ```bash | |
| pip install torch transformers hindi-xlit | |
| ``` | |
| If you plan to use the dictionary transliteration: | |
| ```bash | |
| pip install indic-transliteration | |
| ``` | |
| --- | |
| ## 🚀 Usage | |
| ### **Option 1: Command-Line Interface (CLI)** | |
| Run directly as a command: | |
| ```bash | |
| python -m hing_bert_module.main --text "Ram went to Ayodhya with Sita" | |
| ``` | |
| #### 🧾 Example Output | |
| ``` | |
| Token-level predictions: | |
| ------------------------ | |
| Ram -> HI (0.97) | |
| went -> EN (0.99) | |
| to -> EN (0.99) | |
| Ayodhya -> HI (0.98) | |
| with -> EN (0.99) | |
| Sita -> HI (0.96) | |
| Reconstructed Output: | |
| राम went to अयोध्या with सीता | |
| ``` | |
| #### Available CLI arguments | |
| | Argument | Description | Example | | |
| |-----------|--------------|----------| | |
| | `--text` | Input sentence for classification | `"Ram went to Ayodhya"` | | |
| | `--file` | Input text file (alternative to --text) | `input.txt` | | |
| | `--threshold` | Confidence threshold (default: 0.80) | `--threshold 0.85` | | |
| | `--out` | Output file to save predictions | `--out results.txt` | | |
| --- | |
| ### **Option 2: Import as a Python Module** | |
| ```python | |
| from hing_bert_module import load_model, classify_text, load_dictionary, get_transliteration | |
| from hindi_xlit import HindiTransliterator | |
| # Load model | |
| tokenizer, model, device = load_model() | |
| # Run inference | |
| text = "Ram went to Ayodhya with Sita" | |
| predictions = classify_text(text, tokenizer, model, device, threshold=0.8) | |
| # Display predictions | |
| for p in predictions: | |
| print(f"{p.token} -> {p.label} ({p.confidence:.2f})") | |
| # Load dictionary | |
| dictionary = load_dictionary('dictionary.txt') | |
| # Transliterate Hindi words | |
| hindi_tokens = [p.token for p in predictions if p.label == "HI"] | |
| transliterator = HindiTransliterator(scheme='itrans') | |
| for token in hindi_tokens: | |
| print(token, "→", get_transliteration(token, dictionary, transliterator)) | |
| ``` | |
| --- | |
| ## 🧠 How It Works | |
| 1. **Tokenizer + Model Inference:** | |
| Each token is passed through the HingBERT model for token classification. | |
| 2. **Heuristic Rules:** | |
| Custom rules adjust predictions based on: | |
| - Hindi phonetic clusters (`bh`, `chh`, `th`, etc.) | |
| - Common suffixes (`-a`, `-am`, `-iya`, etc.) | |
| - Stopword filtering for English tokens. | |
| 3. **Confidence Thresholding:** | |
| Only tokens with probability above the given threshold are considered confidently Hindi. | |
| 4. **Transliteration:** | |
| Hindi-classified tokens are converted into **Devanagari script** using: | |
| - Custom dictionary (if available) | |
| - Model-based transliteration fallback | |
| --- | |
| ## 🧪 Example Integration | |
| ```python | |
| from hing_bert_module.main import process_text | |
| result = process_text("Rama and Sita returned to Ayodhya.") | |
| print(result["reconstructed_text"]) | |
| # Output: "राम and सीता returned to अयोध्या." | |
| ``` | |
| --- | |
| ## ⚡ Performance Tips | |
| - Use GPU (`cuda`) for faster inference. | |
| - Keep model loaded between multiple text calls. | |
| - Use thresholds between `0.75 – 0.85` for balanced accuracy. | |
| --- | |
| ## 📄 License | |
| This project is licensed under the MIT License. | |
| Model weights belong to **L3Cube Pune** under their research license. | |
| --- | |
| ## 🤝 Acknowledgements | |
| - **L3Cube Pune** – HingBERT Language Identification Model | |
| - **Hindi-Xlit** – Transliteration utility | |
| - **HuggingFace Transformers** – Model backbone | |