# 📘 Hing-BERT Language Identification Module **Hing-BERT Language Identifier** is a Python module for **Hindi-English token-level language detection** and **transliteration**. It wraps around **L3Cube HingBERT-LID** and integrates transliteration logic and heuristic improvements to accurately detect Hindi words in romanized or mixed Hindi-English text. --- ## 🧩 Features - Detects Hindi (`HI`) and English (`EN`) tokens in mixed text (Hinglish). - Uses **L3Cube HingBERT-LID** Transformer model. - Integrates **pattern-based heuristics** for Hindi-like tokens. - Supports **dictionary-based and model-based transliteration**. - Works both as a **CLI tool** and an **importable Python module**. - Outputs Hindi word detections and transliterations to console or file. --- ## 🗂️ Folder Structure ``` hing_bert_module/ │ ├── __init__.py ├── classifier.py # HingBERT model loading + token classification ├── transliteration.py # Dictionary and transliteration functions ├── utils.py # Logging and helper utilities ├── main.py # CLI entry point ├── hing-bert-lid/ # Pretrained model folder └── dictionary.txt # Transliteration dictionary ``` --- ## ⚙️ Installation ### 1. Clone this repository ```bash git clone https://github.com/yourusername/hing_bert_module.git cd hing_bert_module ``` ### 2. Install dependencies Create a virtual environment (recommended): ```bash python -m venv env source env/bin/activate # or env\Scripts\activate on Windows ``` Then install: ```bash pip install torch transformers hindi-xlit ``` If you plan to use the dictionary transliteration: ```bash pip install indic-transliteration ``` --- ## 🚀 Usage ### **Option 1: Command-Line Interface (CLI)** Run directly as a command: ```bash python -m hing_bert_module.main --text "Ram went to Ayodhya with Sita" ``` #### 🧾 Example Output ``` Token-level predictions: ------------------------ Ram -> HI (0.97) went -> EN (0.99) to -> EN (0.99) Ayodhya -> HI (0.98) with -> EN (0.99) Sita -> HI (0.96) Reconstructed Output: राम went to अयोध्या with सीता ``` #### Available CLI arguments | Argument | Description | Example | |-----------|--------------|----------| | `--text` | Input sentence for classification | `"Ram went to Ayodhya"` | | `--file` | Input text file (alternative to --text) | `input.txt` | | `--threshold` | Confidence threshold (default: 0.80) | `--threshold 0.85` | | `--out` | Output file to save predictions | `--out results.txt` | --- ### **Option 2: Import as a Python Module** ```python from hing_bert_module import load_model, classify_text, load_dictionary, get_transliteration from hindi_xlit import HindiTransliterator # Load model tokenizer, model, device = load_model() # Run inference text = "Ram went to Ayodhya with Sita" predictions = classify_text(text, tokenizer, model, device, threshold=0.8) # Display predictions for p in predictions: print(f"{p.token} -> {p.label} ({p.confidence:.2f})") # Load dictionary dictionary = load_dictionary('dictionary.txt') # Transliterate Hindi words hindi_tokens = [p.token for p in predictions if p.label == "HI"] transliterator = HindiTransliterator(scheme='itrans') for token in hindi_tokens: print(token, "→", get_transliteration(token, dictionary, transliterator)) ``` --- ## 🧠 How It Works 1. **Tokenizer + Model Inference:** Each token is passed through the HingBERT model for token classification. 2. **Heuristic Rules:** Custom rules adjust predictions based on: - Hindi phonetic clusters (`bh`, `chh`, `th`, etc.) - Common suffixes (`-a`, `-am`, `-iya`, etc.) - Stopword filtering for English tokens. 3. **Confidence Thresholding:** Only tokens with probability above the given threshold are considered confidently Hindi. 4. **Transliteration:** Hindi-classified tokens are converted into **Devanagari script** using: - Custom dictionary (if available) - Model-based transliteration fallback --- ## 🧪 Example Integration ```python from hing_bert_module.main import process_text result = process_text("Rama and Sita returned to Ayodhya.") print(result["reconstructed_text"]) # Output: "राम and सीता returned to अयोध्या." ``` --- ## ⚡ Performance Tips - Use GPU (`cuda`) for faster inference. - Keep model loaded between multiple text calls. - Use thresholds between `0.75 – 0.85` for balanced accuracy. --- ## 📄 License This project is licensed under the MIT License. Model weights belong to **L3Cube Pune** under their research license. --- ## 🤝 Acknowledgements - **L3Cube Pune** – HingBERT Language Identification Model - **Hindi-Xlit** – Transliteration utility - **HuggingFace Transformers** – Model backbone