PraveenSharma08's picture
Initial project upload: Hindi/English Text-to-Speech pipeline
8a02978

📘 Hing-BERT Language Identification Module

Hing-BERT Language Identifier is a Python module for Hindi-English token-level language detection and transliteration.
It wraps around L3Cube HingBERT-LID and integrates transliteration logic and heuristic improvements to accurately detect Hindi words in romanized or mixed Hindi-English text.


🧩 Features

  • Detects Hindi (HI) and English (EN) tokens in mixed text (Hinglish).
  • Uses L3Cube HingBERT-LID Transformer model.
  • Integrates pattern-based heuristics for Hindi-like tokens.
  • Supports dictionary-based and model-based transliteration.
  • Works both as a CLI tool and an importable Python module.
  • Outputs Hindi word detections and transliterations to console or file.

🗂️ Folder Structure

hing_bert_module/
│
├── __init__.py
├── classifier.py         # HingBERT model loading + token classification
├── transliteration.py    # Dictionary and transliteration functions
├── utils.py              # Logging and helper utilities
├── main.py               # CLI entry point
├── hing-bert-lid/        # Pretrained model folder
└── dictionary.txt        # Transliteration dictionary

⚙️ Installation

1. Clone this repository

git clone https://github.com/yourusername/hing_bert_module.git
cd hing_bert_module

2. Install dependencies

Create a virtual environment (recommended):

python -m venv env
source env/bin/activate   # or env\Scripts\activate on Windows

Then install:

pip install torch transformers hindi-xlit

If you plan to use the dictionary transliteration:

pip install indic-transliteration

🚀 Usage

Option 1: Command-Line Interface (CLI)

Run directly as a command:

python -m hing_bert_module.main --text "Ram went to Ayodhya with Sita"

🧾 Example Output

Token-level predictions:
------------------------
Ram         -> HI (0.97)
went        -> EN (0.99)
to          -> EN (0.99)
Ayodhya     -> HI (0.98)
with        -> EN (0.99)
Sita        -> HI (0.96)

Reconstructed Output:
राम went to अयोध्या with सीता

Available CLI arguments

Argument Description Example
--text Input sentence for classification "Ram went to Ayodhya"
--file Input text file (alternative to --text) input.txt
--threshold Confidence threshold (default: 0.80) --threshold 0.85
--out Output file to save predictions --out results.txt

Option 2: Import as a Python Module

from hing_bert_module import load_model, classify_text, load_dictionary, get_transliteration
from hindi_xlit import HindiTransliterator

# Load model
tokenizer, model, device = load_model()

# Run inference
text = "Ram went to Ayodhya with Sita"
predictions = classify_text(text, tokenizer, model, device, threshold=0.8)

# Display predictions
for p in predictions:
    print(f"{p.token} -> {p.label} ({p.confidence:.2f})")

# Load dictionary
dictionary = load_dictionary('dictionary.txt')

# Transliterate Hindi words
hindi_tokens = [p.token for p in predictions if p.label == "HI"]
transliterator = HindiTransliterator(scheme='itrans')
for token in hindi_tokens:
    print(token, "→", get_transliteration(token, dictionary, transliterator))

🧠 How It Works

  1. Tokenizer + Model Inference:
    Each token is passed through the HingBERT model for token classification.

  2. Heuristic Rules:
    Custom rules adjust predictions based on:

    • Hindi phonetic clusters (bh, chh, th, etc.)
    • Common suffixes (-a, -am, -iya, etc.)
    • Stopword filtering for English tokens.
  3. Confidence Thresholding:
    Only tokens with probability above the given threshold are considered confidently Hindi.

  4. Transliteration:
    Hindi-classified tokens are converted into Devanagari script using:

    • Custom dictionary (if available)
    • Model-based transliteration fallback

🧪 Example Integration

from hing_bert_module.main import process_text

result = process_text("Rama and Sita returned to Ayodhya.")
print(result["reconstructed_text"])
# Output: "राम and सीता returned to अयोध्या."

⚡ Performance Tips

  • Use GPU (cuda) for faster inference.
  • Keep model loaded between multiple text calls.
  • Use thresholds between 0.75 – 0.85 for balanced accuracy.

📄 License

This project is licensed under the MIT License.
Model weights belong to L3Cube Pune under their research license.


🤝 Acknowledgements

  • L3Cube Pune – HingBERT Language Identification Model
  • Hindi-Xlit – Transliteration utility
  • HuggingFace Transformers – Model backbone