tts_project / hing_bert_module /README.md

Initial project upload: Hindi/English Text-to-Speech pipeline

8a02978 3 months ago

4.91 kB

	# 📘 Hing-BERT Language Identification Module

	Hing-BERT Language Identifier is a Python module for Hindi-English token-level language detection and transliteration.
	It wraps around L3Cube HingBERT-LID and integrates transliteration logic and heuristic improvements to accurately detect Hindi words in romanized or mixed Hindi-English text.

	---

	## 🧩 Features
	- Detects Hindi (`HI`) and English (`EN`) tokens in mixed text (Hinglish).
	- Uses L3Cube HingBERT-LID Transformer model.
	- Integrates pattern-based heuristics for Hindi-like tokens.
	- Supports dictionary-based and model-based transliteration.
	- Works both as a CLI tool and an importable Python module.
	- Outputs Hindi word detections and transliterations to console or file.

	---

	## 🗂️ Folder Structure
	```
	hing_bert_module/
	│
	├── __init__.py
	├── classifier.py # HingBERT model loading + token classification
	├── transliteration.py # Dictionary and transliteration functions
	├── utils.py # Logging and helper utilities
	├── main.py # CLI entry point
	├── hing-bert-lid/ # Pretrained model folder
	└── dictionary.txt # Transliteration dictionary
	```

	---

	## ⚙️ Installation

	### 1. Clone this repository
	```bash
	git clone https://github.com/yourusername/hing_bert_module.git
	cd hing_bert_module
	```

	### 2. Install dependencies
	Create a virtual environment (recommended):
	```bash
	python -m venv env
	source env/bin/activate # or env\Scripts\activate on Windows
	```

	Then install:
	```bash
	pip install torch transformers hindi-xlit
	```

	If you plan to use the dictionary transliteration:
	```bash
	pip install indic-transliteration
	```

	---

	## 🚀 Usage

	### Option 1: Command-Line Interface (CLI)

	Run directly as a command:
	```bash
	python -m hing_bert_module.main --text "Ram went to Ayodhya with Sita"
	```

	#### 🧾 Example Output
	```
	Token-level predictions:
	------------------------
	Ram -> HI (0.97)
	went -> EN (0.99)
	to -> EN (0.99)
	Ayodhya -> HI (0.98)
	with -> EN (0.99)
	Sita -> HI (0.96)

	Reconstructed Output:
	राम went to अयोध्या with सीता
	```

	#### Available CLI arguments
	\| Argument \| Description \| Example \|
	\|-----------\|--------------\|----------\|
	\| `--text` \| Input sentence for classification \| `"Ram went to Ayodhya"` \|
	\| `--file` \| Input text file (alternative to --text) \| `input.txt` \|
	\| `--threshold` \| Confidence threshold (default: 0.80) \| `--threshold 0.85` \|
	\| `--out` \| Output file to save predictions \| `--out results.txt` \|

	---

	### Option 2: Import as a Python Module

	```python
	from hing_bert_module import load_model, classify_text, load_dictionary, get_transliteration
	from hindi_xlit import HindiTransliterator

	# Load model
	tokenizer, model, device = load_model()

	# Run inference
	text = "Ram went to Ayodhya with Sita"
	predictions = classify_text(text, tokenizer, model, device, threshold=0.8)

	# Display predictions
	for p in predictions:
	print(f"{p.token} -> {p.label} ({p.confidence:.2f})")

	# Load dictionary
	dictionary = load_dictionary('dictionary.txt')

	# Transliterate Hindi words
	hindi_tokens = [p.token for p in predictions if p.label == "HI"]
	transliterator = HindiTransliterator(scheme='itrans')
	for token in hindi_tokens:
	print(token, "→", get_transliteration(token, dictionary, transliterator))
	```

	---

	## 🧠 How It Works

	1. Tokenizer + Model Inference:
	Each token is passed through the HingBERT model for token classification.

	2. Heuristic Rules:
	Custom rules adjust predictions based on:
	- Hindi phonetic clusters (`bh`, `chh`, `th`, etc.)
	- Common suffixes (`-a`, `-am`, `-iya`, etc.)
	- Stopword filtering for English tokens.

	3. Confidence Thresholding:
	Only tokens with probability above the given threshold are considered confidently Hindi.

	4. Transliteration:
	Hindi-classified tokens are converted into Devanagari script using:
	- Custom dictionary (if available)
	- Model-based transliteration fallback

	---

	## 🧪 Example Integration

	```python
	from hing_bert_module.main import process_text

	result = process_text("Rama and Sita returned to Ayodhya.")
	print(result["reconstructed_text"])
	# Output: "राम and सीता returned to अयोध्या."
	```

	---

	## ⚡ Performance Tips
	- Use GPU (`cuda`) for faster inference.
	- Keep model loaded between multiple text calls.
	- Use thresholds between `0.75 – 0.85` for balanced accuracy.

	---

	## 📄 License
	This project is licensed under the MIT License.
	Model weights belong to L3Cube Pune under their research license.

	---

	## 🤝 Acknowledgements
	- L3Cube Pune – HingBERT Language Identification Model
	- Hindi-Xlit – Transliteration utility
	- HuggingFace Transformers – Model backbone