| language: ur | |
| tags: | |
| - urdu | |
| - language-model | |
| - n-gram | |
| - kenlm | |
| # Urdu 5-gram Language Model | |
| This is a 5-gram language model trained on Urdu text for ASR decoding. | |
| ## Model Details | |
| - **Language**: Urdu (ur) | |
| - **Model Type**: 5-gram KenLM | |
| - **Training Data**: Combined Urdu ASR datasets | |
| - **Use Case**: Beam search decoding for Urdu ASR | |
| ## Files | |
| - `urdu_5gram.bin`: Binary n-gram model (KenLM format) | |
| - `config.json`: Model configuration | |
| ## Usage | |
| ```python | |
| from pyctcdecode import build_ctcdecoder | |
| import json | |
| # Load vocabulary (from your processor) | |
| vocab = ["<pad>", "<s>", "</s>", "<unk>", "|", ...] # Your vocab here | |
| # Build decoder | |
| decoder = build_ctcdecoder( | |
| vocab, | |
| kenlm_model_path='urdu_5gram.bin', | |
| alpha=0.5, | |
| beta=1.5 | |
| ) | |
| ``` | |
| ## Training Details | |
| - N-gram order: 5 | |
| - Pruning: Minimal (0 0 0 1) | |
| - Backend: KenLM | |
| ## Citation | |
| If you use this model, please cite the original datasets used for training. | |