metadata
language: ur
tags:
- urdu
- language-model
- n-gram
- kenlm
Urdu 5-gram Language Model
This is a 5-gram language model trained on Urdu text for ASR decoding.
Model Details
- Language: Urdu (ur)
- Model Type: 5-gram KenLM
- Training Data: Combined Urdu ASR datasets
- Use Case: Beam search decoding for Urdu ASR
Files
urdu_5gram.bin: Binary n-gram model (KenLM format)config.json: Model configuration
Usage
from pyctcdecode import build_ctcdecoder
import json
# Load vocabulary (from your processor)
vocab = ["<pad>", "<s>", "</s>", "<unk>", "|", ...] # Your vocab here
# Build decoder
decoder = build_ctcdecoder(
vocab,
kenlm_model_path='urdu_5gram.bin',
alpha=0.5,
beta=1.5
)
Training Details
- N-gram order: 5
- Pruning: Minimal (0 0 0 1)
- Backend: KenLM
Citation
If you use this model, please cite the original datasets used for training.