Urdu 5-gram Language Model

This is a 5-gram language model trained on Urdu text for ASR decoding.

Model Details

  • Language: Urdu (ur)
  • Model Type: 5-gram KenLM
  • Training Data: Combined Urdu ASR datasets
  • Use Case: Beam search decoding for Urdu ASR

Files

  • urdu_5gram.bin: Binary n-gram model (KenLM format)
  • config.json: Model configuration

Usage

from pyctcdecode import build_ctcdecoder
import json

# Load vocabulary (from your processor)
vocab = ["<pad>", "<s>", "</s>", "<unk>", "|", ...] # Your vocab here

# Build decoder
decoder = build_ctcdecoder(
    vocab,
    kenlm_model_path='urdu_5gram.bin',
    alpha=0.5,
    beta=1.5
)

Training Details

  • N-gram order: 5
  • Pruning: Minimal (0 0 0 1)
  • Backend: KenLM

Citation

If you use this model, please cite the original datasets used for training.

Downloads last month
42
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support