Upload README.md with huggingface_hub

e2169aa verified about 2 months ago

937 Bytes

language: ur
tags:
  - urdu
  - language-model
  - n-gram
  - kenlm

Urdu 5-gram Language Model

This is a 5-gram language model trained on Urdu text for ASR decoding.

Model Details

Language: Urdu (ur)
Model Type: 5-gram KenLM
Training Data: Combined Urdu ASR datasets
Use Case: Beam search decoding for Urdu ASR

Files

urdu_5gram.bin: Binary n-gram model (KenLM format)
config.json: Model configuration

Usage

from pyctcdecode import build_ctcdecoder
import json

# Load vocabulary (from your processor)
vocab = ["<pad>", "<s>", "</s>", "<unk>", "|", ...] # Your vocab here

# Build decoder
decoder = build_ctcdecoder(
    vocab,
    kenlm_model_path='urdu_5gram.bin',
    alpha=0.5,
    beta=1.5
)

Training Details

N-gram order: 5
Pruning: Minimal (0 0 0 1)
Backend: KenLM

Citation

If you use this model, please cite the original datasets used for training.