abidanoaman
/

urdu-language-model

Model card Files Files and versions

urdu-language-model / README.md

abidanoaman's picture

Upload README.md with huggingface_hub

e2169aa verified about 2 months ago

|

history blame contribute delete

937 Bytes

	---
	language: ur
	tags:
	- urdu
	- language-model
	- n-gram
	- kenlm
	---

	# Urdu 5-gram Language Model

	This is a 5-gram language model trained on Urdu text for ASR decoding.

	## Model Details

	- Language: Urdu (ur)
	- Model Type: 5-gram KenLM
	- Training Data: Combined Urdu ASR datasets
	- Use Case: Beam search decoding for Urdu ASR

	## Files

	- `urdu_5gram.bin`: Binary n-gram model (KenLM format)
	- `config.json`: Model configuration

	## Usage

	```python
	from pyctcdecode import build_ctcdecoder
	import json

	# Load vocabulary (from your processor)
	vocab = ["<pad>", "<s>", "</s>", "<unk>", "\|", ...] # Your vocab here

	# Build decoder
	decoder = build_ctcdecoder(
	vocab,
	kenlm_model_path='urdu_5gram.bin',
	alpha=0.5,
	beta=1.5
	)
	```

	## Training Details

	- N-gram order: 5
	- Pruning: Minimal (0 0 0 1)
	- Backend: KenLM

	## Citation

	If you use this model, please cite the original datasets used for training.