Update README.md

87d0534 verified 2 months ago

6.08 kB

	---
	library_name: transformers
	language:
	- ne
	- en
	datasets:
	- adkta/nep_eng_code-mixed_asr_dataset
	metrics:
	- wer
	results:
	- task:
	type: ASR
	metrics:
	- name: WER
	value: 21.83
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->



	## Model Details

	### Model Description

	This is an ASR/Speech to Text model for Nepali-English code-mixed speech.
	The model is wav2vec 2.0 with a CTC head.
	The transcripts are generated in native script i.e. Nepali in Devanagari script and English in Roman script.
	This model was created as part of thesis work in partial fulfillment of M.Sc. in Information and Communication Engineering offered in Pulchowk Campus, IOE, TU, Nepal.
	This is the best performing model with a WER of 21.83.
	The model is trained on Devanagarized version of the transcripts.
	The model makes use of LM during decoding.
	The native conversion is performed during post processing using a transliteration dictionary created as part of the thesis and a LM.
	For more details please reach out to the author by email stated at the end of this card.

	- Developed by: Ashish Devkota
	- Model type: ASR
	- Language(s) (NLP): Nepali-English Code-mixed
	- License: [More Information Needed]
	- Finetuned from model [optional]: wav2vec 2.0


	## Bias, Risks, and Limitations

	Only trained with ~2hrs of manually created Nepali-English Code-mixed ASR Dataset. The rest of the training dataset was automatically
	created using YouTube transcripts + Gemini and is not as accurate as the manually created dataset. Needs more manual dataset for training.


	## How to Get Started with the Model

	Use the code below to get started with the model.
	Install dependencies: torch, torchcodec, torchaudio, flashlight-text (for CTC Decoder), huggingface transformers (to use this model):
	```
	!pip install torch
	!pip install torchcodec
	!pip install torchaudio
	!pip install flashlight-text
	!pip install transformers
	```


	Install KenLM (for using LM):
	```
	!git clone https://github.com/kpu/kenlm.git
	!sudo apt-get install libboost-all-dev --fix-missing # For colab

	%cd kenlm
	!mkdir -p build
	%cd build
	!cmake ..
	!make -j 1
	%cd ..
	!export KENLM_ROOT=$PWD
	!export USE_CUDA=0 ## for cpu
	%cd ..

	%cd kenlm
	!pip install .
	%cd ..
	```

	Transliteration repo for disambiguation pipeline:
	```
	!pip install nepali-num2word
	!git clone https://github.com/adkta/nepali_arabic_num_to_word.git
	!python -m pip install -U symspellpy
	!rm -r /content/transliteration
	!git clone https://github.com/adkta/transliteration.git
	```

	Download LM for disambiguation:
	```
	!wget -L https://raw.githubusercontent.com/adkta/Devkota_2026_nep_eng_asr/main/disambiguation_lm.binary
	```

	Download Reduction Dictionary for disambiguation:
	```
	!wget -L https://raw.githubusercontent.com/adkta/transliteration/main/dictionaries/Nep_Eng_Code-Mixed_Reduct_Dict_Gemini.json
	```

	Download lexicon and language model for decoder (or train your own KenLM):
	```
	!wget -L https://raw.githubusercontent.com/adkta/Devkota_2026_nep_eng_asr/main/Indic_Combined_Translit_LM/lexicon.lst
	!wget -L https://raw.githubusercontent.com/adkta/Devkota_2026_nep_eng_asr/main/Indic_Combined_Translit_LM/lm.binary
	```

	Download audio of your choice (say 'test_audio.mp3'. Preprocess the input audio:

	```python
	import torchaudio
	from torch import mean

	audio, sample_rate = torchaudio.load(filename)
	mono = audio if audio.shape[0] == 1 else mean(audio, dim=0, keepdim=True)
	resampled_mono = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(mono)
	torchaudio.save(uri= './test_audio.mp3', src = resampled_mono, sample_rate = 16000 )
	```

	Load the model
	```python
	from transformers import AutoModelForCTC, AutoProcessor

	model_path = "adkta/nep_eng_code-mixed_translit_lm" #REPO

	model = AutoModelForCTC.from_pretrained(model_path)
	processor = AutoProcessor.from_pretrained(model_path) # Assuming you saved processor files too
	```

	Create CTC Decoder
	```python
	vocab_dict = processor.tokenizer.get_vocab()
	sorted_vocab_list = [key for key, val in sorted(vocab_dict.items(), key=lambda item: item[1])]

	decoder = ctc_decoder(
	lexicon='./lexicon.lst',
	beam_size = 128,
	beam_size_token= 100,
	beam_threshold= 25.0,
	tokens=sorted_vocab_list,
	lm='./lm.binary',
	word_score=1.0,
	nbest=1,
	lm_weight = 2,
	blank_token = '<s>',
	sil_token = '\|'
	)
	```

	Generate hypothesis in Devanagari:
	```python
	with torch.no_grad():
	logits = model(resampled_mono).logits
	ctc_hypo = decoder(logits)
	print(ctc_hypo[0][0])
	deva_text = " ".join(ctc_hypo[0][0].words)
	print(deva_text)
	```

	Convert to native format (Nepali in devanagari script, English in Roman script):
	```python
	#conversion to native
	from transliteration.examples.disambiguation_examples import disambiguate
	from transliteration.utils import get_reverse_dict
	from transliteration.transliterator import TranslitDict

	import kenlm

	#LM
	LM_PATH = "./disambiguation_lm.binary"
	lang_model = kenlm.LanguageModel(LM_PATH)

	#Reverse-Reduction Dictionary
	reduc_dict_path = "./Nep_Eng_Code-Mixed_Reduct_Dict_Gemini.json"
	reverse_dict = get_reverse_dict(dictionary = TranslitDict.load(reduc_dict_path))

	native_text = disambiguate(sentence=deva_text, model = lang_model, reverse_dict = reverse_dict, sym_spell = None, edit_dist = 0, lang_scoring = False, sep_case_plural = True)
	print(native_text)
	```

	## Citation

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
	BibTeX:

	```bibtex
	@mastersthesis{nep_en_cm_asr_devkota_2026,
	author = "Devkota, Ashish",
	title = "Beyond Monolingual: Leveraging Multilingual Pre-trained Models for End-to-End Nepali-English Code-mixed Speech Recognition",
	school = "Tribhuvan University, Institute of Engineering, Pulchowk Campus",
	year = "2026",
	type = "M.Sc. Engg. Thesis",
	address = "Lalitpur, Nepal",
	month = "January"
	}

	Contact: Ashish Devkota <devkota.ashish@outlook.com> for report or any other details for this model.