--- library_name: transformers language: - ne - en datasets: - adkta/nep_eng_code-mixed_asr_dataset metrics: - wer results: - task: type: ASR metrics: - name: WER value: 21.83 --- # Model Card for Model ID ## Model Details ### Model Description This is an ASR/Speech to Text model for Nepali-English code-mixed speech. The model is wav2vec 2.0 with a CTC head. The transcripts are generated in native script i.e. Nepali in Devanagari script and English in Roman script. This model was created as part of thesis work in partial fulfillment of M.Sc. in Information and Communication Engineering offered in Pulchowk Campus, IOE, TU, Nepal. This is the best performing model with a WER of 21.83. The model is trained on Devanagarized version of the transcripts. The model makes use of LM during decoding. The native conversion is performed during post processing using a transliteration dictionary created as part of the thesis and a LM. For more details please reach out to the author by email stated at the end of this card. - **Developed by:** Ashish Devkota - **Model type:** ASR - **Language(s) (NLP):** Nepali-English Code-mixed - **License:** [More Information Needed] - **Finetuned from model [optional]:** wav2vec 2.0 ## Bias, Risks, and Limitations Only trained with ~2hrs of manually created Nepali-English Code-mixed ASR Dataset. The rest of the training dataset was automatically created using YouTube transcripts + Gemini and is not as accurate as the manually created dataset. Needs more manual dataset for training. ## How to Get Started with the Model Use the code below to get started with the model. Install dependencies: torch, torchcodec, torchaudio, flashlight-text (for CTC Decoder), huggingface transformers (to use this model): ``` !pip install torch !pip install torchcodec !pip install torchaudio !pip install flashlight-text !pip install transformers ``` Install KenLM (for using LM): ``` !git clone https://github.com/kpu/kenlm.git !sudo apt-get install libboost-all-dev --fix-missing # For colab %cd kenlm !mkdir -p build %cd build !cmake .. !make -j 1 %cd .. !export KENLM_ROOT=$PWD !export USE_CUDA=0 ## for cpu %cd .. %cd kenlm !pip install . %cd .. ``` Transliteration repo for disambiguation pipeline: ``` !pip install nepali-num2word !git clone https://github.com/adkta/nepali_arabic_num_to_word.git !python -m pip install -U symspellpy !rm -r /content/transliteration !git clone https://github.com/adkta/transliteration.git ``` Download LM for disambiguation: ``` !wget -L https://raw.githubusercontent.com/adkta/Devkota_2026_nep_eng_asr/main/disambiguation_lm.binary ``` Download Reduction Dictionary for disambiguation: ``` !wget -L https://raw.githubusercontent.com/adkta/transliteration/main/dictionaries/Nep_Eng_Code-Mixed_Reduct_Dict_Gemini.json ``` Download lexicon and language model for decoder (or train your own KenLM): ``` !wget -L https://raw.githubusercontent.com/adkta/Devkota_2026_nep_eng_asr/main/Indic_Combined_Translit_LM/lexicon.lst !wget -L https://raw.githubusercontent.com/adkta/Devkota_2026_nep_eng_asr/main/Indic_Combined_Translit_LM/lm.binary ``` Download audio of your choice (say 'test_audio.mp3'. Preprocess the input audio: ```python import torchaudio from torch import mean audio, sample_rate = torchaudio.load(filename) mono = audio if audio.shape[0] == 1 else mean(audio, dim=0, keepdim=True) resampled_mono = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(mono) torchaudio.save(uri= './test_audio.mp3', src = resampled_mono, sample_rate = 16000 ) ``` Load the model ```python from transformers import AutoModelForCTC, AutoProcessor model_path = "adkta/nep_eng_code-mixed_translit_lm" #REPO model = AutoModelForCTC.from_pretrained(model_path) processor = AutoProcessor.from_pretrained(model_path) # Assuming you saved processor files too ``` Create CTC Decoder ```python vocab_dict = processor.tokenizer.get_vocab() sorted_vocab_list = [key for key, val in sorted(vocab_dict.items(), key=lambda item: item[1])] decoder = ctc_decoder( lexicon='./lexicon.lst', beam_size = 128, beam_size_token= 100, beam_threshold= 25.0, tokens=sorted_vocab_list, lm='./lm.binary', word_score=1.0, nbest=1, lm_weight = 2, blank_token = '', sil_token = '|' ) ``` Generate hypothesis in Devanagari: ```python with torch.no_grad(): logits = model(resampled_mono).logits ctc_hypo = decoder(logits) print(ctc_hypo[0][0]) deva_text = " ".join(ctc_hypo[0][0].words) print(deva_text) ``` Convert to native format (Nepali in devanagari script, English in Roman script): ```python #conversion to native from transliteration.examples.disambiguation_examples import disambiguate from transliteration.utils import get_reverse_dict from transliteration.transliterator import TranslitDict import kenlm #LM LM_PATH = "./disambiguation_lm.binary" lang_model = kenlm.LanguageModel(LM_PATH) #Reverse-Reduction Dictionary reduc_dict_path = "./Nep_Eng_Code-Mixed_Reduct_Dict_Gemini.json" reverse_dict = get_reverse_dict(dictionary = TranslitDict.load(reduc_dict_path)) native_text = disambiguate(sentence=deva_text, model = lang_model, reverse_dict = reverse_dict, sym_spell = None, edit_dist = 0, lang_scoring = False, sep_case_plural = True) print(native_text) ``` ## Citation **BibTeX:** ```bibtex @mastersthesis{nep_en_cm_asr_devkota_2026, author = "Devkota, Ashish", title = "Beyond Monolingual: Leveraging Multilingual Pre-trained Models for End-to-End Nepali-English Code-mixed Speech Recognition", school = "Tribhuvan University, Institute of Engineering, Pulchowk Campus", year = "2026", type = "M.Sc. Engg. Thesis", address = "Lalitpur, Nepal", month = "January" } Contact: Ashish Devkota for report or any other details for this model.