Model Card for Model ID

Model Details

Model Description

This is an ASR/Speech to Text model for Nepali-English code-mixed speech. The model is wav2vec 2.0 with a CTC head. The transcripts are generated in native script i.e. Nepali in Devanagari script and English in Roman script. This model was created as part of thesis work in partial fulfillment of M.Sc. in Information and Communication Engineering offered in Pulchowk Campus, IOE, TU, Nepal. This is the best performing model with a WER of 21.83. The model is trained on Devanagarized version of the transcripts. The model makes use of LM during decoding. The native conversion is performed during post processing using a transliteration dictionary created as part of the thesis and a LM. For more details please reach out to the author by email stated at the end of this card.

  • Developed by: Ashish Devkota
  • Model type: ASR
  • Language(s) (NLP): Nepali-English Code-mixed
  • License: [More Information Needed]
  • Finetuned from model [optional]: wav2vec 2.0

Bias, Risks, and Limitations

Only trained with ~2hrs of manually created Nepali-English Code-mixed ASR Dataset. The rest of the training dataset was automatically created using YouTube transcripts + Gemini and is not as accurate as the manually created dataset. Needs more manual dataset for training.

How to Get Started with the Model

Use the code below to get started with the model. Install dependencies: torch, torchcodec, torchaudio, flashlight-text (for CTC Decoder), huggingface transformers (to use this model):

!pip install torch
!pip install torchcodec
!pip install torchaudio
!pip install flashlight-text
!pip install transformers

Install KenLM (for using LM):

!git clone https://github.com/kpu/kenlm.git
!sudo apt-get install libboost-all-dev --fix-missing # For colab

%cd kenlm
!mkdir -p build
%cd build
!cmake ..
!make -j 1
%cd ..
!export KENLM_ROOT=$PWD
!export USE_CUDA=0 ## for cpu
%cd ..

%cd kenlm
!pip install .
%cd ..

Transliteration repo for disambiguation pipeline:

!pip install nepali-num2word
!git clone https://github.com/adkta/nepali_arabic_num_to_word.git
!python -m pip install -U symspellpy
!rm -r /content/transliteration
!git clone https://github.com/adkta/transliteration.git

Download LM for disambiguation:

!wget -L https://raw.githubusercontent.com/adkta/Devkota_2026_nep_eng_asr/main/disambiguation_lm.binary

Download Reduction Dictionary for disambiguation:

!wget -L https://raw.githubusercontent.com/adkta/transliteration/main/dictionaries/Nep_Eng_Code-Mixed_Reduct_Dict_Gemini.json

Download lexicon and language model for decoder (or train your own KenLM):

!wget -L https://raw.githubusercontent.com/adkta/Devkota_2026_nep_eng_asr/main/Indic_Combined_Translit_LM/lexicon.lst
!wget -L https://raw.githubusercontent.com/adkta/Devkota_2026_nep_eng_asr/main/Indic_Combined_Translit_LM/lm.binary

Download audio of your choice (say 'test_audio.mp3'. Preprocess the input audio:

import torchaudio
from torch import mean

audio, sample_rate = torchaudio.load(filename)
mono = audio if audio.shape[0] == 1 else mean(audio, dim=0, keepdim=True)
resampled_mono = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(mono)
torchaudio.save(uri= './test_audio.mp3', src = resampled_mono, sample_rate = 16000 )

Load the model

from transformers import AutoModelForCTC, AutoProcessor

model_path = "adkta/nep_eng_code-mixed_translit_lm" #REPO

model = AutoModelForCTC.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path) # Assuming you saved processor files too

Create CTC Decoder

vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_list = [key for key, val in sorted(vocab_dict.items(), key=lambda item: item[1])]

decoder = ctc_decoder(
    lexicon='./lexicon.lst',
    beam_size = 128,
    beam_size_token= 100,
    beam_threshold= 25.0,
    tokens=sorted_vocab_list,
    lm='./lm.binary',
    word_score=1.0,
    nbest=1,
    lm_weight = 2,
    blank_token = '<s>',
    sil_token = '|'
)

Generate hypothesis in Devanagari:

with torch.no_grad():
    logits = model(resampled_mono).logits
    ctc_hypo = decoder(logits)
    print(ctc_hypo[0][0])
    deva_text = " ".join(ctc_hypo[0][0].words)
    print(deva_text)

Convert to native format (Nepali in devanagari script, English in Roman script):

#conversion to native
from transliteration.examples.disambiguation_examples import disambiguate
from transliteration.utils import get_reverse_dict
from transliteration.transliterator import TranslitDict

import kenlm

#LM
LM_PATH = "./disambiguation_lm.binary"
lang_model = kenlm.LanguageModel(LM_PATH)

#Reverse-Reduction Dictionary
reduc_dict_path = "./Nep_Eng_Code-Mixed_Reduct_Dict_Gemini.json"
reverse_dict = get_reverse_dict(dictionary = TranslitDict.load(reduc_dict_path))

native_text = disambiguate(sentence=deva_text, model = lang_model, reverse_dict = reverse_dict, sym_spell = None, edit_dist = 0, lang_scoring = False, sep_case_plural = True)
print(native_text)

Citation

BibTeX:

@mastersthesis{nep_en_cm_asr_devkota_2026,
  author  = "Devkota, Ashish",
  title   = "Beyond Monolingual: Leveraging Multilingual Pre-trained Models for End-to-End Nepali-English Code-mixed Speech Recognition",
  school  = "Tribhuvan University, Institute of Engineering, Pulchowk Campus",
  year    = "2026",
  type    = "M.Sc. Engg. Thesis",
  address = "Lalitpur, Nepal",
  month   = "January"
}

Contact: Ashish Devkota <devkota.ashish@outlook.com> for report or any other details for this model.
Downloads last month
15
Safetensors
Model size
94.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train adkta/nep_eng_code-mixed_translit_lm

Collection including adkta/nep_eng_code-mixed_translit_lm