---
library_name: transformers
language:
- ne
- en
datasets:
- adkta/nep_eng_code-mixed_asr_dataset
metrics:
- wer
results:
  - task:
      type: ASR
    metrics:
      - name: WER
        value: 21.83
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->


## Model Details

### Model Description

This is an ASR/Speech to Text model for Nepali-English code-mixed speech.
The model is wav2vec 2.0 with a CTC head.
The transcripts are generated in native script i.e. Nepali in Devanagari script and English in Roman script.
This model was created as part of thesis work in partial fulfillment of M.Sc. in Information and Communication Engineering offered in Pulchowk Campus, IOE, TU, Nepal.
This is the best performing model with a WER of 21.83.
The model is trained on Devanagarized version of the transcripts. 
The model makes use of LM during decoding.
The native conversion is performed during post processing using a transliteration dictionary created as part of the thesis and a LM.
For more details please reach out to the author by email stated at the end of this card.

- **Developed by:** Ashish Devkota
- **Model type:** ASR
- **Language(s) (NLP):** Nepali-English Code-mixed
- **License:** [More Information Needed]
- **Finetuned from model [optional]:** wav2vec 2.0


## Bias, Risks, and Limitations

Only trained with ~2hrs of manually created Nepali-English Code-mixed ASR Dataset. The rest of the training dataset was automatically
created using YouTube transcripts + Gemini and is not as accurate as the manually created dataset. Needs more manual dataset for training.


## How to Get Started with the Model

Use the code below to get started with the model.
Install dependencies: torch, torchcodec, torchaudio, flashlight-text (for CTC Decoder), huggingface transformers (to use this model):
```
!pip install torch
!pip install torchcodec
!pip install torchaudio
!pip install flashlight-text
!pip install transformers
```


Install KenLM (for using LM): 
```
!git clone https://github.com/kpu/kenlm.git
!sudo apt-get install libboost-all-dev --fix-missing # For colab

%cd kenlm
!mkdir -p build
%cd build
!cmake ..
!make -j 1
%cd ..
!export KENLM_ROOT=$PWD
!export USE_CUDA=0 ## for cpu
%cd ..

%cd kenlm
!pip install .
%cd ..
```

Transliteration repo for disambiguation pipeline:
```
!pip install nepali-num2word
!git clone https://github.com/adkta/nepali_arabic_num_to_word.git
!python -m pip install -U symspellpy
!rm -r /content/transliteration
!git clone https://github.com/adkta/transliteration.git
```

Download LM for disambiguation:
```
!wget -L https://raw.githubusercontent.com/adkta/Devkota_2026_nep_eng_asr/main/disambiguation_lm.binary
```

Download Reduction Dictionary for disambiguation:
```
!wget -L https://raw.githubusercontent.com/adkta/transliteration/main/dictionaries/Nep_Eng_Code-Mixed_Reduct_Dict_Gemini.json
```

Download lexicon and language model for decoder (or train your own KenLM):
```
!wget -L https://raw.githubusercontent.com/adkta/Devkota_2026_nep_eng_asr/main/Indic_Combined_Translit_LM/lexicon.lst
!wget -L https://raw.githubusercontent.com/adkta/Devkota_2026_nep_eng_asr/main/Indic_Combined_Translit_LM/lm.binary
```

Download audio of your choice (say 'test_audio.mp3'. Preprocess the input audio:

```python
import torchaudio
from torch import mean

audio, sample_rate = torchaudio.load(filename)
mono = audio if audio.shape[0] == 1 else mean(audio, dim=0, keepdim=True)
resampled_mono = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(mono)
torchaudio.save(uri= './test_audio.mp3', src = resampled_mono, sample_rate = 16000 )
```

Load the model
```python
from transformers import AutoModelForCTC, AutoProcessor

model_path = "adkta/nep_eng_code-mixed_translit_lm" #REPO

model = AutoModelForCTC.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path) # Assuming you saved processor files too
```

Create CTC Decoder
```python
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_list = [key for key, val in sorted(vocab_dict.items(), key=lambda item: item[1])]

decoder = ctc_decoder(
    lexicon='./lexicon.lst',
    beam_size = 128,
    beam_size_token= 100,
    beam_threshold= 25.0,
    tokens=sorted_vocab_list,
    lm='./lm.binary',
    word_score=1.0,
    nbest=1,
    lm_weight = 2,
    blank_token = '<s>',
    sil_token = '|'
)
```

Generate hypothesis in Devanagari:
```python
with torch.no_grad():
    logits = model(resampled_mono).logits
    ctc_hypo = decoder(logits)
    print(ctc_hypo[0][0])
    deva_text = " ".join(ctc_hypo[0][0].words)
    print(deva_text)
```

Convert to native format (Nepali in devanagari script, English in Roman script):
```python
#conversion to native
from transliteration.examples.disambiguation_examples import disambiguate
from transliteration.utils import get_reverse_dict
from transliteration.transliterator import TranslitDict

import kenlm

#LM
LM_PATH = "./disambiguation_lm.binary"
lang_model = kenlm.LanguageModel(LM_PATH)

#Reverse-Reduction Dictionary
reduc_dict_path = "./Nep_Eng_Code-Mixed_Reduct_Dict_Gemini.json"
reverse_dict = get_reverse_dict(dictionary = TranslitDict.load(reduc_dict_path))

native_text = disambiguate(sentence=deva_text, model = lang_model, reverse_dict = reverse_dict, sym_spell = None, edit_dist = 0, lang_scoring = False, sep_case_plural = True)
print(native_text)
```

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**

```bibtex
@mastersthesis{nep_en_cm_asr_devkota_2026,
  author  = "Devkota, Ashish",
  title   = "Beyond Monolingual: Leveraging Multilingual Pre-trained Models for End-to-End Nepali-English Code-mixed Speech Recognition",
  school  = "Tribhuvan University, Institute of Engineering, Pulchowk Campus",
  year    = "2026",
  type    = "M.Sc. Engg. Thesis",
  address = "Lalitpur, Nepal",
  month   = "January"
}

Contact: Ashish Devkota <devkota.ashish@outlook.com> for report or any other details for this model.