Instructions to use espnet/marathi_lrec2020 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- ESPnet
How to use espnet/marathi_lrec2020 with ESPnet:
from espnet2.bin.asr_inference import Speech2Text model = Speech2Text.from_pretrained( "espnet/marathi_lrec2020" ) speech, rate = soundfile.read("speech.wav") text, *_ = model(speech)[0] - Notebooks
- Google Colab
- Kaggle
| tags: | |
| - espnet | |
| - audio | |
| - automatic-speech-recognition | |
| language: mr | |
| datasets: | |
| - marathi_lrec2020 | |
| license: cc-by-4.0 | |
| ## ESPnet2 ASR model | |
| ### `espnet/marathi_lrec2020` | |
| This model was trained by [Aniket Tathe](https://github.com/Aniket-Tathe) using marathi_lrec2020 recipe in [espnet](https://github.com/espnet/espnet/). | |
| ### Demo: How to use in ESPnet2 | |
| Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html) | |
| if you haven't done that already. | |
| ```bash | |
| cd espnet | |
| git checkout 6241a3e3ad9fef6a686ac82c6a7799d40d96cd27 | |
| pip install -e . | |
| cd egs2/marathi_lrec2020/asr1 | |
| ./run.sh --skip_data_prep false --skip_train true --download_model espnet/marathi_lrec2020 | |
| ``` | |
| # RESULTS | |
| ## Environment | |
| - **Date:** `Sat Oct 25 08:19:46 UTC 2025` | |
| - **Python version:** `3.9.23` | |
| - **ESPnet version:** `202509` | |
| - **PyTorch version:** `2.3.0+cu121` | |
| - **CUDA version:** `12.1` | |
| - **ESPnet Git hash (upstream):** `53e09761cb164b28f299e178262bf2056d8059d7` | |
| - **Commit date:** `Fri Oct 24 11:26:46 2025 +0900` | |
| --- | |
| ## Marathi ASR — `marathi_lrec2020` | |
| Recipe for **Marathi** ASR on the | |
| [**IndicCorpora Marathi subset**](https://www.cse.iitb.ac.in/~pjyothi/indiccorpora/#marathi). | |
| Training uses **`conf/train_asr_transformer.yaml`** (character Conformer: 3 blocks, 256-dim encoder, `batch_bins: 16000000`, `accum_grad: 4`, Adam `lr: 0.0005`, warmup 20k, SpecAugment, hybrid CTC/attention `ctc_weight: 0.3`). | |
| Decoding without LM: **`conf/decode_asr.yaml`** (`lm_weight: 0.0`). | |
| Decoding with LM (match reported fusion): **`conf/decode_asr_lm.yaml`** (beam 20, `ctc_weight: 0.5`, `lm_weight: 0.3`). | |
| --- | |
| ### Test-set decoding (`marathi_test`) | |
| Beam **20**, **CTC weight 0.5** unless noted. | |
| #### Beam 20, CTC 0.5 (no LM) | |
| | | Corr | Sub | Del | Ins | Err | S.Err | | |
| |--------|-----:|----:|----:|----:|----:|------:| | |
| | **CER** | 88.9 | 7.1 | 4.0 | 1.9 | 13.0 | 77.7 | | |
| | **WER** | 73.8 | 23.8 | 2.4 | 3.2 | 29.4 | 78.5 | | |
| #### Beam 20, CTC 0.5, LM weight 0.3 | |
| | | Corr | Sub | Del | Ins | Err | S.Err | | |
| |--------|-----:|----:|----:|----:|----:|------:| | |
| | **CER** | 89.0 | 6.6 | 4.4 | 1.7 | 12.6 | 74.3 | | |
| | **WER** | 76.0 | 21.6 | 2.4 | 3.0 | 27.0 | 75.0 | | |
| --- | |
| ### Dataset reference | |
| > P. Jyothi et al., *“IndicCorpora: A Large Multilingual Corpus for Indic Languages.”* | |
| > [IIT Bombay IndicCorpora — Marathi](https://www.cse.iitb.ac.in/~pjyothi/indiccorpora/#marathi) | |
| ## ASR config | |
| <details><summary>expand</summary> | |
| ``` | |
| config: conf/tuning/train_asr_transformer.yaml | |
| print_config: false | |
| log_level: INFO | |
| drop_last_iter: false | |
| dry_run: false | |
| iterator_type: sequence | |
| valid_iterator_type: null | |
| output_dir: exp/asr_marathi_conf_old_try_3conf_16Mbin_4grad | |
| ngpu: 1 | |
| seed: 777 | |
| num_workers: 8 | |
| num_att_plot: 3 | |
| dist_backend: nccl | |
| dist_init_method: env:// | |
| dist_world_size: null | |
| dist_rank: null | |
| local_rank: 0 | |
| dist_master_addr: null | |
| dist_master_port: null | |
| dist_launcher: null | |
| multiprocessing_distributed: false | |
| unused_parameters: false | |
| sharded_ddp: false | |
| use_deepspeed: false | |
| deepspeed_config: null | |
| gradient_as_bucket_view: true | |
| ddp_comm_hook: null | |
| cudnn_enabled: true | |
| cudnn_benchmark: false | |
| cudnn_deterministic: true | |
| use_tf32: false | |
| collect_stats: false | |
| write_collected_feats: false | |
| max_epoch: 60 | |
| patience: null | |
| val_scheduler_criterion: | |
| - valid | |
| - loss | |
| early_stopping_criterion: | |
| - valid | |
| - loss | |
| - min | |
| best_model_criterion: | |
| - - valid | |
| - loss | |
| - min | |
| keep_nbest_models: 10 | |
| nbest_averaging_interval: 0 | |
| grad_clip: 5.0 | |
| grad_clip_type: 2.0 | |
| grad_noise: false | |
| accum_grad: 4 | |
| no_forward_run: false | |
| resume: false | |
| train_dtype: float32 | |
| use_amp: false | |
| log_interval: null | |
| use_matplotlib: true | |
| use_tensorboard: true | |
| create_graph_in_tensorboard: false | |
| use_wandb: false | |
| wandb_project: null | |
| wandb_id: null | |
| wandb_entity: null | |
| wandb_name: null | |
| wandb_model_log_interval: -1 | |
| detect_anomaly: false | |
| use_adapter: false | |
| adapter: lora | |
| save_strategy: all | |
| adapter_conf: {} | |
| pretrain_path: null | |
| init_param: [] | |
| ignore_init_mismatch: false | |
| freeze_param: [] | |
| num_iters_per_epoch: null | |
| batch_size: 20 | |
| valid_batch_size: null | |
| batch_bins: 16000000 | |
| valid_batch_bins: null | |
| category_sample_size: 10 | |
| upsampling_factor: 0.5 | |
| category_upsampling_factor: 0.5 | |
| dataset_upsampling_factor: 0.5 | |
| dataset_scaling_factor: 1.2 | |
| max_batch_size: null | |
| min_batch_size: 1 | |
| train_shape_file: | |
| - exp/asr_stats_raw_char/train/speech_shape | |
| - exp/asr_stats_raw_char/train/text_shape.char | |
| valid_shape_file: | |
| - exp/asr_stats_raw_char/valid/speech_shape | |
| - exp/asr_stats_raw_char/valid/text_shape.char | |
| batch_type: numel | |
| valid_batch_type: null | |
| fold_length: | |
| - 80000 | |
| - 150 | |
| sort_in_batch: descending | |
| shuffle_within_batch: false | |
| sort_batch: descending | |
| multiple_iterator: false | |
| chunk_length: 500 | |
| chunk_shift_ratio: 0.5 | |
| num_cache_chunks: 1024 | |
| chunk_excluded_key_prefixes: [] | |
| chunk_default_fs: null | |
| chunk_max_abs_length: null | |
| chunk_discard_short_samples: true | |
| train_data_path_and_name_and_type: | |
| - - dump/raw/marathi_train_sp/wav.scp | |
| - speech | |
| - sound | |
| - - dump/raw/marathi_train_sp/text | |
| - text | |
| - text | |
| valid_data_path_and_name_and_type: | |
| - - dump/raw/marathi_dev/wav.scp | |
| - speech | |
| - sound | |
| - - dump/raw/marathi_dev/text | |
| - text | |
| - text | |
| multi_task_dataset: false | |
| allow_variable_data_keys: false | |
| max_cache_size: 0.0 | |
| max_cache_fd: 32 | |
| allow_multi_rates: false | |
| valid_max_cache_size: null | |
| exclude_weight_decay: false | |
| exclude_weight_decay_conf: {} | |
| optim: adam | |
| optim_conf: | |
| lr: 0.0005 | |
| scheduler: warmuplr | |
| scheduler_conf: | |
| warmup_steps: 20000 | |
| token_list: | |
| - <blank> | |
| - <unk> | |
| - <space> | |
| - ा | |
| - े | |
| - र | |
| - ् | |
| - क | |
| - स | |
| - ल | |
| - म | |
| - ि | |
| - ं | |
| - ी | |
| - य | |
| - त | |
| - न | |
| - व | |
| - ग | |
| - ह | |
| - ट | |
| - च | |
| - प | |
| - ड | |
| - आ | |
| - . | |
| - ज | |
| - श | |
| - ो | |
| - द | |
| - ब | |
| - अ | |
| - ू | |
| - ु | |
| - '?' | |
| - ण | |
| - इ | |
| - ध | |
| - ए | |
| - फ | |
| - ख | |
| - ॉ | |
| - ॅ | |
| - ळ | |
| - ँ | |
| - भ | |
| - थ | |
| - ठ | |
| - ई | |
| - झ | |
| - ष | |
| - उ | |
| - ऑ | |
| - ऊ | |
| - घ | |
| - ढ | |
| - ै | |
| - ओ | |
| - '2' | |
| - ( | |
| - ) | |
| - '0' | |
| - ृ | |
| - ौ | |
| - '-' | |
| - '3' | |
| - '1' | |
| - ऐ | |
| - | |
| - '5' | |
| - '8' | |
| - छ | |
| - '4' | |
| - '"' | |
| - ',' | |
| - '9' | |
| - औ | |
| - '!' | |
| - ़ | |
| - ञ | |
| - '7' | |
| - '6' | |
| - ९ | |
| - ऋ | |
| - e | |
| - g | |
| - ३ | |
| - X | |
| - १ | |
| - ० | |
| - <sos/eos> | |
| init: xavier_uniform | |
| input_size: null | |
| ctc_conf: | |
| dropout_rate: 0.0 | |
| ctc_type: builtin | |
| reduce: true | |
| ignore_nan_grad: null | |
| zero_infinity: true | |
| brctc_risk_strategy: exp | |
| brctc_group_strategy: end | |
| brctc_risk_factor: 0.0 | |
| joint_net_conf: null | |
| use_preprocessor: true | |
| use_lang_prompt: false | |
| use_nlp_prompt: false | |
| token_type: char | |
| bpemodel: null | |
| non_linguistic_symbols: null | |
| cleaner: null | |
| g2p: null | |
| speech_volume_normalize: null | |
| rir_scp: null | |
| rir_apply_prob: 1.0 | |
| noise_scp: null | |
| noise_apply_prob: 1.0 | |
| noise_db_range: '13_15' | |
| short_noise_thres: 0.5 | |
| aux_ctc_tasks: [] | |
| frontend: default | |
| frontend_conf: | |
| fs: 16k | |
| specaug: specaug | |
| specaug_conf: | |
| apply_time_warp: true | |
| time_warp_window: 5 | |
| time_warp_mode: bicubic | |
| apply_freq_mask: true | |
| freq_mask_width_range: | |
| - 0 | |
| - 30 | |
| num_freq_mask: 2 | |
| apply_time_mask: true | |
| time_mask_width_range: | |
| - 0 | |
| - 40 | |
| num_time_mask: 2 | |
| normalize: global_mvn | |
| normalize_conf: | |
| stats_file: exp/asr_stats_raw_char/train/feats_stats.npz | |
| model: espnet | |
| model_conf: | |
| ctc_weight: 0.3 | |
| report_cer: true | |
| report_wer: true | |
| preencoder: null | |
| preencoder_conf: {} | |
| encoder: conformer | |
| encoder_conf: | |
| output_size: 256 | |
| attention_heads: 4 | |
| linear_units: 1024 | |
| num_blocks: 3 | |
| dropout_rate: 0.2 | |
| positional_dropout_rate: 0.2 | |
| attention_dropout_rate: 0.2 | |
| input_layer: conv2d | |
| normalize_before: true | |
| macaron_style: false | |
| pos_enc_layer_type: rel_pos | |
| selfattention_layer_type: rel_selfattn | |
| activation_type: swish | |
| use_cnn_module: true | |
| cnn_module_kernel: 17 | |
| postencoder: null | |
| postencoder_conf: {} | |
| decoder: transformer | |
| decoder_conf: | |
| attention_heads: 4 | |
| linear_units: 1024 | |
| num_blocks: 3 | |
| dropout_rate: 0.2 | |
| positional_dropout_rate: 0.2 | |
| self_attention_dropout_rate: 0.2 | |
| src_attention_dropout_rate: 0.2 | |
| preprocessor: default | |
| preprocessor_conf: {} | |
| required: | |
| - output_dir | |
| - token_list | |
| version: '202509' | |
| distributed: false | |
| ``` | |
| </details> | |
| ### Citing ESPnet | |
| ```BibTex | |
| @inproceedings{watanabe2018espnet, | |
| author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai}, | |
| title={{ESPnet}: End-to-End Speech Processing Toolkit}, | |
| year={2018}, | |
| booktitle={Proceedings of Interspeech}, | |
| pages={2207--2211}, | |
| doi={10.21437/Interspeech.2018-1456}, | |
| url={http://dx.doi.org/10.21437/Interspeech.2018-1456} | |
| } | |
| ``` | |
| or arXiv: | |
| ```bibtex | |
| @misc{watanabe2018espnet, | |
| title={ESPnet: End-to-End Speech Processing Toolkit}, | |
| author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai}, | |
| year={2018}, | |
| eprint={1804.00015}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL} | |
| } | |
| ``` | |