Indic Conformer ASR — Hindi (600M)

600M-parameter Conformer encoder for Hindi automatic speech recognition, evaluated on all 7 subsets of the Vistaar benchmark. Achieves 12.09% average WER with a custom 5-gram KenLM across read speech, noisy speech, broadcast, conversational, and rural dialectal Hindi.

Runs locally on CPU, Apple Silicon MPS, and NVIDIA CUDA — no GPU required. On Apple M4 CPU: 0.27× RTF (3.7× faster than real-time). On Apple MPS: ~0.03–0.05× RTF (20–30× faster than real-time).

Code and evaluation scripts: github.com/abhayverma6300/indic-asr-conformer

Vistaar Results

WER with Devanagari-aware normalisation (dandas and punctuation stripped). Beam width 100.

Dataset	Domain	Greedy WER	+ Hindi-5M LM
Kathbath	Read speech	10.34%	9.00%
Kathbath Noisy	Noisy read speech	11.86%	10.19%
FLEURS	Broadcast / read	12.68%	11.18%
CommonVoice	Crowd-sourced read	16.57%	12.54%
IndicTTS	TTS-derived	9.49%	8.55%
MUCS	Conversational	10.41%	9.05%
Gramvaani	Rural / dialectal	27.61%	24.09%
Average		14.14%	12.09%

Leaderboard context

Model	Avg WER	Open weights	CPU inference
Indic Conformer 600M + Hindi-5M LM	12.09%	yes	yes
IndicWhisper (Whisper-medium fine-tuned)	13.6%	yes	slow
Nvidia NeMo large	18.6%	yes	no
Azure STT	~20%	no	no
Google STT	~24%	no	no

Numbers for other models from the Vistaar paper (AI4Bharat, 2023).

Model files

File	Size	Description
`am_model.pt`	2.4 GB	Original TorchScript AM (CUDA device literals)
`am_model_cpu.pt`	2.4 GB	Patched for CPU inference
`am_model_mps.pt`	2.4 GB	Patched for Apple Silicon MPS
`preprocessor.pt`	~92 KB	Log-Mel frontend
`lm/hindi/hi.bin`	145 MB	5-gram KenLM (Hindi-5M)
`lm/hindi/unigrams.txt`	—	201k Hindi words for pyctcdecode

Quickstart

Install dependencies

pip install torch torchaudio pyctcdecode

CPU inference

git clone https://github.com/Abhay-Verma031/indic-asr-conformer
cd indic-asr-conformer

huggingface-cli download Abhay-Verma031/indic-conformer-600m \
    --local-dir extracted_models_v3/

python inference/cpu_infer.py \
    --audio speech.wav \
    --language hi \
    --preprocessor extracted_models_v3/preprocessor.pt \
    --am extracted_models_v3/am_model_cpu.pt \
    --lm extracted_models_v3/lm/hindi/hi.bin

Apple Silicon MPS

python inference/cpu_infer.py \
    --audio speech.wav \
    --language hi \
    --preprocessor extracted_models_v3/preprocessor.pt \
    --am extracted_models_v3/am_model_mps.pt \
    --device mps \
    --lm extracted_models_v3/lm/hindi/hi.bin

NVIDIA GPU

python inference/gpu_infer.py \
    --audio speech.wav \
    --language hi \
    --preprocessor extracted_models_v3/preprocessor.pt \
    --am extracted_models_v3/am_model.pt \
    --lm extracted_models_v3/lm/hindi/hi.bin

Architecture

AUDIO (16 kHz mono, FP32)
        │
        ▼
  asr_preprocessor      80-dim log-Mel filterbank  [B, 80, T']
        │
        ▼
      asr_am             Conformer encoder, ~600M params
                         output: CTC logprobs  [B, T', 257]
                         (256 Hindi BPE tokens + CTC blank)
        │
        ▼
    asr_decoder          pyctcdecode CTC beam search + KenLM
                         α=0.3  β=1.0  beam_width=100
        │
        ▼
    TRANSCRIPT

The AM is a multilingual model covering all 22 scheduled Indian languages via a 5633-token multilingual BPE vocabulary. Each language uses a 256-token slice at a fixed offset — for Hindi the slice starts at offset 1536. The model is exported as TorchScript; inference requires only torch and torchaudio.

Hindi language model

The greedy CTC baseline (14.14% avg WER) is already competitive. The Hindi-5M KenLM brings it to 12.09% — a further 2.05pp — by rescoring beam candidates with 5-gram language model scores.

	Hindi-5M
Order	5-gram
Binary size	145 MB
Training sentences	5,000,000
Unigrams	201,136
α	0.3
β	1.0

Training corpus: Wikipedia (hi), CC-100 (hi), CulturaX (hi), OSCAR-2301 (hi), C4 (hi) — ~5M sentences after deduplication and Devanagari filtering.

Citation

@misc{indic-conformer-600m,
  author = {Abhay Verma},
  title  = {Indic Conformer ASR — Hindi 600M},
  year   = {2026},
  url    = {https://huggingface.co/abhayverma6300/indic-conformer-600m}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for Abhay-Verma031/indic-conformer-600m

Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR

Paper • 2305.15386 • Published Aug 2, 2023

Evaluation results

WER (+ Hindi-5M LM) on Vistaar (Kathbath)
self-reported

9.000
WER (+ Hindi-5M LM) on Vistaar (Kathbath Noisy)
self-reported

10.190
WER (+ Hindi-5M LM) on Vistaar (FLEURS)
self-reported

11.180
WER (+ Hindi-5M LM) on Vistaar (CommonVoice)
self-reported

12.540
WER (+ Hindi-5M LM) on Vistaar (MUCS)
self-reported

9.050
WER (+ Hindi-5M LM) on Vistaar (Gramvaani)
self-reported

24.090