Indic Conformer ASR — Hindi (600M)
600M-parameter Conformer encoder for Hindi automatic speech recognition, evaluated on all 7 subsets of the Vistaar benchmark. Achieves 12.09% average WER with a custom 5-gram KenLM across read speech, noisy speech, broadcast, conversational, and rural dialectal Hindi.
Runs locally on CPU, Apple Silicon MPS, and NVIDIA CUDA — no GPU required. On Apple M4 CPU: 0.27× RTF (3.7× faster than real-time). On Apple MPS: ~0.03–0.05× RTF (20–30× faster than real-time).
Code and evaluation scripts: github.com/abhayverma6300/indic-asr-conformer
Vistaar Results
WER with Devanagari-aware normalisation (dandas and punctuation stripped). Beam width 100.
| Dataset | Domain | Greedy WER | + Hindi-5M LM |
|---|---|---|---|
| Kathbath | Read speech | 10.34% | 9.00% |
| Kathbath Noisy | Noisy read speech | 11.86% | 10.19% |
| FLEURS | Broadcast / read | 12.68% | 11.18% |
| CommonVoice | Crowd-sourced read | 16.57% | 12.54% |
| IndicTTS | TTS-derived | 9.49% | 8.55% |
| MUCS | Conversational | 10.41% | 9.05% |
| Gramvaani | Rural / dialectal | 27.61% | 24.09% |
| Average | 14.14% | 12.09% |
Leaderboard context
| Model | Avg WER | Open weights | CPU inference |
|---|---|---|---|
| Indic Conformer 600M + Hindi-5M LM | 12.09% | yes | yes |
| IndicWhisper (Whisper-medium fine-tuned) | 13.6% | yes | slow |
| Nvidia NeMo large | 18.6% | yes | no |
| Azure STT | ~20% | no | no |
| Google STT | ~24% | no | no |
Numbers for other models from the Vistaar paper (AI4Bharat, 2023).
Model files
| File | Size | Description |
|---|---|---|
am_model.pt |
2.4 GB | Original TorchScript AM (CUDA device literals) |
am_model_cpu.pt |
2.4 GB | Patched for CPU inference |
am_model_mps.pt |
2.4 GB | Patched for Apple Silicon MPS |
preprocessor.pt |
~92 KB | Log-Mel frontend |
lm/hindi/hi.bin |
145 MB | 5-gram KenLM (Hindi-5M) |
lm/hindi/unigrams.txt |
— | 201k Hindi words for pyctcdecode |
Quickstart
Install dependencies
pip install torch torchaudio pyctcdecode
CPU inference
git clone https://github.com/Abhay-Verma031/indic-asr-conformer
cd indic-asr-conformer
huggingface-cli download Abhay-Verma031/indic-conformer-600m \
--local-dir extracted_models_v3/
python inference/cpu_infer.py \
--audio speech.wav \
--language hi \
--preprocessor extracted_models_v3/preprocessor.pt \
--am extracted_models_v3/am_model_cpu.pt \
--lm extracted_models_v3/lm/hindi/hi.bin
Apple Silicon MPS
python inference/cpu_infer.py \
--audio speech.wav \
--language hi \
--preprocessor extracted_models_v3/preprocessor.pt \
--am extracted_models_v3/am_model_mps.pt \
--device mps \
--lm extracted_models_v3/lm/hindi/hi.bin
NVIDIA GPU
python inference/gpu_infer.py \
--audio speech.wav \
--language hi \
--preprocessor extracted_models_v3/preprocessor.pt \
--am extracted_models_v3/am_model.pt \
--lm extracted_models_v3/lm/hindi/hi.bin
Architecture
AUDIO (16 kHz mono, FP32)
│
â–¼
asr_preprocessor 80-dim log-Mel filterbank [B, 80, T']
│
â–¼
asr_am Conformer encoder, ~600M params
output: CTC logprobs [B, T', 257]
(256 Hindi BPE tokens + CTC blank)
│
â–¼
asr_decoder pyctcdecode CTC beam search + KenLM
α=0.3 β=1.0 beam_width=100
│
â–¼
TRANSCRIPT
The AM is a multilingual model covering all 22 scheduled Indian languages via a 5633-token multilingual BPE vocabulary. Each language uses a 256-token slice at a fixed offset — for Hindi the slice starts at offset 1536. The model is exported as TorchScript; inference requires only torch and torchaudio.
Hindi language model
The greedy CTC baseline (14.14% avg WER) is already competitive. The Hindi-5M KenLM brings it to 12.09% — a further 2.05pp — by rescoring beam candidates with 5-gram language model scores.
| Hindi-5M | |
|---|---|
| Order | 5-gram |
| Binary size | 145 MB |
| Training sentences | 5,000,000 |
| Unigrams | 201,136 |
| α | 0.3 |
| β | 1.0 |
Training corpus: Wikipedia (hi), CC-100 (hi), CulturaX (hi), OSCAR-2301 (hi), C4 (hi) — ~5M sentences after deduplication and Devanagari filtering.
Citation
@misc{indic-conformer-600m,
author = {Abhay Verma},
title = {Indic Conformer ASR — Hindi 600M},
year = {2026},
url = {https://huggingface.co/abhayverma6300/indic-conformer-600m}
}
Paper for Abhay-Verma031/indic-conformer-600m
Evaluation results
- WER (+ Hindi-5M LM) on Vistaar (Kathbath)self-reported9.000
- WER (+ Hindi-5M LM) on Vistaar (Kathbath Noisy)self-reported10.190
- WER (+ Hindi-5M LM) on Vistaar (FLEURS)self-reported11.180
- WER (+ Hindi-5M LM) on Vistaar (CommonVoice)self-reported12.540
- WER (+ Hindi-5M LM) on Vistaar (MUCS)self-reported9.050
- WER (+ Hindi-5M LM) on Vistaar (Gramvaani)self-reported24.090