Shrutam-2: LLM-Powered Multilingual Indic Speech Recognition

Shrutam-2 is a LLM based automatic speech recognition system for 12 major Indian languages. It bridges a Conformer speech encoder with a pretrained LLM decoder through a Mixture-of-Experts (MoE) projection layer, enabling high-quality, prompt-controllable transcription across diverse Indic languages.

Architecture Overview

Unlike conventional CTC/Attention ASR systems that map audio directly to text tokens, Shrutam-2 reframes speech recognition as a conditional language generation task. A speech encoder produces frame-level audio representations, which are then projected into the LLM's embedding space and fed to a frozen LLM decoder alongside a text prompt.

The key architectural contribution is the MoE Projector that bridges the encoder and the LLM:

Component	Details
Downsampler	Two-stage Conv1D that reduces the encoder frame rate for efficient LLM consumption
MoE Projector	8 linear experts with SMEAR (Soft Merging of Experts with Adaptive Routing) — utterance-level soft gating computes a weighted merge of all expert parameters into a single projector per input, avoiding discrete top-k routing and its associated load-balancing issues

Each expert is a two-layer MLP (encoder_dim → 2048 → llm_dim). Rather than routing each frame to a single expert, SMEAR computes frame-wise router probabilities, averages them at the utterance level, and produces a single merged weight matrix per utterance. This yields a smooth, fully differentiable routing mechanism with a simple MSE-based load-balancing loss.

Why LLM-Based ASR?

Traditional ASR pipelines rely on acoustic models trained exclusively on speech-text pairs. By grounding transcription in a pretrained LLM, this approach gains several advantages:

Rich linguistic priors — The LLM's language knowledge reduces hallucinations and improves fluency, especially for low-resource languages.
Prompt controllability — Transcription behavior can be steered through natural-language prompts without retraining.
Unified multilingual capacity — A single model serves all 12 languages, with the MoE layer learning language-adaptive projections.

Languages Supported

#	Language	Script	ISO 639-1
1	Hindi	Devanagari	`hi`
2	Marathi	Devanagari	`mr`
3	Tamil	Tamil	`ta`
4	Telugu	Telugu	`te`
5	Malayalam	Malayalam	`ml`
6	Kannada	Kannada	`kn`
7	Odia	Odia	`or`
8	Bengali	Bengali	`bn`
9	Urdu	Nastaliq	`ur`
10	Assamese	Bengali	`as`
11	Gujarati	Gujarati	`gu`
12	Punjabi	Gurmukhi	`pa`

Usage

1. Create virtual env

conda create -n shrutam2 python=3.10.14
conda activate shrutam2

2. Download code

pip install huggingface_hub
huggingface-cli download bharatgenai/Shrutam-2 \
  --include "*.py" "*.yaml" "*.json" "*.txt" \
  --local-dir shrutam2
cd shrutam2

3. Install dependencies

pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

4. Run inference

python inference_script.py

The script automatically downloads the model weights (encoder.pt, model.pt, llm/) from the Hub on first run and caches them locally.

License

This model is released under the BharatGen non-commercial license. Please refer to the LICENSE file for detailed terms and conditions.

Shrutam 2 is developed based on the research outlined in the paper-https://arxiv.org/abs/2601.19451

Downloads last month: 197

Paper for bharatgenai/Shrutam-2

Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition

Paper • 2601.19451 • Published Jan 27