Indic Conformer ASR - INT8 Quantized ONNX Models

Quantized version of AI4Bharat's Indic Conformer 600M multilingual ASR model for efficient on-device inference.

📋 Model Details

Original Model: ai4bharat/indic-conformer-600m-multilingual
Quantization: INT8 (via ONNX Runtime quantization)
Framework: ONNX Runtime
Languages: 23 Indian languages (Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Oriya, Assamese, Sanskrit, and more)
Use Case: Offline speech recognition on mobile/edge devices

🗂️ Files Included

Core Models

encoder_int8.onnx (622 MB) - Quantized Conformer encoder
- Input: Log-mel spectrogram features [1, 80, T]
- Output: Encoded features [1, 1024, T_sub]
ctc_decoder_int8.onnx (5.5 MB) - Quantized CTC decoder
- Input: Encoded features [1, T_sub, 1024]
- Output: Log probabilities [1, T_sub, 5633]

Supporting Files

vocab.json - Per-language vocabularies (257 tokens each)
language_indices.json - CTC vocab masking indices for language-specific decoding
mel_filters.json - Mel filterbank (257 x 80)
hanning_window.json - Hanning window for STFT (400 samples)

📊 Compression Stats

Model	Original (FP32)	Quantized (INT8)	Reduction
Encoder	~2.5 GB	622 MB	~75%
Decoder	~22 MB	5.5 MB	~75%
Total	~2.52 GB	627.5 MB	~75%

🚀 Quick Start

Python (ONNX Runtime)

import onnxruntime as ort
import numpy as np
import json

# Load models
encoder_session = ort.InferenceSession("encoder_int8.onnx")
decoder_session = ort.InferenceSession("ctc_decoder_int8.onnx")

# Load vocabularies
with open("vocab.json") as f:
    vocab = json.load(f)
with open("language_indices.json") as f:
    language_indices = json.load(f)

# Prepare audio features (log-mel spectrogram)
# features shape: [1, 80, T]
features = extract_mel_features(audio, sample_rate=16000)
length = np.array([[features.shape[2]]], dtype=np.int64)

# Run encoder
encoder_output = encoder_session.run(
    ["outputs"],
    {"input": features, "length": length}
)[0]

# Run decoder
logprobs = decoder_session.run(
    ["logprobs"],
    {"encoder_output": encoder_output}
)[0]

# Greedy CTC decoding with language-specific masking
language = "hin"  # Hindi
active_indices = language_indices[language]
vocab_list = vocab[language]

transcript = ""
prev_idx = -1
for t in range(logprobs.shape[1]):
    #Get argmax among active vocab indices
    scores = logprobs[0, t, active_indices]
    max_idx = np.argmax(scores)

    # CTC deduplication
    if max_idx != 256 and max_idx != prev_idx:  # 256 is blank
        transcript += vocab_list[max_idx]
    prev_idx = max_idx

print(transcript)

🎯 Supported Languages

23 Indian languages including Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia, Assamese, Sanskrit, Urdu, Dogri, Konkani, Maithili, Manipuri, Nepali, Santali, and Sindhi.

📐 Model Architecture

Audio (16kHz)
    ↓
Mel Spectrogram (80 bins)
    ↓
Conformer Encoder (24 layers, 600M params)
    ↓ [1, 1024, T_sub]
CTC Decoder
    ↓ [1, T_sub, 5633]
Language-specific masking
    ↓
Greedy/Beam Search
    ↓
Transcription

⚙️ Performance

Tested on Android (Pixel 7):

Encoder: ~200-500ms for 3-5 second audio
Decoder: ~10-50ms
Total latency: ~250-550ms end-to-end
Memory: ~800MB peak

📝 Citation

@article{conformer2023,
  title={Scaling Speech Technology to 1000+ Languages},
  author={AI4Bharat},
  journal={arXiv preprint},
  year={2023}
}

🏗️ Original Creators

AI4Bharat - IIT Madras
Original model: https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual

📄 License

Same as original model (check AI4Bharat's repository for licensing terms)

Quantized for mobile deployment | January 2026

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support