YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Indic Conformer ASR - INT8 Quantized ONNX Models
Quantized version of AI4Bharat's Indic Conformer 600M multilingual ASR model for efficient on-device inference.
π Model Details
- Original Model: ai4bharat/indic-conformer-600m-multilingual
- Quantization: INT8 (via ONNX Runtime quantization)
- Framework: ONNX Runtime
- Languages: 23 Indian languages (Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Oriya, Assamese, Sanskrit, and more)
- Use Case: Offline speech recognition on mobile/edge devices
ποΈ Files Included
Core Models
encoder_int8.onnx(622 MB) - Quantized Conformer encoder- Input: Log-mel spectrogram features [1, 80, T]
- Output: Encoded features [1, 1024, T_sub]
ctc_decoder_int8.onnx(5.5 MB) - Quantized CTC decoder- Input: Encoded features [1, T_sub, 1024]
- Output: Log probabilities [1, T_sub, 5633]
Supporting Files
vocab.json- Per-language vocabularies (257 tokens each)language_indices.json- CTC vocab masking indices for language-specific decodingmel_filters.json- Mel filterbank (257 x 80)hanning_window.json- Hanning window for STFT (400 samples)
π Compression Stats
| Model | Original (FP32) | Quantized (INT8) | Reduction |
|---|---|---|---|
| Encoder | ~2.5 GB | 622 MB | ~75% |
| Decoder | ~22 MB | 5.5 MB | ~75% |
| Total | ~2.52 GB | 627.5 MB | ~75% |
π Quick Start
Python (ONNX Runtime)
import onnxruntime as ort
import numpy as np
import json
# Load models
encoder_session = ort.InferenceSession("encoder_int8.onnx")
decoder_session = ort.InferenceSession("ctc_decoder_int8.onnx")
# Load vocabularies
with open("vocab.json") as f:
vocab = json.load(f)
with open("language_indices.json") as f:
language_indices = json.load(f)
# Prepare audio features (log-mel spectrogram)
# features shape: [1, 80, T]
features = extract_mel_features(audio, sample_rate=16000)
length = np.array([[features.shape[2]]], dtype=np.int64)
# Run encoder
encoder_output = encoder_session.run(
["outputs"],
{"input": features, "length": length}
)[0]
# Run decoder
logprobs = decoder_session.run(
["logprobs"],
{"encoder_output": encoder_output}
)[0]
# Greedy CTC decoding with language-specific masking
language = "hin" # Hindi
active_indices = language_indices[language]
vocab_list = vocab[language]
transcript = ""
prev_idx = -1
for t in range(logprobs.shape[1]):
#Get argmax among active vocab indices
scores = logprobs[0, t, active_indices]
max_idx = np.argmax(scores)
# CTC deduplication
if max_idx != 256 and max_idx != prev_idx: # 256 is blank
transcript += vocab_list[max_idx]
prev_idx = max_idx
print(transcript)
π― Supported Languages
23 Indian languages including Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia, Assamese, Sanskrit, Urdu, Dogri, Konkani, Maithili, Manipuri, Nepali, Santali, and Sindhi.
π Model Architecture
Audio (16kHz)
β
Mel Spectrogram (80 bins)
β
Conformer Encoder (24 layers, 600M params)
β [1, 1024, T_sub]
CTC Decoder
β [1, T_sub, 5633]
Language-specific masking
β
Greedy/Beam Search
β
Transcription
βοΈ Performance
Tested on Android (Pixel 7):
- Encoder: ~200-500ms for 3-5 second audio
- Decoder: ~10-50ms
- Total latency: ~250-550ms end-to-end
- Memory: ~800MB peak
π Citation
@article{conformer2023,
title={Scaling Speech Technology to 1000+ Languages},
author={AI4Bharat},
journal={arXiv preprint},
year={2023}
}
ποΈ Original Creators
AI4Bharat - IIT Madras
Original model: https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual
π License
Same as original model (check AI4Bharat's repository for licensing terms)
Quantized for mobile deployment | January 2026
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support