YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Indic Conformer ASR - INT8 Quantized ONNX Models

Quantized version of AI4Bharat's Indic Conformer 600M multilingual ASR model for efficient on-device inference.

πŸ“‹ Model Details

  • Original Model: ai4bharat/indic-conformer-600m-multilingual
  • Quantization: INT8 (via ONNX Runtime quantization)
  • Framework: ONNX Runtime
  • Languages: 23 Indian languages (Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Oriya, Assamese, Sanskrit, and more)
  • Use Case: Offline speech recognition on mobile/edge devices

πŸ—‚οΈ Files Included

Core Models

  • encoder_int8.onnx (622 MB) - Quantized Conformer encoder
    • Input: Log-mel spectrogram features [1, 80, T]
    • Output: Encoded features [1, 1024, T_sub]
  • ctc_decoder_int8.onnx (5.5 MB) - Quantized CTC decoder
    • Input: Encoded features [1, T_sub, 1024]
    • Output: Log probabilities [1, T_sub, 5633]

Supporting Files

  • vocab.json - Per-language vocabularies (257 tokens each)
  • language_indices.json - CTC vocab masking indices for language-specific decoding
  • mel_filters.json - Mel filterbank (257 x 80)
  • hanning_window.json - Hanning window for STFT (400 samples)

πŸ“Š Compression Stats

Model Original (FP32) Quantized (INT8) Reduction
Encoder ~2.5 GB 622 MB ~75%
Decoder ~22 MB 5.5 MB ~75%
Total ~2.52 GB 627.5 MB ~75%

πŸš€ Quick Start

Python (ONNX Runtime)

import onnxruntime as ort
import numpy as np
import json

# Load models
encoder_session = ort.InferenceSession("encoder_int8.onnx")
decoder_session = ort.InferenceSession("ctc_decoder_int8.onnx")

# Load vocabularies
with open("vocab.json") as f:
    vocab = json.load(f)
with open("language_indices.json") as f:
    language_indices = json.load(f)

# Prepare audio features (log-mel spectrogram)
# features shape: [1, 80, T]
features = extract_mel_features(audio, sample_rate=16000)
length = np.array([[features.shape[2]]], dtype=np.int64)

# Run encoder
encoder_output = encoder_session.run(
    ["outputs"],
    {"input": features, "length": length}
)[0]

# Run decoder
logprobs = decoder_session.run(
    ["logprobs"],
    {"encoder_output": encoder_output}
)[0]

# Greedy CTC decoding with language-specific masking
language = "hin"  # Hindi
active_indices = language_indices[language]
vocab_list = vocab[language]

transcript = ""
prev_idx = -1
for t in range(logprobs.shape[1]):
    #Get argmax among active vocab indices
    scores = logprobs[0, t, active_indices]
    max_idx = np.argmax(scores)

    # CTC deduplication
    if max_idx != 256 and max_idx != prev_idx:  # 256 is blank
        transcript += vocab_list[max_idx]
    prev_idx = max_idx

print(transcript)

🎯 Supported Languages

23 Indian languages including Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia, Assamese, Sanskrit, Urdu, Dogri, Konkani, Maithili, Manipuri, Nepali, Santali, and Sindhi.

πŸ“ Model Architecture

Audio (16kHz)
    ↓
Mel Spectrogram (80 bins)
    ↓
Conformer Encoder (24 layers, 600M params)
    ↓ [1, 1024, T_sub]
CTC Decoder
    ↓ [1, T_sub, 5633]
Language-specific masking
    ↓
Greedy/Beam Search
    ↓
Transcription

βš™οΈ Performance

Tested on Android (Pixel 7):

  • Encoder: ~200-500ms for 3-5 second audio
  • Decoder: ~10-50ms
  • Total latency: ~250-550ms end-to-end
  • Memory: ~800MB peak

πŸ“ Citation

@article{conformer2023,
  title={Scaling Speech Technology to 1000+ Languages},
  author={AI4Bharat},
  journal={arXiv preprint},
  year={2023}
}

πŸ—οΈ Original Creators

AI4Bharat - IIT Madras
Original model: https://huggingface.co/ai4bharat/indic-conformer-600m-multilingual

πŸ“„ License

Same as original model (check AI4Bharat's repository for licensing terms)


Quantized for mobile deployment | January 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support