IndicTrans2 - INT8 Quantized ONNX Models

Quantized version of AI4Bharat's IndicTrans2 for efficient on-device Indian language translation.

📋 Model Details

Original Model: ai4bharat/indictrans2-indic-en-1B
Quantization: INT8 (via ONNX Runtime quantization)
Framework: ONNX Runtime
Task: Translation from 22 Indian languages to English
Use Case: Offline translation on mobile/edge devices

🗂️ Files Included

Core Models

encoder_int8.onnx (116 MB) - Quantized encoder
- Input: Token IDs [batch, seq_len]
- Output: Hidden states [batch, seq_len, 1024]
decoder_int8.onnx (92 MB) - Quantized decoder with self-attention
- Input: Decoder token IDs + encoder hidden states
- Output: Logits [batch, seq_len, vocab_size]

Supporting Files

vocab_src.json - Source vocabulary (Indian languages)
vocab_tgt.json - Target vocabulary (English)
special_tokens.json - Special tokens mapping

📊 Compression Stats

Model	Original (FP32)	Quantized (INT8)	Reduction
Encoder	~464 MB	116 MB	~75%
Decoder	~368 MB	92 MB	~75%
Total	~832 MB	208 MB	~75%

🚀 Quick Start

Python (ONNX Runtime)

import onnxruntime as ort
import json
import numpy as np

# Load models
encoder_session = ort.InferenceSession("encoder_int8.onnx")
decoder_session = ort.InferenceSession("decoder_int8.onnx")

# Load vocabularies
with open("vocab_src.json") as f:
    src_vocab = json.load(f)
with open("vocab_tgt.json") as f:
    tgt_vocab = json.load(f)

# Tokenize input (add language tags)
text = "नमस्ते दुनिया"  # Hello world in Hindi
tokens = tokenize(f"<2en> <hin> {text}", src_vocab)
input_ids = np.array([tokens], dtype=np.int64)

# Run encoder
encoder_output = encoder_session.run(
    ["hidden_states"],
    {"input_ids": input_ids}
)[0]

# Autoregressive decoding
generated_ids = [2]  # Start token
max_length = 50

for _ in range(max_length):
    decoder_ids = np.array([generated_ids], dtype=np.int64)
    logits = decoder_session.run(
        ["logits"],
        {
            "input_ids": decoder_ids,
            "encoder_hidden_states": encoder_output
        }
    )[0]

    next_token = np.argmax(logits[0, -1, :])
    if next_token == 2:  # EOS token
        break
    generated_ids.append(int(next_token))

# Decode output
translation = detokenize(generated_ids, tgt_vocab)
print(translation)  # "Hello world"

🎯 Supported Languages

Translation from any of these 22 languages to English:

Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, Urdu

📐 Model Architecture

Input Text (+ lang tags)
    ↓
Tokenization
    ↓
Encoder (Transformer, 1B params)
    ↓ [hidden_states]
Decoder (Autoregressive)
    ↓
Output Tokens
    ↓
English Translation

⚙️ Performance

Tested on Android (Pixel 7):

Encoder: ~50-150ms
Decoder (per token): ~20-40ms
Total (20 tokens output): ~600ms-1s
Memory: ~400MB peak

📝 Language Tags

Use these tags for source language:

<2en> - Translate to English
<hin> - Hindi
<ben> - Bengali
<tam> - Tamil
<tel> - Telugu
etc.

Example: <2en> <hin> यह एक परीक्षण है

📝 Citation

@article{gala2023indictrans,
  title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
  author={Gala, Jay and others},
  journal={Transactions on Machine Learning Research},
  year={2023}
}

🏗️ Original Creators

AI4Bharat - IIT Madras
Original model: https://huggingface.co/ai4bharat/indictrans2-indic-en-1B

📄 License

MIT License (same as original model)

🔗 Related

Original FP32 model: ai4bharat/indictrans2-indic-en-1B
ASR model: Indic Conformer INT8

Quantized for mobile deployment | January 2026

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support