Indic Translation Model โ€” 55 Indian Languages

A fine-tuned translation model supporting 55 Indian languages including 33 low-resource languages. Translates bidirectionally between English and any supported Indic language.

Key Features

  • 55 Indian languages โ€” covers both major and low-resource regional languages
  • Bidirectional โ€” English to Indic and Indic to English
  • 1.3B parameter encoder-decoder model, ready to use out of the box
  • Fast inference โ€” supports CTranslate2 int8 quantization for production deployments
  • Extended vocabulary โ€” 256,243 tokens with 33 newly added language codes

Supported Languages

Language Code Script Language Code Script
Ahirani ahr_Deva Devanagari Kumaoni kfy_Deva Devanagari
Assamese asm_Beng Bengali Kurukh kru_Deva Devanagari
Awadhi awa_Deva Devanagari Kutchi kfr_Deva Devanagari
Bagheli bfy_Deva Devanagari Magahi mag_Deva Devanagari
Bagri bgq_Deva Devanagari Maithili mai_Deva Devanagari
Banjari brj_Deva Devanagari Malayalam mal_Mlym Malayalam
Bengali ben_Beng Bengali Manipuri mni_Beng Bengali
Bhili bhb_Deva Devanagari Manipuri mni_Mtei Meitei Mayek
Bhojpuri bho_Deva Devanagari Marathi mar_Deva Devanagari
Bodo brx_Deva Devanagari Marwari mwr_Deva Devanagari
Braj Bhasha bra_Deva Devanagari Mewari mtr_Deva Devanagari
Bundeli bns_Deva Devanagari Mizo lus_Latn Latin
Chhattisgarhi hne_Deva Devanagari Nepali npi_Deva Devanagari
Dakhini dcc_Deva Devanagari Odia ory_Orya Odia
Dogri doi_Deva Devanagari Pahari phr_Deva Devanagari
Garhwali gbm_Deva Devanagari Punjabi pan_Guru Gurmukhi
Garo grt_Latn Latin Rajasthani raj_Deva Devanagari
Gondi gon_Deva Devanagari Sambalpuri spv_Orya Odia
Gujarati guj_Gujr Gujarati Sanskrit san_Deva Devanagari
Haryanvi bgc_Deva Devanagari Santali sat_Olck Ol Chiki
Hindi hin_Deva Devanagari Sindhi snd_Arab Arabic
Ho hoc_Deva Devanagari Sora srb_Latn Latin
Kangri xnr_Deva Devanagari Tamil tam_Taml Tamil
Kannada kan_Knda Kannada Telugu tel_Telu Telugu
Kashmiri kas_Arab Arabic Tulu tcy_Knda Kannada
Kashmiri kas_Deva Devanagari Urdu urd_Arab Arabic
Khasi kha_Latn Latin Wagdi wbr_Deva Devanagari
Khortha kho_Deva Devanagari
Kodava kfa_Knda Kannada
Konkani kok_Deva Devanagari

How to Use

Quick Start (Transformers)

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load model
model = AutoModelForSeq2SeqLM.from_pretrained("Anonym-050326/nirukti-translate-1.3b")
tokenizer = AutoTokenizer.from_pretrained("Anonym-050326/nirukti-translate-1.3b")

# Translate English to Hindi
tokenizer.src_lang = "eng_Latn"
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
translated = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("hin_Deva"),
    max_new_tokens=128,
)
print(tokenizer.decode(translated[0], skip_special_tokens=True))

CTranslate2 (Fastest โ€” Recommended for Production)

import ctranslate2
from transformers import AutoTokenizer

translator = ctranslate2.Translator("ct2-int8", device="cuda", compute_type="int8_float16")
tokenizer = AutoTokenizer.from_pretrained("Anonym-050326/nirukti-translate-1.3b")
tokenizer.src_lang = "eng_Latn"

text = "Hello, how are you?"
encoded = tokenizer(text, return_tensors=None, max_length=256, truncation=True)
tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"])

result = translator.translate_batch([tokens], target_prefix=[["hin_Deva"]], beam_size=5)
output_tokens = result[0].hypotheses[0][1:]  # skip language token
output_ids = tokenizer.convert_tokens_to_ids(output_tokens)
print(tokenizer.decode(output_ids, skip_special_tokens=True))

Translating Between Any Language Pair

Set tokenizer.src_lang to the source language code and use the target code as forced_bos_token_id (Transformers) or target_prefix (CTranslate2).

# Tamil to English
tokenizer.src_lang = "tam_Taml"
# forced_bos_token_id=tokenizer.convert_tokens_to_ids("eng_Latn")

# English to Marwari
tokenizer.src_lang = "eng_Latn"
# forced_bos_token_id=tokenizer.convert_tokens_to_ids("mwr_Deva")

Testing Scripts

This repository includes evaluation and testing scripts in the testing_scripts/ directory:

Script Description
testing_scripts/run_evaluation.py Multi-GPU evaluation runner (BLEU + chrF++)
testing_scripts/download_datasets.py Download standard Indic translation benchmarks
testing_scripts/excel_to_csv.py Convert Excel test datasets to evaluation format

Running Evaluation

# 1. Convert to CTranslate2 int8 for fast inference
ct2-opennmt-py-converter --model_path Anonym-050326/nirukti-translate-1.3b --output_dir ct2-int8 --quantization int8

# 2. Download benchmark datasets
python testing_scripts/download_datasets.py --output-dir eval_data

# 3. Run multi-GPU evaluation
python testing_scripts/run_evaluation.py \
    --model ct2-int8 \
    --tokenizer Anonym-050326/nirukti-translate-1.3b \
    --manifest eval_data/manifest.json \
    --output-dir results \
    --num-gpus 4 --batch-size 64 --beam-size 5

Training Details

Architecture

  • Base: 1.3B parameter encoder-decoder translation model
  • Fine-tuning: LoRA (rank 32, alpha 64) on attention + FFN layers, merged into base weights
  • Trainable parameters: ~0.75% of total during training
  • 2-stage training: embedding divergence followed by LoRA fine-tuning

Hyperparameters

Parameter Value
Learning rate 3e-5 (cosine schedule)
Warmup steps 1,000
Effective batch size 256
Epochs 5
Max sequence length 128 tokens
Precision BF16
Label smoothing 0.1
Language balancing Temperature sampling (T=5.0)

Training Data

Trained on a curated combination of parallel corpora covering all 55 languages, with hash-based deduplication and quality filtering. Low-resource language pairs are upsampled using temperature-based balancing to ensure adequate representation.

Compute

  • Training: 2x NVIDIA A100-SXM4-80GB, PyTorch DDP
  • Inference: CTranslate2 int8 quantization

Limitations

  • Low-resource languages (33 newly added) have less training data and lower translation quality than high-resource languages
  • Performance varies by script โ€” languages with multiple script variants may show different quality levels
  • Training data is primarily from news, government, and religious domains; informal or conversational text may translate differently
  • Very short sentences (1-5 words) are harder for the model due to limited context
Downloads last month
-
Safetensors
Model size
1B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support