OneNMT v3b — Indic Multilingual Neural Machine Translation

OneNMT v3b is a multilingual sequence-to-sequence translation model supporting 23 Indic languages + English (covering 30+ language codes including script variants). It is based on the MBart architecture, trained with fairseq, and converted to HuggingFace format for easy deployment.

⚠️ Important: This model uses a custom fairseq vocabulary (fairseq_dict.json) and a custom SentencePiece model (onemtv3b_spm.model). The standard HuggingFace MBart tokenizer will not work. Use the provided hf_inference.py script.


Supported Languages

Short Code Language Script FLORES-200 Code
eng English Latin eng_Latn
hin Hindi Devanagari hin_Deva
tel Telugu Telugu tel_Telu
tam Tamil Tamil tam_Taml
mal Malayalam Malayalam mal_Mlym
kan Kannada Kannada kan_Knda
ben Bengali Bengali ben_Beng
guj Gujarati Gujarati guj_Gujr
mar Marathi Devanagari mar_Deva
pan Punjabi Gurmukhi pan_Guru
urd Urdu Arabic urd_Arab
asm Assamese Bengali asm_Beng
npi Nepali Devanagari npi_Deva
ory Odia Odia ory_Orya
san Sanskrit Devanagari san_Deva
mai Maithili Devanagari mai_Deva
brx Bodo Devanagari brx_Deva
doi Dogri Devanagari doi_Deva
gom Konkani Devanagari gom_Deva
mni Meitei Bengali mni_Beng
sat Santali Ol Chiki sat_Olck
kas Kashmiri Arabic kas_Arab
snd Sindhi Arabic snd_Arab

Script variants (e.g. kas_deva, mni_mtei, snd_deva) are also supported — see hf_inference.py for the full language mapping.


Installation

pip install -r requirements.txt

Usage

Single Sentence

from hf_inference import translate_onemt

# English → Telugu
result = translate_onemt("Hello, how are you?", "eng", "tel")
print(result)  # హలో, మీరు ఎలా ఉన్నారు?

# Hindi (code-mixed) → Telugu
result = translate_onemt("मुझे meeting attend करनी है।", "hin", "tel")
print(result)

# Telugu → English
result = translate_onemt("నమస్కారం, మీరు ఎలా ఉన్నారు?", "tel", "eng")
print(result)

Batch Translation

from hf_inference import translate_batch

sentences = [
    "Good morning.",
    "Thank you very much.",
    "The train arrives at six o'clock.",
]

results = translate_batch(sentences, sl="eng", tl="tel", batch_size=32)
for src, tgt in zip(sentences, results):
    print(f"{src}{tgt}")

Long Documents

Texts longer than 200 words are automatically chunked into 100-word parts and translated sequentially — no special handling needed.


How It Works

The model uses a fairseq-style language tag prepended to the source text to control translation direction:

###hin_Deva-to-tel_Telu### मुझे meeting attend करनी है।

This is handled automatically by hf_inference.py. The tokenization pipeline is:

source text
  → prepend ###src-to-tgt### tag
  → SentencePiece (onemtv3b_spm.model) → subword pieces
  → look up fairseq dictionary IDs (fairseq_dict.json)   ← NOT SPM-native IDs
  → feed to MBart encoder

The fairseq dict IDs and SPM-native IDs differ completely. Using the wrong vocab mapping will produce garbage output.

Generation Config

Parameter Value
Beam size 5
Max new tokens 256
decoder_start_token_id 2 (EOS — fairseq convention)
no_repeat_ngram_size 3
repetition_penalty 1.3

Repository Files

File Description
model.safetensors Model weights
config.json MBart model config
generation_config.json Default generation parameters
fairseq_dict.json Custom vocab mapping — required for inference
onemtv3b_spm.model SentencePiece tokenizer — required for inference
hf_inference.py Inference script with translate_onemt() and translate_batch()
sentencepiece.bpe.model Standard MBart50 SPM (not used for inference)
tokenizer_config.json HF tokenizer config (do not use directly)
requirements.txt Python dependencies

Conversion Notes

This model was converted from a fairseq checkpoint using custom conversion scripts (fix_weight_tying.py, fix_vocab.py). The key challenge during conversion was preserving the fairseq vocabulary ordering, since the model's embedding matrix is indexed by fairseq dict IDs — not SPM-native IDs.


Limitations

  • Chunking at word boundaries (every 100 words) may lose cross-sentence context for very long documents.
  • no_repeat_ngram_size=3 and repetition_penalty=1.3 may occasionally hurt quality on morphologically rich languages — tune as needed.
  • The standard HuggingFace MBart tokenizer is incompatible with this model.

Citation

Downloads last month
129
Safetensors
Model size
1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vishnu-vizz/onemtv3b

Finetuned
(336)
this model