OneNMT v3b — Indic Multilingual Neural Machine Translation

OneNMT v3b is a multilingual sequence-to-sequence translation model supporting 23 Indic languages + English (covering 30+ language codes including script variants). It is based on the MBart architecture, trained with fairseq, and converted to HuggingFace format for easy deployment.

⚠️ Important: This model uses a custom fairseq vocabulary (fairseq_dict.json) and a custom SentencePiece model (onemtv3b_spm.model). The standard HuggingFace MBart tokenizer will not work. Use the provided hf_inference.py script.

Supported Languages

Short Code	Language	Script	FLORES-200 Code
`eng`	English	Latin	`eng_Latn`
`hin`	Hindi	Devanagari	`hin_Deva`
`tel`	Telugu	Telugu	`tel_Telu`
`tam`	Tamil	Tamil	`tam_Taml`
`mal`	Malayalam	Malayalam	`mal_Mlym`
`kan`	Kannada	Kannada	`kan_Knda`
`ben`	Bengali	Bengali	`ben_Beng`
`guj`	Gujarati	Gujarati	`guj_Gujr`
`mar`	Marathi	Devanagari	`mar_Deva`
`pan`	Punjabi	Gurmukhi	`pan_Guru`
`urd`	Urdu	Arabic	`urd_Arab`
`asm`	Assamese	Bengali	`asm_Beng`
`npi`	Nepali	Devanagari	`npi_Deva`
`ory`	Odia	Odia	`ory_Orya`
`san`	Sanskrit	Devanagari	`san_Deva`
`mai`	Maithili	Devanagari	`mai_Deva`
`brx`	Bodo	Devanagari	`brx_Deva`
`doi`	Dogri	Devanagari	`doi_Deva`
`gom`	Konkani	Devanagari	`gom_Deva`
`mni`	Meitei	Bengali	`mni_Beng`
`sat`	Santali	Ol Chiki	`sat_Olck`
`kas`	Kashmiri	Arabic	`kas_Arab`
`snd`	Sindhi	Arabic	`snd_Arab`

Script variants (e.g. kas_deva, mni_mtei, snd_deva) are also supported — see hf_inference.py for the full language mapping.

Installation

pip install -r requirements.txt

Usage

Single Sentence

from hf_inference import translate_onemt

# English → Telugu
result = translate_onemt("Hello, how are you?", "eng", "tel")
print(result)  # హలో, మీరు ఎలా ఉన్నారు?

# Hindi (code-mixed) → Telugu
result = translate_onemt("मुझे meeting attend करनी है।", "hin", "tel")
print(result)

# Telugu → English
result = translate_onemt("నమస్కారం, మీరు ఎలా ఉన్నారు?", "tel", "eng")
print(result)

Batch Translation

from hf_inference import translate_batch

sentences = [
    "Good morning.",
    "Thank you very much.",
    "The train arrives at six o'clock.",
]

results = translate_batch(sentences, sl="eng", tl="tel", batch_size=32)
for src, tgt in zip(sentences, results):
    print(f"{src}  →  {tgt}")

Long Documents

Texts longer than 200 words are automatically chunked into 100-word parts and translated sequentially — no special handling needed.

How It Works

The model uses a fairseq-style language tag prepended to the source text to control translation direction:

###hin_Deva-to-tel_Telu### मुझे meeting attend करनी है।

This is handled automatically by hf_inference.py. The tokenization pipeline is:

source text
  → prepend ###src-to-tgt### tag
  → SentencePiece (onemtv3b_spm.model) → subword pieces
  → look up fairseq dictionary IDs (fairseq_dict.json)   ← NOT SPM-native IDs
  → feed to MBart encoder

The fairseq dict IDs and SPM-native IDs differ completely. Using the wrong vocab mapping will produce garbage output.

Generation Config

Parameter	Value
Beam size	5
Max new tokens	256
`decoder_start_token_id`	2 (EOS — fairseq convention)
`no_repeat_ngram_size`	3
`repetition_penalty`	1.3

Repository Files

File	Description
`model.safetensors`	Model weights
`config.json`	MBart model config
`generation_config.json`	Default generation parameters
`fairseq_dict.json`	Custom vocab mapping — required for inference
`onemtv3b_spm.model`	SentencePiece tokenizer — required for inference
`hf_inference.py`	Inference script with `translate_onemt()` and `translate_batch()`
`sentencepiece.bpe.model`	Standard MBart50 SPM (not used for inference)
`tokenizer_config.json`	HF tokenizer config (do not use directly)
`requirements.txt`	Python dependencies

Conversion Notes

This model was converted from a fairseq checkpoint using custom conversion scripts (fix_weight_tying.py, fix_vocab.py). The key challenge during conversion was preserving the fairseq vocabulary ordering, since the model's embedding matrix is indexed by fairseq dict IDs — not SPM-native IDs.

Limitations

Chunking at word boundaries (every 100 words) may lose cross-sentence context for very long documents.
no_repeat_ngram_size=3 and repetition_penalty=1.3 may occasionally hurt quality on morphologically rich languages — tune as needed.
The standard HuggingFace MBart tokenizer is incompatible with this model.

Citation

Downloads last month: 129

Safetensors

Model size

1B params

Tensor type

F32

Model tree for vishnu-vizz/onemtv3b

Base model

facebook/mbart-large-50

Finetuned

(336)

this model