OneNMT v3b — Indic Multilingual Neural Machine Translation
OneNMT v3b is a multilingual sequence-to-sequence translation model supporting 23 Indic languages + English (covering 30+ language codes including script variants). It is based on the MBart architecture, trained with fairseq, and converted to HuggingFace format for easy deployment.
⚠️ Important: This model uses a custom fairseq vocabulary (
fairseq_dict.json) and a custom SentencePiece model (onemtv3b_spm.model). The standard HuggingFace MBart tokenizer will not work. Use the providedhf_inference.pyscript.
Supported Languages
| Short Code | Language | Script | FLORES-200 Code |
|---|---|---|---|
eng |
English | Latin | eng_Latn |
hin |
Hindi | Devanagari | hin_Deva |
tel |
Telugu | Telugu | tel_Telu |
tam |
Tamil | Tamil | tam_Taml |
mal |
Malayalam | Malayalam | mal_Mlym |
kan |
Kannada | Kannada | kan_Knda |
ben |
Bengali | Bengali | ben_Beng |
guj |
Gujarati | Gujarati | guj_Gujr |
mar |
Marathi | Devanagari | mar_Deva |
pan |
Punjabi | Gurmukhi | pan_Guru |
urd |
Urdu | Arabic | urd_Arab |
asm |
Assamese | Bengali | asm_Beng |
npi |
Nepali | Devanagari | npi_Deva |
ory |
Odia | Odia | ory_Orya |
san |
Sanskrit | Devanagari | san_Deva |
mai |
Maithili | Devanagari | mai_Deva |
brx |
Bodo | Devanagari | brx_Deva |
doi |
Dogri | Devanagari | doi_Deva |
gom |
Konkani | Devanagari | gom_Deva |
mni |
Meitei | Bengali | mni_Beng |
sat |
Santali | Ol Chiki | sat_Olck |
kas |
Kashmiri | Arabic | kas_Arab |
snd |
Sindhi | Arabic | snd_Arab |
Script variants (e.g. kas_deva, mni_mtei, snd_deva) are also supported — see hf_inference.py for the full language mapping.
Installation
pip install -r requirements.txt
Usage
Single Sentence
from hf_inference import translate_onemt
# English → Telugu
result = translate_onemt("Hello, how are you?", "eng", "tel")
print(result) # హలో, మీరు ఎలా ఉన్నారు?
# Hindi (code-mixed) → Telugu
result = translate_onemt("मुझे meeting attend करनी है।", "hin", "tel")
print(result)
# Telugu → English
result = translate_onemt("నమస్కారం, మీరు ఎలా ఉన్నారు?", "tel", "eng")
print(result)
Batch Translation
from hf_inference import translate_batch
sentences = [
"Good morning.",
"Thank you very much.",
"The train arrives at six o'clock.",
]
results = translate_batch(sentences, sl="eng", tl="tel", batch_size=32)
for src, tgt in zip(sentences, results):
print(f"{src} → {tgt}")
Long Documents
Texts longer than 200 words are automatically chunked into 100-word parts and translated sequentially — no special handling needed.
How It Works
The model uses a fairseq-style language tag prepended to the source text to control translation direction:
###hin_Deva-to-tel_Telu### मुझे meeting attend करनी है।
This is handled automatically by hf_inference.py. The tokenization pipeline is:
source text
→ prepend ###src-to-tgt### tag
→ SentencePiece (onemtv3b_spm.model) → subword pieces
→ look up fairseq dictionary IDs (fairseq_dict.json) ← NOT SPM-native IDs
→ feed to MBart encoder
The fairseq dict IDs and SPM-native IDs differ completely. Using the wrong vocab mapping will produce garbage output.
Generation Config
| Parameter | Value |
|---|---|
| Beam size | 5 |
| Max new tokens | 256 |
decoder_start_token_id |
2 (EOS — fairseq convention) |
no_repeat_ngram_size |
3 |
repetition_penalty |
1.3 |
Repository Files
| File | Description |
|---|---|
model.safetensors |
Model weights |
config.json |
MBart model config |
generation_config.json |
Default generation parameters |
fairseq_dict.json |
Custom vocab mapping — required for inference |
onemtv3b_spm.model |
SentencePiece tokenizer — required for inference |
hf_inference.py |
Inference script with translate_onemt() and translate_batch() |
sentencepiece.bpe.model |
Standard MBart50 SPM (not used for inference) |
tokenizer_config.json |
HF tokenizer config (do not use directly) |
requirements.txt |
Python dependencies |
Conversion Notes
This model was converted from a fairseq checkpoint using custom conversion scripts (fix_weight_tying.py, fix_vocab.py). The key challenge during conversion was preserving the fairseq vocabulary ordering, since the model's embedding matrix is indexed by fairseq dict IDs — not SPM-native IDs.
Limitations
- Chunking at word boundaries (every 100 words) may lose cross-sentence context for very long documents.
no_repeat_ngram_size=3andrepetition_penalty=1.3may occasionally hurt quality on morphologically rich languages — tune as needed.- The standard HuggingFace MBart tokenizer is incompatible with this model.
Citation
- Downloads last month
- 129
Model tree for vishnu-vizz/onemtv3b
Base model
facebook/mbart-large-50