OneNMT v3b — Indic Multilingual Neural Machine Translation
OneNMT v3b is a 1.1B parameter multilingual seq2seq translation model supporting 36 × 36 language pairs across 23 Indian languages and English. It is based on a deep transformer_18_18 architecture (18 encoder + 18 decoder layers) trained with fairseq and converted to the HuggingFace MBart format.
⚠️ Tokenization Warning: OneNMT v3b uses a custom fairseq dictionary (
fairseq_dict.json) for token ID mapping. UsingAutoTokenizeror any standard MBart tokenizer will produce wrong IDs and garbled output. Always use the providedhf_inference.pyor thetranslate_onemt()wrapper.
Architecture
| Property | Value |
|---|---|
| Architecture | transformer_18_18 (MBartForConditionalGeneration wrapper) |
| Encoder / Decoder layers | 18 + 18 |
| Hidden size | 1024 |
| Attention heads | 16 |
| Parameters | ~1.1B |
| Positional embeddings | Sinusoidal (not learned), max 256 positions |
| Normalization | Pre-norm (normalize_before=True) |
| Vocabulary | 65,400 tokens (fairseq dictionary ordering) |
| Language tag format | ###src_flores-to-tgt_flores### {text} |
| Decoder start token | EOS (id=2) — fairseq convention |
Supported Languages
| Short Code | Language | Script | FLORES-200 Code |
|---|---|---|---|
eng |
English | Latin | eng_Latn |
hin |
Hindi | Devanagari | hin_Deva |
tel |
Telugu | Telugu | tel_Telu |
tam |
Tamil | Tamil | tam_Taml |
mal |
Malayalam | Malayalam | mal_Mlym |
kan |
Kannada | Kannada | kan_Knda |
ben |
Bengali | Bengali | ben_Beng |
guj |
Gujarati | Gujarati | guj_Gujr |
mar |
Marathi | Devanagari | mar_Deva |
pan |
Punjabi | Gurmukhi | pan_Guru |
urd |
Urdu | Arabic | urd_Arab |
asm |
Assamese | Bengali | asm_Beng |
npi |
Nepali | Devanagari | npi_Deva |
ory |
Odia | Odia | ory_Orya |
san |
Sanskrit | Devanagari | san_Deva |
mai |
Maithili | Devanagari | mai_Deva |
brx |
Bodo | Devanagari | brx_Deva |
doi |
Dogri | Devanagari | doi_Deva |
gom |
Konkani | Devanagari | gom_Deva |
mni |
Meitei | Bengali | mni_Beng |
sat |
Santali | Ol Chiki | sat_Olck |
kas |
Kashmiri | Arabic | kas_Arab |
snd |
Sindhi | Arabic | snd_Arab |
Installation
pip install -r requirements.txt
Quick Start
from hf_inference import translate_onemt, translate_batch
# Single sentence — English → Telugu
print(translate_onemt("Hello, how are you?", "eng", "tel"))
# హలో, మీరు ఎలా ఉన్నారు?
# Code-mixed Hindi → Telugu
print(translate_onemt("मुझे meeting attend करनी है।", "hin", "tel"))
# నాకు మీటింగ్కు హాజరు కావాలి.
# Telugu → English
print(translate_onemt("నమస్కారం, మీరు ఎలా ఉన్నారు?", "tel", "eng"))
# Hello, how are you?
# Tamil → Hindi
print(translate_onemt("இன்று வானிலை நன்றாக உள்ளது.", "tam", "hin"))
# आज मौसम अच्छा है।
# Batch translation
sentences = ["Good morning.", "Thank you.", "See you tomorrow."]
results = translate_batch(sentences, sl="eng", tl="hin", batch_size=32)
for src, tgt in zip(sentences, results):
print(f"{src} → {tgt}")
How Tokenization Works
Unlike standard MBart, OneNMT v3b uses a two-step tokenization pipeline:
source text
→ prepend language tag: ###eng_Latn-to-tel_Telu### {text}
→ SentencePiece encoding → subword pieces
→ fairseq dict lookup → integer IDs ← NOT SPM-native IDs
→ MBart encoder
The fairseq dictionary orders tokens by training frequency (descending), not SPM's internal ordering. This means the same subword piece maps to completely different IDs in the two systems — for example, T is ID 29 in the fairseq dict but ID 35376 via the HF tokenizer. Wrong IDs → wrong embeddings → garbled output.
The hf_inference.py script handles this correctly using onemtv3b_spm.model for segmentation and fairseq_dict.json for ID lookup.
Repository Files
| File | Description |
|---|---|
model.safetensors |
Model weights |
config.json |
HF model architecture config |
generation_config.json |
Default generation parameters |
fairseq_dict.json |
Required — fairseq vocab ID mapping (piece → ID) |
onemtv3b_spm.model |
Required — SentencePiece model for subword segmentation |
hf_inference.py |
Ready-to-use inference script |
tokenizer.json,tokenizer_config.json,special_tokens_map.jsonare present for HF Hub format compliance but must not be used for inference — they produce incorrect token IDs due to the SPM vs fairseq ordering mismatch described above.
Generation Parameters
| Parameter | Value | Notes |
|---|---|---|
decoder_start_token_id |
2 | EOS token — fairseq convention (not BOS) |
forced_bos_token_id |
None | No forced BOS |
num_beams |
5 | Beam search |
max_new_tokens |
256 | Matches model's trained max position |
no_repeat_ngram_size |
3 | Reduces repetition |
repetition_penalty |
1.3 | Additional repetition control |
Citation
@article{onenmt2024,
title = {OneNMT: A Unified Multilingual Neural Machine Translation System for Indian Languages},
year = {2024},
url = {https://arxiv.org/pdf/2412.04351}
}
License
MIT
- Downloads last month
- 10