Universal Translation System

A compact, production-ready multilingual neural machine translation model supporting 20 languages (190 language pairs). Trained on curated OPUS-100 data with synthetic augmentation, knowledge distillation, and neural quality filtering.

Model Architecture

Component Configuration
Encoder 6-layer Transformer, 512 hidden dim, 8 heads
Decoder 8-layer Transformer, 768 hidden dim, 12 heads
Vocab 32K tokens, script-grouped (latin, cjk, arabic, devanagari, cyrillic, thai)
Params ~40MB (compact), ~150M total
Precision BF16 mixed-precision training

Supported Languages

Group Languages
Latin en, es, fr, de, it, pt, nl, sv, pl, id, vi, tr
CJK zh, ja, ko
Arabic ar
Devanagari hi
Cyrillic ru, uk
Thai th

Usage

Via the CLI (uts)

# Translate a sentence
uts serve --config config/base.yaml
curl -X POST http://localhost:8000/translate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "source": "en", "target": "es"}'

Via Python

from runtime.encoder.universal_encoder import UniversalEncoder
from runtime.cloud_decoder import OptimizedUniversalDecoder

encoder = UniversalEncoder.from_pretrained("code-with-zeeshan/Universal-Translation-System")
decoder = OptimizedUniversalDecoder.from_pretrained("code-with-zeeshan/Universal-Translation-System")
# See docs/API.md for full inference examples

Via Hugging Face Hub

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("code-with-zeeshan/Universal-Translation-System")
tokenizer = AutoTokenizer.from_pretrained("code-with-zeeshan/Universal-Translation-System")

Training

The model was trained using the Universal Translation System pipeline:

  1. Data pipeline β€” OPUS-100 download, sampling, augmentation (false friends, idioms, backtranslation), COMET quality filtering
  2. Knowledge distillation β€” NLLB-3.3B teacher β†’ compact student
  3. Vocabulary β€” Script-grouped SentencePiece tokenizer (32K per group)
  4. Training β€” BF16 mixed-precision, dynamic batch sizing, gradient checkpointing. ~10 epochs with cosine LR schedule.

Evaluation

Metric Score
BLEU (average across 190 pairs) Coming soon
COMET (average) Coming soon

Files

  • encoder/ β€” Universal encoder weights
  • decoder/ β€” Optimized decoder weights
  • vocab/ β€” Script-grouped vocabulary packs
  • config.yaml β€” Training configuration used for this model

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train code-with-zeeshan/Universal-Translation-System