--- language: - multilingual pipeline_tag: translation tags: - universal-translation - nmt - transformer - encoder-decoder - pytorch license: apache-2.0 datasets: - code-with-zeeshan/UTS-Datasets library_name: universal-translation-system --- # Universal Translation System A compact, production-ready multilingual neural machine translation model supporting **20 languages** (190 language pairs). Trained on curated OPUS-100 data with synthetic augmentation, knowledge distillation, and neural quality filtering. ## Model Architecture | Component | Configuration | |-----------|--------------| | Encoder | 6-layer Transformer, 512 hidden dim, 8 heads | | Decoder | 8-layer Transformer, 768 hidden dim, 12 heads | | Vocab | 32K tokens, script-grouped (latin, cjk, arabic, devanagari, cyrillic, thai) | | Params | ~40MB (compact), ~150M total | | Precision | BF16 mixed-precision training | ## Supported Languages | Group | Languages | |-------|-----------| | Latin | en, es, fr, de, it, pt, nl, sv, pl, id, vi, tr | | CJK | zh, ja, ko | | Arabic | ar | | Devanagari | hi | | Cyrillic | ru, uk | | Thai | th | ## Usage ### Via the CLI (`uts`) ```bash # Translate a sentence uts serve --config config/base.yaml curl -X POST http://localhost:8000/translate \ -H "Content-Type: application/json" \ -d '{"text": "Hello world", "source": "en", "target": "es"}' ``` ### Via Python ```python from runtime.encoder.universal_encoder import UniversalEncoder from runtime.cloud_decoder import OptimizedUniversalDecoder encoder = UniversalEncoder.from_pretrained("code-with-zeeshan/Universal-Translation-System") decoder = OptimizedUniversalDecoder.from_pretrained("code-with-zeeshan/Universal-Translation-System") # See docs/API.md for full inference examples ``` ### Via Hugging Face Hub ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model = AutoModelForSeq2SeqLM.from_pretrained("code-with-zeeshan/Universal-Translation-System") tokenizer = AutoTokenizer.from_pretrained("code-with-zeeshan/Universal-Translation-System") ``` ## Training The model was trained using the [Universal Translation System](https://github.com/code-with-zeeshan/universal-translation-system) pipeline: 1. **Data pipeline** — OPUS-100 download, sampling, augmentation (false friends, idioms, backtranslation), COMET quality filtering 2. **Knowledge distillation** — NLLB-3.3B teacher → compact student 3. **Vocabulary** — Script-grouped SentencePiece tokenizer (32K per group) 4. **Training** — BF16 mixed-precision, dynamic batch sizing, gradient checkpointing. ~10 epochs with cosine LR schedule. ## Evaluation | Metric | Score | |--------|-------| | BLEU (average across 190 pairs) | *Coming soon* | | COMET (average) | *Coming soon* | ## Files - `encoder/` — Universal encoder weights - `decoder/` — Optimized decoder weights - `vocab/` — Script-grouped vocabulary packs - `config.yaml` — Training configuration used for this model ## License Apache 2.0