| --- |
| language: |
| - multilingual |
| pipeline_tag: translation |
| tags: |
| - universal-translation |
| - nmt |
| - transformer |
| - encoder-decoder |
| - pytorch |
| license: apache-2.0 |
| datasets: |
| - code-with-zeeshan/UTS-Datasets |
| library_name: universal-translation-system |
| --- |
| |
| # Universal Translation System |
|
|
| A compact, production-ready multilingual neural machine translation model supporting **20 languages** (190 language pairs). Trained on curated OPUS-100 data with synthetic augmentation, knowledge distillation, and neural quality filtering. |
|
|
| ## Model Architecture |
|
|
| | Component | Configuration | |
| |-----------|--------------| |
| | Encoder | 6-layer Transformer, 512 hidden dim, 8 heads | |
| | Decoder | 8-layer Transformer, 768 hidden dim, 12 heads | |
| | Vocab | 32K tokens, script-grouped (latin, cjk, arabic, devanagari, cyrillic, thai) | |
| | Params | ~40MB (compact), ~150M total | |
| | Precision | BF16 mixed-precision training | |
|
|
| ## Supported Languages |
|
|
| | Group | Languages | |
| |-------|-----------| |
| | Latin | en, es, fr, de, it, pt, nl, sv, pl, id, vi, tr | |
| | CJK | zh, ja, ko | |
| | Arabic | ar | |
| | Devanagari | hi | |
| | Cyrillic | ru, uk | |
| | Thai | th | |
|
|
| ## Usage |
|
|
| ### Via the CLI (`uts`) |
|
|
| ```bash |
| # Translate a sentence |
| uts serve --config config/base.yaml |
| curl -X POST http://localhost:8000/translate \ |
| -H "Content-Type: application/json" \ |
| -d '{"text": "Hello world", "source": "en", "target": "es"}' |
| ``` |
|
|
| ### Via Python |
|
|
| ```python |
| from runtime.encoder.universal_encoder import UniversalEncoder |
| from runtime.cloud_decoder import OptimizedUniversalDecoder |
| |
| encoder = UniversalEncoder.from_pretrained("code-with-zeeshan/Universal-Translation-System") |
| decoder = OptimizedUniversalDecoder.from_pretrained("code-with-zeeshan/Universal-Translation-System") |
| # See docs/API.md for full inference examples |
| ``` |
|
|
| ### Via Hugging Face Hub |
|
|
| ```python |
| from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
| |
| model = AutoModelForSeq2SeqLM.from_pretrained("code-with-zeeshan/Universal-Translation-System") |
| tokenizer = AutoTokenizer.from_pretrained("code-with-zeeshan/Universal-Translation-System") |
| ``` |
|
|
| ## Training |
|
|
| The model was trained using the [Universal Translation System](https://github.com/code-with-zeeshan/universal-translation-system) pipeline: |
|
|
| 1. **Data pipeline** β OPUS-100 download, sampling, augmentation (false friends, idioms, backtranslation), COMET quality filtering |
| 2. **Knowledge distillation** β NLLB-3.3B teacher β compact student |
| 3. **Vocabulary** β Script-grouped SentencePiece tokenizer (32K per group) |
| 4. **Training** β BF16 mixed-precision, dynamic batch sizing, gradient checkpointing. ~10 epochs with cosine LR schedule. |
|
|
| ## Evaluation |
|
|
| | Metric | Score | |
| |--------|-------| |
| | BLEU (average across 190 pairs) | *Coming soon* | |
| | COMET (average) | *Coming soon* | |
|
|
| ## Files |
|
|
| - `encoder/` β Universal encoder weights |
| - `decoder/` β Optimized decoder weights |
| - `vocab/` β Script-grouped vocabulary packs |
| - `config.yaml` β Training configuration used for this model |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|