File size: 2,994 Bytes
2101cfd
2b3afaf
 
 
 
 
 
 
 
 
2101cfd
2b3afaf
 
 
2101cfd
2b3afaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
language:
- multilingual
pipeline_tag: translation
tags:
- universal-translation
- nmt
- transformer
- encoder-decoder
- pytorch
license: apache-2.0
datasets:
- code-with-zeeshan/UTS-Datasets
library_name: universal-translation-system
---

# Universal Translation System

A compact, production-ready multilingual neural machine translation model supporting **20 languages** (190 language pairs). Trained on curated OPUS-100 data with synthetic augmentation, knowledge distillation, and neural quality filtering.

## Model Architecture

| Component | Configuration |
|-----------|--------------|
| Encoder | 6-layer Transformer, 512 hidden dim, 8 heads |
| Decoder | 8-layer Transformer, 768 hidden dim, 12 heads |
| Vocab | 32K tokens, script-grouped (latin, cjk, arabic, devanagari, cyrillic, thai) |
| Params | ~40MB (compact), ~150M total |
| Precision | BF16 mixed-precision training |

## Supported Languages

| Group | Languages |
|-------|-----------|
| Latin | en, es, fr, de, it, pt, nl, sv, pl, id, vi, tr |
| CJK | zh, ja, ko |
| Arabic | ar |
| Devanagari | hi |
| Cyrillic | ru, uk |
| Thai | th |

## Usage

### Via the CLI (`uts`)

```bash
# Translate a sentence
uts serve --config config/base.yaml
curl -X POST http://localhost:8000/translate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "source": "en", "target": "es"}'
```

### Via Python

```python
from runtime.encoder.universal_encoder import UniversalEncoder
from runtime.cloud_decoder import OptimizedUniversalDecoder

encoder = UniversalEncoder.from_pretrained("code-with-zeeshan/Universal-Translation-System")
decoder = OptimizedUniversalDecoder.from_pretrained("code-with-zeeshan/Universal-Translation-System")
# See docs/API.md for full inference examples
```

### Via Hugging Face Hub

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("code-with-zeeshan/Universal-Translation-System")
tokenizer = AutoTokenizer.from_pretrained("code-with-zeeshan/Universal-Translation-System")
```

## Training

The model was trained using the [Universal Translation System](https://github.com/code-with-zeeshan/universal-translation-system) pipeline:

1. **Data pipeline** β€” OPUS-100 download, sampling, augmentation (false friends, idioms, backtranslation), COMET quality filtering
2. **Knowledge distillation** β€” NLLB-3.3B teacher β†’ compact student
3. **Vocabulary** β€” Script-grouped SentencePiece tokenizer (32K per group)
4. **Training** β€” BF16 mixed-precision, dynamic batch sizing, gradient checkpointing. ~10 epochs with cosine LR schedule.

## Evaluation

| Metric | Score |
|--------|-------|
| BLEU (average across 190 pairs) | *Coming soon* |
| COMET (average) | *Coming soon* |

## Files

- `encoder/` β€” Universal encoder weights
- `decoder/` β€” Optimized decoder weights
- `vocab/` β€” Script-grouped vocabulary packs
- `config.yaml` β€” Training configuration used for this model

## License

Apache 2.0