code-with-zeeshan
/

Universal-Translation-System

universal-translation-system

universal-translation

encoder-decoder

Model card Files Files and versions

Universal-Translation-System / README.md

code-with-zeeshan's picture

code-with-zeeshan

Update README.md

2b3afaf verified 12 days ago

|

History Blame Contribute Delete

2.99 kB

	---
	language:
	- multilingual
	pipeline_tag: translation
	tags:
	- universal-translation
	- nmt
	- transformer
	- encoder-decoder
	- pytorch
	license: apache-2.0
	datasets:
	- code-with-zeeshan/UTS-Datasets
	library_name: universal-translation-system
	---

	# Universal Translation System

	A compact, production-ready multilingual neural machine translation model supporting 20 languages (190 language pairs). Trained on curated OPUS-100 data with synthetic augmentation, knowledge distillation, and neural quality filtering.

	## Model Architecture

	\| Component \| Configuration \|
	\|-----------\|--------------\|
	\| Encoder \| 6-layer Transformer, 512 hidden dim, 8 heads \|
	\| Decoder \| 8-layer Transformer, 768 hidden dim, 12 heads \|
	\| Vocab \| 32K tokens, script-grouped (latin, cjk, arabic, devanagari, cyrillic, thai) \|
	\| Params \| ~40MB (compact), ~150M total \|
	\| Precision \| BF16 mixed-precision training \|

	## Supported Languages

	\| Group \| Languages \|
	\|-------\|-----------\|
	\| Latin \| en, es, fr, de, it, pt, nl, sv, pl, id, vi, tr \|
	\| CJK \| zh, ja, ko \|
	\| Arabic \| ar \|
	\| Devanagari \| hi \|
	\| Cyrillic \| ru, uk \|
	\| Thai \| th \|

	## Usage

	### Via the CLI (`uts`)

	```bash
	# Translate a sentence
	uts serve --config config/base.yaml
	curl -X POST http://localhost:8000/translate \
	-H "Content-Type: application/json" \
	-d '{"text": "Hello world", "source": "en", "target": "es"}'
	```

	### Via Python

	```python
	from runtime.encoder.universal_encoder import UniversalEncoder
	from runtime.cloud_decoder import OptimizedUniversalDecoder

	encoder = UniversalEncoder.from_pretrained("code-with-zeeshan/Universal-Translation-System")
	decoder = OptimizedUniversalDecoder.from_pretrained("code-with-zeeshan/Universal-Translation-System")
	# See docs/API.md for full inference examples
	```

	### Via Hugging Face Hub

	```python
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

	model = AutoModelForSeq2SeqLM.from_pretrained("code-with-zeeshan/Universal-Translation-System")
	tokenizer = AutoTokenizer.from_pretrained("code-with-zeeshan/Universal-Translation-System")
	```

	## Training

	The model was trained using the [Universal Translation System](https://github.com/code-with-zeeshan/universal-translation-system) pipeline:

	1. Data pipeline — OPUS-100 download, sampling, augmentation (false friends, idioms, backtranslation), COMET quality filtering
	2. Knowledge distillation — NLLB-3.3B teacher → compact student
	3. Vocabulary — Script-grouped SentencePiece tokenizer (32K per group)
	4. Training — BF16 mixed-precision, dynamic batch sizing, gradient checkpointing. ~10 epochs with cosine LR schedule.

	## Evaluation

	\| Metric \| Score \|
	\|--------\|-------\|
	\| BLEU (average across 190 pairs) \| Coming soon \|
	\| COMET (average) \| Coming soon \|

	## Files

	- `encoder/` — Universal encoder weights
	- `decoder/` — Optimized decoder weights
	- `vocab/` — Script-grouped vocabulary packs
	- `config.yaml` — Training configuration used for this model

	## License

	Apache 2.0