Update README.md

2b3afaf verified 12 days ago

2.99 kB

language:
  - multilingual
pipeline_tag: translation
tags:
  - universal-translation
  - nmt
  - transformer
  - encoder-decoder
  - pytorch
license: apache-2.0
datasets:
  - code-with-zeeshan/UTS-Datasets
library_name: universal-translation-system

Universal Translation System

A compact, production-ready multilingual neural machine translation model supporting 20 languages (190 language pairs). Trained on curated OPUS-100 data with synthetic augmentation, knowledge distillation, and neural quality filtering.

Model Architecture

Component	Configuration
Encoder	6-layer Transformer, 512 hidden dim, 8 heads
Decoder	8-layer Transformer, 768 hidden dim, 12 heads
Vocab	32K tokens, script-grouped (latin, cjk, arabic, devanagari, cyrillic, thai)
Params	~40MB (compact), ~150M total
Precision	BF16 mixed-precision training

Supported Languages

Group	Languages
Latin	en, es, fr, de, it, pt, nl, sv, pl, id, vi, tr
CJK	zh, ja, ko
Arabic	ar
Devanagari	hi
Cyrillic	ru, uk
Thai	th

Usage

Via the CLI (`uts`)

# Translate a sentence
uts serve --config config/base.yaml
curl -X POST http://localhost:8000/translate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "source": "en", "target": "es"}'

Via Python

from runtime.encoder.universal_encoder import UniversalEncoder
from runtime.cloud_decoder import OptimizedUniversalDecoder

encoder = UniversalEncoder.from_pretrained("code-with-zeeshan/Universal-Translation-System")
decoder = OptimizedUniversalDecoder.from_pretrained("code-with-zeeshan/Universal-Translation-System")
# See docs/API.md for full inference examples

Via Hugging Face Hub

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("code-with-zeeshan/Universal-Translation-System")
tokenizer = AutoTokenizer.from_pretrained("code-with-zeeshan/Universal-Translation-System")

Training

The model was trained using the Universal Translation System pipeline:

Data pipeline — OPUS-100 download, sampling, augmentation (false friends, idioms, backtranslation), COMET quality filtering
Knowledge distillation — NLLB-3.3B teacher → compact student
Vocabulary — Script-grouped SentencePiece tokenizer (32K per group)
Training — BF16 mixed-precision, dynamic batch sizing, gradient checkpointing. ~10 epochs with cosine LR schedule.

Evaluation

Metric	Score
BLEU (average across 190 pairs)	Coming soon
COMET (average)	Coming soon

Files

encoder/ — Universal encoder weights
decoder/ — Optimized decoder weights
vocab/ — Script-grouped vocabulary packs
config.yaml — Training configuration used for this model

License

Apache 2.0