Pumatic English-Polish Translation Model
A neural machine translation model for English to Polish translation, trained entirely from scratch using the MarianMT architecture.
Model Description
- Model type: Encoder-Decoder (MarianMT architecture)
- Language pair: English → Polish
- Parameters: ~157M
- Training approach: From scratch (randomly initialized weights, custom tokenizer)
- GPU: 4x NVIDIA H200
- Trained by: pumad
Note: This model was not fine-tuned from any existing pre-trained model. Both the model weights and the SentencePiece tokenizer were trained from scratch on the parallel corpus.
Architecture
| Component | Configuration |
|---|---|
| d_model | 768 |
| Encoder layers | 8 |
| Decoder layers | 8 |
| Attention heads | 12 |
| FFN dimension | 3072 |
| Vocabulary size | 32,000 |
| Max position embeddings | 512 |
| Activation function | GELU |
Training Details
Training Data
The model was trained on high-quality parallel corpora:
- OPUS-100 - Multilingual parallel corpus
- Europarl - European Parliament proceedings
- UN Parallel Corpus (UNPC) - United Nations documents
Training Procedure
- Hardware: 4x NVIDIA H200 GPU (distributed training)
- Framework: Hugging Face Transformers + Accelerate
- Batch size: 512 per GPU (2048 effective)
- Learning rate: 3e-4 with cosine decay
- Warmup: 6% of training steps
- Epochs: 10
- Optimizer: Fused AdamW
- Precision: bf16 mixed precision
- Max sequence length: 128 tokens
Tokenizer
A custom SentencePiece tokenizer (unigram model) was trained on the parallel corpus with:
- 32,000 vocabulary size
- 99.95% character coverage
- Language tag support (
>>pl<<)
Data Preprocessing
- Quality filtering: Removed pairs with fewer than 5 words or more than 200 words
- Length ratio filtering: Excluded pairs with extreme length ratios (< 0.5 or > 2.0)
- Deduplication: Removed duplicate source sentences
Usage
Using the Transformers library
from transformers import MarianMTModel, MarianTokenizer
model_name = "pumad/pumatic-en-pl"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
text = "Hello, how are you today?"
inputs = tokenizer(text, return_tensors="pt", padding=True)
translated = model.generate(**inputs)
output = tokenizer.decode(translated[0], skip_special_tokens=True)
print(output)
Using the Pipeline API
from transformers import pipeline
translator = pipeline("translation", model="pumad/pumatic-en-pl")
result = translator("The quick brown fox jumps over the lazy dog.")
print(result[0]['translation_text'])
Demo
Try this model live at pumatic.eu
API documentation available at pumatic.eu/docs
Limitations
- Optimized for general-purpose translation; domain-specific terminology may vary in quality
- Maximum input length of ~400 characters per chunk for optimal results
- Best performance on formal/written text; colloquial expressions may be less accurate
License
Apache 2.0
Citation
If you use this model, please cite:
@misc{pumatic-en-pl,
author = {pumad},
title = {Pumatic English-Polish Translation Model},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/pumad/pumatic-en-pl}
}
- Downloads last month
- 160