Pumatic English-Polish Translation Model

A neural machine translation model for English to Polish translation, trained entirely from scratch using the MarianMT architecture.

Model Description

Model type: Encoder-Decoder (MarianMT architecture)
Language pair: English → Polish
Parameters: ~157M
Training approach: From scratch (randomly initialized weights, custom tokenizer)
GPU: 4x NVIDIA H200
Trained by: pumad

Note: This model was not fine-tuned from any existing pre-trained model. Both the model weights and the SentencePiece tokenizer were trained from scratch on the parallel corpus.

Architecture

Component	Configuration
d_model	768
Encoder layers	8
Decoder layers	8
Attention heads	12
FFN dimension	3072
Vocabulary size	32,000
Max position embeddings	512
Activation function	GELU

Training Details

Training Data

The model was trained on high-quality parallel corpora:

OPUS-100 - Multilingual parallel corpus
Europarl - European Parliament proceedings
UN Parallel Corpus (UNPC) - United Nations documents

Training Procedure

Hardware: 4x NVIDIA H200 GPU (distributed training)
Framework: Hugging Face Transformers + Accelerate
Batch size: 512 per GPU (2048 effective)
Learning rate: 3e-4 with cosine decay
Warmup: 6% of training steps
Epochs: 10
Optimizer: Fused AdamW
Precision: bf16 mixed precision
Max sequence length: 128 tokens

Tokenizer

A custom SentencePiece tokenizer (unigram model) was trained on the parallel corpus with:

32,000 vocabulary size
99.95% character coverage
Language tag support (>>pl<<)

Data Preprocessing

Quality filtering: Removed pairs with fewer than 5 words or more than 200 words
Length ratio filtering: Excluded pairs with extreme length ratios (< 0.5 or > 2.0)
Deduplication: Removed duplicate source sentences

Usage

Using the Transformers library

from transformers import MarianMTModel, MarianTokenizer

model_name = "pumad/pumatic-en-pl"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

text = "Hello, how are you today?"
inputs = tokenizer(text, return_tensors="pt", padding=True)
translated = model.generate(**inputs)
output = tokenizer.decode(translated[0], skip_special_tokens=True)
print(output)

Using the Pipeline API

from transformers import pipeline

translator = pipeline("translation", model="pumad/pumatic-en-pl")
result = translator("The quick brown fox jumps over the lazy dog.")
print(result[0]['translation_text'])

Demo

Try this model live at pumatic.eu

API documentation available at pumatic.eu/docs

Limitations

Optimized for general-purpose translation; domain-specific terminology may vary in quality
Maximum input length of ~400 characters per chunk for optimal results
Best performance on formal/written text; colloquial expressions may be less accurate

License

Apache 2.0

Citation

If you use this model, please cite:

@misc{pumatic-en-pl,
  author = {pumad},
  title = {Pumatic English-Polish Translation Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/pumad/pumatic-en-pl}
}

Downloads last month: 49

Safetensors

Model size

0.2B params

Tensor type

F32

pumad
/

pumadic-en-pl