EN-ZH Transformer Model
A Transformer-based neural machine translation model for English-to-Chinese translation, trained on 100,000 parallel sentence pairs from diverse domains.
Model Description
This model implements a standard encoder-decoder Transformer architecture optimized for English-to-Chinese translation. It uses BPE tokenization for English and character-level BPE tokenization for Chinese, achieving better handling of Chinese morphology.
- Model type: Transformer (Sequence-to-Sequence)
- Language pair: English (source) → Chinese Simplified (target)
- License: MIT
- Parameters: 65,894,519 (~66M)
Performance
Evaluated on 500 test samples:
| Metric | Greedy Decoding | Beam Search (size=5) |
|---|---|---|
| chrF | 51.24 | 54.47 |
| BLEU | 66.82 | 69.15 |
Beam search provides +3.23 chrF improvement over greedy decoding.
Model Architecture
- Parameters: 65,894,519 (~66M)
- Embedding dimension (d_model): 512
- Feed-forward dimension: 2048
- Attention heads: 8
- Encoder layers: 6
- Decoder layers: 6
- Dropout rate: 0.4
- Positional encoding: Sinusoidal (max length: 5000)
- Normalization: Pre-norm (norm_first=True)
Vocabulary Sizes:
- English (source): 32,000 tokens (BPE)
- Chinese (target): 5,239 tokens (Character-level BPE)
Uses
Direct Use
This model is suitable for:
- General domain English-to-Chinese translation
- Political and diplomatic text translation
- Educational purposes
- Research and experimentation
Best for: Formal and political text, as the model was trained primarily on such content.
Demo
Try the model interactively: EN-ZH Transformer Space
How to Use
Using the Space
The easiest way to use this model is through the Gradio interface:
- Visit the Space
- Enter English text in the input box
- Click "Submit" to get the Chinese translation
Using with Python
from huggingface_hub import hf_hub_download
import torch
import json
from tokenizers import Tokenizer
# Download model files
REPO_ID = "jiaxinnnnn/EN-ZH_Transformer_Model"
config_path = hf_hub_download(repo_id=REPO_ID, filename="config.json")
model_path = hf_hub_download(repo_id=REPO_ID, filename="best_model_finetuned.pt")
en_tokenizer_path = hf_hub_download(repo_id=REPO_ID, filename="en_tokenizer.json")
zh_tokenizer_path = hf_hub_download(repo_id=REPO_ID, filename="zh_tokenizer.json")
# Load tokenizers
en_tokenizer = Tokenizer.from_file(en_tokenizer_path)
zh_tokenizer = Tokenizer.from_file(zh_tokenizer_path)
# Load model (see Space code for full implementation)
For complete implementation details, refer to the Space code.
Training Details
Training Data
- Dataset: 100,000 parallel English-Chinese sentence pairs
- Source: Open-source corpus from GitCode
- Domains: Political documents, diplomatic text, news articles, general conversation
- Split: 80,000 training pairs (80%), 20,000 test pairs (20%)
Training Procedure
Tokenization
- English: BPE (Byte-Pair Encoding) with vocabulary size 32,000
- Chinese: Character-level BPE with vocabulary size 5,239
- Special tokens: [PAD], [UNK], [BOS], [EOS]
Training Hyperparameters
- Optimizer: AdamW
- Learning rate: 0.0005
- Betas: (0.9, 0.98)
- Epsilon: 1e-9
- Weight decay: 0.01
- LR Scheduler: OneCycleLR with cosine annealing
- Max learning rate: 0.001
- Warmup: 10% of training
- Batch size: 64
- Label smoothing: 0.15
- Dropout: 0.4 (high dropout to prevent overfitting on 100k dataset)
Features
- Beam search decoding: Beam size 5 for higher quality translations
- Positional encoding: Sinusoidal up to 5000 tokens
- Pre-normalization: Applied before each sub-layer for training stability
Evaluation
Metrics
The model uses two primary metrics:
- chrF (Character n-gram F-score): Well-suited for character-rich languages like Chinese
- BLEU (Bilingual Evaluation Understudy): Standard MT metric
Example Translations
High-Quality Examples (Political Domain)
| English | Model Translation (Beam Search) |
|---|---|
| Iraqi president saddam expressed satisfaction with this resolution on 4 july. | 伊拉克 总统 萨达姆 4日 对 这 一 决议 表示 满意。 |
| The people of cambodia have expressed their lofty respect for president jiang. | 柬埔寨 人民 在 街头 前 向 江主席 表示 崇高 的 敬意,欢迎 他 来访。 |
Note: The model performs best on political and diplomatic text similar to its training data.
Limitations and Biases
Known Limitations
Domain Specificity: Model is optimized for political/diplomatic text and may underperform on:
- Technical or scientific content
- Casual conversation
- Domain-specific terminology (legal, medical, etc.)
Idiomatic Expressions: Struggles with culturally-specific idioms and expressions, often producing overly literal translations.
Length Issues: Tends to under-generate, producing translations slightly shorter than human references.
Specialized Terminology: Difficulty with formal titles and organizational terms specific to Chinese administrative contexts.
Sentence Length: Optimized for sentences up to 128 tokens; very long texts should be split into smaller segments.
Training Data Bias
The model reflects the characteristics of its training data:
- Primary focus on political and diplomatic language
- Limited exposure to casual or colloquial Chinese
- May not generalize well to modern internet slang or informal speech
Technical Specifications
Model Architecture Details
- Type: Encoder-Decoder Transformer
- Framework: PyTorch
- Special tokens: BOS (Beginning), EOS (End), PAD (Padding), UNK (Unknown)
- Weight initialization: Xavier Uniform
Compute Infrastructure
- Hardware: CUDA-compatible GPU
- Training time: ~9 minutes per epoch on GPU
- Total training: 60 epochs with early stopping
Model Files
This repository contains:
best_model_finetuned.pt- Trained model weights (~250MB)config.json- Model configurationen_tokenizer.json- English BPE tokenizerzh_tokenizer.json- Chinese character-level BPE tokenizer
Citation
If you use this model in your research or applications, please cite:
@misc{en-zh-transformer-2025,
author = {jiaxinnnnn},
title = {EN-ZH Transformer: English to Chinese Neural Machine Translation},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/jiaxinnnnn/EN-ZH_Transformer_Model}},
note = {chrF: 54.47, BLEU: 69.15}
}
Model Card Contact
For questions, issues, or feedback:
- Open an issue in the model repository
- Visit the Community discussions
- Downloads last month
- 100