EN-ZH Transformer Model

A Transformer-based neural machine translation model for English-to-Chinese translation, trained on 100,000 parallel sentence pairs from diverse domains.

Model Description

This model implements a standard encoder-decoder Transformer architecture optimized for English-to-Chinese translation. It uses BPE tokenization for English and character-level BPE tokenization for Chinese, achieving better handling of Chinese morphology.

Model type: Transformer (Sequence-to-Sequence)
Language pair: English (source) → Chinese Simplified (target)
License: MIT
Parameters: 65,894,519 (~66M)

Performance

Evaluated on 500 test samples:

Metric	Greedy Decoding	Beam Search (size=5)
chrF	51.24	54.47
BLEU	66.82	69.15

Beam search provides +3.23 chrF improvement over greedy decoding.

Model Architecture

Parameters: 65,894,519 (~66M)
Embedding dimension (d_model): 512
Feed-forward dimension: 2048
Attention heads: 8
Encoder layers: 6
Decoder layers: 6
Dropout rate: 0.4
Positional encoding: Sinusoidal (max length: 5000)
Normalization: Pre-norm (norm_first=True)

Vocabulary Sizes:

English (source): 32,000 tokens (BPE)
Chinese (target): 5,239 tokens (Character-level BPE)

Uses

Direct Use

This model is suitable for:

General domain English-to-Chinese translation
Political and diplomatic text translation
Educational purposes
Research and experimentation

Best for: Formal and political text, as the model was trained primarily on such content.

Demo

Try the model interactively: EN-ZH Transformer Space

How to Use

Using the Space

The easiest way to use this model is through the Gradio interface:

Visit the Space
Enter English text in the input box
Click "Submit" to get the Chinese translation

Using with Python

from huggingface_hub import hf_hub_download
import torch
import json
from tokenizers import Tokenizer

# Download model files
REPO_ID = "jiaxinnnnn/EN-ZH_Transformer_Model"
config_path = hf_hub_download(repo_id=REPO_ID, filename="config.json")
model_path = hf_hub_download(repo_id=REPO_ID, filename="best_model_finetuned.pt")
en_tokenizer_path = hf_hub_download(repo_id=REPO_ID, filename="en_tokenizer.json")
zh_tokenizer_path = hf_hub_download(repo_id=REPO_ID, filename="zh_tokenizer.json")

# Load tokenizers
en_tokenizer = Tokenizer.from_file(en_tokenizer_path)
zh_tokenizer = Tokenizer.from_file(zh_tokenizer_path)

# Load model (see Space code for full implementation)

For complete implementation details, refer to the Space code.

Training Details

Training Data

Dataset: 100,000 parallel English-Chinese sentence pairs
Source: Open-source corpus from GitCode
Domains: Political documents, diplomatic text, news articles, general conversation
Split: 80,000 training pairs (80%), 20,000 test pairs (20%)

Training Procedure

Tokenization

English: BPE (Byte-Pair Encoding) with vocabulary size 32,000
Chinese: Character-level BPE with vocabulary size 5,239
Special tokens: [PAD], [UNK], [BOS], [EOS]

Training Hyperparameters

Optimizer: AdamW
- Learning rate: 0.0005
- Betas: (0.9, 0.98)
- Epsilon: 1e-9
- Weight decay: 0.01
LR Scheduler: OneCycleLR with cosine annealing
- Max learning rate: 0.001
- Warmup: 10% of training
Batch size: 64
Label smoothing: 0.15
Dropout: 0.4 (high dropout to prevent overfitting on 100k dataset)

Features

Beam search decoding: Beam size 5 for higher quality translations
Positional encoding: Sinusoidal up to 5000 tokens
Pre-normalization: Applied before each sub-layer for training stability

Evaluation

Metrics

The model uses two primary metrics:

chrF (Character n-gram F-score): Well-suited for character-rich languages like Chinese
BLEU (Bilingual Evaluation Understudy): Standard MT metric

Example Translations

High-Quality Examples (Political Domain)

English	Model Translation (Beam Search)
Iraqi president saddam expressed satisfaction with this resolution on 4 july.	伊拉克总统萨达姆 4日对这一决议表示满意。
The people of cambodia have expressed their lofty respect for president jiang.	柬埔寨人民在街头前向江主席表示崇高的敬意，欢迎他来访。

Note: The model performs best on political and diplomatic text similar to its training data.

Limitations and Biases

Known Limitations

Domain Specificity: Model is optimized for political/diplomatic text and may underperform on:
- Technical or scientific content
- Casual conversation
- Domain-specific terminology (legal, medical, etc.)
Idiomatic Expressions: Struggles with culturally-specific idioms and expressions, often producing overly literal translations.
Length Issues: Tends to under-generate, producing translations slightly shorter than human references.
Specialized Terminology: Difficulty with formal titles and organizational terms specific to Chinese administrative contexts.
Sentence Length: Optimized for sentences up to 128 tokens; very long texts should be split into smaller segments.

Training Data Bias

The model reflects the characteristics of its training data:

Primary focus on political and diplomatic language
Limited exposure to casual or colloquial Chinese
May not generalize well to modern internet slang or informal speech

Technical Specifications

Model Architecture Details

Type: Encoder-Decoder Transformer
Framework: PyTorch
Special tokens: BOS (Beginning), EOS (End), PAD (Padding), UNK (Unknown)
Weight initialization: Xavier Uniform

Compute Infrastructure

Hardware: CUDA-compatible GPU
Training time: ~9 minutes per epoch on GPU
Total training: 60 epochs with early stopping

Model Files

This repository contains:

best_model_finetuned.pt - Trained model weights (~250MB)
config.json - Model configuration
en_tokenizer.json - English BPE tokenizer
zh_tokenizer.json - Chinese character-level BPE tokenizer

Citation

If you use this model in your research or applications, please cite:

@misc{en-zh-transformer-2025,
  author = {jiaxinnnnn},
  title = {EN-ZH Transformer: English to Chinese Neural Machine Translation},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/jiaxinnnnn/EN-ZH_Transformer_Model}},
  note = {chrF: 54.47, BLEU: 69.15}
}

Model Card Contact

For questions, issues, or feedback:

Open an issue in the model repository
Visit the Community discussions

Downloads last month: 5

jiaxinnnnn
/

EN-ZH_Transformer_Model