EN-ZH Transformer Model

A Transformer-based neural machine translation model for English-to-Chinese translation, trained on 100,000 parallel sentence pairs from diverse domains.

Model Description

This model implements a standard encoder-decoder Transformer architecture optimized for English-to-Chinese translation. It uses BPE tokenization for English and character-level BPE tokenization for Chinese, achieving better handling of Chinese morphology.

  • Model type: Transformer (Sequence-to-Sequence)
  • Language pair: English (source) → Chinese Simplified (target)
  • License: MIT
  • Parameters: 65,894,519 (~66M)

Performance

Evaluated on 500 test samples:

Metric Greedy Decoding Beam Search (size=5)
chrF 51.24 54.47
BLEU 66.82 69.15

Beam search provides +3.23 chrF improvement over greedy decoding.

Model Architecture

  • Parameters: 65,894,519 (~66M)
  • Embedding dimension (d_model): 512
  • Feed-forward dimension: 2048
  • Attention heads: 8
  • Encoder layers: 6
  • Decoder layers: 6
  • Dropout rate: 0.4
  • Positional encoding: Sinusoidal (max length: 5000)
  • Normalization: Pre-norm (norm_first=True)

Vocabulary Sizes:

  • English (source): 32,000 tokens (BPE)
  • Chinese (target): 5,239 tokens (Character-level BPE)

Uses

Direct Use

This model is suitable for:

  • General domain English-to-Chinese translation
  • Political and diplomatic text translation
  • Educational purposes
  • Research and experimentation

Best for: Formal and political text, as the model was trained primarily on such content.

Demo

Try the model interactively: EN-ZH Transformer Space

How to Use

Using the Space

The easiest way to use this model is through the Gradio interface:

  1. Visit the Space
  2. Enter English text in the input box
  3. Click "Submit" to get the Chinese translation

Using with Python

from huggingface_hub import hf_hub_download
import torch
import json
from tokenizers import Tokenizer

# Download model files
REPO_ID = "jiaxinnnnn/EN-ZH_Transformer_Model"
config_path = hf_hub_download(repo_id=REPO_ID, filename="config.json")
model_path = hf_hub_download(repo_id=REPO_ID, filename="best_model_finetuned.pt")
en_tokenizer_path = hf_hub_download(repo_id=REPO_ID, filename="en_tokenizer.json")
zh_tokenizer_path = hf_hub_download(repo_id=REPO_ID, filename="zh_tokenizer.json")

# Load tokenizers
en_tokenizer = Tokenizer.from_file(en_tokenizer_path)
zh_tokenizer = Tokenizer.from_file(zh_tokenizer_path)

# Load model (see Space code for full implementation)

For complete implementation details, refer to the Space code.

Training Details

Training Data

  • Dataset: 100,000 parallel English-Chinese sentence pairs
  • Source: Open-source corpus from GitCode
  • Domains: Political documents, diplomatic text, news articles, general conversation
  • Split: 80,000 training pairs (80%), 20,000 test pairs (20%)

Training Procedure

Tokenization

  • English: BPE (Byte-Pair Encoding) with vocabulary size 32,000
  • Chinese: Character-level BPE with vocabulary size 5,239
  • Special tokens: [PAD], [UNK], [BOS], [EOS]

Training Hyperparameters

  • Optimizer: AdamW
    • Learning rate: 0.0005
    • Betas: (0.9, 0.98)
    • Epsilon: 1e-9
    • Weight decay: 0.01
  • LR Scheduler: OneCycleLR with cosine annealing
    • Max learning rate: 0.001
    • Warmup: 10% of training
  • Batch size: 64
  • Label smoothing: 0.15
  • Dropout: 0.4 (high dropout to prevent overfitting on 100k dataset)

Features

  • Beam search decoding: Beam size 5 for higher quality translations
  • Positional encoding: Sinusoidal up to 5000 tokens
  • Pre-normalization: Applied before each sub-layer for training stability

Evaluation

Metrics

The model uses two primary metrics:

  • chrF (Character n-gram F-score): Well-suited for character-rich languages like Chinese
  • BLEU (Bilingual Evaluation Understudy): Standard MT metric

Example Translations

High-Quality Examples (Political Domain)

English Model Translation (Beam Search)
Iraqi president saddam expressed satisfaction with this resolution on 4 july. 伊拉克 总统 萨达姆 4日 对 这 一 决议 表示 满意。
The people of cambodia have expressed their lofty respect for president jiang. 柬埔寨 人民 在 街头 前 向 江主席 表示 崇高 的 敬意,欢迎 他 来访。

Note: The model performs best on political and diplomatic text similar to its training data.

Limitations and Biases

Known Limitations

  1. Domain Specificity: Model is optimized for political/diplomatic text and may underperform on:

    • Technical or scientific content
    • Casual conversation
    • Domain-specific terminology (legal, medical, etc.)
  2. Idiomatic Expressions: Struggles with culturally-specific idioms and expressions, often producing overly literal translations.

  3. Length Issues: Tends to under-generate, producing translations slightly shorter than human references.

  4. Specialized Terminology: Difficulty with formal titles and organizational terms specific to Chinese administrative contexts.

  5. Sentence Length: Optimized for sentences up to 128 tokens; very long texts should be split into smaller segments.

Training Data Bias

The model reflects the characteristics of its training data:

  • Primary focus on political and diplomatic language
  • Limited exposure to casual or colloquial Chinese
  • May not generalize well to modern internet slang or informal speech

Technical Specifications

Model Architecture Details

  • Type: Encoder-Decoder Transformer
  • Framework: PyTorch
  • Special tokens: BOS (Beginning), EOS (End), PAD (Padding), UNK (Unknown)
  • Weight initialization: Xavier Uniform

Compute Infrastructure

  • Hardware: CUDA-compatible GPU
  • Training time: ~9 minutes per epoch on GPU
  • Total training: 60 epochs with early stopping

Model Files

This repository contains:

  • best_model_finetuned.pt - Trained model weights (~250MB)
  • config.json - Model configuration
  • en_tokenizer.json - English BPE tokenizer
  • zh_tokenizer.json - Chinese character-level BPE tokenizer

Citation

If you use this model in your research or applications, please cite:

@misc{en-zh-transformer-2025,
  author = {jiaxinnnnn},
  title = {EN-ZH Transformer: English to Chinese Neural Machine Translation},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/jiaxinnnnn/EN-ZH_Transformer_Model}},
  note = {chrF: 54.47, BLEU: 69.15}
}

Model Card Contact

For questions, issues, or feedback:

Downloads last month
100
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using jiaxinnnnn/EN-ZH_Transformer_Model 1