|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: Helsinki-NLP/opus-mt-en-zh |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
- translation |
|
|
- machine-translation |
|
|
- english |
|
|
- traditional-chinese |
|
|
- transformer |
|
|
- fine-tuned |
|
|
datasets: |
|
|
- agentlans/en-zhtw-google-translate |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
pipeline_tag: translation |
|
|
--- |
|
|
<details> |
|
|
<summary>English-to-Traditional Chinese Translator</summary> |
|
|
|
|
|
This model is a fine-tuned version of [Helsinki-NLP/opus-mt-en-zh](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh), trained on the [agentlans/en-zhtw-google-translate](https://huggingface.co/datasets/agentlans/en-zhtw-google-translate) dataset. |
|
|
|
|
|
It is optimized to produce **Traditional Chinese translations by default**, enhancing the naturalness and fluency of the output. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Input:** English text only |
|
|
- **Output:** Traditional Chinese translation |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>英文至繁體中文翻譯模型</summary> |
|
|
|
|
|
本模型為 [Helsinki-NLP/opus-mt-en-zh](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh) 的微調版本,使用 [agentlans/en-zhtw-google-translate](https://huggingface.co/datasets/agentlans/en-zhtw-google-translate) 資料集進行訓練。 |
|
|
|
|
|
模型已針對輸出繁體中文進行最佳化,提升了翻譯結果的自然度與流暢性。 |
|
|
|
|
|
## 模型說明 |
|
|
|
|
|
- **輸入:** 僅支援英文文本 |
|
|
- **輸出:** 繁體中文翻譯 |
|
|
</details> |
|
|
|
|
|
## How to use / 如何使用 |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the translation model |
|
|
# 載入翻譯模型 |
|
|
model_checkpoint = "agentlans/en-zhtw" |
|
|
translator = pipeline("translation", model=model_checkpoint) |
|
|
|
|
|
# This is for correcting English punctuation marks to Traditional Chinese. |
|
|
# 這是為了將英語標點符號校正為繁體中文。 |
|
|
def en_to_zh_punct(text): |
|
|
punct = { |
|
|
'!': '!', '?': '?', ',': ',', '.': '。', |
|
|
':': ':', ';': ';', '(': '(', ')': ')', |
|
|
'[': '【', ']': '】', '{': '{', '}': '}' |
|
|
} |
|
|
result, in_dq, in_sq = [], False, False |
|
|
for ch in text: |
|
|
if ch == '"': |
|
|
result.append("」" if in_dq else "「") |
|
|
in_dq = not in_dq |
|
|
elif ch == "'": |
|
|
result.append("』" if in_sq else "『") |
|
|
in_sq = not in_sq |
|
|
else: |
|
|
result.append(punct.get(ch, ch)) |
|
|
return "".join(result) |
|
|
|
|
|
# The main function for translating English to Traditional Chinese |
|
|
# 將英語翻譯成繁體中文的主要功能 |
|
|
def translate(en_text): |
|
|
return [en_to_zh_punct(x["translation_text"]) for x in translator(en_text)] |
|
|
|
|
|
# Example |
|
|
# 範例 |
|
|
translate( |
|
|
[ |
|
|
"Trump announces new tariffs on penguin islands. The penguins plan to tax U.S. imports in retaliation.", |
|
|
"We now return to the White House for the latest developments on the trade war.", |
|
|
] |
|
|
) |
|
|
# ['川普宣佈對企鵝島徵收新關稅,企鵝打算對美國進口產品徵稅報復。', '我們現在回到白宮尋找貿易戰的最新發展。'] |
|
|
``` |
|
|
|
|
|
## Limitations / 限制 |
|
|
|
|
|
<details> |
|
|
<summary>Limitations</summary> |
|
|
|
|
|
- Handles only one- or two-sentence inputs in English effectively. |
|
|
- Struggles with English spelling, names, abbreviations, and especially technical terminology. |
|
|
- Uses unusual punctuation like the English comma instead of the Chinese comma. |
|
|
- Has difficulty understanding context. |
|
|
- As a result, may generate inaccurate information or omit important details. |
|
|
- Sometimes uses incorrect words due to the base model being primarily trained on Simplified Chinese, which does not always correspond directly to Traditional Chinese. |
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>限制</summary> |
|
|
|
|
|
- 僅適用於處理一至兩句英文句子的輸入,處理較長段落時效果有限。 |
|
|
- 難以準確掌握英語拼字、專有名詞及縮寫,尤其在處理技術術語時表現不佳。 |
|
|
- 常出現標點符號使用不當的情況,例如以英文逗號取代中文逗號。 |
|
|
- 對語境的理解能力有限。 |
|
|
- 可能導致資訊不準確或遺漏重要細節。 |
|
|
- 由於基礎模型主要以簡體中文語料訓練,有時會使用不自然或錯誤的詞語,簡體與繁體用語之間也未必能精確對應。 |
|
|
</details> |
|
|
|
|
|
## Training procedure / 訓練過程 |
|
|
|
|
|
<details> |
|
|
<summary>Click here / 點這裡</summary> |
|
|
|
|
|
### Training hyperparameters |
|
|
|
|
|
The following hyperparameters were used during training: |
|
|
- learning_rate: 5e-05 |
|
|
- train_batch_size: 8 |
|
|
- eval_batch_size: 8 |
|
|
- seed: 42 |
|
|
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
|
|
- lr_scheduler_type: linear |
|
|
- num_epochs: 5.0 |
|
|
|
|
|
### Training results |
|
|
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Input Tokens Seen | |
|
|
|:-------------:|:-----:|:------:|:---------------:|:-----------------:| |
|
|
| 1.3993 | 1.0 | 99952 | 1.2487 | 54454616 | |
|
|
| 1.2801 | 2.0 | 199904 | 1.1701 | 108935048 | |
|
|
| 1.1728 | 3.0 | 299856 | 1.1232 | 163424808 | |
|
|
| 1.1001 | 4.0 | 399808 | 1.0871 | 217911400 | |
|
|
| 1.0243 | 5.0 | 499760 | 1.0584 | 272407288 | |
|
|
|
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- Transformers 4.51.3 |
|
|
- Pytorch 2.6.0+cu124 |
|
|
- Datasets 3.2.0 |
|
|
- Tokenizers 0.21.0 |
|
|
|
|
|
</details> |