en-zhtw / README.md

Update README.md

6b7cc33 verified 6 months ago

5.21 kB

	---
	license: apache-2.0
	base_model: Helsinki-NLP/opus-mt-en-zh
	tags:
	- generated_from_trainer
	- translation
	- machine-translation
	- english
	- traditional-chinese
	- transformer
	- fine-tuned
	datasets:
	- agentlans/en-zhtw-google-translate
	language:
	- en
	- zh
	pipeline_tag: translation
	---
	<details>
	<summary>English-to-Traditional Chinese Translator</summary>

	This model is a fine-tuned version of [Helsinki-NLP/opus-mt-en-zh](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh), trained on the [agentlans/en-zhtw-google-translate](https://huggingface.co/datasets/agentlans/en-zhtw-google-translate) dataset.

	It is optimized to produce Traditional Chinese translations by default, enhancing the naturalness and fluency of the output.

	## Model Description

	- Input: English text only
	- Output: Traditional Chinese translation

	</details>

	<details>
	<summary>英文至繁體中文翻譯模型</summary>

	本模型為 [Helsinki-NLP/opus-mt-en-zh](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh) 的微調版本，使用 [agentlans/en-zhtw-google-translate](https://huggingface.co/datasets/agentlans/en-zhtw-google-translate) 資料集進行訓練。

	模型已針對輸出繁體中文進行最佳化，提升了翻譯結果的自然度與流暢性。

	## 模型說明

	- 輸入：僅支援英文文本
	- 輸出：繁體中文翻譯
	</details>

	## How to use / 如何使用

	```python
	from transformers import pipeline

	# Load the translation model
	# 載入翻譯模型
	model_checkpoint = "agentlans/en-zhtw"
	translator = pipeline("translation", model=model_checkpoint)

	# This is for correcting English punctuation marks to Traditional Chinese.
	# 這是為了將英語標點符號校正為繁體中文。
	def en_to_zh_punct(text):
	punct = {
	'!': '！', '?': '？', ',': '，', '.': '。',
	':': '：', ';': '；', '(': '（', ')': '）',
	'[': '【', ']': '】', '{': '｛', '}': '｝'
	}
	result, in_dq, in_sq = [], False, False
	for ch in text:
	if ch == '"':
	result.append("」" if in_dq else "「")
	in_dq = not in_dq
	elif ch == "'":
	result.append("』" if in_sq else "『")
	in_sq = not in_sq
	else:
	result.append(punct.get(ch, ch))
	return "".join(result)

	# The main function for translating English to Traditional Chinese
	# 將英語翻譯成繁體中文的主要功能
	def translate(en_text):
	return [en_to_zh_punct(x["translation_text"]) for x in translator(en_text)]

	# Example
	# 範例
	translate(
	[
	"Trump announces new tariffs on penguin islands. The penguins plan to tax U.S. imports in retaliation.",
	"We now return to the White House for the latest developments on the trade war.",
	]
	)
	# ['川普宣佈對企鵝島徵收新關稅，企鵝打算對美國進口產品徵稅報復。', '我們現在回到白宮尋找貿易戰的最新發展。']
	```

	## Limitations / 限制

	<details>
	<summary>Limitations</summary>

	- Handles only one- or two-sentence inputs in English effectively.
	- Struggles with English spelling, names, abbreviations, and especially technical terminology.
	- Uses unusual punctuation like the English comma instead of the Chinese comma.
	- Has difficulty understanding context.
	- As a result, may generate inaccurate information or omit important details.
	- Sometimes uses incorrect words due to the base model being primarily trained on Simplified Chinese, which does not always correspond directly to Traditional Chinese.
	</details>

	<details>
	<summary>限制</summary>

	- 僅適用於處理一至兩句英文句子的輸入，處理較長段落時效果有限。
	- 難以準確掌握英語拼字、專有名詞及縮寫，尤其在處理技術術語時表現不佳。
	- 常出現標點符號使用不當的情況，例如以英文逗號取代中文逗號。
	- 對語境的理解能力有限。
	- 可能導致資訊不準確或遺漏重要細節。
	- 由於基礎模型主要以簡體中文語料訓練，有時會使用不自然或錯誤的詞語，簡體與繁體用語之間也未必能精確對應。
	</details>

	## Training procedure / 訓練過程

	<details>
	<summary>Click here / 點這裡</summary>

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- num_epochs: 5.0

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Input Tokens Seen \|
	\|:-------------:\|:-----:\|:------:\|:---------------:\|:-----------------:\|
	\| 1.3993 \| 1.0 \| 99952 \| 1.2487 \| 54454616 \|
	\| 1.2801 \| 2.0 \| 199904 \| 1.1701 \| 108935048 \|
	\| 1.1728 \| 3.0 \| 299856 \| 1.1232 \| 163424808 \|
	\| 1.1001 \| 4.0 \| 399808 \| 1.0871 \| 217911400 \|
	\| 1.0243 \| 5.0 \| 499760 \| 1.0584 \| 272407288 \|


	### Framework versions

	- Transformers 4.51.3
	- Pytorch 2.6.0+cu124
	- Datasets 3.2.0
	- Tokenizers 0.21.0

	</details>