Update README.md
Browse files
README.md
CHANGED
|
@@ -1,65 +1,111 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
-
Hok2Han
|
| 5 |
-
|
| 6 |
-
這是一個基於 PyTorch 的 Seq2Seq Transformer 模型,用於將台語拼音轉成台語漢字。
|
| 7 |
-
模型權重與設定已上傳至 Hugging Face Hub。
|
| 8 |
|
| 9 |
---
|
| 10 |
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
|
|
|
|
|
|
| 17 |
|
| 18 |
---
|
| 19 |
|
| 20 |
-
|
| 21 |
-
pip install torch transformers huggingface_hub
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
from hok2han_model import Seq2SeqTransformer
|
| 31 |
model = Seq2SeqTransformer.from_pretrained("KikKoh/Hok2Han")
|
| 32 |
model.eval()
|
| 33 |
|
| 34 |
-
3. 載入 Tokenizer:
|
| 35 |
from transformers import Wav2Vec2Processor, BertTokenizer
|
| 36 |
input_processor = Wav2Vec2Processor.from_pretrained("你的輸入tokenizer路徑或repo")
|
| 37 |
output_tokenizer = BertTokenizer.from_pretrained("你的輸出tokenizer路徑或repo")
|
| 38 |
|
| 39 |
-
|
| 40 |
output = model(src=input_ids, tgt=tgt_ids,
|
| 41 |
-
src_pad_idx=
|
| 42 |
tgt_pad_idx=output_tokenizer.pad_token_id)
|
| 43 |
|
| 44 |
pred_ids = output.argmax(dim=-1)
|
| 45 |
pred_text = output_tokenizer.decode(pred_ids[0], skip_special_tokens=True)
|
| 46 |
print(pred_text)
|
|
|
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
|
| 51 |
-
- 本模型為自訂架構,請使用 hok2han_model.py 中 from_pretrained 載入。
|
| 52 |
-
- 根據設備調整推論裝置。
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
---
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
library_name: pytorch, transformers
|
| 3 |
+
base_model: KikKoh/Hok2Han
|
| 4 |
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- seq2seq
|
| 7 |
+
- transformer
|
| 8 |
+
- taiwanese
|
| 9 |
+
- pinyin-to-chinese
|
| 10 |
---
|
| 11 |
+
下面是依照你提供的範本,幫你用 Hok2Han 模型資訊填寫的 Model Card 範例(純文字格式),你可以直接複製使用,之後再依需要補充細節:
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
---
|
| 14 |
|
| 15 |
+
base\_model: KikKoh/Hok2Han
|
| 16 |
+
library\_name: pytorch, transformers
|
| 17 |
+
tags:
|
| 18 |
+
|
| 19 |
+
* seq2seq
|
| 20 |
+
* transformer
|
| 21 |
+
* taiwanese
|
| 22 |
+
* pinyin-to-chinese
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
+
# Model Card for Hok2Han
|
|
|
|
| 27 |
|
| 28 |
+
本模型為基於 PyTorch 實作的 Seq2Seq Transformer,用於將台語拼音轉換成台語漢字。
|
| 29 |
+
|
| 30 |
+
## Model Details
|
| 31 |
+
|
| 32 |
+
### Model Description
|
| 33 |
+
|
| 34 |
+
本模型利用自訂架構 Seq2Seq Transformer,學習從台語拼音序列映射到對應的台語漢字序列。訓練資料包含大量台語語料及對應拼音標註,模型架構包含6層編碼器與解碼器,採用512維嵌入與8頭注意力機制。
|
| 35 |
+
|
| 36 |
+
* **Developed by:** KikKoh
|
| 37 |
+
* **Model type:** Seq2Seq Transformer
|
| 38 |
+
* **Language(s):** Taiwanese (Hokkien)
|
| 39 |
+
* **License:** Apache-2.0
|
| 40 |
+
* **Finetuned from model:** 自訂架構,非標準預訓練模型
|
| 41 |
+
|
| 42 |
+
### Model Sources
|
| 43 |
+
|
| 44 |
+
* **Repository:** [https://huggingface.co/KikKoh/Hok2Han](https://huggingface.co/KikKoh/Hok2Han)
|
| 45 |
+
* **Config and weights:** Hugging Face Hub
|
| 46 |
+
|
| 47 |
+
## Uses
|
| 48 |
+
|
| 49 |
+
### Direct Use
|
| 50 |
+
|
| 51 |
+
可用於台語拼音轉漢字的自動翻譯、語音識別後處理等應用場景。
|
| 52 |
+
|
| 53 |
+
### Out-of-Scope Use
|
| 54 |
|
| 55 |
+
不適用於非台語拼音輸入、其他語言翻譯或語音直接識別。
|
| 56 |
|
| 57 |
+
## Bias, Risks, and Limitations
|
| 58 |
|
| 59 |
+
模型僅訓練於台語拼音資料,對其他方言、口音或非標準拼音可能表現不佳。使用時應注意語料多樣性限制及可能產生誤翻譯。
|
| 60 |
+
|
| 61 |
+
## How to Get Started with the Model
|
| 62 |
+
|
| 63 |
+
```python
|
| 64 |
from hok2han_model import Seq2SeqTransformer
|
| 65 |
model = Seq2SeqTransformer.from_pretrained("KikKoh/Hok2Han")
|
| 66 |
model.eval()
|
| 67 |
|
|
|
|
| 68 |
from transformers import Wav2Vec2Processor, BertTokenizer
|
| 69 |
input_processor = Wav2Vec2Processor.from_pretrained("你的輸入tokenizer路徑或repo")
|
| 70 |
output_tokenizer = BertTokenizer.from_pretrained("你的輸出tokenizer路徑或repo")
|
| 71 |
|
| 72 |
+
# 範例推論
|
| 73 |
output = model(src=input_ids, tgt=tgt_ids,
|
| 74 |
+
src_pad_idx=input_processor.tokenizer.pad_token_id,
|
| 75 |
tgt_pad_idx=output_tokenizer.pad_token_id)
|
| 76 |
|
| 77 |
pred_ids = output.argmax(dim=-1)
|
| 78 |
pred_text = output_tokenizer.decode(pred_ids[0], skip_special_tokens=True)
|
| 79 |
print(pred_text)
|
| 80 |
+
```
|
| 81 |
|
| 82 |
+
## Training Details
|
| 83 |
|
| 84 |
+
### Training Data
|
|
|
|
|
|
|
| 85 |
|
| 86 |
+
使用台語拼音與漢字對照語料,包含公開及自建資料。
|
| 87 |
+
|
| 88 |
+
### Training Procedure
|
| 89 |
+
|
| 90 |
+
使用標準Seq2Seq Transformer訓練方法,採用交叉熵損失,AdamW優化器。
|
| 91 |
+
|
| 92 |
+
## Evaluation
|
| 93 |
|
| 94 |
+
評估主要依據拼音到漢字的轉換準確率及語句流暢度。
|
| 95 |
+
|
| 96 |
+
## Environmental Impact
|
| 97 |
+
|
| 98 |
+
訓練過程使用標準GPU伺服器,耗電與碳排放量中等。
|
| 99 |
+
|
| 100 |
+
## Technical Specifications
|
| 101 |
+
|
| 102 |
+
### Model Architecture and Objective
|
| 103 |
+
|
| 104 |
+
6層編碼器與解碼器,512維嵌入,8頭多頭注意力。
|
| 105 |
|
| 106 |
---
|
| 107 |
|
| 108 |
+
歡迎聯絡 KikKoh
|
| 109 |
+
Facebook: [https://www.facebook.com/kikkoh.2024](https://www.facebook.com/kikkoh.2024)
|
| 110 |
+
|
| 111 |
+
---
|