KikKoh commited on
Commit
390cf22
·
1 Parent(s): a40e8a1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +219 -0
README.md CHANGED
@@ -10,3 +10,222 @@ short_description: 台語語音辨識示範,使用 Wav2Vec2 模型將錄音轉
10
  ---
11
 
12
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
13
+
14
+ # 台語語音辨識系統(Taiwanese Hokkien Speech-to-Text)
15
+
16
+ 本專案旨在提供一套完整且可擴展的台語(臺灣閩南語)語音辨識解決方案,涵蓋從語音 ➜ 拼音 ➜ 漢字的雙階段架構,以及基於 LoRA 微調的 Whisper 模型,並同時支援本地與雲端部署。整體技術棧採用 PyTorch、Hugging Face Transformers、PEFT、Accelerate 等先進工具,確保訓練效能、推論效率與易用性。
17
+
18
+ 🔗 **線上體驗**:[Hugging Face Spaces - KikKoh/Hokkien](https://huggingface.co/spaces/KikKoh/Hokkien)
19
+
20
+ ---
21
+
22
+ ## 🎯 專案亮點
23
+
24
+ * **雙階段架構**:
25
+
26
+ * **Stage 1**:台語語音 ➜ 羅馬拼音(台羅) (my-wav2vec2 模組)
27
+ * **Stage 2**:羅馬拼音 ➜ 台語漢字 (hok2han 模組)
28
+ * **基於 LoRA 微調的 Whisper 模型**:
29
+ * **Mode 1**:台語語音 ➜ 台語漢字 (lora-whisper 模組)
30
+ * **Mode 2**:台語語音 ➜ 中文文字 (lora-whisper-zh 模組)
31
+ * **多模型支援**:CTC-Based (Wav2Vec2)、Transformer Seq2Seq、Whisper + LoRA 微調
32
+ * **高效訓練**:混合精度 (AMP)、LoRA 參數高效微調、Accelerate 分散式訓練
33
+ * **易用部署**:Hugging Face Hub / Spaces 一鍵上傳、Dockerfile
34
+ * **開放原始碼**:Apache 2.0,歡迎學術與非商業用途
35
+
36
+ ---
37
+
38
+ ## 📂 專案結構
39
+
40
+ ```
41
+ taiwanese-speech-to-text/
42
+ ├── data/
43
+ │ ├── 詞條音檔/...
44
+ │ ├── 例句音檔/...
45
+ │ └── kautian.ods
46
+
47
+ ├── model/
48
+ │ ├── my-wav2vec2/...
49
+ │ ├── hok2han/...
50
+ │ ├── lora-whisper/...
51
+ │ └── lora-whisper-zh/...
52
+
53
+ ├── Dockerfile
54
+ ├── requirements.txt
55
+ ├── LICENSE
56
+ └── README.md
57
+ ```
58
+
59
+ ---
60
+
61
+ ## 📥 安裝與環境準備
62
+
63
+ 1. **Clone 專案**
64
+
65
+ ```bash
66
+ git clone https://github.com/KikKoh/taiwanese-speech-to-text.git
67
+ cd taiwanese-speech-to-text
68
+ ```
69
+ 2. **建立虛擬環境**(建議使用 `venv` 或 `conda`)
70
+
71
+ ```bash
72
+ python3.10 -m venv venv
73
+ source venv/bin/activate
74
+ ```
75
+ 3. **安裝相依套件**
76
+
77
+ ```bash
78
+ pip install --upgrade pip
79
+ pip install -r requirements.txt
80
+ ```
81
+ 4. **(可選) 安裝 GPU / Accelerate 支援**
82
+
83
+ ```bash
84
+ pip install accelerate
85
+ accelerate config # 初始化設定
86
+ ```
87
+
88
+ ---
89
+
90
+ ## 🚀 快速上手
91
+
92
+ ### 1. 推論:語音 ➜ 羅馬拼音
93
+
94
+ ```python
95
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
96
+ import soundfile as sf
97
+
98
+ processor = Wav2Vec2Processor.from_pretrained("my-wav2vec2")
99
+ model = Wav2Vec2ForCTC.from_pretrained("my-wav2vec2").to(device)
100
+ model.eval()
101
+
102
+ waveform, sr = sf.read("audio.wav")
103
+ waveform = torch.tensor(waveform).float()
104
+ if waveform.dim() == 1:
105
+ waveform = waveform.unsqueeze(0)
106
+ else:
107
+ waveform = waveform.permute(1, 0)
108
+ if waveform.shape[0] > 1:
109
+ waveform = waveform.mean(dim=0, keepdim=True)
110
+ if sr != target_sample_rate:
111
+ resampler = torchaudio.transforms.Resample(sr, 16000)
112
+ waveform = resampler(waveform)
113
+
114
+ audio_input = processor(waveform, sampling_rate=16000, return_tensors="pt")
115
+ with torch.no_grad():
116
+ logits = model(audio_input.input_values).logits
117
+ pred_ids = torch.argmax(logits, dim=-1)[0].tolist()
118
+ romaji = processor.batch_decode(pred_ids)
119
+ print("羅馬拼音:", romaji)
120
+ ```
121
+
122
+ ### 2. 推論:拼音 ➜ 台語漢字
123
+
124
+ ```python
125
+ from transformers import AutoTokenizer
126
+ from model.hok2han import Seq2SeqTransformer
127
+
128
+ input_tokenizer = AutoTokenizer.from_pretrained("KikKoh/Hok2Han", subfolder="input_tokenizer")
129
+ output_tokenizer = AutoTokenizer.from_pretrained("KikKoh/Hok2Han", subfolder="output_tokenizer")
130
+ model = Seq2SeqTransformer.from_pretrained("KikKoh/Hok2Han").to(device)
131
+ model.eval()
132
+
133
+ encoded = input_tokenizer("pinyin", max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
134
+ input_ids = encoded["input_ids"].to(device)
135
+ attention_mask = encoded["attention_mask"].to(device)
136
+
137
+ start_token_id = output_tokenizer.cls_token_id or output_tokenizer.convert_tokens_to_ids('<s>')
138
+ end_token_id = output_tokenizer.sep_token_id or output_tokenizer.convert_tokens_to_ids('</s>')
139
+
140
+ tgt_ids = torch.tensor([[start_token_id]], device=device)
141
+ total_confidence = 0.0
142
+ token_count = 0
143
+
144
+ for _ in range(max_len - 1):
145
+ tgt_mask = generate_square_subsequent_mask(tgt_ids.size(1)).to(device)
146
+ tgt_key_padding_mask = (tgt_ids == output_tokenizer.pad_token_id).to(device)
147
+
148
+ outputs = model(input_ids, tgt_ids, input_tokenizer.pad_token_id, =output_tokenizer.pad_token_id,
149
+ src_key_padding_mask=(attention_mask == 0), tgt_key_padding_mask=tgt_key_padding_mask)
150
+ next_token_logits = outputs[:, -1, :]
151
+ probs = softmax(next_token_logits, dim=-1)
152
+ next_token = torch.argmax(probs, dim=-1).unsqueeze(1)
153
+ tgt_ids = torch.cat([tgt_ids, next_token], dim=1)
154
+ if next_token.item() == end_token_id:
155
+ break
156
+ translation = output_tokenizer.decode(tgt_ids[0], skip_special_tokens=True).replace(" ", "")
157
+ print("台語漢字:", translation)
158
+ ```
159
+
160
+ ### 3. 推論:Whisper + LoRA(zh)
161
+
162
+ ```python
163
+ from peft import PeftModel
164
+ from transformers import WhisperForConditionalGeneration, WhisperProcessor
165
+
166
+ processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="zh", task="transcribe")
167
+ model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
168
+ model = PeftModel.from_pretrained(model, "demo/lora-whisper").to(device).eval()
169
+
170
+ waveform, sr = sf.read("audio.wav")
171
+ waveform = torch.tensor(waveform).float()
172
+ if waveform.dim() == 1:
173
+ waveform = waveform.unsqueeze(0)
174
+ else:
175
+ waveform = waveform.permute(1, 0)
176
+ if waveform.shape[0] > 1:
177
+ waveform = waveform.mean(dim=0, keepdim=True)
178
+ if sr != target_sample_rate:
179
+ resampler = torchaudio.transforms.Resample(sr, 16000)
180
+ waveform = resampler(waveform)
181
+
182
+ inputs = processor(waveform, sampling_rate=sample_rate, return_tensors="pt", padding=True)
183
+ with torch.no_grad():
184
+ generated_ids = model.generate(input_features=inputs.input_features.to(device), task="transcribe", language="zh")
185
+ transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
186
+ print("生成文本:", transcription)
187
+ ```
188
+
189
+ ---
190
+
191
+ ## 🛠️ 自訂訓練
192
+
193
+ 各子模組已提供完整 `README.md`,範例訓練流程包含:
194
+
195
+ * **my-wav2vec2**:CTC 微調、AMP、線性 warm-up scheduler
196
+ * **hok2han**:自架 Seq2Seq Transformer、CrossEntropyLoss、Early Stopping
197
+ * **lora-whisper / lora-whisper-zh**:LoRA 微調、Accelerate 分散式訓練、WER 驗證
198
+
199
+ 請參考對應資料夾下的 `README.md`,並依據 GPU 計算資源、語料規模調整超參數。
200
+
201
+ ---
202
+
203
+ ## 🤝 貢獻指南
204
+
205
+ 歡迎台語愛好者、語音處理研究者、AI 開發者參與:
206
+
207
+ 1. Fork 本倉庫並建立分支 (`git checkout -b feature/xxx`)
208
+ 2. 完成開發後提交 PR,詳述修改內容與測試結果
209
+ 3. Issue 中提案或討論新功能
210
+
211
+ ---
212
+
213
+ ## 📜 授權條款
214
+
215
+ * **程式碼**:Apache 2.0 License ([LICENSE](./LICENSE))
216
+ * **語料資料**:依據中華民國教育部《臺灣閩南語常用詞辭典》CC BY-ND 3.0 TW 條款,僅用於學術研究與非商業用途
217
+
218
+ ---
219
+
220
+ ## ✉️ 聯絡方式
221
+
222
+ 如有疑問,請透過 GitHub Issue 或私訊聯絡:
223
+
224
+ * GitHub: [10809104/taigi-speech-to-text](https://github.com/10809104/taigi-speech-to-text)
225
+ * Hugging Face Spaces: [KikKoh](https://huggingface.co/KikKoh)
226
+ * facebook: [KikKoh2024](https://www.facebook.com/kikkoh.2024))
227
+
228
+ ---
229
+
230
+ 祝研究順利,期待您的貢獻!
231
+