metadata
title: Hokkien
emoji: 👀
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
license: apache-2.0
short_description: 台語語音辨識示範,使用 Wav2Vec2 模型將錄音轉成羅馬拼音;使用 whisper 模型將錄音轉成中文。
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
台語語音辨識系統(Taiwanese Hokkien Speech-to-Text)
本專案旨在提供一套完整且可擴展的台語(臺灣閩南語)語音辨識解決方案,涵蓋從語音 ➜ 拼音 ➜ 漢字的雙階段架構,以及基於 LoRA 微調的 Whisper 模型,並同時支援本地與雲端部署。整體技術棧採用 PyTorch、Hugging Face Transformers、PEFT、Accelerate 等先進工具,確保訓練效能、推論效率與易用性。
🔗 線上體驗:Hugging Face Spaces - KikKoh/Hokkien
🎯 專案亮點
雙階段架構:
- Stage 1:台語語音 ➜ 羅馬拼音(台羅) (my-wav2vec2 模組)
- Stage 2:羅馬拼音 ➜ 台語漢字 (hok2han 模組)
基於 LoRA 微調的 Whisper 模型:
- Mode 1:台語語音 ➜ 台語漢字 (lora-whisper 模組)
- Mode 2:台語語音 ➜ 中文文字 (lora-whisper-zh 模組)
多模型支援:CTC-Based (Wav2Vec2)、Transformer Seq2Seq、Whisper + LoRA 微調
高效訓練:混合精度 (AMP)、LoRA 參數高效微調、Accelerate 分散式訓練
易用部署:Hugging Face Hub / Spaces 一鍵上傳、Dockerfile
開放原始碼:Apache 2.0,歡迎學術與非商業用途
📂 專案結構
taiwanese-speech-to-text/
├── data/
│ ├── 詞條音檔/...
│ ├── 例句音檔/...
│ └── kautian.ods
│
├── model/
│ ├── my-wav2vec2/...
│ ├── hok2han/...
│ ├── lora-whisper/...
│ └── lora-whisper-zh/...
│
├── Dockerfile
├── requirements.txt
├── LICENSE
└── README.md
📥 安裝與環境準備
Clone 專案
git clone https://github.com/KikKoh/taiwanese-speech-to-text.git cd taiwanese-speech-to-text建立虛擬環境(建議使用
venv或conda)python3.10 -m venv venv source venv/bin/activate安裝相依套件
pip install --upgrade pip pip install -r requirements.txt(可選) 安裝 GPU / Accelerate 支援
pip install accelerate accelerate config # 初始化設定
🚀 快速上手
1. 推論:語音 ➜ 羅馬拼音
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import soundfile as sf
processor = Wav2Vec2Processor.from_pretrained("my-wav2vec2")
model = Wav2Vec2ForCTC.from_pretrained("my-wav2vec2").to(device)
model.eval()
waveform, sr = sf.read("audio.wav")
waveform = torch.tensor(waveform).float()
if waveform.dim() == 1:
waveform = waveform.unsqueeze(0)
else:
waveform = waveform.permute(1, 0)
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
if sr != target_sample_rate:
resampler = torchaudio.transforms.Resample(sr, 16000)
waveform = resampler(waveform)
audio_input = processor(waveform, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(audio_input.input_values).logits
pred_ids = torch.argmax(logits, dim=-1)[0].tolist()
romaji = processor.batch_decode(pred_ids)
print("羅馬拼音:", romaji)
2. 推論:拼音 ➜ 台語漢字
from transformers import AutoTokenizer
from model.hok2han import Seq2SeqTransformer
input_tokenizer = AutoTokenizer.from_pretrained("KikKoh/Hok2Han", subfolder="input_tokenizer")
output_tokenizer = AutoTokenizer.from_pretrained("KikKoh/Hok2Han", subfolder="output_tokenizer")
model = Seq2SeqTransformer.from_pretrained("KikKoh/Hok2Han").to(device)
model.eval()
encoded = input_tokenizer("pinyin", max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
input_ids = encoded["input_ids"].to(device)
attention_mask = encoded["attention_mask"].to(device)
start_token_id = output_tokenizer.cls_token_id or output_tokenizer.convert_tokens_to_ids('<s>')
end_token_id = output_tokenizer.sep_token_id or output_tokenizer.convert_tokens_to_ids('</s>')
tgt_ids = torch.tensor([[start_token_id]], device=device)
total_confidence = 0.0
token_count = 0
for _ in range(max_len - 1):
tgt_mask = generate_square_subsequent_mask(tgt_ids.size(1)).to(device)
tgt_key_padding_mask = (tgt_ids == output_tokenizer.pad_token_id).to(device)
outputs = model(input_ids, tgt_ids, input_tokenizer.pad_token_id, =output_tokenizer.pad_token_id,
src_key_padding_mask=(attention_mask == 0), tgt_key_padding_mask=tgt_key_padding_mask)
next_token_logits = outputs[:, -1, :]
probs = softmax(next_token_logits, dim=-1)
next_token = torch.argmax(probs, dim=-1).unsqueeze(1)
tgt_ids = torch.cat([tgt_ids, next_token], dim=1)
if next_token.item() == end_token_id:
break
translation = output_tokenizer.decode(tgt_ids[0], skip_special_tokens=True).replace(" ", "")
print("台語漢字:", translation)
3. 推論:Whisper + LoRA(zh)
from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="zh", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(model, "demo/lora-whisper").to(device).eval()
waveform, sr = sf.read("audio.wav")
waveform = torch.tensor(waveform).float()
if waveform.dim() == 1:
waveform = waveform.unsqueeze(0)
else:
waveform = waveform.permute(1, 0)
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
if sr != target_sample_rate:
resampler = torchaudio.transforms.Resample(sr, 16000)
waveform = resampler(waveform)
inputs = processor(waveform, sampling_rate=sample_rate, return_tensors="pt", padding=True)
with torch.no_grad():
generated_ids = model.generate(input_features=inputs.input_features.to(device), task="transcribe", language="zh")
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("生成文本:", transcription)
🛠️ 自訂訓練
各子模組已提供完整 README.md,範例訓練流程包含:
- my-wav2vec2:CTC 微調、AMP、線性 warm-up scheduler
- hok2han:自架 Seq2Seq Transformer、CrossEntropyLoss、Early Stopping
- lora-whisper / lora-whisper-zh:LoRA 微調、Accelerate 分散式訓練、WER 驗證
請參考對應資料夾下的 README.md,並依據 GPU 計算資源、語料規模調整超參數。
🤝 貢獻指南
歡迎台語愛好者、語音處理研究者、AI 開發者參與:
- Fork 本倉庫並建立分支 (
git checkout -b feature/xxx) - 完成開發後提交 PR,詳述修改內容與測試結果
- Issue 中提案或討論新功能
📜 授權條款
- 程式碼:Apache 2.0 License (LICENSE)
- 語料資料:依據中華民國教育部《臺灣閩南語常用詞辭典》CC BY-ND 3.0 TW 條款,僅用於學術研究與非商業用途
✉️ 聯絡方式
如有疑問,請透過 GitHub Issue 或私訊聯絡:
- GitHub: 10809104/taigi-speech-to-text
- Hugging Face Spaces: KikKoh
- facebook: KikKoh2024)
祝研究順利,期待您的貢獻!