GPT-SoVITS Taiwanese (Hokkien) — Trilingual S1 + r4 e15 S2
Pre-trained weights for the Taiwanese (Hokkien / Pe̍h-ōe-jī) fork of GPT-SoVITS. The S1 is trilingual (TW + ZH + weak EN) thanks to embedding transplant; the S2 is a v2ProTw vocoder finetuned on Taiwanese audio.
Inference code, sandhi preprocessor, training recipe, and Traditional Chinese documentation: github.com/KaedeTai/GPT-SoVITS · TAIWANESE.md · TAIWANESE.zh-tw.md
Files
| File | Size | What |
|---|---|---|
s1_trilingual.ckpt |
156 MB | S1 GPT — TW (sandhi-trained, e15) + transplanted base ZH/EN embeddings |
s2_r4_e15.pth |
952 MB | S2 SoVITS v2ProTw — full-state ckpt at epoch 15 of finetune run r4 |
Quick start
git clone https://github.com/KaedeTai/GPT-SoVITS.git
cd GPT-SoVITS
python3.11 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
./download_pretrained.sh # upstream base pretraineds (BERT, hubert, etc.)
# Pull these weights
hf download KaedeTai/gpt-sovits-tw s1_trilingual.ckpt --local-dir ./models
hf download KaedeTai/gpt-sovits-tw s2_r4_e15.pth --local-dir ./models
# One-line synthesis (POJ-with-diacritics in, mp3 out)
python -m tw_inference.tts_cli "Lí hó, sè-kài!" -o hello.mp3
Or the local web UI:
python -m tw_inference.webui # → http://127.0.0.1:5557/
Quality
| Language | Fluency | Pronunciation | Notes |
|---|---|---|---|
| Taiwanese (POJ) | ~80 / 100 | ~75 / 100 | Single trained speaker; long sentences (>60 syllables) occasionally drift. |
| Mandarin (中文) | usable | usable | Preserved via embedding transplant from s1v3 base. |
| English | weak | weak | Base never had real English; included for completeness only. |
Code-switching within one utterance is not supported — use
{tw:...} / {zh:...} markup per segment.
Architecture
Two-stage TTS:
- S1 (GPT) — autoregressive token model mapping POJ phoneme tokens
→ SoVITS semantic codes. Vocabulary expanded from 732 → 1033 (301
Taiwanese
tw_*tokens added on top of the upstream Mandarin vocabulary). The trilingual variant preserves Mandarin by transplanting rows 0..731 of the embedding table from a cleans1v3checkpoint back into the TW-finetuned ckpt. - S2 (SoVITS v2Pro / v2ProTw) — non-autoregressive vocoder; takes semantic codes + a speaker embedding (cnhubert + sv) and produces 32 kHz mono waveform.
- Sandhi preprocessor — applies standard Taiwanese tone-sandhi rules to citation-tone POJ before tokenization, so the model sees the tone sequence speakers actually produce. 13 flags; defaults match the eval configuration that produced our best reported CER.
Training data
- MoE Tâi-uân-gí 教育部臺灣閩南語常用詞辭典 example sentences (majority of the corpus).
- Common Voice
nan-twvalidated split. - Multi-speaker. Per-segment 3-12 s, 32 kHz mono, loudness normalised.
- Labels: POJ with diacritics, pre-processed with the sandhi preprocessor so the written form matches the audio realisation.
Total: roughly 15-25 hours of paired audio + POJ.
Evaluation
Reported quality is from human listening; ASR-based CER was used for ablations but flattens out at the top of the quality curve.
| Test set | Stack | Mean POJ-CER (BreezeASR-26-derived) |
|---|---|---|
| Canonical 5-sentence | S1 trilingual e15 + S2 r4 e15 + sandhi v1 | 4.44% |
| 13-sentence long content | same | ~15% |
Per-sentence breakdown for the 5-sentence set is in
tw_samples/eval_summary.json
in the GitHub repo. Demo mp3s are in
tw_samples/demo_*.mp3.
Known limitations
- English is weak. Don't ship this for English use cases.
- Long sentences drift past ~60 syllables. The inference pipeline splits at punctuation to mitigate but doesn't eliminate this.
- Code-switching not supported within a single utterance.
- Single training speaker fidelity is capped by the multi-speaker corpus heterogeneity; with a single-speaker corpus we'd expect higher voice consistency but narrower coverage.
- POJ input only. No built-in Han-character → POJ pipeline.
- MPS nondeterminism. Same seed + same machine still produces audibly different output across runs (5-10% spread).
How this was built (short version)
The long version with lessons learned and what we'd do differently is in TAIWANESE.md. Short version:
- S2 first (~24 h on M1 Max): full SoVITS v2ProTw finetune from
s2Gv2Pro.pth. 15 epochs. - S1 next (~12-30 h):
s1_train_mps_arpa_freeze.pyfroms1v3.ckpt, ARPA-row freeze, warmup → cosine LR (peak 1e-2, end 1e-4, 2000-step warmup, 40k-step decay). Critical patch: upstreamlr_schedulers.pyhad a hardcode locking every run to LR=0.002 regardless of yaml; that's now removed. - Sandhi-aligned labels are non-negotiable. Training on citation-tone POJ when the recordings have natural sandhi produces a systematically mispronouncing model.
- Embedding transplant for the trilingual variant: copy rows
0..731 from a clean
s1v3back into the TW-finetuned ckpt. Restores Mandarin without touching the trained TW rows.
License & credits
- License: MIT (matches upstream GPT-SoVITS).
- Upstream: RVC-Boss/GPT-SoVITS.
- TW adaptation: KaedeTai.
- Acknowledgments: MoE 教育部臺灣閩南語常用詞辭典 example sentence
corpus, Common Voice
nan-tw(Mozilla), BreezeASR-26 (MediaTek) for TW ASR eval, linshoufan/whisper-small-nan-tw-pinyin for POJ ASR.
Citation
If you find this useful in academic work, please cite the upstream GPT-SoVITS and this fork:
@misc{gpt-sovits-tw-2026,
title = {GPT-SoVITS Taiwanese (Hokkien) trilingual fork},
author = {KaedeTai},
year = {2026},
howpublished = {\url{https://huggingface.co/KaedeTai/gpt-sovits-tw}}
}
Model tree for KaedeTai/gpt-sovits-tw
Base model
lj1995/GPT-SoVITS