GPT-SoVITS Taiwanese (Hokkien) — Trilingual S1 + r4 e15 S2

Pre-trained weights for the Taiwanese (Hokkien / Pe̍h-ōe-jī) fork of GPT-SoVITS. The S1 is trilingual (TW + ZH + weak EN) thanks to embedding transplant; the S2 is a v2ProTw vocoder finetuned on Taiwanese audio.

Inference code, sandhi preprocessor, training recipe, and Traditional Chinese documentation: github.com/KaedeTai/GPT-SoVITS · TAIWANESE.md · TAIWANESE.zh-tw.md

Files

File Size What
s1_trilingual.ckpt 156 MB S1 GPT — TW (sandhi-trained, e15) + transplanted base ZH/EN embeddings
s2_r4_e15.pth 952 MB S2 SoVITS v2ProTw — full-state ckpt at epoch 15 of finetune run r4

Quick start

git clone https://github.com/KaedeTai/GPT-SoVITS.git
cd GPT-SoVITS
python3.11 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
./download_pretrained.sh   # upstream base pretraineds (BERT, hubert, etc.)

# Pull these weights
hf download KaedeTai/gpt-sovits-tw s1_trilingual.ckpt --local-dir ./models
hf download KaedeTai/gpt-sovits-tw s2_r4_e15.pth      --local-dir ./models

# One-line synthesis (POJ-with-diacritics in, mp3 out)
python -m tw_inference.tts_cli "Lí hó, sè-kài!" -o hello.mp3

Or the local web UI:

python -m tw_inference.webui   # → http://127.0.0.1:5557/

Quality

Language Fluency Pronunciation Notes
Taiwanese (POJ) ~80 / 100 ~75 / 100 Single trained speaker; long sentences (>60 syllables) occasionally drift.
Mandarin (中文) usable usable Preserved via embedding transplant from s1v3 base.
English weak weak Base never had real English; included for completeness only.

Code-switching within one utterance is not supported — use {tw:...} / {zh:...} markup per segment.

Architecture

Two-stage TTS:

  • S1 (GPT) — autoregressive token model mapping POJ phoneme tokens → SoVITS semantic codes. Vocabulary expanded from 732 → 1033 (301 Taiwanese tw_* tokens added on top of the upstream Mandarin vocabulary). The trilingual variant preserves Mandarin by transplanting rows 0..731 of the embedding table from a clean s1v3 checkpoint back into the TW-finetuned ckpt.
  • S2 (SoVITS v2Pro / v2ProTw) — non-autoregressive vocoder; takes semantic codes + a speaker embedding (cnhubert + sv) and produces 32 kHz mono waveform.
  • Sandhi preprocessor — applies standard Taiwanese tone-sandhi rules to citation-tone POJ before tokenization, so the model sees the tone sequence speakers actually produce. 13 flags; defaults match the eval configuration that produced our best reported CER.

Training data

  • MoE Tâi-uân-gí 教育部臺灣閩南語常用詞辭典 example sentences (majority of the corpus).
  • Common Voice nan-tw validated split.
  • Multi-speaker. Per-segment 3-12 s, 32 kHz mono, loudness normalised.
  • Labels: POJ with diacritics, pre-processed with the sandhi preprocessor so the written form matches the audio realisation.

Total: roughly 15-25 hours of paired audio + POJ.

Evaluation

Reported quality is from human listening; ASR-based CER was used for ablations but flattens out at the top of the quality curve.

Test set Stack Mean POJ-CER (BreezeASR-26-derived)
Canonical 5-sentence S1 trilingual e15 + S2 r4 e15 + sandhi v1 4.44%
13-sentence long content same ~15%

Per-sentence breakdown for the 5-sentence set is in tw_samples/eval_summary.json in the GitHub repo. Demo mp3s are in tw_samples/demo_*.mp3.

Known limitations

  • English is weak. Don't ship this for English use cases.
  • Long sentences drift past ~60 syllables. The inference pipeline splits at punctuation to mitigate but doesn't eliminate this.
  • Code-switching not supported within a single utterance.
  • Single training speaker fidelity is capped by the multi-speaker corpus heterogeneity; with a single-speaker corpus we'd expect higher voice consistency but narrower coverage.
  • POJ input only. No built-in Han-character → POJ pipeline.
  • MPS nondeterminism. Same seed + same machine still produces audibly different output across runs (5-10% spread).

How this was built (short version)

The long version with lessons learned and what we'd do differently is in TAIWANESE.md. Short version:

  1. S2 first (~24 h on M1 Max): full SoVITS v2ProTw finetune from s2Gv2Pro.pth. 15 epochs.
  2. S1 next (~12-30 h): s1_train_mps_arpa_freeze.py from s1v3.ckpt, ARPA-row freeze, warmup → cosine LR (peak 1e-2, end 1e-4, 2000-step warmup, 40k-step decay). Critical patch: upstream lr_schedulers.py had a hardcode locking every run to LR=0.002 regardless of yaml; that's now removed.
  3. Sandhi-aligned labels are non-negotiable. Training on citation-tone POJ when the recordings have natural sandhi produces a systematically mispronouncing model.
  4. Embedding transplant for the trilingual variant: copy rows 0..731 from a clean s1v3 back into the TW-finetuned ckpt. Restores Mandarin without touching the trained TW rows.

License & credits

  • License: MIT (matches upstream GPT-SoVITS).
  • Upstream: RVC-Boss/GPT-SoVITS.
  • TW adaptation: KaedeTai.
  • Acknowledgments: MoE 教育部臺灣閩南語常用詞辭典 example sentence corpus, Common Voice nan-tw (Mozilla), BreezeASR-26 (MediaTek) for TW ASR eval, linshoufan/whisper-small-nan-tw-pinyin for POJ ASR.

Citation

If you find this useful in academic work, please cite the upstream GPT-SoVITS and this fork:

@misc{gpt-sovits-tw-2026,
  title  = {GPT-SoVITS Taiwanese (Hokkien) trilingual fork},
  author = {KaedeTai},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/KaedeTai/gpt-sovits-tw}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KaedeTai/gpt-sovits-tw

Finetuned
(12)
this model