GPT-SoVITS Taiwanese (Hokkien) — Trilingual S1 + r4 e15 S2

Pre-trained weights for the Taiwanese (Hokkien / Pe̍h-ōe-jī) fork of GPT-SoVITS. The S1 is trilingual (TW + ZH + weak EN) thanks to embedding transplant; the S2 is a v2ProTw vocoder finetuned on Taiwanese audio.

Inference code, sandhi preprocessor, training recipe, and Traditional Chinese documentation: github.com/KaedeTai/GPT-SoVITS · TAIWANESE.md · TAIWANESE.zh-tw.md

Files

File	Size	What
`s1_trilingual.ckpt`	156 MB	S1 GPT — TW (sandhi-trained, e15) + transplanted base ZH/EN embeddings
`s2_r4_e15.pth`	952 MB	S2 SoVITS v2ProTw — full-state ckpt at epoch 15 of finetune run r4

Quick start

git clone https://github.com/KaedeTai/GPT-SoVITS.git
cd GPT-SoVITS
python3.11 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
./download_pretrained.sh   # upstream base pretraineds (BERT, hubert, etc.)

# Pull these weights
hf download KaedeTai/gpt-sovits-tw s1_trilingual.ckpt --local-dir ./models
hf download KaedeTai/gpt-sovits-tw s2_r4_e15.pth      --local-dir ./models

# One-line synthesis (POJ-with-diacritics in, mp3 out)
python -m tw_inference.tts_cli "Lí hó, sè-kài!" -o hello.mp3

Or the local web UI:

python -m tw_inference.webui   # → http://127.0.0.1:5557/

Quality

Language	Fluency	Pronunciation	Notes
Taiwanese (POJ)	~80 / 100	~75 / 100	Single trained speaker; long sentences (>60 syllables) occasionally drift.
Mandarin (中文)	usable	usable	Preserved via embedding transplant from `s1v3` base.
English	weak	weak	Base never had real English; included for completeness only.

Code-switching within one utterance is not supported — use {tw:...} / {zh:...} markup per segment.

Architecture

Two-stage TTS:

S1 (GPT) — autoregressive token model mapping POJ phoneme tokens → SoVITS semantic codes. Vocabulary expanded from 732 → 1033 (301 Taiwanese tw_* tokens added on top of the upstream Mandarin vocabulary). The trilingual variant preserves Mandarin by transplanting rows 0..731 of the embedding table from a clean s1v3 checkpoint back into the TW-finetuned ckpt.
S2 (SoVITS v2Pro / v2ProTw) — non-autoregressive vocoder; takes semantic codes + a speaker embedding (cnhubert + sv) and produces 32 kHz mono waveform.
Sandhi preprocessor — applies standard Taiwanese tone-sandhi rules to citation-tone POJ before tokenization, so the model sees the tone sequence speakers actually produce. 13 flags; defaults match the eval configuration that produced our best reported CER.

Training data

MoE Tâi-uân-gí 教育部臺灣閩南語常用詞辭典 example sentences (majority of the corpus).
Common Voice nan-tw validated split.
Multi-speaker. Per-segment 3-12 s, 32 kHz mono, loudness normalised.
Labels: POJ with diacritics, pre-processed with the sandhi preprocessor so the written form matches the audio realisation.

Total: roughly 15-25 hours of paired audio + POJ.

Evaluation

Reported quality is from human listening; ASR-based CER was used for ablations but flattens out at the top of the quality curve.

Test set	Stack	Mean POJ-CER (BreezeASR-26-derived)
Canonical 5-sentence	S1 trilingual e15 + S2 r4 e15 + sandhi v1	4.44%
13-sentence long content	same	~15%

Per-sentence breakdown for the 5-sentence set is in tw_samples/eval_summary.json in the GitHub repo. Demo mp3s are in tw_samples/demo_*.mp3.

Known limitations

English is weak. Don't ship this for English use cases.
Long sentences drift past ~60 syllables. The inference pipeline splits at punctuation to mitigate but doesn't eliminate this.
Code-switching not supported within a single utterance.
Single training speaker fidelity is capped by the multi-speaker corpus heterogeneity; with a single-speaker corpus we'd expect higher voice consistency but narrower coverage.
POJ input only. No built-in Han-character → POJ pipeline.
MPS nondeterminism. Same seed + same machine still produces audibly different output across runs (5-10% spread).

How this was built (short version)

The long version with lessons learned and what we'd do differently is in TAIWANESE.md. Short version:

S2 first (~24 h on M1 Max): full SoVITS v2ProTw finetune from s2Gv2Pro.pth. 15 epochs.
S1 next (~12-30 h): s1_train_mps_arpa_freeze.py from s1v3.ckpt, ARPA-row freeze, warmup → cosine LR (peak 1e-2, end 1e-4, 2000-step warmup, 40k-step decay). Critical patch: upstream lr_schedulers.py had a hardcode locking every run to LR=0.002 regardless of yaml; that's now removed.
Sandhi-aligned labels are non-negotiable. Training on citation-tone POJ when the recordings have natural sandhi produces a systematically mispronouncing model.
Embedding transplant for the trilingual variant: copy rows 0..731 from a clean s1v3 back into the TW-finetuned ckpt. Restores Mandarin without touching the trained TW rows.

License & credits

License: MIT (matches upstream GPT-SoVITS).
Upstream: RVC-Boss/GPT-SoVITS.
TW adaptation: KaedeTai.
Acknowledgments: MoE 教育部臺灣閩南語常用詞辭典 example sentence corpus, Common Voice nan-tw (Mozilla), BreezeASR-26 (MediaTek) for TW ASR eval, linshoufan/whisper-small-nan-tw-pinyin for POJ ASR.

Citation

If you find this useful in academic work, please cite the upstream GPT-SoVITS and this fork:

@misc{gpt-sovits-tw-2026,
  title  = {GPT-SoVITS Taiwanese (Hokkien) trilingual fork},
  author = {KaedeTai},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/KaedeTai/gpt-sovits-tw}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for KaedeTai/gpt-sovits-tw

Base model

lj1995/GPT-SoVITS

Finetuned

(14)

this model