| ---
|
| title: GPT-SoVITS-CPUFast
|
| emoji: 🚀
|
| colorFrom: blue
|
| colorTo: blue
|
| sdk: gradio
|
| sdk_version: 5.34.2
|
| app_file: api.py
|
| python_version: "3.10"
|
| pinned: false
|
| license: apache-2.0
|
| ---
|
|
|
|
|
| # GPT-SoVITS CPU Inference Fork
|
|
|
| Inference-only GPT-SoVITS fork focused on CPU deployment and CPU-side optimization on Windows, Linux, and macOS.
|
|
|
| [中文简体](./docs/cn/README.md)
|
|
|
| ## What This Repo Is
|
|
|
| - Inference-only fork of GPT-SoVITS.
|
| - Designed around CPU usage rather than GPU-first features.
|
| - Current practical focus is the S2 `v2Pro` / `v2ProPlus` path, while keeping versioned pretrained downloads for `v1`, `v2`, `v2Pro`, and `v2ProPlus`.
|
|
|
| ## What Was Removed
|
|
|
| This repository no longer keeps most training and dataset-preparation features from upstream.
|
|
|
| - Training entrypoints and training-only utilities
|
| - Dataset slicing / denoise / ASR / labeling workflows
|
| - UVR5 and other non-inference WebUI tools
|
|
|
| The remaining goal is straightforward: run GPT-SoVITS inference on CPU with less installation friction and a smaller runtime surface.
|
|
|
| ## What Still Works
|
|
|
| - `webui.py`: minimal inference launcher
|
| - `GPT_SoVITS/inference_webui_fast.py`: high-performance CPU inference WebUI
|
| - `api.py` and `api_v2.py`: inference APIs
|
|
|
| ## Quick Start
|
|
|
| ### 1. Create Environment
|
|
|
| Use Miniconda or an existing Conda environment:
|
|
|
| ```bash
|
| conda create -n GPTSoVits python=3.10 -y
|
| conda activate GPTSoVits
|
| ```
|
|
|
| ### 2. Install Dependencies and Download Inference Weights
|
|
|
| CPU example with ModelScope and `v2ProPlus`:
|
|
|
| ```bash
|
| bash install.sh --source ModelScope --version v2ProPlus
|
| ```
|
|
|
| Windows PowerShell:
|
|
|
| ```powershell
|
| .\install.ps1 -Source ModelScope -Version v2ProPlus
|
| ```
|
|
|
| Available versions:
|
|
|
| - `v1`
|
| - `v2`
|
| - `v2Pro`
|
| - `v2ProPlus`
|
| - `all`
|
|
|
| ### 3. Launch
|
|
|
| Recommended:
|
|
|
| ```bash
|
| python webui.py
|
| ```
|
|
|
| Direct high-performance inference WebUI:
|
|
|
| ```bash
|
| python GPT_SoVITS/inference_webui_fast.py
|
| ```
|
|
|
| ## Notes
|
|
|
| - This fork is aimed at CPU inference, not training.
|
| - Chinese inference is still heavier than English / Japanese / Korean because text preprocessing needs extra frontend work such as `g2pw` and BERT features.
|
| - `install.sh` and `install.ps1` are now CPU-only installers and download inference assets by version instead of the full pretrained bundle.
|
| - `NLTK` and `OpenJTalk` dictionary downloads remain enabled by default.
|
|
|
| ## Speed Summary
|
|
|
| Without changing recognition behavior, prosody, speaker identity, or audio quality, this CPU fork currently delivers:
|
|
|
| - Chinese end-to-end `zh_pure`: `wall_sec 15.136264 -> 8.309471`, about `-45.1%`
|
| - End-to-end: `wall_sec 10.431826 -> 7.046743`, about `-32.4%`
|
| - Preprocessing: `frontend_sec 0.867065 -> 0.569571`, about `-34.3%`
|
| - T2S: `t2s_sec 4.023657 -> 2.268571`, about `-43.6%`
|
| - VITS: `vits_sec 5.137876 -> 3.969286`, about `-22.7%`
|
|
|
| Largest landed stage wins so far:
|
|
|
| - Preprocessing:
|
| - Chinese warm multi-sentence BERT frontend: `0.855s -> 0.131s`, about `-84.7%`
|
| - Korean cold frontend path: `0.791s -> 0.155s`, about `-80.4%`
|
| - T2S: `stable_batch_remap` 7-case comparison `1.986543 -> 1.684446`, about `-15.2%`
|
| - VITS: `remove_weight_norm` `vits_only` comparison `4.256119 -> 4.102004`, about `-3.6%`
|
|
|
| ## How The Speedups Were Achieved
|
|
|
| These gains do not come from sentence-level caching, quantization, or quality tradeoffs. The main work was removing CPU-side Python overhead, repeated preparation, repeated copies, and unnecessary cold-start costs from the real inference path.
|
|
|
| Chinese shows an even larger end-to-end gain than the overall average because it started from the heaviest path: it pays for both `g2pw` and BERT text features, so frontend reductions propagate directly into total latency.
|
|
|
| ### What Was Deliberately Not Used
|
|
|
| - ONNX / ORT was not adopted. `dec-only ORT`, `flow + dec ORT`, and larger graph-level ORT experiments were all tried, but on the current machine and dependency stack they did not produce a solution that was simultaneously quality-safe, faster, and lighter than the PyTorch path, so the runtime stayed on pure PyTorch and the ONNX compatibility path was removed.
|
| - The Chinese path was not simplified by dropping quality-critical frontend pieces. `g2pw` stayed, Chinese BERT stayed, and the project did not switch to lighter but lower-quality replacements such as `g2pm`, which are more likely to cause noticeable G2P errors on harder text, especially literary or classical material.
|
| - The main path also does not rely on secondary splitting just to manufacture larger batches. That direction was explored in benchmark-only form, but the results did not stay stable, and the `repeats=3` verification did not justify moving it into the runtime path.
|
| - VITS parallel synthesis is not exposed in the CPU WebUI. Local CPU measurements showed it can add overhead and slow inference, so the high-performance WebUI keeps VITS synthesis serial while preserving T2S parallel inference.
|
|
|
| ### Preprocessing / Frontend
|
|
|
| - English frontend work focused on cold start. Heavy first-request dependencies around `g2p_en.G2p` were reduced, `wordsegment.load()` was made lazy, `nltk.pos_tag` was narrowed to heteronym cases, and import-time overhead from `inflect/typeguard` was trimmed.
|
| - Korean frontend work removed an unrelated cold-start chain. `g2pk2` used to pull in `nltk/cmudict` even for pure Korean input; that path is now stubbed lazily so Korean requests no longer pay for the English dictionary on first use.
|
| - The biggest Chinese frontend win came from changing pure-Chinese multi-sentence BERT extraction from sentence-by-sentence serial execution to a batched path. The current path batches tokenization, BERT forward, and `word2ph` alignment for pure Chinese multi-sentence requests.
|
| - Chinese `g2pw` cold start was also reduced by lazily importing `requests`, shrinking tokenizer initialization down to direct `tokenizer.json` loading, and avoiding the larger `transformers` auto-dispatch chain during startup.
|
| - Chinese segmentation / POS loading was further reduced with local static assets so `jieba_fast.posseg` initialization does less repeated work, without introducing any sentence-level cache tied to user input.
|
| - Non-Chinese zero-BERT paths also gained simple length-based zero-tensor reuse so repeated all-zero feature allocation does less work.
|
|
|
| ### T2S
|
|
|
| - The first layer of gains came from removing pure Python overhead from the hot path, including `tqdm` and hot-loop prints during decoding. These changes do not alter model numerics, but they do matter on CPU.
|
| - The next layer came from the shrink path. Previously, when some rows in a batch finished early, the code copied whole future-capacity buffers with `index_select`, including token buffers, KV caches, and masks that were never going to be used. The current path compacts only the valid prefix.
|
| - Another landed gain is `stable_batch_remap`. It is not a “never shrink” hack. It keeps exact behavior while stabilizing how active rows are remapped inside the batch, which reduces unnecessary compaction churn.
|
| - The deeper T2S gains came from addressing the actual `addmm` hotspots. The main path now uses exact-safe hybrid linear execution for high-frequency layers such as `MLP`, `qkv_proj`, and `out_proj`: `rows == 1` keeps the original `F.linear`, while larger cases use a more CPU-friendly `torch.addmm` path.
|
| - That backend work only became safe after fixing the load path as well. `t2s_transformer` is rebuilt after checkpoint loading so cached transposed weights are bound to the real loaded parameters instead of the pre-load initialization weights.
|
|
|
| ### VITS
|
|
|
| - VITS optimization stayed conservative. The main strategy was to remove work that was being repeated for every batch instead of rewriting the model structure.
|
| - The first landed step was a run-level runtime cache in `TTS.run()` for reference-side objects that do not depend on the current input text, such as `refer_audio_spec`, `sv_emb`, `prompt_semantic_tokens`, and `prompt_phones`.
|
| - The second landed step moved decode-condition preparation out of repeated decode calls. `build_decode_condition()` lets `ge / ge_text` be computed once before the batch loop and then reused across decode calls.
|
| - The third landed step applies `remove_weight_norm()` directly to non-vocoder `Generator.dec`, removing weight-norm reparameterization overhead during inference. This is the main clearly logged stage-local VITS win currently kept in the codebase.
|
| - The worklog also records more aggressive VITS experiments that were tested but not kept, such as traced `dec` and more aggressive decode layout variants. The README only describes the parts that actually landed.
|
|
|
| ### Load Path And Memory
|
|
|
| - `t2s_only` benchmark loading used to instantiate the full `TTS()` pipeline and then trim unused objects, which pushed peak memory far too high. That was replaced with a true lightweight load path that initializes only the minimum T2S-side objects.
|
| - The main inference load path was also reordered to avoid rebuilding large transposed-weight structures while the checkpoint dictionary was still alive in memory.
|
| - The biggest memory-side change is that inference no longer builds `self.h` at all on the main T2S path. The runtime now rebuilds `t2s_transformer` directly from the state dict and only keeps what inference actually uses.
|
| - This is not benchmark-only machinery. It is wired into the real inference path and mainly helps reduce steady-state RSS and peak RSS so CPU machines are less likely to stall or be reclaimed under memory pressure.
|
|
|
| ## Upstream and Credits
|
|
|
| This project is based on and uses code from:
|
|
|
| - [RVC-Boss/GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
|
|
|
| This fork keeps upstream credits and referenced projects below.
|
|
|
| ## Referenced Projects
|
|
|
| ### Theoretical Research
|
|
|
| - [ar-vits](https://github.com/innnky/ar-vits)
|
| - [SoundStorm](https://github.com/yangdongchao/SoundStorm/tree/master/soundstorm/s1/AR)
|
| - [vits](https://github.com/jaywalnut310/vits)
|
| - [TransferTTS](https://github.com/hcy71o/TransferTTS/blob/master/models.py#L556)
|
| - [contentvec](https://github.com/auspicious3000/contentvec/)
|
| - [hifi-gan](https://github.com/jik876/hifi-gan)
|
| - [fish-speech](https://github.com/fishaudio/fish-speech/blob/main/tools/llama/generate.py#L41)
|
|
|
| ### Main Model / Training / Vocoder Related
|
|
|
| - [RVC-Boss/GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
|
| - [SoVITS](https://github.com/voicepaw/so-vits-svc-fork)
|
| - [GPT-SoVITS-beta](https://github.com/lj1995/GPT-SoVITS/tree/gsv-v2beta)
|
| - [Chinese Speech Pretrain](https://github.com/TencentGameMate/chinese_speech_pretrain)
|
| - [Chinese-Roberta-WWM-Ext-Large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large)
|
| - [eresnetv2](https://modelscope.cn/models/iic/speech_eres2netv2w24s4ep4_sv_zh-cn_16k-common)
|
|
|
| ### Text Frontend for Inference
|
|
|
| - [paddlespeech zh_normalization](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech/t2s/frontend/zh_normalization)
|
| - [split-lang](https://github.com/DoodleBears/split-lang)
|
| - [g2pW](https://github.com/GitYCC/g2pW)
|
| - [pypinyin-g2pW](https://github.com/mozillazg/pypinyin-g2pW)
|
| - [paddlespeech g2pw](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech/t2s/frontend/g2pw)
|
|
|
| ### Inherited Upstream Tool References
|
|
|
| These projects were referenced by upstream GPT-SoVITS. Some related modules are removed in this inference-only fork, but credits are preserved here.
|
|
|
| - [ultimatevocalremovergui](https://github.com/Anjok07/ultimatevocalremovergui)
|
| - [audio-slicer](https://github.com/openvpi/audio-slicer)
|
| - [SubFix](https://github.com/cronrpc/SubFix)
|
| - [FFmpeg](https://github.com/FFmpeg/FFmpeg)
|
| - [gradio](https://github.com/gradio-app/gradio)
|
| - [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
|
| - [FunASR](https://github.com/alibaba-damo-academy/FunASR)
|
| - [AP-BWE](https://github.com/yxlu-0102/AP-BWE)
|
|
|