--- title: GPT-SoVITS-CPUFast emoji: 🚀 colorFrom: blue colorTo: blue sdk: gradio sdk_version: 5.34.2 app_file: api.py python_version: "3.10" pinned: false license: apache-2.0 --- # GPT-SoVITS CPU Inference Fork Inference-only GPT-SoVITS fork focused on CPU deployment and CPU-side optimization on Windows, Linux, and macOS. [中文简体](./docs/cn/README.md) ## What This Repo Is - Inference-only fork of GPT-SoVITS. - Designed around CPU usage rather than GPU-first features. - Current practical focus is the S2 `v2Pro` / `v2ProPlus` path, while keeping versioned pretrained downloads for `v1`, `v2`, `v2Pro`, and `v2ProPlus`. ## What Was Removed This repository no longer keeps most training and dataset-preparation features from upstream. - Training entrypoints and training-only utilities - Dataset slicing / denoise / ASR / labeling workflows - UVR5 and other non-inference WebUI tools The remaining goal is straightforward: run GPT-SoVITS inference on CPU with less installation friction and a smaller runtime surface. ## What Still Works - `webui.py`: minimal inference launcher - `GPT_SoVITS/inference_webui_fast.py`: high-performance CPU inference WebUI - `api.py` and `api_v2.py`: inference APIs ## Quick Start ### 1. Create Environment Use Miniconda or an existing Conda environment: ```bash conda create -n GPTSoVits python=3.10 -y conda activate GPTSoVits ``` ### 2. Install Dependencies and Download Inference Weights CPU example with ModelScope and `v2ProPlus`: ```bash bash install.sh --source ModelScope --version v2ProPlus ``` Windows PowerShell: ```powershell .\install.ps1 -Source ModelScope -Version v2ProPlus ``` Available versions: - `v1` - `v2` - `v2Pro` - `v2ProPlus` - `all` ### 3. Launch Recommended: ```bash python webui.py ``` Direct high-performance inference WebUI: ```bash python GPT_SoVITS/inference_webui_fast.py ``` ## Notes - This fork is aimed at CPU inference, not training. - Chinese inference is still heavier than English / Japanese / Korean because text preprocessing needs extra frontend work such as `g2pw` and BERT features. - `install.sh` and `install.ps1` are now CPU-only installers and download inference assets by version instead of the full pretrained bundle. - `NLTK` and `OpenJTalk` dictionary downloads remain enabled by default. ## Speed Summary Without changing recognition behavior, prosody, speaker identity, or audio quality, this CPU fork currently delivers: - Chinese end-to-end `zh_pure`: `wall_sec 15.136264 -> 8.309471`, about `-45.1%` - End-to-end: `wall_sec 10.431826 -> 7.046743`, about `-32.4%` - Preprocessing: `frontend_sec 0.867065 -> 0.569571`, about `-34.3%` - T2S: `t2s_sec 4.023657 -> 2.268571`, about `-43.6%` - VITS: `vits_sec 5.137876 -> 3.969286`, about `-22.7%` Largest landed stage wins so far: - Preprocessing: - Chinese warm multi-sentence BERT frontend: `0.855s -> 0.131s`, about `-84.7%` - Korean cold frontend path: `0.791s -> 0.155s`, about `-80.4%` - T2S: `stable_batch_remap` 7-case comparison `1.986543 -> 1.684446`, about `-15.2%` - VITS: `remove_weight_norm` `vits_only` comparison `4.256119 -> 4.102004`, about `-3.6%` ## How The Speedups Were Achieved These gains do not come from sentence-level caching, quantization, or quality tradeoffs. The main work was removing CPU-side Python overhead, repeated preparation, repeated copies, and unnecessary cold-start costs from the real inference path. Chinese shows an even larger end-to-end gain than the overall average because it started from the heaviest path: it pays for both `g2pw` and BERT text features, so frontend reductions propagate directly into total latency. ### What Was Deliberately Not Used - ONNX / ORT was not adopted. `dec-only ORT`, `flow + dec ORT`, and larger graph-level ORT experiments were all tried, but on the current machine and dependency stack they did not produce a solution that was simultaneously quality-safe, faster, and lighter than the PyTorch path, so the runtime stayed on pure PyTorch and the ONNX compatibility path was removed. - The Chinese path was not simplified by dropping quality-critical frontend pieces. `g2pw` stayed, Chinese BERT stayed, and the project did not switch to lighter but lower-quality replacements such as `g2pm`, which are more likely to cause noticeable G2P errors on harder text, especially literary or classical material. - The main path also does not rely on secondary splitting just to manufacture larger batches. That direction was explored in benchmark-only form, but the results did not stay stable, and the `repeats=3` verification did not justify moving it into the runtime path. - VITS parallel synthesis is not exposed in the CPU WebUI. Local CPU measurements showed it can add overhead and slow inference, so the high-performance WebUI keeps VITS synthesis serial while preserving T2S parallel inference. ### Preprocessing / Frontend - English frontend work focused on cold start. Heavy first-request dependencies around `g2p_en.G2p` were reduced, `wordsegment.load()` was made lazy, `nltk.pos_tag` was narrowed to heteronym cases, and import-time overhead from `inflect/typeguard` was trimmed. - Korean frontend work removed an unrelated cold-start chain. `g2pk2` used to pull in `nltk/cmudict` even for pure Korean input; that path is now stubbed lazily so Korean requests no longer pay for the English dictionary on first use. - The biggest Chinese frontend win came from changing pure-Chinese multi-sentence BERT extraction from sentence-by-sentence serial execution to a batched path. The current path batches tokenization, BERT forward, and `word2ph` alignment for pure Chinese multi-sentence requests. - Chinese `g2pw` cold start was also reduced by lazily importing `requests`, shrinking tokenizer initialization down to direct `tokenizer.json` loading, and avoiding the larger `transformers` auto-dispatch chain during startup. - Chinese segmentation / POS loading was further reduced with local static assets so `jieba_fast.posseg` initialization does less repeated work, without introducing any sentence-level cache tied to user input. - Non-Chinese zero-BERT paths also gained simple length-based zero-tensor reuse so repeated all-zero feature allocation does less work. ### T2S - The first layer of gains came from removing pure Python overhead from the hot path, including `tqdm` and hot-loop prints during decoding. These changes do not alter model numerics, but they do matter on CPU. - The next layer came from the shrink path. Previously, when some rows in a batch finished early, the code copied whole future-capacity buffers with `index_select`, including token buffers, KV caches, and masks that were never going to be used. The current path compacts only the valid prefix. - Another landed gain is `stable_batch_remap`. It is not a “never shrink” hack. It keeps exact behavior while stabilizing how active rows are remapped inside the batch, which reduces unnecessary compaction churn. - The deeper T2S gains came from addressing the actual `addmm` hotspots. The main path now uses exact-safe hybrid linear execution for high-frequency layers such as `MLP`, `qkv_proj`, and `out_proj`: `rows == 1` keeps the original `F.linear`, while larger cases use a more CPU-friendly `torch.addmm` path. - That backend work only became safe after fixing the load path as well. `t2s_transformer` is rebuilt after checkpoint loading so cached transposed weights are bound to the real loaded parameters instead of the pre-load initialization weights. ### VITS - VITS optimization stayed conservative. The main strategy was to remove work that was being repeated for every batch instead of rewriting the model structure. - The first landed step was a run-level runtime cache in `TTS.run()` for reference-side objects that do not depend on the current input text, such as `refer_audio_spec`, `sv_emb`, `prompt_semantic_tokens`, and `prompt_phones`. - The second landed step moved decode-condition preparation out of repeated decode calls. `build_decode_condition()` lets `ge / ge_text` be computed once before the batch loop and then reused across decode calls. - The third landed step applies `remove_weight_norm()` directly to non-vocoder `Generator.dec`, removing weight-norm reparameterization overhead during inference. This is the main clearly logged stage-local VITS win currently kept in the codebase. - The worklog also records more aggressive VITS experiments that were tested but not kept, such as traced `dec` and more aggressive decode layout variants. The README only describes the parts that actually landed. ### Load Path And Memory - `t2s_only` benchmark loading used to instantiate the full `TTS()` pipeline and then trim unused objects, which pushed peak memory far too high. That was replaced with a true lightweight load path that initializes only the minimum T2S-side objects. - The main inference load path was also reordered to avoid rebuilding large transposed-weight structures while the checkpoint dictionary was still alive in memory. - The biggest memory-side change is that inference no longer builds `self.h` at all on the main T2S path. The runtime now rebuilds `t2s_transformer` directly from the state dict and only keeps what inference actually uses. - This is not benchmark-only machinery. It is wired into the real inference path and mainly helps reduce steady-state RSS and peak RSS so CPU machines are less likely to stall or be reclaimed under memory pressure. ## Upstream and Credits This project is based on and uses code from: - [RVC-Boss/GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) This fork keeps upstream credits and referenced projects below. ## Referenced Projects ### Theoretical Research - [ar-vits](https://github.com/innnky/ar-vits) - [SoundStorm](https://github.com/yangdongchao/SoundStorm/tree/master/soundstorm/s1/AR) - [vits](https://github.com/jaywalnut310/vits) - [TransferTTS](https://github.com/hcy71o/TransferTTS/blob/master/models.py#L556) - [contentvec](https://github.com/auspicious3000/contentvec/) - [hifi-gan](https://github.com/jik876/hifi-gan) - [fish-speech](https://github.com/fishaudio/fish-speech/blob/main/tools/llama/generate.py#L41) ### Main Model / Training / Vocoder Related - [RVC-Boss/GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) - [SoVITS](https://github.com/voicepaw/so-vits-svc-fork) - [GPT-SoVITS-beta](https://github.com/lj1995/GPT-SoVITS/tree/gsv-v2beta) - [Chinese Speech Pretrain](https://github.com/TencentGameMate/chinese_speech_pretrain) - [Chinese-Roberta-WWM-Ext-Large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large) - [eresnetv2](https://modelscope.cn/models/iic/speech_eres2netv2w24s4ep4_sv_zh-cn_16k-common) ### Text Frontend for Inference - [paddlespeech zh_normalization](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech/t2s/frontend/zh_normalization) - [split-lang](https://github.com/DoodleBears/split-lang) - [g2pW](https://github.com/GitYCC/g2pW) - [pypinyin-g2pW](https://github.com/mozillazg/pypinyin-g2pW) - [paddlespeech g2pw](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech/t2s/frontend/g2pw) ### Inherited Upstream Tool References These projects were referenced by upstream GPT-SoVITS. Some related modules are removed in this inference-only fork, but credits are preserved here. - [ultimatevocalremovergui](https://github.com/Anjok07/ultimatevocalremovergui) - [audio-slicer](https://github.com/openvpi/audio-slicer) - [SubFix](https://github.com/cronrpc/SubFix) - [FFmpeg](https://github.com/FFmpeg/FFmpeg) - [gradio](https://github.com/gradio-app/gradio) - [faster-whisper](https://github.com/SYSTRAN/faster-whisper) - [FunASR](https://github.com/alibaba-damo-academy/FunASR) - [AP-BWE](https://github.com/yxlu-0102/AP-BWE)