S2 Pro β GGUF
ALPHA β EXPERIMENTAL The inference engine (s2.cpp) is an early-stage, community-built project. Expect rough edges and breaking changes. Not production-ready.
GGUF-quantized weights of Fish Audio S2 Pro, a high-quality multilingual text-to-speech model with voice cloning support, packaged for local inference with s2.cpp β a pure C++/GGML engine with no Python dependency.
License: Fish Audio Research License β free for research and non-commercial use. Commercial use requires a separate license from Fish Audio. See LICENSE.md and fish.audio.
Files
| File | Size | Notes |
|---|---|---|
s2-pro-f16.gguf |
9.3 GB | Full precision β reference quality |
s2-pro-q8_0.gguf |
5.7 GB | Near-lossless β recommended for 8+ GB VRAM |
s2-pro-q6_k.gguf |
4.8 GB | Good quality/size balance β recommended for 6+ GB VRAM |
tokenizer.json |
β | Qwen3 BPE tokenizer (required) |
All GGUF files contain both the transformer weights and the audio codec in a single file.
Requirements
- GPU with Vulkan support (AMD/NVIDIA/Intel) or CPU with enough RAM
- s2.cpp built from source (C++17 + CMake)
VRAM guide
| VRAM | Recommended |
|---|---|
| β₯ 10 GB | q8_0 |
| 6β9 GB | q6_k |
| CPU only | f16 (slow) |
Quick start
# Clone and build s2.cpp
git clone --recurse-submodules https://github.com/rodrigomatta/s2.cpp.git
cd s2.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_VULKAN=ON
cmake --build build --parallel $(nproc)
# Download model files (example with huggingface-cli)
huggingface-cli download rodrigomt/s2-pro-gguf s2-pro-q6_k.gguf tokenizer.json --local-dir .
# Synthesize
./build/s2 \
-m s2-pro-q6_k.gguf \
-t tokenizer.json \
-text "Hello, this is a test." \
-v 0 \
-o output.wav
Voice cloning
./build/s2 \
-m s2-pro-q6_k.gguf \
-t tokenizer.json \
-pa reference.wav \
-pt "Transcript of the reference audio." \
-text "Text to synthesize in that voice." \
-v 0 \
-o output.wav
Reference audio: 5β30 seconds, clean recording, WAV or MP3.
All CLI options
| Flag | Default | Description |
|---|---|---|
-m, --model |
model.gguf |
Path to GGUF model file |
-t, --tokenizer |
tokenizer.json |
Path to tokenizer.json |
-text |
"Hello world" |
Text to synthesize |
-pa, --prompt-audio |
β | Reference audio (WAV/MP3) |
-pt, --prompt-text |
β | Transcript of reference audio |
-o, --output |
out.wav |
Output WAV path |
-v, --vulkan |
-1 (CPU) |
Vulkan device index |
-threads N |
4 |
CPU threads |
-max-tokens N |
512 |
Max tokens (~21s per 440 tokens) |
-temp F |
0.7 |
Sampling temperature |
-top-p F |
0.7 |
Top-p nucleus sampling |
-top-k N |
30 |
Top-k sampling |
Model architecture
S2 Pro uses a Dual-AR architecture (~4.56B parameters total):
- Slow-AR β 36-layer Qwen3 transformer (4.13B params), GQA (32 heads / 8 KV heads), RoPE 1M base, persistent KV cache
- Fast-AR β 4-layer transformer (0.42B params) generating 10 acoustic codebook tokens per semantic step
- Audio codec β convolutional RVQ encoder/decoder (10 codebooks Γ 4096 entries)
License
The model weights are licensed under the Fish Audio Research License.
- Research and non-commercial use: free under this license
- Commercial use: requires a separate written license from Fish Audio
Attribution: "This model is licensed under the Fish Audio Research License, Copyright Β© 39 AI, INC. All Rights Reserved."
Full terms: LICENSE.md Β· Commercial: fish.audio Β· business@fish.audio
- Downloads last month
- -
Model tree for rodrigomt/s2-pro-gguf
Base model
fishaudio/s2-pro