S2 Pro β€” GGUF

ALPHA β€” EXPERIMENTAL The inference engine (s2.cpp) is an early-stage, community-built project. Expect rough edges and breaking changes. Not production-ready.

GGUF-quantized weights of Fish Audio S2 Pro, a high-quality multilingual text-to-speech model with voice cloning support, packaged for local inference with s2.cpp β€” a pure C++/GGML engine with no Python dependency.

License: Fish Audio Research License β€” free for research and non-commercial use. Commercial use requires a separate license from Fish Audio. See LICENSE.md and fish.audio.


Files

File Size Notes
s2-pro-f16.gguf 9.3 GB Full precision β€” reference quality
s2-pro-q8_0.gguf 5.7 GB Near-lossless β€” recommended for 8+ GB VRAM
s2-pro-q6_k.gguf 4.8 GB Good quality/size balance β€” recommended for 6+ GB VRAM
tokenizer.json β€” Qwen3 BPE tokenizer (required)

All GGUF files contain both the transformer weights and the audio codec in a single file.


Requirements

  • GPU with Vulkan support (AMD/NVIDIA/Intel) or CPU with enough RAM
  • s2.cpp built from source (C++17 + CMake)

VRAM guide

VRAM Recommended
β‰₯ 10 GB q8_0
6–9 GB q6_k
CPU only f16 (slow)

Quick start

# Clone and build s2.cpp
git clone --recurse-submodules https://github.com/rodrigomatta/s2.cpp.git
cd s2.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release -DS2_VULKAN=ON
cmake --build build --parallel $(nproc)

# Download model files (example with huggingface-cli)
huggingface-cli download rodrigomt/s2-pro-gguf s2-pro-q6_k.gguf tokenizer.json --local-dir .

# Synthesize
./build/s2 \
  -m s2-pro-q6_k.gguf \
  -t tokenizer.json \
  -text "Hello, this is a test." \
  -v 0 \
  -o output.wav

Voice cloning

./build/s2 \
  -m s2-pro-q6_k.gguf \
  -t tokenizer.json \
  -pa reference.wav \
  -pt "Transcript of the reference audio." \
  -text "Text to synthesize in that voice." \
  -v 0 \
  -o output.wav

Reference audio: 5–30 seconds, clean recording, WAV or MP3.


All CLI options

Flag Default Description
-m, --model model.gguf Path to GGUF model file
-t, --tokenizer tokenizer.json Path to tokenizer.json
-text "Hello world" Text to synthesize
-pa, --prompt-audio β€” Reference audio (WAV/MP3)
-pt, --prompt-text β€” Transcript of reference audio
-o, --output out.wav Output WAV path
-v, --vulkan -1 (CPU) Vulkan device index
-threads N 4 CPU threads
-max-tokens N 512 Max tokens (~21s per 440 tokens)
-temp F 0.7 Sampling temperature
-top-p F 0.7 Top-p nucleus sampling
-top-k N 30 Top-k sampling

Model architecture

S2 Pro uses a Dual-AR architecture (~4.56B parameters total):

  • Slow-AR β€” 36-layer Qwen3 transformer (4.13B params), GQA (32 heads / 8 KV heads), RoPE 1M base, persistent KV cache
  • Fast-AR β€” 4-layer transformer (0.42B params) generating 10 acoustic codebook tokens per semantic step
  • Audio codec β€” convolutional RVQ encoder/decoder (10 codebooks Γ— 4096 entries)

License

The model weights are licensed under the Fish Audio Research License.

  • Research and non-commercial use: free under this license
  • Commercial use: requires a separate written license from Fish Audio

Attribution: "This model is licensed under the Fish Audio Research License, Copyright Β© 39 AI, INC. All Rights Reserved."

Full terms: LICENSE.md Β· Commercial: fish.audio Β· business@fish.audio

Downloads last month
-
GGUF
Model size
5B params
Architecture
fish-speech
Hardware compatibility
Log In to add your hardware

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for rodrigomt/s2-pro-gguf

Base model

fishaudio/s2-pro
Quantized
(1)
this model