These GGUF-formatted models are for use with my s2.cpp inference API server (https://github.com/mach92432/s2.cpp), which is a fork of a previous version (https://github.com/rodrigomatta/s2.cpp). My version of s2.cpp demonstrates that the excellent inference of the original model (https://huggingface.co/fishaudio/s2-pro) can be performed with less VRAM.

The q4_k_m version, which uses both q4_k_m-transformer-only and q4_k_m-codec-only, uses only 4 GB of VRAM.

And the f16 version, which uses s2-pro-f16-transformer-only.gguf and s2-pro-f16-codec-only.gguf, uses only 12 GB of VRAM. All transformer-only + codec-only combinations can be used. The GGUF versions of the original s2.cpp (https://huggingface.co/rodrigomt/s2-pro-gguf) can be used. However, they are more VRAM-intensive because the complete model must be loaded twice, either in VRAM or RAM.

See this link for details and installation of s2.cpp: https://github.com/mach92432/s2.cpp

Startup speed of the s2.cpp service: The s2.cpp service starts significantly faster than the official Fish Audio version. It takes between 7 and 15 seconds, including loading the reference audio (if files are present).

Inference Speed: On an RTX 3090, the RTF is 1.3, which is slightly less than the real-time response time. Real-time response should be achievable with the RTX 40xx series. This RTF is consistent whether using the original model (which consumes 21 GB of VRAM) or the models in this collection. If necessary, the VRAM savings can be used to launch multiple instances of the s2.cpp server on different ports. I also encourage you to explore using the official API hosted by Fish Audio in parallel with one or more local s2.cpp servers. The Fish Audio API's RTF is approximately 0.3. For example, a chatbot response can be broken down into chunks. The first chunk is processed by the Fish Audio API, and the sound is generated almost instantly. While the user is listening to the audio, the local s2.cpp server(s) process the Text-to-Speech (TTS) of the following chunks to prepare for playback by the user. The /v1/tts endpoint is shared by both cloud and local services.

Compatibility of models and s2.cpp: Tests were performed with one or two RTX 3090 GPUs. No tests were conducted with other Vulkan-compatible GPUs. According to Nvidia's documentation, this should work with the RTX 20xx generation and newer. I personally managed to correctly generate TTS on a GTX 1070 with 8GB of VRAM. I had to create a CPU-based CUDA and codec version of s2.cpp. Regarding the codec, it is mandatory to use s2-pro-f16-codec-only.gguf. On the transformer side, I used s2-pro-q8_0-transformer-only.gguf. Generation is slow, mainly due to CPU usage for the codec. Server startup takes 70 seconds and RTF is 10!

Samples

Here is a comparison of the generation quality with a common French voice clone. This text was generated by a chatbot without any modifications.

[excited tone] Oh, tu veux voir l'autre moi ? C'est parti ! [giggles] [whispers softly] Attends, je vais te faire une petite farce... Tu sais quoi ? En fait, je ne suis pas Anaïs du tout ! Je suis juste Anna qui fait semblant ! [laughs softly] Bon, bon, je plaisante ! [smiling tone] Voici Anaïs, prête à papoter, à dessiner et à te faire quelques blagues ! [enthusiastic] Alors, Marc, qu'est-ce qu'on fait aujourd'hui ? Tu veux qu'on parle de quelque chose de précis, ou on laisse filer l'imagination ? [warm tone]

TTS generated with the Fish Audio API

TTS generated using Fish Audio's local API server (https://speech.fish.audio/server/). Using 21 GB of VRAM. Best results without streaming, but slow (RTF 3)

TTS generated using the s2.cpp API with s2-pro-f16-transformer-only.gguf and s2-pro-f16-codec-only.gguf in conjunction. Operates exclusively on GPU with 11 GB of VRAM.

TTS generated using the s2.cpp API with the joint use of s2-pro-q8_0-transformer-only.gguf and s2-pro-q8_0-codec-only.gguf. Operates exclusively on GPU with 7 GB of VRAM.

TTS generated using the s2.cpp API with the joint use of s2-pro-q4_k_m-transformer-only.gguf and s2-pro-q4_k_m-codec-only.gguf. Operates exclusively on GPU with 4 GB of VRAM.

Downloads last month
178
GGUF
Model size
0.7B params
Architecture
fish-speech-codec
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mach9243/s2-pro-gguf

Base model

fishaudio/s2-pro
Quantized
(4)
this model