drbaph
/

s2-pro-fp8

@@ -87,6 +87,8 @@ license: other
 license_name: fish-audio-research-license
 license_link: LICENSE.md
 pipeline_tag: text-to-speech
 tags:
 - text-to-speech
 - instruction-following
@@ -95,10 +97,12 @@ tags:
 - fp8
 - comfyui
 - comfy
 inference: false
-extra_gated_prompt: >-
-  You agree to not use the model to generate contents that violate DMCA or local
-  laws.
 extra_gated_fields:
   Country: country
   Specific date: date_picker
@@ -116,6 +120,14 @@ extra_gated_fields:
 ---
 ## What is this?
 This is a weight-only FP8 quantization of Fish Audio S2 Pro — a state-of-the-art open-source TTS model with fine-grained inline prosody and emotion control across 80+ languages. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~24 GB to ~12 GB, with no perceptible quality loss in practice.

 license_name: fish-audio-research-license
 license_link: LICENSE.md
 pipeline_tag: text-to-speech
+library_name: transformers
+base_model: fishaudio/s2-pro
 tags:
 - text-to-speech
 - instruction-following
 - fp8
 - comfyui
 - comfy
+- multi-turn
+- multi-speaker
+- sglang
 inference: false
+extra_gated_prompt: You agree to not use the model to generate contents that violate
+  DMCA or local laws.
 extra_gated_fields:
   Country: country
   Specific date: date_picker
 ---
+## Paper Summary
+Fish Audio S2 is an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and instruction-following control via natural-language descriptions. The system utilizes a multi-stage training recipe and a staged data pipeline covering video and speech captioning. S2 Pro specifically uses a Dual-Autoregressive (Dual-AR) architecture:
+- **Slow AR (4B):** Predicts the primary semantic codebook along the time axis.
+- **Fast AR (400M):** Generates the remaining residual codebooks to reconstruct fine-grained acoustic detail.
+---
 ## What is this?
 This is a weight-only FP8 quantization of Fish Audio S2 Pro — a state-of-the-art open-source TTS model with fine-grained inline prosody and emotion control across 80+ languages. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~24 GB to ~12 GB, with no perceptible quality loss in practice.