Add base_model and library_name metadata
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -87,6 +87,8 @@ license: other
|
|
| 87 |
license_name: fish-audio-research-license
|
| 88 |
license_link: LICENSE.md
|
| 89 |
pipeline_tag: text-to-speech
|
|
|
|
|
|
|
| 90 |
tags:
|
| 91 |
- text-to-speech
|
| 92 |
- instruction-following
|
|
@@ -95,10 +97,12 @@ tags:
|
|
| 95 |
- fp8
|
| 96 |
- comfyui
|
| 97 |
- comfy
|
|
|
|
|
|
|
|
|
|
| 98 |
inference: false
|
| 99 |
-
extra_gated_prompt:
|
| 100 |
-
|
| 101 |
-
laws.
|
| 102 |
extra_gated_fields:
|
| 103 |
Country: country
|
| 104 |
Specific date: date_picker
|
|
@@ -116,6 +120,14 @@ extra_gated_fields:
|
|
| 116 |
|
| 117 |
---
|
| 118 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
## What is this?
|
| 120 |
|
| 121 |
This is a weight-only FP8 quantization of Fish Audio S2 Pro — a state-of-the-art open-source TTS model with fine-grained inline prosody and emotion control across 80+ languages. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~24 GB to ~12 GB, with no perceptible quality loss in practice.
|
|
|
|
| 87 |
license_name: fish-audio-research-license
|
| 88 |
license_link: LICENSE.md
|
| 89 |
pipeline_tag: text-to-speech
|
| 90 |
+
library_name: transformers
|
| 91 |
+
base_model: fishaudio/s2-pro
|
| 92 |
tags:
|
| 93 |
- text-to-speech
|
| 94 |
- instruction-following
|
|
|
|
| 97 |
- fp8
|
| 98 |
- comfyui
|
| 99 |
- comfy
|
| 100 |
+
- multi-turn
|
| 101 |
+
- multi-speaker
|
| 102 |
+
- sglang
|
| 103 |
inference: false
|
| 104 |
+
extra_gated_prompt: You agree to not use the model to generate contents that violate
|
| 105 |
+
DMCA or local laws.
|
|
|
|
| 106 |
extra_gated_fields:
|
| 107 |
Country: country
|
| 108 |
Specific date: date_picker
|
|
|
|
| 120 |
|
| 121 |
---
|
| 122 |
|
| 123 |
+
## Paper Summary
|
| 124 |
+
|
| 125 |
+
Fish Audio S2 is an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and instruction-following control via natural-language descriptions. The system utilizes a multi-stage training recipe and a staged data pipeline covering video and speech captioning. S2 Pro specifically uses a Dual-Autoregressive (Dual-AR) architecture:
|
| 126 |
+
- **Slow AR (4B):** Predicts the primary semantic codebook along the time axis.
|
| 127 |
+
- **Fast AR (400M):** Generates the remaining residual codebooks to reconstruct fine-grained acoustic detail.
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
## What is this?
|
| 132 |
|
| 133 |
This is a weight-only FP8 quantization of Fish Audio S2 Pro — a state-of-the-art open-source TTS model with fine-grained inline prosody and emotion control across 80+ languages. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~24 GB to ~12 GB, with no perceptible quality loss in practice.
|