Add base_model and library_name metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +15 -3
README.md CHANGED
@@ -87,6 +87,8 @@ license: other
87
  license_name: fish-audio-research-license
88
  license_link: LICENSE.md
89
  pipeline_tag: text-to-speech
 
 
90
  tags:
91
  - text-to-speech
92
  - instruction-following
@@ -95,10 +97,12 @@ tags:
95
  - fp8
96
  - comfyui
97
  - comfy
 
 
 
98
  inference: false
99
- extra_gated_prompt: >-
100
- You agree to not use the model to generate contents that violate DMCA or local
101
- laws.
102
  extra_gated_fields:
103
  Country: country
104
  Specific date: date_picker
@@ -116,6 +120,14 @@ extra_gated_fields:
116
 
117
  ---
118
 
 
 
 
 
 
 
 
 
119
  ## What is this?
120
 
121
  This is a weight-only FP8 quantization of Fish Audio S2 Pro — a state-of-the-art open-source TTS model with fine-grained inline prosody and emotion control across 80+ languages. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~24 GB to ~12 GB, with no perceptible quality loss in practice.
 
87
  license_name: fish-audio-research-license
88
  license_link: LICENSE.md
89
  pipeline_tag: text-to-speech
90
+ library_name: transformers
91
+ base_model: fishaudio/s2-pro
92
  tags:
93
  - text-to-speech
94
  - instruction-following
 
97
  - fp8
98
  - comfyui
99
  - comfy
100
+ - multi-turn
101
+ - multi-speaker
102
+ - sglang
103
  inference: false
104
+ extra_gated_prompt: You agree to not use the model to generate contents that violate
105
+ DMCA or local laws.
 
106
  extra_gated_fields:
107
  Country: country
108
  Specific date: date_picker
 
120
 
121
  ---
122
 
123
+ ## Paper Summary
124
+
125
+ Fish Audio S2 is an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and instruction-following control via natural-language descriptions. The system utilizes a multi-stage training recipe and a staged data pipeline covering video and speech captioning. S2 Pro specifically uses a Dual-Autoregressive (Dual-AR) architecture:
126
+ - **Slow AR (4B):** Predicts the primary semantic codebook along the time axis.
127
+ - **Fast AR (400M):** Generates the remaining residual codebooks to reconstruct fine-grained acoustic detail.
128
+
129
+ ---
130
+
131
  ## What is this?
132
 
133
  This is a weight-only FP8 quantization of Fish Audio S2 Pro — a state-of-the-art open-source TTS model with fine-grained inline prosody and emotion control across 80+ languages. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~24 GB to ~12 GB, with no perceptible quality loss in practice.