File size: 8,867 Bytes
d3a2cd8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f463923
 
d3a2cd8
 
 
 
 
 
0fb2152
 
f463923
 
 
d3a2cd8
f463923
 
d3a2cd8
 
 
 
 
 
 
 
 
 
 
 
0fb2152
 
 
d3a2cd8
 
f463923
 
 
 
 
 
 
 
d3a2cd8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0fb2152
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
---
language:
- zh
- en
- ja
- ko
- es
- pt
- ar
- ru
- fr
- de
- sv
- it
- tr
- 'no'
- nl
- cy
- eu
- ca
- da
- gl
- ta
- hu
- fi
- pl
- et
- hi
- la
- ur
- th
- vi
- jw
- bn
- yo
- sl
- cs
- sw
- nn
- he
- ms
- uk
- id
- kk
- bg
- lv
- my
- tl
- sk
- ne
- fa
- af
- el
- bo
- hr
- ro
- sn
- mi
- yi
- am
- be
- km
- is
- az
- sd
- br
- sq
- ps
- mn
- ht
- ml
- sr
- sa
- te
- ka
- bs
- pa
- lt
- kn
- si
- hy
- mr
- as
- gu
- fo
license: other
license_name: fish-audio-research-license
license_link: LICENSE.md
pipeline_tag: text-to-speech
library_name: transformers
base_model: fishaudio/s2-pro
tags:
- text-to-speech
- instruction-following
- multilingual
- quantized
- fp8
- comfyui
- comfy
- multi-turn
- multi-speaker
- sglang
inference: false
extra_gated_prompt: You agree to not use the model to generate contents that violate
  DMCA or local laws.
extra_gated_fields:
  Country: country
  Specific date: date_picker
  I agree to use this model for non-commercial use ONLY: checkbox
---

# Fish Audio S2 Pro — FP8 Weight-Only Quantized

**FP8 quantized version of [fishaudio/s2-pro](https://huggingface.co/fishaudio/s2-pro).**

[**Original Model**](https://huggingface.co/fishaudio/s2-pro) | [**Technical Report**](https://huggingface.co/papers/2603.08823) | [**GitHub**](https://github.com/fishaudio/fish-speech) | [**Playground**](https://fish.audio) | [**ComfyUI Node**](https://github.com/Saganaki22/ComfyUI-FishAudioS2)


![Screenshot 2026-03-11 211919](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/l4qA1zM2PAtmnjNe3ZaQo.png)

---

## Paper Summary

Fish Audio S2 is an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and instruction-following control via natural-language descriptions. The system utilizes a multi-stage training recipe and a staged data pipeline covering video and speech captioning. S2 Pro specifically uses a Dual-Autoregressive (Dual-AR) architecture:
- **Slow AR (4B):** Predicts the primary semantic codebook along the time axis.
- **Fast AR (400M):** Generates the remaining residual codebooks to reconstruct fine-grained acoustic detail.

---

## What is this?

This is a weight-only FP8 quantization of Fish Audio S2 Pro — a state-of-the-art open-source TTS model with fine-grained inline prosody and emotion control across 80+ languages. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~24 GB to ~12 GB, with no perceptible quality loss in practice.

| | Original (s2-pro) | This (s2-pro-fp8) |
|---|---|---|
| **Weight dtype** | bfloat16 | float8_e4m3fn |
| **Activation dtype** | bfloat16 | bfloat16 |
| **Scale** | — | per-row float32 |
| **File size** | ~12 GB | ~6.2 GB |
| **VRAM (inference)** | ~24 GB | ~12 GB |
| **Extra dependencies** | none | none |

---

## Quantization Details

**What is quantized:** All `nn.Linear` weight matrices in both the Slow AR (4B) and Fast AR (400M) backbones — 201 layers in total. Non-linear weights (embeddings, layer norms, codec) remain in bfloat16.

**Method: Per-row symmetric FP8**

Each output row of every weight matrix has its own `float32` scale factor:

```
scale = max(abs(row)) / FP8_MAX        # FP8_MAX = 448.0 for float8_e4m3fn
W_fp8 = round(W_bf16 / scale)          # quantize
W_bf16 = W_fp8.to(bfloat16) * scale   # dequantize at inference
```

Per-row scaling captures the per-channel magnitude variation in transformer weight matrices much better than a single per-tensor scale, significantly reducing quantization error at minimal overhead.

**No external quantization library required.** Dequantization is implemented in pure PyTorch inside a custom `FP8Linear` module — no torchao, bitsandbytes, or AutoGPTQ needed. The model loads and runs on any machine with PyTorch 2.1+.

**File layout inside `model.safetensors`:**
- `<layer>.weight` — `float8_e4m3fn` tensor (quantized weights)
- `<layer>.weight.scale` — `float32` tensor, shape `[out_features, 1]` (per-row scales)
- All other tensors — `bfloat16` (embeddings, norms, codec, etc.)

---

## Hardware Requirements

- **GPU:** NVIDIA GPU with CUDA support
- **VRAM:** ~12 GB
- **Native FP8 tensor cores:** Ada Lovelace or Blackwell (RTX 4090, RTX 5090, H100, etc.) — recommended for full speed
- **Older GPUs (Ampere and below):** Will load and run correctly. Dequantization to bfloat16 happens on all hardware, so you still get the ~12 GB VRAM footprint benefit even without native FP8 cores.

---

## Usage — ComfyUI (Recommended)

The easiest way to use this model is with **[ComfyUI-FishAudioS2](https://github.com/Saganaki22/ComfyUI-FishAudioS2)**, which has native support for this FP8 model with zero extra setup.

### Installation

1. Install the ComfyUI node via **ComfyUI Manager** (search `FishAudioS2`) or manually:
   ```bash
   cd ComfyUI/custom_nodes
   git clone https://github.com/Saganaki22/ComfyUI-FishAudioS2.git
   ```

2. The model **auto-downloads on first use** — select `s2-pro-fp8` from the model dropdown in any Fish S2 node.

3. Or download manually:
   ```bash
   huggingface-cli download drbaph/s2-pro-fp8 --local-dir ComfyUI/models/fishaudioS2/s2-pro-fp8
   ```

### Recommended settings

- `precision`: `auto` or `bfloat16` — matches the activation dtype
- `attention`: `auto` or `sage_attention` for best performance
- `keep_model_loaded`: `True` if running multiple generations back-to-back

Works with all three nodes: **Fish S2 TTS**, **Fish S2 Voice Clone TTS**, **Fish S2 Multi-Speaker TTS**.

---

## About Fish Audio S2 Pro

Fish Audio S2 Pro is a leading text-to-speech model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, it combines reinforcement learning alignment with a **Dual-Autoregressive (Dual-AR)** architecture.

### Architecture

S2 Pro builds on a decoder-only transformer combined with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate):

- **Slow AR** (4B parameters): Operates along the time axis and predicts the primary semantic codebook.
- **Fast AR** (400M parameters): Generates the remaining 9 residual codebooks at each time step, reconstructing fine-grained acoustic detail.

This asymmetric design keeps inference efficient while preserving audio fidelity. The Dual-AR architecture is structurally isomorphic to standard autoregressive LLMs, inheriting LLM-native serving optimizations — continuous batching, paged KV cache, CUDA graph replay, RadixAttention-based prefix caching.

### Fine-Grained Inline Control

Embed natural-language instructions directly in the text using `[tag]` syntax. S2 Pro accepts **free-form descriptions** — not a fixed tag vocabulary:

`[pause]` `[emphasis]` `[laughing]` `[inhale]` `[chuckle]` `[tsk]` `[singing]` `[excited]` `[volume up]` `[echo]` `[angry]` `[sigh]` `[whisper]` `[screaming]` `[shouting]` `[surprised]` `[short pause]` `[exhale]` `[delight]` `[sad]` `[clearing throat]` `[shocked]` `[with strong accent]` `[professional broadcast tone]` `[pitch up]` `[pitch down]`

**Free-form examples:** `[whisper in small voice]` · `[super happy and excited]` · `[speaking slowly and clearly]` · `[sarcastic tone]`

15,000+ unique tags supported.

### Supported Languages

**Tier 1 (Best Quality):** Japanese (ja), English (en), Chinese (zh)

**Tier 2:** Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de)

**80+ total:** sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, sl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo

### Production Streaming Performance (original model, H200)

- **Real-Time Factor (RTF):** 0.195
- **Time-to-first-audio:** ~100 ms
- **Throughput:** 3,000+ acoustic tokens/s while maintaining RTF below 0.5

---

## Citation

```bibtex
@misc{liao2026fishaudios2technical,
      title={Fish Audio S2 Technical Report}, 
      author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu and Xin Chen and Dawei Han},
      year={2026},
      eprint={2603.08823},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2603.08823},
}
```

---

## License

This model inherits the [Fish Audio Research License](LICENSE.md) from [fishaudio/s2-pro](https://huggingface.co/fishaudio/s2-pro). Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.

The FP8 quantization was produced by [drbaph](https://huggingface.co/drbaph) and is released under the same license.