File size: 8,867 Bytes

---
language:
- zh
- en
- ja
- ko
- es
- pt
- ar
- ru
- fr
- de
- sv
- it
- tr
- 'no'
- nl
- cy
- eu
- ca
- da
- gl
- ta
- hu
- fi
- pl
- et
- hi
- la
- ur
- th
- vi
- jw
- bn
- yo
- sl
- cs
- sw
- nn
- he
- ms
- uk
- id
- kk
- bg
- lv
- my
- tl
- sk
- ne
- fa
- af
- el
- bo
- hr
- ro
- sn
- mi
- yi
- am
- be
- km
- is
- az
- sd
- br
- sq
- ps
- mn
- ht
- ml
- sr
- sa
- te
- ka
- bs
- pa
- lt
- kn
- si
- hy
- mr
- as
- gu
- fo
license: other
license_name: fish-audio-research-license
license_link: LICENSE.md
pipeline_tag: text-to-speech
library_name: transformers
base_model: fishaudio/s2-pro
tags:
- text-to-speech
- instruction-following
- multilingual
- quantized
- fp8
- comfyui
- comfy
- multi-turn
- multi-speaker
- sglang
inference: false
extra_gated_prompt: You agree to not use the model to generate contents that violate
  DMCA or local laws.
extra_gated_fields:
  Country: country
  Specific date: date_picker
  I agree to use this model for non-commercial use ONLY: checkbox
---

# Fish Audio S2 Pro — FP8 Weight-Only Quantized

**FP8 quantized version of [fishaudio/s2-pro](https://huggingface.co/fishaudio/s2-pro).**

[**Original Model**](https://huggingface.co/fishaudio/s2-pro) | [**Technical Report**](https://huggingface.co/papers/2603.08823) | [**GitHub**](https://github.com/fishaudio/fish-speech) | [**Playground**](https://fish.audio) | [**ComfyUI Node**](https://github.com/Saganaki22/ComfyUI-FishAudioS2)


![Screenshot 2026-03-11 211919](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/l4qA1zM2PAtmnjNe3ZaQo.png)

---

## Paper Summary

Fish Audio S2 is an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and instruction-following control via natural-language descriptions. The system utilizes a multi-stage training recipe and a staged data pipeline covering video and speech captioning. S2 Pro specifically uses a Dual-Autoregressive (Dual-AR) architecture:
- **Slow AR (4B):** Predicts the primary semantic codebook along the time axis.
- **Fast AR (400M):** Generates the remaining residual codebooks to reconstruct fine-grained acoustic detail.

---

## What is this?

This is a weight-only FP8 quantization of Fish Audio S2 Pro — a state-of-the-art open-source TTS model with fine-grained inline prosody and emotion control across 80+ languages. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~24 GB to ~12 GB, with no perceptible quality loss in practice.

| | Original (s2-pro) | This (s2-pro-fp8) |
|---|---|---|
| **Weight dtype** | bfloat16 | float8_e4m3fn |
| **Activation dtype** | bfloat16 | bfloat16 |
| **Scale** | — | per-row float32 |
| **File size** | ~12 GB | ~6.2 GB |
| **VRAM (inference)** | ~24 GB | ~12 GB |
| **Extra dependencies** | none | none |

---

## Quantization Details

**What is quantized:** All `nn.Linear` weight matrices in both the Slow AR (4B) and Fast AR (400M) backbones — 201 layers in total. Non-linear weights (embeddings, layer norms, codec) remain in bfloat16.

**Method: Per-row symmetric FP8**

Each output row of every weight matrix has its own `float32` scale factor:

```
scale = max(abs(row)) / FP8_MAX        # FP8_MAX = 448.0 for float8_e4m3fn
W_fp8 = round(W_bf16 / scale)          # quantize
W_bf16 = W_fp8.to(bfloat16) * scale   # dequantize at inference
```

Per-row scaling captures the per-channel magnitude variation in transformer weight matrices much better than a single per-tensor scale, significantly reducing quantization error at minimal overhead.

**No external quantization library required.** Dequantization is implemented in pure PyTorch inside a custom `FP8Linear` module — no torchao, bitsandbytes, or AutoGPTQ needed. The model loads and runs on any machine with PyTorch 2.1+.

**File layout inside `model.safetensors`:**
- `<layer>.weight` — `float8_e4m3fn` tensor (quantized weights)
- `<layer>.weight.scale` — `float32` tensor, shape `[out_features, 1]` (per-row scales)
- All other tensors — `bfloat16` (embeddings, norms, codec, etc.)

---

## Hardware Requirements

- **GPU:** NVIDIA GPU with CUDA support
- **VRAM:** ~12 GB
- **Native FP8 tensor cores:** Ada Lovelace or Blackwell (RTX 4090, RTX 5090, H100, etc.) — recommended for full speed
- **Older GPUs (Ampere and below):** Will load and run correctly. Dequantization to bfloat16 happens on all hardware, so you still get the ~12 GB VRAM footprint benefit even without native FP8 cores.

---

## Usage — ComfyUI (Recommended)

The easiest way to use this model is with **[ComfyUI-FishAudioS2](https://github.com/Saganaki22/ComfyUI-FishAudioS2)**, which has native support for this FP8 model with zero extra setup.

### Installation

1. Install the ComfyUI node via **ComfyUI Manager** (search `FishAudioS2`) or manually:
   ```bash
   cd ComfyUI/custom_nodes
   git clone https://github.com/Saganaki22/ComfyUI-FishAudioS2.git
   ```

2. The model **auto-downloads on first use** — select `s2-pro-fp8` from the model dropdown in any Fish S2 node.

3. Or download manually:
   ```bash
   huggingface-cli download drbaph/s2-pro-fp8 --local-dir ComfyUI/models/fishaudioS2/s2-pro-fp8
   ```

### Recommended settings

- `precision`: `auto` or `bfloat16` — matches the activation dtype
- `attention`: `auto` or `sage_attention` for best performance
- `keep_model_loaded`: `True` if running multiple generations back-to-back

Works with all three nodes: **Fish S2 TTS**, **Fish S2 Voice Clone TTS**, **Fish S2 Multi-Speaker TTS**.

---

## About Fish Audio S2 Pro

Fish Audio S2 Pro is a leading text-to-speech model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, it combines reinforcement learning alignment with a **Dual-Autoregressive (Dual-AR)** architecture.

### Architecture

S2 Pro builds on a decoder-only transformer combined with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate):

- **Slow AR** (4B parameters): Operates along the time axis and predicts the primary semantic codebook.
- **Fast AR** (400M parameters): Generates the remaining 9 residual codebooks at each time step, reconstructing fine-grained acoustic detail.

This asymmetric design keeps inference efficient while preserving audio fidelity. The Dual-AR architecture is structurally isomorphic to standard autoregressive LLMs, inheriting LLM-native serving optimizations — continuous batching, paged KV cache, CUDA graph replay, RadixAttention-based prefix caching.

### Fine-Grained Inline Control

Embed natural-language instructions directly in the text using `[tag]` syntax. S2 Pro accepts **free-form descriptions** — not a fixed tag vocabulary:

`[pause]` `[emphasis]` `[laughing]` `[inhale]` `[chuckle]` `[tsk]` `[singing]` `[excited]` `[volume up]` `[echo]` `[angry]` `[sigh]` `[whisper]` `[screaming]` `[shouting]` `[surprised]` `[short pause]` `[exhale]` `[delight]` `[sad]` `[clearing throat]` `[shocked]` `[with strong accent]` `[professional broadcast tone]` `[pitch up]` `[pitch down]`

**Free-form examples:** `[whisper in small voice]` · `[super happy and excited]` · `[speaking slowly and clearly]` · `[sarcastic tone]`

15,000+ unique tags supported.

### Supported Languages

**Tier 1 (Best Quality):** Japanese (ja), English (en), Chinese (zh)

**Tier 2:** Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de)

**80+ total:** sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, sl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo

### Production Streaming Performance (original model, H200)

- **Real-Time Factor (RTF):** 0.195
- **Time-to-first-audio:** ~100 ms
- **Throughput:** 3,000+ acoustic tokens/s while maintaining RTF below 0.5

---

## Citation

```bibtex
@misc{liao2026fishaudios2technical,
      title={Fish Audio S2 Technical Report}, 
      author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu and Xin Chen and Dawei Han},
      year={2026},
      eprint={2603.08823},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2603.08823},
}
```

---

## License

This model inherits the [Fish Audio Research License](LICENSE.md) from [fishaudio/s2-pro](https://huggingface.co/fishaudio/s2-pro). Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.

The FP8 quantization was produced by [drbaph](https://huggingface.co/drbaph) and is released under the same license.