| | --- |
| | tags: |
| | - text-to-speech |
| | license: other |
| | license_name: fish-audio-research-license |
| | license_link: LICENSE.md |
| | language: |
| | - zh |
| | - en |
| | - ja |
| | - ko |
| | - es |
| | - pt |
| | - ar |
| | - ru |
| | - fr |
| | - de |
| | - sv |
| | - it |
| | - tr |
| | - "no" |
| | - nl |
| | - cy |
| | - eu |
| | - ca |
| | - da |
| | - gl |
| | - ta |
| | - hu |
| | - fi |
| | - pl |
| | - et |
| | - hi |
| | - la |
| | - ur |
| | - th |
| | - vi |
| | - jw |
| | - bn |
| | - yo |
| | - sl |
| | - cs |
| | - sw |
| | - nn |
| | - he |
| | - ms |
| | - uk |
| | - id |
| | - kk |
| | - bg |
| | - lv |
| | - my |
| | - tl |
| | - sk |
| | - ne |
| | - fa |
| | - af |
| | - el |
| | - bo |
| | - hr |
| | - ro |
| | - sn |
| | - mi |
| | - yi |
| | - am |
| | - be |
| | - km |
| | - is |
| | - az |
| | - sd |
| | - br |
| | - sq |
| | - ps |
| | - mn |
| | - ht |
| | - ml |
| | - sr |
| | - sa |
| | - te |
| | - ka |
| | - bs |
| | - pa |
| | - lt |
| | - kn |
| | - si |
| | - hy |
| | - mr |
| | - as |
| | - gu |
| | - fo |
| | pipeline_tag: text-to-speech |
| | inference: false |
| | extra_gated_prompt: >- |
| | You agree to not use the model to generate contents that violate DMCA or local |
| | laws. |
| | extra_gated_fields: |
| | Country: country |
| | Specific date: date_picker |
| | I agree to use this model for non-commercial use ONLY: checkbox |
| | --- |
| | |
| |
|
| | # Fish Audio S2 Pro |
| |
|
| | <img src="overview.png" alt="Fish Audio S2 Pro overview — fine-grained control, multi-speaker multi-turn generation, low-latency streaming, and long-context inference." width="100%"> |
| |
|
| | **Fish Audio S2 Pro** is a leading text-to-speech (TTS) model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, the system combines reinforcement learning alignment with a dual-autoregressive architecture. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine. |
| |
|
| | ## Architecture |
| |
|
| | S2 Pro builds on a decoder-only transformer combined with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate) using a **Dual-Autoregressive (Dual-AR)** architecture: |
| |
|
| | - **Slow AR** (4B parameters): Operates along the time axis and predicts the primary semantic codebook. |
| | - **Fast AR** (400M parameters): Generates the remaining 9 residual codebooks at each time step, reconstructing fine-grained acoustic detail. |
| |
|
| | This asymmetric design keeps inference efficient while preserving audio fidelity. Because the Dual-AR architecture is structurally isomorphic to standard autoregressive LLMs, it inherits all LLM-native serving optimizations from SGLang — including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching. |
| |
|
| | ## Fine-Grained Inline Control |
| |
|
| | S2 Pro enables localized control over speech generation by embedding natural-language instructions directly within the text using `[tag]` syntax. Rather than relying on a fixed set of predefined tags, S2 Pro accepts **free-form textual descriptions** — such as `[whisper in small voice]`, `[professional broadcast tone]`, or `[pitch up]` — allowing open-ended expression control at the word level. |
| |
|
| | **Common tags (15,000+ unique tags supported):** |
| |
|
| | `[pause]` `[emphasis]` `[laughing]` `[inhale]` `[chuckle]` `[tsk]` `[singing]` `[excited]` `[laughing tone]` `[interrupting]` `[chuckling]` `[excited tone]` `[volume up]` `[echo]` `[angry]` `[low volume]` `[sigh]` `[low voice]` `[whisper]` `[screaming]` `[shouting]` `[loud]` `[surprised]` `[short pause]` `[exhale]` `[delight]` `[panting]` `[audience laughter]` `[with strong accent]` `[volume down]` `[clearing throat]` `[sad]` `[moaning]` `[shocked]` |
| |
|
| | ## Supported Languages |
| |
|
| | S2 Pro supports 80+ languages. |
| |
|
| | **Tier 1:** Japanese (ja), English (en), Chinese (zh) |
| |
|
| | **Tier 2:** Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de) |
| |
|
| | **Other supported languages:** sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, sl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo, and more. |
| |
|
| | ## Production Streaming Performance |
| |
|
| | On a single NVIDIA H200 GPU: |
| |
|
| | - **Real-Time Factor (RTF):** 0.195 |
| | - **Time-to-first-audio:** ~100 ms |
| | - **Throughput:** 3,000+ acoustic tokens/s while maintaining RTF below 0.5 |
| |
|
| | ## Links |
| |
|
| | - [Fish Speech GitHub](https://github.com/fishaudio/fish-speech) |
| | - [Fish Audio Playground](https://fish.audio) |
| | - [Blog & Tech Report](https://fish.audio/blog/fish-audio-open-sources-s2/) |
| |
|
| | ## License |
| |
|
| | This model is licensed under the [Fish Audio Research License](LICENSE.md). Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact business@fish.audio. |
| |
|