s2-pro / README.md

Duplicate from fishaudio/s2-pro

ce0ace9 2 days ago

4.35 kB

tags:
  - text-to-speech
license: other
license_name: fish-audio-research-license
license_link: LICENSE.md
language:
  - zh
  - en
  - ja
  - ko
  - es
  - pt
  - ar
  - ru
  - fr
  - de
  - sv
  - it
  - tr
  - 'no'
  - nl
  - cy
  - eu
  - ca
  - da
  - gl
  - ta
  - hu
  - fi
  - pl
  - et
  - hi
  - la
  - ur
  - th
  - vi
  - jw
  - bn
  - yo
  - sl
  - cs
  - sw
  - nn
  - he
  - ms
  - uk
  - id
  - kk
  - bg
  - lv
  - my
  - tl
  - sk
  - ne
  - fa
  - af
  - el
  - bo
  - hr
  - ro
  - sn
  - mi
  - yi
  - am
  - be
  - km
  - is
  - az
  - sd
  - br
  - sq
  - ps
  - mn
  - ht
  - ml
  - sr
  - sa
  - te
  - ka
  - bs
  - pa
  - lt
  - kn
  - si
  - hy
  - mr
  - as
  - gu
  - fo
pipeline_tag: text-to-speech
inference: false
extra_gated_prompt: >-
  You agree to not use the model to generate contents that violate DMCA or local
  laws.
extra_gated_fields:
  Country: country
  Specific date: date_picker
  I agree to use this model for non-commercial use ONLY: checkbox

Fish Audio S2 Pro

Fish Audio S2 Pro is a leading text-to-speech (TTS) model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, the system combines reinforcement learning alignment with a dual-autoregressive architecture. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine.

Architecture

S2 Pro builds on a decoder-only transformer combined with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate) using a Dual-Autoregressive (Dual-AR) architecture:

Slow AR (4B parameters): Operates along the time axis and predicts the primary semantic codebook.
Fast AR (400M parameters): Generates the remaining 9 residual codebooks at each time step, reconstructing fine-grained acoustic detail.

This asymmetric design keeps inference efficient while preserving audio fidelity. Because the Dual-AR architecture is structurally isomorphic to standard autoregressive LLMs, it inherits all LLM-native serving optimizations from SGLang — including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching.

Fine-Grained Inline Control

S2 Pro enables localized control over speech generation by embedding natural-language instructions directly within the text using [tag] syntax. Rather than relying on a fixed set of predefined tags, S2 Pro accepts free-form textual descriptions — such as [whisper in small voice], [professional broadcast tone], or [pitch up] — allowing open-ended expression control at the word level.

Common tags (15,000+ unique tags supported):

[pause] [emphasis] [laughing] [inhale] [chuckle] [tsk] [singing] [excited] [laughing tone] [interrupting] [chuckling] [excited tone] [volume up] [echo] [angry] [low volume] [sigh] [low voice] [whisper] [screaming] [shouting] [loud] [surprised] [short pause] [exhale] [delight] [panting] [audience laughter] [with strong accent] [volume down] [clearing throat] [sad] [moaning] [shocked]

Supported Languages

S2 Pro supports 80+ languages.

Tier 1: Japanese (ja), English (en), Chinese (zh)

Tier 2: Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de)

Other supported languages: sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, sl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo, and more.

Production Streaming Performance

On a single NVIDIA H200 GPU:

Real-Time Factor (RTF): 0.195
Time-to-first-audio: ~100 ms
Throughput: 3,000+ acoustic tokens/s while maintaining RTF below 0.5

License

This model is licensed under the Fish Audio Research License. Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.

Duplicated from fishaudio/s2-pro

SummerSmile
/

s2-pro