lengyue233 nielsr HF Staff commited on
Commit
1de9996
·
1 Parent(s): 3f07fe8

Add link to technical report and GitHub, fix citation, and refine metadata (#4)

Browse files

- Add link to technical report and GitHub, fix citation, and refine metadata (a1a729b263c88654dad6d50601f269613ab40717)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +15 -12
README.md CHANGED
@@ -1,9 +1,4 @@
1
  ---
2
- tags:
3
- - text-to-speech
4
- license: other
5
- license_name: fish-audio-research-license
6
- license_link: LICENSE.md
7
  language:
8
  - zh
9
  - en
@@ -18,7 +13,7 @@ language:
18
  - sv
19
  - it
20
  - tr
21
- - "no"
22
  - nl
23
  - cy
24
  - eu
@@ -88,22 +83,29 @@ language:
88
  - as
89
  - gu
90
  - fo
 
 
 
91
  pipeline_tag: text-to-speech
 
 
 
 
92
  inference: false
93
- extra_gated_prompt: >-
94
- You agree to not use the model to generate contents that violate DMCA or local
95
- laws.
96
  extra_gated_fields:
97
  Country: country
98
  Specific date: date_picker
99
  I agree to use this model for non-commercial use ONLY: checkbox
100
  ---
101
 
102
-
103
  # Fish Audio S2 Pro
104
 
105
  <img src="overview.png" alt="Fish Audio S2 Pro overview — fine-grained control, multi-speaker multi-turn generation, low-latency streaming, and long-context inference." width="100%">
106
 
 
 
107
  **Fish Audio S2 Pro** is a leading text-to-speech (TTS) model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, the system combines reinforcement learning alignment with a dual-autoregressive architecture. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine.
108
 
109
  ## Architecture
@@ -131,7 +133,7 @@ S2 Pro supports 80+ languages.
131
 
132
  **Tier 2:** Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de)
133
 
134
- **Other supported languages:** sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, sl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo, and more.
135
 
136
  ## Production Streaming Performance
137
 
@@ -160,8 +162,9 @@ If you find our work useful, please consider citing our report:
160
  archivePrefix={arXiv},
161
  primaryClass={cs.SD},
162
  url={https://arxiv.org/abs/2603.08823},
 
163
  ```
164
 
165
  ## License
166
 
167
- This model is licensed under the [Fish Audio Research License](LICENSE.md). Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.
 
1
  ---
 
 
 
 
 
2
  language:
3
  - zh
4
  - en
 
13
  - sv
14
  - it
15
  - tr
16
+ - 'no'
17
  - nl
18
  - cy
19
  - eu
 
83
  - as
84
  - gu
85
  - fo
86
+ license: other
87
+ license_name: fish-audio-research-license
88
+ license_link: LICENSE.md
89
  pipeline_tag: text-to-speech
90
+ tags:
91
+ - text-to-speech
92
+ - instruction-following
93
+ - multilingual
94
  inference: false
95
+ extra_gated_prompt: You agree to not use the model to generate contents that violate
96
+ DMCA or local laws.
 
97
  extra_gated_fields:
98
  Country: country
99
  Specific date: date_picker
100
  I agree to use this model for non-commercial use ONLY: checkbox
101
  ---
102
 
 
103
  # Fish Audio S2 Pro
104
 
105
  <img src="overview.png" alt="Fish Audio S2 Pro overview — fine-grained control, multi-speaker multi-turn generation, low-latency streaming, and long-context inference." width="100%">
106
 
107
+ [**Technical Report**](https://huggingface.co/papers/2603.08823) | [**GitHub**](https://github.com/fishaudio/fish-speech) | [**Playground**](https://fish.audio)
108
+
109
  **Fish Audio S2 Pro** is a leading text-to-speech (TTS) model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, the system combines reinforcement learning alignment with a dual-autoregressive architecture. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine.
110
 
111
  ## Architecture
 
133
 
134
  **Tier 2:** Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de)
135
 
136
+ **Other supported languages:** sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, xsl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo, and more.
137
 
138
  ## Production Streaming Performance
139
 
 
162
  archivePrefix={arXiv},
163
  primaryClass={cs.SD},
164
  url={https://arxiv.org/abs/2603.08823},
165
+ }
166
  ```
167
 
168
  ## License
169
 
170
+ This model is licensed under the [Fish Audio Research License](LICENSE.md). Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.