Commit ·
1de9996
1
Parent(s): 3f07fe8
Add link to technical report and GitHub, fix citation, and refine metadata (#4)
Browse files- Add link to technical report and GitHub, fix citation, and refine metadata (a1a729b263c88654dad6d50601f269613ab40717)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -1,9 +1,4 @@
|
|
| 1 |
---
|
| 2 |
-
tags:
|
| 3 |
-
- text-to-speech
|
| 4 |
-
license: other
|
| 5 |
-
license_name: fish-audio-research-license
|
| 6 |
-
license_link: LICENSE.md
|
| 7 |
language:
|
| 8 |
- zh
|
| 9 |
- en
|
|
@@ -18,7 +13,7 @@ language:
|
|
| 18 |
- sv
|
| 19 |
- it
|
| 20 |
- tr
|
| 21 |
-
-
|
| 22 |
- nl
|
| 23 |
- cy
|
| 24 |
- eu
|
|
@@ -88,22 +83,29 @@ language:
|
|
| 88 |
- as
|
| 89 |
- gu
|
| 90 |
- fo
|
|
|
|
|
|
|
|
|
|
| 91 |
pipeline_tag: text-to-speech
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
inference: false
|
| 93 |
-
extra_gated_prompt:
|
| 94 |
-
|
| 95 |
-
laws.
|
| 96 |
extra_gated_fields:
|
| 97 |
Country: country
|
| 98 |
Specific date: date_picker
|
| 99 |
I agree to use this model for non-commercial use ONLY: checkbox
|
| 100 |
---
|
| 101 |
|
| 102 |
-
|
| 103 |
# Fish Audio S2 Pro
|
| 104 |
|
| 105 |
<img src="overview.png" alt="Fish Audio S2 Pro overview — fine-grained control, multi-speaker multi-turn generation, low-latency streaming, and long-context inference." width="100%">
|
| 106 |
|
|
|
|
|
|
|
| 107 |
**Fish Audio S2 Pro** is a leading text-to-speech (TTS) model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, the system combines reinforcement learning alignment with a dual-autoregressive architecture. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine.
|
| 108 |
|
| 109 |
## Architecture
|
|
@@ -131,7 +133,7 @@ S2 Pro supports 80+ languages.
|
|
| 131 |
|
| 132 |
**Tier 2:** Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de)
|
| 133 |
|
| 134 |
-
**Other supported languages:** sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo,
|
| 135 |
|
| 136 |
## Production Streaming Performance
|
| 137 |
|
|
@@ -160,8 +162,9 @@ If you find our work useful, please consider citing our report:
|
|
| 160 |
archivePrefix={arXiv},
|
| 161 |
primaryClass={cs.SD},
|
| 162 |
url={https://arxiv.org/abs/2603.08823},
|
|
|
|
| 163 |
```
|
| 164 |
|
| 165 |
## License
|
| 166 |
|
| 167 |
-
This model is licensed under the [Fish Audio Research License](LICENSE.md). Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
language:
|
| 3 |
- zh
|
| 4 |
- en
|
|
|
|
| 13 |
- sv
|
| 14 |
- it
|
| 15 |
- tr
|
| 16 |
+
- 'no'
|
| 17 |
- nl
|
| 18 |
- cy
|
| 19 |
- eu
|
|
|
|
| 83 |
- as
|
| 84 |
- gu
|
| 85 |
- fo
|
| 86 |
+
license: other
|
| 87 |
+
license_name: fish-audio-research-license
|
| 88 |
+
license_link: LICENSE.md
|
| 89 |
pipeline_tag: text-to-speech
|
| 90 |
+
tags:
|
| 91 |
+
- text-to-speech
|
| 92 |
+
- instruction-following
|
| 93 |
+
- multilingual
|
| 94 |
inference: false
|
| 95 |
+
extra_gated_prompt: You agree to not use the model to generate contents that violate
|
| 96 |
+
DMCA or local laws.
|
|
|
|
| 97 |
extra_gated_fields:
|
| 98 |
Country: country
|
| 99 |
Specific date: date_picker
|
| 100 |
I agree to use this model for non-commercial use ONLY: checkbox
|
| 101 |
---
|
| 102 |
|
|
|
|
| 103 |
# Fish Audio S2 Pro
|
| 104 |
|
| 105 |
<img src="overview.png" alt="Fish Audio S2 Pro overview — fine-grained control, multi-speaker multi-turn generation, low-latency streaming, and long-context inference." width="100%">
|
| 106 |
|
| 107 |
+
[**Technical Report**](https://huggingface.co/papers/2603.08823) | [**GitHub**](https://github.com/fishaudio/fish-speech) | [**Playground**](https://fish.audio)
|
| 108 |
+
|
| 109 |
**Fish Audio S2 Pro** is a leading text-to-speech (TTS) model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, the system combines reinforcement learning alignment with a dual-autoregressive architecture. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine.
|
| 110 |
|
| 111 |
## Architecture
|
|
|
|
| 133 |
|
| 134 |
**Tier 2:** Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de)
|
| 135 |
|
| 136 |
+
**Other supported languages:** sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, xsl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo, and more.
|
| 137 |
|
| 138 |
## Production Streaming Performance
|
| 139 |
|
|
|
|
| 162 |
archivePrefix={arXiv},
|
| 163 |
primaryClass={cs.SD},
|
| 164 |
url={https://arxiv.org/abs/2603.08823},
|
| 165 |
+
}
|
| 166 |
```
|
| 167 |
|
| 168 |
## License
|
| 169 |
|
| 170 |
+
This model is licensed under the [Fish Audio Research License](LICENSE.md). Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.
|