Text-to-Speech
Safetensors
MLX
mlx-audio
fish_qwen3_omni
speech
speech generation
voice cloning
tts
8-bit precision
Instructions to use mlx-community/fish-audio-s2-pro-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/fish-audio-s2-pro-8bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir fish-audio-s2-pro-8bit mlx-community/fish-audio-s2-pro-8bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,181 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: mlx-audio
|
| 3 |
+
tags:
|
| 4 |
+
- mlx
|
| 5 |
+
- text-to-speech
|
| 6 |
+
- speech
|
| 7 |
+
- speech generation
|
| 8 |
+
- voice cloning
|
| 9 |
+
- tts
|
| 10 |
+
- mlx-audio
|
| 11 |
+
license: other
|
| 12 |
+
license_name: fish-audio-research
|
| 13 |
+
license_link: https://huggingface.co/fishaudio/s2-pro/blob/main/LICENSE
|
| 14 |
+
language:
|
| 15 |
+
- en
|
| 16 |
+
- zh
|
| 17 |
+
- ja
|
| 18 |
+
- ko
|
| 19 |
+
- es
|
| 20 |
+
- pt
|
| 21 |
+
- ar
|
| 22 |
+
- ru
|
| 23 |
+
- fr
|
| 24 |
+
- de
|
| 25 |
+
- sv
|
| 26 |
+
- it
|
| 27 |
+
- tr
|
| 28 |
+
- "no"
|
| 29 |
+
- nl
|
| 30 |
+
- cy
|
| 31 |
+
- eu
|
| 32 |
+
- ca
|
| 33 |
+
- da
|
| 34 |
+
- gl
|
| 35 |
+
- ta
|
| 36 |
+
- hu
|
| 37 |
+
- fi
|
| 38 |
+
- pl
|
| 39 |
+
- et
|
| 40 |
+
- hi
|
| 41 |
+
- la
|
| 42 |
+
- ur
|
| 43 |
+
- th
|
| 44 |
+
- vi
|
| 45 |
+
- jv
|
| 46 |
+
- bn
|
| 47 |
+
- yo
|
| 48 |
+
- cs
|
| 49 |
+
- sw
|
| 50 |
+
- he
|
| 51 |
+
- ms
|
| 52 |
+
- uk
|
| 53 |
+
- id
|
| 54 |
+
- kk
|
| 55 |
+
- bg
|
| 56 |
+
- lv
|
| 57 |
+
- my
|
| 58 |
+
- tl
|
| 59 |
+
- sk
|
| 60 |
+
- ne
|
| 61 |
+
- fa
|
| 62 |
+
- af
|
| 63 |
+
- el
|
| 64 |
+
- bo
|
| 65 |
+
- hr
|
| 66 |
+
- ro
|
| 67 |
+
- sn
|
| 68 |
+
- mi
|
| 69 |
+
- yi
|
| 70 |
+
- am
|
| 71 |
+
- be
|
| 72 |
+
- km
|
| 73 |
+
- is
|
| 74 |
+
- az
|
| 75 |
+
- sd
|
| 76 |
+
- br
|
| 77 |
+
- sq
|
| 78 |
+
- ps
|
| 79 |
+
- mn
|
| 80 |
+
- ht
|
| 81 |
+
- ml
|
| 82 |
+
- sr
|
| 83 |
+
- sa
|
| 84 |
+
- te
|
| 85 |
+
- kn
|
| 86 |
+
- si
|
| 87 |
+
- hy
|
| 88 |
+
- mr
|
| 89 |
+
- as
|
| 90 |
+
- gu
|
| 91 |
+
- fo
|
| 92 |
+
pipeline_tag: text-to-speech
|
| 93 |
+
base_model: fishaudio/s2-pro
|
| 94 |
+
---
|
| 95 |
+
|
| 96 |
+
# mlx-community/fish-audio-s2-pro-8bit
|
| 97 |
+
|
| 98 |
+
This model was converted to MLX format from [`fishaudio/s2-pro`](https://huggingface.co/fishaudio/s2-pro) using mlx-audio version **0.4.0**.
|
| 99 |
+
|
| 100 |
+
Refer to the [original model card](https://huggingface.co/fishaudio/s2-pro) for more details on the model.
|
| 101 |
+
|
| 102 |
+
## Model Overview
|
| 103 |
+
|
| 104 |
+
Fish Audio S2 Pro is a leading text-to-speech model with fine-grained inline control of prosody and emotion. Trained on **10M+ hours** of audio data across **80+ languages**, it combines reinforcement learning alignment with a Dual-Autoregressive architecture.
|
| 105 |
+
|
| 106 |
+
### Architecture
|
| 107 |
+
|
| 108 |
+
| Attribute | Value |
|
| 109 |
+
|-----------|-------|
|
| 110 |
+
| Total Parameters | 5B |
|
| 111 |
+
| Slow AR | 4B (time-axis, primary semantic codebook) |
|
| 112 |
+
| Fast AR | 400M (residual codebooks per time step) |
|
| 113 |
+
| Audio Codec | 10 codebooks @ ~21 Hz frame rate |
|
| 114 |
+
| Tensor Type | BF16 |
|
| 115 |
+
|
| 116 |
+
### Fine-Grained Inline Control
|
| 117 |
+
|
| 118 |
+
Localized control over speech generation using `[tag]` syntax with free-form textual descriptions (15,000+ supported tags):
|
| 119 |
+
|
| 120 |
+
```
|
| 121 |
+
[whisper in small voice]
|
| 122 |
+
[professional broadcast tone]
|
| 123 |
+
[pitch up]
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
**Common Tags (15,000+ supported):**
|
| 127 |
+
`[pause]` `[emphasis]` `[laughing]` `[inhale]` `[chuckle]` `[tsk]` `[singing]` `[excited]` `[volume up]` `[echo]` `[angry]` `[whisper]` `[screaming]` `[sad]` `[shocked]` and many more.
|
| 128 |
+
|
| 129 |
+
### Supported Languages
|
| 130 |
+
|
| 131 |
+
**Tier 1 (Full Support):** Japanese, English, Chinese
|
| 132 |
+
**Tier 2 (Strong Support):** Korean, Spanish, Portuguese, Arabic, Russian, French, German
|
| 133 |
+
**Additional:** 70+ more languages
|
| 134 |
+
|
| 135 |
+
## Use with mlx-audio
|
| 136 |
+
|
| 137 |
+
```bash
|
| 138 |
+
pip install -U mlx-audio
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
### CLI Example:
|
| 142 |
+
```bash
|
| 143 |
+
python -m mlx_audio.tts.generate --model mlx-community/fish-audio-s2-pro-8bit --text "Hello, this is a test."
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
### Python Example:
|
| 147 |
+
```python
|
| 148 |
+
from mlx_audio.tts.utils import load_model
|
| 149 |
+
from mlx_audio.tts.generate import generate_audio
|
| 150 |
+
|
| 151 |
+
model = load_model("mlx-community/fish-audio-s2-pro-8bit")
|
| 152 |
+
generate_audio(
|
| 153 |
+
model=model,
|
| 154 |
+
text="Hello, this is a test.",
|
| 155 |
+
ref_audio="path_to_audio.wav",
|
| 156 |
+
file_prefix="test_audio",
|
| 157 |
+
)
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
## Citation
|
| 161 |
+
|
| 162 |
+
```bibtex
|
| 163 |
+
@misc{liao2026fishaudios2technical,
|
| 164 |
+
title={Fish Audio S2 Technical Report},
|
| 165 |
+
author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu and Xin Chen and Dawei Han},
|
| 166 |
+
year={2026},
|
| 167 |
+
eprint={2603.08823},
|
| 168 |
+
archivePrefix={arXiv},
|
| 169 |
+
primaryClass={cs.SD},
|
| 170 |
+
url={https://arxiv.org/abs/2603.08823},
|
| 171 |
+
}
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
## License
|
| 175 |
+
|
| 176 |
+
This model is released under the **Fish Audio Research License**:
|
| 177 |
+
- Research use: Free
|
| 178 |
+
- Non-commercial use: Free
|
| 179 |
+
- Commercial use: Requires separate license from Fish Audio (contact: business@fish.audio)
|
| 180 |
+
|
| 181 |
+
See the [original model](https://huggingface.co/fishaudio/s2-pro) for full license details.
|