|
|
--- |
|
|
license: agpl-3.0 |
|
|
base_model: |
|
|
- H5N1AIDS/F5-TTS-ONNX |
|
|
tags: |
|
|
- rknn |
|
|
--- |
|
|
|
|
|
|
|
|
# F5-TTS-RKNN2 |
|
|
|
|
|
## (English README see below) |
|
|
|
|
|
在RK3588上运行超高质量的F5-TTS文字转语音/零样本音色克隆模型! |
|
|
|
|
|
- 推理速度(RK3588, 生成9秒音频): 每次迭代用时11s, 迭代32步, 总用时352s |
|
|
- 内存占用(RK3588): 2.2GB |
|
|
|
|
|
## 使用方法 |
|
|
|
|
|
1. 克隆或者下载此仓库到本地. 模型较大, 请确保有足够的磁盘空间. |
|
|
|
|
|
2. 安装依赖 |
|
|
|
|
|
```bash |
|
|
pip install "numpy<2" rknn-toolkit-lite2 jieba torch onnxruntime soundfile pydub pypinyin tqdm |
|
|
``` |
|
|
|
|
|
4. 运行 |
|
|
|
|
|
```bash |
|
|
python F5-TTS-ONNX-Inference-rknn2.py |
|
|
``` |
|
|
|
|
|
你可以修改`F5-TTS-ONNX-Inference-rknn2.py`中的文本等参数来生成不同的音频。 |
|
|
|
|
|
## 模型转换 |
|
|
|
|
|
1. 下载ONNX模型文件 |
|
|
|
|
|
2. 安装依赖 |
|
|
|
|
|
```bash |
|
|
pip install "numpy<2" rknn-toolkit2==2.3.0 onnx onnxruntime |
|
|
``` |
|
|
|
|
|
3. 转换模型 |
|
|
|
|
|
```bash |
|
|
python convert_opset.py |
|
|
python convert_F5_Transformer_opset19.py |
|
|
``` |
|
|
|
|
|
## 已知问题 |
|
|
|
|
|
- 由于RKNN2不支持动态输入,这里把序列长度固定为了1536,并通过缩放音频速度来补偿。在差距不大的情况下效果可以接受。 |
|
|
- 模型中DiT中的RoPE位置编码部分有一个Transpose操作无法在NPU上运行,造成推理速度下降~15%。这个问题应该可以通过修改原模型来解决,但我懒得改了,因为改完之后推理还是会非常慢,因为序列长度实在太长了。 |
|
|
- 只有DiT部分使用了NPU,其他部分都是CPU推理,但其他部分运行速度快,总体上不会对推理速度有太大影响。 |
|
|
|
|
|
## 参考 |
|
|
|
|
|
- [F5-TTS-ONNX](https://github.com/DakeQQ/F5-TTS-ONNX) |
|
|
- [F5-TTS](https://github.com/SWivid/F5-TTS) |
|
|
|
|
|
|
|
|
## English README |
|
|
|
|
|
Run the ultra-high-quality F5-TTS text-to-speech / zero-shot voice cloning model on RK3588! |
|
|
|
|
|
- Inference Speed (on RK3588, generating 9 seconds of audio): 11s per iteration, 32 iterations, total time ~352s |
|
|
- Memory Usage (on RK3588): 2.2GB |
|
|
|
|
|
## Usage |
|
|
|
|
|
1. Clone or download this repository locally. The models are large, ensure you have sufficient disk space. |
|
|
|
|
|
2. Install dependencies: |
|
|
|
|
|
```bash |
|
|
pip install "numpy<2" rknn-toolkit-lite2 jieba torch onnxruntime soundfile pydub pypinyin tqdm |
|
|
``` |
|
|
|
|
|
3. Run: |
|
|
|
|
|
```bash |
|
|
python F5-TTS-ONNX-Inference-rknn2.py |
|
|
``` |
|
|
|
|
|
You can modify parameters such as the text within `F5-TTS-ONNX-Inference-rknn2.py` to generate different audio. |
|
|
|
|
|
## Model Conversion |
|
|
|
|
|
1. Download the ONNX model files. |
|
|
|
|
|
2. Install dependencies: |
|
|
|
|
|
```bash |
|
|
pip install "numpy<2" rknn-toolkit2==2.3.0 onnx onnxruntime |
|
|
``` |
|
|
|
|
|
3. Convert the models: |
|
|
|
|
|
```bash |
|
|
python convert_opset.py |
|
|
python convert_F5_Transformer_opset19.py |
|
|
``` |
|
|
|
|
|
## Known Issues |
|
|
|
|
|
- Due to RKNN2 limitations with dynamic inputs, the sequence length is fixed at 1536. Audio speed scaling is used to compensate. The effect is acceptable when the difference isn't significant. |
|
|
- A `Transpose` operation within the RoPE (Rotary Positional Embedding) part of the DiT (Diffusion Transformer) model cannot run on the NPU, causing an approximate 15% decrease in inference speed. This could potentially be resolved by modifying the original model, but I chose not to, as inference would still be very slow due to the extremely long sequence length. |
|
|
- Only the DiT part utilizes the NPU; other parts run on the CPU. However, these CPU parts are fast and do not significantly impact the overall inference speed. |
|
|
|
|
|
## References |
|
|
|
|
|
- [F5-TTS-ONNX](https://github.com/DakeQQ/F5-TTS-ONNX) |
|
|
- [F5-TTS](https://github.com/SWivid/F5-TTS) |