happyme531
/

F5-TTS-RKNN2

ONNX

rknn

Model card Files Files and versions

xet

Community

happyme531 commited on Apr 14, 2025

Commit

4bf21ee

verified ·

1 Parent(s): 119f8ea

Update README.md

Browse files

Files changed (1) hide show

README.md +117 -3

README.md CHANGED Viewed

@@ -1,3 +1,117 @@
----
-license: agpl-3.0
----

+---
+license: agpl-3.0
+base_model:
+- H5N1AIDS/F5-TTS-ONNX
+tags:
+- rknn
+---
+# F5-TTS-RKNN2
+## (English README see below)
+在RK3588上运行超高质量的F5-TTS文字转语音/零样本音色克隆模型!
+- 推理速度(RK3588, 生成9秒音频): 每次迭代用时11s, 迭代32步, 总用时352s
+- 内存占用(RK3588): 2.2GB
+## 使用方法
+1. 克隆或者下载此仓库到本地. 模型较大, 请确保有足够的磁盘空间.
+2. 安装依赖
+```bash
+pip install "numpy<2" rknn-toolkit-lite2 jieba torch onnxruntime soundfile pydub pypinyin tqdm
+```
+4. 运行
+```bash
+python F5-TTS-ONNX-Inference-rknn2.py
+```
+你可以修改`F5-TTS-ONNX-Inference-rknn2.py`中的文本等参数来生成不同的音频。
+## 模型转换
+1. 下载ONNX模型文件
+2. 安装依赖
+```bash
+pip install "numpy<2" rknn-toolkit2==2.3.0 onnx onnxruntime
+```
+3. 转换模型
+```bash
+python convert_opset.py
+python convert_F5_Transformer_opset19.py
+```
+## 已知问题
+- 由于RKNN2不支持动态输入，这里把序列长度固定为了1536，并通过缩放音频速度来补偿。在差距不大的情况下效果可以接受。
+- 模型中DiT中的RoPE位置编码部分有一个Transpose操作无法在NPU上运行，造成推理速度下降~15%。这个问题应该可以通过修改原模型来解决，但我懒得改了，因为改完之后推理还是会非常慢，因为序列长度实在太长了。
+- 只有DiT部分使用了NPU，其他部分都是CPU推理，但其他部分运行速度快，总体上不会对推理速度有太大影响。
+## 参考
+- [F5-TTS-ONNX](https://github.com/DakeQQ/F5-TTS-ONNX)
+- [F5-TTS](https://github.com/SWivid/F5-TTS)
+## English README
+Run the ultra-high-quality F5-TTS text-to-speech / zero-shot voice cloning model on RK3588!
+- Inference Speed (on RK3588, generating 9 seconds of audio): 11s per iteration, 32 iterations, total time ~352s
+- Memory Usage (on RK3588): 2.2GB
+## Usage
+1.  Clone or download this repository locally. The models are large, ensure you have sufficient disk space.
+2.  Install dependencies:
+    ```bash
+    pip install "numpy<2" rknn-toolkit-lite2 jieba torch onnxruntime soundfile pydub pypinyin tqdm
+    ```
+3.  Run:
+    ```bash
+    python F5-TTS-ONNX-Inference-rknn2.py
+    ```
+    You can modify parameters such as the text within `F5-TTS-ONNX-Inference-rknn2.py` to generate different audio.
+## Model Conversion
+1.  Download the ONNX model files.
+2.  Install dependencies:
+    ```bash
+    pip install "numpy<2" rknn-toolkit2==2.3.0 onnx onnxruntime
+    ```
+3.  Convert the models:
+    ```bash
+    python convert_opset.py
+    python convert_F5_Transformer_opset19.py
+    ```
+## Known Issues
+-   Due to RKNN2 limitations with dynamic inputs, the sequence length is fixed at 1536. Audio speed scaling is used to compensate. The effect is acceptable when the difference isn't significant.
+-   A `Transpose` operation within the RoPE (Rotary Positional Embedding) part of the DiT (Diffusion Transformer) model cannot run on the NPU, causing an approximate 15% decrease in inference speed. This could potentially be resolved by modifying the original model, but I chose not to, as inference would still be very slow due to the extremely long sequence length.
+-   Only the DiT part utilizes the NPU; other parts run on the CPU. However, these CPU parts are fast and do not significantly impact the overall inference speed.
+## References
+-   [F5-TTS-ONNX](https://github.com/DakeQQ/F5-TTS-ONNX)
+-   [F5-TTS](https://github.com/SWivid/F5-TTS)