happyme531 commited on
Commit
4bf21ee
·
verified ·
1 Parent(s): 119f8ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +117 -3
README.md CHANGED
@@ -1,3 +1,117 @@
1
- ---
2
- license: agpl-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: agpl-3.0
3
+ base_model:
4
+ - H5N1AIDS/F5-TTS-ONNX
5
+ tags:
6
+ - rknn
7
+ ---
8
+
9
+
10
+ # F5-TTS-RKNN2
11
+
12
+ ## (English README see below)
13
+
14
+ 在RK3588上运行超高质量的F5-TTS文字转语音/零样本音色克隆模型!
15
+
16
+ - 推理速度(RK3588, 生成9秒音频): 每次迭代用时11s, 迭代32步, 总用时352s
17
+ - 内存占用(RK3588): 2.2GB
18
+
19
+ ## 使用方法
20
+
21
+ 1. 克隆或者下载此仓库到本地. 模型较大, 请确保有足够的磁盘空间.
22
+
23
+ 2. 安装依赖
24
+
25
+ ```bash
26
+ pip install "numpy<2" rknn-toolkit-lite2 jieba torch onnxruntime soundfile pydub pypinyin tqdm
27
+ ```
28
+
29
+ 4. 运行
30
+
31
+ ```bash
32
+ python F5-TTS-ONNX-Inference-rknn2.py
33
+ ```
34
+
35
+ 你可以修改`F5-TTS-ONNX-Inference-rknn2.py`中的文本等参数来生成不同的音频。
36
+
37
+ ## 模型转换
38
+
39
+ 1. 下载ONNX模型文件
40
+
41
+ 2. 安装依赖
42
+
43
+ ```bash
44
+ pip install "numpy<2" rknn-toolkit2==2.3.0 onnx onnxruntime
45
+ ```
46
+
47
+ 3. 转换模型
48
+
49
+ ```bash
50
+ python convert_opset.py
51
+ python convert_F5_Transformer_opset19.py
52
+ ```
53
+
54
+ ## 已知问题
55
+
56
+ - 由于RKNN2不支持动态输入,这里把序列长度固定为了1536,并通过缩放音频速度来补偿。在差距不大的情况下效果可以接受。
57
+ - 模型中DiT中的RoPE位置编码部分有一个Transpose操作无法在NPU上运行,造成推理速度下降~15%。这个问题应该可以通过修改原模型来解决,但我懒得改了,因为改完之后推理还是会非常慢,因为序列长度实在太长了。
58
+ - 只有DiT部分使用了NPU,其他部分都是CPU推理,但其他部分运行速度快,总体上不会对推理速度有太大影响。
59
+
60
+ ## 参考
61
+
62
+ - [F5-TTS-ONNX](https://github.com/DakeQQ/F5-TTS-ONNX)
63
+ - [F5-TTS](https://github.com/SWivid/F5-TTS)
64
+
65
+
66
+ ## English README
67
+
68
+ Run the ultra-high-quality F5-TTS text-to-speech / zero-shot voice cloning model on RK3588!
69
+
70
+ - Inference Speed (on RK3588, generating 9 seconds of audio): 11s per iteration, 32 iterations, total time ~352s
71
+ - Memory Usage (on RK3588): 2.2GB
72
+
73
+ ## Usage
74
+
75
+ 1. Clone or download this repository locally. The models are large, ensure you have sufficient disk space.
76
+
77
+ 2. Install dependencies:
78
+
79
+ ```bash
80
+ pip install "numpy<2" rknn-toolkit-lite2 jieba torch onnxruntime soundfile pydub pypinyin tqdm
81
+ ```
82
+
83
+ 3. Run:
84
+
85
+ ```bash
86
+ python F5-TTS-ONNX-Inference-rknn2.py
87
+ ```
88
+
89
+ You can modify parameters such as the text within `F5-TTS-ONNX-Inference-rknn2.py` to generate different audio.
90
+
91
+ ## Model Conversion
92
+
93
+ 1. Download the ONNX model files.
94
+
95
+ 2. Install dependencies:
96
+
97
+ ```bash
98
+ pip install "numpy<2" rknn-toolkit2==2.3.0 onnx onnxruntime
99
+ ```
100
+
101
+ 3. Convert the models:
102
+
103
+ ```bash
104
+ python convert_opset.py
105
+ python convert_F5_Transformer_opset19.py
106
+ ```
107
+
108
+ ## Known Issues
109
+
110
+ - Due to RKNN2 limitations with dynamic inputs, the sequence length is fixed at 1536. Audio speed scaling is used to compensate. The effect is acceptable when the difference isn't significant.
111
+ - A `Transpose` operation within the RoPE (Rotary Positional Embedding) part of the DiT (Diffusion Transformer) model cannot run on the NPU, causing an approximate 15% decrease in inference speed. This could potentially be resolved by modifying the original model, but I chose not to, as inference would still be very slow due to the extremely long sequence length.
112
+ - Only the DiT part utilizes the NPU; other parts run on the CPU. However, these CPU parts are fast and do not significantly impact the overall inference speed.
113
+
114
+ ## References
115
+
116
+ - [F5-TTS-ONNX](https://github.com/DakeQQ/F5-TTS-ONNX)
117
+ - [F5-TTS](https://github.com/SWivid/F5-TTS)