bweng commited on
Commit
e8fa119
·
verified ·
1 Parent(s): 13b8352

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md CHANGED
@@ -9,3 +9,87 @@ pipeline_tag: text-to-speech
9
 
10
  Based on the original kokoro model, see https://github.com/FluidInference/FluidAudio for inference
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  Based on the original kokoro model, see https://github.com/FluidInference/FluidAudio for inference
11
 
12
+ ## Benchmark
13
+
14
+ We generated the same strings with to gerneate audio between 1s to ~300s in order to test the speed across a range of varying inputs on Pytorch CPU, MPS, and MLX pipeline, and compared it against the native Swift version with Core ML models.
15
+
16
+ Each pipeline warmed up the models by running through it once with pesudo inputs, and then comparing the raw inference time with the model already loaded. You can see that for the Core ML model, we traded lower memory and very slightly faster inference for longer initial warm-up.
17
+
18
+ Note that the Pytorch kokoro model in Pytorch has a memory leak issue: https://github.com/hexgrad/kokoro/issues/152
19
+
20
+ The following tests were ran on M4 Pro, 48GB RAM, Macbook Pro. If you have another device, please do try replicating it as well!
21
+
22
+ ### Kokoro-82M PyTorch (CPU)
23
+
24
+ ```bash
25
+ KPipeline benchmark for voice af_heart (warm-up took 0.175s) using hexgrad/kokoro
26
+ Test Chars Output (s) Inf(s) RTFx Peak GB
27
+ 1 42 2.750 0.187 14.737x 1.44
28
+ 2 129 8.625 0.530 16.264x 1.85
29
+ 3 254 15.525 0.923 16.814x 2.65
30
+ 4 93 6.125 0.349 17.566x 2.66
31
+ 5 104 7.200 0.410 17.567x 2.70
32
+ 6 130 9.300 0.504 18.443x 2.72
33
+ 7 197 12.850 0.726 17.711x 2.83
34
+ 8 6 1.350 0.098 13.823x 2.83
35
+ 9 1228 76.200 4.342 17.551x 3.19
36
+ 10 567 35.200 2.069 17.014x 4.85
37
+ 11 4615 286.525 17.041 16.814x 4.78
38
+ Total - 461.650 27.177 16.987x 4.85
39
+ ```
40
+
41
+ ### Kokoro-82M PyTorch (MPS)
42
+
43
+ I wasn't able to run the MPS model for longer durations, even with `PYTORCH_ENABLE_MPS_FALLBACK=1` enabled, it kept crashing for the longer strings.
44
+
45
+ ```bash
46
+ KPipeline benchmark for voice af_heart (warm-up took 0.568s) using pip package
47
+ Test Chars Output (s) Inf(s) RTFx Peak GB
48
+ 1 42 2.750 0.414 6.649x 1.41
49
+ 2 129 8.625 0.729 11.839x 1.54
50
+ Total - 11.375 1.142 9.960x 1.54
51
+ ```
52
+
53
+ ### Kokoro-82M MLX Pipeline
54
+
55
+ ```bash
56
+ TTS benchmark for voice af_heart (warm-up took an extra 2.155s) using model prince-canuma/Kokoro-82M
57
+ Test Chars Output (s) Inf(s) RTFx Peak GB
58
+ 1 42 2.750 0.347 7.932x 1.12
59
+ 2 129 8.650 0.597 14.497x 2.47
60
+ 3 254 15.525 0.825 18.829x 2.65
61
+ 4 93 6.125 0.306 20.039x 2.65
62
+ 5 104 7.200 0.343 21.001x 2.65
63
+ 6 130 9.300 0.560 16.611x 2.65
64
+ 7 197 12.850 0.596 21.573x 2.65
65
+ 8 6 1.350 0.364 3.706x 2.65
66
+ 9 1228 76.200 2.979 25.583x 3.29
67
+ 10 567 35.200 1.374 25.615x 3.37
68
+ 11 4615 286.500 11.112 25.783x 3.37
69
+ Total - 461.650 19.401 23.796x 3.37
70
+ ```
71
+
72
+ #### Swift + Fluid Audio Core ML models
73
+
74
+ Note that it does take `~15s` to compile the model on the first run, subsequent runs are shorter, we expect ~2s to load.
75
+
76
+ ```bash
77
+ > swift run fluidaudio tts --benchmark
78
+ ...
79
+ FluidAudio TTS benchmark for voice af_heart (warm-up took an extra 2.348s)
80
+ Test Chars Ouput (s) Inf(s) RTFx
81
+ 1 42 2.825 0.440 6.424x
82
+ 2 129 7.725 0.594 13.014x
83
+ 3 254 13.400 0.776 17.278x
84
+ 4 93 5.875 0.587 10.005x
85
+ 5 104 6.675 0.613 10.889x
86
+ 6 130 8.075 0.621 13.008x
87
+ 7 197 10.650 0.627 16.983x
88
+ 8 6 0.825 0.360 2.290x
89
+ 9 1228 67.625 2.362 28.625x
90
+ 10 567 33.025 1.341 24.619x
91
+ 11 4269 247.600 9.087 27.248x
92
+ Total - 404.300 17.408 23.225
93
+
94
+ Peak memory usage (process-wide): 1.503 GB
95
+ ```