ANE support?
Is this model ANE optimized?
No yet, first version was just trying to get it running on CoreML end to end. @alexwengg is working on improving the model now
Okay great news then :). btw a Xcode Core ML Performance Report screenshot would be interesting in the model card since it is core ml format.
No yet, first version was just trying to get it running on CoreML end to end. @alexwengg is working on improving the model now
I found a useful code repository management tool, although I haven't been able to get it working: https://github.com/mattmireles/kokoro-coreml
https://huggingface.co/xun/kokoro-v1.1-zh-onnx ,This ONNX version can also be tested; it also has an 8-bit quantized version.
@alexwengg appreciate the offer! unfortunately I'm tied up with other things right now and won't have enough time recently.
@alexwengg thanks! I start the kokoro coreml project because an iOS app of mine needs it. I don't have any future projects in mind right now.
but based on my experience, in the future I might train a new small model like kokoro addressing some headaches I met:
- streaming architecture for fast first utterance response time and better performance handling long paragraph
- fixed shape input for better ANE memory optimization
@alexwengg i didn't try other style tts models except the kokoro. i tried the kitten tts in the public space, not very impressive.
I see there’s also magpie tts if you wanna take a look. I am still converting magpie tts but I haven’t had the time for style tts
got it, thanks for bringing them up
@laishere i am curious what made you decided to go with the 7 model routes. i was aware of this feasibility however i had concluded the fragmentations of the model into several sub models would have degraded the speed gained from ANE due to the swift to mlmodelc transitions. output and input transfer can be quite the inference cost.
not to even mention the small of the total model being under 1 gb which is relatively small
also do you have any social media accounts we could resume this communication. i would like to dig more into how you achieved this ANE break through
any reason why albert, postalbert, alignment were not fused if they used ANE & CPU. while prosody + noise uses GPU & CPU too
good question. postalbert has lstm ops, which is unsupported on ANE, but fusing might still be possible (i used to treat them as a single encoder in my earlier tests and seems performed well too).
and fusing postalbert and alignment is likely possible, since they have the same config (fp16 + all units).
noise runs in fp32 which is unsupported on ANE. cannot fuse it with prosody without breaking the ANE and most of the prosody ops run on ANE.
anyway, I splitted the model into small pieces because it's easier to make it schedule most work to ANE. but I think fusing is likely possible.
my email laishereu@gmail.com
so in theory its possible but the ANE scheduler is a blackbox issue.
what was the procedure you used for testing this . did you ended up breaking up the models based on their key pytorch modules (ignoring the vocoder and KokoroTail models ) or was the break up largely from experimentation
yeah, in theory as long as it's fp16, the ANE compatible ops are supposed to run on ANE. but the scheduler...
my procedure is to split the pipeline into smaller stages to check where's the bottleneck. the original modules are clean boundaries.
but sometimes inside a single module, we still need to split further to isolate the ANE or quality issues.
i see so a good stra would be to split the mlpackages by pytorch modules and slowly merge any possible paths.
i think what you have done is quite impressive. how did you arrive at the cos solution and pinpointed the sin causing Compounding Errors.
any reasons why Noise and Tail models were unable to be fp 16. what was so unique about them you needed them to be f32
noise and tail in fp16 will degrade the audio quality, for example, the tail will amplify the errors in fp16, resulting in high frequency noise for some utterances
i woud have thought the prosody model would have been more senitive to change not noise
the noise module contains SineGen which uses cumsum op, not just pure noise
Not sure how that would be a serious issue since my impression is that exponent operations are the main causes of quality degradation .
you could run noise mlpackage with fp16 to compare and verify. I think cumsum can easily hit numerical errors in fp16
I see it was an experimental finding.
Is this your first time finetuning a model or have you done more
it's my first time
honestly impressive. did you have a math background here .
no, just a normal CS undergraduate. I just do and learn this project with the help of AI - the only thing I need to know is how to address those engineering problems which I already know as a software engineer.
I like what you have done here. I will try to keep an eye out on what you will do in the future
@laishere how long did this project took to complete. was it a part time work as well.
i have also finally got to port the Mandarin model thanks to your assistance
https://github.com/FluidInference/FluidAudio/pull/570/changes
about a week including the distillation attempt. yes, it's part time work.