| --- |
| license: apache-2.0 |
| pipeline_tag: audio-to-audio |
| --- |
| # ESPERNet |
| ESPERNet is a set of AI models for audio-to-audio speech processing. The versions available here have been trained on the Mozilla CommonVoice dataset. |
|
|
| **This model is still in development! Weights will be uploaded as soon as hyperparameter tuning and training are complete.** |
|
|
| ESPERNet is built from three models: an encoder, a decoder, and a classifier. |
| Together, they form a VLGAN architecture. |
| ***(Lee, Je-Yeol & Choi, Sang-Il. (2020). Improvement of Learning Stability of Generative Adversarial Network Using Variational Learning. Applied Sciences. 10. 4528. 10.3390/app10134528. )*** |
|
|
| The VAE latent space is split into two parts: a low-dimensional, time-dependent phoneme space and a larger, time-constant style-and-speaker space. |
| This difference in dimensionality enables the encoder to separate speech content from speaker identity. |
| ***(Xie, Yuying, et al. "Speaker and style disentanglement of speech based on contrastive predictive coding supported factorized variational autoencoder." 2024 32nd European Signal Processing Conference (EUSIPCO). IEEE, 2024.)*** |
|
|
| Paired with the decoder, which benefits from adversarial training to produce high-quality outputs, the model can be used for speech transfer from one speaker to another. |
|
|
| Unlike most audio-processing networks, ESPERNet uses the ESPER format instead of MEL spectrograms. |
| That format was designed specifically for speech processing and enables high-quality speech reconstruction without an auxiliary network, as well as a range of effects, augmentations and transforms through [libESPER-V2](https://github.com/CdrSonan/libESPER-V2). |
|
|
| Training and other related scripts are [available on GitHub](https://github.com/CdrSonan/ESPERNet). |