---
license: apache-2.0
pipeline_tag: audio-to-audio
---
# ESPERNet
ESPERNet is a set of AI models for audio-to-audio speech processing. The versions available here have been trained on the Mozilla CommonVoice dataset.

**This model is still in development! Weights will be uploaded as soon as hyperparameter tuning and training are complete.**

ESPERNet is built from three models: an encoder, a decoder, and a classifier.
Together, they form a VLGAN architecture.
***(Lee, Je-Yeol & Choi, Sang-Il. (2020). Improvement of Learning Stability of Generative Adversarial Network Using Variational Learning. Applied Sciences. 10. 4528. 10.3390/app10134528. )***

The VAE latent space is split into two parts: a low-dimensional, time-dependent phoneme space and a larger, time-constant style-and-speaker space.
This difference in dimensionality enables the encoder to separate speech content from speaker identity.
***(Xie, Yuying, et al. "Speaker and style disentanglement of speech based on contrastive predictive coding supported factorized variational autoencoder." 2024 32nd European Signal Processing Conference (EUSIPCO). IEEE, 2024.)***

Paired with the decoder, which benefits from adversarial training to produce high-quality outputs, the model can be used for speech transfer from one speaker to another.

Unlike most audio-processing networks, ESPERNet uses the ESPER format instead of MEL spectrograms.
That format was designed specifically for speech processing and enables high-quality speech reconstruction without an auxiliary network, as well as a range of effects, augmentations and transforms through [libESPER-V2](https://github.com/CdrSonan/libESPER-V2).

Training and other related scripts are [available on GitHub](https://github.com/CdrSonan/ESPERNet).