| --- |
| base_model: |
| - spellbrush/AliasingFreeNeuralAudioSynthesis |
| datasets: |
| - sarulab-speech/mls_sidon |
| - mythicinfinity/Libriheavy-HQ |
| language: |
| - en |
| - ja |
| - nl |
| - fr |
| - de |
| - it |
| - pl |
| - pt |
| - es |
| license: mit |
| pipeline_tag: audio-to-audio |
| tags: |
| - speech |
| - audio |
| - vocoder |
| --- |
| |
| # MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for High-Fidelity Speech Generation |
|
|
| [](https://github.com/Aratako/MioCodec) |
| [](https://huggingface.co/papers/2512.20211) |
|
|
| **MioVocoder** is a high-resolution, aliasing-free neural vocoder designed for high-fidelity speech generation. It is a fine-tuned version of the **Pupu-Vocoder (Small)** from the paper [Aliasing-Free Neural Audio Synthesis](https://huggingface.co/papers/2512.20211) (AFGen). |
|
|
| ## 🌟 Overview |
|
|
| MioVocoder is specifically optimized to serve as the backend for **[MioCodec-25Hz](https://huggingface.co/Aratako/MioCodec-25Hz)**. While the original Pupu-Vocoder is a versatile model, MioVocoder has been fine-tuned with a primary focus on enhancing reconstruction quality for **Japanese speech**. By leveraging a large-scale Japanese corpus alongside multilingual data at 44.1kHz, it achieves high robustness and naturalness for various Japanese speaker characteristics. |
|
|
| ### Key Features |
| * **Aliasing-Free:** Inherits the architecture of AFGen, the first work to achieve efficient aliasing-free upsampling-based audio generation. |
| * **High-Resolution:** Native support for **44.1 kHz** sampling rate. |
| * **Lightweight:** Based on the "Small" architecture with only **15.2M parameters**, making it fast and efficient for inference. |
| * **Multilingual Expertise:** Fine-tuned on a massive corpus (including Japanese, English, and European languages) to ensure natural prosody and timbre. |
|
|
| ## 📊 Model Specifications |
|
|
| | Property | Value | |
| | :--- | :--- | |
| | **Architecture** | Pupu-Vocoder (Small) | |
| | **Parameters** | 15.2M | |
| | **Sampling Rate** | 44.1 kHz | |
| | **Base Model** | [spellbrush/AliasingFreeNeuralAudioSynthesis](https://huggingface.co/spellbrush/AliasingFreeNeuralAudioSynthesis) | |
|
|
| ## 📚 Training Data |
|
|
| The model was fine-tuned on a large-scale multilingual corpus, with significant emphasis on Japanese high-fidelity speech data. |
|
|
| | Language | Approx. Hours | Dataset | |
| | :--- | :--- | :--- | |
| | **Japanese** | ~15,000h | Various public HF datasets | |
| | **English** | ~7,500h | [Libriheavy-HQ](https://huggingface.co/datasets/mythicinfinity/Libriheavy-HQ/tree/main), [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
| | **German** | ~1,950h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
| | **Dutch** | ~1,550h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
| | **French** | ~1,050h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
| | **Spanish** | ~900h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
| | **Italian** | ~240h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
| | **Portuguese** | ~160h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
| | **Polish** | ~100h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| ## ⚠️ Limitations |
|
|
| As MioVocoder is highly optimized for specific use cases, please note the following: |
|
|
| * **Language Performance:** Since the primary goal was to improve Japanese accuracy, the reconstruction quality for other languages may be slightly inferior compared to the original Pupu-Vocoder. |
| * **Speech-Centric:** The fine-tuning process utilized speech-only datasets. Unlike the base model, which may handle general audio or music, MioVocoder’s performance on non-speech audio (e.g., music, singing, environmental noise) may be degraded. |
|
|
| ## 🚀 Usage |
|
|
| Since MioVocoder maintains the original Pupu-Vocoder architecture, it can be used with the [official codebase](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) or via the `miocodec` helper library. |
|
|
| ### Integration with MioCodec |
|
|
| ```python |
| from miocodec import load_vocoder |
| |
| vocoder = load_vocoder( |
| backend="pupu", |
| hf_repo="Aratako/MioVocoder", |
| hf_config_path="config.json", |
| hf_checkpoint_path="model.safetensors", |
| ).cuda() |
| ``` |
|
|
| ## 📜 Acknowledgements |
|
|
| * **Original Architecture & Paper:** [Aliasing-Free Neural Audio Synthesis](https://arxiv.org/abs/2512.20211) (AFGen). |
| * **Official Code:** [GitHub Repository](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) |
| * **Base Weights:** Provided by the [Spellbrush](https://huggingface.co/spellbrush) team. |
|
|
| ## 🖊️ Citation |
|
|
| If you use MioVocoder in your research, please cite both the original paper and this model checkpoint: |
|
|
| **Original Architecture (AFGen):** |
| ```bibtex |
| @article{afgen, |
| title = {Aliasing Free Neural Audio Synthesis}, |
| author = {Yicheng Gu and Junan Zhang and Chaoren Wang and Jerry Li and Zhizheng Wu and Lauri Juvela}, |
| year = {2025}, |
| journal = {arXiv:2512.20211}, |
| } |
| ``` |
|
|
| **MioVocoder Checkpoint:** |
|
|
| ```bibtex |
| @misc{miovocoder, |
| author = {Chihiro Arata}, |
| title = {MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for Japanese Speech}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| journal = {Hugging Face repository}, |
| howpublished = {\url{https://huggingface.co/Aratako/MioVocoder}} |
| } |
| ``` |