--- license: mit language: - en - ja - nl - fr - de - it - pl - pt - es tags: - speech - audio - vocoder datasets: - sarulab-speech/mls_sidon - mythicinfinity/Libriheavy-HQ base_model: - spellbrush/AliasingFreeNeuralAudioSynthesis --- # MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for High-Fidelity Speech Generation [![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/Aratako/MioCodec) **MioVocoder** is a high-resolution, aliasing-free neural vocoder designed for high-fidelity speech generation. It is a fine-tuned version of the **Pupu-Vocoder (Small)** from the [Aliasing-Free Neural Audio Synthesis](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) (AFGen) project. ## ๐ŸŒŸ Overview MioVocoder is specifically optimized to serve as the backend for **[MioCodec-25Hz](https://huggingface.co/Aratako/MioCodec-25Hz)**. While the original Pupu-Vocoder is a versatile model, MioVocoder has been fine-tuned with a primary focus on enhancing reconstruction quality for **Japanese speech**. By leveraging a large-scale Japanese corpus alongside multilingual data at 44.1kHz, it achieves high robustness and naturalness for various Japanese speaker characteristics. ### Key Features * **Aliasing-Free:** Inherits the architecture of AFGen, the first work to achieve efficient aliasing-free upsampling-based audio generation. * **High-Resolution:** Native support for **44.1 kHz** sampling rate. * **Lightweight:** Based on the "Small" architecture with only **15.2M parameters**, making it fast and efficient for inference. * **Multilingual Expertise:** Fine-tuned on a massive corpus (including Japanese, English, and European languages) to ensure natural prosody and timbre. ## ๐Ÿ“Š Model Specifications | Property | Value | | :--- | :--- | | **Architecture** | Pupu-Vocoder (Small) | | **Parameters** | 15.2M | | **Sampling Rate** | 44.1 kHz | | **Base Model** | [spellbrush/AliasingFreeNeuralAudioSynthesis](https://huggingface.co/spellbrush/AliasingFreeNeuralAudioSynthesis) | ## ๐Ÿ“š Training Data The model was fine-tuned on a large-scale multilingual corpus, with significant emphasis on Japanese high-fidelity speech data. | Language | Approx. Hours | Dataset | | :--- | :--- | :--- | | **Japanese** | ~15,000h | Various public HF datasets | | **English** | ~7,500h | [Libriheavy-HQ](https://huggingface.co/datasets/mythicinfinity/Libriheavy-HQ/tree/main), [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | | **German** | ~1,950h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | | **Dutch** | ~1,550h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | | **French** | ~1,050h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | | **Spanish** | ~900h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | | **Italian** | ~240h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | | **Portuguese** | ~160h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | | **Polish** | ~100h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | ## โš ๏ธ Limitations As MioVocoder is highly optimized for specific use cases, please note the following: * **Language Performance:** Since the primary goal was to improve Japanese accuracy, the reconstruction quality for other languages may be slightly inferior compared to the original Pupu-Vocoder. * **Speech-Centric:** The fine-tuning process utilized speech-only datasets. Unlike the base model, which may handle general audio or music, MioVocoderโ€™s performance on non-speech audio (e.g., music, singing, environmental noise) may be degraded. ## ๐Ÿš€ Usage Since MioVocoder maintains the original Pupu-Vocoder architecture, it can be used with the [official codebase](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) or via the `miocodec` helper library. ### Integration with MioCodec ```python from miocodec import load_vocoder vocoder = load_vocoder( backend="pupu", hf_repo="Aratako/MioVocoder", hf_config_path="config.json", hf_checkpoint_path="model.safetensors", ).cuda() ``` ## ๐Ÿ“œ Acknowledgements * **Original Architecture & Paper:** [Aliasing-Free Neural Audio Synthesis](https://arxiv.org/abs/2512.20211) (AFGen). * **Base Weights:** Provided by the [Spellbrush](https://huggingface.co/spellbrush) team. ## ๐Ÿ–Š๏ธ Citation If you use MioVocoder in your research, please cite both the original paper and this model checkpoint: **Original Architecture (AFGen):** ```bibtex @article{afgen, title = {Aliasing Free Neural Audio Synthesis}, author = {Yicheng Gu and Junan Zhang and Chaoren Wang and Jerry Li and Zhizheng Wu and Lauri Juvela}, year = {2025}, journal = {arXiv:2512.20211}, } ``` **MioVocoder Checkpoint:** ```bibtex @misc{miovocoder, author = {Chihiro Arata}, title = {MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for Japanese Speech}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {\url{https://huggingface.co/Aratako/MioVocoder}} } ```