Safetensors
speech
audio
vocoder
MioVocoder / README.md
Aratako's picture
Add files using upload-large-folder tool
32f6a02 verified
---
license: mit
language:
- en
- ja
- nl
- fr
- de
- it
- pl
- pt
- es
tags:
- speech
- audio
- vocoder
datasets:
- sarulab-speech/mls_sidon
- mythicinfinity/Libriheavy-HQ
base_model:
- spellbrush/AliasingFreeNeuralAudioSynthesis
---
# MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for High-Fidelity Speech Generation
[![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/Aratako/MioCodec)
**MioVocoder** is a high-resolution, aliasing-free neural vocoder designed for high-fidelity speech generation. It is a fine-tuned version of the **Pupu-Vocoder (Small)** from the [Aliasing-Free Neural Audio Synthesis](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) (AFGen) project.
## 🌟 Overview
MioVocoder is specifically optimized to serve as the backend for **[MioCodec-25Hz](https://huggingface.co/Aratako/MioCodec-25Hz)**. While the original Pupu-Vocoder is a versatile model, MioVocoder has been fine-tuned with a primary focus on enhancing reconstruction quality for **Japanese speech**. By leveraging a large-scale Japanese corpus alongside multilingual data at 44.1kHz, it achieves high robustness and naturalness for various Japanese speaker characteristics.
### Key Features
* **Aliasing-Free:** Inherits the architecture of AFGen, the first work to achieve efficient aliasing-free upsampling-based audio generation.
* **High-Resolution:** Native support for **44.1 kHz** sampling rate.
* **Lightweight:** Based on the "Small" architecture with only **15.2M parameters**, making it fast and efficient for inference.
* **Multilingual Expertise:** Fine-tuned on a massive corpus (including Japanese, English, and European languages) to ensure natural prosody and timbre.
## 📊 Model Specifications
| Property | Value |
| :--- | :--- |
| **Architecture** | Pupu-Vocoder (Small) |
| **Parameters** | 15.2M |
| **Sampling Rate** | 44.1 kHz |
| **Base Model** | [spellbrush/AliasingFreeNeuralAudioSynthesis](https://huggingface.co/spellbrush/AliasingFreeNeuralAudioSynthesis) |
## 📚 Training Data
The model was fine-tuned on a large-scale multilingual corpus, with significant emphasis on Japanese high-fidelity speech data.
| Language | Approx. Hours | Dataset |
| :--- | :--- | :--- |
| **Japanese** | ~15,000h | Various public HF datasets |
| **English** | ~7,500h | [Libriheavy-HQ](https://huggingface.co/datasets/mythicinfinity/Libriheavy-HQ/tree/main), [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **German** | ~1,950h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **Dutch** | ~1,550h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **French** | ~1,050h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **Spanish** | ~900h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **Italian** | ~240h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **Portuguese** | ~160h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **Polish** | ~100h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
## ⚠️ Limitations
As MioVocoder is highly optimized for specific use cases, please note the following:
* **Language Performance:** Since the primary goal was to improve Japanese accuracy, the reconstruction quality for other languages may be slightly inferior compared to the original Pupu-Vocoder.
* **Speech-Centric:** The fine-tuning process utilized speech-only datasets. Unlike the base model, which may handle general audio or music, MioVocoder’s performance on non-speech audio (e.g., music, singing, environmental noise) may be degraded.
## 🚀 Usage
Since MioVocoder maintains the original Pupu-Vocoder architecture, it can be used with the [official codebase](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) or via the `miocodec` helper library.
### Integration with MioCodec
```python
from miocodec import load_vocoder
vocoder = load_vocoder(
backend="pupu",
hf_repo="Aratako/MioVocoder",
hf_config_path="config.json",
hf_checkpoint_path="model.safetensors",
).cuda()
```
## 📜 Acknowledgements
* **Original Architecture & Paper:** [Aliasing-Free Neural Audio Synthesis](https://arxiv.org/abs/2512.20211) (AFGen).
* **Base Weights:** Provided by the [Spellbrush](https://huggingface.co/spellbrush) team.
## 🖊️ Citation
If you use MioVocoder in your research, please cite both the original paper and this model checkpoint:
**Original Architecture (AFGen):**
```bibtex
@article{afgen,
title = {Aliasing Free Neural Audio Synthesis},
author = {Yicheng Gu and Junan Zhang and Chaoren Wang and Jerry Li and Zhizheng Wu and Lauri Juvela},
year = {2025},
journal = {arXiv:2512.20211},
}
```
**MioVocoder Checkpoint:**
```bibtex
@misc{miovocoder,
author = {Chihiro Arata},
title = {MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for Japanese Speech},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/Aratako/MioVocoder}}
}
```