|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
- ja |
|
|
- nl |
|
|
- fr |
|
|
- de |
|
|
- it |
|
|
- pl |
|
|
- pt |
|
|
- es |
|
|
tags: |
|
|
- speech |
|
|
- audio |
|
|
- vocoder |
|
|
datasets: |
|
|
- sarulab-speech/mls_sidon |
|
|
- mythicinfinity/Libriheavy-HQ |
|
|
base_model: |
|
|
- spellbrush/AliasingFreeNeuralAudioSynthesis |
|
|
--- |
|
|
|
|
|
# MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for High-Fidelity Speech Generation |
|
|
|
|
|
|
|
|
[](https://github.com/Aratako/MioCodec) |
|
|
|
|
|
**MioVocoder** is a high-resolution, aliasing-free neural vocoder designed for high-fidelity speech generation. It is a fine-tuned version of the **Pupu-Vocoder (Small)** from the [Aliasing-Free Neural Audio Synthesis](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) (AFGen) project. |
|
|
|
|
|
## 🌟 Overview |
|
|
|
|
|
MioVocoder is specifically optimized to serve as the backend for **[MioCodec-25Hz](https://huggingface.co/Aratako/MioCodec-25Hz)**. While the original Pupu-Vocoder is a versatile model, MioVocoder has been fine-tuned with a primary focus on enhancing reconstruction quality for **Japanese speech**. By leveraging a large-scale Japanese corpus alongside multilingual data at 44.1kHz, it achieves high robustness and naturalness for various Japanese speaker characteristics. |
|
|
|
|
|
### Key Features |
|
|
* **Aliasing-Free:** Inherits the architecture of AFGen, the first work to achieve efficient aliasing-free upsampling-based audio generation. |
|
|
* **High-Resolution:** Native support for **44.1 kHz** sampling rate. |
|
|
* **Lightweight:** Based on the "Small" architecture with only **15.2M parameters**, making it fast and efficient for inference. |
|
|
* **Multilingual Expertise:** Fine-tuned on a massive corpus (including Japanese, English, and European languages) to ensure natural prosody and timbre. |
|
|
|
|
|
## 📊 Model Specifications |
|
|
|
|
|
| Property | Value | |
|
|
| :--- | :--- | |
|
|
| **Architecture** | Pupu-Vocoder (Small) | |
|
|
| **Parameters** | 15.2M | |
|
|
| **Sampling Rate** | 44.1 kHz | |
|
|
| **Base Model** | [spellbrush/AliasingFreeNeuralAudioSynthesis](https://huggingface.co/spellbrush/AliasingFreeNeuralAudioSynthesis) | |
|
|
|
|
|
## 📚 Training Data |
|
|
|
|
|
The model was fine-tuned on a large-scale multilingual corpus, with significant emphasis on Japanese high-fidelity speech data. |
|
|
|
|
|
| Language | Approx. Hours | Dataset | |
|
|
| :--- | :--- | :--- | |
|
|
| **Japanese** | ~15,000h | Various public HF datasets | |
|
|
| **English** | ~7,500h | [Libriheavy-HQ](https://huggingface.co/datasets/mythicinfinity/Libriheavy-HQ/tree/main), [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **German** | ~1,950h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **Dutch** | ~1,550h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **French** | ~1,050h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **Spanish** | ~900h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **Italian** | ~240h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **Portuguese** | ~160h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **Polish** | ~100h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
|
|
|
## ⚠️ Limitations |
|
|
|
|
|
As MioVocoder is highly optimized for specific use cases, please note the following: |
|
|
|
|
|
* **Language Performance:** Since the primary goal was to improve Japanese accuracy, the reconstruction quality for other languages may be slightly inferior compared to the original Pupu-Vocoder. |
|
|
* **Speech-Centric:** The fine-tuning process utilized speech-only datasets. Unlike the base model, which may handle general audio or music, MioVocoder’s performance on non-speech audio (e.g., music, singing, environmental noise) may be degraded. |
|
|
|
|
|
## 🚀 Usage |
|
|
|
|
|
Since MioVocoder maintains the original Pupu-Vocoder architecture, it can be used with the [official codebase](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) or via the `miocodec` helper library. |
|
|
|
|
|
### Integration with MioCodec |
|
|
|
|
|
```python |
|
|
from miocodec import load_vocoder |
|
|
|
|
|
vocoder = load_vocoder( |
|
|
backend="pupu", |
|
|
hf_repo="Aratako/MioVocoder", |
|
|
hf_config_path="config.json", |
|
|
hf_checkpoint_path="model.safetensors", |
|
|
).cuda() |
|
|
``` |
|
|
|
|
|
## 📜 Acknowledgements |
|
|
|
|
|
* **Original Architecture & Paper:** [Aliasing-Free Neural Audio Synthesis](https://arxiv.org/abs/2512.20211) (AFGen). |
|
|
* **Base Weights:** Provided by the [Spellbrush](https://huggingface.co/spellbrush) team. |
|
|
|
|
|
## 🖊️ Citation |
|
|
|
|
|
If you use MioVocoder in your research, please cite both the original paper and this model checkpoint: |
|
|
|
|
|
**Original Architecture (AFGen):** |
|
|
```bibtex |
|
|
@article{afgen, |
|
|
title = {Aliasing Free Neural Audio Synthesis}, |
|
|
author = {Yicheng Gu and Junan Zhang and Chaoren Wang and Jerry Li and Zhizheng Wu and Lauri Juvela}, |
|
|
year = {2025}, |
|
|
journal = {arXiv:2512.20211}, |
|
|
} |
|
|
``` |
|
|
|
|
|
**MioVocoder Checkpoint:** |
|
|
|
|
|
```bibtex |
|
|
@misc{miovocoder, |
|
|
author = {Chihiro Arata}, |
|
|
title = {MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for Japanese Speech}, |
|
|
year = {2026}, |
|
|
publisher = {Hugging Face}, |
|
|
journal = {Hugging Face repository}, |
|
|
howpublished = {\url{https://huggingface.co/Aratako/MioVocoder}} |
|
|
} |
|
|
``` |