Safetensors
speech
audio
vocoder
File size: 5,190 Bytes
32f6a02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
license: mit
language:
- en
- ja
- nl
- fr
- de
- it
- pl
- pt
- es
tags:
- speech
- audio
- vocoder
datasets:
- sarulab-speech/mls_sidon
- mythicinfinity/Libriheavy-HQ
base_model:
- spellbrush/AliasingFreeNeuralAudioSynthesis
---

# MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for High-Fidelity Speech Generation


[![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/Aratako/MioCodec)

**MioVocoder** is a high-resolution, aliasing-free neural vocoder designed for high-fidelity speech generation. It is a fine-tuned version of the **Pupu-Vocoder (Small)** from the [Aliasing-Free Neural Audio Synthesis](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) (AFGen) project.

## 🌟 Overview

MioVocoder is specifically optimized to serve as the backend for **[MioCodec-25Hz](https://huggingface.co/Aratako/MioCodec-25Hz)**. While the original Pupu-Vocoder is a versatile model, MioVocoder has been fine-tuned with a primary focus on enhancing reconstruction quality for **Japanese speech**. By leveraging a large-scale Japanese corpus alongside multilingual data at 44.1kHz, it achieves high robustness and naturalness for various Japanese speaker characteristics.

### Key Features
* **Aliasing-Free:** Inherits the architecture of AFGen, the first work to achieve efficient aliasing-free upsampling-based audio generation.
* **High-Resolution:** Native support for **44.1 kHz** sampling rate.
* **Lightweight:** Based on the "Small" architecture with only **15.2M parameters**, making it fast and efficient for inference.
* **Multilingual Expertise:** Fine-tuned on a massive corpus (including Japanese, English, and European languages) to ensure natural prosody and timbre.

## 📊 Model Specifications

| Property | Value |
| :--- | :--- |
| **Architecture** | Pupu-Vocoder (Small) |
| **Parameters** | 15.2M |
| **Sampling Rate** | 44.1 kHz |
| **Base Model** | [spellbrush/AliasingFreeNeuralAudioSynthesis](https://huggingface.co/spellbrush/AliasingFreeNeuralAudioSynthesis) |

## 📚 Training Data

The model was fine-tuned on a large-scale multilingual corpus, with significant emphasis on Japanese high-fidelity speech data.

| Language | Approx. Hours | Dataset |
| :--- | :--- | :--- |
| **Japanese** | ~15,000h | Various public HF datasets |
| **English** | ~7,500h | [Libriheavy-HQ](https://huggingface.co/datasets/mythicinfinity/Libriheavy-HQ/tree/main), [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **German** | ~1,950h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **Dutch** | ~1,550h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **French** | ~1,050h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **Spanish** | ~900h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **Italian** | ~240h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **Portuguese** | ~160h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
| **Polish** | ~100h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |

## ⚠️ Limitations

As MioVocoder is highly optimized for specific use cases, please note the following:

* **Language Performance:** Since the primary goal was to improve Japanese accuracy, the reconstruction quality for other languages may be slightly inferior compared to the original Pupu-Vocoder.
* **Speech-Centric:** The fine-tuning process utilized speech-only datasets. Unlike the base model, which may handle general audio or music, MioVocoder’s performance on non-speech audio (e.g., music, singing, environmental noise) may be degraded.

## 🚀 Usage

Since MioVocoder maintains the original Pupu-Vocoder architecture, it can be used with the [official codebase](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) or via the `miocodec` helper library.

### Integration with MioCodec

```python
from miocodec import load_vocoder

vocoder = load_vocoder(
    backend="pupu",
    hf_repo="Aratako/MioVocoder",
    hf_config_path="config.json",
    hf_checkpoint_path="model.safetensors",
).cuda()
```

## 📜 Acknowledgements

* **Original Architecture & Paper:** [Aliasing-Free Neural Audio Synthesis](https://arxiv.org/abs/2512.20211) (AFGen).
* **Base Weights:** Provided by the [Spellbrush](https://huggingface.co/spellbrush) team.

## 🖊️ Citation

If you use MioVocoder in your research, please cite both the original paper and this model checkpoint:

**Original Architecture (AFGen):**
```bibtex
@article{afgen,
  title        = {Aliasing Free Neural Audio Synthesis},
  author       = {Yicheng Gu and Junan Zhang and Chaoren Wang and Jerry Li and Zhizheng Wu and Lauri Juvela},
  year         = {2025},
  journal      = {arXiv:2512.20211},
}
```

**MioVocoder Checkpoint:**

```bibtex
@misc{miovocoder,
  author = {Chihiro Arata},
  title = {MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for Japanese Speech},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/MioVocoder}}
}
```