Safetensors
speech
audio
vocoder
Aratako commited on
Commit
32f6a02
·
verified ·
1 Parent(s): 806764d

Add files using upload-large-folder tool

Browse files
Files changed (3) hide show
  1. README.md +120 -0
  2. config.json +58 -0
  3. model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - ja
6
+ - nl
7
+ - fr
8
+ - de
9
+ - it
10
+ - pl
11
+ - pt
12
+ - es
13
+ tags:
14
+ - speech
15
+ - audio
16
+ - vocoder
17
+ datasets:
18
+ - sarulab-speech/mls_sidon
19
+ - mythicinfinity/Libriheavy-HQ
20
+ base_model:
21
+ - spellbrush/AliasingFreeNeuralAudioSynthesis
22
+ ---
23
+
24
+ # MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for High-Fidelity Speech Generation
25
+
26
+
27
+ [![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/Aratako/MioCodec)
28
+
29
+ **MioVocoder** is a high-resolution, aliasing-free neural vocoder designed for high-fidelity speech generation. It is a fine-tuned version of the **Pupu-Vocoder (Small)** from the [Aliasing-Free Neural Audio Synthesis](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) (AFGen) project.
30
+
31
+ ## 🌟 Overview
32
+
33
+ MioVocoder is specifically optimized to serve as the backend for **[MioCodec-25Hz](https://huggingface.co/Aratako/MioCodec-25Hz)**. While the original Pupu-Vocoder is a versatile model, MioVocoder has been fine-tuned with a primary focus on enhancing reconstruction quality for **Japanese speech**. By leveraging a large-scale Japanese corpus alongside multilingual data at 44.1kHz, it achieves high robustness and naturalness for various Japanese speaker characteristics.
34
+
35
+ ### Key Features
36
+ * **Aliasing-Free:** Inherits the architecture of AFGen, the first work to achieve efficient aliasing-free upsampling-based audio generation.
37
+ * **High-Resolution:** Native support for **44.1 kHz** sampling rate.
38
+ * **Lightweight:** Based on the "Small" architecture with only **15.2M parameters**, making it fast and efficient for inference.
39
+ * **Multilingual Expertise:** Fine-tuned on a massive corpus (including Japanese, English, and European languages) to ensure natural prosody and timbre.
40
+
41
+ ## 📊 Model Specifications
42
+
43
+ | Property | Value |
44
+ | :--- | :--- |
45
+ | **Architecture** | Pupu-Vocoder (Small) |
46
+ | **Parameters** | 15.2M |
47
+ | **Sampling Rate** | 44.1 kHz |
48
+ | **Base Model** | [spellbrush/AliasingFreeNeuralAudioSynthesis](https://huggingface.co/spellbrush/AliasingFreeNeuralAudioSynthesis) |
49
+
50
+ ## 📚 Training Data
51
+
52
+ The model was fine-tuned on a large-scale multilingual corpus, with significant emphasis on Japanese high-fidelity speech data.
53
+
54
+ | Language | Approx. Hours | Dataset |
55
+ | :--- | :--- | :--- |
56
+ | **Japanese** | ~15,000h | Various public HF datasets |
57
+ | **English** | ~7,500h | [Libriheavy-HQ](https://huggingface.co/datasets/mythicinfinity/Libriheavy-HQ/tree/main), [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
58
+ | **German** | ~1,950h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
59
+ | **Dutch** | ~1,550h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
60
+ | **French** | ~1,050h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
61
+ | **Spanish** | ~900h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
62
+ | **Italian** | ~240h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
63
+ | **Portuguese** | ~160h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
64
+ | **Polish** | ~100h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
65
+
66
+ ## ⚠️ Limitations
67
+
68
+ As MioVocoder is highly optimized for specific use cases, please note the following:
69
+
70
+ * **Language Performance:** Since the primary goal was to improve Japanese accuracy, the reconstruction quality for other languages may be slightly inferior compared to the original Pupu-Vocoder.
71
+ * **Speech-Centric:** The fine-tuning process utilized speech-only datasets. Unlike the base model, which may handle general audio or music, MioVocoder’s performance on non-speech audio (e.g., music, singing, environmental noise) may be degraded.
72
+
73
+ ## 🚀 Usage
74
+
75
+ Since MioVocoder maintains the original Pupu-Vocoder architecture, it can be used with the [official codebase](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) or via the `miocodec` helper library.
76
+
77
+ ### Integration with MioCodec
78
+
79
+ ```python
80
+ from miocodec import load_vocoder
81
+
82
+ vocoder = load_vocoder(
83
+ backend="pupu",
84
+ hf_repo="Aratako/MioVocoder",
85
+ hf_config_path="config.json",
86
+ hf_checkpoint_path="model.safetensors",
87
+ ).cuda()
88
+ ```
89
+
90
+ ## 📜 Acknowledgements
91
+
92
+ * **Original Architecture & Paper:** [Aliasing-Free Neural Audio Synthesis](https://arxiv.org/abs/2512.20211) (AFGen).
93
+ * **Base Weights:** Provided by the [Spellbrush](https://huggingface.co/spellbrush) team.
94
+
95
+ ## 🖊️ Citation
96
+
97
+ If you use MioVocoder in your research, please cite both the original paper and this model checkpoint:
98
+
99
+ **Original Architecture (AFGen):**
100
+ ```bibtex
101
+ @article{afgen,
102
+ title = {Aliasing Free Neural Audio Synthesis},
103
+ author = {Yicheng Gu and Junan Zhang and Chaoren Wang and Jerry Li and Zhizheng Wu and Lauri Juvela},
104
+ year = {2025},
105
+ journal = {arXiv:2512.20211},
106
+ }
107
+ ```
108
+
109
+ **MioVocoder Checkpoint:**
110
+
111
+ ```bibtex
112
+ @misc{miovocoder,
113
+ author = {Chihiro Arata},
114
+ title = {MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for Japanese Speech},
115
+ year = {2026},
116
+ publisher = {Hugging Face},
117
+ journal = {Hugging Face repository},
118
+ howpublished = {\url{https://huggingface.co/Aratako/MioVocoder}}
119
+ }
120
+ ```
config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "base_config": "egs/exp_config_pupuvocoder_base.json",
3
+ "model_type": "PupuVocoder",
4
+ "model": {
5
+ "generator": "pupuvocoder",
6
+ "pupuvocoder": {
7
+ "resblock": "1",
8
+ "upsample_rates": [
9
+ 8,
10
+ 8,
11
+ 2,
12
+ 2,
13
+ 2
14
+ ],
15
+ "upsample_kernel_sizes": [
16
+ 16,
17
+ 16,
18
+ 4,
19
+ 4,
20
+ 4
21
+ ],
22
+ "upsample_initial_channel": 512,
23
+ "resblock_kernel_sizes": [
24
+ 3,
25
+ 7,
26
+ 11
27
+ ],
28
+ "resblock_dilation_sizes": [
29
+ [
30
+ 1,
31
+ 3,
32
+ 5
33
+ ],
34
+ [
35
+ 1,
36
+ 3,
37
+ 5
38
+ ],
39
+ [
40
+ 1,
41
+ 3,
42
+ 5
43
+ ]
44
+ ]
45
+ },
46
+ },
47
+ "train": {
48
+ "criterions": [
49
+ "feature",
50
+ "discriminator",
51
+ "generator",
52
+ "multimel",
53
+ ]
54
+ },
55
+ "inference": {
56
+ "batch_size": 1,
57
+ }
58
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e1a73d7fb10d1bf1e84aacc7bf096d77e5816529ad6bf4dd4a35a09b1efa1597
3
+ size 60989884