Add files using upload-large-folder tool

Browse files

Files changed (3) hide show

README.md +120 -0
config.json +58 -0
model.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,120 @@

+---
+license: mit
+language:
+- en
+- ja
+- nl
+- fr
+- de
+- it
+- pl
+- pt
+- es
+tags:
+- speech
+- audio
+- vocoder
+datasets:
+- sarulab-speech/mls_sidon
+- mythicinfinity/Libriheavy-HQ
+base_model:
+- spellbrush/AliasingFreeNeuralAudioSynthesis
+---
+# MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for High-Fidelity Speech Generation
+[![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/Aratako/MioCodec)
+**MioVocoder** is a high-resolution, aliasing-free neural vocoder designed for high-fidelity speech generation. It is a fine-tuned version of the **Pupu-Vocoder (Small)** from the [Aliasing-Free Neural Audio Synthesis](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) (AFGen) project.
+## 🌟 Overview
+MioVocoder is specifically optimized to serve as the backend for **[MioCodec-25Hz](https://huggingface.co/Aratako/MioCodec-25Hz)**. While the original Pupu-Vocoder is a versatile model, MioVocoder has been fine-tuned with a primary focus on enhancing reconstruction quality for **Japanese speech**. By leveraging a large-scale Japanese corpus alongside multilingual data at 44.1kHz, it achieves high robustness and naturalness for various Japanese speaker characteristics.
+### Key Features
+* **Aliasing-Free:** Inherits the architecture of AFGen, the first work to achieve efficient aliasing-free upsampling-based audio generation.
+* **High-Resolution:** Native support for **44.1 kHz** sampling rate.
+* **Lightweight:** Based on the "Small" architecture with only **15.2M parameters**, making it fast and efficient for inference.
+* **Multilingual Expertise:** Fine-tuned on a massive corpus (including Japanese, English, and European languages) to ensure natural prosody and timbre.
+## 📊 Model Specifications
+| Property | Value |
+| :--- | :--- |
+| **Architecture** | Pupu-Vocoder (Small) |
+| **Parameters** | 15.2M |
+| **Sampling Rate** | 44.1 kHz |
+| **Base Model** | [spellbrush/AliasingFreeNeuralAudioSynthesis](https://huggingface.co/spellbrush/AliasingFreeNeuralAudioSynthesis) |
+## 📚 Training Data
+The model was fine-tuned on a large-scale multilingual corpus, with significant emphasis on Japanese high-fidelity speech data.
+| Language | Approx. Hours | Dataset |
+| :--- | :--- | :--- |
+| **Japanese** | ~15,000h | Various public HF datasets |
+| **English** | ~7,500h | [Libriheavy-HQ](https://huggingface.co/datasets/mythicinfinity/Libriheavy-HQ/tree/main), [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
+| **German** | ~1,950h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
+| **Dutch** | ~1,550h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
+| **French** | ~1,050h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
+| **Spanish** | ~900h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
+| **Italian** | ~240h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
+| **Portuguese** | ~160h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
+| **Polish** | ~100h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) |
+## ⚠️ Limitations
+As MioVocoder is highly optimized for specific use cases, please note the following:
+* **Language Performance:** Since the primary goal was to improve Japanese accuracy, the reconstruction quality for other languages may be slightly inferior compared to the original Pupu-Vocoder.
+* **Speech-Centric:** The fine-tuning process utilized speech-only datasets. Unlike the base model, which may handle general audio or music, MioVocoder’s performance on non-speech audio (e.g., music, singing, environmental noise) may be degraded.
+## 🚀 Usage
+Since MioVocoder maintains the original Pupu-Vocoder architecture, it can be used with the [official codebase](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) or via the `miocodec` helper library.
+### Integration with MioCodec
+```python
+from miocodec import load_vocoder
+vocoder = load_vocoder(
+    backend="pupu",
+    hf_repo="Aratako/MioVocoder",
+    hf_config_path="config.json",
+    hf_checkpoint_path="model.safetensors",
+).cuda()
+```
+## 📜 Acknowledgements
+* **Original Architecture & Paper:** [Aliasing-Free Neural Audio Synthesis](https://arxiv.org/abs/2512.20211) (AFGen).
+* **Base Weights:** Provided by the [Spellbrush](https://huggingface.co/spellbrush) team.
+## 🖊️ Citation
+If you use MioVocoder in your research, please cite both the original paper and this model checkpoint:
+**Original Architecture (AFGen):**
+```bibtex
+@article{afgen,
+  title        = {Aliasing Free Neural Audio Synthesis},
+  author       = {Yicheng Gu and Junan Zhang and Chaoren Wang and Jerry Li and Zhizheng Wu and Lauri Juvela},
+  year         = {2025},
+  journal      = {arXiv:2512.20211},
+}
+```
+**MioVocoder Checkpoint:**
+```bibtex
+@misc{miovocoder,
+  author = {Chihiro Arata},
+  title = {MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for Japanese Speech},
+  year = {2026},
+  publisher = {Hugging Face},
+  journal = {Hugging Face repository},
+  howpublished = {\url{https://huggingface.co/Aratako/MioVocoder}}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "base_config": "egs/exp_config_pupuvocoder_base.json",
+  "model_type": "PupuVocoder",
+  "model": {
+    "generator": "pupuvocoder",
+    "pupuvocoder": {
+      "resblock": "1",
+      "upsample_rates": [
+        8,
+        8,
+        2,
+        2,
+        2
+      ],
+      "upsample_kernel_sizes": [
+        16,
+        16,
+        4,
+        4,
+        4
+      ],
+      "upsample_initial_channel": 512,
+      "resblock_kernel_sizes": [
+        3,
+        7,
+        11
+      ],
+      "resblock_dilation_sizes": [
+        [
+          1,
+          3,
+          5
+        ],
+        [
+          1,
+          3,
+          5
+        ],
+        [
+          1,
+          3,
+          5
+        ]
+      ]
+    },
+  },
+  "train": {
+    "criterions": [
+        "feature",
+        "discriminator",
+        "generator",
+        "multimel",
+    ]
+  },
+  "inference": {
+    "batch_size": 1,
+  }
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e1a73d7fb10d1bf1e84aacc7bf096d77e5816529ad6bf4dd4a35a09b1efa1597
+size 60989884