nielsr HF Staff

Add pipeline tag and link to paper

2e60b59 verified about 2 months ago

5.41 kB

	---
	base_model:
	- spellbrush/AliasingFreeNeuralAudioSynthesis
	datasets:
	- sarulab-speech/mls_sidon
	- mythicinfinity/Libriheavy-HQ
	language:
	- en
	- ja
	- nl
	- fr
	- de
	- it
	- pl
	- pt
	- es
	license: mit
	pipeline_tag: audio-to-audio
	tags:
	- speech
	- audio
	- vocoder
	---

	# MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for High-Fidelity Speech Generation

	[![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/Aratako/MioCodec)
	[![arXiv](https://img.shields.io/badge/arXiv-2512.20211-b31b1b.svg)](https://huggingface.co/papers/2512.20211)

	MioVocoder is a high-resolution, aliasing-free neural vocoder designed for high-fidelity speech generation. It is a fine-tuned version of the Pupu-Vocoder (Small) from the paper [Aliasing-Free Neural Audio Synthesis](https://huggingface.co/papers/2512.20211) (AFGen).

	## 🌟 Overview

	MioVocoder is specifically optimized to serve as the backend for [MioCodec-25Hz](https://huggingface.co/Aratako/MioCodec-25Hz). While the original Pupu-Vocoder is a versatile model, MioVocoder has been fine-tuned with a primary focus on enhancing reconstruction quality for Japanese speech. By leveraging a large-scale Japanese corpus alongside multilingual data at 44.1kHz, it achieves high robustness and naturalness for various Japanese speaker characteristics.

	### Key Features
	* Aliasing-Free: Inherits the architecture of AFGen, the first work to achieve efficient aliasing-free upsampling-based audio generation.
	* High-Resolution: Native support for 44.1 kHz sampling rate.
	* Lightweight: Based on the "Small" architecture with only 15.2M parameters, making it fast and efficient for inference.
	* Multilingual Expertise: Fine-tuned on a massive corpus (including Japanese, English, and European languages) to ensure natural prosody and timbre.

	## 📊 Model Specifications

	\| Property \| Value \|
	\| :--- \| :--- \|
	\| Architecture \| Pupu-Vocoder (Small) \|
	\| Parameters \| 15.2M \|
	\| Sampling Rate \| 44.1 kHz \|
	\| Base Model \| [spellbrush/AliasingFreeNeuralAudioSynthesis](https://huggingface.co/spellbrush/AliasingFreeNeuralAudioSynthesis) \|

	## 📚 Training Data

	The model was fine-tuned on a large-scale multilingual corpus, with significant emphasis on Japanese high-fidelity speech data.

	\| Language \| Approx. Hours \| Dataset \|
	\| :--- \| :--- \| :--- \|
	\| Japanese \| ~15,000h \| Various public HF datasets \|
	\| English \| ~7,500h \| [Libriheavy-HQ](https://huggingface.co/datasets/mythicinfinity/Libriheavy-HQ/tree/main), [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| German \| ~1,950h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| Dutch \| ~1,550h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| French \| ~1,050h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| Spanish \| ~900h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| Italian \| ~240h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| Portuguese \| ~160h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|
	\| Polish \| ~100h \| [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) \|

	## ⚠️ Limitations

	As MioVocoder is highly optimized for specific use cases, please note the following:

	* Language Performance: Since the primary goal was to improve Japanese accuracy, the reconstruction quality for other languages may be slightly inferior compared to the original Pupu-Vocoder.
	* Speech-Centric: The fine-tuning process utilized speech-only datasets. Unlike the base model, which may handle general audio or music, MioVocoder’s performance on non-speech audio (e.g., music, singing, environmental noise) may be degraded.

	## 🚀 Usage

	Since MioVocoder maintains the original Pupu-Vocoder architecture, it can be used with the [official codebase](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis) or via the `miocodec` helper library.

	### Integration with MioCodec

	```python
	from miocodec import load_vocoder

	vocoder = load_vocoder(
	backend="pupu",
	hf_repo="Aratako/MioVocoder",
	hf_config_path="config.json",
	hf_checkpoint_path="model.safetensors",
	).cuda()
	```

	## 📜 Acknowledgements

	* Original Architecture & Paper: [Aliasing-Free Neural Audio Synthesis](https://arxiv.org/abs/2512.20211) (AFGen).
	* Official Code: [GitHub Repository](https://github.com/sizigi/AliasingFreeNeuralAudioSynthesis)
	* Base Weights: Provided by the [Spellbrush](https://huggingface.co/spellbrush) team.

	## 🖊️ Citation

	If you use MioVocoder in your research, please cite both the original paper and this model checkpoint:

	Original Architecture (AFGen):
	```bibtex
	@article{afgen,
	title = {Aliasing Free Neural Audio Synthesis},
	author = {Yicheng Gu and Junan Zhang and Chaoren Wang and Jerry Li and Zhizheng Wu and Lauri Juvela},
	year = {2025},
	journal = {arXiv:2512.20211},
	}
	```

	MioVocoder Checkpoint:

	```bibtex
	@misc{miovocoder,
	author = {Chihiro Arata},
	title = {MioVocoder: High-Resolution Aliasing-Free Neural Vocoder for Japanese Speech},
	year = {2026},
	publisher = {Hugging Face},
	journal = {Hugging Face repository},
	howpublished = {\url{https://huggingface.co/Aratako/MioVocoder}}
	}
	```