|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
- ja |
|
|
- nl |
|
|
- fr |
|
|
- de |
|
|
- it |
|
|
- pl |
|
|
- pt |
|
|
- es |
|
|
- ko |
|
|
- zh |
|
|
tags: |
|
|
- speech |
|
|
- audio |
|
|
- tokenizer |
|
|
datasets: |
|
|
- sarulab-speech/mls_sidon |
|
|
- mythicinfinity/Libriheavy-HQ |
|
|
- nvidia/hifitts-2 |
|
|
- amphion/Emilia-Dataset |
|
|
pipeline_tag: audio-to-audio |
|
|
--- |
|
|
|
|
|
# MioCodec-25Hz-24kHz: Lightweight Neural Audio Codec for Efficient Spoken Language Modeling |
|
|
|
|
|
[](https://github.com/Aratako/MioCodec) |
|
|
|
|
|
**MioCodec-25Hz-24kHz** is a lightweight and fast neural audio codec designed for efficient spoken language modeling. Based on the [Kanade-Tokenizer](https://github.com/frothywater/kanade-tokenizer) implementation, this model features an integrated wave decoder (iSTFTHead) that directly synthesizes waveforms without requiring an external vocoder. |
|
|
|
|
|
For higher audio fidelity at 44.1 kHz, see [MioCodec-25Hz-44.1kHz](https://huggingface.co/Aratako/MioCodec-25Hz-44.1kHz). |
|
|
|
|
|
## ๐ Overview |
|
|
|
|
|
MioCodec decomposes speech into two distinct components: |
|
|
|
|
|
1. **Content Tokens:** Discrete representations that primarily capture linguistic information and phonetic content ("what" is being said) at a low frame rate (25 Hz). |
|
|
2. **Global Embeddings:** A continuous vector representing broad acoustic characteristics ("how")โincluding speaker identity, recording environment, and microphone traits. |
|
|
|
|
|
By disentangling these elements, MioCodec is ideal for **Spoken Language Modeling**. |
|
|
|
|
|
### Key features |
|
|
|
|
|
* **Lightweight & Fast:** Integrated wave decoder (iSTFTHead) enables direct waveform synthesis without an external vocoder. |
|
|
* **Ultra-Low Bitrate:** Achieves high-fidelity reconstruction at only **341 bps**. |
|
|
* **End-to-End Design:** Single model architecture from audio input to waveform output. |
|
|
|
|
|
## ๐ Model Comparison |
|
|
|
|
|
| Model | Token Rate | Vocab Size | Bit Rate | Sample Rate | SSL Encoder | Vocoder | Parameters | Highlights | |
|
|
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :--- | |
|
|
| **MioCodec-25Hz-24kHz** | **25 Hz** | **12,800** | **341 bps** | **24 kHz** | **WavLM-base+** | **- (iSTFTHead)** | **132M** | **Lightweight, fast inference** | |
|
|
| MioCodec-25Hz-44.1kHz | 25 Hz | 12,800 | 341 bps | 44.1 kHz | WavLM-base+ | [MioVocoder](https://huggingface.co/Aratako/MioVocoder) (Jointly Tuned) | 118M (w/o vocoder) | High-quality, high sample rate | |
|
|
| kanade-25hz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 118M (w/o vocoder) | Original 25Hz model | |
|
|
| kanade-12.5hz | 12.5 Hz | 12,800 | 171 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 120M (w/o vocoder) | Original 12.5Hz model | |
|
|
|
|
|
## ๐ Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
# Install via pip |
|
|
pip install git+https://github.com/Aratako/MioCodec |
|
|
|
|
|
# Or using uv |
|
|
uv add git+https://github.com/Aratako/MioCodec |
|
|
|
|
|
``` |
|
|
|
|
|
### Basic Inference |
|
|
|
|
|
Basic usage for encoding and decoding audio: |
|
|
|
|
|
```python |
|
|
from miocodec import MioCodecModel, load_audio |
|
|
import soundfile as sf |
|
|
|
|
|
# 1. Load model |
|
|
model = MioCodecModel.from_pretrained("Aratako/MioCodec-25Hz-24kHz").eval().cuda() |
|
|
|
|
|
# 2. Load audio |
|
|
waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda() |
|
|
|
|
|
# 3. Encode Audio |
|
|
features = model.encode(waveform) |
|
|
|
|
|
# 4. Decode to Waveform (directly, no vocoder needed) |
|
|
resynth = model.decode( |
|
|
content_token_indices=features.content_token_indices, |
|
|
global_embedding=features.global_embedding, |
|
|
) |
|
|
|
|
|
# 5. Save |
|
|
sf.write("output.wav", resynth.cpu().numpy(), model.config.sample_rate) |
|
|
``` |
|
|
|
|
|
### Voice Conversion (Zero-shot) |
|
|
|
|
|
MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference. |
|
|
|
|
|
```python |
|
|
source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda() |
|
|
reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda() |
|
|
|
|
|
# Perform conversion |
|
|
vc_wave = model.voice_conversion(source, reference) |
|
|
sf.write("converted.wav", vc_wave.cpu().numpy(), model.config.sample_rate) |
|
|
``` |
|
|
|
|
|
## ๐๏ธ Training Methodology |
|
|
|
|
|
MioCodec-25Hz-24kHz was trained in two phases with an integrated wave decoder that directly synthesizes waveforms via iSTFT. |
|
|
|
|
|
### Phase 1: Feature Alignment |
|
|
|
|
|
The model is trained to minimize both **Multi-Resolution Mel-spectrogram loss** and **SSL feature reconstruction loss** (using WavLM-base+). The wave decoder directly generates waveforms, and losses are computed on the reconstructed audio. |
|
|
|
|
|
* **Multi-Resolution Mel Spectrogram Loss:** Using window lengths of `[32, 64, 128, 256, 512, 1024, 2048]`. |
|
|
* **SSL Feature Reconstruction Loss:** Using WavLM-base+ features. |
|
|
|
|
|
### Phase 2: Adversarial Refinement |
|
|
|
|
|
Building upon Phase 1, adversarial training is introduced to improve perceptual quality. The training objectives include: |
|
|
|
|
|
* **Multi-Resolution Mel Spectrogram Loss:** Using window lengths of `[32, 64, 128, 256, 512, 1024, 2048]`. |
|
|
* **SSL Feature Reconstruction Loss:** Using WavLM-base+ features. |
|
|
* **Multi-Period Discriminator (MPD):** Using periods of `[2, 3, 5, 7, 11, 17, 23]`. |
|
|
* **Multi-Scale STFT Discriminator (MS-STFTD):** Using FFT sizes of `[118, 190, 310, 502, 814, 1314, 2128, 3444]`. |
|
|
* **RMS Loss:** To stabilize energy and volume. |
|
|
|
|
|
## ๐ Training Data |
|
|
|
|
|
The training datasets are listed below: |
|
|
|
|
|
| Language | Approx. Hours | Dataset | |
|
|
| :--- | :--- | :--- | |
|
|
| **Japanese** | ~22,500h | Various public HF datasets | |
|
|
| **English** | ~500h | [Libriheavy-HQ](https://huggingface.co/datasets/mythicinfinity/Libriheavy-HQ) | |
|
|
| **English** | ~4,000h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **English** | ~9,000h | [HiFiTTS-2](https://huggingface.co/datasets/nvidia/hifitts-2) | |
|
|
| **English** | ~27,000h | [Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) | |
|
|
| **German** | ~1,950h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **German** | ~5,600h | [Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) | |
|
|
| **Dutch** | ~1,550h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **French** | ~1,050h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **French** | ~7,400h | [Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) | |
|
|
| **Spanish** | ~900h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **Italian** | ~240h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **Portuguese** | ~160h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **Polish** | ~100h | [MLS-Sidon](https://huggingface.co/datasets/sarulab-speech/mls_sidon) | |
|
|
| **Korean** | ~7,300h | [Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) | |
|
|
| **Chinese** | ~300h | [Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) | |
|
|
|
|
|
## ๐ Acknowledgements |
|
|
|
|
|
* **Codec Architecture:** Based on the brilliant work of [kanade-tokenizer](https://github.com/frothywater/kanade-tokenizer). |
|
|
* **Decoder Design:** Inspired by [XCodec2](https://github.com/zhenye234/X-Codec-2.0). |
|
|
* **Training Techniques:** Training objectives were inspired by [XCodec2](https://github.com/zhenye234/X-Codec-2.0) and [Inworld TTS-1](https://arxiv.org/html/2507.21138v1). |
|
|
|
|
|
## ๐๏ธ Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{miocodec-25hz-24khz, |
|
|
author = {Chihiro Arata}, |
|
|
title = {MioCodec: High-Fidelity Neural Audio Codec for Efficient Spoken Language Modeling}, |
|
|
year = {2026}, |
|
|
publisher = {Hugging Face}, |
|
|
journal = {Hugging Face repository}, |
|
|
howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz-24kHz}} |
|
|
} |
|
|
``` |
|
|
|