Add files using upload-large-folder tool
Browse files- README.md +123 -0
- config.yaml +101 -0
- model.safetensors +3 -0
README.md
ADDED
|
@@ -0,0 +1,123 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- ja
|
| 6 |
+
- nl
|
| 7 |
+
- fr
|
| 8 |
+
- de
|
| 9 |
+
- it
|
| 10 |
+
- pl
|
| 11 |
+
- pt
|
| 12 |
+
- es
|
| 13 |
+
- ko
|
| 14 |
+
- zh
|
| 15 |
+
tags:
|
| 16 |
+
- speech
|
| 17 |
+
- audio
|
| 18 |
+
- tokenizer
|
| 19 |
+
datasets:
|
| 20 |
+
- sarulab-speech/mls_sidon
|
| 21 |
+
- mythicinfinity/Libriheavy-HQ
|
| 22 |
+
- nvidia/hifitts-2
|
| 23 |
+
pipeline_tag: audio-to-audio
|
| 24 |
+
base_model:
|
| 25 |
+
- Aratako/MioCodec-25Hz-24kHz
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
# MioCodec-25Hz-44.1kHz-v2: Lightweight Neural Audio Codec for Efficient Spoken Language Modeling
|
| 29 |
+
|
| 30 |
+
[](https://github.com/Aratako/MioCodec)
|
| 31 |
+
|
| 32 |
+
**MioCodec-25Hz-44.1kHz-v2** is an upsampled, high-fidelity version of the [MioCodec-25Hz-24kHz](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz) model.
|
| 33 |
+
|
| 34 |
+
By integrating an **UpsamplerBlock** inspired by [Inworld TTS-1](https://arxiv.org/abs/2507.21138) into the decoder, this model reconstructs 44.1 kHz audio from the standard 25 Hz token stream.
|
| 35 |
+
|
| 36 |
+
## 🌟 What's New in v2
|
| 37 |
+
|
| 38 |
+
This model is a fine-tuned version of `MioCodec-25Hz-24kHz` with the following architectural enhancements:
|
| 39 |
+
|
| 40 |
+
* **44.1 kHz Output:** Achieves higher audio fidelity compared to the base 24 kHz model.
|
| 41 |
+
* **UpsamplerBlock + SnakeBeta:** We adopted the UpsamplerBlock architecture from [Inworld TTS-1](https://arxiv.org/abs/2507.21138) and enhanced it by integrating SnakeBeta activations. This combination allows the decoder to effectively predict and generate high-frequency components, enabling clear 44.1 kHz reconstruction from the lower-resolution input.
|
| 42 |
+
* **Token Compatibility:** During fine-tuning, the content branch was frozen. This means the discrete tokens generated by this model are identical to those from `MioCodec-25Hz-24kHz`. You can take any TTS model trained on the 24kHz tokens and simply swap the codec to this v2 model during inference to instantly upgrade the audio quality to 44.1 kHz.
|
| 43 |
+
|
| 44 |
+
## 📊 Model Comparison
|
| 45 |
+
|
| 46 |
+
| Model | Token Rate | Vocab Size | Bit Rate | Sample Rate | SSL Encoder | Vocoder | Parameters | Highlights |
|
| 47 |
+
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :--- |
|
| 48 |
+
| **MioCodec-25Hz-44.1kHz-v2** | **25 Hz** | **12,800** | **341 bps** | **44.1 kHz** | **WavLM-base+** | **- (iSTFTHead)** | **133M** | **Fast inference, good quality** |
|
| 49 |
+
| MioCodec-25Hz-24kHz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | - (iSTFTHead) | 132M | Lightweight, fast inference |
|
| 50 |
+
| MioCodec-25Hz-44.1kHz | 25 Hz | 12,800 | 341 bps | 44.1 kHz | WavLM-base+ | [MioVocoder](https://huggingface.co/Aratako/MioVocoder) (Jointly Tuned) | 118M (w/o vocoder) | High-quality, high sample rate |
|
| 51 |
+
| kanade-25hz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 118M (w/o vocoder) | Original 25Hz model |
|
| 52 |
+
| kanade-12.5hz | 12.5 Hz | 12,800 | 171 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 120M (w/o vocoder) | Original 12.5Hz model |
|
| 53 |
+
|
| 54 |
+
## 🚀 Quick Start
|
| 55 |
+
|
| 56 |
+
### Installation
|
| 57 |
+
|
| 58 |
+
```bash
|
| 59 |
+
# Install via pip
|
| 60 |
+
pip install git+https://github.com/Aratako/MioCodec
|
| 61 |
+
|
| 62 |
+
# Or using uv
|
| 63 |
+
uv add git+https://github.com/Aratako/MioCodec
|
| 64 |
+
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
### Basic Inference
|
| 68 |
+
|
| 69 |
+
Basic usage for encoding and decoding audio:
|
| 70 |
+
|
| 71 |
+
```python
|
| 72 |
+
from miocodec import MioCodecModel, load_audio
|
| 73 |
+
import soundfile as sf
|
| 74 |
+
|
| 75 |
+
# 1. Load model
|
| 76 |
+
model = MioCodecModel.from_pretrained("Aratako/MioCodec-25Hz-44.1kHz-v2").eval().cuda()
|
| 77 |
+
|
| 78 |
+
# 2. Load audio
|
| 79 |
+
waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()
|
| 80 |
+
|
| 81 |
+
# 3. Encode Audio
|
| 82 |
+
features = model.encode(waveform)
|
| 83 |
+
|
| 84 |
+
# 4. Decode to Waveform (directly, no vocoder needed)
|
| 85 |
+
resynth = model.decode(
|
| 86 |
+
content_token_indices=features.content_token_indices,
|
| 87 |
+
global_embedding=features.global_embedding,
|
| 88 |
+
)
|
| 89 |
+
|
| 90 |
+
# 5. Save
|
| 91 |
+
sf.write("output.wav", resynth.cpu().numpy(), model.config.sample_rate)
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
### Voice Conversion (Zero-shot)
|
| 95 |
+
|
| 96 |
+
MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.
|
| 97 |
+
|
| 98 |
+
```python
|
| 99 |
+
source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
|
| 100 |
+
reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()
|
| 101 |
+
|
| 102 |
+
# Perform conversion
|
| 103 |
+
vc_wave = model.voice_conversion(source, reference)
|
| 104 |
+
sf.write("converted.wav", vc_wave.cpu().numpy(), model.config.sample_rate)
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
## 📜 Acknowledgements
|
| 108 |
+
|
| 109 |
+
* **Codec Architecture:** Based on the brilliant work of [kanade-tokenizer](https://github.com/frothywater/kanade-tokenizer).
|
| 110 |
+
* **Decoder Design:** Inspired by [XCodec2](https://github.com/zhenye234/X-Codec-2.0) and [Inworld TTS-1](https://arxiv.org/abs/2507.21138).
|
| 111 |
+
|
| 112 |
+
## 🖊️ Citation
|
| 113 |
+
|
| 114 |
+
```bibtex
|
| 115 |
+
@misc{miocodec-25hz-44.1khz-v2,
|
| 116 |
+
author = {Chihiro Arata},
|
| 117 |
+
title = {MioCodec: High-Fidelity Neural Audio Codec for Efficient Spoken Language Modeling},
|
| 118 |
+
year = {2026},
|
| 119 |
+
publisher = {Hugging Face},
|
| 120 |
+
journal = {Hugging Face repository},
|
| 121 |
+
howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz-44.1kHz-v2}}
|
| 122 |
+
}
|
| 123 |
+
```
|
config.yaml
ADDED
|
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
model:
|
| 2 |
+
class_path: miocodec.model.MioCodecModel
|
| 3 |
+
init_args:
|
| 4 |
+
config:
|
| 5 |
+
# SSL Feature settings
|
| 6 |
+
local_ssl_layers: [6, 9]
|
| 7 |
+
global_ssl_layers: [1, 2]
|
| 8 |
+
normalize_ssl_features: true
|
| 9 |
+
|
| 10 |
+
# Down/up-sampling settings
|
| 11 |
+
downsample_factor: 2
|
| 12 |
+
use_conv_downsample: true
|
| 13 |
+
|
| 14 |
+
# Audio settings - 44.1kHz with xcodec2-style ISTFT
|
| 15 |
+
sample_rate: 44100
|
| 16 |
+
n_fft: 392 # hop_length * 4
|
| 17 |
+
hop_length: 98 # Same as Anime-XCodec2, gives ~450 fps STFT
|
| 18 |
+
|
| 19 |
+
# Wave decoder settings
|
| 20 |
+
use_wave_decoder: true
|
| 21 |
+
wave_upsample_factor: 2 # Conv upsample: 25Hz tokens -> 50Hz
|
| 22 |
+
wave_interpolation_mode: linear
|
| 23 |
+
wave_decoder_dim: 512
|
| 24 |
+
wave_resnet_num_blocks: 2
|
| 25 |
+
wave_resnet_kernel_size: 3
|
| 26 |
+
wave_resnet_num_groups: 32
|
| 27 |
+
wave_resnet_dropout: 0.1
|
| 28 |
+
istft_padding: same
|
| 29 |
+
# UpSampler with SnakeBeta: 50Hz -> 450Hz (9x upsampling for 44.1kHz output)
|
| 30 |
+
wave_upsampler_factors: [3, 3]
|
| 31 |
+
wave_upsampler_kernel_sizes: [9, 9]
|
| 32 |
+
|
| 33 |
+
ssl_feature_extractor:
|
| 34 |
+
class_path: miocodec.module.ssl_extractor.SSLFeatureExtractor
|
| 35 |
+
init_args:
|
| 36 |
+
model_name: wavlm_base_plus
|
| 37 |
+
output_layer: 9
|
| 38 |
+
sample_rate: 44100
|
| 39 |
+
|
| 40 |
+
local_encoder:
|
| 41 |
+
class_path: miocodec.module.transformer.Transformer
|
| 42 |
+
init_args:
|
| 43 |
+
dim: 768
|
| 44 |
+
n_layers: 6
|
| 45 |
+
n_heads: 12
|
| 46 |
+
window_size: 125
|
| 47 |
+
use_rope: true
|
| 48 |
+
rope_theta: 10000.0
|
| 49 |
+
max_seq_len: 512
|
| 50 |
+
use_flash_attention: true
|
| 51 |
+
|
| 52 |
+
local_quantizer:
|
| 53 |
+
class_path: miocodec.module.fsq.FiniteScalarQuantizer
|
| 54 |
+
init_args:
|
| 55 |
+
input_dim: 768
|
| 56 |
+
output_dim: 768
|
| 57 |
+
levels: [8, 8, 8, 5, 5] # 12800
|
| 58 |
+
|
| 59 |
+
feature_decoder: null
|
| 60 |
+
|
| 61 |
+
global_encoder:
|
| 62 |
+
class_path: miocodec.module.global_encoder.GlobalEncoder
|
| 63 |
+
init_args:
|
| 64 |
+
input_channels: 768
|
| 65 |
+
output_channels: 128
|
| 66 |
+
num_layers: 4
|
| 67 |
+
dim: 384
|
| 68 |
+
intermediate_dim: 1152
|
| 69 |
+
|
| 70 |
+
# Mel decoder not used
|
| 71 |
+
mel_prenet: null
|
| 72 |
+
mel_decoder: null
|
| 73 |
+
mel_postnet: null
|
| 74 |
+
|
| 75 |
+
# Wave decoder components
|
| 76 |
+
wave_prenet:
|
| 77 |
+
class_path: miocodec.module.transformer.Transformer
|
| 78 |
+
init_args:
|
| 79 |
+
dim: 768
|
| 80 |
+
output_dim: 512
|
| 81 |
+
n_layers: 6
|
| 82 |
+
n_heads: 12
|
| 83 |
+
window_size: 65
|
| 84 |
+
use_rope: true
|
| 85 |
+
rope_theta: 10000.0
|
| 86 |
+
max_seq_len: 512
|
| 87 |
+
use_flash_attention: true
|
| 88 |
+
|
| 89 |
+
wave_decoder:
|
| 90 |
+
class_path: miocodec.module.transformer.Transformer
|
| 91 |
+
init_args:
|
| 92 |
+
dim: 512
|
| 93 |
+
n_layers: 8
|
| 94 |
+
n_heads: 8
|
| 95 |
+
window_size: 65
|
| 96 |
+
use_rope: true
|
| 97 |
+
rope_theta: 10000.0
|
| 98 |
+
max_seq_len: 512
|
| 99 |
+
adanorm_condition_dim: 128
|
| 100 |
+
use_adaln_zero: true
|
| 101 |
+
use_flash_attention: true
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8e319ef2231bad184f17cb73fd5a21b685c25c6c1622ef33ed9271187e81cd4a
|
| 3 |
+
size 528105436
|