Audio-to-Audio
Safetensors
speech
audio
tokenizer
Aratako commited on
Commit
67faba3
·
verified ·
1 Parent(s): 25cfe08

Add files using upload-large-folder tool

Browse files
Files changed (3) hide show
  1. README.md +123 -0
  2. config.yaml +101 -0
  3. model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - ja
6
+ - nl
7
+ - fr
8
+ - de
9
+ - it
10
+ - pl
11
+ - pt
12
+ - es
13
+ - ko
14
+ - zh
15
+ tags:
16
+ - speech
17
+ - audio
18
+ - tokenizer
19
+ datasets:
20
+ - sarulab-speech/mls_sidon
21
+ - mythicinfinity/Libriheavy-HQ
22
+ - nvidia/hifitts-2
23
+ pipeline_tag: audio-to-audio
24
+ base_model:
25
+ - Aratako/MioCodec-25Hz-24kHz
26
+ ---
27
+
28
+ # MioCodec-25Hz-44.1kHz-v2: Lightweight Neural Audio Codec for Efficient Spoken Language Modeling
29
+
30
+ [![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/Aratako/MioCodec)
31
+
32
+ **MioCodec-25Hz-44.1kHz-v2** is an upsampled, high-fidelity version of the [MioCodec-25Hz-24kHz](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz) model.
33
+
34
+ By integrating an **UpsamplerBlock** inspired by [Inworld TTS-1](https://arxiv.org/abs/2507.21138) into the decoder, this model reconstructs 44.1 kHz audio from the standard 25 Hz token stream.
35
+
36
+ ## 🌟 What's New in v2
37
+
38
+ This model is a fine-tuned version of `MioCodec-25Hz-24kHz` with the following architectural enhancements:
39
+
40
+ * **44.1 kHz Output:** Achieves higher audio fidelity compared to the base 24 kHz model.
41
+ * **UpsamplerBlock + SnakeBeta:** We adopted the UpsamplerBlock architecture from [Inworld TTS-1](https://arxiv.org/abs/2507.21138) and enhanced it by integrating SnakeBeta activations. This combination allows the decoder to effectively predict and generate high-frequency components, enabling clear 44.1 kHz reconstruction from the lower-resolution input.
42
+ * **Token Compatibility:** During fine-tuning, the content branch was frozen. This means the discrete tokens generated by this model are identical to those from `MioCodec-25Hz-24kHz`. You can take any TTS model trained on the 24kHz tokens and simply swap the codec to this v2 model during inference to instantly upgrade the audio quality to 44.1 kHz.
43
+
44
+ ## 📊 Model Comparison
45
+
46
+ | Model | Token Rate | Vocab Size | Bit Rate | Sample Rate | SSL Encoder | Vocoder | Parameters | Highlights |
47
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :--- |
48
+ | **MioCodec-25Hz-44.1kHz-v2** | **25 Hz** | **12,800** | **341 bps** | **44.1 kHz** | **WavLM-base+** | **- (iSTFTHead)** | **133M** | **Fast inference, good quality** |
49
+ | MioCodec-25Hz-24kHz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | - (iSTFTHead) | 132M | Lightweight, fast inference |
50
+ | MioCodec-25Hz-44.1kHz | 25 Hz | 12,800 | 341 bps | 44.1 kHz | WavLM-base+ | [MioVocoder](https://huggingface.co/Aratako/MioVocoder) (Jointly Tuned) | 118M (w/o vocoder) | High-quality, high sample rate |
51
+ | kanade-25hz | 25 Hz | 12,800 | 341 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 118M (w/o vocoder) | Original 25Hz model |
52
+ | kanade-12.5hz | 12.5 Hz | 12,800 | 171 bps | 24 kHz | WavLM-base+ | Vocos 24kHz | 120M (w/o vocoder) | Original 12.5Hz model |
53
+
54
+ ## 🚀 Quick Start
55
+
56
+ ### Installation
57
+
58
+ ```bash
59
+ # Install via pip
60
+ pip install git+https://github.com/Aratako/MioCodec
61
+
62
+ # Or using uv
63
+ uv add git+https://github.com/Aratako/MioCodec
64
+
65
+ ```
66
+
67
+ ### Basic Inference
68
+
69
+ Basic usage for encoding and decoding audio:
70
+
71
+ ```python
72
+ from miocodec import MioCodecModel, load_audio
73
+ import soundfile as sf
74
+
75
+ # 1. Load model
76
+ model = MioCodecModel.from_pretrained("Aratako/MioCodec-25Hz-44.1kHz-v2").eval().cuda()
77
+
78
+ # 2. Load audio
79
+ waveform = load_audio("input.wav", sample_rate=model.config.sample_rate).cuda()
80
+
81
+ # 3. Encode Audio
82
+ features = model.encode(waveform)
83
+
84
+ # 4. Decode to Waveform (directly, no vocoder needed)
85
+ resynth = model.decode(
86
+ content_token_indices=features.content_token_indices,
87
+ global_embedding=features.global_embedding,
88
+ )
89
+
90
+ # 5. Save
91
+ sf.write("output.wav", resynth.cpu().numpy(), model.config.sample_rate)
92
+ ```
93
+
94
+ ### Voice Conversion (Zero-shot)
95
+
96
+ MioCodec allows you to swap speaker identities by combining the content tokens of a source with the global embedding of a reference.
97
+
98
+ ```python
99
+ source = load_audio("source_content.wav", sample_rate=model.config.sample_rate).cuda()
100
+ reference = load_audio("target_speaker.wav", sample_rate=model.config.sample_rate).cuda()
101
+
102
+ # Perform conversion
103
+ vc_wave = model.voice_conversion(source, reference)
104
+ sf.write("converted.wav", vc_wave.cpu().numpy(), model.config.sample_rate)
105
+ ```
106
+
107
+ ## 📜 Acknowledgements
108
+
109
+ * **Codec Architecture:** Based on the brilliant work of [kanade-tokenizer](https://github.com/frothywater/kanade-tokenizer).
110
+ * **Decoder Design:** Inspired by [XCodec2](https://github.com/zhenye234/X-Codec-2.0) and [Inworld TTS-1](https://arxiv.org/abs/2507.21138).
111
+
112
+ ## 🖊️ Citation
113
+
114
+ ```bibtex
115
+ @misc{miocodec-25hz-44.1khz-v2,
116
+ author = {Chihiro Arata},
117
+ title = {MioCodec: High-Fidelity Neural Audio Codec for Efficient Spoken Language Modeling},
118
+ year = {2026},
119
+ publisher = {Hugging Face},
120
+ journal = {Hugging Face repository},
121
+ howpublished = {\url{https://huggingface.co/Aratako/MioCodec-25Hz-44.1kHz-v2}}
122
+ }
123
+ ```
config.yaml ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ class_path: miocodec.model.MioCodecModel
3
+ init_args:
4
+ config:
5
+ # SSL Feature settings
6
+ local_ssl_layers: [6, 9]
7
+ global_ssl_layers: [1, 2]
8
+ normalize_ssl_features: true
9
+
10
+ # Down/up-sampling settings
11
+ downsample_factor: 2
12
+ use_conv_downsample: true
13
+
14
+ # Audio settings - 44.1kHz with xcodec2-style ISTFT
15
+ sample_rate: 44100
16
+ n_fft: 392 # hop_length * 4
17
+ hop_length: 98 # Same as Anime-XCodec2, gives ~450 fps STFT
18
+
19
+ # Wave decoder settings
20
+ use_wave_decoder: true
21
+ wave_upsample_factor: 2 # Conv upsample: 25Hz tokens -> 50Hz
22
+ wave_interpolation_mode: linear
23
+ wave_decoder_dim: 512
24
+ wave_resnet_num_blocks: 2
25
+ wave_resnet_kernel_size: 3
26
+ wave_resnet_num_groups: 32
27
+ wave_resnet_dropout: 0.1
28
+ istft_padding: same
29
+ # UpSampler with SnakeBeta: 50Hz -> 450Hz (9x upsampling for 44.1kHz output)
30
+ wave_upsampler_factors: [3, 3]
31
+ wave_upsampler_kernel_sizes: [9, 9]
32
+
33
+ ssl_feature_extractor:
34
+ class_path: miocodec.module.ssl_extractor.SSLFeatureExtractor
35
+ init_args:
36
+ model_name: wavlm_base_plus
37
+ output_layer: 9
38
+ sample_rate: 44100
39
+
40
+ local_encoder:
41
+ class_path: miocodec.module.transformer.Transformer
42
+ init_args:
43
+ dim: 768
44
+ n_layers: 6
45
+ n_heads: 12
46
+ window_size: 125
47
+ use_rope: true
48
+ rope_theta: 10000.0
49
+ max_seq_len: 512
50
+ use_flash_attention: true
51
+
52
+ local_quantizer:
53
+ class_path: miocodec.module.fsq.FiniteScalarQuantizer
54
+ init_args:
55
+ input_dim: 768
56
+ output_dim: 768
57
+ levels: [8, 8, 8, 5, 5] # 12800
58
+
59
+ feature_decoder: null
60
+
61
+ global_encoder:
62
+ class_path: miocodec.module.global_encoder.GlobalEncoder
63
+ init_args:
64
+ input_channels: 768
65
+ output_channels: 128
66
+ num_layers: 4
67
+ dim: 384
68
+ intermediate_dim: 1152
69
+
70
+ # Mel decoder not used
71
+ mel_prenet: null
72
+ mel_decoder: null
73
+ mel_postnet: null
74
+
75
+ # Wave decoder components
76
+ wave_prenet:
77
+ class_path: miocodec.module.transformer.Transformer
78
+ init_args:
79
+ dim: 768
80
+ output_dim: 512
81
+ n_layers: 6
82
+ n_heads: 12
83
+ window_size: 65
84
+ use_rope: true
85
+ rope_theta: 10000.0
86
+ max_seq_len: 512
87
+ use_flash_attention: true
88
+
89
+ wave_decoder:
90
+ class_path: miocodec.module.transformer.Transformer
91
+ init_args:
92
+ dim: 512
93
+ n_layers: 8
94
+ n_heads: 8
95
+ window_size: 65
96
+ use_rope: true
97
+ rope_theta: 10000.0
98
+ max_seq_len: 512
99
+ adanorm_condition_dim: 128
100
+ use_adaln_zero: true
101
+ use_flash_attention: true
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8e319ef2231bad184f17cb73fd5a21b685c25c6c1622ef33ed9271187e81cd4a
3
+ size 528105436