File size: 7,005 Bytes
2a370f4
 
 
e308122
 
 
 
 
 
 
 
 
 
64af293
 
e308122
 
 
64af293
 
e308122
 
 
 
2a370f4
 
 
64af293
 
 
2a370f4
 
e308122
2a370f4
 
d46fafa
2a370f4
 
 
1c2b009
2a370f4
 
 
901510d
 
 
 
2a370f4
 
 
 
ba3f961
2a370f4
73f60c1
2a370f4
 
 
 
 
8e2b804
 
2a370f4
 
 
 
 
 
 
 
 
8e2b804
 
 
 
 
 
 
2a370f4
 
 
 
73f60c1
2a370f4
 
 
 
8e2b804
 
2a370f4
 
 
 
 
 
 
 
 
8e2b804
 
 
 
 
 
 
2a370f4
 
8e2b804
 
 
 
 
 
 
2a370f4
 
 
 
 
 
 
 
 
 
33b15d8
 
64af293
2a370f4
 
 
 
64af293
2a370f4
 
64af293
 
 
 
 
 
 
 
2a370f4
64af293
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
license: cc-by-nc-4.0
---
<div align="center">
    <h1>
    DistilCodec
    </h1>
    <p>
    <b><em>DistilCodec: A Single Codebook Audio Codec For Universal Audio</em></b>
   </p>
    <p>
    </p>
    </p>
    <a href="https://arxiv.org/abs/2505.17426" style="color:red">Paper </a> |  
    <a href="https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0" style="color:#FFD700">HuggingFace Model</a> |
    <a href="https://github.com/IDEA-Emdoor-Lab/DistilCodec" style="color:gray">Code</a>
     <p>
        <img src="./idea_logo.png" alt="Institution 1" style="width: 200px; height: 60px;">
     </p>
     <p>
        <img src="./yidao_logo.png" alt="Institution 2" style="width: 200px; height: 60px;">
        <img src="./yijiayiban.png" alt="Institution 3" style="width: 200px; height: 60px;">
    </p>
</div>


# 🔥 News
- *2025.05.26*: We release DistilCodec-v1.0 checkpoint on [huggingface](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0).
- *2025.05.26*: The paper is available on [arxiv](https://arxiv.org/abs/2505.17426).
- *2025.05.23*: We submit paper to arxiv.

## Introduction of DistilCodec
The Joint Laboratory of International Digital Economy Academy (IDEA) and Emdoor, in collaboration with Emdoor Information Technology Co., Ltd., and Shenzhen Yijiayiban Information Technology Co., Ltd, has launched DistilCodec - A Single-Codebook Neural Audio Codec (NAC) with 32768 codes trained on uniersal audio.The foundational network architecture of DistilCodec adopts an Encoder-VQ-Decoder framework
similar to that proposed in Soundstream. The encoder employs a ConvNeXt-V2 structure,
while the vector quantization module implements the GRFVQ scheme. The decoder
employs a ConvTranspose1d based architectural configuration similar to HiFiGAN. The training methodol-
ogy of DistilCodec follows a similar approach to HiFiGAN, incorporating three types of
discriminators: Multi-Period Discriminator (MPD), Multi-Scale Discriminator (MSD), and Multi-
STFT Discriminator (MSFTFD). Here is the architecture of Distilcodec:
<img src="./figure.jpg" alt="The Architecture of DistilCodec" style="width: 100%; height: auto;" />
Distribution of DistilCodec training data is shown in below table:
| **Data Category**           | **Data Size (in hours)** |
|-----------------------------|--------------------------|
| Chinese Audiobook           | 38000                    |
| Chinese Common Audio        | 20000                    |
| English Audiobook           | 10000                    |
| English Speech              | 30000                    |
| Music                       | 2000                     |
| **Total**                   | **100000**               |

## Inference of DistilCodec
The code is in github [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec).

### Part1: Generating discrete audio tokens from DistilCodec

```python

from distil_codec import DistilCodec, demo_for_generate_audio_codes

codec_model_config_path='/path/to/distilcodec/model_config.json'
codec_ckpt_path = '/path/to/distilcodec_ckpt'
step=204000

codec = DistilCodec.from_pretrained(
    config_path=codec_model_config_path,
    model_path=codec_ckpt_path,
    load_steps=step,
    use_generator=True,
    is_debug=False).eval()

audio_path = '/path/to/audio_file'
audio_tokens = demo_for_generate_audio_codes(
    codec, 
    audio_path, 
    target_sr=24000, 
    plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B.
)
print(audio_tokens)

```

### Part2: Reconstruct audio from raw audio
```python

from distil_codec import DistilCodec, demo_for_generate_audio_codes

codec_model_config_path='/path/to/distilcodec/model_config.json'
codec_ckpt_path = '/path/to/distilcodec_ckpt'
step=204000

codec = DistilCodec.from_pretrained(
    config_path=codec_model_config_path,
    model_path=codec_ckpt_path,
    load_steps=step,
    use_generator=True,
    is_debug=False).eval()

audio_path = '/path/to/audio_file'
audio_tokens = demo_for_generate_audio_codes(
    codec, 
    audio_path, 
    target_sr=24000, 
    plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B.
)
print(audio_tokens)

# Generated audio save path, the path is f'{gen_audio_save_path}/{audio_name}.wav'
gen_audio_save_path = '/path/to/audio_save_path'
audio_name = 'audio_name'
y_gen = codec.decode_from_codes(
    audio_tokens, 
    minus_token_offset=True # if the 'plus_llm_offset' of method demo_for_generate_audio_codes is set to True, then minus_token_offset must be True.
)
codec.save_wav(
    audio_gen_batch=y_gen, 
    nhop_lengths=[y_gen.shape[-1]], 
    save_path=gen_audio_save_path,
    name_tag=audio_name
)

```

## Available DistilCodec models
|Model Version| Huggingface |  Corpus  |  Token/s  | Domain |
|-----------------------|---------|---------------|---------------|-----------------------------------|
| DistilCodec-v1.0 | [HuggingFace](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0) | Universal Audio | 93 | Universal Audio |


## Citation

If you find our work useful in your research, please cite our work:

```
@misc{wang2025unittsendtoendttsdecoupling,
      title={UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information}, 
      author={Rui Wang and Qianguo Sun and Tianrong Chen and Zhiyun Zeng and Junlong Wu and Jiaxing Zhang},
      year={2025},
      eprint={2505.17426},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2505.17426}, 
}
```


## Disclaimer

DistilCodec provides the capability of universal audio discretion only for academic research purposes. We encourage the community to uphold safety and ethical principles in AI research and applications.

Important Notes:

- Compliance with the model's open-source license is mandatory.

- Unauthorized voice replication applications are strictly prohibited.

- Developers bear no responsibility for any misuse of this model.


## License
<a href="https://arxiv.org/abs/2505.17426">UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information</a> © 2025 by <a href="https://creativecommons.org">Rui Wang, Qianguo Sun, Tianrong Chen, Zhiyun Zeng, Junlong Wu, Jiaxing Zhang</a> is licensed under <a href="https://creativecommons.org/licenses/by-nc-nd/4.0/">CC BY-NC-ND 4.0</a><img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" style="max-width: 1em;max-height:1em;margin-left: .2em;"><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" style="max-width: 1em;max-height:1em;margin-left: .2em;"><img src="https://mirrors.creativecommons.org/presskit/icons/nc.svg" style="max-width: 1em;max-height:1em;margin-left: .2em;"><img src="https://mirrors.creativecommons.org/presskit/icons/nd.svg" style="max-width: 1em;max-height:1em;margin-left: .2em;">