--- license: cc-by-nc-4.0 ---
DistilCodec: A Single Codebook Audio Codec For Universal Audio
Paper | HuggingFace Model | Code
Distribution of DistilCodec training data is shown in below table:
| **Data Category** | **Data Size (in hours)** |
|-----------------------------|--------------------------|
| Chinese Audiobook | 38000 |
| Chinese Common Audio | 20000 |
| English Audiobook | 10000 |
| English Speech | 30000 |
| Music | 2000 |
| **Total** | **100000** |
## Inference of DistilCodec
The code is in github [DistilCodec](https://github.com/IDEA-Emdoor-Lab/DistilCodec).
### Part1: Generating discrete audio tokens from DistilCodec
```python
from distil_codec import DistilCodec, demo_for_generate_audio_codes
codec_model_config_path='/path/to/distilcodec/model_config.json'
codec_ckpt_path = '/path/to/distilcodec_ckpt'
step=204000
codec = DistilCodec.from_pretrained(
config_path=codec_model_config_path,
model_path=codec_ckpt_path,
load_steps=step,
use_generator=True,
is_debug=False).eval()
audio_path = '/path/to/audio_file'
audio_tokens = demo_for_generate_audio_codes(
codec,
audio_path,
target_sr=24000,
plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B.
)
print(audio_tokens)
```
### Part2: Reconstruct audio from raw audio
```python
from distil_codec import DistilCodec, demo_for_generate_audio_codes
codec_model_config_path='/path/to/distilcodec/model_config.json'
codec_ckpt_path = '/path/to/distilcodec_ckpt'
step=204000
codec = DistilCodec.from_pretrained(
config_path=codec_model_config_path,
model_path=codec_ckpt_path,
load_steps=step,
use_generator=True,
is_debug=False).eval()
audio_path = '/path/to/audio_file'
audio_tokens = demo_for_generate_audio_codes(
codec,
audio_path,
target_sr=24000,
plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B.
)
print(audio_tokens)
# Generated audio save path, the path is f'{gen_audio_save_path}/{audio_name}.wav'
gen_audio_save_path = '/path/to/audio_save_path'
audio_name = 'audio_name'
y_gen = codec.decode_from_codes(
audio_tokens,
minus_token_offset=True # if the 'plus_llm_offset' of method demo_for_generate_audio_codes is set to True, then minus_token_offset must be True.
)
codec.save_wav(
audio_gen_batch=y_gen,
nhop_lengths=[y_gen.shape[-1]],
save_path=gen_audio_save_path,
name_tag=audio_name
)
```
## Available DistilCodec models
|Model Version| Huggingface | Corpus | Token/s | Domain |
|-----------------------|---------|---------------|---------------|-----------------------------------|
| DistilCodec-v1.0 | [HuggingFace](https://huggingface.co/IDEA-Emdoor/DistilCodec-v1.0) | Universal Audio | 93 | Universal Audio |
## Citation
If you find our work useful in your research, please cite our work:
```
@misc{wang2025unittsendtoendttsdecoupling,
title={UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information},
author={Rui Wang and Qianguo Sun and Tianrong Chen and Zhiyun Zeng and Junlong Wu and Jiaxing Zhang},
year={2025},
eprint={2505.17426},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2505.17426},
}
```
## Disclaimer
DistilCodec provides the capability of universal audio discretion only for academic research purposes. We encourage the community to uphold safety and ethical principles in AI research and applications.
Important Notes:
- Compliance with the model's open-source license is mandatory.
- Unauthorized voice replication applications are strictly prohibited.
- Developers bear no responsibility for any misuse of this model.
## License
UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information © 2025 by Rui Wang, Qianguo Sun, Tianrong Chen, Zhiyun Zeng, Junlong Wu, Jiaxing Zhang is licensed under CC BY-NC-ND 4.0