File size: 4,195 Bytes
1c52951 5d9cc29 1c52951 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | ---
license: apache-2.0
pipeline_tag: text-to-speech
---
# OmniCodec
OmniCodec: Low Frame Rate Universal Audio Codec with SemanticβAcoustic Disentanglement
- Demo Page: [OmniCodec Demo Page](https://hujingbin1.github.io/OmniCodec-Demo-Page/)
- Huggingface: [Huggingface](https://huggingface.co/ASLP-lab/OmniCodec)
- Arxiv: [Arxiv](https://arxiv.org/html/2603.20638v1)
## Overview
This repo contains:
- **Training**: `train.py` (Accelerate + GAN / WavLM-related losses per config)
- **Dataset**: `dataset.py` (multi-domain mixing; loads audio paths from `scp`)
- **Inference**: `infer.py` (reconstructs audio with a pretrained checkpoint)
- **Config**: `config/config_omnicodec.yaml`
## Environment
### Requirements
Install python dependencies:
```bash
pip install -r requirements.txt
```
Notes:
- `requirements.txt` contains an editable install line `-e OmniCodec/transformers-main`. Make sure the referenced path exists in your environment, or adjust/remove that line if you already have `transformers` installed.
## Data preparation (scp)
The training config expects 3 **scp files** (one per domain): speech / music / sound.
Each line in scp can be either:
- `utt_id /abs/or/rel/path/to/audio.wav`
- `/abs/or/rel/path/to/audio.wav` (utt will be inferred from filename)
Example:
```text
utt0001 /data/speech/utt0001.wav
utt0002 /data/speech/utt0002.wav
```
### What dataset does
For each item, `dataset.py` will:
- load audio with `librosa.load(..., sr=sample_rate, mono=True)`
- apply `librosa.util.normalize(wav) * 0.95`
- crop/pad/repeat to `segment_size` (default: 240000 samples @ 24kHz = 10s)
- return a dict: `{"wav": Tensor[T], "utt": str, "text": None}`
Failed samples return `None` and are filtered by `collate_fn` in `train.py`.
## Configure
Edit `config/config_omnicodec.yaml`:
- **Data**
- `data.speech_train_shards_dir`: path to `speech.scp`
- `data.music_train_shards_dir`: path to `music.scp`
- `data.sound_train_shards_dir`: path to `sound.scp`
- `data.sample_rate`: default `24000`
- `data.segment_size`: default `240000`
- **Pretrained SSL (WavLM)**
- `model.wavlmloss.ckpt_path`: default `pretrain_model/ssl/wavlm-base-plus`
- `wav_lm_model`: default `pretrain_model/ssl/wavlm_model/wavlm`
- **Output**
- `train.save_dir`: default `./exps/omnicodec`
## Training
Run training with the provided config:
```bash
python train.py -c config/config_omnicodec.yaml
```
Checkpoints and logs are written to `train.save_dir` (default: `./exps/omnicodec`).
## Inference (reconstruction)
### Prepare checkpoint
`infer.py` loads the checkpoint from:
- `pretrained_model/omnicodec.pth`
Place your pretrained weights at that path (or edit `infer.py` to point to your checkpoint).
### Run
Put test audio files in:
- `./testset/speech/`
Then run:
```bash
python infer.py -c config/config_omnicodec.yaml
```
Outputs will be written to:
- `./outputs/`
## Project structure
```text
.
ββ config/
β ββ config_omnicodec.yaml
ββ dataset.py
ββ train.py
ββ infer.py
ββ models/
ββ modules/
ββ quantization/
ββ discriminators/
ββ losses/
ββ utils/
ββ requirements.txt
```
## Acknowledgements
- This repo benefits from [moshi](https://github.com/kyutai-labs/moshi)
- This repo benefits from [Qwen3Omni](https://github.com/QwenLM/Qwen3-Omni)
- This repo benefits from [DAC](https://github.com/descriptinc/descript-audio-codec)
- This repo benefits from [BigVGAN](https://github.com/NVIDIA/BigVGAN)
- This repo benefits from [SpeechTokenizer](https://github.com/zhangxinfd/speechtokenizer)
## Citation
If you use this work, please cite:
```bibtex
@misc{hu2026omnicodeclowframerate,
title={OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement},
author={Jingbin Hu and Haoyu Zhang and Dake Guo and Qirui Zhan and Wenhao Li and Huakang Chen and Guobin Ma and Hanke Xie and Chengyou Wang and Pengyuan Xie and Chuan Xie and Qiang Zhang and Lei Xie},
year={2026},
eprint={2603.20638},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2603.20638},
}
```
## License
See the repository license |