| --- |
| license: apache-2.0 |
| pipeline_tag: text-to-speech |
| --- |
| # OmniCodec |
|
|
| OmniCodec: Low Frame Rate Universal Audio Codec with SemanticβAcoustic Disentanglement |
|
|
| - Demo Page: [OmniCodec Demo Page](https://hujingbin1.github.io/OmniCodec-Demo-Page/) |
| - Huggingface: [Huggingface](https://huggingface.co/ASLP-lab/OmniCodec) |
| - Arxiv: [Arxiv](https://arxiv.org/html/2603.20638v1) |
|
|
|
|
| ## Overview |
|
|
| This repo contains: |
|
|
| - **Training**: `train.py` (Accelerate + GAN / WavLM-related losses per config) |
| - **Dataset**: `dataset.py` (multi-domain mixing; loads audio paths from `scp`) |
| - **Inference**: `infer.py` (reconstructs audio with a pretrained checkpoint) |
| - **Config**: `config/config_omnicodec.yaml` |
|
|
| ## Environment |
|
|
| ### Requirements |
|
|
| Install python dependencies: |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| Notes: |
|
|
| - `requirements.txt` contains an editable install line `-e OmniCodec/transformers-main`. Make sure the referenced path exists in your environment, or adjust/remove that line if you already have `transformers` installed. |
|
|
| ## Data preparation (scp) |
|
|
| The training config expects 3 **scp files** (one per domain): speech / music / sound. |
|
|
| Each line in scp can be either: |
|
|
| - `utt_id /abs/or/rel/path/to/audio.wav` |
| - `/abs/or/rel/path/to/audio.wav` (utt will be inferred from filename) |
|
|
| Example: |
|
|
| ```text |
| utt0001 /data/speech/utt0001.wav |
| utt0002 /data/speech/utt0002.wav |
| ``` |
|
|
| ### What dataset does |
|
|
| For each item, `dataset.py` will: |
|
|
| - load audio with `librosa.load(..., sr=sample_rate, mono=True)` |
| - apply `librosa.util.normalize(wav) * 0.95` |
| - crop/pad/repeat to `segment_size` (default: 240000 samples @ 24kHz = 10s) |
| - return a dict: `{"wav": Tensor[T], "utt": str, "text": None}` |
|
|
| Failed samples return `None` and are filtered by `collate_fn` in `train.py`. |
|
|
| ## Configure |
|
|
| Edit `config/config_omnicodec.yaml`: |
|
|
| - **Data** |
| - `data.speech_train_shards_dir`: path to `speech.scp` |
| - `data.music_train_shards_dir`: path to `music.scp` |
| - `data.sound_train_shards_dir`: path to `sound.scp` |
| - `data.sample_rate`: default `24000` |
| - `data.segment_size`: default `240000` |
| - **Pretrained SSL (WavLM)** |
| - `model.wavlmloss.ckpt_path`: default `pretrain_model/ssl/wavlm-base-plus` |
| - `wav_lm_model`: default `pretrain_model/ssl/wavlm_model/wavlm` |
| - **Output** |
| - `train.save_dir`: default `./exps/omnicodec` |
|
|
| ## Training |
|
|
| Run training with the provided config: |
|
|
| ```bash |
| python train.py -c config/config_omnicodec.yaml |
| ``` |
|
|
| Checkpoints and logs are written to `train.save_dir` (default: `./exps/omnicodec`). |
|
|
| ## Inference (reconstruction) |
|
|
| ### Prepare checkpoint |
|
|
| `infer.py` loads the checkpoint from: |
|
|
| - `pretrained_model/omnicodec.pth` |
|
|
| Place your pretrained weights at that path (or edit `infer.py` to point to your checkpoint). |
|
|
| ### Run |
|
|
| Put test audio files in: |
|
|
| - `./testset/speech/` |
|
|
| Then run: |
|
|
| ```bash |
| python infer.py -c config/config_omnicodec.yaml |
| ``` |
|
|
| Outputs will be written to: |
|
|
| - `./outputs/` |
|
|
| ## Project structure |
|
|
| ```text |
| . |
| ββ config/ |
| β ββ config_omnicodec.yaml |
| ββ dataset.py |
| ββ train.py |
| ββ infer.py |
| ββ models/ |
| ββ modules/ |
| ββ quantization/ |
| ββ discriminators/ |
| ββ losses/ |
| ββ utils/ |
| ββ requirements.txt |
| ``` |
|
|
| ## Acknowledgements |
|
|
| - This repo benefits from [moshi](https://github.com/kyutai-labs/moshi) |
| - This repo benefits from [Qwen3Omni](https://github.com/QwenLM/Qwen3-Omni) |
| - This repo benefits from [DAC](https://github.com/descriptinc/descript-audio-codec) |
| - This repo benefits from [BigVGAN](https://github.com/NVIDIA/BigVGAN) |
| - This repo benefits from [SpeechTokenizer](https://github.com/zhangxinfd/speechtokenizer) |
|
|
| ## Citation |
|
|
| If you use this work, please cite: |
|
|
| ```bibtex |
| @misc{hu2026omnicodeclowframerate, |
| title={OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement}, |
| author={Jingbin Hu and Haoyu Zhang and Dake Guo and Qirui Zhan and Wenhao Li and Huakang Chen and Guobin Ma and Hanke Xie and Chengyou Wang and Pengyuan Xie and Chuan Xie and Qiang Zhang and Lei Xie}, |
| year={2026}, |
| eprint={2603.20638}, |
| archivePrefix={arXiv}, |
| primaryClass={eess.AS}, |
| url={https://arxiv.org/abs/2603.20638}, |
| } |
| ``` |
|
|
| ## License |
|
|
| See the repository license |