File size: 4,195 Bytes
1c52951
 
 
 
5d9cc29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1c52951
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
license: apache-2.0
pipeline_tag: text-to-speech
---
# OmniCodec

OmniCodec: Low Frame Rate Universal Audio Codec with Semantic–Acoustic Disentanglement

- Demo Page: [OmniCodec Demo Page](https://hujingbin1.github.io/OmniCodec-Demo-Page/)
- Huggingface: [Huggingface](https://huggingface.co/ASLP-lab/OmniCodec)
- Arxiv: [Arxiv](https://arxiv.org/html/2603.20638v1)


## Overview

This repo contains:

- **Training**: `train.py` (Accelerate + GAN / WavLM-related losses per config)
- **Dataset**: `dataset.py` (multi-domain mixing; loads audio paths from `scp`)
- **Inference**: `infer.py` (reconstructs audio with a pretrained checkpoint)
- **Config**: `config/config_omnicodec.yaml`

## Environment

### Requirements

Install python dependencies:

```bash
pip install -r requirements.txt
```

Notes:

- `requirements.txt` contains an editable install line `-e OmniCodec/transformers-main`. Make sure the referenced path exists in your environment, or adjust/remove that line if you already have `transformers` installed.

## Data preparation (scp)

The training config expects 3 **scp files** (one per domain): speech / music / sound.

Each line in scp can be either:

- `utt_id /abs/or/rel/path/to/audio.wav`
- `/abs/or/rel/path/to/audio.wav` (utt will be inferred from filename)

Example:

```text
utt0001 /data/speech/utt0001.wav
utt0002 /data/speech/utt0002.wav
```

### What dataset does

For each item, `dataset.py` will:

- load audio with `librosa.load(..., sr=sample_rate, mono=True)`
- apply `librosa.util.normalize(wav) * 0.95`
- crop/pad/repeat to `segment_size` (default: 240000 samples @ 24kHz = 10s)
- return a dict: `{"wav": Tensor[T], "utt": str, "text": None}`

Failed samples return `None` and are filtered by `collate_fn` in `train.py`.

## Configure

Edit `config/config_omnicodec.yaml`:

- **Data**
  - `data.speech_train_shards_dir`: path to `speech.scp`
  - `data.music_train_shards_dir`: path to `music.scp`
  - `data.sound_train_shards_dir`: path to `sound.scp`
  - `data.sample_rate`: default `24000`
  - `data.segment_size`: default `240000`
- **Pretrained SSL (WavLM)**
  - `model.wavlmloss.ckpt_path`: default `pretrain_model/ssl/wavlm-base-plus`
  - `wav_lm_model`: default `pretrain_model/ssl/wavlm_model/wavlm`
- **Output**
  - `train.save_dir`: default `./exps/omnicodec`

## Training

Run training with the provided config:

```bash
python train.py -c config/config_omnicodec.yaml
```

Checkpoints and logs are written to `train.save_dir` (default: `./exps/omnicodec`).

## Inference (reconstruction)

### Prepare checkpoint

`infer.py` loads the checkpoint from:

- `pretrained_model/omnicodec.pth`

Place your pretrained weights at that path (or edit `infer.py` to point to your checkpoint).

### Run

Put test audio files in:

- `./testset/speech/`

Then run:

```bash
python infer.py -c config/config_omnicodec.yaml
```

Outputs will be written to:

- `./outputs/`

## Project structure

```text
.
β”œβ”€ config/
β”‚  └─ config_omnicodec.yaml
β”œβ”€ dataset.py
β”œβ”€ train.py
β”œβ”€ infer.py
β”œβ”€ models/
β”œβ”€ modules/
β”œβ”€ quantization/
β”œβ”€ discriminators/
β”œβ”€ losses/
β”œβ”€ utils/
└─ requirements.txt
```

## Acknowledgements

- This repo benefits from [moshi](https://github.com/kyutai-labs/moshi)
- This repo benefits from [Qwen3Omni](https://github.com/QwenLM/Qwen3-Omni)
- This repo benefits from [DAC](https://github.com/descriptinc/descript-audio-codec)
- This repo benefits from [BigVGAN](https://github.com/NVIDIA/BigVGAN)
- This repo benefits from [SpeechTokenizer](https://github.com/zhangxinfd/speechtokenizer)

## Citation

If you use this work, please cite:

```bibtex
@misc{hu2026omnicodeclowframerate,
      title={OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement}, 
      author={Jingbin Hu and Haoyu Zhang and Dake Guo and Qirui Zhan and Wenhao Li and Huakang Chen and Guobin Ma and Hanke Xie and Chengyou Wang and Pengyuan Xie and Chuan Xie and Qiang Zhang and Lei Xie},
      year={2026},
      eprint={2603.20638},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2603.20638}, 
}
```

## License

See the repository license