ASLP-lab commited on
Commit
5d9cc29
·
verified ·
1 Parent(s): 900f8f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +156 -3
README.md CHANGED
@@ -1,3 +1,156 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OmniCodec
2
+
3
+ OmniCodec: Low Frame Rate Universal Audio Codec with Semantic–Acoustic Disentanglement
4
+
5
+ - Demo Page: [OmniCodec Demo Page](https://hujingbin1.github.io/OmniCodec-Demo-Page/)
6
+ - Huggingface: [Huggingface](https://huggingface.co/ASLP-lab/OmniCodec)
7
+ - Arxiv: [Arxiv](https://arxiv.org/html/2603.20638v1)
8
+
9
+
10
+ ## Overview
11
+
12
+ This repo contains:
13
+
14
+ - **Training**: `train.py` (Accelerate + GAN / WavLM-related losses per config)
15
+ - **Dataset**: `dataset.py` (multi-domain mixing; loads audio paths from `scp`)
16
+ - **Inference**: `infer.py` (reconstructs audio with a pretrained checkpoint)
17
+ - **Config**: `config/config_omnicodec.yaml`
18
+
19
+ ## Environment
20
+
21
+ ### Requirements
22
+
23
+ Install python dependencies:
24
+
25
+ ```bash
26
+ pip install -r requirements.txt
27
+ ```
28
+
29
+ Notes:
30
+
31
+ - `requirements.txt` contains an editable install line `-e OmniCodec/transformers-main`. Make sure the referenced path exists in your environment, or adjust/remove that line if you already have `transformers` installed.
32
+
33
+ ## Data preparation (scp)
34
+
35
+ The training config expects 3 **scp files** (one per domain): speech / music / sound.
36
+
37
+ Each line in scp can be either:
38
+
39
+ - `utt_id /abs/or/rel/path/to/audio.wav`
40
+ - `/abs/or/rel/path/to/audio.wav` (utt will be inferred from filename)
41
+
42
+ Example:
43
+
44
+ ```text
45
+ utt0001 /data/speech/utt0001.wav
46
+ utt0002 /data/speech/utt0002.wav
47
+ ```
48
+
49
+ ### What dataset does
50
+
51
+ For each item, `dataset.py` will:
52
+
53
+ - load audio with `librosa.load(..., sr=sample_rate, mono=True)`
54
+ - apply `librosa.util.normalize(wav) * 0.95`
55
+ - crop/pad/repeat to `segment_size` (default: 240000 samples @ 24kHz = 10s)
56
+ - return a dict: `{"wav": Tensor[T], "utt": str, "text": None}`
57
+
58
+ Failed samples return `None` and are filtered by `collate_fn` in `train.py`.
59
+
60
+ ## Configure
61
+
62
+ Edit `config/config_omnicodec.yaml`:
63
+
64
+ - **Data**
65
+ - `data.speech_train_shards_dir`: path to `speech.scp`
66
+ - `data.music_train_shards_dir`: path to `music.scp`
67
+ - `data.sound_train_shards_dir`: path to `sound.scp`
68
+ - `data.sample_rate`: default `24000`
69
+ - `data.segment_size`: default `240000`
70
+ - **Pretrained SSL (WavLM)**
71
+ - `model.wavlmloss.ckpt_path`: default `pretrain_model/ssl/wavlm-base-plus`
72
+ - `wav_lm_model`: default `pretrain_model/ssl/wavlm_model/wavlm`
73
+ - **Output**
74
+ - `train.save_dir`: default `./exps/omnicodec`
75
+
76
+ ## Training
77
+
78
+ Run training with the provided config:
79
+
80
+ ```bash
81
+ python train.py -c config/config_omnicodec.yaml
82
+ ```
83
+
84
+ Checkpoints and logs are written to `train.save_dir` (default: `./exps/omnicodec`).
85
+
86
+ ## Inference (reconstruction)
87
+
88
+ ### Prepare checkpoint
89
+
90
+ `infer.py` loads the checkpoint from:
91
+
92
+ - `pretrained_model/omnicodec.pth`
93
+
94
+ Place your pretrained weights at that path (or edit `infer.py` to point to your checkpoint).
95
+
96
+ ### Run
97
+
98
+ Put test audio files in:
99
+
100
+ - `./testset/speech/`
101
+
102
+ Then run:
103
+
104
+ ```bash
105
+ python infer.py -c config/config_omnicodec.yaml
106
+ ```
107
+
108
+ Outputs will be written to:
109
+
110
+ - `./outputs/`
111
+
112
+ ## Project structure
113
+
114
+ ```text
115
+ .
116
+ ├─ config/
117
+ │ └─ config_omnicodec.yaml
118
+ ├─ dataset.py
119
+ ├─ train.py
120
+ ├─ infer.py
121
+ ├─ models/
122
+ ├─ modules/
123
+ ├─ quantization/
124
+ ├─ discriminators/
125
+ ├─ losses/
126
+ ├─ utils/
127
+ └─ requirements.txt
128
+ ```
129
+
130
+ ## Acknowledgements
131
+
132
+ - This repo benefits from [moshi](https://github.com/kyutai-labs/moshi)
133
+ - This repo benefits from [Qwen3Omni](https://github.com/QwenLM/Qwen3-Omni)
134
+ - This repo benefits from [DAC](https://github.com/descriptinc/descript-audio-codec)
135
+ - This repo benefits from [BigVGAN](https://github.com/NVIDIA/BigVGAN)
136
+ - This repo benefits from [SpeechTokenizer](https://github.com/zhangxinfd/speechtokenizer)
137
+
138
+ ## Citation
139
+
140
+ If you use this work, please cite:
141
+
142
+ ```bibtex
143
+ @misc{hu2026omnicodeclowframerate,
144
+ title={OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement},
145
+ author={Jingbin Hu and Haoyu Zhang and Dake Guo and Qirui Zhan and Wenhao Li and Huakang Chen and Guobin Ma and Hanke Xie and Chengyou Wang and Pengyuan Xie and Chuan Xie and Qiang Zhang and Lei Xie},
146
+ year={2026},
147
+ eprint={2603.20638},
148
+ archivePrefix={arXiv},
149
+ primaryClass={eess.AS},
150
+ url={https://arxiv.org/abs/2603.20638},
151
+ }
152
+ ```
153
+
154
+ ## License
155
+
156
+ See the repository license