File size: 7,157 Bytes
1408a41
 
 
 
 
 
706a5c0
 
 
9d10e60
1408a41
 
9d10e60
 
 
1408a41
 
 
 
 
 
 
 
 
 
 
 
 
9d10e60
1408a41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d10e60
1408a41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d10e60
1408a41
 
 
 
 
 
9d10e60
1408a41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d10e60
1408a41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d10e60
1408a41
 
9d10e60
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
# UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement

<p align="center">
  <a href="https://arxiv.org/abs/2510.20441">
    <img src="https://img.shields.io/badge/Paper-ArXiv-red.svg" alt="Paper">
  </a>
  <a href="https://github.com/alibaba/unified-audio/tree/main/QuarkAudio-UniSE">
    <img src="https://img.shields.io/badge/GitHub-Code-green.svg" alt="GitHub">
  </a>
  <a href="https://huggingface.co/QuarkAudio/QuarkAudio-UniSE/">
    <img src="https://img.shields.io/badge/Model-Hugging%20Face-yellow.svg" alt="Hugging Face">
  </a>
  <a href="https://www.modelscope.cn/models/QuarkAudio/QuarkAudio-UniSE/">
    <img src="https://img.shields.io/badge/Model-%20%E9%AD%94%E6%90%AD-orange.svg" alt="ModelScope">
  </a>
</p>

<p align="center">
  <a href="https://arxiv.org/abs/2510.20441"><img src="QuarkAudio-UniSE.png" width="70%" /></a>
</p>
πŸ”Š **UniSE**: A Unified, Prompt-Free, Autoregressive Speech Enhancement Framework Based on Decoder-only Language Models

πŸš€ **Key Highlights**:
- βœ… **Unified & Prompt-Free**: Handles multiple tasks without explicit instruction.
- βš™οΈ **Decoder-only AR-LM Backbone**: Leverages LLM-style autoregressive generation for speech token prediction.
- πŸ”„ **End-to-End Compatible**: Integrates WavLM (feature extractor), BiCodec (discrete codec), and LM into one pipeline.
- 🌍 **Multitask Support**: SE, SR, TSE, SS, and more β€” all in a single model.

πŸ“„ **Paper**: [arXiv:2510.20441](https://arxiv.org/abs/2510.20441)  | πŸ€— **Model**: [Hugging Face Spaces]https://huggingface.co/QuarkAudio/QuarkAudio-UniSE/)

---

## πŸ“‹ Supported Tasks

| Task | Full Name | Status | Description |
|------|-----------|--------|-------------|
| **SR** | Speech Restoration | βœ… Stable | General-purpose denoising and clarity improvemen (e.g., noise, reverb, packet loss) |
| **TSE** | Target Speaker Extraction | βœ… Stable | Extract target speaker using reference enrollment audio |
| **SS** | Speech Separation | βœ… Stable | Separate mixed speakers or sound sources |
| **AEC** | Acoustic Echo Cancellation | ⏳ Developing | Coming soon in next release |

> πŸ’‘ Unlike traditional models requiring task-specific prompts or modules, **UniSE autonomously infers the task type** from input context β€” enabled by powerful LLM comprehension.

---

## 🎯 Quick Start: Run Inference in 3 Minutes

### 1. Clone Repository

```bash
git clone https://github.com/alibaba/unified-audio.git
cd QuarkAudio-UniSE
```

### 2. Create a Conda environment and install dependencies

```bash
conda create -n unise python=3.10
conda activate unise
pip install -r requirements.txt
```

### 3. Download Checkpoints

QuarkAudio-UniSE requires three additional **WavLM** and **BiCodec** pre-trained models and checkpoint of the middle LM on Huggingface to function properly. You can download three of them using the provided shell script:

```bash
cd checkpoints
bash download.sh
```
Additionally, download WavLM-Large.pt from this [URL](https://huggingface.co/microsoft/wavlm-base-plus) and put it at `./ckpt/WavLM-Large.pt` .

Alternatively, you can download them manually and place them in the `./model/bicodec/` directory.

After Downloading, the tree should be like this:

## Train
+ Quick start

```bash
#!/bin/bash
python ./train.py --config conf/config.yaml
```
| Parameter        | Description                                                                                                                                                            |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `resume` | if want to resume, specify ckpt path                                                                                                  |
| `simulation_config` | data simulate config                                                                                                                        |
| `speech_scp_path`        | SCP of clean audio files                                                       |
| `noise_scp_path`        | SCP of noise audio files                                                                   
 | `rir_scp_path`        | SCP of rir audio files                                                                       |
| `mode`           | Task type: `se` (Noise Suppression,Speech Restoration,Packet Loss Concealment), `tse` (Target Speaker Extraction), `SS` (Speech Separation). |


## Inference
+ Quick start
The main inference script is **`test.py`**. The inference process consists of two stages:

1. Extract hidden states from all WavLM layers and obtain a single representation by averaging them across layers.
2. Use the language model (LM) to predict speech tokens, and then decode them into audio using **BiCodec**.

### Running Inference
+ Quick start
To run test.py, configure the parameters in `./conf/config.yaml`:

| Parameter        | Description                                                                                                                                                            |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ckpt_path` | pretrained weight                                                                                                             |
| `enroll_duration` | Number of inference iterations.                                                                                                                                        |
| `data_src_dir`        | Directory of processed audio files directory.                                                        |
| `data_tgt_dir`        | Directory of processed audio files directory.                                                                                                                                    |
| `mode`           | Task type: `se` (Noise Suppression,Speech Restoration,Packet Loss Concealment), `se` (Target Speaker Extraction), `SS` (Speech Separation). |

Command to run inference:

```python
python test.py
```


## Model Checkpoints

Our pretrained model is available on [Hugging Face](https://huggingface.co/QuarkAudio/QuarkAudio-UniSE/).

## Hints

Our approach focuses on leveraging the LLM's comprehension capabilities to enable autonomous determination of task types, though this may exhibit instability in certain scenarios. A more stable and robust iteration will be released in the upcoming version.

## Citation

```
@misc{yan2025uniseunifiedframeworkdecoderonly,
      title={UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement}, 
      author={Haoyin Yan and Chengwei Liu and Shaofei Xue and Xiaotao Liang and Zheng Xue},
      year={2025},
      eprint={2510.20441},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2510.20441}, 
}
```


## Contact
For any questions, please contact: `yanhaoyin.yhy@alibaba-inc.com`