File size: 3,752 Bytes
385cb3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a5b40b7
 
 
 
 
 
 
 
 
 
06c46fd
a5b40b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
license: apache-2.0
pipeline_tag: audio-to-audio
tags:
- speech
- audio
- codec
- speech-codec
- whisper
- low-bitrate
- audio-compression
language:
- en
datasets:
- librispeech
library_name: pytorch
---


<div align="center">

# πŸŽ™οΈ SimWhisper-Codec

### Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding

<p>
  <a href="https://zhangxinwhut.github.io/SimWhisper-Codec/"><img src="https://img.shields.io/badge/🎧_Demo-Online-brightgreen" alt="Demo"></a>
  <a href="https://arxiv.org/pdf/2510.20504"><img src="https://img.shields.io/badge/Paper-Arxiv-red" alt="paper"></a>
  <a href="https://huggingface.co/xxx123456/SimWhisper_Codec"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model%20Page-yellow" alt="Hugging Face"></a>
  <a href="https://github.com/ZhangXinWhut/SimWhisper-Codec"><img src="https://img.shields.io/badge/GitHub-Code-black?logo=github" alt="GitHub"></a>
</p>

*A semantic-first speech codec that achieves superior performance through architectural simplification rather than complex supervision.*

</div>

---

## ✨ Highlights

- πŸš€ **low Bitrate**: Only **1.1 kbps** at 16 kHz sampling rate
- πŸ”Š **High Quality Speech Reconstruction**: Achieving UTMOS 4.00 WER 2.75 (hubert-large-ls960-ft) sim 0.83 (wavlm_large_finetune) stoi 0.93 pesq-nb 3.29 pesq-wb 2.72 on librispeech-test-clean reconstruction (gt: WER 2.16 UTMOS 4.09)
- 🧊 **Frozen Encoder**: No fine-tuning of Whisper encoder required
- ⚑ **Simple & Efficient**: Architectural simplification over complex supervision

## πŸ“Š Performance

| Model | Bitrate | WER ↓ | PESQ-NB ↑ | PESQ-WB ↑ | STOI ↑ | SIM ↑ | UTMOS ↑ |
|:------|:-------:|:-----:|:---------:|:---------:|:------:|:-----:|:-------:|
| XCodec2.0 | 0.8 kbps | 2.61 | 3.04 | 2.43 | 0.92 | 0.82 | **4.13** |
| XY-Tokenizer | 1.0 kbps | **2.46** | 3.00 | 2.41 | 0.91 | **0.84** | 3.98 |
| **SimWhisper-Codec** | 1.1 kbps | 2.75 | **3.29** | **2.72** | **0.93** | 0.83 | 4.00 |

*Evaluated on LibriSpeech test-clean*

## πŸš€ Quick Start

### Installation

```bash
# Clone repository
git clone https://github.com/ZhangXinWhut/SimWhisper-Codec.git && cd SimWhisper-Codec

# Create and activate conda environment
conda create -n swcodec python=3.10 -y && conda activate swcodec

# Install dependencies
pip install -r requirements.txt
```

## Available Models πŸ—‚οΈ

| Model Name | Hugging Face | Training Data |
|:----------:|:-------------:|:---------------:|
| SimWhisper-Codec | [πŸ€—](https://huggingface.co/xxx123456/SimWhisper_Codec) | LibriSpeech |


### Download Model Weights

You need to download the SimWhisper-Codec model weights. You can find the weights in the [SimWhisper-Codec Hugging Face repository](https://huggingface.co/xxx123456/SimWhisper_Codec).

```bash
mkdir -p ./weights && huggingface-cli download xxx123456/SimWhisper_Codec SimWhisperCodec.pt --local-dir ./weights/
```

### Inference

```python
python inference.py --input_dir /path/to/LibriSpeech/test-clean
```

The reconstructed audio files will be available in the `output_wavs/` directory.

## πŸ™ Acknowledgements

Our codebase builds upon the [XY-Tokenizer](https://github.com/gyt1145028706/XY-Tokenizer). We thank the authors for their excellent work.

## πŸ“ Citation

If you find this work useful in your research, please cite our paper:

```
@misc{zhang2025speakingclearlysimplifiedwhisperbased,
      title={Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding}, 
      author={Xin Zhang and Lin Li and Xiangni Lu and Jianquan Liu and Kong Aik Lee},
      year={2025},
      eprint={2510.20504},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2510.20504}, 
}
```