File size: 2,635 Bytes
01f5d9b
 
 
e80e263
01f5d9b
e80e263
01f5d9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e80e263
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: mit
pipeline_tag: audio-to-audio
library_name: sq_codec
---

# SQCodec

This repository contains the implementation of SQCodec, a lightweight audio codec based on a single quantizer, introduced in the paper titled "One Quantizer is Enough: Toward a Lightweight Audio Codec".

[Paper](https://arxiv.org/abs/2504.04949)

[Code](https://github.com/zhai-lw/SQCodec)

## install

```
pip install sq_codec
```

### demo

Firstly, make sure you have installed the librosa package to load the example audio file. You can install it using pip:

```
pip install librosa
```

Then, you can use the following code to load a sample audio file, encode it using the SQCodec model, and decode it back
to audio. The code also calculates the mean squared error (MSE) between the original and generated audio.

```python
import librosa
import torch
import sq_codec

all_models = sq_codec.list_models()
print(f"Available models: {all_models}")

MODEL_USED = '6kbps'
codec = sq_codec.get_model(MODEL_USED)
print(f"loaded codec({MODEL_USED}) and codec sample rate: {codec.config.sample_rate}")

sample_audio, sample_rate = librosa.load(librosa.example("libri1"))
sample_audio = sample_audio[None, :]
print(f"loaded sample audio and audio sample_rate :{sample_rate}")

sample_audio = librosa.resample(sample_audio, orig_sr=sample_rate, target_sr=codec.config.sample_rate)

codec.network.cuda()
codec.network.eval()
with torch.inference_mode():
    audio_in = torch.tensor(sample_audio, dtype=torch.float32, device='cuda')
    _, audio_length = audio_in.shape
    print(f"{audio_in.shape=}")
    q_feature, indices = codec.encode_audio(audio_in)
    audio_out = codec.decode_audio(q_feature)  # or
    # audio_out = codec.decode_audio(indices=indices)
    generated_audio = audio_out[:, :audio_length].detach().cpu().numpy()

mse = ((sample_audio - generated_audio) ** 2).mean().item()
print(f"codec({MODEL_USED}) mse: {mse}")
```

### available models

| config_name  | Sample rate(Hz) | tokens/s | Codebook size | Bitrate(bps) |
|--------------|-----------------|----------|---------------|--------------|
| 0k75bps      | 16,000          | 44.44    | 117,649       | 748.6        |
| 1k5bps       | 16,000          | 88.89    | 117,649       | 1497.3       |
| 3kbps        | 16,000          | 177.78   | 117,649       | 2994.5       |
| 6kbps        | 16,000          | 355.56   | 117,649       | 5989.0       |
| 12kbps       | 16,000          | 666.67   | 250,047       | 11954.6      |
| 12kbps_24khz | 24,000          | 666.67   | 250,047       | 11954.6      |
| 24kbps_24khz | 24,000          | 1333.33  | 250,047       | 23909.1      |