| | --- |
| | license: mit |
| | pipeline_tag: audio-to-audio |
| | library_name: sq_codec |
| | --- |
| | |
| | # SQCodec |
| |
|
| | This repository contains the implementation of SQCodec, a lightweight audio codec based on a single quantizer, introduced in the paper titled "One Quantizer is Enough: Toward a Lightweight Audio Codec". |
| |
|
| | [Paper](https://arxiv.org/abs/2504.04949) |
| |
|
| | [Code](https://github.com/zhai-lw/SQCodec) |
| |
|
| | ## install |
| |
|
| | ``` |
| | pip install sq_codec |
| | ``` |
| |
|
| | ### demo |
| |
|
| | Firstly, make sure you have installed the librosa package to load the example audio file. You can install it using pip: |
| |
|
| | ``` |
| | pip install librosa |
| | ``` |
| |
|
| | Then, you can use the following code to load a sample audio file, encode it using the SQCodec model, and decode it back |
| | to audio. The code also calculates the mean squared error (MSE) between the original and generated audio. |
| |
|
| | ```python |
| | import librosa |
| | import torch |
| | import sq_codec |
| | |
| | all_models = sq_codec.list_models() |
| | print(f"Available models: {all_models}") |
| | |
| | MODEL_USED = '6kbps' |
| | codec = sq_codec.get_model(MODEL_USED) |
| | print(f"loaded codec({MODEL_USED}) and codec sample rate: {codec.config.sample_rate}") |
| | |
| | sample_audio, sample_rate = librosa.load(librosa.example("libri1")) |
| | sample_audio = sample_audio[None, :] |
| | print(f"loaded sample audio and audio sample_rate :{sample_rate}") |
| | |
| | sample_audio = librosa.resample(sample_audio, orig_sr=sample_rate, target_sr=codec.config.sample_rate) |
| | |
| | codec.network.cuda() |
| | codec.network.eval() |
| | with torch.inference_mode(): |
| | audio_in = torch.tensor(sample_audio, dtype=torch.float32, device='cuda') |
| | _, audio_length = audio_in.shape |
| | print(f"{audio_in.shape=}") |
| | q_feature, indices = codec.encode_audio(audio_in) |
| | audio_out = codec.decode_audio(q_feature) # or |
| | # audio_out = codec.decode_audio(indices=indices) |
| | generated_audio = audio_out[:, :audio_length].detach().cpu().numpy() |
| | |
| | mse = ((sample_audio - generated_audio) ** 2).mean().item() |
| | print(f"codec({MODEL_USED}) mse: {mse}") |
| | ``` |
| |
|
| | ### available models |
| |
|
| | | config_name | Sample rate(Hz) | tokens/s | Codebook size | Bitrate(bps) | |
| | |--------------|-----------------|----------|---------------|--------------| |
| | | 0k75bps | 16,000 | 44.44 | 117,649 | 748.6 | |
| | | 1k5bps | 16,000 | 88.89 | 117,649 | 1497.3 | |
| | | 3kbps | 16,000 | 177.78 | 117,649 | 2994.5 | |
| | | 6kbps | 16,000 | 355.56 | 117,649 | 5989.0 | |
| | | 12kbps | 16,000 | 666.67 | 250,047 | 11954.6 | |
| | | 12kbps_24khz | 24,000 | 666.67 | 250,047 | 11954.6 | |
| | | 24kbps_24khz | 24,000 | 1333.33 | 250,047 | 23909.1 | |