File size: 7,715 Bytes
a01a0b9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | ---
license: mit
---
# SCOREQ-PyTorch
## About
This is an unofficial `fairseq`-free implementation of the SCOREQ Speech Quality Assessment system proposed in [SCOREQ: Speech Quality Assessment with Contrastive Regression](https://arxiv.org/abs/2410.06675).
The [original implementation](https://github.com/alessandroragano/scoreq) provides a `fairseq`-based PyTorch model and an ONNX variant. In practice, the `fairseq` dependency can be difficult to install with recent Python, PyTorch, and dependency versions. The ONNX variant avoids `fairseq`, but it can be less convenient for PyTorch-based research workflows and may be difficult to run with GPU acceleration on `ARM/aarch64` systems.
[Recent study from ICASSP 2026](https://arxiv.org/abs/2509.24457) highlights the high correlation of SCOREQ with subjective listening scores for neural codecs. Therefore, modern neural audio codec and TTS research benefits from an easy-to-install SCOREQ implementation.
We provide a `fairseq`-free implementation written directly in `PyTorch` that matches the [original system](https://github.com/alessandroragano/scoreq) using converted weights and reimplemented modules.
We also provide a `TorchScript` variant that can be loaded with only PyTorch, without installing this package.
The PyTorch and TorchScript versions are validated against the original implementation and produce matching scores.
> [!NOTE]
> In contrast to the original implementation, we support batched audio assessment. However, we recommend running SCOREQ with **batch size 1** to avoid metric shifts caused by padding. Batching can be used for faster evaluation when small padding-related score differences are acceptable.
## Model Types
As in the [original system](https://github.com/alessandroragano/scoreq), we support 4 types of SCOREQ, i.e., 2 audio domains and 2 modes.
Data domain (what kind of audio is evaluated):
- `natural`: used for audio that was created from a genuine human speech (Audio Codecs, VoIP, Telephony, Speech Enhancement, Audio Restoration).
- `synthetic`: used for audio that was synthesized by a machine (Text-to-Speech (TTS), Voice Conversion (VC), Generative Speech Models).
Mode (whether there is a reference audio to compare with):
- `nr`: no-reference mode. Assesses the quality of audio, **the higher the better**, without relying on any reference.
- `ref`: reference mode. Calculate the distance between provided and reference audio embeddings, **the lower the better**.
We refer the user to the [original repository](https://github.com/alessandroragano/scoreq) and [paper](https://arxiv.org/abs/2410.06675) for more details on model types.
## Usage
You can install the repo as a package:
```bash
pip install scoreq-pytorch
```
Or from source:
```bash
git clone https://github.com/Blinorot/scoreq-pytorch.git
cd scoreq-pytorch
pip install -e .
```
The code requires:
| Package | Version |
| --------------- | ------- |
| Python | >=3.9 |
| PyTorch | >=2.2.0 |
| HuggingFace Hub | >=0.20 |
The TorchScript checkpoint was scripted with `PyTorch 2.5.1`. We have tested that it works on `PyTorch 2.2.0`, however, `PyTorch >=2.5.1` is recommended for the
TorchScript variant.
Then, you can run the model as follows:
```python
import torchaudio
from scoreq_pytorch import SCOREQScoreTorch
device = "cpu" # set to "cuda" to use on GPU
data_domain = "natural" # or "synthetic"
mode = "nr" # or "ref"
scoreq = SCOREQScoreTorch(
data_domain=data_domain,
mode=mode,
device=device
) # already in eval mode
# load an audio file, e.g. using torchaudio
test_audio_path = ... # path to an audio file
test_wav, sr = torchaudio.load(test_audio_path)
# convert to MONO 16 kHz
TARGET_SR = 16000
if test_wav.shape[0] != 1:
test_wav = test_wav[0:1]
if sr != TARGET_SR:
test_wav = torchaudio.functional.resample(test_wav, orig_freq=sr, new_freq=TARGET_SR)
# put on device
test_wav = test_wav.to(device)
# for "ref" mode, you need a reference audio
# same loading and pre-processing procedure
if mode == "ref":
ref_wav = ...
else:
ref_wav = None
# calculate the score
# accepts T, 1xT, Bx1xT
scoreq_score = scoreq.score(test_wav, ref_wav) # tensor of shape (batch_size,)
```
You can replace `SCOREQScoreTorch` with `SCOREQScoreScripted` to use the `TorchScript` variant instead. On first use, the package downloads converted SCOREQ weights from [Hugging Face Hub](https://huggingface.co/Blinorot/SCOREQ-PyTorch) and caches them locally using the Hugging Face cache.
For `TorchScript`, you can avoid downloading the package and use the model directly:
```python
import torch
import torchaudio
import wget
data_domain = "natural" # or "synthetic"
mode = "nr" # or "ref"
# download scripted checkpoint, e.g. using wget
checkpoint_url = f"https://huggingface.co/Blinorot/SCOREQ-PyTorch/resolve/main/scoreq_{data_domain}_{mode}_scripted.pt"
checkpoint_path = ... # path to saved checkpoint
wget.download(checkpoint_url, checkpoint_path)
# load directly with torch.jit
device = "cpu" # set to "cuda" to use on GPU
scoreq = torch.jit.load(checkpoint_path, map_location=device)
scoreq.eval()
# load an audio file, e.g. using torchaudio
test_audio_path = ... # path to an audio file
test_wav, sr = torchaudio.load(test_audio_path)
# convert to MONO 16 kHz
TARGET_SR = 16000
if test_wav.shape[0] != 1:
test_wav = test_wav[0:1]
if sr != TARGET_SR:
test_wav = torchaudio.functional.resample(test_wav, orig_freq=sr, new_freq=TARGET_SR)
# put on device
test_wav = test_wav.to(device)
# for "ref" mode, you need a reference audio
# same loading and pre-processing procedure
if mode == "ref":
ref_wav = ...
else:
ref_wav = None
# calculate the score
# accepts T, 1xT, Bx1xT
with torch.no_grad():
scoreq_score = scoreq(test_wav, ref_wav) # tensor of shape (batch_size,)
```
### Notes
The model expects audio sampled at **16 kHz**.
Accepted tensor shapes:
| Shape | Meaning |
| ----------- | ------------------------------------------------ |
| `(T,)` | single mono test_waveform |
| `(1, T)` | single mono test_waveform with channel dimension |
| `(B, 1, T)` | batch of mono test_waveforms |
The input should be a floating point PyTorch tensor. Stereo audio should be converted to mono before scoring. `scoreq.score(test_wav)` returns a tensor of shape `(batch_size,)`, where each value is a predicted quality score.
For reference `ref` mode, a reference audio `ref_wav` must be provided: `scoreq.score(test_wav, ref_wav)`.
Note that `score()` and `forward()` return the same values. The only difference is that `score()` is decorated with `torch.no_grad()` for convenient inference. Since the raw TorchScript module exposes `forward()`, it is called directly as `scoreq(test_wav, ref_wav)` rather than through the package wrapper's `scoreq.score(test_wav, ref_wav)`.
**Batch size 1 is recommended to avoid padding-related score shifts.**
API classes:
| Class | Description |
| --------------------- | ----------------------------------------------- |
| `SCOREQScoreTorch` | PyTorch implementation using converted weights. |
| `SCOREQScoreScripted` | Wrapper around the TorchScript checkpoint. |
## Citation
If you use this package, please cite the original SCOREQ paper:
```bibtex
@article{ragano2024scoreq,
title={SCOREQ: Speech quality assessment with contrastive regression},
author={Ragano, Alessandro and Skoglund, Jan and Hines, Andrew},
journal={Advances in Neural Information Processing Systems},
volume={37},
pages={105702--105729},
year={2024}
}
```
|