| --- |
| license: mit |
| --- |
| |
| # SCOREQ-PyTorch |
|
|
| ## About |
|
|
| This is an unofficial `fairseq`-free implementation of the SCOREQ Speech Quality Assessment system proposed in [SCOREQ: Speech Quality Assessment with Contrastive Regression](https://arxiv.org/abs/2410.06675). |
|
|
| The [original implementation](https://github.com/alessandroragano/scoreq) provides a `fairseq`-based PyTorch model and an ONNX variant. In practice, the `fairseq` dependency can be difficult to install with recent Python, PyTorch, and dependency versions. The ONNX variant avoids `fairseq`, but it can be less convenient for PyTorch-based research workflows and may be difficult to run with GPU acceleration on `ARM/aarch64` systems. |
|
|
| [Recent study from ICASSP 2026](https://arxiv.org/abs/2509.24457) highlights the high correlation of SCOREQ with subjective listening scores for neural codecs. Therefore, modern neural audio codec and TTS research benefits from an easy-to-install SCOREQ implementation. |
|
|
| We provide a `fairseq`-free implementation written directly in `PyTorch` that matches the [original system](https://github.com/alessandroragano/scoreq) using converted weights and reimplemented modules. |
|
|
| We also provide a `TorchScript` variant that can be loaded with only PyTorch, without installing this package. |
|
|
| The PyTorch and TorchScript versions are validated against the original implementation and produce matching scores. |
|
|
| > [!NOTE] |
| > In contrast to the original implementation, we support batched audio assessment. However, we recommend running SCOREQ with **batch size 1** to avoid metric shifts caused by padding. Batching can be used for faster evaluation when small padding-related score differences are acceptable. |
|
|
| ## Model Types |
|
|
| As in the [original system](https://github.com/alessandroragano/scoreq), we support 4 types of SCOREQ, i.e., 2 audio domains and 2 modes. |
|
|
| Data domain (what kind of audio is evaluated): |
|
|
| - `natural`: used for audio that was created from a genuine human speech (Audio Codecs, VoIP, Telephony, Speech Enhancement, Audio Restoration). |
| - `synthetic`: used for audio that was synthesized by a machine (Text-to-Speech (TTS), Voice Conversion (VC), Generative Speech Models). |
|
|
| Mode (whether there is a reference audio to compare with): |
|
|
| - `nr`: no-reference mode. Assesses the quality of audio, **the higher the better**, without relying on any reference. |
| - `ref`: reference mode. Calculate the distance between provided and reference audio embeddings, **the lower the better**. |
|
|
| We refer the user to the [original repository](https://github.com/alessandroragano/scoreq) and [paper](https://arxiv.org/abs/2410.06675) for more details on model types. |
|
|
| ## Usage |
|
|
| You can install the repo as a package: |
|
|
| ```bash |
| pip install scoreq-pytorch |
| ``` |
|
|
| Or from source: |
|
|
| ```bash |
| git clone https://github.com/Blinorot/scoreq-pytorch.git |
| cd scoreq-pytorch |
| pip install -e . |
| ``` |
|
|
| The code requires: |
|
|
| | Package | Version | |
| | --------------- | ------- | |
| | Python | >=3.9 | |
| | PyTorch | >=2.2.0 | |
| | HuggingFace Hub | >=0.20 | |
|
|
| The TorchScript checkpoint was scripted with `PyTorch 2.5.1`. We have tested that it works on `PyTorch 2.2.0`, however, `PyTorch >=2.5.1` is recommended for the |
| TorchScript variant. |
|
|
| Then, you can run the model as follows: |
|
|
| ```python |
| import torchaudio |
| from scoreq_pytorch import SCOREQScoreTorch |
| |
| device = "cpu" # set to "cuda" to use on GPU |
| data_domain = "natural" # or "synthetic" |
| mode = "nr" # or "ref" |
| scoreq = SCOREQScoreTorch( |
| data_domain=data_domain, |
| mode=mode, |
| device=device |
| ) # already in eval mode |
| |
| # load an audio file, e.g. using torchaudio |
| test_audio_path = ... # path to an audio file |
| test_wav, sr = torchaudio.load(test_audio_path) |
| |
| # convert to MONO 16 kHz |
| TARGET_SR = 16000 |
| if test_wav.shape[0] != 1: |
| test_wav = test_wav[0:1] |
| if sr != TARGET_SR: |
| test_wav = torchaudio.functional.resample(test_wav, orig_freq=sr, new_freq=TARGET_SR) |
| # put on device |
| test_wav = test_wav.to(device) |
| |
| # for "ref" mode, you need a reference audio |
| # same loading and pre-processing procedure |
| if mode == "ref": |
| ref_wav = ... |
| else: |
| ref_wav = None |
| |
| # calculate the score |
| # accepts T, 1xT, Bx1xT |
| scoreq_score = scoreq.score(test_wav, ref_wav) # tensor of shape (batch_size,) |
| ``` |
|
|
| You can replace `SCOREQScoreTorch` with `SCOREQScoreScripted` to use the `TorchScript` variant instead. On first use, the package downloads converted SCOREQ weights from [Hugging Face Hub](https://huggingface.co/Blinorot/SCOREQ-PyTorch) and caches them locally using the Hugging Face cache. |
|
|
| For `TorchScript`, you can avoid downloading the package and use the model directly: |
|
|
| ```python |
| import torch |
| import torchaudio |
| import wget |
| |
| data_domain = "natural" # or "synthetic" |
| mode = "nr" # or "ref" |
| |
| # download scripted checkpoint, e.g. using wget |
| checkpoint_url = f"https://huggingface.co/Blinorot/SCOREQ-PyTorch/resolve/main/scoreq_{data_domain}_{mode}_scripted.pt" |
| checkpoint_path = ... # path to saved checkpoint |
| wget.download(checkpoint_url, checkpoint_path) |
| |
| # load directly with torch.jit |
| device = "cpu" # set to "cuda" to use on GPU |
| scoreq = torch.jit.load(checkpoint_path, map_location=device) |
| scoreq.eval() |
| |
| # load an audio file, e.g. using torchaudio |
| test_audio_path = ... # path to an audio file |
| test_wav, sr = torchaudio.load(test_audio_path) |
| |
| # convert to MONO 16 kHz |
| TARGET_SR = 16000 |
| if test_wav.shape[0] != 1: |
| test_wav = test_wav[0:1] |
| if sr != TARGET_SR: |
| test_wav = torchaudio.functional.resample(test_wav, orig_freq=sr, new_freq=TARGET_SR) |
| # put on device |
| test_wav = test_wav.to(device) |
| |
| # for "ref" mode, you need a reference audio |
| # same loading and pre-processing procedure |
| if mode == "ref": |
| ref_wav = ... |
| else: |
| ref_wav = None |
| |
| # calculate the score |
| # accepts T, 1xT, Bx1xT |
| with torch.no_grad(): |
| scoreq_score = scoreq(test_wav, ref_wav) # tensor of shape (batch_size,) |
| ``` |
|
|
| ### Notes |
|
|
| The model expects audio sampled at **16 kHz**. |
|
|
| Accepted tensor shapes: |
|
|
| | Shape | Meaning | |
| | ----------- | ------------------------------------------------ | |
| | `(T,)` | single mono test_waveform | |
| | `(1, T)` | single mono test_waveform with channel dimension | |
| | `(B, 1, T)` | batch of mono test_waveforms | |
| |
| The input should be a floating point PyTorch tensor. Stereo audio should be converted to mono before scoring. `scoreq.score(test_wav)` returns a tensor of shape `(batch_size,)`, where each value is a predicted quality score. |
| |
| For reference `ref` mode, a reference audio `ref_wav` must be provided: `scoreq.score(test_wav, ref_wav)`. |
|
|
| Note that `score()` and `forward()` return the same values. The only difference is that `score()` is decorated with `torch.no_grad()` for convenient inference. Since the raw TorchScript module exposes `forward()`, it is called directly as `scoreq(test_wav, ref_wav)` rather than through the package wrapper's `scoreq.score(test_wav, ref_wav)`. |
|
|
| **Batch size 1 is recommended to avoid padding-related score shifts.** |
|
|
| API classes: |
|
|
| | Class | Description | |
| | --------------------- | ----------------------------------------------- | |
| | `SCOREQScoreTorch` | PyTorch implementation using converted weights. | |
| | `SCOREQScoreScripted` | Wrapper around the TorchScript checkpoint. | |
|
|
| ## Citation |
|
|
| If you use this package, please cite the original SCOREQ paper: |
|
|
| ```bibtex |
| @article{ragano2024scoreq, |
| title={SCOREQ: Speech quality assessment with contrastive regression}, |
| author={Ragano, Alessandro and Skoglund, Jan and Hines, Andrew}, |
| journal={Advances in Neural Information Processing Systems}, |
| volume={37}, |
| pages={105702--105729}, |
| year={2024} |
| } |
| ``` |
|
|