Create README.md

a01a0b9 verified 25 days ago

7.72 kB

	---
	license: mit
	---

	# SCOREQ-PyTorch

	## About

	This is an unofficial `fairseq`-free implementation of the SCOREQ Speech Quality Assessment system proposed in [SCOREQ: Speech Quality Assessment with Contrastive Regression](https://arxiv.org/abs/2410.06675).

	The [original implementation](https://github.com/alessandroragano/scoreq) provides a `fairseq`-based PyTorch model and an ONNX variant. In practice, the `fairseq` dependency can be difficult to install with recent Python, PyTorch, and dependency versions. The ONNX variant avoids `fairseq`, but it can be less convenient for PyTorch-based research workflows and may be difficult to run with GPU acceleration on `ARM/aarch64` systems.

	[Recent study from ICASSP 2026](https://arxiv.org/abs/2509.24457) highlights the high correlation of SCOREQ with subjective listening scores for neural codecs. Therefore, modern neural audio codec and TTS research benefits from an easy-to-install SCOREQ implementation.

	We provide a `fairseq`-free implementation written directly in `PyTorch` that matches the [original system](https://github.com/alessandroragano/scoreq) using converted weights and reimplemented modules.

	We also provide a `TorchScript` variant that can be loaded with only PyTorch, without installing this package.

	The PyTorch and TorchScript versions are validated against the original implementation and produce matching scores.

	> [!NOTE]
	> In contrast to the original implementation, we support batched audio assessment. However, we recommend running SCOREQ with batch size 1 to avoid metric shifts caused by padding. Batching can be used for faster evaluation when small padding-related score differences are acceptable.

	## Model Types

	As in the [original system](https://github.com/alessandroragano/scoreq), we support 4 types of SCOREQ, i.e., 2 audio domains and 2 modes.

	Data domain (what kind of audio is evaluated):

	- `natural`: used for audio that was created from a genuine human speech (Audio Codecs, VoIP, Telephony, Speech Enhancement, Audio Restoration).
	- `synthetic`: used for audio that was synthesized by a machine (Text-to-Speech (TTS), Voice Conversion (VC), Generative Speech Models).

	Mode (whether there is a reference audio to compare with):

	- `nr`: no-reference mode. Assesses the quality of audio, the higher the better, without relying on any reference.
	- `ref`: reference mode. Calculate the distance between provided and reference audio embeddings, the lower the better.

	We refer the user to the [original repository](https://github.com/alessandroragano/scoreq) and [paper](https://arxiv.org/abs/2410.06675) for more details on model types.

	## Usage

	You can install the repo as a package:

	```bash
	pip install scoreq-pytorch
	```

	Or from source:

	```bash
	git clone https://github.com/Blinorot/scoreq-pytorch.git
	cd scoreq-pytorch
	pip install -e .
	```

	The code requires:

	\| Package \| Version \|
	\| --------------- \| ------- \|
	\| Python \| >=3.9 \|
	\| PyTorch \| >=2.2.0 \|
	\| HuggingFace Hub \| >=0.20 \|

	The TorchScript checkpoint was scripted with `PyTorch 2.5.1`. We have tested that it works on `PyTorch 2.2.0`, however, `PyTorch >=2.5.1` is recommended for the
	TorchScript variant.

	Then, you can run the model as follows:

	```python
	import torchaudio
	from scoreq_pytorch import SCOREQScoreTorch

	device = "cpu" # set to "cuda" to use on GPU
	data_domain = "natural" # or "synthetic"
	mode = "nr" # or "ref"
	scoreq = SCOREQScoreTorch(
	data_domain=data_domain,
	mode=mode,
	device=device
	) # already in eval mode

	# load an audio file, e.g. using torchaudio
	test_audio_path = ... # path to an audio file
	test_wav, sr = torchaudio.load(test_audio_path)

	# convert to MONO 16 kHz
	TARGET_SR = 16000
	if test_wav.shape[0] != 1:
	test_wav = test_wav[0:1]
	if sr != TARGET_SR:
	test_wav = torchaudio.functional.resample(test_wav, orig_freq=sr, new_freq=TARGET_SR)
	# put on device
	test_wav = test_wav.to(device)

	# for "ref" mode, you need a reference audio
	# same loading and pre-processing procedure
	if mode == "ref":
	ref_wav = ...
	else:
	ref_wav = None

	# calculate the score
	# accepts T, 1xT, Bx1xT
	scoreq_score = scoreq.score(test_wav, ref_wav) # tensor of shape (batch_size,)
	```

	You can replace `SCOREQScoreTorch` with `SCOREQScoreScripted` to use the `TorchScript` variant instead. On first use, the package downloads converted SCOREQ weights from [Hugging Face Hub](https://huggingface.co/Blinorot/SCOREQ-PyTorch) and caches them locally using the Hugging Face cache.

	For `TorchScript`, you can avoid downloading the package and use the model directly:

	```python
	import torch
	import torchaudio
	import wget

	data_domain = "natural" # or "synthetic"
	mode = "nr" # or "ref"

	# download scripted checkpoint, e.g. using wget
	checkpoint_url = f"https://huggingface.co/Blinorot/SCOREQ-PyTorch/resolve/main/scoreq_{data_domain}_{mode}_scripted.pt"
	checkpoint_path = ... # path to saved checkpoint
	wget.download(checkpoint_url, checkpoint_path)

	# load directly with torch.jit
	device = "cpu" # set to "cuda" to use on GPU
	scoreq = torch.jit.load(checkpoint_path, map_location=device)
	scoreq.eval()

	# load an audio file, e.g. using torchaudio
	test_audio_path = ... # path to an audio file
	test_wav, sr = torchaudio.load(test_audio_path)

	# convert to MONO 16 kHz
	TARGET_SR = 16000
	if test_wav.shape[0] != 1:
	test_wav = test_wav[0:1]
	if sr != TARGET_SR:
	test_wav = torchaudio.functional.resample(test_wav, orig_freq=sr, new_freq=TARGET_SR)
	# put on device
	test_wav = test_wav.to(device)

	# for "ref" mode, you need a reference audio
	# same loading and pre-processing procedure
	if mode == "ref":
	ref_wav = ...
	else:
	ref_wav = None

	# calculate the score
	# accepts T, 1xT, Bx1xT
	with torch.no_grad():
	scoreq_score = scoreq(test_wav, ref_wav) # tensor of shape (batch_size,)
	```

	### Notes

	The model expects audio sampled at 16 kHz.

	Accepted tensor shapes:

	\| Shape \| Meaning \|
	\| ----------- \| ------------------------------------------------ \|
	\| `(T,)` \| single mono test_waveform \|
	\| `(1, T)` \| single mono test_waveform with channel dimension \|
	\| `(B, 1, T)` \| batch of mono test_waveforms \|

	The input should be a floating point PyTorch tensor. Stereo audio should be converted to mono before scoring. `scoreq.score(test_wav)` returns a tensor of shape `(batch_size,)`, where each value is a predicted quality score.

	For reference `ref` mode, a reference audio `ref_wav` must be provided: `scoreq.score(test_wav, ref_wav)`.

	Note that `score()` and `forward()` return the same values. The only difference is that `score()` is decorated with `torch.no_grad()` for convenient inference. Since the raw TorchScript module exposes `forward()`, it is called directly as `scoreq(test_wav, ref_wav)` rather than through the package wrapper's `scoreq.score(test_wav, ref_wav)`.

	Batch size 1 is recommended to avoid padding-related score shifts.

	API classes:

	\| Class \| Description \|
	\| --------------------- \| ----------------------------------------------- \|
	\| `SCOREQScoreTorch` \| PyTorch implementation using converted weights. \|
	\| `SCOREQScoreScripted` \| Wrapper around the TorchScript checkpoint. \|

	## Citation

	If you use this package, please cite the original SCOREQ paper:

	```bibtex
	@article{ragano2024scoreq,
	title={SCOREQ: Speech quality assessment with contrastive regression},
	author={Ragano, Alessandro and Skoglund, Jan and Hines, Andrew},
	journal={Advances in Neural Information Processing Systems},
	volume={37},
	pages={105702--105729},
	year={2024}
	}
	```