Blinorot
/

UTMOS-PyTorch

Model card Files Files and versions

xet

Community

Blinorot commited on 3 days ago

Commit

b1fea21

verified ·

1 Parent(s): 91a1fa1

Create README.md

Browse files

Files changed (1) hide show

README.md +152 -0

README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+---
+license: mit
+---
+## About
+This is an unofficial `fairseq`-free implementation of the UTMOS MOS Prediction system proposed in [UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022](https://arxiv.org/abs/2204.02152).
+The [original implementation](https://github.com/sarulab-speech/UTMOS22) is based on [fairseq](https://github.com/facebookresearch/fairseq). However, `fairseq` is difficult to install with recent Python, PyTorch, and dependency versions, which makes UTMOS hard to use in modern environments. [Recent study from ICASSP 2026](https://arxiv.org/abs/2509.24457) highlights the high correlation of UTMOS with subjective listening scores for neural codecs. Therefore, modern neural audio codec and TTS research benefits from an easy-to-install UTMOS implementation.
+We provide a `fairseq`-free implementation written in `PyTorch` that matches the [original system](https://github.com/sarulab-speech/UTMOS22) using converted weights and re-written modules.
+We also provide a `TorchScript` variant that can be loaded with only PyTorch, without installing this package.
+The PyTorch and TorchScript versions are validated against the original implementation and produce matching scores.
+**Note**: As in the original version, we recommend running UTMOS with batch size 1 to avoid metric shifts caused by padding.
+See [GitHub repository](https://github.com/Blinorot/utmos-pytorch) for source.
+## Usage
+You can install the repo as a package:
+```bash
+pip install utmos-pytorch
+```
+Or from source:
+```bash
+git clone https://github.com/Blinorot/UTMOS-PyTorch.git
+cd UTMOS-PyTorch
+pip install -e .
+```
+The code requires:
+| Package         | Version |
+| --------------- | ------- |
+| Python          | >=3.9   |
+| PyTorch         | >=2.2.0 |
+| HuggingFace Hub | >=0.20  |
+The TorchScript checkpoint was scripted with `PyTorch 2.5.1`. Loading it with older
+PyTorch versions is not guaranteed; `PyTorch >=2.5.1` is recommended for the
+TorchScript variant.
+Then, you can run the model as follows:
+```python
+import torchaudio
+from utmos_pytorch import UTMOSScoreTorch
+device = "cpu" # set to "cuda" to use on GPU
+utmos = UTMOSScoreTorch(device=device) # already in eval mode
+# load an audio file, e.g. using torchaudio
+audio_path = ... # path to an audio file
+wav, sr = torchaudio.load(audio_path)
+# convert to MONO 16 kHz
+TARGET_SR = 16000
+if wav.shape[0] != 1:
+    wav = wav[0:1]
+if sr != TARGET_SR:
+    wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR)
+# put on device
+wav = wav.to(device)
+# calculate the score
+# accepts T, 1xT, Bx1xT
+utmos_score = utmos.score(wav) # tensor of shape (batch_size,)
+```
+You can replace `UTMOSScoreTorch` with `UTMOSScoreScripted` to use the `TorchScript` variant instead. On first use, the package downloads converted UTMOS weights from [Hugging Face Hub](https://huggingface.co/Blinorot/UTMOS-PyTorch) and caches them locally using the Hugging Face cache.
+For `TorchScript`, you can avoid downloading the package and use the model directly:
+```python
+import torch
+import torchaudio
+import wget
+# download scripted checkpoint, e.g. using wget
+checkpoint_url = "https://huggingface.co/Blinorot/UTMOS-PyTorch/resolve/main/utmos_scripted.pt"
+checkpoint_path = ... # path to saved checkpoint
+wget.download(checkpoint_url, checkpoint_path)
+# load directly with torch.jit
+device = "cpu" # set to "cuda" to use on GPU
+utmos = torch.jit.load(checkpoint_path, map_location=device)
+utmos.eval()
+# load an audio file, e.g. using torchaudio
+audio_path = ... # path to an audio file
+wav, sr = torchaudio.load(audio_path)
+# convert to MONO 16 kHz
+TARGET_SR = 16000
+if wav.shape[0] != 1:
+    wav = wav[0:1]
+if sr != TARGET_SR:
+    wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR)
+# put on device
+wav = wav.to(device)
+# calculate the score
+# accepts T, 1xT, Bx1xT
+with torch.no_grad():
+    utmos_score = utmos.score(wav) # tensor of shape (batch_size,)
+```
+### Notes
+The model expects audio sampled at **16 kHz**.
+Accepted tensor shapes:
+| Shape       | Meaning                                     |
+| ----------- | ------------------------------------------- |
+| `(T,)`      | single mono waveform                        |
+| `(1, T)`    | single mono waveform with channel dimension |
+| `(B, 1, T)` | batch of mono waveforms                     |
+The input should be a floating point PyTorch tensor. Stereo audio should be converted to mono before scoring. `utmos.score(wav)` returns a tensor of shape `(batch_size,)`, where each value is a predicted MOS score. Higher is better. **Batch size 1 is recommended to avoid padding-related score shifts.**
+API classes:
+| Class                | Description                                     |
+| -------------------- | ----------------------------------------------- |
+| `UTMOSScoreTorch`    | PyTorch implementation using converted weights. |
+| `UTMOSScoreScripted` | Wrapper around the TorchScript checkpoint.      |
+## Citation
+If you use this package, please cite the original UTMOS paper:
+```bibtex
+@inproceedings{saeki22c_interspeech,
+  title     = {{UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022}},
+  author    = {Takaaki Saeki and Detai Xin and Wataru Nakata and Tomoki Koriyama and Shinnosuke Takamichi and Hiroshi Saruwatari},
+  year      = {2022},
+  booktitle = {{Interspeech 2022}},
+  pages     = {4521--4525},
+  doi       = {10.21437/Interspeech.2022-439},
+  issn      = {2958-1796},
+}
+```