File size: 4,102 Bytes
8ddc011
 
 
 
 
 
 
 
 
 
82ca084
 
 
 
 
 
 
 
 
8ddc011
82ca084
8ddc011
 
82ca084
 
 
 
 
 
8ddc011
 
 
 
 
82ca084
 
8ddc011
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82ca084
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
datasets:
- openslr/librispeech_asr
- facebook/multilingual_librispeech
- mozilla-foundation/common_voice_17_0
- speechcolab/gigaspeech
- facebook/voxpopuli
- agkphysics/AudioSet
language:
- en
library_name: transformers
license: bsd-3-clause
pipeline_tag: feature-extraction
tags:
- automatic-speech-recognition
- audio-classification
- audio
- speech
- music
---

# USAD: Universal Speech and Audio Representation via Distillation

The model was presented in the paper [USAD: Universal Speech and Audio Representation via Distillation](https://huggingface.co/papers/2506.18843).

The abstract of the paper is the following:

Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.

**Universal Speech and Audio Distillation (USAD)** is a unified **speech**, **sound**, and **music** encoder distilled from domain-specific teachers.
Trained on 126k hours of mixed data, USAD delivers competitive performance across diverse benchmarks (SUPERB, HEAR, and AudioSet) with a single model.

[πŸ‘€ **Read Full Paper**](https://arxiv.org/abs/2506.18843)

Code: [MIT-SLS/USAD](https://github.com/MIT-SLS/USAD)  *(Assuming this is the correct repository. Please verify.)*

---

## πŸ—‚οΈ Models

USAD models are all transformer encoders operating at **50Hz frame rate**. The teacher models are **WavLM Base+** and **ATST Frame**.

| Model      | Parameters | Dim  | Layer | Checkpoint                                        |
| ---------- | ---------- | ---- | ----- | ------------------------------------------------- |
| USAD Small | 24M        | 384  | 12    | [link](https://huggingface.co/MIT-SLS/USAD-Small) |
| USAD Base  | 94M        | 768  | 12    | [link](https://huggingface.co/MIT-SLS/USAD-Base)  |
| USAD Large | 330M       | 1024 | 24    | [link](https://huggingface.co/MIT-SLS/USAD-Large) |

---


## πŸš€ How To Use

**Installation**
```
pip install -U transformers
```

**Load Model and Extract Features**
```python
import torch
from transformers import AutoModel

# Load pre-trained model
model = AutoModel.from_pretrained("MIT-SLS/USAD-Base", trust_remote_code=True).cuda().eval()

# Load audio and resample to 16kHz
wav = model.load_audio("path/to/audio").unsqueeze(0)  # (batch_size, wav_len)
# wav is a float tensor on the same device as the model
# You can also load waveforms directly with torchaudio.load

# Extract features
with torch.no_grad():
    results = model(wav)

# result["x"]:              model final output (batch_size, seq_len)
# result["mel"]:            mel fbank (batch_size, seq_len * 2, mel_dim)
# result["hidden_states"]:  list of (batch_size, seq_len, encoder_dim)
# result["ffn"]:            list of (batch_size, seq_len, encoder_dim)
```

See [usad_model.py](https://huggingface.co/MIT-SLS/USAD-Base/blob/main/usad_model.py) for more details about the model.

---

## πŸ“– Citation

```bibtex
@article{chang2025usad,
  title={{USAD}: Universal Speech and Audio Representation via Distillation},
  author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
  journal={arXiv preprint arXiv:2506.18843},
  year={2025}
}
```

---

## πŸ™ Acknowledgement

Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.