MIT-SLS
/

USAD-Base

@@ -25,17 +25,19 @@ Trained on 126k hours of mixed data, USAD delivers competitive performance acros
 [👀 **Read Full Paper**](https://arxiv.org/abs/2506.18843)
 ---
 ## 🗂️ Models
 USAD models are all transformer encoders operating at **50Hz frame rate**. The teacher models are **WavLM Base+** and **ATST Frame**.
-| Model      | Parameters | Dim  | Layer | Checkpoint                                        |
-| ---------- | ---------- | ---- | ----- | ------------------------------------------------- |
-| USAD Small | 24M        | 384  | 12    | [link](https://huggingface.co/MIT-SLS/USAD-Small) |
-| USAD Base  | 94M        | 768  | 12    | [link](https://huggingface.co/MIT-SLS/USAD-Base)  |
-| USAD Large | 330M       | 1024 | 24    | [link](https://huggingface.co/MIT-SLS/USAD-Large) |
 ---
@@ -44,7 +46,7 @@ USAD models are all transformer encoders operating at **50Hz frame rate**. The t
 **Installation**
 ```
-pip install -U transformers
 ```
 **Load Model and Extract Features**
@@ -77,10 +79,10 @@ See [usad_model.py](https://huggingface.co/MIT-SLS/USAD-Base/blob/main/usad_mode
 ## 📖 Citation
 ```bibtex
-@article{chang2025usad,
   title={{USAD}: Universal Speech and Audio Representation via Distillation},
   author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
-  journal={arXiv preprint arXiv:2506.18843},
   year={2025}
 }
 ```
@@ -89,4 +91,4 @@ See [usad_model.py](https://huggingface.co/MIT-SLS/USAD-Base/blob/main/usad_mode
 ## 🙏 Acknowledgement
-Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.

 [👀 **Read Full Paper**](https://arxiv.org/abs/2506.18843)
+[🛠️ **GitHub**](https://github.com/vectominist/usad)
 ---
 ## 🗂️ Models
 USAD models are all transformer encoders operating at **50Hz frame rate**. The teacher models are **WavLM Base+** and **ATST Frame**.
+| Model                                                     | Parameters | Dim  | Layer |
+| :-------------------------------------------------------- | ---------: | ---: | ----: |
+| [USAD Small](https://huggingface.co/MIT-SLS/USAD-Small)   | 24M        | 384  | 12    |
+| [USAD Base](https://huggingface.co/MIT-SLS/USAD-Base)     | 94M        | 768  | 12    |
+| [USAD Large]((https://huggingface.co/MIT-SLS/USAD-Small)) | 330M       | 1024 | 24    |
 ---
 **Installation**
 ```
+pip install -U torch torchaudio transformers
 ```
 **Load Model and Extract Features**
 ## 📖 Citation
 ```bibtex
+@inproceedings{chang2025usad,
   title={{USAD}: Universal Speech and Audio Representation via Distillation},
   author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
+  booktitle={IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
   year={2025}
 }
 ```
 ## 🙏 Acknowledgement
+Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.