| | --- |
| | license: cc-by-nc-sa-4.0 |
| | pipeline_tag: audio-classification |
| | tags: |
| | - music |
| | - audio |
| | - speech |
| | - audio-representation-learning |
| | - arch-benchmark |
| | - general-audio |
| |
|
| | --- |
| | |
| | # Model Card: Pre-trained Audio Representation Models on AudioSet |
| |
|
| | ## Overview |
| |
|
| | This model card presents information about pre-trained audio representation models released by ALM. These models are pre-trained on the full AudioSet dataset and are intended for general-purpose Audio Representation Learning (ARL) tasks. |
| |
|
| | ## Models |
| |
|
| | ### 1. [ALM/hubert-base-audioset](https://huggingface.co/ALM/hubert-base-audioset) |
| |
|
| | - **Architecture**: HuBERT (Hubert-Base) transformer-based model |
| | - **Description**: This model is based on the HuBERT architecture, pre-trained on the full AudioSet dataset. |
| |
|
| | ### 2. [ALM/hubert-large-audioset](https://huggingface.co/ALM/hubert-large-audioset) |
| |
|
| | - **Architecture**: HuBERT (Hubert-Large) transformer-based model |
| | - **Description**: Similar to the hubert-base-audioset model, this variant is larger in size, providing increased capacity for capturing audio representations from the full AudioSet dataset. |
| |
|
| | ### 3. [ALM/wav2vec2-base-audioset](https://huggingface.co/ALM/wav2vec2-base-audioset) |
| |
|
| | - **Architecture**: Wav2Vec 2.0 (Wav2Vec2-Base) transformer-based model |
| | - **Description**: This model is based on the Wav2Vec 2.0 architecture, trained on the full AudioSet dataset using SSL with CPC. It offers a different approach to audio representation learning compared to the HuBERT models. |
| |
|
| | ### 4. [ALM/wav2vec2-large-audioset](https://huggingface.co/ALM/wav2vec2-large-audioset) |
| |
|
| | - **Architecture**: Wav2Vec 2.0 (Wav2Vec2-Large) transformer-based model |
| | - **Description**: Similar to the wav2vec2-base-audioset model, this variant is larger in size, providing enhanced capacity for learning audio representations from the full AudioSet dataset. |
| |
|
| | ## Intended Use |
| |
|
| | These pre-trained models are intended for a wide range of ARL tasks, including but not limited to speech recognition, music classification, and acoustic event detection. They serve as powerful tools for feature extraction and can be fine-tuned on task-specific datasets for downstream applications. |
| | It's important to note that while these models offer versatility across various audio domains, their performance in speech-related tasks may be relatively lower compared to specialized models such as the original Wav2Vec and HuBERT models. |
| | This is due to the diverse nature of the AudioSet dataset used for pre-training, which includes a wide range of audio sources beyond speech. |
| |
|
| | ## Limitations and Considerations |
| |
|
| | - The models are pre-trained on the full AudioSet dataset, which may not cover all possible audio domains comprehensively. |
| | - Fine-tuning on domain-specific data may be necessary to achieve optimal performance for certain tasks. |
| | - Computational resources may be required for deploying and fine-tuning these models, especially the larger variants. |
| |
|
| | ## Citation |
| |
|
| | If you use these pre-trained models in your work, please cite the following |
| |
|
| |
|
| | ```bib |
| | @INPROCEEDINGS{ARCH, |
| | author={La Quatra, Moreno and Koudounas, Alkis and Vaiani, Lorenzo and Baralis, Elena and Cagliero, Luca and Garza, Paolo and Siniscalchi, Sabato Marco}, |
| | booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)}, |
| | title={Benchmarking Representations for Speech, Music, and Acoustic Events}, |
| | year={2024}, |
| | pages={505-509}, |
| | keywords={Representation learning; Systematics; Conferences; Benchmark testing; Signal processing; Acoustics; Data models; Audio Representation Learning; Benchmark; Pre-trained Models; Self-Supervised Learning}, |
| | doi={10.1109/ICASSPW62465.2024.10625960} |
| | } |
| | ``` |
| |
|
| | [arXiv version: arxiv.org/abs/2405.00934](arxiv.org/abs/2405.00934) |