Upload 7 files

82d24b9 verified 7 months ago

3.24 kB

license: mit
tags:
  - Audio
  - SSL
  - SSLAM
library_name: transformers

SSLAM AudioSet-2M Finetuned (ViT Base, mAP:50.2)

This repository provides an SSLAM checkpoint formatted for use with Hugging Face Transformers. It is intended for feature extraction in audio LLMs, sound event detection, and general purpose audio representation learning. The implementation follows the EAT code path while swapping in SSLAM AudioSet-2M Finetuned weight.

🔧 Usage

You can load and use the model for feature extraction directly via Hugging Face Transformers:

import torchaudio
import torch
import soundfile as sf
import numpy as np
from transformers import AutoModel

model_id = "ta012/SSLAM_AS2M_Finetuned"
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval().cuda()

source_file = "/path/to/input.wav"
target_length = 1024    # Recommended: 1024 for 10s audio
norm_mean = -4.268
norm_std = 4.569

# Load and resample audio
wav, sr = sf.read(source_file)
waveform = torch.tensor(wav).float().cuda()
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)

# Normalize and convert to mel-spectrogram
waveform = waveform - waveform.mean()
mel = torchaudio.compliance.kaldi.fbank(
    waveform.unsqueeze(0),
    htk_compat=True,
    sample_frequency=16000,
    use_energy=False,
    window_type='hanning',
    num_mel_bins=128,
    dither=0.0,
    frame_shift=10
).unsqueeze(0)

# Pad or truncate
n_frames = mel.shape[1]
if n_frames < target_length:
    mel = torch.nn.ZeroPad2d((0, 0, 0, target_length - n_frames))(mel)
else:
    mel = mel[:, :target_length, :]

# Normalize
mel = (mel - norm_mean) / (norm_std * 2)
mel = mel.unsqueeze(0).cuda()  # shape: [1, 1, T, F]

# Extract features
with torch.no_grad():
    feat = model.extract_features(mel)

feat = feat.squeeze(0).cpu().numpy()
print(f"Feature shape: {feat.shape}")

📌 Notes

See the feature extraction guide for more instructions.

🙌 Acknowledgments

This repository builds on the EAT implementation for Hugging Face models. We remap SSLAM weights to that interface.

Paper: EAT: Self supervised pretraining with Efficient Audio Transformer
Code: https://github.com/cwx-worst-one/EAT

We are not affiliated with the EAT authors. All credit for the original implementation belongs to them.

📚 Citation

If you find our work useful, please cite it as:

@inproceedings{alex2025sslam,
  title={{SSLAM}: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes},
  author={Tony Alex and Sara Atito and Armin Mustafa and Muhammad Awais and Philip J B Jackson},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=odU59TxdiB}
}

Please also cite EAT:

@article{chen2024eat,
  title={EAT: Self-supervised pre-training with efficient audio transformer},
  author={Chen, Wenxi and Liang, Yuzhe and Ma, Ziyang and Zheng, Zhisheng and Chen, Xie},
  journal={arXiv preprint arXiv:2401.03497},
  year={2024}
}