|
|
--- |
|
|
base_model: |
|
|
- nvidia/audio-flamingo-2 |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
# AFClap - Audio Flamingo CLAP |
|
|
|
|
|
Audio-Flamingo-2のCLAPモデル(epoch_16.pt)。 |
|
|
|
|
|
## Model Info |
|
|
|
|
|
- **Base**: HTSAT + T5 |
|
|
- **Audio Embed Dim**: 2048 |
|
|
- **Sample Rate**: 16000 Hz |
|
|
- **Original**: [nvidia/audio-flamingo-2](https://huggingface.co/nvidia/audio-flamingo-2) |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install laion-clap librosa soundfile |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from laion_clap import CLAP_Module |
|
|
|
|
|
def load_afclap(ckpt_path): |
|
|
model = CLAP_Module( |
|
|
enable_fusion=True, |
|
|
amodel='HTSAT-afclap', |
|
|
tmodel='t5' |
|
|
).cuda() |
|
|
model.load_afclap_ckpt(ckpt=ckpt_path, verbose=True) |
|
|
return model |
|
|
|
|
|
# Load model |
|
|
model = load_afclap("epoch_16.pt") |
|
|
|
|
|
# Get audio embedding |
|
|
audio_embed = model.get_audio_embedding_from_filelist( |
|
|
["audio1.wav", "audio2.wav"], |
|
|
sr=16000, |
|
|
use_tensor=True |
|
|
) |
|
|
|
|
|
# Get text embedding |
|
|
text_embed = model.get_text_embedding( |
|
|
["This is a classical song.", "This is a rock song."], |
|
|
use_tensor=True |
|
|
) |
|
|
|
|
|
# Compute similarity |
|
|
similarities = torch.tensor(audio_embed) @ torch.tensor(text_embed).t() |
|
|
print(similarities) |
|
|
``` |