|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
datasets: |
|
|
- DBD-research-group/BirdSet |
|
|
base_model: |
|
|
- facebook/convnext-base-224-22k |
|
|
pipeline_tag: audio-classification |
|
|
library_name: transformers |
|
|
tags: |
|
|
- audio-classification |
|
|
- audio |
|
|
--- |
|
|
# AudioProtoPNet: An Interpretable Deep Learning Model for Bird Sound Classification |
|
|
|
|
|
|
|
|
|
|
|
## Abstract |
|
|
|
|
|
Deep learning models have significantly advanced acoustic bird monitoring by recognizing numerous bird species |
|
|
based on their vocalizations. However, traditional deep learning models are often "black boxes," providing |
|
|
limited insight into their underlying computations, which restricts their utility for ornithologists and machine |
|
|
learning engineers. Explainable models, on the other hand, can facilitate debugging, knowledge discovery, |
|
|
trust, and interdisciplinary collaboration. |
|
|
|
|
|
This work introduces **AudioProtoPNet**, an adaptation of the Prototypical Part Network (ProtoPNet) |
|
|
designed for multi-label bird sound classification. AudioProtoPNet is inherently interpretable, leveraging a |
|
|
ConvNeXt backbone to extract embeddings and a prototype learning classifier trained on these embeddings. |
|
|
The classifier learns prototypical patterns of each bird species' vocalizations from spectrograms of |
|
|
instances in the training data. |
|
|
|
|
|
During inference, recordings are classified by comparing them to learned prototypes in the embedding space, |
|
|
providing explanations for the model's decisions and insights into the most informative embeddings of each |
|
|
bird species. |
|
|
|
|
|
- **Paper**: www.sciencedirect.com/science/article/pii/S1574954125000901 |
|
|
|
|
|
## Model Description |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on the **BirdSet training dataset**, which comprises 9734 bird species and over 6800 |
|
|
hours of recordings. |
|
|
|
|
|
### Evaluation |
|
|
|
|
|
AudioProtoPNet's performance was evaluated on seven BirdSet test datasets, covering diverse geographical |
|
|
regions. The model demonstrated superior performance compared to state-of-the-art bird sound classification |
|
|
models like Perch (which itself outperforms BirdNet). AudioProtoPNet achieved an average AUROC of 0.90 |
|
|
and a cmAP of 0.42, representing relative improvements of 7.1% and 16.7% over Perch, respectively. |
|
|
|
|
|
These results highlight the feasibility of developing powerful yet interpretable deep learning models for the |
|
|
challenging task of multi-label bird sound classification, offering valuable insights for professionals in |
|
|
ornithology and machine learning. |
|
|
|
|
|
### Evaluation Results |
|
|
|
|
|
**Table 1: Mean Performance of AudioProtoPNet Models with Varying Prototypes** |
|
|
|
|
|
Mean performance of AudioProtoPNet models with one, five, ten, and twenty prototypes per class for the |
|
|
validation dataset POW and the seven test datasets, averaged over five different random seeds. The 'Score' |
|
|
column represents the average of the respective metric across all test datasets. Best values for each metric are |
|
|
**bolded**. While models with five, ten, and twenty prototypes performed |
|
|
similarly, the model with only one prototype per class showed slightly lower performance. |
|
|
|
|
|
| | Metric | POW | PER | NES | UHH | HSN | NBP | SSW | SNE | Score | |
|
|
|----------------------|---------|-------|-------|-------|-------|-------|-------|-------|-------|-------| |
|
|
| AudioProtoPNet-1 | cmAP | 0.49 | **0.30** | 0.36 | 0.28 | 0.50 | 0.66 | 0.40 | 0.32 | 0.40 | |
|
|
| | AUROC | 0.88 | 0.79 | 0.92 | 0.85 | 0.91 | 0.92 | 0.96 | 0.84 | 0.88 | |
|
|
| | T1-Acc | **0.87** | 0.59 | 0.49 | 0.42 | 0.64 | 0.71 | 0.64 | 0.70 | 0.60 | |
|
|
| AudioProtoPNet-5 | cmAP | **0.50** | **0.30** | **0.38** | **0.31** | **0.54** | **0.68** | 0.42 | 0.33 | **0.42** | |
|
|
| | AUROC | 0.88 | 0.79 | 0.93 | **0.87** | **0.92** | **0.93** | **0.97** | **0.88** | **0.90** | |
|
|
| | T1-Acc | 0.84 | 0.59 | **0.52** | **0.49** | **0.65** | 0.71 | 0.66 | 0.74 | **0.62** | |
|
|
| AudioProtoPNet-10 | cmAP | **0.50** | **0.30** | **0.38** | 0.30 | **0.54** | **0.68** | 0.42 | **0.34** | **0.42** | |
|
|
| | AUROC | 0.88 | **0.80** | **0.94** | 0.86 | **0.92** | **0.93** | **0.97** | 0.86 | **0.90** | |
|
|
| | T1-Acc | 0.85 | 0.59 | **0.52** | 0.47 | 0.64 | **0.72** | 0.67 | 0.74 | **0.62** | |
|
|
| AudioProtoPNet-20 | cmAP | **0.50** | **0.30** | **0.38** | **0.31** | **0.54** | **0.68** | **0.43** | 0.33 | **0.42** | |
|
|
| | AUROC | **0.89** | **0.80** | **0.94** | 0.86 | **0.92** | **0.93** | **0.97** | 0.87 | **0.90** | |
|
|
| | T1-Acc | **0.87** | **0.60** | **0.52** | 0.42 | **0.65** | **0.72** | **0.68** | **0.75** | **0.62** | |
|
|
|
|
|
**Table 2: Comparative Performance of AudioProtoPNet, ConvNeXt, and Perch** |
|
|
|
|
|
Mean performance of AudioProtoPNet-5, ConvNeXt, and Perch for the validation dataset POW and the seven |
|
|
test datasets, averaged over five different random seeds. The 'Score' column represents the average of the |
|
|
respective metric across all test datasets. Best values for each metric are **bolded**. AudioProtoPNet-5 notably outperformed both Perch and ConvNeXt in terms of cmAP, AUROC, |
|
|
and top-1 accuracy scores. |
|
|
|
|
|
| Model | Metric | POW | PER | NES | UHH | HSN | NBP | SSW | SNE | Score | |
|
|
| :---------------- | :------ | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | |
|
|
| AudioProtoPNet-5 | cmAP | 0.50 | **0.30** | 0.38 | **0.31** | **0.54** | **0.68** | **0.42** | **0.33** | **0.42** | |
|
|
| | AUROC | 0.88 | **0.79** | **0.93** | **0.87** | **0.92** | **0.93** | **0.97** | **0.86** | **0.90** | |
|
|
| | T1-Acc | 0.84 | **0.59** | 0.52 | 0.49 | **0.65** | **0.71** | **0.66** | **0.74** | **0.62** | |
|
|
| ConvNeXt | cmAP | 0.41 | 0.21 | 0.35 | 0.25 | 0.49 | 0.66 | 0.38 | 0.31 | 0.38 | |
|
|
| | AUROC | 0.83 | 0.73 | 0.89 | 0.72 | 0.88 | 0.92 | 0.93 | 0.83 | 0.84 | |
|
|
| | T1-Acc | 0.75 | 0.43 | 0.49 | 0.43 | 0.60 | 0.69 | 0.58 | 0.62 | 0.56 | |
|
|
| Perch | cmAP | 0.30 | 0.18 | **0.39** | 0.27 | 0.45 | 0.63 | 0.28 | 0.29 | 0.36 | |
|
|
| | AUROC | 0.84 | 0.70 | 0.90 | 0.76 | 0.86 | 0.91 | 0.91 | 0.83 | 0.84 | |
|
|
| | T1-Acc | 0.85 | 0.48 | **0.66** | **0.57** | 0.58 | 0.69 | 0.62 | 0.69 | 0.61 | |
|
|
|
|
|
|
|
|
|
|
|
## Example |
|
|
|
|
|
This model can be easily loaded and used for inference with the `transformers` library. |
|
|
|
|
|
```python |
|
|
from transformers import AutoFeatureExtractor, AutoModelForSequenceClassification |
|
|
import librosa |
|
|
import torch |
|
|
|
|
|
# Load the model and feature extractor |
|
|
model = AutoModelForSequenceClassification.from_pretrained("DBD-research-group/AudioProtoPNet-10-BirdSet-XCL",trust_remote_code=True) |
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained("DBD-research-group/AudioProtoPNet-10-BirdSet-XCL", trust_remote_code=True) |
|
|
model.eval() |
|
|
|
|
|
# Load an example audio file |
|
|
audio_path = librosa.ex('robin') |
|
|
label = "eurrob1" # The eBird label for the European Robin. |
|
|
|
|
|
# The model is trained on audio sampled at 32,000 Hz |
|
|
audio, sample_rate = librosa.load(audio_path, sr=32_000) |
|
|
|
|
|
mel_spectrogram = feature_extractor(audio) |
|
|
|
|
|
outputs = model(mel_spectrogram) |
|
|
probabilities = torch.sigmoid(outputs[0]).detach() |
|
|
|
|
|
# Get the top 5 predictions by confidence |
|
|
top_n_probs, top_n_indices = torch.topk(probabilities, k=5, dim=-1) |
|
|
|
|
|
label2id = model.config.label2id |
|
|
id2label = model.config.id2label |
|
|
|
|
|
print(f'Selected species with confidence:') |
|
|
print(f"{label:<7} - {probabilities[:, label2id[label]].item():.2%}") |
|
|
print("\nTop 5 Predictions with confidence:") |
|
|
for idx, conf in zip(top_n_indices.squeeze(), top_n_probs.squeeze()): |
|
|
print(f"{id2label[idx.item()]:<7} - {conf:.2%}") |
|
|
``` |
|
|
|
|
|
**Expected output** |
|
|
``` |
|
|
Selected species with confidence: |
|
|
eurrob1 - 26.81% |
|
|
Top 5 Predictions with confidence: |
|
|
coatit2 - 49.99% |
|
|
sablar2 - 48.29% |
|
|
palwar5 - 41.58% |
|
|
gretit1 - 37.51% |
|
|
verdin - 34.72% |
|
|
``` |
|
|
|
|
|
## More Details |
|
|
For more details refer to our paper at: https://www.sciencedirect.com/science/article/pii/S1574954125000901 |
|
|
|
|
|
## Citation |
|
|
``` |
|
|
@misc{heinrich2024audioprotopnet, |
|
|
title={AudioProtoPNet: An interpretable deep learning model for bird sound classification}, |
|
|
author={René Heinrich and Lukas Rauch and Bernhard Sick and Christoph Scholz}, |
|
|
year={2024}, |
|
|
url={https://www.sciencedirect.com/science/article/pii/S1574954125000901}, |
|
|
} |
|
|
``` |