Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,162 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
datasets:
|
| 4 |
+
- DBD-research-group/BirdSet
|
| 5 |
+
base_model:
|
| 6 |
+
- facebook/convnext-base-224-22k
|
| 7 |
+
pipeline_tag: audio-classification
|
| 8 |
+
library_name: transformers
|
| 9 |
+
tags:
|
| 10 |
+
- audio-classification
|
| 11 |
+
- audio
|
| 12 |
+
---
|
| 13 |
+
# AudioProtoPNet: An Interpretable Deep Learning Model for Bird Sound Classification
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
## Abstract
|
| 18 |
+
|
| 19 |
+
Deep learning models have significantly advanced acoustic bird monitoring by recognizing numerous bird species
|
| 20 |
+
based on their vocalizations. However, traditional deep learning models are often "black boxes," providing
|
| 21 |
+
limited insight into their underlying computations, which restricts their utility for ornithologists and machine
|
| 22 |
+
learning engineers. Explainable models, on the other hand, can facilitate debugging, knowledge discovery,
|
| 23 |
+
trust, and interdisciplinary collaboration.
|
| 24 |
+
|
| 25 |
+
This work introduces **AudioProtoPNet**, an adaptation of the Prototypical Part Network (ProtoPNet)
|
| 26 |
+
designed for multi-label bird sound classification. AudioProtoPNet is inherently interpretable, leveraging a
|
| 27 |
+
ConvNeXt backbone to extract embeddings and a prototype learning classifier trained on these embeddings.
|
| 28 |
+
The classifier learns prototypical patterns of each bird species' vocalizations from spectrograms of
|
| 29 |
+
instances in the training data.
|
| 30 |
+
|
| 31 |
+
During inference, recordings are classified by comparing them to learned prototypes in the embedding space,
|
| 32 |
+
providing explanations for the model's decisions and insights into the most informative embeddings of each
|
| 33 |
+
bird species.
|
| 34 |
+
|
| 35 |
+
- **Paper**: [Elsevier](www.sciencedirect.com/science/article/pii/S1574954125000901)
|
| 36 |
+
|
| 37 |
+
## Model Description
|
| 38 |
+
|
| 39 |
+
### Training Data
|
| 40 |
+
|
| 41 |
+
The model was trained on the **BirdSet training dataset**, which comprises 9734 bird species and over 6800
|
| 42 |
+
hours of recordings.
|
| 43 |
+
|
| 44 |
+
### Evaluation
|
| 45 |
+
|
| 46 |
+
AudioProtoPNet's performance was evaluated on seven BirdSet test datasets, covering diverse geographical
|
| 47 |
+
regions. The model demonstrated superior performance compared to state-of-the-art bird sound classification
|
| 48 |
+
models like Perch (which itself outperforms BirdNet). AudioProtoPNet achieved an average AUROC of 0.90
|
| 49 |
+
and a cmAP of 0.42, representing relative improvements of 7.1% and 16.7% over Perch, respectively.
|
| 50 |
+
|
| 51 |
+
These results highlight the feasibility of developing powerful yet interpretable deep learning models for the
|
| 52 |
+
challenging task of multi-label bird sound classification, offering valuable insights for professionals in
|
| 53 |
+
ornithology and machine learning.
|
| 54 |
+
|
| 55 |
+
### Evaluation Results
|
| 56 |
+
|
| 57 |
+
**Table 1: Mean Performance of AudioProtoPNet Models with Varying Prototypes**
|
| 58 |
+
|
| 59 |
+
Mean performance of AudioProtoPNet models with one, five, ten, and twenty prototypes per class for the
|
| 60 |
+
validation dataset POW and the seven test datasets, averaged over five different random seeds. The 'Score'
|
| 61 |
+
column represents the average of the respective metric across all test datasets. Best values for each metric are
|
| 62 |
+
**bolded**, and second-best values are *underlined*. While models with five, ten, and twenty prototypes performed
|
| 63 |
+
similarly, the model with only one prototype per class showed slightly lower performance.
|
| 64 |
+
|
| 65 |
+
| Model | Metric | POW | PER | NES | UHH | HSN | NBP | SSW | SNE | Score |
|
| 66 |
+
| :---------------- | :------ | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- |
|
| 67 |
+
| AudioProtoPNet-1 | cmAP | 0.49 | 0.30 | 0.36 | 0.28 | **0.50** | **0.66** | 0.40 | 0.32 | 0.40 |
|
| 68 |
+
| | AUROC | 0.88 | 0.79 | 0.92 | 0.85 | 0.91 | 0.92 | **0.96** | 0.84 | 0.88 |
|
| 69 |
+
| | T1-Acc | **0.87** | 0.59 | 0.49 | 0.42 | 0.64 | 0.71 | 0.64 | 0.70 | 0.60 |
|
| 70 |
+
| AudioProtoPNet-5 | cmAP | **0.50** | 0.30 | 0.38 | 0.31 | 0.54 | **0.68** | 0.42 | 0.33 | 0.42 |
|
| 71 |
+
| | AUROC | 0.88 | 0.79 | 0.93 | 0.87 | 0.92 | 0.93 | **0.97** | 0.88 | **0.90** |
|
| 72 |
+
| | T1-Acc | 0.84 | **0.59** | 0.52 | **0.49** | 0.65 | **0.71** | 0.66 | **0.74** | 0.62 |
|
| 73 |
+
| AudioProtoPNet-10 | cmAP | **0.50** | **0.30** | **0.38** | **0.30** | **0.54** | **0.68** | **0.42** | **0.34** | **0.42** |
|
| 74 |
+
| | AUROC | 0.88 | **0.80** | **0.94** | 0.86 | **0.92** | **0.93** | **0.97** | 0.86 | **0.90** |
|
| 75 |
+
| | T1-Acc | 0.85 | **0.59** | **0.52** | 0.47 | **0.64** | **0.72** | **0.67** | **0.74** | **0.62** |
|
| 76 |
+
| AudioProtoPNet-20 | cmAP | **0.50** | **0.30** | **0.38** | **0.31** | **0.54** | **0.68** | **0.43** | **0.33** | **0.42** |
|
| 77 |
+
| | AUROC | **0.89** | **0.80** | **0.94** | **0.86** | **0.92** | **0.93** | **0.97** | **0.87** | **0.90** |
|
| 78 |
+
| | T1-Acc | **0.87** | **0.60** | **0.52** | 0.42 | **0.65** | **0.72** | **0.68** | **0.75** | **0.62** |
|
| 79 |
+
|
| 80 |
+
**Table 2: Comparative Performance of AudioProtoPNet, ConvNeXt, and Perch**
|
| 81 |
+
|
| 82 |
+
Mean performance of AudioProtoPNet-5, ConvNeXt, and Perch for the validation dataset POW and the seven
|
| 83 |
+
test datasets, averaged over five different random seeds. The 'Score' column represents the average of the
|
| 84 |
+
respective metric across all test datasets. Best values for each metric are **bolded**, and second-best values are
|
| 85 |
+
*underlined*. AudioProtoPNet-5 notably outperformed both Perch and ConvNeXt in terms of cmAP, AUROC,
|
| 86 |
+
and top-1 accuracy scores.
|
| 87 |
+
|
| 88 |
+
| Model | Metric | POW | PER | NES | UHH | HSN | NBP | SSW | SNE | Score |
|
| 89 |
+
| :---------------- | :------ | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- |
|
| 90 |
+
| AudioProtoPNet-5 | cmAP | 0.50 | **0.30** | 0.38 | **0.31** | **0.54** | **0.68** | **0.42** | **0.33** | **0.42** |
|
| 91 |
+
| | AUROC | 0.88 | **0.79** | **0.93** | **0.87** | **0.92** | **0.93** | **0.97** | **0.86** | **0.90** |
|
| 92 |
+
| | T1-Acc | 0.84 | **0.59** | 0.52 | 0.49 | **0.65** | **0.71** | **0.66** | **0.74** | **0.62** |
|
| 93 |
+
| ConvNeXt | cmAP | 0.41 | 0.21 | 0.35 | 0.25 | 0.49 | 0.66 | 0.38 | 0.31 | 0.38 |
|
| 94 |
+
| | AUROC | 0.83 | 0.73 | 0.89 | 0.72 | 0.88 | 0.92 | 0.93 | 0.83 | 0.84 |
|
| 95 |
+
| | T1-Acc | 0.75 | 0.43 | 0.49 | 0.43 | 0.60 | 0.69 | 0.58 | 0.62 | 0.56 |
|
| 96 |
+
| Perch | cmAP | 0.30 | 0.18 | **0.39** | 0.27 | 0.45 | 0.63 | 0.28 | 0.29 | 0.36 |
|
| 97 |
+
| | AUROC | 0.84 | 0.70 | 0.90 | 0.76 | 0.86 | 0.91 | 0.91 | 0.83 | 0.84 |
|
| 98 |
+
| | T1-Acc | 0.85 | 0.48 | **0.66** | **0.57** | 0.58 | 0.69 | 0.62 | 0.69 | 0.61 |
|
| 99 |
+
|
| 100 |
+
## Example
|
| 101 |
+
|
| 102 |
+
This model can be easily loaded and used for inference with the `transformers` library.
|
| 103 |
+
|
| 104 |
+
```python
|
| 105 |
+
from transformers import AutoFeatureExtractor, AutoModelForSequenceClassification
|
| 106 |
+
import librosa
|
| 107 |
+
import torch
|
| 108 |
+
|
| 109 |
+
# Load the model and feature extractor
|
| 110 |
+
model = AutoModelForSequenceClassification.from_pretrained("DBD-research-group/AudioProtoPNet-20-BirdSet-XCL",trust_remote_code=True)
|
| 111 |
+
feature_extractor = AutoFeatureExtractor.from_pretrained("DBD-research-group/AudioProtoPNet-20-BirdSet-XCL", trust_remote_code=True)
|
| 112 |
+
model.eval()
|
| 113 |
+
|
| 114 |
+
# Load an example audio file
|
| 115 |
+
audio_path = librosa.ex('robin')
|
| 116 |
+
label = "eurrob1" # The eBird label for the European Robin.
|
| 117 |
+
|
| 118 |
+
# The model is trained on audio sampled at 32,000 Hz
|
| 119 |
+
audio, sample_rate = librosa.load(audio_path, sr=32_000)
|
| 120 |
+
|
| 121 |
+
mel_spectrogram = feature_extractor(audio)
|
| 122 |
+
|
| 123 |
+
outputs = model(mel_spectrogram)
|
| 124 |
+
probabilities = torch.sigmoid(outputs[0]).detach()
|
| 125 |
+
|
| 126 |
+
# Get the top 5 predictions by confidence
|
| 127 |
+
top_n_probs, top_n_indices = torch.topk(probabilities, k=5, dim=-1)
|
| 128 |
+
|
| 129 |
+
label2id = model.config.label2id
|
| 130 |
+
id2label = model.config.id2label
|
| 131 |
+
|
| 132 |
+
print(f'Selected species with confidence:')
|
| 133 |
+
print(f"{label:<7} - {probabilities[:, label2id[label]].item():.2%}")
|
| 134 |
+
print("\nTop 5 Predictions with confidence:")
|
| 135 |
+
for idx, conf in zip(top_n_indices.squeeze(), top_n_probs.squeeze()):
|
| 136 |
+
print(f"{id2label[idx.item()]:<7} - {conf:.2%}")
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
**Expected output**
|
| 140 |
+
```
|
| 141 |
+
Selected species with confidence:
|
| 142 |
+
eurrob1 - 65.40%
|
| 143 |
+
Top 5 Predictions with confidence:
|
| 144 |
+
eurrob1 - 65.40%
|
| 145 |
+
blutit - 34.11%
|
| 146 |
+
eugplo - 33.66%
|
| 147 |
+
sablar2 - 33.50%
|
| 148 |
+
dunnoc1 - 32.35%
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
## More Details
|
| 152 |
+
For more details refer to our paper at: https://www.sciencedirect.com/science/article/pii/S1574954125000901
|
| 153 |
+
|
| 154 |
+
## Citation
|
| 155 |
+
```
|
| 156 |
+
@misc{heinrich2024audioprotopnet,
|
| 157 |
+
title={AudioProtoPNet: An interpretable deep learning model for bird sound classification},
|
| 158 |
+
author={René Heinrich and Lukas Rauch and Bernhard Sick and Christoph Scholz},
|
| 159 |
+
year={2024},
|
| 160 |
+
url={https://www.sciencedirect.com/science/article/pii/S1574954125000901},
|
| 161 |
+
}
|
| 162 |
+
```
|