File size: 8,488 Bytes
025f669
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a91d58
 
025f669
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
02e297d
1a91d58
 
 
025f669
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
238be64
025f669
 
238be64
 
 
 
 
 
 
 
 
 
 
 
 
 
025f669
 
 
 
 
238be64
025f669
 
 
 
 
 
 
 
 
 
 
 
 
 
238be64
 
1a91d58
025f669
 
 
 
 
 
 
 
 
02e297d
 
025f669
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9bd92f6
025f669
9bd92f6
 
 
 
 
025f669
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
license: cc-by-nc-4.0
datasets:
- DBD-research-group/BirdSet
base_model:
- facebook/convnext-base-224-22k
pipeline_tag: audio-classification
library_name: transformers
tags:
- audio-classification
- audio
---
# AudioProtoPNet: An Interpretable Deep Learning Model for Bird Sound Classification



## Abstract

Deep learning models have significantly advanced acoustic bird monitoring by recognizing numerous bird species
based on their vocalizations. However, traditional deep learning models are often "black boxes," providing
limited insight into their underlying computations, which restricts their utility for ornithologists and machine
learning engineers. Explainable models, on the other hand, can facilitate debugging, knowledge discovery,
trust, and interdisciplinary collaboration.

This work introduces **AudioProtoPNet**, an adaptation of the Prototypical Part Network (ProtoPNet)
designed for multi-label bird sound classification. AudioProtoPNet is inherently interpretable, leveraging a
ConvNeXt backbone to extract embeddings and a prototype learning classifier trained on these embeddings.
The classifier learns prototypical patterns of each bird species' vocalizations from spectrograms of
instances in the training data.

During inference, recordings are classified by comparing them to learned prototypes in the embedding space,
providing explanations for the model's decisions and insights into the most informative embeddings of each
bird species.

- **Paper**: www.sciencedirect.com/science/article/pii/S1574954125000901

## Model Description

### Training Data

The model was trained on the **BirdSet training dataset**, which comprises 9734 bird species and over 6800
hours of recordings.

### Evaluation

AudioProtoPNet's performance was evaluated on seven BirdSet test datasets, covering diverse geographical
regions. The model demonstrated superior performance compared to state-of-the-art bird sound classification
models like Perch (which itself outperforms BirdNet). AudioProtoPNet achieved an average AUROC of 0.90
and a cmAP of 0.42, representing relative improvements of 7.1% and 16.7% over Perch, respectively.

These results highlight the feasibility of developing powerful yet interpretable deep learning models for the
challenging task of multi-label bird sound classification, offering valuable insights for professionals in
ornithology and machine learning.

### Evaluation Results

**Table 1: Mean Performance of AudioProtoPNet Models with Varying Prototypes**

Mean performance of AudioProtoPNet models with one, five, ten, and twenty prototypes per class for the
validation dataset POW and the seven test datasets, averaged over five different random seeds. The 'Score'
column represents the average of the respective metric across all test datasets. Best values for each metric are
**bolded**. While models with five, ten, and twenty prototypes performed
similarly, the model with only one prototype per class showed slightly lower performance.

|                      | Metric  |  POW  |  PER  |  NES  |  UHH  |  HSN  |  NBP  |  SSW  |  SNE  | Score |
|----------------------|---------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| AudioProtoPNet-1     | cmAP    | 0.49  | **0.30** | 0.36  | 0.28  | 0.50  | 0.66  | 0.40  | 0.32  | 0.40  |
|                      | AUROC   | 0.88  | 0.79  | 0.92  | 0.85  | 0.91  | 0.92  | 0.96  | 0.84  | 0.88  |
|                      | T1-Acc  | **0.87** | 0.59  | 0.49  | 0.42  | 0.64  | 0.71  | 0.64  | 0.70  | 0.60  |
| AudioProtoPNet-5     | cmAP    | **0.50** | **0.30** | **0.38** | **0.31** | **0.54** | **0.68** | 0.42  | 0.33  | **0.42** |
|                      | AUROC   | 0.88  | 0.79  | 0.93  | **0.87** | **0.92** | **0.93** | **0.97** | **0.88** | **0.90** |
|                      | T1-Acc  | 0.84  | 0.59  | **0.52** | **0.49** | **0.65** | 0.71  | 0.66  | 0.74  | **0.62** |
| AudioProtoPNet-10    | cmAP    | **0.50** | **0.30** | **0.38** | 0.30  | **0.54** | **0.68** | 0.42  | **0.34** | **0.42** |
|                      | AUROC   | 0.88  | **0.80** | **0.94** | 0.86  | **0.92** | **0.93** | **0.97** | 0.86  | **0.90** |
|                      | T1-Acc  | 0.85  | 0.59  | **0.52** | 0.47  | 0.64  | **0.72** | 0.67  | 0.74  | **0.62** |
| AudioProtoPNet-20    | cmAP    | **0.50** | **0.30** | **0.38** | **0.31** | **0.54** | **0.68** | **0.43** | 0.33  | **0.42** |
|                      | AUROC   | **0.89** | **0.80** | **0.94** | 0.86  | **0.92** | **0.93** | **0.97** | 0.87  | **0.90** |
|                      | T1-Acc  | **0.87** | **0.60** | **0.52** | 0.42  | **0.65** | **0.72** | **0.68** | **0.75** | **0.62** |

**Table 2: Comparative Performance of AudioProtoPNet, ConvNeXt, and Perch**

Mean performance of AudioProtoPNet-5, ConvNeXt, and Perch for the validation dataset POW and the seven
test datasets, averaged over five different random seeds. The 'Score' column represents the average of the
respective metric across all test datasets. Best values for each metric are **bolded**. AudioProtoPNet-5 notably outperformed both Perch and ConvNeXt in terms of cmAP, AUROC,
and top-1 accuracy scores.

| Model             | Metric  | POW    | PER    | NES    | UHH    | HSN    | NBP    | SSW    | SNE    | Score  |
| :---------------- | :------ | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- |
| AudioProtoPNet-5  | cmAP    | 0.50   | **0.30** | 0.38   | **0.31** | **0.54** | **0.68** | **0.42** | **0.33** | **0.42** |
|                   | AUROC   | 0.88   | **0.79** | **0.93** | **0.87** | **0.92** | **0.93** | **0.97** | **0.86** | **0.90** |
|                   | T1-Acc  | 0.84   | **0.59** | 0.52   | 0.49   | **0.65** | **0.71** | **0.66** | **0.74** | **0.62** |
| ConvNeXt          | cmAP    | 0.41   | 0.21   | 0.35   | 0.25   | 0.49   | 0.66   | 0.38   | 0.31   | 0.38   |
|                   | AUROC   | 0.83   | 0.73   | 0.89   | 0.72   | 0.88   | 0.92   | 0.93   | 0.83   | 0.84   |
|                   | T1-Acc  | 0.75   | 0.43   | 0.49   | 0.43   | 0.60   | 0.69   | 0.58   | 0.62   | 0.56   |
| Perch             | cmAP    | 0.30   | 0.18   | **0.39** | 0.27   | 0.45   | 0.63   | 0.28   | 0.29   | 0.36   |
|                   | AUROC   | 0.84   | 0.70   | 0.90   | 0.76   | 0.86   | 0.91   | 0.91   | 0.83   | 0.84   |
|                   | T1-Acc  | 0.85   | 0.48   | **0.66** | **0.57** | 0.58   | 0.69   | 0.62   | 0.69   | 0.61   |



## Example

This model can be easily loaded and used for inference with the `transformers` library.

```python
from transformers import AutoFeatureExtractor, AutoModelForSequenceClassification
import librosa
import torch

# Load the model and feature extractor
model = AutoModelForSequenceClassification.from_pretrained("DBD-research-group/AudioProtoPNet-10-BirdSet-XCL",trust_remote_code=True)
feature_extractor = AutoFeatureExtractor.from_pretrained("DBD-research-group/AudioProtoPNet-10-BirdSet-XCL", trust_remote_code=True)
model.eval()

# Load an example audio file
audio_path = librosa.ex('robin')
label = "eurrob1"  # The eBird label for the European Robin.

# The model is trained on audio sampled at 32,000 Hz
audio, sample_rate = librosa.load(audio_path, sr=32_000)

mel_spectrogram = feature_extractor(audio)

outputs = model(mel_spectrogram)
probabilities = torch.sigmoid(outputs[0]).detach()

# Get the top 5 predictions by confidence
top_n_probs, top_n_indices = torch.topk(probabilities, k=5, dim=-1)

label2id = model.config.label2id
id2label = model.config.id2label

print(f'Selected species with confidence:')
print(f"{label:<7} - {probabilities[:, label2id[label]].item():.2%}")
print("\nTop 5 Predictions with confidence:")
for idx, conf in zip(top_n_indices.squeeze(), top_n_probs.squeeze()):
    print(f"{id2label[idx.item()]:<7} - {conf:.2%}")
```

**Expected output**
```
Selected species with confidence:
eurrob1 - 26.81%
Top 5 Predictions with confidence:
coatit2 - 49.99%
sablar2 - 48.29%
palwar5 - 41.58%
gretit1 - 37.51%
verdin  - 34.72%
```

## More Details
For more details refer to our paper at: https://www.sciencedirect.com/science/article/pii/S1574954125000901

## Citation
```
@misc{heinrich2024audioprotopnet,
      title={AudioProtoPNet: An interpretable deep learning model for bird sound classification}, 
      author={René Heinrich and Lukas Rauch and Bernhard Sick and Christoph Scholz},
      year={2024},
      url={https://www.sciencedirect.com/science/article/pii/S1574954125000901}, 
}
```