mwirth7 commited on
Commit
4ce1300
·
verified ·
1 Parent(s): 14de4a2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +162 -0
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ datasets:
4
+ - DBD-research-group/BirdSet
5
+ base_model:
6
+ - facebook/convnext-base-224-22k
7
+ pipeline_tag: audio-classification
8
+ library_name: transformers
9
+ tags:
10
+ - audio-classification
11
+ - audio
12
+ ---
13
+ # AudioProtoPNet: An Interpretable Deep Learning Model for Bird Sound Classification
14
+
15
+
16
+
17
+ ## Abstract
18
+
19
+ Deep learning models have significantly advanced acoustic bird monitoring by recognizing numerous bird species
20
+ based on their vocalizations. However, traditional deep learning models are often "black boxes," providing
21
+ limited insight into their underlying computations, which restricts their utility for ornithologists and machine
22
+ learning engineers. Explainable models, on the other hand, can facilitate debugging, knowledge discovery,
23
+ trust, and interdisciplinary collaboration.
24
+
25
+ This work introduces **AudioProtoPNet**, an adaptation of the Prototypical Part Network (ProtoPNet)
26
+ designed for multi-label bird sound classification. AudioProtoPNet is inherently interpretable, leveraging a
27
+ ConvNeXt backbone to extract embeddings and a prototype learning classifier trained on these embeddings.
28
+ The classifier learns prototypical patterns of each bird species' vocalizations from spectrograms of
29
+ instances in the training data.
30
+
31
+ During inference, recordings are classified by comparing them to learned prototypes in the embedding space,
32
+ providing explanations for the model's decisions and insights into the most informative embeddings of each
33
+ bird species.
34
+
35
+ - **Paper**: [Elsevier](www.sciencedirect.com/science/article/pii/S1574954125000901)
36
+
37
+ ## Model Description
38
+
39
+ ### Training Data
40
+
41
+ The model was trained on the **BirdSet training dataset**, which comprises 9734 bird species and over 6800
42
+ hours of recordings.
43
+
44
+ ### Evaluation
45
+
46
+ AudioProtoPNet's performance was evaluated on seven BirdSet test datasets, covering diverse geographical
47
+ regions. The model demonstrated superior performance compared to state-of-the-art bird sound classification
48
+ models like Perch (which itself outperforms BirdNet). AudioProtoPNet achieved an average AUROC of 0.90
49
+ and a cmAP of 0.42, representing relative improvements of 7.1% and 16.7% over Perch, respectively.
50
+
51
+ These results highlight the feasibility of developing powerful yet interpretable deep learning models for the
52
+ challenging task of multi-label bird sound classification, offering valuable insights for professionals in
53
+ ornithology and machine learning.
54
+
55
+ ### Evaluation Results
56
+
57
+ **Table 1: Mean Performance of AudioProtoPNet Models with Varying Prototypes**
58
+
59
+ Mean performance of AudioProtoPNet models with one, five, ten, and twenty prototypes per class for the
60
+ validation dataset POW and the seven test datasets, averaged over five different random seeds. The 'Score'
61
+ column represents the average of the respective metric across all test datasets. Best values for each metric are
62
+ **bolded**, and second-best values are *underlined*. While models with five, ten, and twenty prototypes performed
63
+ similarly, the model with only one prototype per class showed slightly lower performance.
64
+
65
+ | Model | Metric | POW | PER | NES | UHH | HSN | NBP | SSW | SNE | Score |
66
+ | :---------------- | :------ | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- |
67
+ | AudioProtoPNet-1 | cmAP | 0.49 | 0.30 | 0.36 | 0.28 | **0.50** | **0.66** | 0.40 | 0.32 | 0.40 |
68
+ | | AUROC | 0.88 | 0.79 | 0.92 | 0.85 | 0.91 | 0.92 | **0.96** | 0.84 | 0.88 |
69
+ | | T1-Acc | **0.87** | 0.59 | 0.49 | 0.42 | 0.64 | 0.71 | 0.64 | 0.70 | 0.60 |
70
+ | AudioProtoPNet-5 | cmAP | **0.50** | 0.30 | 0.38 | 0.31 | 0.54 | **0.68** | 0.42 | 0.33 | 0.42 |
71
+ | | AUROC | 0.88 | 0.79 | 0.93 | 0.87 | 0.92 | 0.93 | **0.97** | 0.88 | **0.90** |
72
+ | | T1-Acc | 0.84 | **0.59** | 0.52 | **0.49** | 0.65 | **0.71** | 0.66 | **0.74** | 0.62 |
73
+ | AudioProtoPNet-10 | cmAP | **0.50** | **0.30** | **0.38** | **0.30** | **0.54** | **0.68** | **0.42** | **0.34** | **0.42** |
74
+ | | AUROC | 0.88 | **0.80** | **0.94** | 0.86 | **0.92** | **0.93** | **0.97** | 0.86 | **0.90** |
75
+ | | T1-Acc | 0.85 | **0.59** | **0.52** | 0.47 | **0.64** | **0.72** | **0.67** | **0.74** | **0.62** |
76
+ | AudioProtoPNet-20 | cmAP | **0.50** | **0.30** | **0.38** | **0.31** | **0.54** | **0.68** | **0.43** | **0.33** | **0.42** |
77
+ | | AUROC | **0.89** | **0.80** | **0.94** | **0.86** | **0.92** | **0.93** | **0.97** | **0.87** | **0.90** |
78
+ | | T1-Acc | **0.87** | **0.60** | **0.52** | 0.42 | **0.65** | **0.72** | **0.68** | **0.75** | **0.62** |
79
+
80
+ **Table 2: Comparative Performance of AudioProtoPNet, ConvNeXt, and Perch**
81
+
82
+ Mean performance of AudioProtoPNet-5, ConvNeXt, and Perch for the validation dataset POW and the seven
83
+ test datasets, averaged over five different random seeds. The 'Score' column represents the average of the
84
+ respective metric across all test datasets. Best values for each metric are **bolded**, and second-best values are
85
+ *underlined*. AudioProtoPNet-5 notably outperformed both Perch and ConvNeXt in terms of cmAP, AUROC,
86
+ and top-1 accuracy scores.
87
+
88
+ | Model | Metric | POW | PER | NES | UHH | HSN | NBP | SSW | SNE | Score |
89
+ | :---------------- | :------ | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- |
90
+ | AudioProtoPNet-5 | cmAP | 0.50 | **0.30** | 0.38 | **0.31** | **0.54** | **0.68** | **0.42** | **0.33** | **0.42** |
91
+ | | AUROC | 0.88 | **0.79** | **0.93** | **0.87** | **0.92** | **0.93** | **0.97** | **0.86** | **0.90** |
92
+ | | T1-Acc | 0.84 | **0.59** | 0.52 | 0.49 | **0.65** | **0.71** | **0.66** | **0.74** | **0.62** |
93
+ | ConvNeXt | cmAP | 0.41 | 0.21 | 0.35 | 0.25 | 0.49 | 0.66 | 0.38 | 0.31 | 0.38 |
94
+ | | AUROC | 0.83 | 0.73 | 0.89 | 0.72 | 0.88 | 0.92 | 0.93 | 0.83 | 0.84 |
95
+ | | T1-Acc | 0.75 | 0.43 | 0.49 | 0.43 | 0.60 | 0.69 | 0.58 | 0.62 | 0.56 |
96
+ | Perch | cmAP | 0.30 | 0.18 | **0.39** | 0.27 | 0.45 | 0.63 | 0.28 | 0.29 | 0.36 |
97
+ | | AUROC | 0.84 | 0.70 | 0.90 | 0.76 | 0.86 | 0.91 | 0.91 | 0.83 | 0.84 |
98
+ | | T1-Acc | 0.85 | 0.48 | **0.66** | **0.57** | 0.58 | 0.69 | 0.62 | 0.69 | 0.61 |
99
+
100
+ ## Example
101
+
102
+ This model can be easily loaded and used for inference with the `transformers` library.
103
+
104
+ ```python
105
+ from transformers import AutoFeatureExtractor, AutoModelForSequenceClassification
106
+ import librosa
107
+ import torch
108
+
109
+ # Load the model and feature extractor
110
+ model = AutoModelForSequenceClassification.from_pretrained("DBD-research-group/AudioProtoPNet-20-BirdSet-XCL",trust_remote_code=True)
111
+ feature_extractor = AutoFeatureExtractor.from_pretrained("DBD-research-group/AudioProtoPNet-20-BirdSet-XCL", trust_remote_code=True)
112
+ model.eval()
113
+
114
+ # Load an example audio file
115
+ audio_path = librosa.ex('robin')
116
+ label = "eurrob1" # The eBird label for the European Robin.
117
+
118
+ # The model is trained on audio sampled at 32,000 Hz
119
+ audio, sample_rate = librosa.load(audio_path, sr=32_000)
120
+
121
+ mel_spectrogram = feature_extractor(audio)
122
+
123
+ outputs = model(mel_spectrogram)
124
+ probabilities = torch.sigmoid(outputs[0]).detach()
125
+
126
+ # Get the top 5 predictions by confidence
127
+ top_n_probs, top_n_indices = torch.topk(probabilities, k=5, dim=-1)
128
+
129
+ label2id = model.config.label2id
130
+ id2label = model.config.id2label
131
+
132
+ print(f'Selected species with confidence:')
133
+ print(f"{label:<7} - {probabilities[:, label2id[label]].item():.2%}")
134
+ print("\nTop 5 Predictions with confidence:")
135
+ for idx, conf in zip(top_n_indices.squeeze(), top_n_probs.squeeze()):
136
+ print(f"{id2label[idx.item()]:<7} - {conf:.2%}")
137
+ ```
138
+
139
+ **Expected output**
140
+ ```
141
+ Selected species with confidence:
142
+ eurrob1 - 65.40%
143
+ Top 5 Predictions with confidence:
144
+ eurrob1 - 65.40%
145
+ blutit - 34.11%
146
+ eugplo - 33.66%
147
+ sablar2 - 33.50%
148
+ dunnoc1 - 32.35%
149
+ ```
150
+
151
+ ## More Details
152
+ For more details refer to our paper at: https://www.sciencedirect.com/science/article/pii/S1574954125000901
153
+
154
+ ## Citation
155
+ ```
156
+ @misc{heinrich2024audioprotopnet,
157
+ title={AudioProtoPNet: An interpretable deep learning model for bird sound classification},
158
+ author={René Heinrich and Lukas Rauch and Bernhard Sick and Christoph Scholz},
159
+ year={2024},
160
+ url={https://www.sciencedirect.com/science/article/pii/S1574954125000901},
161
+ }
162
+ ```