mwirth7 commited on
Commit
025f669
·
verified ·
1 Parent(s): c1f2cce

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +158 -0
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ datasets:
4
+ - DBD-research-group/BirdSet
5
+ base_model:
6
+ - facebook/convnext-base-224-22k
7
+ pipeline_tag: audio-classification
8
+ library_name: transformers
9
+ tags:
10
+ - audio-classification
11
+ - audio
12
+ ---
13
+ # AudioProtoPNet: An Interpretable Deep Learning Model for Bird Sound Classification
14
+
15
+ ## Model Description
16
+
17
+ ### Abstract
18
+
19
+ Deep learning models have significantly advanced acoustic bird monitoring by recognizing numerous bird species
20
+ based on their vocalizations. However, traditional deep learning models are often "black boxes," providing
21
+ limited insight into their underlying computations, which restricts their utility for ornithologists and machine
22
+ learning engineers. Explainable models, on the other hand, can facilitate debugging, knowledge discovery,
23
+ trust, and interdisciplinary collaboration.
24
+
25
+ This work introduces **AudioProtoPNet**, an adaptation of the Prototypical Part Network (ProtoPNet)
26
+ designed for multi-label bird sound classification. AudioProtoPNet is inherently interpretable, leveraging a
27
+ ConvNeXt backbone to extract embeddings and a prototype learning classifier trained on these embeddings.
28
+ The classifier learns prototypical patterns of each bird species' vocalizations from spectrograms of
29
+ instances in the training data.
30
+
31
+ During inference, recordings are classified by comparing them to learned prototypes in the embedding space,
32
+ providing explanations for the model's decisions and insights into the most informative embeddings of each
33
+ bird species.
34
+
35
+ ### Training Data
36
+
37
+ The model was trained on the **BirdSet training dataset**, which comprises 9734 bird species and over 6800
38
+ hours of recordings.
39
+
40
+ ### Evaluation
41
+
42
+ AudioProtoPNet's performance was evaluated on seven BirdSet test datasets, covering diverse geographical
43
+ regions. The model demonstrated superior performance compared to state-of-the-art bird sound classification
44
+ models like Perch (which itself outperforms BirdNet). AudioProtoPNet achieved an average AUROC of 0.90
45
+ and a cmAP of 0.42, representing relative improvements of 7.1% and 16.7% over Perch, respectively.
46
+
47
+ These results highlight the feasibility of developing powerful yet interpretable deep learning models for the
48
+ challenging task of multi-label bird sound classification, offering valuable insights for professionals in
49
+ ornithology and machine learning.
50
+
51
+ ### Evaluation Results
52
+
53
+ **Table 1: Mean Performance of AudioProtoPNet Models with Varying Prototypes**
54
+
55
+ Mean performance of AudioProtoPNet models with one, five, ten, and twenty prototypes per class for the
56
+ validation dataset POW and the seven test datasets, averaged over five different random seeds. The 'Score'
57
+ column represents the average of the respective metric across all test datasets. Best values for each metric are
58
+ **bolded**, and second-best values are *underlined*. While models with five, ten, and twenty prototypes performed
59
+ similarly, the model with only one prototype per class showed slightly lower performance.
60
+
61
+ | Model | Metric | POW | PER | NES | UHH | HSN | NBP | SSW | SNE | Score |
62
+ | :---------------- | :------ | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- |
63
+ | AudioProtoPNet-1 | cmAP | 0.49 | 0.30 | 0.36 | 0.28 | **0.50** | **0.66** | 0.40 | 0.32 | 0.40 |
64
+ | | AUROC | 0.88 | 0.79 | 0.92 | 0.85 | 0.91 | 0.92 | **0.96** | 0.84 | 0.88 |
65
+ | | T1-Acc | **0.87** | 0.59 | 0.49 | 0.42 | 0.64 | 0.71 | 0.64 | 0.70 | 0.60 |
66
+ | AudioProtoPNet-5 | cmAP | **0.50** | 0.30 | 0.38 | 0.31 | 0.54 | **0.68** | 0.42 | 0.33 | 0.42 |
67
+ | | AUROC | 0.88 | 0.79 | 0.93 | 0.87 | 0.92 | 0.93 | **0.97** | 0.88 | **0.90** |
68
+ | | T1-Acc | 0.84 | **0.59** | 0.52 | **0.49** | 0.65 | **0.71** | 0.66 | **0.74** | 0.62 |
69
+ | AudioProtoPNet-10 | cmAP | **0.50** | **0.30** | **0.38** | **0.30** | **0.54** | **0.68** | **0.42** | **0.34** | **0.42** |
70
+ | | AUROC | 0.88 | **0.80** | **0.94** | 0.86 | **0.92** | **0.93** | **0.97** | 0.86 | **0.90** |
71
+ | | T1-Acc | 0.85 | **0.59** | **0.52** | 0.47 | **0.64** | **0.72** | **0.67** | **0.74** | **0.62** |
72
+ | AudioProtoPNet-20 | cmAP | **0.50** | **0.30** | **0.38** | **0.31** | **0.54** | **0.68** | **0.43** | **0.33** | **0.42** |
73
+ | | AUROC | **0.89** | **0.80** | **0.94** | **0.86** | **0.92** | **0.93** | **0.97** | **0.87** | **0.90** |
74
+ | | T1-Acc | **0.87** | **0.60** | **0.52** | 0.42 | **0.65** | **0.72** | **0.68** | **0.75** | **0.62** |
75
+
76
+ **Table 2: Comparative Performance of AudioProtoPNet, ConvNeXt, and Perch**
77
+
78
+ Mean performance of AudioProtoPNet-5, ConvNeXt, and Perch for the validation dataset POW and the seven
79
+ test datasets, averaged over five different random seeds. The 'Score' column represents the average of the
80
+ respective metric across all test datasets. Best values for each metric are **bolded**, and second-best values are
81
+ *underlined*. AudioProtoPNet-5 notably outperformed both Perch and ConvNeXt in terms of cmAP, AUROC,
82
+ and top-1 accuracy scores.
83
+
84
+ | Model | Metric | POW | PER | NES | UHH | HSN | NBP | SSW | SNE | Score |
85
+ | :---------------- | :------ | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- |
86
+ | AudioProtoPNet-5 | cmAP | 0.50 | **0.30** | 0.38 | **0.31** | **0.54** | **0.68** | **0.42** | **0.33** | **0.42** |
87
+ | | AUROC | 0.88 | **0.79** | **0.93** | **0.87** | **0.92** | **0.93** | **0.97** | **0.86** | **0.90** |
88
+ | | T1-Acc | 0.84 | **0.59** | 0.52 | 0.49 | **0.65** | **0.71** | **0.66** | **0.74** | **0.62** |
89
+ | ConvNeXt | cmAP | 0.41 | 0.21 | 0.35 | 0.25 | 0.49 | 0.66 | 0.38 | 0.31 | 0.38 |
90
+ | | AUROC | 0.83 | 0.73 | 0.89 | 0.72 | 0.88 | 0.92 | 0.93 | 0.83 | 0.84 |
91
+ | | T1-Acc | 0.75 | 0.43 | 0.49 | 0.43 | 0.60 | 0.69 | 0.58 | 0.62 | 0.56 |
92
+ | Perch | cmAP | 0.30 | 0.18 | **0.39** | 0.27 | 0.45 | 0.63 | 0.28 | 0.29 | 0.36 |
93
+ | | AUROC | 0.84 | 0.70 | 0.90 | 0.76 | 0.86 | 0.91 | 0.91 | 0.83 | 0.84 |
94
+ | | T1-Acc | 0.85 | 0.48 | **0.66** | **0.57** | 0.58 | 0.69 | 0.62 | 0.69 | 0.61 |
95
+
96
+ ## Usage
97
+
98
+ This model can be easily loaded and used for inference with the `transformers` library.
99
+
100
+ ```python
101
+ from transformers import AutoFeatureExtractor, AutoModelForSequenceClassification
102
+ import librosa
103
+ import torch
104
+
105
+ # Load the model and feature extractor
106
+ model = AutoModelForSequenceClassification.from_pretrained("DBD-research-group/AudioProtoPNet-20-BirdSet-XCL",trust_remote_code=True)
107
+ feature_extractor = AutoFeatureExtractor.from_pretrained("DBD-research-group/AudioProtoPNet-20-BirdSet-XCL", trust_remote_code=True)
108
+ model.eval()
109
+
110
+ # Load an example audio file
111
+ audio_path = librosa.ex('robin')
112
+ label = "eurrob1" # The eBird label for the European Robin.
113
+
114
+ # The model is trained on audio sampled at 32,000 Hz
115
+ audio, sample_rate = librosa.load(audio_path, sr=32_000)
116
+
117
+ mel_spectrogram = feature_extractor(audio)
118
+
119
+ outputs = model(mel_spectrogram)
120
+ probabilities = torch.sigmoid(outputs[0]).detach()
121
+
122
+ # Get the top 5 predictions by confidence
123
+ top_n_probs, top_n_indices = torch.topk(probabilities, k=5, dim=-1)
124
+
125
+ label2id = model.config.label2id
126
+ id2label = model.config.id2label
127
+
128
+ print(f'Selected species with confidence:')
129
+ print(f"{label:<7} - {probabilities[:, label2id[label]].item():.2%}")
130
+ print("\nTop 5 Predictions with confidence:")
131
+ for idx, conf in zip(top_n_indices.squeeze(), top_n_probs.squeeze()):
132
+ print(f"{id2label[idx.item()]:<7} - {conf:.2%}")
133
+ ```
134
+
135
+ **Expected output**
136
+ ```
137
+ Selected species with confidence:
138
+ eurrob1 - 65.40%
139
+ Top 5 Predictions with confidence:
140
+ eurrob1 - 65.40%
141
+ blutit - 34.11%
142
+ eugplo - 33.66%
143
+ sablar2 - 33.50%
144
+ dunnoc1 - 32.35%
145
+ ```
146
+
147
+ ## More Details
148
+ For more details refer to our paper at: https://www.sciencedirect.com/science/article/pii/S1574954125000901
149
+
150
+ ## Citation
151
+ ```
152
+ @misc{heinrich2024audioprotopnet,
153
+ title={AudioProtoPNet: An interpretable deep learning model for bird sound classification},
154
+ author={René Heinrich and Lukas Rauch and Bernhard Sick and Christoph Scholz},
155
+ year={2024},
156
+ url={https://www.sciencedirect.com/science/article/pii/S1574954125000901},
157
+ }
158
+ ```