DBD-research-group
/

AudioProtoPNet-5-BirdSet-XCL

+---
+license: cc-by-nc-4.0
+datasets:
+- DBD-research-group/BirdSet
+base_model:
+- facebook/convnext-base-224-22k
+pipeline_tag: audio-classification
+library_name: transformers
+tags:
+- audio-classification
+- audio
+---
+# AudioProtoPNet: An Interpretable Deep Learning Model for Bird Sound Classification
+## Abstract
+Deep learning models have significantly advanced acoustic bird monitoring by recognizing numerous bird species
+based on their vocalizations. However, traditional deep learning models are often "black boxes," providing
+limited insight into their underlying computations, which restricts their utility for ornithologists and machine
+learning engineers. Explainable models, on the other hand, can facilitate debugging, knowledge discovery,
+trust, and interdisciplinary collaboration.
+This work introduces **AudioProtoPNet**, an adaptation of the Prototypical Part Network (ProtoPNet)
+designed for multi-label bird sound classification. AudioProtoPNet is inherently interpretable, leveraging a
+ConvNeXt backbone to extract embeddings and a prototype learning classifier trained on these embeddings.
+The classifier learns prototypical patterns of each bird species' vocalizations from spectrograms of
+instances in the training data.
+During inference, recordings are classified by comparing them to learned prototypes in the embedding space,
+providing explanations for the model's decisions and insights into the most informative embeddings of each
+bird species.
+- **Paper**: [Elsevier](www.sciencedirect.com/science/article/pii/S1574954125000901)
+## Model Description
+### Training Data
+The model was trained on the **BirdSet training dataset**, which comprises 9734 bird species and over 6800
+hours of recordings.
+### Evaluation
+AudioProtoPNet's performance was evaluated on seven BirdSet test datasets, covering diverse geographical
+regions. The model demonstrated superior performance compared to state-of-the-art bird sound classification
+models like Perch (which itself outperforms BirdNet). AudioProtoPNet achieved an average AUROC of 0.90
+and a cmAP of 0.42, representing relative improvements of 7.1% and 16.7% over Perch, respectively.
+These results highlight the feasibility of developing powerful yet interpretable deep learning models for the
+challenging task of multi-label bird sound classification, offering valuable insights for professionals in
+ornithology and machine learning.
+### Evaluation Results
+**Table 1: Mean Performance of AudioProtoPNet Models with Varying Prototypes**
+Mean performance of AudioProtoPNet models with one, five, ten, and twenty prototypes per class for the
+validation dataset POW and the seven test datasets, averaged over five different random seeds. The 'Score'
+column represents the average of the respective metric across all test datasets. Best values for each metric are
+**bolded**, and second-best values are *underlined*. While models with five, ten, and twenty prototypes performed
+similarly, the model with only one prototype per class showed slightly lower performance.
+| Model             | Metric  | POW    | PER    | NES    | UHH    | HSN    | NBP    | SSW    | SNE    | Score  |
+| :---------------- | :------ | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- |
+| AudioProtoPNet-1  | cmAP    | 0.49   | 0.30   | 0.36   | 0.28   | **0.50** | **0.66** | 0.40   | 0.32   | 0.40   |
+|                   | AUROC   | 0.88   | 0.79   | 0.92   | 0.85   | 0.91   | 0.92   | **0.96** | 0.84   | 0.88   |
+|                   | T1-Acc  | **0.87** | 0.59   | 0.49   | 0.42   | 0.64   | 0.71   | 0.64   | 0.70   | 0.60   |
+| AudioProtoPNet-5  | cmAP    | **0.50** | 0.30   | 0.38   | 0.31   | 0.54   | **0.68** | 0.42   | 0.33   | 0.42   |
+|                   | AUROC   | 0.88   | 0.79   | 0.93   | 0.87   | 0.92   | 0.93   | **0.97** | 0.88   | **0.90** |
+|                   | T1-Acc  | 0.84   | **0.59** | 0.52   | **0.49** | 0.65   | **0.71** | 0.66   | **0.74** | 0.62   |
+| AudioProtoPNet-10 | cmAP    | **0.50** | **0.30** | **0.38** | **0.30** | **0.54** | **0.68** | **0.42** | **0.34** | **0.42** |
+|                   | AUROC   | 0.88   | **0.80** | **0.94** | 0.86   | **0.92** | **0.93** | **0.97** | 0.86   | **0.90** |
+|                   | T1-Acc  | 0.85   | **0.59** | **0.52** | 0.47   | **0.64** | **0.72** | **0.67** | **0.74** | **0.62** |
+| AudioProtoPNet-20 | cmAP    | **0.50** | **0.30** | **0.38** | **0.31** | **0.54** | **0.68** | **0.43** | **0.33** | **0.42** |
+|                   | AUROC   | **0.89** | **0.80** | **0.94** | **0.86** | **0.92** | **0.93** | **0.97** | **0.87** | **0.90** |
+|                   | T1-Acc  | **0.87** | **0.60** | **0.52** | 0.42   | **0.65** | **0.72** | **0.68** | **0.75** | **0.62** |
+**Table 2: Comparative Performance of AudioProtoPNet, ConvNeXt, and Perch**
+Mean performance of AudioProtoPNet-5, ConvNeXt, and Perch for the validation dataset POW and the seven
+test datasets, averaged over five different random seeds. The 'Score' column represents the average of the
+respective metric across all test datasets. Best values for each metric are **bolded**, and second-best values are
+*underlined*. AudioProtoPNet-5 notably outperformed both Perch and ConvNeXt in terms of cmAP, AUROC,
+and top-1 accuracy scores.
+| Model             | Metric  | POW    | PER    | NES    | UHH    | HSN    | NBP    | SSW    | SNE    | Score  |
+| :---------------- | :------ | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- |
+| AudioProtoPNet-5  | cmAP    | 0.50   | **0.30** | 0.38   | **0.31** | **0.54** | **0.68** | **0.42** | **0.33** | **0.42** |
+|                   | AUROC   | 0.88   | **0.79** | **0.93** | **0.87** | **0.92** | **0.93** | **0.97** | **0.86** | **0.90** |
+|                   | T1-Acc  | 0.84   | **0.59** | 0.52   | 0.49   | **0.65** | **0.71** | **0.66** | **0.74** | **0.62** |
+| ConvNeXt          | cmAP    | 0.41   | 0.21   | 0.35   | 0.25   | 0.49   | 0.66   | 0.38   | 0.31   | 0.38   |
+|                   | AUROC   | 0.83   | 0.73   | 0.89   | 0.72   | 0.88   | 0.92   | 0.93   | 0.83   | 0.84   |
+|                   | T1-Acc  | 0.75   | 0.43   | 0.49   | 0.43   | 0.60   | 0.69   | 0.58   | 0.62   | 0.56   |
+| Perch             | cmAP    | 0.30   | 0.18   | **0.39** | 0.27   | 0.45   | 0.63   | 0.28   | 0.29   | 0.36   |
+|                   | AUROC   | 0.84   | 0.70   | 0.90   | 0.76   | 0.86   | 0.91   | 0.91   | 0.83   | 0.84   |
+|                   | T1-Acc  | 0.85   | 0.48   | **0.66** | **0.57** | 0.58   | 0.69   | 0.62   | 0.69   | 0.61   |
+## Example
+This model can be easily loaded and used for inference with the `transformers` library.
+```python
+from transformers import AutoFeatureExtractor, AutoModelForSequenceClassification
+import librosa
+import torch
+# Load the model and feature extractor
+model = AutoModelForSequenceClassification.from_pretrained("DBD-research-group/AudioProtoPNet-20-BirdSet-XCL",trust_remote_code=True)
+feature_extractor = AutoFeatureExtractor.from_pretrained("DBD-research-group/AudioProtoPNet-20-BirdSet-XCL", trust_remote_code=True)
+model.eval()
+# Load an example audio file
+audio_path = librosa.ex('robin')
+label = "eurrob1"  # The eBird label for the European Robin.
+# The model is trained on audio sampled at 32,000 Hz
+audio, sample_rate = librosa.load(audio_path, sr=32_000)
+mel_spectrogram = feature_extractor(audio)
+outputs = model(mel_spectrogram)
+probabilities = torch.sigmoid(outputs[0]).detach()
+# Get the top 5 predictions by confidence
+top_n_probs, top_n_indices = torch.topk(probabilities, k=5, dim=-1)
+label2id = model.config.label2id
+id2label = model.config.id2label
+print(f'Selected species with confidence:')
+print(f"{label:<7} - {probabilities[:, label2id[label]].item():.2%}")
+print("\nTop 5 Predictions with confidence:")
+for idx, conf in zip(top_n_indices.squeeze(), top_n_probs.squeeze()):
+    print(f"{id2label[idx.item()]:<7} - {conf:.2%}")
+```
+**Expected output**
+```
+Selected species with confidence:
+eurrob1 - 65.40%
+Top 5 Predictions with confidence:
+eurrob1 - 65.40%
+blutit  - 34.11%
+eugplo  - 33.66%
+sablar2 - 33.50%
+dunnoc1 - 32.35%
+```
+## More Details
+For more details refer to our paper at: https://www.sciencedirect.com/science/article/pii/S1574954125000901
+## Citation
+```
+@misc{heinrich2024audioprotopnet,
+      title={AudioProtoPNet: An interpretable deep learning model for bird sound classification},
+      author={René Heinrich and Lukas Rauch and Bernhard Sick and Christoph Scholz},
+      year={2024},
+      url={https://www.sciencedirect.com/science/article/pii/S1574954125000901},
+}
+```