File size: 3,792 Bytes
3c97f2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
403cf4f
3c97f2f
 
 
 
403cf4f
3c97f2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
da6c49b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: cc-by-nc-4.0
tags:
- audio
- bird
- nature
- bioacoustics
- embeddings
- onnx
- backbone
pipeline_tag: feature-extraction
base_model: justinchuby/BirdNET-onnx
---

# BirdNET v2.4 ONNX Backbone

Backbone-only ONNX exports of the [BirdNET v2.4](https://huggingface.co/justinchuby/BirdNET-onnx) bird sound classifier.
The classification head has been removed, leaving only frontend + feature-extraction.

Two variants are provided, matching the originals from [justinchuby/BirdNET-onnx](https://huggingface.co/justinchuby/BirdNET-onnx/tree/main): `model_backbone.onnx` and `birdnet_backbone.onnx`. Both models output a single tensor named **`embedding`** with shape `(1, 1024)`.

Embeddings are numerically verified against the reference TF SavedModel published on Zenodo
([BirdNET_v2.4_protobuf](https://zenodo.org/records/15050749)).

---

## Quick start

```python
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download

# Download backbone
path = hf_hub_download(
    repo_id="biodiversica/BirdNET-onnx-backbone",
    filename="model_backbone.onnx",
)

sess = ort.InferenceSession(path)

# 3 s of audio at 48 kHz
audio = np.zeros((1, 144000), dtype=np.float32)
(embedding,) = sess.run(["embedding"], {"INPUT": audio})
print(embedding.shape)  # (1, 1024)
```

For `birdnet_backbone.onnx` the input key is `"input"` (lowercase):

```python
path = hf_hub_download(
    repo_id="biodiversica/BirdNET-onnx-backbone",
    filename="birdnet_backbone.onnx",
)
sess = ort.InferenceSession(path)
(embedding,) = sess.run(["embedding"], {"input": audio})
print(embedding.shape)  # (1, 1024)
```

---

## Extraction procedure

The extraction and testing procedure can be reproduced using `extract_backbone.py`. The script will:

1. Download `model.onnx` and `birdnet.onnx` from [justinchuby/BirdNET-onnx](https://huggingface.co/justinchuby/BirdNET-onnx).
2. Download the BirdNET v2.4 TF SavedModel from Zenodo ([BirdNET_v2.4_protobuf](https://zenodo.org/records/15050749)).
3. Extract the backbone subgraph (everything up to and including the `model/GLOBAL_AVG_POOL/Mean_reduced_0` node), renaming the output to `embedding`.
4. Save `model_backbone.onnx` and `birdnet_backbone.onnx`.
5. Run a numerical comparison between ONNX and TF SavedModel embeddings on a fixed random waveform (seed 42, 3 s at 48 kHz).

Expected output:

```
=== Downloading models ===
Downloaded model.onnx -> ...
Downloaded birdnet.onnx -> ...
Downloading BirdNET protobuf from Zenodo...
Extracted audio-model -> ...

=== Extracting backbones ===
Backbone saved -> model_backbone.onnx
  inputs : ['INPUT']
  outputs: ['embedding']
Backbone saved -> birdnet_backbone.onnx
  inputs : ['input']
  outputs: ['embedding']

=== Comparing embeddings against Zenodo TF SavedModel ===
PB embedding shape: (1, 1024)

model_backbone.onnx:
  ONNX embedding shape: (1, 1024)
  |diff| mean=1.230468e-06  max=9.298325e-06
  Embeddings match PB reference with rtol=1e-03, atol=1e-03  PASSED

birdnet_backbone.onnx:
  ONNX embedding shape: (1, 1024)
  |diff| mean=6.440870e-05  max=5.004406e-04
  Embeddings match PB reference with rtol=1e-03, atol=1e-03  PASSED
```

---

## How extraction works

The `_extract` function in `extract_backbone.py` performs a backwards BFS from the
`model/GLOBAL_AVG_POOL/Mean_reduced_0` output node (the global average pool), collecting
every node that contributes to that output and discarding everything downstream (the
classification dense layer). The output tensor is then renamed to `embedding`. It then
rebuilds a minimal ONNX graph containing only the retained nodes and their initializers.

---

## Credits

- Original ONNX conversion: [justinchuby/BirdNET-onnx](https://huggingface.co/justinchuby/BirdNET-onnx)
- [BirdNET Team](https://birdnet.cornell.edu/)