biodiversica
/

BirdNET-onnx-backbone

+---
+license: cc-by-nc-4.0
+tags:
+- audio
+- bird
+- nature
+- bioacoustics
+- embeddings
+- onnx
+- backbone
+pipeline_tag: feature-extraction
+base_model: justinchuby/BirdNET-onnx
+---
+# BirdNET v2.4 ONNX Backbone
+Backbone-only ONNX exports of the [BirdNET v2.4](https://huggingface.co/justinchuby/BirdNET-onnx) bird sound classifier.
+The classification head has been removed, leaving only frontend + feature-extraction.
+Two variants are provided, matching the originals from [justinchuby/BirdNET-onnx](https://huggingface.co/justinchuby/BirdNET-onnx/tree/main): `model_backbone.onnx` and `birdnet_backbone.onnx`. Both models output a single tensor named **`embedding`** with shape `(1, 1024)`.
+Embeddings are numerically verified against the reference TF SavedModel published on Zenodo
+([BirdNET_v2.4_protobuf](https://zenodo.org/records/15050749)).
+---
+## Quick start
+```python
+import numpy as np
+import onnxruntime as ort
+from huggingface_hub import hf_hub_download
+# Download backbone
+path = hf_hub_download(
+    repo_id="biodiversica/BirdNET-onnx-backbone",
+    filename="model_backbone.onnx",
+)
+sess = ort.InferenceSession(path)
+# 3 s of audio at 48 kHz
+audio = np.zeros((1, 144000), dtype=np.float32)
+(embedding,) = sess.run(["embedding"], {"INPUT": audio})
+print(embedding.shape)  # (1, 1024)
+```
+For `birdnet_backbone.onnx` the input key is `"input"` (lowercase):
+```python
+path = hf_hub_download(
+    repo_id="biodiversica/BirdNET-onnx-backbone",
+    filename="birdnet_backbone.onnx",
+)
+sess = ort.InferenceSession(path)
+(embedding,) = sess.run(["embedding"], {"input": audio})
+print(embedding.shape)  # (1, 1024)
+```
+---
+## Extraction procedure
+The extraction and testing procedure can be reproduced using `extract_backbone.py`. The script will:
+1. Download `model.onnx` and `birdnet.onnx` from [justinchuby/BirdNET-onnx](https://huggingface.co/justinchuby/BirdNET-onnx).
+2. Download the BirdNET v2.4 TF SavedModel from Zenodo ([BirdNET_v2.4_protobuf](https://zenodo.org/records/15050749)).
+3. Extract the backbone subgraph (everything up to and including the `model/GLOBAL_AVG_POOL/Mean_reduced_0` node), renaming the output to `embedding`.
+4. Save `model_backbone.onnx` and `birdnet_backbone.onnx`.
+5. Run a numerical comparison between ONNX and TF SavedModel embeddings on a fixed random waveform (seed 42, 3 s at 48 kHz).
+Expected output:
+```
+=== Downloading models ===
+Downloaded model.onnx -> ...
+Downloaded birdnet.onnx -> ...
+Downloading BirdNET protobuf from Zenodo...
+Extracted audio-model -> ...
+=== Extracting backbones ===
+Backbone saved -> model_backbone.onnx
+  inputs : ['INPUT']
+  outputs: ['embedding']
+Backbone saved -> birdnet_backbone.onnx
+  inputs : ['input']
+  outputs: ['embedding']
+=== Comparing embeddings against Zenodo TF SavedModel ===
+PB embedding shape: (1, 1024)
+model_backend.onnx:
+  ONNX embedding shape: (1, 1024)
+  |diff| mean=1.230468e-06  max=9.298325e-06
+  Embeddings match PB reference with rtol=1e-03, atol=1e-03  PASSED
+birdnet_backend.onnx:
+  ONNX embedding shape: (1, 1024)
+  |diff| mean=6.440870e-05  max=5.004406e-04
+  Embeddings match PB reference with rtol=1e-03, atol=1e-03  PASSED
+```
+---
+## How extraction works
+The `_extract` function in `extract_backbone.py` performs a backwards BFS from the
+`model/GLOBAL_AVG_POOL/Mean_reduced_0` output node (the global average pool), collecting
+every node that contributes to that output and discarding everything downstream (the
+classification dense layer). The output tensor is then renamed to `embedding`. It then
+rebuilds a minimal ONNX graph containing only the retained nodes and their initializers.
+---
+## Credits
+- Original ONNX conversion: [justinchuby/BirdNET-onnx](https://huggingface.co/justinchuby/BirdNET-onnx)
+- Reference protobuf: [BirdNET_v2.4_protobuf on Zenodo](https://zenodo.org/records/15050749)