Perch v2 ONNX Backbone
Backbone-only ONNX exports of the Perch v2 bird vocalization classifier. The classification head has been removed, leaving only frontend + feature-extraction.
Two variants are provided, matching the originals from justinchuby/Perch-onnx: perch_v2_backbone.onnx and perch_v2_no_dft_backbone.onnx. Both models output a single tensor named embedding with shape (1, 1536).
Embeddings are numerically verified against the reference TF SavedModel published on Kaggle (google/bird-vocalization-classifier).
Quick start
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
# Download backbone
path = hf_hub_download(
repo_id="biodiversica/Perch-onnx-backbone",
filename="perch_v2_backbone.onnx",
)
sess = ort.InferenceSession(path)
# 5 s of audio at 32 kHz
audio = np.zeros((1, 160000), dtype=np.float32)
(embedding,) = sess.run(["embedding"], {"inputs": audio})
print(embedding.shape) # (1, 1536)
Extraction procedure
The extraction and testing procedure can be reproduced using extract_backbone.py. The script will:
- Download
perch_v2.onnxandperch_v2_no_dft.onnxfrom justinchuby/Perch-onnx. - Download the Perch v2 TF SavedModel from Kaggle (google/bird-vocalization-classifier).
- Extract the backbone subgraph (everything up to and including the
embeddingnode). - Save
perch_v2_backbone.onnxandperch_v2_no_dft_backbone.onnx. - Run a numerical comparison between ONNX and TF SavedModel embeddings on a fixed random waveform (seed 42, 5s at 32 kHz).
Expected output:
=== Downloading models ===
Downloaded perch_v2.onnx -> ...
Downloaded perch_v2_no_dft.onnx -> ...
Downloaded Kaggle model -> ...
=== Extracting backbones ===
Backbone saved -> perch_v2_backbone.onnx
inputs : ['inputs']
outputs: ['embedding']
Backbone saved -> perch_v2_no_dft_backbone.onnx
inputs : ['inputs']
outputs: ['embedding']
=== Comparing embeddings against Kaggle TF SavedModel ===
PB embedding shape: (1, 1536)
perch_v2:
ONNX embedding shape: (1, 1536)
|diff| mean=<small value> max=<small value>
Embeddings match PB reference PASSED
perch_v2_no_dft:
ONNX embedding shape: (1, 1536)
|diff| mean=<small value> max=<small value>
Embeddings match PB reference PASSED
How extraction works
The _extract function in extract_backbone.py performs a backwards BFS from the
embedding output node, collecting every node that contributes to that output and
discarding everything downstream (the classification head). It then rebuilds a minimal
ONNX graph containing only the retained nodes and their initializers.
Credits
- Original ONNX conversion: justinchuby/Perch-onnx
- Original model: Google, bird-vocalization-classifier on Kaggle
- Perch paper: Google Research — Perch