Perch V2 — Optimized TFLite Models for Raspberry Pi
Optimized variants of Google's Perch V2 bird vocalization classifier for edge deployment on Raspberry Pi and ARM64 devices.
Three model variants converted directly from the official Google SavedModel, each targeting a different performance/quality trade-off.
Models
| Model | Size | Inference (RPi 5) | Embedding cosine | Top-1 agree | Top-5 agree | Best for |
|---|---|---|---|---|---|---|
perch_v2_original.tflite |
409 MB | 435 ms | baseline | baseline | baseline | Reference / high-RAM devices |
perch_v2_fp16.tflite |
205 MB | 384 ms | 0.9999 | 100% | 99% | RPi 5 (recommended) |
perch_v2_dynint8.tflite |
105 MB | 299 ms | 0.9927 | 93% | 90% | RPi 4 / low-RAM devices |
Benchmarked on Raspberry Pi 5 Model B (8GB, Cortex-A76 @ 2.4GHz), 20 real bird recordings from 20 species, 5 runs each, 4 threads.
Quick Start
Choose your model
- RPi 5 (4-8 GB): Use
perch_v2_fp16.tflite— near-perfect accuracy, 2x smaller than original - RPi 4 (2-4 GB): Use
perch_v2_dynint8.tflite— 4x smaller, 31% faster, very good accuracy - Desktop / reference: Use
perch_v2_original.tflite— exact Google baseline
Usage
# Works with ai-edge-litert, tflite-runtime, or tensorflow
from ai_edge_litert.interpreter import Interpreter
import numpy as np
model_path = "perch_v2_fp16.tflite" # or dynint8, or original
interpreter = Interpreter(model_path=model_path, num_threads=4)
interpreter.allocate_tensors()
inp = interpreter.get_input_details()
out = interpreter.get_output_details()
# Input: 5 seconds of audio at 32 kHz
audio = np.zeros((1, 160000), dtype=np.float32) # replace with real audio
interpreter.set_tensor(inp[0]["index"], audio)
interpreter.invoke()
# Get species logits (14,795 classes)
logits = interpreter.get_tensor(out[3]["index"])[0]
top_species = np.argsort(logits)[-5:][::-1]
Download a single model
from huggingface_hub import hf_hub_download
# Download only the model you need
model_path = hf_hub_download(
"ernensbjorn/perch-v2-int8-tflite",
"perch_v2_fp16.tflite"
)
Model Details
Architecture
- Backbone: EfficientNet-B3 (~12M params for embeddings)
- Classification head:
91M params (101.8M total) - Input: 5.0 seconds @ 32,000 Hz = 160,000 float32 samples
- Outputs:
- Index 0: Spatial embeddings (16 x 4 x 1536)
- Index 1: Temporal features
- Index 2: 1536-dim global embedding
- Index 3: 14,795 species logits (use this for classification)
Species Coverage
10,340 bird species + frogs, insects, mammals (14,795 total classes).
Use the included labels.txt for class names and bird_indices.json to filter bird-only species.
Quantization Methods
| Variant | Method | What's quantized | File size reduction |
|---|---|---|---|
| original | None (float32 baseline) | Nothing | 1x |
| fp16 | TFLite float16 quantization | Weights stored as float16, dequantized at runtime | 2x smaller |
| dynint8 | TFLite dynamic range quantization | Weights quantized to int8, activations remain float32 | 4x smaller |
All variants were converted directly from the official Google Perch V2 SavedModel using tf.lite.TFLiteConverter with appropriate optimization flags. No binary patching or post-hoc manipulation.
Detailed Benchmarks
Raspberry Pi 5 (8 GB, Cortex-A76 @ 2.4 GHz, 4 threads)
| Model | Size | p50 latency | p95 latency | Embedding cosine (mean) | Embedding cosine (min) | Top-1 | Top-5 |
|---|---|---|---|---|---|---|---|
| original | 409 MB | 435 ms | 534 ms | baseline | baseline | baseline | baseline |
| fp16 | 205 MB | 384 ms | 477 ms | 0.999994 | 0.999991 | 100% | 99% |
| dynint8 | 105 MB | 299 ms | 405 ms | 0.992748 | 0.972732 | 93% | 90% |
- Embedding cosine: Cosine similarity of the 1536-dim embedding vector vs the float32 baseline. Values > 0.99 indicate negligible quality loss for downstream tasks.
- Top-1/Top-5 agreement: How often the quantized model's top predicted species matches the original's prediction.
- Test data: 20 real field recordings from 20 species (Rougegorge familier, Courlis cendré, Grive mauvis, Sarcelle d'hiver, Râle d'eau, etc.)
Raspberry Pi 4 Estimates
The RPi 4 (Cortex-A72 @ 1.8 GHz) is roughly 2-3x slower than the RPi 5. Expected latencies:
| Model | Estimated p50 | RAM needed |
|---|---|---|
| original | ~1000-1300 ms | ~500 MB |
| fp16 | ~900-1150 ms | ~300 MB |
| dynint8 | ~700-900 ms | ~150 MB |
For RPi 4 with 2 GB RAM, dynint8 is strongly recommended.
Origin
Converted from the official Google Perch V2 SavedModel (hosted by Google researcher cgeorgiaw on HuggingFace).
Created as part of the Birdash project — an open-source bird detection dashboard and engine for Raspberry Pi.
License
Apache 2.0 (same as the original Perch V2 model by Google)
Citation
If you use these models, please cite the original Perch V2 work:
@article{ghani2023global,
title={Global birdsong embeddings enable superior transfer learning for bioacoustic classification},
author={Ghani, Burooj and Denton, Tom and Kahl, Stefan and Klinck, Holger},
journal={Scientific Reports},
year={2023}
}
- Downloads last month
- 178