| --- |
| license: apache-2.0 |
| tags: |
| - bioacoustics |
| - audio-classification |
| - birdclef |
| - onnx |
| - cpu-inference |
| --- |
| |
| # Kaggle BirdCLEF+ 2026 — Inference-only submission |
|
|
| CPU-only inference pipeline for the [BirdCLEF+ 2026 competition](https://www.kaggle.com/competitions/birdclef-2026): 234 species, Pantanal (Brazilian wetland) soundscapes, 90-minute CPU runtime budget at scoring time. |
|
|
| This repository is a **mirror of the code-only project**: <https://github.com/SergheiBrinza/kaggle-hackathon-birdclef-2026> |
|
|
| ## What this repo is (and is not) |
|
|
| This is **not a trained or fine-tuned model**. It is the inference glue code I wrote and configured around publicly released pre-trained artifacts. |
|
|
| - Nothing in this repository was trained or fine-tuned by me. |
| - The pre-trained artifacts (Google Perch 2.0, the chaneyma MoE bundle, the rishikeshjani ONNX export of Perch) are used **as released** and are **not redistributed here**. |
| - This matches my published declaration for the submission. |
| - See [THIRD_PARTY_LICENSES.md](THIRD_PARTY_LICENSES.md) for sources and licenses; download the artifacts yourself from the original locations. |
|
|
| ## Method |
|
|
| The submission ensembles a frozen audio foundation model with a small mixture of experts: |
|
|
| - **Perch 2.0 (frozen teacher).** Google's bioacoustics foundation model, used via an ONNX export for CPU inference. Outputs per-class logits plus a 1536-dim embedding. |
| - **chaneyma MoE.** 4× ProtoSSM folds (selective state-space + class prototypes, consuming the Perch embedding) plus a Student CNN and a Student CRNN on log-mel features. |
| - **Site / hour prior.** Empirical priors over the 234 classes conditioned on recording site (`S\d+` parsed from filename) and hour-of-day (24 bins), fit from the BirdCLEF+ 2026 `train_soundscapes_labels.csv` only. |
| - **Temporal smoothing.** Per file, over adjacent 5-second windows (12 windows per 60-second file). |
|
|
| ### My contributions |
|
|
| 1. **ONNX patch of Perch.** The upstream chaneyma inference script (CC0) loaded Perch through `tf.saved_model.load`, which requires TensorFlow and was too heavy for the Kaggle CPU budget. I replaced the TF code path with `onnxruntime` plus a CC0 ONNX export of Perch (rishikeshjani), removing the TensorFlow dependency entirely. |
| 2. **Blend-weight search** over `(perch, student_cnn, student_crnn)` mixing weights of the pre-sigmoid logits. |
| 3. **Prior-scale sweep** over the multiplier applied to the site/hour log-odds prior before it is added to the Perch logits. |
| 4. **Per-file temporal smoothing**, averaging each 5-second window with its immediate neighbours. |
|
|
| > Inference script originally by chaneyma (CC0), patched for ONNX/CPU inference by Serghei Brinza. |
|
|
| ## Experiments (public leaderboard) |
|
|
| Metric: macro-averaged ROC-AUC (the competition metric, ranking-based). |
|
|
| | Variant | Blend (P / CNN / CRNN) | `--prior-scale` | Postprocessing | Public LB | |
| |---|---|---|---|---| |
| | Prior sweep | 0.80 / 0.13 / 0.07 | 0.60 | smoothing 0.8 / 0.1 / 0.1 | n/a | |
| | Prior sweep | 0.80 / 0.13 / 0.07 | 0.40 | smoothing 0.8 / 0.1 / 0.1 | n/a | |
| | Prior sweep | 0.80 / 0.13 / 0.07 | 0.20 | smoothing 0.8 / 0.1 / 0.1 | **0.914** | |
| | Prior sweep | 0.80 / 0.13 / 0.07 | 0.10 | smoothing 0.8 / 0.1 / 0.1 | **0.914** | |
| | Power transform on probs | 0.80 / 0.13 / 0.07 | 0.20 | + power transform | no change | |
| | Alt smoothing | 0.80 / 0.13 / 0.07 | 0.20 | smoothing 0.7 / 0.15 / 0.15 | 0.913 | |
|
|
| Best final public-LB score: **0.914**. |
|
|
| A few honest caveats: |
|
|
| - The power transform on probabilities had no effect, because the macro ROC-AUC metric is ranking-based. |
| - The 0.7 / 0.15 / 0.15 temporal-smoothing variant (0.913) was a **negative result**: slightly worse than the shipped 0.8 / 0.1 / 0.1 version. Logged because honest experiment logs include the runs that did not help. |
| - Intermediate prior-scale rows (0.60, 0.40) are left **blank** rather than filled in with guessed numbers. |
|
|
| ## Required external artifacts (not redistributed) |
|
|
| You must download these yourself; this repo does not host any third-party weights. |
|
|
| | Component | Source | License | |
| |---|---|---| |
| | Google Perch 2.0 (frozen teacher) | <https://huggingface.co/cgeorgiaw/Perch> | Apache License 2.0 | |
| | Perch ONNX export for BirdCLEF+ 2026 | <https://www.kaggle.com/datasets/rishikeshjani/perch-onnx-for-birdclef-2026> | CC0 1.0 | |
| | chaneyma MoE artifacts (4× ProtoSSM folds + StudentCNN + StudentCRNN) | <https://www.kaggle.com/datasets/chaneyma/birdclef-2026-cv9245-moe-artifacts> | CC0 1.0 | |
|
|
| These artifacts are used **as released**: no fine-tuning, distillation, or re-training was performed in this submission. |
|
|
| ## Runtime |
|
|
| - **CPU only.** Perch runs on CPU via `onnxruntime`; ProtoSSM, Student CNN and Student CRNN run on CPU via PyTorch. There is no GPU code path. |
| - **Kaggle 90-minute CPU budget.** Designed around the BirdCLEF+ 2026 scoring environment. |
| - **Audio:** 32 kHz, 60-second `.ogg` test files, processed as 12 non-overlapping 5-second windows per file. |
| - **Classes:** 234, taken from the competition `sample_submission.csv`. |
| - **Region:** Brazilian Pantanal soundscapes. |
|
|
| ## How to run |
|
|
| ```bash |
| python src/infer_moe_onnx.py \ |
| --blend-perch 0.80 \ |
| --blend-cnn 0.13 \ |
| --blend-crnn 0.07 \ |
| --prior-scale 0.20 \ |
| --out submission.csv |
| ``` |
|
|
| The script exposes 15 CLI flags in total (paths to artifact directories, fold weight prefix, legacy single-student fallback, proto model dim, etc.). Run `python src/infer_moe_onnx.py --help` for the full list. Only the five flags shown above were varied during experiments; the rest stayed at their defaults. |
|
|
| You will need to download the pre-trained artifacts yourself from the sources listed above and arrange them under the paths the defaults expect, or override via the path flags. |
|
|
| ## Considered but not pursued |
|
|
| A short design-space review. These directions are listed because I read about them while preparing this submission; **I did not run any of them in this work**. |
|
|
| - Audio foundation models beyond Perch (BirdMAE, NatureLM-audio). Potentially stronger embeddings, but unclear whether they fit the 90-minute CPU budget without distillation work I did not want to claim. |
| - Semi-supervised distillation from a larger teacher into a smaller CPU student. Out of scope here (no training in this repo). |
| - SED (sound-event-detection) heads with frame-wise localisation rather than per-window logits. Would change the I/O contract; not pursued. |
|
|
| ## References |
|
|
| - van Merrienboer, B. et al. (2025). *Perch 2.0*. arXiv:2508.04665. |
| - Sydorskyi, V. & Goncalves, F. (2025). *BirdCLEF+ 2025: 2nd-place CLEF Working Note*. CEUR Workshop Proceedings, Vol. 4038. (Related-work reference; not used as code or data here.) |
|
|
| ## Licensing |
|
|
| - **My code in this repository:** Apache License 2.0, see [LICENSE](LICENSE). |
| - **Pre-trained artifacts and the upstream inference script:** see [THIRD_PARTY_LICENSES.md](THIRD_PARTY_LICENSES.md). They are not redistributed here. |
|
|
| ## Author |
|
|
| Serghei Brinza (`sergheibrinza` on GitHub). |
|
|
| --- |
|
|
| <sub>Unofficial, independent submission. Not affiliated with or endorsed by Kaggle, Google, or the BirdCLEF organizers.</sub> |
|
|