---
license: apache-2.0
tags:
- bioacoustics
- audio-classification
- birdclef
- onnx
- cpu-inference
---

# Kaggle BirdCLEF+ 2026 — Inference-only submission

CPU-only inference pipeline for the [BirdCLEF+ 2026 competition](https://www.kaggle.com/competitions/birdclef-2026): 234 species, Pantanal (Brazilian wetland) soundscapes, 90-minute CPU runtime budget at scoring time.

This repository is a **mirror of the code-only project**: <https://github.com/SergheiBrinza/kaggle-hackathon-birdclef-2026>

## What this repo is (and is not)

This is **not a trained or fine-tuned model**. It is the inference glue code I wrote and configured around publicly released pre-trained artifacts.

- Nothing in this repository was trained or fine-tuned by me.
- The pre-trained artifacts (Google Perch 2.0, the chaneyma MoE bundle, the rishikeshjani ONNX export of Perch) are used **as released** and are **not redistributed here**.
- This matches my published declaration for the submission.
- See [THIRD_PARTY_LICENSES.md](THIRD_PARTY_LICENSES.md) for sources and licenses; download the artifacts yourself from the original locations.

## Method

The submission ensembles a frozen audio foundation model with a small mixture of experts:

- **Perch 2.0 (frozen teacher).** Google's bioacoustics foundation model, used via an ONNX export for CPU inference. Outputs per-class logits plus a 1536-dim embedding.
- **chaneyma MoE.** 4× ProtoSSM folds (selective state-space + class prototypes, consuming the Perch embedding) plus a Student CNN and a Student CRNN on log-mel features.
- **Site / hour prior.** Empirical priors over the 234 classes conditioned on recording site (`S\d+` parsed from filename) and hour-of-day (24 bins), fit from the BirdCLEF+ 2026 `train_soundscapes_labels.csv` only.
- **Temporal smoothing.** Per file, over adjacent 5-second windows (12 windows per 60-second file).

### My contributions

1. **ONNX patch of Perch.** The upstream chaneyma inference script (CC0) loaded Perch through `tf.saved_model.load`, which requires TensorFlow and was too heavy for the Kaggle CPU budget. I replaced the TF code path with `onnxruntime` plus a CC0 ONNX export of Perch (rishikeshjani), removing the TensorFlow dependency entirely.
2. **Blend-weight search** over `(perch, student_cnn, student_crnn)` mixing weights of the pre-sigmoid logits.
3. **Prior-scale sweep** over the multiplier applied to the site/hour log-odds prior before it is added to the Perch logits.
4. **Per-file temporal smoothing**, averaging each 5-second window with its immediate neighbours.

> Inference script originally by chaneyma (CC0), patched for ONNX/CPU inference by Serghei Brinza.

## Experiments (public leaderboard)

Metric: macro-averaged ROC-AUC (the competition metric, ranking-based).

| Variant | Blend (P / CNN / CRNN) | `--prior-scale` | Postprocessing | Public LB |
|---|---|---|---|---|
| Prior sweep | 0.80 / 0.13 / 0.07 | 0.60 | smoothing 0.8 / 0.1 / 0.1 | n/a |
| Prior sweep | 0.80 / 0.13 / 0.07 | 0.40 | smoothing 0.8 / 0.1 / 0.1 | n/a |
| Prior sweep | 0.80 / 0.13 / 0.07 | 0.20 | smoothing 0.8 / 0.1 / 0.1 | **0.914** |
| Prior sweep | 0.80 / 0.13 / 0.07 | 0.10 | smoothing 0.8 / 0.1 / 0.1 | **0.914** |
| Power transform on probs | 0.80 / 0.13 / 0.07 | 0.20 | + power transform | no change |
| Alt smoothing | 0.80 / 0.13 / 0.07 | 0.20 | smoothing 0.7 / 0.15 / 0.15 | 0.913 |

Best final public-LB score: **0.914**.

A few honest caveats:

- The power transform on probabilities had no effect, because the macro ROC-AUC metric is ranking-based.
- The 0.7 / 0.15 / 0.15 temporal-smoothing variant (0.913) was a **negative result**: slightly worse than the shipped 0.8 / 0.1 / 0.1 version. Logged because honest experiment logs include the runs that did not help.
- Intermediate prior-scale rows (0.60, 0.40) are left **blank** rather than filled in with guessed numbers.

## Required external artifacts (not redistributed)

You must download these yourself; this repo does not host any third-party weights.

| Component | Source | License |
|---|---|---|
| Google Perch 2.0 (frozen teacher) | <https://huggingface.co/cgeorgiaw/Perch> | Apache License 2.0 |
| Perch ONNX export for BirdCLEF+ 2026 | <https://www.kaggle.com/datasets/rishikeshjani/perch-onnx-for-birdclef-2026> | CC0 1.0 |
| chaneyma MoE artifacts (4× ProtoSSM folds + StudentCNN + StudentCRNN) | <https://www.kaggle.com/datasets/chaneyma/birdclef-2026-cv9245-moe-artifacts> | CC0 1.0 |

These artifacts are used **as released**: no fine-tuning, distillation, or re-training was performed in this submission.

## Runtime

- **CPU only.** Perch runs on CPU via `onnxruntime`; ProtoSSM, Student CNN and Student CRNN run on CPU via PyTorch. There is no GPU code path.
- **Kaggle 90-minute CPU budget.** Designed around the BirdCLEF+ 2026 scoring environment.
- **Audio:** 32 kHz, 60-second `.ogg` test files, processed as 12 non-overlapping 5-second windows per file.
- **Classes:** 234, taken from the competition `sample_submission.csv`.
- **Region:** Brazilian Pantanal soundscapes.

## How to run

```bash
python src/infer_moe_onnx.py \
  --blend-perch 0.80 \
  --blend-cnn   0.13 \
  --blend-crnn  0.07 \
  --prior-scale 0.20 \
  --out submission.csv
```

The script exposes 15 CLI flags in total (paths to artifact directories, fold weight prefix, legacy single-student fallback, proto model dim, etc.). Run `python src/infer_moe_onnx.py --help` for the full list. Only the five flags shown above were varied during experiments; the rest stayed at their defaults.

You will need to download the pre-trained artifacts yourself from the sources listed above and arrange them under the paths the defaults expect, or override via the path flags.

## Considered but not pursued

A short design-space review. These directions are listed because I read about them while preparing this submission; **I did not run any of them in this work**.

- Audio foundation models beyond Perch (BirdMAE, NatureLM-audio). Potentially stronger embeddings, but unclear whether they fit the 90-minute CPU budget without distillation work I did not want to claim.
- Semi-supervised distillation from a larger teacher into a smaller CPU student. Out of scope here (no training in this repo).
- SED (sound-event-detection) heads with frame-wise localisation rather than per-window logits. Would change the I/O contract; not pursued.

## References

- van Merrienboer, B. et al. (2025). *Perch 2.0*. arXiv:2508.04665.
- Sydorskyi, V. & Goncalves, F. (2025). *BirdCLEF+ 2025: 2nd-place CLEF Working Note*. CEUR Workshop Proceedings, Vol. 4038. (Related-work reference; not used as code or data here.)

## Licensing

- **My code in this repository:** Apache License 2.0, see [LICENSE](LICENSE).
- **Pre-trained artifacts and the upstream inference script:** see [THIRD_PARTY_LICENSES.md](THIRD_PARTY_LICENSES.md). They are not redistributed here.

## Author

Serghei Brinza (`sergheibrinza` on GitHub).

---

<sub>Unofficial, independent submission. Not affiliated with or endorsed by Kaggle, Google, or the BirdCLEF organizers.</sub>