Initial commit: model card, ONNX MoE inference script, license, assets

f616519 verified 8 days ago

7.08 kB

	---
	license: apache-2.0
	tags:
	- bioacoustics
	- audio-classification
	- birdclef
	- onnx
	- cpu-inference
	---

	# Kaggle BirdCLEF+ 2026 — Inference-only submission

	CPU-only inference pipeline for the [BirdCLEF+ 2026 competition](https://www.kaggle.com/competitions/birdclef-2026): 234 species, Pantanal (Brazilian wetland) soundscapes, 90-minute CPU runtime budget at scoring time.

	This repository is a mirror of the code-only project: <https://github.com/SergheiBrinza/kaggle-hackathon-birdclef-2026>

	## What this repo is (and is not)

	This is not a trained or fine-tuned model. It is the inference glue code I wrote and configured around publicly released pre-trained artifacts.

	- Nothing in this repository was trained or fine-tuned by me.
	- The pre-trained artifacts (Google Perch 2.0, the chaneyma MoE bundle, the rishikeshjani ONNX export of Perch) are used as released and are not redistributed here.
	- This matches my published declaration for the submission.
	- See [THIRD_PARTY_LICENSES.md](THIRD_PARTY_LICENSES.md) for sources and licenses; download the artifacts yourself from the original locations.

	## Method

	The submission ensembles a frozen audio foundation model with a small mixture of experts:

	- Perch 2.0 (frozen teacher). Google's bioacoustics foundation model, used via an ONNX export for CPU inference. Outputs per-class logits plus a 1536-dim embedding.
	- chaneyma MoE. 4× ProtoSSM folds (selective state-space + class prototypes, consuming the Perch embedding) plus a Student CNN and a Student CRNN on log-mel features.
	- Site / hour prior. Empirical priors over the 234 classes conditioned on recording site (`S\d+` parsed from filename) and hour-of-day (24 bins), fit from the BirdCLEF+ 2026 `train_soundscapes_labels.csv` only.
	- Temporal smoothing. Per file, over adjacent 5-second windows (12 windows per 60-second file).

	### My contributions

	1. ONNX patch of Perch. The upstream chaneyma inference script (CC0) loaded Perch through `tf.saved_model.load`, which requires TensorFlow and was too heavy for the Kaggle CPU budget. I replaced the TF code path with `onnxruntime` plus a CC0 ONNX export of Perch (rishikeshjani), removing the TensorFlow dependency entirely.
	2. Blend-weight search over `(perch, student_cnn, student_crnn)` mixing weights of the pre-sigmoid logits.
	3. Prior-scale sweep over the multiplier applied to the site/hour log-odds prior before it is added to the Perch logits.
	4. Per-file temporal smoothing, averaging each 5-second window with its immediate neighbours.

	> Inference script originally by chaneyma (CC0), patched for ONNX/CPU inference by Serghei Brinza.

	## Experiments (public leaderboard)

	Metric: macro-averaged ROC-AUC (the competition metric, ranking-based).

	\| Variant \| Blend (P / CNN / CRNN) \| `--prior-scale` \| Postprocessing \| Public LB \|
	\|---\|---\|---\|---\|---\|
	\| Prior sweep \| 0.80 / 0.13 / 0.07 \| 0.60 \| smoothing 0.8 / 0.1 / 0.1 \| n/a \|
	\| Prior sweep \| 0.80 / 0.13 / 0.07 \| 0.40 \| smoothing 0.8 / 0.1 / 0.1 \| n/a \|
	\| Prior sweep \| 0.80 / 0.13 / 0.07 \| 0.20 \| smoothing 0.8 / 0.1 / 0.1 \| 0.914 \|
	\| Prior sweep \| 0.80 / 0.13 / 0.07 \| 0.10 \| smoothing 0.8 / 0.1 / 0.1 \| 0.914 \|
	\| Power transform on probs \| 0.80 / 0.13 / 0.07 \| 0.20 \| + power transform \| no change \|
	\| Alt smoothing \| 0.80 / 0.13 / 0.07 \| 0.20 \| smoothing 0.7 / 0.15 / 0.15 \| 0.913 \|

	Best final public-LB score: 0.914.

	A few honest caveats:

	- The power transform on probabilities had no effect, because the macro ROC-AUC metric is ranking-based.
	- The 0.7 / 0.15 / 0.15 temporal-smoothing variant (0.913) was a negative result: slightly worse than the shipped 0.8 / 0.1 / 0.1 version. Logged because honest experiment logs include the runs that did not help.
	- Intermediate prior-scale rows (0.60, 0.40) are left blank rather than filled in with guessed numbers.

	## Required external artifacts (not redistributed)

	You must download these yourself; this repo does not host any third-party weights.

	\| Component \| Source \| License \|
	\|---\|---\|---\|
	\| Google Perch 2.0 (frozen teacher) \| <https://huggingface.co/cgeorgiaw/Perch> \| Apache License 2.0 \|
	\| Perch ONNX export for BirdCLEF+ 2026 \| <https://www.kaggle.com/datasets/rishikeshjani/perch-onnx-for-birdclef-2026> \| CC0 1.0 \|
	\| chaneyma MoE artifacts (4× ProtoSSM folds + StudentCNN + StudentCRNN) \| <https://www.kaggle.com/datasets/chaneyma/birdclef-2026-cv9245-moe-artifacts> \| CC0 1.0 \|

	These artifacts are used as released: no fine-tuning, distillation, or re-training was performed in this submission.

	## Runtime

	- CPU only. Perch runs on CPU via `onnxruntime`; ProtoSSM, Student CNN and Student CRNN run on CPU via PyTorch. There is no GPU code path.
	- Kaggle 90-minute CPU budget. Designed around the BirdCLEF+ 2026 scoring environment.
	- Audio: 32 kHz, 60-second `.ogg` test files, processed as 12 non-overlapping 5-second windows per file.
	- Classes: 234, taken from the competition `sample_submission.csv`.
	- Region: Brazilian Pantanal soundscapes.

	## How to run

	```bash
	python src/infer_moe_onnx.py \
	--blend-perch 0.80 \
	--blend-cnn 0.13 \
	--blend-crnn 0.07 \
	--prior-scale 0.20 \
	--out submission.csv
	```

	The script exposes 15 CLI flags in total (paths to artifact directories, fold weight prefix, legacy single-student fallback, proto model dim, etc.). Run `python src/infer_moe_onnx.py --help` for the full list. Only the five flags shown above were varied during experiments; the rest stayed at their defaults.

	You will need to download the pre-trained artifacts yourself from the sources listed above and arrange them under the paths the defaults expect, or override via the path flags.

	## Considered but not pursued

	A short design-space review. These directions are listed because I read about them while preparing this submission; I did not run any of them in this work.

	- Audio foundation models beyond Perch (BirdMAE, NatureLM-audio). Potentially stronger embeddings, but unclear whether they fit the 90-minute CPU budget without distillation work I did not want to claim.
	- Semi-supervised distillation from a larger teacher into a smaller CPU student. Out of scope here (no training in this repo).
	- SED (sound-event-detection) heads with frame-wise localisation rather than per-window logits. Would change the I/O contract; not pursued.

	## References

	- van Merrienboer, B. et al. (2025). Perch 2.0. arXiv:2508.04665.
	- Sydorskyi, V. & Goncalves, F. (2025). BirdCLEF+ 2025: 2nd-place CLEF Working Note. CEUR Workshop Proceedings, Vol. 4038. (Related-work reference; not used as code or data here.)

	## Licensing

	- My code in this repository: Apache License 2.0, see [LICENSE](LICENSE).
	- Pre-trained artifacts and the upstream inference script: see [THIRD_PARTY_LICENSES.md](THIRD_PARTY_LICENSES.md). They are not redistributed here.

	## Author

	Serghei Brinza (`sergheibrinza` on GitHub).

	---

	<sub>Unofficial, independent submission. Not affiliated with or endorsed by Kaggle, Google, or the BirdCLEF organizers.</sub>