---
library_name: finch
pipeline_tag: audio-classification
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
- bioacoustics
- bird-species
license: apache-2.0
language:
- en
datasets:
- birdclef-2021
---

# FINCH: Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion

This is the Stage B checkpoint for **FINCH**, a bioacoustic species identification framework that fuses audio classification with spatiotemporal priors from [eBird](https://ebird.org/) abundance data. A frozen [NatureLM-audio](https://huggingface.co/EarthSpeciesProject/NatureLM-audio) encoder is paired with a learned gating network that adaptively weights audio evidence against space-time context.

- **Paper:** [arXiv:2602.03817](https://arxiv.org/abs/2602.03817)
- **Code:** [github.com/leharris3/birdnoise](https://github.com/leharris3/birdnoise)

## Model description

The Stage B model computes fused logits as:

```
final_logits = audio_logits / T + w(a, x, t) * log(prior + eps)
```

where `w(a, x, t)` is a small gating MLP conditioned on audio confidence, prior confidence, location, and time-of-year. `T` and `eps` are learned scalars.

Only the trainable parameters (classifier head, gating network, temperature, epsilon) are stored here (\~3 MB). The frozen NatureLM-audio encoder (\~8B params) is downloaded separately from [EarthSpeciesProject/NatureLM-audio](https://huggingface.co/EarthSpeciesProject/NatureLM-audio) at load time.

## How to use

Requires Python 3.10+, [uv](https://docs.astral.sh/uv/), and a HuggingFace account with access to [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) (required by NatureLM-audio).

```bash
# Clone the repository
git clone https://github.com/leharris3/birdnoise.git
cd birdnoise

# Install dependencies
uv sync
cd NatureLM-audio && uv pip install -r requirements.txt && cd ..

# Log in to HuggingFace
huggingface-cli login
```

```python
from Models import HFStageBModel

# Download weights from the Hub
model = HFStageBModel.from_pretrained("leharris3/FINCH")
```

## Training details

- **Dataset:** BirdCLEF 2021 (184 species)
- **Encoder:** NatureLM-audio (frozen)
- **Stage A:** Linear probe + scalar fusion weight (30 epochs)
- **Stage B:** Gating network `w(a, x, t)` + learned T, eps (warm-started from Stage A)
- **Best val accuracy:** 82.6%

## Citation

```bibtex
@article{ovanger2026adaptive,
    title   = {Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion},
    author  = {Oscar Ovanger and Levi Harris and Timothy H. Keitt},
    journal = {arXiv preprint arXiv:2602.03817},
    year    = {2026},
    url     = {https://arxiv.org/abs/2602.03817}
}
```