Model Card for esp-aves2-effnetb0-bio

Model Details

Model Description

esp-aves2-effnetb0-bio is a supervised bioacoustic encoder trained to produce transferable embeddings for downstream bioacoustic tasks. It follows the paper’s EfficientNet post-training recipe using a curated Bio bioacoustics-only mix.

Developed by: Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane K. Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist
Funded by: More info at https://www.earthspecies.org/about-us#support
Shared by: Earth Species Project
Model type: Audio representation learning model (CNN; EfficientNet-B0 backbone)
License: CC-BY-NC-SA
Finetuned from model: EfficientNet-B0 pretrained on ImageNet (see Parent Models)

Model Sources

Repository: https://github.com/earthspecies/avex
Paper: What Matters for Bioacoustic Encoding
Hugging Face Model: ESP-AVES2 Collection
Configuration: train_config.yaml

Parent Models

EfficientNet-B0 (ImageNet)
- Source: https://docs.pytorch.org/vision/main/models/generated/torchvision.models.efficientnet_b0.html
- Description: ImageNet-pretrained EfficientNet-B0 initialization used before supervised post-training.
- License: See upstream repository

Uses

Direct Use

esp-aves2-effnetb0-bio can be used directly as an embedding model for bioacoustic tasks such as species classification/detection, retrieval, clustering, individual ID, and repertoire analysis.

Downstream Use

Use frozen embeddings with linear probes, or fine-tune on your target dataset. Suitable for deployment as a feature extractor.

Out-of-Scope Use

Not a generative model; does not output text.

Bias, Risks, and Limitations

Bias: Bioacoustic-only training may reflect biases of citizen-science and curated archives (taxa/region/recording).
Risks: Potential misuse in sensitive wildlife contexts.
Limitations: 16 kHz standardization in the paper; may not capture higher-frequency information important for some taxa.

How to Get Started with the Model

Loading this model requires the AVEX (Animal Vocalization Encoder) library avex to be installed.

Installation

pip install avex

Or with uv:

uv add avex

For more details, see https://github.com/earthspecies/avex.

Loading the Model

from avex import load_model

model = load_model("esp_aves2_effnetb0_bio", device="cuda")

Using the Model

# Case 1: embedding extraction (features only)
backbone = load_model("esp_aves2_effnetb0_bio", device="cuda", return_features_only=True)

with torch.no_grad():
    embeddings = backbone(audio_tensor)
    # Shape: (batch, channels, height, width) for EfficientNet

# Pool to get fixed-size embedding
embedding = embeddings.mean(dim=(2, 3))  # Shape: (batch, channels)

# Case 2: supervised predictions (logits over label IDs; see label_map.json)
model = load_model("esp_aves2_effnetb0_bio", device="cuda")

with torch.no_grad():
    logits = model(audio_tensor)
    predicted_class = logits.argmax(dim=-1).item()

Transfer Learning with Probes

from avex.models.probes import build_probe_from_config
from avex.configs import ProbeConfig

# Load backbone for feature extraction
base = load_model("esp_aves2_effnetb0_bio", return_features_only=True, device="cuda")

# Define a probe head for your task
probe_config = ProbeConfig(
    probe_type="linear",
    target_layers=["last_layer"],
    aggregation="mean",
    freeze_backbone=True,
    online_training=True,
)

probe = build_probe_from_config(
    probe_config=probe_config,
    base_model=base,
    num_classes=10,  # Your number of classes
    device="cuda",
)

Class Label Mapping

The class label mapping for this supervised learning model can be found at label_map.json in the Hugging Face repository.

Training Details

Training Data

esp-aves2-effnetb0-bio follows the paper’s supervised post-training recipe on Bio only, starting from ImageNet-pretrained EfficientNet-B0.

Training Data Sources

Dataset	Description	Source	License	Size
Xeno-canto	birds	Link	CC (varies)	10416 hours
iNaturalist	diverse taxa	Link	CC (varies)	1539 hours
Watkins	marine mammals	Link	licensing agreement (paper)	27 hours
Animal Sound Archive	diverse taxa	Link	See archive terms	78 hours

Training Procedure

Initialization: EfficientNet-B0 pretrained on ImageNet.
Supervised post-training: on Bio with a multi-label objective.
Augmentations: random additive noise (p=0.5, SNR in ([-10, 20]) dB); mixup-style within-batch mixing (p=0.5) with union of labels.

Training Hyperparameters

Training hyperparameters are specified in train_config.yaml.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The paper evaluates on:

BEANS (classification and detection): https://github.com/earthspecies/beans
BirdSet (detection): https://huggingface.co/datasets/DBD-research-group/BirdSet
Individual ID: Pipit, Chiffchaff, Little Owl, Macaques
Vocal Repertoire: Zebra Finch, Giant Otters, Bengalese Finch, Killer Whale

Metrics

Linear probing: accuracy / mAP
Retrieval: ROC AUC
Clustering: NMI

Results

Aggregate results for linear probing (frozen base model) with esp-aves2-effnetb0-bio (from the provided LaTeX table):

Benchmark	Task	Metric	Score
BEANS Classification	Probe	Accuracy	0.786
BEANS Classification	Retrieval	ROC AUC	0.799
BEANS Classification	Clustering	NMI	0.563
BEANS Detection	Probe	mAP	0.365
BEANS Detection	Retrieval	ROC AUC	0.695
BirdSet	Probe	mAP	0.279
BirdSet	Retrieval	ROC AUC	0.704
Individual ID	Probe	Accuracy	0.457
Individual ID	Retrieval	ROC AUC	0.683
Vocal Repertoire	Retrieval	ROC AUC	0.806
Vocal Repertoire	Clustering	NMI	0.568

Citation

BibTeX:

@inproceedings{miron2025matters,
  title={What Matters for Bioacoustic Encoding},
  author={Miron, Marius and Robinson, David and Alizadeh, Milad and Gilsenan-McMahon, Ellen and Narula, Gagan and Chemla, Emmanuel and Cusimano, Maddie and Effenberger, Felix and Hagiwara, Masato and Hoffman, Benjamin and Keen, Sara and Kim, Diane and Lawton, Jane K. and Liu, Jen-Yu and Raskin, Aza and Pietquin, Olivier and Geist, Matthieu},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}

Model Card Contact

Contact: marius@earthspecies.org, david@earthspecies.org, milad@earthspecies.org, gagan@earthspecies.org

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including EarthSpeciesProject/esp-aves2-effnetb0-bio

esp-aves2

Collection

ESP-AVES2 model zoo. • 11 items • Updated Feb 2 • 2

Paper for EarthSpeciesProject/esp-aves2-effnetb0-bio

What Matters for Bioacoustic Encoding

Paper • 2508.11845 • Published Aug 15, 2025