Model Card for esp-aves2-eat-bio

Model Details

Model Description

esp-aves2-eat-bio is a self-supervised audio representation learning model (bioacoustic encoder) based on the EAT (Efficient Audio Transformer) architecture, trained with self-supervised learning on a curated bioacoustic corpus (Bio), as described in What Matters for Bioacoustic Encoding.

  • Developed by: Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane K. Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist
  • Funded by: More info at https://www.earthspecies.org/about-us#support
  • Shared by: Earth Species Project
  • Model type: Transformer; EAT backbone (self-supervised)
  • License: CC-BY-NC-SA
  • Finetuned from model: N/A (self-supervised pretraining checkpoint)

Model Sources

Parent Models

  1. EAT (Efficient Audio Transformer)
    • Source: http://github.com/cwx-worst-one/EAT
    • Description: Open-source EAT implementation used as the reference architecture/training codebase.
    • License: See upstream repository

Uses

Direct Use

esp-aves2-eat-bio can be used as an embedding model for downstream bioacoustic tasks (via probes or finetuning), including species classification/detection, retrieval and clustering, and as a bioacoustics-focused SSL baseline.

Downstream Use

Use frozen embeddings with linear probes, or fine-tune on target datasets (taxa-, habitat-, or device-specific).

Out-of-Scope Use

Not a generative model; does not output text.

Bias, Risks, and Limitations

  • Bias: The bioacoustic training mix may over-represent certain taxa (e.g., birds) and geographic regions due to public recording availability; performance may vary on under-represented taxa.
  • Risks: Potential misuse for sensitive wildlife monitoring without safeguards.
  • Limitations: The paper standardizes evaluations at 16 kHz; higher-frequency information may be important for some taxa.

How to Get Started with the Model

Loading this model requires the AVEX (Animal Vocalization Encoder) library avex to be installed.

Installation

pip install avex

Or with uv:

uv add avex

For more details, see https://github.com/earthspecies/avex.

Loading the Model

from avex import load_model

model = load_model("esp_aves2_eat_bio", device="cuda")

Embedding Extraction

import torch
from avex import load_model

model = load_model("esp_aves2_eat_bio", device="cuda")

with torch.no_grad():
    embeddings = model(audio_tensor)
    # Shape: (batch, time_steps, 768) for EAT

# Pool to get fixed-size embedding
embedding = embeddings.mean(dim=1)  # Shape: (batch, 768)

Transfer Learning with Probes

from avex.models.probes import build_probe_from_config
from avex.configs import ProbeConfig

# Load backbone for feature extraction
base = load_model("esp_aves2_eat_bio", device="cuda")

# Define a probe head for your task
probe_config = ProbeConfig(
    probe_type="linear",
    target_layers=["last_layer"],
    aggregation="mean",
    freeze_backbone=True,
    online_training=True,
)

probe = build_probe_from_config(
    probe_config=probe_config,
    base_model=base,
    num_classes=10,  # Your number of classes
    device="cuda",
)

Training Details

Training Data

Self-supervised pretraining on a curated bioacoustic mix (Bio). Labels are ignored during SSL.

Training Data Sources

Dataset Description Source License Size
Xeno-canto birds Link CC (varies) 10416 hours
iNaturalist diverse taxa Link CC (varies) 1539 hours
Watkins marine mammals Link licensing agreement (paper) 27 hours
Animal Sound Archive diverse taxa Link See archive terms 78 hours

Training Procedure

As described in the paper, EAT uses a self-supervised objective combining teacher distillation with reconstruction of masked spectrogram patches.

Training Hyperparameters

Training hyperparameters are specified in train_config.yaml.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The paper evaluates on:

  • BEANS (classification and detection): https://github.com/earthspecies/beans
  • BirdSet (detection): https://huggingface.co/datasets/DBD-research-group/BirdSet
  • Individual ID: Pipit, Chiffchaff, Little Owl, Macaques
  • Vocal Repertoire: Zebra Finch, Giant Otters, Bengalese Finch, Killer Whale

Metrics

  • Linear probing: accuracy / mAP
  • Retrieval: ROC AUC
  • Clustering: NMI

Results

Aggregate results for linear probing (frozen base model) with esp-aves2-eat-bio (from the provided LaTeX table; corresponding to the EAT Bio SSL checkpoint):

Benchmark Task Metric Score
BEANS Classification Probe Accuracy 0.692
BEANS Classification Retrieval ROC AUC 0.671
BEANS Classification Clustering NMI 0.410
BEANS Detection Probe mAP 0.311
BEANS Detection Retrieval ROC AUC 0.679
BirdSet Probe mAP 0.143
BirdSet Retrieval ROC AUC 0.631
Individual ID Probe Accuracy 0.378
Individual ID Retrieval ROC AUC 0.627
Vocal Repertoire Retrieval ROC AUC 0.757
Vocal Repertoire Clustering NMI 0.466

Citation

BibTeX:

@inproceedings{miron2025matters,
  title={What Matters for Bioacoustic Encoding},
  author={Miron, Marius and Robinson, David and Alizadeh, Milad and Gilsenan-McMahon, Ellen and Narula, Gagan and Chemla, Emmanuel and Cusimano, Maddie and Effenberger, Felix and Hagiwara, Masato and Hoffman, Benjamin and Keen, Sara and Kim, Diane and Lawton, Jane K. and Liu, Jen-Yu and Raskin, Aza and Pietquin, Olivier and Geist, Matthieu},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026}
}

Model Card Contact

Contact: marius@earthspecies.org, david@earthspecies.org, milad@earthspecies.org, gagan@earthspecies.org

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including EarthSpeciesProject/esp-aves2-eat-bio

Paper for EarthSpeciesProject/esp-aves2-eat-bio