BirdCLEF 2025 Model

This repository hosts a PyTorch model trained for the BirdCLEF 2025 challenge, classifying bird species from audio spectrograms.

BirdCLEF 2025: Bird Species Classification Model

A deep learning model for automated bird species identification from audio recordings, developed for the BirdCLEF 2025 competition.

Model Overview

This model combines EfficientNetV2 with attention mechanisms to classify bird species from mel-spectrogram representations of audio recordings.

Architecture

Backbone: EfficientNetV2-S (tf_efficientnetv2_s_in21k)
Input: Single-channel mel-spectrograms (128 mel bins)
Attention Module: Convolutional attention mechanism
Output: Multi-class classification (number of classes depends on dataset)

class BirdCLEFModel(torch.nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.backbone = timm.create_model(
            'tf_efficientnetv2_s_in21k',
            pretrained=False,
            in_chans=1,
            num_classes=num_classes
        )
        self.attention = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(16, 1, kernel_size=1),
            nn.Sigmoid()
        )

    def forward(self, x):
        attn = self.attention(x)
        x = x * attn
        return self.backbone(x)

Audio Processing Pipeline

Parameters

Sample Rate: 32,000 Hz
Duration: 5-second segments
Mel Bins: 128
Frequency Range: 20 Hz - 16,000 Hz
Hop Length: 512
FFT Size: 2048

Preprocessing Steps

Load audio at 32kHz sampling rate
Segment into 5-second chunks
Generate mel-spectrograms using librosa
Convert to log-scale (dB)
Normalize to [-1, 1] range
Apply attention weighting

Usage

Installation

pip install torch torchvision torchaudio
pip install librosa numpy pandas tqdm
pip install timm

Basic Usage

import torch
import librosa
import numpy as np
from model import BirdCLEFModel

# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BirdCLEFModel(num_classes=your_num_classes)
model.load_state_dict(torch.load("path_to_model.pth", map_location=device))
model.to(device)
model.eval()

# Process audio
def create_mel_spectrogram(audio_path):
    y, sr = librosa.load(audio_path, sr=32000, mono=True, duration=5)
    S = librosa.feature.melspectrogram(
        y=y, sr=sr, n_mels=128, fmin=20, fmax=16000,
        hop_length=512, n_fft=2048
    )
    S = librosa.power_to_db(S, ref=np.max)
    S = (S - S.min()) / (S.max() - S.min()) * 2 - 1
    return torch.tensor(S).unsqueeze(0).unsqueeze(0).float()

# Inference
spec = create_mel_spectrogram("bird_audio.wav").to(device)
with torch.no_grad():
    logits = model(spec)
    probabilities = torch.softmax(logits, dim=1)

Batch Processing

def process_audio_batch(audio_files, model, device):
    specs = []
    for audio_file in audio_files:
        spec = create_mel_spectrogram(audio_file)
        specs.append(spec)
    
    batch = torch.stack(specs).to(device)
    
    with torch.no_grad():
        logits = model(batch)
        probabilities = torch.softmax(logits, dim=1)
    
    return probabilities.cpu().numpy()

Model Features

Attention Mechanism

Lightweight convolutional attention module
Helps focus on relevant frequency-time regions
Improves model interpretability and performance

Optimizations

Mixed precision training support (AMP)
Efficient batch processing
Memory-optimized inference pipeline
Fast mel-spectrogram computation

Training Details

Data Processing

Audio files segmented into 5-second chunks
Random cropping during training for data augmentation
Mel-spectrogram normalization per sample
Optional SpecAugment for time/frequency masking

Augmentation

class SpecAugment:
    def __init__(self):
        self.time_mask = T.TimeMasking(time_mask_param=20)
        self.freq_mask = T.FrequencyMasking(freq_mask_param=10)
    
    def __call__(self, spec):
        spec = self.time_mask(spec)
        spec = self.freq_mask(spec)
        return spec

Performance

Competition Results

Developed for BirdCLEF 2025 competition
Optimized for real-time inference
Efficient processing of long audio recordings

Inference Speed

Batch processing capability
GPU acceleration support
Optimized for competition time constraints

Model Variants

This architecture can be adapted for different scenarios:

Lightweight: Use EfficientNet-B0 for faster inference
High Accuracy: Use EfficientNet-B4 or larger models
Multi-scale: Process multiple segment lengths
Ensemble: Combine multiple model predictions

File Structure

birdclef-model/
├── model.py              # Model architecture
├── inference.py          # Inference pipeline
├── preprocessing.py      # Audio processing utilities
├── config.py            # Configuration parameters
├── requirements.txt     # Dependencies
└── README.md           # This file

Requirements

torch>=1.9.0
torchaudio>=0.9.0
librosa>=0.8.1
numpy>=1.21.0
pandas>=1.3.0
timm>=0.6.0
tqdm>=4.62.0

Citation

If you use this model in your research, please cite:

@misc{birdclef2025model,
  title={BirdCLEF 2025: Efficient Bird Species Classification with Attention-Enhanced EfficientNetV2},
  author={Your Name},
  year={2025},
  howpublished={Kaggle Competition}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

BirdCLEF 2025 competition organizers
Timm library for pre-trained models
Librosa for audio processing utilities
PyTorch team for the deep learning framework

Contact

For questions or issues, please open an issue on GitHub or contact [javokhirraimov1@gmail.com].

Note: This model was specifically designed for the BirdCLEF 2025 competition format. For other bird classification tasks, you may need to adjust the audio processing parameters and retrain the model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support