BirdCLEF 2025 Model

This repository hosts a PyTorch model trained for the BirdCLEF 2025 challenge, classifying bird species from audio spectrograms.


BirdCLEF 2025: Bird Species Classification Model

A deep learning model for automated bird species identification from audio recordings, developed for the BirdCLEF 2025 competition.

Model Overview

This model combines EfficientNetV2 with attention mechanisms to classify bird species from mel-spectrogram representations of audio recordings.

Architecture

  • Backbone: EfficientNetV2-S (tf_efficientnetv2_s_in21k)
  • Input: Single-channel mel-spectrograms (128 mel bins)
  • Attention Module: Convolutional attention mechanism
  • Output: Multi-class classification (number of classes depends on dataset)
class BirdCLEFModel(torch.nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.backbone = timm.create_model(
            'tf_efficientnetv2_s_in21k',
            pretrained=False,
            in_chans=1,
            num_classes=num_classes
        )
        self.attention = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(16, 1, kernel_size=1),
            nn.Sigmoid()
        )

    def forward(self, x):
        attn = self.attention(x)
        x = x * attn
        return self.backbone(x)

Audio Processing Pipeline

Parameters

  • Sample Rate: 32,000 Hz
  • Duration: 5-second segments
  • Mel Bins: 128
  • Frequency Range: 20 Hz - 16,000 Hz
  • Hop Length: 512
  • FFT Size: 2048

Preprocessing Steps

  1. Load audio at 32kHz sampling rate
  2. Segment into 5-second chunks
  3. Generate mel-spectrograms using librosa
  4. Convert to log-scale (dB)
  5. Normalize to [-1, 1] range
  6. Apply attention weighting

Usage

Installation

pip install torch torchvision torchaudio
pip install librosa numpy pandas tqdm
pip install timm

Basic Usage

import torch
import librosa
import numpy as np
from model import BirdCLEFModel

# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BirdCLEFModel(num_classes=your_num_classes)
model.load_state_dict(torch.load("path_to_model.pth", map_location=device))
model.to(device)
model.eval()

# Process audio
def create_mel_spectrogram(audio_path):
    y, sr = librosa.load(audio_path, sr=32000, mono=True, duration=5)
    S = librosa.feature.melspectrogram(
        y=y, sr=sr, n_mels=128, fmin=20, fmax=16000,
        hop_length=512, n_fft=2048
    )
    S = librosa.power_to_db(S, ref=np.max)
    S = (S - S.min()) / (S.max() - S.min()) * 2 - 1
    return torch.tensor(S).unsqueeze(0).unsqueeze(0).float()

# Inference
spec = create_mel_spectrogram("bird_audio.wav").to(device)
with torch.no_grad():
    logits = model(spec)
    probabilities = torch.softmax(logits, dim=1)

Batch Processing

def process_audio_batch(audio_files, model, device):
    specs = []
    for audio_file in audio_files:
        spec = create_mel_spectrogram(audio_file)
        specs.append(spec)
    
    batch = torch.stack(specs).to(device)
    
    with torch.no_grad():
        logits = model(batch)
        probabilities = torch.softmax(logits, dim=1)
    
    return probabilities.cpu().numpy()

Model Features

Attention Mechanism

  • Lightweight convolutional attention module
  • Helps focus on relevant frequency-time regions
  • Improves model interpretability and performance

Optimizations

  • Mixed precision training support (AMP)
  • Efficient batch processing
  • Memory-optimized inference pipeline
  • Fast mel-spectrogram computation

Training Details

Data Processing

  • Audio files segmented into 5-second chunks
  • Random cropping during training for data augmentation
  • Mel-spectrogram normalization per sample
  • Optional SpecAugment for time/frequency masking

Augmentation

class SpecAugment:
    def __init__(self):
        self.time_mask = T.TimeMasking(time_mask_param=20)
        self.freq_mask = T.FrequencyMasking(freq_mask_param=10)
    
    def __call__(self, spec):
        spec = self.time_mask(spec)
        spec = self.freq_mask(spec)
        return spec

Performance

Competition Results

  • Developed for BirdCLEF 2025 competition
  • Optimized for real-time inference
  • Efficient processing of long audio recordings

Inference Speed

  • Batch processing capability
  • GPU acceleration support
  • Optimized for competition time constraints

Model Variants

This architecture can be adapted for different scenarios:

  • Lightweight: Use EfficientNet-B0 for faster inference
  • High Accuracy: Use EfficientNet-B4 or larger models
  • Multi-scale: Process multiple segment lengths
  • Ensemble: Combine multiple model predictions

File Structure

birdclef-model/
β”œβ”€β”€ model.py              # Model architecture
β”œβ”€β”€ inference.py          # Inference pipeline
β”œβ”€β”€ preprocessing.py      # Audio processing utilities
β”œβ”€β”€ config.py            # Configuration parameters
β”œβ”€β”€ requirements.txt     # Dependencies
└── README.md           # This file

Requirements

torch>=1.9.0
torchaudio>=0.9.0
librosa>=0.8.1
numpy>=1.21.0
pandas>=1.3.0
timm>=0.6.0
tqdm>=4.62.0

Citation

If you use this model in your research, please cite:

@misc{birdclef2025model,
  title={BirdCLEF 2025: Efficient Bird Species Classification with Attention-Enhanced EfficientNetV2},
  author={Your Name},
  year={2025},
  howpublished={Kaggle Competition}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • BirdCLEF 2025 competition organizers
  • Timm library for pre-trained models
  • Librosa for audio processing utilities
  • PyTorch team for the deep learning framework

Contact

For questions or issues, please open an issue on GitHub or contact [javokhirraimov1@gmail.com].


Note: This model was specifically designed for the BirdCLEF 2025 competition format. For other bird classification tasks, you may need to adjust the audio processing parameters and retrain the model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support