BirdCLEF 2025 Model
This repository hosts a PyTorch model trained for the BirdCLEF 2025 challenge, classifying bird species from audio spectrograms.
BirdCLEF 2025: Bird Species Classification Model
A deep learning model for automated bird species identification from audio recordings, developed for the BirdCLEF 2025 competition.
Model Overview
This model combines EfficientNetV2 with attention mechanisms to classify bird species from mel-spectrogram representations of audio recordings.
Architecture
- Backbone: EfficientNetV2-S (tf_efficientnetv2_s_in21k)
- Input: Single-channel mel-spectrograms (128 mel bins)
- Attention Module: Convolutional attention mechanism
- Output: Multi-class classification (number of classes depends on dataset)
class BirdCLEFModel(torch.nn.Module):
def __init__(self, num_classes):
super().__init__()
self.backbone = timm.create_model(
'tf_efficientnetv2_s_in21k',
pretrained=False,
in_chans=1,
num_classes=num_classes
)
self.attention = nn.Sequential(
nn.Conv2d(1, 16, kernel_size=1),
nn.ReLU(),
nn.Conv2d(16, 1, kernel_size=1),
nn.Sigmoid()
)
def forward(self, x):
attn = self.attention(x)
x = x * attn
return self.backbone(x)
Audio Processing Pipeline
Parameters
- Sample Rate: 32,000 Hz
- Duration: 5-second segments
- Mel Bins: 128
- Frequency Range: 20 Hz - 16,000 Hz
- Hop Length: 512
- FFT Size: 2048
Preprocessing Steps
- Load audio at 32kHz sampling rate
- Segment into 5-second chunks
- Generate mel-spectrograms using librosa
- Convert to log-scale (dB)
- Normalize to [-1, 1] range
- Apply attention weighting
Usage
Installation
pip install torch torchvision torchaudio
pip install librosa numpy pandas tqdm
pip install timm
Basic Usage
import torch
import librosa
import numpy as np
from model import BirdCLEFModel
# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BirdCLEFModel(num_classes=your_num_classes)
model.load_state_dict(torch.load("path_to_model.pth", map_location=device))
model.to(device)
model.eval()
# Process audio
def create_mel_spectrogram(audio_path):
y, sr = librosa.load(audio_path, sr=32000, mono=True, duration=5)
S = librosa.feature.melspectrogram(
y=y, sr=sr, n_mels=128, fmin=20, fmax=16000,
hop_length=512, n_fft=2048
)
S = librosa.power_to_db(S, ref=np.max)
S = (S - S.min()) / (S.max() - S.min()) * 2 - 1
return torch.tensor(S).unsqueeze(0).unsqueeze(0).float()
# Inference
spec = create_mel_spectrogram("bird_audio.wav").to(device)
with torch.no_grad():
logits = model(spec)
probabilities = torch.softmax(logits, dim=1)
Batch Processing
def process_audio_batch(audio_files, model, device):
specs = []
for audio_file in audio_files:
spec = create_mel_spectrogram(audio_file)
specs.append(spec)
batch = torch.stack(specs).to(device)
with torch.no_grad():
logits = model(batch)
probabilities = torch.softmax(logits, dim=1)
return probabilities.cpu().numpy()
Model Features
Attention Mechanism
- Lightweight convolutional attention module
- Helps focus on relevant frequency-time regions
- Improves model interpretability and performance
Optimizations
- Mixed precision training support (AMP)
- Efficient batch processing
- Memory-optimized inference pipeline
- Fast mel-spectrogram computation
Training Details
Data Processing
- Audio files segmented into 5-second chunks
- Random cropping during training for data augmentation
- Mel-spectrogram normalization per sample
- Optional SpecAugment for time/frequency masking
Augmentation
class SpecAugment:
def __init__(self):
self.time_mask = T.TimeMasking(time_mask_param=20)
self.freq_mask = T.FrequencyMasking(freq_mask_param=10)
def __call__(self, spec):
spec = self.time_mask(spec)
spec = self.freq_mask(spec)
return spec
Performance
Competition Results
- Developed for BirdCLEF 2025 competition
- Optimized for real-time inference
- Efficient processing of long audio recordings
Inference Speed
- Batch processing capability
- GPU acceleration support
- Optimized for competition time constraints
Model Variants
This architecture can be adapted for different scenarios:
- Lightweight: Use EfficientNet-B0 for faster inference
- High Accuracy: Use EfficientNet-B4 or larger models
- Multi-scale: Process multiple segment lengths
- Ensemble: Combine multiple model predictions
File Structure
birdclef-model/
βββ model.py # Model architecture
βββ inference.py # Inference pipeline
βββ preprocessing.py # Audio processing utilities
βββ config.py # Configuration parameters
βββ requirements.txt # Dependencies
βββ README.md # This file
Requirements
torch>=1.9.0
torchaudio>=0.9.0
librosa>=0.8.1
numpy>=1.21.0
pandas>=1.3.0
timm>=0.6.0
tqdm>=4.62.0
Citation
If you use this model in your research, please cite:
@misc{birdclef2025model,
title={BirdCLEF 2025: Efficient Bird Species Classification with Attention-Enhanced EfficientNetV2},
author={Your Name},
year={2025},
howpublished={Kaggle Competition}
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- BirdCLEF 2025 competition organizers
- Timm library for pre-trained models
- Librosa for audio processing utilities
- PyTorch team for the deep learning framework
Contact
For questions or issues, please open an issue on GitHub or contact [javokhirraimov1@gmail.com].
Note: This model was specifically designed for the BirdCLEF 2025 competition format. For other bird classification tasks, you may need to adjust the audio processing parameters and retrain the model.