---
tags:
- ml-intern
---
# BirdCLEF+ 2026 — Improved Pipeline (Target: 0.90+)

This repository contains an improved 4-notebook pipeline for BirdCLEF+ 2026, based on lessons learned from a 0.815 score baseline.

🔗 **Competition**: https://www.kaggle.com/competitions/birdclef-2026  
🔗 **Author Repo**: https://huggingface.co/hello9972/birdclef-2026-improved

---

## Why You Stuck at 0.815

Your original pipeline had these fatal problems that prevented reaching 0.90+:

### ❌ What Destroyed Your Score

| Mistake | Impact | Why |
|---------|--------|-----|
| **Threshold boosting** (`p * 0.85 + mask * 0.15`) | 0.815 → **0.52** | Thresholds destroy probability ranking. BirdCLEF metric is AUC-based (rank-sensitive). |
| **Mixup + Label Smoothing** | Softened outputs to 0.05-0.2 range | Destroyed calibration needed for AUC. AUC needs spread, not softening. |
| **Aggressive calibration** (`p ** 0.75`) | 0.815 → **0.53** | Non-linear transforms distort ranking order. |
| **2-model ensemble only** | Ceiling ~0.82 | Top solutions use 5-20 models. |
| **No 5-fold CV** | Could not ensemble diverse models | Same data, same predictions = no ensemble gain. |
| **No pseudo-labeling** | Missing 5-8% boost from test-domain adaptation | Top solutions use noisy student on test predictions. |

### ✅ What Actually Works for BirdCLEF

- **Raw sigmoid outputs** — NO thresholds, NO calibration
- **Simple ensemble** — mean logits, not probabilities
- **Exact sample submission alignment** — `sample[["row_id"]].merge(sub, ...)`
- **Pure PyTorch inference** — No ONNX in Kaggle submissions
- **Minimal post-processing** — tiny clip only

---

## New Architecture Overview

```
NB1 → Data Prep + StratifiedKFold(5)
NB2 → 5-Fold Training (AsymmetricLoss, SpecAugment, Energy Crop)
NB3 → Pseudo-Labeling (Noisy Student on train_soundscapes)
NB4 → Inference (10-model ensemble, TTA, rank averaging)
```

---

## Key Improvements

### 1. Loss Function: AsymmetricLoss (NOT BCE)

Replaces `BCEWithLogitsLoss` with AsymmetricLoss from [arXiv:2009.14119](https://arxiv.org/abs/2009.14119):

```python
class AsymmetricLoss(nn.Module):
    def __init__(self, gamma_neg=4, gamma_pos=0, clip=0.05):
        ...
```

**Why**: Down-weights easy negatives (background noise, empty segments) while preserving signal for rare species. Does NOT squash logits like label smoothing.

### 2. Energy-Based Window Selection (Perch 2.0 Trick)

For training clips longer than 5 seconds, finds the window with highest audio energy instead of random crop:

```python
def _energy_crop(self, wav):
    energy = librosa.feature.rms(y=wav, frame_length=2048, hop_length=512)[0]
    peak_frame = np.argmax(smoothed_energy)
    # center window around peak with jitter
```

**Why**: Bird calls are often brief. Random crops miss them. Energy-based crops hit them 3-4x more often.

### 3. Augmentations (Waveform + Spectrogram)

| Augmentation | Level | Purpose |
|---------------|-------|---------|
| Cyclic roll | 100% | Time-shift invariance |
| Colored noise | 30% | SNR 3-30dB, f^-decay | Domain adaptation to soundscapes |
| Background noise | 50% | Real soundscape mixing | Simulates multi-species recordings |
| Gain | 30% | ±12dB | Loudness invariance |
| SpecAugment (freq mask) | 50% | 24 bins | Frequency invariance |
| SpecAugment (time mask) | 50% | 40 frames | Time invariance |

**NO mixup. NO label smoothing.** Both destroyed your score.

### 4. 5-Fold StratifiedKFold

```python
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
```

Each fold gets the same species distribution. 5 diverse models = 5x ensemble power.

### 5. Layer-Wise LR Decay

```python
lr_scale = layer_decay ** (num_blocks - layer_idx)
```

Deeper layers (closer to input) get smaller LR. Prevents overfitting on early layers.

### 6. Test-Time Augmentation (TTA)

4 variants per chunk:
- Original
- Time-reversed
- +3dB gain
- -3dB gain

Average logits across all variants.

### 7. Pseudo-Labeling (Noisy Student)

Use confident predictions (>0.5) on `train_soundscapes` as additional training data. Retrain with these pseudo-labels + original data.

**Expected boost**: 0.84 → 0.88

---

## Expected Score Improvements

| Stage | Technique | Expected Score |
|-------|-----------|----------------|
| Baseline | Your 0.815 pipeline | **0.815** |
| NB2 improvement | AsymmetricLoss + energy crop + no mixup | **0.83-0.85** |
| 5-fold ensemble | 10 models (5 folds × 2 backbones) | **0.85-0.87** |
| TTA | 4 variants per chunk | **0.86-0.88** |
| Pseudo-labeling | Noisy student on soundscapes | **0.88-0.91** |
| + Better backbone | Bird-MAE or ConvNeXt | **0.90-0.93** |

---

## Files

| File | Purpose |
|------|---------|
| `nb01_data_prep.py` | Data cleaning, VAD, StratifiedKFold(5) |
| `nb02_training.py` | 5-fold training with AsymmetricLoss, SpecAugment |
| `nb03_pseudo_labeling.py` | Generate pseudo-labels, noisy student |
| `nb04_inference.py` | 10-model ensemble, TTA, submission generation |

---

## How to Run on Kaggle

### Step 1: Create Dataset from NB1 Output

After running `nb01_data_prep.py`, create a Kaggle dataset from `/kaggle/working/`:

```
train_cleaned_stratified.csv
soundscape_labels_with_folds.csv
species_list.csv
rare_species.csv
```

### Step 2: NB2 Training

```python
# In Kaggle notebook, attach:
# - Competition data: birdclef-2026
# - NB1 output dataset
# Run nb02_training.py → produces 10 .pt files in /kaggle/working/models/
```

Save models as a new Kaggle dataset.

### Step 3: NB3 Pseudo-Labeling

```python
# Attach NB2 model dataset + NB1 data
# Run nb03_pseudo_labeling.py → produces pseudo_labels_soft.csv
```

### Step 4: NB4 Inference (Submission)

```python
# Attach NB2 model dataset + competition test data
# Run nb04_inference.py → produces submission.csv
```

---

## Critical Rules for BirdCLEF

1. **NEVER threshold predictions** — It destroys AUC ranking.
2. **NEVER apply non-linear calibration** (`p**0.75`, `p/(p+1)`, etc.) — It distorts rank order.
3. **NEVER mixup or label-smooth** — It squashes logits into a narrow range, killing AUC spread.
4. **ALWAYS align submission with sample_submission.csv** — `sample[["row_id"]].merge(sub, ...)`
5. **ALWAYS ensemble diverse models** — Same model, same folds = no gain.
6. **ALWAYS use raw sigmoid outputs** — Let the metric handle calibration.

---

## References

- AsymmetricLoss: [arXiv:2009.14119](https://arxiv.org/abs/2009.14119)
- Bird-MAE: [arXiv:2504.12880](https://arxiv.org/abs/2504.12880)
- sl-BEATs: [arXiv:2508.11845](https://arxiv.org/abs/2508.11845)
- Top solution reference: [minalkharat12/birdclef-2026-solution](https://huggingface.co/minalkharat12/birdclef-2026-solution)

---

## License

MIT — Competition code for educational purposes.

<!-- ml-intern-provenance -->
## Generated by ML Intern

This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "hello9972/birdclef-2026-improved"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```

For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.