Update ML Intern artifact metadata

d837417 verified 26 days ago

7.48 kB

	---
	tags:
	- ml-intern
	---
	# BirdCLEF+ 2026 — Improved Pipeline (Target: 0.90+)

	This repository contains an improved 4-notebook pipeline for BirdCLEF+ 2026, based on lessons learned from a 0.815 score baseline.

	🔗 Competition: https://www.kaggle.com/competitions/birdclef-2026
	🔗 Author Repo: https://huggingface.co/hello9972/birdclef-2026-improved

	---

	## Why You Stuck at 0.815

	Your original pipeline had these fatal problems that prevented reaching 0.90+:

	### ❌ What Destroyed Your Score

	\| Mistake \| Impact \| Why \|
	\|---------\|--------\|-----\|
	\| Threshold boosting (`p * 0.85 + mask * 0.15`) \| 0.815 → 0.52 \| Thresholds destroy probability ranking. BirdCLEF metric is AUC-based (rank-sensitive). \|
	\| Mixup + Label Smoothing \| Softened outputs to 0.05-0.2 range \| Destroyed calibration needed for AUC. AUC needs spread, not softening. \|
	\| Aggressive calibration (`p 0.75`) \| 0.815 → 0.53** \| Non-linear transforms distort ranking order. \|
	\| 2-model ensemble only \| Ceiling ~0.82 \| Top solutions use 5-20 models. \|
	\| No 5-fold CV \| Could not ensemble diverse models \| Same data, same predictions = no ensemble gain. \|
	\| No pseudo-labeling \| Missing 5-8% boost from test-domain adaptation \| Top solutions use noisy student on test predictions. \|

	### ✅ What Actually Works for BirdCLEF

	- Raw sigmoid outputs — NO thresholds, NO calibration
	- Simple ensemble — mean logits, not probabilities
	- Exact sample submission alignment — `sample[["row_id"]].merge(sub, ...)`
	- Pure PyTorch inference — No ONNX in Kaggle submissions
	- Minimal post-processing — tiny clip only

	---

	## New Architecture Overview

	```
	NB1 → Data Prep + StratifiedKFold(5)
	NB2 → 5-Fold Training (AsymmetricLoss, SpecAugment, Energy Crop)
	NB3 → Pseudo-Labeling (Noisy Student on train_soundscapes)
	NB4 → Inference (10-model ensemble, TTA, rank averaging)
	```

	---

	## Key Improvements

	### 1. Loss Function: AsymmetricLoss (NOT BCE)

	Replaces `BCEWithLogitsLoss` with AsymmetricLoss from [arXiv:2009.14119](https://arxiv.org/abs/2009.14119):

	```python
	class AsymmetricLoss(nn.Module):
	def __init__(self, gamma_neg=4, gamma_pos=0, clip=0.05):
	...
	```

	Why: Down-weights easy negatives (background noise, empty segments) while preserving signal for rare species. Does NOT squash logits like label smoothing.

	### 2. Energy-Based Window Selection (Perch 2.0 Trick)

	For training clips longer than 5 seconds, finds the window with highest audio energy instead of random crop:

	```python
	def _energy_crop(self, wav):
	energy = librosa.feature.rms(y=wav, frame_length=2048, hop_length=512)[0]
	peak_frame = np.argmax(smoothed_energy)
	# center window around peak with jitter
	```

	Why: Bird calls are often brief. Random crops miss them. Energy-based crops hit them 3-4x more often.

	### 3. Augmentations (Waveform + Spectrogram)

	\| Augmentation \| Level \| Purpose \|
	\|---------------\|-------\|---------\|
	\| Cyclic roll \| 100% \| Time-shift invariance \|
	\| Colored noise \| 30% \| SNR 3-30dB, f^-decay \| Domain adaptation to soundscapes \|
	\| Background noise \| 50% \| Real soundscape mixing \| Simulates multi-species recordings \|
	\| Gain \| 30% \| ±12dB \| Loudness invariance \|
	\| SpecAugment (freq mask) \| 50% \| 24 bins \| Frequency invariance \|
	\| SpecAugment (time mask) \| 50% \| 40 frames \| Time invariance \|

	NO mixup. NO label smoothing. Both destroyed your score.

	### 4. 5-Fold StratifiedKFold

	```python
	skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
	```

	Each fold gets the same species distribution. 5 diverse models = 5x ensemble power.

	### 5. Layer-Wise LR Decay

	```python
	lr_scale = layer_decay ** (num_blocks - layer_idx)
	```

	Deeper layers (closer to input) get smaller LR. Prevents overfitting on early layers.

	### 6. Test-Time Augmentation (TTA)

	4 variants per chunk:
	- Original
	- Time-reversed
	- +3dB gain
	- -3dB gain

	Average logits across all variants.

	### 7. Pseudo-Labeling (Noisy Student)

	Use confident predictions (>0.5) on `train_soundscapes` as additional training data. Retrain with these pseudo-labels + original data.

	Expected boost: 0.84 → 0.88

	---

	## Expected Score Improvements

	\| Stage \| Technique \| Expected Score \|
	\|-------\|-----------\|----------------\|
	\| Baseline \| Your 0.815 pipeline \| 0.815 \|
	\| NB2 improvement \| AsymmetricLoss + energy crop + no mixup \| 0.83-0.85 \|
	\| 5-fold ensemble \| 10 models (5 folds × 2 backbones) \| 0.85-0.87 \|
	\| TTA \| 4 variants per chunk \| 0.86-0.88 \|
	\| Pseudo-labeling \| Noisy student on soundscapes \| 0.88-0.91 \|
	\| + Better backbone \| Bird-MAE or ConvNeXt \| 0.90-0.93 \|

	---

	## Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `nb01_data_prep.py` \| Data cleaning, VAD, StratifiedKFold(5) \|
	\| `nb02_training.py` \| 5-fold training with AsymmetricLoss, SpecAugment \|
	\| `nb03_pseudo_labeling.py` \| Generate pseudo-labels, noisy student \|
	\| `nb04_inference.py` \| 10-model ensemble, TTA, submission generation \|

	---

	## How to Run on Kaggle

	### Step 1: Create Dataset from NB1 Output

	After running `nb01_data_prep.py`, create a Kaggle dataset from `/kaggle/working/`:

	```
	train_cleaned_stratified.csv
	soundscape_labels_with_folds.csv
	species_list.csv
	rare_species.csv
	```

	### Step 2: NB2 Training

	```python
	# In Kaggle notebook, attach:
	# - Competition data: birdclef-2026
	# - NB1 output dataset
	# Run nb02_training.py → produces 10 .pt files in /kaggle/working/models/
	```

	Save models as a new Kaggle dataset.

	### Step 3: NB3 Pseudo-Labeling

	```python
	# Attach NB2 model dataset + NB1 data
	# Run nb03_pseudo_labeling.py → produces pseudo_labels_soft.csv
	```

	### Step 4: NB4 Inference (Submission)

	```python
	# Attach NB2 model dataset + competition test data
	# Run nb04_inference.py → produces submission.csv
	```

	---

	## Critical Rules for BirdCLEF

	1. NEVER threshold predictions — It destroys AUC ranking.
	2. NEVER apply non-linear calibration (`p**0.75`, `p/(p+1)`, etc.) — It distorts rank order.
	3. NEVER mixup or label-smooth — It squashes logits into a narrow range, killing AUC spread.
	4. ALWAYS align submission with sample_submission.csv — `sample[["row_id"]].merge(sub, ...)`
	5. ALWAYS ensemble diverse models — Same model, same folds = no gain.
	6. ALWAYS use raw sigmoid outputs — Let the metric handle calibration.

	---

	## References

	- AsymmetricLoss: [arXiv:2009.14119](https://arxiv.org/abs/2009.14119)
	- Bird-MAE: [arXiv:2504.12880](https://arxiv.org/abs/2504.12880)
	- sl-BEATs: [arXiv:2508.11845](https://arxiv.org/abs/2508.11845)
	- Top solution reference: [minalkharat12/birdclef-2026-solution](https://huggingface.co/minalkharat12/birdclef-2026-solution)

	---

	## License

	MIT — Competition code for educational purposes.

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "hello9972/birdclef-2026-improved"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)
	```

	For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.