bhavin273
/

best_ast_model.pth

Model card Files Files and versions

xet

Community

gulabjam commited on Mar 11

Commit

b243717

1 Parent(s): 2ffbfde

Added ReadMe

Browse files

Files changed (1) hide show

AST_README.md +225 -0

AST_README.md ADDED Viewed

	@@ -0,0 +1,225 @@

+# Audio Spectrogram Transformer (AST) for Music Genre Classification
+Fine-tuned [Audio Spectrogram Transformer](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) for classifying audio tracks into **10 music genres**. This model achieved the best performance among all approaches tried in this project, reaching a **macro F1 of 0.886 on validation** and **0.857 on the Kaggle leaderboard**.
+---
+## Table of Contents
+- [Overview](#overview)
+- [Model Architecture](#model-architecture)
+- [Preprocessing Pipeline](#preprocessing-pipeline)
+- [Training](#training)
+- [Results](#results)
+- [Usage](#usage)
+- [File Structure](#file-structure)
+- [Acknowledgements](#acknowledgements)
+---
+## Overview
+The Audio Spectrogram Transformer (AST) is a convolution-free, purely attention-based model for audio classification. It was originally pretrained on [AudioSet](https://research.google.com/audioset/) and is fine-tuned here on a custom **messy_mashup** music genre dataset with 10 genres:
+> blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, rock
+Each training sample is synthesized on-the-fly by mixing separated stems (drums, vocals, bass, other) from a random song and injecting environmental noise from the ESC-50 dataset.
+---
+## Model Architecture
+```
+Pretrained Checkpoint: MIT/ast-finetuned-audioset-10-10-0.4593
+Input: Mel spectrogram (1024 frames × 128 mel bins)
+  → Patch embedding (16×16 patches)
+  → 12-layer Vision Transformer encoder
+  → [CLS] token pooling
+  → Linear classifier (527 → 10 classes, re-initialized)
+```
+The classification head is replaced with a 10-class output layer using `ignore_mismatched_sizes=True`. All layers are fine-tuned end-to-end.
+```python
+class MusicGenreAST(nn.Module):
+    def __init__(self, num_classes):
+        super(MusicGenreAST, self).__init__()
+        self.ast = ASTForAudioClassification.from_pretrained(
+            "MIT/ast-finetuned-audioset-10-10-0.4593",
+            num_labels=num_classes,
+            ignore_mismatched_sizes=True
+        )
+    def forward(self, x):
+        outputs = self.ast(x)
+        return outputs
+```
+---
+## Preprocessing Pipeline
+### Audio Construction (Training)
+1. **Genre selection**: A random genre is chosen per sample
+2. **Stem loading**: Each of the 4 stems (drums, vocals, bass, other) is loaded at 16 kHz from a random song, starting at a random offset within the track
+3. **Stem dropout**: Each stem has a 15% chance of being excluded — this teaches the model to classify with incomplete information
+4. **Random gain**: Each included stem is scaled by a random factor in `[0.4, 1.2]` to simulate varying mix balances
+5. **Mixing**: All included stems are summed and peak-normalized
+6. **Noise injection**: A random ESC-50 clip is added at a random SNR (noise divisor uniformly sampled from `[2.0, 8.0]`)
+### Feature Extraction
+| Parameter | Value |
+|-----------|-------|
+| Sample rate | 16,000 Hz |
+| Duration | 10 seconds |
+| Mel bands | 128 |
+| FFT size | 400 |
+| Hop length | 160 |
+| Target frames | 1,024 |
+| Normalization | `(mel_dB + 4.26) / 4.56` |
+The mel spectrogram is transposed to shape `(1024, 128)` — 1024 time frames × 128 mel bins — matching the AST's expected input format. Shorter clips are zero-padded; longer clips are truncated.
+### Test-Time Processing
+Test audio is loaded directly (10s at 16 kHz), peak-normalized, and converted to a mel spectrogram using the same parameters. No augmentation is applied at inference.
+---
+## Training
+### Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| Optimizer | AdamW |
+| Learning rate | 1 × 10⁻⁵ |
+| Weight decay | 0.01 |
+| Batch size | 4 |
+| Gradient accumulation | 4 steps (effective batch size = 16) |
+| Max epochs | 15 |
+| Early stopping patience | 7 epochs |
+| Loss function | CrossEntropyLoss |
+| LR scheduler | ReduceLROnPlateau (factor=0.5, patience=2, min_lr=1e-7) |
+| Training samples | 1,000 per epoch (generated on-the-fly) |
+| Validation samples | 500 per epoch |
+### Training Strategy
+- **Gradient accumulation** (4 steps) is used to simulate a larger effective batch size while fitting within GPU VRAM
+- **ReduceLROnPlateau** monitors the macro F1 score and halves the learning rate after 2 epochs without improvement
+- **Early stopping** triggers after 7 consecutive epochs without a new best F1 score
+- Best model weights are saved to `best_ast_model.pth` whenever a new best F1 is achieved
+- **WandB** logs all training metrics (train loss, val loss, F1 score, learning rate) per epoch
+### Seeds
+| Seed | Value |
+|------|-------|
+| Data seed | 67 |
+| Training seed | 1234 |
+| Train/Val split seed | 42 |
+---
+## Results
+| Metric | Score |
+|--------|:-----:|
+| **Max Validation F1 (macro)** | **0.8861** |
+| **Kaggle Leaderboard Score** | **0.85708** |
+### Comparison with Other Models
+| Model | Val F1 | Leaderboard |
+|-------|:------:|:-----------:|
+| CRNN (scratch) | 0.5800 | 0.33103 |
+| EfficientNet-B0 | 0.5258 | 0.31641 |
+| **AST (this model)** | **0.8861** | **0.85708** |
+### Why AST Outperforms
+- **Large-scale pretraining**: The base checkpoint was pretrained on AudioSet (2M+ audio clips), providing robust audio representations
+- **Longer input context**: 10s duration captures more musical structure compared to 5s for other models
+- **Mel spectrogram input**: 128-bin mel spectrograms retain richer frequency detail than MFCCs
+- **Self-attention**: Transformers can model long-range temporal dependencies that CNNs and even RNNs struggle with
+- **Aggressive augmentation**: Stem dropout, variable gain, and variable SNR noise injection improve generalization
+---
+## Usage
+### Prerequisites
+```bash
+pip install torch transformers librosa numpy pandas scikit-learn wandb
+```
+### Training
+```python
+from AST_Pipeline import MusicGenreAST, train_ast
+model = MusicGenreAST(num_classes=10)
+train_ast(model)
+# Best weights saved to best_ast_model.pth
+```
+### Inference
+```python
+from AST_Pipeline import MusicGenreAST, predict
+results = predict(
+    model_instance=MusicGenreAST(10),
+    model_path='best_ast_model.pth'
+)
+# results: list of genre strings, e.g. ['rock', 'jazz', 'blues', ...]
+```
+### Generating a Submission
+```python
+import pandas as pd
+submission_df = pd.read_csv('sample_submission.csv')
+submission = pd.DataFrame({
+    "id": submission_df['id'],
+    "genre": results
+})
+submission.to_csv("submission.csv", index=False)
+```
+---
+## File Structure
+```
+├── AST_Pipeline.py          # Full pipeline: dataset, model, training, prediction
+├── best_ast_model.pth       # Saved model weights (best validation F1)
+├── requirements.txt         # Python dependencies
+└── AST_README.md            # This file
+```
+### Key Classes & Functions in AST_Pipeline.py
+| Name | Type | Description |
+|------|------|-------------|
+| `ASTAudioDataset` | Dataset | Training/validation dataset with on-the-fly stem mixing and augmentation |
+| `ASTTestDataset` | Dataset | Test dataset — loads audio and converts to mel spectrogram |
+| `MusicGenreAST` | nn.Module | Wrapper around `ASTForAudioClassification` with 10-class head |
+| `build_dataset()` | Function | Builds train/val dictionaries with stratified split |
+| `train_ast()` | Function | Full training loop with gradient accumulation, scheduler, early stopping, and WandB logging |
+| `predict()` | Function | Loads saved weights and runs inference on the test set |
+---
+## Acknowledgements
+- [MIT AST](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) — Pretrained Audio Spectrogram Transformer by Yuan Gong et al.
+- [ESC-50](https://github.com/karolpiczak/ESC-50) — Environmental Sound Classification dataset used for noise augmentation
+- [Weights & Biases](https://wandb.ai/) — Experiment tracking
+- [librosa](https://librosa.org/) — Audio analysis and feature extraction