JacobLinCool's picture
Upload folder using huggingface_hub
707cbac unverified
# Beat Tracking Challenge
A challenge for detecting beats and downbeats in music audio, with a focus on handling dynamic tempo changes common in rhythm game charts.
## Goal
The goal is to **detect and identify beats and downbeats** in audio to assist composers by providing a flexible timing grid when working with samples that have dynamic BPM changes.
- **Beat**: A regular pulse in music (e.g., quarter notes in 4/4 time)
- **Downbeat**: The first beat of each measure (the "1" in counting "1-2-3-4")
This is particularly useful for:
- Music production with samples of varying tempos
- Rhythm game chart creation and verification
- Audio analysis and music information retrieval (MIR)
---
## Dataset
The dataset is derived from Taiko no Tatsujin rhythm game charts, providing high-quality human-annotated beat and downbeat ground truth.
**Source**: [`JacobLinCool/taiko-1000-parsed`](https://huggingface.co/datasets/JacobLinCool/taiko-1000-parsed)
| Split | Tracks | Duration | Description |
|-------|--------|----------|-------------|
| `train` | ~1000 | 1-3 min each | Training data with beat/downbeat annotations |
| `test` | ~100 | 1-3 min each | Held-out test set for final evaluation |
### Data Features
Each example contains:
| Field | Type | Description |
|-------|------|-------------|
| `audio` | `Audio` | Audio waveform at 16kHz sample rate |
| `title` | `str` | Track title |
| `beats` | `list[float]` | Beat timestamps in seconds |
| `downbeats` | `list[float]` | Downbeat timestamps in seconds |
### Dataset Characteristics
- **Dynamic BPM**: Many tracks feature tempo changes mid-song
- **Variable Time Signatures**: Common patterns include 4/4, 3/4, 6/8, and more exotic meters
- **Diverse Genres**: Japanese pop, anime themes, classical arrangements, electronic music
- **High-Quality Annotations**: Derived from professional rhythm game charts
---
## Evaluation Metrics
The evaluation considers both **timing accuracy** and **metrical correctness**. Models are evaluated on both beat and downbeat detection tasks.
### Primary Metrics
#### 1. Weighted F1-Score (Main Ranking Metric)
F1-scores are calculated at multiple timing thresholds (3ms to 30ms), then combined with inverse-threshold weighting:
| Threshold | Weight | Rationale |
|-----------|--------|-----------|
| 3ms | 1.000 | Full weight for highest precision |
| 6ms | 0.500 | Half weight |
| 9ms | 0.333 | One-third weight |
| 12ms | 0.250 | ... |
| 15ms | 0.200 | |
| 18ms | 0.167 | |
| 21ms | 0.143 | |
| 24ms | 0.125 | |
| 27ms | 0.111 | |
| 30ms | 0.100 | Minimum weight for coarsest threshold |
**Formula:**
```
Weighted F1 = Ξ£(w_t Γ— F1_t) / Ξ£(w_t)
where w_t = 3ms / t (inverse threshold weighting)
```
This weighting scheme rewards models that achieve high precision at tight tolerances while still considering coarser thresholds.
#### 2. Continuity Metrics (CMLt, AMLt)
Based on the MIREX beat tracking evaluation protocol using `mir_eval`:
| Metric | Full Name | Description |
|--------|-----------|-------------|
| **CMLt** | Correct Metrical Level Total | Percentage of beats correctly tracked at the exact metrical level (Β±17.5% of beat interval) |
| **AMLt** | Any Metrical Level Total | Same as CMLt, but allows for acceptable metrical variations (double/half tempo, off-beat) |
| **CMLc** | Correct Metrical Level Continuous | Longest continuous correctly-tracked segment at exact metrical level |
| **AMLc** | Any Metrical Level Continuous | Longest continuous segment at any acceptable metrical level |
**Note:** Continuity metrics use a default `min_beat_time=5.0s` (skipping the first 5 seconds) to avoid evaluating potentially unstable tempo at the beginning of tracks.
### Metric Interpretation
| Metric | What it measures | Good Score |
|--------|------------------|------------|
| Weighted F1 | Precise timing accuracy | > 0.7 |
| CMLt | Correct tempo tracking | > 0.8 |
| AMLt | Tempo tracking (flexible) | > 0.9 |
| CMLc | Longest stable segment | > 0.5 |
### Evaluation Summary
For each model, we report:
```
Beat Detection:
Weighted F1: X.XXXX
CMLt: X.XXXX AMLt: X.XXXX
CMLc: X.XXXX AMLc: X.XXXX
Downbeat Detection:
Weighted F1: X.XXXX
CMLt: X.XXXX AMLt: X.XXXX
CMLc: X.XXXX AMLc: X.XXXX
Combined Weighted F1: X.XXXX (average of beat and downbeat)
```
### Benchmark Results
Results evaluated on 100 tracks from the test set:
| Model | Combined F1 | Beat F1 | Downbeat F1 | CMLt (Beat) | CMLt (Downbeat) |
|-------|-------------|---------|-------------|-------------|-----------------|
| **Baseline 1 (ODCNN)** | 0.0765 | 0.0861 | 0.0669 | 0.0731 | 0.0321 |
| **Baseline 2 (ResNet-SE)** | **0.2775** | **0.3292** | **0.2258** | **0.3287** | **0.1146** |
*Note: Baseline 2 (ResNet-SE) demonstrates significantly better performance due to larger context window and deeper architecture.*
---
## Quick Start
### Setup
```bash
uv sync
```
### Train Models
```bash
# Train Baseline 1 (ODCNN)
uv run -m exp.baseline1.train
# Train Baseline 2 (ResNet-SE)
uv run -m exp.baseline2.train
# Train specific target only (e.g. for Baseline 2)
uv run -m exp.baseline2.train --target beats
uv run -m exp.baseline2.train --target downbeats
```
### Run Evaluation
```bash
# Evaluation (replace baseline1 with baseline2 to evaluate the new model)
uv run -m exp.baseline1.eval
# Full evaluation with visualization and audio
uv run -m exp.baseline1.eval --visualize --synthesize --summary-plot
# Evaluate on more samples with custom output directory
uv run -m exp.baseline1.eval --num-samples 50 --output-dir outputs/eval_baseline1
```
### Evaluation Options
| Option | Description |
|--------|-------------|
| Option | Description |
|--------|-------------|
| `--model-dir DIR` | Model directory (default: `outputs/baseline1`) |
| `--num-samples N` | Number of samples to evaluate (default: 20) |
| `--output-dir DIR` | Output directory (default: `outputs/eval`) |
| `--visualize` | Generate visualization plots for each track |
| `--synthesize` | Generate audio files with click tracks |
| `--viz-tracks N` | Number of tracks to visualize/synthesize (default: 5) |
| `--time-range START END` | Limit visualization time range (seconds) |
| `--click-volume FLOAT` | Click sound volume (0.0 to 1.0, default: 0.5) |
| `--summary-plot` | Generate summary evaluation bar charts |
---
## Visualization & Audio Tools
### Beat Visualization
Generate plots comparing predicted vs ground truth beats:
```bash
uv run -m exp.baseline1.eval --visualize --viz-tracks 10
```
Output: `outputs/eval/plots/track_XXX.png`
### Click Track Audio
Generate audio files with click sounds overlaid on the original music:
```bash
uv run -m exp.baseline1.eval --synthesize
```
Output files in `outputs/eval/audio/`:
- `track_XXX_pred.wav` - Original audio + predicted beat clicks (1000Hz beat, 1500Hz downbeat)
- `track_XXX_gt.wav` - Original audio + ground truth clicks (800Hz beat, 1200Hz downbeat)
- `track_XXX_both.wav` - Original audio + both prediction and ground truth clicks
### Summary Plot
Generate bar charts summarizing F1 scores and continuity metrics:
```bash
uv run -m exp.baseline1.eval --summary-plot
```
Output: `outputs/eval/evaluation_summary.png`
---
## Models
### Baseline 1: ODCNN
A 10-year-old baseline model: <https://ieeexplore.ieee.org/document/6854953>.
The original baseline implements the **Onset Detection CNN (ODCNN)** architecture:
#### Architecture
- **Input**: Multi-view mel spectrogram (3 window sizes: 23ms, 46ms, 93ms)
- **CNN Backbone**: 3 convolutional blocks with max pooling
- **Output**: Frame-level beat/downbeat probability
- **Inference**: Β±7 frames context (Β±70ms)
### Baseline 2: ResNet-SE
Inspired by ResNet-SE: <https://arxiv.org/abs/1709.01507>.
A modernized architecture designed to capture longer temporal context:
#### Architecture
- **Input**: Mel spectrogram with larger context
- **Backbone**: ResNet with Squeeze-and-Excitation (SE) blocks
- **Context**: **Β±50 frames (~1s)** window
- **Features**: Deeper network (4 stages) with effective channel attention
- **Parameters**: ~400k (Small & Efficient)
### Training Details
Both models use similar training loops:
- **Optimizer**: SGD (Baseline 1) / AdamW (Baseline 2)
- **Learning Rate**: Cosine annealing
- **Loss**: Binary Cross-Entropy
- **Epochs**: 50 (Baseline 1) / 3 (Baseline 2)
- **Batch Size**: 512 (Baseline 1) / 128 (Baseline 2)
---
## Project Structure
```
exp-onset/
β”œβ”€β”€ exp/
β”‚ β”œβ”€β”€ baseline1/ # Baseline 1 (ODCNN)
β”‚ β”‚ β”œβ”€β”€ model.py # ODCNN architecture
β”‚ β”‚ β”œβ”€β”€ train.py
β”‚ β”‚ β”œβ”€β”€ eval.py
β”‚ β”‚ β”œβ”€β”€ data.py
β”‚ β”‚ └── utils.py
β”‚ β”œβ”€β”€ baseline2/ # Baseline 2 (ResNet-SE)
β”‚ β”‚ β”œβ”€β”€ model.py # ResNet-SE
β”‚ β”‚ β”œβ”€β”€ train.py
β”‚ β”‚ β”œβ”€β”€ eval.py
β”‚ β”‚ └── data.py
β”‚ └── data/
β”‚ β”œβ”€β”€ load.py # Dataset loading & preprocessing
β”‚ β”œβ”€β”€ eval.py # Evaluation metrics (F1, CML, AML)
β”‚ β”œβ”€β”€ audio.py # Click track synthesis
β”‚ └── viz.py # Visualization utilities
β”œβ”€β”€ outputs/
β”‚ β”œβ”€β”€ baseline1/ # Trained models (Baseline 1)
β”‚ β”œβ”€β”€ baseline2/ # Trained models (Baseline 2)
β”‚ └── eval/ # Evaluation outputs
β”‚ β”œβ”€β”€ plots/ # Visualization images
β”‚ β”œβ”€β”€ audio/ # Click track audio files
β”‚ └── evaluation_summary.png
β”œβ”€β”€ README.md
β”œβ”€β”€ DATASET.md # Raw dataset specification
└── pyproject.toml
```
---
## License
This project is for research and educational purposes. The dataset is derived from publicly available rhythm game charts.