File size: 9,806 Bytes

# Beat Tracking Challenge

A challenge for detecting beats and downbeats in music audio, with a focus on handling dynamic tempo changes common in rhythm game charts.

## Goal

The goal is to **detect and identify beats and downbeats** in audio to assist composers by providing a flexible timing grid when working with samples that have dynamic BPM changes.

- **Beat**: A regular pulse in music (e.g., quarter notes in 4/4 time)
- **Downbeat**: The first beat of each measure (the "1" in counting "1-2-3-4")

This is particularly useful for:
- Music production with samples of varying tempos
- Rhythm game chart creation and verification
- Audio analysis and music information retrieval (MIR)

---

## Dataset

The dataset is derived from Taiko no Tatsujin rhythm game charts, providing high-quality human-annotated beat and downbeat ground truth.

**Source**: [`JacobLinCool/taiko-1000-parsed`](https://huggingface.co/datasets/JacobLinCool/taiko-1000-parsed)

| Split | Tracks | Duration | Description |
|-------|--------|----------|-------------|
| `train` | ~1000 | 1-3 min each | Training data with beat/downbeat annotations |
| `test` | ~100 | 1-3 min each | Held-out test set for final evaluation |

### Data Features

Each example contains:

| Field | Type | Description |
|-------|------|-------------|
| `audio` | `Audio` | Audio waveform at 16kHz sample rate |
| `title` | `str` | Track title |
| `beats` | `list[float]` | Beat timestamps in seconds |
| `downbeats` | `list[float]` | Downbeat timestamps in seconds |

### Dataset Characteristics

- **Dynamic BPM**: Many tracks feature tempo changes mid-song
- **Variable Time Signatures**: Common patterns include 4/4, 3/4, 6/8, and more exotic meters
- **Diverse Genres**: Japanese pop, anime themes, classical arrangements, electronic music
- **High-Quality Annotations**: Derived from professional rhythm game charts

---

## Evaluation Metrics

The evaluation considers both **timing accuracy** and **metrical correctness**. Models are evaluated on both beat and downbeat detection tasks.

### Primary Metrics

#### 1. Weighted F1-Score (Main Ranking Metric)

F1-scores are calculated at multiple timing thresholds (3ms to 30ms), then combined with inverse-threshold weighting:

| Threshold | Weight | Rationale |
|-----------|--------|-----------|
| 3ms | 1.000 | Full weight for highest precision |
| 6ms | 0.500 | Half weight |
| 9ms | 0.333 | One-third weight |
| 12ms | 0.250 | ... |
| 15ms | 0.200 | |
| 18ms | 0.167 | |
| 21ms | 0.143 | |
| 24ms | 0.125 | |
| 27ms | 0.111 | |
| 30ms | 0.100 | Minimum weight for coarsest threshold |

**Formula:**
```
Weighted F1 = Σ(w_t × F1_t) / Σ(w_t)
where w_t = 3ms / t (inverse threshold weighting)
```

This weighting scheme rewards models that achieve high precision at tight tolerances while still considering coarser thresholds.

#### 2. Continuity Metrics (CMLt, AMLt)

Based on the MIREX beat tracking evaluation protocol using `mir_eval`:

| Metric | Full Name | Description |
|--------|-----------|-------------|
| **CMLt** | Correct Metrical Level Total | Percentage of beats correctly tracked at the exact metrical level (±17.5% of beat interval) |
| **AMLt** | Any Metrical Level Total | Same as CMLt, but allows for acceptable metrical variations (double/half tempo, off-beat) |
| **CMLc** | Correct Metrical Level Continuous | Longest continuous correctly-tracked segment at exact metrical level |
| **AMLc** | Any Metrical Level Continuous | Longest continuous segment at any acceptable metrical level |

**Note:** Continuity metrics use a default `min_beat_time=5.0s` (skipping the first 5 seconds) to avoid evaluating potentially unstable tempo at the beginning of tracks.

### Metric Interpretation

| Metric | What it measures | Good Score |
|--------|------------------|------------|
| Weighted F1 | Precise timing accuracy | > 0.7 |
| CMLt | Correct tempo tracking | > 0.8 |
| AMLt | Tempo tracking (flexible) | > 0.9 |
| CMLc | Longest stable segment | > 0.5 |

### Evaluation Summary

For each model, we report:

```
Beat Detection:
  Weighted F1: X.XXXX
  CMLt: X.XXXX  AMLt: X.XXXX
  CMLc: X.XXXX  AMLc: X.XXXX

Downbeat Detection:
  Weighted F1: X.XXXX
  CMLt: X.XXXX  AMLt: X.XXXX
  CMLc: X.XXXX  AMLc: X.XXXX

Combined Weighted F1: X.XXXX  (average of beat and downbeat)
```

### Benchmark Results

Results evaluated on 100 tracks from the test set:

| Model | Combined F1 | Beat F1 | Downbeat F1 | CMLt (Beat) | CMLt (Downbeat) |
|-------|-------------|---------|-------------|-------------|-----------------|
| **Baseline 1 (ODCNN)** | 0.0765 | 0.0861 | 0.0669 | 0.0731 | 0.0321 |
| **Baseline 2 (ResNet-SE)** | **0.2775** | **0.3292** | **0.2258** | **0.3287** | **0.1146** |

*Note: Baseline 2 (ResNet-SE) demonstrates significantly better performance due to larger context window and deeper architecture.*

---

## Quick Start

### Setup

```bash
uv sync
```

### Train Models

```bash
# Train Baseline 1 (ODCNN)
uv run -m exp.baseline1.train

# Train Baseline 2 (ResNet-SE)
uv run -m exp.baseline2.train

# Train specific target only (e.g. for Baseline 2)
uv run -m exp.baseline2.train --target beats
uv run -m exp.baseline2.train --target downbeats
```

### Run Evaluation

```bash
# Evaluation (replace baseline1 with baseline2 to evaluate the new model)
uv run -m exp.baseline1.eval

# Full evaluation with visualization and audio
uv run -m exp.baseline1.eval --visualize --synthesize --summary-plot

# Evaluate on more samples with custom output directory
uv run -m exp.baseline1.eval --num-samples 50 --output-dir outputs/eval_baseline1
```

### Evaluation Options

| Option | Description |
|--------|-------------|
| Option | Description |
|--------|-------------|
| `--model-dir DIR` | Model directory (default: `outputs/baseline1`) |
| `--num-samples N` | Number of samples to evaluate (default: 20) |
| `--output-dir DIR` | Output directory (default: `outputs/eval`) |
| `--visualize` | Generate visualization plots for each track |
| `--synthesize` | Generate audio files with click tracks |
| `--viz-tracks N` | Number of tracks to visualize/synthesize (default: 5) |
| `--time-range START END` | Limit visualization time range (seconds) |
| `--click-volume FLOAT` | Click sound volume (0.0 to 1.0, default: 0.5) |
| `--summary-plot` | Generate summary evaluation bar charts |

---

## Visualization & Audio Tools

### Beat Visualization

Generate plots comparing predicted vs ground truth beats:

```bash
uv run -m exp.baseline1.eval --visualize --viz-tracks 10
```

Output: `outputs/eval/plots/track_XXX.png`

### Click Track Audio

Generate audio files with click sounds overlaid on the original music:

```bash
uv run -m exp.baseline1.eval --synthesize
```

Output files in `outputs/eval/audio/`:
- `track_XXX_pred.wav` - Original audio + predicted beat clicks (1000Hz beat, 1500Hz downbeat)
- `track_XXX_gt.wav` - Original audio + ground truth clicks (800Hz beat, 1200Hz downbeat)
- `track_XXX_both.wav` - Original audio + both prediction and ground truth clicks

### Summary Plot

Generate bar charts summarizing F1 scores and continuity metrics:

```bash
uv run -m exp.baseline1.eval --summary-plot
```

Output: `outputs/eval/evaluation_summary.png`

---

## Models

### Baseline 1: ODCNN

A 10-year-old baseline model: <https://ieeexplore.ieee.org/document/6854953>.

The original baseline implements the **Onset Detection CNN (ODCNN)** architecture:

#### Architecture
- **Input**: Multi-view mel spectrogram (3 window sizes: 23ms, 46ms, 93ms)
- **CNN Backbone**: 3 convolutional blocks with max pooling
- **Output**: Frame-level beat/downbeat probability
- **Inference**: ±7 frames context (±70ms)

### Baseline 2: ResNet-SE

Inspired by ResNet-SE: <https://arxiv.org/abs/1709.01507>.

A modernized architecture designed to capture longer temporal context:

#### Architecture
- **Input**: Mel spectrogram with larger context
- **Backbone**: ResNet with Squeeze-and-Excitation (SE) blocks
- **Context**: **±50 frames (~1s)** window
- **Features**: Deeper network (4 stages) with effective channel attention
- **Parameters**: ~400k (Small & Efficient)

### Training Details

Both models use similar training loops:
- **Optimizer**: SGD (Baseline 1) / AdamW (Baseline 2)
- **Learning Rate**: Cosine annealing
- **Loss**: Binary Cross-Entropy
- **Epochs**: 50 (Baseline 1) / 3 (Baseline 2)
- **Batch Size**: 512 (Baseline 1) / 128 (Baseline 2)

---

## Project Structure

```
exp-onset/
├── exp/
│   ├── baseline1/         # Baseline 1 (ODCNN)
│   │   ├── model.py       # ODCNN architecture
│   │   ├── train.py
│   │   ├── eval.py
│   │   ├── data.py
│   │   └── utils.py
│   ├── baseline2/         # Baseline 2 (ResNet-SE)
│   │   ├── model.py       # ResNet-SE
│   │   ├── train.py
│   │   ├── eval.py
│   │   └── data.py
│   └── data/
│       ├── load.py        # Dataset loading & preprocessing
│       ├── eval.py        # Evaluation metrics (F1, CML, AML)
│       ├── audio.py       # Click track synthesis
│       └── viz.py         # Visualization utilities
├── outputs/
│   ├── baseline1/         # Trained models (Baseline 1)
│   ├── baseline2/         # Trained models (Baseline 2)
│   └── eval/              # Evaluation outputs
│       ├── plots/         # Visualization images
│       ├── audio/         # Click track audio files
│       └── evaluation_summary.png
├── README.md
├── DATASET.md             # Raw dataset specification
└── pyproject.toml
```

---

## License

This project is for research and educational purposes. The dataset is derived from publicly available rhythm game charts.