Upload folder using huggingface_hub

707cbac unverified 22 days ago

9.81 kB

	# Beat Tracking Challenge

	A challenge for detecting beats and downbeats in music audio, with a focus on handling dynamic tempo changes common in rhythm game charts.

	## Goal

	The goal is to detect and identify beats and downbeats in audio to assist composers by providing a flexible timing grid when working with samples that have dynamic BPM changes.

	- Beat: A regular pulse in music (e.g., quarter notes in 4/4 time)
	- Downbeat: The first beat of each measure (the "1" in counting "1-2-3-4")

	This is particularly useful for:
	- Music production with samples of varying tempos
	- Rhythm game chart creation and verification
	- Audio analysis and music information retrieval (MIR)

	---

	## Dataset

	The dataset is derived from Taiko no Tatsujin rhythm game charts, providing high-quality human-annotated beat and downbeat ground truth.

	Source: [`JacobLinCool/taiko-1000-parsed`](https://huggingface.co/datasets/JacobLinCool/taiko-1000-parsed)

	\| Split \| Tracks \| Duration \| Description \|
	\|-------\|--------\|----------\|-------------\|
	\| `train` \| ~1000 \| 1-3 min each \| Training data with beat/downbeat annotations \|
	\| `test` \| ~100 \| 1-3 min each \| Held-out test set for final evaluation \|

	### Data Features

	Each example contains:

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `audio` \| `Audio` \| Audio waveform at 16kHz sample rate \|
	\| `title` \| `str` \| Track title \|
	\| `beats` \| `list[float]` \| Beat timestamps in seconds \|
	\| `downbeats` \| `list[float]` \| Downbeat timestamps in seconds \|

	### Dataset Characteristics

	- Dynamic BPM: Many tracks feature tempo changes mid-song
	- Variable Time Signatures: Common patterns include 4/4, 3/4, 6/8, and more exotic meters
	- Diverse Genres: Japanese pop, anime themes, classical arrangements, electronic music
	- High-Quality Annotations: Derived from professional rhythm game charts

	---

	## Evaluation Metrics

	The evaluation considers both timing accuracy and metrical correctness. Models are evaluated on both beat and downbeat detection tasks.

	### Primary Metrics

	#### 1. Weighted F1-Score (Main Ranking Metric)

	F1-scores are calculated at multiple timing thresholds (3ms to 30ms), then combined with inverse-threshold weighting:

	\| Threshold \| Weight \| Rationale \|
	\|-----------\|--------\|-----------\|
	\| 3ms \| 1.000 \| Full weight for highest precision \|
	\| 6ms \| 0.500 \| Half weight \|
	\| 9ms \| 0.333 \| One-third weight \|
	\| 12ms \| 0.250 \| ... \|
	\| 15ms \| 0.200 \| \|
	\| 18ms \| 0.167 \| \|
	\| 21ms \| 0.143 \| \|
	\| 24ms \| 0.125 \| \|
	\| 27ms \| 0.111 \| \|
	\| 30ms \| 0.100 \| Minimum weight for coarsest threshold \|

	Formula:
	```
	Weighted F1 = Σ(w_t × F1_t) / Σ(w_t)
	where w_t = 3ms / t (inverse threshold weighting)
	```

	This weighting scheme rewards models that achieve high precision at tight tolerances while still considering coarser thresholds.

	#### 2. Continuity Metrics (CMLt, AMLt)

	Based on the MIREX beat tracking evaluation protocol using `mir_eval`:

	\| Metric \| Full Name \| Description \|
	\|--------\|-----------\|-------------\|
	\| CMLt \| Correct Metrical Level Total \| Percentage of beats correctly tracked at the exact metrical level (±17.5% of beat interval) \|
	\| AMLt \| Any Metrical Level Total \| Same as CMLt, but allows for acceptable metrical variations (double/half tempo, off-beat) \|
	\| CMLc \| Correct Metrical Level Continuous \| Longest continuous correctly-tracked segment at exact metrical level \|
	\| AMLc \| Any Metrical Level Continuous \| Longest continuous segment at any acceptable metrical level \|

	Note: Continuity metrics use a default `min_beat_time=5.0s` (skipping the first 5 seconds) to avoid evaluating potentially unstable tempo at the beginning of tracks.

	### Metric Interpretation

	\| Metric \| What it measures \| Good Score \|
	\|--------\|------------------\|------------\|
	\| Weighted F1 \| Precise timing accuracy \| > 0.7 \|
	\| CMLt \| Correct tempo tracking \| > 0.8 \|
	\| AMLt \| Tempo tracking (flexible) \| > 0.9 \|
	\| CMLc \| Longest stable segment \| > 0.5 \|

	### Evaluation Summary

	For each model, we report:

	```
	Beat Detection:
	Weighted F1: X.XXXX
	CMLt: X.XXXX AMLt: X.XXXX
	CMLc: X.XXXX AMLc: X.XXXX

	Downbeat Detection:
	Weighted F1: X.XXXX
	CMLt: X.XXXX AMLt: X.XXXX
	CMLc: X.XXXX AMLc: X.XXXX

	Combined Weighted F1: X.XXXX (average of beat and downbeat)
	```

	### Benchmark Results

	Results evaluated on 100 tracks from the test set:

	\| Model \| Combined F1 \| Beat F1 \| Downbeat F1 \| CMLt (Beat) \| CMLt (Downbeat) \|
	\|-------\|-------------\|---------\|-------------\|-------------\|-----------------\|
	\| Baseline 1 (ODCNN) \| 0.0765 \| 0.0861 \| 0.0669 \| 0.0731 \| 0.0321 \|
	\| Baseline 2 (ResNet-SE) \| 0.2775 \| 0.3292 \| 0.2258 \| 0.3287 \| 0.1146 \|

	Note: Baseline 2 (ResNet-SE) demonstrates significantly better performance due to larger context window and deeper architecture.

	---

	## Quick Start

	### Setup

	```bash
	uv sync
	```

	### Train Models

	```bash
	# Train Baseline 1 (ODCNN)
	uv run -m exp.baseline1.train

	# Train Baseline 2 (ResNet-SE)
	uv run -m exp.baseline2.train

	# Train specific target only (e.g. for Baseline 2)
	uv run -m exp.baseline2.train --target beats
	uv run -m exp.baseline2.train --target downbeats
	```

	### Run Evaluation

	```bash
	# Evaluation (replace baseline1 with baseline2 to evaluate the new model)
	uv run -m exp.baseline1.eval

	# Full evaluation with visualization and audio
	uv run -m exp.baseline1.eval --visualize --synthesize --summary-plot

	# Evaluate on more samples with custom output directory
	uv run -m exp.baseline1.eval --num-samples 50 --output-dir outputs/eval_baseline1
	```

	### Evaluation Options

	\| Option \| Description \|
	\|--------\|-------------\|
	\| Option \| Description \|
	\|--------\|-------------\|
	\| `--model-dir DIR` \| Model directory (default: `outputs/baseline1`) \|
	\| `--num-samples N` \| Number of samples to evaluate (default: 20) \|
	\| `--output-dir DIR` \| Output directory (default: `outputs/eval`) \|
	\| `--visualize` \| Generate visualization plots for each track \|
	\| `--synthesize` \| Generate audio files with click tracks \|
	\| `--viz-tracks N` \| Number of tracks to visualize/synthesize (default: 5) \|
	\| `--time-range START END` \| Limit visualization time range (seconds) \|
	\| `--click-volume FLOAT` \| Click sound volume (0.0 to 1.0, default: 0.5) \|
	\| `--summary-plot` \| Generate summary evaluation bar charts \|

	---

	## Visualization & Audio Tools

	### Beat Visualization

	Generate plots comparing predicted vs ground truth beats:

	```bash
	uv run -m exp.baseline1.eval --visualize --viz-tracks 10
	```

	Output: `outputs/eval/plots/track_XXX.png`

	### Click Track Audio

	Generate audio files with click sounds overlaid on the original music:

	```bash
	uv run -m exp.baseline1.eval --synthesize
	```

	Output files in `outputs/eval/audio/`:
	- `track_XXX_pred.wav` - Original audio + predicted beat clicks (1000Hz beat, 1500Hz downbeat)
	- `track_XXX_gt.wav` - Original audio + ground truth clicks (800Hz beat, 1200Hz downbeat)
	- `track_XXX_both.wav` - Original audio + both prediction and ground truth clicks

	### Summary Plot

	Generate bar charts summarizing F1 scores and continuity metrics:

	```bash
	uv run -m exp.baseline1.eval --summary-plot
	```

	Output: `outputs/eval/evaluation_summary.png`

	---

	## Models

	### Baseline 1: ODCNN

	A 10-year-old baseline model: <https://ieeexplore.ieee.org/document/6854953>.

	The original baseline implements the Onset Detection CNN (ODCNN) architecture:

	#### Architecture
	- Input: Multi-view mel spectrogram (3 window sizes: 23ms, 46ms, 93ms)
	- CNN Backbone: 3 convolutional blocks with max pooling
	- Output: Frame-level beat/downbeat probability
	- Inference: ±7 frames context (±70ms)

	### Baseline 2: ResNet-SE

	Inspired by ResNet-SE: <https://arxiv.org/abs/1709.01507>.

	A modernized architecture designed to capture longer temporal context:

	#### Architecture
	- Input: Mel spectrogram with larger context
	- Backbone: ResNet with Squeeze-and-Excitation (SE) blocks
	- Context: ±50 frames (~1s) window
	- Features: Deeper network (4 stages) with effective channel attention
	- Parameters: ~400k (Small & Efficient)

	### Training Details

	Both models use similar training loops:
	- Optimizer: SGD (Baseline 1) / AdamW (Baseline 2)
	- Learning Rate: Cosine annealing
	- Loss: Binary Cross-Entropy
	- Epochs: 50 (Baseline 1) / 3 (Baseline 2)
	- Batch Size: 512 (Baseline 1) / 128 (Baseline 2)

	---

	## Project Structure

	```
	exp-onset/
	├── exp/
	│ ├── baseline1/ # Baseline 1 (ODCNN)
	│ │ ├── model.py # ODCNN architecture
	│ │ ├── train.py
	│ │ ├── eval.py
	│ │ ├── data.py
	│ │ └── utils.py
	│ ├── baseline2/ # Baseline 2 (ResNet-SE)
	│ │ ├── model.py # ResNet-SE
	│ │ ├── train.py
	│ │ ├── eval.py
	│ │ └── data.py
	│ └── data/
	│ ├── load.py # Dataset loading & preprocessing
	│ ├── eval.py # Evaluation metrics (F1, CML, AML)
	│ ├── audio.py # Click track synthesis
	│ └── viz.py # Visualization utilities
	├── outputs/
	│ ├── baseline1/ # Trained models (Baseline 1)
	│ ├── baseline2/ # Trained models (Baseline 2)
	│ └── eval/ # Evaluation outputs
	│ ├── plots/ # Visualization images
	│ ├── audio/ # Click track audio files
	│ └── evaluation_summary.png
	├── README.md
	├── DATASET.md # Raw dataset specification
	└── pyproject.toml
	```

	---

	## License

	This project is for research and educational purposes. The dataset is derived from publicly available rhythm game charts.