| # Beat Tracking Challenge | |
| A challenge for detecting beats and downbeats in music audio, with a focus on handling dynamic tempo changes common in rhythm game charts. | |
| ## Goal | |
| The goal is to **detect and identify beats and downbeats** in audio to assist composers by providing a flexible timing grid when working with samples that have dynamic BPM changes. | |
| - **Beat**: A regular pulse in music (e.g., quarter notes in 4/4 time) | |
| - **Downbeat**: The first beat of each measure (the "1" in counting "1-2-3-4") | |
| This is particularly useful for: | |
| - Music production with samples of varying tempos | |
| - Rhythm game chart creation and verification | |
| - Audio analysis and music information retrieval (MIR) | |
| --- | |
| ## Dataset | |
| The dataset is derived from Taiko no Tatsujin rhythm game charts, providing high-quality human-annotated beat and downbeat ground truth. | |
| **Source**: [`JacobLinCool/taiko-1000-parsed`](https://huggingface.co/datasets/JacobLinCool/taiko-1000-parsed) | |
| | Split | Tracks | Duration | Description | | |
| |-------|--------|----------|-------------| | |
| | `train` | ~1000 | 1-3 min each | Training data with beat/downbeat annotations | | |
| | `test` | ~100 | 1-3 min each | Held-out test set for final evaluation | | |
| ### Data Features | |
| Each example contains: | |
| | Field | Type | Description | | |
| |-------|------|-------------| | |
| | `audio` | `Audio` | Audio waveform at 16kHz sample rate | | |
| | `title` | `str` | Track title | | |
| | `beats` | `list[float]` | Beat timestamps in seconds | | |
| | `downbeats` | `list[float]` | Downbeat timestamps in seconds | | |
| ### Dataset Characteristics | |
| - **Dynamic BPM**: Many tracks feature tempo changes mid-song | |
| - **Variable Time Signatures**: Common patterns include 4/4, 3/4, 6/8, and more exotic meters | |
| - **Diverse Genres**: Japanese pop, anime themes, classical arrangements, electronic music | |
| - **High-Quality Annotations**: Derived from professional rhythm game charts | |
| --- | |
| ## Evaluation Metrics | |
| The evaluation considers both **timing accuracy** and **metrical correctness**. Models are evaluated on both beat and downbeat detection tasks. | |
| ### Primary Metrics | |
| #### 1. Weighted F1-Score (Main Ranking Metric) | |
| F1-scores are calculated at multiple timing thresholds (3ms to 30ms), then combined with inverse-threshold weighting: | |
| | Threshold | Weight | Rationale | | |
| |-----------|--------|-----------| | |
| | 3ms | 1.000 | Full weight for highest precision | | |
| | 6ms | 0.500 | Half weight | | |
| | 9ms | 0.333 | One-third weight | | |
| | 12ms | 0.250 | ... | | |
| | 15ms | 0.200 | | | |
| | 18ms | 0.167 | | | |
| | 21ms | 0.143 | | | |
| | 24ms | 0.125 | | | |
| | 27ms | 0.111 | | | |
| | 30ms | 0.100 | Minimum weight for coarsest threshold | | |
| **Formula:** | |
| ``` | |
| Weighted F1 = Ξ£(w_t Γ F1_t) / Ξ£(w_t) | |
| where w_t = 3ms / t (inverse threshold weighting) | |
| ``` | |
| This weighting scheme rewards models that achieve high precision at tight tolerances while still considering coarser thresholds. | |
| #### 2. Continuity Metrics (CMLt, AMLt) | |
| Based on the MIREX beat tracking evaluation protocol using `mir_eval`: | |
| | Metric | Full Name | Description | | |
| |--------|-----------|-------------| | |
| | **CMLt** | Correct Metrical Level Total | Percentage of beats correctly tracked at the exact metrical level (Β±17.5% of beat interval) | | |
| | **AMLt** | Any Metrical Level Total | Same as CMLt, but allows for acceptable metrical variations (double/half tempo, off-beat) | | |
| | **CMLc** | Correct Metrical Level Continuous | Longest continuous correctly-tracked segment at exact metrical level | | |
| | **AMLc** | Any Metrical Level Continuous | Longest continuous segment at any acceptable metrical level | | |
| **Note:** Continuity metrics use a default `min_beat_time=5.0s` (skipping the first 5 seconds) to avoid evaluating potentially unstable tempo at the beginning of tracks. | |
| ### Metric Interpretation | |
| | Metric | What it measures | Good Score | | |
| |--------|------------------|------------| | |
| | Weighted F1 | Precise timing accuracy | > 0.7 | | |
| | CMLt | Correct tempo tracking | > 0.8 | | |
| | AMLt | Tempo tracking (flexible) | > 0.9 | | |
| | CMLc | Longest stable segment | > 0.5 | | |
| ### Evaluation Summary | |
| For each model, we report: | |
| ``` | |
| Beat Detection: | |
| Weighted F1: X.XXXX | |
| CMLt: X.XXXX AMLt: X.XXXX | |
| CMLc: X.XXXX AMLc: X.XXXX | |
| Downbeat Detection: | |
| Weighted F1: X.XXXX | |
| CMLt: X.XXXX AMLt: X.XXXX | |
| CMLc: X.XXXX AMLc: X.XXXX | |
| Combined Weighted F1: X.XXXX (average of beat and downbeat) | |
| ``` | |
| ### Benchmark Results | |
| Results evaluated on 100 tracks from the test set: | |
| | Model | Combined F1 | Beat F1 | Downbeat F1 | CMLt (Beat) | CMLt (Downbeat) | | |
| |-------|-------------|---------|-------------|-------------|-----------------| | |
| | **Baseline 1 (ODCNN)** | 0.0765 | 0.0861 | 0.0669 | 0.0731 | 0.0321 | | |
| | **Baseline 2 (ResNet-SE)** | **0.2775** | **0.3292** | **0.2258** | **0.3287** | **0.1146** | | |
| *Note: Baseline 2 (ResNet-SE) demonstrates significantly better performance due to larger context window and deeper architecture.* | |
| --- | |
| ## Quick Start | |
| ### Setup | |
| ```bash | |
| uv sync | |
| ``` | |
| ### Train Models | |
| ```bash | |
| # Train Baseline 1 (ODCNN) | |
| uv run -m exp.baseline1.train | |
| # Train Baseline 2 (ResNet-SE) | |
| uv run -m exp.baseline2.train | |
| # Train specific target only (e.g. for Baseline 2) | |
| uv run -m exp.baseline2.train --target beats | |
| uv run -m exp.baseline2.train --target downbeats | |
| ``` | |
| ### Run Evaluation | |
| ```bash | |
| # Evaluation (replace baseline1 with baseline2 to evaluate the new model) | |
| uv run -m exp.baseline1.eval | |
| # Full evaluation with visualization and audio | |
| uv run -m exp.baseline1.eval --visualize --synthesize --summary-plot | |
| # Evaluate on more samples with custom output directory | |
| uv run -m exp.baseline1.eval --num-samples 50 --output-dir outputs/eval_baseline1 | |
| ``` | |
| ### Evaluation Options | |
| | Option | Description | | |
| |--------|-------------| | |
| | Option | Description | | |
| |--------|-------------| | |
| | `--model-dir DIR` | Model directory (default: `outputs/baseline1`) | | |
| | `--num-samples N` | Number of samples to evaluate (default: 20) | | |
| | `--output-dir DIR` | Output directory (default: `outputs/eval`) | | |
| | `--visualize` | Generate visualization plots for each track | | |
| | `--synthesize` | Generate audio files with click tracks | | |
| | `--viz-tracks N` | Number of tracks to visualize/synthesize (default: 5) | | |
| | `--time-range START END` | Limit visualization time range (seconds) | | |
| | `--click-volume FLOAT` | Click sound volume (0.0 to 1.0, default: 0.5) | | |
| | `--summary-plot` | Generate summary evaluation bar charts | | |
| --- | |
| ## Visualization & Audio Tools | |
| ### Beat Visualization | |
| Generate plots comparing predicted vs ground truth beats: | |
| ```bash | |
| uv run -m exp.baseline1.eval --visualize --viz-tracks 10 | |
| ``` | |
| Output: `outputs/eval/plots/track_XXX.png` | |
| ### Click Track Audio | |
| Generate audio files with click sounds overlaid on the original music: | |
| ```bash | |
| uv run -m exp.baseline1.eval --synthesize | |
| ``` | |
| Output files in `outputs/eval/audio/`: | |
| - `track_XXX_pred.wav` - Original audio + predicted beat clicks (1000Hz beat, 1500Hz downbeat) | |
| - `track_XXX_gt.wav` - Original audio + ground truth clicks (800Hz beat, 1200Hz downbeat) | |
| - `track_XXX_both.wav` - Original audio + both prediction and ground truth clicks | |
| ### Summary Plot | |
| Generate bar charts summarizing F1 scores and continuity metrics: | |
| ```bash | |
| uv run -m exp.baseline1.eval --summary-plot | |
| ``` | |
| Output: `outputs/eval/evaluation_summary.png` | |
| --- | |
| ## Models | |
| ### Baseline 1: ODCNN | |
| A 10-year-old baseline model: <https://ieeexplore.ieee.org/document/6854953>. | |
| The original baseline implements the **Onset Detection CNN (ODCNN)** architecture: | |
| #### Architecture | |
| - **Input**: Multi-view mel spectrogram (3 window sizes: 23ms, 46ms, 93ms) | |
| - **CNN Backbone**: 3 convolutional blocks with max pooling | |
| - **Output**: Frame-level beat/downbeat probability | |
| - **Inference**: Β±7 frames context (Β±70ms) | |
| ### Baseline 2: ResNet-SE | |
| Inspired by ResNet-SE: <https://arxiv.org/abs/1709.01507>. | |
| A modernized architecture designed to capture longer temporal context: | |
| #### Architecture | |
| - **Input**: Mel spectrogram with larger context | |
| - **Backbone**: ResNet with Squeeze-and-Excitation (SE) blocks | |
| - **Context**: **Β±50 frames (~1s)** window | |
| - **Features**: Deeper network (4 stages) with effective channel attention | |
| - **Parameters**: ~400k (Small & Efficient) | |
| ### Training Details | |
| Both models use similar training loops: | |
| - **Optimizer**: SGD (Baseline 1) / AdamW (Baseline 2) | |
| - **Learning Rate**: Cosine annealing | |
| - **Loss**: Binary Cross-Entropy | |
| - **Epochs**: 50 (Baseline 1) / 3 (Baseline 2) | |
| - **Batch Size**: 512 (Baseline 1) / 128 (Baseline 2) | |
| --- | |
| ## Project Structure | |
| ``` | |
| exp-onset/ | |
| βββ exp/ | |
| β βββ baseline1/ # Baseline 1 (ODCNN) | |
| β β βββ model.py # ODCNN architecture | |
| β β βββ train.py | |
| β β βββ eval.py | |
| β β βββ data.py | |
| β β βββ utils.py | |
| β βββ baseline2/ # Baseline 2 (ResNet-SE) | |
| β β βββ model.py # ResNet-SE | |
| β β βββ train.py | |
| β β βββ eval.py | |
| β β βββ data.py | |
| β βββ data/ | |
| β βββ load.py # Dataset loading & preprocessing | |
| β βββ eval.py # Evaluation metrics (F1, CML, AML) | |
| β βββ audio.py # Click track synthesis | |
| β βββ viz.py # Visualization utilities | |
| βββ outputs/ | |
| β βββ baseline1/ # Trained models (Baseline 1) | |
| β βββ baseline2/ # Trained models (Baseline 2) | |
| β βββ eval/ # Evaluation outputs | |
| β βββ plots/ # Visualization images | |
| β βββ audio/ # Click track audio files | |
| β βββ evaluation_summary.png | |
| βββ README.md | |
| βββ DATASET.md # Raw dataset specification | |
| βββ pyproject.toml | |
| ``` | |
| --- | |
| ## License | |
| This project is for research and educational purposes. The dataset is derived from publicly available rhythm game charts. | |