JacobLinCool commited on Jan 11

Commit

64bf319

verified ·

1 Parent(s): 6f01cc1

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +13 -0
.gitignore +16 -0
.python-version +1 -0
BASELINE3_IMPROVEMENTS.md +163 -0
README.md +299 -0
SE/Squeeze-and-Excitation Networks 1.jpg +3 -0
SE/Squeeze-and-Excitation Networks 10.jpg +3 -0
SE/Squeeze-and-Excitation Networks 11.jpg +3 -0
SE/Squeeze-and-Excitation Networks 12.jpg +3 -0
SE/Squeeze-and-Excitation Networks 13.jpg +3 -0
SE/Squeeze-and-Excitation Networks 2.jpg +3 -0
SE/Squeeze-and-Excitation Networks 3.jpg +3 -0
SE/Squeeze-and-Excitation Networks 4.jpg +3 -0
SE/Squeeze-and-Excitation Networks 5.jpg +3 -0
SE/Squeeze-and-Excitation Networks 6.jpg +3 -0
SE/Squeeze-and-Excitation Networks 7.jpg +3 -0
SE/Squeeze-and-Excitation Networks 8.jpg +3 -0
SE/Squeeze-and-Excitation Networks 9.jpg +3 -0
exp/__init__.py +0 -0
exp/baseline1/__init__.py +0 -0
exp/baseline1/data.py +128 -0
exp/baseline1/eval.py +322 -0
exp/baseline1/model.py +62 -0
exp/baseline1/train.py +183 -0
exp/baseline1/utils.py +53 -0
exp/baseline2/__init__.py +0 -0
exp/baseline2/data.py +137 -0
exp/baseline2/eval.py +324 -0
exp/baseline2/model.py +139 -0
exp/baseline2/train.py +215 -0
exp/baseline3/__init__.py +0 -0
exp/baseline3/data.py +173 -0
exp/baseline3/eval.py +336 -0
exp/baseline3/model.py +173 -0
exp/baseline3/train.py +433 -0
exp/data/__init__.py +25 -0
exp/data/audio.py +301 -0
exp/data/eval.py +568 -0
exp/data/load.py +91 -0
exp/data/viz.py +441 -0
outputs/baseline1/beats/README.md +10 -0
outputs/baseline1/beats/config.json +3 -0
outputs/baseline1/beats/final/README.md +10 -0
outputs/baseline1/beats/final/config.json +3 -0
outputs/baseline1/beats/final/model.safetensors +3 -0
outputs/baseline1/beats/logs/events.out.tfevents.1766351314.msiit232.1284330.0 +3 -0
outputs/baseline1/beats/model.safetensors +3 -0
outputs/baseline1/downbeats/README.md +10 -0
outputs/baseline1/downbeats/config.json +3 -0
outputs/baseline1/downbeats/final/README.md +10 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,16 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+SE/Squeeze-and-Excitation[[:space:]]Networks[[:space:]]1.jpg filter=lfs diff=lfs merge=lfs -text
+SE/Squeeze-and-Excitation[[:space:]]Networks[[:space:]]10.jpg filter=lfs diff=lfs merge=lfs -text
+SE/Squeeze-and-Excitation[[:space:]]Networks[[:space:]]11.jpg filter=lfs diff=lfs merge=lfs -text
+SE/Squeeze-and-Excitation[[:space:]]Networks[[:space:]]12.jpg filter=lfs diff=lfs merge=lfs -text
+SE/Squeeze-and-Excitation[[:space:]]Networks[[:space:]]13.jpg filter=lfs diff=lfs merge=lfs -text
+SE/Squeeze-and-Excitation[[:space:]]Networks[[:space:]]2.jpg filter=lfs diff=lfs merge=lfs -text
+SE/Squeeze-and-Excitation[[:space:]]Networks[[:space:]]3.jpg filter=lfs diff=lfs merge=lfs -text
+SE/Squeeze-and-Excitation[[:space:]]Networks[[:space:]]4.jpg filter=lfs diff=lfs merge=lfs -text
+SE/Squeeze-and-Excitation[[:space:]]Networks[[:space:]]5.jpg filter=lfs diff=lfs merge=lfs -text
+SE/Squeeze-and-Excitation[[:space:]]Networks[[:space:]]6.jpg filter=lfs diff=lfs merge=lfs -text
+SE/Squeeze-and-Excitation[[:space:]]Networks[[:space:]]7.jpg filter=lfs diff=lfs merge=lfs -text
+SE/Squeeze-and-Excitation[[:space:]]Networks[[:space:]]8.jpg filter=lfs diff=lfs merge=lfs -text
+SE/Squeeze-and-Excitation[[:space:]]Networks[[:space:]]9.jpg filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,16 @@

+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+# Virtual environments
+.venv
+outputs/*
+!outputs/baseline1/
+!outputs/baseline2/
+.ruff_cache/

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.12

BASELINE3_IMPROVEMENTS.md ADDED Viewed

	@@ -0,0 +1,163 @@

+# Baseline3 improvements (beats + downbeats)
+This document summarizes the changes that were made in `exp/baseline3` relative to `exp/baseline2` during this session, with an emphasis on improvements intended to increase beat/downbeat F1 and continuity while keeping the training/eval workflow consistent with baseline2.
+## Scope / goals
+- Keep the same overall pipeline as baseline2 (same dataset, same context window, same mel multi-view preprocessing, same peak-picking evaluation).
+- Add SE-inspired improvements to the **model** (baseline3) while preserving the baseline2 ResNet backbone structure.
+- Make training and TensorBoard curves **comparable** to baseline2.
+- Support faster iteration when needed (optional), but allow returning to baseline2-style “full” training defaults.
+---
+## Model improvements (affects both beats + downbeats)
+### 1) Extra SE-inspired gating (temporal excitation)
+- File: `exp/baseline3/model.py`
+- Added an additional SE-style gating mechanism that is **time-dependent** (a “temporal excitation” in addition to channel excitation).
+- The intent is to help the network emphasize temporally-salient patterns that correspond to rhythmic events, improving peak sharpness and reducing spurious activations.
+### 2) SE block robustness
+- File: `exp/baseline3/model.py`
+- Made the SE hidden dimension robust for small channel counts (ensuring the intermediate dimension is never zero).
+---
+## Data / sampling improvements (optional; applies to both beats + downbeats)
+### 3) Track capping support (optional)
+- File: `exp/baseline3/data.py`
+- Added support for limiting the number of tracks used when building indices.
+- This was introduced for **fast iteration** runs (debugging / quick experiments). When not used, training uses the full dataset like baseline2.
+### 4) Hard-negative sampling near events (optional)
+- File: `exp/baseline3/data.py`
+- Added optional “hard negatives” close to ground-truth frames:
+  - For each beat/downbeat frame, add negative frames at offsets ±d for d=2..R.
+  - Controlled by `hard_neg_radius` and `hard_neg_fraction`.
+- Rationale: random negatives are often too easy; near-event negatives help reduce double-peaks/jitter and can improve continuity.
+- Status: kept **off by default** when running in baseline2-style mode.
+---
+## Training-loop improvements
+### 5) Output directories fixed to avoid overwriting baseline2
+- File: `exp/baseline3/train.py` (and earlier in the session also baseline3 eval defaults)
+- Baseline3 outputs were adjusted to use baseline3-specific output directories so baseline2 artifacts aren’t overwritten.
+### 6) Loss logging parity with baseline2
+- File: `exp/baseline3/train.py`
+- Baseline2 uses unweighted BCE (`nn.BCELoss`). Baseline3 introduced an optional weighted BCE objective for imbalance experiments.
+- A key issue was discovered: TensorBoard curves looked “worse” in baseline3 because it was logging weighted BCE as the main loss.
+- Fix:
+  - `train/batch_loss` and `train/epoch_loss` are now **unweighted BCE** (baseline2-comparable).
+  - If weighting is enabled, the optimized objective is logged separately as `*_weighted`.
+### 7) Optional imbalance-aware objective (pos weighting)
+- File: `exp/baseline3/train.py`
+- Added an optional weighted BCE objective, controlled by `--pos-weight`.
+- Default is `--pos-weight 0.0`, which matches baseline2 behavior.
+### 8) Optional gradient clipping
+- File: `exp/baseline3/train.py`
+- Added `--grad-clip` support to stabilize training when experimenting.
+- For baseline2-style mode, default was set back to **disabled** (`--grad-clip 0.0`).
+### 9) Fast-iteration controls (optional)
+- File: `exp/baseline3/train.py`
+- Added optional caps for quicker experiments:
+  - `--max-train-tracks`, `--max-val-tracks`
+  - `--max-train-steps`, `--max-val-steps`, `--max-steps-total`
+- These are intended only for debugging/iteration. Baseline2-style training leaves them unset (0/unlimited).
+### 10) Back to baseline2-style default training mode
+- File: `exp/baseline3/train.py`
+- Returned baseline3 defaults to match baseline2 training mode:
+  - `--epochs 3`
+  - `--patience 5`
+  - objective defaults to unweighted BCE when `--pos-weight 0.0`
+  - no grad clipping by default
+---
+## Evaluation improvements
+### 11) Mix-and-match beats and downbeats checkpoints
+- File: `exp/baseline3/eval.py`
+- Added support to evaluate using different model directories for beats vs downbeats:
+  - `--beats-model-dir`
+  - `--downbeats-model-dir`
+- This enables workflows like “new beats run + keep downbeats fixed”.
+---
+## Beats-specific notes
+- All model/training/eval improvements above apply to beats.
+- A key gotcha found during quick experiments: some runs only saved the checkpoint under a `final/` subfolder. When evaluating, using the correct folder matters.
+### Latest mixed eval result (beats improved)
+Eval command used:
+- Beats: `outputs/baseline3_b2mode_full3/beats`
+- Downbeats: `outputs/baseline3_smoketest/downbeats`
+- Output: `outputs/eval_mix_b3_b2modebeats_smoketestdownbeats`
+Key metrics (116 tracks):
+- Mean Beat Weighted F1: **0.3531**
+- Beat continuity: CMLt **0.3567**, AMLt **0.3607**, CMLc **0.0603**, AMLc **0.0624**
+Summary plot:
+- `outputs/eval_mix_b3_b2modebeats_smoketestdownbeats/evaluation_summary.png`
+---
+## Downbeats-specific notes
+- Downbeats training uses the same dataset/indexing logic, model architecture, and preprocessing as beats.
+- The improvements (temporal excitation, loss logging parity, optional hard negatives, optional fast-iteration, mixed-checkpoint evaluation) all apply identically.
+- In the mixed eval above, downbeats were held fixed using the baseline3 smoketest checkpoint.
+---
+## Repro commands
+### Full baseline2-style training (beats only)
+```bash
+uv run -m exp.baseline3.train --target beats --output-dir outputs/baseline3_b2mode_full3
+```
+### Mixed evaluation (beats from a new run + downbeats from baseline3 smoketest)
+```bash
+uv run -m exp.baseline3.eval \
+  --beats-model-dir outputs/baseline3_b2mode_full3/beats \
+  --downbeats-model-dir outputs/baseline3_smoketest/downbeats \
+  --output-dir outputs/eval_mix_b3_b2modebeats_smoketestdownbeats \
+  --summary-plot
+```
+---
+## Known warnings
+- You may see repeated torchaudio warnings like:
+  - “At least one mel filterbank has all zero values…”
+- This is produced by `torchaudio` mel filterbank construction for some parameter combinations and is not specific to baseline3.

README.md ADDED Viewed

	@@ -0,0 +1,299 @@

+# Beat Tracking Challenge
+A challenge for detecting beats and downbeats in music audio, with a focus on handling dynamic tempo changes common in rhythm game charts.
+## Goal
+The goal is to **detect and identify beats and downbeats** in audio to assist composers by providing a flexible timing grid when working with samples that have dynamic BPM changes.
+- **Beat**: A regular pulse in music (e.g., quarter notes in 4/4 time)
+- **Downbeat**: The first beat of each measure (the "1" in counting "1-2-3-4")
+This is particularly useful for:
+- Music production with samples of varying tempos
+- Rhythm game chart creation and verification
+- Audio analysis and music information retrieval (MIR)
+---
+## Dataset
+The dataset is derived from Taiko no Tatsujin rhythm game charts, providing high-quality human-annotated beat and downbeat ground truth.
+**Source**: [`JacobLinCool/taiko-1000-parsed`](https://huggingface.co/datasets/JacobLinCool/taiko-1000-parsed)
+| Split | Tracks | Duration | Description |
+|-------|--------|----------|-------------|
+| `train` | ~1000 | 1-3 min each | Training data with beat/downbeat annotations |
+| `test` | ~100 | 1-3 min each | Held-out test set for final evaluation |
+### Data Features
+Each example contains:
+| Field | Type | Description |
+|-------|------|-------------|
+| `audio` | `Audio` | Audio waveform at 16kHz sample rate |
+| `title` | `str` | Track title |
+| `beats` | `list[float]` | Beat timestamps in seconds |
+| `downbeats` | `list[float]` | Downbeat timestamps in seconds |
+### Dataset Characteristics
+- **Dynamic BPM**: Many tracks feature tempo changes mid-song
+- **Variable Time Signatures**: Common patterns include 4/4, 3/4, 6/8, and more exotic meters
+- **Diverse Genres**: Japanese pop, anime themes, classical arrangements, electronic music
+- **High-Quality Annotations**: Derived from professional rhythm game charts
+---
+## Evaluation Metrics
+The evaluation considers both **timing accuracy** and **metrical correctness**. Models are evaluated on both beat and downbeat detection tasks.
+### Primary Metrics
+#### 1. Weighted F1-Score (Main Ranking Metric)
+F1-scores are calculated at multiple timing thresholds (3ms to 30ms), then combined with inverse-threshold weighting:
+| Threshold | Weight | Rationale |
+|-----------|--------|-----------|
+| 3ms | 1.000 | Full weight for highest precision |
+| 6ms | 0.500 | Half weight |
+| 9ms | 0.333 | One-third weight |
+| 12ms | 0.250 | ... |
+| 15ms | 0.200 | |
+| 18ms | 0.167 | |
+| 21ms | 0.143 | |
+| 24ms | 0.125 | |
+| 27ms | 0.111 | |
+| 30ms | 0.100 | Minimum weight for coarsest threshold |
+**Formula:**
+```
+Weighted F1 = Σ(w_t × F1_t) / Σ(w_t)
+where w_t = 3ms / t (inverse threshold weighting)
+```
+This weighting scheme rewards models that achieve high precision at tight tolerances while still considering coarser thresholds.
+#### 2. Continuity Metrics (CMLt, AMLt)
+Based on the MIREX beat tracking evaluation protocol using `mir_eval`:
+| Metric | Full Name | Description |
+|--------|-----------|-------------|
+| **CMLt** | Correct Metrical Level Total | Percentage of beats correctly tracked at the exact metrical level (±17.5% of beat interval) |
+| **AMLt** | Any Metrical Level Total | Same as CMLt, but allows for acceptable metrical variations (double/half tempo, off-beat) |
+| **CMLc** | Correct Metrical Level Continuous | Longest continuous correctly-tracked segment at exact metrical level |
+| **AMLc** | Any Metrical Level Continuous | Longest continuous segment at any acceptable metrical level |
+**Note:** Continuity metrics use a default `min_beat_time=5.0s` (skipping the first 5 seconds) to avoid evaluating potentially unstable tempo at the beginning of tracks.
+### Metric Interpretation
+| Metric | What it measures | Good Score |
+|--------|------------------|------------|
+| Weighted F1 | Precise timing accuracy | > 0.7 |
+| CMLt | Correct tempo tracking | > 0.8 |
+| AMLt | Tempo tracking (flexible) | > 0.9 |
+| CMLc | Longest stable segment | > 0.5 |
+### Evaluation Summary
+For each model, we report:
+```
+Beat Detection:
+  Weighted F1: X.XXXX
+  CMLt: X.XXXX  AMLt: X.XXXX
+  CMLc: X.XXXX  AMLc: X.XXXX
+Downbeat Detection:
+  Weighted F1: X.XXXX
+  CMLt: X.XXXX  AMLt: X.XXXX
+  CMLc: X.XXXX  AMLc: X.XXXX
+Combined Weighted F1: X.XXXX  (average of beat and downbeat)
+```
+### Benchmark Results
+Results evaluated on 100 tracks from the test set:
+| Model | Combined F1 | Beat F1 | Downbeat F1 | CMLt (Beat) | CMLt (Downbeat) |
+|-------|-------------|---------|-------------|-------------|-----------------|
+| **Baseline 1 (ODCNN)** | 0.0765 | 0.0861 | 0.0669 | 0.0731 | 0.0321 |
+| **Baseline 2 (ResNet-SE)** | **0.2775** | **0.3292** | **0.2258** | **0.3287** | **0.1146** |
+*Note: Baseline 2 (ResNet-SE) demonstrates significantly better performance due to larger context window and deeper architecture.*
+---
+## Quick Start
+### Setup
+```bash
+uv sync
+```
+### Train Models
+```bash
+# Train Baseline 1 (ODCNN)
+uv run -m exp.baseline1.train
+# Train Baseline 2 (ResNet-SE)
+uv run -m exp.baseline2.train
+# Train specific target only (e.g. for Baseline 2)
+uv run -m exp.baseline2.train --target beats
+uv run -m exp.baseline2.train --target downbeats
+```
+### Run Evaluation
+```bash
+# Evaluation (replace baseline1 with baseline2 to evaluate the new model)
+uv run -m exp.baseline1.eval
+# Full evaluation with visualization and audio
+uv run -m exp.baseline1.eval --visualize --synthesize --summary-plot
+# Evaluate on more samples with custom output directory
+uv run -m exp.baseline1.eval --num-samples 50 --output-dir outputs/eval_baseline1
+```
+### Evaluation Options
+| Option | Description |
+|--------|-------------|
+| Option | Description |
+|--------|-------------|
+| `--model-dir DIR` | Model directory (default: `outputs/baseline1`) |
+| `--num-samples N` | Number of samples to evaluate (default: 20) |
+| `--output-dir DIR` | Output directory (default: `outputs/eval`) |
+| `--visualize` | Generate visualization plots for each track |
+| `--synthesize` | Generate audio files with click tracks |
+| `--viz-tracks N` | Number of tracks to visualize/synthesize (default: 5) |
+| `--time-range START END` | Limit visualization time range (seconds) |
+| `--click-volume FLOAT` | Click sound volume (0.0 to 1.0, default: 0.5) |
+| `--summary-plot` | Generate summary evaluation bar charts |
+---
+## Visualization & Audio Tools
+### Beat Visualization
+Generate plots comparing predicted vs ground truth beats:
+```bash
+uv run -m exp.baseline1.eval --visualize --viz-tracks 10
+```
+Output: `outputs/eval/plots/track_XXX.png`
+### Click Track Audio
+Generate audio files with click sounds overlaid on the original music:
+```bash
+uv run -m exp.baseline1.eval --synthesize
+```
+Output files in `outputs/eval/audio/`:
+- `track_XXX_pred.wav` - Original audio + predicted beat clicks (1000Hz beat, 1500Hz downbeat)
+- `track_XXX_gt.wav` - Original audio + ground truth clicks (800Hz beat, 1200Hz downbeat)
+- `track_XXX_both.wav` - Original audio + both prediction and ground truth clicks
+### Summary Plot
+Generate bar charts summarizing F1 scores and continuity metrics:
+```bash
+uv run -m exp.baseline1.eval --summary-plot
+```
+Output: `outputs/eval/evaluation_summary.png`
+---
+## Models
+### Baseline 1: ODCNN
+A 10-year-old baseline model: <https://ieeexplore.ieee.org/document/6854953>.
+The original baseline implements the **Onset Detection CNN (ODCNN)** architecture:
+#### Architecture
+- **Input**: Multi-view mel spectrogram (3 window sizes: 23ms, 46ms, 93ms)
+- **CNN Backbone**: 3 convolutional blocks with max pooling
+- **Output**: Frame-level beat/downbeat probability
+- **Inference**: ±7 frames context (±70ms)
+### Baseline 2: ResNet-SE
+Inspired by ResNet-SE: <https://arxiv.org/abs/1709.01507>.
+A modernized architecture designed to capture longer temporal context:
+#### Architecture
+- **Input**: Mel spectrogram with larger context
+- **Backbone**: ResNet with Squeeze-and-Excitation (SE) blocks
+- **Context**: **±50 frames (~1s)** window
+- **Features**: Deeper network (4 stages) with effective channel attention
+- **Parameters**: ~400k (Small & Efficient)
+### Training Details
+Both models use similar training loops:
+- **Optimizer**: SGD (Baseline 1) / AdamW (Baseline 2)
+- **Learning Rate**: Cosine annealing
+- **Loss**: Binary Cross-Entropy
+- **Epochs**: 50 (Baseline 1) / 3 (Baseline 2)
+- **Batch Size**: 512 (Baseline 1) / 128 (Baseline 2)
+---
+## Project Structure
+```
+exp-onset/
+├── exp/
+│   ├── baseline1/         # Baseline 1 (ODCNN)
+│   │   ├── model.py       # ODCNN architecture
+│   │   ├── train.py
+│   │   ├── eval.py
+│   │   ├── data.py
+│   │   └── utils.py
+│   ├── baseline2/         # Baseline 2 (ResNet-SE)
+│   │   ├── model.py       # ResNet-SE
+│   │   ├── train.py
+│   │   ├── eval.py
+│   │   └── data.py
+│   └── data/
+│       ├── load.py        # Dataset loading & preprocessing
+│       ├── eval.py        # Evaluation metrics (F1, CML, AML)
+│       ├── audio.py       # Click track synthesis
+│       └── viz.py         # Visualization utilities
+├── outputs/
+│   ├── baseline1/         # Trained models (Baseline 1)
+│   ├── baseline2/         # Trained models (Baseline 2)
+│   └── eval/              # Evaluation outputs
+│       ├── plots/         # Visualization images
+│       ├── audio/         # Click track audio files
+│       └── evaluation_summary.png
+├── README.md
+├── DATASET.md             # Raw dataset specification
+└── pyproject.toml
+```
+---
+## License
+This project is for research and educational purposes. The dataset is derived from publicly available rhythm game charts.

SE/Squeeze-and-Excitation Networks 1.jpg ADDED Viewed

Git LFS Details

SHA256: d0380e82ecf8f2ffc4ff8553a4d40ab50f7503d8b9ffdc58d0ca067b67060c97
Pointer size: 132 Bytes
Size of remote file: 5.94 MB

SE/Squeeze-and-Excitation Networks 10.jpg ADDED Viewed

Git LFS Details

SHA256: 74d9f6c4dc9e6bffccf03e2a8f233fa55eb99bf4e2762a5e7cf6ec2cdd37837e
Pointer size: 132 Bytes
Size of remote file: 4.27 MB

SE/Squeeze-and-Excitation Networks 11.jpg ADDED Viewed

Git LFS Details

SHA256: ad07ce4dcb5540e4878dd0b56cfe8db4c7f13f92429587ac75aa6203a6a7700c
Pointer size: 132 Bytes
Size of remote file: 5.06 MB

SE/Squeeze-and-Excitation Networks 12.jpg ADDED Viewed

Git LFS Details

SHA256: e195ff80703b9c88ee1d661fb8a0cbf9a312cdc96732ed304cbb878d6efd7777
Pointer size: 132 Bytes
Size of remote file: 6.86 MB

SE/Squeeze-and-Excitation Networks 13.jpg ADDED Viewed

Git LFS Details

SHA256: 25065cec902bbed6e52ae28cf5e8c52613ec1da8d36dcd027765edd70cc27e1c
Pointer size: 132 Bytes
Size of remote file: 5.26 MB

SE/Squeeze-and-Excitation Networks 2.jpg ADDED Viewed

Git LFS Details

SHA256: 63dc30e35ecae244ffcb197c491dcbe237f8724ec418c20ddbd91b89e5512135
Pointer size: 132 Bytes
Size of remote file: 5.96 MB

SE/Squeeze-and-Excitation Networks 3.jpg ADDED Viewed

Git LFS Details

SHA256: 2653f086da04f5d990ea817531663115396a77fc52962524623b8a49508f9412
Pointer size: 132 Bytes
Size of remote file: 6.24 MB

SE/Squeeze-and-Excitation Networks 4.jpg ADDED Viewed

Git LFS Details

SHA256: b80814bb2e06269784482d396949c835263084d05944d580a4981cc6ebad70d4
Pointer size: 132 Bytes
Size of remote file: 5.35 MB

SE/Squeeze-and-Excitation Networks 5.jpg ADDED Viewed

Git LFS Details

SHA256: af5e33be7e6cdb093a1e12df355406919d19920343c475eb8e31f65c859d9f76
Pointer size: 132 Bytes
Size of remote file: 5.07 MB

SE/Squeeze-and-Excitation Networks 6.jpg ADDED Viewed

Git LFS Details

SHA256: 4f5e9d98eaa72167d038b5469e158a418eb535780d24d7784124027e7b08e571
Pointer size: 132 Bytes
Size of remote file: 5.95 MB

SE/Squeeze-and-Excitation Networks 7.jpg ADDED Viewed

Git LFS Details

SHA256: b719f3619270a097f05d6909b8446c351a0e38eeb60aac2f6c900b0fe4d5275b
Pointer size: 132 Bytes
Size of remote file: 5.91 MB

SE/Squeeze-and-Excitation Networks 8.jpg ADDED Viewed

Git LFS Details

SHA256: 662aab9fdddcf2f68dc72c2d8499480198b53c96e101a7f502c803a7c5388a05
Pointer size: 132 Bytes
Size of remote file: 5.64 MB

SE/Squeeze-and-Excitation Networks 9.jpg ADDED Viewed

Git LFS Details

SHA256: 9f2c401006c2d645d3c043f7ca79b61e54c36c5a136df8e1bdbc1a3b5eb74bf3
Pointer size: 132 Bytes
Size of remote file: 5.47 MB

exp/__init__.py ADDED Viewed

File without changes

exp/baseline1/__init__.py ADDED Viewed

File without changes

exp/baseline1/data.py ADDED Viewed

	@@ -0,0 +1,128 @@

+import torch
+from torch.utils.data import Dataset
+import numpy as np
+from tqdm import tqdm
+from .utils import extract_context
+class BeatTrackingDataset(Dataset):
+    def __init__(
+        self, hf_dataset, target_type="beats", sample_rate=16000, hop_length=160
+    ):
+        """
+        Args:
+            hf_dataset: HuggingFace dataset object
+            target_type (str): "beats" or "downbeats". Determines which labels are treated as positive.
+        """
+        self.sr = sample_rate
+        self.hop_length = hop_length
+        self.target_type = target_type
+        # Context window size in samples (7 frames = 70ms at 100fps)
+        self.context_frames = 7
+        self.context_samples = (self.context_frames * 2 + 1) * hop_length + max(
+            [368, 736, 1488]
+        )  # extra for FFT window
+        # Cache audio arrays in memory for fast access
+        self.audio_cache = []
+        self.indices = []
+        self._prepare_indices(hf_dataset)
+    def _prepare_indices(self, hf_dataset):
+        """
+        Prepares balanced indices and caches audio.
+        Paper Section 4.5: Uses "Fuzzier" training examples (neighbors weighted less).
+        """
+        print(f"Preparing dataset indices for target: {self.target_type}...")
+        for i, item in tqdm(
+            enumerate(hf_dataset), total=len(hf_dataset), desc="Building indices"
+        ):
+            # Cache audio array (convert to numpy if tensor)
+            audio = item["audio"]["array"]
+            if hasattr(audio, "numpy"):
+                audio = audio.numpy()
+            self.audio_cache.append(audio)
+            # Calculate total frames available in audio
+            audio_len = len(audio)
+            n_frames = int(audio_len / self.hop_length)
+            # Select ground truth based on target_type
+            if self.target_type == "downbeats":
+                # Only downbeats are positives
+                gt_times = item["downbeats"]
+            else:
+                # All beats are positives (downbeats are also beats)
+                gt_times = item["beats"]
+            # Convert to list if tensor
+            if hasattr(gt_times, "tolist"):
+                gt_times = gt_times.tolist()
+            gt_frames = set([int(t * self.sr / self.hop_length) for t in gt_times])
+            # --- Positive Examples (with Fuzziness) ---
+            # "define a single frame before and after each annotated onset to be additional positive examples"
+            pos_frames = set()
+            for bf in gt_frames:
+                if 0 <= bf < n_frames:
+                    self.indices.append((i, bf, 1.0))  # Center frame (Sharp onset)
+                    pos_frames.add(bf)
+                # Neighbors weighted at 0.25
+                if 0 <= bf - 1 < n_frames:
+                    self.indices.append((i, bf - 1, 0.25))
+                    pos_frames.add(bf - 1)
+                if 0 <= bf + 1 < n_frames:
+                    self.indices.append((i, bf + 1, 0.25))
+                    pos_frames.add(bf + 1)
+            # --- Negative Examples ---
+            # Paper uses "all others as negative", but we balance 2:1 for stable SGD.
+            num_pos = len(pos_frames)
+            num_neg = num_pos * 2
+            count = 0
+            attempts = 0
+            while count < num_neg and attempts < num_neg * 5:
+                f = np.random.randint(0, n_frames)
+                if f not in pos_frames:
+                    self.indices.append((i, f, 0.0))
+                    count += 1
+                attempts += 1
+        print(
+            f"Dataset ready. {len(self.indices)} samples, {len(self.audio_cache)} tracks cached."
+        )
+    def __len__(self):
+        return len(self.indices)
+    def __getitem__(self, idx):
+        track_idx, frame_idx, label = self.indices[idx]
+        # Fast lookup from cache
+        audio = self.audio_cache[track_idx]
+        audio_len = len(audio)
+        # Calculate sample range for context window
+        center_sample = frame_idx * self.hop_length
+        half_context = self.context_samples // 2
+        start = center_sample - half_context
+        end = center_sample + half_context
+        # Handle padding if needed
+        pad_left = max(0, -start)
+        pad_right = max(0, end - audio_len)
+        start = max(0, start)
+        end = min(audio_len, end)
+        # Extract audio chunk
+        chunk = audio[start:end]
+        if pad_left > 0 or pad_right > 0:
+            chunk = np.pad(chunk, (pad_left, pad_right), mode="constant")
+        waveform = torch.tensor(chunk, dtype=torch.float32)
+        return waveform, torch.tensor([label], dtype=torch.float32)

exp/baseline1/eval.py ADDED Viewed

	@@ -0,0 +1,322 @@

+import torch
+import numpy as np
+from tqdm import tqdm
+from scipy.signal import find_peaks
+import argparse
+import os
+from .model import ODCNN
+from .utils import MultiViewSpectrogram
+from ..data.load import ds
+from ..data.eval import evaluate_all, format_results
+def get_activation_function(model, waveform, device):
+    """
+    Computes probability curve over time.
+    """
+    processor = MultiViewSpectrogram().to(device)
+    waveform = waveform.unsqueeze(0).to(device)
+    with torch.no_grad():
+        spec = processor(waveform)
+        # Normalize
+        mean = spec.mean(dim=(2, 3), keepdim=True)
+        std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+        spec = (spec - mean) / std
+        # Batchify with sliding window
+        spec = torch.nn.functional.pad(spec, (7, 7))  # Pad time
+        windows = spec.unfold(3, 15, 1)  # (1, 3, 80, Time, 15)
+        windows = windows.permute(3, 0, 1, 2, 4).squeeze(1)  # (Time, 3, 80, 15)
+        # Inference
+        activations = []
+        batch_size = 512
+        for i in range(0, len(windows), batch_size):
+            batch = windows[i : i + batch_size]
+            out = model(batch)
+            activations.append(out.cpu().numpy())
+    return np.concatenate(activations).flatten()
+def pick_peaks(activations, hop_length=160, sr=16000):
+    """
+    Smooth with Hamming window and report local maxima.
+    """
+    # Smoothing
+    window = np.hamming(5)
+    window /= window.sum()
+    smoothed = np.convolve(activations, window, mode="same")
+    # Peak Picking
+    peaks, _ = find_peaks(smoothed, height=0.5, distance=5)
+    timestamps = peaks * hop_length / sr
+    return timestamps.tolist()
+def visualize_track(
+    audio: np.ndarray,
+    sr: int,
+    pred_beats: list[float],
+    pred_downbeats: list[float],
+    gt_beats: list[float],
+    gt_downbeats: list[float],
+    output_dir: str,
+    track_idx: int,
+    time_range: tuple[float, float] | None = None,
+):
+    """
+    Create and save visualizations for a single track.
+    """
+    from ..data.viz import plot_waveform_with_beats, save_figure
+    os.makedirs(output_dir, exist_ok=True)
+    # Full waveform plot
+    fig = plot_waveform_with_beats(
+        audio,
+        sr,
+        pred_beats,
+        gt_beats,
+        pred_downbeats,
+        gt_downbeats,
+        title=f"Track {track_idx}: Beat Comparison",
+        time_range=time_range,
+    )
+    save_figure(fig, os.path.join(output_dir, f"track_{track_idx:03d}.png"))
+def synthesize_audio(
+    audio: np.ndarray,
+    sr: int,
+    pred_beats: list[float],
+    pred_downbeats: list[float],
+    gt_beats: list[float],
+    gt_downbeats: list[float],
+    output_dir: str,
+    track_idx: int,
+    click_volume: float = 0.5,
+):
+    """
+    Create and save audio files with click tracks for a single track.
+    """
+    from ..data.audio import create_comparison_audio, save_audio
+    os.makedirs(output_dir, exist_ok=True)
+    # Create comparison audio
+    audio_pred, audio_gt, audio_both = create_comparison_audio(
+        audio,
+        pred_beats,
+        pred_downbeats,
+        gt_beats,
+        gt_downbeats,
+        sr=sr,
+        click_volume=click_volume,
+    )
+    # Save audio files
+    save_audio(
+        audio_pred, os.path.join(output_dir, f"track_{track_idx:03d}_pred.wav"), sr
+    )
+    save_audio(audio_gt, os.path.join(output_dir, f"track_{track_idx:03d}_gt.wav"), sr)
+    save_audio(
+        audio_both, os.path.join(output_dir, f"track_{track_idx:03d}_both.wav"), sr
+    )
+def main():
+    parser = argparse.ArgumentParser(
+        description="Evaluate beat tracking models with visualization and audio synthesis"
+    )
+    parser.add_argument(
+        "--model-dir",
+        type=str,
+        default="outputs/baseline1",
+        help="Base directory containing trained models (with 'beats' and 'downbeats' subdirs)",
+    )
+    parser.add_argument(
+        "--num-samples",
+        type=int,
+        default=116,
+        help="Number of samples to evaluate",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="outputs/eval_baseline1",
+        help="Directory to save visualizations and audio",
+    )
+    parser.add_argument(
+        "--visualize",
+        action="store_true",
+        help="Generate visualization plots for each track",
+    )
+    parser.add_argument(
+        "--synthesize",
+        action="store_true",
+        help="Generate audio files with click tracks",
+    )
+    parser.add_argument(
+        "--viz-tracks",
+        type=int,
+        default=5,
+        help="Number of tracks to visualize/synthesize (default: 5)",
+    )
+    parser.add_argument(
+        "--time-range",
+        type=float,
+        nargs=2,
+        default=None,
+        metavar=("START", "END"),
+        help="Time range for visualization in seconds (default: full track)",
+    )
+    parser.add_argument(
+        "--click-volume",
+        type=float,
+        default=0.5,
+        help="Volume of click sounds relative to audio (0.0 to 1.0)",
+    )
+    parser.add_argument(
+        "--summary-plot",
+        action="store_true",
+        help="Generate summary evaluation plot",
+    )
+    args = parser.parse_args()
+    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+    # Load BOTH models using from_pretrained
+    beat_model = None
+    downbeat_model = None
+    has_beats = False
+    has_downbeats = False
+    beats_dir = os.path.join(args.model_dir, "beats")
+    downbeats_dir = os.path.join(args.model_dir, "downbeats")
+    if os.path.exists(os.path.join(beats_dir, "model.safetensors")):
+        beat_model = ODCNN.from_pretrained(beats_dir).to(DEVICE)
+        beat_model.eval()
+        has_beats = True
+        print(f"Loaded Beat Model from {beats_dir}")
+    else:
+        print(f"Warning: No beat model found in {beats_dir}")
+    if os.path.exists(os.path.join(downbeats_dir, "model.safetensors")):
+        downbeat_model = ODCNN.from_pretrained(downbeats_dir).to(DEVICE)
+        downbeat_model.eval()
+        has_downbeats = True
+        print(f"Loaded Downbeat Model from {downbeats_dir}")
+    else:
+        print(f"Warning: No downbeat model found in {downbeats_dir}")
+    if not has_beats and not has_downbeats:
+        print("No models found. Please run training first.")
+        return
+    predictions = []
+    ground_truths = []
+    audio_data = []  # Store audio for visualization/synthesis
+    # Eval on specified number of tracks
+    test_set = ds["train"].select(range(args.num_samples))
+    print("Running evaluation...")
+    for i, item in enumerate(tqdm(test_set)):
+        waveform = torch.tensor(item["audio"]["array"], dtype=torch.float32)
+        waveform_device = waveform.to(DEVICE)
+        pred_entry = {"beats": [], "downbeats": []}
+        # 1. Predict Beats
+        if has_beats:
+            act_b = get_activation_function(beat_model, waveform_device, DEVICE)
+            pred_entry["beats"] = pick_peaks(act_b)
+        # 2. Predict Downbeats
+        if has_downbeats:
+            act_d = get_activation_function(downbeat_model, waveform_device, DEVICE)
+            pred_entry["downbeats"] = pick_peaks(act_d)
+        predictions.append(pred_entry)
+        ground_truths.append({"beats": item["beats"], "downbeats": item["downbeats"]})
+        # Store audio for later visualization/synthesis
+        if args.visualize or args.synthesize:
+            if i < args.viz_tracks:
+                audio_data.append(
+                    {
+                        "audio": waveform.numpy(),
+                        "sr": item["audio"]["sampling_rate"],
+                        "pred": pred_entry,
+                        "gt": ground_truths[-1],
+                    }
+                )
+    # Run evaluation
+    results = evaluate_all(predictions, ground_truths)
+    print(format_results(results))
+    # Create output directory
+    if args.visualize or args.synthesize or args.summary_plot:
+        os.makedirs(args.output_dir, exist_ok=True)
+    # Generate visualizations
+    if args.visualize:
+        print(f"\nGenerating visualizations for {len(audio_data)} tracks...")
+        viz_dir = os.path.join(args.output_dir, "plots")
+        for i, data in enumerate(tqdm(audio_data, desc="Visualizing")):
+            time_range = tuple(args.time_range) if args.time_range else None
+            visualize_track(
+                data["audio"],
+                data["sr"],
+                data["pred"]["beats"],
+                data["pred"]["downbeats"],
+                data["gt"]["beats"],
+                data["gt"]["downbeats"],
+                viz_dir,
+                i,
+                time_range=time_range,
+            )
+        print(f"Saved visualizations to {viz_dir}")
+    # Generate audio with clicks
+    if args.synthesize:
+        print(f"\nSynthesizing audio for {len(audio_data)} tracks...")
+        audio_dir = os.path.join(args.output_dir, "audio")
+        for i, data in enumerate(tqdm(audio_data, desc="Synthesizing")):
+            synthesize_audio(
+                data["audio"],
+                data["sr"],
+                data["pred"]["beats"],
+                data["pred"]["downbeats"],
+                data["gt"]["beats"],
+                data["gt"]["downbeats"],
+                audio_dir,
+                i,
+                click_volume=args.click_volume,
+            )
+        print(f"Saved audio files to {audio_dir}")
+        print("  *_pred.wav - Original audio with predicted beat clicks")
+        print("  *_gt.wav   - Original audio with ground truth beat clicks")
+        print("  *_both.wav - Original audio with both predicted and GT clicks")
+    # Generate summary plot
+    if args.summary_plot:
+        from ..data.viz import plot_evaluation_summary, save_figure
+        print("\nGenerating summary plot...")
+        fig = plot_evaluation_summary(results, title="Beat Tracking Evaluation Summary")
+        summary_path = os.path.join(args.output_dir, "evaluation_summary.png")
+        save_figure(fig, summary_path)
+        print(f"Saved summary plot to {summary_path}")
+if __name__ == "__main__":
+    main()

exp/baseline1/model.py ADDED Viewed

	@@ -0,0 +1,62 @@

+import torch
+import torch.nn as nn
+from huggingface_hub import PyTorchModelHubMixin
+class ODCNN(nn.Module, PyTorchModelHubMixin):
+    def __init__(self, dropout_rate=0.5):
+        super().__init__()
+        # Input 3 channels, 80 bands
+        # Conv 1: 7x3 filters -> 10 maps
+        self.conv1 = nn.Conv2d(3, 10, kernel_size=(3, 7))
+        self.relu1 = nn.ReLU()  #  ReLU improvement
+        self.pool1 = nn.MaxPool2d(kernel_size=(3, 1), stride=(3, 1))
+        # Conv 2: 3x3 filters -> 20 maps
+        self.conv2 = nn.Conv2d(10, 20, kernel_size=(3, 3))
+        self.relu2 = nn.ReLU()
+        self.pool2 = nn.MaxPool2d(kernel_size=(3, 1), stride=(3, 1))
+        # Flatten size calculation based on architecture
+        # (20 feature maps * 8 freq bands * 7 time frames)
+        self.flatten_size = 20 * 8 * 7
+        # Dropout on FC inputs
+        self.dropout = nn.Dropout(p=dropout_rate)
+        # 256 Hidden Units
+        self.fc1 = nn.Linear(self.flatten_size, 256)
+        self.relu_fc = nn.ReLU()
+        # Output Unit
+        self.fc2 = nn.Linear(256, 1)
+        self.sigmoid = nn.Sigmoid()
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.relu1(x)
+        x = self.pool1(x)
+        x = self.conv2(x)
+        x = self.relu2(x)
+        x = self.pool2(x)
+        x = x.view(x.size(0), -1)
+        x = self.dropout(x)
+        x = self.fc1(x)
+        x = self.relu_fc(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.sigmoid(x)
+        return x
+if __name__ == "__main__":
+    from torchinfo import summary
+    model = ODCNN()
+    summary(model, (1, 3, 80, 15))

exp/baseline1/train.py ADDED Viewed

	@@ -0,0 +1,183 @@

+import torch
+import torch.nn as nn
+import torch.optim as optim
+from torch.utils.data import DataLoader
+from torch.utils.tensorboard import SummaryWriter
+from tqdm import tqdm
+import argparse
+import os
+from .model import ODCNN
+from .data import BeatTrackingDataset
+from .utils import MultiViewSpectrogram
+from ..data.load import ds
+def train(target_type: str, output_dir: str):
+    # Note: Paper uses SGD with Momentum, Dropout, and ReLU
+    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+    BATCH_SIZE = 512
+    EPOCHS = 50
+    LR = 0.05
+    MOMENTUM = 0.9
+    NUM_WORKERS = 4
+    print(f"--- Training Model for target: {target_type} ---")
+    print(f"Output directory: {output_dir}")
+    # Create output directory
+    os.makedirs(output_dir, exist_ok=True)
+    # TensorBoard writer
+    writer = SummaryWriter(log_dir=os.path.join(output_dir, "logs"))
+    # Data - use existing train/test splits
+    train_dataset = BeatTrackingDataset(ds["train"], target_type=target_type)
+    val_dataset = BeatTrackingDataset(ds["test"], target_type=target_type)
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=BATCH_SIZE,
+        shuffle=True,
+        num_workers=NUM_WORKERS,
+        pin_memory=True,
+        prefetch_factor=4,
+        persistent_workers=True,
+    )
+    val_loader = DataLoader(
+        val_dataset,
+        batch_size=BATCH_SIZE,
+        shuffle=False,
+        num_workers=NUM_WORKERS,
+        pin_memory=True,
+        prefetch_factor=4,
+        persistent_workers=True,
+    )
+    print(f"Train samples: {len(train_dataset)}, Val samples: {len(val_dataset)}")
+    # Model
+    model = ODCNN(dropout_rate=0.5).to(DEVICE)
+    # GPU Spectrogram Preprocessor
+    preprocessor = MultiViewSpectrogram(sample_rate=16000, hop_length=160).to(DEVICE)
+    # Optimizer
+    optimizer = optim.SGD(model.parameters(), lr=LR, momentum=MOMENTUM)
+    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)
+    criterion = nn.BCELoss()  # Binary Cross Entropy
+    best_val_loss = float("inf")
+    global_step = 0
+    for epoch in range(EPOCHS):
+        # Training
+        model.train()
+        total_train_loss = 0
+        for waveform, y in tqdm(
+            train_loader,
+            desc=f"[{target_type}] Epoch {epoch + 1}/{EPOCHS} Train",
+            leave=False,
+        ):
+            waveform, y = waveform.to(DEVICE), y.to(DEVICE)
+            # Compute spectrogram on GPU
+            with torch.no_grad():
+                spec = preprocessor(waveform)  # (B, 3, 80, T)
+                # Normalize
+                mean = spec.mean(dim=(2, 3), keepdim=True)
+                std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+                spec = (spec - mean) / std
+                # Extract center context (T should be ~15 frames)
+                x = spec[:, :, :, 7:22]  # center 15 frames
+            optimizer.zero_grad()
+            output = model(x)
+            loss = criterion(output, y)
+            loss.backward()
+            optimizer.step()
+            total_train_loss += loss.item()
+            global_step += 1
+            # Log batch loss
+            writer.add_scalar("train/batch_loss", loss.item(), global_step)
+        avg_train_loss = total_train_loss / len(train_loader)
+        # Validation
+        model.eval()
+        total_val_loss = 0
+        with torch.no_grad():
+            for waveform, y in tqdm(
+                val_loader,
+                desc=f"[{target_type}] Epoch {epoch + 1}/{EPOCHS} Val",
+                leave=False,
+            ):
+                waveform, y = waveform.to(DEVICE), y.to(DEVICE)
+                # Compute spectrogram on GPU
+                spec = preprocessor(waveform)  # (B, 3, 80, T)
+                # Normalize
+                mean = spec.mean(dim=(2, 3), keepdim=True)
+                std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+                spec = (spec - mean) / std
+                # Extract center context
+                x = spec[:, :, :, 7:22]
+                output = model(x)
+                loss = criterion(output, y)
+                total_val_loss += loss.item()
+        avg_val_loss = total_val_loss / len(val_loader)
+        # Log epoch metrics
+        writer.add_scalar("train/epoch_loss", avg_train_loss, epoch)
+        writer.add_scalar("val/loss", avg_val_loss, epoch)
+        writer.add_scalar("train/learning_rate", scheduler.get_last_lr()[0], epoch)
+        # Step the scheduler
+        scheduler.step()
+        print(
+            f"[{target_type}] Epoch {epoch + 1}/{EPOCHS} - "
+            f"Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}"
+        )
+        # Save best model
+        if avg_val_loss < best_val_loss:
+            best_val_loss = avg_val_loss
+            model.save_pretrained(output_dir)
+            print(f"  -> Saved best model (val_loss: {best_val_loss:.4f})")
+    writer.close()
+    # Save final model
+    final_dir = os.path.join(output_dir, "final")
+    model.save_pretrained(final_dir)
+    print(f"Saved final model to {final_dir}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--target",
+        type=str,
+        choices=["beats", "downbeats"],
+        default=None,
+        help="Train a model for 'beats' or 'downbeats'. If not specified, trains both.",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="outputs/baseline1",
+        help="Directory to save model and logs",
+    )
+    args = parser.parse_args()
+    # Determine which targets to train
+    targets = [args.target] if args.target else ["beats", "downbeats"]
+    for target in targets:
+        output_dir = os.path.join(args.output_dir, target)
+        train(target, output_dir)

exp/baseline1/utils.py ADDED Viewed

	@@ -0,0 +1,53 @@

+import torch
+import torch.nn as nn
+import torchaudio.transforms as T
+import numpy as np
+class MultiViewSpectrogram(nn.Module):
+    def __init__(self, sample_rate=16000, n_mels=80, hop_length=160):
+        super().__init__()
+        #  Window sizes: 23ms, 46ms, 93ms
+        self.win_lengths = [368, 736, 1488]
+        self.transforms = nn.ModuleList()
+        for win_len in self.win_lengths:
+            n_fft = 2 ** int(np.ceil(np.log2(win_len)))
+            mel = T.MelSpectrogram(
+                sample_rate=sample_rate,
+                n_fft=n_fft,
+                win_length=win_len,
+                hop_length=hop_length,
+                f_min=27.5,
+                f_max=16000.0,
+                n_mels=n_mels,
+                power=1.0,
+                center=True,
+            )
+            self.transforms.append(mel)
+    def forward(self, waveform):
+        specs = []
+        for transform in self.transforms:
+            # Scale magnitudes logarithmically
+            s = transform(waveform)
+            s = torch.log(s + 1e-9)
+            specs.append(s)
+        return torch.stack(specs, dim=1)
+def extract_context(spec, center_frame, context=7):
+    # Context of +/- 70ms (7 frames)
+    channels, n_mels, total_time = spec.shape
+    start = center_frame - context
+    end = center_frame + context + 1
+    pad_left = max(0, -start)
+    pad_right = max(0, end - total_time)
+    if pad_left > 0 or pad_right > 0:
+        spec = torch.nn.functional.pad(spec, (pad_left, pad_right))
+        start += pad_left
+        end += pad_left
+    return spec[:, :, start:end]

exp/baseline2/__init__.py ADDED Viewed

File without changes

exp/baseline2/data.py ADDED Viewed

	@@ -0,0 +1,137 @@

+import torch
+from torch.utils.data import Dataset
+import numpy as np
+from tqdm import tqdm
+class BeatTrackingDataset(Dataset):
+    def __init__(
+        self,
+        hf_dataset,
+        target_type="beats",
+        sample_rate=16000,
+        hop_length=160,
+        context_frames=50,
+    ):
+        """
+        Args:
+            hf_dataset: HuggingFace dataset object
+            target_type (str): "beats" or "downbeats". Determines which labels are treated as positive.
+            context_frames (int): Number of frames before and after the center frame.
+                                  Total frames = 2 * context_frames + 1.
+                                  Default 50 means 101 frames (~1s).
+        """
+        self.sr = sample_rate
+        self.hop_length = hop_length
+        self.target_type = target_type
+        self.context_frames = context_frames
+        # Context window size in samples
+        # We need enough samples for the center frame +/- context frames
+        # PLUS the window size of the largest FFT to compute the edges correctly.
+        # Largest window in MultiViewSpectrogram is 1488.
+        self.context_samples = (self.context_frames * 2 + 1) * hop_length + 1488
+        # Cache audio arrays in memory for fast access
+        self.audio_cache = []
+        self.indices = []
+        self._prepare_indices(hf_dataset)
+    def _prepare_indices(self, hf_dataset):
+        """
+        Prepares balanced indices and caches audio.
+        Uses the same "Fuzzier" training examples strategy as the baseline.
+        """
+        print(f"Preparing dataset indices for target: {self.target_type}...")
+        for i, item in tqdm(
+            enumerate(hf_dataset), total=len(hf_dataset), desc="Building indices"
+        ):
+            # Cache audio array (convert to numpy if tensor)
+            audio = item["audio"]["array"]
+            if hasattr(audio, "numpy"):
+                audio = audio.numpy()
+            self.audio_cache.append(audio)
+            # Calculate total frames available in audio
+            audio_len = len(audio)
+            n_frames = int(audio_len / self.hop_length)
+            # Select ground truth based on target_type
+            if self.target_type == "downbeats":
+                gt_times = item["downbeats"]
+            else:
+                gt_times = item["beats"]
+            # Convert to list if tensor
+            if hasattr(gt_times, "tolist"):
+                gt_times = gt_times.tolist()
+            gt_frames = set([int(t * self.sr / self.hop_length) for t in gt_times])
+            # --- Positive Examples (with Fuzziness) ---
+            pos_frames = set()
+            for bf in gt_frames:
+                if 0 <= bf < n_frames:
+                    self.indices.append((i, bf, 1.0))  # Center frame
+                    pos_frames.add(bf)
+                # Neighbors weighted at 0.25
+                if 0 <= bf - 1 < n_frames:
+                    self.indices.append((i, bf - 1, 0.25))
+                    pos_frames.add(bf - 1)
+                if 0 <= bf + 1 < n_frames:
+                    self.indices.append((i, bf + 1, 0.25))
+                    pos_frames.add(bf + 1)
+            # --- Negative Examples ---
+            # Balance 2:1
+            num_pos = len(pos_frames)
+            num_neg = num_pos * 2
+            count = 0
+            attempts = 0
+            while count < num_neg and attempts < num_neg * 5:
+                f = np.random.randint(0, n_frames)
+                if f not in pos_frames:
+                    self.indices.append((i, f, 0.0))
+                    count += 1
+                attempts += 1
+        print(
+            f"Dataset ready. {len(self.indices)} samples, {len(self.audio_cache)} tracks cached."
+        )
+    def __len__(self):
+        return len(self.indices)
+    def __getitem__(self, idx):
+        track_idx, frame_idx, label = self.indices[idx]
+        # Fast lookup from cache
+        audio = self.audio_cache[track_idx]
+        audio_len = len(audio)
+        # Calculate sample range for context window
+        center_sample = frame_idx * self.hop_length
+        half_context = self.context_samples // 2
+        # We want the window centered around center_sample
+        start = center_sample - half_context
+        end = center_sample + half_context
+        # Handle padding if needed
+        pad_left = max(0, -start)
+        pad_right = max(0, end - audio_len)
+        valid_start = max(0, start)
+        valid_end = min(audio_len, end)
+        # Extract audio chunk
+        chunk = audio[valid_start:valid_end]
+        if pad_left > 0 or pad_right > 0:
+            chunk = np.pad(chunk, (pad_left, pad_right), mode="constant")
+        waveform = torch.tensor(chunk, dtype=torch.float32)
+        return waveform, torch.tensor([label], dtype=torch.float32)

exp/baseline2/eval.py ADDED Viewed

	@@ -0,0 +1,324 @@

+import torch
+import numpy as np
+from tqdm import tqdm
+from scipy.signal import find_peaks
+import argparse
+import os
+from .model import ResNet
+from ..baseline1.utils import MultiViewSpectrogram
+from ..data.load import ds
+from ..data.eval import evaluate_all, format_results
+def get_activation_function(model, waveform, device):
+    """
+    Computes probability curve over time.
+    """
+    processor = MultiViewSpectrogram().to(device)
+    waveform = waveform.unsqueeze(0).to(device)
+    with torch.no_grad():
+        spec = processor(waveform)
+        # Normalize
+        mean = spec.mean(dim=(2, 3), keepdim=True)
+        std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+        spec = (spec - mean) / std
+        # Batchify with sliding window
+        # Context frames = 50, so total window = 101.
+        # Pad time by 50 on each side.
+        spec = torch.nn.functional.pad(spec, (50, 50))  # Pad time
+        windows = spec.unfold(3, 101, 1)  # (1, 3, 80, Time, 101)
+        windows = windows.permute(3, 0, 1, 2, 4).squeeze(1)  # (Time, 3, 80, 101)
+        # Inference
+        activations = []
+        batch_size = 128  # Reduced batch size
+        for i in range(0, len(windows), batch_size):
+            batch = windows[i : i + batch_size]
+            out = model(batch)
+            activations.append(out.cpu().numpy())
+    return np.concatenate(activations).flatten()
+def pick_peaks(activations, hop_length=160, sr=16000):
+    """
+    Smooth with Hamming window and report local maxima.
+    """
+    # Smoothing
+    window = np.hamming(5)
+    window /= window.sum()
+    smoothed = np.convolve(activations, window, mode="same")
+    # Peak Picking
+    peaks, _ = find_peaks(smoothed, height=0.5, distance=5)
+    timestamps = peaks * hop_length / sr
+    return timestamps.tolist()
+def visualize_track(
+    audio: np.ndarray,
+    sr: int,
+    pred_beats: list[float],
+    pred_downbeats: list[float],
+    gt_beats: list[float],
+    gt_downbeats: list[float],
+    output_dir: str,
+    track_idx: int,
+    time_range: tuple[float, float] | None = None,
+):
+    """
+    Create and save visualizations for a single track.
+    """
+    from ..data.viz import plot_waveform_with_beats, save_figure
+    os.makedirs(output_dir, exist_ok=True)
+    # Full waveform plot
+    fig = plot_waveform_with_beats(
+        audio,
+        sr,
+        pred_beats,
+        gt_beats,
+        pred_downbeats,
+        gt_downbeats,
+        title=f"Track {track_idx}: Beat Comparison",
+        time_range=time_range,
+    )
+    save_figure(fig, os.path.join(output_dir, f"track_{track_idx:03d}.png"))
+def synthesize_audio(
+    audio: np.ndarray,
+    sr: int,
+    pred_beats: list[float],
+    pred_downbeats: list[float],
+    gt_beats: list[float],
+    gt_downbeats: list[float],
+    output_dir: str,
+    track_idx: int,
+    click_volume: float = 0.5,
+):
+    """
+    Create and save audio files with click tracks for a single track.
+    """
+    from ..data.audio import create_comparison_audio, save_audio
+    os.makedirs(output_dir, exist_ok=True)
+    # Create comparison audio
+    audio_pred, audio_gt, audio_both = create_comparison_audio(
+        audio,
+        pred_beats,
+        pred_downbeats,
+        gt_beats,
+        gt_downbeats,
+        sr=sr,
+        click_volume=click_volume,
+    )
+    # Save audio files
+    save_audio(
+        audio_pred, os.path.join(output_dir, f"track_{track_idx:03d}_pred.wav"), sr
+    )
+    save_audio(audio_gt, os.path.join(output_dir, f"track_{track_idx:03d}_gt.wav"), sr)
+    save_audio(
+        audio_both, os.path.join(output_dir, f"track_{track_idx:03d}_both.wav"), sr
+    )
+def main():
+    parser = argparse.ArgumentParser(
+        description="Evaluate beat tracking models with visualization and audio synthesis"
+    )
+    parser.add_argument(
+        "--model-dir",
+        type=str,
+        default="outputs/baseline2",
+        help="Base directory containing trained models (with 'beats' and 'downbeats' subdirs)",
+    )
+    parser.add_argument(
+        "--num-samples",
+        type=int,
+        default=116,
+        help="Number of samples to evaluate",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="outputs/eval_baseline2",
+        help="Directory to save visualizations and audio",
+    )
+    parser.add_argument(
+        "--visualize",
+        action="store_true",
+        help="Generate visualization plots for each track",
+    )
+    parser.add_argument(
+        "--synthesize",
+        action="store_true",
+        help="Generate audio files with click tracks",
+    )
+    parser.add_argument(
+        "--viz-tracks",
+        type=int,
+        default=5,
+        help="Number of tracks to visualize/synthesize (default: 5)",
+    )
+    parser.add_argument(
+        "--time-range",
+        type=float,
+        nargs=2,
+        default=None,
+        metavar=("START", "END"),
+        help="Time range for visualization in seconds (default: full track)",
+    )
+    parser.add_argument(
+        "--click-volume",
+        type=float,
+        default=0.5,
+        help="Volume of click sounds relative to audio (0.0 to 1.0)",
+    )
+    parser.add_argument(
+        "--summary-plot",
+        action="store_true",
+        help="Generate summary evaluation plot",
+    )
+    args = parser.parse_args()
+    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+    # Load BOTH models using from_pretrained
+    beat_model = None
+    downbeat_model = None
+    has_beats = False
+    has_downbeats = False
+    beats_dir = os.path.join(args.model_dir, "beats")
+    downbeats_dir = os.path.join(args.model_dir, "downbeats")
+    if os.path.exists(os.path.join(beats_dir, "model.safetensors")):
+        beat_model = ResNet.from_pretrained(beats_dir).to(DEVICE)
+        beat_model.eval()
+        has_beats = True
+        print(f"Loaded Beat Model from {beats_dir}")
+    else:
+        print(f"Warning: No beat model found in {beats_dir}")
+    if os.path.exists(os.path.join(downbeats_dir, "model.safetensors")):
+        downbeat_model = ResNet.from_pretrained(downbeats_dir).to(DEVICE)
+        downbeat_model.eval()
+        has_downbeats = True
+        print(f"Loaded Downbeat Model from {downbeats_dir}")
+    else:
+        print(f"Warning: No downbeat model found in {downbeats_dir}")
+    if not has_beats and not has_downbeats:
+        print("No models found. Please run training first.")
+        return
+    predictions = []
+    ground_truths = []
+    audio_data = []  # Store audio for visualization/synthesis
+    # Eval on specified number of tracks
+    test_set = ds["train"].select(range(args.num_samples))
+    print("Running evaluation...")
+    for i, item in enumerate(tqdm(test_set)):
+        waveform = torch.tensor(item["audio"]["array"], dtype=torch.float32)
+        waveform_device = waveform.to(DEVICE)
+        pred_entry = {"beats": [], "downbeats": []}
+        # 1. Predict Beats
+        if has_beats:
+            act_b = get_activation_function(beat_model, waveform_device, DEVICE)
+            pred_entry["beats"] = pick_peaks(act_b)
+        # 2. Predict Downbeats
+        if has_downbeats:
+            act_d = get_activation_function(downbeat_model, waveform_device, DEVICE)
+            pred_entry["downbeats"] = pick_peaks(act_d)
+        predictions.append(pred_entry)
+        ground_truths.append({"beats": item["beats"], "downbeats": item["downbeats"]})
+        # Store audio for later visualization/synthesis
+        if args.visualize or args.synthesize:
+            if i < args.viz_tracks:
+                audio_data.append(
+                    {
+                        "audio": waveform.numpy(),
+                        "sr": item["audio"]["sampling_rate"],
+                        "pred": pred_entry,
+                        "gt": ground_truths[-1],
+                    }
+                )
+    # Run evaluation
+    results = evaluate_all(predictions, ground_truths)
+    print(format_results(results))
+    # Create output directory
+    if args.visualize or args.synthesize or args.summary_plot:
+        os.makedirs(args.output_dir, exist_ok=True)
+    # Generate visualizations
+    if args.visualize:
+        print(f"\nGenerating visualizations for {len(audio_data)} tracks...")
+        viz_dir = os.path.join(args.output_dir, "plots")
+        for i, data in enumerate(tqdm(audio_data, desc="Visualizing")):
+            time_range = tuple(args.time_range) if args.time_range else None
+            visualize_track(
+                data["audio"],
+                data["sr"],
+                data["pred"]["beats"],
+                data["pred"]["downbeats"],
+                data["gt"]["beats"],
+                data["gt"]["downbeats"],
+                viz_dir,
+                i,
+                time_range=time_range,
+            )
+        print(f"Saved visualizations to {viz_dir}")
+    # Generate audio with clicks
+    if args.synthesize:
+        print(f"\nSynthesizing audio for {len(audio_data)} tracks...")
+        audio_dir = os.path.join(args.output_dir, "audio")
+        for i, data in enumerate(tqdm(audio_data, desc="Synthesizing")):
+            synthesize_audio(
+                data["audio"],
+                data["sr"],
+                data["pred"]["beats"],
+                data["pred"]["downbeats"],
+                data["gt"]["beats"],
+                data["gt"]["downbeats"],
+                audio_dir,
+                i,
+                click_volume=args.click_volume,
+            )
+        print(f"Saved audio files to {audio_dir}")
+        print("  *_pred.wav - Original audio with predicted beat clicks")
+        print("  *_gt.wav   - Original audio with ground truth beat clicks")
+        print("  *_both.wav - Original audio with both predicted and GT clicks")
+    # Generate summary plot
+    if args.summary_plot:
+        from ..data.viz import plot_evaluation_summary, save_figure
+        print("\nGenerating summary plot...")
+        fig = plot_evaluation_summary(results, title="Beat Tracking Evaluation Summary")
+        summary_path = os.path.join(args.output_dir, "evaluation_summary.png")
+        save_figure(fig, summary_path)
+        print(f"Saved summary plot to {summary_path}")
+if __name__ == "__main__":
+    main()

exp/baseline2/model.py ADDED Viewed

	@@ -0,0 +1,139 @@

+import torch
+import torch.nn as nn
+from huggingface_hub import PyTorchModelHubMixin
+class SEBlock(nn.Module):
+    def __init__(self, channels, reduction=16):
+        super().__init__()
+        self.avg_pool = nn.AdaptiveAvgPool2d(1)
+        self.fc = nn.Sequential(
+            nn.Linear(channels, channels // reduction, bias=False),
+            nn.ReLU(inplace=True),
+            nn.Linear(channels // reduction, channels, bias=False),
+            nn.Sigmoid(),
+        )
+    def forward(self, x):
+        b, c, _, _ = x.size()
+        y = self.avg_pool(x).view(b, c)
+        y = self.fc(y).view(b, c, 1, 1)
+        return x * y.expand_as(x)
+class ResBlock(nn.Module):
+    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
+        super().__init__()
+        self.conv1 = nn.Conv2d(
+            in_channels,
+            out_channels,
+            kernel_size=3,
+            stride=stride,
+            padding=1,
+            bias=False,
+        )
+        self.bn1 = nn.BatchNorm2d(out_channels)
+        self.relu = nn.ReLU(inplace=True)
+        self.conv2 = nn.Conv2d(
+            out_channels, out_channels, kernel_size=3, padding=1, bias=False
+        )
+        self.bn2 = nn.BatchNorm2d(out_channels)
+        self.se = SEBlock(out_channels)
+        self.downsample = downsample
+    def forward(self, x):
+        identity = x
+        if self.downsample is not None:
+            identity = self.downsample(x)
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+        out = self.conv2(out)
+        out = self.bn2(out)
+        out = self.se(out)
+        out += identity
+        out = self.relu(out)
+        return out
+class ResNet(nn.Module, PyTorchModelHubMixin):
+    def __init__(
+        self, layers=[2, 2, 2, 2], channels=[16, 24, 48, 96], dropout_rate=0.5
+    ):
+        super().__init__()
+        self.in_channels = 16
+        # Stem
+        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False)
+        self.bn1 = nn.BatchNorm2d(16)
+        self.relu = nn.ReLU(inplace=True)
+        # Stages
+        self.layer1 = self._make_layer(channels[0], layers[0], stride=1)
+        self.layer2 = self._make_layer(channels[1], layers[1], stride=2)
+        self.layer3 = self._make_layer(channels[2], layers[2], stride=2)
+        self.layer4 = self._make_layer(channels[3], layers[3], stride=2)
+        self.dropout = nn.Dropout(p=dropout_rate)
+        # Final classification head
+        # H, W will reduce. Assuming input is (3, 80, 101)
+        # L1: (16, 80, 101) (stride 1)
+        # L2: (32, 40, 51)  (stride 2)
+        # L3: (64, 20, 26)  (stride 2)
+        # L4: (128, 10, 13) (stride 2)
+        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
+        self.fc = nn.Linear(channels[3], 1)
+        self.sigmoid = nn.Sigmoid()
+    def _make_layer(self, out_channels, blocks, stride=1):
+        downsample = None
+        if stride != 1 or self.in_channels != out_channels:
+            downsample = nn.Sequential(
+                nn.Conv2d(
+                    self.in_channels,
+                    out_channels,
+                    kernel_size=1,
+                    stride=stride,
+                    bias=False,
+                ),
+                nn.BatchNorm2d(out_channels),
+            )
+        layers = []
+        layers.append(ResBlock(self.in_channels, out_channels, stride, downsample))
+        self.in_channels = out_channels
+        for _ in range(1, blocks):
+            layers.append(ResBlock(self.in_channels, out_channels))
+        return nn.Sequential(*layers)
+    def forward(self, x):
+        # x: (B, 3, 80, 101)
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.relu(x)
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+        x = self.avgpool(x)  # (B, 128, 1, 1)
+        x = torch.flatten(x, 1)  # (B, 128)
+        x = self.dropout(x)
+        x = self.fc(x)
+        x = self.sigmoid(x)
+        return x
+if __name__ == "__main__":
+    from torchinfo import summary
+    model = ResNet()
+    summary(model, (1, 3, 80, 101))

exp/baseline2/train.py ADDED Viewed

	@@ -0,0 +1,215 @@

+import torch
+import torch.nn as nn
+import torch.optim as optim
+from torch.utils.data import DataLoader
+from torch.utils.tensorboard import SummaryWriter
+from tqdm import tqdm
+import argparse
+import os
+from .model import ResNet
+from .data import BeatTrackingDataset
+from ..baseline1.utils import MultiViewSpectrogram
+from ..data.load import ds
+def train(target_type: str, output_dir: str):
+    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+    BATCH_SIZE = 128  # Reduced batch size due to larger context
+    EPOCHS = 3
+    LR = 0.001  # Adjusted LR for Adam (ResNet usually prefers Adam/AdamW)
+    NUM_WORKERS = 4
+    CONTEXT_FRAMES = 50  # +/- 50 frames -> 101 frames total
+    PATIENCE = 5  # Early stopping patience
+    print(f"--- Training Model for target: {target_type} ---")
+    print(f"Output directory: {output_dir}")
+    # Create output directory
+    os.makedirs(output_dir, exist_ok=True)
+    # TensorBoard writer
+    writer = SummaryWriter(log_dir=os.path.join(output_dir, "logs"))
+    # Data
+    train_dataset = BeatTrackingDataset(
+        ds["train"], target_type=target_type, context_frames=CONTEXT_FRAMES
+    )
+    val_dataset = BeatTrackingDataset(
+        ds["test"], target_type=target_type, context_frames=CONTEXT_FRAMES
+    )
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=BATCH_SIZE,
+        shuffle=True,
+        num_workers=NUM_WORKERS,
+        pin_memory=True,
+        prefetch_factor=4,
+        persistent_workers=True,
+    )
+    val_loader = DataLoader(
+        val_dataset,
+        batch_size=BATCH_SIZE,
+        shuffle=False,
+        num_workers=NUM_WORKERS,
+        pin_memory=True,
+        prefetch_factor=4,
+        persistent_workers=True,
+    )
+    print(f"Train samples: {len(train_dataset)}, Val samples: {len(val_dataset)}")
+    # Model
+    model = ResNet(dropout_rate=0.5).to(DEVICE)
+    # GPU Spectrogram Preprocessor
+    preprocessor = MultiViewSpectrogram(sample_rate=16000, hop_length=160).to(DEVICE)
+    # Optimizer - Using AdamW for ResNet
+    optimizer = optim.AdamW(model.parameters(), lr=LR, weight_decay=1e-4)
+    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)
+    criterion = nn.BCELoss()  # Binary Cross Entropy
+    best_val_loss = float("inf")
+    patience_counter = 0
+    global_step = 0
+    for epoch in range(EPOCHS):
+        # Training
+        model.train()
+        total_train_loss = 0
+        for waveform, y in tqdm(
+            train_loader,
+            desc=f"[{target_type}] Epoch {epoch + 1}/{EPOCHS} Train",
+            leave=False,
+        ):
+            waveform, y = waveform.to(DEVICE), y.to(DEVICE)
+            # Compute spectrogram on GPU
+            with torch.no_grad():
+                spec = preprocessor(waveform)  # (B, 3, 80, T_raw)
+                # Normalize
+                mean = spec.mean(dim=(2, 3), keepdim=True)
+                std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+                spec = (spec - mean) / std
+                T_curr = spec.shape[-1]
+                target_T = CONTEXT_FRAMES * 2 + 1
+                if T_curr > target_T:
+                    start = (T_curr - target_T) // 2
+                    x = spec[:, :, :, start : start + target_T]
+                elif T_curr < target_T:
+                    # This shouldn't happen if dataset is correct, but just in case pad
+                    pad = target_T - T_curr
+                    x = torch.nn.functional.pad(spec, (0, pad))
+                else:
+                    x = spec
+            optimizer.zero_grad()
+            output = model(x)
+            loss = criterion(output, y)
+            loss.backward()
+            optimizer.step()
+            total_train_loss += loss.item()
+            global_step += 1
+            # Log batch loss
+            writer.add_scalar("train/batch_loss", loss.item(), global_step)
+        avg_train_loss = total_train_loss / len(train_loader)
+        # Validation
+        model.eval()
+        total_val_loss = 0
+        with torch.no_grad():
+            for waveform, y in tqdm(
+                val_loader,
+                desc=f"[{target_type}] Epoch {epoch + 1}/{EPOCHS} Val",
+                leave=False,
+            ):
+                waveform, y = waveform.to(DEVICE), y.to(DEVICE)
+                # Compute spectrogram on GPU
+                spec = preprocessor(waveform)  # (B, 3, 80, T)
+                # Normalize
+                mean = spec.mean(dim=(2, 3), keepdim=True)
+                std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+                spec = (spec - mean) / std
+                T_curr = spec.shape[-1]
+                target_T = CONTEXT_FRAMES * 2 + 1
+                if T_curr > target_T:
+                    start = (T_curr - target_T) // 2
+                    x = spec[:, :, :, start : start + target_T]
+                else:
+                    pad = target_T - T_curr
+                    x = torch.nn.functional.pad(spec, (0, pad))
+                output = model(x)
+                loss = criterion(output, y)
+                total_val_loss += loss.item()
+        avg_val_loss = total_val_loss / len(val_loader)
+        # Log epoch metrics
+        writer.add_scalar("train/epoch_loss", avg_train_loss, epoch)
+        writer.add_scalar("val/loss", avg_val_loss, epoch)
+        writer.add_scalar("train/learning_rate", scheduler.get_last_lr()[0], epoch)
+        # Step the scheduler
+        scheduler.step()
+        print(
+            f"[{target_type}] Epoch {epoch + 1}/{EPOCHS} - "
+            f"Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}"
+        )
+        # Save best model
+        if avg_val_loss < best_val_loss:
+            best_val_loss = avg_val_loss
+            patience_counter = 0
+            model.save_pretrained(output_dir)
+            print(f"  -> Saved best model (val_loss: {best_val_loss:.4f})")
+        else:
+            patience_counter += 1
+            print(f"  -> No improvement (patience: {patience_counter}/{PATIENCE})")
+        if patience_counter >= PATIENCE:
+            print("Early stopping triggered.")
+            break
+    writer.close()
+    # Save final model
+    final_dir = os.path.join(output_dir, "final")
+    model.save_pretrained(final_dir)
+    print(f"Saved final model to {final_dir}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--target",
+        type=str,
+        choices=["beats", "downbeats"],
+        default=None,
+        help="Train a model for 'beats' or 'downbeats'. If not specified, trains both.",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="outputs/baseline2",
+        help="Directory to save model and logs",
+    )
+    args = parser.parse_args()
+    # Determine which targets to train
+    targets = [args.target] if args.target else ["beats", "downbeats"]
+    for target in targets:
+        output_dir = os.path.join(args.output_dir, target)
+        train(target, output_dir)

exp/baseline3/__init__.py ADDED Viewed

File without changes

exp/baseline3/data.py ADDED Viewed

	@@ -0,0 +1,173 @@

+import torch
+from torch.utils.data import Dataset
+import numpy as np
+from tqdm import tqdm
+class BeatTrackingDataset(Dataset):
+    def __init__(
+        self,
+        hf_dataset,
+        target_type="beats",
+        sample_rate=16000,
+        hop_length=160,
+        context_frames=50,
+        max_tracks: int | None = None,
+        hard_neg_radius: int = 0,
+        hard_neg_fraction: float = 0.5,
+    ):
+        """
+        Args:
+            hf_dataset: HuggingFace dataset object
+            target_type (str): "beats" or "downbeats". Determines which labels are treated as positive.
+            context_frames (int): Number of frames before and after the center frame.
+                                  Total frames = 2 * context_frames + 1.
+                                  Default 50 means 101 frames (~1s).
+        """
+        self.sr = sample_rate
+        self.hop_length = hop_length
+        self.target_type = target_type
+        self.hard_neg_radius = int(hard_neg_radius)
+        self.hard_neg_fraction = float(hard_neg_fraction)
+        self.context_frames = context_frames
+        # Context window size in samples
+        # We need enough samples for the center frame +/- context frames
+        # PLUS the window size of the largest FFT to compute the edges correctly.
+        # Largest window in MultiViewSpectrogram is 1488.
+        self.context_samples = (self.context_frames * 2 + 1) * hop_length + 1488
+        # Cache audio arrays in memory for fast access
+        self.audio_cache = []
+        self.indices = []
+        self._prepare_indices(hf_dataset, max_tracks=max_tracks)
+    def _prepare_indices(self, hf_dataset, *, max_tracks: int | None):
+        """
+        Prepares balanced indices and caches audio.
+        Uses the same "Fuzzier" training examples strategy as the baseline.
+        """
+        print(f"Preparing dataset indices for target: {self.target_type}...")
+        total = len(hf_dataset)
+        if max_tracks is not None:
+            total = min(total, max_tracks)
+        for i, item in tqdm(
+            enumerate(hf_dataset), total=total, desc="Building indices"
+        ):
+            if max_tracks is not None and i >= max_tracks:
+                break
+            # Cache audio array (convert to numpy if tensor)
+            audio = item["audio"]["array"]
+            if hasattr(audio, "numpy"):
+                audio = audio.numpy()
+            self.audio_cache.append(audio)
+            # Calculate total frames available in audio
+            audio_len = len(audio)
+            n_frames = int(audio_len / self.hop_length)
+            # Select ground truth based on target_type
+            if self.target_type == "downbeats":
+                gt_times = item["downbeats"]
+            else:
+                gt_times = item["beats"]
+            # Convert to list if tensor
+            if hasattr(gt_times, "tolist"):
+                gt_times = gt_times.tolist()
+            gt_frames = set([int(t * self.sr / self.hop_length) for t in gt_times])
+            # --- Positive Examples (with Fuzziness) ---
+            pos_frames = set()
+            for bf in gt_frames:
+                if 0 <= bf < n_frames:
+                    self.indices.append((i, bf, 1.0))  # Center frame
+                    pos_frames.add(bf)
+                # Neighbors weighted at 0.25
+                if 0 <= bf - 1 < n_frames:
+                    self.indices.append((i, bf - 1, 0.25))
+                    pos_frames.add(bf - 1)
+                if 0 <= bf + 1 < n_frames:
+                    self.indices.append((i, bf + 1, 0.25))
+                    pos_frames.add(bf + 1)
+            # --- Negative Examples ---
+            # Balance 2:1
+            num_pos = len(pos_frames)
+            num_neg = num_pos * 2
+            # (Optional) hard negatives close to beats.
+            # Rationale: random negatives are often "easy" (silence/long gaps),
+            # while the model struggles most on near-beat confusions that cause
+            # double peaks / jitter.
+            hard_neg_target = 0
+            if self.hard_neg_radius > 1 and num_neg > 0:
+                hard_neg_target = int(num_neg * self.hard_neg_fraction)
+                hard_neg_target = max(0, min(num_neg, hard_neg_target))
+            hard_added = 0
+            if hard_neg_target > 0:
+                for bf in gt_frames:
+                    for d in range(2, self.hard_neg_radius + 1):
+                        for f in (bf - d, bf + d):
+                            if hard_added >= hard_neg_target:
+                                break
+                            if 0 <= f < n_frames and f not in pos_frames:
+                                self.indices.append((i, f, 0.0))
+                                hard_added += 1
+                        if hard_added >= hard_neg_target:
+                            break
+                    if hard_added >= hard_neg_target:
+                        break
+            count = 0
+            attempts = 0
+            remaining_neg = num_neg - hard_added
+            while count < remaining_neg and attempts < remaining_neg * 5:
+                f = np.random.randint(0, n_frames)
+                if f not in pos_frames:
+                    self.indices.append((i, f, 0.0))
+                    count += 1
+                attempts += 1
+        print(
+            f"Dataset ready. {len(self.indices)} samples, {len(self.audio_cache)} tracks cached."
+        )
+    def __len__(self):
+        return len(self.indices)
+    def __getitem__(self, idx):
+        track_idx, frame_idx, label = self.indices[idx]
+        # Fast lookup from cache
+        audio = self.audio_cache[track_idx]
+        audio_len = len(audio)
+        # Calculate sample range for context window
+        center_sample = frame_idx * self.hop_length
+        half_context = self.context_samples // 2
+        # We want the window centered around center_sample
+        start = center_sample - half_context
+        end = center_sample + half_context
+        # Handle padding if needed
+        pad_left = max(0, -start)
+        pad_right = max(0, end - audio_len)
+        valid_start = max(0, start)
+        valid_end = min(audio_len, end)
+        # Extract audio chunk
+        chunk = audio[valid_start:valid_end]
+        if pad_left > 0 or pad_right > 0:
+            chunk = np.pad(chunk, (pad_left, pad_right), mode="constant")
+        waveform = torch.tensor(chunk, dtype=torch.float32)
+        return waveform, torch.tensor([label], dtype=torch.float32)

exp/baseline3/eval.py ADDED Viewed

	@@ -0,0 +1,336 @@

+import torch
+import numpy as np
+from tqdm import tqdm
+from scipy.signal import find_peaks
+import argparse
+import os
+from .model import ResNet
+from ..baseline1.utils import MultiViewSpectrogram
+from ..data.load import ds
+from ..data.eval import evaluate_all, format_results
+def get_activation_function(model, waveform, device):
+    """
+    Computes probability curve over time.
+    """
+    processor = MultiViewSpectrogram().to(device)
+    waveform = waveform.unsqueeze(0).to(device)
+    with torch.no_grad():
+        spec = processor(waveform)
+        # Normalize
+        mean = spec.mean(dim=(2, 3), keepdim=True)
+        std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+        spec = (spec - mean) / std
+        # Batchify with sliding window
+        # Context frames = 50, so total window = 101.
+        # Pad time by 50 on each side.
+        spec = torch.nn.functional.pad(spec, (50, 50))  # Pad time
+        windows = spec.unfold(3, 101, 1)  # (1, 3, 80, Time, 101)
+        windows = windows.permute(3, 0, 1, 2, 4).squeeze(1)  # (Time, 3, 80, 101)
+        # Inference
+        activations = []
+        batch_size = 128  # Reduced batch size
+        for i in range(0, len(windows), batch_size):
+            batch = windows[i : i + batch_size]
+            out = model(batch)
+            activations.append(out.cpu().numpy())
+    return np.concatenate(activations).flatten()
+def pick_peaks(activations, hop_length=160, sr=16000):
+    """
+    Smooth with Hamming window and report local maxima.
+    """
+    # Smoothing
+    window = np.hamming(5)
+    window /= window.sum()
+    smoothed = np.convolve(activations, window, mode="same")
+    # Peak Picking
+    peaks, _ = find_peaks(smoothed, height=0.5, distance=5)
+    timestamps = peaks * hop_length / sr
+    return timestamps.tolist()
+def visualize_track(
+    audio: np.ndarray,
+    sr: int,
+    pred_beats: list[float],
+    pred_downbeats: list[float],
+    gt_beats: list[float],
+    gt_downbeats: list[float],
+    output_dir: str,
+    track_idx: int,
+    time_range: tuple[float, float] | None = None,
+):
+    """
+    Create and save visualizations for a single track.
+    """
+    from ..data.viz import plot_waveform_with_beats, save_figure
+    os.makedirs(output_dir, exist_ok=True)
+    # Full waveform plot
+    fig = plot_waveform_with_beats(
+        audio,
+        sr,
+        pred_beats,
+        gt_beats,
+        pred_downbeats,
+        gt_downbeats,
+        title=f"Track {track_idx}: Beat Comparison",
+        time_range=time_range,
+    )
+    save_figure(fig, os.path.join(output_dir, f"track_{track_idx:03d}.png"))
+def synthesize_audio(
+    audio: np.ndarray,
+    sr: int,
+    pred_beats: list[float],
+    pred_downbeats: list[float],
+    gt_beats: list[float],
+    gt_downbeats: list[float],
+    output_dir: str,
+    track_idx: int,
+    click_volume: float = 0.5,
+):
+    """
+    Create and save audio files with click tracks for a single track.
+    """
+    from ..data.audio import create_comparison_audio, save_audio
+    os.makedirs(output_dir, exist_ok=True)
+    # Create comparison audio
+    audio_pred, audio_gt, audio_both = create_comparison_audio(
+        audio,
+        pred_beats,
+        pred_downbeats,
+        gt_beats,
+        gt_downbeats,
+        sr=sr,
+        click_volume=click_volume,
+    )
+    # Save audio files
+    save_audio(
+        audio_pred, os.path.join(output_dir, f"track_{track_idx:03d}_pred.wav"), sr
+    )
+    save_audio(audio_gt, os.path.join(output_dir, f"track_{track_idx:03d}_gt.wav"), sr)
+    save_audio(
+        audio_both, os.path.join(output_dir, f"track_{track_idx:03d}_both.wav"), sr
+    )
+def main():
+    parser = argparse.ArgumentParser(
+        description="Evaluate beat tracking models with visualization and audio synthesis"
+    )
+    parser.add_argument(
+        "--model-dir",
+        type=str,
+        default="outputs/baseline3",
+        help="Base directory containing trained models (with 'beats' and 'downbeats' subdirs)",
+    )
+    parser.add_argument(
+        "--beats-model-dir",
+        type=str,
+        default=None,
+        help="Directory containing the trained beats model (overrides --model-dir/beats)",
+    )
+    parser.add_argument(
+        "--downbeats-model-dir",
+        type=str,
+        default=None,
+        help="Directory containing the trained downbeats model (overrides --model-dir/downbeats)",
+    )
+    parser.add_argument(
+        "--num-samples",
+        type=int,
+        default=116,
+        help="Number of samples to evaluate",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="outputs/eval_baseline3",
+        help="Directory to save visualizations and audio",
+    )
+    parser.add_argument(
+        "--visualize",
+        action="store_true",
+        help="Generate visualization plots for each track",
+    )
+    parser.add_argument(
+        "--synthesize",
+        action="store_true",
+        help="Generate audio files with click tracks",
+    )
+    parser.add_argument(
+        "--viz-tracks",
+        type=int,
+        default=5,
+        help="Number of tracks to visualize/synthesize (default: 5)",
+    )
+    parser.add_argument(
+        "--time-range",
+        type=float,
+        nargs=2,
+        default=None,
+        metavar=("START", "END"),
+        help="Time range for visualization in seconds (default: full track)",
+    )
+    parser.add_argument(
+        "--click-volume",
+        type=float,
+        default=0.5,
+        help="Volume of click sounds relative to audio (0.0 to 1.0)",
+    )
+    parser.add_argument(
+        "--summary-plot",
+        action="store_true",
+        help="Generate summary evaluation plot",
+    )
+    args = parser.parse_args()
+    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+    # Load BOTH models using from_pretrained
+    beat_model = None
+    downbeat_model = None
+    has_beats = False
+    has_downbeats = False
+    beats_dir = args.beats_model_dir or os.path.join(args.model_dir, "beats")
+    downbeats_dir = args.downbeats_model_dir or os.path.join(args.model_dir, "downbeats")
+    if os.path.exists(os.path.join(beats_dir, "model.safetensors")):
+        beat_model = ResNet.from_pretrained(beats_dir).to(DEVICE)
+        beat_model.eval()
+        has_beats = True
+        print(f"Loaded Beat Model from {beats_dir}")
+    else:
+        print(f"Warning: No beat model found in {beats_dir}")
+    if os.path.exists(os.path.join(downbeats_dir, "model.safetensors")):
+        downbeat_model = ResNet.from_pretrained(downbeats_dir).to(DEVICE)
+        downbeat_model.eval()
+        has_downbeats = True
+        print(f"Loaded Downbeat Model from {downbeats_dir}")
+    else:
+        print(f"Warning: No downbeat model found in {downbeats_dir}")
+    if not has_beats and not has_downbeats:
+        print("No models found. Please run training first.")
+        return
+    predictions = []
+    ground_truths = []
+    audio_data = []  # Store audio for visualization/synthesis
+    # Eval on specified number of tracks
+    test_set = ds["train"].select(range(args.num_samples))
+    print("Running evaluation...")
+    for i, item in enumerate(tqdm(test_set)):
+        waveform = torch.tensor(item["audio"]["array"], dtype=torch.float32)
+        waveform_device = waveform.to(DEVICE)
+        pred_entry = {"beats": [], "downbeats": []}
+        # 1. Predict Beats
+        if has_beats:
+            act_b = get_activation_function(beat_model, waveform_device, DEVICE)
+            pred_entry["beats"] = pick_peaks(act_b)
+        # 2. Predict Downbeats
+        if has_downbeats:
+            act_d = get_activation_function(downbeat_model, waveform_device, DEVICE)
+            pred_entry["downbeats"] = pick_peaks(act_d)
+        predictions.append(pred_entry)
+        ground_truths.append({"beats": item["beats"], "downbeats": item["downbeats"]})
+        # Store audio for later visualization/synthesis
+        if args.visualize or args.synthesize:
+            if i < args.viz_tracks:
+                audio_data.append(
+                    {
+                        "audio": waveform.numpy(),
+                        "sr": item["audio"]["sampling_rate"],
+                        "pred": pred_entry,
+                        "gt": ground_truths[-1],
+                    }
+                )
+    # Run evaluation
+    results = evaluate_all(predictions, ground_truths)
+    print(format_results(results))
+    # Create output directory
+    if args.visualize or args.synthesize or args.summary_plot:
+        os.makedirs(args.output_dir, exist_ok=True)
+    # Generate visualizations
+    if args.visualize:
+        print(f"\nGenerating visualizations for {len(audio_data)} tracks...")
+        viz_dir = os.path.join(args.output_dir, "plots")
+        for i, data in enumerate(tqdm(audio_data, desc="Visualizing")):
+            time_range = tuple(args.time_range) if args.time_range else None
+            visualize_track(
+                data["audio"],
+                data["sr"],
+                data["pred"]["beats"],
+                data["pred"]["downbeats"],
+                data["gt"]["beats"],
+                data["gt"]["downbeats"],
+                viz_dir,
+                i,
+                time_range=time_range,
+            )
+        print(f"Saved visualizations to {viz_dir}")
+    # Generate audio with clicks
+    if args.synthesize:
+        print(f"\nSynthesizing audio for {len(audio_data)} tracks...")
+        audio_dir = os.path.join(args.output_dir, "audio")
+        for i, data in enumerate(tqdm(audio_data, desc="Synthesizing")):
+            synthesize_audio(
+                data["audio"],
+                data["sr"],
+                data["pred"]["beats"],
+                data["pred"]["downbeats"],
+                data["gt"]["beats"],
+                data["gt"]["downbeats"],
+                audio_dir,
+                i,
+                click_volume=args.click_volume,
+            )
+        print(f"Saved audio files to {audio_dir}")
+        print("  *_pred.wav - Original audio with predicted beat clicks")
+        print("  *_gt.wav   - Original audio with ground truth beat clicks")
+        print("  *_both.wav - Original audio with both predicted and GT clicks")
+    # Generate summary plot
+    if args.summary_plot:
+        from ..data.viz import plot_evaluation_summary, save_figure
+        print("\nGenerating summary plot...")
+        fig = plot_evaluation_summary(results, title="Beat Tracking Evaluation Summary")
+        summary_path = os.path.join(args.output_dir, "evaluation_summary.png")
+        save_figure(fig, summary_path)
+        print(f"Saved summary plot to {summary_path}")
+if __name__ == "__main__":
+    main()

exp/baseline3/model.py ADDED Viewed

	@@ -0,0 +1,173 @@

+import torch
+import torch.nn as nn
+from huggingface_hub import PyTorchModelHubMixin
+class SEBlock(nn.Module):
+    def __init__(self, channels: int, reduction: int = 16, use_max_pool: bool = True):
+        super().__init__()
+        self.avg_pool = nn.AdaptiveAvgPool2d(1)
+        self.max_pool = nn.AdaptiveMaxPool2d(1) if use_max_pool else None
+        hidden = max(1, channels // reduction)
+        self.fc = nn.Sequential(
+            nn.Linear(channels, hidden, bias=False),
+            nn.ReLU(inplace=True),
+            nn.Linear(hidden, channels, bias=False),
+            nn.Sigmoid(),
+        )
+    def forward(self, x):
+        b, c, _, _ = x.size()
+        y = self.avg_pool(x).view(b, c)
+        if self.max_pool is not None:
+            y = y + self.max_pool(x).view(b, c)
+        y = self.fc(y).view(b, c, 1, 1)
+        return x * y.expand_as(x)
+class TemporalSEBlock(nn.Module):
+    """Temporal squeeze/excitation for (B, C, F, T) feature maps.
+    Squeezes across frequency (mean over F) to get a per-channel temporal descriptor
+    (B, C, T), then excites with a lightweight 1D bottleneck MLP implemented with
+    pointwise Conv1d.
+    """
+    def __init__(self, channels: int, reduction: int = 16):
+        super().__init__()
+        hidden = max(1, channels // reduction)
+        self.net = nn.Sequential(
+            nn.Conv1d(channels, hidden, kernel_size=1, bias=False),
+            nn.ReLU(inplace=True),
+            nn.Conv1d(hidden, channels, kernel_size=1, bias=False),
+            nn.Sigmoid(),
+        )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # x: (B, C, F, T)
+        # squeeze over frequency -> (B, C, T)
+        t = x.mean(dim=2)
+        gate = self.net(t)  # (B, C, T)
+        return x * gate.unsqueeze(2)
+class ResBlock(nn.Module):
+    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
+        super().__init__()
+        self.conv1 = nn.Conv2d(
+            in_channels,
+            out_channels,
+            kernel_size=3,
+            stride=stride,
+            padding=1,
+            bias=False,
+        )
+        self.bn1 = nn.BatchNorm2d(out_channels)
+        self.relu = nn.ReLU(inplace=True)
+        self.conv2 = nn.Conv2d(
+            out_channels, out_channels, kernel_size=3, padding=1, bias=False
+        )
+        self.bn2 = nn.BatchNorm2d(out_channels)
+        # Baseline3: combine channel SE with a lightweight temporal SE gate.
+        self.cse = SEBlock(out_channels)
+        self.tse = TemporalSEBlock(out_channels)
+        self.downsample = downsample
+    def forward(self, x):
+        identity = x
+        if self.downsample is not None:
+            identity = self.downsample(x)
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+        out = self.conv2(out)
+        out = self.bn2(out)
+        out = self.cse(out)
+        out = self.tse(out)
+        out += identity
+        out = self.relu(out)
+        return out
+class ResNet(nn.Module, PyTorchModelHubMixin):
+    def __init__(
+        self, layers=[2, 2, 2, 2], channels=[16, 24, 48, 96], dropout_rate=0.5
+    ):
+        super().__init__()
+        self.in_channels = 16
+        # Stem
+        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False)
+        self.bn1 = nn.BatchNorm2d(16)
+        self.relu = nn.ReLU(inplace=True)
+        # Stages
+        self.layer1 = self._make_layer(channels[0], layers[0], stride=1)
+        self.layer2 = self._make_layer(channels[1], layers[1], stride=2)
+        self.layer3 = self._make_layer(channels[2], layers[2], stride=2)
+        self.layer4 = self._make_layer(channels[3], layers[3], stride=2)
+        self.dropout = nn.Dropout(p=dropout_rate)
+        # Final classification head
+        # H, W will reduce. Assuming input is (3, 80, 101)
+        # L1: (16, 80, 101) (stride 1)
+        # L2: (32, 40, 51)  (stride 2)
+        # L3: (64, 20, 26)  (stride 2)
+        # L4: (128, 10, 13) (stride 2)
+        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
+        self.fc = nn.Linear(channels[3], 1)
+        self.sigmoid = nn.Sigmoid()
+    def _make_layer(self, out_channels, blocks, stride=1):
+        downsample = None
+        if stride != 1 or self.in_channels != out_channels:
+            downsample = nn.Sequential(
+                nn.Conv2d(
+                    self.in_channels,
+                    out_channels,
+                    kernel_size=1,
+                    stride=stride,
+                    bias=False,
+                ),
+                nn.BatchNorm2d(out_channels),
+            )
+        layers = []
+        layers.append(ResBlock(self.in_channels, out_channels, stride, downsample))
+        self.in_channels = out_channels
+        for _ in range(1, blocks):
+            layers.append(ResBlock(self.in_channels, out_channels))
+        return nn.Sequential(*layers)
+    def forward(self, x):
+        # x: (B, 3, 80, 101)
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.relu(x)
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+        x = self.avgpool(x)  # (B, 128, 1, 1)
+        x = torch.flatten(x, 1)  # (B, 128)
+        x = self.dropout(x)
+        x = self.fc(x)
+        x = self.sigmoid(x)
+        return x
+if __name__ == "__main__":
+    from torchinfo import summary
+    model = ResNet()
+    summary(model, (1, 3, 80, 101))

exp/baseline3/train.py ADDED Viewed

	@@ -0,0 +1,433 @@

+import torch
+import torch.nn as nn
+import torch.optim as optim
+from torch.utils.data import DataLoader
+from torch.utils.tensorboard import SummaryWriter
+from tqdm import tqdm
+import argparse
+import os
+from .model import ResNet
+from .data import BeatTrackingDataset
+from ..baseline1.utils import MultiViewSpectrogram
+from ..data.load import ds
+def weighted_bce_loss(
+    y_pred: torch.Tensor, y_true: torch.Tensor, pos_weight: float
+) -> torch.Tensor:
+    """Weighted BCE on probabilities.
+    This training setup outputs probabilities (sigmoid in model). To better handle
+    the heavy class imbalance, we upweight positive-ish labels.
+    Labels are in {0.0, 0.25, 1.0}. We use a smooth weighting: w = 1 + pos_weight * y.
+    """
+    bce = torch.nn.functional.binary_cross_entropy(y_pred, y_true, reduction="none")
+    weights = 1.0 + (pos_weight * y_true)
+    return (bce * weights).mean()
+def unweighted_bce_loss(y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
+    """Unweighted BCE on probabilities.
+    This matches baseline2's loss definition and is useful for apples-to-apples
+    TensorBoard comparisons and for early stopping / best checkpoint selection.
+    """
+    return torch.nn.functional.binary_cross_entropy(y_pred, y_true)
+def train(
+    target_type: str,
+    output_dir: str,
+    *,
+    batch_size: int,
+    epochs: int,
+    lr: float,
+    weight_decay: float,
+    num_workers: int,
+    context_frames: int,
+    patience: int,
+    pos_weight: float,
+    grad_clip: float,
+    max_train_tracks: int | None,
+    max_val_tracks: int | None,
+    max_train_steps: int,
+    max_val_steps: int,
+    max_steps_total: int,
+    hard_neg_radius: int,
+    hard_neg_fraction: float,
+):
+    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+    print(f"--- Training Model for target: {target_type} ---")
+    print(f"Output directory: {output_dir}")
+    # Create output directory
+    os.makedirs(output_dir, exist_ok=True)
+    # TensorBoard writer
+    writer = SummaryWriter(log_dir=os.path.join(output_dir, "logs"))
+    # Data
+    train_dataset = BeatTrackingDataset(
+        ds["train"],
+        target_type=target_type,
+        context_frames=context_frames,
+        max_tracks=max_train_tracks,
+        hard_neg_radius=hard_neg_radius,
+        hard_neg_fraction=hard_neg_fraction,
+    )
+    val_dataset = BeatTrackingDataset(
+        ds["test"],
+        target_type=target_type,
+        context_frames=context_frames,
+        max_tracks=max_val_tracks,
+        hard_neg_radius=hard_neg_radius,
+        hard_neg_fraction=hard_neg_fraction,
+    )
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=batch_size,
+        shuffle=True,
+        num_workers=num_workers,
+        pin_memory=True,
+        prefetch_factor=4,
+        persistent_workers=True,
+    )
+    val_loader = DataLoader(
+        val_dataset,
+        batch_size=batch_size,
+        shuffle=False,
+        num_workers=num_workers,
+        pin_memory=True,
+        prefetch_factor=4,
+        persistent_workers=True,
+    )
+    print(f"Train samples: {len(train_dataset)}, Val samples: {len(val_dataset)}")
+    # Model
+    model = ResNet(dropout_rate=0.5).to(DEVICE)
+    # GPU Spectrogram Preprocessor
+    preprocessor = MultiViewSpectrogram(sample_rate=16000, hop_length=160).to(DEVICE)
+    # Optimizer - Using AdamW for ResNet
+    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
+    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
+    # Match baseline2's objective by default (unweighted BCE).
+    criterion = nn.BCELoss()
+    best_val_loss = float("inf")
+    patience_counter = 0
+    global_step = 0
+    for epoch in range(epochs):
+        # Training
+        model.train()
+        total_train_loss = 0
+        total_train_loss_unweighted = 0
+        steps_this_epoch = 0
+        for waveform, y in tqdm(
+            train_loader,
+            desc=f"[{target_type}] Epoch {epoch + 1}/{epochs} Train",
+            leave=False,
+        ):
+            if max_steps_total > 0 and global_step >= max_steps_total:
+                break
+            waveform, y = waveform.to(DEVICE), y.to(DEVICE)
+            # Compute spectrogram on GPU
+            with torch.no_grad():
+                spec = preprocessor(waveform)  # (B, 3, 80, T_raw)
+                # Normalize
+                mean = spec.mean(dim=(2, 3), keepdim=True)
+                std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+                spec = (spec - mean) / std
+                T_curr = spec.shape[-1]
+                target_T = context_frames * 2 + 1
+                if T_curr > target_T:
+                    start = (T_curr - target_T) // 2
+                    x = spec[:, :, :, start : start + target_T]
+                elif T_curr < target_T:
+                    # This shouldn't happen if dataset is correct, but just in case pad
+                    pad = target_T - T_curr
+                    x = torch.nn.functional.pad(spec, (0, pad))
+                else:
+                    x = spec
+            optimizer.zero_grad()
+            output = model(x)
+            loss_unweighted = criterion(output, y)
+            loss = (
+                weighted_bce_loss(output, y, pos_weight=pos_weight)
+                if pos_weight > 0
+                else loss_unweighted
+            )
+            loss.backward()
+            if grad_clip > 0:
+                torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
+            optimizer.step()
+            total_train_loss += loss.item()
+            total_train_loss_unweighted += loss_unweighted.item()
+            global_step += 1
+            steps_this_epoch += 1
+            # Log losses
+            # - train/batch_loss matches baseline2 (unweighted)
+            # - train/batch_loss_weighted is the optimized objective (if pos_weight>0)
+            writer.add_scalar("train/batch_loss", loss_unweighted.item(), global_step)
+            writer.add_scalar("train/batch_loss_weighted", loss.item(), global_step)
+            if max_train_steps > 0 and steps_this_epoch >= max_train_steps:
+                break
+        if steps_this_epoch == 0:
+            print("No training steps executed (max_steps reached).")
+            break
+        avg_train_loss = total_train_loss / steps_this_epoch
+        avg_train_loss_unweighted = total_train_loss_unweighted / steps_this_epoch
+        # Validation
+        model.eval()
+        total_val_loss = 0
+        total_val_loss_unweighted = 0
+        val_steps = 0
+        with torch.no_grad():
+            for waveform, y in tqdm(
+                val_loader,
+                desc=f"[{target_type}] Epoch {epoch + 1}/{epochs} Val",
+                leave=False,
+            ):
+                if max_steps_total > 0 and global_step >= max_steps_total:
+                    break
+                waveform, y = waveform.to(DEVICE), y.to(DEVICE)
+                # Compute spectrogram on GPU
+                spec = preprocessor(waveform)  # (B, 3, 80, T)
+                # Normalize
+                mean = spec.mean(dim=(2, 3), keepdim=True)
+                std = spec.std(dim=(2, 3), keepdim=True) + 1e-6
+                spec = (spec - mean) / std
+                T_curr = spec.shape[-1]
+                target_T = context_frames * 2 + 1
+                if T_curr > target_T:
+                    start = (T_curr - target_T) // 2
+                    x = spec[:, :, :, start : start + target_T]
+                else:
+                    pad = target_T - T_curr
+                    x = torch.nn.functional.pad(spec, (0, pad))
+                output = model(x)
+                loss_unweighted = criterion(output, y)
+                loss = (
+                    weighted_bce_loss(output, y, pos_weight=pos_weight)
+                    if pos_weight > 0
+                    else loss_unweighted
+                )
+                total_val_loss_unweighted += loss_unweighted.item()
+                total_val_loss += loss.item()
+                val_steps += 1
+                if max_val_steps > 0 and val_steps >= max_val_steps:
+                    break
+        if val_steps == 0:
+            print("No validation steps executed (max_steps reached).")
+            break
+        avg_val_loss = total_val_loss / val_steps
+        avg_val_loss_unweighted = total_val_loss_unweighted / val_steps
+        # Log epoch metrics
+        writer.add_scalar("train/epoch_loss", avg_train_loss_unweighted, epoch)
+        writer.add_scalar("train/epoch_loss_weighted", avg_train_loss, epoch)
+        writer.add_scalar("val/loss", avg_val_loss_unweighted, epoch)
+        writer.add_scalar("val/loss_weighted", avg_val_loss, epoch)
+        writer.add_scalar("train/learning_rate", scheduler.get_last_lr()[0], epoch)
+        # Step the scheduler
+        scheduler.step()
+        print(
+            f"[{target_type}] Epoch {epoch + 1}/{epochs} - "
+            f"Train Loss: {avg_train_loss_unweighted:.4f}, Val Loss: {avg_val_loss_unweighted:.4f}"
+        )
+        # Save best model
+        if avg_val_loss_unweighted < best_val_loss:
+            best_val_loss = avg_val_loss_unweighted
+            patience_counter = 0
+            model.save_pretrained(output_dir)
+            print(f"  -> Saved best model (val_loss: {best_val_loss:.4f})")
+        else:
+            patience_counter += 1
+            print(f"  -> No improvement (patience: {patience_counter}/{patience})")
+        if patience_counter >= patience:
+            print("Early stopping triggered.")
+            break
+        if max_steps_total > 0 and global_step >= max_steps_total:
+            print("Reached max_steps_total; stopping training.")
+            break
+    writer.close()
+    # Save final model
+    final_dir = os.path.join(output_dir, "final")
+    model.save_pretrained(final_dir)
+    print(f"Saved final model to {final_dir}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--target",
+        type=str,
+        choices=["beats", "downbeats"],
+        default=None,
+        help="Train a model for 'beats' or 'downbeats'. If not specified, trains both.",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="outputs/baseline3",
+        help="Directory to save model and logs",
+    )
+    parser.add_argument(
+        "--batch-size",
+        type=int,
+        default=128,
+        help="Batch size (default: 128)",
+    )
+    parser.add_argument(
+        "--epochs",
+        type=int,
+        default=3,
+        help="Max epochs (default: 3; early stopping may stop sooner)",
+    )
+    parser.add_argument(
+        "--lr",
+        type=float,
+        default=0.001,
+        help="AdamW learning rate (default: 0.001)",
+    )
+    parser.add_argument(
+        "--weight-decay",
+        type=float,
+        default=1e-4,
+        help="AdamW weight decay (default: 1e-4)",
+    )
+    parser.add_argument(
+        "--num-workers",
+        type=int,
+        default=4,
+        help="DataLoader workers (default: 4)",
+    )
+    parser.add_argument(
+        "--max-train-tracks",
+        type=int,
+        default=0,
+        help="Limit train split to first N tracks (default: 0 = all)",
+    )
+    parser.add_argument(
+        "--max-val-tracks",
+        type=int,
+        default=0,
+        help="Limit val split to first N tracks (default: 0 = all)",
+    )
+    parser.add_argument(
+        "--max-train-steps",
+        type=int,
+        default=0,
+        help="Max train batches per epoch (default: 0 = all)",
+    )
+    parser.add_argument(
+        "--max-val-steps",
+        type=int,
+        default=0,
+        help="Max val batches per epoch (default: 0 = all)",
+    )
+    parser.add_argument(
+        "--max-steps-total",
+        type=int,
+        default=0,
+        help="Stop training after N total train batches (default: 0 = unlimited)",
+    )
+    parser.add_argument(
+        "--hard-neg-radius",
+        type=int,
+        default=0,
+        help="Add negatives at +/-d frames from each beat for d>=2..R (default: 0 = off)",
+    )
+    parser.add_argument(
+        "--hard-neg-fraction",
+        type=float,
+        default=0.5,
+        help="Fraction of negatives reserved for hard negatives (default: 0.5)",
+    )
+    parser.add_argument(
+        "--context-frames",
+        type=int,
+        default=50,
+        help="Context frames on each side (default: 50 -> 101 total frames)",
+    )
+    parser.add_argument(
+        "--patience",
+        type=int,
+        default=5,
+        help="Early stopping patience (default: 5)",
+    )
+    parser.add_argument(
+        "--pos-weight",
+        type=float,
+        default=0.0,
+        help="Positive label upweight factor (default: 0.0; 0 matches baseline2)",
+    )
+    parser.add_argument(
+        "--grad-clip",
+        type=float,
+        default=0.0,
+        help="Clip gradient norm; set 0 to disable (default: 0.0)",
+    )
+    args = parser.parse_args()
+    # Determine which targets to train
+    targets = [args.target] if args.target else ["beats", "downbeats"]
+    for target in targets:
+        output_dir = os.path.join(args.output_dir, target)
+        train(
+            target,
+            output_dir,
+            batch_size=args.batch_size,
+            epochs=args.epochs,
+            lr=args.lr,
+            weight_decay=args.weight_decay,
+            num_workers=args.num_workers,
+            context_frames=args.context_frames,
+            patience=args.patience,
+            pos_weight=args.pos_weight,
+            grad_clip=args.grad_clip,
+            max_train_tracks=(args.max_train_tracks or None),
+            max_val_tracks=(args.max_val_tracks or None),
+            max_train_steps=args.max_train_steps,
+            max_val_steps=args.max_val_steps,
+            max_steps_total=args.max_steps_total,
+            hard_neg_radius=args.hard_neg_radius,
+            hard_neg_fraction=args.hard_neg_fraction,
+        )

exp/data/__init__.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""
+Data loading and evaluation utilities for beat tracking.
+Modules:
+    - load: Dataset loading and preprocessing
+    - eval: Evaluation metrics and utilities
+"""
+from exp.data.eval import (
+    DEFAULT_THRESHOLDS_MS,
+    evaluate_beats,
+    evaluate_track,
+    evaluate_all,
+    compute_weighted_f1,
+    format_results,
+)
+__all__ = [
+    "DEFAULT_THRESHOLDS_MS",
+    "evaluate_beats",
+    "evaluate_track",
+    "evaluate_all",
+    "compute_weighted_f1",
+    "format_results",
+]

exp/data/audio.py ADDED Viewed

	@@ -0,0 +1,301 @@

+"""
+Audio synthesis utilities for beat tracking evaluation.
+This module provides functions to:
+- Generate click sounds for beats and downbeats
+- Mix click tracks with original audio
+- Save audio files with beat annotations
+Example usage:
+    from exp.data.audio import create_click_track, mix_audio, save_audio
+    # Create click track
+    clicks = create_click_track(
+        beat_times=pred_beats,
+        downbeat_times=pred_downbeats,
+        duration=30.0,
+        sr=16000
+    )
+    # Mix with original audio
+    mixed = mix_audio(original_audio, clicks, click_volume=0.5)
+    # Save to file
+    save_audio(mixed, "output.wav", sr=16000)
+"""
+import numpy as np
+from pathlib import Path
+def generate_click(
+    frequency: float = 1000.0,
+    duration: float = 0.02,
+    sr: int = 16000,
+    attack: float = 0.002,
+    decay: float = 0.018,
+) -> np.ndarray:
+    """
+    Generate a single click sound.
+    Args:
+        frequency: Frequency of the click tone in Hz
+        duration: Duration of the click in seconds
+        sr: Sample rate
+        attack: Attack time in seconds
+        decay: Decay time in seconds
+    Returns:
+        Click waveform as numpy array
+    """
+    t = np.arange(int(duration * sr)) / sr
+    # Generate sine wave
+    wave = np.sin(2 * np.pi * frequency * t)
+    # Apply envelope (attack-decay)
+    envelope = np.ones_like(t)
+    attack_samples = int(attack * sr)
+    decay_samples = int(decay * sr)
+    if attack_samples > 0:
+        envelope[:attack_samples] = np.linspace(0, 1, attack_samples)
+    if decay_samples > 0:
+        decay_start = len(t) - decay_samples
+        if decay_start > 0:
+            envelope[decay_start:] = np.linspace(1, 0, decay_samples)
+    return wave * envelope
+def create_click_track(
+    beat_times: list[float] | np.ndarray,
+    downbeat_times: list[float] | np.ndarray | None = None,
+    duration: float | None = None,
+    sr: int = 16000,
+    beat_freq: float = 1000.0,
+    downbeat_freq: float = 1500.0,
+    click_duration: float = 0.03,
+) -> np.ndarray:
+    """
+    Create a click track from beat and downbeat times.
+    Args:
+        beat_times: List of beat times in seconds
+        downbeat_times: List of downbeat times in seconds (optional)
+        duration: Total duration in seconds (auto-detected if None)
+        sr: Sample rate
+        beat_freq: Frequency for beat clicks (Hz)
+        downbeat_freq: Frequency for downbeat clicks (Hz)
+        click_duration: Duration of each click in seconds
+    Returns:
+        Click track as numpy array
+    """
+    beat_times = np.array(beat_times) if len(beat_times) > 0 else np.array([])
+    if downbeat_times is not None:
+        downbeat_times = (
+            np.array(downbeat_times) if len(downbeat_times) > 0 else np.array([])
+        )
+    else:
+        downbeat_times = np.array([])
+    # Determine duration
+    if duration is None:
+        all_times = np.concatenate([beat_times, downbeat_times])
+        if len(all_times) == 0:
+            return np.array([])
+        duration = float(np.max(all_times)) + 1.0
+    # Create output array
+    total_samples = int(duration * sr)
+    output = np.zeros(total_samples, dtype=np.float32)
+    # Generate click templates
+    beat_click = generate_click(frequency=beat_freq, duration=click_duration, sr=sr)
+    downbeat_click = generate_click(
+        frequency=downbeat_freq, duration=click_duration, sr=sr
+    )
+    # Convert downbeat times to set for fast lookup
+    downbeat_set = set(np.round(downbeat_times, 3))
+    # Add beat clicks
+    for t in beat_times:
+        sample_idx = int(t * sr)
+        if sample_idx < 0 or sample_idx >= total_samples:
+            continue
+        # Use downbeat click if this is also a downbeat
+        is_downbeat = np.round(t, 3) in downbeat_set
+        click = downbeat_click if is_downbeat else beat_click
+        # Add click to output
+        end_idx = min(sample_idx + len(click), total_samples)
+        click_len = end_idx - sample_idx
+        output[sample_idx:end_idx] += click[:click_len]
+    # Add downbeat clicks (for downbeats not already in beats)
+    beat_set = set(np.round(beat_times, 3))
+    for t in downbeat_times:
+        if np.round(t, 3) in beat_set:
+            continue  # Already added as beat
+        sample_idx = int(t * sr)
+        if sample_idx < 0 or sample_idx >= total_samples:
+            continue
+        end_idx = min(sample_idx + len(downbeat_click), total_samples)
+        click_len = end_idx - sample_idx
+        output[sample_idx:end_idx] += downbeat_click[:click_len]
+    return output
+def mix_audio(
+    audio: np.ndarray,
+    click_track: np.ndarray,
+    click_volume: float = 0.5,
+) -> np.ndarray:
+    """
+    Mix original audio with a click track.
+    Args:
+        audio: Original audio waveform
+        click_track: Click track to overlay
+        click_volume: Volume of clicks relative to audio (0.0 to 1.0)
+    Returns:
+        Mixed audio
+    """
+    # Ensure same length
+    max_len = max(len(audio), len(click_track))
+    audio_padded = np.zeros(max_len, dtype=np.float32)
+    click_padded = np.zeros(max_len, dtype=np.float32)
+    audio_padded[: len(audio)] = audio
+    click_padded[: len(click_track)] = click_track
+    # Normalize audio
+    audio_max = np.abs(audio_padded).max()
+    if audio_max > 0:
+        audio_padded = audio_padded / audio_max * 0.8
+    # Normalize clicks
+    click_max = np.abs(click_padded).max()
+    if click_max > 0:
+        click_padded = click_padded / click_max * click_volume * 0.8
+    # Mix
+    mixed = audio_padded + click_padded
+    # Prevent clipping
+    max_val = np.abs(mixed).max()
+    if max_val > 1.0:
+        mixed = mixed / max_val * 0.95
+    return mixed.astype(np.float32)
+def create_comparison_audio(
+    audio: np.ndarray,
+    pred_beats: list[float],
+    pred_downbeats: list[float],
+    gt_beats: list[float],
+    gt_downbeats: list[float],
+    sr: int = 16000,
+    click_volume: float = 0.5,
+) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+    """
+    Create audio files for comparison: prediction clicks, ground truth clicks, and combined.
+    Args:
+        audio: Original audio waveform
+        pred_beats: Predicted beat times
+        pred_downbeats: Predicted downbeat times
+        gt_beats: Ground truth beat times
+        gt_downbeats: Ground truth downbeat times
+        sr: Sample rate
+        click_volume: Volume of clicks
+    Returns:
+        Tuple of (audio_with_pred_clicks, audio_with_gt_clicks, audio_with_both)
+    """
+    duration = len(audio) / sr
+    # Create click tracks
+    pred_clicks = create_click_track(
+        pred_beats,
+        pred_downbeats,
+        duration=duration,
+        sr=sr,
+        beat_freq=1000.0,
+        downbeat_freq=1500.0,
+    )
+    gt_clicks = create_click_track(
+        gt_beats,
+        gt_downbeats,
+        duration=duration,
+        sr=sr,
+        beat_freq=800.0,  # Different frequency for GT
+        downbeat_freq=1200.0,
+    )
+    # Mix
+    audio_pred = mix_audio(audio, pred_clicks, click_volume)
+    audio_gt = mix_audio(audio, gt_clicks, click_volume)
+    audio_both = mix_audio(audio, pred_clicks + gt_clicks, click_volume)
+    return audio_pred, audio_gt, audio_both
+def save_audio(
+    audio: np.ndarray,
+    path: str | Path,
+    sr: int = 16000,
+) -> None:
+    """
+    Save audio to a WAV file.
+    Args:
+        audio: Audio waveform
+        path: Output file path
+        sr: Sample rate
+    """
+    import scipy.io.wavfile as wavfile
+    path = Path(path)
+    path.parent.mkdir(parents=True, exist_ok=True)
+    # Convert to int16
+    audio_int16 = (audio * 32767).astype(np.int16)
+    wavfile.write(str(path), sr, audio_int16)
+if __name__ == "__main__":
+    # Demo
+    print("Audio synthesis demo...")
+    # Create a simple sine wave as "music"
+    sr = 16000
+    duration = 10.0
+    t = np.arange(int(duration * sr)) / sr
+    music = np.sin(2 * np.pi * 220 * t) * 0.3  # 220 Hz tone
+    # Beats every 0.5s, downbeats every 2s
+    beats = np.arange(0, duration, 0.5).tolist()
+    downbeats = np.arange(0, duration, 2.0).tolist()
+    # Create click track
+    clicks = create_click_track(beats, downbeats, duration=duration, sr=sr)
+    # Mix
+    mixed = mix_audio(music, clicks, click_volume=0.6)
+    print(f"Created mixed audio: {len(mixed)} samples ({len(mixed) / sr:.2f}s)")
+    print(f"Beats: {len(beats)}, Downbeats: {len(downbeats)}")
+    # Save demo
+    save_audio(mixed, "/tmp/beat_click_demo.wav", sr=sr)
+    print("Saved demo to /tmp/beat_click_demo.wav")

exp/data/eval.py ADDED Viewed

	@@ -0,0 +1,568 @@

+"""
+Evaluation utilities for beat and downbeat detection.
+This module provides functions to evaluate beat/downbeat predictions against
+ground truth annotations using F1-scores at various timing thresholds and
+continuity-based metrics (CMLt, AMLt).
+The evaluation metrics include:
+- **F1-scores**: Calculated for timing thresholds from 3ms to 30ms
+- **Weighted F1**: Weights are inversely proportional to threshold (e.g., 3ms: 1, 6ms: 1/2)
+- **CMLt (Correct Metrical Level Total)**: Accuracy at the correct metrical level
+- **AMLt (Any Metrical Level Total)**: Accuracy allowing for metrical variations
+  (double/half tempo, off-beat, etc.)
+- **CMLc/AMLc**: Continuous versions (longest correct segment)
+Example usage:
+    from ..data.eval import (
+        evaluate_beats, evaluate_all, compute_weighted_f1,
+        compute_continuity_metrics, format_results
+    )
+    # Evaluate single track
+    results = evaluate_beats(pred_beats, gt_beats)
+    print(f"Weighted F1: {results['weighted_f1']:.4f}")
+    print(f"CMLt: {results['continuity']['CMLt']:.4f}")
+    print(f"AMLt: {results['continuity']['AMLt']:.4f}")
+    # Evaluate with custom thresholds
+    results = evaluate_beats(pred_beats, gt_beats, thresholds_ms=[5, 10, 20])
+    # Evaluate all tracks in dataset
+    summary = evaluate_all(predictions, ground_truths)
+    print(format_results(summary))
+"""
+from typing import Sequence
+import numpy as np
+import mir_eval
+# Default timing thresholds in milliseconds (3ms to 30ms, step 3ms)
+DEFAULT_THRESHOLDS_MS = [3, 6, 9, 12, 15, 18, 21, 24, 27, 30]
+# Default minimum beat time for mir_eval metrics (can be set to 0 to use all beats)
+DEFAULT_MIN_BEAT_TIME = 5.0
+def match_events(
+    pred: np.ndarray,
+    gt: np.ndarray,
+    tolerance_sec: float,
+) -> tuple[int, int, int]:
+    """
+    Match predicted events to ground truth events within a tolerance.
+    Uses greedy matching: each ground truth event is matched to the closest
+    unmatched prediction within the tolerance window.
+    Args:
+        pred: Predicted event times in seconds, shape (N,)
+        gt: Ground truth event times in seconds, shape (M,)
+        tolerance_sec: Maximum time difference for a match (in seconds)
+    Returns:
+        Tuple of (true_positives, false_positives, false_negatives)
+    """
+    if len(gt) == 0:
+        return 0, len(pred), 0
+    if len(pred) == 0:
+        return 0, 0, len(gt)
+    pred = np.sort(pred)
+    gt = np.sort(gt)
+    matched_pred = np.zeros(len(pred), dtype=bool)
+    matched_gt = np.zeros(len(gt), dtype=bool)
+    # For each ground truth, find the closest unmatched prediction
+    for i, gt_time in enumerate(gt):
+        # Find predictions within tolerance
+        diffs = np.abs(pred - gt_time)
+        candidates = np.where((diffs <= tolerance_sec) & ~matched_pred)[0]
+        if len(candidates) > 0:
+            # Match to closest candidate
+            best_idx = candidates[np.argmin(diffs[candidates])]
+            matched_pred[best_idx] = True
+            matched_gt[i] = True
+    tp = int(matched_gt.sum())
+    fp = int((~matched_pred).sum() == 0 and len(pred) - tp or len(pred) - tp)
+    fn = int(len(gt) - tp)
+    # Recalculate fp correctly
+    fp = len(pred) - tp
+    return tp, fp, fn
+def compute_f1(tp: int, fp: int, fn: int) -> tuple[float, float, float]:
+    """
+    Compute precision, recall, and F1-score from TP, FP, FN counts.
+    Args:
+        tp: True positives
+        fp: False positives
+        fn: False negatives
+    Returns:
+        Tuple of (precision, recall, f1_score)
+    """
+    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
+    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
+    f1 = (
+        2 * precision * recall / (precision + recall)
+        if (precision + recall) > 0
+        else 0.0
+    )
+    return precision, recall, f1
+def compute_weighted_f1(
+    f1_scores: dict[int, float],
+    thresholds_ms: Sequence[int] | None = None,
+) -> float:
+    """
+    Compute weighted F1-score where weights are inversely proportional to threshold.
+    The weight for threshold T ms is 1 / (T / min_threshold).
+    For example, with thresholds [3, 6, 9, ...]:
+        - 3ms: weight = 1
+        - 6ms: weight = 0.5
+        - 9ms: weight = 0.333...
+    Args:
+        f1_scores: Dict mapping threshold (ms) to F1-score
+        thresholds_ms: List of thresholds used (for weight calculation)
+    Returns:
+        Weighted F1-score
+    """
+    if not f1_scores:
+        return 0.0
+    if thresholds_ms is None:
+        thresholds_ms = sorted(f1_scores.keys())
+    min_threshold = min(thresholds_ms)
+    total_weight = 0.0
+    weighted_sum = 0.0
+    for t in thresholds_ms:
+        if t in f1_scores:
+            weight = min_threshold / t  # 3ms -> 1, 6ms -> 0.5, etc.
+            weighted_sum += weight * f1_scores[t]
+            total_weight += weight
+    return weighted_sum / total_weight if total_weight > 0 else 0.0
+def compute_continuity_metrics(
+    pred_times: Sequence[float],
+    gt_times: Sequence[float],
+    min_beat_time: float = DEFAULT_MIN_BEAT_TIME,
+    phase_threshold: float = 0.175,
+    period_threshold: float = 0.175,
+) -> dict:
+    """
+    Compute continuity-based beat tracking metrics using mir_eval.
+    These metrics evaluate beat tracking accuracy accounting for metrical level:
+    - CMLt (Correct Metric Level Total): Accuracy at the correct metrical level
+    - AMLt (Any Metric Level Total): Accuracy allowing for metrical variations
+      (double/half tempo, off-beat, etc.)
+    - CMLc/AMLc: Continuous versions (longest correct segment)
+    Args:
+        pred_times: Predicted beat times in seconds
+        gt_times: Ground truth beat times in seconds
+        min_beat_time: Minimum time to start evaluation (default: 5.0s)
+            Set to 0.0 to use all beats, but note that early beats
+            may not have stable inter-beat intervals.
+        phase_threshold: Maximum phase error as ratio of beat interval (default: 0.175)
+        period_threshold: Maximum period error as ratio of beat interval (default: 0.175)
+    Returns:
+        Dict containing:
+            - 'CMLc': Correct Metric Level Continuous
+            - 'CMLt': Correct Metric Level Total
+            - 'AMLc': Any Metric Level Continuous
+            - 'AMLt': Any Metric Level Total
+    """
+    pred_arr = np.sort(np.array(pred_times, dtype=np.float64))
+    gt_arr = np.sort(np.array(gt_times, dtype=np.float64))
+    # Trim beats before min_beat_time (standard preprocessing)
+    pred_trimmed = mir_eval.beat.trim_beats(pred_arr, min_beat_time=min_beat_time)
+    gt_trimmed = mir_eval.beat.trim_beats(gt_arr, min_beat_time=min_beat_time)
+    # Handle edge cases where trimming results in too few beats
+    if len(gt_trimmed) < 2 or len(pred_trimmed) < 2:
+        return {
+            "CMLc": 0.0,
+            "CMLt": 0.0,
+            "AMLc": 0.0,
+            "AMLt": 0.0,
+        }
+    # Compute continuity metrics
+    CMLc, CMLt, AMLc, AMLt = mir_eval.beat.continuity(
+        gt_trimmed,
+        pred_trimmed,
+        continuity_phase_threshold=phase_threshold,
+        continuity_period_threshold=period_threshold,
+    )
+    return {
+        "CMLc": float(CMLc),
+        "CMLt": float(CMLt),
+        "AMLc": float(AMLc),
+        "AMLt": float(AMLt),
+    }
+def evaluate_beats(
+    pred_times: Sequence[float],
+    gt_times: Sequence[float],
+    thresholds_ms: Sequence[int] | None = None,
+    min_beat_time: float = DEFAULT_MIN_BEAT_TIME,
+) -> dict:
+    """
+    Evaluate beat predictions against ground truth at multiple thresholds.
+    Args:
+        pred_times: Predicted beat times in seconds
+        gt_times: Ground truth beat times in seconds
+        thresholds_ms: Timing thresholds in milliseconds (default: 3ms to 30ms)
+        min_beat_time: Minimum time for continuity metrics (default: 5.0s)
+    Returns:
+        Dict containing:
+            - 'per_threshold': Dict[threshold_ms, {'precision', 'recall', 'f1'}]
+            - 'f1_scores': Dict[threshold_ms, f1_score] (convenience access)
+            - 'weighted_f1': Weighted F1-score across all thresholds
+            - 'continuity': Dict with CMLc, CMLt, AMLc, AMLt metrics
+            - 'num_predictions': Number of predictions
+            - 'num_ground_truth': Number of ground truth events
+    """
+    if thresholds_ms is None:
+        thresholds_ms = DEFAULT_THRESHOLDS_MS
+    pred_arr = np.array(pred_times, dtype=np.float64)
+    gt_arr = np.array(gt_times, dtype=np.float64)
+    per_threshold = {}
+    f1_scores = {}
+    for threshold_ms in thresholds_ms:
+        tolerance_sec = threshold_ms / 1000.0
+        tp, fp, fn = match_events(pred_arr, gt_arr, tolerance_sec)
+        precision, recall, f1 = compute_f1(tp, fp, fn)
+        per_threshold[threshold_ms] = {
+            "precision": precision,
+            "recall": recall,
+            "f1": f1,
+            "tp": tp,
+            "fp": fp,
+            "fn": fn,
+        }
+        f1_scores[threshold_ms] = f1
+    weighted_f1 = compute_weighted_f1(f1_scores, thresholds_ms)
+    continuity = compute_continuity_metrics(pred_times, gt_times, min_beat_time)
+    return {
+        "per_threshold": per_threshold,
+        "f1_scores": f1_scores,
+        "weighted_f1": weighted_f1,
+        "continuity": continuity,
+        "num_predictions": len(pred_arr),
+        "num_ground_truth": len(gt_arr),
+    }
+def evaluate_track(
+    pred_beats: Sequence[float],
+    pred_downbeats: Sequence[float],
+    gt_beats: Sequence[float],
+    gt_downbeats: Sequence[float],
+    thresholds_ms: Sequence[int] | None = None,
+    min_beat_time: float = DEFAULT_MIN_BEAT_TIME,
+) -> dict:
+    """
+    Evaluate both beat and downbeat predictions for a single track.
+    Args:
+        pred_beats: Predicted beat times in seconds
+        pred_downbeats: Predicted downbeat times in seconds
+        gt_beats: Ground truth beat times in seconds
+        gt_downbeats: Ground truth downbeat times in seconds
+        thresholds_ms: Timing thresholds in milliseconds
+        min_beat_time: Minimum time for continuity metrics (default: 5.0s)
+    Returns:
+        Dict containing:
+            - 'beats': Results from evaluate_beats for beats
+            - 'downbeats': Results from evaluate_beats for downbeats
+            - 'combined_weighted_f1': Average of beat and downbeat weighted F1
+    """
+    beat_results = evaluate_beats(pred_beats, gt_beats, thresholds_ms, min_beat_time)
+    downbeat_results = evaluate_beats(
+        pred_downbeats, gt_downbeats, thresholds_ms, min_beat_time
+    )
+    combined_weighted_f1 = (
+        beat_results["weighted_f1"] + downbeat_results["weighted_f1"]
+    ) / 2
+    return {
+        "beats": beat_results,
+        "downbeats": downbeat_results,
+        "combined_weighted_f1": combined_weighted_f1,
+    }
+def evaluate_all(
+    predictions: Sequence[dict],
+    ground_truths: Sequence[dict],
+    thresholds_ms: Sequence[int] | None = None,
+    min_beat_time: float = DEFAULT_MIN_BEAT_TIME,
+    verbose: bool = False,
+) -> dict:
+    """
+    Evaluate predictions for multiple tracks.
+    Args:
+        predictions: List of dicts with 'beats' and 'downbeats' keys
+        ground_truths: List of dicts with 'beats' and 'downbeats' keys
+        thresholds_ms: Timing thresholds in milliseconds
+        min_beat_time: Minimum time for continuity metrics (default: 5.0s)
+        verbose: If True, print per-track results
+    Returns:
+        Dict containing:
+            - 'per_track': List of per-track results
+            - 'mean_beat_weighted_f1': Mean weighted F1 for beats
+            - 'mean_downbeat_weighted_f1': Mean weighted F1 for downbeats
+            - 'mean_combined_weighted_f1': Mean combined weighted F1
+            - 'beat_f1_by_threshold': Mean F1 per threshold for beats
+            - 'downbeat_f1_by_threshold': Mean F1 per threshold for downbeats
+            - 'beat_continuity': Mean continuity metrics for beats
+            - 'downbeat_continuity': Mean continuity metrics for downbeats
+    """
+    if len(predictions) != len(ground_truths):
+        raise ValueError(
+            f"Number of predictions ({len(predictions)}) must match "
+            f"number of ground truths ({len(ground_truths)})"
+        )
+    if thresholds_ms is None:
+        thresholds_ms = DEFAULT_THRESHOLDS_MS
+    per_track = []
+    beat_weighted_f1s = []
+    downbeat_weighted_f1s = []
+    combined_weighted_f1s = []
+    beat_f1_by_threshold = {t: [] for t in thresholds_ms}
+    downbeat_f1_by_threshold = {t: [] for t in thresholds_ms}
+    # Continuity metrics tracking
+    beat_continuity = {"CMLc": [], "CMLt": [], "AMLc": [], "AMLt": []}
+    downbeat_continuity = {"CMLc": [], "CMLt": [], "AMLc": [], "AMLt": []}
+    for i, (pred, gt) in enumerate(zip(predictions, ground_truths)):
+        result = evaluate_track(
+            pred_beats=pred["beats"],
+            pred_downbeats=pred["downbeats"],
+            gt_beats=gt["beats"],
+            gt_downbeats=gt["downbeats"],
+            thresholds_ms=thresholds_ms,
+            min_beat_time=min_beat_time,
+        )
+        per_track.append(result)
+        beat_weighted_f1s.append(result["beats"]["weighted_f1"])
+        downbeat_weighted_f1s.append(result["downbeats"]["weighted_f1"])
+        combined_weighted_f1s.append(result["combined_weighted_f1"])
+        for t in thresholds_ms:
+            beat_f1_by_threshold[t].append(result["beats"]["f1_scores"][t])
+            downbeat_f1_by_threshold[t].append(result["downbeats"]["f1_scores"][t])
+        # Track continuity metrics
+        for metric in ["CMLc", "CMLt", "AMLc", "AMLt"]:
+            beat_continuity[metric].append(result["beats"]["continuity"][metric])
+            downbeat_continuity[metric].append(
+                result["downbeats"]["continuity"][metric]
+            )
+        if verbose:
+            beat_cont = result["beats"]["continuity"]
+            print(
+                f"Track {i}: Beat F1={result['beats']['weighted_f1']:.4f}, "
+                f"CMLt={beat_cont['CMLt']:.4f}, AMLt={beat_cont['AMLt']:.4f}, "
+                f"Downbeat F1={result['downbeats']['weighted_f1']:.4f}, "
+                f"Combined={result['combined_weighted_f1']:.4f}"
+            )
+    return {
+        "per_track": per_track,
+        "mean_beat_weighted_f1": float(np.mean(beat_weighted_f1s)),
+        "mean_downbeat_weighted_f1": float(np.mean(downbeat_weighted_f1s)),
+        "mean_combined_weighted_f1": float(np.mean(combined_weighted_f1s)),
+        "beat_f1_by_threshold": {
+            t: float(np.mean(v)) for t, v in beat_f1_by_threshold.items()
+        },
+        "downbeat_f1_by_threshold": {
+            t: float(np.mean(v)) for t, v in downbeat_f1_by_threshold.items()
+        },
+        "beat_continuity": {
+            metric: float(np.mean(values)) for metric, values in beat_continuity.items()
+        },
+        "downbeat_continuity": {
+            metric: float(np.mean(values))
+            for metric, values in downbeat_continuity.items()
+        },
+        "num_tracks": len(predictions),
+    }
+def format_results(results: dict, title: str = "Evaluation Results") -> str:
+    """
+    Format evaluation results as a human-readable string.
+    Args:
+        results: Results dict from evaluate_all or evaluate_track
+        title: Title for the report
+    Returns:
+        Formatted string report
+    """
+    lines = [title, "=" * len(title), ""]
+    # Check if this is aggregate results (from evaluate_all)
+    if "num_tracks" in results:
+        lines.append(f"Number of tracks: {results['num_tracks']}")
+        lines.append("")
+        lines.append("Overall Metrics:")
+        lines.append(
+            f"  Mean Beat Weighted F1:     {results['mean_beat_weighted_f1']:.4f}"
+        )
+        lines.append(
+            f"  Mean Downbeat Weighted F1: {results['mean_downbeat_weighted_f1']:.4f}"
+        )
+        lines.append(
+            f"  Mean Combined Weighted F1: {results['mean_combined_weighted_f1']:.4f}"
+        )
+        lines.append("")
+        lines.append("Beat F1 by Threshold:")
+        for t, f1 in sorted(results["beat_f1_by_threshold"].items()):
+            lines.append(f"  {t:2d}ms: {f1:.4f}")
+        lines.append("")
+        lines.append("Downbeat F1 by Threshold:")
+        for t, f1 in sorted(results["downbeat_f1_by_threshold"].items()):
+            lines.append(f"  {t:2d}ms: {f1:.4f}")
+        lines.append("")
+        # Continuity metrics
+        if "beat_continuity" in results:
+            lines.append("Beat Continuity Metrics:")
+            bc = results["beat_continuity"]
+            lines.append(f"  CMLt: {bc['CMLt']:.4f}  (Correct Metrical Level Total)")
+            lines.append(f"  AMLt: {bc['AMLt']:.4f}  (Any Metrical Level Total)")
+            lines.append(
+                f"  CMLc: {bc['CMLc']:.4f}  (Correct Metrical Level Continuous)"
+            )
+            lines.append(f"  AMLc: {bc['AMLc']:.4f}  (Any Metrical Level Continuous)")
+            lines.append("")
+        if "downbeat_continuity" in results:
+            lines.append("Downbeat Continuity Metrics:")
+            dc = results["downbeat_continuity"]
+            lines.append(f"  CMLt: {dc['CMLt']:.4f}  (Correct Metrical Level Total)")
+            lines.append(f"  AMLt: {dc['AMLt']:.4f}  (Any Metrical Level Total)")
+            lines.append(
+                f"  CMLc: {dc['CMLc']:.4f}  (Correct Metrical Level Continuous)"
+            )
+            lines.append(f"  AMLc: {dc['AMLc']:.4f}  (Any Metrical Level Continuous)")
+    # Single track results (from evaluate_track)
+    elif "beats" in results and "downbeats" in results:
+        lines.append("Beat Detection:")
+        lines.append(f"  Weighted F1: {results['beats']['weighted_f1']:.4f}")
+        lines.append(f"  Predictions: {results['beats']['num_predictions']}")
+        lines.append(f"  Ground Truth: {results['beats']['num_ground_truth']}")
+        # Beat continuity metrics
+        if "continuity" in results["beats"]:
+            bc = results["beats"]["continuity"]
+            lines.append(f"  CMLt: {bc['CMLt']:.4f}  AMLt: {bc['AMLt']:.4f}")
+            lines.append(f"  CMLc: {bc['CMLc']:.4f}  AMLc: {bc['AMLc']:.4f}")
+        lines.append("")
+        lines.append("Downbeat Detection:")
+        lines.append(f"  Weighted F1: {results['downbeats']['weighted_f1']:.4f}")
+        lines.append(f"  Predictions: {results['downbeats']['num_predictions']}")
+        lines.append(f"  Ground Truth: {results['downbeats']['num_ground_truth']}")
+        # Downbeat continuity metrics
+        if "continuity" in results["downbeats"]:
+            dc = results["downbeats"]["continuity"]
+            lines.append(f"  CMLt: {dc['CMLt']:.4f}  AMLt: {dc['AMLt']:.4f}")
+            lines.append(f"  CMLc: {dc['CMLc']:.4f}  AMLc: {dc['AMLc']:.4f}")
+        lines.append("")
+        lines.append(f"Combined Weighted F1: {results['combined_weighted_f1']:.4f}")
+    return "\n".join(lines)
+if __name__ == "__main__":
+    # Demo with synthetic data
+    print("Running evaluation demo...\n")
+    # Simulate ground truth beats at regular intervals (30s to have beats after 5s)
+    gt_beats = np.arange(0, 30, 0.5).tolist()  # Beat every 0.5s for 30s
+    gt_downbeats = np.arange(0, 30, 2.0).tolist()  # Downbeat every 2s
+    # Simulate predictions with some noise and missed/extra detections
+    np.random.seed(42)
+    pred_beats = (
+        np.array(gt_beats) + np.random.normal(0, 0.005, len(gt_beats))
+    ).tolist()
+    pred_beats = pred_beats[:-2]  # Miss last 2 beats
+    pred_beats.append(15.25)  # Add false positive
+    pred_downbeats = (
+        np.array(gt_downbeats) + np.random.normal(0, 0.003, len(gt_downbeats))
+    ).tolist()
+    # Evaluate single track
+    results = evaluate_track(
+        pred_beats=pred_beats,
+        pred_downbeats=pred_downbeats,
+        gt_beats=gt_beats,
+        gt_downbeats=gt_downbeats,
+    )
+    print(format_results(results, "Single Track Demo"))
+    print("\n" + "=" * 50 + "\n")
+    # Multi-track demo
+    predictions = [
+        {"beats": pred_beats, "downbeats": pred_downbeats},
+        {"beats": pred_beats, "downbeats": pred_downbeats},
+    ]
+    ground_truths = [
+        {"beats": gt_beats, "downbeats": gt_downbeats},
+        {"beats": gt_beats, "downbeats": gt_downbeats},
+    ]
+    all_results = evaluate_all(predictions, ground_truths, verbose=True)
+    print()
+    print(format_results(all_results, "Multi-Track Demo"))

exp/data/load.py ADDED Viewed

	@@ -0,0 +1,91 @@

+from datasets import load_dataset, Audio
+N_PROC = None
+ds = load_dataset("JacobLinCool/taiko-1000-parsed")
+ds = ds.remove_columns(["tja", "hard", "normal", "easy", "ura"])
+def filter_out_broken(example):
+    try:
+        example["audio"]["array"]
+        return True
+    except:
+        return False
+ds = ds.filter(filter_out_broken, num_proc=N_PROC, batch_size=32, writer_batch_size=32)
+ds = ds.cast_column("audio", Audio(sampling_rate=16000))
+def build_beat_and_downbeat_labels(example):
+    """
+    Extract beat and downbeat times from the chart segments.
+    - Downbeats: First beat of each measure (segment timestamp)
+    - Beats: All beats within each measure based on time signature
+    Returns lists of times in seconds.
+    """
+    title = example["metadata"]["TITLE"]
+    segments = example["oni"]["segments"]
+    beats = []
+    downbeats = []
+    for i, segment in enumerate(segments):
+        seg_timestamp = segment["timestamp"]
+        measure_num = segment["measure_num"]  # numerator (e.g., 4 in 4/4)
+        measure_den = segment["measure_den"]  # denominator (e.g., 4 in 4/4)
+        notes = segment["notes"]
+        # Downbeat is the start of each measure
+        downbeats.append(seg_timestamp)
+        # Get BPM from the first note in segment, or fallback to next segment's first note
+        bpm = None
+        if notes:
+            bpm = notes[0]["bpm"]
+        else:
+            # Look ahead for BPM if current segment has no notes
+            for j in range(i + 1, len(segments)):
+                if segments[j]["notes"]:
+                    bpm = segments[j]["notes"][0]["bpm"]
+                    break
+        if bpm is None or bpm <= 0:
+            bpm = 120.0  # fallback default BPM
+        # Calculate beat duration: one beat = 60/BPM seconds (for quarter note)
+        # Adjust for time signature denominator (4 = quarter, 8 = eighth, etc.)
+        beat_duration = (60.0 / bpm) * (4.0 / measure_den)
+        # Calculate beat positions within this measure
+        for beat_idx in range(measure_num):
+            beat_time = seg_timestamp + beat_idx * beat_duration
+            beats.append(beat_time)
+    # Sort and deduplicate (in case of overlapping segments)
+    beats = sorted(set(beats))
+    downbeats = sorted(set(downbeats))
+    return {
+        "title": title,
+        "beats": beats,
+        "downbeats": downbeats,
+    }
+ds = ds.map(
+    build_beat_and_downbeat_labels,
+    num_proc=N_PROC,
+    batch_size=32,
+    writer_batch_size=32,
+    remove_columns=["oni", "metadata"],
+)
+ds = ds.with_format("torch")
+if __name__ == "__main__":
+    print(ds)
+    print(ds["train"].features)

exp/data/viz.py ADDED Viewed

	@@ -0,0 +1,441 @@

+"""
+Visualization utilities for beat tracking evaluation.
+This module provides functions to:
+- Plot beat and downbeat predictions vs ground truth
+- Create waveform visualizations with beat annotations
+- Generate comparison plots for evaluation
+Example usage:
+    from exp.data.viz import plot_beats, plot_waveform_with_beats, save_figure
+    # Plot beat comparison
+    fig = plot_beats(pred_beats, gt_beats, pred_downbeats, gt_downbeats)
+    save_figure(fig, "beat_comparison.png")
+    # Plot waveform with beats
+    fig = plot_waveform_with_beats(audio, sr, pred_beats, gt_beats)
+    save_figure(fig, "waveform.png")
+"""
+import numpy as np
+from pathlib import Path
+# Try to import matplotlib, but make it optional
+try:
+    import matplotlib.pyplot as plt
+    import matplotlib.patches as mpatches
+    HAS_MATPLOTLIB = True
+except ImportError:
+    HAS_MATPLOTLIB = False
+def _check_matplotlib():
+    if not HAS_MATPLOTLIB:
+        raise ImportError(
+            "matplotlib is required for visualization. "
+            "Install with: pip install matplotlib"
+        )
+def plot_beats(
+    pred_beats: list[float] | np.ndarray,
+    gt_beats: list[float] | np.ndarray,
+    pred_downbeats: list[float] | np.ndarray | None = None,
+    gt_downbeats: list[float] | np.ndarray | None = None,
+    title: str = "Beat Tracking Comparison",
+    figsize: tuple[int, int] = (14, 4),
+    time_range: tuple[float, float] | None = None,
+) -> "plt.Figure":
+    """
+    Create a visualization comparing predicted and ground truth beats.
+    Args:
+        pred_beats: Predicted beat times in seconds
+        gt_beats: Ground truth beat times in seconds
+        pred_downbeats: Predicted downbeat times (optional)
+        gt_downbeats: Ground truth downbeat times (optional)
+        title: Plot title
+        figsize: Figure size (width, height)
+        time_range: Optional tuple (start, end) to limit time range
+    Returns:
+        matplotlib Figure object
+    """
+    _check_matplotlib()
+    fig, ax = plt.subplots(figsize=figsize)
+    pred_beats = np.array(pred_beats)
+    gt_beats = np.array(gt_beats)
+    # Apply time range filter
+    if time_range is not None:
+        start, end = time_range
+        pred_beats = pred_beats[(pred_beats >= start) & (pred_beats <= end)]
+        gt_beats = gt_beats[(gt_beats >= start) & (gt_beats <= end)]
+        if pred_downbeats is not None:
+            pred_downbeats = np.array(pred_downbeats)
+            pred_downbeats = pred_downbeats[
+                (pred_downbeats >= start) & (pred_downbeats <= end)
+            ]
+        if gt_downbeats is not None:
+            gt_downbeats = np.array(gt_downbeats)
+            gt_downbeats = gt_downbeats[(gt_downbeats >= start) & (gt_downbeats <= end)]
+    # Plot ground truth beats
+    ax.vlines(
+        gt_beats, 0, 0.4, colors="green", alpha=0.7, linewidth=1.5, label="GT Beats"
+    )
+    # Plot predicted beats
+    ax.vlines(
+        pred_beats,
+        0.6,
+        1.0,
+        colors="blue",
+        alpha=0.7,
+        linewidth=1.5,
+        label="Pred Beats",
+    )
+    # Plot downbeats if provided
+    if gt_downbeats is not None and len(gt_downbeats) > 0:
+        gt_downbeats = np.array(gt_downbeats)
+        ax.vlines(
+            gt_downbeats, 0, 0.4, colors="darkgreen", linewidth=3, label="GT Downbeats"
+        )
+    if pred_downbeats is not None and len(pred_downbeats) > 0:
+        pred_downbeats = np.array(pred_downbeats)
+        ax.vlines(
+            pred_downbeats,
+            0.6,
+            1.0,
+            colors="darkblue",
+            linewidth=3,
+            label="Pred Downbeats",
+        )
+    # Styling
+    ax.set_ylim(-0.1, 1.1)
+    ax.set_yticks([0.2, 0.8])
+    ax.set_yticklabels(["Ground Truth", "Prediction"])
+    ax.set_xlabel("Time (seconds)")
+    ax.set_title(title)
+    ax.legend(loc="upper right", ncol=4)
+    ax.grid(True, alpha=0.3)
+    # Set x-axis range
+    if time_range is not None:
+        ax.set_xlim(time_range)
+    else:
+        all_times = np.concatenate([pred_beats, gt_beats])
+        if len(all_times) > 0:
+            ax.set_xlim(0, np.max(all_times) + 0.5)
+    plt.tight_layout()
+    return fig
+def plot_waveform_with_beats(
+    audio: np.ndarray,
+    sr: int,
+    pred_beats: list[float] | np.ndarray,
+    gt_beats: list[float] | np.ndarray,
+    pred_downbeats: list[float] | np.ndarray | None = None,
+    gt_downbeats: list[float] | np.ndarray | None = None,
+    title: str = "Waveform with Beat Annotations",
+    figsize: tuple[int, int] = (14, 6),
+    time_range: tuple[float, float] | None = None,
+) -> "plt.Figure":
+    """
+    Create a waveform visualization with beat annotations.
+    Args:
+        audio: Audio waveform
+        sr: Sample rate
+        pred_beats: Predicted beat times
+        gt_beats: Ground truth beat times
+        pred_downbeats: Predicted downbeat times (optional)
+        gt_downbeats: Ground truth downbeat times (optional)
+        title: Plot title
+        figsize: Figure size
+        time_range: Optional tuple (start, end) to limit time range
+    Returns:
+        matplotlib Figure object
+    """
+    _check_matplotlib()
+    fig, (ax1, ax2) = plt.subplots(
+        2, 1, figsize=figsize, sharex=True, height_ratios=[3, 1]
+    )
+    # Time axis
+    duration = len(audio) / sr
+    t = np.linspace(0, duration, len(audio))
+    # Apply time range
+    if time_range is not None:
+        start, end = time_range
+        start_idx = int(start * sr)
+        end_idx = int(end * sr)
+        t = t[start_idx:end_idx]
+        audio_plot = audio[start_idx:end_idx]
+    else:
+        audio_plot = audio
+        start, end = 0, duration
+    # Plot waveform
+    ax1.plot(t, audio_plot, color="gray", alpha=0.7, linewidth=0.5)
+    ax1.set_ylabel("Amplitude")
+    ax1.set_title(title)
+    # Filter beats to time range
+    pred_beats = np.array(pred_beats)
+    gt_beats = np.array(gt_beats)
+    pred_beats = pred_beats[(pred_beats >= start) & (pred_beats <= end)]
+    gt_beats = gt_beats[(gt_beats >= start) & (gt_beats <= end)]
+    # Plot beat markers on waveform
+    audio_max = np.abs(audio_plot).max() if len(audio_plot) > 0 else 1.0
+    for beat in gt_beats:
+        ax1.axvline(beat, color="green", alpha=0.5, linewidth=1)
+    for beat in pred_beats:
+        ax1.axvline(beat, color="blue", alpha=0.3, linewidth=1, linestyle="--")
+    # Add downbeat markers (thicker lines)
+    if gt_downbeats is not None:
+        gt_downbeats = np.array(gt_downbeats)
+        gt_downbeats = gt_downbeats[(gt_downbeats >= start) & (gt_downbeats <= end)]
+        for db in gt_downbeats:
+            ax1.axvline(db, color="darkgreen", alpha=0.8, linewidth=2)
+    if pred_downbeats is not None:
+        pred_downbeats = np.array(pred_downbeats)
+        pred_downbeats = pred_downbeats[
+            (pred_downbeats >= start) & (pred_downbeats <= end)
+        ]
+        for db in pred_downbeats:
+            ax1.axvline(db, color="darkblue", alpha=0.5, linewidth=2, linestyle="--")
+    ax1.set_ylim(-audio_max * 1.1, audio_max * 1.1)
+    # Beat comparison subplot
+    ax2.vlines(gt_beats, 0, 0.4, colors="green", alpha=0.7, linewidth=1.5)
+    ax2.vlines(pred_beats, 0.6, 1.0, colors="blue", alpha=0.7, linewidth=1.5)
+    if gt_downbeats is not None and len(gt_downbeats) > 0:
+        ax2.vlines(gt_downbeats, 0, 0.4, colors="darkgreen", linewidth=3)
+    if pred_downbeats is not None and len(pred_downbeats) > 0:
+        ax2.vlines(pred_downbeats, 0.6, 1.0, colors="darkblue", linewidth=3)
+    ax2.set_ylim(-0.1, 1.1)
+    ax2.set_yticks([0.2, 0.8])
+    ax2.set_yticklabels(["GT", "Pred"])
+    ax2.set_xlabel("Time (seconds)")
+    # Legend
+    legend_elements = [
+        mpatches.Patch(color="green", alpha=0.7, label="GT Beats"),
+        mpatches.Patch(color="blue", alpha=0.7, label="Pred Beats"),
+        mpatches.Patch(color="darkgreen", label="GT Downbeats"),
+        mpatches.Patch(color="darkblue", label="Pred Downbeats"),
+    ]
+    ax1.legend(handles=legend_elements, loc="upper right", ncol=4)
+    ax1.grid(True, alpha=0.3)
+    ax2.grid(True, alpha=0.3)
+    plt.tight_layout()
+    return fig
+def plot_evaluation_summary(
+    results: dict,
+    title: str = "Evaluation Summary",
+    figsize: tuple[int, int] = (12, 8),
+) -> "plt.Figure":
+    """
+    Create a summary visualization of evaluation results.
+    Args:
+        results: Results dict from evaluate_all
+        title: Plot title
+        figsize: Figure size
+    Returns:
+        matplotlib Figure object
+    """
+    _check_matplotlib()
+    fig, axes = plt.subplots(2, 2, figsize=figsize)
+    # F1 by threshold for beats
+    ax1 = axes[0, 0]
+    if "beat_f1_by_threshold" in results:
+        thresholds = sorted(results["beat_f1_by_threshold"].keys())
+        f1_scores = [results["beat_f1_by_threshold"][t] for t in thresholds]
+        ax1.bar(range(len(thresholds)), f1_scores, color="steelblue", alpha=0.8)
+        ax1.set_xticks(range(len(thresholds)))
+        ax1.set_xticklabels([f"{t}ms" for t in thresholds], rotation=45)
+        ax1.set_ylabel("F1 Score")
+        ax1.set_title("Beat F1 by Threshold")
+        ax1.set_ylim(0, 1)
+        ax1.grid(True, alpha=0.3)
+    # F1 by threshold for downbeats
+    ax2 = axes[0, 1]
+    if "downbeat_f1_by_threshold" in results:
+        thresholds = sorted(results["downbeat_f1_by_threshold"].keys())
+        f1_scores = [results["downbeat_f1_by_threshold"][t] for t in thresholds]
+        ax2.bar(range(len(thresholds)), f1_scores, color="coral", alpha=0.8)
+        ax2.set_xticks(range(len(thresholds)))
+        ax2.set_xticklabels([f"{t}ms" for t in thresholds], rotation=45)
+        ax2.set_ylabel("F1 Score")
+        ax2.set_title("Downbeat F1 by Threshold")
+        ax2.set_ylim(0, 1)
+        ax2.grid(True, alpha=0.3)
+    # Continuity metrics for beats
+    ax3 = axes[1, 0]
+    if "beat_continuity" in results:
+        metrics = ["CMLc", "CMLt", "AMLc", "AMLt"]
+        values = [results["beat_continuity"][m] for m in metrics]
+        colors = ["#2E86AB", "#A23B72", "#F18F01", "#C73E1D"]
+        bars = ax3.bar(metrics, values, color=colors, alpha=0.8)
+        ax3.set_ylabel("Score")
+        ax3.set_title("Beat Continuity Metrics")
+        ax3.set_ylim(0, 1)
+        ax3.grid(True, alpha=0.3)
+        # Add value labels
+        for bar, val in zip(bars, values):
+            ax3.text(
+                bar.get_x() + bar.get_width() / 2,
+                bar.get_height() + 0.02,
+                f"{val:.3f}",
+                ha="center",
+                fontsize=9,
+            )
+    # Continuity metrics for downbeats
+    ax4 = axes[1, 1]
+    if "downbeat_continuity" in results:
+        metrics = ["CMLc", "CMLt", "AMLc", "AMLt"]
+        values = [results["downbeat_continuity"][m] for m in metrics]
+        colors = ["#2E86AB", "#A23B72", "#F18F01", "#C73E1D"]
+        bars = ax4.bar(metrics, values, color=colors, alpha=0.8)
+        ax4.set_ylabel("Score")
+        ax4.set_title("Downbeat Continuity Metrics")
+        ax4.set_ylim(0, 1)
+        ax4.grid(True, alpha=0.3)
+        # Add value labels
+        for bar, val in zip(bars, values):
+            ax4.text(
+                bar.get_x() + bar.get_width() / 2,
+                bar.get_height() + 0.02,
+                f"{val:.3f}",
+                ha="center",
+                fontsize=9,
+            )
+    fig.suptitle(title, fontsize=14, fontweight="bold")
+    plt.tight_layout()
+    return fig
+def save_figure(
+    fig: "plt.Figure",
+    path: str | Path,
+    dpi: int = 150,
+) -> None:
+    """
+    Save a matplotlib figure to file.
+    Args:
+        fig: Figure to save
+        path: Output file path
+        dpi: Resolution (dots per inch)
+    """
+    _check_matplotlib()
+    path = Path(path)
+    path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(str(path), dpi=dpi, bbox_inches="tight")
+    plt.close(fig)
+if __name__ == "__main__":
+    # Demo
+    _check_matplotlib()
+    print("Visualization demo...")
+    # Generate synthetic data
+    np.random.seed(42)
+    gt_beats = np.arange(0, 10, 0.5)
+    gt_downbeats = np.arange(0, 10, 2.0)
+    pred_beats = gt_beats + np.random.normal(0, 0.02, len(gt_beats))
+    pred_downbeats = gt_downbeats + np.random.normal(0, 0.01, len(gt_downbeats))
+    # Generate fake audio
+    sr = 16000
+    duration = 10.0
+    t = np.arange(int(duration * sr)) / sr
+    audio = np.sin(2 * np.pi * 220 * t) * 0.3
+    # Create plots
+    fig1 = plot_beats(
+        pred_beats, gt_beats, pred_downbeats, gt_downbeats, title="Beat Comparison Demo"
+    )
+    save_figure(fig1, "/tmp/beat_comparison_demo.png")
+    print("Saved /tmp/beat_comparison_demo.png")
+    fig2 = plot_waveform_with_beats(
+        audio,
+        sr,
+        pred_beats,
+        gt_beats,
+        pred_downbeats,
+        gt_downbeats,
+        title="Waveform Demo",
+        time_range=(2, 8),
+    )
+    save_figure(fig2, "/tmp/waveform_demo.png")
+    print("Saved /tmp/waveform_demo.png")
+    # Fake evaluation results
+    results = {
+        "beat_f1_by_threshold": {
+            3: 0.5,
+            6: 0.7,
+            9: 0.85,
+            12: 0.9,
+            15: 0.95,
+            18: 0.96,
+            21: 0.97,
+            24: 0.97,
+            27: 0.98,
+            30: 0.98,
+        },
+        "downbeat_f1_by_threshold": {
+            3: 0.6,
+            6: 0.8,
+            9: 0.9,
+            12: 0.95,
+            15: 0.97,
+            18: 0.98,
+            21: 0.98,
+            24: 0.99,
+            27: 0.99,
+            30: 0.99,
+        },
+        "beat_continuity": {"CMLc": 0.75, "CMLt": 0.92, "AMLc": 0.80, "AMLt": 0.95},
+        "downbeat_continuity": {"CMLc": 0.85, "CMLt": 0.95, "AMLc": 0.88, "AMLt": 0.97},
+    }
+    fig3 = plot_evaluation_summary(results, title="Evaluation Summary Demo")
+    save_figure(fig3, "/tmp/eval_summary_demo.png")
+    print("Saved /tmp/eval_summary_demo.png")

outputs/baseline1/beats/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+---
+tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
+---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Code: [More Information Needed]
+- Paper: [More Information Needed]
+- Docs: [More Information Needed]

outputs/baseline1/beats/config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "dropout_rate": 0.5
+}

outputs/baseline1/beats/final/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+---
+tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
+---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Code: [More Information Needed]
+- Paper: [More Information Needed]
+- Docs: [More Information Needed]

outputs/baseline1/beats/final/config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "dropout_rate": 0.5
+}

outputs/baseline1/beats/final/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f0ee01ee41360f0b486e16d6022f896a19f9ead901be0180bdbd9cad2a3b8597
+size 1159372

outputs/baseline1/beats/logs/events.out.tfevents.1766351314.msiit232.1284330.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7b2d91a22ba01091bf072f5a5e8f12fc7d49801d6538914c973ccb2700978934
+size 17749022

outputs/baseline1/beats/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1e7a0d5178bc5dfeee6da26345e7956aeb6bf64a21be7e541db4bcc37b290249
+size 1159372

outputs/baseline1/downbeats/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+---
+tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
+---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Code: [More Information Needed]
+- Paper: [More Information Needed]
+- Docs: [More Information Needed]

outputs/baseline1/downbeats/config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "dropout_rate": 0.5
+}

outputs/baseline1/downbeats/final/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+---
+tags:
+- model_hub_mixin
+- pytorch_model_hub_mixin
+---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Code: [More Information Needed]
+- Paper: [More Information Needed]
+- Docs: [More Information Needed]