Baseline3 improvements (beats + downbeats)

This document summarizes the changes that were made in exp/baseline3 relative to exp/baseline2 during this session, with an emphasis on improvements intended to increase beat/downbeat F1 and continuity while keeping the training/eval workflow consistent with baseline2.

Scope / goals

Keep the same overall pipeline as baseline2 (same dataset, same context window, same mel multi-view preprocessing, same peak-picking evaluation).
Add SE-inspired improvements to the model (baseline3) while preserving the baseline2 ResNet backbone structure.
Make training and TensorBoard curves comparable to baseline2.
Support faster iteration when needed (optional), but allow returning to baseline2-style “full” training defaults.

Model improvements (affects both beats + downbeats)

1) Extra SE-inspired gating (temporal excitation)

File: exp/baseline3/model.py
Added an additional SE-style gating mechanism that is time-dependent (a “temporal excitation” in addition to channel excitation).
The intent is to help the network emphasize temporally-salient patterns that correspond to rhythmic events, improving peak sharpness and reducing spurious activations.

2) SE block robustness

File: exp/baseline3/model.py
Made the SE hidden dimension robust for small channel counts (ensuring the intermediate dimension is never zero).

Data / sampling improvements (optional; applies to both beats + downbeats)

3) Track capping support (optional)

File: exp/baseline3/data.py
Added support for limiting the number of tracks used when building indices.
This was introduced for fast iteration runs (debugging / quick experiments). When not used, training uses the full dataset like baseline2.

4) Hard-negative sampling near events (optional)

File: exp/baseline3/data.py
Added optional “hard negatives” close to ground-truth frames:
- For each beat/downbeat frame, add negative frames at offsets ±d for d=2..R.
- Controlled by hard_neg_radius and hard_neg_fraction.
Rationale: random negatives are often too easy; near-event negatives help reduce double-peaks/jitter and can improve continuity.
Status: kept off by default when running in baseline2-style mode.

Training-loop improvements

5) Output directories fixed to avoid overwriting baseline2

File: exp/baseline3/train.py (and earlier in the session also baseline3 eval defaults)
Baseline3 outputs were adjusted to use baseline3-specific output directories so baseline2 artifacts aren’t overwritten.

6) Loss logging parity with baseline2

File: exp/baseline3/train.py
Baseline2 uses unweighted BCE (nn.BCELoss). Baseline3 introduced an optional weighted BCE objective for imbalance experiments.
A key issue was discovered: TensorBoard curves looked “worse” in baseline3 because it was logging weighted BCE as the main loss.
Fix:
- train/batch_loss and train/epoch_loss are now unweighted BCE (baseline2-comparable).
- If weighting is enabled, the optimized objective is logged separately as *_weighted.

7) Optional imbalance-aware objective (pos weighting)

File: exp/baseline3/train.py
Added an optional weighted BCE objective, controlled by --pos-weight.
Default is --pos-weight 0.0, which matches baseline2 behavior.

8) Optional gradient clipping

File: exp/baseline3/train.py
Added --grad-clip support to stabilize training when experimenting.
For baseline2-style mode, default was set back to disabled (--grad-clip 0.0).

9) Fast-iteration controls (optional)

File: exp/baseline3/train.py
Added optional caps for quicker experiments:
- --max-train-tracks, --max-val-tracks
- --max-train-steps, --max-val-steps, --max-steps-total
These are intended only for debugging/iteration. Baseline2-style training leaves them unset (0/unlimited).

10) Back to baseline2-style default training mode

File: exp/baseline3/train.py
Returned baseline3 defaults to match baseline2 training mode:
- --epochs 3
- --patience 5
- objective defaults to unweighted BCE when --pos-weight 0.0
- no grad clipping by default

Evaluation improvements

11) Mix-and-match beats and downbeats checkpoints

File: exp/baseline3/eval.py
Added support to evaluate using different model directories for beats vs downbeats:
- --beats-model-dir
- --downbeats-model-dir
This enables workflows like “new beats run + keep downbeats fixed”.

Beats-specific notes

All model/training/eval improvements above apply to beats.
A key gotcha found during quick experiments: some runs only saved the checkpoint under a final/ subfolder. When evaluating, using the correct folder matters.

Latest mixed eval result (beats improved)

Eval command used:

Beats: outputs/baseline3_b2mode_full3/beats
Downbeats: outputs/baseline3_smoketest/downbeats
Output: outputs/eval_mix_b3_b2modebeats_smoketestdownbeats

Key metrics (116 tracks):

Mean Beat Weighted F1: 0.3531
Beat continuity: CMLt 0.3567, AMLt 0.3607, CMLc 0.0603, AMLc 0.0624

Summary plot:

outputs/eval_mix_b3_b2modebeats_smoketestdownbeats/evaluation_summary.png

Downbeats-specific notes

Downbeats training uses the same dataset/indexing logic, model architecture, and preprocessing as beats.
The improvements (temporal excitation, loss logging parity, optional hard negatives, optional fast-iteration, mixed-checkpoint evaluation) all apply identically.
In the mixed eval above, downbeats were held fixed using the baseline3 smoketest checkpoint.

Repro commands

Full baseline2-style training (beats only)

uv run -m exp.baseline3.train --target beats --output-dir outputs/baseline3_b2mode_full3

Mixed evaluation (beats from a new run + downbeats from baseline3 smoketest)

uv run -m exp.baseline3.eval \
  --beats-model-dir outputs/baseline3_b2mode_full3/beats \
  --downbeats-model-dir outputs/baseline3_smoketest/downbeats \
  --output-dir outputs/eval_mix_b3_b2modebeats_smoketestdownbeats \
  --summary-plot

Known warnings

You may see repeated torchaudio warnings like:
- “At least one mel filterbank has all zero values…”
This is produced by torchaudio mel filterbank construction for some parameter combinations and is not specific to baseline3.