AnhP's picture
Upload benchmark_report.md
ee14175 verified

Pitch Detection Algorithm Benchmark Report

Benchmark Methodology

Evaluation Setup

This benchmark evaluates pitch detection algorithms across multiple datasets with different characteristics, including synthetic and real audio from speech and music domains. Each algorithm is tested on noisy audio generated by mixing clean datasets with CHiME background noise at various signal-to-noise ratios (10-30 dB) and voice gain variations (-6 to +6 dB).

Performance Metric Definition

The Overall Performance Rankings show the Harmonic Mean (HM) score as percentages, computed from six complementary components:

HM = 6 / (1/RPA + 1/CA + 1/P + 1/R + 1/OA + 1/GEA)

Where:

  • RPA (Raw Pitch Accuracy): Fraction of voiced frames within 50 cents of ground truth
  • CA (Cents Accuracy): exp(-mean_cents_error/500), penalizing larger deviations exponentially
  • P (Voicing Precision): TP/(TP+FP), fraction of predicted voiced frames that are truly voiced
  • R (Voicing Recall): TP/(TP+FN), fraction of truly voiced frames detected
  • OA (Octave Accuracy): exp(-10×octave_error_rate), robustness against octave errors
  • GEA (Gross Error Accuracy): exp(-5×gross_error_rate), penalizing deviations >200 cents

Speed Benchmark Details

CPU timing measurements are performed on 1-second audio signals at 22.05 kHz sample rate with 256-sample hop length. The reported CPU Time (ms) represents the average processing time per 1-second audio segment across multiple runs. Relative Speed shows performance relative to CREPE as the baseline algorithm.

Optimal Threshold Analysis

The Optimal Threshold refers to the voicing confidence threshold that maximizes the Harmonic Mean score. Algorithms test multiple thresholds (0.0 to 1.0 in steps of 0.1) and select the one yielding the highest combined score. CV stands for Coefficient of Variation (std/mean), measuring consistency across datasets.

Dataset Descriptions

The benchmark evaluates algorithms across diverse datasets covering speech, music, synthetic, and real-world conditions:

Dataset Domain Type Description
NSynth Music Synthetic Single-note synthetic audio from musical instruments with accurate pitch labels. Lacks temporal/spectral complexity of real-world environments.
PTDB Speech Real Speech recordings with laryngograph signals capturing vocal fold vibrations. Ground truth derived from high-pass filtered laryngograph signals processed with RAPT algorithm.
PTDBNoisy Speech Real Subset of 347 PTDB files (7.4%) with noticeable noise that were excluded from main evaluation.
MIR1K Music Real Vocal excerpts with pitch contours initially extracted algorithmically (e.g., YIN) followed by manual correction. Labels still reflect some algorithmic biases.
MDBStemSynth Music Synthetic Musically structured synthetic audio with accurate pitch annotations. Valuable for controlled evaluation but lacks real-world acoustic variability.
Vocadito Music Real Solo vocal recordings with pitch annotations derived from pYIN algorithm, refined through manual verification process.
Bach10Synth Music Synthetic High-quality pitch labels for synthesized musical performances. Similar to MDB-STEM-Synth but focused on Bach compositions.
SpeechSynth Speech Synthetic Synthetic Mandarin speech generated using LightSpeech TTS model. Trained on 97.48 hours from AISHELL-3 and Biaobei datasets, providing exact pitch ground truth.

Key Characteristics:

  • Synthetic datasets provide perfect ground truth but may lack real-world complexity
  • Real datasets capture natural acoustic variations but have imperfect ground truth annotations
  • Speech datasets focus on vocal pitch tracking challenges
  • Music datasets encompass instrumental and vocal music scenarios
  • SpeechSynth addresses the gap of lacking synthetic speech data with accurate pitch labels

Overall Performance Rankings

Algorithm Bach10Synth MIR1K PTDB PTDBNoisy SpeechSynth Vocadito Average
SwiftF0 98.0% 94.9% 91.2% 75.7% 90.7% 95.0% 90.9%
RMVPE 98.4% 96.0% 86.1% 66.2% 90.5% 97.1% 89.0%
DJCM 95.8% 94.4% 86.3% 73.3% 89.0% 94.9% 89.0%

No speed benchmark results found.

Detailed Performance Analysis

Voicing Detection Performance

Measures how well algorithms distinguish between voiced (pitched) and unvoiced (unpitched) audio segments.

Algorithm Precision ↑ Recall ↑ F1-Score ↑
DJCM 0.911 0.857 0.882
RMVPE 0.886 0.826 0.854
SwiftF0 0.891 0.889 0.888

Pitch Accuracy Metrics

Detailed pitch estimation accuracy across different error types and magnitudes.

Algorithm RPA ↑ RCA ↑ Cents Error ↓ RMSE (Hz) ↓ Octave Error ↓ Gross Error ↓
DJCM 0.891 0.897 40.4 25.3 0.015 0.022
RMVPE 0.888 0.892 35.8 11.6 0.012 0.015
SwiftF0 0.912 0.916 32.4 11.9 0.011 0.014

Additional Metric Definitions:

  • RCA (Raw Chroma Accuracy): Fraction with correct pitch class (note name), ignoring octave
  • Cents Error: Mean absolute pitch deviation in cents (raw error, before exponential transform used in CA)
  • RMSE: Root Mean Square Error in Hz

Pitch Contour Smoothness

Measures the temporal stability and continuity of pitch tracks.

Algorithm Relative Smoothness ↓ Continuity Breaks ↓ Overall Smoothness Rank ↓
SwiftF0 1.425 0.720 1.5
RMVPE 1.253 0.868 2.0
DJCM 3.976 0.771 2.5

Metric Definitions:

  • Relative Smoothness: Coefficient of variation of consecutive pitch changes (std/mean of relative frame-to-frame changes)
  • Continuity Breaks: Fraction of ground-truth voiced segments where predicted voicing has gaps
  • Overall Smoothness Rank: Average rank across both smoothness metrics (1=best, lower is better)

Optimal Threshold Analysis

Voicing confidence thresholds that maximize overall performance scores.

Algorithm Mean Threshold Std Dev ↓ Range
DJCM 0.517 0.069 0.40-0.60
RMVPE 0.683 0.069 0.60-0.80
SwiftF0 0.900 0.000 0.90-0.90

Algorithm Consistency

Measures performance stability across different datasets using Coefficient of Variation (CV = std/mean).

Algorithm Performance CV ↓ Threshold CV ↓
DJCM 0.088 0.133
RMVPE 0.124 0.101
SwiftF0 0.080 0.000

Performance by Dataset Subsets

By Origin

  • Synthetic: Bach10Synth, MDBStemSynth, SpeechSynth, NSynth
  • Real: MIR1K, PTDB, PTDBNoisy, Vocadito
Algorithm Synthetic Real
DJCM 92.4% 87.2%
RMVPE 94.4% 86.3%
SwiftF0 94.4% 89.2%

By Domain

  • Speech: PTDB, PTDBNoisy, SpeechSynth
  • Music: Bach10Synth, MDBStemSynth, NSynth, Vocadito, MIR1K
Algorithm Speech Music
DJCM 82.9% 95.0%
RMVPE 80.9% 97.1%
SwiftF0 85.9% 96.0%

By Cross-Dimension

  • Synthetic + Speech: SpeechSynth
  • Synthetic + Music: Bach10Synth, MDBStemSynth, NSynth
  • Real + Speech: PTDB, PTDBNoisy
  • Real + Music: Vocadito, MIR1K
Algorithm Synthetic + Speech Synthetic + Music Real + Speech Real + Music
DJCM 89.0% 95.8% 79.8% 94.7%
RMVPE 90.5% 98.4% 76.1% 96.5%
SwiftF0 90.7% 98.0% 83.5% 95.0%