DJCM-Test / Training With Mixed Dataset /benchmark_report.md

AnhP

Upload benchmark_report.md

ee14175 verified 4 months ago

preview code

raw

history blame contribute delete

8.13 kB

Pitch Detection Algorithm Benchmark Report

Benchmark Methodology

Evaluation Setup

This benchmark evaluates pitch detection algorithms across multiple datasets with different characteristics, including synthetic and real audio from speech and music domains. Each algorithm is tested on noisy audio generated by mixing clean datasets with CHiME background noise at various signal-to-noise ratios (10-30 dB) and voice gain variations (-6 to +6 dB).

Performance Metric Definition

The Overall Performance Rankings show the Harmonic Mean (HM) score as percentages, computed from six complementary components:

HM = 6 / (1/RPA + 1/CA + 1/P + 1/R + 1/OA + 1/GEA)

Where:

RPA (Raw Pitch Accuracy): Fraction of voiced frames within 50 cents of ground truth
CA (Cents Accuracy): exp(-mean_cents_error/500), penalizing larger deviations exponentially
P (Voicing Precision): TP/(TP+FP), fraction of predicted voiced frames that are truly voiced
R (Voicing Recall): TP/(TP+FN), fraction of truly voiced frames detected
OA (Octave Accuracy): exp(-10×octave_error_rate), robustness against octave errors
GEA (Gross Error Accuracy): exp(-5×gross_error_rate), penalizing deviations >200 cents

Speed Benchmark Details

CPU timing measurements are performed on 1-second audio signals at 22.05 kHz sample rate with 256-sample hop length. The reported CPU Time (ms) represents the average processing time per 1-second audio segment across multiple runs. Relative Speed shows performance relative to CREPE as the baseline algorithm.

Optimal Threshold Analysis

The Optimal Threshold refers to the voicing confidence threshold that maximizes the Harmonic Mean score. Algorithms test multiple thresholds (0.0 to 1.0 in steps of 0.1) and select the one yielding the highest combined score. CV stands for Coefficient of Variation (std/mean), measuring consistency across datasets.

Dataset Descriptions

The benchmark evaluates algorithms across diverse datasets covering speech, music, synthetic, and real-world conditions:

Dataset	Domain	Type	Description
NSynth	Music	Synthetic	Single-note synthetic audio from musical instruments with accurate pitch labels. Lacks temporal/spectral complexity of real-world environments.
PTDB	Speech	Real	Speech recordings with laryngograph signals capturing vocal fold vibrations. Ground truth derived from high-pass filtered laryngograph signals processed with RAPT algorithm.
PTDBNoisy	Speech	Real	Subset of 347 PTDB files (7.4%) with noticeable noise that were excluded from main evaluation.
MIR1K	Music	Real	Vocal excerpts with pitch contours initially extracted algorithmically (e.g., YIN) followed by manual correction. Labels still reflect some algorithmic biases.
MDBStemSynth	Music	Synthetic	Musically structured synthetic audio with accurate pitch annotations. Valuable for controlled evaluation but lacks real-world acoustic variability.
Vocadito	Music	Real	Solo vocal recordings with pitch annotations derived from pYIN algorithm, refined through manual verification process.
Bach10Synth	Music	Synthetic	High-quality pitch labels for synthesized musical performances. Similar to MDB-STEM-Synth but focused on Bach compositions.
SpeechSynth	Speech	Synthetic	Synthetic Mandarin speech generated using LightSpeech TTS model. Trained on 97.48 hours from AISHELL-3 and Biaobei datasets, providing exact pitch ground truth.

Key Characteristics:

Synthetic datasets provide perfect ground truth but may lack real-world complexity
Real datasets capture natural acoustic variations but have imperfect ground truth annotations
Speech datasets focus on vocal pitch tracking challenges
Music datasets encompass instrumental and vocal music scenarios
SpeechSynth addresses the gap of lacking synthetic speech data with accurate pitch labels

Overall Performance Rankings

Algorithm	Bach10Synth	MIR1K	PTDB	PTDBNoisy	SpeechSynth	Vocadito	Average
SwiftF0	98.0%	94.9%	91.2%	75.7%	90.7%	95.0%	90.9%
RMVPE	98.4%	96.0%	86.1%	66.2%	90.5%	97.1%	89.0%
DJCM	95.8%	94.4%	86.3%	73.3%	89.0%	94.9%	89.0%

No speed benchmark results found.

Detailed Performance Analysis

Voicing Detection Performance

Measures how well algorithms distinguish between voiced (pitched) and unvoiced (unpitched) audio segments.

Algorithm	Precision ↑	Recall ↑	F1-Score ↑
DJCM	0.911	0.857	0.882
RMVPE	0.886	0.826	0.854
SwiftF0	0.891	0.889	0.888

Pitch Accuracy Metrics

Detailed pitch estimation accuracy across different error types and magnitudes.

Algorithm	RPA ↑	RCA ↑	Cents Error ↓	RMSE (Hz) ↓	Octave Error ↓	Gross Error ↓
DJCM	0.891	0.897	40.4	25.3	0.015	0.022
RMVPE	0.888	0.892	35.8	11.6	0.012	0.015
SwiftF0	0.912	0.916	32.4	11.9	0.011	0.014

Additional Metric Definitions:

RCA (Raw Chroma Accuracy): Fraction with correct pitch class (note name), ignoring octave
Cents Error: Mean absolute pitch deviation in cents (raw error, before exponential transform used in CA)
RMSE: Root Mean Square Error in Hz

Pitch Contour Smoothness

Measures the temporal stability and continuity of pitch tracks.

Algorithm	Relative Smoothness ↓	Continuity Breaks ↓	Overall Smoothness Rank ↓
SwiftF0	1.425	0.720	1.5
RMVPE	1.253	0.868	2.0
DJCM	3.976	0.771	2.5

Metric Definitions:

Relative Smoothness: Coefficient of variation of consecutive pitch changes (std/mean of relative frame-to-frame changes)
Continuity Breaks: Fraction of ground-truth voiced segments where predicted voicing has gaps
Overall Smoothness Rank: Average rank across both smoothness metrics (1=best, lower is better)

Optimal Threshold Analysis

Voicing confidence thresholds that maximize overall performance scores.

Algorithm	Mean Threshold	Std Dev ↓	Range
DJCM	0.517	0.069	0.40-0.60
RMVPE	0.683	0.069	0.60-0.80
SwiftF0	0.900	0.000	0.90-0.90

Algorithm Consistency

Measures performance stability across different datasets using Coefficient of Variation (CV = std/mean).

Algorithm	Performance CV ↓	Threshold CV ↓
DJCM	0.088	0.133
RMVPE	0.124	0.101
SwiftF0	0.080	0.000

Performance by Dataset Subsets

By Origin

Synthetic: Bach10Synth, MDBStemSynth, SpeechSynth, NSynth
Real: MIR1K, PTDB, PTDBNoisy, Vocadito

Algorithm	Synthetic	Real
DJCM	92.4%	87.2%
RMVPE	94.4%	86.3%
SwiftF0	94.4%	89.2%

By Domain

Speech: PTDB, PTDBNoisy, SpeechSynth
Music: Bach10Synth, MDBStemSynth, NSynth, Vocadito, MIR1K

Algorithm	Speech	Music
DJCM	82.9%	95.0%
RMVPE	80.9%	97.1%
SwiftF0	85.9%	96.0%

By Cross-Dimension

Synthetic + Speech: SpeechSynth
Synthetic + Music: Bach10Synth, MDBStemSynth, NSynth
Real + Speech: PTDB, PTDBNoisy
Real + Music: Vocadito, MIR1K

Algorithm	Synthetic + Speech	Synthetic + Music	Real + Speech	Real + Music
DJCM	89.0%	95.8%	79.8%	94.7%
RMVPE	90.5%	98.4%	76.1%	96.5%
SwiftF0	90.7%	98.0%	83.5%	95.0%