This model uses the original RMVPE architecture from yxlllc.

It was not trained together with the HPA RMVPE dataset; instead, it was trained on a different dataset that is richer in contextual information and contains more M4Singer data.

The training was conducted on a HYBRID dataset combining MUSIC and SPEECH, including:
M4Singer (≈40%), Batch10Synth, Vocadito, MIR1K, MDBStemSynth, and PTDB-TUG.

The model is currently trained up to 88,000 steps and may be further trained if I have more time. According to Vidalnt, when these models are trained beyond 100,000 steps, they begin to learn and amplify background noise as well.