This model uses the original RMVPE architecture from yxlllc. It was not trained together with the HPA RMVPE dataset; instead, it was trained on a different dataset that is richer in contextual information and contains more M4Singer data. The training was conducted on a HYBRID dataset combining MUSIC and SPEECH, including: M4Singer (≈40%), Batch10Synth, Vocadito, MIR1K, MDBStemSynth, and PTDB-TUG. The model is currently trained up to 88,000 steps and may be further trained if I have more time. According to Vidalnt, when these models are trained beyond 100,000 steps, they begin to learn and amplify background noise as well.