A large-scale pre-trained RVC model specialized for Japanese pronunciation

image

Overview

Kuromame is a pre-trained model focused on breath (inhale/exhale) to achieve authentic Japanese pronunciation and realistic vocal delivery.

Why Only 32 kHz?

Higher sampling rates such as 48 kHz turned out to be extremely sensitive to handle, with almost no audible benefit.
While they might slightly reduce latency during real-time inference, the difference is negligible.
When fine-tuning for a target speaker, higher rates often caused issues such as reverberation noise and instability.

After considering multiple factors, the model outputs at 32 kHz by design.
If additional high-frequency range is desired, it’s better to expand it afterward using tools like an expander or similar processing.

Phase-wise Configuration and Fine-tuning

Each phase was carefully constructed by adjusting various configurations including the following augmentations, D/G learning rate, lambda_adv,lambda_fm,lambda_spectral, based on inference results.

Model Details

V1

  • RefineGAN / SpinV2
  • 119 Speakers(All japanese): 41 Hours
  • SR: 32Khz
  • Batch 64
  • bf16

V2

Manually perform EQ and room reverb processing on every dataset, and add more varied expressions and phonemes.

  • RefineGAN / SpinV2
  • 109 Speakers(All japanese)
  • SR: 32Khz
  • Batch 64
  • fp32
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yesiampapa/Kuromame

Quantized
IAHispano/Applio
Finetuned
(4)
this model