KLM X (10.0) Pretrained Model

KLM X (10.0) is a KLM/RVC-compatible pretrained foundation model designed around a Clean Voice Prior.

The main goal of KLM X is to help future voice-conversion fine-tuning produce cleaner, more stable, and more robust vocal models, even when user datasets contain mild imperfections such as reverb, room noise, codec damage, instrumental bleed, or source-separation artifacts.

Core idea:
Dirty conditioning should not force the model to learn dirty audio.
The model should learn to recover clean speaker identity from imperfect conditioning.


Warning

You must use KLM Trainer 0.5.0 for this pretrained model. https://huggingface.co/SeoulStreamingStation/KLM_RVC_KLM-HF_Trainer/resolve/main/KLM_Trainer_0_5_0.zip?download=true

Overview

KLM X is not a normal user fine-tuning model.

It is a pretrained foundation model intended to be used as a starting point for KLM/RVC-style voice model training.

The model is designed to preserve compatibility with existing KLM/RVC training and inference flows while adding a stronger robustness prior through a special pretraining strategy called Paired Training.

Dirty Conditioning -> Clean Voice Target

Instead of training the model to reproduce noise, reverb, or artifact-heavy targets, KLM X keeps the reconstruction target clean and uses contaminated paired variants only as imperfect conditioning sources during foundation pretraining.

Key Features KLM/RVC-compatible pretrained model Designed for vocal and singing voice conversion Clean Voice Prior foundation training Paired Training support during pretraining Dirty phone conditioning support Frame-level Dirty F0 Gate Clean target waveform and spectrogram are always preserved Existing architecture, forward signature, and checkpoint compatibility are maintained No paired data required from normal end users during fine-tuning Designed to improve robustness against: reverb delay room noise fan noise codec damage instrumental bleed source-separation artifacts unstable silence regions What Makes KLM X Different?

Traditional pretrained models may learn directly from imperfect training data.

If the training data contains reverb, noise, codec damage, or instrumental residue, those artifacts may become part of the learned voice prior.

KLM X uses a different foundation training strategy.

Training Type Target Conditioning Risk Conventional Pretraining May include imperfect audio Same imperfect audio features Noise/reverb may be absorbed KLM X Paired Training Always clean audio/spec Dirty paired features may be used Model learns to prefer clean voice identity

The goal is not to make the model a denoiser.

The goal is to make the pretrained model develop a bias toward clean speaker identity.

Paired Training

Paired Training is a pretraining-only mechanism.

It is not a user fine-tuning requirement.

Normal users do not need to prepare paired datasets.

During foundation training, each clean training file may optionally have one or more contaminated paired versions.

The clean file remains the reconstruction target.

The paired file is used only as a dirty conditioning source.

clean audio/spec = target dirty phone / gated F0 = conditioning

The model learns:

dirty conditioning -> clean speaker voice target

This teaches the pretrained model to interpret contaminated inputs as imperfect observations of a clean speaker identity.

Dataset Convention for Paired Training

Expected dataset structure:

DatasetRoot/ 0/ voice_0001.wav voice_0002.wav [Paired]/ voice_0001__paired_mix.wav voice_0001__paired_fx2.wav

1/ voice_0001.wav [Paired]/ voice_0001__paired_mix.wav

Rules:

Clean files stay in the speaker folder root. Contaminated paired variants stay inside [Paired]. Paired filenames should preserve the clean filename stem. Recognized suffix examples: __paired __paired_mix __paired_fx2 Missing [Paired] folders are allowed. Missing paired variants are allowed. Unmatched paired files are skipped. Contaminated paired files must never be inserted into the normal clean target filelist.

Recommended contamination types:

reverb room noise codec damage light instrumental bleed source-separation artifacts ambient noise light delay

Avoid paired variants where the voice is fully masked or the lead vocal identity becomes unclear.

Paired Training Pipeline

KLM X Paired Training uses a sidecar design.

Normal clean training data remains compatible with existing KLM/RVC training.

Paired data is processed separately.

experiment/ sliced_audios/ sliced_audios_16k/

paired_sliced_audios/ paired_sliced_audios_16k/

f0/ f0_voiced/ extracted/

paired_f0/ paired_f0_voiced/ paired_extracted/

paired_manifest.json

The normal filelist.txt remains clean-only.

The paired manifest maps clean slice keys to optional dirty paired feature variants.

{ "0_123_0": [ { "phone": "paired_extracted/0_123_p0_0.npy", "pitch": "paired_f0/0_123_p0_0.npy", "pitchf": "paired_f0_voiced/0_123_p0_0.npy" } ] } Training Runtime Behavior

When Paired Training is enabled:

The loader reads the normal clean training sample. It checks whether paired dirty variants exist. A schedule decides whether paired conditioning is activated. Dirty phone may replace clean phone. Dirty F0 must pass the F0 Gate before replacing clean F0. Failed F0 frames fall back to clean F0. Target waveform and target spectrogram always remain clean. clean target waveform = always used clean target spec = always used dirty audio = never used as reconstruction target Default Paired Training Schedule target ratio: 0.25 warmup: 5000 steps ramp: 10000 steps

During warmup, paired conditioning is inactive.

After warmup, the paired activation ratio gradually ramps toward the target ratio.

This avoids destabilizing the model early in training.

Dirty Phone Replacement

For active paired samples:

phone = dirty_paired_phone

For inactive or unavailable paired samples:

phone = clean_phone

This encourages the model to treat dirty content conditioning as an imperfect representation of the same clean speaker target.

Dirty F0 Gate

Dirty F0 is not trusted blindly.

KLM X uses a frame-level F0 Gate.

Dirty paired pitch and pitchf are accepted only when the frame passes all gate conditions.

Default gate logic:

clean F0 and dirty F0 are both voiced dirty F0 is inside valid range: 40 Hz ~ 1600 Hz |logF0_clean - logF0_dirty| < 0.20 local delta difference < 0.35

If the frame passes:

pitch = dirty_pitch pitchf = dirty_pitchf

If the frame fails:

pitch = clean_pitch pitchf = clean_pitchf

This helps prevent unstable pitch conditioning caused by:

reverb bleed backing vocals octave jumps noisy F0 extraction source-separation artifacts Why F0 Gate Matters

F0 is one of the most sensitive conditioning signals in RVC/KLM-style training.

If dirty F0 is wrong, the model may learn to distrust pitch conditioning or produce unstable pitch behavior.

The F0 Gate allows KLM X to use useful paired F0 information while rejecting dangerous frames.

Good dirty F0 -> accepted Bad dirty F0 -> fallback to clean F0

This keeps paired training useful without sacrificing pitch stability.

Architecture Compatibility

KLM X Paired Training v1.0 does not add a new model branch.

The exported checkpoint remains compatible with the existing KLM/RVC inference and fine-tuning flow.

The following remain unchanged:

generator architecture discriminator architecture forward signature checkpoint format normal fine-tuning workflow normal inference workflow

Paired Training uses existing conditioning slots:

phone pitch pitchf speaker id

This means users can fine-tune from KLM X without preparing paired data.

What KLM X Is Not

KLM X is not:

a standalone denoiser a source-separation model a guaranteed artifact remover a replacement for proper dataset cleaning a model that requires end users to prepare paired datasets a model trained to output dirty audio

KLM X is best understood as:

a clean-speaker-prior pretrained model

or:

a robustness-oriented foundation model for KLM/RVC-style voice conversion Expected Benefits

When used as a pretrained base model, KLM X is designed to help with:

cleaner fine-tuning results more stable vocal texture cleaner silence regions reduced learning of room tone reduced reverb imprinting reduced codec artifact imprinting better robustness to imperfect user datasets better adaptation to lightly contaminated vocal data stronger clean speaker identity prior

Actual results may vary depending on:

dataset quality speaker coverage F0 extraction quality training duration fine-tuning settings inference settings index usage source separation quality Recommended Fine-Tuning Usage

For normal users:

Prepare a clean vocal dataset when possible. Avoid heavy reverb, doubled vocals, chorus, or instrumental bleed. Use KLM X as the pretrained base model. Fine-tune normally using the existing KLM/RVC workflow. No [Paired] folder is required for normal fine-tuning.

Recommended dataset characteristics:

clean vocal dry or lightly processed minimal room noise minimal instrumental bleed consistent speaker identity stable volume proper slicing

Avoid:

heavy reverb strong backing vocals chorus layers clipping distortion low-bitrate compression misaligned vocals strong instrumental residue Recommended Paired Training Usage

Paired Training should be used only for foundation/pretrained model development.

It is not recommended for normal user fine-tuning.

Use Paired Training when you are training a foundation model and want to expose the model to controlled dirty conditioning while preserving clean reconstruction targets.

Recommended paired contamination balance:

You must need a KLM Trainer 0.5.0 (or Higher) for paired train. or install Paired pretrain model Trainer plug-in for applio.

clean target data should remain dominant paired contamination should be realistic but not destructive voice identity should remain recognizable lead vocal timing should remain aligned paired files should preserve the clean filename stem Monitoring Metrics

KLM X Paired Training can monitor the following metrics:

paired_training/target_ratio paired_training/available_fraction paired_training/active_fraction paired_training/f0_gate_accept_fraction

Meaning:

Metric Meaning target_ratio Scheduled target probability of using paired conditioning available_fraction Fraction of batch samples that have paired variants active_fraction Fraction of batch samples where paired conditioning was actually activated f0_gate_accept_fraction Fraction of active paired voiced F0 frames accepted by the F0 Gate

Healthy behavior:

available_fraction should reflect dataset coverage active_fraction should gradually ramp toward target_ratio f0_gate_accept_fraction should be moderate clean losses should not destabilize when paired activation increases

If f0_gate_accept_fraction is extremely high, the gate may be too loose.

If it is near zero, paired F0 may be too contaminated or the gate may be too strict.

Safety And Fallback Behavior

KLM X Paired Training is designed to fail safely.

Expected fallback behavior:

no [Paired] folder -> clean-only training unmatched paired file -> skip missing paired feature -> fallback to clean conditioning Paired Training disabled -> ignore paired tensors Fast Training enabled -> bypass paired path

The clean training path must remain functional at all times.

Limitations

KLM X may not fully solve:

heavily contaminated datasets severe source-separation artifacts incorrect speaker labels extreme backing-vocal leakage strong harmony overlap bad F0 extraction clipping or distorted recordings low-quality training data mismatched sample rates unstable slicing content encoder leakage

Paired Training improves robustness, but it does not replace careful dataset preparation.

Ethical Use

This model is intended for legitimate voice conversion, vocal research, singing voice conversion, and authorized creative use.

Do not use this model to impersonate real people without permission.

Do not use this model for fraud, harassment, misinformation, or deceptive identity cloning.

Users are responsible for ensuring that all datasets and generated outputs comply with applicable laws, platform policies, and consent requirements.

Version Model name: KLM X Version: 10.0 Training role: Foundation / Pretrained Model Training concept: Clean Voice Prior Paired Training: v1.0 F0 Gate: enabled Checkpoint compatibility: KLM-HF/RVC-compatible

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support