LoRA for VibeVoice 1.5B

Source:

Elizabeth Klett's narration of The House of the Vampire (public domain) (MP3 128k)

Dataset prep/process

Segmentation and transcription of source audio done using tts-dataset-generator (silence segmentation threshold 400ms; target samplerate 24K for VibeVoice). Some/more than some occurrences of "intra-sentence" segmentation. Audio clips normalized to -3dB. Cumulative duration 1h53m.

Training details

VibeVoice-finetuning

python -m src.finetune_vibevoice_lora --model_name_or_path microsoft/VibeVoice-1.5B --train_jsonl "path\to\metadata.jsonl" --text_column_name text --audio_column_name audio --output_dir "path\to\elizabeth_klett\lora" --per_device_train_batch_size 8 --gradient_accumulation_steps 4 --learning_rate 2.5e-5 --num_train_epochs 60 --logging_steps 10 --save_steps 200 --remove_unused_columns False --bf16 True --do_train --gradient_clipping --gradient_checkpointing False --ddpm_batch_mul 4 --diffusion_loss_weight 1.4 --train_diffusion_head True --ce_loss_weight 0.04 --voice_prompt_drop_rate 1 --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj --lr_scheduler_type cosine --warmup_ratio 0.03 --max_grad_norm 0.8 --report_to tensorboard

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vibevoice-community/klett

Base model

microsoft/VibeVoice-1.5B

Adapter

(13)

this model