DevParker
/

Vibevoice_1_5_lora

Model card Files Files and versions

Vibevoice_1_5_lora / README.md

DevParker's picture

Upload 8 files

86e8346 verified 6 months ago

|

history blame contribute delete

2.28 kB

	# VibeVoice 1.5B Single-Speaker Fine-tuning Guide

	This folder contains all the files needed to fine-tune VibeVoice 1.5B for a single speaker (Elise voice).

	## Key Improvements

	1. Fixed EOS Token Issue: The modified `data_vibevoice.py` adds proper `<\|endoftext\|>` token after speech generation to prevent repetition/looping
	2. Single-Speaker Training: Uses `voice_prompt_drop_rate=1.0` to train without voice prompts
	3. Audio Quality Filter: Removes training samples with abrupt cutoffs

	## Files Included

	- `data_vibevoice.py` - CRITICAL: Modified data collator that adds EOS token (replaces src/data_vibevoice.py)
	- `prepare_jinsaryko_elise_dataset.py` - Downloads and prepares the Elise dataset
	- `detect_audio_cutoffs.py` - Detects audio files with abrupt endings
	- `finetune_elise_single_speaker.sh` - Training script for single-speaker model
	- `test_fixed_eos_dummy_voice.py` - Test script for inference

	## Quick Start

	1. Prepare the dataset:
	```bash
	python prepare_jinsaryko_elise_dataset.py
	```

	2. Detect and remove bad audio (optional but recommended):
	```bash
	python detect_audio_cutoffs.py
	# This will create elise_cleaned/ folder with good samples only
	```

	3. IMPORTANT: Replace the data collator:
	```bash
	cp data_vibevoice.py ../src/data_vibevoice.py
	```

	4. Train the model:
	```bash
	./finetune_elise_single_speaker.sh
	```

	5. Test the model:
	```bash
	python test_fixed_eos_dummy_voice.py
	```

	## Training Configuration

	Key settings in `finetune_elise_single_speaker.sh`:
	- `voice_prompt_drop_rate 1.0` - Always drops voice prompts (single-speaker mode)
	- `learning_rate 2.5e-5` - Conservative learning rate
	- `ddpm_batch_mul 2` - Diffusion batch multiplier
	- `diffusion_loss_weight 1.4` - Diffusion loss weight
	- `ce_loss_weight 0.04` - Cross-entropy loss weight

	## How It Works

	1. The model learns to associate "Speaker 0:" with Elise's voice
	2. No voice samples needed during inference
	3. Proper EOS token ensures clean endings without repetition

	## Dataset Format

	The training data should be JSONL with this format:
	```json
	{"text": "Speaker 0: Hello, this is a test.", "audio": "/path/to/audio.wav"}
	```

	Note: The "Speaker 0:" prefix is REQUIRED for all text entries.