| # VibeVoice 1.5B Single-Speaker Fine-tuning Guide | |
| This folder contains all the files needed to fine-tune VibeVoice 1.5B for a single speaker (Elise voice). | |
| ## Key Improvements | |
| 1. **Fixed EOS Token Issue**: The modified `data_vibevoice.py` adds proper `<|endoftext|>` token after speech generation to prevent repetition/looping | |
| 2. **Single-Speaker Training**: Uses `voice_prompt_drop_rate=1.0` to train without voice prompts | |
| 3. **Audio Quality Filter**: Removes training samples with abrupt cutoffs | |
| ## Files Included | |
| - `data_vibevoice.py` - CRITICAL: Modified data collator that adds EOS token (replaces src/data_vibevoice.py) | |
| - `prepare_jinsaryko_elise_dataset.py` - Downloads and prepares the Elise dataset | |
| - `detect_audio_cutoffs.py` - Detects audio files with abrupt endings | |
| - `finetune_elise_single_speaker.sh` - Training script for single-speaker model | |
| - `test_fixed_eos_dummy_voice.py` - Test script for inference | |
| ## Quick Start | |
| 1. **Prepare the dataset**: | |
| ```bash | |
| python prepare_jinsaryko_elise_dataset.py | |
| ``` | |
| 2. **Detect and remove bad audio** (optional but recommended): | |
| ```bash | |
| python detect_audio_cutoffs.py | |
| # This will create elise_cleaned/ folder with good samples only | |
| ``` | |
| 3. **IMPORTANT: Replace the data collator**: | |
| ```bash | |
| cp data_vibevoice.py ../src/data_vibevoice.py | |
| ``` | |
| 4. **Train the model**: | |
| ```bash | |
| ./finetune_elise_single_speaker.sh | |
| ``` | |
| 5. **Test the model**: | |
| ```bash | |
| python test_fixed_eos_dummy_voice.py | |
| ``` | |
| ## Training Configuration | |
| Key settings in `finetune_elise_single_speaker.sh`: | |
| - `voice_prompt_drop_rate 1.0` - Always drops voice prompts (single-speaker mode) | |
| - `learning_rate 2.5e-5` - Conservative learning rate | |
| - `ddpm_batch_mul 2` - Diffusion batch multiplier | |
| - `diffusion_loss_weight 1.4` - Diffusion loss weight | |
| - `ce_loss_weight 0.04` - Cross-entropy loss weight | |
| ## How It Works | |
| 1. The model learns to associate "Speaker 0:" with Elise's voice | |
| 2. No voice samples needed during inference | |
| 3. Proper EOS token ensures clean endings without repetition | |
| ## Dataset Format | |
| The training data should be JSONL with this format: | |
| ```json | |
| {"text": "Speaker 0: Hello, this is a test.", "audio": "/path/to/audio.wav"} | |
| ``` | |
| Note: The "Speaker 0:" prefix is REQUIRED for all text entries. |