File size: 2,277 Bytes
86e8346 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | # VibeVoice 1.5B Single-Speaker Fine-tuning Guide
This folder contains all the files needed to fine-tune VibeVoice 1.5B for a single speaker (Elise voice).
## Key Improvements
1. **Fixed EOS Token Issue**: The modified `data_vibevoice.py` adds proper `<|endoftext|>` token after speech generation to prevent repetition/looping
2. **Single-Speaker Training**: Uses `voice_prompt_drop_rate=1.0` to train without voice prompts
3. **Audio Quality Filter**: Removes training samples with abrupt cutoffs
## Files Included
- `data_vibevoice.py` - CRITICAL: Modified data collator that adds EOS token (replaces src/data_vibevoice.py)
- `prepare_jinsaryko_elise_dataset.py` - Downloads and prepares the Elise dataset
- `detect_audio_cutoffs.py` - Detects audio files with abrupt endings
- `finetune_elise_single_speaker.sh` - Training script for single-speaker model
- `test_fixed_eos_dummy_voice.py` - Test script for inference
## Quick Start
1. **Prepare the dataset**:
```bash
python prepare_jinsaryko_elise_dataset.py
```
2. **Detect and remove bad audio** (optional but recommended):
```bash
python detect_audio_cutoffs.py
# This will create elise_cleaned/ folder with good samples only
```
3. **IMPORTANT: Replace the data collator**:
```bash
cp data_vibevoice.py ../src/data_vibevoice.py
```
4. **Train the model**:
```bash
./finetune_elise_single_speaker.sh
```
5. **Test the model**:
```bash
python test_fixed_eos_dummy_voice.py
```
## Training Configuration
Key settings in `finetune_elise_single_speaker.sh`:
- `voice_prompt_drop_rate 1.0` - Always drops voice prompts (single-speaker mode)
- `learning_rate 2.5e-5` - Conservative learning rate
- `ddpm_batch_mul 2` - Diffusion batch multiplier
- `diffusion_loss_weight 1.4` - Diffusion loss weight
- `ce_loss_weight 0.04` - Cross-entropy loss weight
## How It Works
1. The model learns to associate "Speaker 0:" with Elise's voice
2. No voice samples needed during inference
3. Proper EOS token ensures clean endings without repetition
## Dataset Format
The training data should be JSONL with this format:
```json
{"text": "Speaker 0: Hello, this is a test.", "audio": "/path/to/audio.wav"}
```
Note: The "Speaker 0:" prefix is REQUIRED for all text entries. |