VoiceNet
/

voiceclap-large

@@ -38,8 +38,8 @@ determined by what is fed in via the multimodal chat template.
 ## Training data
-Trained for **1 epoch** on the open `voiceclap_10` mixture used in the
-VoiceNet paper:
 - `emolia-balanced-5M-subset` (annotated subset of [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset))
 - `laions_got_talent_clean_with_captions`
@@ -47,6 +47,9 @@ VoiceNet paper:
 - `synthetic_vocal_bursts`
 - `improved_synthetic_vocal_bursts`
 - `ears`
 All clips are captioned with `MOSS-Audio-8B-Thinking`-derived dense
 vocal-style captions covering emotions, talking-style attributes, and

 ## Training data
+Trained for **1 epoch** on the open `voiceclap_10_safe` mixture (9 datasets)
+used in the VoiceNet paper:
 - `emolia-balanced-5M-subset` (annotated subset of [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset))
 - `laions_got_talent_clean_with_captions`
 - `synthetic_vocal_bursts`
 - `improved_synthetic_vocal_bursts`
 - `ears`
+- `expresso`
+- `voxceleb1`
+- `voxceleb2`
 All clips are captioned with `MOSS-Audio-8B-Thinking`-derived dense
 vocal-style captions covering emotions, talking-style attributes, and