Common Voice
Common Voice is a public domain speech corpus with 11.2K hours of read speech in 76 languages (the latest version 7.0). We provide examples for building Transformer models on this dataset.
Data preparation
Download and unpack Common Voice v4 to a path ${DATA_ROOT}/${LANG_ID}.
Create splits and generate audio manifests with
python -m examples.speech_synthesis.preprocessing.get_common_voice_audio_manifest \
--data-root ${DATA_ROOT} \
--lang ${LANG_ID} \
--output-manifest-root ${AUDIO_MANIFEST_ROOT} --convert-to-wav
To denoise audio and trim leading/trailing silence using signal processing based VAD, run
for SPLIT in dev test train; do
python -m examples.speech_synthesis.preprocessing.denoise_and_vad_audio \
--audio-manifest ${AUDIO_MANIFEST_ROOT}/${SPLIT}.audio.tsv \
--output-dir ${PROCESSED_DATA_ROOT} \
--denoise --vad --vad-agg-level 2
done
which generates a new audio TSV manifest under ${PROCESSED_DATA_ROOT} with updated path to the processed audio and
a new column for SNR.
To do filtering by CER, follow the Automatic Evaluation section to
run ASR model (add --eval-target to get_eval_manifest for evaluation on the reference audio; add --err-unit char
to eval_asr to compute CER instead of WER). The example-level CER is saved to
${EVAL_OUTPUT_ROOT}/uer_cer.${SPLIT}.tsv.
Then, extract log-Mel spectrograms, generate feature manifest and create data configuration YAML with
python -m examples.speech_synthesis.preprocessing.get_feature_manifest \
--audio-manifest-root ${AUDIO_MANIFEST_ROOT} \
--output-root ${FEATURE_MANIFEST_ROOT} \
--ipa-vocab --lang ${LANG_ID} \
--snr-threshold 15 \
--cer-threshold 0.1 --cer-tsv-path ${EVAL_OUTPUT_ROOT}/uer_cer.${SPLIT}.tsv
where we use phoneme inputs (--ipa-vocab) as example. For sample filtering, we set the SNR and CER threshold
to 15 and 10%, respectively.
Training
(Please refer to the LJSpeech example.)
Inference
(Please refer to the LJSpeech example.)
Automatic Evaluation
(Please refer to the LJSpeech example.)
Results
| Language | Speakers | --arch | Params | Test MCD | Model |
|---|---|---|---|---|---|
| English | 200 | tts_transformer | 54M | 3.8 | Download |