Common Voice

Common Voice is a public domain speech corpus with 11.2K hours of read speech in 76 languages (the latest version 7.0). We provide examples for building Transformer models on this dataset.

Data preparation

Download and unpack Common Voice v4 to a path ${DATA_ROOT}/${LANG_ID}. Create splits and generate audio manifests with

python -m examples.speech_synthesis.preprocessing.get_common_voice_audio_manifest \
  --data-root ${DATA_ROOT} \
  --lang ${LANG_ID} \
  --output-manifest-root ${AUDIO_MANIFEST_ROOT} --convert-to-wav

To denoise audio and trim leading/trailing silence using signal processing based VAD, run

for SPLIT in dev test train; do
    python -m examples.speech_synthesis.preprocessing.denoise_and_vad_audio \
      --audio-manifest ${AUDIO_MANIFEST_ROOT}/${SPLIT}.audio.tsv \
      --output-dir ${PROCESSED_DATA_ROOT} \
      --denoise --vad --vad-agg-level 2
done

which generates a new audio TSV manifest under ${PROCESSED_DATA_ROOT} with updated path to the processed audio and a new column for SNR.

To do filtering by CER, follow the Automatic Evaluation section to run ASR model (add --eval-target to get_eval_manifest for evaluation on the reference audio; add --err-unit char to eval_asr to compute CER instead of WER). The example-level CER is saved to ${EVAL_OUTPUT_ROOT}/uer_cer.${SPLIT}.tsv.

Then, extract log-Mel spectrograms, generate feature manifest and create data configuration YAML with

python -m examples.speech_synthesis.preprocessing.get_feature_manifest \
  --audio-manifest-root ${AUDIO_MANIFEST_ROOT} \
  --output-root ${FEATURE_MANIFEST_ROOT} \
  --ipa-vocab --lang ${LANG_ID} \
  --snr-threshold 15 \
  --cer-threshold 0.1 --cer-tsv-path ${EVAL_OUTPUT_ROOT}/uer_cer.${SPLIT}.tsv

where we use phoneme inputs (--ipa-vocab) as example. For sample filtering, we set the SNR and CER threshold to 15 and 10%, respectively.

Training

(Please refer to the LJSpeech example.)

Inference

(Please refer to the LJSpeech example.)

Automatic Evaluation

(Please refer to the LJSpeech example.)

Results

Language	Speakers	--arch	Params	Test MCD	Model
English	200	tts_transformer	54M	3.8	Download

[Back]