Buckets:

rtrm's picture
download
raw
15.5 kB
<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Text-to-speech datasets&quot;,&quot;local&quot;:&quot;text-to-speech-datasets&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;LJSpeech&quot;,&quot;local&quot;:&quot;ljspeech&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Multilingual LibriSpeech&quot;,&quot;local&quot;:&quot;multilingual-librispeech&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;VCTK (Voice Cloning Toolkit)&quot;,&quot;local&quot;:&quot;vctk-voice-cloning-toolkit&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Libri-TTS/ LibriTTS-R&quot;,&quot;local&quot;:&quot;libri-tts-libritts-r&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}">
<link href="/docs/audio-course/pr_201/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/entry/start.367c4d78.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/scheduler.f7e1785c.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/singletons.0d70d4cc.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/index.279db187.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/paths.274f629d.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/entry/app.4c54ebf9.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/index.9f8f0838.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/nodes/0.e329f606.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/each.e59479a4.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/nodes/42.79f1d3ea.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/EditOnGithub.5a9bb8c5.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Text-to-speech datasets&quot;,&quot;local&quot;:&quot;text-to-speech-datasets&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;LJSpeech&quot;,&quot;local&quot;:&quot;ljspeech&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Multilingual LibriSpeech&quot;,&quot;local&quot;:&quot;multilingual-librispeech&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;VCTK (Voice Cloning Toolkit)&quot;,&quot;local&quot;:&quot;vctk-voice-cloning-toolkit&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Libri-TTS/ LibriTTS-R&quot;,&quot;local&quot;:&quot;libri-tts-libritts-r&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="text-to-speech-datasets" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#text-to-speech-datasets"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Text-to-speech datasets</span></h1> <p data-svelte-h="svelte-z6r8q5">Text-to-speech task (also called <em>speech synthesis</em>) comes with a range of challenges.</p> <p data-svelte-h="svelte-ey4ks9">First, just like in the previously discussed automatic speech recognition, the alignment between text and speech can be tricky.<br>
However, unlike ASR, TTS is a <strong>one-to-many</strong> mapping problem, i.e. the same text can be synthesised in many different ways. Think about the diversity of voices and speaking styles in the speech you hear on a daily basis - each person has a different way of speaking the same sentence, but they are all valid and correct! Even different outputs (spectrograms or audio waveforms) can correspond to the same ground truth. The model has to learn to generate the correct duration and timing for each phoneme, word, or sentence which can be challenging,
especially for long and complex sentences.</p> <p data-svelte-h="svelte-l4mvtu">Next, there’s the long-distance dependency problem: language has a temporal aspect, and understanding the meaning of a
sentence often requires considering the context of surrounding words. Ensuring that the TTS model captures and retains
contextual information over long sequences is crucial for generating coherent and natural-sounding speech.</p> <p data-svelte-h="svelte-7lz037">Finally, training TTS models typically requires pairs of text and corresponding speech recordings. On top of that, to ensure
the model can generate speech that sounds natural for various speakers and speaking styles, data should contain diverse and
representative speech samples from multiple speakers. Collecting such data is expensive, time-consuming and for some languages
is not feasible. You may think, why not just take a dataset designed for ASR (automatic speech recognition) and use it for
training a TTS model? Unfortunately, automated speech recognition (ASR) datasets are not the best option. The features that
make it beneficial for ASR, such as excessive background noise, are typically undesirable in TTS. It’s great to be able to
pick out speach from a noisy street recording, but not so much if your voice assistant replies to you with cars honking
and construction going full-swing in the background. Still, some ASR datasets can sometimes be useful for fine-tuning,
as finding top-quality, multilingual, and multi-speaker TTS datasets can be quite challenging.</p> <p data-svelte-h="svelte-1iermkw">Let’s explore a few datasets suitable for TTS that you can find on the 🤗 Hub.</p> <h2 class="relative group"><a id="ljspeech" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#ljspeech"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>LJSpeech</span></h2> <p data-svelte-h="svelte-1a5eqv7"><a href="https://huggingface.co/datasets/lj_speech" rel="nofollow">LJSpeech</a> is a dataset that consists of 13,100 English-language audio clips
paired with their corresponding transcriptions. The dataset contains recording of a single speaker reading sentences
from 7 non-fiction books in English. LJSpeech is often used as a benchmark for evaluating TTS models
due to its high audio quality and diverse linguistic content.</p> <h2 class="relative group"><a id="multilingual-librispeech" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#multilingual-librispeech"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Multilingual LibriSpeech</span></h2> <p data-svelte-h="svelte-b09uew"><a href="https://huggingface.co/datasets/facebook/multilingual_librispeech" rel="nofollow">Multilingual LibriSpeech</a> is a multilingual extension
of the LibriSpeech dataset, which is a large-scale collection of read English-language audiobooks. Multilingual LibriSpeech
expands on this by including additional languages, such as German, Dutch, Spanish, French, Italian, Portuguese, and Polish.
It offers audio recordings along with aligned transcriptions for each language. The dataset provides a valuable resource
for developing multilingual TTS systems and exploring cross-lingual speech synthesis techniques.</p> <h2 class="relative group"><a id="vctk-voice-cloning-toolkit" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#vctk-voice-cloning-toolkit"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>VCTK (Voice Cloning Toolkit)</span></h2> <p data-svelte-h="svelte-1hzutj7"><a href="https://huggingface.co/datasets/vctk" rel="nofollow">VCTK</a> is a dataset specifically designed for text-to-speech research and development.
It contains audio recordings of 110 English speakers with various accents. Each speaker reads out about 400 sentences,
which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
VCTK offers a valuable resource for training TTS models with varied voices and accents, enabling more natural and diverse
speech synthesis.</p> <h2 class="relative group"><a id="libri-tts-libritts-r" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#libri-tts-libritts-r"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Libri-TTS/ LibriTTS-R</span></h2> <p data-svelte-h="svelte-97018c"><a href="https://huggingface.co/datasets/cdminix/libritts-r-aligned" rel="nofollow">Libri-TTS/ LibriTTS-R</a> is a multi-speaker English corpus of
approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google
Speech and Google Brain team members. The LibriTTS corpus is designed for TTS research. It is derived from the original
materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus. The main
differences from the LibriSpeech corpus are listed below:</p> <ul data-svelte-h="svelte-14kma4g"><li>The audio files are at 24kHz sampling rate.</li> <li>The speech is split at sentence breaks.</li> <li>Both original and normalized texts are included.</li> <li>Contextual information (e.g., neighbouring sentences) can be extracted.</li> <li>Utterances with significant background noise are excluded.</li></ul> <p data-svelte-h="svelte-19vpf29">Assembling a good dataset for TTS is no easy task as such dataset would have to possess several key characteristics:</p> <ul data-svelte-h="svelte-i4kcdr"><li>High-quality and diverse recordings that cover a wide range of speech patterns, accents, languages, and emotions. The recordings should be clear, free from background noise, and exhibit natural speech characteristics.</li> <li>Transcriptions: Each audio recording should be accompanied by its corresponding text transcription.</li> <li>Variety of linguistic content: The dataset should contain a diverse range of linguistic content, including different types of sentences, phrases, and words. It should cover various topics, genres, and domains to ensure the model’s ability to handle different linguistic contexts.</li></ul> <p data-svelte-h="svelte-1uxfg47">Good news is, it is unlikely that you would have to train a TTS model from scratch. In the next section we’ll look into
pre-trained models available on the 🤗 Hub.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/audio-transformers-course/blob/main/chapters/en/chapter6/tts_datasets.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1">&lt;</span> <span data-svelte-h="svelte-x0xyl0">&gt;</span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>
<script>
{
__sveltekit_yq3w38 = {
assets: "/docs/audio-course/pr_201/en",
base: "/docs/audio-course/pr_201/en",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/audio-course/pr_201/en/_app/immutable/entry/start.367c4d78.js"),
import("/docs/audio-course/pr_201/en/_app/immutable/entry/app.4c54ebf9.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 42],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
15.5 kB
·
Xet hash:
cf1b9b6dae321c9bab5b2d4205457ee0eb521e6956d606d0871b1f09dbf5d7bc

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.