Buckets:

rtrm's picture
download
raw
6.47 kB
<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Evaluating text-to-speech models&quot;,&quot;local&quot;:&quot;evaluating-text-to-speech-models&quot;,&quot;sections&quot;:[],&quot;depth&quot;:1}">
<link href="/docs/audio-course/pr_201/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/entry/start.367c4d78.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/scheduler.f7e1785c.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/singletons.0d70d4cc.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/index.279db187.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/paths.274f629d.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/entry/app.4c54ebf9.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/index.9f8f0838.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/nodes/0.e329f606.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/each.e59479a4.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/nodes/36.88a9c513.js">
<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/EditOnGithub.5a9bb8c5.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Evaluating text-to-speech models&quot;,&quot;local&quot;:&quot;evaluating-text-to-speech-models&quot;,&quot;sections&quot;:[],&quot;depth&quot;:1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="evaluating-text-to-speech-models" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#evaluating-text-to-speech-models"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Evaluating text-to-speech models</span></h1> <p data-svelte-h="svelte-5iywum">During the training time, text-to-speech models optimize for the mean-square error loss (or mean absolute error) between
the predicted spectrogram values and the generated ones. Both MSE and MAE encourage the model to minimize the difference
between the predicted and target spectrograms. However, since TTS is a one-to-many mapping problem, i.e. the output spectrogram for a given text can be represented in many different ways, the evaluation of the resulting text-to-speech (TTS) models is much
more difficult.</p> <p data-svelte-h="svelte-okjap1">Unlike many other computational tasks that can be objectively
measured using quantitative metrics, such as accuracy or precision, evaluating TTS relies heavily on subjective human analysis.</p> <p data-svelte-h="svelte-dyvg0f">One of the most commonly employed evaluation methods for TTS systems is conducting qualitative assessments using mean
opinion scores (MOS). MOS is a subjective scoring system that allows human evaluators to rate the perceived quality of
synthesized speech on a scale from 1 to 5. These scores are typically gathered through listening tests, where human
participants listen to and rate the synthesized speech samples.</p> <p data-svelte-h="svelte-yxq6zq">One of the main reasons why objective metrics are challenging to develop for TTS evaluation is the subjective nature of
speech perception. Human listeners have diverse preferences and sensitivities to various aspects of speech, including
pronunciation, intonation, naturalness, and clarity. Capturing these perceptual nuances with a single numerical value
is a daunting task. At the same time, the subjectivity of the human evaluation makes it challenging to compare and
benchmark different TTS systems.</p> <p data-svelte-h="svelte-1ytx6z6">Furthermore, this kind of evaluation may overlook certain important aspects of speech synthesis, such as naturalness,
expressiveness, and emotional impact. These qualities are difficult to quantify objectively but are highly relevant in
applications where the synthesized speech needs to convey human-like qualities and evoke appropriate emotional responses.</p> <p data-svelte-h="svelte-1no78bk">In summary, evaluating text-to-speech models is a complex task due to the absence of one truly objective metric. The most common
evaluation method, mean opinion scores (MOS), relies on subjective human analysis. While MOS provides valuable insights
into the quality of synthesized speech, it also introduces variability and subjectivity.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/audio-transformers-course/blob/main/chapters/en/chapter6/evaluation.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1">&lt;</span> <span data-svelte-h="svelte-x0xyl0">&gt;</span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>
<script>
{
__sveltekit_yq3w38 = {
assets: "/docs/audio-course/pr_201/en",
base: "/docs/audio-course/pr_201/en",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/audio-course/pr_201/en/_app/immutable/entry/start.367c4d78.js"),
import("/docs/audio-course/pr_201/en/_app/immutable/entry/app.4c54ebf9.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 36],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
6.47 kB
·
Xet hash:
82ee70a534eee254662b3592049218f589da590eeb6a1350079330151c8776da

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.