Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / audio-course /pr_201 /en /chapter6 /evaluation.html

rtrm

about 1 month ago

download

raw

6.47 kB

	<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Evaluating text-to-speech models","local":"evaluating-text-to-speech-models","sections":[],"depth":1}">
	<link href="/docs/audio-course/pr_201/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
	<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/entry/start.367c4d78.js">
	<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/scheduler.f7e1785c.js">
	<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/singletons.0d70d4cc.js">
	<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/index.279db187.js">
	<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/paths.274f629d.js">
	<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/entry/app.4c54ebf9.js">
	<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/index.9f8f0838.js">
	<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/nodes/0.e329f606.js">
	<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/each.e59479a4.js">
	<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/nodes/36.88a9c513.js">
	<link rel="modulepreload" href="/docs/audio-course/pr_201/en/_app/immutable/chunks/EditOnGithub.5a9bb8c5.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Evaluating text-to-speech models","local":"evaluating-text-to-speech-models","sections":[],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="evaluating-text-to-speech-models" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#evaluating-text-to-speech-models"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Evaluating text-to-speech models</span></h1> <p data-svelte-h="svelte-5iywum">During the training time, text-to-speech models optimize for the mean-square error loss (or mean absolute error) between
	the predicted spectrogram values and the generated ones. Both MSE and MAE encourage the model to minimize the difference
	between the predicted and target spectrograms. However, since TTS is a one-to-many mapping problem, i.e. the output spectrogram for a given text can be represented in many different ways, the evaluation of the resulting text-to-speech (TTS) models is much
	more difficult.</p> <p data-svelte-h="svelte-okjap1">Unlike many other computational tasks that can be objectively
	measured using quantitative metrics, such as accuracy or precision, evaluating TTS relies heavily on subjective human analysis.</p> <p data-svelte-h="svelte-dyvg0f">One of the most commonly employed evaluation methods for TTS systems is conducting qualitative assessments using mean
	opinion scores (MOS). MOS is a subjective scoring system that allows human evaluators to rate the perceived quality of
	synthesized speech on a scale from 1 to 5. These scores are typically gathered through listening tests, where human
	participants listen to and rate the synthesized speech samples.</p> <p data-svelte-h="svelte-yxq6zq">One of the main reasons why objective metrics are challenging to develop for TTS evaluation is the subjective nature of
	speech perception. Human listeners have diverse preferences and sensitivities to various aspects of speech, including
	pronunciation, intonation, naturalness, and clarity. Capturing these perceptual nuances with a single numerical value
	is a daunting task. At the same time, the subjectivity of the human evaluation makes it challenging to compare and
	benchmark different TTS systems.</p> <p data-svelte-h="svelte-1ytx6z6">Furthermore, this kind of evaluation may overlook certain important aspects of speech synthesis, such as naturalness,
	expressiveness, and emotional impact. These qualities are difficult to quantify objectively but are highly relevant in
	applications where the synthesized speech needs to convey human-like qualities and evoke appropriate emotional responses.</p> <p data-svelte-h="svelte-1no78bk">In summary, evaluating text-to-speech models is a complex task due to the absence of one truly objective metric. The most common
	evaluation method, mean opinion scores (MOS), relies on subjective human analysis. While MOS provides valuable insights
	into the quality of synthesized speech, it also introduces variability and subjectivity.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/audio-transformers-course/blob/main/chapters/en/chapter6/evaluation.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>

	<script>
	{
	__sveltekit_yq3w38 = {
	assets: "/docs/audio-course/pr_201/en",
	base: "/docs/audio-course/pr_201/en",
	env: {}
	};

	const element = document.currentScript.parentElement;

	const data = [null,null];

	Promise.all([
	import("/docs/audio-course/pr_201/en/_app/immutable/entry/start.367c4d78.js"),
	import("/docs/audio-course/pr_201/en/_app/immutable/entry/app.4c54ebf9.js")
	]).then(([kit, app]) => {
	kit.start(app, element, {
	node_ids: [0, 36],
	data,
	form: null,
	error: null
	});
	});
	}
	</script>

Xet Storage Details

Size:: 6.47 kB
Xet hash:: 82ee70a534eee254662b3592049218f589da590eeb6a1350079330151c8776da

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.