Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Pre-trained models for text-to-speech","local":"pre-trained-models-for-text-to-speech","sections":[{"title":"SpeechT5","local":"speecht5","sections":[],"depth":2},{"title":"Bark","local":"bark","sections":[],"depth":2},{"title":"Massive Multilingual Speech (MMS)","local":"massive-multilingual-speech-mms","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/audio-course/pr_239/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/entry/start.1658692c.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/chunks/scheduler.cd324960.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/chunks/singletons.b42fc23b.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/chunks/index.a0c12d66.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/chunks/paths.cd0b54b2.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/entry/app.83f02103.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/chunks/preload-helper.7a3e7823.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/chunks/index.d5c3adcc.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/nodes/0.33fdfcd8.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/nodes/40.b2522111.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/chunks/Tip.889bec11.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/chunks/MermaidChart.svelte_svelte_type_style_lang.f42929ed.js"> | |
| <link rel="modulepreload" href="/docs/audio-course/pr_239/en/_app/immutable/chunks/CodeBlock.f3dccfdb.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Pre-trained models for text-to-speech","local":"pre-trained-models-for-text-to-speech","sections":[{"title":"SpeechT5","local":"speecht5","sections":[],"depth":2},{"title":"Bark","local":"bark","sections":[],"depth":2},{"title":"Massive Multilingual Speech (MMS)","local":"massive-multilingual-speech-mms","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <div class="items-center shrink-0 min-w-[100px] max-sm:min-w-[50px] justify-end ml-auto flex" style="float: right; margin-left: 10px; display: inline-flex; position: relative; z-index: 10;"><div class="inline-flex rounded-md max-sm:rounded-sm"><button class="inline-flex items-center gap-1 h-7 max-sm:h-7 px-2 max-sm:px-1.5 text-sm font-medium text-gray-800 border border-r-0 rounded-l-md max-sm:rounded-l-sm border-gray-200 bg-white hover:shadow-inner dark:border-gray-850 dark:bg-gray-950 dark:text-gray-200 dark:hover:bg-gray-800" aria-live="polite"><span class="inline-flex items-center justify-center rounded-md p-0.5 max-sm:p-0 hover:text-gray-800 dark:hover:text-gray-200"><svg class="sm:size-3.5 size-3" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg></span> <span>Copy page</span></button> <button class="inline-flex items-center justify-center w-6 max-sm:w-5 h-7 max-sm:h-7 disabled:pointer-events-none text-sm text-gray-500 hover:text-gray-700 dark:hover:text-white rounded-r-md max-sm:rounded-r-sm border border-l transition border-gray-200 bg-white hover:shadow-inner dark:border-gray-850 dark:bg-gray-950 dark:text-gray-200 dark:hover:bg-gray-800" aria-haspopup="menu" aria-expanded="false" aria-label="Open copy menu"><svg class="transition-transform text-gray-400 overflow-visible sm:size-3.5 size-3 rotate-0" width="1em" height="1em" viewBox="0 0 12 7" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M1 1L6 6L11 1" stroke="currentColor"></path></svg></button></div> </div> <h1 class="relative group"><a id="pre-trained-models-for-text-to-speech" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#pre-trained-models-for-text-to-speech"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Pre-trained models for text-to-speech</span></h1> <p data-svelte-h="svelte-1daxplp">Compared to ASR (automatic speech recognition) and audio classification tasks, there are significantly fewer pre-trained | |
| model checkpoints available. On the 🤗 Hub, you’ll find close to 300 suitable checkpoints. Among | |
| these pre-trained models we’ll focus on two architectures that are readily available for you in the 🤗 Transformers library - | |
| SpeechT5 and Massive Multilingual Speech (MMS). In this section, we’ll explore how to use these pre-trained models in the | |
| Transformers library for TTS.</p> <h2 class="relative group"><a id="speecht5" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#speecht5"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>SpeechT5</span></h2> <p data-svelte-h="svelte-yag015"><a href="https://arxiv.org/abs/2110.07205" rel="nofollow">SpeechT5</a> is a model published by Junyi Ao et al. from Microsoft that is capable of | |
| handling a range of speech tasks. While in this unit, we focus on the text-to-speech aspect, | |
| this model can be tailored to speech-to-text tasks (automatic speech recognition or speaker identification), | |
| as well as speech-to-speech (e.g. speech enhancement or converting between different voices). This is due to how the model | |
| is designed and pre-trained.</p> <p data-svelte-h="svelte-c9tc8d">At the heart of SpeechT5 is a regular Transformer encoder-decoder model. Just like any other Transformer, the encoder-decoder | |
| network models a sequence-to-sequence transformation using hidden representations. This Transformer backbone is the same | |
| for all tasks SpeechT5 supports.</p> <p data-svelte-h="svelte-13972z8">This Transformer is complemented with six modal-specific (speech/text) <em>pre-nets</em> and <em>post-nets</em>. The input speech or text | |
| (depending on the task) is preprocessed through a corresponding pre-net to obtain the hidden representations that Transformer | |
| can use. The Transformer’s output is then passed to a post-net that will use it to generate the output in the target modality.</p> <p data-svelte-h="svelte-wr8kuz">This is what the architecture looks like (image from the original paper):</p> <div class="flex justify-center" data-svelte-h="svelte-1j5439p"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/speecht5/architecture.jpg" alt="SpeechT5 architecture from the original paper"></div> <p data-svelte-h="svelte-1kbr6z9">SpeechT5 is first pre-trained using large-scale unlabeled speech and text data, to acquire a unified representation | |
| of different modalities. During the pre-training phase all pre-nets and post-nets are used simultaneously.</p> <p data-svelte-h="svelte-fdwn9q">After pre-training, the entire encoder-decoder backbone is fine-tuned for each individual task. At this step, only the | |
| pre-nets and post-nets relevant to the specific task are employed. For example, to use SpeechT5 for text-to-speech, you’d | |
| need the text encoder pre-net for the text inputs and the speech decoder pre- and post-nets for the speech outputs.</p> <p data-svelte-h="svelte-165k5vk">This approach allows to obtain several models fine-tuned for different speech tasks that all benefit from the initial | |
| pre-training on unlabeled data.</p> <blockquote class="tip"><p data-svelte-h="svelte-186t2pn">Even though the fine-tuned models start out using the same set of weights from the shared pre-trained model, the | |
| final versions are all quite different in the end. You can’t take a fine-tuned ASR model and swap out the pre-nets and | |
| post-net to get a working TTS model, for example. SpeechT5 is flexible, but not that flexible ;)</p></blockquote> <p data-svelte-h="svelte-n345q3">Let’s see what are the pre- and post-nets that SpeechT5 uses for the TTS task specifically:</p> <ul data-svelte-h="svelte-jd8qh8"><li>Text encoder pre-net: A text embedding layer that maps text tokens to the hidden representations that the encoder expects. This is similar to what happens in an NLP model such as BERT.</li> <li>Speech decoder pre-net: This takes a log mel spectrogram as input and uses a sequence of linear layers to compress the spectrogram into hidden representations.</li> <li>Speech decoder post-net: This predicts a residual to add to the output spectrogram and is used to refine the results.</li></ul> <p data-svelte-h="svelte-81isp3">When combined, this is what SpeechT5 architecture for text-to-speech looks like:</p> <div class="flex justify-center" data-svelte-h="svelte-sze2fn"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/speecht5/tts.jpg" alt="SpeechT5 architecture for TTS"></div> <p data-svelte-h="svelte-1x4m0vh">As you can see, the output is a log mel spectrogram and not a final waveform. If you recall, we briefly touched on | |
| this topic in <a href="../chapter3/introduction#spectrogram-output">Unit 3</a>. It is common for models that generate audio to produce | |
| a log mel spectrogram, which needs to be converted to a waveform with an additional neural network known as a vocoder.</p> <p data-svelte-h="svelte-mpoa3m">Let’s see how you could do that.</p> <p data-svelte-h="svelte-1rb9653">First, let’s load the fine-tuned TTS SpeechT5 model from the 🤗 Hub, along with the processor object used for tokenization | |
| and feature extraction:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> SpeechT5Processor, SpeechT5ForTextToSpeech | |
| processor = SpeechT5Processor.from_pretrained(<span class="hljs-string">"microsoft/speecht5_tts"</span>) | |
| model = SpeechT5ForTextToSpeech.from_pretrained(<span class="hljs-string">"microsoft/speecht5_tts"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-swfq00">Next, tokenize the input text.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->inputs = processor(text=<span class="hljs-string">"Don't count the days, make the days count."</span>, return_tensors=<span class="hljs-string">"pt"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-74c7oy">The SpeechT5 TTS model is not limited to creating speech for a single speaker. Instead, it uses so-called speaker embeddings | |
| that capture a particular speaker’s voice characteristics.</p> <blockquote class="tip"><p data-svelte-h="svelte-mdrkn0">Speaker embeddings is a method of representing a speaker’s identity in a compact way, as a vector of | |
| fixed size, regardless of the length of the utterance. These embeddings capture essential information about a speaker’s | |
| voice, accent, intonation, and other unique characteristics that distinguish one speaker from another. Such embeddings can | |
| be used for speaker verification, speaker diarization, speaker identification, and more. | |
| The most common techniques for generating speaker embeddings include:</p> <ul data-svelte-h="svelte-1hi7sgd"><li>I-Vectors (identity vectors): I-Vectors are based on a Gaussian mixture model (GMM). They represent speakers as low-dimensional fixed-length vectors derived from the statistics of a speaker-specific GMM, and are obtained in unsupervised manner.</li> <li>X-Vectors: X-Vectors are derived using deep neural networks (DNNs) and capture frame-level speaker information by incorporating temporal context.</li></ul> <p data-svelte-h="svelte-1jxhw7z"><a href="https://www.danielpovey.com/files/2018_icassp_xvectors.pdf" rel="nofollow">X-Vectors</a> are a state-of-the-art method that shows superior performance | |
| on evaluation datasets compared to I-Vectors. The deep neural network is used to obtain X-Vectors: it trains to discriminate | |
| between speakers, and maps variable-length utterances to fixed-dimensional embeddings. You can also load an X-Vector speaker embedding that has been computed ahead of time, which will encapsulate the speaking characteristics of a particular speaker.</p></blockquote> <p data-svelte-h="svelte-jz5sfv">Let’s load such a speaker embedding from a dataset on the Hub. The embeddings | |
| were obtained from the <a href="http://www.festvox.org/cmu_arctic/" rel="nofollow">CMU ARCTIC dataset</a> using | |
| <a href="https://huggingface.co/mechanicalsea/speecht5-vc/blob/main/manifest/utils/prep_cmu_arctic_spkemb.py" rel="nofollow">this script</a>, but | |
| any X-Vector embedding should work.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset | |
| embeddings_dataset = load_dataset(<span class="hljs-string">"Matthijs/cmu-arctic-xvectors"</span>, split=<span class="hljs-string">"validation"</span>) | |
| <span class="hljs-keyword">import</span> torch | |
| speaker_embeddings = torch.tensor(embeddings_dataset[<span class="hljs-number">7306</span>][<span class="hljs-string">"xvector"</span>]).unsqueeze(<span class="hljs-number">0</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-f6dlnz">The speaker embedding is a tensor of shape (1, 512). This particular speaker embedding describes a female voice.</p> <p data-svelte-h="svelte-1awgz3s">At this point we already have enough inputs to generate a log mel spectrogram as an output, you can do it like this:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->spectrogram = model.generate_speech(inputs[<span class="hljs-string">"input_ids"</span>], speaker_embeddings)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-jpr5s4">This outputs a tensor of shape (140, 80) containing a log mel spectrogram. The first dimension is the sequence length, and | |
| it may vary between runs as the speech decoder pre-net always applies dropout to the input sequence. This adds a bit of | |
| random variability to the generated speech.</p> <p data-svelte-h="svelte-ftado2">However, if we are looking to generate speech waveform, we need to specify a vocoder to use for the spectrogram to waveform conversion. | |
| In theory, you can use any vocoder that works on 80-bin mel spectrograms. Conveniently, 🤗 Transformers offers a vocoder | |
| based on HiFi-GAN. Its weights were kindly provided by the original authors of SpeechT5.</p> <blockquote class="tip"><p data-svelte-h="svelte-1netc3w"><a href="https://arxiv.org/pdf/2010.05646v2.pdf" rel="nofollow">HiFi-GAN</a> is a state-of-the-art generative adversarial network (GAN) designed | |
| for high-fidelity speech synthesis. It is capable of generating high-quality and realistic audio waveforms from spectrogram inputs.</p> <p data-svelte-h="svelte-1c62h2f">On a high level, HiFi-GAN consists of one generator and two discriminators. The generator is a fully convolutional | |
| neural network that takes a mel-spectrogram as input and learns to produce raw audio waveforms. The discriminators’ | |
| role is to distinguish between real and generated audio. The two discriminators focus on different aspects of the audio.</p> <p data-svelte-h="svelte-guovqp">HiFi-GAN is trained on a large dataset of high-quality audio recordings. It uses a so-called <em>adversarial training</em>, | |
| where the generator and discriminator networks compete against each other. Initially, the generator produces low-quality | |
| audio, and the discriminator can easily differentiate it from real audio. As training progresses, the generator improves | |
| its output, aiming to fool the discriminator. The discriminator, in turn, becomes more accurate in distinguishing real | |
| and generated audio. This adversarial feedback loop helps both networks improve over time. Ultimately, HiFi-GAN learns to | |
| generate high-fidelity audio that closely resembles the characteristics of the training data.</p></blockquote> <p data-svelte-h="svelte-44n2yf">Loading the vocoder is as easy as any other 🤗 Transformers model.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> SpeechT5HifiGan | |
| vocoder = SpeechT5HifiGan.from_pretrained(<span class="hljs-string">"microsoft/speecht5_hifigan"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1kphbin">Now all you need to do is pass it as an argument when generating speech, and the outputs will be automatically converted to the speech waveform.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->speech = model.generate_speech(inputs[<span class="hljs-string">"input_ids"</span>], speaker_embeddings, vocoder=vocoder)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-19xs0zc">Let’s listen to the result. The sample rate used by SpeechT5 is always 16 kHz.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> IPython.display <span class="hljs-keyword">import</span> Audio | |
| Audio(speech, rate=<span class="hljs-number">16000</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1r9io3v">Neat!</p> <p data-svelte-h="svelte-1h3brza">Feel free to play with the SpeechT5 text-to-speech demo, explore other voices, experiment with inputs. Note that this | |
| pre-trained checkpoint only supports English language:</p> <iframe src="https://matthijs-speecht5-tts-demo.hf.space" frameborder="0" width="850" height="450" data-svelte-h="svelte-k7mwvh"></iframe> <h2 class="relative group"><a id="bark" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#bark"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Bark</span></h2> <p data-svelte-h="svelte-luezb9">Bark is a transformer-based text-to-speech model proposed by Suno AI in <a href="https://github.com/suno-ai/bark" rel="nofollow">suno-ai/bark</a>.</p> <p data-svelte-h="svelte-td1jzt">Unlike SpeechT5, Bark generates raw speech waveforms directly, eliminating the need for a separate vocoder during inference – it’s already integrated. This efficiency is achieved through the utilization of <a href="https://huggingface.co/docs/transformers/main/en/model_doc/encodec" rel="nofollow"><code>Encodec</code></a>, which serves as both a codec and a compression tool.</p> <p data-svelte-h="svelte-ayk4yt">With <code>Encodec</code>, you can compress audio into a lightweight format to reduce memory usage and subsequently decompress it to restore the original audio. This compression process is facilitated by 8 codebooks, each consisting of integer vectors. Think of these codebooks as representations or embeddings of the audio in integer form. It’s important to note that each successive codebook improves the quality of the audio reconstruction from the previous codebooks. As codebooks are integer vectors, they can be learned by transformer models, which are very efficient in this task. This is what Bark was specifically trained to do.</p> <p data-svelte-h="svelte-1bbb3eg">To be more specific, Bark is made of 4 main models:</p> <ul data-svelte-h="svelte-1ho1l4x"><li><code>BarkSemanticModel</code> (also referred to as the ‘text’ model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.</li> <li><code>BarkCoarseModel</code> (also referred to as the ‘coarse acoustics’ model): a causal autoregressive transformer, that takes as input the results of the <code>BarkSemanticModel</code> model. It aims at predicting the first two audio codebooks necessary for EnCodec.</li> <li><code>BarkFineModel</code> (the ‘fine acoustics’ model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.</li> <li>having predicted all the codebook channels from the <code>EncodecModel</code>, Bark uses it to decode the output audio array.</li></ul> <p data-svelte-h="svelte-birews">It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.</p> <p data-svelte-h="svelte-mr83a0">Bark is an highly-controllable text-to-speech model, meaning you can use with various settings, as we are going to see.</p> <p data-svelte-h="svelte-hrck43">Before everything, load the model and its processor.</p> <p data-svelte-h="svelte-hgto8u">The processor role here is two-sides:</p> <ol data-svelte-h="svelte-1d9lyek"><li>It is used to tokenize the input text, i.e. to cut it into small pieces that the model can understand.</li> <li>It stores speaker embeddings, i.e voice presets that can condition the generation.</li></ol> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> BarkModel, BarkProcessor | |
| model = BarkModel.from_pretrained(<span class="hljs-string">"suno/bark-small"</span>) | |
| processor = BarkProcessor.from_pretrained(<span class="hljs-string">"suno/bark-small"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1ep51cg">Bark is very versatile and can generate audio conditioned by <a href="https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c" rel="nofollow">a speaker embeddings library</a> which can be loaded via the processor.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-comment"># add a speaker embedding</span> | |
| inputs = processor(<span class="hljs-string">"This is a test!"</span>, voice_preset=<span class="hljs-string">"v2/en_speaker_3"</span>) | |
| speech_output = model.generate(**inputs).cpu().numpy()<!-- HTML_TAG_END --></pre></div> <audio controls="" data-svelte-h="svelte-1dcfkqn"><source src="https://huggingface.co/datasets/ylacombe/hf-course-audio-files/resolve/main/first_sample.wav" type="audio/wav"> | |
| Your browser does not support the audio element.</audio> <p data-svelte-h="svelte-1ezu4yu">It can also generate ready-to-use multilingual speeches, such as French and Chinese. You can find a list of supported languages <a href="https://huggingface.co/suno/bark" rel="nofollow">here</a>. Unlike MMS, discussed below, it is not necessary to specify the language used, but simply adapt the input text to the corresponding language.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-comment"># try it in French, let's also add a French speaker embedding</span> | |
| inputs = processor(<span class="hljs-string">"C'est un test!"</span>, voice_preset=<span class="hljs-string">"v2/fr_speaker_1"</span>) | |
| speech_output = model.generate(**inputs).cpu().numpy()<!-- HTML_TAG_END --></pre></div> <audio controls="" data-svelte-h="svelte-1mf2amh"><source src="https://huggingface.co/datasets/ylacombe/hf-course-audio-files/resolve/main/second_sample.wav" type="audio/wav"> | |
| Your browser does not support the audio element.</audio> <p data-svelte-h="svelte-1sw314w">The model can also generate <strong>non-verbal communications</strong> such as laughing, sighing and crying. You just have to modify the input text with corresponding cues such as <code>[clears throat]</code>, <code>[laughter]</code>, or <code>...</code>.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->inputs = processor( | |
| <span class="hljs-string">"[clears throat] This is a test ... and I just took a long pause."</span>, | |
| voice_preset=<span class="hljs-string">"v2/fr_speaker_1"</span>, | |
| ) | |
| speech_output = model.generate(**inputs).cpu().numpy()<!-- HTML_TAG_END --></pre></div> <audio controls="" data-svelte-h="svelte-18uewve"><source src="https://huggingface.co/datasets/ylacombe/hf-course-audio-files/resolve/main/third_sample.wav" type="audio/wav"> | |
| Your browser does not support the audio element.</audio> <p data-svelte-h="svelte-17xea6o">Bark can even generate music. You can help by adding ♪ musical notes ♪ around your words.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->inputs = processor( | |
| <span class="hljs-string">"♪ In the mighty jungle, I'm trying to generate barks."</span>, | |
| ) | |
| speech_output = model.generate(**inputs).cpu().numpy()<!-- HTML_TAG_END --></pre></div> <audio controls="" data-svelte-h="svelte-xst92l"><source src="https://huggingface.co/datasets/ylacombe/hf-course-audio-files/resolve/main/fourth_sample.wav" type="audio/wav"> | |
| Your browser does not support the audio element.</audio> <p data-svelte-h="svelte-q2c458">In addition to all these features, Bark supports batch processing, which means you can process several text entries at the same time, at the expense of more intensive computation. | |
| On some hardware, such as GPUs, batching enables faster overall generation, which means it can be faster to generate samples all at once than to generate them one by one.</p> <p data-svelte-h="svelte-1gu4p56">Let’s try generating a few examples:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->input_list = [ | |
| <span class="hljs-string">"[clears throat] Hello uh ..., my dog is cute [laughter]"</span>, | |
| <span class="hljs-string">"Let's try generating speech, with Bark, a text-to-speech model"</span>, | |
| <span class="hljs-string">"♪ In the jungle, the mighty jungle, the lion barks tonight ♪"</span>, | |
| ] | |
| <span class="hljs-comment"># also add a speaker embedding</span> | |
| inputs = processor(input_list, voice_preset=<span class="hljs-string">"v2/en_speaker_3"</span>) | |
| speech_output = model.generate(**inputs).cpu().numpy()<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-dtoew1">Let’s listen to the outputs one by one.</p> <p data-svelte-h="svelte-8xc0ic">First one:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> IPython.display <span class="hljs-keyword">import</span> Audio | |
| sampling_rate = model.generation_config.sample_rate | |
| Audio(speech_output[<span class="hljs-number">0</span>], rate=sampling_rate)<!-- HTML_TAG_END --></pre></div> <audio controls="" data-svelte-h="svelte-v3niu0"><source src="https://huggingface.co/datasets/ylacombe/hf-course-audio-files/resolve/main/batch_1.wav" type="audio/wav"> | |
| Your browser does not support the audio element.</audio> <p data-svelte-h="svelte-kbivly">Second one:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->Audio(speech_output[<span class="hljs-number">1</span>], rate=sampling_rate)<!-- HTML_TAG_END --></pre></div> <audio controls="" data-svelte-h="svelte-1x1ywin"><source src="https://huggingface.co/datasets/ylacombe/hf-course-audio-files/resolve/main/batch_2.wav" type="audio/wav"> | |
| Your browser does not support the audio element.</audio> <p data-svelte-h="svelte-a5t5uz">Third one:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->Audio(speech_output[<span class="hljs-number">2</span>], rate=sampling_rate)<!-- HTML_TAG_END --></pre></div> <audio controls="" data-svelte-h="svelte-69zylm"><source src="https://huggingface.co/datasets/ylacombe/hf-course-audio-files/resolve/main/batch_3.wav" type="audio/wav"> | |
| Your browser does not support the audio element.</audio> <blockquote class="tip"><p data-svelte-h="svelte-19jzp61">Bark, like other 🤗 Transformers models, can be optimized in just a few lines of code regarding speed and memory impact. To find out how, click on <a href="https://colab.research.google.com/github/ylacombe/notebooks/blob/main/Benchmark_Bark_HuggingFace.ipynb" rel="nofollow">this colab demonstration notebook</a>.</p></blockquote> <h2 class="relative group"><a id="massive-multilingual-speech-mms" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#massive-multilingual-speech-mms"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Massive Multilingual Speech (MMS)</span></h2> <p data-svelte-h="svelte-1sn7aa3">What if you are looking for a pre-trained model in a language other than English? Massive Multilingual Speech (MMS) is | |
| another model that covers an array of speech tasks, however, it supports a large number of languages. For instance, it can | |
| synthesize speech in over 1,100 languages.</p> <p data-svelte-h="svelte-11mw9ju">MMS for text-to-speech is based on <a href="https://arxiv.org/pdf/2106.06103.pdf" rel="nofollow">VITS Kim et al., 2021</a>, which is one of the | |
| state-of-the-art TTS approaches.</p> <p data-svelte-h="svelte-5uuyg0">VITS is a speech generation network that converts text into raw speech waveforms. It works like a conditional variational | |
| auto-encoder, estimating audio features from the input text. First, acoustic features, represented as spectrograms, are | |
| generated. The waveform is then decoded using transposed convolutional layers adapted from HiFi-GAN. | |
| During inference, the text encodings are upsampled and transformed into waveforms using the flow module and HiFi-GAN decoder. | |
| Like Bark, there’s no need for a vocoder, as waveforms are generated directly.</p> <blockquote class="warning">MMS model has been added to 🤗 Transformers very recently, so you will have to install the library from source: | |
| <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip install git+https://github.com/huggingface/transformers.git<!-- HTML_TAG_END --></pre></div></blockquote> <p data-svelte-h="svelte-7htp4h">Let’s give MMS a go, and see how we can synthesize speech in a language other than English, e.g. German. | |
| First, we’ll load the model checkpoint and the tokenizer for the correct language:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> VitsModel, VitsTokenizer | |
| model = VitsModel.from_pretrained(<span class="hljs-string">"facebook/mms-tts-deu"</span>) | |
| tokenizer = VitsTokenizer.from_pretrained(<span class="hljs-string">"facebook/mms-tts-deu"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1s5fwe">You may notice that to load the MMS model you need to use <code>VitsModel</code> and <code>VitsTokenizer</code>. This is because MMS for text-to-speech | |
| is based on the VITS model as mentioned earlier.</p> <p data-svelte-h="svelte-cy5yeb">Let’s pick an example text in German, like these first two lines from a children’s song:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->text_example = ( | |
| <span class="hljs-string">"Ich bin Schnappi das kleine Krokodil, komm aus Ägypten das liegt direkt am Nil."</span> | |
| )<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1hq1hkh">To generate a waveform output, preprocess the text with the tokenizer, and pass it to the model:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch | |
| inputs = tokenizer(text_example, return_tensors=<span class="hljs-string">"pt"</span>) | |
| input_ids = inputs[<span class="hljs-string">"input_ids"</span>] | |
| <span class="hljs-keyword">with</span> torch.no_grad(): | |
| outputs = model(input_ids) | |
| speech = outputs[<span class="hljs-string">"waveform"</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-cmwyek">Let’s listen to it:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> IPython.display <span class="hljs-keyword">import</span> Audio | |
| Audio(speech, rate=<span class="hljs-number">16000</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-159l1h7">Wunderbar! If you’d like to try MMS with another language, find other suitable <code>vits</code> checkpoints <a href="https://huggingface.co/models?filter=vits" rel="nofollow">on 🤗 Hub</a>.</p> <p data-svelte-h="svelte-ul4o7z">Now let’s see how you can fine-tune a TTS model yourself!</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/audio-transformers-course/blob/main/chapters/en/chapter6/pre-trained_models.mdx" target="_blank"><svg class="mr-1" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M31,16l-7,7l-1.41-1.41L28.17,16l-5.58-5.59L24,9l7,7z"></path><path d="M1,16l7-7l1.41,1.41L3.83,16l5.58,5.59L8,23l-7-7z"></path><path d="M12.419,25.484L17.639,6.552l1.932,0.518L14.351,26.002z"></path></svg> <span data-svelte-h="svelte-zjs2n5"><span class="underline">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1pbp10e = { | |
| assets: "/docs/audio-course/pr_239/en", | |
| base: "/docs/audio-course/pr_239/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/audio-course/pr_239/en/_app/immutable/entry/start.1658692c.js"), | |
| import("/docs/audio-course/pr_239/en/_app/immutable/entry/app.83f02103.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 40], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 61.7 kB
- Xet hash:
- 1b6841af21df84bbcbbaee49984a1cc05af1146869b5deeb49e5c5fdca0b8ca5
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.