111 kB

Title: ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control

URL Source: https://arxiv.org/html/2406.01205

Markdown Content: Shengpeng Ji 1,2, Qian Chen 2, Wen Wang 2, Jialong Zuo 1, Minghui Fang 1,

Ziyue Jiang 1, Hai Huang 1, Zehan Wang 1, Xize Cheng 1, Siqi Zheng 2, Zhou Zhao 1

1 Zhejiang University 2 Alibaba Tongyi Speech Lab

Abstract

In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker’s voice and enabling arbitrary control and adjustment of speaking style. Prior zero-shot TTS models only mimic the speaker’s voice without further control and adjustment capabilities while prior controllable TTS models cannot perform speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging task—a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture codec representations corresponding to timbre, content, and style in a discrete decoupling codec space. Moreover, we analyze the many-to-many issue in textual style control and propose the Style Mixture Semantic Density (SMSD) module, which is based on Gaussian mixture density networks, to resolve this problem. To facilitate empirical validations, we make available a new style controllable dataset called VccmDataset. Our experimental results demonstrate that ControlSpeech exhibits comparable or state-of-the-art (SOTA) performance in terms of controllability, timbre similarity, audio quality, robustness, and generalizability. Codes are available at https://github.com/jishengpeng/ControlSpeech.

ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control

Shengpeng Ji 1,2, Qian Chen 2, Wen Wang 2, Jialong Zuo 1, Minghui Fang 1,Ziyue Jiang 1, Hai Huang 1, Zehan Wang 1, Xize Cheng 1, Siqi Zheng 2, Zhou Zhao 1††thanks: Corresponding author.1 Zhejiang University 2 Alibaba Tongyi Speech Lab

Figure 1: The voice prompt, the content description, and the style description correspond to the timbre, content, and style representations in the discrete codec space in the left panel . The right panel compares ControlSpeech with previous style-controllable TTS and zero-shot TTS systems. In this comparison, we use the amplitude and color of the waveform to represent the styleand timbre.

1 Introduction

Over the past decade, the field of speech synthesis has seen remarkable advancements(Ren et al., 2020; Kim et al., 2021), achieving synthesized speech that rivals real human speech in terms of expressiveness and naturalness(Tan et al., 2024). Recently, with the development of large language models(Brown et al., 2020; Touvron et al., 2023) and generative models in other domains(Ho et al., 2020; Kim et al., 2020), the tasks of zero-shot TTS(Wang et al., 2023; Shen et al., 2023; Le et al., 2023) and style-controllable speech synthesis(Guo et al., 2023; Yang et al., 2023b) have garnered significant attention in the speech domain due to their powerful zero-shot generation and controllability capabilities. Zero-shot TTS(Kharitonov et al., 2023) refers to the ability to perfectly clone an unseen speaker’s voice using only a few seconds of a speech prompt, commonly achieved by significantly scaling up both the training data and model sizes. On the other hand, style-controllable TTS(Guo et al., 2023) supports the control of a speaker’s style (prosody, accent, emotion, etc.) through textual descriptions.

However, these two types of models have their own limitations. As illustrated in the right panel of Figure1, prior zero-shot TTS (Wang et al., 2023) can clone the voice of any speaker, but the style is fixed and cannot be further controlled or adjusted. Conversely, prior style-controllable TTS(Leng et al., 2023) can synthesize speech in any desired style, but it cannot specify the timbre of the synthesized voice. Although some efforts(Yang et al., 2023b; Liu et al., 2023) have been made to use speaker IDs to control the timbre, these approaches are limited to testing on constrained in-domain datasets and lack the zero shot ability. As a result, current speech synthesis systems lack independent and flexible control over content, timbre, and style at the same time, for example, they are unable to synthesize speech in Trump’s voice with a child’s joyful style saying “Today is Monday”. To address these limitations, we propose a novel model called ControlSpeech. To the best of our knowledge, ControlSpeech is the first model to simultaneously and independently control timbre, content, and style, and demonstrate competitive zero-shot voice cloning and style control abilities.

There are two main challenges to achieve simultaneous and independent control over content, timbre, and style in a TTS system. First, the information from the style prompt and the speech prompt can become entangled and interfere with or contradict each other. For instance, the speech prompt might contain a style different from that described by the textual style prompt; therefore, simply adding a style prompt control module or a speech prompt control module to previous model frameworks(Leng et al., 2023; Wang et al., 2023) is evidently insufficient. Second, there lacks large datasets that fulfill both requirements of zero-shot TTS systems and textual style-controllable TTS systems. Specifically, due to the scarcity of style-descriptive textual data, the training data for mainstream style-controllable TTS systems(Guo et al., 2023; Liu et al., 2023) typically amounts to only a few hundred hours(Ji et al., 2023), far from meeting the requirements of a large-scale, multi-speaker training dataset(Kahn et al., 2020) that is crucial to attain robust zero-shot speaker cloning capabilities. To tackle these two challenges, we explore a novel approach in ControlSpeech that leverages a pre-trained disentangled representation space for controllable speech generation. On one hand, disentangling representations enables independent control over content, style, and timbre. On the other hand, utilizing a representation space pre-trained on a large-scale multi-speaker dataset ensures robust zero-shot capabilities of ControlSpeech. In this work, we use the disentangled representation space from(Ju et al., 2024) that is pre-trained on 60,000 hours(Kahn et al., 2020). During the speech synthesis process, we adopt an encoder-decoder architecture(Ren et al., 2020) as the backbone synthesis framework and integrate a high-quality non-autoregressive, confidence-based codec generator(Chang et al., 2022; Borsos et al., 2023; Villegas et al., 2022) as the decoder.

We also identify and analyze the many-to-many issue in textual style-controllable TTS for the first time, that is, different textual style descriptions may correspond to the same audio, while a single textual style description may be associated with varying degrees of a particular style for the same speaker. For instance, the phrases “The man speaks at a very rapid pace" and “The man articulates his words with considerable speed" describe the same speech style, yet “The man speaks at a very rapid pace" can also correspond to many audio clips exhibiting different levels of high speaking rate. To address this many-to-many issue in style control, we propose a novel module called Style Mixture Semantic Density Sampling (SMSD). This module integrates the global semantic information of style control and utilizes sampling from a mixed distribution (Zen and Senior, 2014; Hwang et al., 2020) of style descriptions to achieve hierarchical control. Additionally, we incorporate a noise perturbation mechanism to further enhance style diversity. The design motivation and detailed architecture of SMSD are elaborated in Section3.3.

To comprehensively evaluate ControlSpeech’s controllability, timbre similarity, audio quality, diversity, and generalization, we create a new open sourced dataset called VccmDataset based on TextrolSpeech(Ji et al., 2023) to foster advancements in controllable TTS. In summary, our contributions are as follows:

•We conduct detailed analysis of existing zero-shot TTS and style-controllable TTS models and identify their inability to simultaneously and independently control content, style, and timbre in a zero-shot setting. We propose the ControlSpeech to achieve independent control over these speech factors at the same time.
•To the best of our knowledge, this is also the first work to identify and analyze the many-to-many issue in text style-controllable TTS, we propose a novel Style Mixture Semantic Density (SMSD) module. Furthermore, we investigate integrating various noise perturbation mechanisms within SMSD to enhance control diversity.
•We conduct comprehensive experiments and demonstrate that ControlSpeech exhibits comparable or state-of-the-art performance in terms of controllability, timbre similarity, audio quality, robustness, and generalizability. We also create a new open-source dataset VccmDataset tailored for style and timbre control at the same time.

2 Related Work

In this section, we summarize previous studies on text prompt-based controllable TTS. Detailed discussions of discrete codec related to ControlSpeech are in AppendixA.

2.1 Text Prompt Based Controllable TTS

Some recent studies propose to control speech style through natural text prompts. PromptTTS(Guo et al., 2023) employs manually annotated text prompts to describe four to five attributes of speech (gender, pitch, speaking speed, volume, and emotion). InstructTTS(Yang et al., 2023b) employs a three-stage training approach to capture semantic information from natural language style prompts as conditioning to the TTS system. Textrolspeech(Ji et al., 2023) introduces an efficient architecture which treats textual controllable TTS as a language model task. PromptStyle(Liu et al., 2023) proposes a two-stage TTS approach for cross-speaker style transfer with natural language descriptions based on VITS(Kim et al., 2021). PromptTTS 2(Leng et al., 2023) proposes an automatic description creation pipeline leveraging large language models (LLMs)(Bubeck et al., 2023) and adopts a diffusion model to capture the one-to-many relationship. Audiobox(Vyas et al., 2023) propose a unified model based on flow-matching that is capable of generating and controlling various audio modalities. While AudioBox supports multiple inputs, it does not decouple the speech prompt from the style prompt. Consequently, when there is a conflict between the styles in the speech prompt and the style text prompt, it significantly impacts the controllability. We also validate the necessity of decoupling in our ablation study presented in Table4. It is noteworthy that existing style-controllable TTS models are either speaker-independent or can only control timbre using speaker IDs, without the capability for timbre cloning. The introduction of ControlSpeech expands the scope of the controllable TTS task.

Furthermore, to the best of our knowledge, ControlSpeech is the first model to identify the many-to-many problem in the field of style control. It is worth noting that while PromptTTS 2(Leng et al., 2023) also identifies a one-to-many issue between style descriptions and audio, the one-to-many issue identified in PromptTTS 2 is fundamentally different from the one-to-many issue we identify in ControlSpeech. PromptTTS 2 attributes the one-to-many issue to the absence of the timbre information in the style descriptions, and thus employs a Q-former combined with a diffusion model to generate the missing latent speech features. In contrast, we argue that the textual style descriptions themselves are inherently insufficient to capture the range of variations in one style, leading to the one-to-many issue.

2.2 Zero-shot TTS

Zero-shot speech synthesis refers to the ability to synthesize the voice of an unseen speaker based solely on a few seconds of audio prompt, also known as voice cloning. In recent months, with the advancement of generative large-scale models, a plethora of outstanding works have emerged. VALL-E(Wang et al., 2023) leverages discrete codec representations and combines autoregressive and non-autoregressive models in a cascaded manner, preserving the powerful contextual capabilities of language models. NaturalSpeech 2(Shen et al., 2023) employs continuous vectors instead of discrete neural codec tokens and introduces in-context learning to a latent diffusion model. NaturalSpeech 3(Ju et al., 2024) proposes a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way, although Naturalspeech 3 also employs a disentangled codec representation, all its codec targets are generated with the same textual content. SpearTTS(Kharitonov et al., 2023) and Make-a-Voice (Huang et al., 2023) utilize semantic tokens to reduce the gap between text and acoustic features. VoiceBox(Le et al., 2023) is a non-autoregressive flow-matching model trained to infill speech, given audio context and text. Mega-TTS(Jiang et al., 2023c, b, a), on the other hand, utilizes traditional mel-spectrograms, decoupling timbre and prosody and further modeling the prosody using an autoregressive approach. VoiceBox(Le et al., 2023) and P-flow (Kim et al., 2024) employ flowing models as generators, demonstrating robust generative performance. SoundStorm(Borsos et al., 2023) and MobileSpeech(Ji et al., 2024c) utilize a non-autoregressive and mask-based iterative generation method, achieving an excellent balance between inference speed and generation quality. It is noteworthy that existing zero-shot TTS models (including NaturalSpeech3) are unable to achieve arbitrary language style control and modify. ControlSpeech is the first TTS model capable of simultaneously and independently performing zero-shot timbre cloning and style control.

3 ControlSpeech

Figure 2: Figure (a) depicts the overall architecture of ControlSpeech, which is an encoder-decoder parallel disentangled codec generation model. Figure (b) provides a detailed illustration of the SMSD module, which addresses the many-to-many problem in style control by sampling from the style mixture semantic distribution and incorporating an additional noise perturbator. Figure (c) shows the process of the codec generator. Through masking, the codec can generate discrete codec representations in a fully non-autoregressive manner.

3.1 Overall Architecture

As illustrated in Figure2 (a), ControlSpeech is fundamentally an encoder-decoder model (Ji et al., 2024c) designed for parallel codec generation(Borsos et al., 2023). ControlSpeech employs three separate encoders to encode the input content prompt, style prompt, and speech prompt, respectively. Specifically, the content text is converted into phonemes and fed into the text encoder, while style text is prepended with the special [CLS] token and encoded at the word level using BERT’s tokenizer(Devlin et al., 2018). Meanwhile, the speech prompt is processed by the pre-trained codec encoder(Ju et al., 2024) and timbre extractor to capture the timbre information. In Figure2, the dashed box represents frame-level features, while the solid box represents global features. The Style Mixture Semantic Density (SMSD) module samples style text to generate the corresponding global style representations, which are then combined with text representations from the text encoder via a cross-attention module. The combined representations are then fed into the duration prediction model and subsequently into the codec generator, which is a non-autoregressive Conformer based on mask iteration and parallel generation. The timbre extractor is a Transformer encoder that converts the output of the speech encoder into a global vector, representing the timbre attributes. Given the input of a style description X s subscript 𝑋 𝑠 X_{s}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, a content text X c subscript 𝑋 𝑐 X_{c}italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and a speech prompt X t subscript 𝑋 𝑡 X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, ControlSpeech aims to sequentially generate the corresponding style codec Y s subscript 𝑌 𝑠 Y_{s}italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, content codec Y c subscript 𝑌 𝑐 Y_{c}italic_Y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and timbre embedding Y t subscript 𝑌 𝑡 Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. These representations are then concatenated and upsampled into speech through the pre-trained codec decoder(Ju et al., 2024).

3.2 Codec Decoupling and Generation

3.2.1 Decouple Content, Style, and Timbre

ControlSpeech leverages the pre-trained disentangled representation space to separate different aspects of speech. We utilize FACodec(Ju et al., 2024) as our codec disentangler and timbre extractor module, since FACodec facilitates codec decoupling and is pre-trained on a large-scale, multi-speaker dataset, ensuring robust zero-shot TTS capabilities. Specifically, during the training process of ControlSpeech, we freeze the corresponding codec encoder to obtain downsampled compressed audio frames h ℎ h italic_h from the target speech Y 𝑌 Y italic_Y. The frames h ℎ h italic_h are processed through the disentangling quantizer module and the timbre extractor module(Ju et al., 2024) to derive the original content codec Y c subscript 𝑌 𝑐 Y_{c}italic_Y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, prosody codec Y p subscript 𝑌 𝑝 Y_{p}italic_Y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, acoustic codec Y a subscript 𝑌 𝑎 Y_{a}italic_Y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, and timbre information Y t subscript 𝑌 𝑡 Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Theoretically, after excluding the content Y c subscript 𝑌 𝑐 Y_{c}italic_Y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and timbre information Y t subscript 𝑌 𝑡 Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the remaining representation collectively is treated as the style codec Y s subscript 𝑌 𝑠 Y_{s}italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. In practice, we concatenate the prosody codec Y p subscript 𝑌 𝑝 Y_{p}italic_Y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and the acoustic codec Y a subscript 𝑌 𝑎 Y_{a}italic_Y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT along the channel dimension to obtain the corresponding style codec Y s subscript 𝑌 𝑠 Y_{s}italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, as follows:

Y s=c⁢o⁢n⁢c⁢a⁢t⁢(Y p,Y a)subscript 𝑌 𝑠 𝑐 𝑜 𝑛 𝑐 𝑎 𝑡 subscript 𝑌 𝑝 subscript 𝑌 𝑎 Y_{s}=concat(Y_{p},Y_{a})italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_c italic_o italic_n italic_c italic_a italic_t ( italic_Y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT )(1)

3.2.2 Codec Generation Process

The codec generation comprises two stages.

In the first stage, based on the paired text and speech data {X,Y c⁢o⁢d⁢e⁢c}𝑋 subscript 𝑌 𝑐 𝑜 𝑑 𝑒 𝑐\left{X,Y_{codec}\right}{ italic_X , italic_Y start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_c end_POSTSUBSCRIPT }, where X={x 1,x 2,x 3,⋯,x T}𝑋 subscript 𝑥 1 subscript 𝑥 2 subscript 𝑥 3⋯subscript 𝑥 𝑇 X=\left{x_{1},x_{2},x_{3},\cdots,x_{T}\right}italic_X = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } represents the cross-attention fusion of the global style representations and the aligned text representations, and Y c⁢o⁢d⁢e⁢c subscript 𝑌 𝑐 𝑜 𝑑 𝑒 𝑐 Y_{codec}italic_Y start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_c end_POSTSUBSCRIPT denotes the speech representations through vector quantization, formulated as follows:

Y c⁢o⁢d⁢e⁢c=c⁢o⁢n⁢c⁢a⁢t⁢(Y s,Y c)=C 1:T,1:N∈ℝ T×N subscript 𝑌 𝑐 𝑜 𝑑 𝑒 𝑐 𝑐 𝑜 𝑛 𝑐 𝑎 𝑡 subscript 𝑌 𝑠 subscript 𝑌 𝑐 subscript 𝐶:1 𝑇 1:𝑁 superscript ℝ 𝑇 𝑁 Y_{codec}=concat(Y_{s},Y_{c})=C_{1:T,1:N}\in\mathbb{R}^{T\times N}italic_Y start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_c end_POSTSUBSCRIPT = italic_c italic_o italic_n italic_c italic_a italic_t ( italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = italic_C start_POSTSUBSCRIPT 1 : italic_T , 1 : italic_N end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N end_POSTSUPERSCRIPT(2)

where T 𝑇 T italic_T denotes the downsampled utterance length, which is equal to the text length extended by the duration predictor. N 𝑁 N italic_N represents the number of channels for every frame. The row vector of each acoustic code matrix C t,1:N subscript 𝐶:𝑡 1 𝑁 C_{t,1:N}italic_C start_POSTSUBSCRIPT italic_t , 1 : italic_N end_POSTSUBSCRIPT represents the N 𝑁 N italic_N codes for frame t 𝑡 t italic_t, and the column vector of each acoustic code matrix C 1:T,i subscript 𝐶:1 𝑇 𝑖 C_{1:T,i}italic_C start_POSTSUBSCRIPT 1 : italic_T , italic_i end_POSTSUBSCRIPT represents the i 𝑖 i italic_i-th codebook sequence (the length is T 𝑇 T italic_T), where i∈{1,2,⋯,N}𝑖 1 2⋯𝑁 i\in\left{1,2,\cdots,N\right}italic_i ∈ { 1 , 2 , ⋯ , italic_N }. Following VALL-E (Wang et al., 2023), in the training process of ControlSpeech, we randomly select the i 𝑖 i italic_i-th channel C 1:T,i subscript 𝐶:1 𝑇 𝑖 C_{1:T,i}italic_C start_POSTSUBSCRIPT 1 : italic_T , italic_i end_POSTSUBSCRIPT for training. For the generation of the i 𝑖 i italic_i-th channel P⁢(C 1:T,i∣X 1:T;θ)𝑃 conditional subscript 𝐶:1 𝑇 𝑖 subscript 𝑋:1 𝑇 𝜃 P(C_{1:T,i}\mid X_{1:T};\theta)italic_P ( italic_C start_POSTSUBSCRIPT 1 : italic_T , italic_i end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ; italic_θ ), as illustrated in Figure2 (c), we employ a mask-based generative model as our parallel decoder. We sample the mask M i∈{0,1}T subscript 𝑀 𝑖 superscript 0 1 𝑇 M_{i}\in\left{0,1\right}^{T}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT according to a cosine schedule(Chang et al., 2022) for codec level i 𝑖 i italic_i, specifically, sampling the masking ratio p=cos⁡(u′)𝑝 superscript 𝑢′p=\cos(u^{{}^{\prime}})italic_p = roman_cos ( italic_u start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) where u′∼𝒰⁢[0,π 2]similar-to superscript 𝑢′𝒰 0 𝜋 2 u^{{}^{\prime}}\sim\mathcal{U}\left[0,\frac{\pi}{2}\right]italic_u start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∼ caligraphic_U [ 0 , divide start_ARG italic_π end_ARG start_ARG 2 end_ARG ]. and the mask M i∼B⁢e⁢r⁢n⁢o⁢u⁢l⁢l⁢i⁢(p)similar-to subscript 𝑀 𝑖 𝐵 𝑒 𝑟 𝑛 𝑜 𝑢 𝑙 𝑙 𝑖 𝑝 M_{i}\sim Bernoulli(p)italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_B italic_e italic_r italic_n italic_o italic_u italic_l italic_l italic_i ( italic_p ). Here, M i subscript 𝑀 𝑖 M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the portion to be masked in the i 𝑖 i italic_i-th level, while M i¯¯subscript 𝑀 𝑖\bar{M_{i}}over¯ start_ARG italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG denotes the unmasked portion in the i 𝑖 i italic_i-th level. As shown in Figure2 (c), the prediction of this portion C 1:T,i subscript 𝐶:1 𝑇 𝑖 C_{1:T,i}italic_C start_POSTSUBSCRIPT 1 : italic_T , italic_i end_POSTSUBSCRIPT is refined based on the prompt j⁢(j<i)𝑗 𝑗 𝑖 j(j<i)italic_j ( italic_j < italic_i ) channels C 1:T,<i subscript 𝐶:1 𝑇 absent 𝑖 C_{1:T,<i}italic_C start_POSTSUBSCRIPT 1 : italic_T , < italic_i end_POSTSUBSCRIPT, and the concatenation of the target text X 1:T subscript 𝑋:1 𝑇 X_{1:T}italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT and the unmasked portion of the i 𝑖 i italic_i-th channel M i¯⁢C 1:T,i¯subscript 𝑀 𝑖 subscript 𝐶:1 𝑇 𝑖\bar{M_{i}}C_{1:T,i}over¯ start_ARG italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_C start_POSTSUBSCRIPT 1 : italic_T , italic_i end_POSTSUBSCRIPT. Therefore, the prediction for this part can be specified as P⁢(C 1:T,i∣X 1:T;θ)=P⁢(M i⁢C 1:T,i∣C 1:T,<i,X 1:T,M i¯⁢C 1:T,i;θ)𝑃 conditional subscript 𝐶:1 𝑇 𝑖 subscript 𝑋:1 𝑇 𝜃 𝑃 conditional subscript 𝑀 𝑖 subscript 𝐶:1 𝑇 𝑖 subscript 𝐶:1 𝑇 absent 𝑖 subscript 𝑋:1 𝑇¯subscript 𝑀 𝑖 subscript 𝐶:1 𝑇 𝑖 𝜃 P(C_{1:T,i}\mid X_{1:T};\theta)=P(M_{i}C_{1:T,i}\mid C_{1:T,<i},X_{1:T},\bar{M% {i}}C{1:T,i};\theta)italic_P ( italic_C start_POSTSUBSCRIPT 1 : italic_T , italic_i end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ; italic_θ ) = italic_P ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 : italic_T , italic_i end_POSTSUBSCRIPT ∣ italic_C start_POSTSUBSCRIPT 1 : italic_T , < italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , over¯ start_ARG italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_C start_POSTSUBSCRIPT 1 : italic_T , italic_i end_POSTSUBSCRIPT ; italic_θ )

In the second stage, as illustrated in Figure2 (c), following AdaSpeech (Chen et al., 2021), we utilize a conditional normalization layer to fuse the previously obtained Y c⁢o⁢d⁢e⁢c subscript 𝑌 𝑐 𝑜 𝑑 𝑒 𝑐 Y_{codec}italic_Y start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_c end_POSTSUBSCRIPT and the global timbre embedding Y t subscript 𝑌 𝑡 Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, resulting in Y′superscript 𝑌′Y^{{}^{\prime}}italic_Y start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT. This result Y′superscript 𝑌′Y^{{}^{\prime}}italic_Y start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT is then processed by the pre-trained codec decoder(Ju et al., 2024) to generate the final speech output Y 𝑌 Y italic_Y. Specifically, we first use two simple linear layers W γ subscript 𝑊 𝛾 W_{\gamma}italic_W start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT and W β subscript 𝑊 𝛽 W_{\beta}italic_W start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT, which take the global timbre embedding Y t subscript 𝑌 𝑡 Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as input and output the scale vectors γ 𝛾\gamma italic_γ and bias vectors β 𝛽\beta italic_β respectively. These lightweight, learnable scale vectors γ 𝛾\gamma italic_γ and bias vectors β 𝛽\beta italic_β are then fused with Y c⁢o⁢d⁢e⁢c subscript 𝑌 𝑐 𝑜 𝑑 𝑒 𝑐 Y_{codec}italic_Y start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_c end_POSTSUBSCRIPT. This process can be represented by the following formula:

Y=C⁢o⁢d⁢e⁢c⁢D⁢e⁢c⁢o⁢d⁢e⁢r⁢(W γ⁢Y t⁢Y c⁢o⁢d⁢e⁢c−μ c σ c 2+W β⁢Y t)𝑌 𝐶 𝑜 𝑑 𝑒 𝑐 𝐷 𝑒 𝑐 𝑜 𝑑 𝑒 𝑟 subscript 𝑊 𝛾 subscript 𝑌 𝑡 subscript 𝑌 𝑐 𝑜 𝑑 𝑒 𝑐 subscript 𝜇 𝑐 superscript subscript 𝜎 𝑐 2 subscript 𝑊 𝛽 subscript 𝑌 𝑡 Y=CodecDecoder(W_{\gamma}Y_{t}\frac{Y_{codec}-\mu_{c}}{{\sigma_{c}}^{2}}+W_{% \beta}Y_{t})italic_Y = italic_C italic_o italic_d italic_e italic_c italic_D italic_e italic_c italic_o italic_d italic_e italic_r ( italic_W start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG italic_Y start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_c end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_W start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )(3)

where μ c subscript 𝜇 𝑐\mu_{c}italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and σ c 2 superscript subscript 𝜎 𝑐 2{\sigma_{c}}^{2}italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are the mean and variance of the hidden representation of Y c⁢o⁢d⁢e⁢c subscript 𝑌 𝑐 𝑜 𝑑 𝑒 𝑐 Y_{codec}italic_Y start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_c end_POSTSUBSCRIPT.

3.3 The Style Mixture Semantic Density (SMSD) Module

We identify a many-to-many relationship between style text descriptions and their corresponding audio. Specifically, different style descriptions can correspond to the same audio sample (that is, many-to-one), while a single style description may correspond to multiple audio samples with varying degrees of the same style (that is, one-to-many). More precisely, the many-to-one relationship arises because multiple textual descriptions can refer to the same style of speech. For example, both “Her speaking speed is considerably fast” and “Her speech rate is remarkably fast” can refer to the “fast-speed” speech style and could correspond to the same audio sample. On the other hand, the one-to-many relationship occurs because a single textual description is unable to capture the varying degrees of a style. For instance, if we divide the tempo of different speech into 100 levels, any speech with the tempo above 70 may be considered as “fast-speed”. As a result, the text description suggesting “fast speed” could correspond to different audio samples with speech rates of 75, 80, or even 90 for the same speaker.

To address the many-to-many issue in style control, we propose the Style Mixture Semantic Density (SMSD) module. To address the many-to-one issue, similar to previous approaches(Guo et al., 2023; Liu et al., 2023), we utilize a pre-trained BERT model within the SMSD module to extract the semantic representation X s′superscript subscript 𝑋 𝑠′{X_{s}}^{{}^{\prime}}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT from style descriptions, thereby aligning different style texts into the same semantic space and enhancing generalization of out-of-domain style descriptions. To address the one-to-many issue, we observe that addressing this phenomenon of a single style description corresponding to multiple audio with varying degrees of style closely aligns with the motivation of mixture density networks (MDN). We hypothesize that X s′superscript subscript 𝑋 𝑠′{X_{s}}^{{}^{\prime}}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT as the semantic representation of style can be considered as a global mixture of Gaussian distributions, where different Gaussian distributions represent varying degrees of a particular style. During training, each independent Gaussian distribution is multiplied by a corresponding learnable weight and then summed. By constraining the K⁢L 𝐾 𝐿 KL italic_K italic_L divergence between the style representation distribution of the target audio and the summed mixture density distribution, we establish a one-to-one correspondence between the style text and the target audio. This approach also enhances the diversity of style control directly with the text descriptions. During inference, we sample from the mixture of style semantic distributions to obtain an independent Gaussian distribution, with each sampled distribution reflecting different degrees of the same style. Additionally, to further enhance the diversity of style control, we incorporate a noise perturbation module within the MDN network of SMSD in ControlSpeech. The noise perturbation module controls the isotropy of perturbations across different dimensions.

Specifically, one raw style prompt X s=[X 1,X 2,X 3,⋯,X L]subscript 𝑋 𝑠 subscript 𝑋 1 subscript 𝑋 2 subscript 𝑋 3⋯subscript 𝑋 𝐿 X_{s}=[X_{1},X_{2},X_{3},\cdots,X_{L}]italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = [ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , ⋯ , italic_X start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ] is prepended with a [C⁢L⁢S]delimited-[]𝐶 𝐿 𝑆\left[CLS\right][ italic_C italic_L italic_S ] token, then converted into word embedding, and fed into the BERT model, where L 𝐿 L italic_L denotes the length of the style prompt. The hidden vector corresponding to the [C⁢L⁢S]delimited-[]𝐶 𝐿 𝑆\left[CLS\right][ italic_C italic_L italic_S ] token is regarded as the global style semantic representation X s′superscript subscript 𝑋 𝑠′{X_{s}}^{{}^{\prime}}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, which guides generation and sampling of subsequent modules. Based on the MDN network(Zen and Senior, 2014; Duan, 2019; Du and Yu, 2021), we aim to regress the target style representation Y s′∈ℝ d superscript subscript 𝑌 𝑠′superscript ℝ 𝑑{Y_{s}}^{{}^{\prime}}\in\mathbb{R}^{d}italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, using the style semantic input representation X s′∈ℝ n superscript subscript 𝑋 𝑠′superscript ℝ 𝑛{X_{s}}^{{}^{\prime}}\in\mathbb{R}^{n}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as covariates, where d 𝑑 d italic_d and n 𝑛 n italic_n are the respective dimensions. We model the conditional distribution as a mixture of Gaussian distribution, as follows:

P θ⁢(Y s′|X s′)=∑k=1 K π k⁢𝒩⁢(μ(k),σ 2(k))subscript 𝑃 𝜃 conditional superscript subscript 𝑌 𝑠′superscript subscript 𝑋 𝑠′superscript subscript 𝑘 1 𝐾 subscript 𝜋 𝑘 𝒩 superscript 𝜇 𝑘 superscript superscript 𝜎 2 𝑘 P_{\theta}({Y_{s}}^{{}^{\prime}}|{X_{s}}^{{}^{\prime}})=\sum_{k=1}^{K}\pi_{k}% \mathcal{N}(\mu^{(k)},{\sigma^{2}}^{(k)})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_N ( italic_μ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT )(4)

where K 𝐾 K italic_K is a hyperparameter as the number of independent Gaussian distribution, and other mixture distribution parameters π k subscript 𝜋 𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, μ k superscript 𝜇 𝑘\mu^{k}italic_μ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, σ 2(k)superscript superscript 𝜎 2 𝑘{\sigma^{2}}^{(k)}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT are output of a neural MDN network f θ subscript 𝑓 𝜃 f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT based on the input style semantic representation X s′superscript subscript 𝑋 𝑠′{X_{s}}^{{}^{\prime}}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, as follows:

π∈Δ K−1,μ(k)∈ℝ d,σ 2(k)∈S+d=f θ⁢(X s′)formulae-sequence 𝜋 superscript Δ 𝐾 1 formulae-sequence superscript 𝜇 𝑘 superscript ℝ 𝑑 superscript superscript 𝜎 2 𝑘 superscript subscript 𝑆 𝑑 subscript 𝑓 𝜃 superscript subscript 𝑋 𝑠′\pi\in\Delta^{K-1},\mu^{(k)}\in\mathbb{R}^{d},{\sigma^{2}}^{(k)}\in S_{+}^{d}=% f_{\theta}({X_{s}}^{{}^{\prime}})italic_π ∈ roman_Δ start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∈ italic_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT )(5)

Note that the sum of the mixture weights is constrained to 1 during the training phase, which is achieved by applying a softmax function on the corresponding neural network output α k subscript 𝛼 𝑘\alpha_{k}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, as follows:

π k=e⁢x⁢p⁢(a k)∑k=1 K e⁢x⁢p⁢(a k)subscript 𝜋 𝑘 𝑒 𝑥 𝑝 subscript 𝑎 𝑘 superscript subscript 𝑘 1 𝐾 𝑒 𝑥 𝑝 subscript 𝑎 𝑘\pi_{k}=\frac{exp(a_{k})}{\sum_{k=1}^{K}exp(a_{k})}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_e italic_x italic_p ( italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e italic_x italic_p ( italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG(6)

To further enhance the diversity of style control, we design a specialized noise perturbation module within the SMSD module to constrain the noise model. As illustrated by the circles within the SMSD module in Figure2 (b), this noise perturbation module regulates the isotropy of perturbations ε 𝜀\varepsilon italic_ε across different dimensions in variance σ 2(k)superscript superscript 𝜎 2 𝑘{\sigma^{2}}^{(k)}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. The four types of perturbations from left to right in Figure 2 (b) are as follows:

•Fully factored: σ 2(k)=f θ⁢(X s′)+f θ⁢(ε)=d⁢i⁢a⁢g⁢(σ 2(k))∈ℝ+d superscript superscript 𝜎 2 𝑘 subscript 𝑓 𝜃 superscript subscript 𝑋 𝑠′subscript 𝑓 𝜃 𝜀 𝑑 𝑖 𝑎 𝑔 superscript superscript 𝜎 2 𝑘 superscript subscript ℝ 𝑑{\sigma^{2}}^{(k)}=f_{\theta}({X_{s}}^{{}^{\prime}})+f_{\theta}(\varepsilon)=% diag({\sigma^{2}}^{(k)})\in\mathbb{R}_{+}^{d}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ε ) = italic_d italic_i italic_a italic_g ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, which predicts the noise level for each dimension separately.
•Isotropic: σ 2(k)=f θ⁢(X s′)+f θ⁢(ε)=σ 2(k)⁢I∈ℝ+superscript superscript 𝜎 2 𝑘 subscript 𝑓 𝜃 superscript subscript 𝑋 𝑠′subscript 𝑓 𝜃 𝜀 superscript superscript 𝜎 2 𝑘 𝐼 subscript ℝ{\sigma^{2}}^{(k)}=f_{\theta}({X_{s}}^{{}^{\prime}})+f_{\theta}(\varepsilon)={% \sigma^{2}}^{(k)}I\in\mathbb{R}_{+}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ε ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT italic_I ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, which assumes the same noise level for each dimension over d 𝑑 d italic_d.
•Isotropic across clusters: σ 2(k)=f θ⁢(X s′)+f θ⁢(ε)=σ 2⁢I∈ℝ+superscript superscript 𝜎 2 𝑘 subscript 𝑓 𝜃 superscript subscript 𝑋 𝑠′subscript 𝑓 𝜃 𝜀 superscript 𝜎 2 𝐼 subscript ℝ{\sigma^{2}}^{(k)}=f_{\theta}({X_{s}}^{{}^{\prime}})+f_{\theta}(\varepsilon)=% \sigma^{2}I\in\mathbb{R}_{+}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ε ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, which assumes the same noise level for each dimension over d 𝑑 d italic_d and cluster.
•Fixed isotropic is the same as Isotropic across clusters but does not learn σ 2 superscript 𝜎 2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

As shown in the experimental results in AppendixI, isotropic across clusters outperforms the other types for striking a balance between accuracy and diversity and is used as the mode for noise perturbation. We obtain more robust mean, variance, and weight parameters for the mixture of Gaussian distributions with the noise perturbation module. The training objective of the SMSD module is the negative log-likelihood of the observation Y s′superscript subscript 𝑌 𝑠′{Y_{s}}^{{}^{\prime}}italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT given its input X s′superscript subscript 𝑋 𝑠′{X_{s}}^{{}^{\prime}}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT. The loss function is formulated as ℒ S⁢M⁢S⁢D=−l⁢o⁢g⁢s⁢u⁢m⁢e⁢x⁢p k⁢(l⁢o⁢g⁢π k−1 2⁢‖Y s′−μ(k)σ‖2)subscript ℒ 𝑆 𝑀 𝑆 𝐷 𝑙 𝑜 𝑔 𝑠 𝑢 𝑚 𝑒 𝑥 subscript 𝑝 𝑘 𝑙 𝑜 𝑔 subscript 𝜋 𝑘 1 2 superscript norm superscript subscript 𝑌 𝑠′superscript 𝜇 𝑘 𝜎 2\mathcal{L}{SMSD}=-logsumexp{k}(log\pi_{k}-\frac{1}{2}\left|\frac{{Y_{s}}^{% {}^{\prime}}-\mu^{(k)}}{\sigma}\right|^{2})caligraphic_L start_POSTSUBSCRIPT italic_S italic_M italic_S italic_D end_POSTSUBSCRIPT = - italic_l italic_o italic_g italic_s italic_u italic_m italic_e italic_x italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_l italic_o italic_g italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ divide start_ARG italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_μ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Details for deriving the non-convex ℒ S⁢M⁢S⁢D subscript ℒ 𝑆 𝑀 𝑆 𝐷\mathcal{L}_{SMSD}caligraphic_L start_POSTSUBSCRIPT italic_S italic_M italic_S italic_D end_POSTSUBSCRIPT are in AppendixK.

3.4 Training and Inference

During the training process, the duration predictor is optimized using the mean square error loss, with the extracted duration serving as the training target. We employ the Montreal Forced Alignment (MFA) tool(McAuliffe et al., 2017) to extract phoneme durations, and denote the loss for the duration predictor as ℒ d⁢u⁢r subscript ℒ 𝑑 𝑢 𝑟\mathcal{L}{dur}caligraphic_L start_POSTSUBSCRIPT italic_d italic_u italic_r end_POSTSUBSCRIPT. The codec generator module is optimized using cross-entropy loss. We randomly select a channel for optimization and denote this loss as ℒ c⁢o⁢d⁢e⁢c subscript ℒ 𝑐 𝑜 𝑑 𝑒 𝑐\mathcal{L}{codec}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_c end_POSTSUBSCRIPT. In the SMSD module, the target style representation Y s′superscript subscript 𝑌 𝑠′{Y_{s}}^{{}^{\prime}}italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT is the global style representation obtained by passing style codec Y s subscript 𝑌 𝑠 Y_{s}italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT through the style extractor. During training, we feed the ground truth style representation Y s′superscript subscript 𝑌 𝑠′{Y_{s}}^{{}^{\prime}}italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT and the ground truth duration into the codec generator and duration predictor, respectively. The overall loss ℒ ℒ\mathcal{L}caligraphic_L for ControlSpeech is the sum of losses:

ℒ=ℒ c⁢o⁢d⁢e⁢c+ℒ d⁢u⁢r+ℒ S⁢M⁢S⁢D ℒ subscript ℒ 𝑐 𝑜 𝑑 𝑒 𝑐 subscript ℒ 𝑑 𝑢 𝑟 subscript ℒ 𝑆 𝑀 𝑆 𝐷\mathcal{L}=\mathcal{L}{codec}+\mathcal{L}{dur}+\mathcal{L}_{SMSD}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_c end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_d italic_u italic_r end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_S italic_M italic_S italic_D end_POSTSUBSCRIPT(7)

During the inference stage, we initiate the process by inputting the original stylistic descriptor X s subscript 𝑋 𝑠 X_{s}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT into the BERT module to obtain the style semantic representation X s′superscript subscript 𝑋 𝑠′{X_{s}}^{{}^{\prime}}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, and then input X s′superscript subscript 𝑋 𝑠′{X_{s}}^{{}^{\prime}}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT into the SMSD module to obtain the corresponding π 𝜋\pi italic_π, μ 𝜇\mu italic_μ and σ 2 superscript 𝜎 2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. By directly sampling X s′superscript subscript 𝑋 𝑠′{X_{s}}^{{}^{\prime}}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, we can derive the predicted style distribution. Subsequently, we iteratively generate discrete acoustic tokens by incorporating the predicted style into the text state and employing the confidence based sampling scheme(Chang et al., 2022; Borsos et al., 2023). Specifically, we perform multiple forward passes, and at each iteration j 𝑗 j italic_j, we sample candidates for the masked positions. We then retain P j subscript 𝑃 𝑗 P_{j}italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT candidates based on their confidence scores, where P j subscript 𝑃 𝑗 P_{j}italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT follows a cosine schedule. Finally, by integrating the timbre prompt through the condition normalization layer and feeding it into the codec decoder, we generate the final speech output.

4 Experiments

4.1 Experimental Setup

VccmDataset.

To the best of our knowledge, there is no large-scale TTS dataset that includes both text style prompts and speaker prompts. We build upon the TextrolSpeech dataset(Ji et al., 2023) and create VccmDataset. Based on TextrolSpeech, we optimize the pitch distribution, label boundaries, the dataset splits, and then select new test sets. Specifically, we use LibriTTS and the emotional data from TextrolSpeech as the base databases, and annotate each speech sample with five attribute labels: gender, volume, speed, pitch, and emotion. We use the gender labels available in the online metadata. Regarding volume, we compute the L2-norm of the amplitude of each short-time Fourier transform frame. We utilize the Montreal forced alignment tool(McAuliffe et al., 2017) to extract phoneme durations and silence segments. Subsequently, we calculate the average duration of each phoneme within voiced segments for the speaking speed. The Parselmouth 3 tool 1 1 1https://github.com/YannickJadoul/Parselmouth is employed to extract fundamental frequency (f0) and calculate the geometric mean across all voiced regions as pitch values. We partition speech samples into 3 categories (high/normal/low) according to the proportion of speed, pitch, and volume values respectively. Considering the close proximity of attribute values of speech samples between adjacent categories, we exclude the 5%percent%% of data samples at the boundaries of each interval for each attribute. Particularly, We use gender-specific thresholds to bin the pitch into three different levels. After obtaining more accurate labels through these procedures, we align each audio segment with the corresponding style description text in TextrolSpeech based on the labeled attributes to obtain the VccmDataset. We then select four distinct test sets from VccmDataset, namely, test set A, test set B, test set C, test set D. Details of the VccmDataset test sets are in AppendixC.

Baselines.

To ensure a fair comparison of the actual performance of various models, we reimplement several SOTA style-controllable models, including PromptStyle (Liu et al., 2023), Salle (Ji et al., 2023), InstructTTS (Yang et al., 2023b), and PromptTTS 2 (Leng et al., 2023), to serve as primary comparative models for evaluating the controllability of ControlSpeech. For the comparison of voice cloning effectiveness, we reimplement the VALL-E model(Wang et al., 2023) and the MobileSpeech model(Ji et al., 2024c), which are representatives of the autoregressive paradigm and the parallel generation paradigm, respectively. All reproduced baseline will be also made publicly.

Evaluation Metrics and Experimental Settings.

For objective evaluations, we adopt the common metrics used in prior works(Guo et al., 2023; Ji et al., 2023; Leng et al., 2023). To evaluate the model’s style controllability, we use accuracy of pitch, speaking speed, volume, emotion as the metrics, which measures the correspondence between the style factors in the output speech and those in the prompts. We evaluate timbre similarity (Spk-sv) between the original prompt and the synthesized speech, and evaluate speech synthesis accuracy and robustness by using an ASR system to transcribe the synthesized speech and computing word error rate (WER) against the content prompt. For subjective evaluations, we conduct mean opinion score (MOS) evaluations on the test set to measure audio naturalness via crowdsourcing. We further analyze MOS in two aspects: MOS-Q (Quality, assessing clarity and naturalness of the duration and pitch) and MOS-S (Speaker similarity). We also design new subjective MOS metrics: MOS-TS (Timbre similarity), MOS-SD (Style diversity), and MOS-SA (Style accuracy). Details of the evaluation metrics, experimental settings, and specifics of model architecture are provided in Appendix D, E, and F, respectively.

Table 1: The style controllability evaluation results of style-controlled models on VccmDataset test set A. Pitch, Speed, Volume, Emotion denote accuracy of the style. ±plus-or-minus\pm± denotes standard deviation.

4.2 Results and Discussions

Evaluation on style controllability.

We first compare the performance of ControlSpeech with various SOTA models on the style controllability task. The evaluation is conducted on the 1,500-sample VccmDataset test set A. To eliminate the influence of timbre variations on the controllability results of ControlSpeech, we use the ground truth (GT) timbre as the prompt for ControlSpeech. We compare the controllability of the models using pitch accuracy, speed accuracy, volume accuracy, and emotion accuracy. Additionally, we measure the audio quality generated by the models using WER, timbre similarity (Spk-sv), and MOS-Q. Results are shown in Table1, and we drew the following conclusions: 1) Comparing ControlSpeech with other baselines on controllability metrics, we find that, except for pitch accuracy, ControlSpeech achieves best results in volume, speed, and emotion classification accuracy. Upon analyzing the synthesized audio of ControlSpeech, we attribute the degraded pitch accuracy to the difficulty arising from simultaneously controlling different timbres and styles. 2) In terms of Spk-sv, MOS-Q, and WER metrics, the audio generated by ControlSpeech demonstrates best timbre similarity, audio quality, and robustness.

Evaluation on the timbre cloning task.

Table 2: The timbre cloning results of different zero-shot models on the VccmDataset test set B.

To evaluate the timbre cloning capability of ControlSpeech in an out-of-domain speaker scenario, we compare the performance of ControlSpeech with SOTA models such as VALL-E and MobileSpeech on the out-of-domain speaker test set (test set B). The experimental results are shown in Table2. We observe that in terms of the robustness metric (WER), the zero-shot TTS systems that are trained on small datasets perform worse than ControlSpeech. We attribute these performance gains of ControlSpeech to its pre-trained speaker prompt component. Additionally, in terms of the MOS-Q and MOS-S metrics, we find that ControlSpeech also maintains performance comparable to zero-shot TTS systems on the timbre cloning task.

Evaluation on addressing the many-to-many issue.

To better evaluate the performance of style-controllable models on addressing the many-to-many issue, we compare ControlSpeech with controllable baseline models on the VccmDataset test set D. Results are shown in Table3. We find that ControlSpeech markedly outperforms PromptStyle and InstructTTS on both MOS-SA (style accuracy) and MOS-SD (style diversity) metrics. This suggests that the unique SMSD module in ControlSpeech enables the model to synthesize both accurate and diverse speech.

Table 3: The results under many-to-many style control conditions on VccmDataset test set D. MOS-TS, MOS-SA, MOS-SD measure timbre stability, accuracy and diversity of style generation.

4.3 Ablation Studies

We validate the necessity of the codec decoupl ingand the SMSD module. We also investigate the impact of hyperparameters for mixed distributions and various noise models in Appendix H and I.

Table 4: An ablation experiment on impact of codec decoupling on the VccmDataset test set A.

Decouple codec. To analyze the impact of decoupling, we maintain the main framework of ControlSpeech and directly encode the speech prompt and style prompt using the frozen speech encodec encoder and style encoder (replicated from the structure of the text encoder) respectively, then feed them into the codec generator through cross attention. We denote this model as ControlSpeech w/o decoupling. As shown in Table4, ControlSpeech w/o decoupling performs substantially worse in controllability compared to ControlSpeech, suggesting that the speech prompt and style prompt indeed may interfere with each other.

The SMSD module. We replace the SMSD module with a style encoder (replicated from the structure of the text encoder) and denote this model as ControlSpeech w/o SMSD. As shown in Table3, ControlSpeech w/o SMSD performs markedly worse in terms of MOS-SA and MOS-SD compared to ControlSpeech, which strongly validates that the SMSD module enables more fine-grained control of the model’s style and increases style diversity through style sampling. We also visualize the distribution of the SMSD under varying pitch/speed/volume (details in AppendixB).

5 Conclusion

In this paper, we present ControlSpeech, the first TTS system capable of simultaneously performing zero-shot timbre cloning and zero-shot style control independently. Additionally, we identify a many-to-many problem in style control and design a unique SMSD module. We will also open source VccmDataset to foster community development.

6 Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant No.62222211 and No.U24A20326

Limitations

In this work, we introduce ControlSpeech, the first TTS system capable of simultaneously cloning timbre and controlling style independently. While ControlSpeech has demonstrated competitive controllability and cloning capabilities, there remains considerable scope for further research and improvement based on the current framework.

Larger Training Datasets.

The field of style-controllable TTS demands larger training datasets. Although TextrolSpeech and our VccmDataset have established a foundation, we hypothesize that achieving more advanced speech controllability may require datasets comprising tens of thousands of hours of speech with style descriptions.

Exploring Generative Models.

In this work, we experiment with decoupled codecs and non-autoregressive parallel generative models. In future research, we plan to explore a broader range of generative model architectures and audio representations.

References

Ahn et al. (2024) Sunghwan Ahn, Beom Jun Woo, Min Hyun Han, Chanyeong Moon, and Nam Soo Kim. 2024. Hilcodec: High fidelity and lightweight neural audio codec. arXiv preprint arXiv:2405.04752.
Borsos et al. (2023) Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, and Marco Tagliasacchi. 2023. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
Chang et al. (2022) Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. 2022. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325.
Chen et al. (2021) Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, and Tie-Yan Liu. 2021. Adaspeech: Adaptive text to speech for custom voice. arXiv preprint arXiv:2103.00993.
Chen et al. (2022) Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
Défossez et al. (2022) Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2022. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Du and Yu (2021) Chenpeng Du and Kai Yu. 2021. Phone-level prosody modelling with gmm-based mdn for diverse and controllable speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:190–201.
Du et al. (2024) Zhihao Du, Shiliang Zhang, Kai Hu, and Siqi Zheng. 2024. Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 591–595. IEEE.
Duan (2019) Tony Duan. 2019. tonyduan/mdn. Original-date: 2019-03-06T01:59:39Z.
Gulati et al. (2020) Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.
Guo et al. (2023) Zhifang Guo, Yichong Leng, Yihan Wu, Sheng Zhao, and Xu Tan. 2023. Prompttts: Controllable text-to-speech with text descriptions. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851.
Huang et al. (2023) Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Luping Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, and Dong Yu. 2023. Make-a-voice: Unified voice synthesis with discrete representation. arXiv preprint arXiv:2305.19269.
Huang et al. (2024) Zhichao Huang, Chutong Meng, and Tom Ko. 2024. Repcodec: A speech representation codec for speech tokenization. Preprint, arXiv:2309.00169.
Hwang et al. (2020) Min-Jae Hwang, Eunwoo Song, Ryuichi Yamamoto, Frank Soong, and Hong-Goo Kang. 2020. Improving lpcnet-based text-to-speech with linear prediction-structured mixture density network. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7219–7223. IEEE.
Ji et al. (2024a) Shengpeng Ji, Minghui Fang, Ziyue Jiang, Rongjie Huang, Jialung Zuo, Shulei Wang, and Zhou Zhao. 2024a. Language-codec: Reducing the gaps between discrete codec representation and speech language models. arXiv preprint arXiv:2402.12208.
Ji et al. (2024b) Shengpeng Ji, Ziyue Jiang, Xize Cheng, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, et al. 2024b. Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling. arXiv preprint arXiv:2408.16532.
Ji et al. (2024c) Shengpeng Ji, Ziyue Jiang, Hanting Wang, Jialong Zuo, and Zhou Zhao. 2024c. Mobilespeech: A fast and high-fidelity framework for mobile zero-shot text-to-speech. arXiv preprint arXiv:2402.09378.
Ji et al. (2023) Shengpeng Ji, Jialong Zuo, Minghui Fang, Ziyue Jiang, Feiyang Chen, Xinyu Duan, Baoxing Huai, and Zhou Zhao. 2023. Textrolspeech: A text style control speech corpus with codec language text-to-speech models. arXiv preprint arXiv:2308.14430.
Jiang et al. (2023a) Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang, Pengfei Wei, Chunfeng Wang, et al. 2023a. Boosting prompting mechanisms for zero-shot speech synthesis. In The Twelfth International Conference on Learning Representations.
Jiang et al. (2023b) Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Chen Zhang, Zhenhui Ye, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, et al. 2023b. Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts. arXiv preprint arXiv:2307.07218.
Jiang et al. (2023c) Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, et al. 2023c. Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias. arXiv preprint arXiv:2306.03509.
Ju et al. (2024) Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. 2024. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100.
Kahn et al. (2020) Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, et al. 2020. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673. IEEE.
Kharitonov et al. (2023) Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. 2023. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540.
Kim et al. (2020) Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. 2020. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077.
Kim et al. (2021) Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR.
Kim et al. (2024) Sungwon Kim, Kevin Shih, Joao Felipe Santos, Evelina Bakhturina, Mikyas Desta, Rafael Valle, Sungroh Yoon, Bryan Catanzaro, et al. 2024. P-flow: A fast and data-efficient zero-shot tts through speech prompting. Advances in Neural Information Processing Systems, 36.
Kumar et al. (2024) Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. 2024. High-fidelity audio compression with improved rvqgan. Advances in Neural Information Processing Systems, 36.
Le et al. (2023) Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al. 2023. Voicebox: Text-guided multilingual universal speech generation at scale. arXiv preprint arXiv:2306.15687.
Lee et al. (2022) Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. 2022. Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658.
Leng et al. (2023) Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, et al. 2023. Prompttts 2: Describing and generating voices with text prompt. arXiv preprint arXiv:2309.02285.
Li et al. (2024) Hanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, Yuanjun Lv, Lei Xie, Yunlin Chen, Hao Yin, and Zhifei Li. 2024. Single-codec: Single-codebook speech codec towards high-performance speech generation. arXiv preprint arXiv:2406.07422.
Liu et al. (2023) Guanghou Liu, Yongmao Zhang, Yi Lei, Yunlin Chen, Rui Wang, Zhifei Li, and Lei Xie. 2023. Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions. arXiv preprint arXiv:2305.19522.
Liu et al. (2024) Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, and Mark D Plumbley. 2024. Semanticodec: An ultra low bitrate semantic audio codec for general sound. arXiv preprint arXiv:2405.00233.
Ma et al. (2023) Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. 2023. emotion2vec: Self-supervised pre-training for speech emotion representation. arXiv preprint arXiv:2312.15185.
McAuliffe et al. (2017) Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502.
Pan et al. (2024) Yu Pan, Lei Ma, and Jianjun Zhao. 2024. Promptcodec: High-fidelity neural speech codec using disentangled representation learning based adaptive feature-aware prompt encoders. arXiv preprint arXiv:2404.02702.
Ren et al. (2020) Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations.
Ren et al. (2024) Yong Ren, Tao Wang, Jiangyan Yi, Le Xu, Jianhua Tao, Chu Yuan Zhang, and Junzuo Zhou. 2024. Fewer-token neural speech codec with time-invariant codes. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12737–12741. IEEE.
Shen et al. (2023) Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. 2023. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116.
Siuzdak (2023) Hubert Siuzdak. 2023. Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. arXiv preprint arXiv:2306.00814.
Tan et al. (2024) Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, et al. 2024. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Villegas et al. (2022) Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. 2022. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399.
Vyas et al. (2023) Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, et al. 2023. Audiobox: Unified audio generation with natural language prompts. arXiv preprint arXiv:2312.15821.
Wang et al. (2023) Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. 2023. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111.
Wu et al. (2023) Yi-Chiao Wu, Israel D Gebru, Dejan Marković, and Alexander Richard. 2023. Audiodec: An open-source streaming high-fidelity neural audio codec. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
Yang et al. (2023a) Dongchao Yang, Songxiang Liu, Rongjie Huang, Jinchuan Tian, Chao Weng, and Yuexian Zou. 2023a. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765.
Yang et al. (2023b) Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, and Helen Meng. 2023b. Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt. arXiv preprint arXiv:2301.13662.
Zeghidour et al. (2021) Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. 2021. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507.
Zen and Senior (2014) Heiga Zen and Andrew Senior. 2014. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 3844–3848. IEEE.
Zhang et al. (2023) Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. 2023. Speechtokenizer: Unified speech tokenizer for speech large language models. arXiv preprint arXiv:2308.16692.

Appendix A Related work

A.1 Acoustic Codec Models

In recent times, neural acoustic codecs (Zeghidour et al., 2021; Défossez et al., 2022; Kumar et al., 2024) have demonstrated remarkable capabilities in reconstructing high-quality audio at low bitrates. Typically, these methods employ an encoder to extract deep features in a latent space, which are subsequently quantized before being fed into the decoder. To elaborate, Soundstream(Zeghidour et al., 2021) utilizes a model architecture comprising a fully convolutional encoder/decoder network and a residual vector quantizer (RVQ) to effectively compress speech. Encodec(Défossez et al., 2022) employs a streaming encoder-decoder architecture with a quantized latent space, trained in an end-to-end fashion. AudioDec(Wu et al., 2023) has demonstrated the importance of discriminators. PromptCodec(Pan et al., 2024) enhances representation capabilities through additional input prompts. DAC(Kumar et al., 2024) significantly improves reconstruction quality through techniques like quantizer dropout and a multi-scale STFT-based discriminator. Vocos(Siuzdak, 2023) eliminates codec noise artifacts using a pre-trained Encodec with an inverse Fourier transform vocoder. HILCodec(Ahn et al., 2024) introduces the MFBD discriminator to guide codec modeling. APCodec(Ahn et al., 2024) further enhances reconstruction quality by incorporating ConvNextV2 modules in the encoder and decoder. HiFi-Codec(Yang et al., 2023a) proposes a parallel GRVQ structure, achieving good speech reconstruction with just four quantizers. Language-Codec(Ji et al., 2024a) introduces the MCRVQ mechanism to evenly distribute information across the first quantizer, also requiring only four quantizers for excellent performance across various generative models. Single-Codec(Li et al., 2024) designs additional BLSTM, hybrid sampling, and resampling modules to ensure basic performance with a single quantizer, though reconstruction quality still needs improvement. TiCodec(Ren et al., 2024) models codec space by distinguishing between time-independent and time-dependent information. FACodec(Ju et al., 2024) further decouples codec space into content, style, and acoustic detail modules. Additionally, recognizing the importance of semantic information in generative models, recent efforts have begun integrating semantic information into codec models. RepCodec(Huang et al., 2024) learns a vector quantization codebook by reconstructing speech representations from speech encoders like HuBERT. SpeechTokenizer(Zhang et al., 2023) enriches the semantic content of the first quantizer through semantic distillation. FunCodec(Du et al., 2024) makes semantic tokens optional and explores different combinations. SemanticCodec(Liu et al., 2024) is based on quantized semantic tokens and further reconstructs acoustic information using an audio encoder and diffusion model. WavTokenizer(Ji et al., 2024b) represents the latest state-of-the-art codec model, capable of reconstructing high-quality audio using only forty discrete codebooks. Given that ControlSpeech requires disentangled discrete audio representations that are pre-trained on large-scale multi-speaker data, we select FACodec(Ju et al., 2024) as the tokenizer for ControlSpeech.

Appendix B Distribution visualization

In this section, we visualize the distribution of the SMSD mixed density network. As shown in Figure3, we select the original style descriptions from TextrolSpeech and visualize the distributions produced by the SMSD module under three experimental settings: varying pitch (high/low), speech rate (fast/slow), and volume (high/low). Each experimental setting includes 1,000 different style descriptions, with other factors held constant. For example, in the speech rate experiment, both pitch and volume descriptions are set to “normal." We employ t-SNE for dimensionality reduction of the features. Our results show that the SMSD module effectively distinguishes between different types of styles, and the mixed density distribution is not confined to a small region, indicating that the style control module exhibits substantial diversity.

Figure 3: The t-SNE visualization of mixture density distribution after the SMSD module.

Appendix C VccmDataset test set

To further validate ControlSpeech’s ability to simultaneously control style and clone speaker timbre, we create four types of test sets in the VccmDataset: the main test set (test set A), the out-of-domain speaker test set (test set B), the out-of-domain style test set (test set C), and the special case test set (test set D). Each test set corresponds to four experiments: style controllability experiments, out-of-domain speaker cloning experiments, out-of-domain style controllability experiments, and many-to-many style control experiments, respectively. We randomly select 1,500 audio samples as the ControlSpeech main test set (test set A) and match the corresponding prompt voice based on speaker IDs. Additionally, to evaluate ControlSpeech’s performance on out-of-domain timbre and styles, we further filter an appropriate test set (speakers that are not present in the training set) and enlist language experts to compose style descriptions distinct from those in TextrolSpeech. Using these two methods, we generate the out-of-domain speaker test set (test set B) and the out-of-domain style test set (test set C). The test set B consists of 1,086 test utterances, and we ensure that none of the speakers in test set B appear in the training set. The special case test set (test set D) is designed to evaluate the model’s performance under many-to-many style control conditions. Firstly, we select four groups of speakers, each of whom is matched with 60 different style descriptions while the content text remains fixed. This particular set of test samples is referred to as test set D1. We further select six distinct style descriptions paired with 50 different timbre prompts, with pitch, speed, and volume labels set to the following combinations: normal, fast, normal; normal, slow, normal; high, normal, normal; low, normal, normal; normal, normal, high; and normal, normal, low, respectively. This set of special test samples is referred to as test set D2.

Appendix D Evaluation metrics

For objective evaluations, we adopt the metrics used in prior works(Guo et al., 2023; Ji et al., 2023; Leng et al., 2023). To evaluate the model’s style controllability, we use accuracy as the metric, which measures the correspondence between the style factors in the output speech and those in the prompts. The accuracy of pitch, speaking speed, and volume is calculated using signal processing tools. We fine-tune the official version of the Emotion2vec model(Ma et al., 2023) on the emotional dataset of VccmDataset, and compute the speech emotion classification accuracy with the fine-tuned model. To evaluate timbre similarity (Spk-sv) between the original prompt and the synthesized speech, we utilize the base-plus-sv version of WavLM (Chen et al., 2022). For Word Error Rate (WER), we use an ASR model 2 2 2https://huggingface.co/facebook/hubert-large-ls960-ft to transcribe the synthesized speech. This ASR model is a CTC-based HuBERT pre-trained on Librilight and fine-tuned on the 960 hours training set of LibriSpeech. For subjective evaluations, we conduct mean opinion score (MOS) evaluations on the test set to measure audio naturalness via crowdsourcing. We randomly select 30 samples from the test set of each dataset for subjective evaluation, and each audio sample is listened by at least 10 testers. We analyze the MOS in two aspects: MOS-Q (Quality, assessing clarity and naturalness of the duration and pitch) and MOS-S (Speaker similarity).

Furthermore, for the evaluation of style-controllable many-to-many scenarios in the test set D, we design new subjective MOS metrics: MOS-TS (Timbre Similarity), MOS-SD (Style Diversity), and MOS-SA (Style Accuracy). Specifically, the MOS-TS metric is used to assess whether the timbre remains stable across 60 different style descriptions for four speakers on the test set D1. The MOS-SA and MOS-SD metrics represent the accuracy and diversity of style control for each style description respectively on the test set D2.

Appendix E Training and Inference Settings

ControlSpeech is trained on VccmDataset using 8 NVIDIA A100 40G GPUs with each batch accommodating 3500 frames of the discrete codec. We optimize the models using the AdamW optimizer with parameters β 1 subscript 𝛽 1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β 2 subscript 𝛽 2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95. The learning rate is warmed up for the first 5k updates, reaching a peak of 5×10−4 5 superscript 10 4 5\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and then linearly decayed. We utilize the open-source FACodec’s voice conversion version as the codec encoder and decoder for ControlSpeech. The style-controllable baseline models are trained on the same VccmDataset training set to eliminate potential biases. We utilize a pre-trained BERT(Devlin et al., 2018) model consisting of 12 hidden layers with 110M parameters. For the implementation of the basic MDN network model, we largely follow the approach described in(Duan, 2019).

Appendix F Model Architecture in ControlSpeech

Following(Ju et al., 2024), the basic architecture of codec encoder and codec decoder follows(Kumar et al., 2024) and employs the SnakeBeta activation function(Lee et al., 2022). The timbre extractor consists of several conformer(Gulati et al., 2020) blocks. We use N q c=2 subscript 𝑁 subscript 𝑞 𝑐 2 N_{{q}{c}}=2 italic_N start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 2, N q p=1 subscript 𝑁 subscript 𝑞 𝑝 1 N{{q}{p}}=1 italic_N start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1, N q d=3 subscript 𝑁 subscript 𝑞 𝑑 3 N{{q}_{d}}=3 italic_N start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 3 as the number of quantizers for each of the three FVQ Q c superscript 𝑄 𝑐 Q^{c}italic_Q start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, Q p superscript 𝑄 𝑝 Q^{p}italic_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, Q d superscript 𝑄 𝑑 Q^{d}italic_Q start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the codebook size for all the quantizers is 1024. Text encoder and variance adaptor share the similar architecture which comprises several FFT blocks or attention layers as used by FastSpeech2(Ren et al., 2020). The Style Extractor is a module comprising both convolutional and LSTM networks from FACodec(Ju et al., 2024) and outputs a 512-dimensional global ground truth style vector. The codec generator is a decoder primarily based on conformer blocks(Gulati et al., 2020), similar to MobileSpeech (Ji et al., 2024c). However, we opt for fewer decoder layers (6 layers) and a smaller parameter count in the codec generator.

Appendix G Evaluation on the out-of-domain style control task.

We further evaluate the controllability of style-controllable models with out-of-domain style descriptions. We compare the performance of ControlSpeech with controllable baseline models on the VccmDataset test set C. The test set C comprises 100 test utterances, with style prompts rewritten by experts. None of the test set style prompts are present in the training set. Results are shown in Table5. We find that the generalization performance of ControlSpeech is remarkably better than that of the baseline models, which could be attributed to the SMSD module and its underlying mixture density network mechanism. The accuracies of speech speed and volume from ControlSpeech are markedly better than those from baseline models, especially in terms of the volume accuracy. ControlSpeech also yields best WER, MOS-Q, and speaker timbre similarity. Similar to the results shown in Table1, the pitch accuracy of ControlSpeech is slightly lower. We believe this is due to pitch inconsistencies arising from the simultaneous control of style and timbre cloning. Note that there is no significant difference between the test set A and test set C, except the style descriptions in test set C are out-of-domain while those in test set A are in-domain. Comparing Table5 and Table1, degradations from ControlSpeech on all metrics are much smaller than degradations from baselines.

Table 5: The out-of-domain style control results of different style-controlled models on the VccmDataset test set C. None of the style prompts are present in the training set.

Appendix H Ablation Experiments about Mixed Distributions

In this section, we investigate the impact of the number of mixtures in the SMSD module on model performance. We conduct ablation studies under the isotropic across clusters noise perturbation mode (the mode selected for ControlSpeech), examining the effects of using 3, 5, and 7 mixtures. As shown in Table6, the differences in the MOS-SD metric are negligible. However, an increase in the number of mixtures leads to a decline in the MOS-SA metric, indicating that an excessive number of mixtures may reduce the model’s control accuracy.

Table 6: Under the Isotropic across clusters noise perturbation scheme, we investigate the influence of the number of Gaussian mixture components in the SMSD module on stylistic diversity. Subsequently, we analyze the corresponding outcomes using the MOS-SA and MOS-SD metrics.

Appendix I Ablation Experiments on Various Noise Modes

We analyze the impact of different noise perturbation modes on the many-to-many style control problem, with the number of mixture distributions fixed at 5. As shown in Table7, we find that the noise perturbation mode maintaining isotropy at the cluster centers achieves a balance between the MOS-SA and MOS-SD metrics and outperforms all other modes.

Table 7: The results of different noise perturbation modes on the MOS-SA and MOS-SD metrics.

Appendix J Ethics Statement

ControlSpeech is capable of zero-shot voice cloning; hence, there are potential risks from misuse, such as voice spoofing. For any real-world applications involving unseen speakers, it is crucial to establish protocols ensuring the speaker’s authorization over using the certain speaker’s voice. Also, to mitigate these risks, we will also develop approaches such as speech watermarking technology to identify whether a given audio is synthesized by ControlSpeech.

Appendix K The SMSD Loss

The loss function for the SMSD module represents the conditional probability of the input style representation X s′superscript subscript 𝑋 𝑠′{X_{s}}^{{}^{\prime}}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT given the target global style Y s′superscript subscript 𝑌 𝑠′{Y_{s}}^{{}^{\prime}}italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT. We further refine this into a maximum likelihood loss involving the style distribution parameters π k subscript 𝜋 𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, μ(k)superscript 𝜇 𝑘\mu^{(k)}italic_μ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, σ 2(k)superscript superscript 𝜎 2 𝑘{\sigma^{2}}^{(k)}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT derived through the MDN network and noise perturbation module. The detailed derivation of the loss function is as follows.

ℒ S⁢M⁢S⁢D=−l⁢o⁢g⁢P θ⁢(Y s′|X s′)∝−∑k=1 K(π k⁢e⁢x⁢p⁢(−1 2⁢(Y s′−μ(k))T⁢σ 2(k)−1⁢(Y s′−μ(k))−1 2⁢l⁢o⁢g⁢d⁢e⁢t⁢σ 2(k)))=−l⁢o⁢g⁢s⁢u⁢m⁢e⁢x⁢p k⁢(l⁢o⁢g⁢π k−1 2⁢(Y s′−μ(k))T⁢σ 2(k)−1⁢(Y s′−μ(k))−1 2⁢l⁢o⁢g⁢d⁢e⁢t⁢σ 2(k))=−l o g s u m e x p k(l o g π k−1 2∥Y s′−μ(k)σ(k)∥2−∥l o g(σ(k)))∥1)=−l⁢o⁢g⁢s⁢u⁢m⁢e⁢x⁢p k⁢(l⁢o⁢g⁢π k−1 2⁢‖Y s′−μ(k)σ(k)‖2−d⁢l⁢o⁢g⁢(σ(k)))=−l⁢o⁢g⁢s⁢u⁢m⁢e⁢x⁢p k⁢(l⁢o⁢g⁢π k−1 2⁢‖Y s′−μ(k)σ(k)‖2−d⁢l⁢o⁢g⁢(σ))=−l⁢o⁢g⁢s⁢u⁢m⁢e⁢x⁢p k⁢(l⁢o⁢g⁢π k−1 2⁢‖Y s′−μ(k)σ‖2)\begin{split}\mathcal{L}{SMSD}&=-logP{\theta}({Y_{s}}^{{}^{\prime}}|{X_{s}}^% {{}^{\prime}})\ &\propto-\sum_{k=1}^{K}(\pi_{k}exp(-\frac{1}{2}({Y_{s}}^{{}^{\prime}}-\mu^{(k)% })^{T}{{\sigma^{2}}^{(k)}}^{-1}({Y_{s}}^{{}^{\prime}}-\mu^{(k)})-\frac{1}{2}% logdet{\sigma^{2}}^{(k)}))\ &=-logsumexp_{k}(log\pi_{k}-\frac{1}{2}({Y_{s}}^{{}^{\prime}}-\mu^{(k)})^{T}{{% \sigma^{2}}^{(k)}}^{-1}({Y_{s}}^{{}^{\prime}}-\mu^{(k)})-\frac{1}{2}logdet{% \sigma^{2}}^{(k)})\ &=-logsumexp_{k}(log\pi_{k}-\frac{1}{2}\left|\frac{{Y_{s}}^{{}^{\prime}}-\mu^% {(k)}}{\sigma^{(k)}}\right|^{2}-\left|log(\sigma^{(k)}))\right|{1})\ &=-logsumexp{k}(log\pi_{k}-\frac{1}{2}\left|\frac{{Y_{s}}^{{}^{\prime}}-\mu^% {(k)}}{\sigma^{(k)}}\right|^{2}-dlog(\sigma^{(k)}))\ &=-logsumexp_{k}(log\pi_{k}-\frac{1}{2}\left|\frac{{Y_{s}}^{{}^{\prime}}-\mu^% {(k)}}{\sigma^{(k)}}\right|^{2}-dlog(\sigma))\ &=-logsumexp_{k}(log\pi_{k}-\frac{1}{2}\left|\frac{{Y_{s}}^{{}^{\prime}}-\mu^% {(k)}}{\sigma}\right|^{2})\ \end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_S italic_M italic_S italic_D end_POSTSUBSCRIPT end_CELL start_CELL = - italic_l italic_o italic_g italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∝ - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_e italic_x italic_p ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_μ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_μ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_l italic_o italic_g italic_d italic_e italic_t italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - italic_l italic_o italic_g italic_s italic_u italic_m italic_e italic_x italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_l italic_o italic_g italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_μ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_μ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_l italic_o italic_g italic_d italic_e italic_t italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - italic_l italic_o italic_g italic_s italic_u italic_m italic_e italic_x italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_l italic_o italic_g italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ divide start_ARG italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_μ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ italic_l italic_o italic_g ( italic_σ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - italic_l italic_o italic_g italic_s italic_u italic_m italic_e italic_x italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_l italic_o italic_g italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ divide start_ARG italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_μ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_d italic_l italic_o italic_g ( italic_σ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - italic_l italic_o italic_g italic_s italic_u italic_m italic_e italic_x italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_l italic_o italic_g italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ divide start_ARG italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_μ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_d italic_l italic_o italic_g ( italic_σ ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - italic_l italic_o italic_g italic_s italic_u italic_m italic_e italic_x italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_l italic_o italic_g italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ divide start_ARG italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_μ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW(8)

Xet Storage Details

Size:: 111 kB
Xet hash:: 3b4f0b872ceaf2a0d900e22f99afe68d0206cf4d5701764fd5f0821181790fab

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.