arxiv:2606.10029

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Published on Jun 8

· Submitted by

Nikita Balagansky on Jun 10

T-Tech

Upvote

Authors:

Abstract

Sparse autoencoders trained on language model representations reveal interpretable features for speech synthesis that can be manipulated to control linguistic and prosodic attributes.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.

View arXiv page View PDF Add to collection

Community

elephantmipt

Paper submitter about 24 hours ago

Bringing SAEs to text-to-speech models!

Currently, control over TTS models such as CosyVoice3 is limited to prompts or predefined tags. We found that model generations can be precisely edited by steering SAE features.

We also analyze these features: some are audio-only, others activate only on text, and some activate on both text and audio. Additionally, we introduce an autointerp pipeline for all of them.

We plan to publish the SAE weights and code soon!