Papers
arxiv:2606.10029

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Published on Jun 8
· Submitted by
Nikita Balagansky
on Jun 10
Authors:
,
,
,

Abstract

Sparse autoencoders trained on language model representations reveal interpretable features for speech synthesis that can be manipulated to control linguistic and prosodic attributes.

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.

Community

Bringing SAEs to text-to-speech models!

Currently, control over TTS models such as CosyVoice3 is limited to prompts or predefined tags. We found that model generations can be precisely edited by steering SAE features.

We also analyze these features: some are audio-only, others activate only on text, and some activate on both text and audio. Additionally, we introduce an autointerp pipeline for all of them.

We plan to publish the SAE weights and code soon!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.10029
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.10029 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.10029 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.10029 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.