File size: 5,120 Bytes

---
license: other
license_name: uc-research-education-not-for-profit
license_link: https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE
library_name: pytorch
tags:
  - voice-conversion
  - speech
  - audio
  - streaming
  - style-transfer
  - research
  - not-for-profit
---

<h1 align="center">
  StyleStream
</h1>

<p align="center">
  <a href="http://arxiv.org/abs/2602.20113"><img src="https://img.shields.io/badge/arXiv-2602.20113-b31b1b.svg?logo=arXiv" alt="arXiv" /></a>
  <a href="https://berkeley-speech-group.github.io/StyleStream/"><img src="https://img.shields.io/badge/GitHub-Demo-orange.svg" alt="demo" /></a>
  <a href="https://github.com/Berkeley-Speech-Group/StyleStream"><img src="https://img.shields.io/badge/GitHub-Code-black.svg?logo=github" alt="GitHub" /></a>
  <a href="https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Research%2FEducation-blue.svg" alt="license" /></a>
</p>

<p align="center">
  <strong>StyleStream: Real-Time Zero-Shot Voice Style Conversion</strong>
</p>

<p align="center">
  Official PyTorch model weights for streamable voice style conversion in timbre, accent, and emotion.
</p>

<p align="center">
  <img src="assets/figures/overview.png" alt="StyleStream overview" width="100%" />
</p>

**Release note:** To reduce voice-cloning misuse, this public release excludes the style encoder weights. Public inference uses curated target speaker embeddings, not arbitrary target-speaker cloning.

## News

- 2026/06/11: StyleStream offline / streaming inference code and weights are open sourced! 🔥 🔥 🔥
- 2026/06/03: StyleStream was accepted to the INTERSPEECH 2026 long paper track! 🎉 🎉 🎉

## Files

This Hugging Face repo hosts the public inference assets:

- `stylizer-no-style-enc.ckpt`: stylizer checkpoint without style encoder weights
- `destylizer.ckpt`: destylizer checkpoint
- `vocos_causal_best.ckpt`: causal vocoder checkpoint
- `target_spkrs.tar`: larger curated target speaker inventory

Small target examples and the full inference code are available in the GitHub repo:

```text
https://github.com/Berkeley-Speech-Group/StyleStream
```

## Download

Install the Hugging Face CLI if needed:

```bash
pip install huggingface_hub
```

From the StyleStream project root, download checkpoints:

```bash
hf download Louis0324/StyleStream \
  stylizer-no-style-enc.ckpt destylizer.ckpt vocos_causal_best.ckpt \
  --repo-type model --local-dir assets/ckpts
```

Download the larger target speaker inventory:

```bash
hf download Louis0324/StyleStream target_spkrs.tar --repo-type model --local-dir assets/target_spkrs
```

Expected local layout:

```text
assets/ckpts/
  stylizer-no-style-enc.ckpt
  destylizer.ckpt
  vocos_causal_best.ckpt

assets/target_spkrs/
  target_spkrs.tar
```

## Usage

Clone the GitHub repo and follow its setup instructions:

```bash
git clone https://github.com/Berkeley-Speech-Group/StyleStream.git
cd StyleStream
pip install -r requirements.txt
```

Offline Streamlit app:

```bash
streamlit run inference/offline_app.py
```

Recommended streaming inference:

```bash
python inference/streaming.py
```

Use this terminal script for the fastest realtime performance. It runs the speed test before audio IO, selects a streamable inference-step setting, and lets you switch target styles by typing a target index.

Streaming Streamlit app:

```bash
streamlit run inference/streaming_app.py
```

Use this when you want browser-based target selection, audio device selection, live status, and speed-test visualization. It has the same core streaming functionality, but is slower because of Streamlit overhead.

Command-line examples:

```bash
./inference/run_inference_offline.sh
./inference/run_inference_simulate_streaming.sh
```

## Style Inventory

Target styles use this folder format:

```text
target_name/
  target_name.wav
  target_name.npy
```

The `.wav` provides target mel/acoustic context. The `.npy` file is the pre-extracted style embedding with shape `[768]`.

## Intended Use

StyleStream is released for educational, research, and not-for-profit use. It is intended for voice style conversion research, benchmarking, comparison, and reproducible inference.

The public release does not include style encoder weights and does not support arbitrary target-speaker cloning.

## License

The code is released under a **research, educational, and not-for-profit software license**. Commercial use requires prior written permission from The Regents of the University of California.

See the `LICENSE` file in this Hugging Face model repo:

```text
https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE
```

## Acknowledgements

[F5-TTS](https://arxiv.org/abs/2410.06885): stylizer flow matching modules.

## Citation

If you find StyleStream useful, please consider giving a star and citation:

```bibtex
@article{liu2026stylestream,
  title={StyleStream: Real-Time Zero-Shot Voice Style Conversion},
  author={Yisi Liu and Nicholas Lee and Gopala Anumanchipalli},
  journal={arXiv preprint arXiv:2602.20113},
  year={2026}
}
```