File size: 5,120 Bytes
60f0e79 dee0e2a 60f0e79 dee0e2a 60f0e79 dee0e2a 60f0e79 0166026 60f0e79 0166026 60f0e79 dee0e2a 60f0e79 dee0e2a 60f0e79 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | ---
license: other
license_name: uc-research-education-not-for-profit
license_link: https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE
library_name: pytorch
tags:
- voice-conversion
- speech
- audio
- streaming
- style-transfer
- research
- not-for-profit
---
<h1 align="center">
StyleStream
</h1>
<p align="center">
<a href="http://arxiv.org/abs/2602.20113"><img src="https://img.shields.io/badge/arXiv-2602.20113-b31b1b.svg?logo=arXiv" alt="arXiv" /></a>
<a href="https://berkeley-speech-group.github.io/StyleStream/"><img src="https://img.shields.io/badge/GitHub-Demo-orange.svg" alt="demo" /></a>
<a href="https://github.com/Berkeley-Speech-Group/StyleStream"><img src="https://img.shields.io/badge/GitHub-Code-black.svg?logo=github" alt="GitHub" /></a>
<a href="https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Research%2FEducation-blue.svg" alt="license" /></a>
</p>
<p align="center">
<strong>StyleStream: Real-Time Zero-Shot Voice Style Conversion</strong>
</p>
<p align="center">
Official PyTorch model weights for streamable voice style conversion in timbre, accent, and emotion.
</p>
<p align="center">
<img src="assets/figures/overview.png" alt="StyleStream overview" width="100%" />
</p>
**Release note:** To reduce voice-cloning misuse, this public release excludes the style encoder weights. Public inference uses curated target speaker embeddings, not arbitrary target-speaker cloning.
## News
- 2026/06/11: StyleStream offline / streaming inference code and weights are open sourced! π₯ π₯ π₯
- 2026/06/03: StyleStream was accepted to the INTERSPEECH 2026 long paper track! π π π
## Files
This Hugging Face repo hosts the public inference assets:
- `stylizer-no-style-enc.ckpt`: stylizer checkpoint without style encoder weights
- `destylizer.ckpt`: destylizer checkpoint
- `vocos_causal_best.ckpt`: causal vocoder checkpoint
- `target_spkrs.tar`: larger curated target speaker inventory
Small target examples and the full inference code are available in the GitHub repo:
```text
https://github.com/Berkeley-Speech-Group/StyleStream
```
## Download
Install the Hugging Face CLI if needed:
```bash
pip install huggingface_hub
```
From the StyleStream project root, download checkpoints:
```bash
hf download Louis0324/StyleStream \
stylizer-no-style-enc.ckpt destylizer.ckpt vocos_causal_best.ckpt \
--repo-type model --local-dir assets/ckpts
```
Download the larger target speaker inventory:
```bash
hf download Louis0324/StyleStream target_spkrs.tar --repo-type model --local-dir assets/target_spkrs
```
Expected local layout:
```text
assets/ckpts/
stylizer-no-style-enc.ckpt
destylizer.ckpt
vocos_causal_best.ckpt
assets/target_spkrs/
target_spkrs.tar
```
## Usage
Clone the GitHub repo and follow its setup instructions:
```bash
git clone https://github.com/Berkeley-Speech-Group/StyleStream.git
cd StyleStream
pip install -r requirements.txt
```
Offline Streamlit app:
```bash
streamlit run inference/offline_app.py
```
Recommended streaming inference:
```bash
python inference/streaming.py
```
Use this terminal script for the fastest realtime performance. It runs the speed test before audio IO, selects a streamable inference-step setting, and lets you switch target styles by typing a target index.
Streaming Streamlit app:
```bash
streamlit run inference/streaming_app.py
```
Use this when you want browser-based target selection, audio device selection, live status, and speed-test visualization. It has the same core streaming functionality, but is slower because of Streamlit overhead.
Command-line examples:
```bash
./inference/run_inference_offline.sh
./inference/run_inference_simulate_streaming.sh
```
## Style Inventory
Target styles use this folder format:
```text
target_name/
target_name.wav
target_name.npy
```
The `.wav` provides target mel/acoustic context. The `.npy` file is the pre-extracted style embedding with shape `[768]`.
## Intended Use
StyleStream is released for educational, research, and not-for-profit use. It is intended for voice style conversion research, benchmarking, comparison, and reproducible inference.
The public release does not include style encoder weights and does not support arbitrary target-speaker cloning.
## License
The code is released under a **research, educational, and not-for-profit software license**. Commercial use requires prior written permission from The Regents of the University of California.
See the `LICENSE` file in this Hugging Face model repo:
```text
https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE
```
## Acknowledgements
[F5-TTS](https://arxiv.org/abs/2410.06885): stylizer flow matching modules.
## Citation
If you find StyleStream useful, please consider giving a star and citation:
```bibtex
@article{liu2026stylestream,
title={StyleStream: Real-Time Zero-Shot Voice Style Conversion},
author={Yisi Liu and Nicholas Lee and Gopala Anumanchipalli},
journal={arXiv preprint arXiv:2602.20113},
year={2026}
}
```
|