StyleStream / README.md
Louis0324's picture
Upload README.md with huggingface_hub
0166026 verified
|
Raw
History Blame Contribute Delete
5.12 kB
---
license: other
license_name: uc-research-education-not-for-profit
license_link: https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE
library_name: pytorch
tags:
- voice-conversion
- speech
- audio
- streaming
- style-transfer
- research
- not-for-profit
---
<h1 align="center">
StyleStream
</h1>
<p align="center">
<a href="http://arxiv.org/abs/2602.20113"><img src="https://img.shields.io/badge/arXiv-2602.20113-b31b1b.svg?logo=arXiv" alt="arXiv" /></a>
<a href="https://berkeley-speech-group.github.io/StyleStream/"><img src="https://img.shields.io/badge/GitHub-Demo-orange.svg" alt="demo" /></a>
<a href="https://github.com/Berkeley-Speech-Group/StyleStream"><img src="https://img.shields.io/badge/GitHub-Code-black.svg?logo=github" alt="GitHub" /></a>
<a href="https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Research%2FEducation-blue.svg" alt="license" /></a>
</p>
<p align="center">
<strong>StyleStream: Real-Time Zero-Shot Voice Style Conversion</strong>
</p>
<p align="center">
Official PyTorch model weights for streamable voice style conversion in timbre, accent, and emotion.
</p>
<p align="center">
<img src="assets/figures/overview.png" alt="StyleStream overview" width="100%" />
</p>
**Release note:** To reduce voice-cloning misuse, this public release excludes the style encoder weights. Public inference uses curated target speaker embeddings, not arbitrary target-speaker cloning.
## News
- 2026/06/11: StyleStream offline / streaming inference code and weights are open sourced! πŸ”₯ πŸ”₯ πŸ”₯
- 2026/06/03: StyleStream was accepted to the INTERSPEECH 2026 long paper track! πŸŽ‰ πŸŽ‰ πŸŽ‰
## Files
This Hugging Face repo hosts the public inference assets:
- `stylizer-no-style-enc.ckpt`: stylizer checkpoint without style encoder weights
- `destylizer.ckpt`: destylizer checkpoint
- `vocos_causal_best.ckpt`: causal vocoder checkpoint
- `target_spkrs.tar`: larger curated target speaker inventory
Small target examples and the full inference code are available in the GitHub repo:
```text
https://github.com/Berkeley-Speech-Group/StyleStream
```
## Download
Install the Hugging Face CLI if needed:
```bash
pip install huggingface_hub
```
From the StyleStream project root, download checkpoints:
```bash
hf download Louis0324/StyleStream \
stylizer-no-style-enc.ckpt destylizer.ckpt vocos_causal_best.ckpt \
--repo-type model --local-dir assets/ckpts
```
Download the larger target speaker inventory:
```bash
hf download Louis0324/StyleStream target_spkrs.tar --repo-type model --local-dir assets/target_spkrs
```
Expected local layout:
```text
assets/ckpts/
stylizer-no-style-enc.ckpt
destylizer.ckpt
vocos_causal_best.ckpt
assets/target_spkrs/
target_spkrs.tar
```
## Usage
Clone the GitHub repo and follow its setup instructions:
```bash
git clone https://github.com/Berkeley-Speech-Group/StyleStream.git
cd StyleStream
pip install -r requirements.txt
```
Offline Streamlit app:
```bash
streamlit run inference/offline_app.py
```
Recommended streaming inference:
```bash
python inference/streaming.py
```
Use this terminal script for the fastest realtime performance. It runs the speed test before audio IO, selects a streamable inference-step setting, and lets you switch target styles by typing a target index.
Streaming Streamlit app:
```bash
streamlit run inference/streaming_app.py
```
Use this when you want browser-based target selection, audio device selection, live status, and speed-test visualization. It has the same core streaming functionality, but is slower because of Streamlit overhead.
Command-line examples:
```bash
./inference/run_inference_offline.sh
./inference/run_inference_simulate_streaming.sh
```
## Style Inventory
Target styles use this folder format:
```text
target_name/
target_name.wav
target_name.npy
```
The `.wav` provides target mel/acoustic context. The `.npy` file is the pre-extracted style embedding with shape `[768]`.
## Intended Use
StyleStream is released for educational, research, and not-for-profit use. It is intended for voice style conversion research, benchmarking, comparison, and reproducible inference.
The public release does not include style encoder weights and does not support arbitrary target-speaker cloning.
## License
The code is released under a **research, educational, and not-for-profit software license**. Commercial use requires prior written permission from The Regents of the University of California.
See the `LICENSE` file in this Hugging Face model repo:
```text
https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE
```
## Acknowledgements
[F5-TTS](https://arxiv.org/abs/2410.06885): stylizer flow matching modules.
## Citation
If you find StyleStream useful, please consider giving a star and citation:
```bibtex
@article{liu2026stylestream,
title={StyleStream: Real-Time Zero-Shot Voice Style Conversion},
author={Yisi Liu and Nicholas Lee and Gopala Anumanchipalli},
journal={arXiv preprint arXiv:2602.20113},
year={2026}
}
```