| --- |
| license: other |
| license_name: uc-research-education-not-for-profit |
| license_link: https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE |
| library_name: pytorch |
| tags: |
| - voice-conversion |
| - speech |
| - audio |
| - streaming |
| - style-transfer |
| - research |
| - not-for-profit |
| --- |
| |
| <h1 align="center"> |
| StyleStream |
| </h1> |
|
|
| <p align="center"> |
| <a href="http://arxiv.org/abs/2602.20113"><img src="https://img.shields.io/badge/arXiv-2602.20113-b31b1b.svg?logo=arXiv" alt="arXiv" /></a> |
| <a href="https://berkeley-speech-group.github.io/StyleStream/"><img src="https://img.shields.io/badge/GitHub-Demo-orange.svg" alt="demo" /></a> |
| <a href="https://github.com/Berkeley-Speech-Group/StyleStream"><img src="https://img.shields.io/badge/GitHub-Code-black.svg?logo=github" alt="GitHub" /></a> |
| <a href="https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Research%2FEducation-blue.svg" alt="license" /></a> |
| </p> |
|
|
| <p align="center"> |
| <strong>StyleStream: Real-Time Zero-Shot Voice Style Conversion</strong> |
| </p> |
|
|
| <p align="center"> |
| Official PyTorch model weights for streamable voice style conversion in timbre, accent, and emotion. |
| </p> |
|
|
| <p align="center"> |
| <img src="assets/figures/overview.png" alt="StyleStream overview" width="100%" /> |
| </p> |
|
|
| **Release note:** To reduce voice-cloning misuse, this public release excludes the style encoder weights. Public inference uses curated target speaker embeddings, not arbitrary target-speaker cloning. |
|
|
| ## News |
|
|
| - 2026/06/11: StyleStream offline / streaming inference code and weights are open sourced! π₯ π₯ π₯ |
| - 2026/06/03: StyleStream was accepted to the INTERSPEECH 2026 long paper track! π π π |
|
|
| ## Files |
|
|
| This Hugging Face repo hosts the public inference assets: |
|
|
| - `stylizer-no-style-enc.ckpt`: stylizer checkpoint without style encoder weights |
| - `destylizer.ckpt`: destylizer checkpoint |
| - `vocos_causal_best.ckpt`: causal vocoder checkpoint |
| - `target_spkrs.tar`: larger curated target speaker inventory |
|
|
| Small target examples and the full inference code are available in the GitHub repo: |
|
|
| ```text |
| https://github.com/Berkeley-Speech-Group/StyleStream |
| ``` |
|
|
| ## Download |
|
|
| Install the Hugging Face CLI if needed: |
|
|
| ```bash |
| pip install huggingface_hub |
| ``` |
|
|
| From the StyleStream project root, download checkpoints: |
|
|
| ```bash |
| hf download Louis0324/StyleStream \ |
| stylizer-no-style-enc.ckpt destylizer.ckpt vocos_causal_best.ckpt \ |
| --repo-type model --local-dir assets/ckpts |
| ``` |
|
|
| Download the larger target speaker inventory: |
|
|
| ```bash |
| hf download Louis0324/StyleStream target_spkrs.tar --repo-type model --local-dir assets/target_spkrs |
| ``` |
|
|
| Expected local layout: |
|
|
| ```text |
| assets/ckpts/ |
| stylizer-no-style-enc.ckpt |
| destylizer.ckpt |
| vocos_causal_best.ckpt |
| |
| assets/target_spkrs/ |
| target_spkrs.tar |
| ``` |
|
|
| ## Usage |
|
|
| Clone the GitHub repo and follow its setup instructions: |
|
|
| ```bash |
| git clone https://github.com/Berkeley-Speech-Group/StyleStream.git |
| cd StyleStream |
| pip install -r requirements.txt |
| ``` |
|
|
| Offline Streamlit app: |
|
|
| ```bash |
| streamlit run inference/offline_app.py |
| ``` |
|
|
| Recommended streaming inference: |
|
|
| ```bash |
| python inference/streaming.py |
| ``` |
|
|
| Use this terminal script for the fastest realtime performance. It runs the speed test before audio IO, selects a streamable inference-step setting, and lets you switch target styles by typing a target index. |
|
|
| Streaming Streamlit app: |
|
|
| ```bash |
| streamlit run inference/streaming_app.py |
| ``` |
|
|
| Use this when you want browser-based target selection, audio device selection, live status, and speed-test visualization. It has the same core streaming functionality, but is slower because of Streamlit overhead. |
|
|
| Command-line examples: |
|
|
| ```bash |
| ./inference/run_inference_offline.sh |
| ./inference/run_inference_simulate_streaming.sh |
| ``` |
|
|
| ## Style Inventory |
|
|
| Target styles use this folder format: |
|
|
| ```text |
| target_name/ |
| target_name.wav |
| target_name.npy |
| ``` |
|
|
| The `.wav` provides target mel/acoustic context. The `.npy` file is the pre-extracted style embedding with shape `[768]`. |
|
|
| ## Intended Use |
|
|
| StyleStream is released for educational, research, and not-for-profit use. It is intended for voice style conversion research, benchmarking, comparison, and reproducible inference. |
|
|
| The public release does not include style encoder weights and does not support arbitrary target-speaker cloning. |
|
|
| ## License |
|
|
| The code is released under a **research, educational, and not-for-profit software license**. Commercial use requires prior written permission from The Regents of the University of California. |
|
|
| See the `LICENSE` file in this Hugging Face model repo: |
|
|
| ```text |
| https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE |
| ``` |
|
|
| ## Acknowledgements |
|
|
| [F5-TTS](https://arxiv.org/abs/2410.06885): stylizer flow matching modules. |
|
|
| ## Citation |
|
|
| If you find StyleStream useful, please consider giving a star and citation: |
|
|
| ```bibtex |
| @article{liu2026stylestream, |
| title={StyleStream: Real-Time Zero-Shot Voice Style Conversion}, |
| author={Yisi Liu and Nicholas Lee and Gopala Anumanchipalli}, |
| journal={arXiv preprint arXiv:2602.20113}, |
| year={2026} |
| } |
| ``` |
|
|