Upload README.md with huggingface_hub

0166026 verified 17 days ago

5.12 kB

	---
	license: other
	license_name: uc-research-education-not-for-profit
	license_link: https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE
	library_name: pytorch
	tags:
	- voice-conversion
	- speech
	- audio
	- streaming
	- style-transfer
	- research
	- not-for-profit
	---

	<h1 align="center">
	StyleStream
	</h1>

	<p align="center">
	<a href="http://arxiv.org/abs/2602.20113"><img src="https://img.shields.io/badge/arXiv-2602.20113-b31b1b.svg?logo=arXiv" alt="arXiv" /></a>
	<a href="https://berkeley-speech-group.github.io/StyleStream/"><img src="https://img.shields.io/badge/GitHub-Demo-orange.svg" alt="demo" /></a>
	<a href="https://github.com/Berkeley-Speech-Group/StyleStream"><img src="https://img.shields.io/badge/GitHub-Code-black.svg?logo=github" alt="GitHub" /></a>
	<a href="https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Research%2FEducation-blue.svg" alt="license" /></a>
	</p>

	<p align="center">
	<strong>StyleStream: Real-Time Zero-Shot Voice Style Conversion</strong>
	</p>

	<p align="center">
	Official PyTorch model weights for streamable voice style conversion in timbre, accent, and emotion.
	</p>

	<p align="center">
	<img src="assets/figures/overview.png" alt="StyleStream overview" width="100%" />
	</p>

	Release note: To reduce voice-cloning misuse, this public release excludes the style encoder weights. Public inference uses curated target speaker embeddings, not arbitrary target-speaker cloning.

	## News

	- 2026/06/11: StyleStream offline / streaming inference code and weights are open sourced! 🔥 🔥 🔥
	- 2026/06/03: StyleStream was accepted to the INTERSPEECH 2026 long paper track! 🎉 🎉 🎉

	## Files

	This Hugging Face repo hosts the public inference assets:

	- `stylizer-no-style-enc.ckpt`: stylizer checkpoint without style encoder weights
	- `destylizer.ckpt`: destylizer checkpoint
	- `vocos_causal_best.ckpt`: causal vocoder checkpoint
	- `target_spkrs.tar`: larger curated target speaker inventory

	Small target examples and the full inference code are available in the GitHub repo:

	```text
	https://github.com/Berkeley-Speech-Group/StyleStream
	```

	## Download

	Install the Hugging Face CLI if needed:

	```bash
	pip install huggingface_hub
	```

	From the StyleStream project root, download checkpoints:

	```bash
	hf download Louis0324/StyleStream \
	stylizer-no-style-enc.ckpt destylizer.ckpt vocos_causal_best.ckpt \
	--repo-type model --local-dir assets/ckpts
	```

	Download the larger target speaker inventory:

	```bash
	hf download Louis0324/StyleStream target_spkrs.tar --repo-type model --local-dir assets/target_spkrs
	```

	Expected local layout:

	```text
	assets/ckpts/
	stylizer-no-style-enc.ckpt
	destylizer.ckpt
	vocos_causal_best.ckpt

	assets/target_spkrs/
	target_spkrs.tar
	```

	## Usage

	Clone the GitHub repo and follow its setup instructions:

	```bash
	git clone https://github.com/Berkeley-Speech-Group/StyleStream.git
	cd StyleStream
	pip install -r requirements.txt
	```

	Offline Streamlit app:

	```bash
	streamlit run inference/offline_app.py
	```

	Recommended streaming inference:

	```bash
	python inference/streaming.py
	```

	Use this terminal script for the fastest realtime performance. It runs the speed test before audio IO, selects a streamable inference-step setting, and lets you switch target styles by typing a target index.

	Streaming Streamlit app:

	```bash
	streamlit run inference/streaming_app.py
	```

	Use this when you want browser-based target selection, audio device selection, live status, and speed-test visualization. It has the same core streaming functionality, but is slower because of Streamlit overhead.

	Command-line examples:

	```bash
	./inference/run_inference_offline.sh
	./inference/run_inference_simulate_streaming.sh
	```

	## Style Inventory

	Target styles use this folder format:

	```text
	target_name/
	target_name.wav
	target_name.npy
	```

	The `.wav` provides target mel/acoustic context. The `.npy` file is the pre-extracted style embedding with shape `[768]`.

	## Intended Use

	StyleStream is released for educational, research, and not-for-profit use. It is intended for voice style conversion research, benchmarking, comparison, and reproducible inference.

	The public release does not include style encoder weights and does not support arbitrary target-speaker cloning.

	## License

	The code is released under a research, educational, and not-for-profit software license. Commercial use requires prior written permission from The Regents of the University of California.

	See the `LICENSE` file in this Hugging Face model repo:

	```text
	https://huggingface.co/Louis0324/StyleStream/blob/main/LICENSE
	```

	## Acknowledgements

	[F5-TTS](https://arxiv.org/abs/2410.06885): stylizer flow matching modules.

	## Citation

	If you find StyleStream useful, please consider giving a star and citation:

	```bibtex
	@article{liu2026stylestream,
	title={StyleStream: Real-Time Zero-Shot Voice Style Conversion},
	author={Yisi Liu and Nicholas Lee and Gopala Anumanchipalli},
	journal={arXiv preprint arXiv:2602.20113},
	year={2026}
	}
	```