llmvox / README.md

Update README.md

476af55 verified 11 months ago

5.72 kB

	---
	license: cc-by-nc-sa-4.0
	pipeline_tag: text-to-speech
	library_name: transformers
	tags:
	- tts
	- voice
	- wip
	---

	This repository contains the model as described in [LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM](https://hf.co/papers/2503.04724).

	For more information, check out the project page at https://mbzuai-oryx.github.io/LLMVoX/ and the code at https://github.com/mbzuai-oryx/LLMVoX.

	# LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

	<div>
	<a href="https://mbzuai-oryx.github.io/LLMVoX/"><img src="https://img.shields.io/badge/Project-Page-blue" alt="Project Page"></a>
	<a href="https://arxiv.org/abs/2503.04724"><img src="https://img.shields.io/badge/arXiv-2503.04724-b31b1b.svg" alt="arXiv"></a>
	<a href="https://github.com/mbzuai-oryx/LLMVoX/"><img src="https://img.shields.io/badge/GitHub-LLMVoX-black?logo=github" alt="GitHub Repository"></a>
	<a href="https://github.com/mbzuai-oryx/LLMVoX/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
	</div>

	Authors:
	[Sambal Shikar](https://github.com/mbzuai-oryx/LLMVoX?tab=readme-ov-file), [Mohammed Irfan K](https://scholar.google.com/citations?user=GJp0keYAAAAJ&hl=en), [Sahal Shaji Mullappilly](https://scholar.google.com/citations?user=LJWxVpUAAAAJ&hl=en), [Fahad Khan](https://sites.google.com/view/fahadkhans/home), [Jean Lahoud](https://scholar.google.com/citations?user=LsivLPoAAAAJ&hl=en), [Rao Muhammad Anwer](https://scholar.google.com/citations?hl=en&authuser=1&user=_KlvMVoAAAAJ), [Salman Khan](https://salman-h-khan.github.io/), [Hisham Cholakkal](https://scholar.google.com/citations?hl=en&user=bZ3YBRcAAAAJ)

	Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE

	<p align="center">
	<img src="assets/arch_diagram.svg" alt="LLMVoX Architecture" width="800px">
	</p>

	<video src="https://github.com/user-attachments/assets/6d305563-3c62-4f14-a8aa-acedf2143f76" width="500" controls></video>

	## Overview

	LLMVoX is a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming Text-to-Speech (TTS) system designed to convert text outputs from Large Language Models into high-fidelity streaming speech with low latency.

	Key features:
	- 🚀 Lightweight & Fast: Only 30M parameters with end-to-end latency as low as 300ms
	- 🔌 LLM-Agnostic: Works with any LLM and Vision-Language Model without fine-tuning
	- 🌊 Multi-Queue Streaming: Enables continuous, low-latency speech generation
	- 🌐 Multilingual Support: Adaptable to new languages with dataset adaptation

	## Quick Start

	### Installation

	```bash
	# Requirements: CUDA 11.7+, Flash Attention 2.0+ compatible GPU

	git clone https://github.com/mbzuai-oryx/LLMVoX.git
	cd LLMVoX

	conda create -n llmvox python=3.9
	conda activate llmvox

	pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
	pip install flash-attn --no-build-isolation
	pip install -r requirements.txt

	# Download checkpoints from Hugging Face
	# https://huggingface.co/MBZUAI/LLMVoX/tree/main
	mkdir -p CHECKPOINTS
	# Download wavtokenizer_large_speech_320_24k.ckpt and ckpt_english_tiny.pt
	```

	### Voice Chat

	```bash
	# Basic usage
	python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"

	# With multiple GPUs
	python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
	--llm_device "cuda:0" --tts_device_1 1 --tts_device_2 2

	# Balance latency/quality
	python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
	--initial_dump_size_1 10 --initial_dump_size_2 160 --max_dump_size 1280
	```

	### Text Chat & Visual Speech

	```bash
	# Text-to-Speech
	python streaming_server.py --chat_type text --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"

	# Visual Speech (Speech + Image → Speech)
	python streaming_server.py --chat_type visual_speech --llm_checkpoint "Qwen/Qwen2.5-VL-7B-Instruct" \
	--eos_token "<\|im_end\|>"

	# Multimodal (support for models like Phi-4)
	python streaming_server.py --chat_type multimodal --llm_checkpoint "microsoft/Phi-4-multimodal-instruct" \
	--eos_token "<\|end\|>"
	```

	## API Reference

	\| Endpoint \| Purpose \| Required Parameters \|
	\|----------\|---------\|---------------------\|
	\| `/tts` \| Text-to-speech \| `text`: String to convert \|
	\| `/voicechat` \| Voice conversations \| `audio_base64`, `source_language`, `target_language` \|
	\| `/multimodalchat` \| Voice + multiple images \| `audio_base64`, `image_list` \|
	\| `/vlmschat` \| Voice + single image \| `audio_base64`, `image_base64`, `source_language`, `target_language` \|

	## Local UI Demo

	<p align="center">
	<img src="assets/ui.png" alt="Demo UI" width="800px">
	</p>

	```bash
	# Start server
	python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --api_port PORT

	# Launch UI
	python run_ui.py --ip STREAMING_SERVER_IP --port PORT
	```

	## Citation

	```bibtex
	@article{shikhar2025llmvox,
	title={LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM},
	author={Shikhar, Sambal and Kurpath, Mohammed Irfan and Mullappilly, Sahal Shaji and Lahoud, Jean and Khan, Fahad and Anwer, Rao Muhammad and Khan, Salman and Cholakkal, Hisham},
	journal={arXiv preprint arXiv:2503.04724},
	year={2025}
	}
	```

	## Acknowledgments

	- [Andrej Karpathy's NanoGPT](https://github.com/karpathy/nanoGPT)
	- [WavTokenizer](https://github.com/jishengpeng/WavTokenizer)
	- [Whisper](https://github.com/openai/whisper)
	- [Neural G2P](https://github.com/lingjzhu/CharsiuG2P)

	## License

	This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.