Update README.md

ea21159 verified 18 days ago

5.33 kB

	---
	license: apache-2.0
	tags:
	- audio
	- vocoder
	- speech-synthesis
	- streaming
	- pytorch-lightning
	- causal-conv
	language:
	- en
	library_name: pytorch
	---

	# Streaming Vocos: Neural vocoder for fast streaming applications

	Streaming Vocos is a streaming-friendly replication of the original Vocos neural vocoder design, modified for causal / streaming inference. Compared to typical GAN vocoders that generate waveform samples in the time domain, Vocos predicts spectral coefficients, enabling fast waveform reconstruction via inverse Fourier transform—making it well-suited for low-latency and real-time settings.

	This implementation replaces vanilla CNN blocks with causal CNNs and provides a streaming interface with dynamically adjustable chunk size (in multiples of the hop size).

	- Input: 50 Hz log-mel spectrogram
	- window = 1024, hop = 320
	- Output: 16 kHz waveform audio

	Training follows the GAN objective as in the original Vocos, while adopting loss functions inspired by Descript’s audio codec.

	Original Vocos resources:
	- Audio samples: https://gemelo-ai.github.io/vocos/
	- Paper: https://arxiv.org/abs/2306.00814

	---

	## ⚡ Streaming Latency & Real-Time Performance

	We benchmark Streaming Vocos in streaming inference mode using chunked mel-spectrogram decoding on both CPU and GPU.

	### Benchmark setup

	- Audio duration: 3.24 s
	- Sample rate: 16 kHz
	- Mel hop size: 320 samples (20 ms per mel frame)
	- Chunk size: 5 mel frames (100 ms buffering latency)
	- Runs: 100 warm-up + 1000 timed runs
	- Inference mode: Streaming (stateful causal decoding)

	Metrics
	- Processing time per chunk
	- End-to-end latency = chunk buffering + processing time
	- RTF (Real-Time Factor) = processing time / audio duration

	---

	### Results

	#### Streaming performance (chunk size = 5 frames, 100 ms buffer)

	\| Device \| Avg proc / chunk \| First-chunk proc \| End-to-end latency \| Total proc (3.2 s audio) \| RTF \|
	\|------\|------------------\|------------------\|--------------------\|---------------------------\|-----\|
	\| CPU \| 14.0 ms \| 14.0 ms \| 114.0 ms \| 464 ms \| 0.14 \|
	\| GPU (CUDA) \| 3.4 ms \| 3.3 ms \| 103.3 ms \| 113 ms \| 0.035 \|

	> End-to-end latency includes the 100 ms chunk buffering delay required for streaming inference.

	---

	### Interpretation

	- Real-time capable on CPU
	Streaming Vocos achieves an RTF of approximately 0.14, corresponding to inference running ~7× faster than real time.

	- Ultra-low compute overhead on GPU
	Chunk processing time is reduced to ~3.4 ms, making overall latency dominated by buffering rather than computation.

	- Streaming-friendly first-chunk behavior
	First-chunk latency closely matches steady-state latency, indicating no cold-start penalty during streaming inference.

	- Latency–quality tradeoff
	Smaller chunk sizes further reduce buffering latency (e.g., 1–2 frames → <40 ms), at the cost of slightly increased computational overhead.

	---

	With a chunk size of 1 frame (20 ms buffering), GPU end-to-end latency drops below 25 ms, making Streaming Vocos suitable for interactive and conversational TTS pipelines.


	## Checkpoints

	This repo provides a PyTorch Lightning checkpoint:

	- `epoch=3.ckpt`

	You can download it from the “Files” tab, or directly via `hf_hub_download` (example below).

	---

	## Quickstart (inference)

	### Install
	```bash
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # or cpu wheels
	pip install lightning librosa scipy matplotlib huggingface_hub
	```

	Clone the github repo
	```bash
	git clone https://github.com/warisqr007/vocos.git
	cd vocos
	```

	### Run inference (offline)
	```python
	import torch
	import librosa
	from huggingface_hub import hf_hub_download

	from src.modules import VocosVocoderModule # from the github codebase

	ckpt_path = hf_hub_download(
	repo_id="warisqr007/StreamingVocos",
	filename="epoch=3.ckpt",
	)

	model = VocosVocoderModule.load_from_checkpoint(ckpt_path, map_location="cpu")
	model.eval()

	wav_path = "your_input.wav"
	audio, _ = librosa.load(wav_path, sr=16000, mono=True)

	audio_t = torch.from_numpy(audio).unsqueeze(0).unsqueeze(0) # (B=1,1,T)
	mel = model.feature_extractor(audio_t)

	with torch.no_grad():
	y = model(mel).squeeze().cpu().numpy() # reconstructed waveform @ 16kHz

	```

	### Streaming inference (chunked mel)
	```python
	import torch

	chunk_size = 1 # mel frames per chunk (adjust as desired)

	with torch.no_grad(), model.decoder[0].streaming(chunk_size), model.decoder[1].streaming(chunk_size):
	y_chunks = []
	for mel_chunk in mel.split(chunk_size, dim=2):
	y_chunks.append(model(mel_chunk))
	y_stream = torch.cat(y_chunks, dim=2).squeeze().cpu().numpy()
	```

	### Space demo
	A Gradio demo Space is provided [here](https://huggingface.co/spaces/warisqr007/StreamingVocos_16khz)


	### Acknowledgements
	- [Vocos Repo](https://github.com/gemelo-ai/vocos)
	- [Moshi Repo for streaming implementation](https://github.com/kyutai-labs/moshi)
	- [descript-audio-codec losses](https://github.com/descriptinc/descript-audio-codec)
	- [lightning-template](https://github.com/DavidZhang73/pytorch-lightning-template)