---
license: apache-2.0
tags:
  - audio
  - vocoder
  - speech-synthesis
  - streaming
  - pytorch-lightning
  - causal-conv
language:
  - en
library_name: pytorch
---

# Streaming Vocos: Neural vocoder for fast streaming applications

**Streaming Vocos** is a streaming-friendly replication of the original **Vocos** neural vocoder design, modified for **causal / streaming inference**. Compared to typical GAN vocoders that generate waveform samples in the time domain, Vocos predicts **spectral coefficients**, enabling fast waveform reconstruction via inverse Fourier transform—making it well-suited for **low-latency** and **real-time** settings.

This implementation replaces vanilla CNN blocks with **causal CNNs** and provides a **streaming interface** with dynamically adjustable chunk size (in multiples of the hop size).

- **Input:** 50 Hz log-mel spectrogram  
  - window = 1024, hop = 320
- **Output:** 16 kHz waveform audio

Training follows the GAN objective as in the original Vocos, while adopting loss functions inspired by Descript’s audio codec.

**Original Vocos resources:**  
- Audio samples: https://gemelo-ai.github.io/vocos/  
- Paper: https://arxiv.org/abs/2306.00814

---

## ⚡ Streaming Latency & Real-Time Performance

We benchmark **Streaming Vocos** in **streaming inference mode** using chunked mel-spectrogram decoding on both CPU and GPU.

### Benchmark setup

- **Audio duration:** 3.24 s  
- **Sample rate:** 16 kHz  
- **Mel hop size:** 320 samples (20 ms per mel frame)  
- **Chunk size:** 5 mel frames (100 ms buffering latency)  
- **Runs:** 100 warm-up + 1000 timed runs  
- **Inference mode:** Streaming (stateful causal decoding)  

**Metrics**
- **Processing time per chunk**
- **End-to-end latency** = chunk buffering + processing time
- **RTF (Real-Time Factor)** = processing time / audio duration

---

### Results

#### Streaming performance (chunk size = 5 frames, 100 ms buffer)

| Device | Avg proc / chunk | First-chunk proc | End-to-end latency | Total proc (3.2 s audio) | RTF |
|------|------------------|------------------|--------------------|---------------------------|-----|
| **CPU** | 14.0 ms | 14.0 ms | **114.0 ms** | 464 ms | 0.14 |
| **GPU (CUDA)** | **3.4 ms** | **3.3 ms** | **103.3 ms** | **113 ms** | **0.035** |

> End-to-end latency includes the **100 ms chunk buffering delay** required for streaming inference.

---

### Interpretation

- **Real-time capable on CPU**  
  Streaming Vocos achieves an RTF of approximately **0.14**, corresponding to inference running ~7× faster than real time.

- **Ultra-low compute overhead on GPU**  
  Chunk processing time is reduced to **~3.4 ms**, making overall latency dominated by buffering rather than computation.

- **Streaming-friendly first-chunk behavior**  
  First-chunk latency closely matches steady-state latency, indicating **no cold-start penalty** during streaming inference.

- **Latency–quality tradeoff**  
  Smaller chunk sizes further reduce buffering latency (e.g., 1–2 frames → <40 ms), at the cost of slightly increased computational overhead.

---

With a **chunk size of 1 frame (20 ms buffering)**, GPU end-to-end latency drops below **25 ms**, making **Streaming Vocos** suitable for **interactive and conversational TTS pipelines**.


## Checkpoints

This repo provides a PyTorch Lightning checkpoint:

- `epoch=3.ckpt`

You can download it from the “Files” tab, or directly via `hf_hub_download` (example below).

---

## Quickstart (inference)

### Install
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # or cpu wheels
pip install lightning librosa scipy matplotlib huggingface_hub
```

Clone the github repo
```bash
git clone https://github.com/warisqr007/vocos.git 
cd vocos
```

### Run inference (offline)
```python
import torch
import librosa
from huggingface_hub import hf_hub_download

from src.modules import VocosVocoderModule  # from the github codebase

ckpt_path = hf_hub_download(
    repo_id="warisqr007/StreamingVocos",
    filename="epoch=3.ckpt",
)

model = VocosVocoderModule.load_from_checkpoint(ckpt_path, map_location="cpu")
model.eval()

wav_path = "your_input.wav"
audio, _ = librosa.load(wav_path, sr=16000, mono=True)

audio_t = torch.from_numpy(audio).unsqueeze(0).unsqueeze(0)  # (B=1,1,T)
mel = model.feature_extractor(audio_t)

with torch.no_grad():
    y = model(mel).squeeze().cpu().numpy()  # reconstructed waveform @ 16kHz

```

### Streaming inference (chunked mel)
```python
import torch

chunk_size = 1  # mel frames per chunk (adjust as desired)

with torch.no_grad(), model.decoder[0].streaming(chunk_size), model.decoder[1].streaming(chunk_size):
    y_chunks = []
    for mel_chunk in mel.split(chunk_size, dim=2):
        y_chunks.append(model(mel_chunk))
    y_stream = torch.cat(y_chunks, dim=2).squeeze().cpu().numpy()
```

### Space demo
A Gradio demo Space is provided [here](https://huggingface.co/spaces/warisqr007/StreamingVocos_16khz)


### Acknowledgements
- [Vocos Repo](https://github.com/gemelo-ai/vocos)
- [Moshi Repo for streaming implementation](https://github.com/kyutai-labs/moshi)
- [descript-audio-codec losses](https://github.com/descriptinc/descript-audio-codec)
- [lightning-template](https://github.com/DavidZhang73/pytorch-lightning-template)