File size: 5,330 Bytes

16acd84
46cd24c
 
 
 
 
 
 
 
 
 
 
16acd84
 
46cd24c
16acd84
46cd24c
16acd84
46cd24c
16acd84
46cd24c
 
 
ead2e28
46cd24c
ead2e28
46cd24c
 
 
16acd84
46cd24c
16acd84
ea21159
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16acd84
 
46cd24c
16acd84
46cd24c
16acd84
46cd24c
16acd84
46cd24c
16acd84
46cd24c
16acd84
46cd24c
 
 
 
16acd84
 
d023c31
 
 
 
 
 
46cd24c
 
 
 
 
16acd84
d023c31
16acd84
46cd24c
 
 
 
16acd84
46cd24c
 
16acd84
46cd24c
 
16acd84
46cd24c
 
 
 
 
16acd84
 
 
46cd24c
 
 
16acd84
46cd24c
16acd84
46cd24c
 
 
 
 
 
 
 
 
16acd84
 
46cd24c
16acd84
 
46cd24c
16acd84
46cd24c

---
license: apache-2.0
tags:
  - audio
  - vocoder
  - speech-synthesis
  - streaming
  - pytorch-lightning
  - causal-conv
language:
  - en
library_name: pytorch
---

# Streaming Vocos: Neural vocoder for fast streaming applications

**Streaming Vocos** is a streaming-friendly replication of the original **Vocos** neural vocoder design, modified for **causal / streaming inference**. Compared to typical GAN vocoders that generate waveform samples in the time domain, Vocos predicts **spectral coefficients**, enabling fast waveform reconstruction via inverse Fourier transform—making it well-suited for **low-latency** and **real-time** settings.

This implementation replaces vanilla CNN blocks with **causal CNNs** and provides a **streaming interface** with dynamically adjustable chunk size (in multiples of the hop size).

- **Input:** 50 Hz log-mel spectrogram  
  - window = 1024, hop = 320
- **Output:** 16 kHz waveform audio

Training follows the GAN objective as in the original Vocos, while adopting loss functions inspired by Descript’s audio codec.

**Original Vocos resources:**  
- Audio samples: https://gemelo-ai.github.io/vocos/  
- Paper: https://arxiv.org/abs/2306.00814

---

## ⚡ Streaming Latency & Real-Time Performance

We benchmark **Streaming Vocos** in **streaming inference mode** using chunked mel-spectrogram decoding on both CPU and GPU.

### Benchmark setup

- **Audio duration:** 3.24 s  
- **Sample rate:** 16 kHz  
- **Mel hop size:** 320 samples (20 ms per mel frame)  
- **Chunk size:** 5 mel frames (100 ms buffering latency)  
- **Runs:** 100 warm-up + 1000 timed runs  
- **Inference mode:** Streaming (stateful causal decoding)  

**Metrics**
- **Processing time per chunk**
- **End-to-end latency** = chunk buffering + processing time
- **RTF (Real-Time Factor)** = processing time / audio duration

---

### Results

#### Streaming performance (chunk size = 5 frames, 100 ms buffer)

| Device | Avg proc / chunk | First-chunk proc | End-to-end latency | Total proc (3.2 s audio) | RTF |
|------|------------------|------------------|--------------------|---------------------------|-----|
| **CPU** | 14.0 ms | 14.0 ms | **114.0 ms** | 464 ms | 0.14 |
| **GPU (CUDA)** | **3.4 ms** | **3.3 ms** | **103.3 ms** | **113 ms** | **0.035** |

> End-to-end latency includes the **100 ms chunk buffering delay** required for streaming inference.

---

### Interpretation

- **Real-time capable on CPU**  
  Streaming Vocos achieves an RTF of approximately **0.14**, corresponding to inference running ~7× faster than real time.

- **Ultra-low compute overhead on GPU**  
  Chunk processing time is reduced to **~3.4 ms**, making overall latency dominated by buffering rather than computation.

- **Streaming-friendly first-chunk behavior**  
  First-chunk latency closely matches steady-state latency, indicating **no cold-start penalty** during streaming inference.

- **Latency–quality tradeoff**  
  Smaller chunk sizes further reduce buffering latency (e.g., 1–2 frames → <40 ms), at the cost of slightly increased computational overhead.

---

With a **chunk size of 1 frame (20 ms buffering)**, GPU end-to-end latency drops below **25 ms**, making **Streaming Vocos** suitable for **interactive and conversational TTS pipelines**.


## Checkpoints

This repo provides a PyTorch Lightning checkpoint:

- `epoch=3.ckpt`

You can download it from the “Files” tab, or directly via `hf_hub_download` (example below).

---

## Quickstart (inference)

### Install
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # or cpu wheels
pip install lightning librosa scipy matplotlib huggingface_hub
```

Clone the github repo
```bash
git clone https://github.com/warisqr007/vocos.git 
cd vocos
```

### Run inference (offline)
```python
import torch
import librosa
from huggingface_hub import hf_hub_download

from src.modules import VocosVocoderModule  # from the github codebase

ckpt_path = hf_hub_download(
    repo_id="warisqr007/StreamingVocos",
    filename="epoch=3.ckpt",
)

model = VocosVocoderModule.load_from_checkpoint(ckpt_path, map_location="cpu")
model.eval()

wav_path = "your_input.wav"
audio, _ = librosa.load(wav_path, sr=16000, mono=True)

audio_t = torch.from_numpy(audio).unsqueeze(0).unsqueeze(0)  # (B=1,1,T)
mel = model.feature_extractor(audio_t)

with torch.no_grad():
    y = model(mel).squeeze().cpu().numpy()  # reconstructed waveform @ 16kHz

```

### Streaming inference (chunked mel)
```python
import torch

chunk_size = 1  # mel frames per chunk (adjust as desired)

with torch.no_grad(), model.decoder[0].streaming(chunk_size), model.decoder[1].streaming(chunk_size):
    y_chunks = []
    for mel_chunk in mel.split(chunk_size, dim=2):
        y_chunks.append(model(mel_chunk))
    y_stream = torch.cat(y_chunks, dim=2).squeeze().cpu().numpy()
```

### Space demo
A Gradio demo Space is provided [here](https://huggingface.co/spaces/warisqr007/StreamingVocos_16khz)


### Acknowledgements
- [Vocos Repo](https://github.com/gemelo-ai/vocos)
- [Moshi Repo for streaming implementation](https://github.com/kyutai-labs/moshi)
- [descript-audio-codec losses](https://github.com/descriptinc/descript-audio-codec)
- [lightning-template](https://github.com/DavidZhang73/pytorch-lightning-template)