File size: 5,330 Bytes
16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c ead2e28 46cd24c ead2e28 46cd24c 16acd84 46cd24c 16acd84 ea21159 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c 16acd84 d023c31 46cd24c 16acd84 d023c31 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c 16acd84 46cd24c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | ---
license: apache-2.0
tags:
- audio
- vocoder
- speech-synthesis
- streaming
- pytorch-lightning
- causal-conv
language:
- en
library_name: pytorch
---
# Streaming Vocos: Neural vocoder for fast streaming applications
**Streaming Vocos** is a streaming-friendly replication of the original **Vocos** neural vocoder design, modified for **causal / streaming inference**. Compared to typical GAN vocoders that generate waveform samples in the time domain, Vocos predicts **spectral coefficients**, enabling fast waveform reconstruction via inverse Fourier transform—making it well-suited for **low-latency** and **real-time** settings.
This implementation replaces vanilla CNN blocks with **causal CNNs** and provides a **streaming interface** with dynamically adjustable chunk size (in multiples of the hop size).
- **Input:** 50 Hz log-mel spectrogram
- window = 1024, hop = 320
- **Output:** 16 kHz waveform audio
Training follows the GAN objective as in the original Vocos, while adopting loss functions inspired by Descript’s audio codec.
**Original Vocos resources:**
- Audio samples: https://gemelo-ai.github.io/vocos/
- Paper: https://arxiv.org/abs/2306.00814
---
## ⚡ Streaming Latency & Real-Time Performance
We benchmark **Streaming Vocos** in **streaming inference mode** using chunked mel-spectrogram decoding on both CPU and GPU.
### Benchmark setup
- **Audio duration:** 3.24 s
- **Sample rate:** 16 kHz
- **Mel hop size:** 320 samples (20 ms per mel frame)
- **Chunk size:** 5 mel frames (100 ms buffering latency)
- **Runs:** 100 warm-up + 1000 timed runs
- **Inference mode:** Streaming (stateful causal decoding)
**Metrics**
- **Processing time per chunk**
- **End-to-end latency** = chunk buffering + processing time
- **RTF (Real-Time Factor)** = processing time / audio duration
---
### Results
#### Streaming performance (chunk size = 5 frames, 100 ms buffer)
| Device | Avg proc / chunk | First-chunk proc | End-to-end latency | Total proc (3.2 s audio) | RTF |
|------|------------------|------------------|--------------------|---------------------------|-----|
| **CPU** | 14.0 ms | 14.0 ms | **114.0 ms** | 464 ms | 0.14 |
| **GPU (CUDA)** | **3.4 ms** | **3.3 ms** | **103.3 ms** | **113 ms** | **0.035** |
> End-to-end latency includes the **100 ms chunk buffering delay** required for streaming inference.
---
### Interpretation
- **Real-time capable on CPU**
Streaming Vocos achieves an RTF of approximately **0.14**, corresponding to inference running ~7× faster than real time.
- **Ultra-low compute overhead on GPU**
Chunk processing time is reduced to **~3.4 ms**, making overall latency dominated by buffering rather than computation.
- **Streaming-friendly first-chunk behavior**
First-chunk latency closely matches steady-state latency, indicating **no cold-start penalty** during streaming inference.
- **Latency–quality tradeoff**
Smaller chunk sizes further reduce buffering latency (e.g., 1–2 frames → <40 ms), at the cost of slightly increased computational overhead.
---
With a **chunk size of 1 frame (20 ms buffering)**, GPU end-to-end latency drops below **25 ms**, making **Streaming Vocos** suitable for **interactive and conversational TTS pipelines**.
## Checkpoints
This repo provides a PyTorch Lightning checkpoint:
- `epoch=3.ckpt`
You can download it from the “Files” tab, or directly via `hf_hub_download` (example below).
---
## Quickstart (inference)
### Install
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # or cpu wheels
pip install lightning librosa scipy matplotlib huggingface_hub
```
Clone the github repo
```bash
git clone https://github.com/warisqr007/vocos.git
cd vocos
```
### Run inference (offline)
```python
import torch
import librosa
from huggingface_hub import hf_hub_download
from src.modules import VocosVocoderModule # from the github codebase
ckpt_path = hf_hub_download(
repo_id="warisqr007/StreamingVocos",
filename="epoch=3.ckpt",
)
model = VocosVocoderModule.load_from_checkpoint(ckpt_path, map_location="cpu")
model.eval()
wav_path = "your_input.wav"
audio, _ = librosa.load(wav_path, sr=16000, mono=True)
audio_t = torch.from_numpy(audio).unsqueeze(0).unsqueeze(0) # (B=1,1,T)
mel = model.feature_extractor(audio_t)
with torch.no_grad():
y = model(mel).squeeze().cpu().numpy() # reconstructed waveform @ 16kHz
```
### Streaming inference (chunked mel)
```python
import torch
chunk_size = 1 # mel frames per chunk (adjust as desired)
with torch.no_grad(), model.decoder[0].streaming(chunk_size), model.decoder[1].streaming(chunk_size):
y_chunks = []
for mel_chunk in mel.split(chunk_size, dim=2):
y_chunks.append(model(mel_chunk))
y_stream = torch.cat(y_chunks, dim=2).squeeze().cpu().numpy()
```
### Space demo
A Gradio demo Space is provided [here](https://huggingface.co/spaces/warisqr007/StreamingVocos_16khz)
### Acknowledgements
- [Vocos Repo](https://github.com/gemelo-ai/vocos)
- [Moshi Repo for streaming implementation](https://github.com/kyutai-labs/moshi)
- [descript-audio-codec losses](https://github.com/descriptinc/descript-audio-codec)
- [lightning-template](https://github.com/DavidZhang73/pytorch-lightning-template)
|