| | --- |
| | license: apache-2.0 |
| | tags: |
| | - audio |
| | - vocoder |
| | - speech-synthesis |
| | - streaming |
| | - pytorch-lightning |
| | - causal-conv |
| | language: |
| | - en |
| | library_name: pytorch |
| | --- |
| | |
| | # Streaming Vocos: Neural vocoder for fast streaming applications |
| |
|
| | **Streaming Vocos** is a streaming-friendly replication of the original **Vocos** neural vocoder design, modified for **causal / streaming inference**. Compared to typical GAN vocoders that generate waveform samples in the time domain, Vocos predicts **spectral coefficients**, enabling fast waveform reconstruction via inverse Fourier transform—making it well-suited for **low-latency** and **real-time** settings. |
| |
|
| | This implementation replaces vanilla CNN blocks with **causal CNNs** and provides a **streaming interface** with dynamically adjustable chunk size (in multiples of the hop size). |
| |
|
| | - **Input:** 50 Hz log-mel spectrogram |
| | - window = 1024, hop = 320 |
| | - **Output:** 16 kHz waveform audio |
| |
|
| | Training follows the GAN objective as in the original Vocos, while adopting loss functions inspired by Descript’s audio codec. |
| |
|
| | **Original Vocos resources:** |
| | - Audio samples: https://gemelo-ai.github.io/vocos/ |
| | - Paper: https://arxiv.org/abs/2306.00814 |
| |
|
| | --- |
| |
|
| | ## ⚡ Streaming Latency & Real-Time Performance |
| |
|
| | We benchmark **Streaming Vocos** in **streaming inference mode** using chunked mel-spectrogram decoding on both CPU and GPU. |
| |
|
| | ### Benchmark setup |
| |
|
| | - **Audio duration:** 3.24 s |
| | - **Sample rate:** 16 kHz |
| | - **Mel hop size:** 320 samples (20 ms per mel frame) |
| | - **Chunk size:** 5 mel frames (100 ms buffering latency) |
| | - **Runs:** 100 warm-up + 1000 timed runs |
| | - **Inference mode:** Streaming (stateful causal decoding) |
| |
|
| | **Metrics** |
| | - **Processing time per chunk** |
| | - **End-to-end latency** = chunk buffering + processing time |
| | - **RTF (Real-Time Factor)** = processing time / audio duration |
| |
|
| | --- |
| |
|
| | ### Results |
| |
|
| | #### Streaming performance (chunk size = 5 frames, 100 ms buffer) |
| |
|
| | | Device | Avg proc / chunk | First-chunk proc | End-to-end latency | Total proc (3.2 s audio) | RTF | |
| | |------|------------------|------------------|--------------------|---------------------------|-----| |
| | | **CPU** | 14.0 ms | 14.0 ms | **114.0 ms** | 464 ms | 0.14 | |
| | | **GPU (CUDA)** | **3.4 ms** | **3.3 ms** | **103.3 ms** | **113 ms** | **0.035** | |
| |
|
| | > End-to-end latency includes the **100 ms chunk buffering delay** required for streaming inference. |
| |
|
| | --- |
| |
|
| | ### Interpretation |
| |
|
| | - **Real-time capable on CPU** |
| | Streaming Vocos achieves an RTF of approximately **0.14**, corresponding to inference running ~7× faster than real time. |
| |
|
| | - **Ultra-low compute overhead on GPU** |
| | Chunk processing time is reduced to **~3.4 ms**, making overall latency dominated by buffering rather than computation. |
| |
|
| | - **Streaming-friendly first-chunk behavior** |
| | First-chunk latency closely matches steady-state latency, indicating **no cold-start penalty** during streaming inference. |
| |
|
| | - **Latency–quality tradeoff** |
| | Smaller chunk sizes further reduce buffering latency (e.g., 1–2 frames → <40 ms), at the cost of slightly increased computational overhead. |
| |
|
| | --- |
| |
|
| | With a **chunk size of 1 frame (20 ms buffering)**, GPU end-to-end latency drops below **25 ms**, making **Streaming Vocos** suitable for **interactive and conversational TTS pipelines**. |
| |
|
| |
|
| | ## Checkpoints |
| |
|
| | This repo provides a PyTorch Lightning checkpoint: |
| |
|
| | - `epoch=3.ckpt` |
| |
|
| | You can download it from the “Files” tab, or directly via `hf_hub_download` (example below). |
| |
|
| | --- |
| |
|
| | ## Quickstart (inference) |
| |
|
| | ### Install |
| | ```bash |
| | pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # or cpu wheels |
| | pip install lightning librosa scipy matplotlib huggingface_hub |
| | ``` |
| |
|
| | Clone the github repo |
| | ```bash |
| | git clone https://github.com/warisqr007/vocos.git |
| | cd vocos |
| | ``` |
| |
|
| | ### Run inference (offline) |
| | ```python |
| | import torch |
| | import librosa |
| | from huggingface_hub import hf_hub_download |
| | |
| | from src.modules import VocosVocoderModule # from the github codebase |
| | |
| | ckpt_path = hf_hub_download( |
| | repo_id="warisqr007/StreamingVocos", |
| | filename="epoch=3.ckpt", |
| | ) |
| | |
| | model = VocosVocoderModule.load_from_checkpoint(ckpt_path, map_location="cpu") |
| | model.eval() |
| | |
| | wav_path = "your_input.wav" |
| | audio, _ = librosa.load(wav_path, sr=16000, mono=True) |
| | |
| | audio_t = torch.from_numpy(audio).unsqueeze(0).unsqueeze(0) # (B=1,1,T) |
| | mel = model.feature_extractor(audio_t) |
| | |
| | with torch.no_grad(): |
| | y = model(mel).squeeze().cpu().numpy() # reconstructed waveform @ 16kHz |
| | |
| | ``` |
| |
|
| | ### Streaming inference (chunked mel) |
| | ```python |
| | import torch |
| | |
| | chunk_size = 1 # mel frames per chunk (adjust as desired) |
| | |
| | with torch.no_grad(), model.decoder[0].streaming(chunk_size), model.decoder[1].streaming(chunk_size): |
| | y_chunks = [] |
| | for mel_chunk in mel.split(chunk_size, dim=2): |
| | y_chunks.append(model(mel_chunk)) |
| | y_stream = torch.cat(y_chunks, dim=2).squeeze().cpu().numpy() |
| | ``` |
| |
|
| | ### Space demo |
| | A Gradio demo Space is provided [here](https://huggingface.co/spaces/warisqr007/StreamingVocos_16khz) |
| |
|
| |
|
| | ### Acknowledgements |
| | - [Vocos Repo](https://github.com/gemelo-ai/vocos) |
| | - [Moshi Repo for streaming implementation](https://github.com/kyutai-labs/moshi) |
| | - [descript-audio-codec losses](https://github.com/descriptinc/descript-audio-codec) |
| | - [lightning-template](https://github.com/DavidZhang73/pytorch-lightning-template) |
| |
|
| |
|