Vocoder produces click/pop artifact at the end of generated audio segments
Description:
When using s2-pro for TTS via sgl-omni serve, the generated audio segments frequently contain an audible click or pop sound at the very end. This happens regardless of the text content or speaker reference used.
Observed behavior:
- ~90%+ of generated segments have an audible click/pop at the tail end of the audio
- The artifact appears to be random β the same text can produce it on one run and not another
- The artifact is present in the raw WAV output from the server β no client-side processing is applied
How to reproduce:
import requests
payload = {
"input": "Hello, how are you doing today?",
"response_format": "wav",
"references": [{"vq_codes": [...], "text": "..."}] # any valid reference
}
resp = requests.post("http://localhost:8080/v1/audio/speech", json=payload)
Listen to the end of the resulting WAV β click/pop is audible
Generate 20-30 segments with varied text β the vast majority will have the artifact at the end.
Root cause hypothesis:
The vocoder/codec decoder appears to stop generating abruptly before the waveform has decayed to zero, creating a discontinuity at the end of the audio. This is a classic cause of click/pop artifacts in digital audio.
Current workaround:
We trim the last 50-80ms off each generated segment and apply a short fade-out (15-30ms). This removes the artifact in most cases but occasionally clips the tail end of actual speech content β not ideal for short utterances.
Environment:
- Model: fishaudio/s2-pro
- Server: sgl-omni serve --model-path fishaudio/s2-pro --config examples/configs/s2pro_tts.yaml
- Output format: WAV
Questions:
- Is there a server-side config (e.g. in s2pro_tts.yaml) that controls end-of-sequence behavior or adds padding?
- Could the model be made to generate a few extra silent frames at the end to ensure a clean tail-off?
- Is this a known issue with the vocoder decoder?