How to overcome surges
Re: Fragmented TTS generation – first chunk has raised tone, seeking optimal stitching method
You're seeing classic "initial inference state" behavior. The first fragment has no prior acoustic context, so the model defaults to a higher baseline pitch before settling into its neutral register.
Since you're on s2-pro, here's how to fix the immediate problem, but I'm going to redirect you at the end because you're fighting the wrong battle.
Why It's Happening
s2-pro (like most autoregressive TTS) uses priming – the first mel frame starts from zero state. Generating each 200–250 char chunk as an isolated inference call guarantees the first chunk will sound different.
If You Must Keep the Fragment‑Then‑Stitch Approach
1. Split by sentence boundaries, not character count – keeps prosodic contours intact.
2. Generate with acoustic context – after the first fragment, prime the model with the last 1–2 seconds of audio from the previous fragment.
3. Use lossless intermediate format – write fragments as 16‑bit PCM WAV or FLAC. MP3 adds encoder delay that desyncs seams.
4. Crossfade concatenation – simple cat creates clicks. Use sox:
sox first.wav second.wav out.wav splice -q 0.03
Or Python with pydub:
from pydub import AudioSegment
a = AudioSegment.from_wav("frag1.wav")
b = AudioSegment.from_wav("frag2.wav")
combined = a.append(b, crossfade=45)
combined.export("final.wav", format="wav")
But Honestly? You're Solving the Wrong Problem
What you're trying to do – splitting audio then restitching seamlessly – is a bandaid over a fundamental architectural mismatch.
The real solution is to use a streaming TTS model.
With models like vibe-voice or MOSS-Realtime, you don't split text, you don't generate fragments, and you don't stitch. You feed the entire text in one go, and the model begins outputting audio as soon as the first tokens are available. The prosody is continuous from the first syllable because the model operates with full context and streams the output in realtime.
Why this solves your problem:
- No "first fragment" edge case – the model's acoustic state is initialized once for the entire utterance
- No concatenation artifacts – no seams, no crossfades, no glitches
- No arbitrary character limits – the model handles arbitrarily long text via streaming
- Self-hosted – both vibe-voice and MOSS-Realtime run locally
What you should do:
- Ditch the fragment‑then‑stitch pipeline entirely
- Pick a streaming TTS model (vibe-voice for lightweight/fast, MOSS-Realtime for higher quality)
- Feed your full radio script as a single text input
- Stream the audio output directly to your output file or playback pipeline
You'll get a single, coherent audio file with natural prosody throughout, no weird pitch jumps on the first segment, and zero stitching headaches. This is the proper architecture for what you're building.
Edit:
And I would just add: the problem you're running into reveals a fundamental misunderstanding of how modern TTS models work — that the "raised tone on first fragment" isn't a bug to be patched with stitching tricks, but an inherent feature of isolated inference calls.
If you just want uniform, splicable audio with no perceptible "raised tone" on the first fragment, you could:
- Flatten the prosody entirely — set pitch variation to near zero, kill expressive contour, make the model sound like a monotone robot;
- Generate each fragment with identical neutral prosody;
- Splice away.
But then... what's the point of using a modern neural TTS model at all? You're basically asking for the one thing these models were designed to not do. You're taking a system built for natural, dynamic, context-aware speech and trying to hammer it into behaving like an old concatenative TTS from the 90s.
