Idle
Voice Agent Pipeline Core stages from end-of-speech to first assistant audio
Initial Response
1
EOU Delay
Seconds from detected end of speech (VAD) until your turn is finalized, including transcription delay.
Turn Detection
Finalizing when your turn is complete
--
2
LLM TTFT
Thinking time: from LLM request start to the first generated token.
Thinking
Generating the first token
--
3
TTS TTFB
Voice generation startup only: time from TTS request start to the first audio chunk (TTFB).
Voice Generation
Starting audio synthesis
--
Total Round-Trip
End-to-End Latency
Sum of the three stages above: EOU delay + Thinking (LLM TTFT) + Voice Generation (TTS TTFB).
--