Voice Agent Pipeline
Core stages from end-of-speech to first assistant audio
Initial Response
1
EOU Delay
Seconds from detected end of speech (VAD) until your turn is finalized, including transcription delay.
Turn Detection
Finalizing when your turn is complete
--
2
LLM TTFT
Thinking time: from LLM request start to the first generated token.
Thinking
Generating the first token
--
3
TTS TTFB
Voice generation startup only: time from TTS request start to the first audio chunk (TTFB).
Voice Generation
Starting audio synthesis
--
Total Round-Trip
End-to-End Latency
Sum of the three stages above: EOU delay + Thinking (LLM TTFT) + Voice Generation (TTS TTFB).
--
Tool Calls
Tool Call
Tool execution
Executing external function calls
No tools executed.
--
Post-Tool Response
5
LLM TTFT
Thinking
Generating the first token after tools
--
6
TTS TTFB
Voice Generation
Starting post-tool audio synthesis
--
Second Audio
Tool completion to post-tool assistant audio
Tool→Audio
--
Total Turn Duration
Complete turn including tool execution and post-tool response
--