All articles
Product
The Architecture Behind Sub-300ms Voice Latency
A look under the hood: streaming STT, speculative LLM decoding, and chunked TTS — and why naïve implementations sit at 1.5s.
Ravi Subramanian, Staff Engineer·April 2, 2026· 11 min read
⚡
The naïve pipeline
Record → upload → STT → LLM → TTS → play. Each arrow is a network hop. You'll land at 1.2–1.8 seconds and the conversation feels broken.
What we do instead
- Streaming STT with VAD endpointing, so we start the LLM call before the caller stops talking.
- Speculative decoding that drafts the first 80 tokens while STT finalizes.
- Chunked TTS that starts playing at the first sentence boundary, not the full response.
The result
P50 end-of-utterance to first audio byte: 268 ms. P95: 412 ms. That's the threshold where callers stop noticing the AI.