Sign in Open app

Product

The Architecture Behind Sub-300ms Voice Latency

A look under the hood: streaming STT, speculative LLM decoding, and chunked TTS — and why naïve implementations sit at 1.5s.

Ravi Subramanian, Staff Engineer·April 2, 2026· 11 min read

⚡

The naïve pipeline

Record → upload → STT → LLM → TTS → play. Each arrow is a network hop. You'll land at 1.2–1.8 seconds and the conversation feels broken.

What we do instead

Streaming STT with VAD endpointing, so we start the LLM call before the caller stops talking.
Speculative decoding that drafts the first 80 tokens while STT finalizes.
Chunked TTS that starts playing at the first sentence boundary, not the full response.

The result

P50 end-of-utterance to first audio byte: 268 ms. P95: 412 ms. That's the threshold where callers stop noticing the AI.

Want to see this in your own call data?

Pilot VOXOS for 14 days on real traffic — no credit card.

Keep reading

The 2026 Buyer's Guide to AI Voice Agents

Garbage In, Garbage Out: Building a Knowledge Base Your Voice Agent Won't Hallucinate From

12 Voice UX Principles That Make AI Agents Feel Human