All articles
Product

The Architecture Behind Sub-300ms Voice Latency

A look under the hood: streaming STT, speculative LLM decoding, and chunked TTS — and why naïve implementations sit at 1.5s.

Ravi Subramanian, Staff Engineer·April 2, 2026· 11 min read

The naïve pipeline

Record → upload → STT → LLM → TTS → play. Each arrow is a network hop. You'll land at 1.2–1.8 seconds and the conversation feels broken.

What we do instead

  • Streaming STT with VAD endpointing, so we start the LLM call before the caller stops talking.
  • Speculative decoding that drafts the first 80 tokens while STT finalizes.
  • Chunked TTS that starts playing at the first sentence boundary, not the full response.

The result

P50 end-of-utterance to first audio byte: 268 ms. P95: 412 ms. That's the threshold where callers stop noticing the AI.

Want to see this in your own call data?

Pilot VOXOS for 14 days on real traffic — no credit card.

Talk to a human
Request a pilot

Tell us about your call volume and we'll be in touch within one business day.

We'll only use this to contact you about VOXOS. No marketing spam.