AI Systems · Voice AI · LLM Pipelines · Real-Time S2S

AI Systems Engineering
From Prototype to Production

Built and deployed a real-time speech-to-speech AI pipeline end-to-end: Twilio telephony → Deepgram STT → Groq LLaMA → Azure TTS → caller. Sub-second per-component, multi-tenant, circuit-broken, production-observable.

Groq LPU LLaMA 3.3 70B Deepgram nova-2 Azure TTS Twilio Media Streams WAHA / WhatsApp FastAPI async WebSocket Redis cache

Real-Time Speech-to-Speech Pipeline

5 provider APIs, 3 WebSocket connections, 2 persistent data stores — all orchestrated in a single async Python process per call. Each step observable in production.

01
Twilio → mulaw 8kHz WebSocket
Caller audio arrives as base64-encoded mulaw 8kHz chunks (20ms frames) over Twilio Media Streams WebSocket. FastAPI async handler receives and queues chunks with minimal buffering.
~20ms/chunk
02
mulaw → PCM → Deepgram STT
Chunks decoded from mulaw to 16kHz PCM and streamed to Deepgram nova-2 WebSocket (hi-IN language model). endpointing=250ms yields speech_final event on caller pause. Partial transcripts used for barge-in detection before the full turn completes.
250ms endpointing configured
03
Transcript → Groq LLaMA (streaming)
On speech_final: final transcript + last 8 conversation turns + tenant system prompt sent to Groq LLaMA 3.3 70B. Streaming API — tokens arrive as generated. Accumulated until sentence boundary (।?!), then dispatched to TTS immediately. Concurrency controlled per-tenant via asyncio.Semaphore + AIMD.
~150–300ms TTFT
04
Sentence → Azure TTS + Cache
Text normalised → SHA256 → Redis lookup. Cache hit: pre-computed mulaw bytes returned in <200ms. Cache miss: Azure Cognitive Services TTS synthesises MP3, decoded to mulaw, written to Redis (24hr TTL). Audio streamed in 20ms chunks back to caller.
<200ms cache hit · ~800ms miss
05
mulaw chunks → Twilio → Caller
mulaw chunks sent back over Twilio Media Streams WebSocket as base64 JSON payloads. Barge-in token checked at every chunk send — stale audio discarded if new caller speech detected mid-sentence. One 20ms chunk of bleed-in on interrupt is the accepted trade-off.
~20ms/chunk to caller

Observed Latency Profile

Production configuration. Provider latency variability not under our control — these are typical observed values, not guarantees. Click any card to see the controlling variable.

Deepgram Endpointing
250ms
configured · not variable
250ms silence triggers speech_final. Lower = more interruptions; higher = perceived hesitation in response.
Controlling Variable

endpointing parameter on Deepgram WebSocket connection. 250ms is the sweet spot for Hindi conversation — English callers tolerate lower values.

Tuning Trade-off

Reducing to 150ms cuts 100ms from turn latency but causes more false speech_final events from mid-sentence pauses. Increasing to 400ms feels unresponsive. 250ms is calibrated to APEX's primary Hindi use case.

Groq TTFT
~150–300ms
first token from LLM
LPU hardware advantage over GPU — 3–5× faster first-token than OpenAI GPT-4o at similar cost.
Controlling Variable

Groq LPU load + prompt length. Context window (8 turns + system prompt) kept short intentionally to minimise TTFT. Each additional 1K tokens adds ~20–40ms.

Worst Case

Under rate-limit AIMD throttle, calls queue behind semaphore. Queue time adds to perceived latency but TTFT itself stays consistent once the call acquires the semaphore.

First Sentence Complete
~300–600ms from TTFT
sentence boundary split
Dispatch to TTS on first sentence boundary, not full response — saves ~300ms vs waiting for complete LLM output.
Controlling Variable

Response sentence length + Groq token generation rate (~200 tokens/s on LPU). Short first sentences (e.g. greetings) dispatch in 50–100ms of TTFT.

Implementation Note

Sentence split on ।?! characters. Hindi full-stop (।) added specifically after testing showed English split-only missed natural pause points in Hindi sentences.

TTS Cache Hit
<200ms
Redis in-cluster RTT
Pre-computed mulaw bytes from Redis. Same namespace — sub-ms network hop. Greetings and common phrases always warm.
Controlling Variable

Redis RTT (sub-ms in same K8s namespace) + mulaw decode time. Cache hit rate is ~70–85% for FAQ-heavy tenants — greetings, confirmations, and static FAQ answers repeat across every call.

Pre-warming

On tenant activation: extract common phrases from system prompt → synthesise → cache. Startup cost ~500ms per tenant; amortised across thousands of calls.

TTS Cache Miss
~600–1200ms
Azure TTS API RTT
Full Azure synthesis cycle: HTTP call + MP3 generation + mulaw transcode. Primary latency risk in the pipeline.
Controlling Variable

Azure Cognitive Services API RTT + MP3 decode + mulaw conversion. Azure TTS has a circuit breaker (threshold: 3 failures) — tighter than Groq because silent audio is worse UX than slow response.

Mitigation

Cache miss on first occurrence; all subsequent calls for same text are cache hits. For dynamic responses (personalised replies), miss rate is higher but sentence-level splitting keeps per-sentence miss cost bounded.

End-to-End Turn Latency
~600ms – 1.5s
cache-hit path · observed production
250ms endpointing + 150ms TTFT + 300ms sentence + <200ms TTS cache hit = ~900ms nominal. Observed range accounts for provider variance.
Cache Hit Path (nominal)

Endpointing (250ms) + Groq TTFT (150ms) + first sentence tokens (200ms) + TTS cache hit (150ms) + audio transmission (20ms) ≈ 770ms. Observed p50 ~850ms.

Cache Miss Path (worst case)

Same as above but TTS takes ~800ms instead of 150ms. Total ≈ 1.4s. Still within the 1.5s target. p99 outliers caused by Groq rate-limit queuing.

LLM Integration Design Decisions

Why Groq over OpenAI for inference

LPU hardware delivers 10–20× faster TTFT vs GPU-based inference at comparable cost. For real-time voice, TTFT dominates perceived latency. GPT-4o not evaluated — Groq LLaMA sufficient and faster for this token budget.

8-turn context window

Context limited to last 8 conversation turns. Token cost + latency increase non-linearly with context length. 8 turns covers the conversational memory horizon for most IVR scenarios. Full history preserved in Firestore if needed for escalation.

Sentence-boundary streaming split

LLM response split at sentence boundaries (।?!). First sentence dispatched to TTS while LLM still generates the rest. Reduces perceived first-audio latency by ~300ms. Hindi full-stop (।) added after testing — critical for correct split on Hindi responses.

AIMD throttling on 429s

Rate limit response: halve global LLM concurrency immediately. On success: increment by 1. Adapts to actual API capacity in real-time. Alternative (fixed backoff with jitter) rejected — doesn't converge to optimal throughput under sustained load.

Circuit breaker per provider

6 independent circuit breakers: Deepgram, Groq, Azure TTS, Twilio, Firestore, WAHA. Azure TTS threshold tighter (3 failures) than Groq (5) — silent audio is worse UX than a slow response. Half-open probe re-closes on single success.

Barge-in interruption

On new Deepgram partial transcript mid-speech: set barge-in token. TTS send loop checks token before each 20ms chunk — stops sending if triggered. Prevents agent talking over caller. Trade-off: ~20ms audio bleed-in on interrupt (one chunk).

Multi-Channel: WhatsApp via WAHA

WAHA Core Tier Constraint
Single default session
WAHA Core (free) limitation
WAHA Core supports exactly 1 session. Multi-tenant WhatsApp requires WAHA Plus or separate WAHA instances. APEX handles the 422 gracefully — returns CORE_LIMIT status instead of propagating 500.
Operational Reality

WAHA Core (free tier): exactly 1 default session. Creating additional sessions → HTTP 422. Session state stored in WAHA internal store, not Redis. Session recovery on WAHA restart requires QR code re-scan.

Remediation Path

WAHA Plus upgrade enables multi-session support with programmatic management. Alternative: separate WAHA Core instance per tenant (higher ops overhead, lower cost). Current deployment: single shared WAHA Core — sufficient for single-tenant MVP.

Message Handling
Webhook → FastAPI
same pipeline as voice
WhatsApp messages route through the same LLM orchestration pipeline as voice — different I/O layer, identical AI logic.
Architecture

WAHA receives WhatsApp message → fires webhook → FastAPI handler → transcript stored in Redis session → LLM + response → WAHA sends reply. No TTS on WhatsApp — text response direct.

Session Handling

Redis session keyed on WhatsApp number instead of call_sid. TTL reset on each message. Same 8-turn context window. Firestore stores full conversation history for compliance.

AI Systems Engineering Engagements

Specific AI systems problems I can help build or fix, with production experience on each.

Real-time voice AI pipeline

STT → LLM → TTS end-to-end on any telephony provider (Twilio, Telnyx, Vonage). Latency-optimised, multi-tenant, production-deployed.

LLM integration & concurrency

Groq / OpenAI / Anthropic API integration with AIMD throttling, streaming, per-tenant semaphores, circuit breakers.

Multi-tenant AI SaaS architecture

Tenant isolation, per-tenant quotas, onboarding automation, system prompt customisation per tenant, billing integration.

AI agent reliability patterns

Circuit breakers, graceful degradation under provider failure, fallback chains, barge-in interruption handling, observability.

WhatsApp / messaging AI

WAHA / Twilio Conversations / direct WABA integration. Stateful conversation management in Redis + Firestore.

Latency optimisation

TTS caching strategy, sentence-boundary streaming, endpointing tuning, TTFT reduction via provider selection and context pruning.

Build an AI System Together

Describe the AI system you're building or the reliability problem you're solving. I'll respond with a candid assessment of fit and a proposed engagement structure.