AI Systems Engineering
From Prototype to Production
Built and deployed a real-time speech-to-speech AI pipeline end-to-end: Twilio telephony → Deepgram STT → Groq LLaMA → Azure TTS → caller. Sub-second per-component, multi-tenant, circuit-broken, production-observable.
Real-Time Speech-to-Speech Pipeline
5 provider APIs, 3 WebSocket connections, 2 persistent data stores — all orchestrated in a single async Python process per call. Each step observable in production.
Observed Latency Profile
Production configuration. Provider latency variability not under our control — these are typical observed values, not guarantees. Click any card to see the controlling variable.
Deepgram Endpointing ▾
250ms
configured · not variable
250ms silence triggers speech_final. Lower = more interruptions; higher = perceived hesitation in response.
Controlling Variable
endpointing parameter on Deepgram WebSocket connection. 250ms is the sweet spot for Hindi conversation — English callers tolerate lower values.
Tuning Trade-off
Reducing to 150ms cuts 100ms from turn latency but causes more false speech_final events from mid-sentence pauses. Increasing to 400ms feels unresponsive. 250ms is calibrated to APEX's primary Hindi use case.
Groq TTFT ▾
~150–300ms
first token from LLM
LPU hardware advantage over GPU — 3–5× faster first-token than OpenAI GPT-4o at similar cost.
Controlling Variable
Groq LPU load + prompt length. Context window (8 turns + system prompt) kept short intentionally to minimise TTFT. Each additional 1K tokens adds ~20–40ms.
Worst Case
Under rate-limit AIMD throttle, calls queue behind semaphore. Queue time adds to perceived latency but TTFT itself stays consistent once the call acquires the semaphore.
First Sentence Complete ▾
~300–600ms from TTFT
sentence boundary split
Dispatch to TTS on first sentence boundary, not full response — saves ~300ms vs waiting for complete LLM output.
Controlling Variable
Response sentence length + Groq token generation rate (~200 tokens/s on LPU). Short first sentences (e.g. greetings) dispatch in 50–100ms of TTFT.
Implementation Note
Sentence split on ।?! characters. Hindi full-stop (।) added specifically after testing showed English split-only missed natural pause points in Hindi sentences.
TTS Cache Hit ▾
<200ms
Redis in-cluster RTT
Pre-computed mulaw bytes from Redis. Same namespace — sub-ms network hop. Greetings and common phrases always warm.
Controlling Variable
Redis RTT (sub-ms in same K8s namespace) + mulaw decode time. Cache hit rate is ~70–85% for FAQ-heavy tenants — greetings, confirmations, and static FAQ answers repeat across every call.
Pre-warming
On tenant activation: extract common phrases from system prompt → synthesise → cache. Startup cost ~500ms per tenant; amortised across thousands of calls.
TTS Cache Miss ▾
~600–1200ms
Azure TTS API RTT
Full Azure synthesis cycle: HTTP call + MP3 generation + mulaw transcode. Primary latency risk in the pipeline.
Controlling Variable
Azure Cognitive Services API RTT + MP3 decode + mulaw conversion. Azure TTS has a circuit breaker (threshold: 3 failures) — tighter than Groq because silent audio is worse UX than slow response.
Mitigation
Cache miss on first occurrence; all subsequent calls for same text are cache hits. For dynamic responses (personalised replies), miss rate is higher but sentence-level splitting keeps per-sentence miss cost bounded.
End-to-End Turn Latency ▾
~600ms – 1.5s
cache-hit path · observed production
250ms endpointing + 150ms TTFT + 300ms sentence + <200ms TTS cache hit = ~900ms nominal. Observed range accounts for provider variance.
Cache Hit Path (nominal)
Endpointing (250ms) + Groq TTFT (150ms) + first sentence tokens (200ms) + TTS cache hit (150ms) + audio transmission (20ms) ≈ 770ms. Observed p50 ~850ms.
Cache Miss Path (worst case)
Same as above but TTS takes ~800ms instead of 150ms. Total ≈ 1.4s. Still within the 1.5s target. p99 outliers caused by Groq rate-limit queuing.
LLM Integration Design Decisions
Why Groq over OpenAI for inference
LPU hardware delivers 10–20× faster TTFT vs GPU-based inference at comparable cost. For real-time voice, TTFT dominates perceived latency. GPT-4o not evaluated — Groq LLaMA sufficient and faster for this token budget.
8-turn context window
Context limited to last 8 conversation turns. Token cost + latency increase non-linearly with context length. 8 turns covers the conversational memory horizon for most IVR scenarios. Full history preserved in Firestore if needed for escalation.
Sentence-boundary streaming split
LLM response split at sentence boundaries (।?!). First sentence dispatched to TTS while LLM still generates the rest. Reduces perceived first-audio latency by ~300ms. Hindi full-stop (।) added after testing — critical for correct split on Hindi responses.
AIMD throttling on 429s
Rate limit response: halve global LLM concurrency immediately. On success: increment by 1. Adapts to actual API capacity in real-time. Alternative (fixed backoff with jitter) rejected — doesn't converge to optimal throughput under sustained load.
Circuit breaker per provider
6 independent circuit breakers: Deepgram, Groq, Azure TTS, Twilio, Firestore, WAHA. Azure TTS threshold tighter (3 failures) than Groq (5) — silent audio is worse UX than a slow response. Half-open probe re-closes on single success.
Barge-in interruption
On new Deepgram partial transcript mid-speech: set barge-in token. TTS send loop checks token before each 20ms chunk — stops sending if triggered. Prevents agent talking over caller. Trade-off: ~20ms audio bleed-in on interrupt (one chunk).
Multi-Channel: WhatsApp via WAHA
WAHA Core Tier Constraint ▾
Single default session
WAHA Core (free) limitation
WAHA Core supports exactly 1 session. Multi-tenant WhatsApp requires WAHA Plus or separate WAHA instances. APEX handles the 422 gracefully — returns CORE_LIMIT status instead of propagating 500.
Operational Reality
WAHA Core (free tier): exactly 1 default session. Creating additional sessions → HTTP 422. Session state stored in WAHA internal store, not Redis. Session recovery on WAHA restart requires QR code re-scan.
Remediation Path
WAHA Plus upgrade enables multi-session support with programmatic management. Alternative: separate WAHA Core instance per tenant (higher ops overhead, lower cost). Current deployment: single shared WAHA Core — sufficient for single-tenant MVP.
Message Handling ▾
Webhook → FastAPI
same pipeline as voice
WhatsApp messages route through the same LLM orchestration pipeline as voice — different I/O layer, identical AI logic.
Architecture
WAHA receives WhatsApp message → fires webhook → FastAPI handler → transcript stored in Redis session → LLM + response → WAHA sends reply. No TTS on WhatsApp — text response direct.
Session Handling
Redis session keyed on WhatsApp number instead of call_sid. TTL reset on each message. Same 8-turn context window. Firestore stores full conversation history for compliance.
AI Systems Engineering Engagements
Specific AI systems problems I can help build or fix, with production experience on each.
Real-time voice AI pipeline
STT → LLM → TTS end-to-end on any telephony provider (Twilio, Telnyx, Vonage). Latency-optimised, multi-tenant, production-deployed.
LLM integration & concurrency
Groq / OpenAI / Anthropic API integration with AIMD throttling, streaming, per-tenant semaphores, circuit breakers.
Multi-tenant AI SaaS architecture
Tenant isolation, per-tenant quotas, onboarding automation, system prompt customisation per tenant, billing integration.
AI agent reliability patterns
Circuit breakers, graceful degradation under provider failure, fallback chains, barge-in interruption handling, observability.
WhatsApp / messaging AI
WAHA / Twilio Conversations / direct WABA integration. Stateful conversation management in Redis + Firestore.
Latency optimisation
TTS caching strategy, sentence-boundary streaming, endpointing tuning, TTFT reduction via provider selection and context pruning.
Build an AI System Together
Describe the AI system you're building or the reliability problem you're solving. I'll respond with a candid assessment of fit and a proposed engagement structure.