Production AI & Platform Systems Engineer
Builds and operates production AI systems: real-time voice AI pipelines, LLM integration backends, multi-tenant SaaS on Kubernetes/AKS, CPaaS integrations (Twilio, WAHA), and Redis-backed job queue architectures. Not a researcher. An engineer who ships.
| Core stack | FastAPI (Python 3.11) · Redis · Firestore · React 18 |
| Cloud | Azure (AKS, Key Vault, Azure TTS, ACR) |
| Deployment | GitHub Actions → ACR → kubectl rolling update (40+ deployments completed) |
| Orchestration | Custom Python — deterministic, not LangChain or Autogen |
| Contracting | B2B / IR35-outside / remote / UK–EU–US–India |
| Timezone | IST (UTC+5:30) — overlap with UK/EU mornings |
| Engagement | Platform build · Architecture advisory · Fractional technical leadership |
| Production | aimmarketing.in — live, real customers, real calls |
Ankit Panicker — AI Platform & Systems Engineer
AI platform engineer based in the UK. Builds and operates production AI systems: real-time voice AI pipelines, LLM integration backends, multi-tenant SaaS on Kubernetes/AKS, CPaaS integrations (Twilio, WAHA/WhatsApp), Redis-backed queue architectures, and offline-first systems. Available for B2B contracting outside IR35 via UK Ltd. Not a researcher — production engineer with observable reliability track record.
Systems built and operated
- Multi-tenant AI voice platform — inbound + outbound calls, Hindi language, appointment booking, WhatsApp automation
- CPaaS integrations — Twilio Media Streams WebSocket (mulaw 8kHz), WAHA WhatsApp Business API, Telnyx (fallback provider)
- Offline-first hospital management system — PWA, IndexedDB + WAL sync for durability, multi-role RBAC (admin/doctor/pharmacist/receptionist), prescription management, audit log
- Fully local Hindi voice AI — Whisper STT + llama.cpp + Kokoro TTS + RedisVL vector memory, zero cloud API calls, GPU-accelerated
- Redis-backed ZSET job queue — KEDA autoscaling, AOF persistence, DLQ, priority scoring, dead-letter inspection
- Azure AKS Kubernetes cluster — HPA, PDB, KEDA, External Secrets Operator, NGINX Ingress, cert-manager, namespace-scoped NetworkPolicy
- LLM orchestration pipeline — Groq LLaMA 3.3 70B, deterministic Python (no LangChain), streaming, sentence-boundary TTS dispatch, AIMD throttle, circuit breakers
- Observability — Sentry error tracking, structured log_event() with JSON fields, Prometheus /metrics, SSE real-time log tail, separate liveness/readiness probes
- Reliability engineering — 6 named circuit breakers (CLOSED/OPEN/HALF-OPEN), Redis NX idempotency, Firestore file fallback, volatile-lru queue preservation, barge-in interruption
Production evidence
- 6,800+ production call executions handled
- 40+ CI/CD deployments to Azure AKS
- ~1s typical E2E voice turn latency (600ms–1.5s range, observed)
- <200ms first audio on TTS cache hit (SHA256 mulaw sidecar)
- 3 production incidents: diagnosed, root-caused, fixed, post-mortemed
- 6 circuit breakers covering all external providers independently
- 10+ secrets managed via Azure Key Vault + ESO (rotate without pod restart)
- Cross-tenant cache leak found and fixed in production (React Query key missing tenant scope)
- Live at aimmarketing.in — real tenants, real calls, real customers
Engagement & contact
| Role | AI Platform Engineer |
| Location | United Kingdom (remote-first) |
| Contracting | Outside IR35 · UK Ltd · B2B |
| Markets | UK · EU · US · India |
| Availability | Engagement types → |
| Contact | mr.ankitpanicker@gmail.com |
Production Evidence
Architecture descriptions are cheap. This section documents what was actually built, deployed, and operated — with answers to "how do you know?"
40+ CI/CD deployments completed
- GitHub Actions → ACR Docker build →
kubectl set image→ rollout status validation - Typical deployment window: ~4 minutes end-to-end
- Rollback model tested:
kubectl rollout undo - Zero manual intervention required for standard deploy
Multi-tenant isolation in production
- Per-tenant Redis key namespacing:
rate:{tenant_id},llm_cache:{tenant_id}:{hash} - Server-side tenant scope on every API endpoint — not just UI-level
- Cross-tenant React Query cache bug found and fixed in production (missing
tenant_idin query key)
Live telephony integration
- Twilio WebSocket media streams — inbound and outbound calls handled in production
- WhatsApp (WAHA Core, default session, connected phone 917247240888)
- Telnyx fallback configured and tested
- Real customers using real calls — not a demo or sandbox environment
Circuit breaker coverage exercised under real failures
- Groq 429 rate limit → AIMD throttle triggered; per-tenant semaphore queue confirmed functional
- WAHA 422 → CORE_LIMIT graceful handling deployed after real production crash
- Firestore fallback activated and verified in dev environment
Observed latency documented under production load
- E2E speech turn: ~600ms–1.5s (STT endpointing 250ms + Groq LPU ~200–400ms + TTS stream start)
- TTS cache path: <200ms first audio observed (mulaw sidecar — no Azure API call)
- Redis round-trips: sub-millisecond (in-cluster, same K8s namespace)
Three production incidents diagnosed and fixed
- Onboarding activation loop — root cause: state transition gap in user activation
- WAHA 422 cascade — root cause: WAHA tier limitation unhandled, crash on every WhatsApp status load
- Cross-tenant query cache leak — root cause: missing tenant scope in React Query key
Executive Overview
What the system is, what problem it solves, and how it's built.
What is APEX Voice AI?
APEX Voice AI is a multi-tenant AI voice agent SaaS platform. A business subscribes, creates their tenant configuration (clinic, call center, hospital), and immediately gets an AI agent that answers inbound calls in Hindi, books appointments, runs outbound campaigns, and handles WhatsApp — without hiring additional staff. Each tenant is fully isolated: their configuration, their call data, their rate limits, their Redis key namespace.
Business problem solved
Indian SMBs — clinics, hospitals, retail stores, financial services — receive high inbound call volumes with repetitive queries (appointment booking, order status, store timings). The cost of staffing a call center is prohibitive. APEX replaces the first-touch human interaction with an AI agent that speaks natural Hindi, understands context, and hands off to a human when needed.
Engineering scope
Full-stack implementation from scratch: FastAPI backend on Azure AKS (2 replicas, HPA, PDB), real-time WebSocket telephony with Twilio, Deepgram STT, Groq LLM, Azure TTS with a custom mulaw sidecar cache, Redis as a universal backbone (queue + session + cache + rate limiter + idempotency), Firebase Auth + Firestore, WAHA WhatsApp, GitHub Actions CI/CD, External Secrets Operator, React 18 dashboard deployed on Firebase Hosting.
Architecture in one paragraph
Stateless FastAPI pods on AKS sit behind an NGINX Ingress. Twilio opens a WebSocket media stream per call; the pod connects a Deepgram WebSocket and an asyncio pipeline that transcribes → generates with Groq → synthesises with Azure TTS → streams audio back. All state lives in Redis (session, queue, cache) or Firestore (tenant config, user records). Workers pull from a Redis ZSET priority queue for async jobs. Six circuit breakers guard all external providers. The entire system deploys in ~4 minutes from a git push.
System Architecture
Interactive diagram — click any node for component details. Animated signal flow shows request path. Full technical walkthrough below ↓
Project Walkthrough
Six guided sections — expand any row for technical depth. Use the tabs to navigate directly to the layer you need.
Problem Statement and System Goals — APEX Voice AI
01Problem Statement & System Goals▾
Indian SMBs — clinics, hospitals, call centers — receive high inbound call volumes with repetitive, structured queries. The cost of staffing human agents is prohibitive. The goal: an AI voice agent that answers calls in natural Hindi, understands context, executes tasks (booking, lookup, escalation), and hands off to humans when needed.
System goals:
- Sub-1.5s perceived turn latency (speech-to-speech) under production configuration
- Full multi-tenancy — one deployment, many isolated business customers
- Zero secrets in code — all credentials via Azure Key Vault
- Automatic recovery from external provider failures (circuit breakers)
- Deployable by one engineer in under 5 minutes from a git push
Backend Architecture, Frontend Architecture, Database Model, Queue System — APEX Voice AI
02Functional Architecture▾
From a user perspective: a business configures their tenant (system prompt, business hours, agent persona, appointment types). Their phone number is pointed at a Twilio webhook. When a customer calls, APEX answers, transcribes speech in real-time, generates a context-aware reply with Groq LLaMA, synthesises audio with Azure TTS, and streams it back — in under 1.5 seconds per turn.
Outbound campaigns: a CSV of phone numbers is uploaded. APEX dials each number, delivers a message, records responses, and logs outcomes in Firestore. WhatsApp follows the same pipeline via WAHA.
03Backend Architecture▾
FastAPI on Python 3.11 — async-first, auto-generates OpenAPI, uses Depends() for auth injection into every handler. Two API pod replicas on AKS with an HPA that scales 2–5 based on CPU. PodDisruptionBudget ensures 1 pod always available during rolling deploys.
Worker pods handle background jobs: outbound campaigns, batch TTS pre-warm, knowledge base indexing. Workers pull from a Redis ZSET priority queue. KEDA ScaledObject watches queue depth for worker autoscaling.
# Dependency injection pattern — auth verified on every handler
@router.post("/api/v1/campaign")
async def create_campaign(
req: Request,
user: dict = Depends(get_current_user), # Firebase JWT → user record
_admin = Depends(require_admin),
):
tenant = user["tenant"]
# All downstream ops scoped to this tenant
04Frontend Architecture▾
React 18 + TypeScript + Vite + TailwindCSS + React Query + Zustand + Framer Motion. Deployed on Firebase Hosting alongside the marketing site via a unified build pipeline (npm run build → Vite → sync-marketing.mjs → Firebase deploy).
useAuthStore (Zustand): single source of truth for {user, role, tenant, active}. All React Query keys include tenant to prevent cross-tenant cache leaks — this was a real production bug found and fixed. All pages are lazy-loaded via React.lazy + Suspense; main bundle reduced from 2007KB to 253KB with rollupOptions.manualChunks.
05Database Model — Firestore & Redis Schemas▾
Firestore: Document-model, serverless, zero-ops. Two root collections:
users/{email}:
active: bool, tenant: str, role: str, plan: str
tenants/{id}:
name: str, plan: str, owner_email: str,
system_prompt: str, waha_session: str,
rate_limit: int, voice: str, language: str
Redis key patterns:
rate:{tenant_id} # daily call counter, TTL = end of day
session:{call_sid}:{tid} # active call state, TTL = 3600s
llm_cache:{tid}:{sha256} # LLM response cache, TTL = 86400s
idempotency:{call_sid} # NX key, TTL = 3600s
job:{jid} # ZSET member with priority score
File fallback: tenants/*.json and data/users.json activate automatically if Firestore is unreachable. No data loss — degraded mode only.
06Queue System — Redis ZSET + KEDA▾
Jobs are scored ZSET members. Priority = score (lower score = higher priority). Workers call ZPOPMIN to claim the highest-priority job. Dead-letter: jobs that fail N times are moved to dlq:{tenant} ZSET for manual inspection.
KEDA ScaledObject watches Redis queue depth. When queue depth exceeds threshold, KEDA signals the K8s HPA to scale up worker pods. Workers scale to zero when queue is empty — no idle compute cost.
# Job dispatch
await redis.zadd("jobs", {job_id: priority_score})
# Worker poll
job_id = await redis.zpopmin("jobs", 1)
AI Orchestration — LLM Pipeline, Memory Management, Redis TTL
07AI Orchestration — LLM Context & Pipeline▾
Deterministic Python orchestration — no LangChain, no Autogen. The orchestration is a small function that constructs the messages array: system prompt (tenant-specific) + last 8 conversation turns (sliding window) + current user transcript. This predictable structure gives consistent latency and no framework abstraction overhead.
Groq LLaMA 3.3 70B returns streaming tokens. FastAPI accumulates tokens and splits on sentence boundaries (। ? ! . for Hindi/English). Each complete sentence is passed to the TTS pipeline immediately — pipelining TTS generation while LLM continues — reducing perceived latency.
AIMD concurrency control: global semaphore caps total concurrent Groq calls to match API limits. If concurrency exceeds limit, new requests back off (multiplicative decrease). On success, limit increases additively. Per-tenant semaphore ensures no single tenant can starve others.
08Memory Management — Redis TTL & Eviction▾
Redis is configured with maxmemory-policy volatile-lru: only keys with TTL set are eligible for eviction. Session state and LLM cache keys have TTL; idempotency keys have TTL; rate counters have TTL aligned to daily boundaries. Queue ZSET members (no TTL) are never evicted — they are preserved even under memory pressure.
AOF persistence (appendfsync everysec): queue state survives pod restarts. At startup, workers resume processing from where they left off. Session state (TTL 3600s) is acceptable to lose on restart — calls already in-flight will reconnect.
Authentication, Multi-Tenancy Isolation, Security, Observability, Idempotency and Retry Handling
09Authentication Flow — Firebase Auth + RBAC▾
Firebase Auth issues JWTs on login. Backend get_current_user() verifies the JWT, extracts the email, and calls get_or_create_user(email) to fetch the user record from Firestore. The record contains {email, tenant, role, active}.
RBAC: role: "admin" can see all tenants; role: "client" is scoped to user.tenant only. RequireAdmin and require_admin guards are applied at the handler level. Admin emails are configurable via env var — auto-promoted at login time.
Machine-to-machine: X-API-Key header (APEX_API_AUTH_KEY from apex-secrets) for admin operations that don't have a Firebase token context.
10Multi-Tenancy — Isolation Model▾
Four layers of tenant isolation:
- Data: Firestore documents scoped under
tenants/{tid}. Redis keys prefixed with{tenant_id}:. No query can access another tenant's data structurally. - Auth: Every API handler receives
user.tenantfromget_current_user(). All downstream queries are scoped to this value. Server-side enforcement — not UI-level. - Compute: Per-tenant
asyncio.Semaphoreon LLM requests. One heavy tenant can't starve others of Groq API capacity. - Rate limits: Per-tenant daily call counters in Redis. Plan tiers (trial/spark/blaze) each have different limits checked before any Twilio call is created.
Client-side: React Query cache keys include tenant — a missing tenant scope was found and fixed in production (cross-tenant data visible to wrong user).
11Security — Secrets, Network, Auth Layers▾
Secrets: 10+ secrets stored in Azure Key Vault. External Secrets Operator (ESO) syncs them into a single K8s apex-secrets Opaque secret every 5 minutes. No secrets in environment files, no secrets in code, no secrets in git history. Firebase service account mounted as a volume from a separate K8s secret.
Network: FastAPI pods listen on 127.0.0.1 — NGINX is the only external entry point. All inter-service traffic is cluster-internal. TLS terminated at the ingress layer only.
Auth depth: (1) Firebase JWT on every user request. (2) Server-side tenant scope on every data access. (3) Admin key for machine-to-machine operations. (4) Kubernetes RBAC for cluster access. (5) Azure RBAC for Key Vault access.
12Observability — Logs, Metrics, Tracing▾
Error tracking: Sentry DSN injected at runtime from apex-secrets. All unhandled exceptions, circuit breaker state changes, and critical path failures are captured with context (tenant, call_sid, endpoint).
Structured logging: log_event() throughout the codebase — consistent JSON fields: event, tenant, call_sid, duration_ms, error. Aggregated via kubectl logs; future migration path to structured log sink.
Metrics: Prometheus counters at /metrics (in-process). Real-time log tail via SSE endpoint in the dashboard — operators can watch call events live without SSH.
Synthetic health: /healthz (liveness) and /readyz (readiness) endpoints — separate concerns. readyz includes Firestore and Redis connectivity checks. Pods are not marked ready until Azure TTS warmup completes (~38s at startup).
13Retry Handling — Idempotency & Deduplication▾
Twilio retries webhook delivery on 5xx responses. Without idempotency, a transient API error creates duplicate call sessions. Solution: SET idempotency:{call_sid} 1 NX EX 3600 — the NX flag means only the first request succeeds. Subsequent retries find the key set, skip session creation, and rehydrate the existing session from Redis.
Job queue idempotency: jobs include a content hash as part of their ZSET key. Duplicate job submissions are deduplicated at the ZADD stage. Dead-letter jobs include retry count and original error for diagnosis.
Failure Recovery, Circuit Breakers, Deployment Lifecycle CI/CD, Real-Time Pipeline Hot Path
14Failure Recovery — Circuit Breakers▾
Six named circuit breakers (Groq, Azure TTS, Deepgram, Twilio, Telnyx, WAHA). Each is an independent state machine: CLOSED → OPEN (after N failures) → HALF_OPEN (after timeout) → CLOSED (if probe succeeds) or OPEN (if probe fails).
Thresholds differ by provider: Azure TTS opens after 3 failures (silent audio to caller is worse than a slow response). Groq opens after 5 failures. When OPEN, requests fail immediately (no 30s timeout wait) — fast failure is the primary value of circuit breaking, not recovery.
Barge-in: TTS send tasks check a per-call generation token at every audio chunk boundary. When a caller speaks mid-response, the token is invalidated — in-flight chunks abort themselves and a clear event is sent to Twilio to stop playback.
15Deployment Lifecycle — CI/CD & Rollback▾
GitHub Actions (.github/workflows/cd-aks.yml) triggers on every push to master:
- Build Docker image with tag
YYYYMMDD-{sha7} - Push to ACR (
apexacrprod.azurecr.io) — both tagged and:latest kubectl set image deployment/apex-api api={image} -n apexkubectl set image deployment/apex-worker worker={image} -n apexkubectl rollout status --timeout=120s— fails fast if pods don't come up
Total: ~4 minutes from git push to production traffic on new image. Rollback: kubectl rollout undo deployment/apex-api -n apex. Probe timings: initialDelaySeconds: 60 on both liveness and readiness — Azure TTS warmup for all tenants takes ~38s at startup.
16Real-Time S2S Pipeline — Hot Path Detail▾
The speech-to-speech hot path per call turn:
Phone → Twilio → POST /twilio/s2s/voice (webhook)
→ FastAPI returns TwiML: <Connect><Stream url=wss://...>
→ Twilio opens WebSocket to /twilio/s2s/stream/{call_sid}
→ FastAPI opens Deepgram WebSocket (nova-2, hi-IN, 16kHz)
→ Twilio streams mulaw audio in 20ms chunks
→ FastAPI converts mulaw→PCM → Deepgram
→ Deepgram returns transcript (speech_final, ~250ms endpointing)
→ FastAPI builds messages array (last 8 turns + system prompt)
→ Groq streaming request (LLaMA 3.3 70B)
→ Split on sentence boundary → TTS cache lookup (SHA256)
→ Cache HIT: read .ulaw sidecar → stream chunks to Twilio
→ Cache MISS: Azure TTS → MP3 → decode → .ulaw → cache → stream
→ Twilio plays audio to phone caller
Technical Tradeoffs, Key Engineering Decisions, Lessons Learned from Production
17Technical Tradeoffs — Key Decisions▾
See the Engineering Decisions section for full decision log. Summary of the five highest-impact decisions:
- Redis as universal backbone: eliminates separate message broker + cache + session store. Single operational decision collapses five engineering decisions.
- Groq over OpenAI: 10× lower inference latency at 10× lower cost — only viable choice for real-time voice at current scale.
- AKS over App Service: ~$60/month for HA setup vs $200+. Kubernetes ops overhead mitigated by full CI/CD automation.
- Deterministic orchestration over LangChain: predictable latency, no framework abstraction tax, full control over context window and streaming behaviour.
- Firestore over Postgres: serverless, zero-ops, JSON document model matches tenant config structure. No connection pooling to manage.
18Lessons Learned▾
State transitions must be atomic: The onboarding loop bug (users stuck re-onboarding forever) was caused by a gap between creating a resource and updating the user record. Two operations, not one atomic unit. Fix: always complete the state machine in a single function call.
Model tier limitations explicitly: The WAHA 422 crash happened because WAHA Core's single-session constraint was not modelled in the code. 422 was treated as a generic server error. Fix: tier limitations are a first-class concern — model them as named statuses, not exceptions.
Every cache key is a potential data boundary: The cross-tenant cache leak was a React Query key missing a tenant scope. It's not just a bug — it's a design principle. Every cache key is a data boundary decision. Apply tenant scope everywhere, not selectively.
Engineering Decisions
Architectural decisions with alternatives considered and final reasoning. Click any card to expand context.
Job Queue ▾
Redis ZSET
vs RabbitMQ / SQS / Celery
Zero extra infra. Priority natively via score. Same Redis instance already used for cache + session.
Context
Needed a job queue for outbound campaigns and async work. Options: add RabbitMQ (new pod, new ops surface), use SQS (AWS-specific, egress cost), or reuse the Redis instance already in cluster.
Final Reasoning
Redis ZSET already running in cluster — sub-millisecond access, native priority via score, persistent via AOF. Adding RabbitMQ doubles the stateful service count. KEDA integrates natively with Redis queue depth for autoscaling. Zero new dependencies.
Web Framework ▾
FastAPI
vs Django, Flask, Express
Async-native Python. Auto-generates OpenAPI. Clean dependency injection. Type-safe with Pydantic.
Context
Needed async-first Python capable of handling WebSocket streams (Twilio, Deepgram) concurrently with HTTP API requests.
Final Reasoning
FastAPI's async support is first-class — not bolted on. Depends() injection makes auth a clean cross-cutting concern. Auto-generated OpenAPI reduces docs overhead. Django's sync-first model needs workarounds for WebSocket concurrency.
LLM Provider ▾
Groq · LLaMA 3.3 70B
vs OpenAI GPT-4o, Anthropic Claude
10× lower inference latency via LPU hardware. ~10× cheaper per token. Latency is binary for real-time voice.
Context
Real-time voice requires LLM response to start within 200–400ms to hit the 1.5s E2E target. OpenAI GPT-4o p50 TTFT is ~500–800ms. Groq LPU TTFT for LLaMA 3.3 70B is ~100–200ms.
Final Reasoning
For voice AI, every 100ms is perceptible to callers as hesitation. Groq's LPU delivers the speed that makes the product viable. Cost is additive: ~$0.0006/1K tokens vs GPT-4o at ~$0.005/1K. Tradeoff: Groq has rate limits requiring AIMD throttle + per-tenant semaphores.
Orchestration ▾
Deterministic Python
vs LangChain, LlamaIndex, Autogen
Predictable latency. Zero framework overhead. Full control over context window and streaming boundaries.
Context
LangChain adds 50–200ms overhead per chain step from abstraction, serialization, and internal logging. For a voice system where 100ms matters, this is significant latency tax.
Final Reasoning
The orchestration is simple: build messages array, call Groq, split tokens on sentence boundaries. 50 lines of explicit Python beats 500 lines of LangChain config. Debugging framework internals is harder than debugging explicit code. Predictability wins over ergonomics in latency-sensitive systems.
Compute Platform ▾
Azure AKS
vs App Service, Container Apps, ECS
~$60/month for HA. HPA + PDB + rolling deploys. Full K8s ecosystem: KEDA, ESO, cert-manager.
Context
App Service for the same workload costs $200–400/month. Container Apps simplify ops but lack K8s ecosystem integrations needed (KEDA for queue-based scaling, ESO for secrets sync).
Final Reasoning
AKS on Standard_B2s_v2 (~$30/month each) provides HA at a fraction of App Service cost. Ops is fully automated: push to master triggers CI/CD. ESO, KEDA, cert-manager, and HPA are K8s-native — no equivalent exists in App Service.
Primary Datastore ▾
Firestore
vs PostgreSQL, MongoDB Atlas
Serverless, zero-ops. JSON doc model matches tenant config. Shares Firebase Auth SDK.
Context
Tenant configs are JSON docs with variable schema. User records are simple key-value maps. No complex relational queries needed. Firebase Auth already in use for identity.
Final Reasoning
Firestore scales to zero when idle (zero cost), needs zero DB admin (no migrations, no connection pooling, no vacuum). Shares the Firebase project with Auth. Analytics route to Redis counters, not Firestore.
Secrets Management ▾
ESO + Azure Key Vault
vs K8s Secrets (plain), HashiCorp Vault
Rotation, audit log, zero secrets committed. ESO refreshes every 5 min via AAD Workload Identity.
Context
Plain K8s Secrets are base64 encoded, not encrypted at rest by default. Any kubectl-authorized user can read them. Key Vault provides encryption, rotation, access policies, and audit logs.
Final Reasoning
ESO syncs Key Vault secrets to a K8s Opaque secret every 5 minutes. Rotation is automatic — no pod restart needed. HashiCorp Vault adds another stateful service to operate. Key Vault is in-ecosystem with AKS via AAD Workload Identity.
TTS Strategy ▾
mulaw sidecar cache
vs Always call Azure TTS
Cache hit <200ms vs ~800ms–1.2s miss. Common phrases pre-warm at startup.
Context
Azure TTS call + MP3 decode + mulaw conversion takes ~800ms–1.2s. Greetings like "नमस्ते, मैं आपकी मदद कर सकता हूँ" are identical across every call for a given tenant.
Final Reasoning
SHA256 on normalised text = deterministic cache key. Pre-warm at startup for common phrases from each tenant's system prompt. Hit rate is high for FAQ-heavy tenants. Cache miss falls back to Azure TTS transparently. Redis TTL = 86400s.
Frontend Hosting ▾
Firebase Hosting
vs Azure Static Web Apps, CloudFront
Free at current traffic. Single deploy command. Dashboard + marketing in one CDN deployment.
Context
Needed to host the React dashboard (/app/) and marketing site (/) together. Firebase Hosting supports path-based rewrites with a single JSON config.
Final Reasoning
Free at current traffic (<10GB/month). Single firebase deploy command. Dashboard outputs to dist/app/; marketing to dist/ root. CloudFront requires Route 53 + IAM config + ~$0.012/GB. Not justified at current scale.
Reliability & Failure Domains
Failure matrix covering detection, immediate response, and recovery path for each failure class.
| Failure | Detection | Immediate Response | Recovery |
|---|---|---|---|
| Groq rate limit (429) | 429 HTTP response | AIMD throttle decreases concurrency; per-tenant semaphore queues requests | Auto-recover when rate window resets (~60s) |
| Azure TTS timeout | 30s probe timeout | Circuit breaker opens after 3 failures | Caller hears silence this turn; HALF_OPEN probe retries after cooldown |
| Deepgram WS disconnect | WebSocket close event | Reconnect with exponential backoff | Session continues from last Redis-stored turn context |
| Redis OOM | eviction notices / maxmemory hit | volatile-lru evicts expired keys; queue keys (no TTL) preserved | Cache misses increase (TTS slower); session state at risk if extreme |
| AKS pod crash | Liveness probe fail (60s initial delay) | K8s restarts pod; PDB ensures 1 replica always available | New pod ready in ~80s including warmup; rolling deploy keeps 1 pod live |
| Twilio webhook retry | Duplicate call_sid on webhook | Redis NX idempotency — second webhook finds key set, skips creation | Rehydrates existing session; no duplicate call handling |
| WAHA 422 (Core limit) | HTTP 422 from WAHA pod | Returns {"status":"CORE_LIMIT"} instead of raising exception | No crash; UI shows CORE_LIMIT status with upgrade guidance |
| Firestore unavailable | SDK exception on read | File fallback activates: reads tenants/*.json, data/users.json | Degraded mode — no data loss; writes queued or dropped depending on op type |
| Azure TTS audio warmup | Pod startup — TTS needs ~38s to warm all tenant voices | readinessProbe blocked until warmup completes | No traffic sent to pod until ready; initialDelaySeconds: 60 on both probes |
Circuit Breaker State Machine
Six independent state machines — one per provider. Failure thresholds vary by impact:
- Azure TTS: opens after 3 failures — silent audio to caller is the worst user experience
- Groq LLM: opens after 5 failures — response degradation tolerable briefly
- Deepgram STT: opens after 5 failures — WS reconnect attempted first
- Twilio / Telnyx / WAHA: opens after 5 failures
All requests pass
No requests sent
Test recovery
Performance & Metrics
Observed under current production configuration. All values are provider-dependent and represent typical ranges, not benchmark claims.
endpointing: 250 parameter · trades latency for accuracyProduction Incidents & Lessons
Three real incidents from APEX production — symptom, root cause, fix, and what changed. These are the situations that test whether an architecture is sound.
Symptom
New users completing onboarding (form submit + tenant creation) were redirected back to /onboard on every subsequent login. Infinite loop. No error shown — the flow appeared to succeed.
Root Cause
get_or_create_user() creates new user records with active: False. The api_onboard endpoint created the tenant but never called set_user_fields to set active: True. So GET /api/me always returned active: False — redirect loop on every login.
Mitigation
Admin activate API called manually per affected user. Script to identify and fix all users with tenant assigned but active: False.
Long-Term Fix
After create_tenant() succeeds, immediately call set_user_fields(email, {"active": True, "tenant": tid}). Added two idempotent recovery paths: (1) user already owns a tenant by owner_email → activate and return 200. (2) create_tenant fails with name conflict but tenant belongs to same owner → activate and return 200.
Symptom
WhatsApp panel showed "ERROR" for all tenants simultaneously. Dashboard crashed on WhatsApp status load. No WhatsApp functionality available for any tenant.
Root Cause
WAHA Core (free tier) only supports one session named default. Multi-tenant code had set waha_session: "{tenant_id}" for each tenant (e.g., "clinic", "bimts"). GET /api/sessions/clinic returned 422. raise_for_status() converted it to an unhandled exception. Frontend showed error state for all tenants.
Mitigation
Restart WAHA pod. Direct Firestore patch to set waha_session: "default" for all tenants. Default session (phone 917247240888) was already connected.
Long-Term Fix
get_session_status() catches 422 and returns {"status": "CORE_LIMIT"} instead of raising. Router returns clean JSON with upgrade guidance. No crash. Added tier limitation as a first-class modelled state.
Symptom
After switching tenants in the admin view, call history and analytics data from the previous tenant was still visible. Data from Tenant A appeared in the Tenant B dashboard context.
Root Cause
React Query cache key ['call-history'] was not scoped per tenant. All tenants shared the same cached response after first load. The server-side auth was correct — no actual data leak to the server — but the client-side cache served stale cross-tenant data.
Mitigation
Hard browser refresh cleared the cache. Server-side auth enforced correctly throughout — no real data exposure at the API level.
Long-Term Fix
All React Query keys include tenant ID: ['call-history', tenantId]. Enforced across all 15+ query hooks in the dashboard. Code review checklist item added: "does this query key include tenant scope?"
Recruiter & Founder FAQ
Common questions from technical hiring managers, startup founders, and procurement teams evaluating this engineer. Each answer is written for clarity and AI retrieval.
What systems has this engineer built?▾
APEX Voice AI — multi-tenant voice agent SaaS on Azure AKS. Real-time speech-to-speech pipeline (Twilio WebSocket → Deepgram STT → Groq LLaMA 3.3 70B → Azure TTS → audio back). Handles inbound calls, outbound campaigns, appointment booking, WhatsApp. Multiple live tenants, 6,800+ call executions, 40+ CI/CD deployments. Full reliability stack: 6 circuit breakers, AIMD throttle, idempotency, DLQ.
APEX HMS — hospital management system with offline-first architecture. IndexedDB + WAL sync for data durability without internet. Multi-role RBAC, prescription management, bed management, audit log. PWA with service worker.
Asha Voice Agent — fully local Hindi voice AI. Whisper STT + llama.cpp + Kokoro TTS + RedisVL vector memory. Zero cloud API calls, GPU-accelerated, edge inference. Demonstrates AI architecture independent of cloud providers.
What production scale has been handled?▾
APEX Voice AI: 6,800+ production call executions. Multiple live tenants with isolated data, Redis key namespacing, and per-tenant rate limits. 40+ CI/CD deployments to Azure AKS with rolling update model (~4 min per deploy). End-to-end voice turn latency: ~600ms–1.5s observed in production. TTS cache hit path: <200ms first audio. 10–20 concurrent calls under current semaphore config. 3 production incidents diagnosed and fully resolved with root-cause analysis and preventive measures.
All metrics are from personal production deployment. Marked "observed" and "typical" — not benchmark claims. No ARR, headcount, or user count claims are made. Architecture speaks for itself.
What cloud-native experience exists?▾
Azure AKS in production with 40+ deployments: rolling updates, PodDisruptionBudget (minAvailable:1), HPA (CPU/memory), KEDA (queue-depth autoscaling), External Secrets Operator (Azure Key Vault sync every 5 min), NGINX Ingress Controller, cert-manager with Let's Encrypt, namespace-scoped NetworkPolicy, StatefulSets for Redis (PVC + AOF persistence).
CI/CD: GitHub Actions → Docker multi-stage build → Azure Container Registry → kubectl set image → kubectl rollout status. Rollback tested via kubectl rollout undo. Secrets: 10+ credentials in Azure Key Vault, never in code or git. Pod networking: FastAPI listens on 127.0.0.1 — NGINX is sole external entry point.
What reliability engineering patterns are implemented?▾
Circuit breakers: 6 named state machines (CLOSED/OPEN/HALF-OPEN), one per external provider. Fast failure (not 30s timeout waits) when providers degrade. Thresholds differ by impact: Azure TTS opens at 3 failures (silent audio is the worst caller UX), Groq at 5.
Concurrency control: AIMD throttle on Groq API (halve on 429, increment on success). Per-tenant asyncio.Semaphore prevents one heavy tenant from starving others.
Idempotency: Redis SET NX on Twilio call_sid (TTL 3600s) prevents duplicate sessions when Twilio retries webhooks on 5xx. Job queue uses content-hash deduplication at ZADD time.
Degraded operation: Firestore file fallback (tenants/*.json, data/users.json) activates on SDK exception. Redis volatile-lru eviction preserves job queue (no TTL = never evicted). Barge-in interruption via per-call generation tokens clears in-flight TTS.
What AI infrastructure has been deployed?▾
LLM integration: Groq LLaMA 3.3 70B via streaming API. Deterministic Python orchestration (no LangChain/Autogen) — messages array: system prompt + last 8 turns + current transcript. Sentence-boundary split (।?! incl. Devanagari) dispatches each sentence to TTS immediately while LLM continues generating — pipelining reduces perceived latency.
STT: Deepgram nova-2 WebSocket, hi-IN language model, 16kHz PCM, endpointing=250ms, speech_final event triggers LLM call.
TTS: Azure Cognitive Services TTS with mulaw sidecar cache. SHA256 on normalised text = deterministic cache key. Common phrases pre-warmed at pod startup. Cache hit: <200ms (no Azure API call). Cache miss: MP3 → decode → mulaw → cache → stream (~800ms–1.2s).
Local AI: Asha Voice Agent — llama.cpp + Whisper + Kokoro TTS, GPU-accelerated, zero cloud API calls, RedisVL vector memory for conversation context.
What multi-tenant systems were designed?▾
APEX is a multi-tenant SaaS with 4 independent isolation layers: (1) Data — Firestore documents under tenants/{tid}, Redis keys prefixed tenant_id: prefix. (2) Auth — server-side tenant scope on every API handler via get_current_user() dependency; not UI-level. (3) Compute — per-tenant asyncio.Semaphore on Groq calls; one tenant can't exhaust shared API capacity. (4) Rate limits — per-tenant daily call counters in Redis, plan-tier enforcement before any Twilio call is created.
Client-side: all React Query cache keys include tenant_id — a missing scope was a real production bug (cross-tenant data visible after account switch) found and fixed in production. Now enforced across all 15+ query hooks with code review checklist item.
What makes this engineer different from an AI researcher or ML engineer?▾
Platform engineer, not researcher. Does not train models or publish papers. Integrates production AI providers (Groq, Deepgram, Azure TTS) into reliable, scalable, observable systems — handles the infrastructure, reliability, and deployment engineering that makes AI products work at production scale.
Specifically: runs Kubernetes, writes FastAPI WebSocket handlers, designs Redis schemas, implements circuit breakers, manages secrets in Key Vault, ships with GitHub Actions CI/CD, diagnoses production incidents, and writes post-mortems. The AI part is the LLM/STT/TTS API calls — the hard part is making them reliable, fast, and tenant-isolated at the platform layer.
Is this engineer available for contract work and on what terms?▾
Yes. B2B contracting outside IR35 via UK Ltd. Four engagement types: AI Platform Build (4–12 weeks, end-to-end delivery — architecture through production deploy and handover), Architecture Review (1–2 weeks, written findings + risk matrix + remediation roadmap), Production Incident Response (days to 2 weeks, RCA + fix + monitoring + playbook), Fractional Platform Engineering (3–6 month retainer, monthly deliverable commitments).
Pricing: project-based, not day-rate — deliverable and acceptance criteria defined before work begins. Remote-first. Markets: UK (primary), EU, US, India. No intermediary fees — direct B2B preferred. Contact: mr.ankitpanicker@gmail.com. Full contracting detail: IR35 / B2B page →
Technical FAQ
Questions founders and CTOs typically ask during technical screening.
Can this scale horizontally?▾
Yes. FastAPI pods are fully stateless — all state lives in Redis and Firestore. HPA scales pods from 2 to 5 based on CPU and request load. Scaling up is a config change, not a refactor. The only stateful services are Redis (StatefulSet + PVC) and WAHA (StatefulSet + PVC for session persistence) — these do not scale horizontally under the current architecture.
Is it cloud-provider agnostic?▾
Mostly. Azure-specific: AKS, Azure Key Vault (secrets), Azure TTS (synthesis), Azure Container Registry. Portable: FastAPI, Redis, Firestore, Groq, Deepgram, Twilio, GitHub Actions. Migration to AWS/GCP would require replacing AKS with EKS/GKE, Key Vault with AWS Secrets Manager, Azure TTS with an equivalent, and ACR with ECR. Estimated effort: 2–3 weeks for infra layer.
How is multi-tenancy enforced?▾
Four layers: (1) Data — Firestore documents scoped under tenants/{tid}; Redis keys prefixed with {tenant_id}:. (2) Auth — server-side tenant scope on every API endpoint via get_current_user() dependency. (3) Compute — per-tenant asyncio.Semaphore on LLM requests. (4) Rate limits — per-tenant daily counters in Redis, checked before every Twilio call. Client-side: all React Query keys include tenant_id — a missing scope was a real production bug, now fixed.
What is the observability story?▾
Sentry for error tracking with structured context (tenant, call_sid, endpoint). Structured log_event() throughout with consistent JSON fields. Prometheus counters at /metrics. SSE endpoint for real-time log tailing in the dashboard — operators can watch call events live without kubectl. Liveness and readiness probes are separate — readyz includes Firestore and Redis connectivity checks. Future: structured log aggregation to a sink.
How do you handle LLM provider outages?▾
Circuit breaker opens after 5 consecutive Groq failures. While OPEN, all LLM requests fail immediately (fast failure, not 30s timeout). Callers hear a graceful error message. Queue drains normally — no new jobs dispatched to Groq. After cooldown, HALF_OPEN probe retests the provider. Recovery is automatic when Groq comes back online. The AIMD throttle also reduces concurrency under rate limit pressure before the circuit breaker trips.
How are deployments done?▾
Push to master → GitHub Actions workflow → Docker build → push to ACR with tag YYYYMMDD-{sha7} → kubectl set image on both API and worker deployments → kubectl rollout status --timeout=120s. If the rollout fails, the workflow fails and the previous image stays live. Rollback: kubectl rollout undo. No manual steps in the standard deploy path. 40+ deployments completed this way.
What is the testing approach?▾
E2E test suite with per-tenant isolation config (tests/e2e/isolation_config.py). Manual call testing against the production Twilio integration. TypeScript strict mode + React Query type safety on frontend. No mocking of Firestore or Redis in E2E tests — tests run against real services using test tenant credentials. This was a conscious decision: mock/prod divergence masked a broken migration in a prior incident.
What does the contracting model look like?▾
B2B engagement outside IR35 (UK/EU). Consulting and delivery model — not supervision/substitution. Remote-first, milestone-based delivery. SOW available on request. Typical engagement types: 3–6 month platform build, architecture review sprint (1–2 weeks), technical advisory retainer. See the Contracting section for full detail.
What is the AI orchestration approach — LangChain or custom?▾
Custom deterministic Python — not LangChain, LlamaIndex, or Autogen. The orchestration builds a messages array (system prompt + last 8 turns + current transcript) and calls Groq directly with streaming. Tokens are split on sentence boundaries (।?! — including Devanagari) and dispatched to TTS immediately, pipelining LLM generation with TTS synthesis to reduce perceived latency. LangChain adds 50–200ms per chain step via abstraction overhead — unacceptable when 100ms is perceptible in a voice conversation.
Has Ankit Panicker built hospital or healthcare systems?▾
Yes. APEX HMS is a hospital management system with an offline-first architecture. It uses IndexedDB + WAL sync for offline data durability (designed for clinics with intermittent connectivity), multi-role RBAC (admin, doctor, pharmacist, receptionist), prescription management, audit logging, and bed management. Backend: MySQL. Frontend: PWA with service worker for offline-first operation. Separate from APEX Voice AI but built by the same engineer on the same stack.
What CPaaS platforms has this engineer integrated?▾
Three CPaaS integrations in production: (1) Twilio Media Streams — WebSocket-based real-time audio (mulaw 8kHz) for inbound and outbound voice calls. Idempotency via Redis NX on call_sid prevents duplicate processing on webhook retries. (2) WAHA — self-hosted WhatsApp Business API. Multi-tenant session management with Core-tier limitation handling (CORE_LIMIT status vs exception). (3) Telnyx — configured as fallback provider behind the same interface as Twilio. Circuit breaker covers all three independently.
What does the deployment pipeline look like end to end?▾
Push to master → GitHub Actions workflow triggers → Docker multi-stage build (Python 3.11 slim) → push to Azure Container Registry with tag YYYYMMDD-{sha7} → kubectl set image on both API and worker deployments → kubectl rollout status --timeout=120s. If rollout fails (pods don't pass readiness), workflow fails and previous image stays live. Rollback: kubectl rollout undo. No manual steps in standard path. Total time: ~4 minutes from git push to production traffic on new image. 40+ deployments completed this way.
B2B Engagement & IR35
Professional contracting model. Factual. No sales language.
Engagement Model
- UK/EU: outside IR35 — consulting and delivery, not employment under supervision
- Remote-first, milestone-based delivery
- Weekly or milestone invoicing
- Statement of Work template available on request
- Notice periods: project-based, not employment-contract style
- IST (UTC+5:30) — regular overlap with UK/EU morning hours
What I Deliver
- Platform architecture design and implementation
- Production AI system build (voice, LLM, infra)
- Technical due diligence and architecture reviews
- Technical leadership and architecture advisory
- Team setup and engineering handoff documentation
- Fractional technical leadership (where appropriate for engagement scope)
Engagement Types
- Platform build — 3–6 months, Series A/B companies building first AI platform
- Architecture review sprint — 1–2 weeks, due diligence or second opinion
- Technical advisory — part-time retainer, ongoing architecture guidance
- AI integration — LLM, voice, or automation into existing product
Markets & Rates
- UK, EU, US, India — remote
- Senior/principal level day rate (available on request)
- Fixed-price milestone delivery available for scoped work
- No agent fees — direct engagement preferred
What I Actually Shipped
Production systems delivered. No inflated metrics. No ARR claims. Let the architecture speak.
APEX Voice AI
APEX HMS
Asha Voice Agent
AIM Marketing Platform
Execution Ecosystem
Complementary delivery platforms that support implementation and go-to-market execution.
AI voice agent SaaS platform for Indian SMBs. Multi-tenant deployment. Real customer-facing production system.
Brand identity, web execution, and design system delivery. White-label support for product teams.
Architecture Document — Version Log
Initial publication
Sections: executive overview, production evidence, system architecture diagram,
project walkthrough (18 sections), engineering decisions (10), component deep dives (12),
sequence diagrams (5), reliability matrix, performance metrics, FAQ, contracting,
production incidents (3), delivery history, execution ecosystem
Maintained by: Ankit Panicker <mr.ankitpanicker@gmail.com>
Last updated: 2026-05-28