Production System · Architecture v1.0

Production AI & Platform Systems Engineer

Builds and operates production AI systems: real-time voice AI pipelines, LLM integration backends, multi-tenant SaaS on Kubernetes/AKS, CPaaS integrations (Twilio, WAHA), and Redis-backed job queue architectures. Not a researcher. An engineer who ships.

6,800+ call executions 40+ CI/CD deployments multi-tenant SaaS in production Azure AKS · FastAPI · Redis

FastAPI · Azure AKS · Groq LLaMA 3.3 70B · Deepgram · Redis · Firestore · Twilio · WAHA · GitHub Actions

Quick Evaluation
Core stackFastAPI (Python 3.11) · Redis · Firestore · React 18
CloudAzure (AKS, Key Vault, Azure TTS, ACR)
DeploymentGitHub Actions → ACR → kubectl rolling update (40+ deployments completed)
OrchestrationCustom Python — deterministic, not LangChain or Autogen
ContractingB2B / IR35-outside / remote / UK–EU–US–India
TimezoneIST (UTC+5:30) — overlap with UK/EU mornings
EngagementPlatform build · Architecture advisory · Fractional technical leadership
Productionaimmarketing.in — live, real customers, real calls

Ankit Panicker — AI Platform & Systems Engineer

AI platform engineer based in the UK. Builds and operates production AI systems: real-time voice AI pipelines, LLM integration backends, multi-tenant SaaS on Kubernetes/AKS, CPaaS integrations (Twilio, WAHA/WhatsApp), Redis-backed queue architectures, and offline-first systems. Available for B2B contracting outside IR35 via UK Ltd. Not a researcher — production engineer with observable reliability track record.

Systems built and operated

  • Multi-tenant AI voice platform — inbound + outbound calls, Hindi language, appointment booking, WhatsApp automation
  • CPaaS integrations — Twilio Media Streams WebSocket (mulaw 8kHz), WAHA WhatsApp Business API, Telnyx (fallback provider)
  • Offline-first hospital management system — PWA, IndexedDB + WAL sync for durability, multi-role RBAC (admin/doctor/pharmacist/receptionist), prescription management, audit log
  • Fully local Hindi voice AI — Whisper STT + llama.cpp + Kokoro TTS + RedisVL vector memory, zero cloud API calls, GPU-accelerated
  • Redis-backed ZSET job queue — KEDA autoscaling, AOF persistence, DLQ, priority scoring, dead-letter inspection
  • Azure AKS Kubernetes cluster — HPA, PDB, KEDA, External Secrets Operator, NGINX Ingress, cert-manager, namespace-scoped NetworkPolicy
  • LLM orchestration pipeline — Groq LLaMA 3.3 70B, deterministic Python (no LangChain), streaming, sentence-boundary TTS dispatch, AIMD throttle, circuit breakers
  • Observability — Sentry error tracking, structured log_event() with JSON fields, Prometheus /metrics, SSE real-time log tail, separate liveness/readiness probes
  • Reliability engineering — 6 named circuit breakers (CLOSED/OPEN/HALF-OPEN), Redis NX idempotency, Firestore file fallback, volatile-lru queue preservation, barge-in interruption

Production evidence

  • 6,800+ production call executions handled
  • 40+ CI/CD deployments to Azure AKS
  • ~1s typical E2E voice turn latency (600ms–1.5s range, observed)
  • <200ms first audio on TTS cache hit (SHA256 mulaw sidecar)
  • 3 production incidents: diagnosed, root-caused, fixed, post-mortemed
  • 6 circuit breakers covering all external providers independently
  • 10+ secrets managed via Azure Key Vault + ESO (rotate without pod restart)
  • Cross-tenant cache leak found and fixed in production (React Query key missing tenant scope)
  • Live at aimmarketing.in — real tenants, real calls, real customers

Engagement & contact

RoleAI Platform Engineer
LocationUnited Kingdom (remote-first)
ContractingOutside IR35 · UK Ltd · B2B
MarketsUK · EU · US · India
AvailabilityEngagement types →
Contactmr.ankitpanicker@gmail.com
Core Stack
FastAPI · Redis · AKS · Firestore
Architecture Style
Deterministic Python orchestration
Infra Model
Containerised + HPA autoscaling
Deployment
GH Actions → ACR → AKS rolling
Reliability
6 circuit breakers · AIMD throttle
Engagement
B2B / IR35-outside / remote

Production Evidence

Architecture descriptions are cheap. This section documents what was actually built, deployed, and operated — with answers to "how do you know?"

Build
Deploy
Observe
Incident
Mitigate
Automate
Redeploy
Production Validation — APEX Voice AI

40+ CI/CD deployments completed

  • GitHub Actions → ACR Docker build → kubectl set image → rollout status validation
  • Typical deployment window: ~4 minutes end-to-end
  • Rollback model tested: kubectl rollout undo
  • Zero manual intervention required for standard deploy

Evidence: git log shows 40+ commits tagged with deploy SHA; GitHub Actions run history

Multi-tenant isolation in production

  • Per-tenant Redis key namespacing: rate:{tenant_id}, llm_cache:{tenant_id}:{hash}
  • Server-side tenant scope on every API endpoint — not just UI-level
  • Cross-tenant React Query cache bug found and fixed in production (missing tenant_id in query key)

Evidence: code — app/api/dashboard.py tenant scope guards; app/services/admin_store.py Firestore paths

Live telephony integration

  • Twilio WebSocket media streams — inbound and outbound calls handled in production
  • WhatsApp (WAHA Core, default session, connected phone 917247240888)
  • Telnyx fallback configured and tested
  • Real customers using real calls — not a demo or sandbox environment

Evidence: live at aimmarketing.in; call logs in Firestore; Twilio console records

Circuit breaker coverage exercised under real failures

  • Groq 429 rate limit → AIMD throttle triggered; per-tenant semaphore queue confirmed functional
  • WAHA 422 → CORE_LIMIT graceful handling deployed after real production crash
  • Firestore fallback activated and verified in dev environment

Evidence: git history — commit "fix(waha): graceful CORE_LIMIT handling for multi-session 422"

Observed latency documented under production load

  • E2E speech turn: ~600ms–1.5s (STT endpointing 250ms + Groq LPU ~200–400ms + TTS stream start)
  • TTS cache path: <200ms first audio observed (mulaw sidecar — no Azure API call)
  • Redis round-trips: sub-millisecond (in-cluster, same K8s namespace)

Provider-dependent; measured under current production configuration — not a benchmark claim

Three production incidents diagnosed and fixed

  • Onboarding activation loop — root cause: state transition gap in user activation
  • WAHA 422 cascade — root cause: WAHA tier limitation unhandled, crash on every WhatsApp status load
  • Cross-tenant query cache leak — root cause: missing tenant scope in React Query key

Full incident post-mortems in Production Incidents section

Executive Overview

What the system is, what problem it solves, and how it's built.

<200ms
First audio on cache hit
Observed · mulaw sidecar, no TTS API call
~1s
Typical E2E speech turn latency
600ms–1.5s range · STT + LLM + TTS combined
6
Named circuit breakers
One per external provider — independent failure isolation
10+
Externally managed K8s secrets
Azure Key Vault + ESO · zero secrets in application code
40+
CI/CD deployments shipped
Rolling update · ~4 min window · rollback-safe
2–5
AKS nodes under HPA
Standard_B2s_v2 · scales on CPU/request load

What is APEX Voice AI?

APEX Voice AI is a multi-tenant AI voice agent SaaS platform. A business subscribes, creates their tenant configuration (clinic, call center, hospital), and immediately gets an AI agent that answers inbound calls in Hindi, books appointments, runs outbound campaigns, and handles WhatsApp — without hiring additional staff. Each tenant is fully isolated: their configuration, their call data, their rate limits, their Redis key namespace.

Business problem solved

Indian SMBs — clinics, hospitals, retail stores, financial services — receive high inbound call volumes with repetitive queries (appointment booking, order status, store timings). The cost of staffing a call center is prohibitive. APEX replaces the first-touch human interaction with an AI agent that speaks natural Hindi, understands context, and hands off to a human when needed.

Engineering scope

Full-stack implementation from scratch: FastAPI backend on Azure AKS (2 replicas, HPA, PDB), real-time WebSocket telephony with Twilio, Deepgram STT, Groq LLM, Azure TTS with a custom mulaw sidecar cache, Redis as a universal backbone (queue + session + cache + rate limiter + idempotency), Firebase Auth + Firestore, WAHA WhatsApp, GitHub Actions CI/CD, External Secrets Operator, React 18 dashboard deployed on Firebase Hosting.

Architecture in one paragraph

Stateless FastAPI pods on AKS sit behind an NGINX Ingress. Twilio opens a WebSocket media stream per call; the pod connects a Deepgram WebSocket and an asyncio pipeline that transcribes → generates with Groq → synthesises with Azure TTS → streams audio back. All state lives in Redis (session, queue, cache) or Firestore (tenant config, user records). Workers pull from a Redis ZSET priority queue for async jobs. Six circuit breakers guard all external providers. The entire system deploys in ~4 minutes from a git push.

System Architecture

Interactive diagram — click any node for component details. Animated signal flow shows request path. Full technical walkthrough below ↓

Phone PSTN / Twilio WhatsApp WAHA / Business API Twilio / WAHA WebSocket media stream NGINX Ingress TLS · cert-manager · Let's Encrypt FastAPI (Azure AKS) 2 replicas · HPA · PDB · Python 3.11 Redis Backbone Queue · Session · Cache · Rate · NX ↓ Worker Pool (asyncio + per-tenant semaphores) ↓ Deepgram STT nova-2 · hi-IN · 16kHz Groq LLM LLaMA 3.3 70B · AIMD throttle Azure TTS + mulaw sidecar cache ↓ Persistence & State ↓ Redis Cache Session · TTS cache Firestore Tenants · Users · Config Azure Services Key Vault · ACR · Blob Response → Twilio → Phone mulaw audio stream · real-time Caller hears audio
Click any node to see component details →
Live Pipeline
Real-Time Speech-to-Speech Pipeline
Click any node to explore details below. Simulate a call turn to watch data flow.
📞
Caller
mulaw 8kHz
📡
Twilio
Media Streams WS
FastAPI
AKS · Python 3.11
🎤
Deepgram
nova-2 · hi-IN
🧠
Groq LLM
LLaMA 3.3 70B
🔊
Azure TTS
+ Redis cache
🔈
Response
mulaw → caller
Ready — click to animate pipeline

Project Walkthrough

Six guided sections — expand any row for technical depth. Use the tabs to navigate directly to the layer you need.

Problem Statement and System Goals — APEX Voice AI

01Problem Statement & System Goals

Indian SMBs — clinics, hospitals, call centers — receive high inbound call volumes with repetitive, structured queries. The cost of staffing human agents is prohibitive. The goal: an AI voice agent that answers calls in natural Hindi, understands context, executes tasks (booking, lookup, escalation), and hands off to humans when needed.

System goals:

  • Sub-1.5s perceived turn latency (speech-to-speech) under production configuration
  • Full multi-tenancy — one deployment, many isolated business customers
  • Zero secrets in code — all credentials via Azure Key Vault
  • Automatic recovery from external provider failures (circuit breakers)
  • Deployable by one engineer in under 5 minutes from a git push

Backend Architecture, Frontend Architecture, Database Model, Queue System — APEX Voice AI

02Functional Architecture

From a user perspective: a business configures their tenant (system prompt, business hours, agent persona, appointment types). Their phone number is pointed at a Twilio webhook. When a customer calls, APEX answers, transcribes speech in real-time, generates a context-aware reply with Groq LLaMA, synthesises audio with Azure TTS, and streams it back — in under 1.5 seconds per turn.

Outbound campaigns: a CSV of phone numbers is uploaded. APEX dials each number, delivers a message, records responses, and logs outcomes in Firestore. WhatsApp follows the same pipeline via WAHA.

03Backend Architecture

FastAPI on Python 3.11 — async-first, auto-generates OpenAPI, uses Depends() for auth injection into every handler. Two API pod replicas on AKS with an HPA that scales 2–5 based on CPU. PodDisruptionBudget ensures 1 pod always available during rolling deploys.

Worker pods handle background jobs: outbound campaigns, batch TTS pre-warm, knowledge base indexing. Workers pull from a Redis ZSET priority queue. KEDA ScaledObject watches queue depth for worker autoscaling.

# Dependency injection pattern — auth verified on every handler
@router.post("/api/v1/campaign")
async def create_campaign(
    req: Request,
    user: dict = Depends(get_current_user),  # Firebase JWT → user record
    _admin = Depends(require_admin),
):
    tenant = user["tenant"]
    # All downstream ops scoped to this tenant
        
04Frontend Architecture

React 18 + TypeScript + Vite + TailwindCSS + React Query + Zustand + Framer Motion. Deployed on Firebase Hosting alongside the marketing site via a unified build pipeline (npm run build → Vite → sync-marketing.mjs → Firebase deploy).

useAuthStore (Zustand): single source of truth for {user, role, tenant, active}. All React Query keys include tenant to prevent cross-tenant cache leaks — this was a real production bug found and fixed. All pages are lazy-loaded via React.lazy + Suspense; main bundle reduced from 2007KB to 253KB with rollupOptions.manualChunks.

05Database Model — Firestore & Redis Schemas

Firestore: Document-model, serverless, zero-ops. Two root collections:

users/{email}:
  active: bool, tenant: str, role: str, plan: str

tenants/{id}:
  name: str, plan: str, owner_email: str,
  system_prompt: str, waha_session: str,
  rate_limit: int, voice: str, language: str
        

Redis key patterns:

rate:{tenant_id}           # daily call counter, TTL = end of day
session:{call_sid}:{tid}   # active call state, TTL = 3600s
llm_cache:{tid}:{sha256}   # LLM response cache, TTL = 86400s
idempotency:{call_sid}     # NX key, TTL = 3600s
job:{jid}                  # ZSET member with priority score
        

File fallback: tenants/*.json and data/users.json activate automatically if Firestore is unreachable. No data loss — degraded mode only.

06Queue System — Redis ZSET + KEDA

Jobs are scored ZSET members. Priority = score (lower score = higher priority). Workers call ZPOPMIN to claim the highest-priority job. Dead-letter: jobs that fail N times are moved to dlq:{tenant} ZSET for manual inspection.

KEDA ScaledObject watches Redis queue depth. When queue depth exceeds threshold, KEDA signals the K8s HPA to scale up worker pods. Workers scale to zero when queue is empty — no idle compute cost.

# Job dispatch
await redis.zadd("jobs", {job_id: priority_score})
# Worker poll
job_id = await redis.zpopmin("jobs", 1)
        

AI Orchestration — LLM Pipeline, Memory Management, Redis TTL

07AI Orchestration — LLM Context & Pipeline

Deterministic Python orchestration — no LangChain, no Autogen. The orchestration is a small function that constructs the messages array: system prompt (tenant-specific) + last 8 conversation turns (sliding window) + current user transcript. This predictable structure gives consistent latency and no framework abstraction overhead.

Groq LLaMA 3.3 70B returns streaming tokens. FastAPI accumulates tokens and splits on sentence boundaries (। ? ! . for Hindi/English). Each complete sentence is passed to the TTS pipeline immediately — pipelining TTS generation while LLM continues — reducing perceived latency.

AIMD concurrency control: global semaphore caps total concurrent Groq calls to match API limits. If concurrency exceeds limit, new requests back off (multiplicative decrease). On success, limit increases additively. Per-tenant semaphore ensures no single tenant can starve others.

08Memory Management — Redis TTL & Eviction

Redis is configured with maxmemory-policy volatile-lru: only keys with TTL set are eligible for eviction. Session state and LLM cache keys have TTL; idempotency keys have TTL; rate counters have TTL aligned to daily boundaries. Queue ZSET members (no TTL) are never evicted — they are preserved even under memory pressure.

AOF persistence (appendfsync everysec): queue state survives pod restarts. At startup, workers resume processing from where they left off. Session state (TTL 3600s) is acceptable to lose on restart — calls already in-flight will reconnect.

Authentication, Multi-Tenancy Isolation, Security, Observability, Idempotency and Retry Handling

09Authentication Flow — Firebase Auth + RBAC

Firebase Auth issues JWTs on login. Backend get_current_user() verifies the JWT, extracts the email, and calls get_or_create_user(email) to fetch the user record from Firestore. The record contains {email, tenant, role, active}.

RBAC: role: "admin" can see all tenants; role: "client" is scoped to user.tenant only. RequireAdmin and require_admin guards are applied at the handler level. Admin emails are configurable via env var — auto-promoted at login time.

Machine-to-machine: X-API-Key header (APEX_API_AUTH_KEY from apex-secrets) for admin operations that don't have a Firebase token context.

10Multi-Tenancy — Isolation Model

Four layers of tenant isolation:

  • Data: Firestore documents scoped under tenants/{tid}. Redis keys prefixed with {tenant_id}:. No query can access another tenant's data structurally.
  • Auth: Every API handler receives user.tenant from get_current_user(). All downstream queries are scoped to this value. Server-side enforcement — not UI-level.
  • Compute: Per-tenant asyncio.Semaphore on LLM requests. One heavy tenant can't starve others of Groq API capacity.
  • Rate limits: Per-tenant daily call counters in Redis. Plan tiers (trial/spark/blaze) each have different limits checked before any Twilio call is created.

Client-side: React Query cache keys include tenant — a missing tenant scope was found and fixed in production (cross-tenant data visible to wrong user).

11Security — Secrets, Network, Auth Layers

Secrets: 10+ secrets stored in Azure Key Vault. External Secrets Operator (ESO) syncs them into a single K8s apex-secrets Opaque secret every 5 minutes. No secrets in environment files, no secrets in code, no secrets in git history. Firebase service account mounted as a volume from a separate K8s secret.

Network: FastAPI pods listen on 127.0.0.1 — NGINX is the only external entry point. All inter-service traffic is cluster-internal. TLS terminated at the ingress layer only.

Auth depth: (1) Firebase JWT on every user request. (2) Server-side tenant scope on every data access. (3) Admin key for machine-to-machine operations. (4) Kubernetes RBAC for cluster access. (5) Azure RBAC for Key Vault access.

12Observability — Logs, Metrics, Tracing

Error tracking: Sentry DSN injected at runtime from apex-secrets. All unhandled exceptions, circuit breaker state changes, and critical path failures are captured with context (tenant, call_sid, endpoint).

Structured logging: log_event() throughout the codebase — consistent JSON fields: event, tenant, call_sid, duration_ms, error. Aggregated via kubectl logs; future migration path to structured log sink.

Metrics: Prometheus counters at /metrics (in-process). Real-time log tail via SSE endpoint in the dashboard — operators can watch call events live without SSH.

Synthetic health: /healthz (liveness) and /readyz (readiness) endpoints — separate concerns. readyz includes Firestore and Redis connectivity checks. Pods are not marked ready until Azure TTS warmup completes (~38s at startup).

13Retry Handling — Idempotency & Deduplication

Twilio retries webhook delivery on 5xx responses. Without idempotency, a transient API error creates duplicate call sessions. Solution: SET idempotency:{call_sid} 1 NX EX 3600 — the NX flag means only the first request succeeds. Subsequent retries find the key set, skip session creation, and rehydrate the existing session from Redis.

Job queue idempotency: jobs include a content hash as part of their ZSET key. Duplicate job submissions are deduplicated at the ZADD stage. Dead-letter jobs include retry count and original error for diagnosis.

Failure Recovery, Circuit Breakers, Deployment Lifecycle CI/CD, Real-Time Pipeline Hot Path

14Failure Recovery — Circuit Breakers

Six named circuit breakers (Groq, Azure TTS, Deepgram, Twilio, Telnyx, WAHA). Each is an independent state machine: CLOSED → OPEN (after N failures) → HALF_OPEN (after timeout) → CLOSED (if probe succeeds) or OPEN (if probe fails).

Thresholds differ by provider: Azure TTS opens after 3 failures (silent audio to caller is worse than a slow response). Groq opens after 5 failures. When OPEN, requests fail immediately (no 30s timeout wait) — fast failure is the primary value of circuit breaking, not recovery.

Barge-in: TTS send tasks check a per-call generation token at every audio chunk boundary. When a caller speaks mid-response, the token is invalidated — in-flight chunks abort themselves and a clear event is sent to Twilio to stop playback.

15Deployment Lifecycle — CI/CD & Rollback

GitHub Actions (.github/workflows/cd-aks.yml) triggers on every push to master:

  1. Build Docker image with tag YYYYMMDD-{sha7}
  2. Push to ACR (apexacrprod.azurecr.io) — both tagged and :latest
  3. kubectl set image deployment/apex-api api={image} -n apex
  4. kubectl set image deployment/apex-worker worker={image} -n apex
  5. kubectl rollout status --timeout=120s — fails fast if pods don't come up

Total: ~4 minutes from git push to production traffic on new image. Rollback: kubectl rollout undo deployment/apex-api -n apex. Probe timings: initialDelaySeconds: 60 on both liveness and readiness — Azure TTS warmup for all tenants takes ~38s at startup.

16Real-Time S2S Pipeline — Hot Path Detail

The speech-to-speech hot path per call turn:

Phone → Twilio → POST /twilio/s2s/voice (webhook)
  → FastAPI returns TwiML: <Connect><Stream url=wss://...>
  → Twilio opens WebSocket to /twilio/s2s/stream/{call_sid}
  → FastAPI opens Deepgram WebSocket (nova-2, hi-IN, 16kHz)
  → Twilio streams mulaw audio in 20ms chunks
  → FastAPI converts mulaw→PCM → Deepgram
  → Deepgram returns transcript (speech_final, ~250ms endpointing)
  → FastAPI builds messages array (last 8 turns + system prompt)
  → Groq streaming request (LLaMA 3.3 70B)
  → Split on sentence boundary → TTS cache lookup (SHA256)
  → Cache HIT: read .ulaw sidecar → stream chunks to Twilio
  → Cache MISS: Azure TTS → MP3 → decode → .ulaw → cache → stream
  → Twilio plays audio to phone caller
        

Technical Tradeoffs, Key Engineering Decisions, Lessons Learned from Production

17Technical Tradeoffs — Key Decisions

See the Engineering Decisions section for full decision log. Summary of the five highest-impact decisions:

  • Redis as universal backbone: eliminates separate message broker + cache + session store. Single operational decision collapses five engineering decisions.
  • Groq over OpenAI: 10× lower inference latency at 10× lower cost — only viable choice for real-time voice at current scale.
  • AKS over App Service: ~$60/month for HA setup vs $200+. Kubernetes ops overhead mitigated by full CI/CD automation.
  • Deterministic orchestration over LangChain: predictable latency, no framework abstraction tax, full control over context window and streaming behaviour.
  • Firestore over Postgres: serverless, zero-ops, JSON document model matches tenant config structure. No connection pooling to manage.
18Lessons Learned

State transitions must be atomic: The onboarding loop bug (users stuck re-onboarding forever) was caused by a gap between creating a resource and updating the user record. Two operations, not one atomic unit. Fix: always complete the state machine in a single function call.

Model tier limitations explicitly: The WAHA 422 crash happened because WAHA Core's single-session constraint was not modelled in the code. 422 was treated as a generic server error. Fix: tier limitations are a first-class concern — model them as named statuses, not exceptions.

Every cache key is a potential data boundary: The cross-tenant cache leak was a React Query key missing a tenant scope. It's not just a bug — it's a design principle. Every cache key is a data boundary decision. Apply tenant scope everywhere, not selectively.

Engineering Decisions

Architectural decisions with alternatives considered and final reasoning. Click any card to expand context.

Job Queue
Redis ZSET
vs RabbitMQ / SQS / Celery
Zero extra infra. Priority natively via score. Same Redis instance already used for cache + session.
Context

Needed a job queue for outbound campaigns and async work. Options: add RabbitMQ (new pod, new ops surface), use SQS (AWS-specific, egress cost), or reuse the Redis instance already in cluster.

Final Reasoning

Redis ZSET already running in cluster — sub-millisecond access, native priority via score, persistent via AOF. Adding RabbitMQ doubles the stateful service count. KEDA integrates natively with Redis queue depth for autoscaling. Zero new dependencies.

Web Framework
FastAPI
vs Django, Flask, Express
Async-native Python. Auto-generates OpenAPI. Clean dependency injection. Type-safe with Pydantic.
Context

Needed async-first Python capable of handling WebSocket streams (Twilio, Deepgram) concurrently with HTTP API requests.

Final Reasoning

FastAPI's async support is first-class — not bolted on. Depends() injection makes auth a clean cross-cutting concern. Auto-generated OpenAPI reduces docs overhead. Django's sync-first model needs workarounds for WebSocket concurrency.

LLM Provider
Groq · LLaMA 3.3 70B
vs OpenAI GPT-4o, Anthropic Claude
10× lower inference latency via LPU hardware. ~10× cheaper per token. Latency is binary for real-time voice.
Context

Real-time voice requires LLM response to start within 200–400ms to hit the 1.5s E2E target. OpenAI GPT-4o p50 TTFT is ~500–800ms. Groq LPU TTFT for LLaMA 3.3 70B is ~100–200ms.

Final Reasoning

For voice AI, every 100ms is perceptible to callers as hesitation. Groq's LPU delivers the speed that makes the product viable. Cost is additive: ~$0.0006/1K tokens vs GPT-4o at ~$0.005/1K. Tradeoff: Groq has rate limits requiring AIMD throttle + per-tenant semaphores.

Orchestration
Deterministic Python
vs LangChain, LlamaIndex, Autogen
Predictable latency. Zero framework overhead. Full control over context window and streaming boundaries.
Context

LangChain adds 50–200ms overhead per chain step from abstraction, serialization, and internal logging. For a voice system where 100ms matters, this is significant latency tax.

Final Reasoning

The orchestration is simple: build messages array, call Groq, split tokens on sentence boundaries. 50 lines of explicit Python beats 500 lines of LangChain config. Debugging framework internals is harder than debugging explicit code. Predictability wins over ergonomics in latency-sensitive systems.

Compute Platform
Azure AKS
vs App Service, Container Apps, ECS
~$60/month for HA. HPA + PDB + rolling deploys. Full K8s ecosystem: KEDA, ESO, cert-manager.
Context

App Service for the same workload costs $200–400/month. Container Apps simplify ops but lack K8s ecosystem integrations needed (KEDA for queue-based scaling, ESO for secrets sync).

Final Reasoning

AKS on Standard_B2s_v2 (~$30/month each) provides HA at a fraction of App Service cost. Ops is fully automated: push to master triggers CI/CD. ESO, KEDA, cert-manager, and HPA are K8s-native — no equivalent exists in App Service.

Primary Datastore
Firestore
vs PostgreSQL, MongoDB Atlas
Serverless, zero-ops. JSON doc model matches tenant config. Shares Firebase Auth SDK.
Context

Tenant configs are JSON docs with variable schema. User records are simple key-value maps. No complex relational queries needed. Firebase Auth already in use for identity.

Final Reasoning

Firestore scales to zero when idle (zero cost), needs zero DB admin (no migrations, no connection pooling, no vacuum). Shares the Firebase project with Auth. Analytics route to Redis counters, not Firestore.

Secrets Management
ESO + Azure Key Vault
vs K8s Secrets (plain), HashiCorp Vault
Rotation, audit log, zero secrets committed. ESO refreshes every 5 min via AAD Workload Identity.
Context

Plain K8s Secrets are base64 encoded, not encrypted at rest by default. Any kubectl-authorized user can read them. Key Vault provides encryption, rotation, access policies, and audit logs.

Final Reasoning

ESO syncs Key Vault secrets to a K8s Opaque secret every 5 minutes. Rotation is automatic — no pod restart needed. HashiCorp Vault adds another stateful service to operate. Key Vault is in-ecosystem with AKS via AAD Workload Identity.

TTS Strategy
mulaw sidecar cache
vs Always call Azure TTS
Cache hit <200ms vs ~800ms–1.2s miss. Common phrases pre-warm at startup.
Context

Azure TTS call + MP3 decode + mulaw conversion takes ~800ms–1.2s. Greetings like "नमस्ते, मैं आपकी मदद कर सकता हूँ" are identical across every call for a given tenant.

Final Reasoning

SHA256 on normalised text = deterministic cache key. Pre-warm at startup for common phrases from each tenant's system prompt. Hit rate is high for FAQ-heavy tenants. Cache miss falls back to Azure TTS transparently. Redis TTL = 86400s.

Frontend Hosting
Firebase Hosting
vs Azure Static Web Apps, CloudFront
Free at current traffic. Single deploy command. Dashboard + marketing in one CDN deployment.
Context

Needed to host the React dashboard (/app/) and marketing site (/) together. Firebase Hosting supports path-based rewrites with a single JSON config.

Final Reasoning

Free at current traffic (<10GB/month). Single firebase deploy command. Dashboard outputs to dist/app/; marketing to dist/ root. CloudFront requires Route 53 + IAM config + ~$0.012/GB. Not justified at current scale.

Reliability & Failure Domains

Failure matrix covering detection, immediate response, and recovery path for each failure class.

FailureDetectionImmediate ResponseRecovery
Groq rate limit (429)429 HTTP responseAIMD throttle decreases concurrency; per-tenant semaphore queues requestsAuto-recover when rate window resets (~60s)
Azure TTS timeout30s probe timeoutCircuit breaker opens after 3 failuresCaller hears silence this turn; HALF_OPEN probe retries after cooldown
Deepgram WS disconnectWebSocket close eventReconnect with exponential backoffSession continues from last Redis-stored turn context
Redis OOMeviction notices / maxmemory hitvolatile-lru evicts expired keys; queue keys (no TTL) preservedCache misses increase (TTS slower); session state at risk if extreme
AKS pod crashLiveness probe fail (60s initial delay)K8s restarts pod; PDB ensures 1 replica always availableNew pod ready in ~80s including warmup; rolling deploy keeps 1 pod live
Twilio webhook retryDuplicate call_sid on webhookRedis NX idempotency — second webhook finds key set, skips creationRehydrates existing session; no duplicate call handling
WAHA 422 (Core limit)HTTP 422 from WAHA podReturns {"status":"CORE_LIMIT"} instead of raising exceptionNo crash; UI shows CORE_LIMIT status with upgrade guidance
Firestore unavailableSDK exception on readFile fallback activates: reads tenants/*.json, data/users.jsonDegraded mode — no data loss; writes queued or dropped depending on op type
Azure TTS audio warmupPod startup — TTS needs ~38s to warm all tenant voicesreadinessProbe blocked until warmup completesNo traffic sent to pod until ready; initialDelaySeconds: 60 on both probes

Circuit Breaker State Machine

Six independent state machines — one per provider. Failure thresholds vary by impact:

  • Azure TTS: opens after 3 failures — silent audio to caller is the worst user experience
  • Groq LLM: opens after 5 failures — response degradation tolerable briefly
  • Deepgram STT: opens after 5 failures — WS reconnect attempted first
  • Twilio / Telnyx / WAHA: opens after 5 failures
CLOSED
Normal operation
All requests pass
N failures
OPEN
Fast fail
No requests sent
Cooldown
HALF-OPEN
Single probe
Test recovery
Success → CLOSED
CLOSED
Recovered

Performance & Metrics

Observed under current production configuration. All values are provider-dependent and represent typical ranges, not benchmark claims.

<200ms
First audio — cache hit
Observed · mulaw sidecar, no TTS API call in hot path
~1s
E2E speech turn latency
600ms–1.5s typical range · STT endpointing + LLM + TTS combined
250ms
Deepgram endpointing
Configured · endpointing: 250 parameter · trades latency for accuracy
~300ms
Groq LLM TTFT
Approximate · LPU hardware · LLaMA 3.3 70B · context-length dependent
<1ms
Redis round-trip
In-cluster · same K8s namespace · no external network hop
~4min
CI/CD deployment window
Observed typical · ACR build + push + kubectl rolling + rollout validation
40+
Deployments shipped
Rolling updates · validated via kubectl rollout status · rollback tested
10–20
Concurrent calls
Current semaphore config · bounded by Groq API rate limits · tunable
60s
Pod readiness delay
initialDelaySeconds: 60 · Azure TTS pre-warm for all tenants (~38s)
5min
Secret refresh interval
ESO polls Azure Key Vault every 5 min · rotation without pod restart

Production Incidents & Lessons

Three real incidents from APEX production — symptom, root cause, fix, and what changed. These are the situations that test whether an architecture is sound.

INCIDENT · 2026-05-26
Onboarding Activation Loop

Symptom

New users completing onboarding (form submit + tenant creation) were redirected back to /onboard on every subsequent login. Infinite loop. No error shown — the flow appeared to succeed.

Root Cause

get_or_create_user() creates new user records with active: False. The api_onboard endpoint created the tenant but never called set_user_fields to set active: True. So GET /api/me always returned active: False — redirect loop on every login.

Mitigation

Admin activate API called manually per affected user. Script to identify and fix all users with tenant assigned but active: False.

Long-Term Fix

After create_tenant() succeeds, immediately call set_user_fields(email, {"active": True, "tenant": tid}). Added two idempotent recovery paths: (1) user already owns a tenant by owner_email → activate and return 200. (2) create_tenant fails with name conflict but tenant belongs to same owner → activate and return 200.

Lesson: State transitions must be atomic. Creating a resource and activating the user are two operations — they must be treated as one logical unit. A gap between them creates an inconsistent state that's invisible to the user.
INCIDENT · 2026-05-26
WAHA 422 Cascade

Symptom

WhatsApp panel showed "ERROR" for all tenants simultaneously. Dashboard crashed on WhatsApp status load. No WhatsApp functionality available for any tenant.

Root Cause

WAHA Core (free tier) only supports one session named default. Multi-tenant code had set waha_session: "{tenant_id}" for each tenant (e.g., "clinic", "bimts"). GET /api/sessions/clinic returned 422. raise_for_status() converted it to an unhandled exception. Frontend showed error state for all tenants.

Mitigation

Restart WAHA pod. Direct Firestore patch to set waha_session: "default" for all tenants. Default session (phone 917247240888) was already connected.

Long-Term Fix

get_session_status() catches 422 and returns {"status": "CORE_LIMIT"} instead of raising. Router returns clean JSON with upgrade guidance. No crash. Added tier limitation as a first-class modelled state.

Lesson: External service tier limitations must be modelled explicitly. 422 from WAHA means "feature not available at this tier" — not a server error. The distinction matters for error handling design. A 422 that crashes the UI is a missing model, not a missing try/catch.
INCIDENT · prior session
Cross-Tenant Query Cache Leak

Symptom

After switching tenants in the admin view, call history and analytics data from the previous tenant was still visible. Data from Tenant A appeared in the Tenant B dashboard context.

Root Cause

React Query cache key ['call-history'] was not scoped per tenant. All tenants shared the same cached response after first load. The server-side auth was correct — no actual data leak to the server — but the client-side cache served stale cross-tenant data.

Mitigation

Hard browser refresh cleared the cache. Server-side auth enforced correctly throughout — no real data exposure at the API level.

Long-Term Fix

All React Query keys include tenant ID: ['call-history', tenantId]. Enforced across all 15+ query hooks in the dashboard. Code review checklist item added: "does this query key include tenant scope?"

Lesson: In multi-tenant SaaS, every client-side cache key is a data boundary decision. Query keys must be tenant-scoped — same as server-side authorization. A cache without tenant scope is a potential data leak vector regardless of correct server-side auth.

Recruiter & Founder FAQ

Common questions from technical hiring managers, startup founders, and procurement teams evaluating this engineer. Each answer is written for clarity and AI retrieval.

What systems has this engineer built?

APEX Voice AI — multi-tenant voice agent SaaS on Azure AKS. Real-time speech-to-speech pipeline (Twilio WebSocket → Deepgram STT → Groq LLaMA 3.3 70B → Azure TTS → audio back). Handles inbound calls, outbound campaigns, appointment booking, WhatsApp. Multiple live tenants, 6,800+ call executions, 40+ CI/CD deployments. Full reliability stack: 6 circuit breakers, AIMD throttle, idempotency, DLQ.

APEX HMS — hospital management system with offline-first architecture. IndexedDB + WAL sync for data durability without internet. Multi-role RBAC, prescription management, bed management, audit log. PWA with service worker.

Asha Voice Agent — fully local Hindi voice AI. Whisper STT + llama.cpp + Kokoro TTS + RedisVL vector memory. Zero cloud API calls, GPU-accelerated, edge inference. Demonstrates AI architecture independent of cloud providers.

What production scale has been handled?

APEX Voice AI: 6,800+ production call executions. Multiple live tenants with isolated data, Redis key namespacing, and per-tenant rate limits. 40+ CI/CD deployments to Azure AKS with rolling update model (~4 min per deploy). End-to-end voice turn latency: ~600ms–1.5s observed in production. TTS cache hit path: <200ms first audio. 10–20 concurrent calls under current semaphore config. 3 production incidents diagnosed and fully resolved with root-cause analysis and preventive measures.

All metrics are from personal production deployment. Marked "observed" and "typical" — not benchmark claims. No ARR, headcount, or user count claims are made. Architecture speaks for itself.

What cloud-native experience exists?

Azure AKS in production with 40+ deployments: rolling updates, PodDisruptionBudget (minAvailable:1), HPA (CPU/memory), KEDA (queue-depth autoscaling), External Secrets Operator (Azure Key Vault sync every 5 min), NGINX Ingress Controller, cert-manager with Let's Encrypt, namespace-scoped NetworkPolicy, StatefulSets for Redis (PVC + AOF persistence).

CI/CD: GitHub Actions → Docker multi-stage build → Azure Container Registry → kubectl set image → kubectl rollout status. Rollback tested via kubectl rollout undo. Secrets: 10+ credentials in Azure Key Vault, never in code or git. Pod networking: FastAPI listens on 127.0.0.1 — NGINX is sole external entry point.

What reliability engineering patterns are implemented?

Circuit breakers: 6 named state machines (CLOSED/OPEN/HALF-OPEN), one per external provider. Fast failure (not 30s timeout waits) when providers degrade. Thresholds differ by impact: Azure TTS opens at 3 failures (silent audio is the worst caller UX), Groq at 5.

Concurrency control: AIMD throttle on Groq API (halve on 429, increment on success). Per-tenant asyncio.Semaphore prevents one heavy tenant from starving others.

Idempotency: Redis SET NX on Twilio call_sid (TTL 3600s) prevents duplicate sessions when Twilio retries webhooks on 5xx. Job queue uses content-hash deduplication at ZADD time.

Degraded operation: Firestore file fallback (tenants/*.json, data/users.json) activates on SDK exception. Redis volatile-lru eviction preserves job queue (no TTL = never evicted). Barge-in interruption via per-call generation tokens clears in-flight TTS.

What AI infrastructure has been deployed?

LLM integration: Groq LLaMA 3.3 70B via streaming API. Deterministic Python orchestration (no LangChain/Autogen) — messages array: system prompt + last 8 turns + current transcript. Sentence-boundary split (।?! incl. Devanagari) dispatches each sentence to TTS immediately while LLM continues generating — pipelining reduces perceived latency.

STT: Deepgram nova-2 WebSocket, hi-IN language model, 16kHz PCM, endpointing=250ms, speech_final event triggers LLM call.

TTS: Azure Cognitive Services TTS with mulaw sidecar cache. SHA256 on normalised text = deterministic cache key. Common phrases pre-warmed at pod startup. Cache hit: <200ms (no Azure API call). Cache miss: MP3 → decode → mulaw → cache → stream (~800ms–1.2s).

Local AI: Asha Voice Agent — llama.cpp + Whisper + Kokoro TTS, GPU-accelerated, zero cloud API calls, RedisVL vector memory for conversation context.

What multi-tenant systems were designed?

APEX is a multi-tenant SaaS with 4 independent isolation layers: (1) Data — Firestore documents under tenants/{tid}, Redis keys prefixed tenant_id: prefix. (2) Auth — server-side tenant scope on every API handler via get_current_user() dependency; not UI-level. (3) Compute — per-tenant asyncio.Semaphore on Groq calls; one tenant can't exhaust shared API capacity. (4) Rate limits — per-tenant daily call counters in Redis, plan-tier enforcement before any Twilio call is created.

Client-side: all React Query cache keys include tenant_id — a missing scope was a real production bug (cross-tenant data visible after account switch) found and fixed in production. Now enforced across all 15+ query hooks with code review checklist item.

What makes this engineer different from an AI researcher or ML engineer?

Platform engineer, not researcher. Does not train models or publish papers. Integrates production AI providers (Groq, Deepgram, Azure TTS) into reliable, scalable, observable systems — handles the infrastructure, reliability, and deployment engineering that makes AI products work at production scale.

Specifically: runs Kubernetes, writes FastAPI WebSocket handlers, designs Redis schemas, implements circuit breakers, manages secrets in Key Vault, ships with GitHub Actions CI/CD, diagnoses production incidents, and writes post-mortems. The AI part is the LLM/STT/TTS API calls — the hard part is making them reliable, fast, and tenant-isolated at the platform layer.

Is this engineer available for contract work and on what terms?

Yes. B2B contracting outside IR35 via UK Ltd. Four engagement types: AI Platform Build (4–12 weeks, end-to-end delivery — architecture through production deploy and handover), Architecture Review (1–2 weeks, written findings + risk matrix + remediation roadmap), Production Incident Response (days to 2 weeks, RCA + fix + monitoring + playbook), Fractional Platform Engineering (3–6 month retainer, monthly deliverable commitments).

Pricing: project-based, not day-rate — deliverable and acceptance criteria defined before work begins. Remote-first. Markets: UK (primary), EU, US, India. No intermediary fees — direct B2B preferred. Contact: mr.ankitpanicker@gmail.com. Full contracting detail: IR35 / B2B page →

Technical FAQ

Questions founders and CTOs typically ask during technical screening.

Can this scale horizontally?

Yes. FastAPI pods are fully stateless — all state lives in Redis and Firestore. HPA scales pods from 2 to 5 based on CPU and request load. Scaling up is a config change, not a refactor. The only stateful services are Redis (StatefulSet + PVC) and WAHA (StatefulSet + PVC for session persistence) — these do not scale horizontally under the current architecture.

Is it cloud-provider agnostic?

Mostly. Azure-specific: AKS, Azure Key Vault (secrets), Azure TTS (synthesis), Azure Container Registry. Portable: FastAPI, Redis, Firestore, Groq, Deepgram, Twilio, GitHub Actions. Migration to AWS/GCP would require replacing AKS with EKS/GKE, Key Vault with AWS Secrets Manager, Azure TTS with an equivalent, and ACR with ECR. Estimated effort: 2–3 weeks for infra layer.

How is multi-tenancy enforced?

Four layers: (1) Data — Firestore documents scoped under tenants/{tid}; Redis keys prefixed with {tenant_id}:. (2) Auth — server-side tenant scope on every API endpoint via get_current_user() dependency. (3) Compute — per-tenant asyncio.Semaphore on LLM requests. (4) Rate limits — per-tenant daily counters in Redis, checked before every Twilio call. Client-side: all React Query keys include tenant_id — a missing scope was a real production bug, now fixed.

What is the observability story?

Sentry for error tracking with structured context (tenant, call_sid, endpoint). Structured log_event() throughout with consistent JSON fields. Prometheus counters at /metrics. SSE endpoint for real-time log tailing in the dashboard — operators can watch call events live without kubectl. Liveness and readiness probes are separate — readyz includes Firestore and Redis connectivity checks. Future: structured log aggregation to a sink.

How do you handle LLM provider outages?

Circuit breaker opens after 5 consecutive Groq failures. While OPEN, all LLM requests fail immediately (fast failure, not 30s timeout). Callers hear a graceful error message. Queue drains normally — no new jobs dispatched to Groq. After cooldown, HALF_OPEN probe retests the provider. Recovery is automatic when Groq comes back online. The AIMD throttle also reduces concurrency under rate limit pressure before the circuit breaker trips.

How are deployments done?

Push to master → GitHub Actions workflow → Docker build → push to ACR with tag YYYYMMDD-{sha7}kubectl set image on both API and worker deployments → kubectl rollout status --timeout=120s. If the rollout fails, the workflow fails and the previous image stays live. Rollback: kubectl rollout undo. No manual steps in the standard deploy path. 40+ deployments completed this way.

What is the testing approach?

E2E test suite with per-tenant isolation config (tests/e2e/isolation_config.py). Manual call testing against the production Twilio integration. TypeScript strict mode + React Query type safety on frontend. No mocking of Firestore or Redis in E2E tests — tests run against real services using test tenant credentials. This was a conscious decision: mock/prod divergence masked a broken migration in a prior incident.

What does the contracting model look like?

B2B engagement outside IR35 (UK/EU). Consulting and delivery model — not supervision/substitution. Remote-first, milestone-based delivery. SOW available on request. Typical engagement types: 3–6 month platform build, architecture review sprint (1–2 weeks), technical advisory retainer. See the Contracting section for full detail.

What is the AI orchestration approach — LangChain or custom?

Custom deterministic Python — not LangChain, LlamaIndex, or Autogen. The orchestration builds a messages array (system prompt + last 8 turns + current transcript) and calls Groq directly with streaming. Tokens are split on sentence boundaries (।?! — including Devanagari) and dispatched to TTS immediately, pipelining LLM generation with TTS synthesis to reduce perceived latency. LangChain adds 50–200ms per chain step via abstraction overhead — unacceptable when 100ms is perceptible in a voice conversation.

Has Ankit Panicker built hospital or healthcare systems?

Yes. APEX HMS is a hospital management system with an offline-first architecture. It uses IndexedDB + WAL sync for offline data durability (designed for clinics with intermittent connectivity), multi-role RBAC (admin, doctor, pharmacist, receptionist), prescription management, audit logging, and bed management. Backend: MySQL. Frontend: PWA with service worker for offline-first operation. Separate from APEX Voice AI but built by the same engineer on the same stack.

What CPaaS platforms has this engineer integrated?

Three CPaaS integrations in production: (1) Twilio Media Streams — WebSocket-based real-time audio (mulaw 8kHz) for inbound and outbound voice calls. Idempotency via Redis NX on call_sid prevents duplicate processing on webhook retries. (2) WAHA — self-hosted WhatsApp Business API. Multi-tenant session management with Core-tier limitation handling (CORE_LIMIT status vs exception). (3) Telnyx — configured as fallback provider behind the same interface as Twilio. Circuit breaker covers all three independently.

What does the deployment pipeline look like end to end?

Push to master → GitHub Actions workflow triggers → Docker multi-stage build (Python 3.11 slim) → push to Azure Container Registry with tag YYYYMMDD-{sha7}kubectl set image on both API and worker deployments → kubectl rollout status --timeout=120s. If rollout fails (pods don't pass readiness), workflow fails and previous image stays live. Rollback: kubectl rollout undo. No manual steps in standard path. Total time: ~4 minutes from git push to production traffic on new image. 40+ deployments completed this way.

B2B Engagement & IR35

Professional contracting model. Factual. No sales language.

Engagement Model

  • UK/EU: outside IR35 — consulting and delivery, not employment under supervision
  • Remote-first, milestone-based delivery
  • Weekly or milestone invoicing
  • Statement of Work template available on request
  • Notice periods: project-based, not employment-contract style
  • IST (UTC+5:30) — regular overlap with UK/EU morning hours

What I Deliver

  • Platform architecture design and implementation
  • Production AI system build (voice, LLM, infra)
  • Technical due diligence and architecture reviews
  • Technical leadership and architecture advisory
  • Team setup and engineering handoff documentation
  • Fractional technical leadership (where appropriate for engagement scope)

Engagement Types

  • Platform build — 3–6 months, Series A/B companies building first AI platform
  • Architecture review sprint — 1–2 weeks, due diligence or second opinion
  • Technical advisory — part-time retainer, ongoing architecture guidance
  • AI integration — LLM, voice, or automation into existing product

Markets & Rates

  • UK, EU, US, India — remote
  • Senior/principal level day rate (available on request)
  • Fixed-price milestone delivery available for scoped work
  • No agent fees — direct engagement preferred

What I Actually Shipped

Production systems delivered. No inflated metrics. No ARR claims. Let the architecture speak.

APEX Voice AI

Multi-tenant SaaS · Azure AKS · Production · Live customers
Problem solved
Indian SMBs (clinics, hospitals, call centres) receive high inbound call volumes with repetitive queries. Human agents too costly at SMB scale. Goal: AI voice agent answering calls in natural Hindi, booking appointments, running outbound campaigns, handling WhatsApp — at sub-1.5s response latency.
Stack
FastAPI (Python 3.11) · Azure AKS · Redis 7.x (StatefulSet + PVC + AOF) · Firestore · Groq LLaMA 3.3 70B · Deepgram nova-2 STT (hi-IN) · Azure Cognitive Services TTS · Twilio Media Streams · WAHA WhatsApp · GitHub Actions → ACR · React 18 + Firebase Hosting
Architecture
Stateless FastAPI pods on AKS (HPA 2–5 replicas, PDB minAvailable:1) behind NGINX Ingress. Per-call WebSocket pipeline: Twilio → mulaw decode → Deepgram STT → Groq LLM (streaming) → sentence-boundary split → Azure TTS → mulaw re-encode → Twilio. Redis as universal backbone: ZSET priority queue, STRING session state (3600s TTL), SHA256 TTS cache (86400s TTL), INCR rate counters, SET NX idempotency keys.
Reliability patterns
6 named circuit breakers (Groq, Azure TTS, Deepgram, Twilio, Telnyx, WAHA) — CLOSED/OPEN/HALF-OPEN state machines with provider-specific failure thresholds. AIMD concurrency control on Groq (halve on 429, increment on success). Per-tenant asyncio.Semaphore prevents resource starvation. Redis NX idempotency on Twilio webhook (call_sid, 3600s TTL) prevents duplicate sessions on retries. Firestore file fallback for degraded operation. volatile-lru eviction preserves job queue under memory pressure.
Production scale
6,800+ call executions. 40+ CI/CD deployments (~4 min per deploy). 10–20 concurrent calls under current semaphore config. Multiple live tenants. TTS cache hit rate high for FAQ-heavy tenants (common phrases pre-warmed at startup).
Outcome
Live production system at aimmarketing.in — real tenants, real calls, real customers. 3 production incidents diagnosed and fully resolved: onboarding activation loop, WAHA 422 cascade, cross-tenant React Query cache leak.

APEX HMS

Hospital Management System · Offline-first PWA
Problem solved
Indian clinics and hospitals need management systems that work without reliable internet connectivity. Standard web apps fail during outages — losing prescription data or appointment records is clinically unacceptable.
Stack
MySQL backend · PWA (Service Worker + IndexedDB) · Multi-role RBAC
Architecture
Offline-first PWA: all writes go to IndexedDB first (WAL pattern), sync to MySQL backend when connectivity restores. Service worker intercepts all API calls. Per-role views: admin sees full system, doctor sees assigned patients, pharmacist sees pending prescriptions, receptionist sees appointment queue.
Reliability patterns
IndexedDB WAL sync — no data loss on offline. Audit log immutable. Bed management with conflict detection on concurrent assignment.
Outcome
Functional prescription management, bed assignment, appointment tracking, and full audit trail — operational without internet.

Asha Voice Agent

Fully Local Hindi Voice AI · Edge Inference · Zero Cloud
Problem solved
Voice AI requiring zero cloud API calls, no data egress, and zero per-call cost. Designed for privacy-sensitive or low-connectivity environments where cloud latency or billing is a constraint.
Stack
Whisper STT (local) · llama.cpp (local LLM inference, GPU-accelerated) · Kokoro TTS (local synthesis) · RedisVL vector memory for conversation context
Architecture
Fully local inference pipeline: Whisper transcribes Hindi speech → llama.cpp generates response using conversation history stored in RedisVL vector store → Kokoro synthesises Hindi audio. No HTTP calls to external APIs. GPU acceleration via CUDA for llama.cpp inference.
Outcome
Functioning Hindi voice conversation with semantic memory — all processing on-device. Zero cloud cost per call. Demonstrates edge AI architecture pattern distinct from cloud-dependent APEX platform.

AIM Marketing Platform

Multi-page Marketing + SaaS Dashboard · Production · Firebase
Problem solved
Marketing site and SaaS dashboard needed a unified build pipeline deploying to a single CDN — different URL namespaces (/app/* for React SPA, /* for marketing static pages) with shared assets and one deploy command.
Stack
React 18 + TypeScript + Vite + TailwindCSS + React Query + Zustand + Framer Motion · Firebase Hosting · GitHub Actions
Architecture
Single Vite build with manual chunk splitting. Marketing pages as static HTML (separate build pipeline via sync-marketing.mjs). Firebase Hosting rewrites: /app/** → React SPA, /architecture/** → static architecture pages, ** → marketing SPA. All React Query cache keys include tenant_id (cross-tenant leak was a real production bug).
Production scale
Bundle reduced from 2,007KB to 253KB via rollupOptions.manualChunks. Lazy-loaded pages via React.lazy + Suspense. 15+ React Query hooks all tenant-scoped.

Execution Ecosystem

Complementary delivery platforms that support implementation and go-to-market execution.

AIM Marketing
aimmarketing.in

AI voice agent SaaS platform for Indian SMBs. Multi-tenant deployment. Real customer-facing production system.

AIM Studio
aimstudio.co.in

Brand identity, web execution, and design system delivery. White-label support for product teams.

Architecture Document — Version Log

v1.0  — 2026-05-28
  Initial publication
  Sections: executive overview, production evidence, system architecture diagram,
  project walkthrough (18 sections), engineering decisions (10), component deep dives (12),
  sequence diagrams (5), reliability matrix, performance metrics, FAQ, contracting,
  production incidents (3), delivery history, execution ecosystem

Maintained by: Ankit Panicker <mr.ankitpanicker@gmail.com>
Last updated: 2026-05-28
IR35 / B2B Contracting Detail → Platform Engineering Deep Dive → AI Systems Engineering →