Case study · Real-time multilingual voice agent

Voice-on-Call AI Platform

A phone call is all it takes — AI-powered civic and farmer support in any Indian language, no smartphone required.

PipecatGeminiSarvam AIDeepgramTwilioFastAPIReactWebSocketsPostgreSQL
Voice-on-Call AI Platform
Highlights
  • 99% uptime under 500+ concurrent calls
  • Sub-second response with VAD + barge-in
  • Cut manual call-centre workload by 60%
The problem

What needed solving

Millions of people in rural India — farmers, citizens needing civic services — don't have access to apps, smartphones, or English-language support systems. Existing call-centre infrastructure is expensive to scale and limited to business hours. The result: people who need help most are the ones least served by digital systems.

The solution

What I built

I built a real-time, multilingual voice AI agent that lets anyone call a regular phone number and have a natural, spoken conversation in their native Indian language — no app to download, no smartphone needed, works on any basic phone. The system listens, understands intent, looks up relevant information, and responds — all in the time it takes a human to answer a question.

How it works

Architecture

01

Call ingestion

Twilio handles inbound/outbound telephony, streaming raw audio over WebSockets into the pipeline.

02

Speech-to-Text

Sarvam AI's STT converts spoken Indian-language audio into text in real time, tuned for regional accents and dialects.

03

Reasoning layer

Gemini LLM interprets intent, holds conversational context, and decides what information or action is needed.

04

Text-to-Speech

Deepgram TTS converts the LLM's response back into natural-sounding speech in the same language.

05

Orchestration

Pipecat ties these stages together as one low-latency, interruption-aware voice pipeline, with FastAPI as the backend and a React dashboard for monitoring live calls.

Engineering

Key challenges solved

Sub-second latency

Voice conversations break down if response time feels unnatural. I engineered a WebSocket-based audio streaming pipeline with overlapping STT/LLM/TTS stages instead of sequential processing, getting end-to-end response time down to sub-second.

Barge-in handling

Real conversations involve interruptions. I implemented Voice Activity Detection (VAD) and barge-in logic so the system stops speaking and listens the moment the caller starts talking — just like a human would.

Concurrency at scale

The system runs at 99% uptime under 500+ simultaneous calls — designing the backend to handle many parallel conversation states without one call's load affecting another's latency.

Multilingual robustness

Indian languages have huge dialect and accent variation. Tuning Sarvam STT and prompt-engineering Gemini's responses for natural, locally appropriate phrasing — not robotic translation — was a continuous iteration process.

Impact

What it shipped

99% uptime under 500+ concurrent calls
Cut manual call-centre workload by 60%
Showcased at Mumbai Tech Week 2026
Built and owned end-to-end: ML pipeline, backend, frontend dashboard, and telephony integration — solo

Have a system that needs to ship?

I build production-grade AI — voice, vision, and full-stack. Open to senior AI engineering and founder roles.