Voice-on-Call AI Platform
A phone call is all it takes — AI-powered civic and farmer support in any Indian language, no smartphone required.

- 99% uptime under 500+ concurrent calls
- Sub-second response with VAD + barge-in
- Cut manual call-centre workload by 60%
What needed solving
Millions of people in rural India — farmers, citizens needing civic services — don't have access to apps, smartphones, or English-language support systems. Existing call-centre infrastructure is expensive to scale and limited to business hours. The result: people who need help most are the ones least served by digital systems.
What I built
I built a real-time, multilingual voice AI agent that lets anyone call a regular phone number and have a natural, spoken conversation in their native Indian language — no app to download, no smartphone needed, works on any basic phone. The system listens, understands intent, looks up relevant information, and responds — all in the time it takes a human to answer a question.
Architecture
Call ingestion
Twilio handles inbound/outbound telephony, streaming raw audio over WebSockets into the pipeline.
Speech-to-Text
Sarvam AI's STT converts spoken Indian-language audio into text in real time, tuned for regional accents and dialects.
Reasoning layer
Gemini LLM interprets intent, holds conversational context, and decides what information or action is needed.
Text-to-Speech
Deepgram TTS converts the LLM's response back into natural-sounding speech in the same language.
Orchestration
Pipecat ties these stages together as one low-latency, interruption-aware voice pipeline, with FastAPI as the backend and a React dashboard for monitoring live calls.
Key challenges solved
Sub-second latency
Voice conversations break down if response time feels unnatural. I engineered a WebSocket-based audio streaming pipeline with overlapping STT/LLM/TTS stages instead of sequential processing, getting end-to-end response time down to sub-second.
Barge-in handling
Real conversations involve interruptions. I implemented Voice Activity Detection (VAD) and barge-in logic so the system stops speaking and listens the moment the caller starts talking — just like a human would.
Concurrency at scale
The system runs at 99% uptime under 500+ simultaneous calls — designing the backend to handle many parallel conversation states without one call's load affecting another's latency.
Multilingual robustness
Indian languages have huge dialect and accent variation. Tuning Sarvam STT and prompt-engineering Gemini's responses for natural, locally appropriate phrasing — not robotic translation — was a continuous iteration process.
What it shipped
Other projects
Have a system that needs to ship?
I build production-grade AI — voice, vision, and full-stack. Open to senior AI engineering and founder roles.
