Looking for experts on Voice AI over telephony #153618

moonwisper · 2025-03-10T21:31:03Z

moonwisper
Mar 10, 2025

Body

I am working on getting a Conversational Voice AI built which can use SIP trunking or VoIP API to initiate calls (not using Twilio as it will not operate in the US, using virtual PBX or as mentioned direct sip trunking to assigned phone numbers by local telephony providers). What kind of tech stack would you recommend to use?

Guidelines

I have read and understood this category's guidelines before making this post.

Boardson · 2025-09-24T17:31:27Z

Boardson
Sep 24, 2025

I just rename the default branch to main in each new repo right after creating it. It’s a bit annoying, but only takes a second and works fine for now.

0 replies

eggduzao · 2025-09-24T22:35:06Z

eggduzao
Sep 24, 2025

Interesting problem space. If you want a conversational Voice AI that can originate/receive calls over SIP trunks (or a VoIP API)-without Twilio-you basically need four layers to play nicely together:

Telephony & Call Control (SIP, RTP, signaling)
Media I/O (low-latency audio in/out + barge-in)
Conversation Engine (ASR + LLM/NLU + tool use + TTS)
Glue/Orchestration + Observability (routing, scaling, compliance)

Below is a practical stack/architecture that is battle-tested and vendor-agnostic. I'll give you two variants: a minimal PoC and a production-ready baseline.

0) SIP Trunk Providers (US)

If you're not using Twilio, common US options with good SIP support:

Bandwidth, Telnyx, SignalWire, Flowroute, VoIP.ms
Features to check: inbound/outbound trunks, CNAM, E911, number provisioning APIs, TLS/SRTP, STIR/SHAKEN, webhooks, call recording legality guidance.

1) Telephony & Call Control

Option A - Self-hosted PBX/B2BUA (flexible, OSS)

FreeSWITCH or Asterisk as your B2BUA/PBX.
Kamailio/OpenSIPS as SBC/edge if you need SIP routing at scale or multi-tenant isolation.
Janus or mediasoup if you also want WebRTC ingress/egress (e.g., for a browser agent or supervisor UI).

Why: You own the signaling and can script call flows (IVRs, transfers, conferencing), inject media, and fork streams to your AI.

Option B - Hosted programmable telephony (less ops)

Providers like SignalWire, Telnyx, Voximplant, Plivo expose call control APIs and often media webhooks (websocket/WebRTC/SIPREC) that stream audio to your app. You keep code thin and focus on the AI brain.

If you're new to SIP ops, Option B is a quick win. If you need custom behavior (e.g., complex bridging, RTP forking, on-prem edges), go with Asterisk/FreeSWITCH.

2) Media I/O (the "real-time" bit)

You want bidirectional, low-latency audio between the PSTN call and your AI:

RTP from PBX -> your media gateway (could be a FreeSWITCH mod, a Janus plugin, or your own service via GStreamer/aiortc).
Convert to 16 kHz 16-bit PCM (or 8 kHz if you must; 16 kHz helps ASR).
VAD (Voice Activity Detection) + barge-in (cut TTS when user starts speaking).
DTMF: capture out-of-band (RFC 4733) or in-band if needed.
Latency budget target: ~300-500 ms E2E feels good enough for phone conversation.

Implementation choices:

GStreamer pipeline (very reliable for RTP <-> PCM <-> TTS/ASR).
Python aiortc for WebRTC if you prefer encrypted media channel to your AI service.
SIPREC / RTP fork from FreeSWITCH if you want a simple "tap" to your recognizer.

3) Conversation Engine

ASR (streaming)

OSS: Vosk (fast, light), Whisper.cpp (local; quality good, watch CPU), Nvidia Riva (GPU).
Managed APIs: Google Streaming, Azure Cognitive Services, AWS Transcribe, Deepgram, AssemblyAI.
Must-haves: partial hypotheses, timestamps, low latency, endpointing.

NLU / Orchestrator

LLM with function/tool-calling (OpenAI-style, Anthropic, local Llama + function routing) or traditional Rasa/Dialogflow CX if your flows are structured.
Add state machine layered over the LLM (for guardrails, retries, transitions).
Tooling: call CRM APIs, KB search, appointment systems, etc.
Turn-taking policy: pick a voice cadence; apply barge-in when user speaks.

TTS (fast + natural)

Managed: Azure Neural TTS, ElevenLabs, Amazon Polly NTTS, Google Wavenet/Neural2.
OSS: Coqui XTTS (quality improving; consider caching).
Must-haves: <200 ms first audio, streaming output, SSML support (prosody, breaks).

Critical: Implement a "speak-or-listen" arbiter. When user audio arrives (VAD triggers), immediately pause/cancel TTS and switch to ASR capture.

4) Glue, Ops, Compliance

Call flow app: Node.js/TypeScript (excellent SIP/WebSocket ecosystem) or Python (asyncio) for orchestration.
Message bus: Redis/AMQP/Kafka to decouple ASR/LLM/TTS.
Observability: Prometheus + Grafana, logs with structured fields (call-id, asset-id, leg-id).
Recordings: if legal/promised; announce recording; redact PCI/PII if needed.
Security: TLS/SRTP, firewall SIP ports, rate limiting, fraud detection.
Scaling: containerize (Docker), autoscale stateless components; pin real-time nodes to dedicated instances.

Minimal PoC Stack (works in a weekend)

Telnyx (SIP trunk) -> FreeSWITCH (B2BUA & RTP fork)
Media bridge service (Python aiortc or GStreamer) that:
- Receives RTP, converts to 16k PCM
- Streams to Deepgram (ASR)
- Sends prompts to an LLM (tool-calling) for replies
- Streams TTS from ElevenLabs back to FreeSWITCH
- Implements VAD/barge-in
Node.js or Python app for call logic (start/stop, transfers, retries)

Call flow sketch:

INVITE -> FreeSWITCH answers.
FreeSWITCH bridges caller's RTP to MediaBridge (fork).
MediaBridge -> ASR (stream) -> LLM (tool calls) -> TTS (stream) -> back as RTP.
If user talks while TTS plays -> VAD triggers -> cancel TTS -> resume ASR.

Production-Ready Baseline

Kamailio/OpenSIPS on the edge (multi-tenant SIP routing, rate limits).
FreeSWITCH pool (B2BUA, conferencing, DTMF, recording).
Janus for WebRTC ingress (supervisor dashboard, browser monitoring).
Media Gateway: GStreamer services (separate pods for transcode; attach to RTP from FreeSWITCH).
ASR/TTS: pick 1-2 providers; build an abstraction; enable adaptive fallback.
Conversation service: TypeScript/Node with function-calling LLM, state machine, idempotent tool calls, per-call memory store (Redis).
Observability: call-id traced across services; jitter/packet-loss stats from RTP; transcripts stored with retention policy.
Compliance: STIR/SHAKEN, call recording prompts, PII redaction, secure key vault.

Concrete Tech Choices

SIP trunk: Bandwidth or Telnyx
SBC: Kamailio (optional at start)
B2BUA: FreeSWITCH
Media: GStreamer service (RTP in/out, PCM, WebSocket to ASR/TTS)
ASR: Deepgram streaming (or Google)
LLM: Your favorite function-calling model (or Rasa/Dialogflow CX)
TTS: Azure Neural or ElevenLabs (streamed)
Server: Node.js (or Python) orchestrator + Redis
Infra: Docker + K8s; Prometheus + Grafana; Loki/ELK for logs

Checklist of what to avoid

Not bounding audio size -> TTS/ASR latency spikes. Keep 16 kHz mono PCM, short chunks.
No barge-in -> users talk over TTS, you miss user input.
Starting with 8 kHz -> okay for PSTN, but hurts ASR quality; 16 kHz is a sweet spot if your media path allows.
Reusing TTS connections poorly -> high first-byte delay. Use persistent connections or warm pools.
No jitter buffer -> choppy audio. Use jitter buffer in your media pipeline.
Single provider -> add fallback ASR/TTS for resiliency.

"What stack would you recommend?"

If you want minimal ops: Telnyx/SignalWire (for call control + media webhooks) + Node.js service that streams audio to Deepgram (ASR), uses an LLM with function calling, and streams ElevenLabs/Azure TTS back to the call. Add barge-in. Done.
If you want full control: FreeSWITCH (B2BUA) + GStreamer media bridge (RTP<->PCM) + Deepgram/Google ASR + LLM + Azure/ElevenLabs TTS. Wrap with a Node/Python orchestrator and add SBC (Kamailio) later.

Happy to sketch sample code for the media bridge (GStreamer or aiortc), or a FreeSWITCH dialplan snippet that forks RTP to your service. If you share more constraints (on-prem vs cloud, HIPAA/PCI, expected concurrency), I can tighten the blueprint.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Looking for experts on Voice AI over telephony #153618

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

Looking for experts on Voice AI over telephony #153618

Uh oh!

moonwisper Mar 10, 2025

Body

Guidelines

Replies: 2 comments

Uh oh!

Boardson Sep 24, 2025

Uh oh!

eggduzao Sep 24, 2025

0) SIP Trunk Providers (US)

1) Telephony & Call Control

Option A - Self-hosted PBX/B2BUA (flexible, OSS)

Option B - Hosted programmable telephony (less ops)

2) Media I/O (the "real-time" bit)

3) Conversation Engine

ASR (streaming)

NLU / Orchestrator

TTS (fast + natural)

4) Glue, Ops, Compliance

Minimal PoC Stack (works in a weekend)

Production-Ready Baseline

Concrete Tech Choices

Checklist of what to avoid

"What stack would you recommend?"

moonwisper
Mar 10, 2025

Boardson
Sep 24, 2025

eggduzao
Sep 24, 2025