Voice

Configure AI voice calls — inbound, outbound, voice models, speech recognition, interruption handling, and troubleshooting.

Intermediate
18 min read

Voice

Voice is how your AI speaks and listens on a real phone call or in a browser tab. The same workflow that runs your SMS and web chats runs your voice calls — but instead of reading text, the caller hears a synthetic voice, and instead of typing, they speak. Gravity Rail's voice stack handles speech recognition, low-latency AI responses, natural barge-in, and call routing end-to-end.

This guide covers everything you need to know as a workspace admin to enable voice, pick the right model and voice, and keep calls flowing cleanly.

Heads up — see also Phone & Voice. Phone & Voice is the quick-start for phone numbers, SMS, and work modes. This guide is the deep dive on the voice side: models, speech recognition, interruption behavior, and troubleshooting.

What Voice Does

When a caller reaches your number (or clicks the mic on a site), Gravity Rail:

  1. Answers the call — through Twilio for phone calls, or directly via WebSocket for browser voice.
  2. Plays a greeting (optional) — a TTS greeting with optional consent acknowledgment.
  3. Listens — streaming audio to a speech-to-text (STT) model or a native realtime model.
  4. Thinks — runs the current workflow task through the assigned AI model.
  5. Speaks — streams synthesized audio back to the caller in near-real time.
  6. Handles interruption — if the caller starts talking while the AI is speaking, the AI stops and listens.
  7. Persists the conversation — full transcripts saved to the chat, with a summary generated when the call ends.

The same AI agent can talk on the phone, on a website, or over WhatsApp — you pick the voice and model per agent, and the phone number or site decides which agent picks up.

Choosing a Voice Stack

Gravity Rail supports two broad architectures. Pick one when you configure your agent:

ArchitectureHow it worksWhen to use
Native realtime modelA single model (OpenAI GPT Realtime, Gemini Live, Nova Sonic, Grok Voice) does STT + LLM + TTS in one WebSocket.Lowest latency; natural turn-taking; best for voice-first products.
Pipeline modelSeparate STT (Deepgram / ElevenLabs / xAI), any chat LLM, and separate TTS (ElevenLabs / Polly / OpenAI / Google / Deepgram / xAI / Pocket).Lets you mix and match — e.g. Claude for reasoning with ElevenLabs for voice. More tuning knobs.

Native realtime models are simpler to configure but offer fewer voice choices. Pipeline mode gives you access to every voice Gravity Rail supports (100+ voices across 6 TTS vendors) with any chat model you like.

Enabling Voice on a Workspace

Voice is enabled per phone number and per site. There's no workspace-wide "voice on/off" switch — if you have a phone number with Enable Voice on and a workflow connected, voice is live.

Prerequisites

  • An Agent with a voice configured (see People → Agents).
  • A Workflow that the agent will run on the call.
  • Either a Phone Number (for PSTN calls) or a Site with voice enabled (for browser calls).

If your org doesn't have any phone numbers yet, an org owner needs to purchase one first from the Organization's Phone Numbers tab. See Phone & Voice for the walkthrough.

Configuring Inbound Calls

Inbound calls are the common case: someone dials your number and your AI picks up.

1. Connect a phone number to a workflow

  1. Go to Channels → Phone Numbers.
  2. Open a phone number (or add one).
  3. Set:
    • Enable Voice → on.
    • Default Workflow → the workflow the agent will run on the call.
    • Brand Name → how the AI introduces itself ("Thanks for calling Acme Clinic").

Save. The first inbound call will be answered by your agent within a few seconds.

2. Pick a work mode

The work mode decides when the AI answers vs. routing the call elsewhere:

ModeBehavior
defaultAI always answers.
forward_off_hoursAI during business hours, forward to a human number after.
message_off_hoursAI during business hours, plays a recorded message after.
always_forwardNever let AI answer — forward every call.
always_messageAlways play a recorded message and hang up — never let AI answer.
voicemailAlways take a voicemail (recorded + transcribed).
voicemail_off_hoursAI during hours, voicemail after.

Business hours are configured in Settings → Workspace Settings. The phone number uses those hours to decide what "off-hours" means.

Some workflows — especially in healthcare or regulated industries — need to announce that the call is recorded or that an AI is answering. Configure this on the phone number:

  • voice_greeting_message — TTS text the caller hears before being connected to the AI. For example: "Thanks for calling Acme Clinic. This call may be recorded and is being answered by our AI assistant."
  • voice_require_consent — when on, the greeting plays inside a gather element. The caller must press 1 or say "yes" to continue. No response → a polite "No response received. Goodbye." and the call hangs up.

The greeting uses your configured TTS voice, so it sounds like the same agent that'll take the call. The consent flow lives in lib/routes/integrations/twilio/voice_consent.py if you need to trace the exact behavior.

4. Anonymous callers

By default, an incoming call from a number that isn't a workspace member is routed through a signup consent flow — the AI asks if the caller wants to join as a new member. Two knobs on each phone number control this:

  • allow_anonymous → when on, unknown callers are admitted as "anonymous" members with a configured role (set anonymous_member_role_id). Useful for public support lines where you don't want to create a real account per caller.
  • allow_signup → when on (and allow_anonymous is off), unknown callers are asked for consent to be signed up as members. If they agree, an SMS is sent with a signup link.

If both are off, unknown callers are politely rejected. This is the right choice for private, member-only voice lines.

Configuring Outbound Calls

Outbound (the AI calls out) is driven by the Phone Call action on a workflow. There is no "dial this number" button — outbound calls are always triggered by automation.

  1. Create a Phone Call action that places a call from one of your phone numbers.
  2. Trigger the action from an event rule, a schedule, or a workflow step.
  3. The outbound call uses the same agent + workflow wiring as inbound calls; the only difference is who dialed whom.

Typical triggers

  • An appointment is 24 hours away → place a reminder call.
  • A form field changes state (e.g. lab result marked abnormal) → call the patient.
  • A scheduled campaign runs → dial a list of members sequentially.

Outbound calls are billed per-minute just like inbound. See Analytics & Usage Reports for the per-number and per-agent cost breakdown.

Voice Models & TTS

Voices are picked per agent. Every voice maps to a TTS model (the engine that synthesizes speech) and a voice name (the specific speaker).

Go to People → Agents, pick your agent, and open the Voice section. You'll see a list of voices grouped by provider. Listen to previews and pick one.

Voice providers

Gravity Rail supports six TTS providers out of the box. Each has trade-offs:

ProviderVoicesStrengthsNotes
ElevenLabs~30 multilingual voices (Rachel, Adam, Bella, etc.)Most human-sounding; 30+ languages; emotion control.Highest quality; pipeline-only (not used with native realtime models).
AWS Polly~60 voices across generative and neural engines (Ruth, Matthew, Joanna, etc.)Reliable AWS infrastructure; predictable pricing; strong multi-language.Uses polly-neural or polly-generative depending on voice.
OpenAI TTS11 voices (Alloy, Ash, Ballad, Coral, Echo, Sage, Shimmer, Verse, Marin, Nova, Onyx)Very natural conversational tone; low latency.Same voices also available on OpenAI Realtime.
Google ChirpPuck, Kore, Charon, Fenrir, Aoede, Leda, Orus, ZephyrHigh-quality; good multilingual support.Used via Vertex AI or AI Studio.
Deepgram Aura 2Thalia, Asteria, Luna, Arcas, Perseus, and othersLow-latency; designed for real-time.Good fit for pipeline mode where STT is also Deepgram.
xAIEve, Ara, Rex, Sal, LeoPaired with Grok Voice Agent.
Pocket TTSOne default voiceUltra-low-latency internal model for testing.

The full voice catalog lives in lib/voice/voice_models.py. Each VoiceModelMetadata entry tells the frontend which TTS model IDs work with which voices — the agent UI filters the list based on what you've selected.

Native realtime voices

If your agent uses a realtime model (OpenAI GPT Realtime, Gemini Live, Nova Sonic, Grok Voice Agent), the voice is baked into the model configuration, not picked from the TTS catalog:

  • OpenAI GPT Realtime: Alloy, Ash, Ballad, Coral, Sage, Verse
  • Gemini Live: Puck, Kore, Charon, Fenrir (and more)
  • Nova Sonic: Matthew, Tiffany
  • Grok Voice: Eve, Ara, Rex, Sal, Leo

These voices skip the separate TTS step entirely — the model emits audio directly, which keeps latency low (~300ms end-to-end). See Realtime Models for the full matrix.

Switching voices mid-conversation

You generally can't change voices within a single call — the voice is locked to the agent config at the start of the session. If you want different voices for different use cases (support vs. sales), use different agents and route them with workflows.

Speech Recognition (STT)

When you use a pipeline model (anything except a native realtime model), speech recognition happens separately from the LLM. Gravity Rail supports three STT providers:

ProviderModelStrengths
Deepgramnova-3, fluxFastest streaming; strong on medical/technical vocabulary; keyterm prompting.
ElevenLabs Scribescribe_v1Strong multilingual; language detection; good for international callers.
xAIBuilt into Grok Voice16kHz native; server-side transcription only.

Deepgram is the default and what we recommend for most workspaces. Its flux model is purpose-built for low-latency streaming with server-side end-of-turn (EOT) detection.

Keyterm prompting

If your workspace handles unusual vocabulary — drug names, procedure codes, product SKUs — you can give Deepgram a list of keyterms to bias transcription toward. Two sources feed the keyterm list:

  • Pronunciation terms configured on your workspace (account or org level). These flow through load_pronunciation_terms() in lib/voice/utils/pronunciation.py.
  • Manual keyterms set in the agent's STT config.

Both lists are merged and passed to Deepgram. Keyterm prompting is free — no reason not to use it if you have a known vocabulary.

Language detection

ElevenLabs Scribe supports automatic language detection — useful when callers might speak any of several languages. Enable it by setting languageDetection: true in the agent's STT config. When on, the LanguageDetector in lib/voice/pipeline/language_switching.py watches the detected language per utterance and can switch the voice mid-call if the caller changes languages.

For native realtime models, language detection is handled by the model itself — you just instruct the agent to respond in the caller's language.

Barge-in & Interruption

Natural conversation means people interrupt each other. The voice pipeline is built around barge-in — the moment the caller starts speaking, the AI stops mid-sentence and listens.

How it works (pipeline mode)

  1. TTS audio is streaming to the caller through PlaybackController in lib/voice/pipeline/playback_controller.py.
  2. STT detects the caller speaking via VAD (voice activity detection).
  3. The pipeline enters the INTERRUPTED state — see the state machine in lib/voice/pipeline/state_machine.py.
  4. AudioPlaybackTracker records exactly how much audio the caller actually heard before being cut off (the "truncation offset").
  5. The AI's in-progress message is persisted with a barge_in truncation reason so the next turn has the right context — the AI knows the caller didn't hear the rest.
  6. A new turn starts; the AI listens.

How it works (native realtime mode)

Each realtime provider has its own barge-in logic. OpenAI and Gemini detect interruption in-model and emit a cancel/interrupt event. Gravity Rail's socket managers translate those events into the same ServerInterruptPlaybackEvent that the pipeline uses, so the rest of the stack stays consistent.

When barge-in feels wrong

  • AI gets cut off by its own echo (on phone): usually means Twilio's echo cancellation is degraded — check for packet loss or a bad SIP trunk. This is not something the app can fix; escalate to the carrier.
  • AI doesn't stop when the caller speaks: VAD sensitivity is too low. If you're on native realtime, the turnDetection config on the agent controls this. For Deepgram, tune eot_threshold or vad_silence_threshold.
  • AI thinks it was interrupted when it wasn't (phantom barge-in): VAD is too hot — background noise is triggering it. Raise the VAD threshold or, on a pipeline model, bump min_speech_duration_ms.

Web Voice

Voice on a site works like phone voice, but the browser is the carrier. Enable it by:

  1. Open a site under Channels → Sites.
  2. Enable voice in the site's settings.
  3. Attach a workflow with a voice-enabled agent.

Visitors click a microphone button, grant mic permission, and stream PCM16 audio over a WebSocket to /workspace/stream-ws. The same pipeline handles the audio — no phone number needed. Browser voice is free of per-minute Twilio costs, so it's a great fit for self-service portals.

Because the browser can send richer audio (24kHz vs. Twilio's 8kHz μ-law), voice quality is usually noticeably better on web than on the phone.

Call Summaries & Transcripts

Every call generates a full transcript — saved alongside the chat in the same conversation view your team already uses for SMS and email. Open the chat, and you'll see:

  • Each turn with speaker attribution.
  • Tool calls the AI made during the conversation.
  • Any data the AI collected in forms.

When a call ends, a background Temporal activity generates a short AI-written summary of the call and saves it to the chat record. You don't have to wait — the summary shows up a few seconds after the call hangs up. Details of the summarization pipeline live in lib/voice/persistence.py and the chat-summary Temporal activity.

Audio recording is controlled by a workspace-level setting (enable_audio_recording on Workspace Settings) and is off by default. When enabled, it applies workspace-wide — you can't opt in or out per phone number today. Per-number granularity is on the roadmap. Transcripts are always persisted regardless of the audio-recording toggle.

To turn recording on and play calls back inside chat (including PHI safeguards), see Call Recordings.

Troubleshooting

"The caller said something but the AI didn't respond."

  • Check the chat in Gravity Rail. If the STT transcript is missing or garbled, it's a speech recognition issue — consider switching STT provider or adding keyterms.
  • Check Twilio console for media stream errors. If the WebSocket disconnected mid-call, the phone call record's status will be ERROR (see lib/models/workspace/phone_number.py for the enum).
  • VAD threshold too high: the caller's speech isn't crossing the detection threshold. Lower vad_threshold (ElevenLabs STT) or eot_threshold (Deepgram flux).

"The AI's voice sounds robotic or choppy."

  • Network latency: the most common cause. Pipeline mode is more sensitive than native realtime because it adds STT → LLM → TTS hops. Switch to a native realtime model if you can't fix the network.
  • Wrong sample rate: if you're seeing audio artifacts on Twilio calls specifically, the 8kHz μ-law conversion may be misconfigured. Every Twilio call must be 8kHz μ-law on the wire — the pipeline upsamples internally.
  • TTS provider is overloaded: ElevenLabs occasionally queues requests during peak hours. Polly is more predictable under load.

"The AI keeps interrupting itself / talking over the caller."

  • Barge-in is too aggressive — see the Barge-in section for tuning.
  • On OpenAI Realtime, check turn_detection.silence_duration_ms — the default is ~500ms, which can misfire in noisy environments. Raise it to 800–1000ms for phone calls.

"The AI answered but hung up immediately."

  • Session setup error: check the voice route logs for validation errors (invalid realtime model, missing workflow, etc.).
  • Consent flow timed out: if voice_require_consent is on and the caller didn't respond, the flow hangs up deliberately with "No response received. Goodbye."
  • Off-hours + message_off_hours mode: the call was outside business hours, so the recorded message played and the call ended — this is the intended behavior.

"Calls work on phone but not on the web site."

  • Mic permission denied in the browser — visitors need to grant microphone access. The site widget will prompt, but some browsers block it by default.
  • JWT token expired — web calls auth via a short-lived token; if it expires between page load and the mic button being clicked, the WebSocket will refuse the connection. Refresh the page.
  • CORS / WSS misconfigured on a custom domain — custom site domains need the WebSocket upgrade path allowed through whatever proxy or CDN sits in front of the site.

"The voice changed partway through the call."

  • Language detection switched it — if ElevenLabs language detection is on and the caller changed languages, the voice may switch to a more appropriate one for that language. If this is unwanted, disable languageDetection in the agent's STT config.
  • Agent was reassigned mid-call (rare) — this shouldn't happen, but if a workflow transition changes agents between tasks, the new agent's voice takes over at the next turn.

"The caller was charged for a call that never actually connected."

This shouldn't happen — Gravity Rail only logs usage once the call reaches STARTED status. If you see phantom billing entries, file an issue with the phone call UUID and we'll investigate. Billing is driven by Resource.PHONE_CALL usage in lib/services/ and cross-checked against Twilio's call records.

"Transcripts are missing or partial."

  • Call dropped before finalization: if the WebSocket closed abnormally, some audio that was in-flight may not have been transcribed. Check the chat — look for a message with an ERROR status.
  • STT provider outage: rare, but if Deepgram or ElevenLabs had a regional incident, transcription may have failed for a window. Check your provider's status page.

Common Setups

24/7 clinical triage line

  • Work mode: default (AI always answers)
  • Consent: voice_require_consent = true, greeting announces recording + AI
  • Agent: native realtime model (OpenAI GPT Realtime) for lowest latency
  • Voice: Coral or Sage (warm, professional)
  • Abilities: Calendar Booking, Forward Call (to nurse line), Hang Up
  • Anonymous callers: allow_signup = true → unknown callers sign up via SMS consent

Outbound appointment reminders

  • Workflow: reminder flow with Phone Call action
  • Schedule: event rule 24h before appointment
  • Voice: ElevenLabs Rachel (warm, familiar)
  • STT: Deepgram with keyterms for your clinic's vocabulary
  • Outcome tracking: workflow collects confirmation into a form field

Multilingual inbound support

  • STT: ElevenLabs Scribe with languageDetection: true
  • Voice: Google Chirp (strong multilingual)
  • Agent instructions: "Respond in the language the caller uses."
  • Fallback: if detection misfires, the caller can say "English please" and the agent will switch.

After-hours voicemail

  • Work mode: voicemail_off_hours
  • Business hours: 9am–5pm weekdays
  • Voicemail transcription: auto-transcribed and saved to the chat
  • Notification: event rule on new voicemail → Slack notification to on-call staff

Tips

  • Always test with a real phone — the browser mic doesn't exhibit the same quirks as a cellular call. Dial your number from a real phone before you go live.
  • Pick the shortest plausible greeting — every second of greeting is a second before the caller can talk. Consent prompts in particular should be as short as legally allowed.
  • Use pronunciation terms if you have non-standard vocabulary — it's the fastest way to improve transcription quality.
  • Keep instruction prompts tight — realtime models are especially sensitive to prompt length; long system prompts increase time-to-first-audio.
  • Monitor the Analytics dashboard for call duration, token usage, and failure rate trends. Sudden changes usually signal an upstream provider issue.