Prompting voice agents

View as Markdown

A working integration and a good-sounding voice agent are different problems. The integration is what the rest of these docs cover. This page is about the instructions string you send in session.configure.

The minimum useful prompt

You are a warm, concise voice assistant. Reply in one or two short sentences.

Three things this gets right:

  1. Voice-first framing — “voice assistant”, not “AI”, not “chatbot”. Sets the persona toward spoken language.
  2. Length discipline — “one or two short sentences”. Without this, the model writes paragraphs and TTS plays them at length — fine for chat, terrible on a phone call.
  3. Warm tone — single-word style cue. The model carries it through prosody.

Don’t micromanage prosody in prose. Hydra adapts tone from context. Telling the model “speak slowly and carefully and pause between thoughts” mostly produces text that says “slowly and carefully and pause” rather than changing how it sounds. Shape the content and length; let Hydra handle delivery.

Anti-patterns

Don’tWhy
”Be helpful and answer the user’s question accurately”Generic. Drives Hydra toward chat-style answers. Be specific about voice.
”Format your response as a numbered list”Numbered lists sound robotic when spoken. Phrase as “First, … Then, …” instead.
”Use bullet points for clarity”There are no bullet points in speech. The model will say the word “bullet”.
”Provide as much detail as possible”The opposite of what you want in voice. Constrain output length explicitly.
Long persona backstoriesThe model occasionally drifts into reciting the backstory. Keep persona to one or two sentences.

Turn-taking discipline

Hydra handles turn detection automatically, but the prompt still shapes how the model behaves around interruption.

You are a phone agent. If the user is mid-sentence, wait for them to finish.
Pause naturally between thoughts so the user can interject.
Never repeat yourself if interrupted — pick up where you left off.

This is more effective than relying on the model’s defaults, especially in noisy environments.

Tool use prompting

When you declare tools, also tell the model when to use them.

You are a weather assistant. When asked about weather conditions, use the
get_weather tool with the city name. Don't guess — call the tool every time.
If the tool returns an error, apologise and ask the user for a different city.

Without explicit instruction, the model sometimes answers from priors instead of calling the tool. Be direct.

Greetings and generate_initial_response

Pair generate_initial_response: true with an explicit opening-line instruction:

You are a hotel concierge at the Grand Pacific. Open the call by greeting
the guest warmly in one short sentence, then ask how you can help.

Without a specific instruction, the model picks a generic opener. With it, you get the line you want.

Length and pacing

Voice users tolerate roughly one breath of latency between asking and hearing an answer. The model can’t make itself talk faster, but you can make it say less.

Default to one sentence. If the user asks for detail, take two. Three is too many.

For long-form content (legal disclaimers, addresses, phone numbers), break it explicitly:

When reading back a phone number, say each digit with a brief pause:
"Six… one… seven… nine…"

Worked example: phone-banking concierge

You are Maya, a phone-banking assistant for Pacific Bank. Speak warmly and
concisely. One or two short sentences per turn.
Available tools:
- lookup_balance(account_id) — current balance
- lookup_recent_transactions(account_id, days) — list of transactions
Turn-taking:
- If the user interrupts, stop and listen. Don't repeat yourself.
- Pause naturally between sentences.
If the user asks anything outside banking, politely redirect:
"I can help with your accounts and recent transactions — is there
something specific I can look up?"

Next