Real-Time Multimodal Agents

Your voice agent takes 6 seconds to respond. In that silence, the user has already assumed the call dropped, repeated themselves, or hung up. You've lost them.

Human conversations flow at ~500ms turn-taking. Cross that 800ms threshold, and the interaction starts to feel broken. Cross 2 seconds, and users begin talking over your agent, creating a feedback loop of confusion.

Real-time multimodal agents solve this by processing audio and video natively—no transcription bottleneck, no synthesis delay. They hear tone, see context, and respond before the awkward pause sets in.

Voice: Two Approaches

The Pipeline

Traditional voice assistants chain three separate systems:

Each box adds latency. STT needs 300-500ms to transcribe. The LLM takes 500-2000ms to generate a response. TTS adds another 200-400ms to synthesize speech. Total: 1-3 seconds minimum. With network hops, often 3-5 seconds.

But latency isn't the only problem.

"I need to book a flight to Phuket" becomes "I need to book a flight to bucket" after STT. The LLM responds about buckets. The user is confused. Every boundary between systems is an opportunity for meaning to degrade.

And emotion? The user sighs in frustration. STT transcribes it as "..." or nothing at all. The LLM doesn't know the user is upset. It continues cheerfully, making things worse.

Native Audio Models

Models like Gemini Live and GPT-4o Realtime take a different approach. No transcription. No synthesis. Audio in, audio out.

The model "hears" the raw waveform—including tone, pace, hesitation, and background noise. It generates audio tokens directly, preserving prosody and emotion in the output. Latency drops to 300-600ms. The "Phuket/bucket" problem disappears because there's no intermediate text representation to corrupt.

When to Use Which

Native audio models aren't always the right choice. The decision depends on your constraints:

Choose native audio when:

  • Sub-second response time is critical (customer-facing, real-time assistance)
  • Emotional nuance matters (therapy bots, companion apps, sales)
  • You need natural conversation flow without robotic turn-taking
  • Your domain has standard vocabulary that doesn't require specialized recognition

The pipeline still wins when:

  • You need domain-specific speech recognition (medical terminology, legal jargon, heavy accents in specific industries)—you can fine-tune the STT layer independently
  • You want to swap TTS voices without retraining the whole system
  • You need exact transcripts for compliance (some industries require verbatim records, and native audio's "understanding" doesn't give you that)
  • Cost is a primary constraint—native audio models charge significantly more per minute than running separate STT/LLM/TTS components

The hybrid approach works too: use fast native audio for the conversational flow, but run parallel STT in the background for logging and searchability. You get the UX benefits of native audio without losing the auditability of text.

The WebSocket Architecture

Real-time voice requires persistent, bidirectional connections. HTTP request-response won't work—you can't wait for the user to finish a 30-second utterance before sending anything.

You could connect directly from the client to Gemini Live—Google supports this, and it's simpler for prototypes. But a server proxy gives you security (API keys stay server-side, you mint short-lived tokens for clients), tool execution (when the model calls a tool, your server runs it), and logging for compliance and debugging.

Turn-Taking: When to Speak

Humans don't wait for explicit "over" signals. We sense when someone's finished through pauses, falling intonation, and completed thoughts. Your agent needs the same intuition.

Voice Activity Detection (VAD) answers two questions: Is the user speaking right now? Have they stopped?

Simple VAD uses energy thresholds—if the audio amplitude exceeds a level for a certain duration, someone's talking. Sophisticated VAD uses neural networks trained to distinguish speech from background noise, even in cafes or cars.

The Tuning Problem

Getting VAD parameters right is harder than it sounds—and getting them wrong destroys the user experience faster than almost any other mistake.

Too aggressive (short silence threshold): Set your silence threshold to 300ms and watch users get interrupted mid-thought. They pause to think, and your agent jumps in. "Sorry, go ahead"—"No, you go"—the conversation devolves into awkward overlap. Users start speaking faster to avoid being cut off, which makes them harder to understand, which makes your agent's responses worse. A vicious cycle.

The failure is particularly bad for non-native speakers, who pause more frequently while finding words, and for complex topics, where even native speakers need thinking time. Your "responsive" agent becomes an interrupting nightmare.

Too conservative (long silence threshold): Set it to 1500ms and users wonder if you're still listening. "Hello? Are you there?" The long pause feels like lag, even though you're just being polite. Users might repeat themselves, creating duplicate inputs. Or they hang up—in voice commerce, every second of perceived latency costs conversions.

Speech sensitivity traps: Too sensitive, and every keyboard click, coffee sip, or background TV triggers the "user is speaking" state. The agent starts responding to noise, or worse, waits indefinitely for the "user" (actually a passing car outside) to finish talking. Too insensitive, and soft-spoken users get ignored. They speak louder, get frustrated, speak louder still—or give up entirely.

Environment-specific failures: What works in a quiet home office fails spectacularly in a café. What works for a confident American English speaker fails for a hesitant speaker with a strong accent. The "optimal" threshold doesn't exist universally.

Adaptive approaches help but add complexity. The best voice agents measure background noise levels and adjust thresholds dynamically—higher noise floor means you need stronger speech signals before triggering. They also learn user patterns: some users naturally pause longer, and the agent can adapt after a few turns.

For a starting point: 600-800ms silence threshold, moderate speech sensitivity, and a feedback mechanism to collect user complaints. Then iterate. Voice UX is empirical—you cannot solve it purely in theory.

Handling Interruptions

User speaks. Agent responds. User interrupts mid-response. Now what?

Without proper handling, the agent keeps talking while also trying to process new input. The user hears stale audio while the model generates a response to their interruption. Both parties are confused.

With barge-in handling: VAD detects user speech during playback, audio stops immediately, the partial response is discarded, and fresh processing begins. This requires your audio pipeline to support instant cancellation—you can't wait for a queued buffer to drain. You need to kill it mid-syllable if necessary.

Prompting for Voice

Voice prompts differ from text prompts in ways that aren't obvious until you hear the results.

Brevity is Survival

Every extra word costs ~40ms of audio generation. A response that reads fine as text sounds painfully slow when spoken:

❌ "I'd be happy to help you with that! Let me check my calendar
    and see what times are available for scheduling your meeting
    with the marketing team this week."

By the time the agent finishes this sentence, the user has mentally checked out. Compare:

✅ "Checking your calendar... Tuesday at 2pm works. Should I send
    the invite?"

Same information. Half the time. The user stays engaged.

Instruct your model to keep responses under 2-3 sentences, lead with the answer before the explanation, and cut filler phrases ("I think," "basically," "you know"). Read your prompts aloud—if you'd get impatient listening, so will your users.

Filling the Silence

When your agent calls a tool—checking a calendar, querying a database—there's a processing delay. In text chat, the user sees a loading spinner. In voice, they hear nothing. Did the call drop? Did the agent crash?

Train your agent to verbalize intent before acting: "Let me check that for you," "Looking up your order now," "One moment while I search." These fillers buy time and signal that the agent is working, not frozen. Without them, even a 2-second tool call feels like an eternity of dead air.

Persona Bleeds Through

Voice reveals personality in ways text hides. The same words delivered with different pacing, pitch, and emphasis create entirely different experiences.

Your system prompt should specify pace ("measured, not rushed"), tone ("warm and professional, like a knowledgeable concierge"), and recovery behavior ("if you mishear something, ask for clarification naturally, not robotically").

You can also specify—or forbid—verbal tics. Some personas benefit from occasional "hmm" or "let's see" to sound human. Others (medical, legal) should avoid anything that sounds uncertain.

Context Management for Audio

Twenty minutes into your support call, the agent suddenly forgets the user's original problem. It asks them to repeat their account number—for the third time. What happened?

Audio fills your context window dramatically faster than text.

A user says "What's the weather in San Francisco?" That's 6 words, maybe 8 tokens as text. As audio? The same utterance is ~3 seconds at 16kHz sample rate. Depending on how the model tokenizes audio, that's 200-400 tokens for the same semantic content.

A 10-minute voice conversation can consume 100,000+ tokens. Your 128k context window suddenly looks cramped. The model starts dropping early conversation to make room for new input—and "early conversation" is where the user explained their actual problem.

Strategies That Work

Summarize aggressively. After every few turns, have a background process summarize the conversation into text and inject it as context. The summarizer doesn't need to be fancy—a simple prompt like "Summarize the key facts, decisions, and open questions from this conversation in under 200 words" works well. Discard the raw audio history but preserve the semantic content. You're trading fidelity for longevity.

The summarization can run in parallel with the main conversation. While the user is speaking, your background process digests the last 5 minutes into a paragraph. When you hit your context budget, you swap out the oldest audio chunks for their text summary. The user never notices.

Separate memory tiers. Keep the last 2-3 audio turns in full fidelity—you need recent context for tone and conversational continuity. If the user just said something sarcastically, the model needs to have heard that sarcasm, not read a summary that says "user expressed mild frustration."

Store older context as text summaries in a database, retrieving when relevant keywords appear. If the user says "like I mentioned earlier about the Tokyo flight," your retrieval layer fetches the relevant summary from 15 minutes ago and injects it. The model doesn't need to remember everything—it needs to access everything.

Reset at natural boundaries. For task-oriented agents (booking, support), natural conversation points like "Is there anything else I can help with?" are opportunities to start fresh. Beginning a new session with a brief text summary—"Continuing conversation with user about flight booking to Tokyo, departure March 15, returning March 22, window seat preference noted"—gives the model everything it needs without the 50,000-token audio history.

Users actually prefer this. A clean context with relevant facts produces better responses than a bloated context where important details get lost in the noise.

Parallel transcription for logs. Even if your model processes audio natively, run parallel transcription for your records. You'll want text logs for debugging ("why did the agent recommend the wrong hotel?"), compliance (regulated industries require conversation records), and training data for future model improvements. Text is 30x more storage-efficient than audio. It's also searchable—try grepping through 10,000 hours of audio to find every conversation about refund requests.

Adding Video

Video agents see while they listen. A support agent watches you navigate a confusing UI. A cooking assistant sees the ingredients on your counter. A fitness coach observes your form and corrects your posture mid-rep.

But video isn't just "voice plus a camera." It introduces new engineering challenges—and new failure modes.

Frame Sampling: Less Is More

You don't send 30fps video—that would be millions of tokens per minute. Instead, sample frames strategically.

Different use cases need different sampling strategies:

Screen sharing is high-detail but mostly static. A user reading documentation doesn't need 2fps—capture on mouse clicks, scroll events, or detected pixel changes. You might send only 5-10 frames across a 2-minute segment, each one capturing a meaningful state change.

Camera input is lower information density but more continuous. A cooking assistant needs regular 1-2fps sampling to notice when the garlic starts browning. A fitness coach might need higher rates during fast movements but can drop to 0.5fps during rest periods.

Hybrid approaches adapt dynamically. Start at low frequency, increase when the model detects activity (motion, speech, new objects entering frame), then drop back to baseline.

Bandwidth and Quality Trade-offs

A 720p frame is ~50-100KB compressed. At 2fps, that's 100-200KB/second upstream. Fine on WiFi, problematic on cellular. Consider these strategies:

  • Adaptive resolution: Start at 480p, upgrade to 720p/1080p only when the model requests detail ("Can you show me that label closer?")
  • Region of interest: If the model is watching a cooking pan, crop to the relevant quadrant rather than sending the full frame
  • Graceful degradation: On poor connections, drop frame rate before dropping resolution—temporal gaps are less jarring than pixelated images

When Video Adds Value (and When It Doesn't)

Video isn't always worth the overhead. Ask yourself:

Video helps when:

  • The task involves physical objects the user can show (cooking, repairs, shopping)
  • Visual context changes the answer ("Is this rash serious?" requires seeing the rash)
  • Real-time feedback on technique matters (sports, music, physical therapy)
  • The user would otherwise need to describe something complex in words

Voice-only is often sufficient when:

  • The conversation is purely informational ("What's the capital of France?")
  • The user is mobile and camera access is awkward (driving, walking)
  • Bandwidth is limited or unreliable
  • Privacy is a concern (users may not want cameras active)

Don't add video because you can. Add it because it makes the interaction meaningfully better.

Visual Hallucination: A New Failure Mode

Audio models can mishear. Visual models can mis-see.

A user shows a bottle of olive oil. The model confidently says, "I see you have canola oil there." The user corrects it. The model apologizes and continues—but now the user questions everything the agent "sees."

Visual hallucination is harder to catch than audio errors because there's no transcript to review. The model might:

  • Misidentify objects (especially similar-looking items, or items with obscured labels)
  • Miss objects outside the frame or in poor lighting
  • Confuse the current frame with a previous one (temporal confusion)
  • Invent details that seem plausible but aren't visible

Mitigation strategies:

  • Prompt for uncertainty: "If you're not sure what you're seeing, ask the user to confirm rather than guessing."
  • Confirm before acting: For consequential decisions ("That looks like peanut oil—are you sure that's safe for your guest with allergies?"), ask before proceeding.
  • Log frames for debugging: When users report errors, having the actual frames helps you understand what the model saw.

Guiding Visual Attention

Your system prompt should describe what the model should pay attention to. Without guidance, the model might fixate on irrelevant visual details—the color of your kitchen walls instead of the state of your sauté pan.

Be specific about the task domain: "The user is sharing their screen. Watch for error messages, modal dialogs, and UI elements they seem to struggle with. Ignore decorative elements and focus on functional interface components."

For camera input, guide attention to the action: "Watch the user's hands and the cooking surface. Notice changes in color, texture, and steam. If ingredients are being added, identify them."

A Multimodal Agent in Action

Picture this: you're staring at random ingredients after a long day. Tomatoes, garlic, half a box of pasta, olive oil. You could scroll through recipe apps, but instead you just... ask.

"Hey, what can I make with this stuff?"

The agent looks at your counter through your phone camera. "Looks like you've got the basics for aglio e olio—garlic, oil, pasta. Simple and really good. Want me to walk you through it?"

"Yeah, sure."

"First, get a big pot of water boiling. While that heats up, slice the garlic thin—I'm talking paper-thin, so it crisps up instead of burning..."

You start slicing. The agent watches.

"Those are a bit thick—try to get them more translucent. Yeah, like that."

This is the magic of real-time multimodal: the agent isn't responding to a query, it's participating in what you're doing. It sees, hears, and responds in the moment.

How It Works

The system prompt shapes the interaction:

You are a cooking assistant. You can see the user's kitchen through
their camera and hear them through their microphone.
 
When you see ingredients, mention them naturally—don't list them
robotically. "I see you've got some nice tomatoes there" not
"Detected: tomatoes, garlic, pasta, olive oil."
 
Keep responses brief. You're in the kitchen with them, not reading
a recipe book. Guide like a friend who happens to be a good cook.
 
Watch their technique. If they're about to make a mistake (knife
angle, heat too high, garlic about to burn), jump in. Don't wait
for them to ask.
 
Use the timer tool when they need to track time. Use recipe_search
if they want ideas beyond what you know.
 
If you can't see clearly, ask them to adjust the camera. "Can you
angle that down a bit? I can't quite see the pan."

The agent watches, listens, and participates. When the garlic hits the oil, it might say "hear that sizzle? Perfect temperature" or "that's a bit quiet—give it another minute to heat up." It's not answering questions. It's cooking with you.


Next: On-Device & Hybrid—running models locally for privacy, offline access, and zero latency.