Memory & Persistence
TODO: memory is needed only when context window is exceeded, whichi normally happens in long-horizon task, however, by far in this tutorial we have not cover long-horizon task. think about moving this chapter after 12-agentic-systems.In 2024, a healthcare startup deployed an AI triage assistant to help nurses prepare for patient consultations. It was designed to review patient history, flag concerns, and suggest questions for the doctor. On day three, a patient mentioned they were severely allergic to penicillin. Thirty minutes later—after discussing symptoms, running through a checklist, and asking clarifying questions—the same assistant recommended a penicillin-based antibiotic.
The allergy disclosure had scrolled out of the context window. The agent had "forgotten."
Fortunately, the nurse caught the error before any harm was done. But the incident exposed a fundamental truth about AI agents: they don't remember anything by default. Every interaction starts fresh. Every conversation is ephemeral. The impressive illusion of continuity we experience with chatbots is often just careful engineering behind the scenes.
This chapter teaches you that engineering. We'll build agents that remember your name, recall discussions from months ago, and maintain persistent knowledge across sessions—all while respecting the hard constraints of context windows.
"Memory" and "persistence" are overloaded terms in AI. Many resources make analogies to human memory—short-term, long-term, episodic, semantic—which adds confusion without clarity. We'll skip the metaphors and focus on what actually matters: storing data somewhere and retrieving it into the context window when needed.
1. The Context Window Problem
Before solving memory, we need to understand why agents forget in the first place.
The Hard Limit
The model can only "see" and process tokens within its context window. This is a hard limit enforced by the architecture:
| Model | Context Window |
|---|---|
| GPT-4o | 128k tokens |
| Claude 3.5 Sonnet | 200k tokens |
| Gemini 2.0 Flash | 1M tokens |
When you exceed this limit, something must be removed. There's no overflow buffer, no automatic compression—tokens simply get dropped.
The Trimming Dilemma
When conversation history grows beyond the context limit, we must trim it. The simplest approach: keep only the last N messages.
This is the core problem: trimming destroys information. The agent loses access to anything outside the current window.
Why Not Just Use a Bigger Window?
Models like Gemini 2.0 offer 1M+ token windows. Problem solved?
Not quite. As we discussed in the RAG chapter:
-
Data often exceeds even massive windows: Enterprise conversations, customer histories, and knowledge bases can easily surpass any context limit.
-
Context is precious: Every token spent on "just in case" history is a token unavailable for instructions, reasoning, and the current task.
-
Attention degrades: The "Lost in the Middle" phenomenon means models struggle to use information buried deep in long contexts. A fact mentioned 500k tokens ago might as well not exist.
The solution isn't bigger windows—it's smarter memory management.
2. The Mental Model
Before implementing anything, let's establish how to think about agent memory.
Memory = Storage + Retrieval
Memory is simply:
- Storage: Persisting data somewhere outside the context window
- Retrieval: Fetching relevant data back into the context window when needed
Memory Scoping: Who and Which Conversation?
Here's a question that trips up many new agent developers: whose memory is this, and which conversation does it belong to?
When you're prototyping, it's easy to think of your agent as a single entity with one memory. But in production, your agent might be:
- Talking to thousands of users simultaneously
- Handling multiple conversations per user (a billing question, then a product question, then a feature request)
- Running across multiple server instances with no shared state
Memory needs to be scoped correctly, or User A sees User B's conversation history. That's a privacy disaster and a UX nightmare.
Two Levels of Scoping
Most memory systems use two identifiers:
| Concept | What It Represents | Other Names You'll See |
|---|---|---|
| User/Resource | The entity you're storing memory about | user_id, resource_id, customer_id |
| Conversation/Thread | A specific conversation or session | thread_id, session_id, conversation_id, chat_id |
Think of it like email: the user is the person, the thread is a specific email chain.
Why Threads?
If a user can only have one conversation, why bother with threads at all?
In practice, users often have multiple concurrent contexts:
- A support agent might open a new ticket for each issue (each ticket = one thread)
- A user might start a fresh conversation without wanting old context ("Let's start over")
- Different channels might be different threads (web chat vs. mobile app)
Threads let you isolate conversations while still sharing user-level information.
Scoping Your Memory
The key insight: different types of memory should be scoped differently.
| Memory Type | Typical Scope | Why |
|---|---|---|
| Message history | Thread | "What we discussed in this conversation" |
| User preferences | User | "They prefer Spanish" should apply everywhere |
| Task state | Thread | "We're working on refund #123" is specific to this chat |
| Medical allergies | User | Critical info that must follow them across all threads |
This gives you the flexibility to say: "Remember what we talked about today" (thread-scoped) vs. "Remember who I am" (user-scoped).
In Mastra
Mastra uses resourceId for the user and threadId for the conversation. Here's how scoping looks in practice (we'll cover the full setup in Section 3):
// When calling the agent
const response = await agent.generate("Continue our discussion", {
threadId: "conversation-123", // This specific chat
resourceId: "user-456", // This user (across all chats)
});
// When configuring memory scope
workingMemory: {
enabled: true,
scope: "resource", // Persists across all threads for this user
// or
scope: "thread", // Only within this conversation
}You'll see threadId and resourceId in every Mastra code example. Now you know what they mean and why they exist.
Three Memory Patterns
We'll implement three patterns, each solving a different problem:
| Pattern | What It Stores | How It Retrieves | Best For |
|---|---|---|---|
| Message History | All messages | Last N messages | Recent context, conversational flow |
| Working Memory | Key facts | Always included in prompt | User preferences, critical info |
| Semantic Recall | Embeddings | Vector similarity search | "What did we discuss last year?" |
Let's implement each one.
3. Message History
The most basic form of memory: save every message to a database, then reload recent messages when the user returns.
The Pattern
Implementation with Mastra
Mastra handles storage and retrieval automatically. First, install the required packages:
pnpm add @mastra/core @mastra/memory @mastra/libsqlThen configure the lastMessages option and Mastra manages everything:
import { Agent } from "@mastra/core/agent";
import { Memory } from "@mastra/memory";
import { LibSQLStore } from "@mastra/libsql";
const agent = new Agent({
name: "support-agent",
instructions: "You are a helpful support agent.",
model: "openai/gpt-4o",
memory: new Memory({
storage: new LibSQLStore({
id: "support-storage",
url: "file:./conversations.db",
}),
options: {
lastMessages: 20, // Automatically fetches last 20 messages
},
}),
});
// Each call automatically loads history and saves new messages
const response = await agent.generate("Continue our discussion", {
threadId: "thread-123",
resourceId: "user-456",
});Choosing N: How Many Messages?
The right value for "last N messages" depends on your use case:
| N Value | Trade-off |
|---|---|
| 5-10 | Minimal context, fast responses, but agent forgets quickly |
| 20-50 | Good for most conversational agents |
| 100+ | Full conversation awareness, but expensive and potentially noisy |
| All | Complete history—only viable for short conversations or large context windows |
Experiment: Try different values and observe how the agent's responses change. With N=0, the agent has no memory of previous messages. With N=50, it can reference discussions from earlier in the session.
4. Working Memory
Message history solves one problem: remembering recent messages. But what about facts that should always be available, regardless of how much conversation happens?
Consider: a user tells your agent "I'm allergic to penicillin" at the start of a medical consultation. Two hundred messages later, that fact might be trimmed from history—but it should never be forgotten.
The Pattern: Protected Facts
Working memory stores critical facts in a protected section of the prompt that survives trimming:
When trimming occurs, we remove old messages—but the protected facts section is never touched.
How the Agent Updates Working Memory
The key insight is that the agent itself manages this memory. Under the hood, Mastra gives the agent a tool to update its working memory. When users share important information, the agent recognizes it and calls this tool automatically.
You don't need to write extraction logic or manually parse user messages—the LLM handles the judgment of "is this worth remembering?" based on your template.
Designing Effective Templates
The template you provide shapes what the agent remembers. A well-designed template:
- Signals what matters: The agent uses the template fields as hints for what to extract
- Stays concise: Working memory consumes context tokens on every request
- Uses clear labels: Ambiguous labels lead to inconsistent updates
Good template design:
# Customer Profile
- **Name**:
- **Account Type**: [Free/Pro/Enterprise]
- **Primary Use Case**:
- **Known Issues**:
- **Communication Preference**: [Formal/Casual]Poor template design:
# Information
- Stuff about the user:
- Things they mentioned:
- Other notes:The first template gives the agent clear categories. The second is vague—the agent won't know what "stuff" or "things" to extract.
Template vs Schema
Mastra supports two formats for working memory:
| Format | Syntax | Best For |
|---|---|---|
| Template (Markdown) | Free-form text with placeholders | Flexible facts, notes, summaries |
| Schema (Zod/JSON) | Typed object with defined fields | Structured data, programmatic access |
Template example (what we've been using):
workingMemory: {
enabled: true,
template: `
# Patient Information
- **Name**:
- **Allergies**:
- **Medications**:
`,
}Schema example (for structured data):
import { z } from "zod";
workingMemory: {
enabled: true,
schema: z.object({
name: z.string().optional(),
allergies: z.array(z.string()).optional(),
medications: z.array(z.string()).optional(),
lastVisit: z.string().optional(),
}),
}Use templates when you want flexibility and natural language. Use schemas when you need to programmatically read the working memory or validate its structure.
What Belongs in Working Memory?
Not everything should go in working memory. It's for facts that are:
- Critical: Must not be lost, even after hundreds of messages
- Stable: Don't change frequently during a conversation
- Compact: Can be expressed in a few words
| ✅ Put in Working Memory | ❌ Don't Put in Working Memory |
|---|---|
| User's name | Full conversation transcript |
| Language preference | Current task details (use message history) |
| Medical allergies | Temporary session state |
| Account tier | Large documents or data |
If something changes frequently or is only relevant to the current task, message history is a better fit.
Implementation with Mastra
import { Agent } from "@mastra/core/agent";
import { Memory } from "@mastra/memory";
import { LibSQLStore } from "@mastra/libsql";
const agent = new Agent({
name: "medical-assistant",
instructions: `You are a medical triage assistant.
When patients share medical information (allergies, medications, conditions),
remember it in your working memory. Always check allergies before making
any medication recommendations.`,
model: "openai/gpt-4o",
memory: new Memory({
storage: new LibSQLStore({
id: "medical-storage",
url: "file:./medical.db",
}),
options: {
lastMessages: 20,
workingMemory: {
enabled: true,
scope: "resource", // Persists across all conversations with this patient
template: `
# Patient Information
- **Name**:
- **Age**:
- **Known Allergies**:
- **Current Medications**:
- **Relevant Medical History**:
`,
},
},
}),
});Notice the instructions explicitly tell the agent to use working memory. While Mastra provides the mechanism, you should guide the agent on when to use it. Without clear instructions, the agent might not recognize that "I'm allergic to penicillin" should be saved.
5. Semantic Recall
Message history gives you recent context. Working memory gives you persistent facts. But what about information from weeks ago that didn't make it into working memory?
"What did we discuss about the marketing budget last month?" — this query can't be answered by the last 20 messages or a predefined template. You need to search past conversations.
This is RAG applied to conversation history.
How Semantic Recall Works
The core idea: every message gets converted to a vector embedding and stored in a vector database. When a relevant query comes in, we search for semantically similar past messages.
The embedding model converts text into a high-dimensional vector that captures semantic meaning. "Marketing budget" and "The marketing budget is $50k" have similar vectors, even though the exact words differ.
The messageRange Parameter
When you find a relevant message, the surrounding context often matters too. If someone said "Yes, I agree with that" — that's useless without knowing what "that" refers to.
The messageRange parameter retrieves messages before and after each match:
With messageRange: 2, each matched message brings 2 messages before and 2 after—giving you a 5-message window of context.
Tuning Semantic Recall
Two key parameters control behavior:
| Parameter | What It Does | Trade-off |
|---|---|---|
topK | Number of similar messages to retrieve | Higher = more context, but more noise and tokens |
messageRange | Context window around each match | Higher = better context, but more tokens |
Conservative settings (faster, cheaper):
semanticRecall: {
topK: 3,
messageRange: 1,
}Aggressive settings (more recall, higher cost):
semanticRecall: {
topK: 10,
messageRange: 3,
}Start conservative and increase if the agent frequently misses relevant history.
When Semantic Recall Fails
Semantic search finds conceptually similar content. It struggles with:
| Query Type | Why It Fails | Better Approach |
|---|---|---|
| Exact IDs | "Order #12345" ≈ "Order #12346" semantically | Use tools with database lookup |
| Precise numbers | "$99.99" and "$89.99" have similar embeddings | Store in structured data |
| Recent context | Last 5 messages don't need vector search | Use message history |
| Negations | "I don't like pizza" ≈ "I like pizza" | Be aware of this limitation |
Semantic recall is a complement to—not a replacement for—message history and structured data retrieval.
Scope: Thread vs Resource
Like working memory, semantic recall can be scoped:
- Thread scope: Only search within the current conversation
- Resource scope: Search across all conversations with this user
semanticRecall: {
topK: 5,
messageRange: 2,
scope: "resource", // Search all conversations with this user
}Resource scope is powerful for questions like "What did we discuss about X last month?" — but be careful with privacy. Make sure users expect their past conversations to inform new ones.
Performance Considerations
Semantic recall adds latency to every request:
- Embed the query (~50-100ms)
- Vector search (~10-50ms depending on DB size)
- Fetch message context (~10-20ms)
For most applications, this 100-200ms overhead is acceptable. For latency-sensitive use cases (like real-time voice), consider:
- Disabling semantic recall entirely
- Using a faster (but less accurate) embedding model
- Reducing
topKto minimize search time
Implementation with Mastra
import { Agent } from "@mastra/core/agent";
import { Memory } from "@mastra/memory";
import { LibSQLStore, LibSQLVector } from "@mastra/libsql";
const agent = new Agent({
name: "assistant",
instructions: `You are a helpful assistant with long-term memory.
When users ask about past discussions, search your memory to find relevant
context. Reference specific details from past conversations when helpful.`,
model: "openai/gpt-4o",
memory: new Memory({
storage: new LibSQLStore({
id: "assistant-storage",
url: "file:./memory.db",
}),
vector: new LibSQLVector({
id: "assistant-vector",
connectionUrl: "file:./vectors.db",
}),
options: {
lastMessages: 20,
semanticRecall: {
topK: 5, // Retrieve 5 most similar messages
messageRange: 2, // Include 2 messages before/after each match
scope: "resource", // Search across all conversations with this user
},
},
}),
});When you configure a vector store, semantic recall is enabled by default. The configuration above shows how to tune it. To disable it entirely, set semanticRecall: false.
With all three patterns covered, let's see how they work together in a real-world agent.
6. Putting It All Together
Real-world agents often combine all three patterns. Here's a complete example:
import { Agent } from "@mastra/core/agent";
import { Memory } from "@mastra/memory";
import { LibSQLStore, LibSQLVector } from "@mastra/libsql";
// Tools for structured data retrieval
const tools = [
{
name: "get_order_status",
description: "Check the status of a customer order",
parameters: { ... },
execute: async ({ orderId }) => await db.getOrder(orderId),
},
{
name: "get_product_info",
description: "Get product details and pricing",
parameters: { ... },
execute: async ({ productId }) => await db.getProduct(productId),
},
];
const agent = new Agent({
name: "support-agent",
instructions: `
You are a customer support agent with memory across conversations.
When users share personal information (name, preferences), remember it.
When they ask about past discussions, search your memory.
When they ask about orders or products, use the appropriate tools.
`,
model: "openai/gpt-4o",
tools,
memory: new Memory({
storage: new LibSQLStore({ id: "support-storage", url: "file:./support.db" }),
vector: new LibSQLVector({ id: "support-vector", connectionUrl: "file:./vectors.db" }),
options: {
// Recent conversation context
lastMessages: 30,
// Persistent user facts
workingMemory: {
enabled: true,
scope: "resource", // Persists across all threads for this user
template: `
# Customer Profile
- **Name**:
- **Account Type**:
- **Preferences**:
- **Previous Issues**:
`,
},
// Long-term semantic memory
semanticRecall: {
topK: 5,
messageRange: 2,
},
},
}),
});
// Usage
const response = await agent.generate(
"What did we discuss about my refund last month?",
{
threadId: "conversation-789",
resourceId: "customer-456",
}
);This agent:
- Remembers recent messages (last 30) for conversational context
- Maintains persistent facts about the customer across all conversations
- Searches past conversations semantically when asked about historical discussions
- Retrieves structured data (orders, products) via tools
Notice the tools array in the example above. For structured data with known keys—order status, product specs, user profiles—you don't need memory patterns at all. Just define tools that query your database directly. A SQL query is faster, cheaper, and more reliable than semantic search when you know exactly what you're looking for.
Summary
Agent memory isn't magic—it's engineering around the context window constraint:
| Pattern | What It Does | Storage | Scope | Use When |
|---|---|---|---|---|
| Message History | Loads last N messages | SQL/NoSQL | Thread | Conversational continuity |
| Working Memory | Persists key facts in prompt | SQL/NoSQL | Thread or Resource | User preferences, critical info |
| Semantic Recall | Searches past conversations | Vector DB | Thread or Resource | "What did we discuss about X?" |
The key principles:
- Context is precious—don't waste it on information the agent doesn't need right now
- Trim strategically—protect critical facts, let old messages go
- Retrieve on demand—fetch relevant context when needed, not "just in case"
- Use the right pattern—recent context, persistent facts, or semantic search
In the next chapter, we'll explore how memory fits into the broader Agent Loop—the orchestration pattern that ties together reasoning, tool use, and memory into a coherent agent architecture.