Memory & Persistence

TODO: memory is needed only when context window is exceeded, whichi normally happens in long-horizon task, however, by far in this tutorial we have not cover long-horizon task. think about moving this chapter after 12-agentic-systems.

In 2024, a healthcare startup deployed an AI triage assistant to help nurses prepare for patient consultations. It was designed to review patient history, flag concerns, and suggest questions for the doctor. On day three, a patient mentioned they were severely allergic to penicillin. Thirty minutes later—after discussing symptoms, running through a checklist, and asking clarifying questions—the same assistant recommended a penicillin-based antibiotic.

The allergy disclosure had scrolled out of the context window. The agent had "forgotten."

Fortunately, the nurse caught the error before any harm was done. But the incident exposed a fundamental truth about AI agents: they don't remember anything by default. Every interaction starts fresh. Every conversation is ephemeral. The impressive illusion of continuity we experience with chatbots is often just careful engineering behind the scenes.

This chapter teaches you that engineering. We'll build agents that remember your name, recall discussions from months ago, and maintain persistent knowledge across sessions—all while respecting the hard constraints of context windows.

Cutting Through the Jargon

"Memory" and "persistence" are overloaded terms in AI. Many resources make analogies to human memory—short-term, long-term, episodic, semantic—which adds confusion without clarity. We'll skip the metaphors and focus on what actually matters: storing data somewhere and retrieving it into the context window when needed.


1. The Context Window Problem

Before solving memory, we need to understand why agents forget in the first place.

The Hard Limit

The model can only "see" and process tokens within its context window. This is a hard limit enforced by the architecture:

ModelContext Window
GPT-4o128k tokens
Claude 3.5 Sonnet200k tokens
Gemini 2.0 Flash1M tokens

When you exceed this limit, something must be removed. There's no overflow buffer, no automatic compression—tokens simply get dropped.

The Trimming Dilemma

When conversation history grows beyond the context limit, we must trim it. The simplest approach: keep only the last N messages.

This is the core problem: trimming destroys information. The agent loses access to anything outside the current window.

Why Not Just Use a Bigger Window?

Models like Gemini 2.0 offer 1M+ token windows. Problem solved?

Not quite. As we discussed in the RAG chapter:

  1. Data often exceeds even massive windows: Enterprise conversations, customer histories, and knowledge bases can easily surpass any context limit.

  2. Context is precious: Every token spent on "just in case" history is a token unavailable for instructions, reasoning, and the current task.

  3. Attention degrades: The "Lost in the Middle" phenomenon means models struggle to use information buried deep in long contexts. A fact mentioned 500k tokens ago might as well not exist.

The solution isn't bigger windows—it's smarter memory management.


2. The Mental Model

Before implementing anything, let's establish how to think about agent memory.

Memory = Storage + Retrieval

Memory is simply:

  1. Storage: Persisting data somewhere outside the context window
  2. Retrieval: Fetching relevant data back into the context window when needed

Memory Scoping: Who and Which Conversation?

Here's a question that trips up many new agent developers: whose memory is this, and which conversation does it belong to?

When you're prototyping, it's easy to think of your agent as a single entity with one memory. But in production, your agent might be:

  • Talking to thousands of users simultaneously
  • Handling multiple conversations per user (a billing question, then a product question, then a feature request)
  • Running across multiple server instances with no shared state

Memory needs to be scoped correctly, or User A sees User B's conversation history. That's a privacy disaster and a UX nightmare.

Two Levels of Scoping

Most memory systems use two identifiers:

ConceptWhat It RepresentsOther Names You'll See
User/ResourceThe entity you're storing memory aboutuser_id, resource_id, customer_id
Conversation/ThreadA specific conversation or sessionthread_id, session_id, conversation_id, chat_id

Think of it like email: the user is the person, the thread is a specific email chain.

Why Threads?

If a user can only have one conversation, why bother with threads at all?

In practice, users often have multiple concurrent contexts:

  • A support agent might open a new ticket for each issue (each ticket = one thread)
  • A user might start a fresh conversation without wanting old context ("Let's start over")
  • Different channels might be different threads (web chat vs. mobile app)

Threads let you isolate conversations while still sharing user-level information.

Scoping Your Memory

The key insight: different types of memory should be scoped differently.

Memory TypeTypical ScopeWhy
Message historyThread"What we discussed in this conversation"
User preferencesUser"They prefer Spanish" should apply everywhere
Task stateThread"We're working on refund #123" is specific to this chat
Medical allergiesUserCritical info that must follow them across all threads

This gives you the flexibility to say: "Remember what we talked about today" (thread-scoped) vs. "Remember who I am" (user-scoped).

In Mastra

Mastra uses resourceId for the user and threadId for the conversation. Here's how scoping looks in practice (we'll cover the full setup in Section 3):

// When calling the agent
const response = await agent.generate("Continue our discussion", {
  threadId: "conversation-123",  // This specific chat
  resourceId: "user-456",        // This user (across all chats)
});
 
// When configuring memory scope
workingMemory: {
  enabled: true,
  scope: "resource",  // Persists across all threads for this user
  // or
  scope: "thread",    // Only within this conversation
}

You'll see threadId and resourceId in every Mastra code example. Now you know what they mean and why they exist.

Three Memory Patterns

We'll implement three patterns, each solving a different problem:

PatternWhat It StoresHow It RetrievesBest For
Message HistoryAll messagesLast N messagesRecent context, conversational flow
Working MemoryKey factsAlways included in promptUser preferences, critical info
Semantic RecallEmbeddingsVector similarity search"What did we discuss last year?"

Let's implement each one.


3. Message History

The most basic form of memory: save every message to a database, then reload recent messages when the user returns.

The Pattern

Implementation with Mastra

Mastra handles storage and retrieval automatically. First, install the required packages:

pnpm add @mastra/core @mastra/memory @mastra/libsql

Then configure the lastMessages option and Mastra manages everything:

import { Agent } from "@mastra/core/agent";
import { Memory } from "@mastra/memory";
import { LibSQLStore } from "@mastra/libsql";
 
const agent = new Agent({
  name: "support-agent",
  instructions: "You are a helpful support agent.",
  model: "openai/gpt-4o",
  memory: new Memory({
    storage: new LibSQLStore({
      id: "support-storage",
      url: "file:./conversations.db",
    }),
    options: {
      lastMessages: 20, // Automatically fetches last 20 messages
    },
  }),
});
 
// Each call automatically loads history and saves new messages
const response = await agent.generate("Continue our discussion", {
  threadId: "thread-123",
  resourceId: "user-456",
});

Choosing N: How Many Messages?

The right value for "last N messages" depends on your use case:

N ValueTrade-off
5-10Minimal context, fast responses, but agent forgets quickly
20-50Good for most conversational agents
100+Full conversation awareness, but expensive and potentially noisy
AllComplete history—only viable for short conversations or large context windows

Experiment: Try different values and observe how the agent's responses change. With N=0, the agent has no memory of previous messages. With N=50, it can reference discussions from earlier in the session.


4. Working Memory

Message history solves one problem: remembering recent messages. But what about facts that should always be available, regardless of how much conversation happens?

Consider: a user tells your agent "I'm allergic to penicillin" at the start of a medical consultation. Two hundred messages later, that fact might be trimmed from history—but it should never be forgotten.

The Pattern: Protected Facts

Working memory stores critical facts in a protected section of the prompt that survives trimming:

When trimming occurs, we remove old messages—but the protected facts section is never touched.

How the Agent Updates Working Memory

The key insight is that the agent itself manages this memory. Under the hood, Mastra gives the agent a tool to update its working memory. When users share important information, the agent recognizes it and calls this tool automatically.

You don't need to write extraction logic or manually parse user messages—the LLM handles the judgment of "is this worth remembering?" based on your template.

Designing Effective Templates

The template you provide shapes what the agent remembers. A well-designed template:

  1. Signals what matters: The agent uses the template fields as hints for what to extract
  2. Stays concise: Working memory consumes context tokens on every request
  3. Uses clear labels: Ambiguous labels lead to inconsistent updates

Good template design:

# Customer Profile
- **Name**: 
- **Account Type**: [Free/Pro/Enterprise]
- **Primary Use Case**: 
- **Known Issues**: 
- **Communication Preference**: [Formal/Casual]

Poor template design:

# Information
- Stuff about the user:
- Things they mentioned:
- Other notes:

The first template gives the agent clear categories. The second is vague—the agent won't know what "stuff" or "things" to extract.

Template vs Schema

Mastra supports two formats for working memory:

FormatSyntaxBest For
Template (Markdown)Free-form text with placeholdersFlexible facts, notes, summaries
Schema (Zod/JSON)Typed object with defined fieldsStructured data, programmatic access

Template example (what we've been using):

workingMemory: {
  enabled: true,
  template: `
# Patient Information
- **Name**: 
- **Allergies**: 
- **Medications**: 
`,
}

Schema example (for structured data):

import { z } from "zod";
 
workingMemory: {
  enabled: true,
  schema: z.object({
    name: z.string().optional(),
    allergies: z.array(z.string()).optional(),
    medications: z.array(z.string()).optional(),
    lastVisit: z.string().optional(),
  }),
}

Use templates when you want flexibility and natural language. Use schemas when you need to programmatically read the working memory or validate its structure.

What Belongs in Working Memory?

Not everything should go in working memory. It's for facts that are:

  • Critical: Must not be lost, even after hundreds of messages
  • Stable: Don't change frequently during a conversation
  • Compact: Can be expressed in a few words
✅ Put in Working Memory❌ Don't Put in Working Memory
User's nameFull conversation transcript
Language preferenceCurrent task details (use message history)
Medical allergiesTemporary session state
Account tierLarge documents or data

If something changes frequently or is only relevant to the current task, message history is a better fit.

Implementation with Mastra

import { Agent } from "@mastra/core/agent";
import { Memory } from "@mastra/memory";
import { LibSQLStore } from "@mastra/libsql";
 
const agent = new Agent({
  name: "medical-assistant",
  instructions: `You are a medical triage assistant. 
  
When patients share medical information (allergies, medications, conditions), 
remember it in your working memory. Always check allergies before making 
any medication recommendations.`,
  model: "openai/gpt-4o",
  memory: new Memory({
    storage: new LibSQLStore({
      id: "medical-storage",
      url: "file:./medical.db",
    }),
    options: {
      lastMessages: 20,
      workingMemory: {
        enabled: true,
        scope: "resource", // Persists across all conversations with this patient
        template: `
# Patient Information
- **Name**: 
- **Age**: 
- **Known Allergies**: 
- **Current Medications**: 
- **Relevant Medical History**: 
`,
      },
    },
  }),
});
Instructing the Agent

Notice the instructions explicitly tell the agent to use working memory. While Mastra provides the mechanism, you should guide the agent on when to use it. Without clear instructions, the agent might not recognize that "I'm allergic to penicillin" should be saved.


5. Semantic Recall

Message history gives you recent context. Working memory gives you persistent facts. But what about information from weeks ago that didn't make it into working memory?

"What did we discuss about the marketing budget last month?" — this query can't be answered by the last 20 messages or a predefined template. You need to search past conversations.

This is RAG applied to conversation history.

How Semantic Recall Works

The core idea: every message gets converted to a vector embedding and stored in a vector database. When a relevant query comes in, we search for semantically similar past messages.

The embedding model converts text into a high-dimensional vector that captures semantic meaning. "Marketing budget" and "The marketing budget is $50k" have similar vectors, even though the exact words differ.

The messageRange Parameter

When you find a relevant message, the surrounding context often matters too. If someone said "Yes, I agree with that" — that's useless without knowing what "that" refers to.

The messageRange parameter retrieves messages before and after each match:

With messageRange: 2, each matched message brings 2 messages before and 2 after—giving you a 5-message window of context.

Tuning Semantic Recall

Two key parameters control behavior:

ParameterWhat It DoesTrade-off
topKNumber of similar messages to retrieveHigher = more context, but more noise and tokens
messageRangeContext window around each matchHigher = better context, but more tokens

Conservative settings (faster, cheaper):

semanticRecall: {
  topK: 3,
  messageRange: 1,
}

Aggressive settings (more recall, higher cost):

semanticRecall: {
  topK: 10,
  messageRange: 3,
}

Start conservative and increase if the agent frequently misses relevant history.

When Semantic Recall Fails

Semantic search finds conceptually similar content. It struggles with:

Query TypeWhy It FailsBetter Approach
Exact IDs"Order #12345" ≈ "Order #12346" semanticallyUse tools with database lookup
Precise numbers"$99.99" and "$89.99" have similar embeddingsStore in structured data
Recent contextLast 5 messages don't need vector searchUse message history
Negations"I don't like pizza" ≈ "I like pizza"Be aware of this limitation

Semantic recall is a complement to—not a replacement for—message history and structured data retrieval.

Scope: Thread vs Resource

Like working memory, semantic recall can be scoped:

  • Thread scope: Only search within the current conversation
  • Resource scope: Search across all conversations with this user
semanticRecall: {
  topK: 5,
  messageRange: 2,
  scope: "resource",  // Search all conversations with this user
}

Resource scope is powerful for questions like "What did we discuss about X last month?" — but be careful with privacy. Make sure users expect their past conversations to inform new ones.

Performance Considerations

Semantic recall adds latency to every request:

  1. Embed the query (~50-100ms)
  2. Vector search (~10-50ms depending on DB size)
  3. Fetch message context (~10-20ms)

For most applications, this 100-200ms overhead is acceptable. For latency-sensitive use cases (like real-time voice), consider:

  • Disabling semantic recall entirely
  • Using a faster (but less accurate) embedding model
  • Reducing topK to minimize search time

Implementation with Mastra

import { Agent } from "@mastra/core/agent";
import { Memory } from "@mastra/memory";
import { LibSQLStore, LibSQLVector } from "@mastra/libsql";
 
const agent = new Agent({
  name: "assistant",
  instructions: `You are a helpful assistant with long-term memory.
 
When users ask about past discussions, search your memory to find relevant 
context. Reference specific details from past conversations when helpful.`,
  model: "openai/gpt-4o",
  memory: new Memory({
    storage: new LibSQLStore({
      id: "assistant-storage",
      url: "file:./memory.db",
    }),
    vector: new LibSQLVector({
      id: "assistant-vector",
      connectionUrl: "file:./vectors.db",
    }),
    options: {
      lastMessages: 20,
      semanticRecall: {
        topK: 5,           // Retrieve 5 most similar messages
        messageRange: 2,   // Include 2 messages before/after each match
        scope: "resource", // Search across all conversations with this user
      },
    },
  }),
});
Enabled by Default

When you configure a vector store, semantic recall is enabled by default. The configuration above shows how to tune it. To disable it entirely, set semanticRecall: false.

With all three patterns covered, let's see how they work together in a real-world agent.


6. Putting It All Together

Real-world agents often combine all three patterns. Here's a complete example:

import { Agent } from "@mastra/core/agent";
import { Memory } from "@mastra/memory";
import { LibSQLStore, LibSQLVector } from "@mastra/libsql";
 
// Tools for structured data retrieval
const tools = [
  {
    name: "get_order_status",
    description: "Check the status of a customer order",
    parameters: { ... },
    execute: async ({ orderId }) => await db.getOrder(orderId),
  },
  {
    name: "get_product_info",
    description: "Get product details and pricing",
    parameters: { ... },
    execute: async ({ productId }) => await db.getProduct(productId),
  },
];
 
const agent = new Agent({
  name: "support-agent",
  instructions: `
You are a customer support agent with memory across conversations.
 
When users share personal information (name, preferences), remember it.
When they ask about past discussions, search your memory.
When they ask about orders or products, use the appropriate tools.
  `,
  model: "openai/gpt-4o",
  tools,
  memory: new Memory({
    storage: new LibSQLStore({ id: "support-storage", url: "file:./support.db" }),
    vector: new LibSQLVector({ id: "support-vector", connectionUrl: "file:./vectors.db" }),
    options: {
      // Recent conversation context
      lastMessages: 30,
      
      // Persistent user facts
      workingMemory: {
        enabled: true,
        scope: "resource", // Persists across all threads for this user
        template: `
# Customer Profile
- **Name**: 
- **Account Type**: 
- **Preferences**: 
- **Previous Issues**: 
`,
      },
      
      // Long-term semantic memory
      semanticRecall: {
        topK: 5,
        messageRange: 2,
      },
    },
  }),
});
 
// Usage
const response = await agent.generate(
  "What did we discuss about my refund last month?",
  {
    threadId: "conversation-789",
    resourceId: "customer-456",
  }
);

This agent:

  1. Remembers recent messages (last 30) for conversational context
  2. Maintains persistent facts about the customer across all conversations
  3. Searches past conversations semantically when asked about historical discussions
  4. Retrieves structured data (orders, products) via tools
What About Structured Data?

Notice the tools array in the example above. For structured data with known keys—order status, product specs, user profiles—you don't need memory patterns at all. Just define tools that query your database directly. A SQL query is faster, cheaper, and more reliable than semantic search when you know exactly what you're looking for.


Summary

Agent memory isn't magic—it's engineering around the context window constraint:

PatternWhat It DoesStorageScopeUse When
Message HistoryLoads last N messagesSQL/NoSQLThreadConversational continuity
Working MemoryPersists key facts in promptSQL/NoSQLThread or ResourceUser preferences, critical info
Semantic RecallSearches past conversationsVector DBThread or Resource"What did we discuss about X?"

The key principles:

  1. Context is precious—don't waste it on information the agent doesn't need right now
  2. Trim strategically—protect critical facts, let old messages go
  3. Retrieve on demand—fetch relevant context when needed, not "just in case"
  4. Use the right pattern—recent context, persistent facts, or semantic search

In the next chapter, we'll explore how memory fits into the broader Agent Loop—the orchestration pattern that ties together reasoning, tool use, and memory into a coherent agent architecture.