Memory & Context Management

An agent without memory is forced to restart its world with every interaction. To build coherent agents, we need to manage two types of memory: Short-term (the conversation history) and Long-term (external knowledge).

Short-Term Memory: The Context Window

The context window (e.g., 128k tokens) is finite and expensive. You cannot simply append messages forever.

Strategies for Conversation Management

  1. Sliding Window: Keep only the last $N$ messages.

    • Pros: Simple, low latency.
    • Cons: Forgets early details/instructions.
  2. Summarization (Reflective Memory): Periodically ask an LLM to summarize the conversation so far, and inject that summary as the new "system" context.

    "The user J.P. is asking about Python lists. We have already covered append and pop."

  3. Token Budgeting: Truncate the oldest messages when the token count exceeds a threshold, but always preserve the System Prompt and the very first user message (which often contains the main goal).

Long-Term Memory: RAG (Retrieval-Augmented Generation)

When the data exceeds the context window (e.g., your entire documentation or codebase), you need Retrieval-Augmented Generation.

RAG connects the LLM to a database. It works in three steps: Index, Retrieve, Generate.

1. Indexing (Vector Embeddings)

We convert text chunks into vectors (lists of numbers) using an embedding model (e.g., text-embedding-3-small or Cohere). Semantic meaning is encoded in the vector space—"Dog" is mathematically close to "Puppy".

2. Retrieval Strategies

A naive "Top-K cosine similarity" search is often insufficient for production agents.

  • Hybrid Search: Combine Semantic Search (Vectors) with Keyword Search (BM25). Vectors are good for concepts; Keywords are good for exact matches (IDs, names).
  • Hypothetical Document Embeddings (HyDE): Instead of searching with the user's raw query (which might be short/vague), have an LLM generate a hypothetical answer, then search for documents similar to that answer.

3. Reranking (The Quality Filter)

Vector databases fetch the top 100 candidates quickly, but their ranking is approximate. Use a dedicated Reranker Model (like Cohere Rerank) to score those 100 results accurately and pick the top 5 to feed into the LLM context.

Retrieve 100 -> Rerank High Precision -> Top 5 -> LLM Context

Memory in Code

// A simplified RAG flow
async function retrieveContext(query: string) {
  // 1. Embed query
  const vector = await embeddings.create(query);
  
  // 2. Hybrid Search in Vector DB
  const candidates = await vectorDB.search({
    vector,
    hybrid: true,
    limit: 50
  });
 
  // 3. Rerank
  const topDocs = await reranker.rank({
    query,
    documents: candidates,
    topN: 5
  });
 
  return topDocs.map(d => d.content).join("\n---\n");
}

Summary

Memory is the difference between a one-off tool and a learning companion.

  • Use Sliding Windows or Summarization to keep the immediate conversation fresh despite token limits.
  • Use Hybrid RAG + Reranking to give your agent access to vast libraries of information without hallucination.