Skip to content

Agent Memory — How AI Agents Remember (2026)

An AI agent without memory starts from zero every conversation. It cannot recall your preferences, past decisions, or previous mistakes. Agent memory is the architecture that fixes this — 4 tiers of storage that give agents the ability to remember, personalize, and improve over time. This guide covers each tier with Python code you can ship to production.

Without memory, every LLM call is stateless — agents cannot remember prior sessions, user preferences, or past decisions unless you explicitly architect persistence into the system.

Most agent tutorials skip memory entirely. They show you a ReAct loop, add a few tools, and call it done. The agent works for a single turn. Ask it a follow-up question 10 minutes later and it has no idea what you were talking about.

This is the default behavior of every LLM. Stateless. No persistence. Each API call is independent. The model does not remember anything from prior calls unless you explicitly pass that context back in.

For a chatbot answering one-off questions, statelessness is fine. For an agent that manages your calendar, reviews your code, or handles customer support tickets across multiple sessions — statelessness is a critical limitation.

Memory solves three problems:

  1. Continuity — the agent picks up where it left off across sessions
  2. Personalization — the agent adapts to user preferences and past behavior
  3. Efficiency — the agent avoids repeating work it has already done

Production agents at companies like Intercom, Notion, and Replit all implement multi-tier memory. The complexity varies, but the core architecture is consistent: short-term context in the prompt, medium-term summaries in a database, long-term knowledge in a vector store, and structured relationships in a graph.

This guide covers:

  • The 4 memory tiers and when to use each one
  • Human-brain analogies that make the architecture intuitive
  • Python implementations for each tier — not pseudocode, real classes you can adapt
  • LangChain and LangGraph memory integration patterns
  • Memory management: when to forget, how to consolidate, privacy considerations
  • Production patterns: user-scoped memory, memory as a microservice, backup strategies

If you have read AI Agents, you already understand the ReAct loop and tool use. This page goes deep on the one capability that turns a clever demo into a production system: remembering.


A stateful agent that remembers a prior interaction requires zero extra effort from the user — that single difference separates a tool people tolerate from one they trust.

Consider two versions of the same customer support agent:

Stateless agent (no memory):

  • User: “I ordered a laptop last week and it hasn’t arrived.”
  • Agent: Searches order database, finds the order, provides tracking info.
  • Next day — User: “Any update on my laptop?”
  • Agent: “I don’t have context about a previous order. Could you provide your order number?”

Stateful agent (with memory):

  • User: “I ordered a laptop last week and it hasn’t arrived.”
  • Agent: Searches order database, finds the order, provides tracking info. Stores: user has pending laptop order #4821, shipping delayed.
  • Next day — User: “Any update on my laptop?”
  • Agent: Retrieves stored context. “Your laptop (order #4821) shipped yesterday via FedEx. Estimated delivery is Thursday.”

The second interaction took 0 extra effort from the user. The agent remembered. That single difference — remembering vs forgetting — is the gap between a tool people tolerate and a tool people trust.

Agent memory tiers map directly to how human memory works:

Human MemoryAgent Memory TierDurationCapacity
Working memory (what you are thinking right now)In-context memorySeconds to minutes7 +/- 2 items / context window
Short-term memory (what happened today)Buffer memoryMinutes to hoursConversation history
Long-term memory (learned facts and experiences)Vector store memoryDays to yearsMillions of entries
Structured knowledge (relationships, schemas, rules)Knowledge graph memoryPermanentEntity-relationship network

This analogy is not just pedagogical. It directly informs architecture decisions. You do not store everything in working memory — that is expensive and slow. You do not put rarely-accessed knowledge in the context window. Each tier has a cost-access tradeoff.


The four memory tiers map directly to human memory — working memory (in-context), short-term (buffer), long-term (vector store), and structured knowledge (knowledge graph) — each with a distinct cost-access tradeoff.

4 Memory Tiers for AI Agents

Each tier trades off access speed, capacity, and persistence. Production agents combine multiple tiers.

In-Context Memory
Current conversation in the prompt window
Fast access (0ms)
Limited by context window
Lost on session end
Buffer Memory
Sliding window or summary of recent interactions
Conversation history
Token-efficient summaries
Session-scoped persistence
Vector Store Memory
Semantic search over past interactions and documents
Cross-session recall
Similarity-based retrieval
Scales to millions of memories
Knowledge Graph Memory
Structured relationships between entities and facts
Entity relationships
Reasoning over connections
Highest setup complexity
Idle

This is the simplest form. Every message in the current conversation sits in the LLM’s context window. The model can “see” everything that has been said so far.

How it works: The framework appends each new user message and assistant response to a growing list. That entire list is sent with every API call.

Strengths:

  • Zero latency — the data is already in the prompt
  • Perfect recall of everything in the window
  • No external infrastructure needed

Limitations:

  • Context windows have hard limits: 128K tokens for GPT-4o, 200K for Claude 3.5
  • Cost scales linearly with conversation length — you pay per token, every call
  • Everything vanishes when the session ends

The sliding window problem: A 50-turn conversation easily exceeds 20,000 tokens. At GPT-4o pricing ($2.50/1M input tokens), a heavy user generating 100 conversations per day costs $5/day in context alone. Multiply by 10,000 users. This is why you do not rely on in-context memory alone.

Buffer memory manages conversation history intelligently. Instead of keeping every message, it applies a strategy: truncate old messages, summarize them, or keep only the most recent N turns.

Three buffer strategies:

  1. Window buffer — keep the last K messages, drop everything older
  2. Summary buffer — periodically compress older messages into a dense summary
  3. Token buffer — keep as many recent messages as fit within a token budget

The summary strategy is the most useful in production. You get the recency of a window buffer with the context preservation of full history — at a fraction of the token cost.

Vector store memory enables cross-session recall. Past conversations, documents, and facts are converted to embeddings and stored in a vector database (Qdrant, Pinecone, Weaviate, ChromaDB). When the agent needs to recall something, it runs a similarity search.

How it works:

  1. At the end of each session, key facts and conversation summaries are embedded
  2. On a new session, the user’s first message is embedded and used as a query
  3. The top-K most similar memories are retrieved and injected into the system prompt
  4. The agent now has relevant historical context without keeping full conversation logs

This is where agents go from “useful in a single session” to “useful over weeks and months.” A coding assistant that remembers your project structure, preferred patterns, and past debugging sessions is qualitatively different from one that does not.

Knowledge graph memory stores structured relationships between entities. Instead of “the user prefers Python over JavaScript” as a flat text embedding, a graph stores: User --prefers--> Python, User --avoids--> JavaScript, Python --is_a--> Language.

When graphs add value:

  • The agent needs to reason over relationships (e.g., “which of the user’s projects use a framework that has a known vulnerability?”)
  • Multiple entities have complex interconnections
  • You need explainable memory — you can query the graph to see exactly what the agent “knows”

When graphs add overhead without value:

  • Simple preference storage (vector store is sufficient)
  • Low entity count (<100 entities)
  • No relationship-based reasoning required

The following implementations cover all four memory tiers — buffer, vector store, and knowledge graph — as production-grade Python classes you can adapt directly.

Here is a production-grade buffer memory class that supports both window and summary strategies:

from dataclasses import dataclass, field
from typing import Literal
from openai import OpenAI
@dataclass
class Message:
role: str # "user", "assistant", "system"
content: str
tokens: int = 0
class BufferMemory:
"""Manages conversation history with configurable retention."""
def __init__(
self,
strategy: Literal["window", "summary", "token"] = "summary",
max_messages: int = 20,
max_tokens: int = 4000,
client: OpenAI | None = None,
):
self.strategy = strategy
self.max_messages = max_messages
self.max_tokens = max_tokens
self.messages: list[Message] = []
self.summary: str = ""
self.client = client or OpenAI()
def add(self, role: str, content: str, tokens: int = 0) -> None:
self.messages.append(Message(role=role, content=content, tokens=tokens))
self._enforce_limits()
def get_context(self) -> list[dict]:
"""Return messages formatted for the LLM API."""
context = []
if self.summary:
context.append({
"role": "system",
"content": f"Summary of earlier conversation:\n{self.summary}"
})
for msg in self.messages:
context.append({"role": msg.role, "content": msg.content})
return context
def _enforce_limits(self) -> None:
if self.strategy == "window":
# Keep only the last N messages
if len(self.messages) > self.max_messages:
self.messages = self.messages[-self.max_messages:]
elif self.strategy == "summary":
# Summarize when buffer exceeds limit, keep recent messages
if len(self.messages) > self.max_messages:
old = self.messages[:-10] # summarize all but last 10
self._summarize(old)
self.messages = self.messages[-10:]
elif self.strategy == "token":
total = sum(m.tokens for m in self.messages)
while total > self.max_tokens and len(self.messages) > 1:
removed = self.messages.pop(0)
total -= removed.tokens
def _summarize(self, messages: list[Message]) -> None:
text = "\n".join(f"{m.role}: {m.content}" for m in messages)
prev = f"Previous summary: {self.summary}\n\n" if self.summary else ""
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": (
f"{prev}Summarize this conversation in 3-5 bullet points. "
f"Preserve key facts, decisions, and user preferences:\n\n{text}"
),
}],
max_tokens=300,
)
self.summary = response.choices[0].message.content

Key design decisions:

  • gpt-4o-mini for summarization — cheap ($0.15/1M input tokens) and fast
  • Summaries are cumulative — each new summary includes the previous one
  • The last 10 messages are always kept verbatim for recency

This example uses Qdrant, but the pattern applies to any vector database:

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from openai import OpenAI
import uuid
from datetime import datetime
class VectorMemory:
"""Cross-session memory using semantic search over past interactions."""
def __init__(
self,
collection: str = "agent_memory",
qdrant_url: str = "http://localhost:6333",
embedding_model: str = "text-embedding-3-small",
):
self.qdrant = QdrantClient(url=qdrant_url)
self.openai = OpenAI()
self.collection = collection
self.embedding_model = embedding_model
self._ensure_collection()
def _ensure_collection(self) -> None:
collections = [c.name for c in self.qdrant.get_collections().collections]
if self.collection not in collections:
self.qdrant.create_collection(
collection_name=self.collection,
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
def _embed(self, text: str) -> list[float]:
response = self.openai.embeddings.create(
model=self.embedding_model, input=text
)
return response.data[0].embedding
def store(self, content: str, user_id: str, metadata: dict | None = None) -> str:
"""Store a memory with user scoping."""
point_id = str(uuid.uuid4())
payload = {
"content": content,
"user_id": user_id,
"timestamp": datetime.utcnow().isoformat(),
**(metadata or {}),
}
self.qdrant.upsert(
collection_name=self.collection,
points=[PointStruct(
id=point_id,
vector=self._embed(content),
payload=payload,
)],
)
return point_id
def recall(
self,
query: str,
user_id: str,
top_k: int = 5,
score_threshold: float = 0.7,
) -> list[dict]:
"""Retrieve relevant memories for a user."""
results = self.qdrant.search(
collection_name=self.collection,
query_vector=self._embed(query),
query_filter={"must": [{"key": "user_id", "match": {"value": user_id}}]},
limit=top_k,
score_threshold=score_threshold,
)
return [
{"content": r.payload["content"], "score": r.score, "timestamp": r.payload["timestamp"]}
for r in results
]

Critical detail: user scoping. Every memory is tagged with user_id. When recalling, the filter ensures User A never sees User B’s memories. Skip this and you have a data privacy incident.

A basic graph memory using NetworkX for prototyping (swap to Neo4j for production):

import networkx as nx
from typing import Optional
import json
class GraphMemory:
"""Structured entity-relationship memory for agents."""
def __init__(self):
self.graph = nx.DiGraph()
def add_entity(self, entity_id: str, entity_type: str, properties: dict) -> None:
self.graph.add_node(entity_id, type=entity_type, **properties)
def add_relationship(
self, source: str, target: str, relation: str, properties: dict | None = None
) -> None:
self.graph.add_edge(source, target, relation=relation, **(properties or {}))
def query_entity(self, entity_id: str) -> Optional[dict]:
if entity_id not in self.graph:
return None
data = dict(self.graph.nodes[entity_id])
# Get all relationships
outgoing = [
{"target": t, **self.graph.edges[entity_id, t]}
for t in self.graph.successors(entity_id)
]
incoming = [
{"source": s, **self.graph.edges[s, entity_id]}
for s in self.graph.predecessors(entity_id)
]
return {"properties": data, "outgoing": outgoing, "incoming": incoming}
def find_path(self, source: str, target: str) -> list[str] | None:
"""Find how two entities are connected."""
try:
return nx.shortest_path(self.graph, source, target)
except nx.NetworkXNoPath:
return None
def get_neighborhood(self, entity_id: str, depth: int = 2) -> dict:
"""Get all entities within N hops."""
subgraph = nx.ego_graph(self.graph, entity_id, radius=depth)
return {
"nodes": [
{"id": n, **subgraph.nodes[n]} for n in subgraph.nodes
],
"edges": [
{"source": u, "target": v, **subgraph.edges[u, v]}
for u, v in subgraph.edges
],
}
# Usage: building a user knowledge graph
graph = GraphMemory()
graph.add_entity("user_42", "User", {"name": "Alice", "role": "ML Engineer"})
graph.add_entity("project_a", "Project", {"name": "RAG Pipeline", "status": "active"})
graph.add_entity("python", "Language", {"version": "3.11"})
graph.add_entity("qdrant", "Tool", {"category": "vector_db"})
graph.add_relationship("user_42", "project_a", "works_on")
graph.add_relationship("project_a", "python", "uses")
graph.add_relationship("project_a", "qdrant", "uses")
graph.add_relationship("user_42", "python", "prefers")
# Query: What does Alice's project use?
info = graph.query_entity("project_a")
# Returns: tools, languages, and relationships in structured form

No production agent uses a single memory tier — the standard architecture retrieves from vector store and graph before each LLM call and writes back after each response.

No production agent uses a single tier. The standard architecture combines all four:

Request flow:

  1. User sends a message
  2. Agent retrieves relevant long-term memories from vector store (Tier 3)
  3. Agent queries knowledge graph for structured context about the user (Tier 4)
  4. Buffer memory provides recent conversation summary (Tier 2)
  5. All retrieved context is injected into the system prompt alongside the current conversation (Tier 1)
  6. LLM reasons and responds
  7. After the response, new facts are extracted and stored back to Tier 3 and Tier 4

Token budget allocation (128K context window example):

ComponentToken BudgetPurpose
System prompt + tools4,000Agent instructions and tool definitions
Vector store memories2,000Top-5 relevant past interactions
Graph context1,000User entity relationships
Buffer summary500Compressed earlier conversation
Recent messages (Tier 1)8,000Last 10-15 messages verbatim
Reserved for response4,000LLM output generation
Total used19,500~15% of available context

The remaining 85% of the context window is available for tool results, long documents, and extended reasoning. Over-allocating memory context is a common mistake — it crowds out space the agent needs for actual work.

LangChain provides built-in memory classes that implement these patterns:

from langchain.memory import (
ConversationBufferMemory,
ConversationSummaryMemory,
ConversationBufferWindowMemory,
VectorStoreRetrieverMemory,
)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
# Tier 2a: Simple buffer — keeps all messages
buffer = ConversationBufferMemory(return_messages=True)
# Tier 2b: Window buffer — keeps last 10 exchanges
window = ConversationBufferWindowMemory(k=10, return_messages=True)
# Tier 2c: Summary buffer — compresses old messages
summary = ConversationSummaryMemory(
llm=ChatOpenAI(model="gpt-4o-mini"),
return_messages=True,
)
# Tier 3: Vector store memory — cross-session recall
vectorstore = QdrantVectorStore.from_existing_collection(
embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
collection_name="agent_memory",
url="http://localhost:6333",
)
vector_memory = VectorStoreRetrieverMemory(
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
)

LangGraph takes a different approach. Instead of implicit memory classes, it uses explicit state schemas with checkpointing:

from langgraph.graph import StateGraph, MessagesState
from langgraph.checkpoint.memory import MemorySaver
# State persists automatically across graph invocations
checkpointer = MemorySaver() # In-memory; use PostgresSaver for production
graph = StateGraph(MessagesState)
# ... define nodes and edges ...
app = graph.compile(checkpointer=checkpointer)
# Each thread_id maintains its own conversation state
config = {"configurable": {"thread_id": "user_42_session_1"}}
result = app.invoke({"messages": [("user", "What did we discuss yesterday?")]}, config)

LangGraph’s thread_id pattern naturally supports user-scoped memory. Each user gets an isolated state. The checkpointer persists state between invocations, giving you Tier 1 and Tier 2 out of the box. For Tier 3 and Tier 4, you add retrieval and graph queries as nodes in the graph.


These two examples show memory in context: a coding assistant that recalls project structure across sessions, and a support agent that retrieves past resolutions to reduce repeat effort.

Example 1: Personal Coding Assistant with Memory

Section titled “Example 1: Personal Coding Assistant with Memory”

A coding assistant that remembers your project context across sessions:

class CodingAssistantMemory:
"""Multi-tier memory for a personalized coding assistant."""
def __init__(self, user_id: str):
self.user_id = user_id
self.buffer = BufferMemory(strategy="summary", max_messages=20)
self.vector = VectorMemory(collection="coding_assistant")
self.graph = GraphMemory()
def build_context(self, user_message: str) -> str:
"""Assemble memory context for the system prompt."""
# Tier 3: Retrieve relevant past interactions
memories = self.vector.recall(user_message, self.user_id, top_k=5)
memory_text = "\n".join(
f"- [{m['timestamp'][:10]}] {m['content']}" for m in memories
)
# Tier 4: Get user's project and preference context
user_info = self.graph.query_entity(self.user_id)
graph_text = ""
if user_info:
projects = [
r["target"] for r in user_info["outgoing"]
if r["relation"] == "works_on"
]
prefs = [
r["target"] for r in user_info["outgoing"]
if r["relation"] == "prefers"
]
graph_text = f"Active projects: {', '.join(projects)}\n"
graph_text += f"Preferences: {', '.join(prefs)}"
# Tier 2: Recent conversation summary
summary = self.buffer.summary or "No prior conversation in this session."
return (
f"## User Memory\n{memory_text}\n\n"
f"## User Profile\n{graph_text}\n\n"
f"## Session Summary\n{summary}"
)
def after_response(self, user_msg: str, assistant_msg: str) -> None:
"""Update memory after each interaction."""
self.buffer.add("user", user_msg)
self.buffer.add("assistant", assistant_msg)
# Store significant interactions to vector memory
# (in production, use an LLM call to decide if the interaction is worth storing)
self.vector.store(
f"User asked: {user_msg}\nAssistant helped with: {assistant_msg[:200]}",
self.user_id,
)

Example 2: Customer Support Agent with Ticket Memory

Section titled “Example 2: Customer Support Agent with Ticket Memory”
# When a support agent resolves a ticket, store the resolution pattern
def store_resolution(vector_memory, user_id, ticket_id, issue, resolution):
vector_memory.store(
content=f"Issue: {issue}\nResolution: {resolution}\nTicket: {ticket_id}",
user_id=user_id,
metadata={"type": "resolution", "ticket_id": ticket_id},
)
# On the next interaction, the agent recalls past resolutions
def get_relevant_history(vector_memory, user_id, current_issue):
past = vector_memory.recall(current_issue, user_id, top_k=3)
if past:
return f"This user had similar issues before:\n" + "\n".join(
f"- {m['content']}" for m in past
)
return "No relevant history found for this user."

This pattern reduces resolution time by 30-40% in production deployments. The agent does not ask the user to re-explain problems it has already solved.


The key failure modes in agent memory — stale memories, similarity threshold miscalibration, and unbounded growth — all have tractable mitigations if you design for them from the start.

Not all memories should persist forever. Three reasons to forget:

Privacy compliance. GDPR and CCPA require the ability to delete user data on request. If your agent memory is an opaque vector store with no deletion mechanism, you cannot comply. Every memory must be tagged with user_id and deletable by user.

Relevance decay. A user’s Python 3.9 preference from 6 months ago is irrelevant if they switched to 3.12. Stale memories actively degrade agent performance by providing outdated context. Implement TTL (time-to-live) on memories or periodic relevance scoring.

Context pollution. Too many retrieved memories crowd the context window and confuse the LLM. If you retrieve 20 memories when 5 would suffice, the agent spends tokens processing irrelevant context. Strict score_threshold and top_k limits are essential.

Vector similarity search returns the “most similar” results, not the “correct” results. A threshold of 0.7 cosine similarity sounds reasonable, but calibration is critical:

  • Too low (0.5): Retrieves tangentially related memories that confuse the agent
  • Too high (0.9): Misses relevant memories that are phrased differently
  • Right range (0.7-0.8): Depends on your embedding model and domain

Test your threshold with real user data. The “right” value varies by embedding model, domain vocabulary, and memory content structure.

Over weeks of use, a user accumulates thousands of individual memories. Raw accumulation degrades retrieval quality — the signal-to-noise ratio drops. Production systems run periodic consolidation:

  1. Cluster similar memories using the same embedding space
  2. Merge clusters into consolidated summaries
  3. Replace individual memories with the consolidated version
  4. Preserve source references for auditability

This is analogous to how human memory consolidates during sleep — individual episodic memories are compressed into general knowledge.

ComponentCost Per 1M OperationsNotes
OpenAI embeddings (text-embedding-3-small)$0.02 per 1M tokensCheapest tier
Qdrant Cloud (1GB)~$25/monthManaged hosting
Pinecone Serverless$0.002/1K readsPay per query
GPT-4o-mini summaries$0.15/1M input tokensFor buffer consolidation
Neo4j AuraDB Free$0/month200K nodes, 400K relationships

For an agent serving 1,000 daily active users with 10 interactions each: embedding costs are roughly $0.20/day, vector queries $0.02/day, and summarization $0.45/day. Total memory infrastructure: under $1/day. The cost is negligible compared to LLM inference.


Agent memory is a reliable interview differentiator — most candidates describe the ReAct loop, but few can explain cross-session state management, user scoping, and GDPR deletion requirements.

Memory is a reliable differentiator in GenAI engineering interviews. Most candidates can describe the ReAct loop. Far fewer can explain how an agent maintains state across sessions. Here is what interviewers test:

“How would you design memory for an agent that handles customer support across multiple sessions?”

Strong answer structure:

  1. Start with the access pattern — users return with follow-up questions, need context from prior tickets
  2. Describe the tier architecture: buffer for current session, vector store for past tickets, graph for customer profile
  3. Address user scoping — memory isolation between customers
  4. Mention privacy compliance — GDPR deletion, data retention policies
  5. Discuss failure modes — stale memories, retrieval errors, cold-start for new users

“What happens when an agent’s memory is wrong?”

This tests whether you have built real systems. Correct answer: wrong memories are worse than no memories. The agent confidently acts on outdated or incorrect context. Mitigation: confidence scoring on retrieved memories, recency weighting, user feedback loops to flag incorrect recall, periodic memory audits.

“How do you scale memory for 1 million users?”

Key points:

  • Vector stores scale horizontally — Qdrant, Pinecone, and Weaviate all handle millions of vectors
  • User scoping via metadata filters, not separate collections per user
  • Memory consolidation to prevent unbounded growth
  • Tiered storage: hot memories in vector store, cold memories archived to object storage

“When would you NOT add memory to an agent?”

Best answer: when the task is inherently stateless. A code generation agent that takes a spec and returns code does not need memory. A translation agent does not need memory. Memory adds complexity, cost, and potential failure modes. Only add it when the user experience measurably improves.

  1. Confusing RAG with memory. RAG retrieves from a static document corpus. Memory retrieves from dynamically-created interaction history. The retrieval mechanism is similar, but the data source and lifecycle are fundamentally different.
  2. Ignoring privacy. Any memory system stores user data. Failing to mention GDPR, data retention, or deletion capabilities signals you have not built production systems.
  3. Over-engineering. Proposing a knowledge graph for an agent that only needs to remember 3 user preferences. Match the tier to the complexity.

At scale, memory is exposed as a dedicated internal microservice — the agent calls a memory API rather than querying storage directly, enabling independent scaling and privacy enforcement.

Every production memory system partitions by user. The architecture pattern:

User Request → Auth Layer → Extract user_id
Memory Service (internal API)
├── GET /memory/{user_id}/recall?query=...
├── POST /memory/{user_id}/store
└── DELETE /memory/{user_id} (GDPR)
┌──────────────────────┐
│ Vector Store │ ← Tier 3
│ (user_id filter) │
├──────────────────────┤
│ Graph DB │ ← Tier 4
│ (user subgraph) │
├──────────────────────┤
│ Session Store │ ← Tier 2
│ (Redis/Postgres) │
└──────────────────────┘

Memory as a microservice is the dominant pattern at scale. The agent does not directly query databases. It calls a memory API that handles retrieval, storage, privacy filtering, and rate limiting. This decouples the agent logic from the storage infrastructure and enables independent scaling.

Memory is user data. Treat it with the same rigor as any other user data:

  • Daily backups of vector store and graph database
  • Point-in-time recovery — if a bug corrupts memories, you need to roll back
  • Export format — users may request their data (GDPR Article 20). Store memories in a format that can be exported as JSON
  • Migration path — switching from Qdrant to Pinecone should not require re-embedding everything. Store raw text alongside vectors

Track these metrics in production:

MetricHealthy RangeAction If Out of Range
Average recall latency<100msScale vector DB or reduce top_k
Memory retrieval hit rate>60%Lower score_threshold or check embedding quality
Memories per user (avg)50-500Run consolidation if above; check storage pipeline if below
Stale memory ratio<20%Increase consolidation frequency
GDPR deletion response time<24 hoursAutomate deletion pipeline

Start with buffer memory for 80% of use cases, add vector store for cross-session recall, and only introduce knowledge graphs when the agent needs to reason over entity relationships.

Agent memory is not one thing. It is four tiers, each solving a different problem:

  1. In-context memory — the conversation so far, fast but ephemeral and expensive
  2. Buffer memory — managed conversation history with summarization, session-scoped
  3. Vector store memory — semantic search over past interactions, cross-session, scalable
  4. Knowledge graph memory — structured entity relationships, highest value for reasoning, highest setup cost
QuestionAnswerRecommended Tier
Do you need multi-session continuity?NoTier 1 (in-context) is sufficient
Do you need multi-session continuity?YesAdd Tier 2 + Tier 3
Do conversations exceed 20 turns?YesAdd Tier 2 (buffer with summary)
Do you need to recall past interactions?YesAdd Tier 3 (vector store)
Do you have complex entity relationships?YesAdd Tier 4 (knowledge graph)
Are you serving <100 users?YesStart with Tier 1 + Tier 2
Are you serving >10,000 users?YesFull 4-tier architecture
  1. Start simple. Buffer memory covers 80% of use cases. Add vector and graph only when you have evidence they improve the user experience.
  2. Scope everything by user. Memory without user isolation is a privacy violation waiting to happen.
  3. Set retrieval limits. Always use top_k and score_threshold. Unbounded retrieval degrades performance.
  4. Build deletion into day one. Retrofitting GDPR compliance into a memory system is painful and expensive.
  5. Monitor retrieval quality. Bad memories are worse than no memories. Track hit rates and user corrections.
  6. Consolidate regularly. Thousands of raw memories degrade retrieval. Merge and compress on a schedule.

Official Documentation and Further Reading

Section titled “Official Documentation and Further Reading”

Frameworks:

Vector Databases:

Research:


Last updated: March 2026. Memory frameworks and vector database capabilities evolve rapidly; verify specific APIs against current documentation.

Frequently Asked Questions