arXiv 2310.08560 · UC Berkeley

MemGPT: Towards LLMs as Operating Systems

arxiv.org/abs/2310.08560 ↗

The central problem: LLMs have a fixed context window. Once exceeded, earlier information is lost permanently. MemGPT proposes a solution borrowed from a 60-year-old idea in operating systems: virtual memory and paging — applied directly to the LLM context management problem.

The analogy holds more tightly than it first appears. The LLM is not just a process running on an OS — it is the OS, orchestrating its own memory hierarchy through autonomous tool calls.

The LLM is not just a process running on an OS — it is the OS.

— Reading note

Three-Tier Memory Architecture

Core Memory is always loaded — the identity-level facts that must colour every response. Recall Storage is the chronological conversation history, retrievable by time range. Archival Storage is a vector database of long-term knowledge, retrieved semantically on demand. What distinguishes this from RAG is that the model itself decides what to page in and out — through tool calls it initiates autonomously from within its own reasoning trace.

Key Tools

archival_memory_search(q)  ·  archival_memory_insert(text)  ·  core_memory_replace(field, value) — called by the model when it judges current context insufficient, not when prompted by the user.

Inner Monologue: Separating Reasoning from Output

Before any visible output, the model runs a hidden reasoning step. Here it asks: is current context sufficient? Which memories are relevant? What from this exchange is worth storing? Only after this deliberation does it produce user-visible output. This separation of reasoning from output is the mechanism that makes the memory management adaptive rather than rule-based.

The Noise Warning — Most Important Finding

Buried in the ablation results: when archival storage contains high-noise content, agent performance degrades below a no-memory baseline. Bad memories are worse than no memories. This inverts a common assumption that more stored context is always better.

Implication

Any system built on this architecture must invest as much in what not to store as in retrieval quality. A filtering layer is not optional — it is load-bearing infrastructure.

Field Notes · Enterprise Design Considerations

On Memory as Identity

Core Memory is not just cached context — it is an explicit representation of identity, separate from and more stable than any individual conversation. This separation suggests that agent fidelity depends less on generation quality and more on the quality of the identity layer. Get Core Memory right, and retrieval errors become recoverable. Get it wrong, and no retrieval scheme saves you.

On the Noise Finding and Data Quality in Enterprise Contexts

In organisational contexts, the signal-to-noise problem is severe. Most of what people say in meetings, emails, and documents is not useful for characterising their stable positions or judgment style. The filtering layer — the mechanism deciding what is worth committing to archival storage — is the highest-leverage engineering decision in any long-horizon agent system. This is typically where teams under-invest.

On Latency and Real-time Constraints

The inner monologue + tool-call loop introduces measurable latency. For asynchronous applications this is acceptable. For synchronous interactive contexts, it requires design work: pre-fetching probable memory queries, caching frequently accessed archival content, and deciding which reasoning steps can be elided without quality loss.

UIST 2023 · Stanford University & Google Research

Generative Agents: Interactive Simulacra of Human Behavior

arxiv.org/abs/2304.03442 ↗

25 LLM-powered agents living in a shared sandbox — each with a name, occupation, and memories, navigating daily life with no scripted outcomes. The most rigorous test to date of whether language models can plausibly simulate human social behaviour at the individual level.

A birthday party organised by one agent, attended by others who had not been told to go — the social network stitched itself together from individual decisions alone, density rising from 0.17 to 0.74. Nothing was scripted.

— Reading note on emergent social dynamics

Memory Stream → Reflection → Planning

The memory stream is a timestamped natural-language log. Retrieval uses a weighted combination of recency (exponential decay, λ = 0.995), poignancy (LLM-rated importance 1–10), and embedding similarity — only top-scoring memories enter the prompt for any given decision. The reflection layer periodically synthesises memories into higher-order abstractions.

Retrieval Formula

score = α·recency + β·importance + γ·relevance  (all components normalised [0,1]).

Evaluation: d=8.16 and the Ablation Evidence

100 participants assessed agent believability. The full architecture scored d = 8.16 above the best ablated variant. RLHF-trained models exhibit cooperative bias — overly agreeable, rarely expressing strong disagreement. Any system modelling real individuals must explicitly counteract this.

Field Notes · Organisational Simulation

On Reflection as a Model of Character Formation

Character is not initialised — it accumulates. The longer the observation window, the more coherent the emergent character representation, because reflections compound over time. This has direct implications for any long-horizon agent system: early outputs will be lower fidelity than later ones, and that degradation curve should be modelled and communicated.

On Counteracting the Cooperation Bias

For any agent intended to represent a specific individual's positions — including their contrarian or challenging ones — the RLHF cooperative bias must be explicitly addressed. Seed the agent's initial character description with examples of documented disagreements, refusals, and push-backs. The character layer must actively encode friction, not just consensus.

On Emergent Organisational Dynamics

The emergent social network finding suggests a possible application to organisational modelling. A fleet of agents initialised with employee profiles could simulate how information propagates across a company, where diffusion bottlenecks emerge, or how a policy change cascades through different team structures.

Notes on Hierarchical Memory in Collaborative AI Systems

After reading MemGPT alongside Generative Agents, a design pattern emerges that applies well beyond either paper's academic setting. These notes sketch how the combined architecture applies to AI systems operating in real organisational contexts — where the agent must represent a specific person's judgment, not a fictional character's.

Combined Architecture Mapping

Layer Source Purpose in organisational context
Core MemoryMemGPTStable identity: communication style, domain confidence, decision pattern — always in context
Recall StorageMemGPTChronological log of observed utterances and decisions, time-indexed
Archival StorageMemGPTLong-term knowledge: stated positions on topics, historical decisions, inferred stances
Reflection LayerGenerative AgentsPeriodic compression of observations into character generalisations
Filtering LayerMemGPT (noise warning)Pre-storage classification: high-value observations vs. phatic noise — most under-invested component

The Filtering Problem is the Hardest Problem

Both papers point to the same bottleneck from different directions. MemGPT's noise warning says bad storage degrades performance below baseline. Generative Agents' retrieval scoring says only high-salience memories should enter the prompt. Neither paper fully specifies how to classify input before storage.

Compound Effect

Filtering quality affects retrieval quality multiplicatively. A small improvement in pre-storage filtering compounds across every future retrieval, because it permanently reduces the noise floor all subsequent queries operate above.

Structured Observation Data as a Behavioural Prior

The Generative Agents paper initialises character from a brief biographical description. In practice, organisations hold far richer structured data about individuals — survey instruments, performance assessments, 360-degree feedback. This note considers how that data maps onto the Core Memory layer, and where the translation is lossy.

Survey data captures attitude. Core Memory requires predicted behaviour. The translation layer between them is where domain expertise matters most.

— Reading note

Four Survey Instruments, Four Memory Dimensions

InstrumentMemory DimensionPredicts
Wellbeing SurveyCapacity & Emotional StateContribution frequency, assertiveness, meeting engagement style
Engagement / eNPSValues & Organisational StanceDefault posture toward institutional positions — challenge vs. defend
Development / PerformanceCapability Self-PerceptionDomain confidence zones — where the individual contributes vs. defers
360-Degree FeedbackObserved Behavioural SignatureBehavioural consistency across contexts — highest fidelity source

Core Memory Structure — What It Is and Why It Matters

The purpose of Core Memory is to give an AI agent — or a team designer — a precise, structured map of how a specific individual thinks. Not their job title. Not their seniority. Their actual cognitive fingerprint: where they lead, where they defer, how they decide, and what their current state of capacity and trust looks like right now.

Without this, an AI agent interacting with a human is guessing. With it, the agent knows — before a single word is spoken — whether to challenge or support, whether to present data or a conclusion, whether to push back or give space. The same precision applies to team design: once you have Core Memory profiles for every person on a team, you can see the team's collective cognitive landscape at a glance — its strengths, its blind spots, and exactly where an AI agent would add most value rather than simply duplicate what humans already provide.

The Point

Core Memory is the shared language between humans and agents. It makes human cognition legible to AI — and makes AI behaviour predictable to humans. It is the foundation on which high performance human+agent teams are built.

# Core Memory structure derived from survey instruments
COMMUNICATION_STYLE: "Data-driven. Measured. Asks clarifying questions
before taking a position. Peer-validated across 8 reviewers (360)."
CURRENT_CAPACITY: "Workload pressure: 7.2/10 (Q4 wellbeing survey).
Engagement: high. Work-life balance: declining trend (2 quarters).
Psychological safety: moderate — expresses dissent selectively."
ORGANISATIONAL_STANCE: "eNPS: 8/10. Strong strategic alignment.
Likely to defend institutional positions unless challenged with data."
DOMAIN_CONFIDENCE:
  data_analysis     → HIGH (contributes substantively)
  risk_assessment   → HIGH (contributes substantively)
  strategy          → LOW  (asks questions, defers)
  people_mgmt       → LOW  (listens, rarely initiates)
DECISION_PATTERN: "Requires data before committing. Will explicitly
defer: 'I need to check the numbers.' Slow but thorough."

Reading this profile, an AI agent immediately knows: lead with evidence, not conclusions. Offer strategic challenge, since this person won't generate it themselves. Don't push on people management — they'll disengage. And watch the workload pressure: this person is close to capacity and expressing dissent selectively, which means important signals may be going unspoken. That is actionable. A job title is not.

Synthesis · Apr 2026

From Core Memory to High Performance Teams

Core Memory answers one question precisely: who is this person, as a thinking agent? How do they process information, where are they strong, where do they defer, and what does their current state of capacity and trust look like right now?

That question matters because it unlocks a second, more important one: given what we know about each person, how do we compose a team — human and AI — that is collectively smarter than the sum of its parts? Core Memory is the input. High Performance Team architecture is the output. The gap between them is where most organisations leave value on the table.

A traditional team is assembled from job titles and seniority levels. A high performance human+agent team is assembled from cognitive fingerprints. You look at each person's domain confidence map, decision pattern, and organisational stance — then you ask: where are the collective blind spots? Where does the team systematically defer? Which challenges will no one on this team ever raise? Those gaps are exactly where AI agents belong. Not replicating what humans already do well — filling what the human team structurally cannot generate on its own.

A high performance team is not a collection of high performers. It is a collection of people whose cognitive fingerprints are mutually load-bearing — each member's blind spots covered by another's strength, and agents deployed precisely at the gaps.

— Design principle

Reading Multiple Core Memory Profiles as a Team Design Tool

Once you have Core Memory structures for every team member, you can overlay them. The result is a map of the team's collective cognitive landscape — not who is "best", but what the team can and cannot generate on its own. A team where three of four members have strategy → LOW will consistently under-examine strategic assumptions, regardless of individual competence. A team where everyone shares the same organisational stance will never produce the productive friction that stress-tests decisions. These are not personality problems — they are structural gaps, and they are fixable with deliberate composition.

01
Cognitive Mapping
Aggregate individual domain confidence maps to identify team-level coverage gaps and systematic blind spots.
02
Friction Design
Deliberately seed the team with members whose organisational stance and decision patterns diverge — productive tension, not consensus.
03
Agent Augmentation
Use AI agents to fill identified gaps — not to replicate human roles, but to provide the challenge and analysis the human team cannot generate internally.

Each of these three moves has an academic anchor.

Cognitive mapping draws on Belbin's team role theory (Belbin, 1981). Belbin's central finding — validated across hundreds of management teams — was that high performing teams are not composed of the most individually talented people; they are composed of people whose functional tendencies are complementary. A team of "Plants" (creative, idea-generating) without a "Monitor Evaluator" (critical, slow to commit) will generate proposals it cannot pressure-test. Core Memory's DOMAIN_CONFIDENCE and DECISION_PATTERN fields are a structured, data-derived equivalent of Belbin's profile: they make the same complementarity question answerable from survey instruments rather than psychometric tests.

Friction design draws on Nemeth et al. (2001), "Devils Advocate versus Authentic Dissent." Nemeth's experimental evidence showed that authentic dissent — a team member who genuinely holds a minority position — produces better decision quality than a designated devil's advocate, because authentic dissenters generate more original counter-arguments and cause the majority to process information more deeply. The implication for team composition is precise: the team needs members whose ORGANISATIONAL_STANCE genuinely diverges, not members assigned a challenging role. Core Memory makes this visible at design time. The Generative Agents cooperative bias finding (Park et al., 2023) — that LLM agents are systematically too agreeable — is the AI-side equivalent of the same problem: a team of aligned humans paired with agreeable agents produces no useful friction at all.

Agent augmentation draws on Wegner's Transactive Memory Systems theory (Wegner, 1987). Wegner proposed that teams function as distributed cognitive systems: each member holds specialist knowledge, but — crucially — each member also knows which teammates hold which knowledge and can direct queries accordingly. The team's performance depends not just on what any individual knows, but on the team's shared awareness of its own knowledge distribution. In a human+agent team, agents can be initialised with exactly the domain knowledge and challenge patterns the human team lacks — but only if the team has a shared model of what it does and does not hold. Core Memory, aggregated across the team, is that shared model. AutoGen (Wu et al., 2023) — on the reading list — provides the orchestration layer: how agents coordinate queries and responses across a multi-agent system without collapsing into redundancy.

From Principles to Observable Performance

Architecture without measurement is aspiration. Belbin's own validation work used observable team outcomes — not self-reported role preferences — to test the complementarity hypothesis. The same discipline applies here. If Core Memory profiles are being used to compose teams and assign agents, the composition decisions need to generate falsifiable predictions: teams with higher cognitive coverage should make fewer reversible decisions; teams with structured friction should surface more minority positions before consensus is reached; agents assigned to gap-filling roles should be overridden less than agents assigned to roles the human team already covers well.

LayerSignal TypeExample MetricsGrounding
Decision QualityLagging / outcomeDecision reversal rate, time-to-commitNemeth et al. (2001) — authentic dissent reduces reversals
Challenge BehaviourLeading / processMinority positions documented before consensusBelbin (1981) — Monitor Evaluator role frequency
Memory UtilisationProcessPrior decision retrieval rate, commitment adherenceWegner (1987) — transactive memory access patterns
Agent ContributionAugmentationHuman override rate by agent role typeWu et al. (2023) — AutoGen agent coordination quality

References

Belbin, R.M. (1981). Management Teams: Why They Succeed or Fail. Heinemann.
Nemeth, C.J., Brown, K., & Rogers, J. (2001). Devil's advocate versus authentic dissent: Stimulating quantity and quality. European Journal of Social Psychology, 31(6), 707–720.
Park, J.S., O'Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. (2023). Generative agents: Interactive simulacra of human behavior. UIST 2023. arXiv:2304.03442.
Packer, C., Fang, V., Patil, S.G., Liu, K., Finn, C., & Gonzalez, J.E. (2023). MemGPT: Towards LLMs as operating systems. arXiv:2310.08560.
Wegner, D.M. (1987). Transactive memory: A contemporary analysis of the group mind. In B. Mullen & G.R. Goethals (Eds.), Theories of Group Behavior (pp. 185–208). Springer.
Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., & Wang, C. (2023). AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv:2308.08155.

Thesis

MemGPT and Generative Agents provide the memory substrate. Belbin provides the complementarity principle. Nemeth provides the friction evidence. Wegner provides the distributed cognition model. AutoGen provides the coordination layer. The synthesis is not a new theory — it is the application of established, empirically validated ideas to a new design problem: how do you build a human+agent team with the same rigour you would apply to a production AI system? Core Memory is where that rigour starts.

On the Reading List — 3 Papers Queued

arXiv 2308 · Microsoft Research
AutoGen: Multi-Agent Conversation Framework
Multi-agent orchestration — how agents coordinate and maintain coherence across parallel conversations.
arxiv.org/abs/2308.08155 ↗
arXiv 2309
RAGAS: Automated Evaluation of RAG
Framework for evaluating retrieval quality — applicable to assessing memory retrieval fidelity in agent systems.
arxiv.org/abs/2309.15217 ↗
arXiv 2210 · Princeton & Google
ReAct: Reasoning and Acting in Language Models
Interleaving reasoning traces with tool use — foundational for the inner monologue + action loop.
arxiv.org/abs/2210.03629 ↗
Opportunities

Open to New Opportunities

Interested in senior data science, AI transformation, or people analytics leadership roles in consulting and financial services.

Get in Touch → ← Back to Profile