arXiv 2310.08560 · UC Berkeley
MemGPT: Towards LLMs as Operating Systems
Packer, Fang, Patil, Liu, Finn, Gonzalez
arxiv.org/abs/2310.08560 ↗
The central problem: LLMs have a fixed context window. Once exceeded, earlier information is lost permanently.
MemGPT proposes a solution borrowed from a 60-year-old idea in operating systems: virtual memory and
paging — applied directly to the LLM context management problem.
The analogy holds more tightly than it first appears. The LLM is not just a process running on an OS —
it is the OS, orchestrating its own memory hierarchy through autonomous tool calls.
The LLM is not just a process running on an OS — it is the OS.
— Reading note
Three-Tier Memory Architecture
Core Memory is always loaded — the identity-level facts that must colour every response.
Recall Storage is the chronological conversation history, retrievable by time range.
Archival Storage is a vector database of long-term knowledge, retrieved semantically on demand.
What distinguishes this from RAG is that the model itself decides what to page in and out —
through tool calls it initiates autonomously from within its own reasoning trace.
Key Tools
archival_memory_search(q) ·
archival_memory_insert(text) ·
core_memory_replace(field, value)
— called by the model when it judges current context insufficient, not when prompted by the user.
Inner Monologue: Separating Reasoning from Output
Before any visible output, the model runs a hidden reasoning step. Here it asks:
is current context sufficient? Which memories are relevant? What from this exchange is worth storing?
Only after this deliberation does it produce user-visible output. This separation of reasoning from output
is the mechanism that makes the memory management adaptive rather than rule-based.
The Noise Warning — Most Important Finding
Buried in the ablation results: when archival storage contains high-noise content,
agent performance degrades below a no-memory baseline. Bad memories are worse than no memories.
This inverts a common assumption that more stored context is always better.
Implication
Any system built on this architecture must invest as much in what not to store
as in retrieval quality. A filtering layer is not optional — it is load-bearing infrastructure.
On Memory as Identity
Core Memory is not just cached context — it is an explicit representation of identity, separate from and more stable than any individual conversation. This separation suggests that agent fidelity depends less on generation quality and more on the quality of the identity layer. Get Core Memory right, and retrieval errors become recoverable. Get it wrong, and no retrieval scheme saves you.
On the Noise Finding and Data Quality in Enterprise Contexts
In organisational contexts, the signal-to-noise problem is severe. Most of what people say in meetings, emails, and documents is not useful for characterising their stable positions or judgment style. The filtering layer — the mechanism deciding what is worth committing to archival storage — is the highest-leverage engineering decision in any long-horizon agent system. This is typically where teams under-invest.
On Latency and Real-time Constraints
The inner monologue + tool-call loop introduces measurable latency. For asynchronous applications this is acceptable. For synchronous interactive contexts, it requires design work: pre-fetching probable memory queries, caching frequently accessed archival content, and deciding which reasoning steps can be elided without quality loss.
UIST 2023 · Stanford University & Google Research
Generative Agents: Interactive Simulacra of Human Behavior
Park, O'Brien, Cai, Morris, Liang, Bernstein
arxiv.org/abs/2304.03442 ↗
25 LLM-powered agents living in a shared sandbox — each with a name, occupation, and memories,
navigating daily life with no scripted outcomes. The most rigorous test to date of whether language models
can plausibly simulate human social behaviour at the individual level.
A birthday party organised by one agent, attended by others who had not been told to go — the social network stitched itself together from individual decisions alone, density rising from 0.17 to 0.74. Nothing was scripted.
— Reading note on emergent social dynamics
Memory Stream → Reflection → Planning
The memory stream is a timestamped natural-language log. Retrieval uses a weighted
combination of recency (exponential decay, λ = 0.995), poignancy (LLM-rated importance 1–10),
and embedding similarity — only top-scoring memories enter the prompt for any given decision.
The reflection layer periodically synthesises memories into higher-order abstractions.
Retrieval Formula
score = α·recency + β·importance + γ·relevance (all components normalised [0,1]).
Evaluation: d=8.16 and the Ablation Evidence
100 participants assessed agent believability. The full architecture scored d = 8.16
above the best ablated variant. RLHF-trained models exhibit cooperative bias —
overly agreeable, rarely expressing strong disagreement. Any system modelling real individuals must
explicitly counteract this.
On Reflection as a Model of Character Formation
Character is not initialised — it accumulates. The longer the observation window, the more coherent the emergent character representation, because reflections compound over time. This has direct implications for any long-horizon agent system: early outputs will be lower fidelity than later ones, and that degradation curve should be modelled and communicated.
On Counteracting the Cooperation Bias
For any agent intended to represent a specific individual's positions — including their contrarian or challenging ones — the RLHF cooperative bias must be explicitly addressed. Seed the agent's initial character description with examples of documented disagreements, refusals, and push-backs. The character layer must actively encode friction, not just consensus.
On Emergent Organisational Dynamics
The emergent social network finding suggests a possible application to organisational modelling. A fleet of agents initialised with employee profiles could simulate how information propagates across a company, where diffusion bottlenecks emerge, or how a policy change cascades through different team structures.
Notes on Hierarchical Memory in Collaborative AI Systems
Synthesis — Zhiyan Wang
After reading MemGPT alongside Generative Agents, a design pattern emerges that applies well beyond
either paper's academic setting. These notes sketch how the combined architecture applies to AI systems
operating in real organisational contexts — where the agent must represent a specific person's judgment,
not a fictional character's.
Combined Architecture Mapping
| Layer |
Source |
Purpose in organisational context |
| Core Memory | MemGPT | Stable identity: communication style, domain confidence, decision pattern — always in context |
| Recall Storage | MemGPT | Chronological log of observed utterances and decisions, time-indexed |
| Archival Storage | MemGPT | Long-term knowledge: stated positions on topics, historical decisions, inferred stances |
| Reflection Layer | Generative Agents | Periodic compression of observations into character generalisations |
| Filtering Layer | MemGPT (noise warning) | Pre-storage classification: high-value observations vs. phatic noise — most under-invested component |
The Filtering Problem is the Hardest Problem
Both papers point to the same bottleneck from different directions. MemGPT's noise warning says bad
storage degrades performance below baseline. Generative Agents' retrieval scoring says only
high-salience memories should enter the prompt. Neither paper fully specifies
how to classify input before storage.
Compound Effect
Filtering quality affects retrieval quality multiplicatively. A small improvement in pre-storage filtering compounds across every future retrieval, because it permanently reduces the noise floor all subsequent queries operate above.
Structured Observation Data as a Behavioural Prior
Applied Framework — Zhiyan Wang
The Generative Agents paper initialises character from a brief biographical description. In practice,
organisations hold far richer structured data about individuals — survey instruments, performance
assessments, 360-degree feedback. This note considers how that data maps onto the Core Memory layer,
and where the translation is lossy.
Survey data captures attitude. Core Memory requires predicted behaviour. The translation layer between them is where domain expertise matters most.
— Reading note
Four Survey Instruments, Four Memory Dimensions
| Instrument | Memory Dimension | Predicts |
| Wellbeing Survey | Capacity & Emotional State | Contribution frequency, assertiveness, meeting engagement style |
| Engagement / eNPS | Values & Organisational Stance | Default posture toward institutional positions — challenge vs. defend |
| Development / Performance | Capability Self-Perception | Domain confidence zones — where the individual contributes vs. defers |
| 360-Degree Feedback | Observed Behavioural Signature | Behavioural consistency across contexts — highest fidelity source |
Core Memory Structure — What It Is and Why It Matters
The purpose of Core Memory is to give an AI agent — or a team designer — a precise, structured
map of how a specific individual thinks. Not their job title. Not their seniority. Their actual
cognitive fingerprint: where they lead, where they defer, how they decide, and what their current
state of capacity and trust looks like right now.
Without this, an AI agent interacting with a human is guessing. With it, the agent knows — before
a single word is spoken — whether to challenge or support, whether to present data or a conclusion,
whether to push back or give space. The same precision applies to team design: once you have Core
Memory profiles for every person on a team, you can see the team's collective cognitive landscape
at a glance — its strengths, its blind spots, and exactly where an AI agent would add most value
rather than simply duplicate what humans already provide.
The Point
Core Memory is the shared language between humans and agents. It makes human cognition legible to AI — and makes AI behaviour predictable to humans. It is the foundation on which high performance human+agent teams are built.
# Core Memory structure derived from survey instruments
COMMUNICATION_STYLE: "Data-driven. Measured. Asks clarifying questions
before taking a position. Peer-validated across 8 reviewers (360)."
CURRENT_CAPACITY: "Workload pressure: 7.2/10 (Q4 wellbeing survey).
Engagement: high. Work-life balance: declining trend (2 quarters).
Psychological safety: moderate — expresses dissent selectively."
ORGANISATIONAL_STANCE: "eNPS: 8/10. Strong strategic alignment.
Likely to defend institutional positions unless challenged with data."
DOMAIN_CONFIDENCE:
data_analysis → HIGH (contributes substantively)
risk_assessment → HIGH (contributes substantively)
strategy → LOW (asks questions, defers)
people_mgmt → LOW (listens, rarely initiates)
DECISION_PATTERN: "Requires data before committing. Will explicitly
defer: 'I need to check the numbers.' Slow but thorough."
Reading this profile, an AI agent immediately knows: lead with evidence, not conclusions. Offer
strategic challenge, since this person won't generate it themselves. Don't push on people management —
they'll disengage. And watch the workload pressure: this person is close to capacity and expressing
dissent selectively, which means important signals may be going unspoken.
That is actionable. A job title is not.
Synthesis · Apr 2026
High Performance Teams
Applied Framework
People Analytics
From Core Memory to High Performance Teams
Applied Synthesis — Zhiyan Wang
Core Memory answers one question precisely: who is this person, as a thinking agent?
How do they process information, where are they strong, where do they defer, and what does their
current state of capacity and trust look like right now?
That question matters because it unlocks a second, more important one: given what we know
about each person, how do we compose a team — human and AI — that is collectively smarter than the
sum of its parts? Core Memory is the input. High Performance Team architecture is the output.
The gap between them is where most organisations leave value on the table.
A traditional team is assembled from job titles and seniority levels. A high performance human+agent
team is assembled from cognitive fingerprints. You look at each person's domain confidence map, decision
pattern, and organisational stance — then you ask: where are the collective blind spots? Where does
the team systematically defer? Which challenges will no one on this team ever raise? Those gaps
are exactly where AI agents belong. Not replicating what humans already do well — filling
what the human team structurally cannot generate on its own.
A high performance team is not a collection of high performers. It is a collection of people whose cognitive fingerprints are mutually load-bearing — each member's blind spots covered by another's strength, and agents deployed precisely at the gaps.
— Design principle
Reading Multiple Core Memory Profiles as a Team Design Tool
Once you have Core Memory structures for every team member, you can overlay them. The result is a
map of the team's collective cognitive landscape — not who is "best", but what the team can and
cannot generate on its own. A team where three of four members have strategy → LOW
will consistently under-examine strategic assumptions, regardless of individual competence.
A team where everyone shares the same organisational stance will never produce the productive
friction that stress-tests decisions. These are not personality problems — they are structural gaps,
and they are fixable with deliberate composition.
01
Cognitive Mapping
Aggregate individual domain confidence maps to identify team-level coverage gaps and systematic blind spots.
02
Friction Design
Deliberately seed the team with members whose organisational stance and decision patterns diverge — productive tension, not consensus.
03
Agent Augmentation
Use AI agents to fill identified gaps — not to replicate human roles, but to provide the challenge and analysis the human team cannot generate internally.
Each of these three moves has an academic anchor.
Cognitive mapping draws on Belbin's team role theory (Belbin, 1981). Belbin's central
finding — validated across hundreds of management teams — was that high performing teams are not
composed of the most individually talented people; they are composed of people whose functional
tendencies are complementary. A team of "Plants" (creative, idea-generating) without a "Monitor
Evaluator" (critical, slow to commit) will generate proposals it cannot pressure-test. Core Memory's
DOMAIN_CONFIDENCE and DECISION_PATTERN fields are a structured, data-derived
equivalent of Belbin's profile: they make the same complementarity question answerable from survey
instruments rather than psychometric tests.
Friction design draws on Nemeth et al. (2001), "Devils Advocate versus Authentic
Dissent." Nemeth's experimental evidence showed that authentic dissent — a team member who genuinely
holds a minority position — produces better decision quality than a designated devil's advocate, because
authentic dissenters generate more original counter-arguments and cause the majority to process
information more deeply. The implication for team composition is precise: the team needs members whose
ORGANISATIONAL_STANCE genuinely diverges, not members assigned a challenging role.
Core Memory makes this visible at design time. The Generative Agents cooperative bias finding
(Park et al., 2023) — that LLM agents are systematically too agreeable — is the AI-side equivalent
of the same problem: a team of aligned humans paired with agreeable agents produces no useful friction
at all.
Agent augmentation draws on Wegner's Transactive Memory Systems theory (Wegner, 1987).
Wegner proposed that teams function as distributed cognitive systems: each member holds specialist
knowledge, but — crucially — each member also knows which teammates hold which knowledge and can
direct queries accordingly. The team's performance depends not just on what any individual knows,
but on the team's shared awareness of its own knowledge distribution. In a human+agent team,
agents can be initialised with exactly the domain knowledge and challenge patterns the human team
lacks — but only if the team has a shared model of what it does and does not hold. Core Memory,
aggregated across the team, is that shared model. AutoGen (Wu et al., 2023) — on the reading list —
provides the orchestration layer: how agents coordinate queries and responses across a multi-agent
system without collapsing into redundancy.
From Principles to Observable Performance
Architecture without measurement is aspiration. Belbin's own validation work used observable team
outcomes — not self-reported role preferences — to test the complementarity hypothesis. The same
discipline applies here. If Core Memory profiles are being used to compose teams and assign agents,
the composition decisions need to generate falsifiable predictions: teams with higher cognitive
coverage should make fewer reversible decisions; teams with structured friction should surface
more minority positions before consensus is reached; agents assigned to gap-filling roles should
be overridden less than agents assigned to roles the human team already covers well.
| Layer | Signal Type | Example Metrics | Grounding |
| Decision Quality | Lagging / outcome | Decision reversal rate, time-to-commit | Nemeth et al. (2001) — authentic dissent reduces reversals |
| Challenge Behaviour | Leading / process | Minority positions documented before consensus | Belbin (1981) — Monitor Evaluator role frequency |
| Memory Utilisation | Process | Prior decision retrieval rate, commitment adherence | Wegner (1987) — transactive memory access patterns |
| Agent Contribution | Augmentation | Human override rate by agent role type | Wu et al. (2023) — AutoGen agent coordination quality |
References
Belbin, R.M. (1981). Management Teams: Why They Succeed or Fail. Heinemann.
Nemeth, C.J., Brown, K., & Rogers, J. (2001). Devil's advocate versus authentic dissent: Stimulating quantity and quality. European Journal of Social Psychology, 31(6), 707–720.
Park, J.S., O'Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. (2023). Generative agents: Interactive simulacra of human behavior. UIST 2023. arXiv:2304.03442.
Packer, C., Fang, V., Patil, S.G., Liu, K., Finn, C., & Gonzalez, J.E. (2023). MemGPT: Towards LLMs as operating systems. arXiv:2310.08560.
Wegner, D.M. (1987). Transactive memory: A contemporary analysis of the group mind. In B. Mullen & G.R. Goethals (Eds.), Theories of Group Behavior (pp. 185–208). Springer.
Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., & Wang, C. (2023). AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv:2308.08155.
Thesis
MemGPT and Generative Agents provide the memory substrate. Belbin provides the complementarity
principle. Nemeth provides the friction evidence. Wegner provides the distributed cognition model.
AutoGen provides the coordination layer. The synthesis is not a new theory — it is the application
of established, empirically validated ideas to a new design problem: how do you build a
human+agent team with the same rigour you would apply to a production AI system?
Core Memory is where that rigour starts.