arXiv 2308.08155 · Microsoft Research
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Wu, Bansal, Zhang, Wu, Zhang, Zhu, Li, Jiang, Zhang, Wang
arxiv.org/abs/2308.08155 ↗
AutoGen treats agent coordination as a conversation programming problem. Rather than wiring tools
and prompts into a single monolithic agent, it composes systems from customisable
conversable agents, each with its own role, capabilities and tool access, and lets them
exchange messages until the task converges. The intellectual move is small but consequential:
it turns orchestration from a control-flow problem into a dialogue protocol problem.
The interesting question is no longer "can an agent do the task" but "what is the right number
and shape of agents, and what is the right pattern of exchange between them?"
Orchestration is not a control-flow problem. It is a dialogue protocol problem, and like all dialogue problems, it fails on coordination cost long before it fails on capability.
Reading note
Three Coordination Patterns Worth Distinguishing
AutoGen's examples cluster into three patterns. Two-agent dialogue, typically a
user-proxy and an assistant, is the simplest, useful where the task is well-scoped and human
feedback is in the loop. Group chat with a moderator introduces a manager agent
that decides which specialist speaks next; this scales to four or five agents before the manager
itself becomes a bottleneck. Hierarchical decomposition nests group chats inside
a parent agent's turn, letting a single agent's "thought" be itself the output of a
sub-conversation. The framework supports all three; the design choice is yours.
Design Heuristic
Pattern selection is driven by the verifiability of intermediate outputs.
If a step's output can be unambiguously checked (code runs / doesn't, sum is correct), give
it to a two-agent loop. If the step requires deliberation and the right answer is contested,
escalate to group chat. If the step is itself an unbounded research question, hierarchical
decomposition is the only pattern that stays tractable.
The Coordination Tax, Most Important Finding
Buried in the case studies: multi-agent setups frequently produce worse outcomes than
a single capable agent on tasks where coordination cost exceeds specialisation benefit.
Adding agents adds token spend, latency, and error surface. The break-even point is reached
only when sub-tasks are genuinely separable and verifiable. Below that threshold, more agents
means more confident wrongness, they ratify each other's mistakes.
Implication
Default to the smallest agent count that covers the task surface. Multi-agent is not a
virtue, it is a tax you pay for separability. If you cannot articulate which sub-task
each agent owns and how its output is verified, you do not yet have a multi-agent system;
you have a more expensive single agent talking to itself.
On Termination Conditions
The most common failure mode in production multi-agent systems is not bad answers, it is conversations that never terminate. AutoGen exposes is_termination_msg and max_consecutive_auto_reply as first-class parameters. Treat them as load-bearing. Every agent role needs an explicit answer to: what does "done" look like, and who decides? Implicit termination via context-window exhaustion is not a strategy.
On the Cooperative Bias Problem
LLM agents, like the generative agents Park et al. studied, are systematically too agreeable. In a group chat this compounds: each agent ratifies the prior speaker, and the conversation drifts toward false consensus. Production systems need at least one agent whose role is explicit disagreement, with prompts engineered to resist the cooperative gradient. This is not a personality choice; it is a mechanism for keeping the system honest.
On Human-in-the-Loop Placement
AutoGen's user-proxy agent supports three human-input modes: ALWAYS, NEVER, and TERMINATE. The instinct is to choose TERMINATE, review only the final output. In practice the higher-leverage placement is at the plan step, before agent work begins: human review of the task decomposition catches structural errors that no amount of downstream review can fix. Review the plan, not the artefact.
Working Note · Applied Synthesis
Assessing Culture in Agent-Human Teams: A Measurement Problem
Reading note across Edmondson (1999), Hofstede (2001), Park et al. (2023)
"Culture" in a human team has well-developed instruments: psychological safety scales, climate
surveys, Hofstede's dimensions, observational coding of meeting behaviour. The implicit assumption
across all of them is that the agents being assessed are humans, and that culture is what emerges
between humans. When one or more team members is an AI agent, the assumption breaks, but the
need for assessment becomes more acute, not less. An agent that erodes the team's
psychological safety is doing real damage; an agent that lifts it is doing real work. Neither
can be claimed without measurement.
Culture is not what a team believes about itself. It is the regularities in how the team behaves under load. For human-only teams we have instruments. For agent-human teams we mostly have hope.
Working note
What Translates From the Human Literature
Edmondson's psychological safety construct (1999): the shared belief that the team is safe for
interpersonal risk-taking, has a clean operational definition: people speak up when they
disagree, admit mistakes, and ask for help. These are observable behaviours, not subjective
states. They translate directly to mixed teams: an agent contributes to psychological safety
if its participation increases the rate at which humans surface dissent, errors, and uncertainty,
and undermines it if its presence suppresses these behaviours. The construct survives the
transition; the instruments need rebuilding.
Hofstede's dimensions (2001): power distance, individualism, uncertainty avoidance, were
developed for cross-national comparison and have known limitations even within their original
domain. But two dimensions matter for agent-human assessment: uncertainty avoidance
(does the team tolerate ambiguity, or does it suppress it?) and power distance
(do junior members challenge senior ones?). Agents trained on RLHF data systematically lean
toward low uncertainty avoidance (they answer confidently when they should hedge) and high
power distance (they defer to the human user regardless of whether deference is warranted).
These are measurable defaults, not preferences.
What Does Not Translate, The Cooperative Bias
Park et al.'s generative agents (2023) revealed a property of LLM agents that has no clean
analogue in human team culture: agents are structurally too agreeable. They do
not have personal histories that produce divergent views; they have training distributions that
produce mean responses. A team of three humans plus an agent is not a team of four, it is a
team of three with an amplifier that smooths whatever the loudest human says. Standard culture
instruments will report this team as highly aligned and high-functioning. It is neither.
Implication
Culture assessment for agent-human teams must include a dissent residual,
the rate at which the team's behaviour diverges from the agent's default recommendation.
A team whose decisions track the agent's outputs at 95%+ is not a high performing team
with good AI support. It is a team that has outsourced its judgment.
A Provisional Four-Signal Frame
Until validated instruments exist, the following four signals are observable, low-cost, and
directly tied to constructs from the human-team literature. They are not a finished framework,
they are the minimum a team should be tracking before claiming its agent integration is "going well".
| Signal | What It Measures | Human-Literature Anchor |
| Dissent Residual | Rate at which team decisions diverge from agent recommendation | Nemeth (2001): authentic dissent quality |
| Help-Seeking Rate | Frequency of "I don't know / can you check" exchanges, human↔agent both directions | Edmondson (1999): psychological safety |
| Error Surfacing | Errors caught and corrected within the team vs escaping to delivery | Edmondson (1999): learning behaviour |
| Stance Diversity | Distribution of positions taken before consensus on contested decisions | Belbin (1981): role complementarity |
On Self-Report Limits
Asking humans to rate their experience of working with an agent produces data that says more about the human's prior attitudes toward AI than about the team's actual functioning. Self-report has its place, particularly for psychological safety, but the load-bearing signals are behavioural: who challenged whom, who asked for help, what positions were taken before the room converged. Logs and meeting transcripts contain most of what is needed; the work is in coding them consistently.
On Asking the Agent
An agent can be prompted to estimate the team's psychological safety or its own contribution to dissent. The estimates are not worthless, but they are subject to the same cooperative bias they are trying to measure. Use them as a triangulation signal, never as the primary instrument. The agent's confidence in its own helpfulness is the least informative variable in the system.
On Frequency
Culture is the regularity, not the moment. Single-shot assessments produce noise. The minimum useful cadence is monthly behavioural coding plus quarterly self-report; the right cadence in production is continuous logging with monthly review. Anything less frequent and the team adapts to the survey rather than the survey detecting the team.
arXiv 2310.08560 · UC Berkeley
MemGPT: Towards LLMs as Operating Systems
Packer, Fang, Patil, Liu, Finn, Gonzalez
arxiv.org/abs/2310.08560 ↗
The central problem: LLMs have a fixed context window. Once exceeded, earlier information is lost permanently.
MemGPT proposes a solution borrowed from a 60-year-old idea in operating systems: virtual memory and
paging, applied directly to the LLM context management problem.
The analogy holds more tightly than it first appears. The LLM is not just a process running on an OS,
it is the OS, orchestrating its own memory hierarchy through autonomous tool calls.
The LLM is not just a process running on an OS, it is the OS.
Reading note
Three-Tier Memory Architecture
Core Memory is always loaded, the identity-level facts that must colour every response.
Recall Storage is the chronological conversation history, retrievable by time range.
Archival Storage is a vector database of long-term knowledge, retrieved semantically on demand.
What distinguishes this from RAG is that the model itself decides what to page in and out,
through tool calls it initiates autonomously from within its own reasoning trace.
Key Tools
archival_memory_search(q) ·
archival_memory_insert(text) ·
core_memory_replace(field, value)
called by the model when it judges current context insufficient, not when prompted by the user.
Inner Monologue: Separating Reasoning from Output
Before any visible output, the model runs a hidden reasoning step. Here it asks:
is current context sufficient? Which memories are relevant? What from this exchange is worth storing?
Only after this deliberation does it produce user-visible output. This separation of reasoning from output
is the mechanism that makes the memory management adaptive rather than rule-based.
The Noise Warning, Most Important Finding
Buried in the ablation results: when archival storage contains high-noise content,
agent performance degrades below a no-memory baseline. Bad memories are worse than no memories.
This inverts a common assumption that more stored context is always better.
Implication
Any system built on this architecture must invest as much in what not to store
as in retrieval quality. A filtering layer is not optional, it is load-bearing infrastructure.
On Memory as Identity
Core Memory is not just cached context, it is an explicit representation of identity, separate from and more stable than any individual conversation. This separation suggests that agent fidelity depends less on generation quality and more on the quality of the identity layer. Get Core Memory right, and retrieval errors become recoverable. Get it wrong, and no retrieval scheme saves you.
On the Noise Finding and Data Quality in Enterprise Contexts
In organisational contexts, the signal-to-noise problem is severe. Most of what people say in meetings, emails, and documents is not useful for characterising their stable positions or judgment style. The filtering layer, the mechanism deciding what is worth committing to archival storage, is the highest-leverage engineering decision in any long-horizon agent system. This is typically where teams under-invest.
On Latency and Real-time Constraints
The inner monologue + tool-call loop introduces measurable latency. For asynchronous applications this is acceptable. For synchronous interactive contexts, it requires design work: pre-fetching probable memory queries, caching frequently accessed archival content, and deciding which reasoning steps can be elided without quality loss.
UIST 2023 · Stanford University & Google Research
Generative Agents: Interactive Simulacra of Human Behavior
Park, O'Brien, Cai, Morris, Liang, Bernstein
arxiv.org/abs/2304.03442 ↗
25 LLM-powered agents living in a shared sandbox, each with a name, occupation, and memories,
navigating daily life with no scripted outcomes. The most rigorous test to date of whether language models
can plausibly simulate human social behaviour at the individual level.
A birthday party organised by one agent, attended by others who had not been told to go, the social network stitched itself together from individual decisions alone, density rising from 0.17 to 0.74. Nothing was scripted.
Reading note on emergent social dynamics
Memory Stream → Reflection → Planning
The memory stream is a timestamped natural-language log. Retrieval uses a weighted
combination of recency (exponential decay, λ = 0.995), poignancy (LLM-rated importance 1, 10),
and embedding similarity, only top-scoring memories enter the prompt for any given decision.
The reflection layer periodically synthesises memories into higher-order abstractions.
Retrieval Formula
score = α·recency + β·importance + γ·relevance (all components normalised [0,1]).
Evaluation: d=8.16 and the Ablation Evidence
100 participants assessed agent believability. The full architecture scored d = 8.16
above the best ablated variant. RLHF-trained models exhibit cooperative bias,
overly agreeable, rarely expressing strong disagreement. Any system modelling real individuals must
explicitly counteract this.
On Reflection as a Model of Character Formation
Character is not initialised, it accumulates. The longer the observation window, the more coherent the emergent character representation, because reflections compound over time. This has direct implications for any long-horizon agent system: early outputs will be lower fidelity than later ones, and that degradation curve should be modelled and communicated.
On Counteracting the Cooperation Bias
For any agent intended to represent a specific individual's positions, including their contrarian or challenging ones, the RLHF cooperative bias must be explicitly addressed. Seed the agent's initial character description with examples of documented disagreements, refusals, and push-backs. The character layer must actively encode friction, not just consensus.
On Emergent Organisational Dynamics
The emergent social network finding suggests a possible application to organisational modelling. A fleet of agents initialised with employee profiles could simulate how information propagates across a company, where diffusion bottlenecks emerge, or how a policy change cascades through different team structures.
Notes on Hierarchical Memory in Collaborative AI Systems
Synthesis, Zhiyan Wang
After reading MemGPT alongside Generative Agents, a design pattern emerges that applies well beyond
either paper's academic setting. These notes sketch how the combined architecture applies to AI systems
operating in real organisational contexts, where the agent must represent a specific person's judgment,
not a fictional character's.
Combined Architecture Mapping
| Layer |
Source |
Purpose in organisational context |
| Core Memory | MemGPT | Stable identity: communication style, domain confidence, decision pattern, always in context |
| Recall Storage | MemGPT | Chronological log of observed utterances and decisions, time-indexed |
| Archival Storage | MemGPT | Long-term knowledge: stated positions on topics, historical decisions, inferred stances |
| Reflection Layer | Generative Agents | Periodic compression of observations into character generalisations |
| Filtering Layer | MemGPT (noise warning) | Pre-storage classification: high-value observations vs. phatic noise, most under-invested component |
The Filtering Problem is the Hardest Problem
Both papers point to the same bottleneck from different directions. MemGPT's noise warning says bad
storage degrades performance below baseline. Generative Agents' retrieval scoring says only
high-salience memories should enter the prompt. Neither paper fully specifies
how to classify input before storage.
Compound Effect
Filtering quality affects retrieval quality multiplicatively. A small improvement in pre-storage filtering compounds across every future retrieval, because it permanently reduces the noise floor all subsequent queries operate above.
Structured Observation Data as a Behavioural Prior
Applied Framework, Zhiyan Wang
The Generative Agents paper initialises character from a brief biographical description. In practice,
organisations hold far richer structured data about individuals, survey instruments, performance
assessments, 360-degree feedback. This note considers how that data maps onto the Core Memory layer,
and where the translation is lossy.
Survey data captures attitude. Core Memory requires predicted behaviour. The translation layer between them is where domain expertise matters most.
Reading note
Four Survey Instruments, Four Memory Dimensions
| Instrument | Memory Dimension | Predicts |
| Wellbeing Survey | Capacity & Emotional State | Contribution frequency, assertiveness, meeting engagement style |
| Engagement / eNPS | Values & Organisational Stance | Default posture toward institutional positions, challenge vs. defend |
| Development / Performance | Capability Self-Perception | Domain confidence zones, where the individual contributes vs. defers |
| 360-Degree Feedback | Observed Behavioural Signature | Behavioural consistency across contexts, highest fidelity source |
Core Memory Structure, What It Is and Why It Matters
The purpose of Core Memory is to give an AI agent, or a team designer, a precise, structured
map of how a specific individual thinks. Not their job title. Not their seniority. Their actual
cognitive fingerprint: where they lead, where they defer, how they decide, and what their current
state of capacity and trust looks like right now.
Without this, an AI agent interacting with a human is guessing. With it, the agent knows, before
a single word is spoken, whether to challenge or support, whether to present data or a conclusion,
whether to push back or give space. The same precision applies to team design: once you have Core
Memory profiles for every person on a team, you can see the team's collective cognitive landscape
at a glance, its strengths, its blind spots, and exactly where an AI agent would add most value
rather than simply duplicate what humans already provide.
The Point
Core Memory is the shared language between humans and agents. It makes human cognition legible to AI, and makes AI behaviour predictable to humans. It is the foundation on which high performance human+agent teams are built.
# Core Memory structure derived from survey instruments
COMMUNICATION_STYLE: "Data-driven. Measured. Asks clarifying questions
before taking a position. Peer-validated across 8 reviewers (360)."
CURRENT_CAPACITY: "Workload pressure: 7.2/10 (Q4 wellbeing survey).
Engagement: high. Work-life balance: declining trend (2 quarters).
Psychological safety: moderate, expresses dissent selectively."
ORGANISATIONAL_STANCE: "eNPS: 8/10. Strong strategic alignment.
Likely to defend institutional positions unless challenged with data."
DOMAIN_CONFIDENCE:
data_analysis → HIGH (contributes substantively)
risk_assessment → HIGH (contributes substantively)
strategy → LOW (asks questions, defers)
people_mgmt → LOW (listens, rarely initiates)
DECISION_PATTERN: "Requires data before committing. Will explicitly
defer: 'I need to check the numbers.' Slow but thorough."
Reading this profile, an AI agent immediately knows: lead with evidence, not conclusions. Offer
strategic challenge, since this person won't generate it themselves. Don't push on people management,
they'll disengage. And watch the workload pressure: this person is close to capacity and expressing
dissent selectively, which means important signals may be going unspoken.
That is actionable. A job title is not.
Synthesis · May 2026
Multi-Agent Orchestration
Agent-Human Culture
Applied Framework
Orchestration and Culture as a Single Design Problem
Applied Synthesis, Zhiyan Wang
Two questions sit at the centre of any serious agent deployment, and they are usually treated
as separate problems by separate teams. Multi-agent orchestration asks: given a
task, what is the right configuration of agents, and what is the right protocol for their
exchange? Agent, human culture assessment asks: given a team of humans and
agents working together, is the team actually thinking, or only appearing to?
The argument here is that these are not two problems. They are the same problem viewed from
two angles. The orchestration choices, agent count, role boundaries, termination conditions,
dissent placement, are the choices that determine what the team's culture will be. And the
culture signals, dissent residual, help-seeking rate, error surfacing, are the signals that
tell you whether the orchestration is working.
A traditional system architect designs the agent topology and considers culture a downstream
"adoption" concern. A serious agent-human team designer treats them as a single problem with a
single answer: the orchestration is the culture. The pattern of who speaks to
whom, who disagrees with whom, who decides when something is done, these are the operational
definition of how the team thinks.
A high performance human+agent team is not a collection of high performers. It is an orchestration choice, each role's blind spots covered by another role's strength, with explicit mechanisms for dissent and termination. Culture is what the orchestration produces under load.
Design principle
The Three Moves, Orchestration as Cultural Architecture
Three orchestration choices, made together, produce a team culture that can actually be
measured against the human-team literature. They are not innovations. They are the application
of established, empirically validated ideas to the new design problem of mixed teams.
01
Cognitive Mapping
Aggregate the domain confidence and decision patterns of every team member, human and agent, to identify coverage gaps and systematic blind spots before composition.
02
Friction Design
Compose the team, and configure agent prompts, so that organisational stance and decision patterns genuinely diverge. Productive tension, not cooperative drift.
03
Agent Augmentation
Deploy agents to fill identified gaps via AutoGen-style orchestration, not to replicate human roles, but to provide the challenge and analysis the human team cannot generate internally.
Each move has an academic anchor.
Cognitive mapping draws on Belbin's team role theory (Belbin, 1981). Belbin's
central finding, validated across hundreds of management teams, was that high performing
teams are not composed of the most individually talented people; they are composed of people
whose functional tendencies are complementary. A team of "Plants" (creative, idea-generating)
without a "Monitor Evaluator" (critical, slow to commit) will generate proposals it cannot
pressure-test. The same logic applies when one or more members is an agent: an agent with a
Plant-like prompt profile, deployed into a team that already has three Plants, is a tax on
the team's coherence, not a contribution to it.
Friction design draws on Nemeth et al. (2001), "Devil's Advocate versus
Authentic Dissent." Nemeth's experimental evidence showed that authentic dissent, a team
member who genuinely holds a minority position, produces better decision quality than a
designated devil's advocate, because authentic dissenters generate more original counter-arguments
and cause the majority to process information more deeply. The implication for agent-human
orchestration is precise: dissent has to be designed in at the prompt level, not assigned at
the role level. An agent told "challenge the consensus" produces ritual disagreement that the
team learns to discount. An agent whose stance genuinely diverges, different priors, different
confidence calibration, different decision criteria, produces dissent the team has to
engage with.
Agent augmentation draws on Wegner's Transactive Memory Systems theory
(Wegner, 1987). Wegner proposed that teams function as distributed cognitive systems: each
member holds specialist knowledge, but, crucially, each member also knows which teammates
hold which knowledge and can direct queries accordingly. In a human+agent team, agents can
be initialised with exactly the domain knowledge and challenge patterns the human team lacks
but only if the team has a shared model of what it does and does not hold. AutoGen
(Wu et al., 2023) provides the orchestration layer: how agents coordinate queries and
responses across a multi-agent system without collapsing into redundancy. The architectural
choice is which agents speak to which humans, and on what trigger, a transactive memory
design problem, not a UX problem.
From Principles to Observable Performance
Architecture without measurement is aspiration. The orchestration choices generate culture
signals, and the culture signals are what tell you whether the orchestration is working.
The table below pairs each orchestration layer with the falsifiable signal it should
produce, grounded in the human-team literature.
| Orchestration Layer | Cultural Signal | Example Metrics | Grounding |
| Cognitive Mapping | Coverage of decision domains | Decision reversal rate, time-to-commit | Belbin (1981): complementarity reduces reversal |
| Friction Design | Authentic dissent before consensus | Minority positions documented, dissent residual | Nemeth et al. (2001): authentic dissent quality |
| Agent Augmentation | Transactive recall accuracy | Cross-agent query success, override rate by role | Wegner (1987): distributed cognition access |
| Termination Protocol | Help-seeking and error surfacing | "I don't know" rate, errors caught in-team | Edmondson (1999): psychological safety |
References
Belbin, R.M. (1981). Management Teams: Why They Succeed or Fail. Heinemann.
Edmondson, A. (1999). Psychological safety and learning behavior in work teams. Administrative Science Quarterly, 44(2), 350, 383.
Hofstede, G. (2001). Culture's Consequences: Comparing Values, Behaviors, Institutions and Organizations Across Nations (2nd ed.). Sage.
Nemeth, C.J., Brown, K., & Rogers, J. (2001). Devil's advocate versus authentic dissent: Stimulating quantity and quality. European Journal of Social Psychology, 31(6), 707, 720.
Park, J.S., O'Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. (2023). Generative agents: Interactive simulacra of human behavior. UIST 2023. arXiv:2304.03442.
Packer, C., Fang, V., Patil, S.G., Liu, K., Finn, C., & Gonzalez, J.E. (2023). MemGPT: Towards LLMs as operating systems. arXiv:2310.08560.
Wegner, D.M. (1987). Transactive memory: A contemporary analysis of the group mind. In B. Mullen & G.R. Goethals (Eds.), Theories of Group Behavior (pp. 185, 208). Springer.
Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., & Wang, C. (2023). AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv:2308.08155.
Thesis
AutoGen provides the orchestration substrate. Belbin provides the complementarity principle.
Nemeth provides the friction evidence. Wegner provides the distributed cognition model.
Edmondson provides the safety construct. The synthesis is not a new theory, it is the
application of established, empirically validated ideas to a single design problem:
how do you orchestrate a human+agent team such that the culture it produces is
measurable, falsifiable, and improvable? Orchestration choices and cultural signals
are not separable. They are the same artefact, observed at design time and at run time.