arXiv 2308.08155 · Microsoft Research

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

arxiv.org/abs/2308.08155 ↗

AutoGen treats agent coordination as a conversation programming problem. Rather than wiring tools and prompts into a single monolithic agent, it composes systems from customisable conversable agents, each with its own role, capabilities and tool access, and lets them exchange messages until the task converges. The intellectual move is small but consequential: it turns orchestration from a control-flow problem into a dialogue protocol problem.

The interesting question is no longer "can an agent do the task" but "what is the right number and shape of agents, and what is the right pattern of exchange between them?"

Orchestration is not a control-flow problem. It is a dialogue protocol problem, and like all dialogue problems, it fails on coordination cost long before it fails on capability.

Reading note

Three Coordination Patterns Worth Distinguishing

AutoGen's examples cluster into three patterns. Two-agent dialogue, typically a user-proxy and an assistant, is the simplest, useful where the task is well-scoped and human feedback is in the loop. Group chat with a moderator introduces a manager agent that decides which specialist speaks next; this scales to four or five agents before the manager itself becomes a bottleneck. Hierarchical decomposition nests group chats inside a parent agent's turn, letting a single agent's "thought" be itself the output of a sub-conversation. The framework supports all three; the design choice is yours.

Design Heuristic

Pattern selection is driven by the verifiability of intermediate outputs. If a step's output can be unambiguously checked (code runs / doesn't, sum is correct), give it to a two-agent loop. If the step requires deliberation and the right answer is contested, escalate to group chat. If the step is itself an unbounded research question, hierarchical decomposition is the only pattern that stays tractable.

The Coordination Tax, Most Important Finding

Buried in the case studies: multi-agent setups frequently produce worse outcomes than a single capable agent on tasks where coordination cost exceeds specialisation benefit. Adding agents adds token spend, latency, and error surface. The break-even point is reached only when sub-tasks are genuinely separable and verifiable. Below that threshold, more agents means more confident wrongness, they ratify each other's mistakes.

Implication

Default to the smallest agent count that covers the task surface. Multi-agent is not a virtue, it is a tax you pay for separability. If you cannot articulate which sub-task each agent owns and how its output is verified, you do not yet have a multi-agent system; you have a more expensive single agent talking to itself.

Field Notes · Enterprise Orchestration Considerations

On Termination Conditions

The most common failure mode in production multi-agent systems is not bad answers, it is conversations that never terminate. AutoGen exposes is_termination_msg and max_consecutive_auto_reply as first-class parameters. Treat them as load-bearing. Every agent role needs an explicit answer to: what does "done" look like, and who decides? Implicit termination via context-window exhaustion is not a strategy.

On the Cooperative Bias Problem

LLM agents, like the generative agents Park et al. studied, are systematically too agreeable. In a group chat this compounds: each agent ratifies the prior speaker, and the conversation drifts toward false consensus. Production systems need at least one agent whose role is explicit disagreement, with prompts engineered to resist the cooperative gradient. This is not a personality choice; it is a mechanism for keeping the system honest.

On Human-in-the-Loop Placement

AutoGen's user-proxy agent supports three human-input modes: ALWAYS, NEVER, and TERMINATE. The instinct is to choose TERMINATE, review only the final output. In practice the higher-leverage placement is at the plan step, before agent work begins: human review of the task decomposition catches structural errors that no amount of downstream review can fix. Review the plan, not the artefact.

Working Note · Applied Synthesis

Assessing Culture in Agent-Human Teams: A Measurement Problem

"Culture" in a human team has well-developed instruments: psychological safety scales, climate surveys, Hofstede's dimensions, observational coding of meeting behaviour. The implicit assumption across all of them is that the agents being assessed are humans, and that culture is what emerges between humans. When one or more team members is an AI agent, the assumption breaks, but the need for assessment becomes more acute, not less. An agent that erodes the team's psychological safety is doing real damage; an agent that lifts it is doing real work. Neither can be claimed without measurement.

Culture is not what a team believes about itself. It is the regularities in how the team behaves under load. For human-only teams we have instruments. For agent-human teams we mostly have hope.

Working note

What Translates From the Human Literature

Edmondson's psychological safety construct (1999): the shared belief that the team is safe for interpersonal risk-taking, has a clean operational definition: people speak up when they disagree, admit mistakes, and ask for help. These are observable behaviours, not subjective states. They translate directly to mixed teams: an agent contributes to psychological safety if its participation increases the rate at which humans surface dissent, errors, and uncertainty, and undermines it if its presence suppresses these behaviours. The construct survives the transition; the instruments need rebuilding.

Hofstede's dimensions (2001): power distance, individualism, uncertainty avoidance, were developed for cross-national comparison and have known limitations even within their original domain. But two dimensions matter for agent-human assessment: uncertainty avoidance (does the team tolerate ambiguity, or does it suppress it?) and power distance (do junior members challenge senior ones?). Agents trained on RLHF data systematically lean toward low uncertainty avoidance (they answer confidently when they should hedge) and high power distance (they defer to the human user regardless of whether deference is warranted). These are measurable defaults, not preferences.

What Does Not Translate, The Cooperative Bias

Park et al.'s generative agents (2023) revealed a property of LLM agents that has no clean analogue in human team culture: agents are structurally too agreeable. They do not have personal histories that produce divergent views; they have training distributions that produce mean responses. A team of three humans plus an agent is not a team of four, it is a team of three with an amplifier that smooths whatever the loudest human says. Standard culture instruments will report this team as highly aligned and high-functioning. It is neither.

Implication

Culture assessment for agent-human teams must include a dissent residual, the rate at which the team's behaviour diverges from the agent's default recommendation. A team whose decisions track the agent's outputs at 95%+ is not a high performing team with good AI support. It is a team that has outsourced its judgment.

A Provisional Four-Signal Frame

Until validated instruments exist, the following four signals are observable, low-cost, and directly tied to constructs from the human-team literature. They are not a finished framework, they are the minimum a team should be tracking before claiming its agent integration is "going well".

SignalWhat It MeasuresHuman-Literature Anchor
Dissent ResidualRate at which team decisions diverge from agent recommendationNemeth (2001): authentic dissent quality
Help-Seeking RateFrequency of "I don't know / can you check" exchanges, human↔agent both directionsEdmondson (1999): psychological safety
Error SurfacingErrors caught and corrected within the team vs escaping to deliveryEdmondson (1999): learning behaviour
Stance DiversityDistribution of positions taken before consensus on contested decisionsBelbin (1981): role complementarity

Field Notes · Building the Instrument

On Self-Report Limits

Asking humans to rate their experience of working with an agent produces data that says more about the human's prior attitudes toward AI than about the team's actual functioning. Self-report has its place, particularly for psychological safety, but the load-bearing signals are behavioural: who challenged whom, who asked for help, what positions were taken before the room converged. Logs and meeting transcripts contain most of what is needed; the work is in coding them consistently.

On Asking the Agent

An agent can be prompted to estimate the team's psychological safety or its own contribution to dissent. The estimates are not worthless, but they are subject to the same cooperative bias they are trying to measure. Use them as a triangulation signal, never as the primary instrument. The agent's confidence in its own helpfulness is the least informative variable in the system.

On Frequency

Culture is the regularity, not the moment. Single-shot assessments produce noise. The minimum useful cadence is monthly behavioural coding plus quarterly self-report; the right cadence in production is continuous logging with monthly review. Anything less frequent and the team adapts to the survey rather than the survey detecting the team.

arXiv 2310.08560 · UC Berkeley

MemGPT: Towards LLMs as Operating Systems

arxiv.org/abs/2310.08560 ↗

The central problem: LLMs have a fixed context window. Once exceeded, earlier information is lost permanently. MemGPT proposes a solution borrowed from a 60-year-old idea in operating systems: virtual memory and paging, applied directly to the LLM context management problem.

The analogy holds more tightly than it first appears. The LLM is not just a process running on an OS, it is the OS, orchestrating its own memory hierarchy through autonomous tool calls.

The LLM is not just a process running on an OS, it is the OS.

Reading note

Three-Tier Memory Architecture

Core Memory is always loaded, the identity-level facts that must colour every response. Recall Storage is the chronological conversation history, retrievable by time range. Archival Storage is a vector database of long-term knowledge, retrieved semantically on demand. What distinguishes this from RAG is that the model itself decides what to page in and out, through tool calls it initiates autonomously from within its own reasoning trace.

Key Tools

archival_memory_search(q)  ·  archival_memory_insert(text)  ·  core_memory_replace(field, value) called by the model when it judges current context insufficient, not when prompted by the user.

Inner Monologue: Separating Reasoning from Output

Before any visible output, the model runs a hidden reasoning step. Here it asks: is current context sufficient? Which memories are relevant? What from this exchange is worth storing? Only after this deliberation does it produce user-visible output. This separation of reasoning from output is the mechanism that makes the memory management adaptive rather than rule-based.

The Noise Warning, Most Important Finding

Buried in the ablation results: when archival storage contains high-noise content, agent performance degrades below a no-memory baseline. Bad memories are worse than no memories. This inverts a common assumption that more stored context is always better.

Implication

Any system built on this architecture must invest as much in what not to store as in retrieval quality. A filtering layer is not optional, it is load-bearing infrastructure.

Field Notes · Enterprise Design Considerations

On Memory as Identity

Core Memory is not just cached context, it is an explicit representation of identity, separate from and more stable than any individual conversation. This separation suggests that agent fidelity depends less on generation quality and more on the quality of the identity layer. Get Core Memory right, and retrieval errors become recoverable. Get it wrong, and no retrieval scheme saves you.

On the Noise Finding and Data Quality in Enterprise Contexts

In organisational contexts, the signal-to-noise problem is severe. Most of what people say in meetings, emails, and documents is not useful for characterising their stable positions or judgment style. The filtering layer, the mechanism deciding what is worth committing to archival storage, is the highest-leverage engineering decision in any long-horizon agent system. This is typically where teams under-invest.

On Latency and Real-time Constraints

The inner monologue + tool-call loop introduces measurable latency. For asynchronous applications this is acceptable. For synchronous interactive contexts, it requires design work: pre-fetching probable memory queries, caching frequently accessed archival content, and deciding which reasoning steps can be elided without quality loss.

UIST 2023 · Stanford University & Google Research

Generative Agents: Interactive Simulacra of Human Behavior

arxiv.org/abs/2304.03442 ↗

25 LLM-powered agents living in a shared sandbox, each with a name, occupation, and memories, navigating daily life with no scripted outcomes. The most rigorous test to date of whether language models can plausibly simulate human social behaviour at the individual level.

A birthday party organised by one agent, attended by others who had not been told to go, the social network stitched itself together from individual decisions alone, density rising from 0.17 to 0.74. Nothing was scripted.

Reading note on emergent social dynamics

Memory Stream → Reflection → Planning

The memory stream is a timestamped natural-language log. Retrieval uses a weighted combination of recency (exponential decay, λ = 0.995), poignancy (LLM-rated importance 1, 10), and embedding similarity, only top-scoring memories enter the prompt for any given decision. The reflection layer periodically synthesises memories into higher-order abstractions.

Retrieval Formula

score = α·recency + β·importance + γ·relevance  (all components normalised [0,1]).

Evaluation: d=8.16 and the Ablation Evidence

100 participants assessed agent believability. The full architecture scored d = 8.16 above the best ablated variant. RLHF-trained models exhibit cooperative bias, overly agreeable, rarely expressing strong disagreement. Any system modelling real individuals must explicitly counteract this.

Field Notes · Organisational Simulation

On Reflection as a Model of Character Formation

Character is not initialised, it accumulates. The longer the observation window, the more coherent the emergent character representation, because reflections compound over time. This has direct implications for any long-horizon agent system: early outputs will be lower fidelity than later ones, and that degradation curve should be modelled and communicated.

On Counteracting the Cooperation Bias

For any agent intended to represent a specific individual's positions, including their contrarian or challenging ones, the RLHF cooperative bias must be explicitly addressed. Seed the agent's initial character description with examples of documented disagreements, refusals, and push-backs. The character layer must actively encode friction, not just consensus.

On Emergent Organisational Dynamics

The emergent social network finding suggests a possible application to organisational modelling. A fleet of agents initialised with employee profiles could simulate how information propagates across a company, where diffusion bottlenecks emerge, or how a policy change cascades through different team structures.

Notes on Hierarchical Memory in Collaborative AI Systems

After reading MemGPT alongside Generative Agents, a design pattern emerges that applies well beyond either paper's academic setting. These notes sketch how the combined architecture applies to AI systems operating in real organisational contexts, where the agent must represent a specific person's judgment, not a fictional character's.

Combined Architecture Mapping

Layer Source Purpose in organisational context
Core MemoryMemGPTStable identity: communication style, domain confidence, decision pattern, always in context
Recall StorageMemGPTChronological log of observed utterances and decisions, time-indexed
Archival StorageMemGPTLong-term knowledge: stated positions on topics, historical decisions, inferred stances
Reflection LayerGenerative AgentsPeriodic compression of observations into character generalisations
Filtering LayerMemGPT (noise warning)Pre-storage classification: high-value observations vs. phatic noise, most under-invested component

The Filtering Problem is the Hardest Problem

Both papers point to the same bottleneck from different directions. MemGPT's noise warning says bad storage degrades performance below baseline. Generative Agents' retrieval scoring says only high-salience memories should enter the prompt. Neither paper fully specifies how to classify input before storage.

Compound Effect

Filtering quality affects retrieval quality multiplicatively. A small improvement in pre-storage filtering compounds across every future retrieval, because it permanently reduces the noise floor all subsequent queries operate above.

Structured Observation Data as a Behavioural Prior

The Generative Agents paper initialises character from a brief biographical description. In practice, organisations hold far richer structured data about individuals, survey instruments, performance assessments, 360-degree feedback. This note considers how that data maps onto the Core Memory layer, and where the translation is lossy.

Survey data captures attitude. Core Memory requires predicted behaviour. The translation layer between them is where domain expertise matters most.

Reading note

Four Survey Instruments, Four Memory Dimensions

InstrumentMemory DimensionPredicts
Wellbeing SurveyCapacity & Emotional StateContribution frequency, assertiveness, meeting engagement style
Engagement / eNPSValues & Organisational StanceDefault posture toward institutional positions, challenge vs. defend
Development / PerformanceCapability Self-PerceptionDomain confidence zones, where the individual contributes vs. defers
360-Degree FeedbackObserved Behavioural SignatureBehavioural consistency across contexts, highest fidelity source

Core Memory Structure, What It Is and Why It Matters

The purpose of Core Memory is to give an AI agent, or a team designer, a precise, structured map of how a specific individual thinks. Not their job title. Not their seniority. Their actual cognitive fingerprint: where they lead, where they defer, how they decide, and what their current state of capacity and trust looks like right now.

Without this, an AI agent interacting with a human is guessing. With it, the agent knows, before a single word is spoken, whether to challenge or support, whether to present data or a conclusion, whether to push back or give space. The same precision applies to team design: once you have Core Memory profiles for every person on a team, you can see the team's collective cognitive landscape at a glance, its strengths, its blind spots, and exactly where an AI agent would add most value rather than simply duplicate what humans already provide.

The Point

Core Memory is the shared language between humans and agents. It makes human cognition legible to AI, and makes AI behaviour predictable to humans. It is the foundation on which high performance human+agent teams are built.

# Core Memory structure derived from survey instruments
COMMUNICATION_STYLE: "Data-driven. Measured. Asks clarifying questions
before taking a position. Peer-validated across 8 reviewers (360)."
CURRENT_CAPACITY: "Workload pressure: 7.2/10 (Q4 wellbeing survey).
Engagement: high. Work-life balance: declining trend (2 quarters).
Psychological safety: moderate, expresses dissent selectively."
ORGANISATIONAL_STANCE: "eNPS: 8/10. Strong strategic alignment.
Likely to defend institutional positions unless challenged with data."
DOMAIN_CONFIDENCE:
  data_analysis     → HIGH (contributes substantively)
  risk_assessment   → HIGH (contributes substantively)
  strategy          → LOW  (asks questions, defers)
  people_mgmt       → LOW  (listens, rarely initiates)
DECISION_PATTERN: "Requires data before committing. Will explicitly
defer: 'I need to check the numbers.' Slow but thorough."

Reading this profile, an AI agent immediately knows: lead with evidence, not conclusions. Offer strategic challenge, since this person won't generate it themselves. Don't push on people management, they'll disengage. And watch the workload pressure: this person is close to capacity and expressing dissent selectively, which means important signals may be going unspoken. That is actionable. A job title is not.

Synthesis · May 2026

Orchestration and Culture as a Single Design Problem

Two questions sit at the centre of any serious agent deployment, and they are usually treated as separate problems by separate teams. Multi-agent orchestration asks: given a task, what is the right configuration of agents, and what is the right protocol for their exchange? Agent, human culture assessment asks: given a team of humans and agents working together, is the team actually thinking, or only appearing to?

The argument here is that these are not two problems. They are the same problem viewed from two angles. The orchestration choices, agent count, role boundaries, termination conditions, dissent placement, are the choices that determine what the team's culture will be. And the culture signals, dissent residual, help-seeking rate, error surfacing, are the signals that tell you whether the orchestration is working.

A traditional system architect designs the agent topology and considers culture a downstream "adoption" concern. A serious agent-human team designer treats them as a single problem with a single answer: the orchestration is the culture. The pattern of who speaks to whom, who disagrees with whom, who decides when something is done, these are the operational definition of how the team thinks.

A high performance human+agent team is not a collection of high performers. It is an orchestration choice, each role's blind spots covered by another role's strength, with explicit mechanisms for dissent and termination. Culture is what the orchestration produces under load.

Design principle

The Three Moves, Orchestration as Cultural Architecture

Three orchestration choices, made together, produce a team culture that can actually be measured against the human-team literature. They are not innovations. They are the application of established, empirically validated ideas to the new design problem of mixed teams.

01
Cognitive Mapping
Aggregate the domain confidence and decision patterns of every team member, human and agent, to identify coverage gaps and systematic blind spots before composition.
02
Friction Design
Compose the team, and configure agent prompts, so that organisational stance and decision patterns genuinely diverge. Productive tension, not cooperative drift.
03
Agent Augmentation
Deploy agents to fill identified gaps via AutoGen-style orchestration, not to replicate human roles, but to provide the challenge and analysis the human team cannot generate internally.

Each move has an academic anchor.

Cognitive mapping draws on Belbin's team role theory (Belbin, 1981). Belbin's central finding, validated across hundreds of management teams, was that high performing teams are not composed of the most individually talented people; they are composed of people whose functional tendencies are complementary. A team of "Plants" (creative, idea-generating) without a "Monitor Evaluator" (critical, slow to commit) will generate proposals it cannot pressure-test. The same logic applies when one or more members is an agent: an agent with a Plant-like prompt profile, deployed into a team that already has three Plants, is a tax on the team's coherence, not a contribution to it.

Friction design draws on Nemeth et al. (2001), "Devil's Advocate versus Authentic Dissent." Nemeth's experimental evidence showed that authentic dissent, a team member who genuinely holds a minority position, produces better decision quality than a designated devil's advocate, because authentic dissenters generate more original counter-arguments and cause the majority to process information more deeply. The implication for agent-human orchestration is precise: dissent has to be designed in at the prompt level, not assigned at the role level. An agent told "challenge the consensus" produces ritual disagreement that the team learns to discount. An agent whose stance genuinely diverges, different priors, different confidence calibration, different decision criteria, produces dissent the team has to engage with.

Agent augmentation draws on Wegner's Transactive Memory Systems theory (Wegner, 1987). Wegner proposed that teams function as distributed cognitive systems: each member holds specialist knowledge, but, crucially, each member also knows which teammates hold which knowledge and can direct queries accordingly. In a human+agent team, agents can be initialised with exactly the domain knowledge and challenge patterns the human team lacks but only if the team has a shared model of what it does and does not hold. AutoGen (Wu et al., 2023) provides the orchestration layer: how agents coordinate queries and responses across a multi-agent system without collapsing into redundancy. The architectural choice is which agents speak to which humans, and on what trigger, a transactive memory design problem, not a UX problem.

From Principles to Observable Performance

Architecture without measurement is aspiration. The orchestration choices generate culture signals, and the culture signals are what tell you whether the orchestration is working. The table below pairs each orchestration layer with the falsifiable signal it should produce, grounded in the human-team literature.

Orchestration LayerCultural SignalExample MetricsGrounding
Cognitive MappingCoverage of decision domainsDecision reversal rate, time-to-commitBelbin (1981): complementarity reduces reversal
Friction DesignAuthentic dissent before consensusMinority positions documented, dissent residualNemeth et al. (2001): authentic dissent quality
Agent AugmentationTransactive recall accuracyCross-agent query success, override rate by roleWegner (1987): distributed cognition access
Termination ProtocolHelp-seeking and error surfacing"I don't know" rate, errors caught in-teamEdmondson (1999): psychological safety

References

Belbin, R.M. (1981). Management Teams: Why They Succeed or Fail. Heinemann.
Edmondson, A. (1999). Psychological safety and learning behavior in work teams. Administrative Science Quarterly, 44(2), 350, 383.
Hofstede, G. (2001). Culture's Consequences: Comparing Values, Behaviors, Institutions and Organizations Across Nations (2nd ed.). Sage.
Nemeth, C.J., Brown, K., & Rogers, J. (2001). Devil's advocate versus authentic dissent: Stimulating quantity and quality. European Journal of Social Psychology, 31(6), 707, 720.
Park, J.S., O'Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. (2023). Generative agents: Interactive simulacra of human behavior. UIST 2023. arXiv:2304.03442.
Packer, C., Fang, V., Patil, S.G., Liu, K., Finn, C., & Gonzalez, J.E. (2023). MemGPT: Towards LLMs as operating systems. arXiv:2310.08560.
Wegner, D.M. (1987). Transactive memory: A contemporary analysis of the group mind. In B. Mullen & G.R. Goethals (Eds.), Theories of Group Behavior (pp. 185, 208). Springer.
Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., & Wang, C. (2023). AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv:2308.08155.

Thesis

AutoGen provides the orchestration substrate. Belbin provides the complementarity principle. Nemeth provides the friction evidence. Wegner provides the distributed cognition model. Edmondson provides the safety construct. The synthesis is not a new theory, it is the application of established, empirically validated ideas to a single design problem: how do you orchestrate a human+agent team such that the culture it produces is measurable, falsifiable, and improvable? Orchestration choices and cultural signals are not separable. They are the same artefact, observed at design time and at run time.

On the Reading List, 3 Papers Queued

arXiv 2303 · KAUST
CAMEL: Communicative Agents for "Mind" Exploration
Role-playing multi-agent setup, how prompt-induced personas drive divergent reasoning across cooperating agents.
arxiv.org/abs/2303.17760 ↗
arXiv 2309
RAGAS: Automated Evaluation of RAG
Framework for evaluating retrieval quality, applicable to assessing memory retrieval fidelity in agent systems.
arxiv.org/abs/2309.15217 ↗
arXiv 2210 · Princeton & Google
ReAct: Reasoning and Acting in Language Models
Interleaving reasoning traces with tool use, foundational for the inner monologue + action loop.
arxiv.org/abs/2210.03629 ↗
Opportunities

Open to New Opportunities

Interested in senior data science, AI transformation, or people analytics leadership roles in consulting and financial services.

Get in Touch → ← Back to Profile