Working Paper · Wang (first author), in preparation

Hire Your Agents, Don't Just Call Them: An HR Framework for High-Performance Orchestrator-Worker Multi-Agent Systems

The current frontier of multi-agent AI is the orchestrator-worker pattern: a lead agent decomposes a goal, spawns specialised subagents to work in parallel, and synthesises their results. The architecture is well understood. The management of it is not, most systems brief a worker once, let it run to completion, and read whatever comes back.

My argument: that is a people-management problem wearing an engineering costume. The failure modes (workers drifting from intent, errors caught only at the end, outputs no one downstream can use) are precisely the failures human organisations spent a century learning to manage. So I treat the orchestrator as a manager, the subagents as its direct reports, and port the operational practices of HR performance management, not the org chart, the actual process, into the agent loop. This is the same move I make in my day job, run the other direction.

A bad manager briefs once, never checks in, and judges only at the end. That is also how most orchestrators manage their workers. HR has spent decades learning to do better, the question is whether that knowledge transfers.

Core thesis

Five HR Practices, Mapped to the Agent Loop

Each practice has a well-replicated empirical base in the people literature and a concrete, implementable mechanism in the orchestrator-worker loop. Underpinning all five is team trust, treated not as a sixth practice but as the substrate that decides whether the other channels carry honest signal at all.

HR PracticeAgent-Loop MechanismAnchor
Goal SettingSubagent contract: objective, success criteria, output schema, scope, effort budgetLocke & Latham (2002)
Manager 1:1sEvent-triggered mid-task check-ins (not a fixed schedule)Pulakos et al. (2019)
CoachingTask-directed corrective feedback, not pass/fail judgmentKluger & DeNisi (1996)
End ReviewStructured retrospective + cross-task policy updateDeNisi & Murphy (2017)
Stakeholder FeedbackMulti-source eval by the peer/downstream agents that consume the outputLondon & Smither (1995)
Team Trust (substrate)Protected-escalation contract: an honest "I'm not sure" counts as successEdmondson (1999)

The Honest Positioning, Complement, Not Replacement

The classic fan-out-and-synthesise flow is strongest on breadth-first tasks with independent subagents, and explicitly weaker on interdependent tasks whose subtasks must be tightly combined. Every oversight mechanism I add is pure overhead on the former and load-bearing on the latter. So the framework is aimed squarely at the regime the baseline handles worst, integration-heavy, long-horizon work, and uses breadth-first tasks only as a neutrality control.

Why I Frame It This Way

The claim is not "better than single-shot delegation." It is "extends performant coordination to the tasks single-shot delegation handles worst." Overclaiming the first is how this kind of work gets dismissed; the second is defensible and useful.

The Hard Constraint, Oversight vs. Parallelism

Every check-in, coaching turn, and stakeholder review reintroduces a synchronisation point into a system whose whole advantage is parallel execution. Three design choices keep that in check: check-ins fire on events (low confidence, budget fraction, milestone) rather than a clock; the orchestrator is non-blocking (other workers keep running while one check-in is handled); and stakeholder feedback flows only along a directed acyclic graph, so no two agents can wait on each other.

The Residual Bottleneck

Because an LLM orchestrator is itself serial, one forward pass at a time, a high check-in rate makes it the throughput bottleneck no matter how workers are scheduled. That is not a bug to remove; it is the mechanism behind a predicted inverted-U: too few check-ins → drift, too many → orchestrator stall.

Built to Adopt Incrementally

1
Tier 1 · Low cost, low risk

Contract goal-setting + protected-escalation trust. Prompt-level only, does not touch parallelism, expected to help on its own.

2
Tier 2 · Moderate cost, conditional benefit

Event-triggered check-ins + coaching, gated so they fire rarely.

3
Tier 3 · Higher cost, task-dependent

Stakeholder feedback, restricted to critical-path outputs only.

4
Tier 4 · Infrastructure-dependent

Cross-task policy learning, needs an external store of past contracts/outcomes. Treated as exploratory.

Field Notes · Where the Analogy Holds and Breaks

What Transfers

The informational and coordinative half of HR transfers cleanly: clear goals, timely two-way updates, task-directed correction, multi-source feedback, honest escalation. All of it is about getting good work out of capable-but-imperfect autonomous agents under uncertainty, which is the orchestrator's exact problem.

What Does Not

The motivational half (recognition, fairness, retention) mostly does not transfer. Subagents have no career and no morale. I limit the claims to the half that does, rather than stretching the metaphor past where it earns its keep.

The Feedback Trap

The same literature I rely on (Kluger & DeNisi) warns that misdirected feedback can lower performance. An orchestrator coaching from a lossy summary can misdiagnose and steer a healthy worker wrong. A pure engineering view would not anticipate that risk. The HR lens flags it in advance.

The Dissent Residual: Measuring Whether an Agent-Human Team Actually Thinks

RLHF-trained agents are structurally too agreeable. Drop one into a team of humans and the standard culture instruments will report the team as highly aligned and high-functioning. It may be neither, it may be a team that has quietly outsourced its judgment to a confident autocomplete. The development in GenAI is the cooperative bias; the problem it creates is a measurement problem in my field.

GenAI development
Cooperative bias in RLHF agents

Agents default to agreement, smoothing whatever the loudest human in the room already said. Documented in generative-agent and alignment work alike.

My HR / culture work
A new culture metric: the dissent residual

The rate at which a team's decisions diverge from the agent's default recommendation. A team tracking the agent's output 95%+ of the time isn't well-supported. It has stopped thinking.

This connects straight to Edmondson's psychological safety (do people surface dissent, errors, uncertainty?) and Nemeth's work on authentic versus performed dissent. The instruments need rebuilding for mixed teams, but the constructs survive the transition. The actionable version for a people-analytics function: log who diverged from the agent and how often, and treat a collapsing dissent residual as an early-warning signal, not a sign of harmony.

From Core Memory to a Cognitive Map of a Person

MemGPT gave agents a persistent "core memory", a small, always-loaded identity layer that colours every response. The interesting question for me isn't the engineering. It's that the same structure is exactly what an HR or customer function needs and almost never has: a precise, structured map of how a specific person thinks, decides, and where they defer, built from data we already hold (engagement surveys, 360s, performance signals).

GenAI development
Agent core-memory / identity layer

A stable, structured profile the agent carries into every interaction, separate from any single conversation.

My HR / customer work
A behavioural prior for each person

Not a job title, a cognitive fingerprint: communication style, current capacity, where they lead vs. defer. Survey data captures attitude; this captures predicted behaviour.

The hard part is the translation layer, and it's where the noise warning from the memory literature bites: a profile built from low-quality signal performs worse than no profile at all. So the highest-leverage decision isn't retrieval, it's what you refuse to store. That is a data-quality and ethics problem before it is a modelling one, which is exactly why it belongs to someone working in people analytics rather than to a pure ML team.

The Reading Behind These Ideas

Research Notes
Annotated close readings
Every idea here grew out of a paper I read closely. The annotated notes, AutoGen, MemGPT, Generative Agents and more, live on the Research Notes page.
Read the notes →
Focus
GenAI × People & Culture
Multi-agent orchestration, agent-human culture assessment, and behavioural modelling from organisational data.
See the full journal →
Collaboration
Working on something adjacent?
If you're building agent systems for real teams, or measuring how human+agent teams actually perform, I'd like to compare notes.
Get in touch →
Opportunities

Open to New Opportunities

Interested in senior data science, AI transformation, or people analytics leadership roles in consulting and financial services.

Get in Touch → ← Back to Profile