Managing people well is genuinely hard. I know this from experience: I have led teams of close to a hundred teachers, navigated the politics of large institutions, and spent more time than I would like calibrating feedback so it lands as intended rather than as criticism. And even after a lot of training and practice, I do not find it easy.
Which is why it surprised me how natural it felt to start working with AI agents. The friction I associate with managing people, the sensitivities, the competing priorities, the careful relationship maintenance, largely disappears. Ask an agent for twenty more ideas and you get twenty more ideas, immediately, without negotiation or pushback. Ask it to critique its own work and it will do so thoroughly and without defensiveness. For someone who has spent years navigating the human side of managing a team, that ease is striking.
However, that ease, it turns out, is part of the problem. In my experience working with real estate professionals, the people who scrutinise AI outputs most carefully tend to be those who were sceptical to begin with. They approach every output looking for reasons not to trust it, and they find them. The enthusiastic adopters, the ones actually getting value from these tools day to day, tend to scrutinise far less. Not because they are careless, but because nothing in the experience prompts them to. The experience of working with an AI that is endlessly willing, endlessly patient, and always trying its best lowers your guard in ways you don’t always notice. There is no hesitation in the output, no visible uncertainty, no moment where the agent signals that it is on unfamiliar ground. The friction that would normally keep you alert is simply absent. And on top of that, most people have never been shown what rigorous scrutiny of an AI output actually looks like. It is not obvious, and it is not the same as checking a colleague's work.
And that’s where the analogy to human management becomes useful, and where it also breaks down. A good employee, a good student, would not allow themselves to simply make something up when they didn’t know the answer. They would flag the gap, ask a question, or at minimum signal uncertainty. Professional conscience is a form of self-regulation that good workers apply even when nobody is watching. AI agents have none of it. A hallucination arrives with exactly the same fluency and confidence as a correct answer. There is no tell. The responsibility for catching it sits entirely with the person who did the delegating.
That is the supervision problem. And most people, right now, are not set up to solve it. I reach back to human supervision training to see if there is anything we can borrow from there to manage the work with AI better. Here is what I have found.
Calibrating trust to the wrong thing
Managers are known to over-supervise people they distrust and under-supervise people they like, regardless of what is actually being produced. With agents, there is no relationship to miscalibrate on, which sounds like an improvement. In practice, people calibrate trust to the tool instead: they trust a frontier model more than a cheaper one, or a familiar workflow more than a new one, regardless of what is actually being asked. The stakes of the specific output rarely enter the calculation.
This is compounded by a capability bias that runs in exactly the wrong direction. A weaker model produces errors that are usually obvious. A frontier model constructs a more convincing wrong answer: fluent, internally consistent, and wrong in ways that require genuine domain knowledge to catch. The output that most needs careful scrutiny is often the one that looks most authoritative. In practice, the opposite tends to happen. Polished output from a capable model on a high-stakes task often receives less scrutiny than rough output from a tool people are less comfortable with. The result is the same as the management failure: oversight is determined by familiarity and surface confidence, not by consequence.
Underspecified briefs and the silence that follows
A manager who delegates without a clear brief will usually get some signal that something has gone wrong. The colleague asks a question. The draft comes back in a style that feels off. Something surfaces that prompts a conversation about what was actually wanted. Agents fill gaps aggressively and silently. Give an AI an underspecified task and it will produce something that looks finished, embedding assumptions you never sanctioned along the way. The output looks complete, so the inadequacy of the brief stays invisible until something downstream breaks.
Underneath this is a failure that human managers also know well: abdication disguised as trust. The manager says they trust the person to handle it, but what they mean is they do not want to engage with the complexity of specifying what they actually want. With a human delegate, there is at least social friction. They might push back, express uncertainty, or ask for a meeting. An agent produces output immediately, and the volume and speed of that output creates the impression that the task has been handled. And because most interactions are stateless, if you do not write down what good looks like, the standard resets with every run. What is a cultural problem in human teams, one that improves slowly through repeated interaction, becomes a structural problem with agents that stays constant unless you deliberately build evaluation criteria into the workflow itself.
When oversight happens matters as much as whether it happens
Late-stage review is expensive in any team. Errors that could have been caught early compound through revision after revision. But with human work, the damage is usually contained to one piece of work and recoverable.
With agents running multi-step workflows, an error in step two propagates through steps three, four, and five before anyone looks. Unpicking it requires understanding the whole process rather than correcting a single decision. Checking only at the end is not a minor inefficiency in agentic work. It is a structural risk. The same logic applies after the task is done. Human teams have organic moments of reflection: a difficult project ends, a pattern becomes too obvious to ignore, something goes wrong badly enough that people stop and ask what happened. With agents, throughput is high enough that there is rarely a natural pause. Tasks complete, outputs get used, and the workflow that produced a subtle error last Tuesday is running again by Thursday. The speed that makes agents valuable is exactly what makes reflection feel unnecessary, right up until the point when it turns out to have been essential.
So what should we do?
None of these issues is an argument against using AI agents. They are amazingly useful and allow us to do things that were not possible before. But the list above should make clear that the ease of working with AI is partly an illusion, or at least incomplete. The patience, the willingness, the absence of ego: these are real advantages. What is also absent is the professional conscience that makes human workers self-regulate. A good colleague knows when they are out of their depth. They hesitate. They qualify. They ask. An agent produces the hallucination with the same confidence as the correct answer, and moves on.
This means the social scaffolding that makes human delegation functional, the embarrassment of handing in poor work, the instinct to flag uncertainty, the professional reputation at stake, simply does not exist with AI. It has to be replaced with something deliberate. That something is workflow design: deciding what good looks like before you run the task, building checkpoints at the moments where errors are most likely to compound, treating the brief as a document that improves over time rather than something you improvise fresh each run.
The irony is that the qualities that make AI so pleasant to delegate to are exactly the qualities that make unsupervised delegation risky. An agent that always tries its best, never pushes back, and produces something complete-looking every time gives you very little signal that anything has gone wrong. With a human colleague, friction is information. With an agent, the absence of friction tells you almost nothing.
Good supervision of AI is not about reviewing every output line by line. It is about understanding the process well enough to know where the risks are, and designing your oversight around those moments rather than hoping the final result will make the problems obvious. That has always been what good management required. It just used to come with more built-in reminders.

