The barriers to building AI agents have never been lower. With tools like ChatGPT, Claude, and a growing ecosystem of no-code platforms, anyone can string together a workflow and call it an "agent." The demos are impressive. The promises are seductive. And the temptation to dive in is real.
But accessibility isn't the same as capability. And as I learned recently while advising a client, the gap between building something and building something reliable is wider than most people realise.
If you've ever felt both excited and overwhelmed by what AI promises for your work, you're not alone—and neither was this client.
But accessibility isn't the same as capability. And as I learned recently while advising a client, the gap between building something and building something reliable is wider than most people realise.
If you've ever felt both excited and overwhelmed by what AI promises for your work, you're not alone—and neither was this client.
A Clever Idea Meets Reality
The client came to me with a genuinely smart vision. His firm has a detailed financial analysis process—multiple people, multiple stages, multiple checks—that produces valuations for investment decisions. He wanted to translate this into an AI-powered workflow: several specialised LLMs working together, orchestrated by a kind of "meta-agent," ultimately producing an auditable spreadsheet.
On paper, it made perfect sense. In practice, it fell apart.
He wasn't a programmer—he was a real estate expert with deep knowledge of the analysis itself. So he did what seemed logical: he used ChatGPT, Claude, and Perplexity to help him build it. He gathered instructions from various sources on multi-agent systems and asked the models to generate prompts that would make it all work.
The problem was that his instructions were internally contradictory. He didn't understand the architecture of multi-agent systems—and why would he? Nothing in his background had prepared him for that. The models, of course, produced output. They always do. But it was chaos: confused, inconsistent, and nowhere near production-ready.
This pattern is more common than you might think. The tools are so accessible that it's easy to overestimate what they can do without proper structure underneath.
On paper, it made perfect sense. In practice, it fell apart.
He wasn't a programmer—he was a real estate expert with deep knowledge of the analysis itself. So he did what seemed logical: he used ChatGPT, Claude, and Perplexity to help him build it. He gathered instructions from various sources on multi-agent systems and asked the models to generate prompts that would make it all work.
The problem was that his instructions were internally contradictory. He didn't understand the architecture of multi-agent systems—and why would he? Nothing in his background had prepared him for that. The models, of course, produced output. They always do. But it was chaos: confused, inconsistent, and nowhere near production-ready.
This pattern is more common than you might think. The tools are so accessible that it's easy to overestimate what they can do without proper structure underneath.
The Hard Truth About Multi-Agent Systems
One thing the marketing around AI agents glosses over: true multi-agent orchestration—where multiple agents with distinct roles coordinate, hand off tasks, and synthesise results—requires code.
I don't mean you need to be a software engineer. But there's architectural work involved: defining each agent's scope, managing how they communicate, handling errors, evaluating outputs. You can't prompt-engineer your way around that. What you can do with prompting alone is get a single LLM to simulate multiple perspectives—essentially role-playing different experts within one model. That's useful for brainstorming, but it's not orchestration.
And the LLM won't tell you when your approach is flawed. It will confidently produce something either way.
I don't mean you need to be a software engineer. But there's architectural work involved: defining each agent's scope, managing how they communicate, handling errors, evaluating outputs. You can't prompt-engineer your way around that. What you can do with prompting alone is get a single LLM to simulate multiple perspectives—essentially role-playing different experts within one model. That's useful for brainstorming, but it's not orchestration.
And the LLM won't tell you when your approach is flawed. It will confidently produce something either way.
What Actually Happened
After a long, iterative conversation with ChatGPT, my client ended up somewhere different from where he started. The model gradually steered him toward simplification: instead of a multi-agent system, he landed on a single, very long prompt submitted to one LLM. No code involved. No orchestration. Just a detailed set of instructions and the hope that the model would follow them.
This is a legitimate path—and for casual exploration, it might be fine. But it comes with its own problems.
To convey all the necessary instructions, context, constraints, and edge cases, that prompt becomes enormous. And the longer and more complex a prompt gets, the less reliably the LLM will follow it to the letter. Instructions start to conflict with each other. Edge cases get ignored. The model quietly makes trade-offs you never sanctioned. Worse still, ensuring there are no inconsistencies in a complex prompt is extraordinarily difficult. You're essentially writing a specification document in natural language and hoping the model interprets it exactly as you intended, every time. It won't.
This is a legitimate path—and for casual exploration, it might be fine. But it comes with its own problems.
To convey all the necessary instructions, context, constraints, and edge cases, that prompt becomes enormous. And the longer and more complex a prompt gets, the less reliably the LLM will follow it to the letter. Instructions start to conflict with each other. Edge cases get ignored. The model quietly makes trade-offs you never sanctioned. Worse still, ensuring there are no inconsistencies in a complex prompt is extraordinarily difficult. You're essentially writing a specification document in natural language and hoping the model interprets it exactly as you intended, every time. It won't.
Why This Matters in Real Estate
Real estate financial analysis isn't a sandbox. The outputs feed into investor decisions, fund valuations, audits, compliance. An unreliable result isn't just inconvenient—it's a business risk.
My client's original instinct was exactly right: he wanted something auditable and reproducible. That's the professional standard. The problem was that his execution path couldn't deliver it.
The question you should always ask is: Can I trust this output enough to act on it? If the answer depends on luck—on whether the model happens to interpret your prompt correctly this time—you have a problem.
My client's original instinct was exactly right: he wanted something auditable and reproducible. That's the professional standard. The problem was that his execution path couldn't deliver it.
The question you should always ask is: Can I trust this output enough to act on it? If the answer depends on luck—on whether the model happens to interpret your prompt correctly this time—you have a problem.
A Middle Path: Claude Skills
What I recommended to this client was something in between a raw prompt and a fully coded system: Claude Skills.
Skills are essentially structured tools you can build within Claude that include templates and actual code. The key advantage is reproducibility. You give it the same inputs, you get the same outputs—something a raw LLM prompt simply cannot guarantee.
Niko, our Real Estate Economics lead at VARi, has tested this extensively. He built a DCF skill that produces identical valuations every time, with explicit assumptions you can inspect and adjust. That came directly from our own experimentation and teaching—we wanted to know exactly where the boundaries of reliability were.
The best part: you don't need to code it yourself. You can write and refine skills using natural language. But underneath, there's structure—which solves the long-prompt problem I described earlier. You get something you can actually stand behind.
Skills are essentially structured tools you can build within Claude that include templates and actual code. The key advantage is reproducibility. You give it the same inputs, you get the same outputs—something a raw LLM prompt simply cannot guarantee.
Niko, our Real Estate Economics lead at VARi, has tested this extensively. He built a DCF skill that produces identical valuations every time, with explicit assumptions you can inspect and adjust. That came directly from our own experimentation and teaching—we wanted to know exactly where the boundaries of reliability were.
The best part: you don't need to code it yourself. You can write and refine skills using natural language. But underneath, there's structure—which solves the long-prompt problem I described earlier. You get something you can actually stand behind.
What About the Latest Models?
OpenAI recently announced GPT-5.2 with claims of better instruction-following and improved financial analysis capabilities. And to be fair, the models are getting better. Each generation handles complexity a bit more gracefully.
But there will always be a new model, a new announcement, a new benchmark. The underlying question remains the same. If you're running a single prompt with no structural scaffolding, the output is still generated probabilistically. Better doesn't mean auditable. Better doesn't mean reproducible.
The real skill isn't knowing which model is newest. It's understanding how these models work—well enough to judge whether a new release actually solves your problem or just sounds like it does. That's the difference between chasing announcements and making informed decisions.
This is exactly what we focus on in our courses: developing the intuition, grounded in real understanding of how these systems work, that lets you evaluate new tools critically.
But there will always be a new model, a new announcement, a new benchmark. The underlying question remains the same. If you're running a single prompt with no structural scaffolding, the output is still generated probabilistically. Better doesn't mean auditable. Better doesn't mean reproducible.
The real skill isn't knowing which model is newest. It's understanding how these models work—well enough to judge whether a new release actually solves your problem or just sounds like it does. That's the difference between chasing announcements and making informed decisions.
This is exactly what we focus on in our courses: developing the intuition, grounded in real understanding of how these systems work, that lets you evaluate new tools critically.
Match Complexity to Capability
Yes, everyone can build an AI agent now. But the sophistication of what you build should match your understanding of the tools underneath.
Think of it as a spectrum:
Before you build, it's worth asking: Do I understand what I'm building? If the answer is "not quite," that's not a failing. It's a starting point.
Think of it as a spectrum:
- Simple prompts are fine for exploration, brainstorming, and first drafts.
- Structured tools like Claude Skills give you reproducibility and auditability without requiring you to code.
- True multi-agent systems require programming expertise—yours or someone you're working with.
Before you build, it's worth asking: Do I understand what I'm building? If the answer is "not quite," that's not a failing. It's a starting point.


