The model wars are over. The workflow wars have begun.

Dec 16 / Nikodem Szumilo

A lot has changed in the last few months. Not in the “one model is suddenly 10x better” sense. More in the “you now have several genuinely excellent options, and the hard part is choosing” sense.

We’ve just had a cluster of major releases: Google’s Gemini 3, Anthropic’s Claude Opus 4.5, and OpenAI’s GPT-5.2. And they’re all good. Not “demo good”. Actually useful for real work.

So yes, a lot has changed. But here’s the more interesting point. If you ask “what’s the best model?” you’re asking the wrong question.

Because today, most frontier models can produce something helpful. The question is: which model produces the specific kind of helpful you need, with the fewest edits, the lowest risk, and the best fit for your workflow?

When people talk about models, they usually talk about capability. Benchmarks. Coding scores. Reasoning modes. Context windows. Tool use.

All of that matters. GPT-5.2, for example, is explicitly framed by OpenAI as a step up in long-context understanding, agentic tool-calling, and “complex, real-world tasks end-to-end”, with specific emphasis on professional outputs like slides and spreadsheets. Gemini 3 is positioned by Google as its “most intelligent” model, improving reasoning, multimodality and coding, with an even heavier “Deep Think” mode on the way. Claude Opus 4.5 is framed by Anthropic as a big jump in “everyday tasks like deep research and working with slides and spreadsheets”, plus longer-running agent workflows and even Excel-specific use. They are all very good models. But capability is only half the story. The other half is behaviour. And behaviour is where the differences feel massive in practice.

Some models are “obedient”. They give you what you asked for. Some are “helpful” in a slightly annoying way. They interpret what you said, decide what you meant, and give you what they think you want, rather than what you actually asked for. Some are punchy and efficient. Some are verbose and performative. Some ask good clarifying questions; others just guess.

This is what people mean when they say models have “personalities”. It’s not mystical. It’s product design plus training choices. But it’s real. And it affects output quality more than most people expect.

Right now, I don’t think there’s a single winner. I think there are specialists.

For office work: presentations, spreadsheets, turning messy notes into something you can send to a client, Claude is, in my experience, the most consistently usable. The “first draft is already close” rate is simply higher. That tracks with how Anthropic is positioning Opus 4.5: better at slides/spreadsheets, plus “new ways to use Claude in Excel”, and strong Excel automation claims from early users.

This matches what I’ve been seeing in real estate work specifically. I’ve previously tested Claude on an Excel DCF with a rent roll and formulas, and it was able to explain what the spreadsheet was doing, extend the holding period, and run analyses that would be genuinely annoying to do manually. That’s not “AI magic”. That’s “time saved where it actually counts”. My recent two LinkedIn post (here) claims that, in my opinion, Excel financial modelling using AI is basically a solved problem. This means that I think the technological capability already exists and we now just need to adopt it in practice.

For legal-ish work: contracts, clause comparisons, “does this say what I think it says?”, I also tend to prefer Claude. Not because other models can’t do it, but because Claude is less likely to freestyle. It’s calmer. More literal. Less eager to invent. It can also edit Word documents and track changes and that’s extremally useful for legal work.

For very long, multi-step processes, where I want the machine to grind through multiple files, keep state, and come back with a stitched together deliverable, I like GPT-5.2 Pro. OpenAI is explicitly leaning into this “end-to-end professional work” framing, including artifact-style outputs like presentations and spreadsheets, and model variants (Instant, Thinking, Pro) built around different trade-offs.

And for “smartest model I can access today”, especially for coding, academic research, and maths, Gemini 3 currently feels strongest to me. Google is clearly aiming Gemini 3 at reasoning-heavy, tool-using, build-and-plan workflows, with Deep Think positioned for complex problems.

None of this is a religious belief. It’s just the result of using them for different jobs and noticing where I spend less time correcting them.

We’ve all been trained (by marketing, benchmarks, and social media) to think in version jumps. GPT-4 to GPT-4o. Claude 2 to Claude 3. Gemini 1 to Gemini 2. But in day-to-day work, the difference that matters isn’t “which is smartest in general”. It’s “which one makes my kind of mistakes less often”. In real estate, those mistakes are rarely abstract logic puzzles. They’re practical failures like: misreading a rent-free period, making up a covenant definition, producing a beautiful slide deck with two broken numbers in it, DCF sensitivity that looks plausible but uses the wrong cell.

That’s why I keep coming back to the same point: the “best model” depends on the workflow, the stakes, and the shape of the output. If your output is a spreadsheet, the model needs to behave like a spreadsheet colleague, not a creative writing assistant. If your output is a legal clause comparison, you need something that is conservative, not “confident”. If your output is research, you need something that can handle ambiguity and still stay grounded.

Here’s the framework I’m using now. Start with the deliverable. Not the prompt. Not the model. The deliverable.

If you need something you can paste into PowerPoint or Excel with minimal repair, bias toward the model that behaves like an analyst who has shipped client work before. Right now, for me, that’s Claude most of the time.

If you need a long, careful process across multiple documents, where the model has to keep context, check itself, and produce something cohesive, bias toward the system that’s explicitly optimised for tool-heavy “end-to-end” workflows. GPT-5.2 is pushing hard in that direction.

If you need raw cognitive horsepower, research synthesis, coding, math, complex reasoning, bias toward the model that feels strongest in those modes. For me today, that’s Gemini 3.

Then, only after that, consider price and convenience. Because saving £20/month is meaningless if you spend three hours editing outputs that should have been right the first time.

This is where I’ll be slightly self-referential, because it’s exactly the point we built VARi around. The course uses publicly available tools (ChatGPT, Gemini, Copilot, DeepSeek) but the goal is transferable skills across platforms and models. That’s not a marketing line. It’s a survival strategy. If you tie your workflow to a single model, you’re building on sand. Models change. Pricing changes. Features move behind paywalls. Enterprises block tools. Compliance teams panic. But if your workflow is built around fundamentals, specifying outputs, structuring information, checking work, designing prompts that constrain ambiguity, then swapping models becomes a tactical choice, not an existential crisis.

I’m going to keep using all of them. That’s the honest answer. Because the “best model” is not one model. It’s a small stack. If I had to pick one for professional applications today, one model I trust to give me something usable most often, I’d probably pick Claude. It’s more expensive, but the output is closer to “client-ready” with less wrestling.

And the funny thing is: that’s the real metric now. Not “IQ”. Not benchmark scores. But: “How quickly do I get to something I can actually use?”

Thank you!

The model wars are over. The workflow wars have begun.

Capability has converged. Behaviour hasn’t.

My current (very practical) split

The awkward truth: marginal gains are now use-case gains

So how do you choose?

The meta-skill: tool-agnostic thinking

Where I’ve landed (for now)

Helpful links

Subscribe to our monthly newsletter