Welcome to the RAG Problem (Or, Why Chatting With Your Data Disappoints)

May 12 / Monika Szumilo
Have you ever tried “chatting with your data” and walked away frustrated? You asked a sensible question, one that should be answerable from your own files and emails. What came back was fluent, well-formatted, and quietly nonsense. Half-right at best. Confidently incomplete. The one document in SharePoint that would have actually answered the question never even surfaced.

Almost every real estate professional I work with has had this experience at least once. The reason it happens has nothing to do with the model being bad. It comes down to how these tools are built underneath, which turns out to matter quite a lot for how to use them well.

The technology behind “chat with your data” is called RAG, short for retrieval-augmented generation. Almost every enterprise AI assistant on the market today uses it, including the popular M365 Copilot. The architecture is genuinely useful when you ask it the right kind of question. The trouble is that the way these tools are marketed encourages you to ask the wrong kind, and the output looks identical either way. You get the same polished bullets and confident formatting whether the answer is right, partly right, or invented.

This is worth understanding for two reasons. You can usually get much better results by changing how you ask, without any new tools. And knowing where the architecture fails is the only way to spot the answers you should not trust.

A short detour into how this works

When you ask Copilot a question, the system does not read through all your documents the way you would. It uses something called an embedding model. Think of it as software that converts each chunk of text, whether a paragraph from a market report or a line in your inbox, into a long list of numbers that captures what the text is “about.” Documents on similar topics end up with similar lists. Your question gets converted the same way. The system then finds the documents whose numbers sit closest to your question’s numbers, hands those to the language model, and asks it to produce an answer from what was found.

This is called semantic search, and it is a real improvement on keyword search. It understands that “lease expiry” and “end of tenancy” describe related concepts even though they share no words. What it does not do is understand what you actually meant. The system is searching for documents that sound like your question, not documents that answer it. That distinction is where almost every RAG failure comes from.

There are three failure modes worth knowing about. Each shows up regularly in real estate work, and each has a practical workaround that requires no new technology, just a different way of asking.

Failure 1: Compound questions

Try typing something like this into M365 Copilot: “Find tenants in our office portfolio whose leases expire in the next 18 months and who have requested rent reductions in the last year.” It is a perfectly sensible question. The information exists somewhere in your environment. The answer will almost certainly be wrong.

The reason is structural. When the question gets converted into numbers for matching, the system effectively averages across all the concepts in the query: leases, expiry dates, office portfolio, rent reductions. The documents that score best tend to be the ones that touch most of those topics generally, not the ones that satisfy each condition specifically. The model then writes a confident-sounding answer from whatever came back, sometimes including properties that matched a few words but not the full logic.

The workaround is to stop asking the system to handle the logic for you. Break the question into steps. First, ask for the list of leases expiring in the next 18 months. Then, working from that list, ask which of those tenants requested rent reductions. The retrieval finds candidates at each stage, and you supply the filtering. This is not glamorous, but it works. Models are perfectly capable of reasoning over a smaller, well-defined set once you have surfaced it. Expecting a similarity search to handle compound logic correctly on its own is where things tend to fall over.

Failure 2: Niche or unusual questions

Ask M365 Copilot about how recent eurozone monetary tightening might affect your multifamily holdings in Wrocław, and you will get a fluent, well-structured answer. Most of it will be about Germany, or the eurozone broadly, or generic housing-market dynamics. The Poland-specific bits, if they appear at all, will be either thin or quietly invented.

This happens because AI models work well within the distribution of data they were trained on and struggle outside it. Coverage of major Western markets is dense; coverage of Polish secondary cities is sparse. When you ask about something unfamiliar, the system substitutes the nearest familiar pattern at both the retrieval and the synthesis steps, and does so without flagging the substitution. The answer reads as if it is about your portfolio. It is really about the part of the training data that looked most like your question.

The workaround is to force the system to separate what it actually knows from what it is inferring. Try a prompt like: “List only the facts directly supported by the documents you found. Flag any analogies or assumptions. Tell me what evidence is missing.” This will not always work cleanly, but the output you get back is usually more honest. If the system implicitly compares Poland to Germany, you can ask it to justify the comparison. Sometimes the justification holds up. Often it does not, which is information worth having.

Failure 3: Questions that require combining sources

“Which of my assets face refinancing risk in the next two years?” feels like exactly the kind of question Copilot ought to be able to answer. The marketing invites it. The data is all there in your environment. In practice, this one fails often. The answer lives across loan documents in one folder, NOI trends in an Excel model, debt maturity schedules in a separate spreadsheet, and lender covenant correspondence buried in someone’s inbox. No single document scores particularly highly against the question. Each piece becomes informative only when combined with the others, which is exactly what a similarity search does not do.

What you typically get back is a high-level commentary about refinancing risk in general, possibly with a few assets named because their loan documents happened to use that vocabulary recently. The properties that actually face the most refinancing risk, with covenant headroom thinning and maturity approaching, often go unmentioned, because no single document about them looks especially similar to the question.

The workaround is to build the structure yourself. Ask the model to list, for each property, the variables it would need to answer the question: DSCR, NOI trend, debt maturity, covenant terms. Then ask it to populate that table from the documents it can find, flagging anything missing. This converts a vague reasoning task into something closer to filling in a spreadsheet, which models handle well. Once the table exists, the reasoning over it becomes simple. The intermediate structure does almost all of the work.

A unified way to think about this

The principle worth taking from all of this is straightforward. Embedding-based retrieval is good at finding documents that look similar to your question. Language models are good at reasoning over evidence placed in front of them. RAG works best when each part does the job it is good at, and badly when you expect the retrieval step to solve a reasoning problem on your behalf.

In practice this means asking Copilot questions in a way that respects the architecture. Decompose compound questions into steps. Ask the model to separate evidence from inference, and to flag what it is missing. Build structured intermediate states when the answer requires combining several sources. None of this requires any technical background, just knowing that retrieval and reasoning are different things, and that a confident-looking answer can mean very little about whether the answer is actually correct.

Using these tools well comes down to workflow discipline more than technical knowledge. The teams getting real value from RAG have learned that the chat window is a poor representation of what is happening underneath, and they ask accordingly. The ones still expecting to type a single complex question and receive a complete answer keep being disappointed, often without realising they should be.
Created with