In Defence of Hallucinations

Jun 16 / Monika Szumilo

Hallucinations in LLMs are a feature, not a bug. We have been teaching that for years. If you want your model to be at all creative, it will hallucinate by design. At the same time, getting a different answer every time you run the same prompt may be confusing. These are manifestations of a single fact: LLMs are probabilistic. And it is worth knowing how to make the most of their use. Let me explain.

At every step of generating a response, a language model is doing one thing: picking the next word. More precisely, it is computing a probability distribution over every word it knows, and sampling from that distribution. The word it picks influences the next distribution, and so on, until the response is complete. The process is inherently stochastic. There is a parameter called temperature that controls how concentrated or spread out those distributions are, but even at zero temperature, the output is still the product of a probabilistic system. It just becomes a more predictable one.

This is not a flaw in the design. It is what makes these models useful. A fully deterministic system that always produced the same output given the same input would be a very expensive lookup table. The probability-based architecture is what allows LLMs to generalise, to be creative, to handle questions they have never seen before. It is also what makes them confusing to work with if you come in expecting a calculator.

The hallucination problem, honestly stated

One of the most persistent beliefs I encounter is that setting temperature to zero eliminates hallucinations. It does not. Understanding why tells you something important about how these systems work.

Hallucinations happen when the model does not have access to the information it needs. Perhaps the fact was absent from training data, or is not in the context you provided. In that situation, the model still has to produce a distribution over possible next words, and none of the options in that distribution corresponds to the correct answer. The distribution is flat: all options are roughly equally plausible. Setting temperature to zero does not change the shape of that distribution. It just tells the model to always pick the most probable option rather than sampling from the tails. If the correct answer is not in the distribution, the most probable option is still wrong. You will now get the same wrong answer every time, consistently, which is not the same thing as getting the right one.

The practical implication is that repeatability is not a proxy for accuracy. If you are running the same prompt twice in the same chat to check whether the model is reliable, you are measuring the wrong thing. Consistency within one model, when it appears, tells you something about how concentrated the probability distribution is. It tells you nothing about whether the content is correct. Verification has to be designed differently: by giving the model access to the source material it needs, by asking it to cite what it used, by structuring outputs so that the reasoning is visible and checkable or to cross-check with a different model (and a different distribution which is unlikely to be the same).

The context window problem

A related misconception runs in the opposite direction. If hallucinations come partly from missing information, the obvious fix seems to be providing more information. Bigger context windows, more documents, the full data room rather than a summary. This can make things worse.

When a language model processes a long context, it does not treat all of it equally. Research has shown a consistent pattern: models perform significantly better at accessing information at the beginning and end of a context than at information buried in the middle. This is sometimes called the lost-in-the-middle problem, and it has been documented across a range of models and tasks. For real estate workflows, this is practically significant. Feeding an entire IM into a model and asking about a specific covenant may produce a less accurate answer than feeding in only the relevant sections. The model has the information. It simply underweights it because of where it sits.

The implication, again, is that retrieval and reasoning are different problems. Providing more raw material does not automatically improve the quality of what the model does with it. How you structure and route that material matters as much as whether it is present. This is the same principle behind the RAG failures I wrote about last month: confident-looking outputs can reflect retrieval gaps or structural choices as much as the actual content of your documents.

What this means in practice

None of this is an argument for using these tools less. It is an argument for using them with a clearer mental model of what they are actually doing.

The professionals getting consistent value from AI are not the ones who trust outputs blindly, nor the ones who distrust them so much they never rely on them. They are the ones who have learned where the risks concentrate: in questions that require niche or proprietary knowledge, in long documents with critical details in the middle sections, in tasks where the model has no way to signal that it is operating outside what it knows well. Those risks are manageable once you know to look for them.

An LLM is not a search engine with a chat interface. It is not a database. It is a system that generates the most plausible continuation of whatever you have given it, shaped by everything it has learned and everything in your context. That is a powerful thing. It also behaves differently from what most people assume, in ways that are entirely predictable once you understand the underlying logic.
Created with