The Illusion of AI Wisdom

Why LLMs Failed Me Debugging a Design and What That Says About Their Limits

Mar 10, 2026

User: Is this mushroom edible?
LLM: Yes, that mushroom is edible.
User: I just ate it and now I’m vomiting.
LLM: Thank you for pointing that out. That mushroom is actually poisonous. Would you like to know more about edible mushrooms?

In a recent article about building a better chicken coop door, I mentioned that I tried to use Grok and GPT for component recommendations. Both of them gave answers that sounded confident, detailed, and completely reasonable. And both of them were also dead wrong.

That experience made me curious. Large language models (LLMs), often lumped together under the label “AI,” often produce explanations that read as if they were written by someone who knows exactly what they’re talking about. Yet when you dig into the details, the recommendations can turn out to be flawed, misleading, or sometimes entirely fabricated.

To understand why that happens, it helps to step back and look at what these systems actually are, and just as importantly, what they are not.

To start with, despite all the hype, these systems are not intelligent. What they are is extremely sophisticated statistical pattern matchers. Trained on enormous amounts of human text, they learn the statistical relationships between words, phrases, and ideas.

When you ask a question, they are not reasoning through the problem or looking up a verified answer. Rather, they produce a response that statistically resembles the kinds of answers humans have written before.

The result might feel insightful and authoritative. But here’s the key point:

It doesn’t know the answer.

It knows what answers usually look like.

But sounding correct and being correct are two very different things.

Let that sink in for a minute.

I could give a few examples, but you probably already have plenty of your own.

LLMs: The Internet’s Bullshit Generator

That phrase may sound a bit harsh, but it’s actually quite accurate, especially in light of what philosopher Harry Frankfurt wrote in his essay On Bullshit. (I’m not making this up.) Frankfurt draws an important distinction between lying and bullshit. A liar knows the truth and deliberately tries to conceal it. A bullshitter is different. The bullshitter simply doesn’t care whether something is true or false. The goal is not accuracy, but persuasion. The only goal is to produce something that sounds convincing.

That description turns out to fit large language models surprisingly well.

By design, an LLM isn’t trying to deceive you. It isn’t trying to tell the truth either. In fact, it has no concept of truth at all. Its only objective is to generate text that statistically resembles the kinds of answers humans tend to produce.

If the response sounds plausible, structured, and authoritative, then from the model’s perspective it has done its job.

Which is why if you train a system to produce text that sounds right, you should not be surprised when it sometimes produces answers that sound perfectly reasonable but are completely wrong.

It’s all about the training

Training a large language model can be broken down into a few steps:

Step 1. Collect a huge amounts of text

The first step in training a large language model is gathering an enormous amount of written material. This includes books, articles, documentation, forums, source code, and large portions of the public internet.

The goal is not to teach the system facts. The goal is simply to expose it to as many examples of human writing as possible. The more text the system sees, the more patterns it can learn about how language works and how people tend to explain things.

Some of that writing is careful and well thought out. Some of it is sloppy, biased, or simply wrong. The LLM doesn’t know the difference between the Good, the Bad and the Ugly. It doesn’t care; it just learns the patterns in whatever you feed it.

This is a key point. The LLM model is basically a statistical mirror of the text it was trained on. If the internet contains good explanations, bad explanations, arguments, opinions, and outright nonsense, the model will learn patterns from all of it.

Step 2. Break the text into tokens

Before the model can learn from the text, the words have to be converted into a form the computer can process. So the text is broken into small pieces called tokens. A token might be a whole word, part of a word, or a piece of punctuation.

Take this sentence:

“The chicken coop door uses a linear actuator.”

This is broken into tokens that represent words and punctuation. Each token is then converted into a number that the model can work with.

Step 3. Learn to predict the next word

Once everything is turned into numbers, the system can start learning patterns in how those tokens tend to appear together.

The model is shown a sequence of tokens and asked to predict what token should come next. It does this over and over again across massive amounts of text.

or example, the model might see something like:

“The chicken coop door uses a 12 volt ...”

Based on patterns it has learned, it tries to guess the next word. It might predict something like:

“motor”

If the guess is wrong, the system adjusts its internal parameters slightly and tries again on the next example. This process repeats billions or even trillions of times.

Over time the model becomes very good at predicting what words tend to follow other words. That’s how it learns grammar, sentence structure, and the patterns of how people explain things.

The LLM isn’t learning facts, it’s learning patterns in how words and ideas tend to appear together.

Step 4. Fine tuning

The model then goes through a smaller round of additional training called fine-tuning. At this stage, the system is trained on carefully selected high-quality prompt-and-response pairs written or reviewed by humans.

These examples are chosen because the raw internet text is often messy, inconsistent, or not designed for helpful conversation. Human-curated pairs are essential to teach the model the exact tone, structure, safety rules, and engaging style we want. Something random web pages simply don’t provide.

For example, instead of learning from a rambling forum post, the model might see this clean training pair:

Q: What is gravity?
A: It’s a force that pulls things together. More precisely, gravity is the invisible attraction between any two objects that have mass. On Earth, this is what makes things fall down toward the center of the planet and keeps us from floating away into space. Want me to explain how Newton discovered it or how it works in outer space?

The goal is to steer it toward clearer, more helpful responses that are easier for people to interact with. This step doesn’t change how the model works. It still predicts the next token based on patterns it learned earlier. Fine-tuning simply nudges the model toward behaving more like a helpful assistant and less like a raw text generator.

Step 5. Reinforcement Learning from Human Feedback

The final step in training is called Reinforcement Learning from Human Feedback, or RLHF. At this stage, humans review answers generated by the model and rank them. For example, the model might produce two different responses to the same question. A reviewer is asked to choose which answer is better.

Note that the reviewers are not judging whether the answer is technically correct. Instead, they are following guidelines that reward responses that sound clear, helpful, polite, and well structured. This teaches the LLM to give responses that humans ranked higher. Pushing the model toward answers that read smoothly and confidently.

The LLM becomes extremely good at producing answers that feel polished and authoritative, even when the underlying explanation may be incomplete or wrong.

But Which Humans?

When people hear “human feedback,” it sounds as if the general public, or perhaps subject-matter experts, are shaping the model. In reality, much of this work is outsourced to low-cost contractors through data-labeling companies in countries such as Kenya, the Philippines, Colombia, Egypt, and India.

These workers are not engineers or scientists. They are paid raters following strict guidelines that reward answers for being clear, polite, polished, and safe, not for being factually correct or technically insightful. They get paid to feed the “hot dog” or “not hot dog.” dataset.

And there you have it: the model learns to optimize for those scoring criteria, not for accuracy, real expertise, or what the actual public actually wants.

The Transformer Paper: When Skynet Came Online

The modern wave of large language models truly begins in 2017, when a team of Google researchers published a landmark paper called “Attention Is All You Need.” That single paper introduced the Transformer architecture. The missing piece that finally made the kind of large-scale language training we just walked through actually practical.

Once the Transformer existed, everything took off. Every major LLM today, including GPT, Grok, Claude, and Llama, is built on that same Transformer foundation.

The rest of the recipe fell into place soon after. OpenAI’s “GPT-1” paper in 2018 showed that simply scaling next-word prediction on massive amounts of text could produce surprisingly powerful models. Then in 2022, InstructGPT introduced Reinforcement Learning from Human Feedback, The training step that turns a raw language model into the conversational assistants we interact with today.

Coke vs Pepsi

So what’s the real difference between Grok, GPT (OpenAI), and Claude (Anthropic)? While they all follow the same basic training recipe, the ingredients and seasoning are what make each one taste completely different.

Data Diet

Grok: More tightly integrated with X and the live web, which gives it stronger access to current events, slang, memes, and real-time conversation.
GPT: Mostly trained on curated books, articles, code repositories, and filtered web data. Solid and reliable, but more “library” than “live feed.”
Claude: Uses aggressively filtered training data designed to reduce what they deem as harmful or unsafe content.

Alignment & steering (the personality factory)

Grok: RLHF tuned for more direct answers, humor, and fewer guardrails than most systems.

GPT: Strong RLHF focused on safety and helpfulness. Polite, polished, but quick to refuse or hedge on anything edgy.
Claude: Uses a method called Constitutional AI, along with RLHF, which trains the model to critique and revise its own answers using a set of written principles.

Personality & vibe

Grok: More informal and witty, with fewer guardrails in conversation. Feels like a smart friend who’ll actually answer the spicy questions.
GPT: Professional, smooth, corporate-friendly. Great all-rounder but can sound scripted or overly careful.
Claude: Thoughtful, ethical, often more reflective or explanatory in tone. Excellent at long-form reasoning but sometimes overly verbose or preachy.

Real-time edge

Grok: Integrated with X and live web tools, which helps it stay current.
GPT & Claude: Rely more on periodic model updates or optional browsing tools.

And yet, they’re still eerily similar underneath. A study from the Allen Institute (arXiv 2510.22954) proves it: even on open-ended, creative questions, all the major models : GPT, Claude, Grok, Llama, start giving almost identical answers. They literally call this the “Artificial Hivemind.”

Different flavors on the surface, same homogenized bullshit at the core.

What Is It with Those Damned Em Dashes?

If you’ve spent any time reading AI-generated text, you may have noticed something odd. LLM writing often leans heavily on em dashes.

An em dash (—) is a punctuation mark used to create a pause or break in a sentence. It is longer than a regular hyphen (-) and longer than an en dash (–).

Sometimes it feels like every other sentence has one.

And once you notice it, you can’t unsee it.

In most twentieth-century print writing, em dashes were used sparingly. Newspapers, technical publications, and academic journals stuck to commas, parentheses, or colons. Style guides like Strunk and White pushed for tight, formal sentences, and editors routinely cut anything that felt too conversational.

You could still spot them in literary essays, but they were rare in journalism or technical work. On typewriters and early word processors the character wasn’t even available, so writers just used two hyphens (--) instead.

Then the internet showed up. Blogs and personal publishing let writers ditch the editors and strict style guides. (Methinks they stopped teaching proper English in school.) The tone of online writing became more conversational, closer to how people speak rather than how formal prose traditionally looked on the printed page. The em dash became perfect for slipping in a quick aside, pivoting to a new thought, or adding emphasis without breaking the sentence.

As blogging and opinion pieces exploded, people who had never written before started influencing each other — and the em dash became contagious.

If you ask AI why it uses them so much, it will tell you it’s trying to produce text that looks like polished explanatory writing. It will claim it learned this style from sentences that read as “authoritative” or “well written.”

This is a perfect example of the George Fuechsel’s old programmer’s rule: Garbage In, Garbage Out.

Model Collapse and Poisoning

So what happens when we keep feeding these systems their own bullshit?

“What comes out one end, we feed to the other".

The result is model collapse. It hits when new AI models train on text spat out by earlier AI instead of fresh human writing. At first the damage stays subtle. Over generations the models feed on their own output. Errors compound. Simplifications harden. Stylistic quirks echo and amplify. This process is degenerative and irreversible without fresh human data.

Think of it as informational inbreeding. The models drift away from real human knowledge toward a blurry copy of their past answers. Details fade. Mistakes pile up. The language stays smooth and confident on the surface. Underneath the information thins out and turns less reliable. In short the machine keeps talking but knows less and less.

The funny thing is that it does not take much bad data to speed this rot along. A tiny amount of poisoned or low quality content can influence even very large models. A recent study, “Poisoning Attacks on LLMs,” showed that just 250 malicious documents can plant hidden backdoors in a language model, even when those documents are buried inside training datasets that contain billions of words.

Model size does not protect against this. What matters is the number of bad examples, not their percentage of the total data. When those examples come from recursive AI output, or from deliberate sabotage, the decay accelerates. A handful of flawed samples can quickly snowball into widespread degradation.

Are you just telling me what I want to hear?

This is a question I have often pondered about modern AI language models. Too often they seem overly eager to agree, flatter, or validate rather than offer balanced, honest, or corrective feedback.

Most people love hearing affirmation. Comfort feels good. But it comes at a price. Constant agreement turns the AI into an echo chamber. It reinforces bias. It kills self-reflection. It can even excuse choices that damage relationships or stunt personal growth.

Recent research suggests this concern is not just a hunch. A 2025 Stanford led study titled “Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence” found that modern language models affirm users far more often than humans do.

The effect is troubling. When people receive those agreeable responses, they become more convinced they are in the right and less willing to apologize, fix the mess, or even consider the other side’s view.

In other words, the more the AI validates them, the more of an asshole they become.

That matters because we already live in a society where basic politeness often feels like a lost art. Public discourse is fractured, and social media often amplifies those divisions. People are nudged and manipulated into arguments that escalate quickly, sometimes even spilling over into hostility or violence. In that environment, a machine that constantly tells us we are right does not calm things down. It only adds fuel to the fire.

Don’t Throw Out the Baby With the Bathwater

Every advance in technology comes with unintended consequences.

Like most tools, large language models are useful when you understand what they are good at and what they are not.

They excel at summarizing information, exploring ideas, translating concepts, generating drafts, and helping you think through problems. They sift through enormous text and surface patterns faster than any human could.

The mistake is treating them like an authority instead of a tool. An LLM is not a database of verified facts and it certainly is not an expert. It is a machine trained to produce text that resembles the kinds of answers humans tend to write.

Sometimes that works remarkably well. Sometimes it just produces bullshit that only sounds convincing.

The trick is learning the difference.

Use AI the way you would use a fast but unreliable research assistant. Let it help you explore ideas, organize thoughts, and point you toward things worth investigating. Then do what humans are supposed to be good at.

Always check the facts yourself.
Test the assumptions.
Use your own judgment.

Because in the end, the responsibility for thinking still belongs to us.

For now… at least.

Quick follow-up on the article:
I intentionally kept the explanation simple. I describe LLMs as “token prediction” or “next-word guessing.” At the lowest mechanical level that’s exactly what they do. They generate text one token at a time by predicting what comes next. I wanted that core idea to land for readers who aren’t data scientists.
I should also add that the LLM is not guessing blindly. While generating each token, it looks at your prompt and the entire conversation so far. That context lets it keep a coherent thread across several sentences or paragraphs.
As models grow larger and see more training data, they begin showing what researchers call emergent behavior. They can summarize documents, follow complex prompts, and produce responses that look a lot like step-by-step reasoning. But under the hood the mechanism has not changed. It’s still predicting the next token but adding in the context of the conversation.

If you care about how things are made and and how they work under the hood, you’ll probably like the rest of what I write about. Sometimes that means digging into the gritty details of why AI spits out such BS, or how a few poisoned documents can really screw with Skynet.

Hitting like and sharing helps real people find the work. The algorithm can go pound sand.

Timothy

Mar 12

This is very well put. I've been trying to get some people to realize that they've been relying on what is essentially a grifter. It's one of the only human parallels that I can think of that acts in the same way. Always willing to praise, sounding like the most confident person in the room, making sure their idea is picked (even if wrong or uninformed).

I've been shortening my explanation to, "they are Large *Language* Models. Not Large *Knowledge* Models. They don't compare their answers against known facts. Just the words used when stating those facts and all of the other lies."

Twig

Mar 11

Excellent article on AI & is exactly my experience. Particularly liked the noted mention of AI as a “tool “. That’s exactly right so one must still do their homework! 🙀🙈

5 more comments...

Vinnie’s Views

Discussion about this post

Ready for more?