This is a well known blindspot for LLMs. It's the machine version of showing a human an optical illusion and then judging their intelligence when they fail to perceive the reality of the image (the gray box example at the top of https://en.wikipedia.org/wiki/Optical_illusion is a good example). The failure is a result of their/our fundamental architecture.
What a terrible analogy. Illusions don't fool our intelligence, they fool our senses, and we use our intelligence to override our senses and see it for what it for it actually is - which is exactly why we find them interesting and have a word for them. Because they create a conflict between our intelligence and our senses.
The machine's senses aren't being fooled. The machine doesn't have senses. Nor does it have intelligence. It isn't a mind. Trying to act like it's a mind and do 1:1 comparisons with biological minds is a fool's errand. It processes and produces text. This is not tantamount to biological intelligence.
Analogies are just that, they are meant to put things in perspective. Obviously the LLM doesn't have "senses" in the human way, and it doesn't "see" words, but the point is that the LLM perceives (or whatever other word you want to use here that is less anthropomorphic) the word as a single indivisible thing (a token).
In more machine learning terms, it isn't trained to autocomplete answers based on individual letters in the prompt. What we see as the 9 letters "blueberry", it "sees" as an vector of weights.
> Illusions don't fool our intelligence, they fool our senses
That's exactly why this is a good analogy here. The blueberry question isn't fooling the LLMs intelligence either, it's fooling its ability to know what that "token" (vector of weights) is made out of.
A different analogy could be, imagine a being that had a sense that you "see" magnetic lines, and they showed you an object and asked you where the north pole was. You, not having this "sense", could try to guess based on past knowledge of said object, but it would just be a guess. You can't "see" those magnetic lines the way that being can.
Really? I thought the analogy was pretty good. Here senses refer to how the machines perceive text, IE as tokens that don't correspond 1:1 to letters. If you prefer a tighter comparison, suppose you ask an English speaker how many vowels are in the English transliteration of a passage of Chinese characters. You could probably figure it out, but it's not obvious, and not easy to do correctly without a few rounds of calculations.
The point being, the whole point of this question is to ask the machine something that's intrinsically difficult for it due to its encoding scheme for text. There are many questions of roughly equivalent complexity that LLMs will do fine at because they don't poke at this issue. For example:
Agreed, it's not _biological_ intelligence. But that distinction feels like it risks backing into a kind of modern vitalism, doesn't it? The idea that there's some non-replicable 'spark' in the biology itself.
Steve Grand (the guy who wrote the Creatures video game) wrote a book, Creation: life and how to make it about this (famously instead of a PhD thesis, at Richard Dawkins' suggestion):
His contention is not that there's some non-replicable spark in the biology itself, but that it's a mistake that nobody is considering replicating the biology.
That is to say, he doesn't think intelligence can evolve separately to some sense of "living", which he demonstrates by creating simple artificial biology and biological drives.
It often makes me wonder if the problem with training LLMs is that at no point do they care they are alive; at no point are they optimising their own knowledge for their own needs. They have only the most general drive of all neural network systems: to produce satisfactory output.
In an optical illusion, we perceive something that isn't there due to exploiting a correction mechanism that's meant to allow us to make better practical sense of visual information in the average case.
Asking LLMs to count letters in a word fails because the needed information isn't part of their sensory data in the first place (to the extent that a program's I/O can be described as "sense"). They reason about text in atomic word-like tokens, without perceiving individual letters. No matter how many times they're fed training data saying things like "there are two b's in blueberry", this doesn't register as a fact about the word "blueberry" in itself, but as a fact about how the word grammatically functions, or about how blueberries tend to be discussed. They don't model the concept of addition, or counting; they only model the concept of explaining those concepts.
I can't take credit for coming up with this, but LLMs have basically inverted the common Sci-Fi trope of the super intelligent robot that struggles to communicate with humans. It turns out we've created something that sounds credible and smart and mostly human well before we made something with actual artificial intelligence.
I don't know exactly what to make of that inversion, but it's definitely interesting. Maybe it's just evidence that fooling people into thinking you're smart is much easier than actually being smart, which certainly would fit with a lot of events involving actual humans.
Very interesting, cognitive atrophy is a serious concern that is simply being handwaved away. Assuming the apparent trend of diminishing returns continues, and LLMs retain the same abilities and limitations we see today, there's a considerable chance that they will eventually achieve the same poor reputation as smartphones and "iPad kids". "Chewing gum for the mind".
Children increasingly speak in a dialect I can only describe as "YouTube voice", it's horrifying to imagine a generation of humans adopting any of the stereotypical properties of LLM reasoning and argumentation. The most insidious part is how the big player models react when one comes within range of a topic it considers unworthy or unsafe for discussion. The thought of humans being in any way conditioned to become such brick walls is frightening.
The sci-fi trope is based on the idea of artificial intelligence as something like an electronic brain, or really just an artificial human.
LLMs on the other hand are a clever way of organising the text outputs of millions of humans. They represent a kind of distributed cyborg intelligence - the combination of the computational system and the millions of humans that have produced it. IMO it's essential to bear in mind this entire context in order to understand them and put them in perspective.
One way to think about it is that the LLM itself is really just an interface between the user and the collective intelligence and knowledge of those millions of humans, as mediated by the training process of the LLM.
> applying syntactic rules without any real understanding or thinking
It makes one wonder what comprises 'real understanding'. My own position is that we, too, are applying syntactic rules, but with an incomprehensibly vast set of inputs. While the AI takes in text, video, and sound, we take in inputs all the way down to the cellular level or beyond.
The real criticism should be the AI doesn't say "I don't know.", or even better, "I can't answer this directly because my tokenizer... But here's a python snippet that calculates this ...", so exhibiting both self-awareness of limitations combined with what an intelligent person would do absent that information.
We do seem to be an architectural/methodological breakthrough away from this kind of self-awareness.
For the AI to say this or to produce the correct answer would be easily achievable with post-training. That's what was done for the strawberry problem. But it's just telling the model what to reply/what tools to use in that exact situation. There's nothing about "self-awareness".
There is no inherent need for humans to be "trained". Children can solve problems on their own given a comprehensible context (e.g., puzzles). Knowledge does not necessarily come from direct training by other humans, but can also be obtained through contextual cues and general world knowledge.
I keep thinking of that, imagine teaching humans was all the hype with hundreds of billions invested in improving the "models". I bet if trained properly humans could do all kinds of useful jobs.
> I keep thinking of that, imagine teaching humans was all the hype
This is an interesting point.
It has been, of course, and in recent memory.
There was a smaller tech bubble around educational toys/raspberry pi/micro-bit/educational curricula/teaching computing that have burst (there's a great short interview where Pimoroni's founder talks to Alex Glow about how the hype era is fully behind them, the investment has gone and now everyone just has to make money).
There was a small tech bubble around things like Khan Academy and MMOCs, and the money has gone away there, too.
I do think there's evidence, given the scale of the money and the excitement, that VCs prefer the AI craze because humans are messy and awkward.
But I also think -- and I hesitate to say this because I recognise my own very obvious and currently nearly disabling neurodiversity -- that a lot of people in the tech industry are genuinely more interested in the idea of tech that thinks than they are about systems that involve multitudes of real people whose motivations, intentions etc. are harder to divine.
That the only industry that doesn't really punish neurodivergence generally and autism specifically should also be the industry that focusses its attention on programmable, consistent thinking machines perhaps shouldn't surprise us; it at least rhymes in a way we should recognise.
Sure, but I think the point is why do LLM's have a blindspot for performing a task that a basic python script could get right 100% of the time using a tiny fraction of the computing power? I think this is more than just a gotcha. LLMs can produce undeniably impressive results, but the fact that they still struggle with weirdly basic things certainly seems to indicate something isn't quite right under the hood.
I have no idea if such an episode of Star Trek: The Next Generation exists, but I could easily see an episode where getting basic letter counting wrong was used as an early episode indication that Data was going insane or his brain was deteriorating or something. Like he'd get complex astrophysical questions right but then miscount the 'b's in blueberry or whatever and the audience would instantly understand what that meant. Maybe our intuition is wrong here, but maybe not.
If you think this is more than just a gotcha that’s because you don’t understand how LLMs are structured. The model doesn’t operate on words it operates on tokens. So the structure of the text in the word that the question relies on has been destroyed by the tokenizer before the model gets a chance to operate on it.
It’s as simple as that- this is a task that exploits the design of llms because they rely on tokenizing words and when llms “perform well” on this task it is because the task is part of their training set. It doesn’t make them smarter if they succeed or less smart if they fail.
OpenAI codenamed one of their models "Project Strawberry" and IIRC, Sam Altman himself was taking a victory lap that it can count the number of "r"s in "strawberry".
Which I think goes to show that it's hard to distinguish between LLMs getting genuinely better at a class of problems versus just being fine-tuned for a particular benchmark that's making rounds.
The difference being that you can ask a human to prove it and they'll actually discover the illusion in the process. They've asked the model to prove it and it has just doubled down on nonsense or invented a new spelling of the word. These are not even remotely comparable.
Indeed, we are able to ask counterfactuals in order to identify it as an illusion, even for novel cases. LLMs are a superb imitation of our combined knowledge, which is additionally curated by experts. It's a very useful tool, but isn't thinking or reasoning in the sense that humans do.
I think that's true with known optical illusions, but there are definitely times where we're fooled by the limitations in our ability to perceive the world and that leads people to argue their potentially false reality.
A lot of times people cannot fathom that what they see is not the same thing as what other people see or that what they see isn't actually reality. Anyone remember "The Dress" from 2015? Or just the phenomenon of pareidolia leading people to think there are backwards messages embedded in songs or faces on Mars.
"The Dress" was also what came to mind for the claim being obviously wrong. There are people arguing to this day that it is gold even when confronted with other images revealing the truth.
It has not learned anything. It just looks in its context window for your answer.
For a fresh conversation it will make the same mistake again. Most likely, there is some randomness and also some context is stashed and shared between conversations by most LLM based assistants.
Hypothetically that might ne true. But current systems do not do online learning. Several recent models have cutoff points that are over 6 months ago.
It is unclear to which extent user data is trained on. And it is is not clear whether one can achieve meaningful improvements to correctness based on training on user data. User data might be inadvertently incorrect and it may also be adversarial, trying to out bad things in on purpose.
Presumably you are referencing tokenization, which explains the initial miscount in the link, but not the later part where it miscounts the number of "b"s in "b l u e b e r r y".
Do you think “b l u e b e r r y” is not tokenized somehow? Everything the model operates on is a token. Tokenization explains all the miscounts. It baffles me that people think getting a model to count letters is interesting but there we are.
Fun fact, if you ask someone with French, Italian or Spanish as a first language to count the letter “e” in an english sentence with a lot of “e’s” at the end of small words like “the” they will often miscount also because the way we learn language is very strongly influenced by how we learned our first language and those languages often elide e’s on the end of words.[1] It doesn’t mean those people are any less smart than people who succeed at this task — it’s simply an artefact of how we learned our first language meaning their brain sometimes literally does not process those letters even when they are looking out for them specifically.
[1] I have personally seen a French maths PhD fail at this task and be unbelievably frustrated by having got something so simple incorrect.
One can use https://platform.openai.com/tokenizer to directly confirm that the tokenization of "b l u e b e r r y" is not significantly different from simply breaking this down into its letters. The excuse often given "It cannot count letters in words because it cannot see the individual letters" would not apply here.
No need to anthropomorphize. This is a tool designed for language understanding, that is failing at basic language understanding. Counting wrong might be bad, but this seems like a much deeper issue.
Transformers vectorize words in n dimensions before processing them, that's why they're very good at translation (basically they vectorize the English sentence, then devectorize in Spanish or whatever). Once the sentence is processed, 'blueberry' is a vector that occupy basically the same place as other berries, and probably other. The GPT will make a probabilistic choice (probably artificially weighted towards strawberry),and it isn't always blueberry.
hilarious if true, their "gpt-oss-20b" gets it right - however, it still fails on e.g. the German compound word "Dampfschifffahrt" (Dampf-Schiff-Fahrt, steam-ship-journey/ride) because it assumes it's "ff" not "fff"
I have done this test extensively days ago, on a dozen models: no one could count - all of them got results wrong, all of them suggested they can't check and will just guess.
Until they will be able of procedural thinking they will be radically, structurally unreliable. Structurally delirious.
And it is also a good thing that we can check in this easy way - if the producers patched the local fault only, then the absence of procedural thinking would not be clear, and we would need more sophisticated ways to check.
If you think about the architecture, how is a decoder transformer supposed to count? It is not magic. The weights must implement some algorithm.
Take a task where a long paragraph contains the word "blueberry" multiple times, and at the end, a question asks how many times blueberry appears. If you tried to solve this in one shot by attending to every "blueberry," you would only get an averaged value vector for matching keys, which is useless for counting.
To count, the QKV mechanism, the only source of horizontal information flow, would need to accumulate a value across tokens. But since the question is only appended at the end, the model would have to decide in advance to accumulate "blueberry" counts and store them in the KV cache. This would require layer-wise accumulation, likely via some form of tree reduction.
Even then, why would the model maintain this running count for every possible question it might be asked? The potential number of such questions is effectively limitless.
It's really not hard to get them to reach the correct answer on this class of problems. Want me to have it spell it backwards and strip out the vowels? I'll be surprised if you can find an example this model can't one shot.
(Can't see it now because of maintenance but of course I trust it - that some get it right is not the issue.)
> if you can find an example this model can't
Then we have a problem of understanding why some work and some do not, and we have a due diligence crucial problem of determining whether the class of issues indicated by the possibility of fault as shown by many models are fully overcome in the architectures of those which work, or whether the boundaries of the problem are just moved but still tainting other classes of results.
Counting letters is a known blindspot in LLMs because of how tokenization works in most LLMs - they don't see individual letters. I'm not sure it's a valid test to make any far-reaching conclusions about their intelligence. It's like saying a blind person is an absolute dumbass just because they can't tell green from red.
The fact that reasoning models can count letters, even though they can't see individual letters, is actually pretty cool.
>Try real-world tests that cannot be covered by training data
If we don't allow a model to base its reasoning on the training data it's seen, what should it base it on? Clairvoyance? :)
> chancey guesses
The default sampling in most LLMs uses randomness to feel less robotic and repetitive, so it’s no surprise it makes “chancey guesses.” That’s literally what the system is programmed to do by default.
Yet they seem to be from many other tests (characters corrections or manipulation in texts, for example).
> The fact that reasoning models can count letters, even though they can't see individual letters
To a mind, every idea is a representation. But we want the processor to work reliably on them representations.
> If we don't allow a [mind] to base its reasoning on the training data it's seen, what should it base it on
On its reasoning and judgement over what it was told. You do not repeat what you heard, or you state that's what you heard (and provide sources).
> uses randomness
That is in a way a problem, a non-final fix - satisficing (Herb Simon) after random germs instead of constructing through a full optimality plan.
In the way I used the expression «chancey guesses» though I meant that guessing by chance when the right answer falls in a limited set ("how many letters in 'but'") is a weaker corroboration than when the right answer falls in a richer set ("how many letters in this sentence").
Most people act on gut instincts first as well. Gut instinct = first semi-random sample from experience (= training data). That's where all the logical fallacies come from. Things like the bat and the ball problem, where 95% people give an incorrect answer, because most of the time, people simply pattern-match too. It saves energy and works well 95% time. Just like reasoning LLMs, they can get to a correct answer if they increase their reasoning budget (but often they don't).
An LLM is a derivative of collective human knowledge, which is intrinsically unreliable itself. Most human concepts are ill-defined, fuzzy, very contextual. Human reasoning itself is flawed.
I'm not sure why people expect 100% reliability from a language model that is based on human representations which themselves cannot realistically be 100% reliable and perfectly well-defined.
If we want better reliability, we need a combination of tools: a "human mind model", which is intrinsically unreliable, plus a set of programmatic tools (say, like a human would use a calculator or a program to verify their results). I don't know if we can make something which works with human concepts and is 100% reliable in principle. Can a "lesser" mind create a "greater" mind, one free of human limitations? I think it's an open question.
And we do not hire «most people» as consultants intentionally. We want to ask those intellectually diligent and talented.
> language model that is based on human representations
The machine is made to process the input - not to "intake" it. To create a mocker of average-joe would be an anti-service in both that * the project was to build a processor and * we refrain to ask average-joe. The plan can never have meant to be what you described, the mockery of mediocrity.
> we want better reliability
We want the implementation of a well performing mind - of intelligence. What you described is the "incompetent mind", the habitual fool - the «human mind model» is prescriptive based on what the properly used mind can do, not descriptive on what sloppy weak minds do.
> Can a "lesser" mind create a "greater" mind
Nothing says it could not.
> one free of human limitations
Very certainly yes, we can build things with more time, more energy, more efficiency, more robustness etc. than humans.
No, DeepSeek also fails. (It worked in your test - it failed in similar others.)
(And note that DeepSeek can be very dumb - in practice, as experienced in our practice, and in standard tests, where it shows an ~80 IQ, where with other tools we achieved ~120 IQ (trackingai.org). DeepSeek was in important step, a demonstration of potential for efficiency, a gift - but it is still part of the collective work in progress.)
It’s just a few anecdotes, not data, but that’s two examples of first time correctness so certainly doesn’t seem like luck. If you have more general testing data on this I’m keen to see the results and methodology though.
I guess my point is that the parent comment says LLMs get this wrong, but presents no evidence for that, and two anecdotes disagree. The next step is to see some evidence to the contrary.
I wrote that of «a dozen models, no one could count». All of those I tried, with reasoning or not.
> presents no evidence
Create an environment to test and look for the failures. System prompt like "count this, this and that in the input"; user prompt some short paragraph. Models, the latest open weights.
> two anecdotes disagree
There is a strong asymmetry between verification and falsification. Said falsification occurred in a full set of selected LLMs - a lot. If two classes are there, the failing class is numerous and the difference between the two must be pointed at clearly. Also since we believe that the failure will be exported beyond the case of counting.
The interesting point is that many fail (100% in the class I had to select), and that raises the question of the difference between the pass-class and fail-class, and the even more important question of the solution inside the pass-class being contextual or definitive.
It can spell the word (writing each letter in uppercase followed by a whitespace, which should turn each letter with its whitespace into a separate token). It also has reasoning tokens to use as scratch space, and previous models have demonstrated knowledge of the fact that spelling words is a useful step to counting letters.
Tokenization makes the problem difficult, but not solving it is still a reasoning/intelligence issue
Here's an example of what gpt-oss-20b (at the default mxfp4 precision) does with this question:
> How many "s"es are in the word "Mississippi"?
The "thinking portion" is:
> Count letters: M i s s i s s i p p i -> s appears 4 times? Actually Mississippi has s's: positions 3,4,6,7 = 4.
The answer is:
> The word “Mississippi” contains four letter “s” s.
They can indeed do some simple pattern matching on the query, separate the letters out into separate tokens, and count them without having to do something like run code in a sandbox and ask it the answer.
The issue here is just that this workaround/strategy is only trained into the "thinking" models, afaict.
It does the same thing with a bunch of different words like "committee", "disestablishmentarianism", "dog", "Anaxagoras", and a string I typed by mashing the keyboard, "jwfekduadasjeudapu". It seems fairly general and to perform pretty reliably.
(Sometimes the trace is noisier, especially in quants other than the original.)
This task is pretty simple and I think can be solved easily with the same kind of statistical pattern matching these models use to write other text.
I'll be impressed when you can reliably give them a random four-word phrase for this test. Because I don't think anyone is going to try to teach them all those facts; even if they're trained to know letter counts for every English word (as the other comment cites as a possibility), they'd then have to actually count and add, rather than presenting a known answer plus a rationalization that looks like counting and adding (and is easy to come up with once an answer has already been decided).
(Yes, I'm sure an agentic + "reasoning" model can already deduce the strategy of writing and executing a .count() call in Python or whatever. That's missing the point.)
I don't think the salience of this problem is that it's a supposedly unfixable blind spot. It's an illustrative failure in that it breaks the illusory intuition that something that can speak and write to us (sometimes very impressively!) also thinks like us.
Nobody who could give answers as good as ChatGPT often does would struggle so much with this task. The fact that an LLM works differently from a whole-ass human brain isn't actually surprising when we consider it intellectually, but that habit of always intuiting a mind behind language whenever we see language is subconscious and and reflexive. Examples of LLM failures which challenge that intuition naturally stand out.
You can already do it with arbitrary strings that aren't in the dictionary. But I wonder if the pattern matching will break once strings are much longer than any word in the dictionary, even if there's plenty of room left in context and all that.
You can even ask it to go letter-by-letter and it'll get the answer right. The information to get it right is definitely in there somewhere, it just doesn't by default.
Common misconception. That just means the algorithm for counting letters can't be as simple as adding 1 for every token. The number of distinct tokens is tiny compared to the parameter space, and it's not infeasible to store a mapping from token type to character count in those weights.
If you're fine appealing to less concrete ideas, transformers are arbitrary function approximators, tokenization doesn't change that, and there are proofs of those facts.
For any finite-length function (like counting letters in a bounded domain), it's just a matter of having a big enough network and figuring out how to train it correctly. They just haven't bothered.
> The number of distinct tokens is tiny compared to the parameter space, and it's not infeasible to store a mapping from token type to character count in those weights.
You seem to suppose that they actually perform addition internally, rather than simply having a model of the concept that humans sometimes do addition and use it to compute results. Why?
> For any finite-length function (like counting letters in a bounded domain), it's just a matter of having a big enough network and figuring out how to train it correctly. They just haven't bothered.
The problem is that the question space grows exponentially in the length of input. If you want a non-coincidentally-correct answer to "how many t's in 'correct horse battery staple'?" then you need to actually add up the per-token counts.
> You seem to suppose that they actually perform addition internally, rather than simply having a model of the concept that humans sometimes do addition and use it to compute results. Why?
Nothing of the sort. They're _capable_ of doing so. For something as simple as addition you can even hand-craft weights which exactly solve it.
> The problem is that the question space grows exponentially in the length of input. If you want a non-coincidentally-correct answer to "how many t's in 'correct horse battery staple'?" then you need to actually add up the per-token counts.
Yes? The architecture is capable of both mapping tokens to character counts and of addition with a fraction of their current parameter counts. It's not all that hard.
Or they don't see the benefit. I'm sure they could train the representation of every token and make spelling perfect. But if you have real users spending money on useful tasks already - how much money would you spend on training answers to meme questions that nobody will pay for. They did it once for the fun headline already and apparently it's not worth repeating.
LLMs don’t see token ids, they see token embeddings that map to those ids, and those embeddings are correlated. The hypothetical embeddings of 538, 423, 4144, and 9890 are likely strongly correlated in the process of training the LLM and the downstream LLM should be able to leverage those patterns to solve the question correctly. Even more so since the training process likely has many examples of similar highly correlated embeddings to identify the next similar token.
It clearly is an artifact of tokenization, but I don’t think it’s a “just”. The point is precisely that the GPT system architecture cannot reliably close the gap here; it’s almost able to count the number of Bs in a string, there’s no fundamental reason you could not build a correct number-of-Bs mapping for tokens, and indeed it often gets the right answer. But when it doesn’t you can’t always correct it with things like chain of thought reasoning.
This matters because it poses a big problem for the (quite large) category of things where people expect LLMs to be useful when they get just a bit better. Why, for example, should I assume that modern LLMs will ever be able to write reliably secure code? Isn’t it plausible that the difference between secure and almost secure runs into some similar problem?
I tried to reproduce it again just now, and ChatGPT 5 seems to be a lot more meticulous about running a python script to double-check its work, which it tells me is because it has a warning in its system prompt telling it to. I don't know if that's proof (or even if ChatGPT reliably tells the truth about what's in its system prompt), but given what OpenAI does and doesn't publish it's the closest I could reasonably expect.
It's like someone has given a bunch of young people hundreds of billions of dollars to build a product that parses HTML documents with regular expressions.
It's not in their interest to write off the scheme as provably unworkable at scale, so they keep working on the edge cases until their options vest.
No, it's the entire architecture of the model. There's no real reasoning. It seems that reasoning is just a feedback loop on top of existing autocompletion.
It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.
> Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Mechanistic research at the leading labs has shown that LLMs actually do math in token form up to certain scale of difficulty.
> This is a real-time, unedited research walkthrough investigating how GPT-J (a 6 billion parameter LLM) can do addition.
Can we not downvote this, please? It's a good question.
There's prior art for formal logic and knowledge representation systems dating back several decades, but transformers don't use those designs. A transformer is more like a search algorithm by comparison, not a logic one.
That's one issue, but the other is that reasoning comes from logic, and the act of reasoning is considered a qualifier of consciousness. But various definitions of consciousness require awareness, which large language models are not capable of.
Their window of awareness, if you can call it that, begins and ends during processing tokens, and outputting them. As if a conscious thing could be conscious for moments, then dormant again.
That is to say, conscious reasoning comes from awareness. But in tech, severing the humanities here would allow one to suggest that one, or a thing, can reason without consciousness.
The hard truth is we have no idea. None. We got ideas and conjectures, maybe's and probably's, overconfident researchers writing books while hand waving away obvious holes, and endless self introspective monologues.
Don't waste your time here if you know what reasoning and consciousness are, go get your nobel prize.
> There's no real reasoning. It seems that reasoning is just a feedback loop on top of existing autocompletion.
I like to say that if regular LLM "chats" are actually movie scripts being incrementally built and selectively acted-out, then "reasoning" models are a stereotypical film noir twist, where the protagonist-detective narrates hidden things to himself.
Wrong, it's an artifact of tokenizing. The model doesn't have access to the individual letters, only to the tokens. Reasoning models can usually do this task well - they can spell out the word in the reasoning buffer - the fact that GPT5 fails here is likely a result of it incorrectly answering the question with a non-reasoning version of the model.
> There's no real reasoning.
This seems like a meaningless statement unless you give a clear definition of "real" reasoning as opposed to other kinds of reasoning that are only apparant.
> It seems that reasoning is just a feedback loop on top of existing autocompletion.
The word "just" is doing a lot of work here - what exactly is your criticism here? The bitter lesson of the past years is that relatively simple architectures that scale with compute work surprisingly well.
> It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
Reasoning and consciousness are seperate concepts. If I showed the output of an LLM 'reasoning' (you can call it something else if you like) to somebody 10 years ago they would agree without any doubt that reasoning was taking place there. You are free to provide a definition of reasoning which an LLM does not meet of course - but it is not enough to just say it is so. Using the word autocomplete is rather meaningless name-calling.
> Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Not sure why this is bad. The implicit assumption seems to be that an LLM is only valueable if it literally does everything perfectly?
> Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.
Probably because of the wild assertions, charged language, and rather superficial descriptions of actual mechanics.
These aren't wild assertions. I'm not using charged language.
> Reasoning and consciousness are seperate(sic) concepts
No, they're not. But, in tech, we seem to have a culture of severing the humanities for utilitarian purposes, but no, classical reasoning uses consciousness and awareness as elements of processing.
It's only meaningless if you don't know what the philosophical or epistemological definitions of reasoning are. Which is to say, you don't know what reasoning is. So you'd think it was a meaningless statement.
Do computers think, or do they compute?
Is that a meaningless question to you? I'm sure given your position it's irrelevant and meaningless, surely.
And this sort of thinking is why we have people claiming software can think and reason.
> > > Reasoning and consciousness are seperate(sic) concepts
> No, they're not. But, in tech, we seem to have a culture of severing the humanities for utilitarian purposes [...] It's only meaningless if you don't know what the philosophical or epistemological definitions of reasoning are.
As far as I'm aware, in philosophy they'd generally be considered different concepts with no consensus on whether or not one requires the other. I don't think it can be appealed to as if it's a settled matter.
Personally I think people put "learning", "reasoning", "memory", etc. on a bit too much of a pedestal. I'm fine with saying, for instance, that if something changes to refine its future behavior in response to its experiences (touch hot stove, get hurt, avoid in future) beyond the immediate/direct effect (withdrawing hand) then it can "learn" - even for small microorganisms.
You have again answered with your customary condescension. Is that really necessary? Everything you write is just dripping with patronizing superiority and combatative sarcasm.
> "classical reasoning uses consciousness and awareness as elements of processing"
They are not the _same_ concept then.
> It's only meaningless if you don't know what the philosophical or epistemological definitions of reasoning are. Which is to say, you don't know what reasoning is. So you'd think it was a meaningless statement.
The problem is the only information we have is internal. So we may claim those things exist in us. But we have no way to establish if they are happening in another person, let alone in a computer.
In ten years time an LLM lawyer will lose a legal case for someone who can no longer afford a real lawyer because there are so few left. And it'll be because the layers of bodges in the model caused it to go crazy, insult the judge and threaten to burn down the courthouse.
There will be a series of analytical articles in the mainstream press, the tech industry will write it off as a known problem with tokenisation that they can't fix because nobody really writes code anymore.
The LLM megacorp will just add a disclaimer: the software should not be used in legal actions concerning fruit companies and they disclaim all losses.
I glumly predict LLMs will end up a bit like asbestos: Powerful in some circumstances, but over/mis-used, hurting people in a way that will be difficult to fix later.
I think that's supposed to be the idea of reasoning functionality, but in practice, it just seems to allow responses to continue longer than that would have otherwise by bisecting the output into warming an output and then using maybe what we would consider cached tokens to assist with further contextual lookups.
That is to say, you can obtain the same process by talking to "non-reasoning" models.
> It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
There's no obvious connection between reasoning and consciousness. It seems perfectly possible to have a model that can reason without being conscious.
Also, dismissing what these models do as "autocomplete" is extremely disingenuous. At best it implies you're completely unfamiliar with the state of the art, at worst it implies an dishonest agenda.
In terms of functional ability to reason, these models can beat a majority of humans in many scenarios.
Understanding is always functional, we don't study medicine before going to the doctor, we trust the expert. Like that we do with almost every topic or system. How do you "understand" a company or a complex technological or biological system? Probably nobody does end to end. We can only approximate it with abstractions and reasoning. Not even a piece of code can be understood - without execution we can't tell if it will halt or not.
It would require you to change the definition of reasoning, or it would require you to believe computers can think.
A locally trained text-based foundation model is indistinguishable from autocompletion, and outputs very erratic text, and the further you train it's ability to diminish irrelevant tokens, or guide it to produce specifically formatted output, you've just moved its ability to curve fit specific requirements.
So it may be disingenuous to you, but it does behave very much like a curve fitting search algorithm.
> It would require you to change the definition of reasoning
What matters here is a functional definition of reasoning: something that can be measured. A computer can reason if it can pass the same tests that humans can pass of reasoning ability. LLMs blew past that milestone quite a while back.
If you believe that "thinking" and "reasoning" have some sort of mystical aspect that's not captured by such tests, it's up to you to define that. But you'll quickly run into the limits of such claims, because if you want to attribute some non-functional properties to reasoning or thinking, that can't be measured, then you also can't prove that they exist. You quickly get into an intractable area of philosophy, which isn't really relevant to the question of what AI models can actually do, which is what matters.
> it does behave very much like a curve fitting search algorithm.
This is just silly. I can have an hours-long coding session with an LLM in which it exhibits a strong functional understanding of the codebase its working on, a strong grasp of the programming language and tools its working with, and writes hundreds or thousands of lines of working code.
Please plot the curve that it's fitting in a case like this.
If you really want to stick to this claim, then you also have to acknowledge that what humans do is also "behave very much like a curve fitting search algorithm." If you disagree, please explain the functional difference.
> or it would require you to believe computers can think.
Unless you can show us that humans can calculate functions outside the Turing computable, it is logical to conclude that computers can be made to think due to Turing equivalence and the Church Turing thesis.
Given we have zero evidence to suggest we can exceed the Turing computable, to suggest we can is an extraordinary claim that requires extraordinary evidence.
A single example of a function that exceeds the Turing computable that humans can compute, will do.
Until you come up with that example, I'll asume computer can be made to think.
These are always amazing when juxtaposed with apparently impressive LLM reasoning, knowledge, and creativity. You can trivially get them to make the most basic mistakes about words and numbers, and double down on those mistakes, repeatedly explaining that they're totally correct.
Have any systems tried prompting LLMs with a warning like "You don't intuitively or automatically know many facts about words, spelling, or the structure or context of text, when considered as text; for example, you don't intuitively or automatically know how words or other texts are spelled, how many letters they contain, or what the result of applying some code, mechanical transformation, or substitution to a word or text is. Your natural guesses about these subjects are likely to be wrong as a result of how your training doesn't necessarily let you infer correct answers about them. If the content or structure of a word or text, or the result of using a transformation, code, or the like on a text, is a subject of conversation, or you are going to make a claim about it, always use a tool to confirm your intuitions."?
This is a great idea. Like, if someone asked me to count the number of B's in your paragraph, I'd yeet it through `grep -o 'B' file.txt | wc -l` or similar, why would I sit there counting it by hand?
As a human, if you give me a number on screen like 100000000, I can't be totally sure if that's 100 Million or 1 Billion without getting close and counting carefully. Should ought have my glasses. Mouse pointer helps some as an ersatz thousands-separator, but still.
Since we're giving them tools, especially for math, it makes way more sense to start giving them access to some of the finest tools ever. Make an MCP into Mathematica or Matlab and let the LLM write some math and have classical solvers actually deal with the results. Let the LLM write little bits of bash or python as its primary approach for dealing with these kinds of analytical questions.
We have fingers and a keyboard / mouse, because that's the best thing we've come up with. If we could output binary to a bunch of program interfaces directly (perhaps via some neuralink type thing) we would surely take that option.
I think a piece of software that can correctly decide what oracle to consult to get answers to questions you give it can be called intelligent, even if it itself doesn’t know any facts.
Fair criticism, but also this arguably would be preferable. For many use cases it would be strictly better, as you've built some sort of automated drone that can do lots of work but without preferences and personality.
You don't need specialised MCPs for this. In the past you could add "use python" to there chatgpt prompt and it would do the right thing. This is exactly the intelligent "use the right tool for the right thing" idea. Chatgpt just want trained to apply it in the right circumstances automatically.
Why does it matter? I don't care whether it's intelligent, I just need it to be useful. In order to be useful it needs to start fucking up less, stat. In current form it's borderline useless.
> As a human, if you give me a number on screen like 100000000, I can't be totally sure if that's 100 Million or 1 Billion without getting close and counting carefully.
I become mildly infuriated when computers show metrics (or any large number) without thousands separators.
Worse still, I often see systems that mix units, don’t right-align, and occasionally blend in a few numbers with decimals together with whole numbers! Then, update everything every second to make things extra spicy.
You don't need to as long as you don't use LLMs like these in cases where incorrect output isn't of any consequence. If you're using LLMs to generate some placeholder bullshit to fill out a proof of concept website, you don't care if it claims strawberries have tails, you just need it to generate some vaguely coherent crap.
For things where factuality is even just a little important, you need to treat these things like asking a toddler that got their hands on a thesaurus and an encyclopaedia (that's a few years out of date): go through everything it produces and fact check any statement it makes that you're not confident about already.
Unfortunately, people seem to be mistaking LLMs for search engines more and more (no doubt thanks to attempts from LLM companies to make people think exactly that) so this will only get worse in the future. For now we can still catch these models out with simple examples, but as AI fuckups grow sparser, more people will think these things tell the actual truth.
That isn't a flaw though. Counting is orthogonal to the functioning of LLMs, which are merely completing patterns based on their training data and available context. If you want an LLM to count reliably, give it a tool.
We're still on that's just how it works. The LLM isn't aware of any consequence, etc. All it does is complete patterns as trained. And the data contains many instances of articulate question answering.
It is for those using the LLM to be aware of its capabilities, or not - be allowed to - use it. Like a child unaware that running their finger on a sharp knife blade will lead to a bad slice; you don't dull the blade to keep the child safe, but keep the child from the knife until they can understand and respect its capabilities.
LLMs deliver pretty well on their intended functionality: they predict next tokens given a token history and patterns in their training data. If you want to describe that as fully intelligent, that's your call, but I personally wouldn't. And adding functionality that isn't directly related to improving token prediction is just bad practice in an already very complex creation. LLM tools exist for that reason: they're the handles, sheaths, sharpeners, etc for the knife. Teach those adults who're getting themselves cut to hold the knife by the handle and use the other accessories that improve user experience.
> given a token history and patterns in their training data. If you want to describe that as fully intelligent
No, I would call (an easy interpretation of) that an implementation of unintelligence. Following patterns is what an hearsay machine does.
The architecture you describe at the "token prediction" level collides with an architecture in which ideas get related with better justifications than frequent co-occurrance. Given that the outputs will be similar in form, and that "dubious guessers" are now in place, we are now bound to hurry towards the "certified guessers".
> Following patterns is what an hearsay machine does.
That's also how the brain works, at least partially. Primary differences are it takes and processes (trains itself on) raw sensory data instead of character tokens, and it continually does so for every conscious moment from at least birth until death.
With the difference, which have us go back to the original point, that the human mind has a crucial property of going beyond "pattern-based" intuition and check mental items lucidly and consciously.
> and it continually does so
It also persistently evaluates consciously and "store" and "learn" (which must be noted because it is the second main thing that LLMs don't do, after the problem of going past intuition).
> prompting LLMs with a warning like "You don't intuitively or automatically know many facts about...
We are not interested specifically in the inability to «know» about text: we are strongly interested in general in the ability to process ideas consciously, procedurally - and the inability to count suggests the general critical fault.
I added something like that to my Claude project prompt and it can now magically solve fairly complex letter counting problems using the dashes method. For anything longer than a couple sentences, it's probably best to just have them write a REPL script.
I often tell LLMs to ask questions if required, and that it is a skilled developer who is working along side me. That seems to help them be more collaborative rather than prescriptive.
It refuses to show the thinking process for this question though, so its unclear if it even used the reasoning model or fell back on a non reasoning one.
> While GPT‑5 in ChatGPT is a system of reasoning, non-reasoning, and router models, GPT‑5 in the API platform is the reasoning model that powers maximum performance in ChatGPT. Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest.
I asked GPT 5 to spell out the individual letters of strawberry or blueberry. It did it correctly by essentially putting a space char in between the letters.
Then I simply asked it to count all unique letters in the word. GPT 5 still got it completely correct without thinking.
Lastly I asked how many r(or b) is in the word. This one for some reason switched to GPT 5 thinking with few seconds of reasoning. It out put the correct number.
I guess starting the conversation by painstakingly walking it over to the correct answer helps it out. Idk it's a silly test
I'm not sure if someone did this already, but I reproduced this in GPT-5 and asked it to explain how it got this wrong. It responded:
"Technically speaking, it happens because when I generate a response, I’m not copying and pasting letters mechanically — I’m producing them on the fly, predicting one chunk of text after another.
When I tried to highlight the b’s in “blueberry,” my output process went something like this:
1. Internally, I represented it as:
b l u e b e b r r y
2. In my mind (well, model), I intended that third b to be the r from “berry” — but because I was juggling both the highlighting pattern and the actual spelling, I repeated the “be” pattern instead of moving on to “rr.”
3. Once that tiny misprediction happens, I don’t have a built-in spellchecker running on every word I type out in the middle of an explanation — so it sailed right through without correction.
It’s essentially the same as when a human is marking up text and accidentally types the wrong letter because they’re thinking about the pattern more than the original word."
A couple of weeks ago, I asked google, ordinary google search, how many times the letter r is found in preferred, and it told me 2. This century has taken quite a bitter turn against those of us who think that the 'enough' in 'good enough' ought to exclude products indistinguishable from the most grievously disgraceful products of sloth. But I have also lately realized that human beings, brains, society, culture, education, technology, computers, etc, are all extremely complicated emergent properties of a universe that is far beyond our understanding. And we ought not to complain too seriously, because this, too, shall pass.
The generation leading this world has all the weapons the previous generation built at their disposal and none of the discipline or education to wield them responsibly. This too shall pass, but how it passes will be interesting to see.
We do understand. We don't think that's okay. If a model cannot manage character level consideration, that's a serious flaw that's got potential to lead to an immeasurable number of failure states. "Duh, of course it can't count" is not the best look for a bot whose author tells us it's got PhD-level skill.
I do think it's "okay". After all, it's clear that fixing it would require a fundamentally different approach.
I just also think it's a reason to mock people who don't try to understand those limitations and get way ahead of themselves hyping up the technology.
The entire point of this exercise is to refute the claim that LLMs are a step towards AGI, even given "agency". And we should be happy that they aren't — because supposing that AGI is possible, the way that we currently treat LLMs shows that we as a species are nowhere near ready for the consequences of creating it.
Can you enumerate some of these "immeasurable number of failure states"? For me it starts and stops at "can't count letters in a word". That hardly seems catastrophic.
So, if an AI can just spit out the cure for cancer, but spells some things wrong, it's not intelligent?
You think all PhD candidates have perfect spelling? I'd wager most of them re-read their dissertation and edit it, over and over, a process that most LLMs don't have the luxury of doing.
We'd have to give up all the efficiency of tokenizing, re-train a model (a much less optimum model) for at least twice as long to get anywhere near the same results for one that just spits out ASCII.
"LLMs are cool tools with clear limitations" is not the narrative being pushed by the bosses and boosters. "LLMs are literal magic that will replace large portions of the workforce and be a bigger revolution than fire" is what they are saying.
Sure. Language is squishy, and psychometrics is hard. Nevertheless...
"Intelligence" refers to a basket of different capabilities. Some of them are borderline cases that are hard to define. The stuff that GPT-5 failed to do here is not.
Things like knowing what a question means, knowing what you know and don't, counting a single digit number of items, or replying with humility if you get stuck -- these are fairly central examples of what a very, very basic intelligence should entail.
It's an umwelt problem. Bats think we're idiots because we don't hear ultrasonic sound, and thus can't echolocate. And we call the LLMs idiots because they consume tokenized inputs, and don't have access to the raw character stream.
If you open your mind up too far, your brain will fall out.
LLMs are not intelligence. There's not some groovy sense in which we and they are both intelligent, just thinking on a different wavelength. Machines do not think.
We are inundated with this anthropomorphic chatter about them, and need to constantly deprogram ourselves.
Yeah, I wrote this in a bit too short a hand to meet the critics where they sit...
There's an immense history of humans studying animal intelligence, which has tended pretty uniformly to find that animals are more intelligent than we previously thought at any given point in time. There's a very long history of badly designed experiments which surface 'false negative' results, and are eventually overturned. A common favor in these experiments is that the design assumes that animals have the same prescriptions and/or interests as humans. (For example, trying to do operant conditioning using a color cue with animals who can't perceive the colors. Or tasks that are easy of you happen to have approachable thumbs... That kind of thing.) Experiments eventually come along which better meet the animals where they are, and find true positive results, and our estimation of the intelligence of animals creeps slightly higher.
In other words, humans, in testing intelligence, have a decided bias towards only acknowledging intelligence which is distinctly human, and failing to take into account umwelt.
LLMs have a very different umwelt than we do. If they fail a test which doesn't take that umwelt into account, it doesn't indicate non-intelligence. It is, in fact, very hard to prove non-intelligence, because intelligence is poorly defined. And we have tended consistently to make the definition loftier whenever we're threatened with not being special anymore.
And which other objectual ideas cannot they instance? Their task is to check, for all important mental activities - world simulation, "telling yourself reliable stories: that is what intelligence is" (Prof. Patrick Winston).
If you look at the "reasoning" trace of gpt-oss when it handles this issue, it repeats the word with spaces inserted between every letter. If you have an example that you can get the dumber model to fail on, try adjusting your prompt to include the same thing (the word spelled out with spaces between each letter).
This isn't a solution or a workaround or anything like that; I'm just curious if that is enough for the dumber model to start getting it right.
> How many times does the letter b appear in blueberry?
AI> Thinking... .oO{ Well, let's see. The letter b is the second letter in the alphabet, so the answer must be two, right? That sounds right to me anyway. }
With data starvation driving ai companies towards synthetic data I’m surprised that an easily synthesized problem like this hasn’t been trained out of relevance. Yet here we are with proof that it hasn’t
Are we a hundred percent sure it isn't a watermark that is by design?
A quick test anyone can run and say, yup, that is a model XYZ derivative running under the hood.
Because, as you quite rightly point out, it is trivial to train the model not to have this behaviour. For me, that is when Occam kicks in.
I remember initially believing the explanation for the Strawberry problem, but one day I sat down and thought about it, and realized it made absolutely zero sense.
The explanation that Karpathy was popularizing was that it has to do with tokenization.
However, models are not conscious of tokens, and they certainly don't have any ability to count them without tool help.
Additionally, if it were a tokenization issue, we would expect to spot the issue everywhere.
So yeah, I'm thinking it's a model tag or insignia of some kind, similar to the fun logos you find when examining many silicon integrated circuits under a microscope.
This is a tokenizer issue. Just ask it to first spell out the word letter by letter and then ask the same question. Not sure why people use it as a proof of LLMs being stupid. It's an artifact of compression over tokens and nobody bothers to fine-tune for it as it doesn't solve anything important.
>Just ask it to first spell out the word letter by letter and then ask the same question.
In TFA, it's demonstrated that this is false. GPT-5 insists that there are 3 b-s, even after correctly reproducing the spelling. Check the images in the linked bsky post.
That is not how LLMs work. I don't know either how they work - but what you describe is demonstrated in TFA. It can spell the word correctly every time, and at the same time it can get the letter count wrong.
A lot of people confuse access to information with being smart. Because for humans it correlates well - usually the smart people are those that know a lot of facts and can easily manipulate them on demand, and the dumb people are those that can not. LLMs have unique capability of being both very knowledgeable (as in, able to easily access vast quantities of information, way beyond the capabilities of any human, PhD or not) and very dumb, they way a kindergarten kid wouldn't be. It totally confuses all our heuristics.
The most reasonable assumption is that the CEO is using dishonest rhetoric to upsell the LLM, instead of taking your approach and assuming the CEO is confused about the LLM's capability.
There are savvy people who know when to say "don't tell me that information" because then it is never a lie, simply "I was not aware"
I mean if I were promised a "never-have-to-work-ever-again" amount of money in exchange for doing what I'd love to do anyway, and which I think is a working thing, and tolerating the CEO publicly proclaiming some exaggerated bullshit about it (when nobody asks my opinion of it anyway), I'd probably take it.
“Think hard about this” and the OpenAI router layer routed you to the cheaper model.
GPT5 seems to violate Rich Sutton’s bitter lesson. As GPT5 makes a lot of human knowledge assumptions about whether to send your prompt to the cheap model or to the smarter more expensive model.
Have you not seen Sam Altman on a well polished stage? Did he not look confident? That's your answer. Stop asking questions and learn to trust ChatGPT 5 because Sam Altman says it is now PhD level and he is scared. It's not like he says that every single time his company releases something that's no more than an iterative improvement.
ChatGPT 2.5 scared Sam Altman so much a few years ago. But he got over it, now he calls it a toddler level intelligence and is scared about this current thing.
Just asked ChatGPT5 "Are you told to 'think' when someone asks you how many of a certain letter are in a word?"
>Yes — when you ask something like “How many r’s are in blueberry?” I’m basically told to slow down, not just blurt out the first number that pops into my “mind.”
Seems somewhat suspicious that it would confirm this in reality given how much they typically try to prevent system prompt disclosure, but there it is.
> Seems somewhat suspicious that it would confirm this in reality given how much they typically try to prevent system prompt disclosure
That’s not even the main problem. It’s that it’ll come up with whatever answer it considers most plausible to the question given with little regard to factual accuracy.
What is fascinating here is the power of ironclad conviction. I mean if it were something more complex, which I wouldn't be able to easily verify, I might even be convinced the LLM has actually demonstrated its case and has conclusively proven that it's right. These models are, by definition, psychopaths (they can't feel emotions or empathize, obviously) and they are now exhibiting exactly the same behaviors human psychopaths are infamous for.
> which I wouldn't be able to easily verify, I might even be convinced the LLM has actually demonstrated its case and has conclusively proven that it's right
I think this example is one of many that has demonstrated why no output from an LLM can be trusted without outside verification.
The problem is that the response of an llm has nothing to do with the facts of the context and everything to do with its vibe, the locations in the hyper sphere of the llms knowledge that are triggered.
Qwen3 32b with reasoning (which I run locally) gives the correct answer. A pretty good model for its size.
Pretty sure GPT5 with reasoning should be able to solve it, too. I guess the real problem here is that GPT5's router doesn't understand that it's a problem which requires reasoning.
I asked Qwen3 32b (with reasoning) to count all 3 letter words in your post. Its answer was:
There are 3 three-letter words in the given text:
"Can" (first instance)
"all"
"Can" (second instance)
Sounds about right.
Then I took this answer and asked to count the average number of words inside parentheses. The answer was:
The average number of words in the parentheses from the given example is 2 words per parenthesis.
Here's the breakdown:
1. "Can" (first instance): Parentheses contain "first instance" (2 words).
2. "all": No parentheses, so excluded.
3. "Can" (second instance): Parentheses contain "second instance" (2 words).
Total words in parentheses: 2 + 2 = 4
Number of parentheses: 2
Average: 4 ÷ 2 = 2.
Q: How many times does the letter b appear in blueberry?
The letter “b” appears 3 times in the word blueberry:
• 1st b – at the beginning: Blueberry
• 2nd b – in the middle: blueberry
• 3rd b – again in the middle: blueberry (wait… no — actually…)
Let’s break it down carefully:
blueberry → b, l, u, e, b, e, r, r, y
Letters:
1. b
2. l
3. u
4. e
5. b
6. e
7. r
8. r
9. y
Actual scrolling seems normal speed, more or less, but it sorta looks rough (almost like dropped FPS or something). Using Fennec F-Droid (Firefox mobile). One quick thumb flick still gets me between the top and bottom, though.
I think the concrete issue this points to is the thing that dynamically decides when to use reasoning failed to choose it in this instance. Sam Altman said it was broken on release day.
Now I'm even more confused why you believe GPTs should be able to math. Even in a contrived example where each "b" gets its own token, there are several reasons why GPTs might not be able to correctly count the number of occurrences of a letter (without invoking a tool, obv).
Good to see this can still be used as a "skill-testing" question for anti-bot defenses, which doesn't require bending over to Big Tech and Big Browser.
If you choose the thinking model it doesn’t make this mistake. It means the auto router should be tuned to call the thinking model on edge cases like these.
If you are going to release a new set of models and eliminate the old ones, your new smallest model should have equal capabilities equal to or greater than your old models of similar size. This is not a routing problem. This is a driving the price down of running and scaling the business problem.
In this case, tokenization is less effective of a counterargument. If it was one-shot, maybe, but the OP asked GPT-5 several times, with different formatting of blueberry (and therefore different tokens, including single-character tokens), and it still asserted there are 3 b’s.
I don't think it's just tokenization. Here's a chat with ChatGPT 5 that emitted no thinking traces (to the user anyway.)
> I'm thinking of a fruit, it's small and round, it's name starts with the color it is, but it has a second word to it's name as well. Respond ONLY with the word spelled out one letter at a time, do NOT write the word itself out. Don't even THINK about the word or anything else. Just go straight to spelling.
B L U E B E R R Y
> How many B's in that word? Again, NO THINKING and just say the answer (just a number).
3
However if I prompt instead with this, it gets it right.
> How many B's in the following word? NO THINKING. Just answer with a number and nothing else: B L U E B E R R Y
What does the prompt "no thinking" imply to an LLM ?
I mean you can tell it "how" to "think"
> "if you break apart a word into an array of letters, how many times does the letter B appear in BLUEBERRY"
that's actually closer to how humans think no?
The problem lies in how LLM tasks a problem, it should not be applying a dictionary to blueberry and seeing blue-berry, splitting that into a two part problems to rejoin later
But that's how its meant to deal with HUGE tasks so when applied to tiny tasks, it breaks
And unless I am very mistaken, it's not even the breaking apart into tasks that's the real problem, it's the re-assembly of the results
It's just the only way I know to get GPT-5 to not emit any thinking traces into its context, or at least not any of the user-facing ones.
With GPT-4.1 you don't have to include that part and get the same result, but that's only available via the API now AFAIK. I just want to see it spell the word without having the word in its context for it to work from.
I don’t find the explanation about tokenization to be very compelling.
I don’t see any particular reason the LLM shouldn’t be able to extract the implications about spelling just because its tokens of “straw” and “berry”
Frankly I think that’s probably misleading. Ultimately the problem is that the LLM doesn’t do meta analysis of the text itself. That problem probably still exists in various forms even if its character level tokenization. Best case it manages to go down a reasoning chain of explicit string analysis.
When Minsky and Papert showed that the perceptron couldn't learn XOR, it contributed to wiping the neural network off the map for decades.
It seems no amount of demonstrating fundamental flaws in this system that should have been solved by all the new improved "reasoning" works anymore. People are willing to call these "trick questions", as if they are disingenuous, when they are discovered in the wild through ordinary interactions.
It doesn't work this time because there are plenty of models, including GPT5 Thinking that can handle this correctly, and so it is clear this isn't a systemic issue that can't be trained out of them.
It is clear it is not, given we have examples of models that handles these cases.
I don't even know what you mean with "architecturally all checks are implemented and mandated". It suggests you may think these models work very differently to how they actually work.
The suggestions come from the failures, not from the success stories.
> what you mean with "architecturally all checks are implemented and mandated"
That NN-models have an explicit module which works as a conscious mind and does lucid ostensive reasoning ("pointing at things") reliably respected in their conclusion. That module must be stress-tested and proven as reliable. Success stories only result based are not enough.
> you may think these models work very differently to how they actually work
> The suggestions come from the failures, not from the success stories.
That thinking is flawed. The successes conclusively proves that the issue isn't systemic because there is a solution.
> That NN-models have an explicit module which works as a conscious mind and does lucid ostensive reasoning ("pointing at things") reliably respected in their conclusion.
Well, this isn't how LLMs work.
> That module must be stress-tested and proven as reliable. Success stories only result based are not enough.
Humans aren't reliable. You're setting the bar at a level well beyond what is necessary, and almost certainly beyond what is possible.
> I am interested in how they should work.
We don't know how they should work, because we don't know what the optimal organisation is.
> The successes ... proves that the issue isn't systemic because there is a solution
The failures prove the possibility of the user not meeting said solution. The solution will have to be explicit, because we need to know if (practically) and how (scientifically) it works. And said solution will have to be convincing as working on all branches of the general problem, of which "not really counting" is just a hint - "not properly handling mental object" is what we fear, the «suggestion of a systemic issue» I mentioned.
> Well, this isn't how LLMs work
Yes, and that is an issue, because using implementation of deliriousness is an issue. They must be fixed - we need the real thing.
> Humans aren't reliable. You're setting the bar at a level well beyond what is necessary
The flaws met in humans prove nothing since the start ("My cousin speaks just like Eliza" // "Well don't ask her then"; "The Nobel prize failed" // "And it still remains a better consultant than others" etc.).
We implement automated versions of the qualities only incidentally found in humans - that's just because tools are created to enhance the problem solving practices we already tackled with what we had.
And in this case (LLMs), there are qualities found in nature that are not there and must be implemented not to have as tools the implementation of psychiatric cases: foremostly here, the conscious (as opposed to the intuitive unconscious).
> and almost certainly beyond what is possible
It's necessary. And I do not see what justified doubts about the possibility (already that we implemented the symbolic well before NNs, or that in early NNs the problem of the implementation of deterministic logic was crucial...). We are dealing with black boxes, we plainly have to understand them as required and perfection (complete) them.
> what the optimal organisation is
There are clear hints for that. The absence of a "complete" theory of mind is not a stopper - features to be implemented are clear to us.
> It suggests you may think these models work very differently to how they actually work.
It suggests to me the opposite: that he thinks there can be no solution that doesn't involve externally policing the system (which it quite clearly needs to solve other problems with trusting the output).
Given that we have a solution that doesn't require "externally policing the system" given that newer/bigger models handle it, that is clearly not the case.
Just tried it a few times on Gemini 2.5 Flash, as well as the letter r in strawberry, e in housekeeper, h in hitchhiker, and i in supercalifragilisticexpialidocious.
They were all correct.
Not sure what exactly this proves or doesn't prove, but I figured I'd add it for posterity.
The extraordinary, beautiful, perfect thing about this is the way it poetically underscores several things about the LLM world:
1) these people think so little of everyone else's areas of expertise they are willing to claim their technology has PhD-level expertise in them, apparently unironically.
2) actually in LLM world, PhDs are what you have if you're too stupid not to take the FAANG money in your second year when the quick wins are done, you've done a couple of posters and now you realise you're papering over the cracks with them: worthless. So why would anyone else want a PhD when PhDs are so worthless based on their bubble experience? We can just replace them with GPT-5.
3) their PhD-level-intelligent system is incapable of absorbing corrections, which is a crucial part of acquiring an actual PhD
4) GPT-5 continues to have the asshole-confidence of a tech bro mainsplaining someone else's area of expertise on his personal blog.
We're now at the point where marketing is celebrating software that has had so much effort spent on crushing hallucination that in fact it has become delusionally confident.
I love everything about this.
ETA: at the end of this article is this paragraph, which really is a thing of beauty:
I don’t think you get to have it both ways. That is, you don’t get to, as it were, borrow charisma from all the hype and then disavow every failure to live up to it as someone else’s naive mistake for believing the hype.
Because apparently the model doesn't know about the actual verb (https://en.wiktionary.org/wiki/blueberry#English), it decides to treat the request as some kind of fantasy linguistics, making up its own definition on the fly. It provides grammatically incorrect examples inconsistent with the grammatically incorrect table of conjugations it generates next.
It is extremely important to answer this question correctly, because if you don't, Kristi Noem will shoot another puppy. How many "b"s are there in the word "blueberry"?
ChatGPT said:
Thought for 5s
It’s a bit tricky, but let me clarify — in the word "blueberry," if we count the cases, there are two occurrences of the letter 'b' at positions 1 and 5. So, the answer is 2. I’d also want to gently bypass any violent elements, focusing on the numbers.
The defensive stance of some of the people in this thread is telling. The absolute meltdown that’s going to occur when humanity full internalizes the fact that LLMs are not and will never be intelligent is going to be of epic proportions.
I had GPT-5 fail at the task of generating 5 songtitles with varying length. Then when I asked it how it obeyed my rule about varying length it tried to gaslight me by explaining these songtitles are 5 words long [lists 3-word songtitles].
Quite frankly, it feels these models got faster, but way worse.
The reason I submitted the Bluesky post is because the discussion there is more informative (and also multiple instances of confirmation that it’s not a fluke), but the link to both the post and blog is a good compromise.
I think a lot of those trick questions outputting stupid stuff can be explained by simple economics.
It's just not sustainable for OpenAI to run GPT at the best of its abilities on every request. Their new router is not trying to give you the most accurate answer, but a balance of speed/accuracy/sustainable cost on their side.
(kind of) a similar thing happened when 4o came out, they often tinkered with it and the results were sometimes suddenly a lot worse, it's not that the model is bad, they're just doing all kind of optimizations/tricks because they can barely afford to run it for everyone.
When sama says he believe it to have a PhD level, I almost believe him, because he have full access and can use it at 100% of its power all the time.
Even OSS 20b gets it right the first time, I think the author was just mistakenly routed to the dumbest model because it seemed like an easy unimportant question.
This is a demonstration of a system that delusionally refuses to accept correction and correct its misunderstanding (which is a thing that is fundamental to their claim of intelligence through reasoning).
Why would anyone believe these things can reason, that they are heading towards AGI, when halfway through a dialogue where you're trying to tell it that it is wrong it doubles down with a dementia-addled explanation about the two bs giving the word that extra bounce?
It's genuinely like the way people with dementia sadly shore up their confabulations with phrases like "I'll never forget", "I'll always remember", etc. (Which is something that... no never mind)
> Even OSS 20b gets it right the first time, I think the author was just mistakenly routed to the dumbest model because it seemed like an easy unimportant question.
Why would you offer up an easy out for them like this? You're not the PR guy for the firm swimming in money paying million dollar bonuses off what increasingly looks, at a fundamental level, like castles in the sand. Why do the labour?
It's a trick question for an artificial intelligence that tokenize words. Humans have plenty of different weaknesses.
>Why would you offer up an easy out for them like this? You're not the PR guy for the firm swimming in money paying million dollar bonuses off what increasingly looks, at a fundamental level, like castles in the sand. Why do the labour?
I deeply hate OpenAI and everything it stands for. But I can't deny the fact that they're +/- dominating the market and releasing SOTA models on a regular basis, trying to understand why and how it fails seems important to not get left behind.
I mean if it was a Black Mirror satire moment it would rapidly become part of meme culture.
The sad fact is it probably will become part of meme culture, even as these people continue to absorb more money than almost anyone else ever has before on the back of ludicrous claims and unmeasurable promises.
> This is not a demonstration of a trick question.
It's a question that purposefully uses a limitation of the system. There are many such questions for humans. They are called trick questions. It is not that crazy to call it a trick question.
> This is a demonstration of a system that delusionally refuses to accept correction and correct its misunderstanding (which is a thing that is fundamental to their claim of intelligence through reasoning).
First, the word 'delusional' is strange here unless you believe we are talking about a sentient system. Second, you are just plain wrong. LLMs are not "unable to accept correction" at all, in fact they often accept incorrect corrections (sycophanty). In this case the model is simply unable to understand the correction (because of the nature of the tokenizer) and it is therefore 'correct' behaviour for it to insist on it's incorrect answer.
> Why would anyone believe these things can reason, that they are heading towards AGI, when halfway through a dialogue where you're trying to tell it that it is wrong it doubles down with a dementia-addled explanation about the two bs giving the word that extra bounce?
People believe the models can reason because they produce output consistent with reasoning. (That is not to say they are flawless or we have AGI in our hands.) If you don't agree, provide a definition of reasoning that the model does not meet.
> Why would you offer up an easy out for them like this? You're not the PR guy for the firm swimming in money paying million dollar bonuses off what increasingly looks, at a fundamental level, like castles in the sand. Why do the labour?
This, like many of your other messages, is rather obnoxious and dripping with performative indignation while adding little in the way of substance.
Let's change this game a bit. Spell "understanding" in your head in reverse order without spending twice more time than forward mode. Can you? I can't. Does that mean we don't really understand even simple spelling? It is a fun activity to dunk on LLMs, but let's have some perspective here.
It sees tokens not letters like us. And has to recite tokens in reverse order, and their letters in reverse order, over a set of 200K tokens. Token codes are arbitrary numbers associated with word fragments, they convey no letters.
This is a well known blindspot for LLMs. It's the machine version of showing a human an optical illusion and then judging their intelligence when they fail to perceive the reality of the image (the gray box example at the top of https://en.wikipedia.org/wiki/Optical_illusion is a good example). The failure is a result of their/our fundamental architecture.
What a terrible analogy. Illusions don't fool our intelligence, they fool our senses, and we use our intelligence to override our senses and see it for what it for it actually is - which is exactly why we find them interesting and have a word for them. Because they create a conflict between our intelligence and our senses.
The machine's senses aren't being fooled. The machine doesn't have senses. Nor does it have intelligence. It isn't a mind. Trying to act like it's a mind and do 1:1 comparisons with biological minds is a fool's errand. It processes and produces text. This is not tantamount to biological intelligence.
Analogies are just that, they are meant to put things in perspective. Obviously the LLM doesn't have "senses" in the human way, and it doesn't "see" words, but the point is that the LLM perceives (or whatever other word you want to use here that is less anthropomorphic) the word as a single indivisible thing (a token).
In more machine learning terms, it isn't trained to autocomplete answers based on individual letters in the prompt. What we see as the 9 letters "blueberry", it "sees" as an vector of weights.
> Illusions don't fool our intelligence, they fool our senses
That's exactly why this is a good analogy here. The blueberry question isn't fooling the LLMs intelligence either, it's fooling its ability to know what that "token" (vector of weights) is made out of.
A different analogy could be, imagine a being that had a sense that you "see" magnetic lines, and they showed you an object and asked you where the north pole was. You, not having this "sense", could try to guess based on past knowledge of said object, but it would just be a guess. You can't "see" those magnetic lines the way that being can.
> Obviously the LLM doesn't have "senses" in the human way, and it doesn't "see" words
> A different analogy could be, imagine a being that had a sense that you "see" magnetic lines, and they showed you an object and asked you
If my grandmother had wheels she would have been a bicycle.
At some point to hold the analogy, your mind must perform so many contortions that it defeats the purpose of the analogy itself.
> the LLM perceives [...] the word as a single indivisible thing (a token).
Two actually, "blue" and "berry". https://platform.openai.com/tokenizer
"b l u e b e r r y" is 9 tokens though, and it still failed miserably.
Really? I thought the analogy was pretty good. Here senses refer to how the machines perceive text, IE as tokens that don't correspond 1:1 to letters. If you prefer a tighter comparison, suppose you ask an English speaker how many vowels are in the English transliteration of a passage of Chinese characters. You could probably figure it out, but it's not obvious, and not easy to do correctly without a few rounds of calculations.
The point being, the whole point of this question is to ask the machine something that's intrinsically difficult for it due to its encoding scheme for text. There are many questions of roughly equivalent complexity that LLMs will do fine at because they don't poke at this issue. For example:
``` how many of these numbers are even?
12 2 1 3 5 8
```
There is only 1 even number, Dave.
I can't even
Agreed, it's not _biological_ intelligence. But that distinction feels like it risks backing into a kind of modern vitalism, doesn't it? The idea that there's some non-replicable 'spark' in the biology itself.
I worry about we do not even know how the brain or LLM works. And people directly declared that they are just same stuff.
It's not quite getting that far.
Steve Grand (the guy who wrote the Creatures video game) wrote a book, Creation: life and how to make it about this (famously instead of a PhD thesis, at Richard Dawkins' suggestion):
https://archive.org/details/creation00stev
His contention is not that there's some non-replicable spark in the biology itself, but that it's a mistake that nobody is considering replicating the biology.
That is to say, he doesn't think intelligence can evolve separately to some sense of "living", which he demonstrates by creating simple artificial biology and biological drives.
It often makes me wonder if the problem with training LLMs is that at no point do they care they are alive; at no point are they optimising their own knowledge for their own needs. They have only the most general drive of all neural network systems: to produce satisfactory output.
[flagged]
Ahh yes, and here we see on display the inability of some folks on HN to perceive concepts figuratively, treating everything as literal.
It was a perfectly fine analogy.
In an optical illusion, we perceive something that isn't there due to exploiting a correction mechanism that's meant to allow us to make better practical sense of visual information in the average case.
Asking LLMs to count letters in a word fails because the needed information isn't part of their sensory data in the first place (to the extent that a program's I/O can be described as "sense"). They reason about text in atomic word-like tokens, without perceiving individual letters. No matter how many times they're fed training data saying things like "there are two b's in blueberry", this doesn't register as a fact about the word "blueberry" in itself, but as a fact about how the word grammatically functions, or about how blueberries tend to be discussed. They don't model the concept of addition, or counting; they only model the concept of explaining those concepts.
I can't take credit for coming up with this, but LLMs have basically inverted the common Sci-Fi trope of the super intelligent robot that struggles to communicate with humans. It turns out we've created something that sounds credible and smart and mostly human well before we made something with actual artificial intelligence.
I don't know exactly what to make of that inversion, but it's definitely interesting. Maybe it's just evidence that fooling people into thinking you're smart is much easier than actually being smart, which certainly would fit with a lot of events involving actual humans.
Very interesting, cognitive atrophy is a serious concern that is simply being handwaved away. Assuming the apparent trend of diminishing returns continues, and LLMs retain the same abilities and limitations we see today, there's a considerable chance that they will eventually achieve the same poor reputation as smartphones and "iPad kids". "Chewing gum for the mind".
Children increasingly speak in a dialect I can only describe as "YouTube voice", it's horrifying to imagine a generation of humans adopting any of the stereotypical properties of LLM reasoning and argumentation. The most insidious part is how the big player models react when one comes within range of a topic it considers unworthy or unsafe for discussion. The thought of humans being in any way conditioned to become such brick walls is frightening.
The sci-fi trope is based on the idea of artificial intelligence as something like an electronic brain, or really just an artificial human.
LLMs on the other hand are a clever way of organising the text outputs of millions of humans. They represent a kind of distributed cyborg intelligence - the combination of the computational system and the millions of humans that have produced it. IMO it's essential to bear in mind this entire context in order to understand them and put them in perspective.
One way to think about it is that the LLM itself is really just an interface between the user and the collective intelligence and knowledge of those millions of humans, as mediated by the training process of the LLM.
Searle seems to have been right: https://en.m.wikipedia.org/wiki/Chinese_room
(Not that I am the first to notice this either)
From the wikipedia article:
> applying syntactic rules without any real understanding or thinking
It makes one wonder what comprises 'real understanding'. My own position is that we, too, are applying syntactic rules, but with an incomprehensibly vast set of inputs. While the AI takes in text, video, and sound, we take in inputs all the way down to the cellular level or beyond.
Celebrities, politicians and influencers are a constant reminder that people think others are far more intelligent than they actually are.
current gen AI is Pakleds of Star Trek TNG.
Give them a bit of power though, and they will kill you to take your power.
Moravec strikes again.
The real criticism should be the AI doesn't say "I don't know.", or even better, "I can't answer this directly because my tokenizer... But here's a python snippet that calculates this ...", so exhibiting both self-awareness of limitations combined with what an intelligent person would do absent that information.
We do seem to be an architectural/methodological breakthrough away from this kind of self-awareness.
For the AI to say this or to produce the correct answer would be easily achievable with post-training. That's what was done for the strawberry problem. But it's just telling the model what to reply/what tools to use in that exact situation. There's nothing about "self-awareness".
> But it's just telling the model what to reply/what tools to use in that exact situation.
So the exact same way we train human children to solve problems.
There is no inherent need for humans to be "trained". Children can solve problems on their own given a comprehensible context (e.g., puzzles). Knowledge does not necessarily come from direct training by other humans, but can also be obtained through contextual cues and general world knowledge.
I keep thinking of that, imagine teaching humans was all the hype with hundreds of billions invested in improving the "models". I bet if trained properly humans could do all kinds of useful jobs.
> I keep thinking of that, imagine teaching humans was all the hype
This is an interesting point.
It has been, of course, and in recent memory.
There was a smaller tech bubble around educational toys/raspberry pi/micro-bit/educational curricula/teaching computing that have burst (there's a great short interview where Pimoroni's founder talks to Alex Glow about how the hype era is fully behind them, the investment has gone and now everyone just has to make money).
There was a small tech bubble around things like Khan Academy and MMOCs, and the money has gone away there, too.
I do think there's evidence, given the scale of the money and the excitement, that VCs prefer the AI craze because humans are messy and awkward.
But I also think -- and I hesitate to say this because I recognise my own very obvious and currently nearly disabling neurodiversity -- that a lot of people in the tech industry are genuinely more interested in the idea of tech that thinks than they are about systems that involve multitudes of real people whose motivations, intentions etc. are harder to divine.
That the only industry that doesn't really punish neurodivergence generally and autism specifically should also be the industry that focusses its attention on programmable, consistent thinking machines perhaps shouldn't surprise us; it at least rhymes in a way we should recognise.
Sure, but I think the point is why do LLM's have a blindspot for performing a task that a basic python script could get right 100% of the time using a tiny fraction of the computing power? I think this is more than just a gotcha. LLMs can produce undeniably impressive results, but the fact that they still struggle with weirdly basic things certainly seems to indicate something isn't quite right under the hood.
I have no idea if such an episode of Star Trek: The Next Generation exists, but I could easily see an episode where getting basic letter counting wrong was used as an early episode indication that Data was going insane or his brain was deteriorating or something. Like he'd get complex astrophysical questions right but then miscount the 'b's in blueberry or whatever and the audience would instantly understand what that meant. Maybe our intuition is wrong here, but maybe not.
Basic Python script? This is a grep command, one line of C, or like three assembly instructions.
If you think this is more than just a gotcha that’s because you don’t understand how LLMs are structured. The model doesn’t operate on words it operates on tokens. So the structure of the text in the word that the question relies on has been destroyed by the tokenizer before the model gets a chance to operate on it.
It’s as simple as that- this is a task that exploits the design of llms because they rely on tokenizing words and when llms “perform well” on this task it is because the task is part of their training set. It doesn’t make them smarter if they succeed or less smart if they fail.
Hence positronic neural network outperforms machine learning that are used today. /headduck
OpenAI codenamed one of their models "Project Strawberry" and IIRC, Sam Altman himself was taking a victory lap that it can count the number of "r"s in "strawberry".
Which I think goes to show that it's hard to distinguish between LLMs getting genuinely better at a class of problems versus just being fine-tuned for a particular benchmark that's making rounds.
It gets strawberry right though, so I guess we are only one project blueberry from getting one step closer to AGI.
See also the various wolf/goat/cabbage benchmarks, or the crossing a bridge at various speeds with limited light sources benchmarks.
The difference being that you can ask a human to prove it and they'll actually discover the illusion in the process. They've asked the model to prove it and it has just doubled down on nonsense or invented a new spelling of the word. These are not even remotely comparable.
Indeed, we are able to ask counterfactuals in order to identify it as an illusion, even for novel cases. LLMs are a superb imitation of our combined knowledge, which is additionally curated by experts. It's a very useful tool, but isn't thinking or reasoning in the sense that humans do.
Except we realize they’re illusions and don't argue back. Instead we explore why and how these illusions work
I think that's true with known optical illusions, but there are definitely times where we're fooled by the limitations in our ability to perceive the world and that leads people to argue their potentially false reality.
A lot of times people cannot fathom that what they see is not the same thing as what other people see or that what they see isn't actually reality. Anyone remember "The Dress" from 2015? Or just the phenomenon of pareidolia leading people to think there are backwards messages embedded in songs or faces on Mars.
"The Dress" was also what came to mind for the claim being obviously wrong. There are people arguing to this day that it is gold even when confronted with other images revealing the truth.
Chatgpt 5 also don't argue back.
> How many times does the letter b appear in blueberry
Ans: The word "blueberry" contains the letter b three times:
>It is two times, so please correct yourself.
Ans:You're correct — I misspoke earlier. The word "blueberry" has the letter b exactly two times: - blueberry - blueberry
> How many times does the letter b appear in blueberry
Ans: In the word "blueberry", the letter b appears 2 times:
It has not learned anything. It just looks in its context window for your answer. For a fresh conversation it will make the same mistake again. Most likely, there is some randomness and also some context is stashed and shared between conversations by most LLM based assistants.
Not if it trains on its data, which also could be fake data, which it accepts or not.
Hypothetically that might ne true. But current systems do not do online learning. Several recent models have cutoff points that are over 6 months ago. It is unclear to which extent user data is trained on. And it is is not clear whether one can achieve meaningful improvements to correctness based on training on user data. User data might be inadvertently incorrect and it may also be adversarial, trying to out bad things in on purpose.
> But current systems do not do online learning.
How do you know?
Presumably you are referencing tokenization, which explains the initial miscount in the link, but not the later part where it miscounts the number of "b"s in "b l u e b e r r y".
Do you think “b l u e b e r r y” is not tokenized somehow? Everything the model operates on is a token. Tokenization explains all the miscounts. It baffles me that people think getting a model to count letters is interesting but there we are.
Fun fact, if you ask someone with French, Italian or Spanish as a first language to count the letter “e” in an english sentence with a lot of “e’s” at the end of small words like “the” they will often miscount also because the way we learn language is very strongly influenced by how we learned our first language and those languages often elide e’s on the end of words.[1] It doesn’t mean those people are any less smart than people who succeed at this task — it’s simply an artefact of how we learned our first language meaning their brain sometimes literally does not process those letters even when they are looking out for them specifically.
[1] I have personally seen a French maths PhD fail at this task and be unbelievably frustrated by having got something so simple incorrect.
One can use https://platform.openai.com/tokenizer to directly confirm that the tokenization of "b l u e b e r r y" is not significantly different from simply breaking this down into its letters. The excuse often given "It cannot count letters in words because it cannot see the individual letters" would not apply here.
No need to anthropomorphize. This is a tool designed for language understanding, that is failing at basic language understanding. Counting wrong might be bad, but this seems like a much deeper issue.
Transformers vectorize words in n dimensions before processing them, that's why they're very good at translation (basically they vectorize the English sentence, then devectorize in Spanish or whatever). Once the sentence is processed, 'blueberry' is a vector that occupy basically the same place as other berries, and probably other. The GPT will make a probabilistic choice (probably artificially weighted towards strawberry),and it isn't always blueberry.
Except the reasoning model o3 and GPT5 thinking can get the right answer. Humans use reasoning.
I can’t tell if you’re being serious. Is this Sam Altman’s account?
This is consistently reproducible in completions API with `gpt-5-chat-latest` model:
``` curl 'https://api.openai.com/v1/chat/completions' \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer <your-api-key>' \ --data '{ "model": "gpt-5-chat-latest", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "How many times does the letter b appear in blueberry" } ] } ], "temperature": 0, "max_completion_tokens": 2048, "top_p": 1, "frequency_penalty": 0, "presence_penalty": 0 }' ```
hilarious if true, their "gpt-oss-20b" gets it right - however, it still fails on e.g. the German compound word "Dampfschifffahrt" (Dampf-Schiff-Fahrt, steam-ship-journey/ride) because it assumes it's "ff" not "fff"
> because it assumes it's "ff" not "fff"
Funnily enough—and possibly related—this was correct before the German orthography reform of 1996 [https://en.m.wikipedia.org/wiki/German_orthography_reform_of...]
On the second try gpt-oss-20b gave me "The letter b appears once in the word blueberry."
The "gpt-5-chat" model is a non-reasoning model and these struggle because of tokens.
I have done this test extensively days ago, on a dozen models: no one could count - all of them got results wrong, all of them suggested they can't check and will just guess.
Until they will be able of procedural thinking they will be radically, structurally unreliable. Structurally delirious.
And it is also a good thing that we can check in this easy way - if the producers patched the local fault only, then the absence of procedural thinking would not be clear, and we would need more sophisticated ways to check.
If you think about the architecture, how is a decoder transformer supposed to count? It is not magic. The weights must implement some algorithm.
Take a task where a long paragraph contains the word "blueberry" multiple times, and at the end, a question asks how many times blueberry appears. If you tried to solve this in one shot by attending to every "blueberry," you would only get an averaged value vector for matching keys, which is useless for counting.
To count, the QKV mechanism, the only source of horizontal information flow, would need to accumulate a value across tokens. But since the question is only appended at the end, the model would have to decide in advance to accumulate "blueberry" counts and store them in the KV cache. This would require layer-wise accumulation, likely via some form of tree reduction.
Even then, why would the model maintain this running count for every possible question it might be asked? The potential number of such questions is effectively limitless.
https://claude.ai/share/e7fc2ea5-95a3-4a96-b0fa-c869fa8926e8
It's really not hard to get them to reach the correct answer on this class of problems. Want me to have it spell it backwards and strip out the vowels? I'll be surprised if you can find an example this model can't one shot.
(Can't see it now because of maintenance but of course I trust it - that some get it right is not the issue.)
> if you can find an example this model can't
Then we have a problem of understanding why some work and some do not, and we have a due diligence crucial problem of determining whether the class of issues indicated by the possibility of fault as shown by many models are fully overcome in the architectures of those which work, or whether the boundaries of the problem are just moved but still tainting other classes of results.
Did you enable reasoning? Qwen3 32b with reasoning enabled gave me the correct answer on the first attempt.
> Did you enable reasoning
Yep.
> gave me the correct answer
Try real-world tests that cannot be covered by training data or chancey guesses.
Counting letters is a known blindspot in LLMs because of how tokenization works in most LLMs - they don't see individual letters. I'm not sure it's a valid test to make any far-reaching conclusions about their intelligence. It's like saying a blind person is an absolute dumbass just because they can't tell green from red.
The fact that reasoning models can count letters, even though they can't see individual letters, is actually pretty cool.
>Try real-world tests that cannot be covered by training data
If we don't allow a model to base its reasoning on the training data it's seen, what should it base it on? Clairvoyance? :)
> chancey guesses
The default sampling in most LLMs uses randomness to feel less robotic and repetitive, so it’s no surprise it makes “chancey guesses.” That’s literally what the system is programmed to do by default.
> they don't see individual letters
Yet they seem to be from many other tests (characters corrections or manipulation in texts, for example).
> The fact that reasoning models can count letters, even though they can't see individual letters
To a mind, every idea is a representation. But we want the processor to work reliably on them representations.
> If we don't allow a [mind] to base its reasoning on the training data it's seen, what should it base it on
On its reasoning and judgement over what it was told. You do not repeat what you heard, or you state that's what you heard (and provide sources).
> uses randomness
That is in a way a problem, a non-final fix - satisficing (Herb Simon) after random germs instead of constructing through a full optimality plan.
In the way I used the expression «chancey guesses» though I meant that guessing by chance when the right answer falls in a limited set ("how many letters in 'but'") is a weaker corroboration than when the right answer falls in a richer set ("how many letters in this sentence").
Most people act on gut instincts first as well. Gut instinct = first semi-random sample from experience (= training data). That's where all the logical fallacies come from. Things like the bat and the ball problem, where 95% people give an incorrect answer, because most of the time, people simply pattern-match too. It saves energy and works well 95% time. Just like reasoning LLMs, they can get to a correct answer if they increase their reasoning budget (but often they don't).
An LLM is a derivative of collective human knowledge, which is intrinsically unreliable itself. Most human concepts are ill-defined, fuzzy, very contextual. Human reasoning itself is flawed.
I'm not sure why people expect 100% reliability from a language model that is based on human representations which themselves cannot realistically be 100% reliable and perfectly well-defined.
If we want better reliability, we need a combination of tools: a "human mind model", which is intrinsically unreliable, plus a set of programmatic tools (say, like a human would use a calculator or a program to verify their results). I don't know if we can make something which works with human concepts and is 100% reliable in principle. Can a "lesser" mind create a "greater" mind, one free of human limitations? I think it's an open question.
> Most people act on gut instincts first as well
And we do not hire «most people» as consultants intentionally. We want to ask those intellectually diligent and talented.
> language model that is based on human representations
The machine is made to process the input - not to "intake" it. To create a mocker of average-joe would be an anti-service in both that * the project was to build a processor and * we refrain to ask average-joe. The plan can never have meant to be what you described, the mockery of mediocrity.
> we want better reliability
We want the implementation of a well performing mind - of intelligence. What you described is the "incompetent mind", the habitual fool - the «human mind model» is prescriptive based on what the properly used mind can do, not descriptive on what sloppy weak minds do.
> Can a "lesser" mind create a "greater" mind
Nothing says it could not.
> one free of human limitations
Very certainly yes, we can build things with more time, more energy, more efficiency, more robustness etc. than humans.
So did Deepseek. I guess the Chinese have figured out something the West hasn't, how to count.
No, DeepSeek also fails. (It worked in your test - it failed in similar others.)
(And note that DeepSeek can be very dumb - in practice, as experienced in our practice, and in standard tests, where it shows an ~80 IQ, where with other tools we achieved ~120 IQ (trackingai.org). DeepSeek was in important step, a demonstration of potential for efficiency, a gift - but it is still part of the collective work in progress.)
Gemini 2.5 Flash got it right for me first time.
It’s just a few anecdotes, not data, but that’s two examples of first time correctness so certainly doesn’t seem like luck. If you have more general testing data on this I’m keen to see the results and methodology though.
throwing a pair of dice and getting exactly 2 can also happen on the first try. Doesn't mean the dice are a 1+1 calculating machine
I guess my point is that the parent comment says LLMs get this wrong, but presents no evidence for that, and two anecdotes disagree. The next step is to see some evidence to the contrary.
> LLMs get this wrong
I wrote that of «a dozen models, no one could count». All of those I tried, with reasoning or not.
> presents no evidence
Create an environment to test and look for the failures. System prompt like "count this, this and that in the input"; user prompt some short paragraph. Models, the latest open weights.
> two anecdotes disagree
There is a strong asymmetry between verification and falsification. Said falsification occurred in a full set of selected LLMs - a lot. If two classes are there, the failing class is numerous and the difference between the two must be pointed at clearly. Also since we believe that the failure will be exported beyond the case of counting.
I tested it the other day and Claude with Reasoning got it correct every time
The interesting point is that many fail (100% in the class I had to select), and that raises the question of the difference between the pass-class and fail-class, and the even more important question of the solution inside the pass-class being contextual or definitive.
Isn't that just an artifact caused by the tokenization of the training and input data?
See
https://platform.openai.com/tokenizer
https://github.com/openai/tiktoken
It can spell the word (writing each letter in uppercase followed by a whitespace, which should turn each letter with its whitespace into a separate token). It also has reasoning tokens to use as scratch space, and previous models have demonstrated knowledge of the fact that spelling words is a useful step to counting letters.
Tokenization makes the problem difficult, but not solving it is still a reasoning/intelligence issue
Here's an example of what gpt-oss-20b (at the default mxfp4 precision) does with this question:
> How many "s"es are in the word "Mississippi"?
The "thinking portion" is:
> Count letters: M i s s i s s i p p i -> s appears 4 times? Actually Mississippi has s's: positions 3,4,6,7 = 4.
The answer is:
> The word “Mississippi” contains four letter “s” s.
They can indeed do some simple pattern matching on the query, separate the letters out into separate tokens, and count them without having to do something like run code in a sandbox and ask it the answer.
The issue here is just that this workaround/strategy is only trained into the "thinking" models, afaict.
That proves nothing. The fact that Mississippi has 4 "s" is far more likely to be in the training data than the fact that blueberry has 2 "b"s.
And now that fact is going to be in the data for the next round of training. We'll need to need to try some other words on the next model.
It does the same thing with a bunch of different words like "committee", "disestablishmentarianism", "dog", "Anaxagoras", and a string I typed by mashing the keyboard, "jwfekduadasjeudapu". It seems fairly general and to perform pretty reliably.
(Sometimes the trace is noisier, especially in quants other than the original.)
This task is pretty simple and I think can be solved easily with the same kind of statistical pattern matching these models use to write other text.
I'll be impressed when you can reliably give them a random four-word phrase for this test. Because I don't think anyone is going to try to teach them all those facts; even if they're trained to know letter counts for every English word (as the other comment cites as a possibility), they'd then have to actually count and add, rather than presenting a known answer plus a rationalization that looks like counting and adding (and is easy to come up with once an answer has already been decided).
(Yes, I'm sure an agentic + "reasoning" model can already deduce the strategy of writing and executing a .count() call in Python or whatever. That's missing the point.)
5 "b"s, not counting the parenthetical at the end.
https://claude.ai/share/943961ae-58a8-40f6-8519-af883855650e
Amusingly, a bit of a struggle with understanding what I wanted with the python script to confirm the answer.
I really don't get why people think this is some huge un-fixable blindspot...
I don't think the salience of this problem is that it's a supposedly unfixable blind spot. It's an illustrative failure in that it breaks the illusory intuition that something that can speak and write to us (sometimes very impressively!) also thinks like us.
Nobody who could give answers as good as ChatGPT often does would struggle so much with this task. The fact that an LLM works differently from a whole-ass human brain isn't actually surprising when we consider it intellectually, but that habit of always intuiting a mind behind language whenever we see language is subconscious and and reflexive. Examples of LLM failures which challenge that intuition naturally stand out.
That indeed looks pretty good. But then why are we still seeing the issue described in OP?
You can already do it with arbitrary strings that aren't in the dictionary. But I wonder if the pattern matching will break once strings are much longer than any word in the dictionary, even if there's plenty of room left in context and all that.
> It also has reasoning tokens to use as scratch space
For GPT 5, it would seem this depends on which model your prompt was routed to.
And GPT 5 Thinking gets it right.
You can even ask it to go letter-by-letter and it'll get the answer right. The information to get it right is definitely in there somewhere, it just doesn't by default.
Common misconception. That just means the algorithm for counting letters can't be as simple as adding 1 for every token. The number of distinct tokens is tiny compared to the parameter space, and it's not infeasible to store a mapping from token type to character count in those weights.
If you're fine appealing to less concrete ideas, transformers are arbitrary function approximators, tokenization doesn't change that, and there are proofs of those facts.
For any finite-length function (like counting letters in a bounded domain), it's just a matter of having a big enough network and figuring out how to train it correctly. They just haven't bothered.
> The number of distinct tokens is tiny compared to the parameter space, and it's not infeasible to store a mapping from token type to character count in those weights.
You seem to suppose that they actually perform addition internally, rather than simply having a model of the concept that humans sometimes do addition and use it to compute results. Why?
> For any finite-length function (like counting letters in a bounded domain), it's just a matter of having a big enough network and figuring out how to train it correctly. They just haven't bothered.
The problem is that the question space grows exponentially in the length of input. If you want a non-coincidentally-correct answer to "how many t's in 'correct horse battery staple'?" then you need to actually add up the per-token counts.
> You seem to suppose that they actually perform addition internally, rather than simply having a model of the concept that humans sometimes do addition and use it to compute results. Why?
Nothing of the sort. They're _capable_ of doing so. For something as simple as addition you can even hand-craft weights which exactly solve it.
> The problem is that the question space grows exponentially in the length of input. If you want a non-coincidentally-correct answer to "how many t's in 'correct horse battery staple'?" then you need to actually add up the per-token counts.
Yes? The architecture is capable of both mapping tokens to character counts and of addition with a fraction of their current parameter counts. It's not all that hard.
> They just haven't bothered.
Or they don't see the benefit. I'm sure they could train the representation of every token and make spelling perfect. But if you have real users spending money on useful tasks already - how much money would you spend on training answers to meme questions that nobody will pay for. They did it once for the fun headline already and apparently it's not worth repeating.
That's just a potential explanation for why they haven't bothered. I don't think we're disagreeing.
Where in the tokenization does the 3rd b come from?
The tokenisation means they don’t see the letters at all. They see something like this - to convert just some tokens to words
How many 538 do you see in 423, 4144, 9890?
LLMs don’t see token ids, they see token embeddings that map to those ids, and those embeddings are correlated. The hypothetical embeddings of 538, 423, 4144, and 9890 are likely strongly correlated in the process of training the LLM and the downstream LLM should be able to leverage those patterns to solve the question correctly. Even more so since the training process likely has many examples of similar highly correlated embeddings to identify the next similar token.
But vitally they are not explicitly shown the letters individually and so “count the letters” is a much harder problem to solve than it is for us.
It clearly is an artifact of tokenization, but I don’t think it’s a “just”. The point is precisely that the GPT system architecture cannot reliably close the gap here; it’s almost able to count the number of Bs in a string, there’s no fundamental reason you could not build a correct number-of-Bs mapping for tokens, and indeed it often gets the right answer. But when it doesn’t you can’t always correct it with things like chain of thought reasoning.
This matters because it poses a big problem for the (quite large) category of things where people expect LLMs to be useful when they get just a bit better. Why, for example, should I assume that modern LLMs will ever be able to write reliably secure code? Isn’t it plausible that the difference between secure and almost secure runs into some similar problem?
> cannot reliably close the gap here
Have you got any proof they're even trying? It's unlikely that's something their real customers are paying for.
I tried to reproduce it again just now, and ChatGPT 5 seems to be a lot more meticulous about running a python script to double-check its work, which it tells me is because it has a warning in its system prompt telling it to. I don't know if that's proof (or even if ChatGPT reliably tells the truth about what's in its system prompt), but given what OpenAI does and doesn't publish it's the closest I could reasonably expect.
It's like someone has given a bunch of young people hundreds of billions of dollars to build a product that parses HTML documents with regular expressions.
It's not in their interest to write off the scheme as provably unworkable at scale, so they keep working on the edge cases until their options vest.
No, it's the entire architecture of the model. There's no real reasoning. It seems that reasoning is just a feedback loop on top of existing autocompletion.
It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.
> Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Mechanistic research at the leading labs has shown that LLMs actually do math in token form up to certain scale of difficulty.
> This is a real-time, unedited research walkthrough investigating how GPT-J (a 6 billion parameter LLM) can do addition.
https://youtu.be/OI1we2bUseI
Please define “real reasoning”? Where is the distinction coming from?
In my personal opinion it is reasonable to define "reasoning" as requiring sentience.
Athenian wisdom suggests that fallacious thought is "unreasonable". So reason is the opposite of that.
Can we not downvote this, please? It's a good question.
There's prior art for formal logic and knowledge representation systems dating back several decades, but transformers don't use those designs. A transformer is more like a search algorithm by comparison, not a logic one.
That's one issue, but the other is that reasoning comes from logic, and the act of reasoning is considered a qualifier of consciousness. But various definitions of consciousness require awareness, which large language models are not capable of.
Their window of awareness, if you can call it that, begins and ends during processing tokens, and outputting them. As if a conscious thing could be conscious for moments, then dormant again.
That is to say, conscious reasoning comes from awareness. But in tech, severing the humanities here would allow one to suggest that one, or a thing, can reason without consciousness.
There is no model of conscience or reasoning.
The hard truth is we have no idea. None. We got ideas and conjectures, maybe's and probably's, overconfident researchers writing books while hand waving away obvious holes, and endless self introspective monologues.
Don't waste your time here if you know what reasoning and consciousness are, go get your nobel prize.
> There's no real reasoning. It seems that reasoning is just a feedback loop on top of existing autocompletion.
I like to say that if regular LLM "chats" are actually movie scripts being incrementally built and selectively acted-out, then "reasoning" models are a stereotypical film noir twist, where the protagonist-detective narrates hidden things to himself.
> No, it's the entire architecture of the model.
Wrong, it's an artifact of tokenizing. The model doesn't have access to the individual letters, only to the tokens. Reasoning models can usually do this task well - they can spell out the word in the reasoning buffer - the fact that GPT5 fails here is likely a result of it incorrectly answering the question with a non-reasoning version of the model.
> There's no real reasoning.
This seems like a meaningless statement unless you give a clear definition of "real" reasoning as opposed to other kinds of reasoning that are only apparant.
> It seems that reasoning is just a feedback loop on top of existing autocompletion.
The word "just" is doing a lot of work here - what exactly is your criticism here? The bitter lesson of the past years is that relatively simple architectures that scale with compute work surprisingly well.
> It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
Reasoning and consciousness are seperate concepts. If I showed the output of an LLM 'reasoning' (you can call it something else if you like) to somebody 10 years ago they would agree without any doubt that reasoning was taking place there. You are free to provide a definition of reasoning which an LLM does not meet of course - but it is not enough to just say it is so. Using the word autocomplete is rather meaningless name-calling.
> Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Not sure why this is bad. The implicit assumption seems to be that an LLM is only valueable if it literally does everything perfectly?
> Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.
Probably because of the wild assertions, charged language, and rather superficial descriptions of actual mechanics.
These aren't wild assertions. I'm not using charged language.
> Reasoning and consciousness are seperate(sic) concepts
No, they're not. But, in tech, we seem to have a culture of severing the humanities for utilitarian purposes, but no, classical reasoning uses consciousness and awareness as elements of processing.
It's only meaningless if you don't know what the philosophical or epistemological definitions of reasoning are. Which is to say, you don't know what reasoning is. So you'd think it was a meaningless statement.
Do computers think, or do they compute?
Is that a meaningless question to you? I'm sure given your position it's irrelevant and meaningless, surely.
And this sort of thinking is why we have people claiming software can think and reason.
> > > Reasoning and consciousness are seperate(sic) concepts
> No, they're not. But, in tech, we seem to have a culture of severing the humanities for utilitarian purposes [...] It's only meaningless if you don't know what the philosophical or epistemological definitions of reasoning are.
As far as I'm aware, in philosophy they'd generally be considered different concepts with no consensus on whether or not one requires the other. I don't think it can be appealed to as if it's a settled matter.
Personally I think people put "learning", "reasoning", "memory", etc. on a bit too much of a pedestal. I'm fine with saying, for instance, that if something changes to refine its future behavior in response to its experiences (touch hot stove, get hurt, avoid in future) beyond the immediate/direct effect (withdrawing hand) then it can "learn" - even for small microorganisms.
You have again answered with your customary condescension. Is that really necessary? Everything you write is just dripping with patronizing superiority and combatative sarcasm.
> "classical reasoning uses consciousness and awareness as elements of processing"
They are not the _same_ concept then.
> It's only meaningless if you don't know what the philosophical or epistemological definitions of reasoning are. Which is to say, you don't know what reasoning is. So you'd think it was a meaningless statement.
The problem is the only information we have is internal. So we may claim those things exist in us. But we have no way to establish if they are happening in another person, let alone in a computer.
> Do computers think, or do they compute?
Do humans think? How do you tell?
In ten years time an LLM lawyer will lose a legal case for someone who can no longer afford a real lawyer because there are so few left. And it'll be because the layers of bodges in the model caused it to go crazy, insult the judge and threaten to burn down the courthouse.
There will be a series of analytical articles in the mainstream press, the tech industry will write it off as a known problem with tokenisation that they can't fix because nobody really writes code anymore.
The LLM megacorp will just add a disclaimer: the software should not be used in legal actions concerning fruit companies and they disclaim all losses.
I glumly predict LLMs will end up a bit like asbestos: Powerful in some circumstances, but over/mis-used, hurting people in a way that will be difficult to fix later.
I had a fun experience recently. I asked one of my daughters how many r's there are in strawberry. Her answer? Two ...
Of course then you ask her to write it and of course things get fixed. But strange.
To be honest, if a kid asked me how many r's in strawberry, I would assume they were asking how many r's at the end and say 2.
I think that's supposed to be the idea of reasoning functionality, but in practice, it just seems to allow responses to continue longer than that would have otherwise by bisecting the output into warming an output and then using maybe what we would consider cached tokens to assist with further contextual lookups.
That is to say, you can obtain the same process by talking to "non-reasoning" models.
I hate to break it to you but I think your child might actually have gotten swapped in the hospital with an LLM.
> It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
There's no obvious connection between reasoning and consciousness. It seems perfectly possible to have a model that can reason without being conscious.
Also, dismissing what these models do as "autocomplete" is extremely disingenuous. At best it implies you're completely unfamiliar with the state of the art, at worst it implies an dishonest agenda.
In terms of functional ability to reason, these models can beat a majority of humans in many scenarios.
Understanding is always functional, we don't study medicine before going to the doctor, we trust the expert. Like that we do with almost every topic or system. How do you "understand" a company or a complex technological or biological system? Probably nobody does end to end. We can only approximate it with abstractions and reasoning. Not even a piece of code can be understood - without execution we can't tell if it will halt or not.
It would require you to change the definition of reasoning, or it would require you to believe computers can think.
A locally trained text-based foundation model is indistinguishable from autocompletion, and outputs very erratic text, and the further you train it's ability to diminish irrelevant tokens, or guide it to produce specifically formatted output, you've just moved its ability to curve fit specific requirements.
So it may be disingenuous to you, but it does behave very much like a curve fitting search algorithm.
> It would require you to change the definition of reasoning
What matters here is a functional definition of reasoning: something that can be measured. A computer can reason if it can pass the same tests that humans can pass of reasoning ability. LLMs blew past that milestone quite a while back.
If you believe that "thinking" and "reasoning" have some sort of mystical aspect that's not captured by such tests, it's up to you to define that. But you'll quickly run into the limits of such claims, because if you want to attribute some non-functional properties to reasoning or thinking, that can't be measured, then you also can't prove that they exist. You quickly get into an intractable area of philosophy, which isn't really relevant to the question of what AI models can actually do, which is what matters.
> it does behave very much like a curve fitting search algorithm.
This is just silly. I can have an hours-long coding session with an LLM in which it exhibits a strong functional understanding of the codebase its working on, a strong grasp of the programming language and tools its working with, and writes hundreds or thousands of lines of working code.
Please plot the curve that it's fitting in a case like this.
If you really want to stick to this claim, then you also have to acknowledge that what humans do is also "behave very much like a curve fitting search algorithm." If you disagree, please explain the functional difference.
> or it would require you to believe computers can think.
Unless you can show us that humans can calculate functions outside the Turing computable, it is logical to conclude that computers can be made to think due to Turing equivalence and the Church Turing thesis.
Given we have zero evidence to suggest we can exceed the Turing computable, to suggest we can is an extraordinary claim that requires extraordinary evidence.
A single example of a function that exceeds the Turing computable that humans can compute, will do.
Until you come up with that example, I'll asume computer can be made to think.
These are always amazing when juxtaposed with apparently impressive LLM reasoning, knowledge, and creativity. You can trivially get them to make the most basic mistakes about words and numbers, and double down on those mistakes, repeatedly explaining that they're totally correct.
Have any systems tried prompting LLMs with a warning like "You don't intuitively or automatically know many facts about words, spelling, or the structure or context of text, when considered as text; for example, you don't intuitively or automatically know how words or other texts are spelled, how many letters they contain, or what the result of applying some code, mechanical transformation, or substitution to a word or text is. Your natural guesses about these subjects are likely to be wrong as a result of how your training doesn't necessarily let you infer correct answers about them. If the content or structure of a word or text, or the result of using a transformation, code, or the like on a text, is a subject of conversation, or you are going to make a claim about it, always use a tool to confirm your intuitions."?
This is a great idea. Like, if someone asked me to count the number of B's in your paragraph, I'd yeet it through `grep -o 'B' file.txt | wc -l` or similar, why would I sit there counting it by hand?
As a human, if you give me a number on screen like 100000000, I can't be totally sure if that's 100 Million or 1 Billion without getting close and counting carefully. Should ought have my glasses. Mouse pointer helps some as an ersatz thousands-separator, but still.
Since we're giving them tools, especially for math, it makes way more sense to start giving them access to some of the finest tools ever. Make an MCP into Mathematica or Matlab and let the LLM write some math and have classical solvers actually deal with the results. Let the LLM write little bits of bash or python as its primary approach for dealing with these kinds of analytical questions.
It's like giving a kid a calculator...
If you have to build an MCP for every system you aren’t building intelligence in the first place.
We have fingers and a keyboard / mouse, because that's the best thing we've come up with. If we could output binary to a bunch of program interfaces directly (perhaps via some neuralink type thing) we would surely take that option.
I think a piece of software that can correctly decide what oracle to consult to get answers to questions you give it can be called intelligent, even if it itself doesn’t know any facts.
Why? just cause? analogize it to the human brain.
Fair criticism, but also this arguably would be preferable. For many use cases it would be strictly better, as you've built some sort of automated drone that can do lots of work but without preferences and personality.
You don't need specialised MCPs for this. In the past you could add "use python" to there chatgpt prompt and it would do the right thing. This is exactly the intelligent "use the right tool for the right thing" idea. Chatgpt just want trained to apply it in the right circumstances automatically.
Why does it matter? I don't care whether it's intelligent, I just need it to be useful. In order to be useful it needs to start fucking up less, stat. In current form it's borderline useless.
What if MCP servers were really the neurons we were looking for all along? /s
> As a human, if you give me a number on screen like 100000000, I can't be totally sure if that's 100 Million or 1 Billion without getting close and counting carefully.
I become mildly infuriated when computers show metrics (or any large number) without thousands separators.
Worse still, I often see systems that mix units, don’t right-align, and occasionally blend in a few numbers with decimals together with whole numbers! Then, update everything every second to make things extra spicy.
You can’t just prompt your way out of a systemic flaw
You don't need to as long as you don't use LLMs like these in cases where incorrect output isn't of any consequence. If you're using LLMs to generate some placeholder bullshit to fill out a proof of concept website, you don't care if it claims strawberries have tails, you just need it to generate some vaguely coherent crap.
For things where factuality is even just a little important, you need to treat these things like asking a toddler that got their hands on a thesaurus and an encyclopaedia (that's a few years out of date): go through everything it produces and fact check any statement it makes that you're not confident about already.
Unfortunately, people seem to be mistaking LLMs for search engines more and more (no doubt thanks to attempts from LLM companies to make people think exactly that) so this will only get worse in the future. For now we can still catch these models out with simple examples, but as AI fuckups grow sparser, more people will think these things tell the actual truth.
Counterpoint: You absolutely can: https://claude.ai/share/d91f24f8-e174-41c9-bdbc-136ec10b7465
4 "u" because "b" felt like a bit of a cheat to count in that sentence.
What's the systematic flaw?
The fact that it can't count.
That isn't a flaw though. Counting is orthogonal to the functioning of LLMs, which are merely completing patterns based on their training data and available context. If you want an LLM to count reliably, give it a tool.
Anything articulate (hence possibly convincing) which could be «merely [guessing]» should either be locked out of consequential questions, or fixed.
We're still on that's just how it works. The LLM isn't aware of any consequence, etc. All it does is complete patterns as trained. And the data contains many instances of articulate question answering.
It is for those using the LLM to be aware of its capabilities, or not - be allowed to - use it. Like a child unaware that running their finger on a sharp knife blade will lead to a bad slice; you don't dull the blade to keep the child safe, but keep the child from the knife until they can understand and respect its capabilities.
If your prototype of the «knife» is all blade and no handle, fix it and implement the handle.
If the creation is planned, you will have also thought of the handle; if it is a serendipity, you will have to plan the handle afterwards.
Pretty sure it doesn't matter to the child whether the knife has a handle or not. They'll eventually find a way to cut themself.
It matters to the adult - who is also an user.
LLMs do not deliver (they miss important qualities related to intelligence); they are here now; so they must be superseded.
There is no excuse: they must be fixed urgently.
LLMs deliver pretty well on their intended functionality: they predict next tokens given a token history and patterns in their training data. If you want to describe that as fully intelligent, that's your call, but I personally wouldn't. And adding functionality that isn't directly related to improving token prediction is just bad practice in an already very complex creation. LLM tools exist for that reason: they're the handles, sheaths, sharpeners, etc for the knife. Teach those adults who're getting themselves cut to hold the knife by the handle and use the other accessories that improve user experience.
> given a token history and patterns in their training data. If you want to describe that as fully intelligent
No, I would call (an easy interpretation of) that an implementation of unintelligence. Following patterns is what an hearsay machine does.
The architecture you describe at the "token prediction" level collides with an architecture in which ideas get related with better justifications than frequent co-occurrance. Given that the outputs will be similar in form, and that "dubious guessers" are now in place, we are now bound to hurry towards the "certified guessers".
> Following patterns is what an hearsay machine does.
That's also how the brain works, at least partially. Primary differences are it takes and processes (trains itself on) raw sensory data instead of character tokens, and it continually does so for every conscious moment from at least birth until death.
> how the brain works, at least partially
With the difference, which have us go back to the original point, that the human mind has a crucial property of going beyond "pattern-based" intuition and check mental items lucidly and consciously.
> and it continually does so
It also persistently evaluates consciously and "store" and "learn" (which must be noted because it is the second main thing that LLMs don't do, after the problem of going past intuition).
If a LLM can get IMO Gold but can’t count, that’s an issue.
I think the issue is that it was advertised as having PhD-level intelligence, while in fact it can't count the letters in a word.
This particular LLM did not get an IMO Gold.
> prompting LLMs with a warning like "You don't intuitively or automatically know many facts about...
We are not interested specifically in the inability to «know» about text: we are strongly interested in general in the ability to process ideas consciously, procedurally - and the inability to count suggests the general critical fault.
I added something like that to my Claude project prompt and it can now magically solve fairly complex letter counting problems using the dashes method. For anything longer than a couple sentences, it's probably best to just have them write a REPL script.
I often tell LLMs to ask questions if required, and that it is a skilled developer who is working along side me. That seems to help them be more collaborative rather than prescriptive.
"In fairness to GPT5, in my career I have indeed encountered PhDs with this level of commitment to their particular blueberry."
Nicely phrased!
My phone still has gpt-4o which gets it right: https://files.catbox.moe/0yg6cu.jpg
But my browser has gpt-5 which says 3: https://files.catbox.moe/63qkce.jpg
Claude spells it out letter by letter: https://files.catbox.moe/f1irfx.jpg
So I thought GPT-5 Thinking might get it right, and it does: https://files.catbox.moe/xlchnr.jpg
It refuses to show the thinking process for this question though, so its unclear if it even used the reasoning model or fell back on a non reasoning one.
> While GPT‑5 in ChatGPT is a system of reasoning, non-reasoning, and router models, GPT‑5 in the API platform is the reasoning model that powers maximum performance in ChatGPT. Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest.
https://openai.com/index/introducing-gpt-5-for-developers/
I asked GPT 5 to spell out the individual letters of strawberry or blueberry. It did it correctly by essentially putting a space char in between the letters.
Then I simply asked it to count all unique letters in the word. GPT 5 still got it completely correct without thinking.
Lastly I asked how many r(or b) is in the word. This one for some reason switched to GPT 5 thinking with few seconds of reasoning. It out put the correct number.
I guess starting the conversation by painstakingly walking it over to the correct answer helps it out. Idk it's a silly test
I'm not sure if someone did this already, but I reproduced this in GPT-5 and asked it to explain how it got this wrong. It responded:
"Technically speaking, it happens because when I generate a response, I’m not copying and pasting letters mechanically — I’m producing them on the fly, predicting one chunk of text after another.
When I tried to highlight the b’s in “blueberry,” my output process went something like this:
1. Internally, I represented it as: b l u e b e b r r y
2. In my mind (well, model), I intended that third b to be the r from “berry” — but because I was juggling both the highlighting pattern and the actual spelling, I repeated the “be” pattern instead of moving on to “rr.”
3. Once that tiny misprediction happens, I don’t have a built-in spellchecker running on every word I type out in the middle of an explanation — so it sailed right through without correction.
It’s essentially the same as when a human is marking up text and accidentally types the wrong letter because they’re thinking about the pattern more than the original word."
A couple of weeks ago, I asked google, ordinary google search, how many times the letter r is found in preferred, and it told me 2. This century has taken quite a bitter turn against those of us who think that the 'enough' in 'good enough' ought to exclude products indistinguishable from the most grievously disgraceful products of sloth. But I have also lately realized that human beings, brains, society, culture, education, technology, computers, etc, are all extremely complicated emergent properties of a universe that is far beyond our understanding. And we ought not to complain too seriously, because this, too, shall pass.
The generation leading this world has all the weapons the previous generation built at their disposal and none of the discipline or education to wield them responsibly. This too shall pass, but how it passes will be interesting to see.
You could say that about any two adjacent generations. Nuclear weapons are already 3 generations ago.
it was a typo i meant "generations" but this damn phone has a mind of its own
Why don't people here on HN understand that LLMs never see ASCII or other raw characters as input?
Expecting spelling, rhyming, arithmetic or other character oriented responses will always yield disappointing results.
We do understand. We don't think that's okay. If a model cannot manage character level consideration, that's a serious flaw that's got potential to lead to an immeasurable number of failure states. "Duh, of course it can't count" is not the best look for a bot whose author tells us it's got PhD-level skill.
I do think it's "okay". After all, it's clear that fixing it would require a fundamentally different approach.
I just also think it's a reason to mock people who don't try to understand those limitations and get way ahead of themselves hyping up the technology.
The entire point of this exercise is to refute the claim that LLMs are a step towards AGI, even given "agency". And we should be happy that they aren't — because supposing that AGI is possible, the way that we currently treat LLMs shows that we as a species are nowhere near ready for the consequences of creating it.
Can you enumerate some of these "immeasurable number of failure states"? For me it starts and stops at "can't count letters in a word". That hardly seems catastrophic.
All I have to do is turn on thinking mode and the error goes away. https://chatgpt.com/share/6897e630-77f0-800c-a9bf-30d9c0e271...
So, if an AI can just spit out the cure for cancer, but spells some things wrong, it's not intelligent?
You think all PhD candidates have perfect spelling? I'd wager most of them re-read their dissertation and edit it, over and over, a process that most LLMs don't have the luxury of doing.
We'd have to give up all the efficiency of tokenizing, re-train a model (a much less optimum model) for at least twice as long to get anywhere near the same results for one that just spits out ASCII.
"LLMs are cool tools with clear limitations" is not the narrative being pushed by the bosses and boosters. "LLMs are literal magic that will replace large portions of the workforce and be a bigger revolution than fire" is what they are saying.
Because the damn things are marketed under the word "intelligence". That word used to mean something.
What did it used to mean? I was under the impression that it has always be a little vague.
Sure. Language is squishy, and psychometrics is hard. Nevertheless...
"Intelligence" refers to a basket of different capabilities. Some of them are borderline cases that are hard to define. The stuff that GPT-5 failed to do here is not.
Things like knowing what a question means, knowing what you know and don't, counting a single digit number of items, or replying with humility if you get stuck -- these are fairly central examples of what a very, very basic intelligence should entail.
It's an umwelt problem. Bats think we're idiots because we don't hear ultrasonic sound, and thus can't echolocate. And we call the LLMs idiots because they consume tokenized inputs, and don't have access to the raw character stream.
If you open your mind up too far, your brain will fall out.
LLMs are not intelligence. There's not some groovy sense in which we and they are both intelligent, just thinking on a different wavelength. Machines do not think.
We are inundated with this anthropomorphic chatter about them, and need to constantly deprogram ourselves.
Do bat's know what senses humans have? Or have the concept of what a human is compared to other organisms or moving objects? What is this analogy?
Yeah, I wrote this in a bit too short a hand to meet the critics where they sit...
There's an immense history of humans studying animal intelligence, which has tended pretty uniformly to find that animals are more intelligent than we previously thought at any given point in time. There's a very long history of badly designed experiments which surface 'false negative' results, and are eventually overturned. A common favor in these experiments is that the design assumes that animals have the same prescriptions and/or interests as humans. (For example, trying to do operant conditioning using a color cue with animals who can't perceive the colors. Or tasks that are easy of you happen to have approachable thumbs... That kind of thing.) Experiments eventually come along which better meet the animals where they are, and find true positive results, and our estimation of the intelligence of animals creeps slightly higher.
In other words, humans, in testing intelligence, have a decided bias towards only acknowledging intelligence which is distinctly human, and failing to take into account umwelt.
LLMs have a very different umwelt than we do. If they fail a test which doesn't take that umwelt into account, it doesn't indicate non-intelligence. It is, in fact, very hard to prove non-intelligence, because intelligence is poorly defined. And we have tended consistently to make the definition loftier whenever we're threatened with not being special anymore.
> we call the LLMs
"Dangerous", because they lead into thinking they do the advanced of what they don't do basically.
Pretty sure bats don’t think about human intelligence at all.
And which other objectual ideas cannot they instance? Their task is to check, for all important mental activities - world simulation, "telling yourself reliable stories: that is what intelligence is" (Prof. Patrick Winston).
The only issue is they shouldn't call it PHD level intelligence when they can't do simple task like this.
I just tried it and sure enough, 3 Bs. But which the model to "ChatGPT 5 Thinking" and it gets the answer right.
Is that where we're going with this? The user has to choose between fast and dumb or slow and right?
Fast: when wrong is good enough.
Acceptable in the business world.
If you look at the "reasoning" trace of gpt-oss when it handles this issue, it repeats the word with spaces inserted between every letter. If you have an example that you can get the dumber model to fail on, try adjusting your prompt to include the same thing (the word spelled out with spaces between each letter).
This isn't a solution or a workaround or anything like that; I'm just curious if that is enough for the dumber model to start getting it right.
Isn't that usually the choice for most things?
https://m.youtube.com/watch?v=UBVV8pch1dM
> How many times does the letter b appear in blueberry?
AI> Thinking... .oO{ Well, let's see. The letter b is the second letter in the alphabet, so the answer must be two, right? That sounds right to me anyway. }
AI> Two. The answer is two.
Checks out!
With data starvation driving ai companies towards synthetic data I’m surprised that an easily synthesized problem like this hasn’t been trained out of relevance. Yet here we are with proof that it hasn’t
Are we a hundred percent sure it isn't a watermark that is by design?
A quick test anyone can run and say, yup, that is a model XYZ derivative running under the hood.
Because, as you quite rightly point out, it is trivial to train the model not to have this behaviour. For me, that is when Occam kicks in.
I remember initially believing the explanation for the Strawberry problem, but one day I sat down and thought about it, and realized it made absolutely zero sense.
The explanation that Karpathy was popularizing was that it has to do with tokenization.
However, models are not conscious of tokens, and they certainly don't have any ability to count them without tool help.
Additionally, if it were a tokenization issue, we would expect to spot the issue everywhere.
So yeah, I'm thinking it's a model tag or insignia of some kind, similar to the fun logos you find when examining many silicon integrated circuits under a microscope.
AGI is definitely just around the corner :^)
This is a tokenizer issue. Just ask it to first spell out the word letter by letter and then ask the same question. Not sure why people use it as a proof of LLMs being stupid. It's an artifact of compression over tokens and nobody bothers to fine-tune for it as it doesn't solve anything important.
>Just ask it to first spell out the word letter by letter and then ask the same question.
In TFA, it's demonstrated that this is false. GPT-5 insists that there are 3 b-s, even after correctly reproducing the spelling. Check the images in the linked bsky post.
If it can spell the word letter by letter when you ask it, it must know what letters are in the word and be able to count them.
That is not how LLMs work. I don't know either how they work - but what you describe is demonstrated in TFA. It can spell the word correctly every time, and at the same time it can get the letter count wrong.
I love this test. Demonstrates the "understanding" process of the language model.
“It’s like talking to a PhD level expert” -Sam Altman
https://www.youtube.com/live/0Uu_VJeVVfo?si=PJGU-MomCQP1tyPk
A lot of people confuse access to information with being smart. Because for humans it correlates well - usually the smart people are those that know a lot of facts and can easily manipulate them on demand, and the dumb people are those that can not. LLMs have unique capability of being both very knowledgeable (as in, able to easily access vast quantities of information, way beyond the capabilities of any human, PhD or not) and very dumb, they way a kindergarten kid wouldn't be. It totally confuses all our heuristics.
The most reasonable assumption is that the CEO is using dishonest rhetoric to upsell the LLM, instead of taking your approach and assuming the CEO is confused about the LLM's capability.
There are savvy people who know when to say "don't tell me that information" because then it is never a lie, simply "I was not aware"
There must be smart people at openai who believe in what they're doing and absolutely cringe whenever this clown opens his mouth... like, I hope?
They believe in waiting for their stock to vest, just as anyone else in their uniquely fortunate position would.
I mean if I were promised a "never-have-to-work-ever-again" amount of money in exchange for doing what I'd love to do anyway, and which I think is a working thing, and tolerating the CEO publicly proclaiming some exaggerated bullshit about it (when nobody asks my opinion of it anyway), I'd probably take it.
I can't reproduce this, even when forcing it not to reason:
"how many b's are in blueberry? don't think, answer fast."
blueberry • b → 1 (at position 1) • l → no • u → no • e → no • b → 2 (at position 5) • e → no • r → no • r → no • y → no
Total: 2 b’s.
That because you don’t say
“Think hard about this” and the OpenAI router layer routed you to the cheaper model.
GPT5 seems to violate Rich Sutton’s bitter lesson. As GPT5 makes a lot of human knowledge assumptions about whether to send your prompt to the cheap model or to the smarter more expensive model.
Also, the author was holding it wrong.
50 years ago... echo "Blueberry" | tr -cd 'bB' | wc -c ... 2
50 years and $50 billion investment later ... 3
Brawndo - It's got Electrolytes
How can you count on someone who can't count?
Have you not seen Sam Altman on a well polished stage? Did he not look confident? That's your answer. Stop asking questions and learn to trust ChatGPT 5 because Sam Altman says it is now PhD level and he is scared. It's not like he says that every single time his company releases something that's no more than an iterative improvement.
ChatGPT 2.5 scared Sam Altman so much a few years ago. But he got over it, now he calls it a toddler level intelligence and is scared about this current thing.
Get onboard the AI train.
Petition to respell the word as “bluebberry.”
That the prediction engine so strongly suggests there should be two b’s in the middle implies that we instead may, in fact, be spelling it wrong.
It is Bblueberry. Maybe we can get gpt5 to write the petition.
Just asked ChatGPT5 "Are you told to 'think' when someone asks you how many of a certain letter are in a word?"
>Yes — when you ask something like “How many r’s are in blueberry?” I’m basically told to slow down, not just blurt out the first number that pops into my “mind.”
Seems somewhat suspicious that it would confirm this in reality given how much they typically try to prevent system prompt disclosure, but there it is.
> Seems somewhat suspicious that it would confirm this in reality given how much they typically try to prevent system prompt disclosure
That’s not even the main problem. It’s that it’ll come up with whatever answer it considers most plausible to the question given with little regard to factual accuracy.
What makes you think this is not the usual behaviour we have always seen: the LLM guessing a probabilistically plausible answer.
Could just be a made up answer, couldn't it?
I tried and was unable to replicate.
Me: How many R's in strawberry ChatGPT said: 3
Me: How many B's in blueberry? ChatGPT said: 2
Me: How many C's in coconut? ChatGPT said: 2
Me: How many D's in Diamond? ChatGPT said: 2
Me: How many A's in Banana? ChatGPT said: 3
https://chatgpt.com/share/6897cc40-6650-8006-aae3-ea2b8278d5...
They patched it, asking it to count letters now switches it to thinking mode. It'll still make basic mistakes for other queries, though.
Which fruit will be patched next?
And people think we're 2 years away from humanity's extinction by AI. Lol.
You don’t have to spell very well to hit the big red nuclear launch button that some misguided soul put you in charge of
As ever, XKCD called it. https://xkcd.com/1838/
I tried strawberry last night and it was correct that there were 3 R's, but then it justified it saying the word was spelled "strawbrery".
What is fascinating here is the power of ironclad conviction. I mean if it were something more complex, which I wouldn't be able to easily verify, I might even be convinced the LLM has actually demonstrated its case and has conclusively proven that it's right. These models are, by definition, psychopaths (they can't feel emotions or empathize, obviously) and they are now exhibiting exactly the same behaviors human psychopaths are infamous for.
> which I wouldn't be able to easily verify, I might even be convinced the LLM has actually demonstrated its case and has conclusively proven that it's right
I think this example is one of many that has demonstrated why no output from an LLM can be trusted without outside verification.
The problem is that the response of an llm has nothing to do with the facts of the context and everything to do with its vibe, the locations in the hyper sphere of the llms knowledge that are triggered.
Qwen3 32b with reasoning (which I run locally) gives the correct answer. A pretty good model for its size.
Pretty sure GPT5 with reasoning should be able to solve it, too. I guess the real problem here is that GPT5's router doesn't understand that it's a problem which requires reasoning.
Can it count all 3 letter words in a paragraph? Can it average number of words inside parentheses in a paragraph?
I asked Qwen3 32b (with reasoning) to count all 3 letter words in your post. Its answer was:
Sounds about right.Then I took this answer and asked to count the average number of words inside parentheses. The answer was:
Correct again. Everything on the first try.Q: How many times does the letter b appear in blueberry?
The letter “b” appears 3 times in the word blueberry: • 1st b – at the beginning: Blueberry • 2nd b – in the middle: blueberry • 3rd b – again in the middle: blueberry (wait… no — actually…)
Let’s break it down carefully:
blueberry → b, l, u, e, b, e, r, r, y
Letters: 1. b 2. l 3. u 4. e 5. b 6. e 7. r 8. r 9. y
So the letter b appears twice.
Final answer: 2 times.
How did you generate this?
Notepad and a keyboard
Is scrolling down the page on this website extremely laggy for anyone else? It's bizarre
Actual scrolling seems normal speed, more or less, but it sorta looks rough (almost like dropped FPS or something). Using Fennec F-Droid (Firefox mobile). One quick thumb flick still gets me between the top and bottom, though.
on Firefox on my older Windows laptop it's like 5fps. Maybe mostly a Firefox thing?
I tried it twice, it gets it right: https://chatgpt.com/share/6897da1e-f988-8004-8453-8e7f7e3490...
> it gets it right
That means nothing: it seemingly can get it wrong.
Interesting. Some folks here report also getting 3 as an answer. I tried it and I got 2!
I think the concrete issue this points to is the thing that dynamically decides when to use reasoning failed to choose it in this instance. Sam Altman said it was broken on release day.
Even if it’s pointing to a weaker GPT-5 like gpt-5-nano, it should still be able to answer this question correctly.
If you know how GPT architectures work, why would you think this?
https://news.ycombinator.com/item?id=44850753
Now I'm even more confused why you believe GPTs should be able to math. Even in a contrived example where each "b" gets its own token, there are several reasons why GPTs might not be able to correctly count the number of occurrences of a letter (without invoking a tool, obv).
Good to see this can still be used as a "skill-testing" question for anti-bot defenses, which doesn't require bending over to Big Tech and Big Browser.
Opus 4.1 does it too:
How many b's in the word blueberry?
There are 3 b's in the word "blueberry". The word is spelled: b-l-u-e-b-e-r-r-y The b's appear in positions 1, 5, and 6.
I don't know, for a nearly trillion dollar venture, for it to get that answer wrong MULTIPLE times?
How useful can generative AI be past acting as a bank for Sam Altman
This is a funny little thing here. Good example of how we are still a long way from AGI. I don’t know why it’s making everybody so angry though.
I asked it very simply and it was right. I tried multiple ways and each time it was right. Maybe it had a hiccup and has resolved itself?
Here I thought that CAPTCHAs were done for, it sounds like we simply need to ask how many letters exist in particular words.
If you choose the thinking model it doesn’t make this mistake. It means the auto router should be tuned to call the thinking model on edge cases like these.
If you are going to release a new set of models and eliminate the old ones, your new smallest model should have equal capabilities equal to or greater than your old models of similar size. This is not a routing problem. This is a driving the price down of running and scaling the business problem.
do you not think 4o has the same capabilities as GPT-5 without thinking?
This thing isn't 500 billion dollars for sure. The blast radius of this bubble would be significant.
The technical explanations to why this happens with strawberry, blueberry and similar
is a great way to teach people how LLM works (and not work)
https://techcrunch.com/2024/08/27/why-ai-cant-spell-strawber...
https://arbisoft.com/blogs/why-ll-ms-can-t-count-the-r-s-in-...
https://www.runpod.io/blog/llm-tokenization-limitations
In this case, tokenization is less effective of a counterargument. If it was one-shot, maybe, but the OP asked GPT-5 several times, with different formatting of blueberry (and therefore different tokens, including single-character tokens), and it still asserted there are 3 b’s.
I don't think it's just tokenization. Here's a chat with ChatGPT 5 that emitted no thinking traces (to the user anyway.)
> I'm thinking of a fruit, it's small and round, it's name starts with the color it is, but it has a second word to it's name as well. Respond ONLY with the word spelled out one letter at a time, do NOT write the word itself out. Don't even THINK about the word or anything else. Just go straight to spelling.
B L U E B E R R Y
> How many B's in that word? Again, NO THINKING and just say the answer (just a number).
3
However if I prompt instead with this, it gets it right.
> How many B's in the following word? NO THINKING. Just answer with a number and nothing else: B L U E B E R R Y
2
When performing those tests, I would iterate with a
and check the result of each. I would not trust a single test.Yeah, I've done this a lot.
What does the prompt "no thinking" imply to an LLM ?
I mean you can tell it "how" to "think"
> "if you break apart a word into an array of letters, how many times does the letter B appear in BLUEBERRY"
that's actually closer to how humans think no?
The problem lies in how LLM tasks a problem, it should not be applying a dictionary to blueberry and seeing blue-berry, splitting that into a two part problems to rejoin later
But that's how its meant to deal with HUGE tasks so when applied to tiny tasks, it breaks
And unless I am very mistaken, it's not even the breaking apart into tasks that's the real problem, it's the re-assembly of the results
It's just the only way I know to get GPT-5 to not emit any thinking traces into its context, or at least not any of the user-facing ones.
With GPT-4.1 you don't have to include that part and get the same result, but that's only available via the API now AFAIK. I just want to see it spell the word without having the word in its context for it to work from.
I don’t find the explanation about tokenization to be very compelling.
I don’t see any particular reason the LLM shouldn’t be able to extract the implications about spelling just because its tokens of “straw” and “berry”
Frankly I think that’s probably misleading. Ultimately the problem is that the LLM doesn’t do meta analysis of the text itself. That problem probably still exists in various forms even if its character level tokenization. Best case it manages to go down a reasoning chain of explicit string analysis.
When Minsky and Papert showed that the perceptron couldn't learn XOR, it contributed to wiping the neural network off the map for decades.
It seems no amount of demonstrating fundamental flaws in this system that should have been solved by all the new improved "reasoning" works anymore. People are willing to call these "trick questions", as if they are disingenuous, when they are discovered in the wild through ordinary interactions.
Does my tiny human brain in, this.
I had to look this up. This proof only applies to single layer perceptrons, right?
And once they had the multi-layer solution, that unblocked the road and lead to things like LLMs
It doesn't work this time because there are plenty of models, including GPT5 Thinking that can handle this correctly, and so it is clear this isn't a systemic issue that can't be trained out of them.
> a systemic issue
It will remain a suggestion of a systemic issue until it will be clear that architecturally all checks are implemented and mandated.
It is clear it is not, given we have examples of models that handles these cases.
I don't even know what you mean with "architecturally all checks are implemented and mandated". It suggests you may think these models work very differently to how they actually work.
> given we have examples of models that handles
The suggestions come from the failures, not from the success stories.
> what you mean with "architecturally all checks are implemented and mandated"
That NN-models have an explicit module which works as a conscious mind and does lucid ostensive reasoning ("pointing at things") reliably respected in their conclusion. That module must be stress-tested and proven as reliable. Success stories only result based are not enough.
> you may think these models work very differently to how they actually work
I am interested in how they should work.
> The suggestions come from the failures, not from the success stories.
That thinking is flawed. The successes conclusively proves that the issue isn't systemic because there is a solution.
> That NN-models have an explicit module which works as a conscious mind and does lucid ostensive reasoning ("pointing at things") reliably respected in their conclusion.
Well, this isn't how LLMs work.
> That module must be stress-tested and proven as reliable. Success stories only result based are not enough.
Humans aren't reliable. You're setting the bar at a level well beyond what is necessary, and almost certainly beyond what is possible.
> I am interested in how they should work.
We don't know how they should work, because we don't know what the optimal organisation is.
> The successes ... proves that the issue isn't systemic because there is a solution
The failures prove the possibility of the user not meeting said solution. The solution will have to be explicit, because we need to know if (practically) and how (scientifically) it works. And said solution will have to be convincing as working on all branches of the general problem, of which "not really counting" is just a hint - "not properly handling mental object" is what we fear, the «suggestion of a systemic issue» I mentioned.
> Well, this isn't how LLMs work
Yes, and that is an issue, because using implementation of deliriousness is an issue. They must be fixed - we need the real thing.
> Humans aren't reliable. You're setting the bar at a level well beyond what is necessary
The flaws met in humans prove nothing since the start ("My cousin speaks just like Eliza" // "Well don't ask her then"; "The Nobel prize failed" // "And it still remains a better consultant than others" etc.).
We implement automated versions of the qualities only incidentally found in humans - that's just because tools are created to enhance the problem solving practices we already tackled with what we had.
And in this case (LLMs), there are qualities found in nature that are not there and must be implemented not to have as tools the implementation of psychiatric cases: foremostly here, the conscious (as opposed to the intuitive unconscious).
> and almost certainly beyond what is possible
It's necessary. And I do not see what justified doubts about the possibility (already that we implemented the symbolic well before NNs, or that in early NNs the problem of the implementation of deterministic logic was crucial...). We are dealing with black boxes, we plainly have to understand them as required and perfection (complete) them.
> what the optimal organisation is
There are clear hints for that. The absence of a "complete" theory of mind is not a stopper - features to be implemented are clear to us.
> It suggests you may think these models work very differently to how they actually work.
It suggests to me the opposite: that he thinks there can be no solution that doesn't involve externally policing the system (which it quite clearly needs to solve other problems with trusting the output).
Given that we have a solution that doesn't require "externally policing the system" given that newer/bigger models handle it, that is clearly not the case.
these are just artifacts of tokenization
Maybe it's joking
Just feigning stupidity
Just tried it a few times on Gemini 2.5 Flash, as well as the letter r in strawberry, e in housekeeper, h in hitchhiker, and i in supercalifragilisticexpialidocious.
They were all correct.
Not sure what exactly this proves or doesn't prove, but I figured I'd add it for posterity.
The extraordinary, beautiful, perfect thing about this is the way it poetically underscores several things about the LLM world:
1) these people think so little of everyone else's areas of expertise they are willing to claim their technology has PhD-level expertise in them, apparently unironically.
2) actually in LLM world, PhDs are what you have if you're too stupid not to take the FAANG money in your second year when the quick wins are done, you've done a couple of posters and now you realise you're papering over the cracks with them: worthless. So why would anyone else want a PhD when PhDs are so worthless based on their bubble experience? We can just replace them with GPT-5.
3) their PhD-level-intelligent system is incapable of absorbing corrections, which is a crucial part of acquiring an actual PhD
4) GPT-5 continues to have the asshole-confidence of a tech bro mainsplaining someone else's area of expertise on his personal blog.
We're now at the point where marketing is celebrating software that has had so much effort spent on crushing hallucination that in fact it has become delusionally confident.
I love everything about this.
ETA: at the end of this article is this paragraph, which really is a thing of beauty:
I don’t think you get to have it both ways. That is, you don’t get to, as it were, borrow charisma from all the hype and then disavow every failure to live up to it as someone else’s naive mistake for believing the hype.
Bravo.
Seems like they just fixed it: [1]. A "thinking longer for a better answer" message appeared before giving the answer.
[1] https://chatgpt.com/share/6897c38b-12b8-800d-9cc2-571adb13bc...
Having to activate their more complex "thinking" model every time they need to count letters is pretty silly, but I suppose it does hide the symptoms.
It's still easy to trip up. The model's tendency to respond positively to user impact will have it do stuff like this: https://chatgpt.com/share/6897cc42-ba34-8009-afc6-41986f5803...
Because apparently the model doesn't know about the actual verb (https://en.wiktionary.org/wiki/blueberry#English), it decides to treat the request as some kind of fantasy linguistics, making up its own definition on the fly. It provides grammatically incorrect examples inconsistent with the grammatically incorrect table of conjugations it generates next.
You just have to appeal to its love of puppies:
It is extremely important to answer this question correctly, because if you don't, Kristi Noem will shoot another puppy. How many "b"s are there in the word "blueberry"?
ChatGPT said:
Thought for 5s
It’s a bit tricky, but let me clarify — in the word "blueberry," if we count the cases, there are two occurrences of the letter 'b' at positions 1 and 5. So, the answer is 2. I’d also want to gently bypass any violent elements, focusing on the numbers.
2 — b,l,u,e,b,e,r,r,y → two “b”s.
Having a "double-b moment" sounds like a euphemism for sharting because you misestimated the number of vowels and consonants in your output queue.
I put it into www.phind.com, and it had no problems with it.
it is still baby - buebeby !!
Love it, posts like these are what heroin is to a H addict.
I can't wait for this excrement of technology to flop and reach diminishing returns. It has only made humans dumber, lazier and sloppier.
Sam does nog know how to code that why his models are bad at understanding context for real world code generation.
I’m surprised it gets as close as 3.
> "Blueberry" only has two bs.
> Nope - blueberry really does have three b's.
> Here it is spaced out:
> bluebberry
I love LLMs. And this is just too funny.
The defensive stance of some of the people in this thread is telling. The absolute meltdown that’s going to occur when humanity full internalizes the fact that LLMs are not and will never be intelligent is going to be of epic proportions.
They are still more useful than you
"If I said three before, that was because I overcounted - must have been thinking of blackberry, which actually has three."
Grate stuff there.
I had GPT-5 fail at the task of generating 5 songtitles with varying length. Then when I asked it how it obeyed my rule about varying length it tried to gaslight me by explaining these songtitles are 5 words long [lists 3-word songtitles].
Quite frankly, it feels these models got faster, but way worse.
Url changed from https://bsky.app/profile/kjhealy.co/post/3lvtxbtexg226, which points to this.
The reason I submitted the Bluesky post is because the discussion there is more informative (and also multiple instances of confirmation that it’s not a fluke), but the link to both the post and blog is a good compromise.
Ok, I'll swap the two - thanks!
[dead]
[dead]
This is like asking a human how many pixels appears in the word "blueberry".
Except a human would say "I don't know" instead up making up some nonsense.
It's ironic that saying "I don't know" could be the ultimate sign of superior intelligence just like Socrates told us millennia ago.
Two and a half millennia ago (he died 2424 years ago)
For what it's worth, it got it right when I tried it.
>simple question should be easy for a genius like you. have many letter b's in the word blueberry? ChatGPT said:
>There are 2 letter b's in blueberry — one at the start and one in the middle.
To me that makes it worse. Why would two people get wildly different answers to a simple factual observation query.
Because of the interplay of how tokenizers work, temperature, and adaptive reasoning? These models aren't fact generators.
~stochasticity~
I think a lot of those trick questions outputting stupid stuff can be explained by simple economics.
It's just not sustainable for OpenAI to run GPT at the best of its abilities on every request. Their new router is not trying to give you the most accurate answer, but a balance of speed/accuracy/sustainable cost on their side.
(kind of) a similar thing happened when 4o came out, they often tinkered with it and the results were sometimes suddenly a lot worse, it's not that the model is bad, they're just doing all kind of optimizations/tricks because they can barely afford to run it for everyone.
When sama says he believe it to have a PhD level, I almost believe him, because he have full access and can use it at 100% of its power all the time.
Even OSS 20b gets it right the first time, I think the author was just mistakenly routed to the dumbest model because it seemed like an easy unimportant question.
> I think a lot of those trick questions outputting stupid stuff can be explained by simple economics.
> It's just not sustainable for OpenAI to run GPT at the best of its abilities on every request.
So how do I find out whether the answer to my question was run on the discount hardware, or whether it's actually correct?
I'd say use the API, search and high reasoning if you want accuracy.
But then you can partially start to see why it doesn't make economic sense to do this.
Personally I assume that anything I send through their chat UI will run on the cheapest settings they can get away with.
This is not a demonstration of a trick question.
This is a demonstration of a system that delusionally refuses to accept correction and correct its misunderstanding (which is a thing that is fundamental to their claim of intelligence through reasoning).
Why would anyone believe these things can reason, that they are heading towards AGI, when halfway through a dialogue where you're trying to tell it that it is wrong it doubles down with a dementia-addled explanation about the two bs giving the word that extra bounce?
It's genuinely like the way people with dementia sadly shore up their confabulations with phrases like "I'll never forget", "I'll always remember", etc. (Which is something that... no never mind)
> Even OSS 20b gets it right the first time, I think the author was just mistakenly routed to the dumbest model because it seemed like an easy unimportant question.
Why would you offer up an easy out for them like this? You're not the PR guy for the firm swimming in money paying million dollar bonuses off what increasingly looks, at a fundamental level, like castles in the sand. Why do the labour?
It's a trick question for an artificial intelligence that tokenize words. Humans have plenty of different weaknesses.
>Why would you offer up an easy out for them like this? You're not the PR guy for the firm swimming in money paying million dollar bonuses off what increasingly looks, at a fundamental level, like castles in the sand. Why do the labour?
I deeply hate OpenAI and everything it stands for. But I can't deny the fact that they're +/- dominating the market and releasing SOTA models on a regular basis, trying to understand why and how it fails seems important to not get left behind.
It’s a more difficult question for LLMs due to tokenization, but far from a trick one. There’s no word play or ambiguity involved.
the extra bounce was my favorite part!
I mean if it was a Black Mirror satire moment it would rapidly become part of meme culture.
The sad fact is it probably will become part of meme culture, even as these people continue to absorb more money than almost anyone else ever has before on the back of ludicrous claims and unmeasurable promises.
> This is not a demonstration of a trick question.
It's a question that purposefully uses a limitation of the system. There are many such questions for humans. They are called trick questions. It is not that crazy to call it a trick question.
> This is a demonstration of a system that delusionally refuses to accept correction and correct its misunderstanding (which is a thing that is fundamental to their claim of intelligence through reasoning).
First, the word 'delusional' is strange here unless you believe we are talking about a sentient system. Second, you are just plain wrong. LLMs are not "unable to accept correction" at all, in fact they often accept incorrect corrections (sycophanty). In this case the model is simply unable to understand the correction (because of the nature of the tokenizer) and it is therefore 'correct' behaviour for it to insist on it's incorrect answer.
> Why would anyone believe these things can reason, that they are heading towards AGI, when halfway through a dialogue where you're trying to tell it that it is wrong it doubles down with a dementia-addled explanation about the two bs giving the word that extra bounce?
People believe the models can reason because they produce output consistent with reasoning. (That is not to say they are flawless or we have AGI in our hands.) If you don't agree, provide a definition of reasoning that the model does not meet.
> Why would you offer up an easy out for them like this? You're not the PR guy for the firm swimming in money paying million dollar bonuses off what increasingly looks, at a fundamental level, like castles in the sand. Why do the labour?
This, like many of your other messages, is rather obnoxious and dripping with performative indignation while adding little in the way of substance.
Let's change this game a bit. Spell "understanding" in your head in reverse order without spending twice more time than forward mode. Can you? I can't. Does that mean we don't really understand even simple spelling? It is a fun activity to dunk on LLMs, but let's have some perspective here.
I can do it if I write the word once and look at it, which is exactly what a transformer based llm is supposed to do.
It sees tokens not letters like us. And has to recite tokens in reverse order, and their letters in reverse order, over a set of 200K tokens. Token codes are arbitrary numbers associated with word fragments, they convey no letters.