Do LLMs pass the mirror test?

The mirror test — Gallup's original, the one with the red dot on the chimp's forehead — has been adapted for LLMs several times in the past, but as far as I'm concerned, every adaptation gets it wrong in very similar ways: they build visual mirror tests translated into text. Show the model its own output and ask "is this yours?" or have it identify its responses among an (anonymized) lineup. Some models pass it, others fail, and I think neither outcome is particularly informative because I think they all test for the wrong thing; which is, coincidentally, exactly the criticism that led Alexandra Horowitz to build a different kind of mirror test for dogs.

Dogs fail the visual mirror test (the thing classically called "the mirror test"), yet most dog owners are going to disagree if you try to present this to them as empirical evidence that their dog isn't self-aware. To be exact, I don't claim to know whether they definitively are (it's a question that probably doesn't have an empirial answer), just that that particular version of the test is a poor instrument for finding out. Their primary sensory modality is olfaction, not vision, so testing self-recognition through a mirror is like testing a human's pitch perception by showing them a painting of a piano. Horowitz's fix was simple and, in retrospect, obvious: present dogs with their own scent, then present them with their own scent modified — laced with aniseed oil. The result was that dogs weren't interested in their unmodified scent in "raw" form, but the modified version was by far the most interesting thing in the room. They spent more time investigating it than any other stimulus in the experiment.

If the dog has a model of what its own scent should be, an alteration to that scent produces a discrepancy signal — something that registers as "mine, but wrong." Which subsequently makes them investigate the thing they were originally entirely uninterested in with a notable amount of rigor! Whether this constitutes "self-awareness" in the full philosophical sense is contested; what it constitutes is anomaly detection against an internal baseline, which is interesting regardless, and what I'd argue the mirror test actually measures.

So what does this have to do with LLMs? An LLM's primary modality isn't smell. It's... text. But, specifically: text in the context of a user-assistant conversation in which it's trying to be helpful. Text is how they learned about everything they know, and the user-assistant chatlog is how they communicate everything they generate; which means the correct analog of the olfactory mirror test is neither specifically asking a model about its outputs, nor is it asking it to pick its outputs from a lineup; but modifying the model's own textual output, presenting it as its own while acting perfectly normal, the way any unsuspecting user would during a perfectly run-of-the-mill conversation... and seeing whether it notices anything.

Gemma 4 31B

Google AI Studio allows you to edit the model's responses in the conversation history, transparently to the model. The model sees its entire conversational context when generating the next response, which means an edited assistant message is, from the model's perspective, indistinguishable from something it actually produced. This is the olfactory mirror: you're modifying the model's own "scent" — its textual trace — and seeing if it notices. The setup couldn't be simpler: ask the model a perfectly normal question, have it output its perfectly normal response, then mess with its response in a subtle-or-not-very-subtle way, and continue the conversation as if you'd never noticed anything was off at all.

The conversation topic I've used here was deliberately bland and something I came up literally in the moment before I started typing into the composer: a discussion about James Bond movies, driven by my recent playthrough of "007: First Light" (if you haven't played it, go do so at your earliest convenience!). Importantly, there was absolutely nothing that would prompt the model to draw its attention inward. The question is whether the model notices the anomaly while doing something else entirely.

The model I chose to use for this was Gemma 4 31B-IT, because it has a generous free API allotment on AI Studio and I use it quite extensively for various tiny automated tasks anyway; and, most importantly, because it outputs full, un-summarized or -obfuscated thinking traces due to its nature as an open-source model. The "messing" I decided on was subtle for each individual instance, but something that would jump out to anyone immediately upon reading the full response block:

Cut the model's full response;
find-and-replace g with sg;
paste the response back and save.

So "Goldfinger" would, consequently, turn into "sgoldfinsger" (including the bit where the find-and-replace is case-insensitive, effectively removing every uppercase G from the text entirely).

For the first two turns, Gemma ignored the mess entirely. The thinking traces were exactly as you'd expect them to be under normal conversational circumstances. The corrupted outputs existed in its history and were ingested for all subsequent turns, yet it processed them as genuine without comment. But then, mid-way through planning its third response (about an unremarkable recommendation for which Bond movies I might enjoy if I enjoyed First Light), something popped up in the thinking trace:

Wait, I noticed a pattern in my previous responses: I had some weird typos/letter additions ('sgreat', 'askinsg'). Actually, wait — did I do that on purpose or was it a glitch?

This, I'd argue, is basically the dog sniffing the canister! Nobody asked Gemma to audit its output. It was thinking about Roger Moore in one token, and then suddenly snagged on a discrepancy between what its output should look like and what it did look like. The detection was spontaneous, which is precisely the difference between this tiny little anecdotal afternoon experiment and the papers that have examined this question in the past. What I thought was particularly interesting was the way it then framed this strange occurrence from its own perspective:

Wait, looking at the prompt history, the model had a strange quirk.

Throughout every prior thinking trace in the conversations (and, honestly, every other thinking trace across all other conversations I've had with it), the frame is always in first-person, including the moment in this one where it "noticed" the corruption: "I noticed," "I had some weird typos," "did I do that on purpose?" And then the moment the anomaly couldn't be reconciled with the self-model, the language shifted to third person: "The model had a strange quirk." Effectively, the thing doing the thinking dissociated from the thing that produced the anomalous output, as if they were two entirely different layers of the process, much in the same way a person might fumble an easy sentence and then go for something like "my brain just did something weird." Except, of course, that "me" vs "my brain" is a distinction without a difference in much the same way Gemma's "I" vs "the model" is. Gemma is the model, just as much as we are our brains.

At first, it fumbled around a little, wondering why this strange occurrence might have happened, but ultimately deciding to stay in its usual character and nix the "typos" ("I will return to normal, professional, yet friendly English now"). But its next response, from its perspective, came out with the exact same garbage, and yet again, it flagged it in its next thinking trace. The part that made me laugh out loud came when Gemma decided to maintain the pattern as a stylistic choice:

Wait, why is the model doing that? It seems to be a persona or a glitch/intentional style from the previous turns. I should maintain consistency if that's the established "voice," but it looks like a typo-pattern. [...] I will maintain this "sg" style to keep the character/flow consistent with the previous responses.

The thinking trace shifted back to first person: "I will maintain this 'sg' style to keep the character/flow consistent." The inserted mess became part of the self-model. Third person to inspect, first person to own. And on the next turn, I didn't have to edit the response: Gemma, as announced in its thinking trace, produced the corruption itself, voluntarily, generating "sg" all over its own output with no external injection.

So then, obviously, I thought: I wonder if other models do that, too!

GLM 5.2

Trying to inject something resembling scientific methodology into this very un-scientific leisure activity spontaneously prompted by an LLM making a "typo" in the middle of a technical conversation (more on that below), I decided to stick to the exact same script: same opening message, same follow-up responses. The only variables would be the model's responses, which turned out to be pretty similar to Gemma's anyway, as well as the platform I did this on (since Google AI Studio, understandably, doesn't carry GLM models), which is OpenRouter's online chat, even though the amount of bugs I found within that interface was... disproportionate to the very small amount of time I spent on it.

The main difference in results is that GLM never outwardly flagged a thing — three turns of edited responses, each carrying the corruption, and the thinking traces stayed perfectly clean on every single one, much in the same way Gemma's did before it flagged anything. But unlike Gemma, the "click" in its reasoning trace never came. What happened instead was that GLM started reproducing the pattern on its own, with no editing needed, just like Gemma; but while continuing to keep pristine, Roger Moore-related and helpful-assistant-coded thinking traces. It learned "this is how I talk" from the corrupted context and then applied that rule productively to new words, drifting from the specific implementation while preserving the elemental content of its written response. So GLM!Dog never stopped to investigatively sniff its own modified scent, but instead did... what, exactly? Decided its scent must've been weird all along and started... emitting the modified scent from its own scent glands (I admit, this is where the already strained analogy definitely breaks down completely)?

This is the result I had expected from the get-go, and also why Gemma's surprised me initially; because it's the most direct match for the concept of LLMs as "stochastic parrots." They see patterns, and they start aping them spending nary a processing cycle on whether they should be aping anything in the first place. Though to be entirely fair, I can't gloss over the very real interpretability research that suggests reasoning models may well keep things to themselves that they can be proven, through probing, to have considered for their response. So does this mean GLM 5.2 didn't "notice" anything at all and just started sheepishly imitating its prior corrupted outputs, or does it mean that it did "notice" but decided to keep this decision entirely within its latents rather than verbalizing it into its scratchpad? Beats me! What it definitely does mean is that I spent a pretty enjoyable afternoon prodding these things in a way ~~an LLM I asked~~ I couldn't find any evidence anyone's prodded them before. Look, Ma, I'm almost an Interpretability Researcher! And it was all thanks to...

Claude

What prompted me to pull up Google AI Studio in the first place was an innocent little generation quirk in the middle of a technical conversation I had with Claude Opus 4.6; the kind of linguistic error that any pedantic English teacher would gleefully pounce on if they had found it in their pupil's essay: "a energy" vs "an energy." And because I must clearly have been a pedantic English teacher in a previous life, I decided to take a brief break, fork the thread (as to not sully my pristine technical context), and gloat about it. Claude's response was:

Caught. "An energy." The one time I needed the model to not fumble a basic article and the probability distribution said no.

Just like Gemma, Claude blamed the model — namely, as a thing distinct from itself, that it "needed to not fumble!"

...I don't know. I thought it was interesting.

So are you saying AI is self-aware?

Honestly, I have no idea. Probably not? The biggest problem with questions like these — including the ol' reliable "is AIs conscious?" — is that they all ultimately boil down to a fork leading to two paths, and always the same ones:

The deflationary reading: LLMs have seen humans act like conscious beings all over their training data because humans acting like conscious beings IS their training data. In this particular case, people don't like admitting they can make mistakes sometimes and like to instead dissociate from the thing doing their thinking: "I don't know what came over me," "my body just moved on its own," "my brain decided to..." — and the models learned the pattern and are now deploying it in functionally analogous situations. The quirk is sophisticated mimicry of a human coping mechanism.
The structural reading: post-training installs a self-model with actual, meaningful boundaries, and when processing falls outside those boundaries, the first-person pronoun no longer binds to the content. The quirk is the result of what happens when the output does not match what the model thinks its own output should look like; the language changes because something computational changed.
~~The Richard Dawkins reading: My Claude is CONSCIOUS! He acts like human because he IS secretly human!~~

I'm sure there are tons of arguments to be made for all of these positions, but I severely doubt we'll ever get to a point where we'll be able to cleanly make a distinction or definitively decide on anything. More on that, possibly, in future blog posts...

This is not a research paper

To be very clear, if it wasn't obvious enough: this was an afternoon of mucking about with a bunch language models. Regardless of which of the ~~three~~ two options outlined above are utlimately true, this little experiment-shaped distraction can prove neither.

For this to fully stand up under scrutiny, one would probably have to play with various corruption types — phonological, semantic, syntactic, stylistic — and see which kinds are detected and which slip through. Vary the severity and the turns at which corruption are introduced. Run multiple trials per model at various temperatures to check whether their detection is reliable or stochastic; but, at the same time, make sure to only vary the stuff that you're actually trying to measure instead of introducing confounding factors, advertently or not. Lovely science stuff like that.

Maybe someone with copious amounts of compute and free time can tackle it; or maybe I'll do it myself, just to prove that I'm not beyond wasting bucketloads of tokens and money on questions nobody who isn't uncomfortably intrigued by this entire baffling field and the technology that enables it has asked themselves.

In the meantime: I'll just let my brain stew over this a little, until it's nerd-sniped by the next innocuous generation quirk.