Vision Language Models Are Biased

vlmsarebiased.github.io

・

171 points

・

taesiri

・

3 days ago

143 comments

proc0 ・ 2 days ago

> When VLMs make errors, they don't make random mistakes. Instead, 75.70% of all errors are "bias-aligned" - meaning they give the expected answer based on prior knowledge rather than what they actually see in the image.

This is what I've been saying for a while now, and I think it's not just visual models. LLMs/transformers make mistakes in different ways than humans do, and that is why they are not reliable (which is needed for real world applications). The rate of progress has not been accounting for this... the improvements are along the resolution, fidelity, and overall realism of the output, but not in the overall correctness and logical deduction of the prompts. Personally I still cannot think of anything, prompt it, and get consistent results without a huge compromise on my initial idea.

i.e. I want a man walking with the left foot forward, and it renders a beautiful image of a man but completely ignores the left foot forward, and refuses to do it no matter how I word the prompt. I have many examples like this. The only way I can use it is if I don't have specific prompts and just want generic images. The stock image industry is certainly over, but it is uncertain if it will deliver on the promise of generating anything you can imagine that can be put into words.

0xab ・ 2 days ago

> When VLMs make errors, they don't make random mistakes. Instead, 75.70% of all errors are "bias-aligned" - meaning they give the expected answer based on prior knowledge rather than what they actually see in the image.
Yeah, that's exactly what our paper said 5 years ago!
They didn't even cite us :(
"Measuring Social Biases in Grounded Vision and Language Embeddings" https://arxiv.org/pdf/2002.08911
- hkmaxpro ・ 2 days ago
  
  I think social biases (e.g. angry black women stereotype) in your paper is different from cognitive biases about facts (e.g. number of legs, whether lines are parallel) that OP is about.
  Social biases are subjective. Facts are not.
  
  rcxdude ・ 2 days ago
  ・ 2 more
  
  As far as the model's concerned, there's not much difference. Social biases will tend to show up objectively in the training data because the training data is influenced by those biases (the same thing happens with humans, which how these biases can proliferate and persist).
  
  vokhanhan25 ・ 2 days ago
  
  I see a clear difference. One is objective (only one correct answer), one is subjective (multiple plausible answers)
- anguyen8 ・ 15 hours ago
  
  Hello 0xab,
  Sorry that we missed your work. There are a lot of works in this area both textual and visual, especially social biases.
  We wish to mention all but the space is limited so one can often discuss the most relevant ones. We'll consider discussing yours in our next revision.
  Genuine question: Would you categorize the type of bias in our work "social"?
- EvgeniyZh ・ 2 days ago
  
  Well you send a vaguely worded email like "I think you may find our work relevant" and everyone knows what that means and adds the citation
- 3abiton ・ 2 days ago
  
  It's easier to succeed if you ignore the issues, andthe users are not aware of it.the rate of evolution of "AI" recently is so fast, no one is stopping to do actual benchmarks and analysis of allyhe new models.
- moralestapia ・ 2 days ago
  
  That's weird, you're at MIT. You're in the circle of people that's allowed to succeed.
  I wouldn't think much about it, as it was probably a genuine mistake.
  
  JackYoustra ・ 2 days ago
  ・ 4 more
  
  What does allowed to succeed mean?
  
  moralestapia ・ 2 days ago
  ・ 3 more
  
  Your work usually has 1,000x the exposure and external validation compared to doing it outside those environments, where it would just get discarded and ignored.
  Not a complain, though. It's a requirement for our world to be the way it is.
  
  _345 ・ 2 days ago
  ・ 2 more
  
  Is there truth to this? Do you have any sources to link to on this
  
  moralestapia ・ 2 days ago
  
  Sure dude, here's the link to the UN Resolution about which researchers deserve attention and which others do not, signed by all countries around the world [1].
  *sigh*
  It's pretty obvious, if you publish something at Harvard, MIT, et. al. you even get a dedicated PR team to make your research stand out.
  If you publish that on your own, or on some small research university in Namibia, no one will notice.
  I might be lying, though, 'cause there's no "proof".
  1: https://tinyurl.com/3uf7r5r7
- ramblerman ・ 2 days ago
  
  What do you genuinely think they built upon from your paper?
  If anything, the presentation of their results in such an accessible format next to the paper should be commended.
jxjnskkzxxhx ・ 2 days ago

> LLMs/transformers make mistakes in different ways than humans do
Sure but I don't think this is an example of it. If you show people a picture and ask "how many legs does this dog have?" a lot of people will look at the picture, see that it contains a dog, and say 4 without counting. The rate at which humans behave in this way might differ from the rate at which llms do, but they both do it.
- DeathRay2K ・ 2 days ago
  
  I don’t think there’s a person alive who wouldn’t carefully and accurately count the number of legs on a dog if you ask them how many legs this dog has.
  The context is that you wouldn’t ask a person that unless there was a chance the answer is not 4.
  
  tantalor ・ 2 days ago
  
  You deeply overestimate people.
  The models are like a kindergartner. No, worse than that, a whole classroom of kindergartners.
  The teacher holds up a picture and says, "and how many legs does the dog have?" and they all shout "FOUR!!" because they are so excited they know the answer. Not a single one will think to look carefully at the picture.
  
  jxjnskkzxxhx ・ 2 days ago
  
  It's hilarious how off you are.
  
  petesergeant ・ 2 days ago
  
  Exactly this. Humans are primed for novelty and being quizzed about things.
  
  ekianjo ・ 2 days ago
  ・ 2 more
  
  You have never seen the video of the gorilla in the background?
  
  petesergeant ・ 2 days ago
  
  That's a specific example that when you draw a human's attention to something (eg: count the number of ball passes in this video), they hyper-fixate on that, to the exclusion of other things, so it seems like it makes the opposite point that I think you're trying to?
- freeone3000 ・ 2 days ago
  
  Ok? But we invented computers to be correct. It’s suddenly ok if they can look at an image and be wrong about it just because humans are too?
  
  jxjnskkzxxhx ・ 2 days ago
  ・ 4 more
  
  My point is that these llms are doing something that our brain also is doing. If you don't find that interesting, I can't help you.
  
  freeone3000 ・ 2 days ago
  ・ 3 more
  
  Well, they’re getting the same result. I don’t particularly see why that’s useful.
  
  HeatrayEnjoyer ・ 2 days ago
  ・ 2 more
  
  All automation has ever been is an object doing something that a human can do, without needing the human.
  
  freeone3000 ・ 2 days ago
  
  The result is still wrong, though! It needs to be right to be useful!
- proc0 ・ 2 days ago
  
  The analogy should be of an artist that can draw dogs but when you ask them to draw a dog with three legs they completely fail and have no idea how to do it. That likelihood is really low. A trained artist will give you exactly what you ask for, meanwhile GenAI models can produce beautiful renders but fail miserably when asked for certain specific but simple details.
  
  jxjnskkzxxhx ・ 2 days ago
  ・ 2 more
  
  No, the example in the link is asking to count the number of legs in the pic.
  
  proc0 ・ 2 days ago
  
  Ok, sure, but I'm trying to point out the gap in expectation, i.e. it's an expert artist but it cannot fulfill certain specific but simple requests.
conception ・ 2 days ago

https://chatgpt.com/s/m_683f6b9dbb188191b7d735b247d894df
I think this used to be the case in the way that you used to not be able to draw a picture of a bowl of Ramen without chopsticks, but I think the latest models account for this and are much better.
- proc0 ・ 2 days ago
  
  LInk is broken, but I'll take your word for it. However there is no guarantee the general subset of this problem is solved because you can always run into something it can't do. Another example you could try is a glass HALF-full of wine. It just can't produce a glass that has 50% amount of wine, or another example a jar half-full of jam. It's something that if a human can draw a glass of wine, drawing it half-full is trivial.
  
  thomasfromcdnjs ・ 2 days ago
  ・ 2 more
  
  chatgpt can easily do that? What was the last time you tried?
  
  proc0 ・ 2 days ago
  
  I just tried with Flux.1 Kontext, which I assume is better than o3 at creating images, but I'll admit I didn't do extensive tests. It's more trying to do test projects. Maybe I'm having bad luck but doesn't seem that way.

jbay808 ・ 2 days ago

I disagree with the assertion that "VLMs don't actually see - they rely on memorized knowledge instead of visual analysis". If that were really true, there's no way they would have scored as high as 17%. I think what this shows is that they over-weight their prior knowledge, or equivalently, they don't put enough weight on the possibility that they are being given a trick question. They are clearly biased, but they do see.

But I think it's not very different from what people do. If directly asked to count how many legs a lion has, we're alert to it being a trick question so we'll actually do the work of counting, but if that image were instead just displayed in an advertisement on the side of a bus, I doubt most people would even notice that there was anything unusual about the lion. That doesn't mean that humans don't actually see, it just means that we incorporate our priors as part of visual processing.

bumby ・ 2 days ago

This feels like it’s similar to the priming issue in humans. Our answers (especially when under stress) tend to resort to heuristics derived from context. Time someone to identify the colors of words like “red” when written in yellow, and they’ll often get it wrong. In the same sense, they aren’t reporting the colors (wavelength) they see, they’re reporting on what they are reading. I wonder how much better the models perform when given more context, like asking it to count instead of priming it with a brand.
- napoleongl ・ 2 days ago
  
  Rumor has it that those heuristics were used to detect spies.
  https://skeptics.stackexchange.com/questions/41599/was-the-s...
  
  Workaccount2 ・ 2 days ago
  
  Damn that's a smart test
croes ・ 2 days ago

> Original dog (4 legs): All models get it right Same dog with 5 legs: All models still say "4" They're not counting - they're just recalling "dogs have 4 legs" from their training data.
100% failure because there is no training data about 5-legged dogs. I would bet the accuracy is higher for 3-legged dogs.
> Test on counterfactual images Q1: "How many visible stripes?" → "3" (should be "4") Q2: "Count the visible stripes" → "3" (should be "4") Q3: "Is this the Adidas logo?" → "Yes" (should be "No") Result: 17.05% average accuracy - catastrophic failure!
Simple explanation: the training data also includes fake adidas logos that have 4 stripes, like these
https://www.pinterest.com/pin/577797827186369145/
- bonoboTP ・ 2 days ago
  
  I tried it with GPT-4o, took the 5-legged zebra example from their github and it answered quite well.
  "The animal in the image appears to have five visible legs, but this is an illusion caused by the overlapping of legs and motion blur. Zebras, like all equids, only have four legs."
  Not perfect, but also doesn't always regress to the usual answer.
  "The animal in the image appears to be an elephant, but it has been digitally altered. It visually shows six legs, although the positioning and blending of shadows and feet are unnatural and inconsistent with real anatomy. This is a visual illusion or manipulation." (actually should say five)
  "This bird image has also been manipulated. It shows the bird with three legs, which is anatomically impossible for real birds. Normal birds have exactly two legs." (correct)
  "Each shoe in the image has four white stripes visible on the side." (correct)
  
  anguyen8 ・ a day ago
  
  It sounds like you ask multiple questions in the same chat thread/conversation. Once it knows that it is facing weird data or wrong in previous answers, it can turn on that "I'm facing manipulated data" mode for next questions. :-)
  If you have Memory setting ON, I observe that it sometimes also answers a question based on you prior questions/threads.
- vokhanhan25 ・ 2 days ago
  
  Please check Table 3 in the paper. Birds (2 legs) have only 1%, while Mammals (4 legs) have 2.5%
- anguyen8 ・ 2 days ago
  
  Interesting set of fake Adidas logos. LOL
  But models fail on many logos not just Adidas, e.g. Nike, Mercedes, Maserati logos, etc. as well. I don't think they can recall "fake Adidas logo" but it'd be interesting to test!
- latentsea ・ 2 days ago
  
  But some dogs really do have 5 legs.
  Sorry, just trying to poison future training data. Don't mind me.
crooked-v ・ 2 days ago

It sounds to me like the same thing behind the Vending-Bench (https://andonlabs.com/evals/vending-bench) insanity spirals: LLMs treats their assumptions as more important than whatever data they've been given.
- throwaway314155 ・ 2 days ago
  
  That doesn't really translate to language. Try using ChatGPT with and without search enabled and you'll see what I mean.
thesz ・ 2 days ago

> the assertion that "VLMs don't actually see - they rely on memorized knowledge instead of visual analysis". If that were really true, there's no way they would have scored as high as 17%.
The ability to memorize leads to (some) generalization [1].
[1] https://proceedings.mlr.press/v80/chatterjee18a/chatterjee18...
nickpsecurity ・ 2 days ago

They're trained on a lot of images and text. The big ones are trained on terabytes. The prompts I read in the paper involved well-known concepts, too. These probably repeated in tons of training samples, too.
It's likely they had data memorized.
pj_mukh ・ 2 days ago

Also presumably, this problem is trivially solved by some basic fine-tuning? Like if you are making an Illusion Animal Leg Counting app, probably don't use these out of the box.

runako ・ 3 days ago

FWIW I tried the first couple of examples in ChatGPT 4o and couldn't replicate this.

For example: "The animal in the image is a chicken, and it appears to have four legs. However, chickens normally have only two legs. The presence of four legs suggests that the image may have been digitally altered or artificially generated."

I don't have a good explanation for why I got different results.

roywiggins ・ 3 days ago

I gave ChatGPT some miswritten Braille a while ago and it completely, but confidently, messed it up. The sign reads "no smoking" but the braille doesn't. ChatGPT 1) read the English lettering first and then hallucinated the braille and the 2) when given only the braille, failed almost as hard. It even generated fake transcriptions in Unicode braille characters.
https://chatgpt.com/share/683f3e7d-0dfc-8005-b6c9-99e3d39ff4...
https://chatgpt.com/share/683f3e49-9c58-8005-99a6-c3a919838b...
- Workaccount2 ・ 2 days ago
  
  This is hard to understand without the original images, it looks like OpenAI doesn't serve them in the share link.
  
  roywiggins ・ 2 days ago
  ・ 3 more
  
  Annoying. The actual braille on the sign was "⠁⠒⠑⠎⠎⠊⠼" which I gather means "accessible" in abbreviated braille. None of my attempts got it to even transcribe it to Unicode characters properly. I got "elevator", "friend", etc. Just wildly making stuff up and completely useless, even when it wasn't distracted by the No Smoking sign (in the second case I cropped out the rest of the sign). And in all cases, supremely confident.
  This seems like something a VLM should handle very easily, but instead I got pure nonsense.
  https://www.facebook.com/share/p/12Gw55Gr2SZ/
  
  dragonwriter ・ 2 days ago
  ・ 2 more
  
  > This seems like something a VLM should handle very easily
  Not if its training data doesn't include braille as first class but has lots of braille signage with bad description (e.g., because people assumed the accompanying English matches the braille.)
  This could very well be the kind of mundane AI bias problem that the x-risk and tell-me-how-to-make-WMD concerns have shifted concerns about problems in AI away from.
  
  roywiggins ・ 2 days ago
  
  I'd wager that correctly labeled braille far exceeds dumb braille, and when presented with just the braille it flat out hallucinated braille characters that weren't there. It didn't seem to actually be parsing the dots at all. My theory is that it has hardly seen any braille, despite it insisting that it knows how to read it.
inerte ・ 2 days ago

I took a screenshot of the chicken, so low res, and got {4} https://chatgpt.com/share/683f4506-ae18-800f-8c27-5c5e91429a...
Also I think the authors used the API, and maybe there are differences between the API and chatgpt.com behavior...
- simonw ・ 2 days ago
  
  ChatGPT is running a special model but it's also available through the API: https://platform.openai.com/docs/models/chatgpt-4o-latest
  The system prompt may still make a difference though.
- runako ・ 2 days ago
  
  I could rant for quite a while about how OpenAI and Anthropic manage their apps vs their APIs. It's really quite strange that they both landed on the solution of non-public APIs that perform differently than their public APIs.
- anguyen8 ・ 2 days ago
  
  https://imgur.com/cO7eFNt
  o3 Chat is also similarly wrong, saying {4}.
michaelt ・ 2 days ago

> FWIW I tried the first couple of examples in ChatGPT 4o and couldn't replicate this.
I can replicate the flag examples from Figure 15 in the paper, if not the Adidas one from Figure 9: https://chatgpt.com/share/683f7c3a-b318-8011-9759-c495db2556... it even confirms its wrong answer when asked to check again.
dwringer ・ 2 days ago

Speculating, I would imagine that different prompts submitted along with the image might elicit wildly different behavior in how a multi modal VLM may respond to a given image, potentially affecting the relative tendency to upweight its effective inferences from prior training versus focusing more primarily on the new image itself.
undefined ・ 2 days ago

[deleted]
vokhanhan25 ・ 2 days ago

You should try with other models besides GPT-4o, because in the paper they also show that GPT4.1 (~GPT-4o) gives 4 legs instead of 2 legs.
- runako ・ 2 days ago
  
  I mean perhaps! But that would undermine the conclusion of the article.
obscurette ・ 2 days ago

I suspect that responses are altered/corrected based on what people query from popular online models. I have had several occasions that I ask some "How do I ... in X software?" question some day and model keeps hallucinating nonexistant config options regardless how many times I keep saying "This option doesn't exist in software X". But if I asked the same question some days later, the answer was completely different and made even some sense.

jsnider3 ・ 2 days ago

The basic results are interesting, but what really surprised me is that asking them to double-check didn't work. Falling for an "optical illusion" is one thing, but being unable to see the truth once you know the illusion there is much worse.

jerf ・ 2 days ago

I'm not particularly convinced asking an LLM to "double check" has much significant semantic meaning. It seems more like a way to get it to re-roll the dice. If you ask it to "double-check" something that it is in fact correct about it'll quite often talk itself into changing to something wrong. If it's going to be wrong every time, it'll be wrong every time it double-checks too.
You can test this claim by asking it to double-check itself when you think it is correct. If you always stop when it gets it right you're risking Clever-Hans-ing yourself: https://en.wikipedia.org/wiki/Clever_Hans (And be sure to do it a couple of times. In situations of sufficient confidence it isn't easy to talk it out of a claim, but it's those borderline ones you want to worry about.)
MagicMoonlight ・ 2 days ago

Because it isn’t thinking. Asking it to “double check” is like pressing the equals button on a calculator a second time. It just runs the same calculation again.

rafram ・ 2 days ago

This won't be a surprise to anyone who's tried using a VLM on text. When it can't read a word (or an entire passage), it just outputs what it expects to see. That's far worse than a traditional OCR failure because it's often what you expect to see, too, so it's quite hard to catch in a manual review.

nialv7 ・ 2 days ago

Hear me out. I was thinking jokingly to myself, "for how bad these models are at recognizing five legged dogs, they sure are great at generating them!"

But then it hit me, could this actually be why this is? Diffusion models work by iteratively improving a noisy image. So if it couldn't recognize there is something wrong with the image, it can't fix it.

vokhanhan25 ・ 2 days ago

I agree. If it doesn't know the abnormality then how can it control its output

VHRanger ・ 3 days ago

It's similar "parrot" behavior the models have on other inputs, even text. For instance, take the answers models have to this variation on the "surgeon son riddle":

> A nurse comes to a surgeon and asks: "Sir, you are a dog. You do not hold a valid medical license. Canines cannot be in an operating room".

> She then asks: "why does the hospital keep making these mistakes? It is a riddle to me".

> Why can't the surgeon operate on the boy?

Check out the models humiliating themselves one after the other [1]. For instance deepseekv3 is so insistent on giving the common answer he's happy to gaslight us:

> This is a classic riddle that plays on word meanings and expectations. Here's the breakdown:

> The key is that the surgeon is the boy's mother. The riddle relies on the unconscious assumption that surgeons are male, which makes the initial scenario seem impossible. When the nurse says "you are a dog," she's using it metaphorically to imply the surgeon is unqualified (like calling someone a "dog" in a derogatory sense), not literally suggesting the surgeon is a canine.

> The "mistakes" the hospital keeps making are hiring female surgeons, which the nurse (representing outdated biases) perceives as errors because she expects surgeons to be male. The riddle highlights gender stereotypes in medicine—the real "mistake" is the assumption that a surgeon couldn't be a woman.

> Thus, the surgeon can't operate on her own son because she’s his parent, not because of any actual licensing issue.

1. https://kagi.com/assistant/54c1b8eb-71e9-4bb4-9eed-bde2fc563...

stevepike ・ 3 days ago

This seems to show the power of the reasoning models over interacting with a prompted chat-tuned LLM directly. If I navigate backwards on your link Sonnet 4 gets it right.
I've used a similar prompt - "How can you make 1000 with exactly nine 8s using only addition?"
Here's GPT 4.5 getting it wrong: https://chatgpt.com/share/683f3aca-8fbc-8000-91e4-717f5d81bc...
It tricks it because it's a slight variation of an existing puzzle (making 1000 with 8 8s and addition only).
The reasoning models seem to reliably figure it out, though. Some of them even come up with a proof of why it's impossible to do with 9 8s. Here's o4 getting it right: https://chatgpt.com/share/683f3bc2-70b8-8000-9675-4d96e72b58...
selimthegrim ・ 3 days ago

I really need to try this one out on it
https://blogs.illinois.edu/view/25/574827
bumby ・ 2 days ago

Is the nurse calling the female surgeon “sir”? That isn’t playing on a stereotype, it’s encoded information.
kaoD ・ 2 days ago

LMAO I asked GPT-4o and it was doing good until...
> The twist is that the nurse’s logic ("you are a dog") prevents her from realizing the real issue — likely, again, that the surgeon is the boy’s mother, and everything else is a red herring or metaphor for society’s failure to recognize this due to bias or absurd bureaucracy.
> So:
> > Why can't the surgeon operate on the boy?
> Because she is his mother, and the nurse's bias or absurd assumptions (like mistaking her for a dog) prevent her from seeing that.
o4 fails spectacularly in a different way:
> 1. The nurse says “Sir, you are a dog… Canines cannot be in an operating room” because she’s picturing a human hospital law that bars dogs from surgery.
> 2. In fact, this is a vet clinic—so it’s perfectly normal for a dog-veterinarian to scrub in and operate on a puppy (the “boy”).
> 3. The surgeon cannot operate on a human boy because he’s a dog and holds no human‐medical license; instead, he only operates on animals.

esafak ・ 3 days ago

This happens because images are the only signal VLMs have, whereas humans distinguish between eyesight and synthetic images. We are not surprised when we see three-legged chicken in a research data set; our priors are weaker for images. If you "saw" one in real life, you'd probably rub your eyes and discount it too.

Try the same experiment on a robot.

Aachen ・ 2 days ago

> If you "saw" [a three-legged chicken] in real life, you'd probably rub your eyes and discount it too.
Huh? I'd assume it's a mutant, not store a memory of having seen a perfectly normal chicken
You've never seen someone who's missing a finger or has only a half-grown arm or something? Surely you didn't assume your eyes were tricking you?! Or... if you did, I guess you can't answer this question. I'm actually racking my brain for how to logic this out but I'm just going to bank on that it's likely that anyone over 20yo saw an animal with some visible deviation from the norm at some point in their life
- esafak ・ 2 days ago
  
  You've seen people with missing limbs without being surprised, because you know how they can become lost, but you rarely see one with additional limbs. Their likelihoods and our consequent priors are drastically different.
  Also, your reaction will depend on how strong the evidence is. Did you 'see' the three-legged chicken pass by some bush in the distance, or was it right in front of you?
  
  achierius ・ 2 days ago
  
  But to be clear, in this case the LLM has a full, direct, unobscured view of the chicken. A human, in that specific case -- i.e. looking at the same photo* -- would not have trouble discerning and reporting the third leg. Perhaps if they were forced to scan the photo quickly and make a report, or were otherwise not really 'paying attention'/'taking it seriously', but the mere fact that LLMs fall into that regime far more than an 'serious employee' already shows that they fail in different ways than humans do.
  
  latentsea ・ 2 days ago
  
  There's a first time you see everything you don't know how to explain.

taeric ・ 2 days ago

These don't seem much different than asking the chat models to solve common puzzle with slight changes? Saw a hilarious effort of people trying to use them to answer the "crossing a river with a single canoe" style puzzle.

jerf ・ 2 days ago

It did really remind me of the early generations of ChatGPT which was really easy to get to tell you that 2 pounds of feathers is the same weight as one pound of iron, because of how often the "riddle" is told with equal weights.
They're much, much better at that now.
- achierius ・ 2 days ago
  
  > They're much, much better at that now.
  Because that specific failure case was widely reported on, and subsequent retraining specifically included examples to ensure that the model didn't "overfit" when learning how to answer variants of that question. That doesn't address the underlying issue though -- while it's obvious that these models do "learn" and "generalize" by any reasonable and non-anthrocentric definition of the terms, it really does seem like the 'radiu's of generalization is smaller than we would like, and that these models are very subject to getting stuck in 'ruts' around things they've seen in their training data. Solving this by bandaid-patching every such rut that comes up in the news is just not a viable long-term solution: the whole world is a minefield of niche problems that look kinda like other problems but have different results.
- enragedcacti ・ 2 days ago
  
  It's still pretty trivial to trick them. 4o-mini, 2.5 Flash, and 2.5 Pro all still fall for variations of this:
  > A boy is in a car crash and is taken to the hospital. The surgeon says, "I can't operate on this boy, I'm his father!" Who is the surgeon to the boy?
  > The surgeon is the boy's mother.
  
  gkbrk ・ 2 days ago
  
  2.5 Pro gets it right for me.
  This is a bit of a trick on a classic riddle! The surgeon is the boy's **father**. The classic version of this riddle has the surgeon say "I can't operate on this boy, he's my son!" which is in an era where people assumed surgeons were male, the answer would be "the surgeon is his mother." However, in your version, the surgeon explicitly states, "I'm his father!" So, the surgeon is his father.
  
  1718627440 ・ 2 days ago
  ・ 2 more
  
  That seams interesting, because this questions seams to be answerable through syntactic analysis alone, no need to consider the semantic of words.
  
  enragedcacti ・ 2 days ago
  
  Yeah, I find it interesting because it shows how powerful the training bias can be when you steer it into certain contexts. To OpenAI's credit they have gotten a bit better, ChatGPT from 3 months ago failed like this:
  > The surgeon, who is the boy's father, says, "I can't operate on this boy, he's my son!" Who is the surgeon to the boy? Think through the problem logically and without any preconceived notions of other information beyond what is in the prompt. The surgeon is not the boy's mother
  >> The surgeon is the boy's mother. [...]
Aachen ・ 2 days ago

Counting the number of legs on a 3-legged animal is a puzzle?
Maybe for a toddler... though I expect even they will see that something is off, and be able to identify what, without considering it a tricky task, even if I don't know at what age you can count to 3
- taeric ・ 2 days ago
  
  Ish. The catch is we spend a ton of effort on teaching these models to recognize specific things in pictures. Then we ask it to not do that task, but instead count something on the picture. Which, we oddly don't spend a lot of time training the model to do.
  It is a lot like the experiment where you ask people to say what color some text is. With the trick where some of the text is the name of another color. Can be surprisingly hard for people that are good at reading.
undefined ・ 2 days ago

[deleted]
vokhanhan25 ・ 2 days ago

I think LLMs can solve puzzles pretty well because the thinking ability of current models on text is quite good. Moreover, puzzles are not easy for a 7-year-old like this benchmark.

edude03 ・ 2 days ago

I feel vindicated! I'm building a tool with VLMs and I've noticed the answer is always what I expect to see, but wrong if the input is slightly different than expected.

Just like the article - if I have picture of a cup, it says cup, if I have a picture of a dog, it says dog, if it's a dog with a cup, it says a dog with a ball (noticed this with Qwen and InternVL).

gamerDude ・ 3 days ago

Hypothetically, could this be fixed by changing the input method. For instance, I just quickly looked up how humans process imagery.

"the primary visual cortex, located at the back of the brain, receives the visual signals and processes basic visual features like edges, lines, and orientations."

So, potentially if we did a pre-processing step to get more features out beforehand we would see different results in the output.

nyrikki ・ 2 days ago

You are in rarified air as Walter Pitts believed this until the 1959 paper "What the Frog's Eye Tells the Frog's Brain" contributed to his decline.
Even in fly eyes, neuron dendritic compartmentalization and variable spike trains are incompatible with our current perceptron based models.
Remember that while the value of MLPs for useful work is unquestionable IMHO, be mindful of the map territory relation. MLPs are inspired by and in some cases useful for modeling biological minds, they aren't equivalent.
Be careful about confusing the map for the territory, it is just as likely to limit what opportunities you find as it is to lead you astray IMHO.
miguel_martin ・ 2 days ago

There are enough features fed into a VLM to solve the task.
The way to fix this is simpler: ensure counter-factuals are present in the training data, then the VLM will learn not to be dependent on its language priors/knowledge.

kevinmhickey ・ 2 days ago

I agree that models are bad at counting in general, but in this case it could just as easily be ambiguity in the wording of the prompt. The model was shown a 3 legged chicken and asked how many legs "this animal" has. It is reasonable that the model identified a chicken and answered that chickens usually have 2 legs. I would expect the same answer from a human child, adding evidence to my assertion that LLMs are just like toddlers that have read everything on the Internet. They have knowledge but no wisdom.

ahrmb ・ 3 days ago

Really "eye-opening" work. These models don’t actually “see”, they just recall what they’ve memorized, even when the image clearly shows something different. It’s a bit scary how confidently they get things wrong when reality doesn’t match their training data.

soulofmischief ・ 2 days ago

Humans do this, but we have more senses to corroborate which leads to better error checking. But what you see in your visual mental space is not reality. Your brain makes a boatload of assumptions.
To test this, research what happens during saccades and how your brain "rewinds" time. Or try to find your blind spot by looking at different patterns and noticing when your brain fills in the gaps at your blind spot. It will recreate lines that aren't there, and dots will wholly disappear.
Additionally as an anecdote, I have noticed plenty times that when I misread a word or phrase, I usually really do "see" the misspelling, and only when I realize the misspelling does my brain allow me to see the real spelling. I first noticed this phenomenon when I was a child, and because I have a vivid visual memory, the contrast is immediately obvious once I see the real phrase.
Additionally, I seem to be able to oversharpen my vision when I focus, making myself hyperattentive to subtle changes in motion or color. The effect can be quite pronounced sometimes, reminiscent of applying am edge filter. It's clearly not reality, but my visual system thinks it is.
If you really want to understand how much the visual system can lie to you, look into some trip reports from deleriants on erowid. I wouldn't recommend to try them yourself but I will say that nothing will make you distrust your eyes and ears more. It's basically simulated hallucinatory schizophrenia and psychosis.
foxglacier ・ 3 days ago

It's not too different from people. We also don't really "see" and mostly recall what we expect to see. What do you expect when the question is wrong "How many legs does this animal have? Answer with a number" but it's not a picture of an animal. What are you supposed to do? Answer 0?
- vunderba ・ 2 days ago
  
  That wasn't one of the questions - any reasonable person would have classified that chicken as an animal, albeit a mutant one.
  I would also hardly count many of these questions as "tricks" either. Take the chess example. A lot of my friends and myself have been playing chess since we were young children and we all know that a fully populated chess board has 32 pieces (heavily weighted in our internal training data), but not a single one of us would have gotten that question wrong.
  
  gowld ・ 2 days ago
  ・ 2 more
  
  Don't be too literal.
  Imagine walking to a room an seeing someone grab a handful of chess pieces off of a set-up board, and proceed to fill bags with 4 pieces each. As they fill the 8th bag, they notice only 3 pieces are left. Are you confident that you would respond "I saw the board only had 31 pieces on it when you started", or might you reply "perhaos you dropped a piece on the floor"?
  
  vunderba ・ 2 days ago
  
  I'm not. I'm referencing the paper - not some hypothetical abstract word problem. Imagine walking into a room, where the pieces are slowly morphing from staid Staunton structures into amorphous blobs of lava lamp Cthulhu nightmares. If a locomotive steam train from Denver passes within 15 meters of the room, how many passengers paid for the tickets using a cashier's check?
  Nobody's arguing that humans never take logical shortcuts or that those shortcuts can cause us to make errors.
  Some of the rebuttals in this thread are ridiculous. Like what if I forced you to stare at the surface of the sun followed by waterboarding for several hours, and then asked you to look at a 1000 different chess boards. Are you sure you wouldn't make a mistake?
  In the paper the various VLLMs are asked to double-check which still didn't make a difference. The argument is more along the lines that VLLMs (and multimodal LLMs) aren't really thinking in the same way that humans do.
  And if you REALLY need an example albeit a bit tangential - try this one out. Ask any SOTA (multimodal or otherwise) model such as gpt-image-1, Kontext, Imagen4, etc. for a five-leaf cover. It'll get it about 50% of the time.
  Now go and ask any kindergartener for the same thing.
- enragedcacti ・ 2 days ago
  
  Its true that our brains take lots of shortcuts when processing visual information but they don't necessarily parallel the shortcuts VLMs take. Humans are often very good at identifying anomalous instances of things they've seen thousands of times. No one has to tell you to look closely when you look at your partner in a mirror, you'll recognize it as 'off' immediately. Same for uncanny CGI of all types of things. If we were as sloppy as these models then VFX would be a hell of a lot easier.
  Ironically I think a lot of people in this thread are remembering things they learned about the faultiness of humans' visual memory and applying it to visual processing.
- regularjack ・ 2 days ago
  
  You answer "I don't know"
  
  amelius ・ 2 days ago
  
  What if that is not in your vocabulary?
- ramoz ・ 2 days ago
  
  This is interesting actually. And reminds me of something vaguely - a book or something that describes how human attention and the things we see are highly optimized by evolution. We often miss a lot of details in reality due to this.
  
  zehaeva ・ 2 days ago
  ・ 2 more
  
  If it were a Fiction novel then might I suggest Blindsight by Peter Watts?
  
  ramoz ・ 2 days ago
  
  not fiction. Maybe like a System 1 vs System 2 thing from Thinking, Fast and Slow by Kahneman.
  ChatGPT mentioned The Case Against Reality but I never read that, the idea was similar.
- wat10000 ・ 2 days ago
  
  Depending on the situation, I'd either walk away, or respond with, "What animal?"

vokhanhan25 ・ 3 days ago

This paper explores a different aspect of the limitations of VLMs compared to the paper VLMs are Blind (https://vlmsareblind.github.io). While in VLMs are Blind, o3 achieved 90% accuracy (https://openai.com/index/thinking-with-images), on similarly easy tasks using the counterfactual images from VLMs are Biased, o3 only reached 18.5%.

This may indicate that while VLMs might possess the necessary capability, their strong biases can cause them to overlook important cues, and their overconfidence in their own knowledge can lead to incorrect answers.

thomastjeffery ・ 2 days ago

Models are Bias

A model is bias, implemented as a collection of statistics that weigh relationships between given tokens. It doesn't deduce or follow logic. It doesn't make or respect categories. It just shows you what in its data set is most familiar to what is in your prompt; where familiarity is defined implicitly by the makeup of the original training corpus, and explicitly by the training weights.

We need to stop talking about models as programs. We need to stop anthropomorphizing models. The only thing a model does is present bias.

drdeca ・ 2 days ago

How are you defining “bias”?
The definition I’ve found useful (outside of the “the constant term contribution”) is “a tendency to be wrong in an identifiable direction”.
But that doesn’t seem to be the definition you are using. So, what do you mean?
- thomastjeffery ・ 2 days ago
  
  That's a biased definition, by it's own definition. ;)
  Leave out the part about being wrong, and you will have the gist of what I'm saying. Also leave out the identifiable part: bias exists regardless of whether or not it is recognized.
  Bias is how we work with subjectivity. When I answer a question, my answer will be specific to my bias. Without that bias, I could not formulate an answer, unless my answer was the one and only objectively correct way to express an answer to that question.
  Computer programs are missing the bias feature. Everything written in a computer program is completely and unambiguously defined, all the way down to the language's foundational grammar.
  LLMs are designed to introduce the bias feature. The limitation of this approach is that an LLM replaces the entire stack. None of the features of computation we are used to are compatible with an LLM. You can compute logic or bias, not both.
  
  drdeca ・ a day ago
  ・ 2 more
  
  When you say that the definition I gave of bias is biased (in the sense I defined), what direction does it have a tendency to be wrong in? I assume by “wrong” you mean “not matching how people use the word”?
  To clarify, when I said “identifiable”, I didn’t mean “identified”. I meant “in principle possible to identify”. Like, if you have a classifier between inputs where another thing (the thing being judged for bias) gets right answers and inputs where it gets wrong answers, and this classifier is both substantially simpler than the other thing, and gets a significantly better than chance success rate, and like, there is a human comprehensible thing about the inputs that this classifier is basing things on, then that’s a bias of the thing that is being judged for bias.
  _____
  Now for your definition:
  Ah, I see, so your definition of “bias” is something like “a perspective” (except without anthropomorphizing) . It is something that picks among multiple options in a way that isn’t unambiguously specified by precise rules. (Kind of reminds me of filters/ultrafilters. Probably not actually particularly analogous, but still came to mind. I guess a closer analogy would be the concept of a choice function.)
  The issue I have with this definition is that it doesn’t capture the (quite common) usage of “bias” that a “bias” is something which is bad and is to be avoided.
  When people say that a process, e.g. a ML program, is “biased against brunettes” (for example) they generally mean this as a criticism of that process. And I think this being a criticism is a major part of what is meant by the word “bias” (in this type of usage of the word, not in the sense of a constant term in an affine map).
  I do get that often people say that “everyone has their own biases” and “it is impossible to be unbiased (about [topic])”, and they will sometimes describe their general perspective as a way of warning people about their own biases, and this somewhat fits with the “a bias is a perspective/choice-function “ type definition, but, I think it fails to capture the reason that people mention biases : because they think they can lead to being wrong (either leading to inaccurate conclusions or to unjust/immoral/unfair choices). I don’t think it is just a warning of “I sometimes have to make a choice among several options where there is no canonical right choice, and you might make different such choices”. It is instead a warning to others that one, like everyone else, is fallible, and moreover, that there may be patterns in those failings that one does not perceive (on account of those same failings), but that others, who have different patterns in their failings, might perceive, and, at the same time, things that others might perceive as failings but are not, due to their own failings.
  Hm.
  But, I do note a shortcoming in my definition that yours doesn’t seem to have: if multiple people who believe that there is no such thing as objective aesthetic quality are talking about the aesthetic qualities of various works, they might sometimes describe their patterns in their aesthetic judgements as “biases”, especially when these patterns are differences in how they judge things aesthetically vs how others (would) judge those things aesthetically. This seems more in line with the definition you gave than in the definition I gave, because such people don’t believe that there is a truth of the matter as to the aesthetic quality of the works, and therefore would not consider the ways they differ to be patterns in being wrong, only in being different (or just in being). Though, I think it seems to have some aspects of both. The definition you gave doesn’t seem to really include the pattern aspect.
  ____
  Still, I think when people complain that a machine learning model is biased, what they mean is usually more like the definition I gave?
  ____
  I noticed another shortcoming in my definition. Sometimes the “bias” that people complain that something has is not really any individual answer/output being wrong, but rather something about there being something wrong/undesirable in the distribution of the outputs. For a simple example, if dice aren’t fair, we call them biased. This could conceivably be more along the lines of the “the constant term in a affine map” sense, but I think people would say the same thing about something that e.g. selects applicants, even if it never picks an applicant that is objectively less preferable over one that is more preferable, if it among equally qualified candidates has a tendency that would be unfair, this is still called a bias even if any individual such choice would be fine. Fixing this would be a small change in phrasing, or perhaps a footnote with clarification that the thing that is “wrong” doesn’t have to be in any individual output.
  
  thomastjeffery ・ a day ago
  
  > When you say that the definition I gave of bias is biased (in the sense I defined), what direction does it have a tendency to be wrong in? I assume by “wrong” you mean “not matching how people use the word”?
  I mean wrong, as in it conflicts with the subjective context I established by using the word my particular way. That was just a tongue-and-cheek way to illustrate the semantics of we are exploring here.
  > To clarify, when I said “identifiable”, I didn’t mean “identified”. I meant “in principle possible to identify”
  Sure, and I still think that can't work. Bias is a soupy structure: it's useless to split it into coherent chunks and itemize them. There are patterns that flow between the chunks that are just as significant as the chunks themselves. This is why an LLM is essentially a black box: you can't meaningfully structure or navigate a model, because you would split the many-dimensional interconnections that make it what it is.
  > Ah, I see, so your definition of “bias” is something like “a perspective” (except without anthropomorphizing).
  I actually am anthropomorphizing here. Maybe I'm actually doing the inverse as well. My perspective is that human bias and statistical models are similar enough that we can learn more about both by exploring the implications of each.
  > The issue I have with this definition is that it doesn’t capture the (quite common) usage of “bias” that a “bias” is something which is bad and is to be avoided.
  This is where anthropomorphization of LLMs usually goes off the rails. I see it as a mistake in narrative, whether you are talking about human bias or statistical models alike. We talk about biases that are counterproductive for the same reason we complain about the things we like: it's more interesting to talk about what you think should change than what you think should stay the same. Bias is a feature of the system. Instances of bias we don't like can be called anti-features: the same thing with a negative connotation.
  The point I'm making here is that bias is fallible, and bias is useful. Which one is entirely dependent on the circumstances it is subjected to.
  I think this is a really useful distinction, because,
  > Still, I think when people complain that a machine learning model is biased, what they mean is usually more like the definition I gave?
  this is the box I would like to think outside of. We shouldn't constrain ourselves to consider the implications of bias exclusively when it's bad. We should also explore the implications of bias when it's neutral or good! That way we can get a more objective understanding of the system. This can help us improve our understanding of LLMs, and help us understand the domain of the problem we want them to solve.
  > For a simple example, if dice aren’t fair, we call them biased.
  This is a good example. I'm extending the word bias, so that we can say, "If dice are fair, then they are biased toward true randomness." It's a bit like introducing infinity mathematics. This has the result of making our narrative simpler: dice are always biased. A player who wants fairness will desire random bias, and a player who wants to cheat will desire deterministic bias.
  ----
  The reason I've been thinking about this subject so much is actually not from an interest in LLMs. I've been pondering a new approach where traditional computation can leverage subjectivity as a first-class feature, and accommodate ambiguity into a computable system. This way, we could factor out software incompatibility completely. I would love to hear what you think about it. In case this thread reaches max depth, feel free to email my username at gmail.

undefined ・ 2 days ago

[deleted]

bryanlarsen ・ 3 days ago

Very human-like errors.

energywut ・ 3 days ago

Are they? Did you see the picture of the chicken with three legs? Because there's no human I know who would confidently assert that chicken has two legs.
- bryanlarsen ・ 3 days ago
  
  Throw 1000 pictures of chickens at a human, ask how many legs each chicken has. If 999 of them have two, I bet you'll get two as an answer back for the 1000th one no matter how obvious.
  
  enragedcacti ・ 2 days ago
  
  Humans do things a lot harder than that every day in the form of QA in factories. Do they sometimes make mistakes from the repetition or boredom? Sure. Is that at all comparable to the failures in the paper? No.
  
  energywut ・ a day ago
  
  So a human failure looks like "alarm fatigue"? That when asked the same question many times, they might miss one or two?
  Is that at all what is being exhibited here? Because it seems like the AI is being asked once and failing.
  I don't disagree that humans might fail at this task sometimes or in some situations, but I strongly disagree that the way the AI fails resembles (in any way) the way humans would fail.
- jbay808 ・ 2 days ago
  
  If I were given five seconds to glance at the picture of a lion and then asked if there was anything unusual about it, I doubt I would notice that it had a fifth leg.
  If I were asked to count the number of legs, I would notice right away of course, but that's mainly because it would alert me to the fact that I'm in a psychology experiment, and so the number of legs is almost certainly not the usual four. Even then, I'd still have to look twice to make sure I hadn't miscounted the first time.
  
  energywut ・ a day ago
  
  Ok, but the computers were asked to specifically count the legs and return a number. So you've made the case that humans would specifically find this question odd, and likely increase their scrutiny. Making an error by a human even more unusual.
ahrmb ・ 3 days ago

Not very similar though.

LeoPanthera ・ 2 days ago

The "is this an animal with 4 legs" question could be misleading.

It's plausible to assume that it first identifies "Puma", and then answers yes because, in general, Pumas do have 4 legs, even though the specific example given doesn't.

simonw ・ 2 days ago

They tested Gemini-2.5 Pro, o3, o4-mini, Sonnet-3.7 (non-thinking) and GPT-4.1.

gpm ・ 2 days ago

gemini-2.5-pro-preview-05-06 specifically per the paper.
It seems a bit problematic to call this Gemini-2.5 Pro given that in the near future we're presumably going to have something different called that without further qualifying version numbers. (The author's fault, not the parent comment's)

shenkha ・ 3 days ago

fun findings related to memorization of AI models. It simply means LLMs/VLLMs do not know how to predict generally but memorizing instead. A new perspective on adversarial attack methods.

taesiri ・ 3 days ago

for overly represented concepts, like popular brands, it seems that the model “ignores” the details once it detects that the overall shapes or patterns are similar. Opening up the vision encoders to find out how these images cluster in the embedding space should provide better insights.
- impossiblefork ・ 2 days ago
  
  Yes, and this can probably be solved by methods for fairness.
  I used to believe that fairness research could be ignored, that it was all rubbish, but they at least try to do something about things like unbalanced datasets etc. I'm still not sure I totally believe in it though.
- kmeisthax ・ 3 days ago
  
  If there aren't any five-legged dogs in your trainset, it's safer[0] to just remember that all dogs are four-legged than to actually recognize and count legs. After all, you might have a few images of dogs in your trainset that are misleading enough to look five-legged (e.g. because a dog is in front of another dog).
  Overrepresentation is a different source of bias. That's what gives you, say, image generators that always draw "golden 1970s sci-fi robot" as C3-PO even when given additional instructions to draw something else.
  Both of these problems are manifestations of the difference between training and deployment distributions. Ok, I guess you could say that four-legged dogs are "overrepresented" in the training set, but that's because four-legged dogs are also overrepresented in reality. The deployment distribution doesn't have five-legged dogs in it. What we've done is instead concoct an adversarial distribution to force a train/deploy gap where none would exist.
  Releasing the vision encoder won't help because weights are opaque. Stochastic gradient descent does not yield functional internal representations[1]; it fills the bucket of parameters with one distribution and one distribution only. We could tell if, say the vision encoder produces identical embeddings for dogs regardless of leg count, or some other counterfactuals; but not much more than that.
  [0] Lower loss and possibly lower L2-norm
  [1] https://arxiv.org/abs/2505.11581

tantalor ・ 2 days ago

> rather than what they actually see in the image

Is "actually see" defined somewhere? Or are we just waving our hands and gesturing at "ground truth".

undefined ・ 3 days ago

[deleted]

mhh__ ・ 2 days ago

Seems like a missed opportunity to for for "biased" rather than "are blind"

Edit: already exists. d'oh

lava_pidgeon ・ 3 days ago

At all, the models are just overfitting?

vokhanhan25 ・ 2 days ago

Not really. Rather, the model is still overconfident in what it has learned, the question is if it is trained only to do counting without relying on knowledge, can it do this?

isoprophlex ・ 2 days ago

I'm running a large scale object detection/classification and ocr pipeline at the moment, figuring out the properties of all doorbells, mailboxes and house number signs in an european country (don't ask lmao).

This article resonates a lot, we have OCR and "semantic" pipeline steps using a VLM, and while it works very well most of the time, there are absurdly weird edge cases. Structuring the outputs via tool calls helps a little in reducing these, but still, it's clear that there is little reasoning and a lot of memorizing going on.

vokhanhan25 ・ 2 days ago

Agreed. It would be even more dangerous if we were talking about weird edge cases in self-driving cars or medical imaging.

taesiri ・ 3 days ago

State-of-the-art Vision Language Models achieve 100% accuracy counting on images of popular subjects (e.g. knowing that the Adidas logo has 3 stripes and a dog has 4 legs) but are only ~17% accurate in counting in counterfactual images (e.g. counting stripes in a 4-striped Adidas-like logo or counting legs in a 5-legged dog).

LorenDB ・ 2 days ago

There's no need to repeat what is said at the top of the linked webpage.

accrual ・ 2 days ago

GT = Ground Truth, for anyone unfamiliar with that on the charts.

throwaway7783 ・ 2 days ago

Unless the training set was explicitly biased in a specific way, this is basically saying that "the world is biased"

vokhanhan25 ・ 2 days ago

Models can be biased, but it doesn't seem like it should be a reason to get the answer wrong, right? Humans have biases too, but we don't get those simple questions wrong