> When VLMs make errors, they don't make random mistakes. Instead, 75.70% of all errors are "bias-aligned" - meaning they give the expected answer based on prior knowledge rather than what they actually see in the image.
This is what I've been saying for a while now, and I think it's not just visual models. LLMs/transformers make mistakes in different ways than humans do, and that is why they are not reliable (which is needed for real world applications). The rate of progress has not been accounting for this... the improvements are along the resolution, fidelity, and overall realism of the output, but not in the overall correctness and logical deduction of the prompts. Personally I still cannot think of anything, prompt it, and get consistent results without a huge compromise on my initial idea.
i.e. I want a man walking with the left foot forward, and it renders a beautiful image of a man but completely ignores the left foot forward, and refuses to do it no matter how I word the prompt. I have many examples like this. The only way I can use it is if I don't have specific prompts and just want generic images. The stock image industry is certainly over, but it is uncertain if it will deliver on the promise of generating anything you can imagine that can be put into words.
> When VLMs make errors, they don't make random mistakes. Instead, 75.70% of all errors are "bias-aligned" - meaning they give the expected answer based on prior knowledge rather than what they actually see in the image.
Yeah, that's exactly what our paper said 5 years ago!
They didn't even cite us :(
"Measuring Social Biases in Grounded Vision and Language Embeddings" https://arxiv.org/pdf/2002.08911
I think social biases (e.g. angry black women stereotype) in your paper is different from cognitive biases about facts (e.g. number of legs, whether lines are parallel) that OP is about.
Social biases are subjective. Facts are not.
As far as the model's concerned, there's not much difference. Social biases will tend to show up objectively in the training data because the training data is influenced by those biases (the same thing happens with humans, which how these biases can proliferate and persist).
I see a clear difference. One is objective (only one correct answer), one is subjective (multiple plausible answers)
Hello 0xab,
Sorry that we missed your work. There are a lot of works in this area both textual and visual, especially social biases.
We wish to mention all but the space is limited so one can often discuss the most relevant ones. We'll consider discussing yours in our next revision.
Genuine question: Would you categorize the type of bias in our work "social"?
Well you send a vaguely worded email like "I think you may find our work relevant" and everyone knows what that means and adds the citation
It's easier to succeed if you ignore the issues, andthe users are not aware of it.the rate of evolution of "AI" recently is so fast, no one is stopping to do actual benchmarks and analysis of allyhe new models.
That's weird, you're at MIT. You're in the circle of people that's allowed to succeed.
I wouldn't think much about it, as it was probably a genuine mistake.
What does allowed to succeed mean?
Your work usually has 1,000x the exposure and external validation compared to doing it outside those environments, where it would just get discarded and ignored.
Not a complain, though. It's a requirement for our world to be the way it is.
Is there truth to this? Do you have any sources to link to on this
Sure dude, here's the link to the UN Resolution about which researchers deserve attention and which others do not, signed by all countries around the world [1].
*sigh*
It's pretty obvious, if you publish something at Harvard, MIT, et. al. you even get a dedicated PR team to make your research stand out.
If you publish that on your own, or on some small research university in Namibia, no one will notice.
I might be lying, though, 'cause there's no "proof".
What do you genuinely think they built upon from your paper?
If anything, the presentation of their results in such an accessible format next to the paper should be commended.
> LLMs/transformers make mistakes in different ways than humans do
Sure but I don't think this is an example of it. If you show people a picture and ask "how many legs does this dog have?" a lot of people will look at the picture, see that it contains a dog, and say 4 without counting. The rate at which humans behave in this way might differ from the rate at which llms do, but they both do it.
I don’t think there’s a person alive who wouldn’t carefully and accurately count the number of legs on a dog if you ask them how many legs this dog has.
The context is that you wouldn’t ask a person that unless there was a chance the answer is not 4.
You deeply overestimate people.
The models are like a kindergartner. No, worse than that, a whole classroom of kindergartners.
The teacher holds up a picture and says, "and how many legs does the dog have?" and they all shout "FOUR!!" because they are so excited they know the answer. Not a single one will think to look carefully at the picture.
It's hilarious how off you are.
Exactly this. Humans are primed for novelty and being quizzed about things.
You have never seen the video of the gorilla in the background?
That's a specific example that when you draw a human's attention to something (eg: count the number of ball passes in this video), they hyper-fixate on that, to the exclusion of other things, so it seems like it makes the opposite point that I think you're trying to?
Ok? But we invented computers to be correct. It’s suddenly ok if they can look at an image and be wrong about it just because humans are too?
My point is that these llms are doing something that our brain also is doing. If you don't find that interesting, I can't help you.
Well, they’re getting the same result. I don’t particularly see why that’s useful.
All automation has ever been is an object doing something that a human can do, without needing the human.
The result is still wrong, though! It needs to be right to be useful!
The analogy should be of an artist that can draw dogs but when you ask them to draw a dog with three legs they completely fail and have no idea how to do it. That likelihood is really low. A trained artist will give you exactly what you ask for, meanwhile GenAI models can produce beautiful renders but fail miserably when asked for certain specific but simple details.
No, the example in the link is asking to count the number of legs in the pic.
Ok, sure, but I'm trying to point out the gap in expectation, i.e. it's an expert artist but it cannot fulfill certain specific but simple requests.
https://chatgpt.com/s/m_683f6b9dbb188191b7d735b247d894df
I think this used to be the case in the way that you used to not be able to draw a picture of a bowl of Ramen without chopsticks, but I think the latest models account for this and are much better.
LInk is broken, but I'll take your word for it. However there is no guarantee the general subset of this problem is solved because you can always run into something it can't do. Another example you could try is a glass HALF-full of wine. It just can't produce a glass that has 50% amount of wine, or another example a jar half-full of jam. It's something that if a human can draw a glass of wine, drawing it half-full is trivial.
chatgpt can easily do that? What was the last time you tried?
I just tried with Flux.1 Kontext, which I assume is better than o3 at creating images, but I'll admit I didn't do extensive tests. It's more trying to do test projects. Maybe I'm having bad luck but doesn't seem that way.