If I take a step back and think back to say a few (or 5) years ago, what LLMs can do is amazing. One has to acknowledge that (or at least, I do). But as a scientist it's been rather interesting to probe the jagged edge and unreliability, including using deep research tools, on any topic I know well.
If I read through the reports and summaries it generates, it seems at first glance correct - the jargon is used correctly, and physical phenomena referred to mostly accurately. But very quickly I realize that, even with the deep research features and citations, it's making a bunch of incorrect inferences that likely arise from certain concepts (words, really) co-occurring in documents but are actually physically not causally linked or otherwise fundamentally connected. In addition to some strange leading sentences and arguments made, this often ends up creating entirely inappropriate topic headings/ sections connecting things that really shouldn't be together.
One small example of course, but this type of error (usually multiple errors) shows up in both Gemini and OpenAI models, and even with some very specific prompts and multiple turns. And keeps happening for topics in the fields I work in in the physical sciences and engineering. I'm not sure one could RL hard enough to correct this sort of thing (and it is not likely worth the time and money), but perhaps my imagination is limited.
I think those in the computer science field see passable results of LLM use with respect to software and papers and start assuming other engineering fields should be easy.
They fail to understand other engineering fields documentation and process are awful. Not that computer science is good because they are even less rigorous.
The difference is other fields don’t log every single change they make into source control and have millions of open source projects to pull from. There aren’t billions of books on engineering to pull from like with language. The information is siloed and those with the keys now know what it’s worth.
I'm reminded of the whole "vegetative electron microscopy" mess (https://www.sciencealert.com/a-strange-phrase-keeps-turning-...).
That's wild! Now I want to go hunting for more such examples..
You know who else is infamous for making errors due to shallow understanding ? (Non-specialized) journalists !
How do you find they compare?
Not OP, but here is my observations: The llm are uniformly dumb and not "understaging" across all spectrum of topics. It is counter-intuitive. By asking llm to simply blab ("write a story about ..") you notice it:
- mixes up pronouns (who is "you" or "he")
- cannot keep track of what is where.
- continuously plugs it's guidance slant ("lets cook dinner, Bob! It is paramount to strive for safety and cooperation while doing it!")
— language style is all over the place, comically so.
— when asked about the text it just generated, is able to give valid critique to itself (i.e. having that "insight" does not help the generation)
Journalists may have shallow understanding of topic, but they do not start referring to a person they write about as "me" halfway through.
LLM is uniformly dumb
This is the model conflating correlation with causation. Perhaps with more data spurious correlations would disappear, but the 'right' way is to make the models learn causal, world models.
Well, and I think the future of LLMs is not just in the pure LLM, but the agentic ones. LLMs with deterministic tools to ferret out specifics. We're only starting here but the results will be far better than what we do today.
Agentic LLM by itself provides value, to be sure, but they could also be part of learning a causal model. That's how humans do it; by interacting with the world.