The Unreliability of LLMs and What Lies Ahead

verissimo.substack.com

130 points

talhof8

2 days ago


161 comments

ar813 2 days ago

If I take a step back and think back to say a few (or 5) years ago, what LLMs can do is amazing. One has to acknowledge that (or at least, I do). But as a scientist it's been rather interesting to probe the jagged edge and unreliability, including using deep research tools, on any topic I know well.

If I read through the reports and summaries it generates, it seems at first glance correct - the jargon is used correctly, and physical phenomena referred to mostly accurately. But very quickly I realize that, even with the deep research features and citations, it's making a bunch of incorrect inferences that likely arise from certain concepts (words, really) co-occurring in documents but are actually physically not causally linked or otherwise fundamentally connected. In addition to some strange leading sentences and arguments made, this often ends up creating entirely inappropriate topic headings/ sections connecting things that really shouldn't be together.

One small example of course, but this type of error (usually multiple errors) shows up in both Gemini and OpenAI models, and even with some very specific prompts and multiple turns. And keeps happening for topics in the fields I work in in the physical sciences and engineering. I'm not sure one could RL hard enough to correct this sort of thing (and it is not likely worth the time and money), but perhaps my imagination is limited.

  • elictronic 2 days ago

    I think those in the computer science field see passable results of LLM use with respect to software and papers and start assuming other engineering fields should be easy.

    They fail to understand other engineering fields documentation and process are awful. Not that computer science is good because they are even less rigorous.

    The difference is other fields don’t log every single change they make into source control and have millions of open source projects to pull from. There aren’t billions of books on engineering to pull from like with language. The information is siloed and those with the keys now know what it’s worth.

  • BlueTemplar 2 days ago

    You know who else is infamous for making errors due to shallow understanding ? (Non-specialized) journalists !

    How do you find they compare?

    • 112233 a day ago

      Not OP, but here is my observations: The llm are uniformly dumb and not "understaging" across all spectrum of topics. It is counter-intuitive. By asking llm to simply blab ("write a story about ..") you notice it:

      - mixes up pronouns (who is "you" or "he")

      - cannot keep track of what is where.

      - continuously plugs it's guidance slant ("lets cook dinner, Bob! It is paramount to strive for safety and cooperation while doing it!")

      — language style is all over the place, comically so.

      — when asked about the text it just generated, is able to give valid critique to itself (i.e. having that "insight" does not help the generation)

      Journalists may have shallow understanding of topic, but they do not start referring to a person they write about as "me" halfway through.

      LLM is uniformly dumb

  • esafak 2 days ago

    This is the model conflating correlation with causation. Perhaps with more data spurious correlations would disappear, but the 'right' way is to make the models learn causal, world models.

    • jvalencia 2 days ago

      Well, and I think the future of LLMs is not just in the pure LLM, but the agentic ones. LLMs with deterministic tools to ferret out specifics. We're only starting here but the results will be far better than what we do today.

      • esafak 2 days ago

        Agentic LLM by itself provides value, to be sure, but they could also be part of learning a causal model. That's how humans do it; by interacting with the world.

thorum 2 days ago

Good article. Agree that general unreliability will continue to be an issue since it's fundamental to how LLMs work. However, it would surprise me if there was still a significant gap between single-turn and multi-turn performance in 18 months. Judging by improvements in the last few frontier model releases, I think the top AI labs have finally figured out how to train for multi-turn and agentic capabilities (likely RL) and just need to scale this up.

  • karn97 2 days ago

    Reasoning is just the worst kind of stop gap measure. The state that should emerge internally is forced through automating prompts. And you can clearly see this because the models rarely follow their own "reasoning". Its just auto self prompting

  • koakuma-chan 2 days ago

    They’re reliable enough for many use cases

    • bluefirebrand 2 days ago

      What this should be doing is exposing how those use cases are faulty, if they can accept such inconsistent and poorly defined outputs

brentm 2 days ago

This is a good articulation of what is a real concern around the AI bull thesis.

If a calculator works great 99% of the time you could not use that calculator to build a bridge.

Using AI for more than code generation is still very difficult and requires a human in the loop to verify the results. Sometimes using AI ends up being less productive because you're spending all your time debugging it's outputs. It's great but also there are a lot of questions on if this technology will ultimately lead to the productivity gains that many think are guaranteed in next few years. There is a non zero chance it ends up actually hurting productivity because of all the time wasted trying to get it to produce magic results.

  • oconnor663 2 days ago

    A pedantic but maybe-not-entirely-pedantic point: It depends on what you mean by 99%.

    If the calculator has a little gremlin in it that rolls a random 100-sided die, and gives you the wrong answer every time it rolls a 1, then you certainly can use it to build a bridge. You just need to do each calculation say 10 or 20 times and take the majority answer :)

    If the gremlin is clever, it might remember the wrong answers it gave you, and then it might give them to you again if you ask about the same numbers. In that case you might need to buy 10 or 20 calculators that all have different gremlins in them, but otherwise the process is the same.

    Of course if all your gremlins consistently lie for certain inputs, you might need to do a lot of work to sample all over your input space and see exactly what sorts of numbers they don't like. Then you can breed a new generation of gremlins that...

    • franktankbank a day ago

      Yea I know, can't really understand why people have such a problem with this. Just ignore the wrong answers and be thankful when it gives you a right answer. Picky bastards.

  • tveita 2 days ago

    > If a calculator works great 99% of the time you could not use that calculator to build a bridge.

    We know for certain that certified lawyers have committed malpractice by using ChatGPT, in part because the made-up citations are relatively easy to spot. Malpractice by engineers might take a little more time to discover.

    • bagacrap 2 days ago

      Engineers' work is also externally verifiable, e.g. by unit tests for software, but I'm assuming by other sorts of automated protocols for civil engineering. I would hope a bridge is not built without triple checking the various outcomes.

      • orwin 2 days ago

        Well, most of the LLM-generated code i serve are unit tests (and scripts), so hopefully, those are good enough to catch my mistakes :)

      • yencabulator 2 days ago

        If that argument were to save anyone, it would have saved the lawyers too.

  • tom_m 2 days ago

    I believe it absolutely will. I think eventually we'll get to a point where people will be measured on now well they can get the AI to behave and how good they are at keeping cost down.

    My boss built an AI workflow that cost over $600 that does the same thing I already gave him that cost less than $30. He just wanted to use tools he found and did it his way. Now, this had some value, it got more people in the company exposed to AI and he learned from the experience. It's his prerogative as he's the owner of the company. Though he also isn't concerned about the cost and will continue to pay much more. For now. I think as time goes on this will be more scrutinized.

  • worldsayshi 2 days ago

    This doesn't seem like the first time engineers try to work with something useful that is only partially reliable.

    The solution is to play at its strengths and reinforce it with other mediums. You don't build structures with pure concrete. You add rebar. You don't build ships out of only sail and you don't build rail with just iron. You compose materials in a way that makes sense.

    LLMs are most useful when the output is immediately verifiable. So let's build frameworks that take that to core. Build everything around verification. And use LLMs for its strengths.

  • andrewmutz 2 days ago

    What we are seeing with our customers is that LLM errors are a very manageable problem. End users adapt pretty quickly to the idea that AI systems aren't perfect. In many cases AI products are doing tasks that used to be done by humans and these humans were making mistakes too, so the end user is used to the idea that the task will get accomplished with some non-zero error rate.

    You just need to build your products in a manner where the user has the ability to easily double check the results whenever they like. Then they can audit as they see fit, in order to get used to the accuracy level and to apply additional scrutiny to cases that are very important to their business.

    • insane_dreamer 2 days ago

      > the user has the ability to easily double check the results whenever they like

      if the user is able to so easily verify that the results are accurate, that means that they are able to generate accurate results through other means, which means they don't need the LLM in the first place

      • Ukv a day ago
        3 more

        I don't think that's necessarily true - many tasks are difficult to solve but easy to verify. If I ask "place names that end with um", or "good ideas for a birthday party" I can pretty much verify the answer just by reading it. In other cases, clicking through to check that a linked source supports a claim is easier than researching to find and summarize the source in the first place would be.

        • insane_dreamer 18 hours ago
          2 more

          > I can pretty much verify the answer just by reading it

          Only if you have domain knowledge. In both of your examples, you have to 1) know geography to determine whether "Técolum" and "Tolum" are indeed city names or just made up; and 2) know what might be acceptable ("good idea") or not at a birthday party.

          Yes, it'll probably save you some time, but it's not orders of magnitude.

          > In other cases, clicking through to check that a linked source supports a claim

          this supposes that the AI provides a link for every fact. Google search + Gemini does, but most LLM interfaces don't.

          secondly, if I have to click through every link and read through the source to determine whether details of a "summary" are correct or not, that really does not save me much time from conducting a search and looking through the linked sources myself

          Anecdote from a couple of weeks ago. My wife's professor sent her 5 citations and summaries related to a medical research project. She didn't say they were LLM generated, but it was obvious (to me, not my wife) they were, by the formatting alone. None of the 5 papers existed as cited. My wife was confused, spent a lot of time trying to figure out what was wrong and why she couldn't find any of the papers. A Google Scholar search turned up 2 of the papers which were close enough to the citation to be the ones with some logical thinking, but the other 3 were not even matchable. In the end, the time spent trying to sort out valid vs invalid citations, and find valid replacements, was significantly greater than just doing the search and looking through the abstracts.

          PS: LLMs are fine for information that can be "fuzzy": suggest places to go on vacation in September, plan a birthday party, etc. But I wouldn't consider that to be a "revolutionary" advance.

          • Ukv 2 hours ago

            > Only if you have domain knowledge [...]

            It's common to have a reasonable intuitive sense for whether something works as a birthday party yet be stumped when coming up with ideas. Or be able to see that a word ends in "um" and is a real word/place you recognise (or double click -> search if not) without necessarily being able to list many yourself if asked. I don't mean to say that verification requires absolutely zero knowledge, just that it can be (and often is) substantially easier, so I don't think insane_dreamer's reasoning holds.

            > this supposes that the AI provides a link for every fact.

            For andrewmutz's LLM, it was the statement "the user has the ability to easily double check the results whenever they like" that was suggested made it unnecessary in the first place.

            Outside of that case, people have the choice to use the LLM that best suits their task - and most popular ones I'm aware of do support search/RAG.

            Certainly possible to waste time by doing something like what your wife's professor seemingly did (get non-link "citations" generated by an LLM without search/RAG, then send them to someone who'll probably infer "these must exist somewhere because the sender read them" opposed to "these were vaguely recalled from memory so may not exist") - I don't recommend doing that.

            > secondly, if I have to click through every link and read through the source to determine whether details of a "summary" are correct or not, that really does not save me much time from conducting a search and looking through the linked sources myself

            A lot of LLM responses are for the kind of thing that doesn't need verification, or for which verification doesn't depend on checking the source. For situations where checking the source is relevant, that's typically just going to be the source for the part you're interested in - in the same way a Wikipedia article can provide a useful lead without needing to check every source the article cites. Anecdotally I find that, while far from perfect, it saves a lot of time when it can surface information that would've otherwise required digging through a dozen or so sources.

      • tom_m a day ago

        True, but people are so enamored by what they can do that they rarely seem to think about that. We will over spend on AI purely because we think it's cool.

    • brentm 2 days ago

      Yea I just think the true unlock in productivity will come from not requiring a human in the loop.

  • the_snooze 2 days ago

    >If a calculator works great 99% of the time you could not use that calculator to build a bridge.

    That's happened before with far higher correctness rate than 99%, and it cost Intel $500M. Reliability and accuracy matter. https://en.wikipedia.org/wiki/Pentium_FDIV_bug

  • vinni2 2 days ago

    > If a calculator works great 99% of the time you could not use that calculator to build a bridge.

    But if the alternative is doing calculations by hand (writing code manually) there is a higher chance of making mistakes.

    Just like calculations are double checked while building bridges unit tests and code reviews should catch bugs introduced by LLM written code.

    • banannaise a day ago

      Code review is your last (and worst) line of defense. Humans are not good at needle-in-a-haystack tasks.

ok123456 2 days ago

MongoDB was basically "vibe coding" for RBDMs. After the hype cycle, there will be a wasteland of unmaintainable vibe-coded products that companies will have to pump unlimited amounts of money into to maintain.

  • boardwaalk 2 days ago

    Or we’ll just leave them behind and that’s fine. And I work day maintaining old stuff of varying quality. Conceptually, software composting.

  • Spivak 2 days ago

    I think we mythologize the relational model a bit too much to call nosql dbs vibe coding. DynamoDB is quite good and you can point to some very large customers using it successfully.

    • yencabulator 2 days ago

      MongoDB was bad for several reasons unrelated to the relational model.

eterm 2 days ago

There are jobs out there that have always been unreliable.

A classic example is the Travel Agent. This was already a job driven to near-extinction just by Google, but LLMs are a nail in the travel agent coffin.

The job was always fuzzy. It was always unreliable. A travel agent recommendation was never a stamp of quality or guarentee of satisfaction.

But now, I can ask an LLM to compare and contrast two weeks in the Seychelles with two weeks in the Caribbean, have it then come up with sample itineraries and sample budgets.

Is it going to be accurate? No, it'll be messy and inaccurate, but sometimes a vibe check is all you ever wanted to confirm that yeah, you should blow your money on the Seychelles, or to confirm that actually, you were right to pick the Caribbean.

Or that actually, both are twice the amount you'd prefer to spend, where dear ChatGPT would be more suitable?

etc.

When it comes down to the nitty-gritty, does it start hallucinating hotels and prices? Sure, at that point you break out trip-advisor, etc.

But as a basic "I don't even know where I want to go on holiday ( vacation ), please help?" it's fantastic.

  • asadotzler 2 days ago

    If you don't care about reliability, repeatability and accuracy, they're great.

    • jimbokun 2 days ago

      This should be OpenAI's official slogan!

  • whyowhy3484939 2 days ago

    Once they start making deals with the relevant organizations, book rooms, handle insurance, replacement hotels, etc, then they'll replace travel agents. These guys don't just Google a bunch of tickets you know.

    • eterm 2 days ago

      We're getting into semantics now, but I'm talking about the kind of person who used to sit in a physical store, waiting for someone to walk by and go into the travel agency.

      In the 80's and 90's, this is how most people booked their holidays. It was labour intensive, people would spend some time talking with a travel agent in a store, who would have a good idea of the packages available, and be able to make recommendations and match people with holidays.

      The remnants of agencies still provide the same services, but (for the most of us) it's all online, it's all tick-box based, and much of the protection is via ATOL/ABTA.

      These services still exist, but they're no longer all over the high-street. Names like Thomas Cook, Lunn Poly, have either been absorbed (mostly by TUI), or collapsed, and largely disappeared from the high-street with just a few left. (Mostly Tui).

      And those that are left, have been reduced, much like retail banking, to entering your details into the same websites and services available to anyone, and talking you through the results that the computer spits out, that you could have browsed yourself at home. The underpaid travel agent in the store isn't any better connected than you are. In fact, they're possibly even more pushy about pushing you toward the hotels with the best commission than the website is.

      • netsharc 2 days ago
        2 more

        I imagine a travel agent would have local knowledge and connections, and would know the quality of the hotels they're trying to send to you, a high commission isn't worth it if your customer is unsatisfied and goes to a different agent for their next trip. Of course this is based on the assumption that the customer always wants to use a travel agent (an unrealistic assumption nowadays, because it's so easy to switch to the Internet).

        Someone like Rick Steves(1) still goes to the destinations every summer to check out hotels, restaurants and local companies, I imagine someone with more budget would travel with his company rather than try their luck with some booking.com hotel with a high rating...

        1: https://www.youtube.com/@RickStevesEuropeOfficial

        • eterm 2 days ago

          What you're imagining is what it was like in the 1980s, or possibly now for a boutique place, not the reality of the post-internet high-street travel agent.

          You're not realising the reality of the typical high-street worker, and the sheer lack of autonomy that they have in their roles.

      • jimbokun 2 days ago

        Seems like the travel agent has been replace by:

        1. the travel blogger who writes about places and why you might/might not want to go there.

        2. the tour guide who books everything end to end for everyone on the tour and goes along to show and explain the sites.

    • jimbokun 2 days ago

      Um, Google and travel sites already replaced travel agents a LONG time ago.

  • liveoneggs 2 days ago

    I have used it on three big family vacations already and it's definitely a place where "AI" shines in usefulness. It did recommend some out-of-business hotels and things but the broad strokes were good enough to save hours of work.

  • 65 2 days ago

    Yes, which is why it's slightly confusing why programming is being pushed so hard to use with LLMs. For things that don't need completely accurate information, sure. But for programming, data, and factual information, it's surprising to see so many people using LLMs.

    • asadotzler 2 days ago

      Code runs or it doesn't, that's a sort of verification feedback that other use cases don't offer, at least not so immediately. Formal code verification is a thing, not so much for verification of say legal citations. Code is language with some well documented rules all over the training corpora. Many other use cases are hardly so well represented in model training. These are just a few of many, many reasons that code is an easier problem than most.

      • 65 2 days ago

        Code runs or it doesn't... but that doesn't mean it does what you want it to do.

        An LLM could generate code that takes raw user input and adds it to a raw SQL query. Does it work? Yeah. Is it a terrible security flaw? Also yeah.

        Additionally, if you want a certain UX and the LLM cannot get there but the code works, that doesn't mean it's successful.

Ostrogoth a day ago

A few months ago I asked CGPT to create a max operating depth table for scuba diving based on various PPO2 limits and EAN gas profiles, just to test it on something I know (its a trivially easy calculation; and the formula is readily available online). It got it wrong…multiple times…even after correction and supplying the correct formula, the table was still repeatedly wrong (it did finally output a correct table). I just tried it again, with the same result. Obviously not something I would stake my life on anyway, but if it’s getting something so trivial wrong, I’m not inclined to trust it on more complex topics.

  • tom_m a day ago

    Well it doesn't really do math.

willk357 a day ago

Has anyone experimented with an ensemble + synthesizer approach for reliability? I'm thinking: make n identical requests to get diverse outputs, then use a separate LLM call to synthesize/reconcile the distinct results into a final answer. Seems like it could help with the consistency issues discussed here by leveraging the natural variance in LLM outputs rather than fighting it. Any experience with this pattern?

tom_m 2 days ago

Unreliability doesn't matter for some people because their bar was already that low. Unfortunately this is the way of the world and quality has and will continue to suffer. LLMs mostly accelerate this problem... hopefully they get good enough to help solve it.

jeisc a day ago

AI does not know what is fake or real any more than we do. It uses our shaky data to make predictions.

mjburgess 2 days ago

I think I'm settling on a "Gell-mann Amnesia" explanation of why people are so rabidly committed to the "acceptable veracity" of LLM output. When you don't know the facts, you're easily mislead by plausible-sounding analysis, and having been mislead -- a certain default prejudice to existing beliefs takes over. There's a significant asymmetry of effort in belief change vs. acquisition. I think there's also an ego-protection effect here too: if I have to change my belief then I was wrong.

There a socratically-minded people who are more addicted to that moment of belief change, and hence overall vastly more sceptical -- but I think this attitude is extremely marginal. And probably requires a lot of self-training to be properly inculcated into it.

In any case, with LLMs, people really seem to hate the idea that their beliefs about AI and their reliance of LLM output could be systematically mistaken. All the while, when shown output in an area of their expertise, realising immediately that its full of mistakes.

This, of course, makes LLMs a uniquely dangerous force in the health of our social knowledge-conductive processes.

  • helloplanets 2 days ago

    You need to be pushing much more data in than you're getting out. 40k tokens of input can result in 400 actual quality tokens of output. Not giving enough input to work off of will result in regressed output.

    It's basically like a funnel, which can also be used the other way around if the user is okay with quirky side effects. It feels like a lot of people are using the funnel the wrong way around and complaining that it's not working.

    • mjburgess 2 days ago

      Sure, if you have a high-quality starting point and need refinement.

      The issue is that the vast majority of user-facing LLM use cases are where people don't have these high-quality starting points. They don't have 40k tokens to make 400.

      • helloplanets a day ago

        You can just attach 40k of context directly into the Gemini, ChatGPT and Claude web interfaces afaik. If someone is using an LLM as a tool to actually be of help in an area they are already professionals in, conjuring good books, research, etc, as attachments shouldn't be an issue.

        But yes, the default mode of LLMs is usually a WikiHow and content farm style answer. This is also a problem with Google: The content you get back from generic searches will often be riddled with inaccuracies and massive generalizations.

        Not being able / bothering to come up with relevant context and throwing the dice on the LLM being able to do this out of the box is definitely a serious issue. I really think that is where the discussion should be: Focused more on how people use these tools. Just like you can tell quite a bit about someone's expertise based on the specific way in which they interface with Google (or any information on the internet) while they work.

  • asadotzler 2 days ago

    Bullshit works on lots of people. Seeming to be true, or even just plausible, is enough for most people. This is why powerful bullshit machines are dangerous tools.

    • mjburgess 2 days ago

      If people were easy enough to convince that they had been deceived, then I'd not mind so much. It's the extraordinary lengths people will go to in order to protect the bullshit they acquired with far less scepticism. Genuinely wild leaps of logic, shallowness of reasoning, on-the-face-of-it non-sequiturs, claims offered as great defeaters which require only a single moment of reflection to see through.

      This is the problem. The problem is how bullshit conscripts its dupes into this self-degradation and bad faith dialogue with others.

      And of course, how there are mechanisms in society (LLMs now one of them) which correlate this self-degrading shallowness of reasoning -- so that all at once an expert is faced with millions of people with half-baked notions and a great desire to preserve them.

      • danans 2 days ago
        3 more

        > It's the extraordinary lengths people will go to in order to protect the bullshit they acquired with far less scepticism

        That's the narrative bias at play. We all are subject to it, and for good reason. People need stories to help maintain a stable mental equilibrium and a sense of identity. Knowledge that contradicts the stories that form the foundation of their understanding of the world can be destabilizing, which nobody wants.

        Especially when they are facing struggle and stress, people will cling to their stories, even if a lie or deception in the story might be harming them. Religious cults and conspiracy theories are often built on this tendency, but so is culture in general.

        • mjburgess 2 days ago
          2 more

          I think there is a certain sort of person who, if not "wants" this destabilization, doesn't really experience the alternative. People primarily relating to the world through irony, say. So, characteristically, socrates (, some stand up comedians, and the like) who trade in aporia -- this feeling of destablization.

          • danans 2 days ago

            > I think there is a certain sort of person who, if not "wants" this destabilization, doesn't really experience the alternative.

            I agree, but the ability/willingness to engage in that kind of destabilizing irony itself comes from a certain stability, where you can mess with the margins of your own stories' contradictions, without putting the core of your stories under threat.

consumer451 2 days ago

I have been using LLM coding tools to make stuff which I had no chance of making otherwise. They are MVPs, and if anything ever got traction I am very aware that I would need to hire a real dev. For now, I am basically a PM and QA person.

What really concerns me is that the big companies on whose tools we all rely are starting to push a lot of LLM generated code without having increased their QA.

I mean, everybody cut QA teams in recent years. Are they about to make a comeback once big orgs realize that they are pushing out way more bugs?

Am I way off base here?

bionhoward 2 days ago

Can’t we make this deterministic with techniques like Jax’s RNG seed?

worik 2 days ago

LLMs are a tool to extend human capabilities. They are not intelligent agents that can replace humans

Not very hard to understand, except it seems to be

  • cmiles74 2 days ago

    The field where LLMs are most successful, software development, is also a place where many software developers are paid to use LLMs. I have colleagues who are reluctant to express their skepticism publicly for just this reason.

  • baxtr 2 days ago

    This. 100%.

    I think and say this all the time. But people keep continue to say that AI will take all our jobs and I’m so utterly confused by this.

    Sometimes I wonder if I have gone mad or everyone else.

    • bluefirebrand 2 days ago

      Companies are salivating over the idea of cutting staff and replacing them with AI tools, so it's not exactly farfetched to think AI might lead to a lot of unemployment, at least for a while

      Every type of automation ever invented has led to massive job cuts and yes, some sectors actually did not ever recover

      • Barrin92 2 days ago
        3 more

        >Every type of automation ever invented has led to massive job cuts

        It, never has, in fact the opposite is true. Every type of automation has expanded the economic output so much that it created massive amounts of labor demand, which is why cities early absorbed masses of underemployed workers during the industrial revolution. One famous example, there are now more bank tellers than before the invention of the ATM.

        In fact you can go to any poor country with no automation and you'll find entire classes of un- and underemployed people. This is a condition of premodern, not technological societies.

        The entire AI debate rests on the speculative claim that it is not merely an automation tool, but a sort of sci-fi wholesale replacement of human beings, contrary to what happened during earlier waves of automation.

        • PantaloonFlames 2 days ago

          > It, never has, in fact the opposite is true. Every type of automation has expanded the economic output so much that it created massive amounts of labor demand, …

          This seems to be true, but there’s a second issue at work here, which that automation and progress in general can _disrupt_ the labor market. Sure there’s a net gain in labor demand, but there are people involved who are more than just “resources” who can easily be redeployed.

          Progress is what built and then killed (injured?) cities and towns like, in the US, Detroit, or Gary, or Pittsburgh.

          We want to promote progress and automation while at the same time protecting people who are inadvertently over-exposed to the downside. (Generally less educated people or people with less agency).

        • bluefirebrand 2 days ago

          > One famous example, there are now more bank tellers than before the invention of the ATM.

          Bank tellers do way more varied work than ATMs do. You cannot open an account at a bank from an ATM. This is a stupid example because ATMs were not and never did try to automate the entirety of a bank teller's job, only a couple of the services they do

          > In fact you can go to any poor country with no automation and you'll find entire classes of un- and underemployed people. This is a condition of premodern, not technological societies

          You can find this in Rural America, forget "poor countries with no automation"

  • turtletontine 2 days ago

    Well. In an ideal world, LLMs would be used this way, as a tool to help automate the bullshit and let the person driving worry about other stuff.

    But I never see them actually used this way. At the big institution end, companies and universities will continue to force AI tools on their employees in heavy handed and poorly thought out ways, and use it as an excuse to fire people whenever budgets get tight (or investors demand higher profits). At the opposite scale, with individual users, it’s really alarming how rapidly people seem to stop thinking with their own brain and offload all critical thinking to an LLM. That’s not “extending your capabilities,” that’s letting all your skills atrophy while you train a machine to be your shitty replacement.

akomtu 2 days ago

LLMs can't evaluate their own output. LLMs suggest possibilities, but can't evaluate them. Imagine an insane man who is rumbling something smart, but doesn't self-reflect. The evaluation is done against some framework of values that are considered true: the rules of a board game, the language syntax or something else. LLMs also can't fabricate evaluation because the latter is a rather rigid and precise model, a unlike natural language. Otherwise you could set up two LLMs questioning each other.

  • candiddevmike 2 days ago

    Isn't this kind of the hope/dream of multi-agent systems where one LLM "coordinates" among others or checks the responses? In my experience it works about as well as you're describing.

lapsis_beeftech 2 days ago

Large language models reliably produce misinformation that appears plausible only because it mimics human language. They are dangerous toys that cannot be made into tools that are safe to use.

josefritzishere 2 days ago

It's hard to say "never" in technology. History isn't really on your side. However, LLMs have largely proven to be good at things computers were are already good at: repetitive tasks, parallel processing, and data analysis. There's nothing magical about an LLM that seems to be defeating the traditional paradigm. Increasingly I lean toward an implosion of the hype cycle for AI.

  • wintermutestwin 2 days ago

    What I don’t understand is, how can a liar be good at data analysis?

    • rienbdj 2 days ago

      If you give an LLM the data in the prompt and then ask it to extract information from that data it does pretty well. This is the premise of RAG. Where LLMs do poorly is when you ask it for information you haven’t given it.

    • ToucanLoucan 2 days ago

      It works great if all you're looking for is an output, with not a care for what it is. So if you're trying to generate slop children's books to shit onto Amazon, it's awesome. If you want to give your boss a huge bloated report on your daily activities, works great. If you want to phone in an assignment that doesn't add value to your education, LLM will do that. If you want a header image for your LinkedIn post that you don't want to pay for, generate it. Who cares.

      This isn't even an indictment, not really. I'm just reading between the lines here regarding when/how it's used. Nobody with intentionality uses these things. Nobody who CARES what they're making uses these things. And again, I want to emphasize, this is not an attack. There are tons of things I do in my work life that I utterly do not give a shit about, and LLMs have been a blessing for it. Not my code, fuck no. But all the ancillary crap, absolutely.

  • ToucanLoucan 2 days ago

    LLMs are a legitimate technology with legitimate applications. However in a desperate bid for a new iPhone moment to assure Wall Street that the fantasy of infinite growth in a finite world is possible, they have utterly lost the plot regarding what statistical analysis of words at scale is capable of doing. Useless? Far from it. The basis for a 300 billion company with no meaningful products after almost a decade working on it? I have doubts.

    I can't fathom a future where OpenAI for sure doesn't eat dirt, with Anthropic likely not far behind it. nVidia will likely come out fine, since it still has gamers to disappoint, and the infrastructure build out that did occur will crater the cost of GPUs at scale for smaller, smarter companies to take advantage of. So it will likely still kick around, but as another technology, not the second coming of Cyber Christ as it's been hyped to be.

    • rini17 2 days ago

      You seriously underestimate the appeal of burning cycles on GPUs to get something cool, if barely useful, out. Cryptocurrencies are still very much alive, too.

      • ToucanLoucan 2 days ago

        > Cryptocurrencies are still very much alive, too.

        Yeah, like I said, LLMs will be around. Frankly I think they'll be way more around than crypto which as far as the mainstream is concerned might as well be dead.

  • dist-epoch 2 days ago

    Funny, I don't remember any computer program in the past being able to explain a news article through the lens of one particular philosopher.

    Or being able to explain the static physical forces in a picture that are keeping a structure from collapsing.

    Or recommend me a python library which does X, Y and Z with constraints A, B and C.

    But I guess you can file all the above under "data analysis".

    • GuinansEyebrows 2 days ago

      it is the result of data analysis. the computer program isn't explaining anything, or recommending anything. it's simply presenting the results of querying data analyzed at scale and returning the "most likely" result (as determined by the system prompt and human input from developers and users of the program). "most likely" is still a super-fuzzy grey area.

      https://www.plough.com/en/topics/life/technology/computers-c...

    • keybrd-intrrpt 2 days ago

      It's all just electricity and binary bits, nothing new here...

      /s?

johnea 2 days ago

> Internally, it uses a sophisticated, multi-path strategy, approximating the sum with one heuristic while precisely determining the final digit with another. Yet, if asked to explain its calculation, the LLM describes the standard 'carry the one' algorithm taught to humans.

So, the LLM isn't just wrong, it also lies...

  • mjburgess 2 days ago

    The LLM has no relevant capacities, either to tell the truth or to lie. In generates "appropriate" text, given a history of cases of appropriate textual structures.

    It is the person who reads this text as-if written by a person who imparts these capacities to the machine, who treats the text as meaningful. But almost no text the LLM generates could be said to be meaningful, if any.

    In the sense that if a two year old were taught to say, "the magnitude of the charge on the electron is the same as the charge on the proton", one would not suppose the two year old meant what was said.

    Since the LLM has no interior representational model of the world, only a surface of text tokens laid out as-if it did, its generation of text never comes into direct contact with a system of understanding that text. Therefore the LLM has no capacities ever implied by its use of language, it only appears to.

    This appearance may be good enough for some use cases, but as an appearance, it's highly fragile.

    • johnea 2 days ago

      One could always argue that the lie is in the ear of the receiver 8-/

      I would argue, that if the output of the LLM is to be interpreted as natural speech, and the output makes an authoritative statement, which is factually incorrect, but stated as if it were true, this is a lie.

      The problem is that the tech is presented as if it did have the internal state, that you accurately describe it not having.

      The lie in this example, is when it is prompted to describe the process by which it reached a result, and that description has no resemblance to the actual process by which it reached the result.

      This isn't a misrepresentation of some external facts, but a complete fabrication, that does not represent how it reached that result, at all.

      However many users will accept this information, since it only involves internal aspects of the tool itself.

      The fact that the LLM doesn't have this introspective information, is part of exactly why LLMs are NOT intelligence, artificial or otherwise.

      And yet they are being presented as such, also, a lie...

  • GuB-42 2 days ago

    A LLM can't self-reflect. It doesn't know what happens in its own circuits. If you ask it, it will either tell you what it knows (from the articles about LLMs it has ingested), and if it doesn't, it will hallucinate something, as it is often the case.

    Since the LLM has no knowledge on how LLMs do addition, it will pick something that seems to makes sense, and it picked the "carry the one" algorithm. New generations of LLMs will probably do better now that they have access to a better answer for that specific question, but it doesn't mean that they have become more insightful.

    • johnea 2 days ago

      Please see the reply to the comment above...

  • psychoslave a day ago

    No, because the LLM is a tool without any feeling and consciousness, like the article rightfully point out. It doesn't have the possibility to scrutinize it's own internals, nor the possibility to wonder if that would be something relevant to do.

    Those who lie (possibly even to themselves) are those who pretend that mimicry if stretched enough will surpass the actual thing, and foster the deceptive psychological analogies like "hallucinate".

    • johnea 16 hours ago

      The LLM doesn't have a brain, it doesn't have consciousness, therefore it doesn't "hallucinate"; it just produces factually incorrect results.

      It's just wrong, and then gives misleading explanations of how it got the wrong answer, following the same process that led to the wrong answer in the first place. Lying is a subset of being wrong.

      The tech has great applications, why hype the stuff it doesn't do well? Or apply terms that misrepresent the process the s/w uses?

      One might say the use of the word "hallucinate" is an analogy, but it's a poor analogy, which further misleads the lay public in what is actually happening inside the LLM, and how it's results are generated.

      If you want to assert that "hallucinate" is an analogy, then "lying" is also an analogy.

      If every prompt that ever went into an LLM was prefixed with: "Tell me a made up story about: ...", then the user expectation would be more in line with what the output represents.

      I'm not averse to the tech in general, but I am against the rampant misrepresentation that's going on...

  • glial 2 days ago

    Talking about "truth" or "lies" with LLMs isn't helpful.

    • johnea 2 days ago

      Could you get the CEO of Goggle or OpenAI to state that clearly in a press announcement? 8-)

      Although "isn't helpful" is rather dodgy wording. "Helpful" for who? "Helpful" in what way?

      I think most users would find it helpful if the output was not presented as correct, when it's incorrect.

      If every prompt that ever went into an LLM was prefixed with: "tell me a made up story about:", then the user expectation would be more in line with what the output represents.

      But, that's not the way the corps are describing it, is it?

AlienRobot 2 days ago

I'm no AI fan, but articles talking about the shortcomings of LLM's seem to have to be complaining that forks aren't good for drinking soup.

Don't use LLM's to do 2 + 2. Don't use LLM's to ask how many r's are in strawberry.

For the love of God. It's not actual intelligence. This isn't hard. It just randomly spits out text. Use it for what it's good at instead. Text.

Instead of hunting for how to do things in programming using an increasingly terrible search engine, I just ask ChatGPT. For example, this is something I've asked ChatGPT in the past:

    in typescript, I have a type called IProperty<T>, how do I create a function argument that receives a tuple of IProperty<T> of various T types and returns a tuple of the T types of the IProperty in order received?
This question that's such an edge case that I wasn't even sure how to word properly actually yielded the answer I was looking for.

    function extractValues<T extends readonly IProperty<any>[]>(
      props: [...T]
    ): { [K in keyof T]: T[K] extends IProperty<infer U> ? U : never } {
      return props.map(p => p.get()) as any;
    }
This doesn't look unrealiable to me. It actually feels pretty useful. I just need [...T] there and infer there.
  • DanHulton 2 days ago

    The thing is, I have spent the last year being told that I will VERY SOON be able to use a fork to drink soup, and better than any spoon has ever been able to, and in fact pretty soon spoons will be completely outclassed anyway, and I'M the idiot for doubting this.

    Articles like this are still very much needed, to push back against that narrative, regularly, until it DOES become as obvious to everyone as it is to you.

    • AlienRobot 2 days ago

      My impression is that the only people telling others they can drink soup with forks are the people who sell the forks.

      Even this isn't new. A few years ago we had people who sold knives telling everybody you could use knives to drink soup. And in some cases they weren't even kitchen knives, they were switchblades.

  • bluefirebrand 2 days ago

    > Don't use LLM's to do 2 + 2. Don't use LLM's to ask how many r's are in strawberry

    But use them to do more important things that require more precision and accuracy?

    No thanks

    • batshit_beaver 2 days ago

      You use LLMs to _discover_ how to approach important problems. You don't necessarily need to use the output verbatim. Same as StackOverflow and Google.

    • yongjik 2 days ago

      When you employ your developers at $200K/yr you won't trust them to tell you the first one hundred digits of pi, but you'll trust them with your business logic, which is much more important and mission-critical to you.

      Same thing.

      • bluefirebrand 2 days ago
        2 more

        The difference is that (hopefully) your employee is honest enough to say "I do not know the first 100 digits of Pi offhand but I can find out"

        An LLM will happily produce a string of 100 digits that might be the first 100 digits of Pi, might be some known sequence of 100 digits in Pi but not the first 100, or might be 100 random digits that have nothing to do with Pi

        • empath75 2 days ago

          I was actually curious about this and chatgpt actually accurately and very slowly gave me the first 100 digits of pi one digit at a time. I have _no idea_ how that worked, it did not search, nor did it run code. As far as I can tell, it pulled it straight out of it's own model.

          If I ask it to use python, it writes and executes the code _much_ more quickly, same if I ask it to search.

  • coliveira 2 days ago

    The problem is exactly how the public will learn "not to ask 2+2". When you have a well trained professional using an LLM it's all great. They know how to separate hallucination from actually good results as you do. The problem lies with the general public and new workers who will, no questions about it, use the AI generated results as some sort of truth.

    • AlienRobot 2 days ago

      Maybe use an LLM to detect when the public is asking the wrong question and display a message saying "As a large language model, I don't know how to count."

  • Marazan 2 days ago

    People need to stop recommending forks to drink soup with.

  • uludag 2 days ago

    So many times I've asked questions just like this and gotten complete nonsense incorrect answers. In fact, you have no guarantees whatsoever that even the typescript question you asked will always return a sensible answer.

    I'm by no means saying that LLMs aren't useful. They're just not reliably useful.

smeeger 2 days ago

hallucinations are essentially the only thing keeping all knowledge workers from being made permanently redundant. if that doesnt make you a little concerned then you are a fool. and the predictions of all the experts in 2010 is that what is currently happening right in front of us could never happen within a hundred years. why are the predictions of experts more reliable now? anyone who dismisses the risks is just a sorry fool

  • bgnn 2 days ago

    I'm a knowledge worker (electrical engineer) but not one bit worried about being replaced by AI in yhe foreseeable future. It does not only neet to be reliable, but also should be able to create, as in create physically working complex systems for me to be worried. I have not seen anything remotely close this yet.

    I believe AI/ML will eventually get there but definitely not with LLMs or hoarding the whole internet. Most of the human know-how isn't on internet!

    Oh, I guess I'm a fool.

    • smeeger 2 days ago

      you are. just change a few words around and you would be reading the confidently incorrect predictions of essentially all scientists and engineers in 2010. you say LLMs wont get us there… and you personally would probably have said word2vec couldnt get us past the turing test… and here we are. citing the existence of a current technology as evidence that another technology, related or not, cannot exist, is lazy and stupid. the simple fact is that there has been an explosion in the progress recently… a corresponding explosion of funding and the specific purpose of every single dollar of research is to create AGI, whether through LLMs or some other framework. to dismiss this situation as totally unconcerning is literally FOOLISH

      • bgnn 4 hours ago

        That's good to hear that I'm foolish. It's always nice to be foolish.

        Foolishness aside, all we have to predict the future is the current capabilities of the current technology. Bear and bull alike do this. This is the reason people believe we are closer to AGI than say couple of years ago. I have no idea how close or far off we are. What I'm interested in is the current and predictable near future capabilities of the technology.

godelski 2 days ago

I think this misses some of the core problems and it suggests there are some more straight forward solutions. We have no solutions to this and the way we're treating this means we aren't going to come up with solutions.

Problem 1: Training

Using any method like RLHF, DPO, or such guarantees that we train our models to be deceptive.

This is because our metric is the Justice Potter metric: I know it when I see it. Well, you're assuming that this accurate. The original case was about defining porn and well... I don't think it is hard to see how people even disagree on this. Go on Reddit and ask if girls in bikinis are safe for work or not. But it gets worse. At times you'll be presented with the choice between two lies. One lie you know is a lie and the other lie you don't know it is. So which do you choose? Obviously the latter! This means we optimize our models to deceive us. This is true too when we come to the choice between truth and a lie we do not know is a lie. They both look like truths.

This will be true even in completely verifiable domains. The problem comes down to truth not having infinite precision. A lot of truth is contextually dependent. Things often have incredible depth, which is why we have experts. As you get more advanced those nuances matter more and more.

Problem 2: Metrics and Alignment

All metrics are proxies. No ifs, ands, or buts. Every single one. You cannot obtain direct measurements which are perfectly aligned with what you intend to measure.

This can be easily observed with even simple forms of measurements like measuring distance. I studied physics and worked as an (aerospace) engineer prior to coming to computing. I did experimental physics, and boy, is there a fuck ton more complexity to measuring things than you'd guess. I have a lot of rules, calipers, micrometers and other stuff at my house. Guess what, none of them actually agree on measurements. They all are pretty close, but they do differ within their marked precision levels. I'm not talking about my ruler with mm hatch marks being off by <1mm, but rather >1mm. RobertElderSoftware illustrates some of this in this fun video[0]. In engineering, if you send a drawing to a machinist and it doesn't have tolerances, you have actually not provided them measurements.

In physics, you often need to get a hell of a lot more nuanced. If you want to get into that, go find someone that works in an optics lab. Boy does a lot of stuff come up that throws off your measurements. It seems straight forward, you're measuring distances.

This gets less straightforward once we talk about measuring things that aren't concrete. What's a high fidelity image? What is a well written sentence? What is artistic? What is a good science theory? None of these even have answers and are highly subjective. The result of that is your precision is incredibly low. In other words, you have no idea how you align things. It is fucking hard in well defined practical areas, but the stuff we're talking about isn't even close to well defined. I'm sorry, we need more theory. And we need it fast. Ad hoc methods will get you pretty far, but you'll quickly hit a wall if you aren't pushing the theory alongside it. The theory sits invisible in the background, but it is critical to advancements.

We're not even close to figuring this shit out... We don't even know if it is possible! But we should figure out how to put bounds, because even bounding the measurements to certain levels of error provides huge value. These are certainly possible things to accomplish, but we aren't devoting enough time to them. Frankly, it seems many are dismissive. But you can't discuss alignment without understanding these basic things. It only gets more complicated, and very fast.

[0] https://www.youtube.com/watch?v=EstiCb1gA3U

jmathai 2 days ago

My experience with LLm-based chat is so different from what the article (and some friends) describe.

I use LLM chat for a wide range of tasks including coding, writing, brainstorming, learning, etc.

It’s mostly right enough. And so my usage of it has only increased and expanded. I don’t know how less right it needs to be or how often to reduce my usage.

Honestly, I think it’s hard to change habits and LLM chat, at its most useful, is attempting to replace decades long habits.

Doesn’t mean quality evaluation is bad. It’s what got us where we are today and what will help us get further.

My experience is anecdotal. But I see this divide in nearly all discussions about LLM usage and adoption.

  • bluefirebrand 2 days ago

    > It’s mostly right enough.

    Honestly this is why your experience is different: your expectations are different (and likely lower). I never find they are "mostly right enough", I find they are "mostly wrong in ways that range from subtle mistakes to extremely incorrect". The more subtly they are wrong, the worse I rate their output actually, because that is what costs me more time when I try to use them

    I want tools that save me time. When I use LLMs I have to carefully write the prompts, read and understand, evaluate, and iterate on the output to get "close enough" then fix it up to be actually correct.

    By the time I've done all of that, I probably could have just written it from scratch.

    The fact is that typing speed has basically never been the bottleneck for developer productivity, and LLMs basically don't offer much except "generate the lines of code more quickly" imo

    • mjr00 2 days ago

      It's also what you're writing. The GP's commenter's bio shows they're a product lead, not a full-time software developer. To make some broad assumptions about what kind of code they're talking about: using an LLM for "write me a Python script that queries the Jira API for all tickets closed in the past week" is a much different task from "change the code in our 15 year old in-house accounting software to handle these tariffs", both in terms of the code that gets written as well as the consequences of the LLM getting it wrong.

      To be clear this isn't a knock on anyone's work, but it does seem to be a source of why "pro-LLM" and "anti-LLM" groups tend to talk past each other.

      • bluefirebrand 2 days ago
        7 more

        Sure, but in both cases you are running a real risk of producing incorrect data

        If you're a product lead and you ask an LLM to produce a script that gets that output, you still should verify the output is correct

        Otherwise you run a real risk of seeming like an idiot later when you give a report on "tickets closed in the past week" and your data is completely wrong. "Why hasn't John closed any tickets this week? Is he slacking off?"... "What he closed more tickets than anyone..." And then it turns out that the unreliable LLM script excluded him for whatever reason

        Of course I understand that people are not going to actually be this careful, because more and more people are trusting LLM output without verifying it. Because it's "right enough" that we are becoming complacent

        • mjr00 2 days ago
          4 more

          You're absolutely right. You need to verify the script works, and you need to be able to read the code to see what it's actually doing and if it passes the smell test (as a sibling commenter said, the same way you would for a code snippet off StackOverflow). But ultimately for these bits which are largely rote "take data from API, transform into data format X" tasks, LLMs do a great job getting at least 95% of the way there, in my experience. In a lot of ways they're the perfect job for LLMs: most of the work is just typing (as in, pressing buttons on a keyboard) and passing the right arguments to an API, so why not outsource that to an LLM and verify the output?

          The challenge comes when dealing with larger systems. Like an LLM might suggest Library A for accomplishing a task, but if your codebase already has Library B for that already, or maybe Library A but a version from 2020 with a different API, you need to make judgment calls about the right approach to take, and the LLM can't help you there. Same with code style, architecture, how future-proof-but-possibly-YAGNI you want your design to be, etc.

          I don't think "vibe coding" or making large changes across big code bases really works (or will ever really work), but I do think LLMs are useful for isolated tasks and it's a mistake to totally dismiss them.

          • bluefirebrand 2 days ago
            3 more

            > so why not outsource that to an LLM and verify the output?

            I mean sure, why not. My argument isn't that it doesn't work, it's that it doesn't really save time

            If you try to have it do big changes you will be swamped reviewing those changes for correctness for a long time while you build a mental model of the work

            If you have it do small changes, the actual performance improvement is marginal at best, because small changes already don't take much time or effort to create

            I really think that LLM-coding has largely just shifted "time spent typing" to "time spent reviewing"

            Yes, past a certain size reviewing is faster than typing. But LLMs are not producing terribly good output for large amounts of code still

            • mjr00 2 days ago
              2 more

              I disagree that it doesn't save time for some classes of problems.

              As a concrete recent example, I had to write a Python script which checked for any postgres tables where the primary key was of type 'INT' and print out the max value of the ID for each table. I know broadly how to do this, but I'd have to double check which information_schema table to use, the right names of the columns to use, etc. Plus a refresher on direct use of psycopg2 and the cursor API. Plus the typing itself. I just put that query into an LLM and it gave me exactly what I needed, took about 30-60 seconds total. Between the research and typing that's easily 10 minutes saved, maybe closer to 20 really.

              And I mean, no, this example isn't worth the $10 trillion or whatever the economy thinks AI is worth, but given that it exists, I'm happy to take advantage of it.

              • bluefirebrand 2 days ago

                I don't see a lot of value in "saving 10-20 minutes here and there" tbh

                Especially since I'm not ever likely to see any benefit from my employer for that extra productivity

        • keybrd-intrrpt 2 days ago
          2 more

          > you still should verify the output is correct

          And that's a problem with the workflow, not a problem with the LLM.

          It's no different than verifying the information from your Google search or the Stack Overflow answer you found works. But for some reason there are people that have higher expectations of LLM output.

          • bluefirebrand 2 days ago

            People aren't trying to produce entire codebases in 10 minutes using Stack Overflow, or giving it free reign to refactor the entire codebase

      • cwillu 2 days ago

        Having poked at a few database queries with subtle errors that compounded with a flawed understanding resulting in wildly incorrect conclusions, [a realistic expansion of] “write me a Python script that queries the Jira API for all tickets closed in the past week” is exactly the place where I expect those fuckups to come from.

    • throwacct 2 days ago

      This. I use LLMs for some tasks, but for more complex issues, I do it myself. I tried to use it for a project by defining each task as clearly as possible, and I spent weeks trying to come up with something useful. Mind you, I achieved 80% of what I wanted after iterating and "telling" the chat that their answers were wrong, and going over the code to double-check if everything was okay. Now I use it for specific, simple tasks if these are work-related, and then use it for random kinds of stuff that I can verify by going to the actual source.

      • bluefirebrand 2 days ago

        > Mind you, I achieved 80% of what I wanted after iterating and "telling" the chat that their answers were wrong, and going over the code to double-check if everything was okay

        I very often read things like this, and I'm surprised how often the person estimates "around 80%" of the work was good. It feels so perfectly tailored to the Pareto Principal

        The LLM does the easy 80% (which we usually say takes 20% of the time anyways). Then the human has to go do the harder remaining 20%, only with a much smaller mental model of how the original 80% is fitting together

    • empath75 2 days ago

      They save me a tremendous amount of time, you just need to be smart about what you try to get them to do. _Busy work_ is what you want to focus on, not anything that takes a ton of domain knowledge and intelligence.

      Just as an example from today, i had a huge pile of yaml documents that needed to have some transformations done to them -- they were pretty simple and obvious, but I just went into cursor, give it a before and after and a few notes, and it wrote a python script in less than 10 seconds that converted everything exactly the way I needed. Did it save me a day of work? Probably not, but probably an hour or so of looking up python docs and iterating until i worked out all the syntax errors myself? An hour here and an hour there adds up to a _lot_ of saved time.

      I spent more time just writing this comment then I did asking cursor to write and run that script for me.

      Other things I had an LLM do for me just _today_ is fix a github action that was failing, and knock out a developer readme for a helm chart documenting what all the values do -- that's one of the kinds of things where it gets a lot of stuff wrong, but typing speed _is_ the bottleneck. It took me a minute or so to fix the stuff it misunderstood, but the formatting and the bulk of it was fine.

      • bgnn 2 days ago

        Isn't the article saying it's mainly useful for SW?

        I'm an electrical engineer and the only cases LLMs useful were developing phyton scripts or translating a text into a foreign language that I'm fluently speaking.

        They are absolutely garbage for anything electrical engineering related, even coding RTL.

      • bluefirebrand 2 days ago
        2 more

        > _Busy work_ is what you want to focus on, not anything that takes a ton of domain knowledge and intelligence

        Eh..

        Maybe that's more of a sign that we shouldn't be doing busywork in the first place

        • empath75 2 days ago

          You are in a magical place if you never have to do busy work.

  • nomel 2 days ago

    From what I can tell, rather than a simple difference in expectation (which could explain your positive experience vs others), it seems to be a "comfort within uncertainty" difference that, from what I can tell, is a personality trait!

    You're comfortable with the uncertainty, and accommodate it in your use and expectations. You're left feeling good about the experience, within that uncertainty. Others are repelled by uncertainty, so will have a negative experience, regardless of how well it may work for a subset of tasks they try, because that repulsive uncertainty is always present.

    I think it would be interesting (and possibly very useful/profitable for the marketing/UI departments of companies that use AI) to find the relation between perceived AI usefulness and the results of some of the "standard" personality tests.

    • TheOtherHobbes 2 days ago

      It's not comfort with uncertainty, it's discomfort with the predictable effects of uncertainty.

      I don't want to have to waste time tidying up after an unreliable software tool which is being sold as saving me time. I don't want to be misled by hallucinated fantasies that have no relationship to reality. (See also - lawyers getting laughed out of courtrooms because of this.)

      I don't want to have to cancel a travel booking because an AI agent booked me a holiday in Angkor Wat when I wanted a train ticket to Crystal Palace in South London.

      Hypotheticals? Not even slightly. Ask anyone who's lost their KDP author account on Amazon or been locked out of Meta because of AI moderation errors.

      This is common sense, not some kind of personality flaw.

      I'm happy using LLMs for coding and research, but it's also clear the technology is in perpetual beta - at best - and is being wildly oversold.

      Normal software operating with this level of reliability would be called "very buggy."

      But apparently LLMs get a pass because one day they might not be as buggy as they are today.

      Which - if you think about it - is ridiculous, even by the usual standards of the software industry.

      • nomel 2 days ago

        These apply:

        > comfortable with the uncertainty, and accommodate it in your use

        Many of the tasks you listed are require absolute determinism.

        > regardless of how well it may work for a subset of tasks they try

        You're using examples of absolute determinism, even though, with certainty, it has worked for some tasks you've throw at it.

    • kenjackson 2 days ago

      I wonder if this is like dishwasher usage. As a kid growing up we never used the dishwasher. It was just the drying rack. The reason was you had to rinse off the big stuff anyways, and then the resulting quality of dishwashing was poor in it. You'd often get a fork with rice stuck between it still, which was unacceptable.

      As a grown up now I use a dishwasher for everything that is permitted to go in it. I still have to rinse off plates first, and occasionally I do see rice between a fork that I have to then clean manually. But I'm not comfortable knowing that it won't clean as well as I could by hand, but it does a good enough job -- and in some ways a much better job (it uses much hotter water than I do by hand). I don't know if my mom could ever really be comfortable with it though.

      • mastercheif 2 days ago

        You don’t need to pre-wash dishes before they go in the sink, beyond a basic scrapping of the plate into the garbage.

        Pre-washing dishes degrades the performance of the dishwasher. This is due to the use of enzymes in modern detergent formulations.

        I’ve sent dozens of people the Technology Connections video on this topic to rave reviews: https://youtu.be/jHP942Livy0

      • dingnuts 2 days ago

        This is a funny example since, for a long time anyway, dishwashers have been much better at actually sanitizing dishes due to the much higher temperatures that can be used vs hand washing. I don't feel like hand washed dishes are truly clean. Oh you rubbed it with a nasty dish rag and water cool enough to touch? greeeeaaaaat

      • bluefirebrand 2 days ago
        3 more

        Imagine if the advice for Dishwasher usage mirrored the advice for AI

        "You have to iterate on the output to get good results"

        Just keep running that dishwasher until they're clean! If you run it and they're still dirty, load it up with soap and try again!

        • keybrd-intrrpt 2 days ago
          2 more

          That's all new technology though. Dishwashers _were_ like that.

          What's seemed to change are people's expectations of technology that "just works". When in reality, we are in the infant years of AI/ML and LLMs

          We're so spoiled by the pace of innovation we're upset it requires a bit of hand-holding while they figure things out.

          • blharr 2 days ago

            It's still egregious because the main theme is "Learn how to work with AI so you won't be left behind in the future!" The analogy in that case is to waste time pointlessly learning the quirks of old dishwashers while new dishwashers won't have them in the future.

  • foobiekr 2 days ago

    Charitably, your low expectations are probably the source of your finding them acceptable.

    It’s also possible - and you should not take this as an insult, it’s just the way it is - you may not know enough about the subjects of your interactions to really spot how wrong they are.

    However the cases you list - brainstorming - don’t really care about wrong answers.

    Coding is in the eye of the beholder, but for anything that isn’t junk glue code, scripts or low-complexity web stuff, I find the output of LLMs just short of horrendous.

    • CuriouslyC 2 days ago

      The code that the best frontier models produce is definitely good if you prompt it with what you believe "good" means, with the caveat that code quality depends heavily on the language -- Python, Typescript/Javascript, Java and C are quite good, Rust, C++ and Go tend to be decent to weak depending on the specific model, and other languages are poor.

      • foobiekr 2 days ago

        The C output is absolutely terrible. I cannot fathom an experienced C coder who has found otherwise for anything non trivial. The code is full of things like return from stack, poor buffer size discipline, etc.

      • fellowniusmonk 2 days ago

        Yeah, I've had mixed results with Rust. Oddly it's been most helpful for me so far in getting Rust code running in WASM without having to know anything about WASM, which I have found delightful.

  • fellowniusmonk 2 days ago

    I really don't understand people who are down on LLM.

    In terms of code output. I have gone from the productivity of being a Sr. Engineer to a team with .8 of a Sr. Engineer, 5 Jr. Engineers and One dude solely dedicated to reading/creating documentation.

    Unlike a lot of my fellow engineers who are also from traditional CS backgrounds and haven't worked in revenue restricted startup environments, I also have been VERY into interpreted languages like ruby in the past.

    Now compiled languages are even better, I think from a velocity perspective compiled languages are now incredibly on par for prototyping velocity and have had their last weakness removed.

    It's both exciting and scary, I can't believe how people are still sleep walking in this environment and don't realize we are in a different world. Once again the human inability to "gut reason" about exponentials is going to screw us all over.

    One terribly overlooked thing I've noticed that I think explains the differing takes. Foundation of my position here: https://www.nature.com/articles/s41598-020-60661-8

    Within the population that writes code there are a small number of successful people who approach the topic in a ~purely mathematical approach, and a small number of successful people that approach writing code in a ~purely linguistic approach. Most people fall somewhere in the middle.

    Those who are on the MOST extreme end of the mathematic side and are linguistically bereft HATE LLM's and effectively cannot use them.

    My guess is that HN population will tend to show stronger reactions against LLM's because it was heavily seeded with functional programmers which I think has a concentration of the successful extremely math focused. I worked for several years in a purely functional shop and that was my observation: Elixir, Haskell, Ramda.

    Just my speculation.

    • whyowhy3484939 2 days ago

      There is this interesting thing called the Paradox of Automation where increasing automation increases the importance of human intervention. We are trying this out on a societal level. It will be.. interesting, to say the least.

      Also, congratulations on becoming a team. I sure hope you have the mental bandwidth to check all that output carefully. If so, doubly congrats, because you might be the smartest human that ever lived.

      • fellowniusmonk 2 days ago
        2 more

        I appreciate you're incredulity and snark! Dismissing without engagement is a fun ability to exercise. I look forward to talking past each other going forward :-)

        HackerNews typically doesn't appreciate and will ban accounts for that type of engagement as it is just personal and not a factual wrestling with the point of discussion, I see you are new here and I would encourage you to not continue to engage in the patterns you show.

        At core, I think perhaps we have a different interpretation of what 20% of a Sr. Engineer can accomplish and what Jr. Devs are capable of accomplishing.

        To be fair to your point, I think one of the enablers is that I actually enjoy working longer hours now so my net time engaging with code has gone up as well.

        But I'm from the old school and I've always preferred time in code vs having outside hobbies, that's been true since the 90s.

        I find code reviews relaxing and enjoyable and not particularly mentally taxing for 90% of what a decent jr. dev writes. I find it a nice little break from working on problems that can actually be classified as "hard".

        Coincidentally, I've worked in human in the loop automation for quite a long time, making Sr. individuals more efficient with their time and removing busy work has been a big focus.

        There is a lot in that space to consider from a human factors perspective, the intersection of creation vs editing is a big one, decomposing problems for sure, each individual seems to have different capabilities and natural bents in that regard. I've long been a thought dump and edit person and that's part of what I attribute my high personal productivity to.

        • whyowhy3484939 2 days ago

          Ah, I met my match it seems.

          I confess I might be showing signs of unlawful thought patterns. I will correct that, fellowniusmonk. Thanks for pointing that out.

          I am in the "code is not an asset, it's a liability"-camp and our recently acquired ability to swiftly defecate metric tons of it is not something I am particularly thrilled about. In fact, I find "senior" engineers using LoC as a productivity metric highly suspect - at best. I thought we passed that phase a decade or two ago. Not saying you are one, but in the spirit of talking past each other I thought it prudent to put up a good straw man.

          All in all to be completely honest I find it hard to parse your original point so I concur I wasn't engaging properly. To be fair you opened with "in terms of code output" so that's what triggered me I guess.

    • yoyohello13 2 days ago

      > Those who are on the MOST extreme end of the mathematic side and are linguistically bereft HATE LLM's and effectively cannot use them.

      This is an interesting observation. It at least aligns with my experience. I wouldn't say I'm "linguistically bereft" lol, but I do lean more toward the "functional programming is beautiful" side. I even have a degree in math. I'm not totally down on LLM coding, but I do fall more on unfavorable feelings side. I mostly just hate the idea of having a bunch of code I don't fully understand, but also am responsible for.

      I do use them, and find them helpful. But the idea of fully giving control of my codebase to LLM agents, like some people are suggesting, repels me.

      • fellowniusmonk 2 days ago

        Yeah, I certainly don't mean to imply that's the only reason. There are MANY reasons to hate LLMs and people all up and down the spectrum hate them for any number of reasons. I definitely think utility is still language specific as well (LLMs are just terrible with some languages), project specific, etc.

        I think currently there are prompts and approaches that help ensure functions stay small and easy to reason about but it's very context dependent. Certainly any language or framework that has large amount of boilerplate will be less painful to work with if you hate boilerplate, I think that could arguably be increasing enshitification though in a sense. The people who say tons of code is being generated and it will all come crashing down in an unmaintainable mess... I do kinda agree.

        I'm glad I am not writing code in medical/flight control systems or something like that, I think LLMs can be used in that context but idk if they would save or increase time?

        Certain types of tasks require greater precision. Like in working with wood, framing a house is fine but building a dovetailed cabinet drawer is not on the table if that makes sense?

        My impression is that at this point work in high precision environments is still in the human domain and LLMs are not. Multi-agent approaches maybe, treating humans like the final agent in multi-agent approaches, maybe, idk, I'm not working on any life or death libraries or projects ATM but I do feel good about test coverage so maybe that's good enough in a lot of cases.

        People who say non-devs can dev with ai or cursor, I think at this point that's just a way of getting non-technical people to burn tokens and give them more money, but idk if that will be true in six months you know?

  • light_hue_1 2 days ago

    > It’s mostly right enough

    What do you use it for?

    In my space, "mostly right enough" isn't useful. Particularly when that means that the errors are subtle and I might miss them. I can't write whitepapers that tell people to do things that would result in major losses.

  • leptons 2 days ago

    It's fine if LLMs are used casually, for things that don't affect anyone but the user. But when someone plugs an LLM into Social Security or other governmental bodies to take action on real human beings, then disaster awaits. Nobody is going to care if the LLM got it wrong if you're just chatting with it or writing some wonky code that doesn't matter in the real world, but when your government check is reduced or deleted by an LLM that is hallucinating, then the real problems start. These things should not be trusted with anything but the least consequential actions an individual would use it for.

    • gte525u 2 days ago

      ^This - we're trying to use one to partially automate some system engineering type activities.

      It's great for reviews where any given reviewer could be expected to have a misunderstanding of certain details or skip a section (RAG somewhat helps this) - but it's frustrating for artifact generation where missing details cascade through the project.

      As great as the technology (right now) it seems so far from reliable business process automation.

  • strangattractor 2 days ago

    IMHO it's a great summarizing search engine. I now don't have to click on a link to go to that original source - Gemini just hands me a useful summary. Ask AI to do something specific that requires GI (General Intelligence) your milage may vary. So as OpenAI and Google suck in all your content (creators) you are going to find yourself derive less and less revenue generated by visits to your site. Just sayin.

    • foobiekr 2 days ago

      Gemini routinely inaccurately reports the contents in the summary. I have found it actually reversing things on a regular basis. The summary says no and the source says yes.

    • hooverd 2 days ago

      DuckDuckGo, which uses Bing I think, now has Bing's AI summaries instead of the goddamn content in search results, which makes evaluating the search results at a glance useless!

      • yegg 2 days ago

        For what it's worth, we produce our own summaries, and you can turn them off if you don't like them. We also offer noai.duckduckgo.com, which turns all of our AI features off automatically.