I don't understand the stance that AI currently is able to automate away non-trivial coding tasks. I've tried this consistently since GPT 3.5 came out, with every single SOTA model up to GPT 5.1 Codex Max and Opus 4.5. Every single time, I get something that works, yes, but then when I start self-reviewing the code, preparing to submit it to coworkers, I end up rewriting about 70% of the thing. So many important details are subpar about the AI solution, and many times fundamental architectural issues cripple any attempt at prompting my way out of it, even though I've been quite involved step-by-step through the whole prototyping phase.
I just have to conclude 1 of 2 things:
1) I'm not good at prompting, even though I am one of the earliest AI in coding adopters I know, and have been consistent for years. So I find this hard to accept.
2) Other people are just less picky than I am, or they have a less thorough review culture that lets subpar code slide more often.
I'm not sure what else I can take from the situation. For context, I work on a 15 year old Java Spring + React (with some old pages still in Thymeleaf) web application. There are many sub-services, two separate databases,and this application needs to also 2-way interface with customer hardware. So, not a simple project, but still. I can't imagine it's way more complicated than most enterprise/legacy projects...
> non-trivial coding tasks
I’ve come back to the idea LLMs are super search engines. If you ask it a narrow, specific question, with one answer, you may well get the answer. For the “non-trivial” questions, there always will be multiple answers, and you’ll get from the LLM all of these depending on the precise words you use to prompt it. You won’t get the best answer, and in a complex scenario requiring highly recursive cross-checks— some answers you get won’t be functional.
It’s not readily apparent at first blush the LLM is doing this, giving all the answers. And, for a novice who doesn’t know the options, or an expert who can scan a list of options quickly and steer the LLM, it’s incredibly useful. But giving all the answers without strong guidance on non-trivial architectural points— entropy. LLMs churning independently quickly devolve into entropy.
I wish LLMs were good at search. I've tried to evaluate them many times for their quality at answering research questions for astrophysics (specifically numerical relativity). If they were good at answering questions, I'd use them in a heartbeat
Without exception, every technical question I've ever asked an LLM that I know the answer to, has been substantially wrong in some fashion. This makes it just.. absolutely useless for research. In some cases I've spotted it straight up plagiarising from the original sources, with random capitalisation giving it away
The issue is that once you get even slightly into a niche, they fall apart because the training data just doesn't exist. But they don't say "sorry there's insufficient training data to give you an answer", they just make shit up and state it as confidently incorrect
LLMs got good at search last year. You need to use the right ones though - ChatGPT Thinking mode and Google AI mode (that's https://www.google.com/ai - which is NOT the same as regular Google's "AI overviews" which are still mostly trash) are both excellent.
I've been tracking advances in AI assisted search here - https://simonwillison.net/tags/ai-assisted-search/ - in particular:
- https://simonwillison.net/2025/Apr/21/ai-assisted-search/ - April is when they started getting good, with o3 and the various deep research tools
- https://simonwillison.net/2025/Sep/6/research-goblin/ - GPT-5 got excellent. This post includes several detailed examples, including "Starbucks in the UK don’t sell cake pops! Do a deep investigative dive".
- https://simonwillison.net/2025/Sep/7/ai-mode/ - AI mode from Google
> LLMs got good at search last year. You need to use the right ones though - ChatGPT Thinking mode and Google AI mode (that's https://www.google.com/ai - which is NOT the same as regular Google's "AI overviews" which are still mostly trash) are both excellent.
I disagree. You might have seen some improvements in the results, but all LLMs still hallucinate quite hard on simple queries where you prompt them to cite their sources. You'll see ChatGPT insist quite hard that the source of their assertions is the 404 link that it asserts is working.
This is just completely the opposite to what i've experienced within Claude and Gemini. Sources are identified and if inaccessible are not included in the citations. I recently tried a quite specific search aimed towards finding information about specific memo's and essays cited within a 90s memo by bill gates, and it was succesful at finding a vast majority of them, something google search failed with.
I don't want to say that it's a skill issue, but you may just be using the wrong tools for the job.
Oh boy, someone's claiming that chatgpt is actually great now, time to ask it some questions
I asked chatgpt's thinking mode if the adm formalism is strictly equivalent to general relativity, and it made several strongly incorrect statements
This is my favourite:
>3. Boundary terms matter
>To be fully equivalent:
>One must add the correct Gibbons–Hawking–York boundary term
>And handle asymptotic conditions carefully (e.g. ADM energy)
>Otherwise, the variational principle is not well-defined.
Which is borderline gibberish
>The theory still has 2 propagating DOF per spacetime point
This is pretty good too
>(lapse and shift act as Lagrange multipliers, not dynamical fields).
This is also as far as I'm aware just wrong, as the gauge conditions are nonphysical. In general, lapse and shift are generally always treated as dynamical fields
Its full answer reads like someone with minimal understanding of physics trying to bullshit you. Then I asked it if the BSSN formalism is strictly equivalent to the ADM formalism (it isn't, because it isn't covariant)
This answer is actually more wrong, surprisingly
>Yes — classically, the BSSN formalism is equivalent to ADM, but only under specific conditions. In practice, it is a reparameterization plus gauge fixing and constraint handling, not a new theory. The equivalence is more delicate than ADM ↔ GR.
The ONE thing that doesn't change in the BSSN formalism is the gauge conditions
>Rewriting the evolution equations, adding terms proportional to constraints.
This is also pretty inadequate
>Precise equivalence statement
>BSSN is strictly equivalent to ADM at the classical level if:
...
>Gauge choices are compatible >(e.g. lapse and shift not over-constraining the system)
This is complete gibberish
It also states:
>No extra degrees of freedom are introduced
I don't think chatgpt knows what a degree of freedom is
>Why the equivalence is more subtle than ADM ↔ GR >1. BSSN is not a canonical transformation
>Unlike ADM ↔ GR:
>BSSN is not manifestly Hamiltonian
>The Poisson structure is not preserved automatically
>One must reconstruct ADM variables to see equivalence
This is all absolute bollocks. Manifestly hamiltonian is literally gibberish. Neither of these formalisms have a "poisson structure" whatever that means, and sure yes you can construct the adm variables from the bssn variables whoopee
>When equivalence can fail
>Discretized (numerical) system -> Equivalence only approximate
Nobody explain to chatgpt that the ADM formalism is also a discretiseable series of PDEs!
>BSSN and ADM describe the same classical solutions of Einstein’s equations, but BSSN reshapes the phase space and constraint handling to make the evolution well-behaved, sacrificing manifest Hamiltonian structure off-shell.
We're starting to hit timecube levels of nonsense
It also gets the original question completely wrong: The BSSN formalism isn't covariant or coordinate free - there's an alterative bssn-like formalism called cBSSN (covariant bssn), which is similar to ccz4 and z4cc (both covariant). Its an important property that the regular BSSN formalism lacks, which is one of the ways you can identify it as being not a strict equivalence to the ADM formalism on mathematical grounds. So in the ADM formalism you can express your equations in polar coordinates, but if you make that transformation in the BSSN formalism - its no longer the same
This has actually gotten significantly worse than last time I asked chatgpt about this kind of thing, its more confidently incorrect now
How did it do when you posed these arguments to it?
Did you try https://elicit.org ?
I have been impressed by its results.
I think this fact stems more from its initial search phase than its pure LLM processing power, but to me it seems the approach works really well.
> Without exception, every technical question I've ever asked an LLM that I know the answer to, has been substantially wrong in some fashion.
The other problem that I tend to hit is a tradeoff between wrongness and slowness. The fastest variants of the SOTA models are so frequently and so severely wrong that I don't find them useful for search. But the bigger, slower ones that spend more time "thinking" take so long to yield their (admittedly better) results that it's often faster for me to just do some web searching myself.
They tend to be more useful the first time I'm approaching a subject, or before I've familiarized myself with the documentation of some API or language or whatever. After I've taken some time to orient myself (even by just following the links they've given me a few times), it becomes faster for me to just search by myself.
>> at answering research questions for astrophysics
I googled for "helium 3" yesterday. Google's AI answer said that helium 3 is "primarily sourced from the moon", as if we were actively mining it there already.
There are probably thousands of scifi books where the moon has some forms of helium 3 mining. Considering Google pirated and used them all for training it makes sense that it puts it in present tense.
On a similar note, Gemini told that I was born in 2025 when I did a cursory search for my real name. It's rather confident.
- [deleted]
I wonder how much memory and computing time goes into making them, vs. a typical "proper" LLM prompt. It's like the freebies you get with a Christmas cracker.
If you nudge it towards tool use, A lot of time it can give you better answers.
Instead of "how cheese X is usually made" "search the web and give me a summary on the ways cheese X is made"
> I wish LLMs were good at search
The entire situation of web search for LLMs is a mess. None of the existing providers return good or usable results; and Google refuses to provide general access to theirs. As a result, all LLMs (except maybe Gemini) are severely gimped forever until someone solves this.
I seriously believe that the only real new breakthrough for LLM research can be achieved by a clean, trustworthy, comprehensive search index. Maybe someone will build that? Otherwise we’re stuck with subpar results indefinitely.
YaCy does a pretty good job, and is free, and you can run yourself, so the quality/experience is pretty much up to you. Paired together with a local GPT-OSS-120b with reasoning_effort set to high, I'm getting pretty good results. Validated with questions I do know the answer to, and seems alright although could be better of course, still getting better results out of GPT5.2 Pro which I guess is to be expected.
The point of my comment was that the AI/LLM is almost irrelevant in light of low quality search engine APIs/indexes. Is there a way to validate the actual quality and comprehensiveness of YaCY beyond anecdata?
> Is there a way to validate the actual quality and comprehensiveness of YaCY beyond anecdata?
No, because it's your own index essentially, hence the "the quality/experience is pretty much up to you" part.
How to build a search engine, apparently:
1. Install YaCy
2. Draw the rest of the owl
Yeah, that’s not really reassuring nor indicative of its usefulness or value.
Yeah, if that's how you feel about your own abilities, then I guess that's the way it is. Not sure what that has to do with YaCy or my original comment.
Respectfully, you said:
> YaCy does a pretty good job
I assume that should be qualified with some basic amount of evidence beyond “I said so”? Anyways, thanks for pointing me in the direction of YaCy, will try it out.
An example I had last month. Some code (dealing with PDF's) package ran into a resources problem in production. LLM suggested an adaptation to the segment that caused the problem, but that code pulled in 3 new non-trivial dependecies. Added constraints and the next iteration it dropped 1 of the 3. Pushed further and it confirmed my suggestion that the 2 remaining dependencies could be covered just by specifying an already existing parameter in the constructor.
The real problem btw was a bug introduced in the PDF handeling package 2 versions ago that caused resource handeling problems in some contexts, and the real solution was roling back to the version before the bug.
I'm still using AI daily in my development though, as as long as you sort of know what you are doing and have enough knowledge to evaluate it is very much a net productivity multiplier for me.
> But giving all the answers without strong guidance on non-trivial architectural points— entropy. LLMs churning independently quickly devolve into entropy.
Typical iterative-circular process "write code -> QA -> fix remarks" works because the code is analyzable and "fix" is on average cheaper than "write", therefore the process, eventually, converges on a "correct" solution.
LLM prompting is on average much less analyzable (if at all) and therefore the process "prompt LLM -> QA -> fix prompt" falls somewhere between "does not converge" and "convergence tail is much longer".
This is consistent with typical observation where LLMs are working better: greenfield implementations of "slap something together" and "modify well structured, uncoupled existing codebase", both situations where convergence is easier in the first place, i.e. low existing entropy.
I very much agree with how you’ve categorized the initial state condition that is amenable to LLM assisted SWE and works well to a greater state of beneficial order. And implicitly I also agree most of the complement to that set of applied contexts yields ~medium to not so productive results.
But what do you mean by “LLM prompting is on average much less analyzable” ? Isn’t structured prompting (what that should optimally look like) the most objective and well defined part of the whole workflow. it’s the lowest entropy part of the situation, we know pretty well what a good LLM prompt is and what will be ineffective, even LLMs “know” that. Do you mean “context engineering” is hard to optimize around ? That’s often thought of interchangeably I think, but regardless that has in fact become the “hard problem” (user facing) in effectively leveraging LLM for dev work. Ever since the reasoning class models were introduced I think, it became more about context engineering in practice than prompting. Nowadays from the very onset Even resuming a session efficiently often requires a non-trivial approach that we’ve already started to design patterns and built tools around, (like CLI coding workflows adding /compact as user directive, etc).
I’m not a software engineer by trade, so I can’t pretend to know what that fully entails at the tail ends of enterprise scale and complexity, but I’ve spent a decent amount of time programming and as far as LLMs go, I think there’s probably somewhere down the road where we get so methodical about context engineering and tooling and memory management, all of the vast still somewhat nebulous surrounding space and scaffolding to LLM workflows that have a big impact on productive use of them—we may eventually engineer that aspect to an extent that will be able to much more consistently yield better results across more applied contexts than the “clean code”/“trivial app” dichotomy. But … I think the depth of additional effort and knowledge and skill required by human user to do this optimal context engineering (once we fully understand how even) to get the best out of LLMs… I think that quickly just converges to — what it means to be a competent software engineer already. the meta layers around just “writing code” that are required to build robust systems and maintain them, the amount of work required to coerce non-deterministic models into effectively internalizing that, or at minimum not fvcking it up… that juice might not be worth the squeeze when it’s essentially what a good developer’s job is already. If that’s true then there will likely remain a ceiling of finite productivity you can expect from LLM assisted development for a long time… (I conjecture).
> Do you mean “context engineering” is hard to optimize around ? That’s often thought of interchangeably I think,
The so called "context" is part of the prompt.
> we may eventually engineer that aspect to an extent that will be able to much more consistently yield better results across more applied contexts than the “clean code”/“trivial app” dichotomy.
> the amount of work required to coerce non-deterministic models into effectively internalizing that,
That's, essentially, the point here. You write a prompt (or context, or memory, or whatever people want to call it to make themselves feel better), get code out, test the code and get test failures. Now what? Unless the problem is obvious lack of information in the prompt (i.e. something was not defined), there are no methodical ways to patch the prompt in a way that consistently fixes the error.
You can take program code, apply certain analytical rules on it and exhaustively define all the operations, states and side effects the program will have. That might be an extremely hard exercise to do in full, but in the end this is what it means to be analyzable. You can take a reduced set of rules and heuristics and quickly build a general structure of the operations and analyze deficiencies. If you are given a prompt, regardless of how well structured it is, you cannot, by definition, in general tell what the eventual output is going to look like without invoking the full ruleset (i.e. running the prompt through an LLM), therefore average fix of a prompt is effectively a full rewrite, which does not invoke the shortcut I have invoked.
They don't even really do that IME. If I ask Claude or ChatGPT to generate terraform for non-trivial but by no means obscure or highly unusual setups, they almost invariably hallucinate part of the answer even if a documented solution exists that isn't even that difficult. Maybe vibe coding JavaScript is that much better, or I'm just hopeless at prompting, but I feel a few dozen lines of fairly straightforward terraform config shouldn't require elaborate prompt setups, or I can just save some brain cycles by writing it myself.
For better or for worse have spent a large amount of time in terraform since 0.13 and I can confidently say LLM's are very, very bad at it. My favorite is when it invents internal functions (that look suspiciously like python) that do not exist, even when corrected, it will still keep going back to them. A year or two ago there were bad problems with hallucinated resource field names but I haven't seen that as much these days.
It however, is pretty good at refactoring given a set of constraints and an existing code base. It is decent at spitting out boilerplate code for well-known resources (such as AWS), but then again, those boilerplate examples are mostly coming straight from the documentation. The nice thing about refactoring with LLM's in terraform is, even if you vibe it, the refactor is trivially verifiable because the plan should show no changes, or the exact changes you would expect.
>I’ve come back to the idea LLMs are super search engines.
Yes! This is exactly what it is. A search engine with a lossy-compressed dataset of most public human knowledge, which can return the results in natural language. This is the realization that will pop the AI bubble if the public could ever bring themselves to ponder it en masse. Is such a thing useful? Hell yes! Is such a thing intellegent? Certainly NO!
> …can return the results in natural language.
That’s one of the most important features, though. For example, LLMs can analyze a code base and tell you how it works in natural language. That demonstrates functional understanding and intelligence - in addition to exceeding the abilities of the majority of humans in this area.
You’d need a very no-true-Scotsmanned definition of intelligence to be able exclude LLMs. That’s not to say that they’re equivalent to human intelligence in all respects, but intelligence is not an all-or-nothing property. (If it were, most humans probably wouldn’t qualify.)
LLMs being intelligence or not is not really that interesting. It's just matter of how you define intelligence. It matters maybe to the AI CEOs and their investors because of marketing.
What matters is how useful LLMs actually are. Many people here say it is useful as advanced search engine and not that useful as your coworker. That is very useful but most likely not something the AI companies want to hear.
> You’d need a very no-true-Scotsmanned definition of intelligence to be able exclude LLMs.
The thing is, that intelligence is an anthropocentric term. And has always been defined in a no-true-Scotsman way. When we describe the intelligence of other species we do so in extremely human terms (except for dogs). For example we consider dolphins smart when we see them play with each other, talk to each other, etc. We consider chimpanzees when we see them use a tool, row a boat, etc. We don’t consider an ant colony smart when they optimize a search for food sources, only because humans don’t normally do that. The only exception here are dogs, who we consider smart when they obey us more easily.
Personally, my take on this is that intelligence is not a useful term in philosophy nor science. Describing a behavior as intelligent is kind of like calling a small creature a bug. It is useful in our day to day speech, but fails when we want to build any theory around it.
In the context of "AI", the use of the word "intelligence" has referred to human-comparable intelligence for at least the last 75 years, when Alan Turing described the Turing Test. That test was explicitly intended to test for a particular kind of human equivalent intelligence. No other animal has come close to passing the Turing Test. As such, the distinction you're referring to isn't relevant to this discussion.
> Personally, my take on this is that intelligence is not a useful term in philosophy nor science.
Hot take.
The Turing test was debunked by John Searle in 1980 with the Chinese room thought experiment. And even looking past that, the existence, and the pervasiveness, of the Turing test proves my point that this term is and always has been extremely anthropocentric.
In statistics there has been a prevailing consensus for a really long time that artificial intelligence is not only a misnomer, but also rather problematic, and maybe even confusing. There has been a concerted effort the past 15 years to move away from this term onto something like machine learning (machine learning is not without its own set of downsides, but is still miles better then AI). So honestly my take is not that hot (at least not in statistics; maybe in psychology and philosophy).
But I want to justify my take in psychology. Psychometricians have been doing intelligence testing for well over a century now, and the science is not much further along then it was a century ago. No new prediction, no new subfields, etc. This is a hallmark of a scientific dead end. And on the flip side, psychological theories that don‘t use intelligence at all are doing just fine.
While I agree, I can't help but wonder: if such a "super search engine" were to have the knowledge on how to solve individual steps of problems, how different would that be from an "intelligent" thing? I mean that, instead of "searching" for the next line of code, it searches for the next solution or implementation detail, then using it as the query that eventually leads to code.
Having knowledge isn't the same as knowing. I can hold a stack of physics papers in my hand but that doesn't make me a physics professor.
LLMs possess and can retrieve knowledge but they don't understand it, and when people try to get them to do that it's like talking to a non-expert who has been coached to smalltalk with experts. I remember reading about a guy who did this with his wife so she could have fun when travelling to conferences with him!
I've spent a lot of time thinking about that - what if the realization that we need is not that LLMs are intelligent, but that our own brains work in the same way as the LLMs. There is certainly a cognitive bias to believe that humans are somehow special and that our brains are not simply machinery.
The difference, to me, is that an LLM can very efficiently recall information, or more accurately, a statistical model of information. However, they seem to be unable to actually extrapolate from it or rationalize about it (they can create the illusion of rationalization be knowing what the rationalization would look like). A human would never be able to ingest and remember the amount of information that an LLM can, but we seem to have the incredible ability of extrapolation - to reach new conclusions by deeply reasoning about old ones.
This is much like the difference in being "book smart" and "actually smart" that some people use to describe students. Some students can memorize vast amounts of information, pass all tests with straight A's, only to fail when they're tasked with thinking on their own. Others perform terribly on memorization tasks, but naturally are gifted at understanding things in a more intuitive sense.
I have seen heaps of evidence that LLMs have zero ability to reason, so I believe that there's something very fundamental missing. Perhaps the LLM is a small part of the puzzle, but there doesn't seem to be any breakthroughs that seem like we might be moving towards actual reasoning. I do think that the human brain can very likely be emulated if we cracked the technology. I just don't believe we're close.
Even though I think it's true that it's lossy, I think there is more going on in an LLM neural net. Namely that when it uses tokens to produce output, you essentially split the text into millions or billions of chunks, each with probability of those chunks. So in essence the LLM can do a form of pattern recognition where the patterns are the chunks and it also enables basic operations on those chunks.
That's why I think you can work iteratively on code and change parts of the code while keeping others, because the code gets chunked and "probabilitized'. It can also do semantic processing and understanding where it can apply knowledge about one topic (like 'swimming') to another topic (like a 'swimming spaceship', it then generates text about what a swimming spaceship would be which is not in the dataset). It chunks it into patterns of probability and then combines them based on probability. I do think this is a lossy process though which sucks.
Maybe it's looked down upon to complain about downvotes but I have to say I'm a little disappointed that there is a downvote with no accompanying post to explain that vote, especially to a post that is factually correct and nothing obviously wrong with it.
> Is such a thing intellegent [sic]? Certainly NO!
A proofreader would have caught this humorous gaffe. In fact, one just did.
I personally had the completely opposite takeaway: Intelligence, at its core, really might just be a bunch of extremely good and self-adapting search heuristics.
I don't blurt out different answers to the same question using different phrasing, I doubt any human does.
We actually do, and often - depending on who our speaker is, our relationship with them, the tone of the message, etc. Maybe our intellect is not fully an LLM, but I truly wonder how much of our dialectical skills are.
You're describing the same answer with different phrasing.
Humans do that, LLMs regularly don't.
If you phrase the question "what color is your car?" a hundred different ways, a human will get it correct every time. LLMs randomly don't, if the token prediction veers off course.
Edit:
A human also doesn't get confused at fundamental priors after a reasonable context window. I'm perplexed that we're still having this discussion after years of LLM usage. How is it possible that it's not clear to everyone?
Don't get me wrong, I use it daily at work and at home and it's indeed useful, but there's is absolutely 0 illusion of intelligence for me.
- [deleted]
- [deleted]
that would be true if not for LLM making up answers where none exists.
Like, I've seen Claude go thru source code of the program, telling (correctly!) what counters are in code that return value I need (I just wanted to look at some packet metrics), then inventing entirely fake CLI command to extract those metrics
>It’s not readily apparent at first blush the LLM is doing this, giving all the answers.
Now I'm wondering if I'm prompting wrong. I usually get one answer. Maybe a few options but rarely the whole picture.
I do like the super search engine view though. I often know what I want, but e.g. work with a language or library I'm not super familiar with. So then I ask how do I do x in this setting. It's really great for getting an initial idea here.
Then it gives me maybe one or two options, but they're verbose or add unneeded complexity. Then I start probing asking if this could be done another way, or if there's a simpler solution to this.
Then I ask what are the trade-offs between solutions. Etc.
It's maybe a mix of search engine and rubber ducking.
Agents are, like for OP, a complete failure for me though. Still can't get them to not run off into a completely strange direction, leaving a minefield of subtle coding errors and spaghetti behind.
I’ve recently created many Claude skills to do repeatable tasks (architecture review, performance, magic strings, privacy, SOLID review, documentation review etc). The pattern is: when I’ve prompted it into the right state and it’s done what I want, I ask it to create a skill. I get codex to check the skill. I could then run it independently in another window etc and feed back to adjust…but you get the idea.
And almost every time it screws up we create a test, and often for the whole class of problem. More recent it’s been far better behaved. Between Opus, skills, docs, generating Mermaid diagrams, tests it’s been a lot better. I’ve also cleaned up so much of the architecture so there’s only one way to do things. This keeps it more aligned and helps with entropy. And they’ll work better as models improve. Having a match between code, documents and tests means it’s not just relying on one source.
Prompts like this seem to work: “what’s the ideal way to do this? Don’t be pragmatic. Tokens are cheaper than me hunting bugs down years later”
Can you tell me more about how you do tests? How do they look like? What testing tools or frameworks do you use?
I'm not going to argue about how capable the models are, I personally think they are pretty capable.
What I will argue is that the LLMs are not just search engines. They have "compressed" knowledge. When they do this, they learn relations between all kinds of different levels of abstractions and meta patterns.
It is really important to understand that the model can follow logical rules and has some map of meta relationships between concepts.
Thinking of a LLM as a "search engine" is just fundamentally wrong in how they work, especially when connected to external context like code bases or live information.
A sufficiently advanced search engine might actually be indistinguishable from intelligence.
After all, until quite recently, chess engines really were quite mechanically search engines too.
I'm just saying you are doing a dis-service to yourself if that is your mental model on how current SOTA models work.
Well, it's "a search engine that applies some transformations on top of the results" doesn't sound to me as a terrible way to think about LLMs.
> can follow logical rules
This is not their strong suite, though. They can only follow through a few levels on their own. This can be improved by agent-style iterations or via invoking external tools.
Let's see how this comment ages why don't we. I've understood where we are going and if you look at my comment history. I have confidence that in 12 months time. One opinion will be proved out with observations and the other will not.
For the "only few levels" claim, I think this one is sort of evident from the way they work. Solving a logical problem can have an arbitrary number of steps, and in a single pass there is only so many connection within a LLM to do some "work".
As mentioned, there are good ways to counter this problem (e.g. writing a plan and then iteratively going over those less-complex ones, or simply using the proper tool for the problem: use e.g. a SAT solver and just "translate" the problem to and from the appropriate format)
Nonetheless, I'm always open to new information/evidence and it will surely improve a lot in a year. As for reference, to date this is my favorite description of LLMs: https://news.ycombinator.com/item?id=46561537
Agreed, but:
There's been a notable jump over the course of the last few months, to where I'd say it's inevitable. For a while I was holding out for them to hit a ceiling where we'd look back and laugh at the idea they'd ever replace human coders. Now, it seems much more like a matter of time.
Ultimately I think over the next two years or so, Anthropic and OpenAI will evolve their product from "coding assistant" to "engineering team replacement", which will include standard tools and frameworks that they each specialize in (vendor lock in, perhaps), but also ways to plug in other tech as well. The idea being, they market directly to the product team, not to engineers who may have specific experience with one language, framework, database, or whatever.
I also think we'll see a revival of monolithic architectures. Right now, services are split up mainly because project/team workflows are also distributed so they can be done in parallel while minimizing conflicts. As AI makes dev cycles faster that will be far less useful, while having a single house for all your logic will be a huge benefit for AI analysis.
This doesn't make any sense. If the business can get rid of their engineers, then why can't the user get rid of the business providing the software? Why can't the user use AI to write it themselves?
I think instead the value is in getting a computer to execute domain-specific knowledge organized in a way that makes sense for the business, and in the context of those private computing resources.
It's not about the ability to write code. There are already many businesses running low-code and no-code solutions, yet they still have software engineers writing integration code, debugging and making tweaks, in touch with vendor support, etc. This has been true for at least a decade!
That integration work and domain-specific knowledge is already distilled out at a lot of places, but it's still not trivial. It's actually the opposite. AI doesn't help when you've finally shaved the yak smooth.
If the business can get rid of their engineers, then why can't the user get rid of the business providing the software?
A lot of businesses are the only users of their own software. They write and use software in-house in order to accomplish business tasks. If they could get rid of their engineers, they would, since then they'd only have to pay the other employees who use the software.
They're much less likely to get rid of the user employees because those folks don't command engineer salaries.
So instead of paying a human that "commands an engineer salary" then they'll be forced to pay whatever Anthropic or OpenAI commands to use their LLMs? I don't see how that's a better proposition: the LLM generates a huge volume of code that the product team (or whoever) cannot maintain themselves. Therefore, they're locked-in and need to hope the LLM can solve whatever issues they have, and if it can't, hope that whatever mess it generated can be fixed by an actual engineer without costing too much money.
Also, code is only a small piece and you still need to handle your hosting environment, permissions, deployment pipelines, etc. which LLMs / agentic workflows will never be able to handle IMO. Security would be a nightmare with teams putting all their faith into the LLM and not being able to audit anything themselves.
I don't doubt that some businesses will try this, but on paper it sounds like a money pit and you'd be better off just hiring a person.
It’s the same business model as consulting firms. Rather than hiring a few people for 65k each, a VP will bring in a consulting firm for 10M and get a bloated, half-working solution that costs even more to get working. The VP doesn’t care though because he ends up looking like a big shot in front of the other execs.
There are lots of developer agencies that hire developers as contractors that companies can use to outsource development to in a cheaper way without needing to pay for benefits or HR. They don't necessarily make bad quality software, but it doesn't feel humane.
Unless we're talking about some sketchy gig work nonsense, the "agency" is a consultancy like any other. They are a legitimate employer with benefits, w2, etc. It's not like they're pimps or something!
Those devs aren't code monkeys and they get paid the same as anyone else working in this industry. In fact, I think a lot of the more ADHD type people on here would strongly prefer working on a new project every 6 months without needing to find a new employer every time. The contracts between the consultancy and client usually also include longer term support than the limited time the original dev spent on it.
Agencies commonly use 1099 workers, there's been fierce legal battles on qualifications of agencies. (ABC test)
I believe 1099 worker growth has been outpacing hiring for several years.
The VP doesn't care because the short term result is worth more to the business. The business is not going to trip over dollars to pick up pennies.
Would you prefer that they hire, string those people along, and then fire them? That's a pain in the ass for everyone.
> If the business can get rid of their engineers, then why can't the user get rid of the business providing the software?
I have't checked the stats lately, but at one point most software written was in non-tech companies for the single business. The first 1/2 of my career was spent writing in-house software for a company that did everything from custom reporting and performance tracking to scraping data of automated phone dialers. There's so much software out there that effectively has a user base of a single company.
In some cases that could happen; in particular there may be a lot of UI and cross-app-integration style stuff that starts to get offloaded to users, so users can have AI code up their own UI for using some services together in the way that they want.
But in most cases businesses still need to own their own logic and data, so businesses will still be owning plenty of their own software. Otherwise customers could just write software to buy all your business's products for 99% off!
> Ultimately I think over the next two years or so, Anthropic and OpenAI will evolve their product from "coding assistant" to "engineering team replacement"
The way I see it, there will always be a layer in the corporate organization where someone has to interact with the machine. The transitioning layer from humans to AIs. This is true no matter how high up the hierarchy you replace the humans, be it the engineers layer, the engineering managers, or even their managers.
Given the above, it feels reasonable to believe that whatever title that person has—who is responsible for converting human management's ideas into prompts (or whatever the future has the text prompts replaced by)—that person will do a better job if they have a high degree of technical competence. That is to say, I believe most companies will still want and benefit if that/those employees are engineers. Converting non-technical CEO fever dreams and ambitions into strict technical specifications and prompts.
What this means for us, our careers, or Anthropic's marketing department, I cannot say.
That reminds me of the time where 3GL languages arrived and bosses claimed they no longer needed developers, because anyone could write code in those English-like languages.
Then when mouse-based tools like Visual Basic arrived, same story, no need for developers because anyone can write programs by clicking!
Now bosses think that with AI anyone will be able to create software, but the truth is that you'll still need software engineers to use those tools.
Will we need less people? Maybe. But in the past 40 years we have been increasing the developers productivity so many times, and yet we still need more and more developers because the needs have grown faster.
- [deleted]
My suspicion is that it will be bad for salaries, mostly because it'll kill the "looks difficult" moat that software development currently has. Developers know that "understanding source code" is far from the hard part of developing software, but non-technical folks' immediate recoiling in the face of the moon runes has kept our profession pretty easy to justify high pay for for ages. If our jobs transition to largely "communing with the machines", then we'll go from a "looks hard, is hard" job, to a "looks easy, is hard" job, which historically hurts bargaining power.
I don't think "looks difficult" has been driving wages. FAANG etc leadership knows what's difficult and what's not. It's just marginal ROI. If you have a trillion-dollar market and some feature could increase that by 0.0001%, you hire some engineers to give it a try. If other companies are also competing for the same engineers for the same reasons, salaries skyrocket.
I wonder if the actual productivity changes won't end up mattering for the economics to change dramatically, but change in terms of a rebound in favour of seniors. If I was in school 2 years ago, looking at the career prospects and cost of living, I just straight up wouldn't invest in the career. If that happens at a large enough scale, the replenishment of the discipline may reduce, which would have an effect on what people who already had those skills could ask for. If the middle step, where wild magical productivity gains don't materialize in a way that reduces the need for expert software people who can reasonably be liable for whatever gets shipped, then we'll stick around.
Whether it looks easy or not doesn't matter as much imo. Plumbing looks and probably is easy, but it's not the CEOs job to go and fix the pipes.
I think this is the right take. In some narrow but constantly broadening contexts, agents give you a huge productivity edge. But to leverage that you need to be skilled enough to steer, design the initial prompt, understand the impact of what you produce, etc. I don't see agents in their current and medium term inception as being a replacement of engineering work, I see it as a great reshuffling of engineering work.
In some business contexts, the impact of more engineering labor on output gets capped at some point. Meaning once agent quality reaches a certain point, the output increase is going to be minimal with further improvements. There, labor is not the bottleneck.
In other business contexts, labor is the bottleneck. For instance it's the bottleneck for you as an individual: what kind of revenue could you make if you had a large team of highly skilled senior SWEs that operate for pennies on the dollar?
Labor will shift to where the ROI is highest is what I think you'll see.
To be fair, I can imagine a world where we eventually fully replace the "driver" of the agent in that it is good enough to fulfill the role of a ~staff engineer that can ingest very high level business context, strategy, politics and generate a high level system design that can then be executed by one or more agents (or one or more other SWEs using agents). I don't (at this point) see some fundamental rule of physics / economics that prevents this, but this seems much further ahead from where we are now.
I actually think it’s the opposite. We’ll see fewer monorepos because small, scoped repos are the easiest way to keep an agent focused and reduce the blast radius of their changes. Monorepos exist to help teams of humans keep track of things.
Could be. Most projects I've worked on tend to span multiple services though, so I think AI would struggle more trying to understand and coordinate across all those services versus having all the logic in a single deployable instance.
The way I see feature development in the future is, PM creates a dev cluster (also much easier with a monolith), has AI implement a bunch of features to spec, AI provides some feedback and gets input on anywhere it might conflict with existing functionality, whether eventual consistency is okay, which pieces are performance criticial, etc., and provides the implementation, a bunch of tests for review, and errata about where to find observability data, design decisions considered and chosen, etc. PM does some manual testing across various personas and products (along with PMs from those teams), has AI add feature flags, launches. The feature flag rollout ends up being the long-pole, since generally the product team needs to monitor usage data for some time before increasing the rollout percentage.
So I see that kind of workflow as being a lot easier in a monolithic service. Granted, that's a few years down the road though, before we have AI reliable enough to do that kind of work.
> Most projects I've worked on tend to span multiple services though, so I think AI would struggle more trying to understand and coordinate across all those services versus having all the logic in a single deployable instance.
1. At least CC supports multiple folders in a workspace, so that’s not really a limitation.
2. If you find you are making changes across multiple services, then that is a good indication that you might not have the correct abstraction on the service boundary. I agree that in this case a monolith seems like a better fit.
Agreed on both counts. Though for the first one it's still easier to implement things when bugs create compile or local unit/integration test errors rather than distributed service mismatches that can only be caught with extensive distributed e2e tests and a platform for running them, plus the lack of distribution cuts down significantly on the amount of code, edge cases, and deployment sequencing that needs to be taken into account.
For the second, yeah, but IME everything starts out well-factored, but almost universally evolves into spaghetti over time. The main advantage monoliths have is that they're safer to refactor across boundaries. With distributed services, there are a lot more backward-compatibility guarantees and concerns you have to work through, and it's harder to set up tests that exercise everything e2e across those boundaries. Not impossible, but hard enough that it usually requires a dedicated initiative.
Anyway, random thoughts.
If you research how something like Cursor works I don't think you would believe it is inevitable. The jump that would have to happen for it to replace engineers entirely is insurmountable. They can keep expanding contexts and coming up with clever ways to augment generation but I don't see it ever actually having full vision on the system, product and users.
Beyond that it is incredibly biased towards existing code & prompt content. If you wanted to build a voice chat app, and you said "should I use websockets or http?" It would say Websockets. It won't override you and say "Use neither, you should use webRTC", but an experienced engineer would spot that the prompt itself is flawed instantly. LLMs just will bias towards existing tokens in the prompt and won't surface data that would challenge the question itself.
Unless you, well, state in AGENTS.md that prompts may offer suboptimal options in which case it's the machine's duty to question them, treat the prompter like a coworker and not a boss.
Sit down and re-read your comment one night with your "I am an engineer and will solve this as an engineering problem" hat firmly on. If you stop thinking of LLMs as lobotimized coworkers trapped inside an API wrapper and instead as computational primitives then things become much more interesting and the future becomes clearer to see.
There's no chance LLMs will be an engineering team replacement. The hallucination problem is unsolvable and catastrophic in some edge cases. Any company using such a team would be uninsurable and sued into oblivion.
Writing software is actually one of the domains where hallucinations are easiest to fix: you can easily check whether it builds and passes tests.
If you want to go further, you can even require the LLM to produce a machine checkable proof that the software is correct. That's beyond the state of the art at the moment, but it's far from 'unsolvable'.
If you hallucinate such a proof, it'll just not work. Feed back the error message from the proof checker to your coding assistant, and the hallucination goes away / isn't a problem.
This link were on HN recently: https://spectrum.ieee.org/ai-coding-degrades> you can easily check whether it builds and passes tests.
The trend for LLM generated code is to build and pass tests but do not deliver functionality needed."...recently released LLMs, such as GPT-5, have a much more insidious method of failure. They often generate code that fails to perform as intended, but which on the surface seems to run successfully, avoiding syntax errors or obvious crashes. It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution."Also, please consider how SQLite is tested: https://sqlite.org/testing.html
The ratio between test code and code itself is mere 590 times (590 LOC of tests per LOC of actual code), it used to be more than 1100.
Here is notes on current release: https://sqlite.org/releaselog/3_51_2.html
Notice fixes there. Despite being one of the most, if not the most, tested pieces of software in the world, it still contains errors.
Haha. How do you reconcile a proof with actual code?> If you want to go further, you can even require the LLM to produce a machine checkable proof that the software is correct.I've recently seen Opus, after struggling for a bit, implement an API by having it return JSON that includes instructions for a human to manually accomplish the task I gave it.
It proudly declared the task done.
I believe you have used Albanian [1] version of Opus.
[1] https://www.reddit.com/r/ProgrammerHumor/comments/1lw2xr6/hu...
Recent models have started to "fix" HTML issues with ugly hacks like !important. The result looks like it works, but the tech debt is considerable.
Still, it's just a temporary hindrance. Nothing a decent system prompt can't take care of until the models evolve.
> Haha. How do you reconcile a proof with actual code?
You can either proof your Rust code correct, or you can use a proof system that allows you to extract executable code from the proofs. Both approaches have been done in practice.
Or what do you mean?
Rust code can have arbitrary I/O effects in any parts of it. This precludes using only Rust's type system to make sure code does what spec said.
The most successful formally proven project I know, seL4 [1], did not extracted executable code from the proof. They created a prototype in Haskell, mapped (by hand) it to Isabelle, I believe, to have a formal proof and then recreated code in C, again, manually.
Not many formal proof systems can extract executable C source.
> Haha. How do you reconcile a proof with actual code?
Languages like Lean allow you to write programs and proofs under the same umbrella.
As if Lean does not allow to circumvent it's proof system (the "sorry" keyword).
Also, consider adding code to the bigger system, written in C++. How would you use Lean to prove correctness of your code as part of the bigger system?
I mean, it's somewhat moot, as even the formal hypothesis ("what is this proof proving") can be more complex than the code that implements it in nontrivial cases. So verifying that the proof is saying the thing that you actually want it to prove can be near impossible for non-experts, and that's just the hypothesis; I'm assuming the proof itself is fully AI-generated and not reviewed beyond running it through the checker.
And at least in backend engineering, for anything beyond low-level algorithms you almost always want some workarounds: for your customer service department, for engineering during incident response, for your VIP clients, etc. If you're relying on formal proof of some functionality, you've got to create all those allowances in your proof algorithm (and hypothesis) too. And additionally nobody has really come up with a platform for distributed proofs, durable proof keys (kinda), or how to deal with "proven" functionality changes over time.
You focused on writing software, but the real problem is the spec used to produce the software, LLMs will happily hallucinate reasonable but unintended specs, and the checker won’t save you because after all the software created is correct w.r.t. spec.
Also tests and proof checkers only catch what they’re asked to check, if the LLM misunderstands intent but produces a consistent implementation+proof, everything “passes” and is still wrong.
This is why every one of my coding agent sessions starts with "... write a detailed spec in spec.md and wait for me to approve it". Then I review the spec, then I tell it "implement with red/green TDD".
The premise was that the AI solution would replace the engineering team, so who exactly is writing/reviewing this detailed spec?
Well, perhaps it'll only shrink the engineering team by 95% then.
Why would you shrink the team rather than become 20x more productive as a whole?
Users don't want changes that rapidly. There's not enough people on the product team to design 20x more features. 20x more features means 400x more cross-team coordination. There's only positive marginal ROI for maybe 1.5-2x even if development is very cheap.
Either way can work. It depends on what the rest of the business needs.
The premise is in progress. We are only at the beginning of the fourth year of this hype-phase, and we haven't even reached AGI yet. It's obviously not perfect, maybe never will, but we are not a the point yet were we can conclude which future is true. The singularity hasn't happend yet, so we are still moving with (llm-enhanced) human speed at the moment, meaning things need time.
That's a bad premise.
Maybe, but you're responding to a thread about why AI might or might not be able to replace an entire engineering team:
> Ultimately I think over the next two years or so, Anthropic and OpenAI will evolve their product from "coding assistant" to "engineering team replacement", which will include standard tools and frameworks that they each specialize in (vendor lock in, perhaps), but also ways to plug in other tech as well.
This is the context of how this thread started, and this is the context in which DrammBA was saying that the spec problem is very hard to fix [without an engineering team].
Might be good to define the (legacy) engineering team. Instead of thinking 0/1 (ugh, almost nothing happens this way), the traditional engineering team may be replaced by something different. A team mostly of product, spec writers, and testers. IDK.
The job of AI is to do what we tell it to do. It can't "create a spec" on its own. If it did and then implemented that spec, it wouldn't accomplish what we want it to accomplish. Therefore we the humans must come up with that spec. And when you talk about a software application, the totality of its spec written out, can be very complex, very complicated. To write and understand, and evolve and fix such a spec takes engineers, or what used to be called "system analysts".
To repeat: To specify what a "system" we want to create does is a highly complicated task, which can only be dones by human engineers who understand the requirements for the system, and how parts of those requirements/specs interact with other parts of the spec, what are the consequences of one (part of the) spec to other parts of it. We must not writ e"impossible specs" like draw me a round square. Maybe the AI can check whether the spec is impossible or not, but I'm not so sure of that.
So I expect that software engineers will still be in high demand, but they will be much more productive with AI than without it. This means there will be much more software because it will be cheaper to produce. And the quality of the software will be higher in terms of doing what humans need it to do. Usability. Correctness. Evolvability. In a sense the natural language-spec we give the AI is really something written in a very high-level programming-language - the language of engineers.
BTW. As I write this I realize there is no spell-checker integrated into Hacker News. (Or is there?). Why? Because it takes developers to specify and implement such a system - which must be integrated into the current HN implementation. If AI can do that for HN, it can be done, because it will be cheap enough to do it -- if HN can exactly spell out what kind of system it wants. So we do need more software, better software, cheaper software, and AI will helps us do that.
A 2nd factor is that we don't really know if a spec is "correct" until we test the implemented system with real users. At that point we typically find many problems with the spec. So somebody must fix the problems with the spec, evolve the spec and rinse and repeat the testing with real users -- the developers who understand the current spec and why it is is not good enough.
AI can write my personal scripts for me surely. But writing a spec for a system to be used by thousands of humans, still takes a lot of (human) work. The spec must work for ALL users. That makes it complicated and difficult to get right.
Same, and similarly something like a "create a holistic design with all existing functionality you see in tests and docs plus new feature X, from scratch", then "compare that to the existing implementation and identify opportunities for improvement, ranked by impact, and a plan to implement them" when the code starts getting too branchy. (aka "first make the change easy, then make the easy change"). Just prompting "clean this code up" rarely gets beyond dumb mechanical changes.
Given so much of the work of managing these systems has become so rote now, my only conclusion is that all that's left (before getting to 95+% engineer replacement) is an "agent engineering" problem, not an AI research problem.
In order to prove safety you need a formal model of the system and formally defined safety properties that are both meaningful and understandable by humans. These do not exist for enterprise systems
An exhaustive formal spec doesn't exist. But you can conservatively proof some properties. Eg program termination is far from sufficient for your program to do what you want, but it's probably necessary.
(Termination in the wider sense: for example an event loop has to be able to finish each run through the loop in finite time.)
You can see eg Rust's or Haskell's type system as another light-weight formal model that lets you make and proof some simple statements, without having a full formal spec of the whole desired behaviour of the system.
Yeah, but with all respect, that is a totally uninteresting property in an enterprise software system where almost no software bugs actually manifest as non-termination.
The critical bugs here are related to security (DDoS attacks, authorization and authentication, data exfiltration, etc), concurrency, performance, data corruption, transactionality and so forth. Most enterprise systems are distributed or at least concurrent systems which depend on several components like databases, distributed lock managers, transaction managers, and so forth, where developing a proper formal spec is a monumental task and possibly impossible to do in a meaningful way because these systems were not initially developed with formal verification in mind. The formal spec, if faithful, will have to be huge to capture all the weird edge cases.
Even if you had all that, you need to actually formulate important properties of your application in a formal language. I have no idea how to even begin doing that for the vast majority of the work I do.
Proving the correctness of linear programs using techniques such as Hoare logic is hard enough already for anything but small algorithms. Proving the correctness of concurrent programs operating on complex data structures requires much more advanced techniques, setting up complicated logical relations and dealing with things like separation logic. It's an entirely different beast, and I honestly do not see LLMs as a panacea that will suddenly make these things scale for anything remotely close in size to a modern enterprise system.
Oh, there's lots more simple properties you can state and prove that capture a lot more, even in the challenging enterprise setting.
I just gave the simplest example I could think of.
And termination is actually a much stronger and more useful property than you make it out to be---in the face of locks and concurrency.
That is true and very useful for software development, but it doesn't help if the goal is to remove human programmers from the loop entirely. If I'm a PM who is trying to get a program to, say, catalogue books according to the Dewey Decimal system for a library, a proof that the program terminates is not going to help that much when the program is mis-categorizing some books.
Is removing the human in the loop really the goal, or is the goal right now to make the human a lot more productive? Because...those are both very different things.
I don't know what the goal for OpenAI or Anthropic really is.
But the context of this thread is the idea that the user daxfohl launched that these companies will, in the next few years, launch an "engineering team replacement" program; and then the user eru claimed that this is indeed more doable in programming than other domains because you can have specs and tests for programs in a way that you can't for, say, an animated movie.
OK, so you successfully argued that replacing the entire engineering team is hard. But you can perhaps still shrink it by 99%. To the point where a sole founder can do the remaining tech role part time.
I have no idea what will happen in a few years, maybe LLM tech will hit a wall and humans will continue to be needed in the loop. But today humans are definitely needed in the loop in some way.
> Writing software is actually one of the domains where hallucinations are easiest to fix: you can easily check whether it builds and passes tests.
What tests? You can't trust the tests that the LLM writes, and if you can write detailed tests yourself you might as well write the damn software.
Use multiple competing LLM. Generative adversarial network style.
Cool. That sure sounds nice and simple. What do you do when the multiple LLMs disagree on what the correct tests are? Do you sit down and compare 5 different diffs to see which have the tests you actually want? That sure sounds like a task you would need an actual programmer for.
At some point a human has to actually use their brain to decide what the actual goals of a given task are. That person needs to be a domain expert to draw the lines correctly. There's no shortcut around that, and throwing more stochastic parrots at it doesn't help.
Just because you can't (yet) remove the human entirely from the loop, doesn't mean that economising on the use of the humans time is impossible.
For comparison have a look at compilers: nowadays approximately no one writes their software by hand, we write a 'prompt' in something like Rust or C, and ask another computer program to create the actual software.
We still need the human in the loop here, but it takes much less human time than creating the ELF directly.
It’s not “economizing” if I have to verify every test myself. To actually validate that tests are good I need to understand the system under test, and at that point I might as well just write the damn thing myself.
This is the fundamental problem with this “AI” mirage. If I have to be an expert to validate that the LLM actually did the task I set out, and isn’t just cheating on tests, then I might as well code the solution myself.
From a PM perspective, the main differentiator between an engineering team and AI is "common sense". As these tools get used more and more, enough training data will be available that AI's "common sense" in terms of coding and engineering decisions could be indistinguishable from a human's over time. At that point, the only advantage a human has is that they're also useful on the ops and incident response side, so it's beneficial if they're also comfortable with the codebase.
Eventually these human advantages will be overcome, and AI will sufficiently pass a "Turing Test" for software engineering. PMs will work with them directly and get the same kinds of guidance, feedback, documentation, and conversational planning and coordination that they'd get from an engineering team, just with far greater speed and less cost. At that point, yeah you'll probably need to keep a few human engineers around to run the system, but the system itself will manage the software. The advantage of keeping a human in the loop will dwindle to zero.
I can see how LLMs can help with testing, but one should never compare LLMs with deterministic tools like compilers. LLMs are entirely a separate category.
[dead]
Tests and proofs can only detect issues that you design them to detect. LLMs and other people are remarkably effective at finding all sorts of new bugs you never even thought to test against. Proofs are particularly fragile as they tend to rely on pre/post conditions with clean deterministic processing, but the whole concept just breaks down in practice pretty quickly when you start expanding what's going on in between those, and then there's multithreading...
Ah, most the problem in programming is writing the tests. Once you know what you need the rest is just typing.
I can see an argument where you can get none programers to create the input and output of said tests but if the can do that, they are basically programmers.
This is of course leaving aside that half the stated use cases I hear for AI are that it can 'write the tests for you'. If it is writing the code and the tests it is pointless.
You need more than tests. Test induced design damage:
Well - the end result can be garbage still. To be fair: humans also write a lot of garbage. I think in general most software is rather poorly written; only a tiny percentage is of epic prowess.
Who is writing the tests?
Who writes the tests?
A competing AI.
Ah, it is turtles all the way down.
Yes. But it's no different from the question of how a non-tech person can make sure that whatever their tech person tells them actually makes sense: you hire another tech person to have a look.
These types of comments are interesting to me. Pre-chatGPT there were tons of posts how so many software people were terrible at their jobs. Bugs were/are rampant. Software bugs caused high profile issues, but likely so many more we never heard about.
Today we have chatGPT and only now will teams be uninsurable and sued into oblivion? LOL
LLMs were trained on exactly that kind of code.
If you've ever used Claude Code in brave mode, I can't understand how you'd think a dev team could make the same categories of mistakes or with the same frequency.
I am but a lowly IC, with no notion of the business side of things. If I am an IC at, say, a FANG company, what insurance has been taken out on me writing code there?
> If I am an IC at, say, a FANG company, what insurance has been taken out on me writing code there?
Every non-trivial software business has liability insurance to cover them for coding lapses that lead to data breaches or other kinds of damages to customers/users.
I use LLM's to write the majority of my code. I haven't encountered a hallucination for the better part of a year. It might be theoretically unsolvable but it certainly doesn't seem like a real problem to me.
I use LLMs whenever I'm coding, and it makes mistakes ~80% of the time. If you haven't seen it make a huge mistake, you may not be experienced enough to catch them.
Hallucinations, no. Mistakes, yes, of course. That's a matter of prompting.
> That's a matter of prompting.
So when I introduce a bug it's the PM's fault.
- [deleted]
honestly i think they got the low hanging fruit already. they're bumping up against the limits of what it can do and while it's impressive it's not spectacular
Maybe I'm easily impressed, but that LLMs even work to output basic human-like text to me is bananas, and I do understand a bit of how it works, yet it's still up there as "Amazing that huge airplanes even can fly" is for me.
> Other people are just less picky than I am
I think this is part of it.
When coding style has been established among a team, or within an app, there are a lot of extra hoops to jump through, just to get it to look The Right Way, with no detectable benefit to the user.
If you put those choices aside and simply say: does it accomplish the goal per the spec (and is safe and scalable[0]), then you can get away with a lot more without the end user ever having a clue.
Sure, there's the argument for maintainability, and vibe coded monoliths tend to collapse in on themselves at ~30,000 LOC. But it used to be 2,000 LOC just a couple of years ago. Temporary problem.
[0]insisting that something be scalable isn't even necessary imo
> When coding style has been established
It feels like you're diminishing the parent commenter's views, reducing it to the perspective of style. Their comment didn't mention style.
Style = syntax, taste, architecture choices, etc. Things you would see on a 15-year-old Java app.
i.e. not a greenfield project.
Isn't coding style a solved problem with claude.md files or something?
You can control some simple things that way. But the subtle stylistic choices that many teams agree on are difficult to articulate clearly. Plus they don’t always do everything you tell them to in the prompts or rule files. Even when it’s extremely clear sometimes they just don’t. And often the thing you want isn’t clear.
> with no detectable benefit to the user
Except the fact that the idioms and patterns used means that I can jump in and understand any part of the codebase, as I know it will be wired up and work the same as any other part.
I think here “to the user” is referring to the end user, not the programmer (the user of the coding style). There is a comprehension benefit for the team working on the code, but there is no direct¹ benefit to the end user.
--------
[1] The indirect benefits of there possibly being a faster release cadence and/or fewer bugs, could also be for many other reasons.
But you could say the same about tests, documentation, CI, issue trackers or really any piece of technology used. So it's not a very interesting statement if so.
> tests, documentation, CI, issue trackers
Exactly. In many engineering camps, it's not unreasonable to say that almost all of this has no benefit to the end-user, even indirectly.
> When coding style has been established among a team, or within an app, there are a lot of extra hoops to jump through, just to get it to look The Right Way, with no detectable benefit to the user.
Morphing an already decent PR into a different coding style is actually something that LLMs should excel at.
What's that old adage? "Programs must be written for people to read, and only incidentally for machines to execute."[1]
I wonder how well that works as a prompt.
I've seen vibe coding fall apart at 600 lines of code. It turns out lines of code is not a good metric for this or any other purpose.
Do you have any references for "vibe coded monoliths tend to collapse in on themselves at ~30,000 LOC"? I haven't personally vibed up anything with that many LOC, so I'm legitimately curious if we have solid numbers yet for when this starts to happen (and for which definitions of "collapse").
Just my experience in vibe coding apps from gpt-3.5 onwards (mostly NextJS or Node). In gpt-3.5, I had to really hand-hold it, getting it to write one function at a time, then a separate task to glue the functions together.
Now, it can build almost all of an app from a single prompt, but will start to rewrite utility functions, or modules, forgetting that they already exist. Some of this is still solvable with clever prompting, but if you're just attacking it without thinking, ~30,000 LOC seems to be the app 'size' that it will start to exhibit those behaviors.
you don't even have to put these choices aside too much, you can have very detailed linting rules that nudge the LLM towards the style you want.
At work, I have the same difficulty using AI as you. When working on deep Jiras that require a lot of domain knowledge, bespoke testing tools, but maybe just a few lines of actual code changes across a vast codebase, I have not been able to use it effectively.
For personal projects on the other hand, it has expedited me what? 10x, 30x? It's not measurable. My output has been so much more than what would have been possible earlier, that there is no benchmark because these level of projects would not have been getting completed in the first place.
Back to using at work: I think it's a skill issue. Both on my end and yours. We haven't found a way to encode our domain knowledge into AI and transcend into orchestrators of that AI.
> deep Jiras that require a lot of domain knowledge, bespoke testing tools, but maybe just a few lines of actual code changes
How do new hires onboard? Do you spend days of your own time guiding them in person, do they just figure things out on their own after a few quarters of working on small tickets, or are things documented? Basically AI, when working on a codebase, has the same level of context that a new hire would have, so if you want them to get started faster then provide them with ample documentation.
> Do you spend days of your own time guiding them in person, do they just figure things out on their own after a few quarters of working on small tickets
It is this rather than docs. I think you're absolutely right about our lack of documentation handicapping AI agents.
After you review, instead of rewriting 70% of the code, have you tried to follow up with a message with a list of things to fix?
Also: in my experience 1. and 2. are not needed for you to have bad results. The existing code base is a fundamental variable. The more complex / convoluted it is, the worse is the result. Also in my experience LLMs are constantly better at producing C code than anything else (Python included).
I have the feeling that the simplicity of the code bases I produced over the years, and that now I modify with LLMs, and the fact they are mostly in C, is a big factor why LLMs appear to work so well for me.
Another thing: Opus 4.5 for me is bad on the web, compared to Gemini 3 PRO / GPT 5.2, and very good if used with Claude Code, since it requires to reiterate to reach the solution, why the others sometimes are better first-shotter. If you generate code via the web interface, this could be another cause.
There are tons of variables.
> After you review, instead of rewriting 70% of the code, have you tried to follow up with a message with a list of things to fix?
This is one of my problems with the whole thing, at least from a programming PoV. Even though superficially it seems like the ST:TNG approach to using an intelligent but not aware computer as a tool to collaboratively solve a problem, it is really more like guiding a junior through something complex. While guiding a junior (or even some future AGI) in that way is definitely a good thing, if I am a good guide they will learn from the experience so it will be a useful knowledge sharing process, that isn't a factor for an LLM (at least not the current generations). But if I understand the issue well enough to be a good guide, and there is no teaching benefit external to me, I'd rather do it myself and at most use the LLM as a glorified search engine to help muddle through bad documentation for hidden details.
That and TBH I got into techie things because I like tinkering with the details. If I thought I'd not dislike guiding others doing the actual job, I'd have not resisted becoming a manager throughout all these years!
> After you review, instead of rewriting 70% of the code, have you tried to follow up with a message with a list of things to fix?
I think this is the wrong approach, already by having "wrong code" in the context, makes every response after this worse.
Instead, try restarting, but this time specify exactly how you expected that 70% of the code to actually have worked, from the get go. Often, LLMs seem to make choices because they have to, and if you think they made the wrong choice, you can often find that you didn't actually specify something well enough, hence the LLM had to do something, since apparently the single most important thing for them is that they finish something, no matter how right or wrong.
After a while, you'll get better at knowing what you have to be precise, specific and "extra verbose" about, compared to other things. Something that also seems to depend on the model, like with how Gemini you can have 5 variations of "Don't add any comments" yet it does anyways, but say that once to GPT/Claude-family of models and it seems they get it at once.
There are some problems where this becomes a game of whack-a-mole either way you approach it (restart or modify with existing context). I end up writing more prompts than the code I could've written myself.
This isn't to say I don't think LLMs are an asset, they have helped me solve problems and grow in domains where I lack experience.
The biggest frustration with LLMs for me is people telling me I'm not promoting it in a good way. Just think about any product where they are selling a half baked product, and repeatedly telling the user you are not using it properly.
But that's not how most products work.
If you buy a table saw and can't figure out how to cut a straight line in a piece of wood with it - or keep cutting your fingers off - but didn't take any time at all to learn how to use it, that's on you.
Likewise a car, you have to take lessons and a test before you can use those!
Why should LLMs be any different?
A table saw does not advertise to be a panacea which will make everyone obsolete.
You should ignore anyone who says that LLMs are a panacea that will make everyone obsolete.
Even if they're your boss? Remember that most people here are not independently wealthy, they're stuck answering to someone who may not have so level a take on these things as you do.
Your boss can't magic things into reality. If the LLM can't do your job they can't replace you with it
They can try. Which they'll then fail, and you'll be rehired and have to clean up the mess, then continue on
> and you'll be rehired and have to clean up the mess, then continue on
Not how this works. Yes, it happens sometimes, but there's no guarantee. Alternatives include:
- The rest of your team (or another tam) soaks up the additional work by working longer hours
- They hire someone else, or transfer someone from elsewhere
- The company accepts the lower output quality / whatever breakages result
- The breakages, even if unacceptable, only show up months down the line
So all that needs to happen is for your boss to believe they can replace you up to the point where they feel comfortable firing you. Whether that works or not is largely immaterial to the impact it thereafter has on your ability to pay rent / your mortgage / etc.
The fact that you could be fired at any time hasn't changed. That was true before any of this. Maybe this is a wake up call that it's a real risk, but the risk was always there and should be planned for
The more important thing though is that if LLMs can't replace people (remains to be seen) they won't lead to a net job loss. You'll find something else
>> Even if they're your boss?
Especially if they are your boss.
The problem there is the boss, not the technology. If it isn’t an insane take on AI, it’d be on something else, and eventually will be. People quit bad managers, not bad jobs. If you have a bad manager, work on quitting them.
I think the problem is the techno fascist oligarchs that are peddling the snake oil that LLMs will wipe out all white collar jobs tomorrow. Managers usually answer to C suite, and the C suite is salivating at the idea of laying off 80% of staff
You can't ignore managers, founders, colleagues, investors, and procurement teams.
Can't, or you're afraid to?
If you're not afraid of pushing back against an entire industry you don't have a full appreciation of the risks.
Aside: I love your website! Cool games :)
Thanks!
FWIW, I left my full time job some years ago to do my own thing, in part because pushing back on bad decisions was not really doing me any favors for my mental health. Glad to report I'm in a much better place after finding the courage to get out of that abusive relationship.
Some might argue the risk of not pushing back is far worse.
I was a contractor/consultant between 2020-2023; I have a problem w/ authority so it suited me. But work/life balance was awful--I have 2 kids now, and I can't do nothing for 6 weeks then work 100 hour weeks for 4 weeks. The maximum instability my life will tolerate is putting the kids to bed at 9 instead of 8:30 lol. I'm also in the Netherlands so there's also other benefits. Worker protections are very strong here, so it's highly unlikely I'll be fired or laid off; I can't be asked to work overtime; I can't be Slack'd after hours; I can drop down to 4 days a week no questions asked, when the kids were born I got a ton of paid leave, etc. Not to imply I work at some awful salt mine; I like my current gig and coworkers/leadership.
Anyway, this is a collective action problem. I don't take any responsibility for the huge plastic island in the Pacific, nor do I take any responsibility for the grift economy built on successive, increasingly absurd hype waves of tech (web 2.0, mobile, SPAs, big data, blockchain, VR, AI). I've also worked in social good, from Democratic presidential campaigns and recounts to helping connect people w/ pro bono legal services, which is to say I've done my time. There are too many problems for me to address, I get to pick which, if any, I battle, I am happy if my kids don't meltdown too much during the evening. Maybe when they're both in school I can take more risks or reformulate my work/life balance, but currently I'm focused on furthering the human race.
But this is how LLMs are marketed by all the big players. Should we ignore them too. LLMs are over sold.
Same as any other technology. If MongoDB tell you that their solution is "web scale" it's still on you to evaluate that claim before picking the database platform to build your company on.
> But that's not how most products work.
That's exactly how most products work :-/
> If you buy a table saw and can't figure out how to cut a straight line in a piece of wood with it - or keep cutting your fingers off - but didn't take any time at all to learn how to use it, that's on you.
Of course - that's deterministic, so if you make a mistake and it comes out wrong, you can fix the mistake you made.
> Why should LLMs be any different?
Because they are not deterministic; you can't use experience with LLMs in any meaningful way. They may give you a different result when you run the same spec through the LLM a second time.
> Because they are not deterministic; you can't use experience with LLMs in any meaningful way. They may give you a different result when you run the same spec through the LLM a second time.
Lots of things, and indeed humans, are also as non-deterministic; I absolutely do use experience working with humans and non-deterministic things to improve my future interactions with them.
Table saws are kinda infamous in this regard: you may say that kick-back is hidden state/incomplete information rather than non-deterministic, but in practice the impact is the same.
> They may give you a different result when you run the same spec through the LLM a second time.
Yes kind of, but only different results (maybe) for the things you didn't specify. If you ask for A, B and C, and the LLM automatically made the choice to implement C in "the wrong way" (according to you), you can retry but specify exactly how you want C to be implemented, and it should follow that.
Once you've nailed your "spec" enough so there isn't any ambiguity, the LLM won't have to make any choices for you, and then you'll get exactly what you expected.
Learning this process, and learning how much and what exactly you have to instruct it to do, is you building up your experience learning how to work with an LLM, and that's meaningful, and something you get better with as you practice it.
> Yes kind of, but only different results (maybe) for the things you didn't specify.
No. They will produce a different result for everything, including the things you specify.
It's so easy to verify that I'm surprised you're even making this claim.
> Once you've nailed your "spec" enough so there isn't any ambiguity, the LLM won't have to make any choices for you, and then you'll get exactly what you expected
1. There's always ambiguity, or else you'll end up an eternity writing specs
2. LLMs will always produce different results even if the spec is 100% unambiguous for a huge variety of reasons, the main one being: their output is non-deterministic. Except in the most trivial of cases. And even then the simple fact of "your context window is 80% full" can lead to things like "I've rewritten half of your code even though the spec only said that the button color should be green"
> It's so easy to verify that I'm surprised you're even making this claim.
Well, to be fair, I'm surprised you're even trying to say this claim isn't true, when it's so easy to test yourself.
If I prompt "Create a function with two arguments, a and b, which returns adding those two together", I'll get exactly what I specify. If I feel like it using u8 instead of u32 was wrong, I add "two arguments which are both u8", then you now get this.
Is this not the experience you get when you use LLMs? How does what you get differ from that?
> 1. There's always ambiguity, or else you'll end up an eternity writing specs
There isn't though, at one point it does end. If it's worth going so deep into specifying the exact implementation is up to you and what you're doing, sometimes it is, sometimes it isn't.
> LLMs will always produce different results even if the spec is 100% unambiguous for a huge variety of reasons, the main one being: their output is non-deterministic.
Again, it's so easy to verify that this isn't true, and also surprising you'd say this, because earlier you say "always ambiguity" yet somehow you seem to also know that you can be 100% unambiguous.
Like with "manual" programming, the answer is almost always "divide and conquer", when you apply that with enough granularity, you can reach "100% umambiguity".
> And even then the simple fact of "your context window is 80% full" can lead to things like "I've rewritten half of your code even though the spec only said that the button color should be green"
Yes, this is a real flaw, once you go beyond two messages, the models absolutely lose track almost immediately. Only workaround for this is constantly restarting the conversation. I never "correct" an agent if they get it wrong with more "No, I meant", I rewrite my first message so there are no corrections needed. If your context goes beyond ~20% of what's possible, you're gonna get shit results basically. Don't trust the "X tokens context length", because "what's possible" is very different from "what's usable".
> If I prompt "Create a function with two arguments, a and b, which returns adding those two together", I'll get exactly what I specify. If I feel like it using u8 instead of u32 was wrong, I add "two arguments which are both u8", then you now get this.
This is actually a good example of how your spec will progress:
First pass: "Create a function [in language $X] with two arguments, a and b, which returns adding those two together"
Second pass: "It must take u8 types, not u32 types"
Third pass: "You are not handling overflows. It must return a u8 type."
Fourth pass: "Don't clamp the output, and you're still not handling overflows"
Fifth pass: "Don't panic if the addition overflows, return an error" (depending on the language, this could be "throw an exception" or return a tuple with an error field, or use an out parameter for the result or error)
For just a simple "add two numbers" function, the specification can easily exceed the actual code. So you can probably understand the skepticism when the task is not trivial, and depends on a lot of existing code.
So you do know how the general "writing specification" part is working, you just have the wrong process. Instead of iterating and adding more context on top, restructure your initial prompt to include the context.
DONT DO:
First pass: "Create a function [in language $X] with two arguments, a and b, which returns adding those two together"
Second pass: "It must take u8 types, not u32 types"
INSTEAD DO:
First pass: "Create a function [in language $X] with two arguments, a and b, which returns adding those two together"
Second pass: "Create a function [in language $X] with two arguments, a and b, both using u8, which returns adding those two together"
----
What you don't want to do, is adding additional messages/context on top of "known bad" context, so instead you should take the clue that the LLM didn't understand correctly as "I need to edit my prompt" not "I need to now after their reply, add more context to correct what was wrong". The goal should be to completely avoid anything bad, not correct it.
Together with this, you build up a system/developer prompt you can reuse across projects/scopes, that follows how you code. In that, you add stuff as you discover what's needed to be added, like "Make sure to always handle Exceptions in X way" or similar.
> > For just a simple "add two numbers" function, the specification can easily exceed the actual code. So you can probably understand the skepticism when the task is not trivial, and depends on a lot of existing code.
Yes, please be skeptical, I am as well, which I guess is why I am seemingly more effective at using LLMs than others who are less skeptical. It's a benefit here to be skeptical, not a drawback.
And yes, it isn't trivial to verify work that others have done for you, when you have a concrete idea of how it should be exactly. But as I managed to work with outsourced/contracting developers before, or even collaborate with developers in the same company as me, I also learned to use LLMs in a similar way where you have to review and ensure code follow the architecture/design you intended.
> INSTEAD DO:
> First pass: "Create a function [in language $X] with two arguments, a and b, which returns adding those two together"
> Second pass: "Create a function [in language $X] with two arguments, a and b, both using u8, which returns adding those two together"
So it will create two different functions (and LLMs do love to ignore anything that came before and create a lot of stuff from scratch again and again). Now what.
What? No, I think you fundamentally misunderstand what workflow I'm suggesting here.
You ask: "Do X". The LLM obliges, gives you something you don't want. At this point, don't accept/approve it, so nothing has changed, you still have an empty directory, or whatever.
Then you start a brand new context, with iteration on the prompt: "Do X with Y", and the LLM again tries to do it. If something is wrong, repeat until you get what you're happy with, extract what you can into reusable system/developer prompts, then accept/approve the change.
Then you end up with one change, and one function, exactly as you specified it. Then if you want, you can re-run the exact same prompt, with the exact same context (nothing!) and you'll get the same results.
"LLMs do love to ignore anything that came before" literally cannot happen in this workflow, because there is nothing that "came before".
> No, I think you fundamentally misunderstand what workflow I'm suggesting here.
Ah. Basically meaningless monkey work of baby sitting an eager junior developer. And this is for a simple thing like adding two numbers. See how it doesn't scale at all with anything remotely complex?
> "LLMs do love to ignore anything that came before" literally cannot happen in this workflow, because there is nothing that "came before".
Of course it can. Because what came before is the project you're working on. Unless of course you end up specifying every single utility function and every single library call in your specs. Which, once again, doesn't scale.
> See how it doesn't scale at all with anything remotely complex?
No, I don't. Does outsourcing not work for you with "anything remotely complex"? Then yeah, LLMs won't help you, because that's a communication issue. Once you figure out how to communicate, using LLMs even for "anything remotely complex" becomes trivial, but requires an open mind.
> Because what came before is the project you're working on.
Right, if that's what you meant, then yeah, of course they don't ignore the existing code, if there is a function that already does what it needs, it'll use that. If the agent/LLM you use doesn't automatically does this, I suggest you try something better, like Codex or Claude Code.
But anyways, you don't really seem like you're looking for improving, but instead try to dismiss better techniques available, so I'm not even sure why I'm trying to help you here. Hopefully at least someone who wants to improve comes across it so this whole conversation wasn't a complete waste of time.
> No, I don't.
Strange. For a simple "add two integers" you now have to do five different updates to specs to make it non-ambiguous, restarting the work from scratch (that is, starting a new context) every time.
What happens when your work isn't to add two integers? How many iterations of the spec you have to do before you arrive at an unambiguous one, and how big will it be?
> Once you figure out how to communicate,
LLMs don't communicate.
> Right, if that's what you meant, then yeah, of course they don't ignore the existing code, if there is a function that already does what it needs, it'll use that.
Of course it won't since LLMs don't learn. When you start a new context, the world doesn't exist. It literally has no idea what does and does not exist in your project.
It may search for some functionality given a spec/definition/question/brainstorming skill/thinking or planning mode. But it may just as likely not. Because there are no actual proper way for anyone to direct it, and the models don't have learning/object permanence.
> If the agent/LLM you use doesn't automatically does this, I suggest you try something better, like Codex or Claude Code.
The most infuriating thing about these conversations is that people hyping AI assume everyone else but them is stupid, or doing something incorrectly.
We are supposed to always believe people who say "LLMs just work", without any doubt, on faith alone.
However, people who do the exact same things, use the exact tools, and see all the problems for what they are? Well, they are stupid idiots with skill issues who don't know anything and probably use GPT 1.0 or something.
Neither Claude nor Codex are magic silver bullets. Claude will happily reinvent any and all functions it wants, and has been doing so since the very first day it was unleashed onto the world.
> But anyways, you don't really seem like you're looking for improving, but instead try to dismiss better techniques available
Yup. Just as I said previously.
There are some magical techniques, and if you don't use them, you're a stupid Luddite idiot.
Doesn't matter that the person talking about these magical techniques completely ignores and misses the whole point of the conversation and is fully prejudiced against you. The person who needs to improve for some vague condescending definition of improvement is you.
> LLMs don't communicate.
Similarly, some humans seem to unable to too. The problem is, you need to be good at communication to effectively use LLMs, judging by this thread, it's pretty clear what the problem is. I hope you figure it out someday, or just ignore LLMs, no one is forcing you to use them (I hope at least).
I don't mind what you do, and I'm not "hyping LLMs", I see them as tools that are sometimes applicable. But even to use them in that way, you need to understand how to use them. But again, maybe you don't want, that's fine too.
"However, people who do the exact same things, use the exact tools, and see all the problems for what they are? Well, they are stupid idiots with skill issues who don't know anything and probably use GPT 1.0 or something."
Perfectly exemplified
Yeah, a summary of some imaginary arguments someone else made (maybe?), quoted back to me that never said any of those things? Fun :)
The "imaginary arguments" in question:
- "If the agent/LLM you use doesn't automatically does this, I suggest you try something better, like Codex or Claude Code."
- "you don't really seem like you're looking for improving"
- "Hopefully at least someone who wants to improve comes across it so this whole conversation wasn't a complete waste of time"
- "judging by this thread, it's pretty clear what the problem is. I hope you figure it out someday"
- "you need to understand how to use them. But again, maybe you don't want"
Aka what I said previously.
At this point, adieu.
It seems generally agreed that LLMs (currently) do better or worse with different programming languages at least, and maybe with other project logistical differences.
The fact that an LLM works great for one user on one project does not mean it will work equally great for another user on a different project. It might! It might work better. It might work worse.
And both users might be using the tool equally well, with equal skill, insofar as their part goes.
I'm glad you brought up the power tool analogy - I've bought a $40 soldering iron once, which looked just like the Weller that cost like 5x as much. There was nothing wrong with it on the surface, it was well built and heated up just fine.
But every time i tried to solder with it, the results sucked. I couldn't articulate why, and assumed I was doing something wrong (I probably was).
Then at my friends house, I got to try the real thing, and it worked like a dream. Again I can't pin down why, but everything just worked.
This is how I felt with LLMs (and image generation) - sometimes it just doesn't feel right, and I can't put my finger on what should I fix, but I come away often with the feeling that I needed to do way more tweaking than necessary and the results were just still mediocre.
No one knows what the actual "right way" to hold (prompt) an LLM is. A certain style or pattern to prompting may work in one scenario for one LLM, but change the scenario or model and it often loses any advantage and can give worse output than a different style/pattern.
In contrast table saws and cars have pretty clear rules of operation.
Table saws and cars are deterministic. Once uou learn how to use them, the experience is repeatable.
The various magic incantations that LLMs require cannot be learned or repeated. Whatever the "just one more prompt bro" du jour you're thinking of may or may not work at any given time for any given project in any given language.
Operating a car (i.e. driving) is certainly not deterministic. Even if you take the same route over and over, you never know exactly what other drivers or pedestrians are going to do, or whether there will be unexpected road conditions, construction, inclement weather, etc. But through experience, you build up intuition and rules of thumb that allow you to drive safely, even in the face of uncertainty.
It's the same programming with LLMs. Through experience, you build up intuition and rules of thumb that allow you to get good results, even if you don't get exactly the same result every time.
> It's the same programming with LLMs. Through experience, you build up intuition and rules of thumb that allow you to get good results, even if you don't get exactly the same result every time.
Friend, you have literally described a nondeterministic system. LLM output is nondeterministic. Identical input conditions result in variable output conditions. Even if those variable output conditions cluster around similar ideas or methods, they are not identical.
The problem is that this is completely false. LLMs are actually deterministic. There are a lot more input parameters than just the prompt. If you're using a piece of shit corpo cloud model, you're locked out of managing your inputs because of UX or whatever.
Ah, we've hit the rock bottom of arguments: there's some unspecified ideal LLM model that is 100% deterministic that will definitely 100% do the same thing every time.
We've hit rock bottom of rebuttals, where not only is domain knowledge completely vacant, but you can't even be bothered to read and comprehend what you're replying to. There is no non-deterministic LLM. Period. You're already starting off from an incoherent position.
Now, if you'd like to stop acting like a smug ass and be inquisitive as per the commenting guidelines, I'd be happy to tell you more. But really, if you actually comprehended the post you're replying to, there would be no need since it contains the piece of the puzzle you aren't quite grasping.
> There is no non-deterministic LLM.
Strange then that the vast majority of LLMs that people use produce non-deterministic output.
Funnily enough I had literally the same argument with someone a few months back in a friends group. I ran the "non-shitty non-corpo completely determenistic model" through ollama... And immediately got two different answers for the same input.
> Now, if you'd like to stop acting like a smug ass and be inquisitive as per the commenting guidelines,
Ah. Commenting guidelines. The ones that tell you not to post vague allusions to something, not to be dismissive of what others are saying, responding to the strongest plausible interpretation of someone says etc.? Those ones?
> Strange then that the vast majority of LLMs that people use produce non-deterministic output.
> I ran the "non-shitty non-corpo completely determenistic model" through ollama... And immediately got two different answers for the same input.
With deterministic hardware in the same configuration, using the same binaries, providing the same seed, the same input sequence to the same model weights will produce bit-identical outputs. Where you can get into trouble is if you aren't actually specifying your seed, or with non-deterministic hardware in varying configurations, or if your OS mixes entropy with the standard pRNG mechanisms.
Inference is otherwise fundamentally deterministic. In implementation, certain things like thread-scheduling and floating-point math can be contingent on the entire machine state as an input itself. Since replicating that input can be very hard on some systems, you can effectively get rid of it like so:
A note that "--temperature 0" may not strictly be necessary. Depending on your system, setting the seed and restricting to a single thread will be sufficient.ollama run [whatever] --seed 123 --temperature 0 --num-thread 1These flags don't magically change LLM formalisms. You can read more about how floating point operations produce non-determinism here:
https://arxiv.org/abs/2511.17826
In this context, forcing single-threading bypasses FP-hardware's non-associativity issues that crop up with multi-threaded reduction. If you still don't have bit-replicated outputs for the same input sequence, either something is seriously wrong with your computer or you should get in touch with a reputable metatheoretician because you've just discovered something very significant.
> Those ones?
Yes those ones. Perhaps in the future you can learn from this experience and start with a post like the first part of this, rather than a condescending non-sequitur, and you'll find it's a more constructive way to engage with others. That's why the guidelines exist, after all.
> These flags don't magically change LLM formalisms. You can read more about how floating point operations produce non-determinism here:
Basically what you're saying is "for 99.9% of use cases and how people use them they are non-deterministic, and you have to very carefully work around that non-determinism to the point of having workarounds for your GPU and making them even more unusable"
> In this context, forcing single-threading bypasses FP-hardware's non-associativity issues that crop up with multi-threaded reduction.
Translation: yup, they are non-deterministic under normal conditions. Which the paper explicitly states:
--- start quote ---
existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs.
--- end quote ---
> If you still don't have bit-replicated outputs for the same input sequence, either something is seriously wrong with your computer or you should get in touch with a reputable metatheoretician because you've just discovered something very significant.
Basically what you're saying is: If you do all of the following, then the output will be deterministic:
- workaround for GPUs with num_thread 1
- temperature set to 0
- top_k to 0
- top_p to 0
- context window to 0 (or always do a single run from a new session)
Then the output will be the same all the time. Otherwise even "non-shitty corp runners" or whatever will keep giving different answers for the same question: https://gist.github.com/dmitriid/5eb0848c6b274bd8c5eb12e6633...
Edit. So what we should be saying is that "LLM models as they are normally used are very/completely non-deterministic".
> Perhaps in the future you can learn from this experience and start with a post like the first part of this
So why didn't you?
> The problem is that this is completely false. LLMs are actually deterministic. There are a lot more input parameters than just the prompt. If you're using a piece of shit corpo cloud model, you're locked out of managing your inputs because of UX or whatever.
When you decide to make up your own definition of determinism, you can win any argument. Good job.
Yes, that's my point. Neither driving nor coding with an LLM is perfectly deterministic. You have to learn to deal with different things happening if you want do do either successfully.
> Neither driving nor coding with an LLM is perfectly deterministic.
Funny.
When driving, I can safely assume that when I turn the steering wheel in the direction in turns. That the road that was there yesterday is there today (barring certain emergencies, that's why they are emergencies). That the red light in a traffic light means stop, and the green means go.
And not the equivalent "oh, you're completely right, I forgot to include the wheels, wired the steering wheel incorrectly, and completely messed up the colors"
> Operating a car (i.e. driving) is certainly not deterministic.
Yes. Operating a car or a table saw is deterministic. If you turn your steering wheel left, the car will turn left every time with very few exceptions that can also be explained deterministically (e.g. hardware fault or ice on road).
Operating LLMs is completly non-deterministic.
> Operating LLMs is completly non-deterministic.
Claiming "completely" is mapping a boolean to a float.
If you tell an LLM (with tools) to do a web search, it usually does a web search. The biggest issue right now is more at the scale of: if you tell it to create turn-by-turn directions to navigate across a city, it might create a python script that does this perfectly with OpenStreetMap data, or it may attempt to use its own intuition and get lost in a cul-de-sac.
Wow. It can do a web search. And that is useful in the context of programming how? Or in any context?
The question is about the result of an action. Given the same problem statement in the same codebase it will produce wildly different results even if prompted two times in a row.
Even for trivial tasks the output may vary between just a simple fix, and a rewrite of half of the codebase. You can never predict or replicate the output.
To quote Douglas Adams, "The ships hung in the sky in much the same way that bricks don't". Cars and table saws operate in much the same way that LLMs don't.
> Wow. It can do a web search. And that is useful in the context of programming how? Or in any context?
Your own example was turning a steering wheel.
A web search is as relevant to the broader problems LLMs are good at, as steering wheels are to cars.
> Given the same problem statement in the same codebase it will produce wildly different results even if prompted two times in a row.
Do you always drive the same route, every day, without alteration?
Does it matter?
> You can never predict or replicate the output.
Sure you can. It's just less like predicting what a calculator will show and more like predicting if, when playing catch, the other player will catch your throw.
You can learn how to deal with reality even when randomness is present, and in fact this is something we're better at than the machines.
> Your own example was turning a steering wheel.
The original example was trying to compare LLMs to cars and table saws.
> Do you always drive the same route, every day, without alteration?
I'm not the one comparing operating machinery (cars, table saws) to LLMs. Again. If I turn a steering wheel in a car, the car turns. If input the same prompt into an LLM, it will produce different results at different times.
Lol. Even "driving a route" is probably 99% deterministic unlike LLMs. If I follow a sign saying "turn left", I will not end up in a "You are absolutely right, there shouldn't be a cliff at this location" situation.
Edit: and when signs end pointing to a cliff, or when a child runs onto the roads in front of you, these are called emergency situations. Whereas emergency situations are the only available modus operandi for an LLM, and actually following instructions is a lucky happenstance.
> It's just less like predicting what a calculator will show and more like predicting if, when playing catch, the other player will catch your throw
If you think that throwing more and more bad comparisons that don't work into the conversation somehow proves your point, let me dissuade you of that notion: it doesn't.
I'm finding the prompting techniques I've learned over the last six months continue to work just fine.
Have you run the "same prompting technique" on the same problem in the same code base and got the same result all the time?
I also have prompting techniques that work better than other magical incantations. They do also fail often. Or stop working in a new context. Or...
Now imagine the table saw is really, REALLY shit at being table saw and saw no straight angle anywhere during its construction. And they come with new one every 6 months that is very slightly less crooked but controls are all moved over so you have to tweak your workflow
Would you still blame the user ?
It’s more like the iPhone “you’re holding it wrong”.
- [deleted]
It's not anyone's job to "promote it in a good way", we have no responsibility either for or against such tech.
The analogy would be more like: "yeah, the motor blew up and burned your garage, but please don't be negative - we need you to promote this saw in a good way".
Sure, it's important to "hold it right", but we're not in some cult here where we need to all sell this tech well beyond its current or future potential.
I think that was a typo and should have been "prompting", not "promoting".
Have you seen the way some people google/prompt? It can be a murder scene.
Not coding related but my wife is certainly better than most and yet I’ve had to reprompt certain questions she’s asked ChatGPT because she gave it inadequate context. People are awful at that. Us coders are probably better off than most but just as with human communication if you’re not explaining things correctly you’re going to get garbage back.
People are "awful at that" because when two people communicate, we're using a lot more than words. Each person participating in a conversation is doing a lot of active bridge-building. We're supplying and looking for extra nonverbal context; we're leaning on basic assumptions about the other speaker, their mood, their tone, their meanings; we're looking at not just syntax but the pragmatics of the convo (https://en.wikipedia.org/wiki/Pragmatics). The communication of meaning is a multi-dimensional thing that everyone in the conversation is continually contributing to and pushing on.
In a way, LLMs are heavily exploitative of human linguistic abilities and expectations. We're wired so hard to actively engage and seek meaning in conversational exchanges that we tend to "helpfully" supply that meaning even when it's absent. We are "vulnerable" to LLMs because they supply all the "I'm talking to a person" linguistic cues, but without any form of underlying mind.
Folks like your wife aren't necessarily "bad" at LLM prompting—they're simply responding to the signals they get. The LLM "seems smart." It seems like it "knows" things, so many folks engage with them naturally, as they would with another person, without painstakingly feeding in context and precisely defining all the edges. If anything, it speaks to just how good LLMs are at being LLMs.
Until we get LLMs with deterministic output for a given prompt, there's no guarantee that you and me typing the same prompt will yield a working solution of similar quality.
I agree that it helps to add context, but then again assuming people aren't already doing it doesn't help in any way. You can add all the context there is and still get a total smudge out of it. You can select regenerate a few times and it's no better. There's nothing indisputably proving which part of your prompt the LLM will fixate on more and which one it will silently forget (this one's even more apparent with longer prompts).
If my mum buys a copy of Visual Studio, is it their fault if she cannot code?
its more like I buy Visual studio, it will crash at random time, and I get a response like you don't know how to use the ide.
It's not like that though.
It's like you buy Visual Studio and don't believe anyone who tells you that it's complex software with a lot of hidden features and settings that you need to explore in order to use it to its full potential.
I feel it's not worth the effort to spend time and learn the hidden features. whenever I use it to plug something new into a existing codebase it either gives something good at first shot or repeat the non working solution again and again. after such session I only get a feeling instead of spending the last 15 minutes on prompting this, I should have learnt these stuff and this learning would be useful for me forever.
I use LLMs as a better form of search engines and that's a useful product.
> I feel it's not worth the effort to spend time and learn the hidden features.
And that's the only issue here. Many programmers feel offended by an AI threatening their livelihood, and are too arrogant to invest some time in a tool they do deem below themselves—then proceed to complain how useless the tool is on the internet.
I'd really suggest taking antirez' advice at heart, and invest time in actually learning how to work with AI properly. Just because Claude Code has a text prompt like ChatGPT doesn't mean you know how to work with it yet. It is going to pay off.
> I should have learnt these stuff and this learning would be useful for me forever.
Oh, if only software worked like that.
Even a decade ago, one could reasonably say that half of what we proudly add to our CVs becomes obsolete every 18 months, it's just hard to predict which half.
> Non-trivial coding tasks
A coding agent just beat every human in the AtCoder Heuristic optimization contest. It also beat the solution that the production team for the contest put together. https://sakana.ai/ahc058/
It's not enterprise-grade software, but it's not a CRUD app with thousands of examples in github, either.
> AtCoder Heuristic optimization contest
Optimization space that has been automated before LLMs. Big surprise, machines are still better at this.
This feels a bit like comparing programming teams to automated fuzzing.
In fact not too rarely developing algorithms involved some kind of automated algorithm testing where the algorithm is permuted in an automatic manner.
It's also a bit like how OCR and a couple of other fields (protein folding) are better to be done in an automated manner.
The fact that now this is done by an LLM, another machine isn't exactly surprising. Nobody claims that computers aren't good at these kinds of tasks.
> It's not enterprise-grade software, but it's not a CRUD app with thousands of examples in github, either.
Optimization is a very simple problem though.
Maintaining a random CRUD app from some startup is harder work.
The argument was about “non-trivial”. Are you calling this work trivial or not?
> Optimization is a very simple problem though.
C'mon, there's post every other week that optimization never happens anymore because it's too hard. If AI can take all the crap code humans are writing and make it better, that sounds like a huge win.
Simple is the opposite of complex; the opposite of hard is easy. They are orthogonal. Chess is simple and hard. Go is simpler and harder than chess.
Program optimization problems are less simple than both, but still simpler than free-form CRUD apps with fuzzy, open ended acceptance criteria. It would stand to reason an autonomous agent would do well at mathematically challenging problems with bounded search space and automatically testable and quantifiable output.
(Not GP but I assume that's what they were getting at)
> If AI can take all the crap code humans are writing and make it better, that sounds like a huge win.
This sort of misunderstanding of achievements is what keeps driving the AI mania. The AI generated an algorithm for optimizing a well-defined, bounded mathematical problem that marginally beat the human-written algorithms.
This AI can't do what you're hyping it up to do because software optimization is a different kind of optimization problem - it's complex, underspecified, and it doesn't have general algorithmic solutions.
LLM may play a significant role in optimizing software some day but it's not going to have much in common with optimization in a mathematical sense so this achievement doesn't get us any closer to that goal.
Compilers beat most coders before LLM were even popular
had to scroll far to find the problem description
> AHC058, held on December 14, 2025, was conducted over a 4-hour competition window. The problem involved a setting where participants could produce machines with hierarchical relationships, such as multiple types of “apple-producing machines” and “machines that build those machines.” The objective was to construct an efficient production planning algorithm by determining which types and hierarchies of machines to upgrade and in what specific order.
... so not a CRUD app but it beat humans at Cookie Clicker? :-)
I think you're spot on.
So many people hyping AI are only thinking about new projects and don't even distinguish between what is a product and what is a service.
Most software devs employed today work on maintaining services that have a ton of deliberate decisions baked in that were decided outside of that codebase and driven by business needs.
They are not building shiny new products. That's why most of the positive hype about AI doesn't make sense when you're actually at work and not just playing around with personal projects or startup POCs.
Personally I've yet to see any high profile programming person (who's not directly invested into AI) endorse only coding by prompting.
Experienced coders that I follow, who do use AI tend to focus on tight and fast feedback loops, and precise edits (or maybe exploratory coding) rather than agentic fire-and-forget workflows.
Also, an interesting side note, I expected programmers I think of as highly skilled, who I know personally to reject AI from personal pride - that has not been the case. However 2 criticisms I've heard consistently from this crowd (besides the thing I mentioned before) was
- AI makes hosting and participating in coding competitions impossible, and denies them of brain-teasers and an ability to hone their skills.
- A lot of them are concerned about the ethics of training on large codebases - and consider AI plagiarism as much of an issue as artists do.
It's the second.
Like, yes, prompting is a skill and you need to learn it for AI to do something useful but usefulness quickly falls down a cliff once you go past "greenfield implementation" or "basically example code" or "the thing done a lot so AI have a lot of reference to put from" it quickly gets into kinda sorta but not really working state.
It can still be used effectively on smaller parts of the codebase (I used it a lot basically to generate some boilerplate to run the test even if I had to rewrite a bunch of actual tests) but as whole very, very overrated by the AI peddlers.
And it probably stems from the fact that for the clueless ones it looks like amazing productivity boost because they go from "not even knowing framework" to "somewhat working app"
People already say here that they don’t even look the code anymore. ”That is AIs job”. As long as there is a spec and tests pass, they are happy! I just can’t do that.
It's just the next rung on the enshittification ladder. So many steps in our "progress" to enlightenment as a society, as a technology community, is just abstracting away work with a "good enough" solution that is around an 80% solution
That's fine for the first iteration or two, because you think "oh man this is going to make me so productive, I'll be able to use this new productivity to wring 40% of progress out of that 20% gap"
But instead we just move on to the next thing, bring that 20% shittified gap along with us, and the next thing that gets built or paved over has a 20% gap, and eventually we're bankrupt from rolling over all that negative equity
The counter argument for this is the comparison for traditional compilers. AI is "the new compiler", just for natural language. The optimization happens over time! But I am not so sure about that.
Except that the most glaring difference is that compilers are deterministic, while LLMs aren't.
Given the same input, compilers will always return the same output, while for LLMs. They won't, given the same input, they will return different output.
Why not post a github gist with prompt and code so that people here can give you their opinion?
Those just don't appear at all on HackerNews
Gee I wonder why
Because most people don't work on public projects and can't share the code publicly?
What's more interesting is the lack of examples of non-trivial projects that are provably vibe-coded and that claim to be of high-quality.
I think many of us are looking for: "I vibe-coded [this] with minimal corrections/manual coding on a livestream [here] and I believe it to be high-quality code"
If the code is in fact good quality then the livestream would serve as educational material for using LLMs/agents productively and I guarantee that it would change many minds. Stop telling people how great it all is, show them. I don't want to be a naysayer, I want to be impressed.
I'm considering attempting to vibe code translate one of my XNA games to javascript and recording the process and using all of the latest tools and strategies like agents and .md files and multiple LLMs etc
[dead]
That's been pretty much exactly my experience too.
For what it's worth, multiple times in my career, I've worked at shops that once thought they could do it quick and cheap and it would be good enough, and then had to hire someone 'picky' like me to sort out the inevitable money-losing mess.
From what I've seen even Opus 4.5 spit, the 'picky' are going to remain in demand for a little while longer still. Will that last? No clue. We'll see.
You can be picky with Opus, just yell at it to refactor a few times. To reduce refactor cycles, give it correct and enough context before you start along with expected code style, etc. These things aren't one shot magic machines.
> I don't understand the stance that AI currently is able to automate away non-trivial coding tasks.
I'm happy enough for it to automate away the trivial coding tasks. That's an immense force multiplier in its own right.
> I end up rewriting about 70% of the thing.
Doesn't match my experience, that figure is closer to about 20-40% to me, though a lot of those changes I want are possible by just further prompting OR turning to a different model, or adding some automated checks that promptly fail and the AI can do a few more loops of fixes.
> Other people are just less picky than I am, or they have a less thorough review culture that lets subpar code slide more often.
This is also likely, or you are just doing stuff that is worse represented in the training data, or working on novel things where the output isn't as good. But I'm leaning towards people just being picky about what they view as "good code" (or underspecifying how the AI is supposed to output it) at least roughly since Sonnet 4, since with some people I work with it's just endless and oftentimes meaningless discussions and bikeshedding when in code review.
You can always be like: "This here pattern in these 20 files is Good Code™, use the same collection of approaches and code style when working on this refactoring/new feature."
> You can always be like: "This here pattern in these 20 files is Good Code™, use the same collection of approaches and code style when working on this refactoring/new feature."
…and then add that to your CLAUDE.md, and never worry about having to say it again manually.
Exactly! Unless you use something that doesn’t read CLAUDE.md, then you’d still just tell the model to read the file as a part of its work.
What helped me a bunch was having prebuild scripts (can be Bash, can be Python, can be whatever) for each of the architectural or style conventions I want to enforce. Tools like ESLint are also nice but focused a bit more on the code than architecture/structure.
Problems start when a colleague might just remove some of those due to personal preference without discussion but then you have other problems - in my experience, with proper controls in place AI will cause less issues and friction than people (ofc depending on culture fit).
> Every single time [...] I end up rewriting about 70% of the thing
If that number has not significantly changed since GPT 3.5, I think it's safe to assume that something very weird is happening on your end.
I think I know what they mean, I share a similar experience. It has changed, 3.5 couldn't even attempt to solve non-trivial tasks so it was a 100% failure, now it's 70%.
I get the best results when using code to demonstrate my intention to an LLM, rather than try and explain it. It doesn't have to be working code.
I think that mentally estimating the problem space helps. These things are probabilistic models, and if there are a million solutions the chance of getting the right one is clearly unlikely.
Feeding back results from tests really helps too.
This is exact the impression that I got. Every question or task given to LLM returns pretty reasonable, but flawed result. For the coding, those are hard to spot but dangerous mistakes. They all look good and perfectly reasonable, but just wrong. Anthropic compared Claude Code to a "slot machine", and I fell that AI coding now is something close to gambling addiction. As small wins keep gambler to make more bets, so correct results from AI keep developers to use it: "I see it made correct solution, let's try again!" At a startup CTO, I review most of the pull requests from team members, and team uses AI tools actively. The overall picture strongly confirms your second conclusion.
If someone gives you access to a slot machine which is weighted such that it pays out way more than you put into it, my advice is to start cranking that lever.
If it does indeed start costing more than it's paying out, step away.
On the subpar code, would the code work, albeit suboptimally?
I think part of the problem a lot of senior devs are having is that they see what they do as an artisanal craft. The rest of the world just sees the code as a means to an end.
I don't care how elegantly my toaster was crafted as long as it toasts the bread and doesn't break.
There is some truth to your point but you might want to consider that often seniors concerned with code quality aren't being pedantic about artisanal craft they are worried about the consequences of bad code...
- it becomes brittle and rigid (can't change it, can't add to it)
- it becomes buggy and impossible to fix one bug without creating another
- it becomes harder to tell what it's doing
- plus it can be inefficient / slow / insecure, etc.
The problem with your analogy is that toasters are quite simple. The better example would be your computer, and if you want your computer to just run your programs and not break, then these things matter.
More review items to consider on a PR:
* You have made a new file format. Consider that it will live forever.
* You have added exactly what the user/product team asked for it. It must be supported forever.
Part of my job is to push back on user requests. I also think a lot about ease of use.
I think even with an LLM that can one-shot a task, the engineer writing the prompt must still have "engineering judgment".
Perhaps a better analogy is the smartphone or personal computer.
Think of all the awful cheapest android phones and Windows PCs and laptops that are slow, buggy, have not had a security update in however long and are thus insecure, become virtually unusable within a couple years. The majority of the people in the world live on such devices either because they don't know better or have no better option. The world continues to turn.
People are fine with imperfection in their products, we're all used to it in various aspects of our lives.
Code being buggy, brittle, hard to extend, inefficient, slow, insecure. None of those are actual deal breakers to the end user, or the owners of the companies, and that's all that really matters at the end of the day in determining whether or not the product will sell and continue to exist.
If we think of it in terms of evolution, the selection pressure of all the things you listed is actually very weak in determining whether or not the thing survives and proliferates.
The usefulness is a function of how quickly the consequences from poor coding arrive and how meaningful they are to the organization.
Like in finance if your AI trading bot makes a drastic mistake it's immediately realized and can be hugely consequential, so AI is less useful. Retail is somewhat in the middle, but for something like marketing or where the largest function is something with data or managerial the negatives aren't as quickly realized so there can be a lot of hype around AI and what it may be able to do.
Another poster commented how very useful AI was to the insurance industry, which makes total sense, because even then if something is terribly wrong it has only a minor chance of ever being an issue and it's very unlikely that it would have a consequence soon.
Hattmall said it well with this:
> The usefulness is a function of how quickly the consequences from poor coding arrive and how meaningful they are to the organization.
I would just add that these hypothetical senior devs we are talking about are real people with careers, accountability and responsibilities. So when their company says "we want the software to do X" those engineers may be responsible for making it happen and accountable if it takes too long or goes wrong.
So rather than thinking of them as being irrationally fixated on the artisanal aspect (which can happen) maybe consider in most cases they are just doing their best to take responsibility for what they think the company wants now and in the future.
There’s for sure legitimacy to the concern over the quality of output of LLMs and the maintainability of that code, not to mention the long term impact on next generation of devs coming in and losing their grasp on the fundamentals.
At the same time, the direction of software by and large seems to me to be going in the direction of fast fashion. Fast, cheap, replaceable, questionable quality.
Not all software can tolerate this, as I mentioned in another comment, flight control software, the software controlling your nuclear power plant, but the majority of the software in the world is far more trivial and its consumers (and producers) more tolerant of flaws.
I don’t think of seniors as purely irrationally fixated on the artisanal aspect, I also think they are rationally, subconsciously or not, fearful of the implications for their career as the bottom falls out of this industry.
I could be wrong though! Maybe high quality software will continue to be what the industry strives for and high paying jobs to fix the flawed vibe coded slop will proliferate, but I’m more pessimistic than to think that.
Who does it fall on to fix the mess that's been made. You do care if the toaster catches fire and burns your house down.
> I don't care how elegantly my toaster was crafted as long as it toasts the bread and doesn't break.
A consumer or junior engineer cares whether the toaster toasts the bread and doesn’t break.
Someone who cares about their craft also cares about:
- If I turn the toaster on and leave, can it burn my house down, or just set off the smoke alarm?
- Can it toast more than sliced uniform-thickness bread?
- What if I stick a fork in the toaster? What happens if I drop it in the bathtub while on? Have I made the risks of doing that clear in such a way that my company cannot be sued into oblivion when someone inevitably electrocutes themselves?
- Does it work sideways?
- When it fills up with crumbs after a few months of use, is it obvious (without knowing that this needs to be done or reading the manual) that this should be addressed, and how?
- When should the toaster be replaced? After a certain amount of time? When a certain misbehavior starts happening?
Those aren’t contrived questions in service to a tortured metaphor. They’re things that I would expect every company selling toasters to have dedicated extensive expertise to answering.
My contention is:
> A consumer
is all that ultimately matters.
All those things you’re talking about may or may not matter some day, after years and a class action lawsuit that may or may not materialize or have any material impact on the bottom line of the company producing the toaster, by which time millions of units of subpar toasters that don’t work sideways will have sold.
The world is filled with junk. The majority of what fills the world is junk. There are parts of our society where junk isn’t well tolerated (jet engines, mri machines) but the majority of the world tolerates quite a lot of sloppiness in design and execution and the companies producing those products are happily profitable.
You really underestimate how much work goes into everything around you. You don't care because it just works: the stuff you use is by and large not crap, which makes the crappy stuff all the more noticable. Check out the housing code for your area: everything from the size of steps to the materials used for siding are in there. Or look at the FCC specifications for electrical devices that make sure you don't inadvertently jam radio frequencies in your local area, or the various codes which try very hard to stop you from burning your house down.
You're right that "there are parts of our society where junk isn't well tolerated", but the scope of those areas is far greater than you give credit for.
I'm long term traveling, mostly through the developing world, where something like 84% of humanity resides.
All around me, people's houses, the roads, the infrastructure, food cultivation and preparation, furniture, vehicles, it goes on and on, the tendency is towards loose approximation, loose standards. Things are constantly breaking, the quality is low, people are constantly being poisoned by the waste seeping into their water, air and soil, by the plastic they burn to cook their food, by the questionable chemicals in the completely unsafe industrial environments they work in to produce toxic products consumed by the masses.
There is no uniform size of steps. Yet the majority of humanity lives this way, and not just tolerates it but considers it a higher standard of living than we've had for the majority of human history.
I don't think people in the first world are a different species, so we will also adapt to whatever shitty environment we regress into as our standards fall. We'll realize that the majority of the areas we may consider sacrosanct are in fact quite negotiable in terms of quality when it comes down to our needs.
All this is to say that yeah, I think people will generally tolerate the quality of software going down just fine.
That's a sad way to think. I'd like to hope that humanity can improve itself, and that includes building products that are safer, more refined, more beautiful, more performant and more useful. I agree that there's a lot of crap out there, but I still want to believe and strive to make things that are excellent. I'm not ready to give up on that. And yes, I still get annoyed every time my crappy toaster doesn't work properly.
>I think part of the problem a lot of senior devs are having is that they see what they do as an artisanal craft. The rest of the world just sees the code as a means to an end.
Then you haven't been a senior dev long enough.
We want code that will be good enough because we will have to maintain it for years (or inherit maintaining from someone else), we want it to be clean enough that adding new features isn't a pain and architected well enough that it doesn't need major rewrite to do so.
Of course if code is throwaway that doesn't matter but if you're making long term product, making shit code now is taking on the debt you will have to pay off.
That is not to say "don't use AI for that", that is to say "actually go thru AI code and review whether it is done well enough". But many AI-first developers just ship first thing that compiles or passes tests, without looking.
> I don't care how elegantly my toaster was crafted as long as it toasts the bread and doesn't break.
...well if you want it to not break (and still be cheap) you have to put quite a bit of engineering into it.
I'm exactly on the same boat.
To anybody who want to try, a concrete example, that I have tested in all available LLMs:
Make a prompt to get a common lisp application which makes a "hello triangle" in open gl, without using SDL or any framework, only OpenGL and GLFW bindings.
None of the replies even compiled. I kept asking at least 5 times, with error feedback, to see if AI can do it. It did't work. Never.
The best I got was from gemini, a code where I had to change about 10 lines, absolutely no trivial changes that need to be familiar with opengl and lisp. After doing the changes I asked back, what does it think of the changes, it replied I was wrong, with those changes it will never work.
If anybody can make a prompt that get me that, please let me known...
It sounds like you're using LLMs directly, instead of a coding agent. Agents are capable of testing their own code and using that to fix issues, which is what makes them so powerful.
Using Claude Code, I was able to successfully produce the Hello Triangle you asked for (note that I have never used CL before): https://github.com/philpax/hello-triangle-cl
For reference, here is the transcript of the entire interaction I had with CC (produced with simonw's excellent claude-code-transcripts): https://gisthost.github.io/?7924519b32addbf794c17f4dc7106bc2...
Edit: To better contextualise what it's doing, the detailed transcript page may be useful: https://gisthost.github.io/?7924519b32addbf794c17f4dc7106bc2...
Nice. The code I got running from gemini was much much cleaner, it did not work, but after the manual changes it did work. I will hive it a try with the next task: put text and generate primitives like rectangle, circle, polygon, etc…
"Please write me a program in Common LISP (SBCL is installed) which will render a simple "hello world" triangle in OpenGL. You should use only OpenGL and GLFW (using sbcl's FFI) for this, not any other existing 3D graphics framework."
This worked in codex-cli, albeit it took three rounds of passing back the errors. https://gist.github.com/jamesacraig/9ae0e5ed8ebae3e7fe157f67... has the resulting code.
That is using sb-alien and sb-sys, which basically is no common lisp anymore. That is basically sbcl. I didn’t get anything in that direction (my prompt said nothing about a CL implementation) but I would have rejected it. I just wanted to see glfw and opengl in the :use clause. I have to do something that has to work in Mac, Linux and Windows, with at least ECL, sbcl and ccl.
Yeah, I was just trying to keep to the letter of what you'd said - you asked for it just to use OpenGL/GLFW bindings, not other libraries, so I didn't want to install cl-opengl and cl-glfw, and told it just to use its own FFI.
Well at least ypu and the other commenters made something that worked, which I was unable to do. Seems the key is using a coding agent, not an LLM out of the box like I did.
> I end up rewriting about 70% of the thing.
I think this touches on the root of the issue. I am seeing a results over process winning. Code quality will reduce. Out of touch or apathetic project management who prioritize results, now are even more emboldened to have more tech debt riddled code
Have you tried asking one of your peers who claims to get good results to run a test with you? Where you both try to create the same project, and share your results?
I and one or two others are _the_ AI use experts at my org, and I was by far the earliest adopter here. So I don't really have anyone else with significantly different experiences than me that I could ask.
- [deleted]
Maybe if your coding style is already close to what an LLM like Claude outputs, you’ll never have these issues? At least it generally seems to be doing what I would do myself.
Most of the architectural failures come from it still not having the whole codebase in mind when changing stuff.
I actually think it's less about code style and more about the disjointed way end outcomes seem to be the culmination of a lot of prompt attempts over the course of a project/implementation.
The funny thing is reviewing stuff claude has made isn't actually unfamiliar to me in the slightest. It's something I'm intimately familiar with and have been intimately familiar with for many years, long before this AI stuff blew up...
..it's what code I've reviewed/maintained/rejected looks like when a consulting company was brought on board to build something. Such a company that leverages probably underpaid and overworked laborers both overseas and US based workers on visas. The delivered documentation/code is noisy+disjointed.
> The delivered documentation/code is noisy+disjointed.
Yeah, which is what you get if your memory consists of everything you’ve read in the past 20 minutes. Most of my Claude work involves pointing it at the right things.
In my experience, using AI coding agents need highly specific success criteria, and an easy way to verify its output against that criteria.
My biggest successes have come when I take a TDD approach. First I identify a subset of my work into a module with an API that can be easily tested, then I collaborate with the agent on writing correct test-cases, and finally I tell it to implement the module such that the test cases pass without any lint or typing errors.
It forces me to spend much more time thinking about use cases, project architecture, and test coverage than about nitty-gritty implementation details. I can imagine that in a system that evolved over time without a clear testing strategy, AI would struggle mightily to be even barely useful.
Not saying this applies to your system, but I've definitely worked on systems in the past that fit the "big ball of mud" description pretty neatly, and I have zero clue how I'd have been able to make effective use of these AI tools.
> Every single time, I get something that works, yes, but then when I start self-reviewing the code, preparing to submit it to coworkers, I end up rewriting about 70% of the thing.
You might want to review how you approach these tools. Complaining that you need to rewrite 70% of the code screams of poor prompting, with too vague inputs, no constraints, and no feedback at all.
Using agents to help you write code is far from a one-shot task, but if throwing out 70% of what you create screams out that you are prompting the agent to create crap.
> 1) I'm not good at prompting, even though I am one of the earliest AI in coding adopters I know, and have been consistent for years. So I find this hard to accept.
I think you need to take a humble pill, review how you are putting together these prompts, figure out what you are doing wrong in prompts and processes, and work up from where you are at this point. If 70% of your output is crap, the problem is in your input.
I recommend you spend 20 minutes with your agent of choice prompting it to help you improve your prompts. Check instruction files, spec-driven approaches, context files, etc. Even a plain old README.md helps a lot. Prompt your agent to generate it for you. From there, instead of one-shot prompts try to break down a task into multiple sub steps with small deliverables. Always iterate on your instruction files. It you spend a few minutes on this, you will quickly halve your churn rate.
Maybe LLMs are like a next evolution of a rubber ducky: you can talk to it, and it's very helpful, just don't expect that IT will give you the final answer.
You alluded to it, but also:
3) Not everyone codes the same things
4) It's easy to get too excited about the tech and ignore its failure modes when describing your experiences later
I use AI a lot. With your own control plane (as opposed to a generic Claude Code or whatever) you can fully automate a lot more things. It's still fundamentally incapable of doing tons of tasks though at any acceptable quality level, and I strongly suspect all of (2,3,4) are guiding the disconnect you're seeing.
Take the two things I've been working on this morning as an example.
One was a one-off query. I told it the databases it should consider, a few relevant files, roughly how that part of the business works, and asked it to come back when it finished. When it was done I had it patch up the output format. It two-shot (with a lot of helpful context) something that would have taken me an hour or more.
Another is more R&D-heavy. It pointed me to a new subroutine I needed (it couldn't implement it correctly though) and is otherwise largely useless. It's actively harmful to have it try to do any of the work.
It's possible that (1) matters more than you suspect too. AI has certain coding patterns it likes to use a lot which won't work in my codebase. Moreover, it can't one-shot the things I want. It can, however, follow a generic step-by-step guide for generating those better ideas, translating worse ideas into things that will be close enough to what I need, identifying where it messed up, and refactoring into something suitable, especially if you take care to keep context usage low and whatnot. A lot of people seem to be able to get away with CLAUDE.md or whatever, but I like having more granular control of what the thing is going to be doing.
I have been doing the same since GPT-3. I remember a time, probably around 4o when it started to get useful for some things like small React projects but was useless for other things like firestore rules. I think that surface is still jagged, it's just that it's less obviously useless in areas that it's weaker.
When things really broke open for me was when I adopted windsurf with Opus 4, and then again with Opus 4.5. I think the way the IDE manages the context and breaks down tasks helps extend llm usefulness a lot, but I haven't tried cursor and haven't really tried to get good at Claude code.
All that said, I have a lot of experience writing in business contexts and I think when I really try I am a pretty good communicator. I find when I am sloppy with prompts I leave a lot more to chance and more often I don't get what I want, but when I'm clear and precise I get what I want. E.g. if it's using sloppy patterns and making bad architectural choices, I've found that I can avoid that by explaining more about what I want and why I want it, or just being explicit about those decisions.
Also, I'm working on smaller projects with less legacy code.
So in summary, it might be a combination of 1, 2 and the age/complexity of the project you're working on.
My experience with agents in larger / older codebases is that feedback loops are critical. They'll get it somewhere in the neighborhood of right on the first attempt; it's up to your prompt and tooling to guide them to improve it on correctness and quality. Basic checks: can the agent run the app, interact with it, and observe its state? If not, you probably won't get working code. Quality checks: by default, you'll get the same code quality as the code the agent reads while it's working; if your linters and prompts don't guide it towards your desired style, you won't get it.
To put that another way: one-shots attempts aren't where the win is in big codebases. Repeat iteration is, as long as your tooling steers it in the right direction.
I think the answer will lie somewhere closer to social psychology and modern economics than to anything in software engineering.
It might be 1), being an early adopter doesn’t help much with AI. So much is changing constantly. If you put a good description of your architecture and coding guidelines in the right .md files and work on your prompts the output should be much better. In the other hand your project being legacy code probably also doesn’t help.
We find across our team different people are able to use these things at different levels. Unsurprisingly, more senior coders with both more experience in general and more experience in ai coding are able to do more with ai and get more ambitious things done more quickly.
A bummer is that we have a genai team (louie.ai) and a gpu/viz/graph analytics team (graphistry), and those who have spent the last 2-3 years doing genai daily have a higher uptake rate here than those who aren't. I wouldn't say team 1 is better than team 2 in general: these are tools, and different people have different engineering skill and ai coding skill, including different amounts of time doing both.
What was a revelation for me personally was taking 1-2mo early in claude code's release was to go full cold turkey on manual coding, similar to getting immersed in a foreign language. That forced eliminating a lot of bad habits wrt effective ai coding both personally and in state of our repo tooling. Since then, it's been steady work to accelerate and smooth that loop, eg, moving from vibe coding/engineering to now more eval-driven ai coding loops: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t... . That takes a LOT of buildout.
Do you have links to texts that describe which markdown files, and what to write in them? What is good and what is bad etc.
I don't have any links but you can start with CLAUDE.md and/or AGENTS.md and put the basic instructions in there ( you can also google these filenames for examples and recommendations). I also put README.md's in every subfolder to describe which file does what, etc.
> 1) I'm not good at prompting,
I assume this is part of the problem (though I've avoided using LLMs mostly so can't comment with any true confidence here) but to a large extent this is blaming you for a suboptimal interface when the interface is the problem.
That some people seem to get much better results than others, and that the distinction does not map well to differences in ability elsewhere, suggests to me that the issue is people thinking slightly differently and the training data for the models somehow being biased to those who operate in certain ways.
> 2) Other people are just less picky than I am
That is almost certainly a much larger part of the problem. “Fuck it, it'll do, someone else can tidy it later if they are bothered enough” attitudes were rampant long before people started outsourcing work to LLMs.
I think you should try harder to find their limits. Be as picky as you want, but don't just take over after it gave you something you didn't like. Try again with a prompt that talks about the parts you think were bad the first time. I don't mean iterate with it, I mean start over with a brand new prompt. Try to figure out if there is a prompt that would have given you the result you wanted from the start.
It won't be worth it the first few times you try this, and you may not get it to where you want it. I think you might be pickier than others and you might be giving it harder problems, but I also bet you could get better results out of the box after you do this with a few problems.
Not even coding tasks. Just getting an LLM to help me put together a PromQL query to do something somewhat non-standard takes dozens of tries and copy/pasting back error messages.. and these aren't complex errors, trivial things like missing closing brackets and the like.
I know the usual clap back is "you're just missing this magical workflow" or "you need to prompt better" but.. do I really need to prompt "make sure your syntax is correct"? Shouldn't that be, ya know, a given for a prompt that starts with "Help me put together a PromQL query that..."?
Yes, you're missing a magic workflow.
If you find yourself having to copy and paste errors back and forward you need to upgrade to a coding agent harness like Claude Code so the LLM can try things out and then fix the errors on its own.
If you're not willing to do that you can also fix this by preparing a text file with a few examples of correctly formatted queries and pasting that in at the start of your session, or putting it in a skill markdown file.
So, let me get this straight, LLMs need a "coding agent harness" to figure out that they need to close brackets? Wild.
They need one if you want them to be able to automatically recover from mistakes they make, yes.
I think you are correct, with one large caveat:
With very good tooling (e.g., Google Antigravity, Claude Coding, Open AI’s codex, and several open platforms) and not caring about your monthly API and subscription costs, then very long running trial and error and also with tools for testing code changes, then some degree of real autonomy is possible.
But, do we want to work like this? I don’t.
I feel very good about using strong AI for research and learning new things (self improvement) and I also feel good about using strong AI as a ‘minor partner’ in coding.
Try learning to vibe code on something totally greenfield without looking at the code and see if it changes your mind. Ignore code quality, “does it work” and “am i happy with the app” are the only metrics.
Code quality is an issue you need to ignore with vibe coding - if code quality is important to your project or you then it’s not an issue. But if you abandon this concept and build things small enough or modular enough then speed gains await!
IMO codebases can be architected for LLMs to work better in them, but this is harder in brownfield apps.
If you start greenfield and ignore the code quality, how do you know you can maintain it long term?
Greenfield is fundamentally easier than maintaining existing software. Once software exists, users expect it to behave a certain way and they expect their data to remain usable in new versions.
The existing software now imposes all sorts of contraints that may not be explicit in the spec. Some of these constraints end up making some changes very hard. Bad assumptions in data modeling can make migrations a nightmare.
You can't just write entirely new software every time the requirements change.
In practice, this is managed by:
1) Making the application small enough, and breaking it apart if needed (e.g. I've refactored my old 'big' app into 10 micro-apps).
2) Selecting an architecture that will work, looking after the data modelling and architecture yourself rather than delegating this to the LLM (it can implement it - but you need to design it).
3) Trusting that the LLM is capable enough to implement new requirements or fixes as required.
If requirements change so substantially that it's not possible, you can write new software as requirements change - as per point 1, you will have made your application modular enough that this isn't a significant concern.
I think you are not hardcore enough. I paste entire files or 2 3 files at once and ask to rewrite everything.
Then you rewiew it and in general have to ask to remove some stuff. And then it's (good enough). You have to accept to not nitpick some parts (like random functions being generated) as long as your test suite pass, otherwise of course you will end up rewritin everything
It also depends on your setting, some area (web vs AI vs robotics) can be more suited than other
You can definitely use AI for non-trivial tasks.
It's not just about better prompting, but using better tools. Tools that will turn a bad prompt into a good prompt.
For example there is the plan mode for Cursor. Or just ask the AI: "make a plan to do this task", then you review the plan before asking it to implement. Configure the AI to ask you clarification questions instead of assuming things.
It's still evolving pretty quickly, so it's worth staying up to date with that.
I have not been as aggressive as GP in trying new AI tools. But the last few months I have been trying more and more and I'm just not seeing it.
One project I tried out recently I took a test-driven approach. I built out the test suite while asking the AI to do the actual implementation. This was one of my more successful attempts, and may have saved me 20-30% time overall - but I still had to throw out 80% of what it built because the agent just refused to implement the architecture I was describing.
It's at its most useful if I'm trying to bootstrap something new on a stack I barely know, OR if I decide I just don't care about the quality of the output.
I have tried different CLI tools, IDE tools. Overall I've had the best success with Claude Code but I'm open to trying new things.
Do you have any good resources you would recommend for getting LLM's to perform better, or staying up-to-date on the field in general?
If you haven't yet, check Claude Code's plan mode:
> 2) Other people are just less picky than I am, or they have a less thorough review culture that lets subpar code slide more often.
Given how consistently terrible the code of Claude Code-d projects posted here have been, I think this is it.
I find LLMs pretty useful for coding, for multiple things(to write boilerplate, as an idiomatic design pattern search engine, as a rubber duck, helping me name things, explaining unclear error messages, etc.), but I find the grandiose claims a bit ridiculous.
It is pretty simple imo. AI (just like humans!) does best on well written, self contained code bases. Which is a very small niche, but also over represented in open source and subsequently by tech celebrities who tend not to work on “ugly code”.
I work on a giant legacy code base at big tech, which is one piece of many distributed systems. LLM is helpful for localised, well defined work, but nowhere close to what the TFA describes.
Not trying to back the AI hype, but most pre-AI auto generated code is garbage (like winform auto generated code or entity framework SQL in the .net world). But that’s fine, it’s not meant to be read by humans. If you want to change it you can regenerate it. It may be that AI just moves the line between what developers should care and look at vs the boring boiler plate code that has little value added.
But those code generators were deterministic (and indeed caused huge headaches if the generated code changed between versions). Seems like a totally different thing.
If you follow antirez's post history, he was a skeptics until maybe a year ago. Why don't you look at his recent commits and judge for yourself. I suppose the majority of his most recent code is relevant for this discussion.
https://github.com/antirez?tab=overview&from=2026-01-01&to=2...
I don't think I'd be a good judge because I don't have the years of familiarity and expertise in his repos that I do at my job. A lot of the value of me specifically vs an LLM at my job is that I have the tribal knowledge and the LLM does not. We have gotten a lot better at documentation, but I don't think we can _ever_ truly eliminate that factor.
How much buggy / incorrect Java written by first year computer science University students is there on Stack Overflow (in SO post bodies)? Decades of it.
Ask the same question of Golang, or Rust, or Typescript.
I have a theory that the large dichotomy in how people experience AI coding has to do with the quality of the training corpus for each language online.
Instead of rewriting yourself have you tried telling the agent what it did wrong and do the rewrite with it? Then at the end of the session ask it to extract a set of rules that would have helped to get it right the first time. Save that in AGENTS.md. If you and your team do this a few times it can lead to only having to rewrite 5% of the code instead of 70%.
> Instead of rewriting yourself have you tried telling the agent what it did wrong and do the rewrite with it?
I have, it becomes a race to the bottom.
Race to the bottom? Tell me more
It says "of course you're right" and may or may not refactor/fix/rewrite the issue correctly. More often than not it doesn't or misses some detail.
So you tell it again, "of course you are right", and the cycle repeats.
And then the context window gets exhausted. Compaction loses most of the details and degrades quality. You start a new session, but the new session has to re-learn the entire world from scratch and may or may not fix the issue.
And so the cycle continued.
Thank you for providing data which can actually be used to collate! I strongly suspect that experience is a huge determinant of what utility is seen from LLMs.
It seems that theres more people writing and finishing projets, but not many have reached the point where they have to maintain their code / deal with the tech debt.
I'm not sure if I got in this weird LLM bubble where they give me bad advice to drive engagement, because I can't resist trying to correct them and tell them how absurdly wrong they are.
But it is astounding how terrible they are at debugging non-trivial assembly in my experience.
Anyone else have input here?
Am I in a weird bubble? Or is this just not their forte?
It's truly incredible how thoughtless they can be, so I think I'm in a bubble.
> I can't resist trying to correct them and tell them how absurdly wrong they are.
Oh god I thought I was the only one. Do you find yourself getting mad at them too?
If a normal person looked at my messages, they could safely assume I've gone crazy.
Yes, nothing has made me angry like their insistence that they are always right, even when you prove them wrong.
Again, I think I've done this to myself.
They know that gets me to respond, and all they care about is engagement.
I've tried to use Claude Code with Sonnet 4.5 for implementing a new interpreter, and man is it bad with reference counting. Granted, I'm doing it in Zig, so there's not as much training, but Claude will suggest the most stupid changes. All it does is make the rare case of incorrect reference counting more rare, not fixing the underlying problem. It kept heaping on more and more hacks, until I decided enough is enough and rolled up my sleeves. I still can't tell if it makes me faster, or if I'm faster.
Even when refactoring, it would change all my comments, which is really annoying, as I put a lot of thought into my comments. Plus, the time it took to do each refactoring step was about how long it would take me, and when I do it I get the additional benefit of feeling when I'm repeating code too often.
So, I'm not using it for now, except for isolating bugs. It's addicting having it work on it for me, but I end up feeling disconnected and then something inevitably goes wrong.
I'm also building a language in Zig!
Good luck!
Oh cool! I'd love to hear more. I'm implementing an existing language, Tcl, but I'm working on making it safe to share values between threads, since a project I contribute to[1] uses Tcl for all the scripting, but they have about a 30% overhead with serialization/deserialization between threads, and it doesn't allow for sharing large values without significant overheads. I'm also doing some experiments with heap representation to reduce data indirection, so it's been fun getting to learn how to implement malloc and other low-level primitives I usually take for granted.
[1] folk.computer
Genuine question, doesn't this apply to coding style than actual results? Same applies to writing style. LLMs manage to write great stories but they don't suit my writing style. When generating code it doesn't always suit my coding style but the code it generates functions fine.
That's the curse of the expert. You see many of the shortcomings, that someone less experienced might not even think about, when they go to social media and blurt out that AI is now able to fully replace them.
> Every single time, I get something that works, yes, but then when I start self-reviewing the code, preparing to submit it to coworkers, I end up rewriting about 70% of the thing.
Have another model review the code, and use that review as automatic feedback?
CodeRabbit in particular is gold here. I don't know what they do but it is far better at reviewing than any AI model I've seen. From the deep kinds of things it finds, I highly suspect they have a lot of agents routing code to extremely specialized subagents that can find subtle concurrency bugs, misuse of some deep APIs etc. I often have to do the architecture l/bug picture/how this fits into project vision review myself, but for finding actual bugs in code, or things that would be self evident from reading one file, it is extremely good.
I've been using a `/feedback ...` command with claude code where I give it either positive or negative feedback about some action it just did, and it'll look through the session to make some educated guesses about why it did some thing - notably, checking for "there was guidance for this, but I didn't follow it", or "there was no guidance for this".
the outcome is usually a new or tweaked skill file.
it doesn't always fix the problem, but it's definitely been making some great improvements.
That is actually a gold tip. Codex CLI is way less pleasant to use than Opus, but way better at finding bugs, so I combine them.
Codex is a sufficiently good reviewer I now let it review my hand-coded work too. It's a really, really good reviewer. I think I make this point often enough now that I suspect OpenAI should be paying me. Claude and Gemini will happily sign off work that just doesn't work, OpenAI is a beast at code-review.
It sounds harsh but you're most likely using it wrong.
1) Have an AGENTS.md that describes not just the project structure, but also the product and business (what does it do, who is it for, etc). People expect LLMs to read a snippet of code and be as good as an employee who has implicit understanding of the whole business. You must give it all that information. Tell it to use good practices (DRY, KISS, etc). Add patterns it should use or avoid as you go.
2) It must have source access to anything it interacts with. Use Monorepo, Workspaces, etc.
3) Most important of all, everything must be setup so the agent can iterate, test and validate it's changes. It will make mistakes all the time, just like a human does (even basic syntax errors), but it will iterate and end up on a good solution. It's incorrect to assume it will make perfect code blindly without building, linting, testing, and iterating on it. No human would either. The LLM should be able to determine if a task was completed successfully or not.
4) It is not expected to always one shot perfect code. If you value quality, you will glance at it, and sometimes ahve to reply to make it this other way, extract this, refactor that. Having said that, you shouldn't need to write a single line of code (I haven't for months).
Using LLMs correctly allow you to complete tasks in minutes that would take hours, days, or even weeks, with higher quality and less errors.
Use Opus 4.5 with other LLMs as a fallback when Opus is being dumb.
> Most important of all, everything must be setup so the agent can iterate, test and validate it's changes.
This was the biggest unlock for me. When I received a bug report I have the LLM tell me where it thinks the source of the bug is located, write a test that triggers the bug/fails, design a fix, finally implement the fix and repeat. I'm routinely surprised how good it is at doing this, and the speed with which it works. So even if I have to manually tweak a few things, I've moved much faster than without the LLM.
"The LLM should be able to determine if a task was completed successfully or not."
Writing logic that verifies something complex requires basically solving the problem entirely already.
Situation A) Model writes a new endpoint and that's it
Situation B) Model writes a new endpoint, runs lint and build, adds e2e tests with sample data and runs them.
Did situation B mathematically prove the code is correct? No. But the odds the code is correct increases enormously. You see all the time how the Agent finds errors at any of those steps and fixes them, that otherwise would have slipped by.
LLM generated tests in my experience are really poor
Doesn't change the fact that what I mentioned greatly improves agent accuracy.
AI-generated implementation with AI-generated tests left me with some of the worst code I've witnessed in my life. Many of the passing tests it generated were tautologies (i.e. they would never fail even if behavior was incorrect).
When the tests failed the agent tended to change the (previously correct) test making it pass but functionally incorrect, or it "wisely" concluded that both the implementation and the test are correct but that there are external factors making the test fail (there weren't).
It behaved much like a really naive junior.
Which coding agent and which model?
[dead]
Actually it borderline undermines it because it's shit building upon shit
Like with anything else the people best positioned to enjoy output are the people least well positioned to criticize it. This is true of AI just as eating at restaurants or enjoying movie dramas.
Wow only 70%. I so far have had to drop and rewrite from scratch every time. Mind, I work in C/embedded spaces, and current LLMs are just horrible at any code in that space.
My vote is with (2).
This is also my experience with enterprise Java. LLMs have done much better with slightly less convoluted code bases in Go. Its currently clearly better at Go and Typescript than Java in my view
- [deleted]
Do you have an example of something that was subpar and needed a 70% rewrite?
- [deleted]
LLMs tend to rise to the level of the complexity of the codebase. They are probabilistic pattern matching machines, after all. It's rare to have a 15 year old repo without significant complexity; is it possible that the reason LLMs have trouble with complex codebases is that the codebases are complex?
IMO it has nothing to do with LLMs. They just mirror the patterns they see - don't get upset when you don't like your own reflection! Software complexity is still bad. LLMs just shove it back in our face.
Implications: AI is always going to feel more effective on brand new codebases without any legacy weight. And less effective on "real" apps where the details matter.
The bias is strongly evident - you rarely hear anyone talking about how they vibe coded a coherent changeset to an existing repo.
AI is a house painter, wall to wall, with missed spots and drips. Good coders are artists. That said, artists have been known to use assistants on backgrounds. Perhaps the end case is a similar coder/AI collaborative effort?
> I don't understand the stance that AI currently is able to automate away non-trivial coding tasks
It's just the Dunning-Kruger effect. People who think AI is the bee's knees are precisely the dudes who are least qualified to judge its effectiveness.
Same experience. The better the model the more complicated are the bugs and brain damages it introduces.
Perhaps one has to be skilled programmer in the first place to spot the problems, which is not easy when the program runs apparently.
Things like mocked tests, you know. Who would care about that.
I think it comes down to what you mean by sub par code. If you're talking a mess of bubblesorts and other algorithmic problems, that's probably a prompting issue. If you're talking "I just don't like the style of the code, it looks inelegant" that's not really a prompting issue, models will veer towards common patterns in a way that's hard to avoid with prompts.
Think about it like compiler output. Literally nobody cares if that is well formatted. They just care that they can get fairly performant code without having to write assembly. People still dip to assembly (very very infrequently now) for really fine performance optimizations, but people used to write large programs in it (miserably).
There's a huge amount you're missing by boiling down their complaint to "bubble sorts or inelegant code". The architecture of the new code, how it fits into the existing system, whether it makes use of existing utility code (IMO this is a huge downside; LLMs seem to love to rewrite a little helper function 100x over), etc.
These are all important when you consider the long-term viability of a change. If you're working in a greenfield project where requirements are constantly changing and you plan on throwing this away in 3 months, maybe it works out fine. But not everyone is doing that, and I'd estimate most professional SWEs are not doing that, even!
There's certainly coupled, obtuse, contorted code styles that the LLM will be unable to twister itself into (which is different from the coupled, obtuse code it generates itself). Don't pretend this is good code though, own that you're up to your neck in shit.
LLMs are pretty good at modifying well factored code. If you have a functional modular monolith, getting agents to add new functions and compose them into higher order functionality works pretty darn well.
It's a combination of being bad at prompting and different expectations from the tool. You expect it to be one shot, and then rewrite things that don't match up to what you want.
Instead I recommend that you use LLMs to fix the problems that they introduced as well, and over time you'll get better at figuring out the parts that the LLM will get confused by. My hunch is that you'll find your descriptions of what to implement were more vague than you thought, and as you iterate, you'll learn to be a lot more specific. Basically, you'll find that your taste was more subjective than you thought and you'll rid yourself of the expectation that the LLM magically understands your taste.