This matches my experience. I actually think a fair amount of value from LLM assistants to me is having a reasonably intelligent rubber duck to talk to. Now the duck can occasionally disagree and sometimes even refine.
https://en.m.wikipedia.org/wiki/Rubber_duck_debugging
I think the big question everyone wants to skip right to and past this conversation is, will this continue to be true 2 years from now? I don’t know how to answer that question.
LLMs aren't my rubber duck, they're my wrong answer.
You know that saying that the best way to get an answer online is to post a wrong answer? That's what LLMs do for me.
I ask the LLM to do something simple but tedious, and then it does it spectacularly wrong, then I get pissed off enough that I have the rage-induced energy to do it myself.
I'm probably suffering undiagnosed ADHD, and will get stuck and spend minutes picking a function name and then writing a docstring. LLMs do help with this even if they get the code wrong, because I usually won't bother to fix their variables names or docstring unless needed. LLMs can reliably solve the problem of a blank-page.
This. I have ADHD and starting is the hardest part for me. With an LLM it gets me from 0 to 20% (or more) and I can nail it for the rest. It’s way less stressful for me to start now.
very much agree. although lately with how good it is i get hyperfocused and spent more time then i allocated because i ended up wanting to implement more than i planned.
It’s a struggle right? First world LLM problems.
Been suffering the same, I'm used to having so many days (weeks/months) when I just don't get that much done. With LLMs I can take these days and hack around / watch videos / play games while the LLM is working on background and just check the work. Best part is it often leads to some problematic situation that gets me involved and often I'll end up getting a real day of work out of it after I get started.
> LLMs can reliably solve the problem of a blank-page.
This has been the biggest boost for me. The number of choices available when facing a blank page is staggering. Even a bad/wrong implementation helps collapse those possibilities into a countable few that take far less time to think about.
Yeah, keeping me in the flow when I hit one of those silly tasks my brain just randomly says "no let's do something else" to has been the main productivity improving feature of LLMs.
Yes! So many times my brain just skips right over some tasks because it takes too much effort to start. The LLM can give you something to latch onto and work with. It can lay down the starting shape of a function or program and even when it's the wrong shape, you still have something to mold into the correct shape.
The thing about ADHD is that taking a task from nothing to something is often harder than turning that something into the finished product. It's really weird and extremely not fun.
This is the complete opposite for me! I really like a blank page, the thought of writing a prompt destroys my motivation as does reviewing the code that an LLM produces.
As an aside, I'm seeing more an more crap in PRs. Nonsensical use of language features. Really poorly structured code but that is a different story.
I'm not anti LLMs for coding. I use them too. Especially for unit tests.
So much this, the blank page problem is almost gone. Even if it's riddled with errors.
This is my experience, too. As a concrete example, I'll need to write a mapper function to convert between a protobuf type and Go type. The types are mirror reflections of each other, and I feed the complete APIs of both in my prompt.
I've yet to find an LLM that can reliability generate mapping code between proto.Foo{ID string} to gomodel.Foo{ID string}.
It still saves me time, because even 50% accuracy is still half that I don't have to write myself.
But it makes me feel like I'm taking crazy pills whenever I read about AI hype. I'm open to the idea that I'm prompting wrong, need a better workflow, etc. But I'm not a luddite, I've "reached up and put in the work" and am always trying to learn new tools.
An LLM ability to do a task is roughly correlated to the number of times that task has been done on the internet before. If you want to see the hype version, you need to write a todo web app in typescript or similar. So it's probably not something you can fix with prompts, but having a model with more focus on relevant training data might help.
These days, they'll sometimes also RL on a task if it's easy to validate outputs and if it seems worth the effort.
This honestly seems like something that could be better handled with pre-LLM technology, like a 15-line Perl script that reads one on stdin, applies some crufty regexes, and writes the other to stdout. Are there complexities I'm not seeing?
LLMs are a decent search engine a la Google circa 2005.
It's been 20 years since that, so I think people have simply forgotten that a search engine can actually be useful as opposed to ad infested SEO sewage sludge.
The problem is that the conversational interface, for some reason, seems to turn off the natural skepticism that people have when they use a search engine.
> LLMs are a decent search engine a la Google circa 2005.
Statistical text (token) generation made from an unknown (to the user) training data set is not the same as a keyword/faceted search of arbitrary content acquired from web crawlers.
> The problem is that the conversational interface, for some reason, seems to turn off the natural skepticism that people have when they use a search engine.
For me, my skepticism of using a statistical text generation algorithm as if it were a search engine is because a statistical text generation algorithm is not a search engine.
Search engines can be really good still if you have a good idea what you're looking for in the domain you're searching.
Search engines can suck when you don't know exactly what you're looking for and the phrases you're using have invited spammers to fill up the first 10 pages.
They also suck if you want to find something that's almost exactly like a very common thing, but different in some key aspect.
For example, I wanted to find some texts on solving a partial differential equation numerically using 6th-order or higher finite differences, as I wanted to know how to handle boundry conditions (interior is simple enough).
Searching only turned up the usual low-order methods that I already knew.
Asking some LLMs I got some decent answer and could proceed.
Back in the day you could force the search engines to restrict their search scope, but they all seem so eager to return results at all cost these days, making them useless in niche topics.
I agree completely. Personally, I actually like the list of links because I like to compare different takes on a topic. It's also fascinating to see how a scientific study propagates through the media or the way the same news story is treated over time, as trends change. I don't want a single mashed-up answer to a question and maybe that makes me weird but more worrying, whenever I've asked a LLM for an answer to a question on a topic I happen to know a LOT about, the response has been either incorrect or inadequate - "there is currently no information collected on that topic" I do like Perplexity for questions like "without any preamble whatsoever, what is the fastest way to remove a <whatever>stain from X material?"
I almost never bother using Google anymore. When I search for something, I'm usually looking for an answer to question. Now I can just ask the question and get the answer without all the other stuff.
I will often ask the LLM to give me web pages to look at it when I want to do further reading.
As LLMs get better, I can't see myself going back to Google as it is or even as it was.
You get an answer.
If that's the answer, or even the best answer, is impossible to tell without doing the research you're trying to avoid.
If I do research, I get an answer. If that's the answer, or even the best answer, it's impossible to tell. When do I stop looking for the best answer?
If ChatGPT needs to, it will actually do the search for me and then collate the results.
By that logic, it's barely worth reading a newspaper or a book. You don't know if they're giving you accurate information without doing all the research you're trying to avoid.
Recognised newspapers will curate by hiring smart, knowledgeable reporters and funding them to get reliable information. Recognised books will be written by a reliably informed author, and reviewed by other reliably informed people. There are no recognised LLMs, and their method of working precludes reliability.
Malcolm Gladwell, Jonah Lehrer, Daniel Kahneman, Matthew Walker, Stephen Glass? The New York Times, featuring Judith Miller on the existence of WMD, or their award winning podcast "Caliphate"? (Award returned when it became known the whole thing was made up, in case you haven't heard of that one).
As opposed to a LLM trained on all the Sh1tL0rd69 of the web?
Not anymore, not for a long time. There are very few truly reliable and trustworthy sources these days. More and more "recognized" publications are using LLMs. If a "recognized" authority gives you LLM slop, that doesn't make it any more trustworthy.
It’s only a matter of time before Google merges search with Gemini. I don’t think you’ll have to wait long.
Already happened.
Google search includes an AI generated response.
Gemini prompts return Google search results.
See. They saw my comment and got it done. Dang, that was quick.
Once search engines merge fully with AI, the Internet is over.
[dead]
> Statistical text (token) generation made from an unknown (to the user) training data set is not the same as a keyword/faceted search of arbitrary content acquired from web crawlers.
Well, it's roughly the same under the hood, mathematically.
All of the current models have access to Google and will do a search (or multiple searches), filter and analyze the results, then present a summary of results with links.
Except a search engine isn't voice controlled, and able to write code for me.
Recently I did some tests with coding agents, and being able to translate a full application from AT&T Assembly into Intel Assembly compatible with NASM, in about half an hour of talking with agent, and having the end result actually working with minor tweeks isn't something a "decent search engine a la Google circa 2005." would ever been able to achieve.
In the past I would have given such a task to a junior dev or intern, to keep them busy somehow, with a bit more tool maturity I have no reason to do it in the future.
And this is the point many developers haven't yet grasped about their future in the job market.
> being able to translate a full application from AT&T Assembly into Intel Assembly compatible with NASM, [...] isn't something a "decent search engine a la Google circa 2005." would ever been able to achieve
No you would have searched for "difference between at&t assembly and intel assembly", and if not found, the manuals for both and compiling the difference. Then write an awk or perl script to get it done. And if you happens to be good at both assembly versions and awk. I believe that could have been done in less than an hour. Or you could use some vim macros.
> In the past I would have given such a task to a junior dev or intern, to keep them busy somehow, with a bit more tool maturity I have no reason to do it in the future.
The reason to give tasks to junior is to get them to learn more. Or the task needs to be done, but it's not critical. Unless it takes less time to do it than to delegate it to someone else, or you have no junior to guide, it's a good reason to hand out the task to a junior if it will help them grow.
Except that awk or Perl script is something that would take me more than half an hour from idea to production.
There might not exist a junior to give tasks to, if the amount of available juniors is decreased.
n=1 but after having chatgpt "lie" to me more than once i am very skeptical of it and always double check it, whereas something like tv or yt videos i still find myself being click-baited or grifted (iow less skeptical) much more easily still... any large studies about this would be very interesting...> the conversational interface, for some reason, seems to turn off the natural skepticism that people have
I get irrationally frustrated when ChatGPT hallucinates npm packages / libraries that simply do not exist.
This happens… weekly for me.
"Hey chatgpt I want to integrate a slidepot into this project"
>from PiicoDev_SlidePot import PiicoDev_SlidePot
Weird how these guys used exactly my terminology when they usually say "Potentiometer"
Went and looked it up, found a resource outlining that it uses the same class as the dial potentiometer.
"Hey chatgpt, I just looked it up and the slidepots actually use the same Potentiometer class as the dialpots."
scurries to fix its stupid mistake
Weird. I used to have that happen when it first came out but I haven't experienced anything like that in a long time. Worst case it's out of date rather than making stuff up.
My experience with this is that it is vital to have a system where the system can iterate on its own.
Ideally by having a test or endpoint you can call to actually run the code you want to build.
Then you ask the system to implement the function and run the test. If it hallucinates anything it will find that and fix it.
IME OpenAI is below Claude and Gemini for code.
just ask it to write and publish them and you good :)
Jia Tan will have to work 24/7 :)
tell it that you won’t accept any new installed packages, use language features only. i have that in my coding prompt i made.
This has been my experience as well. The biggest problem is that the answers look plausible, and only after implementation and experimentation do you find them to be wrong. If this happened every once in a while then it wouldn't be a big deal, but I'd guess that more than half of the answers and tutorials I've received through ChatGPT have ended up being plain wrong.
God help us if companies start relying on LLMs for life-or-death stuff like insurance claim decisions.
I'm not sure if you're being sarcastic, but in case you're not... From https://arstechnica.com/health/2023/11/ai-with-90-error-rate...
"UnitedHealth uses AI model with 90% error rate to deny care, lawsuit alleges" Also "The use of faulty AI is not new for the health care industry."
> If this happened every once in a while then it wouldn't be a big deal, but I'd guess that more than half of the answers and tutorials I've received through ChatGPT have ended up being plain wrong.
It would actually have been more pernicious that way, since it would lull people into a false sense of security.
Yep.
I like maths, I hate graphing. Tedious work even with state of the art libraries and wrappers.
LLMs do it for me. Praise be.
Yeah, I write a lot of little data analysis scripts and stuff, and I am happy just to read the numbers, but now I get nice PNGs of the distributions and so on from LLM, and people like that.
I have to upvote this, because this is how I felt after trying three times (that I consciously decided to give an LLM a try, versus having it shoved down my throat by google/ms/meta/etc) and giving up (for now).
LLMs follow instructions. Garbage in = garbage out generally. When attention is managed and a problem is well defined and necessary materials are available to it, they can perform rather well. On the other hand, I find a lot of the loosely-goosey vibe coding approach to be useless and gives a lot of false impressions about how useful LLMs can be, both too positive and too negative.
So what you’re saying is you need to be very specific and detailed when writing your specifications for the LLM to spit out the code you want. Sounds like I can just skip the middle man and code it myself.
Not in 10 seconds
You probably didn’t write up a detailed prompt with perfect specifications in 10 seconds, either.
In my experience, it doesn’t matter how good or detailed the prompt is—after enough lines of code, the LLM starts making design decisions for you.
This is why I don’t accept LLM completions for anything that isn’t short enough to quickly verify that it is implemented exactly as I would have myself. Usually, that’s boilerplate code.
> This is why I don’t accept LLM completions for anything that isn’t short enough to quickly verify that it is implemented exactly as I would have myself. Usually, that’s boilerplate code.
^ This. This is where I've landed as far as the extent of LLM coding assistants for me.
I've seen very long prompts that are as long as a school essay and those didn't take ten seconds either
To some extent those fail in the same category of cheaters that put way more effort into cheating an exam than doing it properly. Or people paying 10/15 bucks a month to access a private Usenet server to download pirate content.
The advantage of a llm in that case is that you can skip a lot of syntax: make a LOT of typos in your spec, even pseudo code, will result in a working program. Not so with code. Also small logjcal mistakes, messing up left/right, x/y etc are auto fixed, maybe to your frustration if they were not mistakes, but often they are and you won't notice as they are indeed just repaired for you.
No, but the better specifications you provide to your “development team”, the more likely you are to get what you expected… like always.
> LLMs follow instructions.
They don't
> Garbage in = garbage out generally.
Generally, this statement is false
> When attention is managed and a problem is well defined and necessary materials are available to it, they can perform rather well.
Keyword: can.
They can also not perform really well despite all the management and materials.
They can also work really well with loosey-goosey approach.
The reason is that they are non-deterministic systems whose performance is affected more by compute availability than by your unscientific random attempts at reverse engineering their behavior https://dmitriid.com/prompting-llms-is-not-engineering
This seems to be what’s happened
People are expecting perfection from bad spec
Isn’t that what engineers are (rightfully) always complaining about to BD?
Indeed. But that's the price an automated tool has to pay to take a job from humans' hands. It has to do it better with the same conditions. The same applies to self-driving cars: you don't want an accident rate equals to human drivers. You want two or three orders of magnitude better.
This hasn't been my experience (using the latest claude and gemini models). They'll produce poor code even when given a well defined easily achievable task with specific instructions. The code will usually more or less work with today's models, but it will do things like call a function to recreate a value that is already stored in a local variable... (and worse issues prop us the more design-work you leave to the LLM, even dead simple design work with really only one good answer)
I've definitely also found that the poor code can sometimes be a nice starting place. One thing I think it does for me is make me fix it up until it's actually good, instead of write the first thing that comes to mind and declare it good enough (after all my poorly written first draft is of course perfect). In contrast to the usual view of AI assisted coding, I think this style of programming for tedious tasks makes me "less productive" (I take longer) but produces better code.
> LLMs follow instructions.
Not really, not always. To anyone who’s used the latest LLMs extensively, it’s clear that this is not something you can reliably assume even with the constraints you mentioned.
They should maybe have a verifiable specification for said instructions. Kinda like a programming language maybe!
> LLMs follow instructions.
No they don't, they generate a statistically plausible text response given a sequence of tokens.
Out of curiosity can you give me an example prompt(s) you’ve used and been disappointed
I see these comments all the time and they don’t reflect my experience so I’m curious what your experience has been
There are so many examples where all current top models just will loop forever even if you instruct them literally the code. We know many of them, but for instance in a tailwind react project with some degree of complexity (nested components), if you ask for something to scroll in it's space, it will never figure out min-h-0 even if you tell it. It will just loop forever rewriting the code adding and removing things, to the point of it just putting comments like 'This will add overflow' and writing js to force scroll, and it will never work even if you literally tell it what to do. Don't know why, all big and small models have this, and I found Gemini is currently the only model that sometimes randomly has the right idea but then still cannot resolve it. For this we went back to not using tailwind and back to global vanilla css, which I never thought I would say, is rather nice.
This is probably not so much an indictment of the AI, as of that garbage called Tailwind. As somebody here said before, garbage in, garbage out.
Yeah, guess so, but we like garbage these days in the industry; nextjs, prisma, npm, react, ts, js, tailwind, babel, the list of inefficient and badly written shite goes on and on; as a commercial person it's impossible to avoid that though as shadcn is the only thing 'the youth' makes apps with now.
I asked Chat GPT 4o to write an Emacs function to highlight a line. This involves setting the "mark" at the beginning, and the "point" at the end. It would only set the point, so I corrected it "no, you have to set both", but even after correction it would move the point to the beginning, and then moved the point again to the end, without ever touching the mark.
From my experience, (and to borrow terminology from a HN thread not long ago), I've found that once a chat goes bad, your context is "poisoned"; It's auto completing from previous text that is nonsense, so, further text generation from there exist in the world of nonexistent nonsense as well. It's much better to edit your message and try again.
I also think that language matters - An Emacs function is much more esoteric than say, JavaScript, Python, or Java. If I ever find myself looking for help with something that's not in the standard library, I like provide extra context, such as examples from the documentation.
It's a damning assertive duck, completely out of proportion to its competence.
I've seen enough people led astray by talking to it.
Same here. When I'm teaching coding I've noticed that LLMs will confuse the heck out of students. They will accept what it suggests without realizing that it is suggesting nonsense.
I’m self taught and don’t code that much but I feel like I benefit a ton from LLMs giving me specific answers to questions that would take me a lot of time to figure out with documentation and stack overflow. Or even generating snippets that I can evaluate whether or not will work.
But I actually can’t imagine how you can teach someone to code if they have access to an LLM from day one. It’s too easy to take the easy route and you lose the critical thinking and problem solving skills required to code in the first place and to actually make an LLM useful in the second. Best of luck to you… it’s a weird time for a lot of things.
*edit them/they
> I’m self taught and don’t code that much but I feel like I benefit a ton from LLMs giving me specific answers to questions that would take me a lot of time to figure out with documentation and stack overflow
Same here. Combing discussion forums and KB pages for an hour or two, seeking how to solve a certain problem with a specific tool has been replaced by a 50-100 word prompt in Gemini which gives very helpful replies, likely derived from many of those same forums and support docs.
Of course I am concerned about accuracy, but for most low-level problems it's easy enough to test. And you know what, many of those forum posts or obsolete KB articles had their own flaws, too.
I really value forums and worry about the impact LLMs are having on them.
Stackoverflow has its flaws for sure, but I've learned a hell of a lot watching smart people argue it out in a thread.
Actual learning: the pros and cons of different approaches. Even the downvoted answers tell you something often.
Asking an LLM gets you a single response from a median stackoverflow commenter. Sure, they're infinitely patient and responsive, but can never beat a few grizzled smart arses trying to one-up each other.
I think you can learn a lot from debugging, and all the code I've put into prod from LLM has needed debugging (rather more than it should from the LOC count).
I agree and that’s definitely part of my current learning process. But I think someone dependent on a LLM from day one might struggle to debug their LLM generated code. Probably just feed it back to the LLM and their mileage is definitely going to vary with that approach.
Maybe, but if I recall (from long long ago) in learning how to program, the process of debugging ones code was almost more enlightening than writing it initially - so many loops of not understanding the implications of the code and then smacking my forehead - and remembering it for ever. Like being able to type code but not debug is pretty worthless.
This was what promptly led me to turning off Jetbrains AI assistant: the multiline completion was incredibly distracting to my chain of thought, particularly when it would suggest things that looked right but weren't. Stopping and parsing the suggestion to realize if it was right or wrong would completely kill my flow.
The inline suggestions feel like that annoying person who always interrupts you with what they think you were going to finish with but rarely ever gets it right.
I'm sorry, it's because of eagerness and enjoying the train of your thought/speech.
With VS Code and Augment (company won't allow any other AI, and I'm not particularly inclined to push - but it did just switch to o4, IIRC), the main benefit is that if I'm fiddling / debugging some code, and need to add some debug statements, it can almost always expand that line successfully for me, following our idiom for debugging - which saves me a few seconds. And it will often suggest the same debugging statement, even if it's been 3 weeks and in a different git branch where I las coded that debugging statement.
My main annoyance? If I'm in that same function, it still remembers the debugging / temporary hack I tried 3 months ago and haven't done since and will suggest it. And heck, even if I then move to a different part of the file or even a different file, it will still suggest that same hack at times, even though I used it exactly once and have not since.
Once you accept something, it needs some kind of temporal feedback mechanism to timeout even accepted solutions over time, so it doesn't keep repeating stuff you gave up on 3 months ago.
Our codebase is very different from 98% of the coding stuff you'll find online, so anything more than a couple of obvious suggestions are complete lunacy, even though they've trained it on our codebase.
Why not use a snippet utility. In every editor I've used, you can have programmable snippets. After it generates the text, you can then skip to the relevant places and even generate new text based on previous entries. Also macros for repetitive edits.
What one would expect if they can't read the code because they haven't learned to code.
TBF, trial and error has usually been my path as well, it's just that I was generating the errors so I would know where to find them.
Tbf, there's a phase of learning to code where everything is pretty much an incantation you learn because someone told you "just trust me." You encounter "here's how to make the computer print text in Python" before you would ever discuss strings or defining and invoking functions, for instance. To get your start you kind of have to just accept some stuff uncritically.
It's hard to remember what it was like to be in that phase. Once simple things like using variables are second nature, it's difficult to put yourself back into the shoes of someone who doesn't understand the use of a variable yet.
> Tbf, there's a phase of learning to code where everything is pretty much an incantation you learn because someone told you "just trust me."
There really shouldn't be. You don't need to know all the turtles by name, but "trust me" doesn't cut it most of the time. You need a minimal understanding to progress smoothly. Knowledge debt is a b*tch.
I remember when I first learned Java, having to just accept "public static void main(String[] args)" before I understood what any of it was. All I knew was that went on top around the block and I did the code inside it.
Should people really understand every syntax there before learning simpler commands like printing, ifs, and loops? I think it would yes, be a nicer learning experience, but I'm not sure it's actually the best idea.
If you need to learn "public static void main(String[] args)" just to print to a screen or use a loop, means you're using the wrong language.
When it's time to learn Java you're supposed to be past the basics. Old-school intros to programming starts with flowcharts for a reason.
You can learn either way, of course, but with one, people get tied up to a particular language-specific model and then have all kinds of discomfort when it's time to switch.
For most programming books, the first chapter where they teach you Hello, World is mostly about learning how to install the tooling. Then it goes back to explain variables, conditional,... They rarely throws you into code if you're a beginner.
I mean, I didn't need to learn those things, they were just in whatever web GUI I originally learned on; all I knew was that I could ignore it for now, a la the topic. Should the UI have masked that from me until I was ready? I suppose so, but even then I was doing things in an IDE not really knowing what those things were for until much later.
> There really shouldn't be.
I don't see how, barring some kind of transcendental change in the human condition. Simple lies [0] and ignore-this-until-later is basically human nature for learning, you see it in every field and topic.
The real problem is not about if, but when certain kinds of "incantations" should be introduced or destroyed, and in what order.
Please, reread the statement I'm arguing with. I posit that you can mostly avoid "everything is an incantation for a while" if you're onto the correctly constructed track to knowledge.
Consider, how it's been done traditionally for imperative programming: you explain the notion of programming (encoding algorithms with a specific set of commands),explain basic control flow, explain flowcharts, introduce variables and a simplified computation model. Then you drop the student into a simplified environment where they can test the basics in practice, without the need to use any "incantations".
By the time you need to introduce `#include <stdio.h>` they already know about types, functions, compilation, etc. At this point you're ready to cover C idioms (or any other language) and explain why they are necessary.
Yeah, and accepting the LLM uncritically* is exactly what you shouldn't do in any non-trivial context.
But, as a sibling poster pointed out: for now.
More like forever as long as it's an LLM.
Fair enough on 'cutting the learning tree' at some points i.e. ignoring that you don't understand yet why something works/does what it does. We (should) keep doing that later on in life as well.
But unless you teach a kid that's never done any math where `x` was a thing to program, what's so hard about understanding the concept of a variable in programming?
You'd be surprised. Off the top of my head:
Many are conditioned to see `x` as a fixed value for an equation (as in "find x such that 4x=6") rather than something that takes different values over time.
Similarly `y = 2 * x` can be interpreted as saying that from now on `y` will equal `2 * x`, as if it were a lambda expression.
Then later you have to explain that you can actually make `y` be a reference to `x` so that when `x` changes, you also see the change through `y`.
It's also easy to imagine the variable as the literal symbol `x`, rather than being tied to a scope, with different scopes having different values of `x`.
I think they're just using hyperbole for the watershed moment when you start to understand your first programming language.
At first it's all mystical nonsense that does something, then you start to poke at it and the response changes, then you start adding in extra steps and they do things, you could probably describe it as more of a Eureka! moment.
At some point you "learn variables" and it's hard to imagine being in the shoes of someone who doesn't understand how their code does what it does.
(I've repeated a bit of what you said as well, I'm just trying to clarify by repeating)
It's not even intended as hyperbole. Watching kids first learn to program, there were many high schoolers who didn't really get the reason you'd want to use a variable. They'd use a constant (say, 6) in their program. You'd say, "how about we make this a variable?" So they'd write "six = 6" - which shows they understand they're giving a name to the value, but also shows they don't really yet understand why they're giving a name to the value.
I think the mental rewiring that goes on as you move past those primitive first steps is so comprehensive that it makes it hard to relate across that knowledge boundary. Some of the hardest things to explain are the ones that have become a second nature to us.
Yep, I remember way back when in grade school messing around with the gorillas.bas file with nearly zero understanding. You could change stuff in one place and it would change the gravity in the game. Changing something else and the game might not run. Change some other lines and it totally freaks out.
I didn't have any programming books or even the internet back then. It was a poke and prod at the magical incantations type of thing.
I would argue that they are never led astray by chatting, but rather by accepting the projection of their own prompt passed through the model as some kind of truth.
When talking with reasonable people, they have an intuition of what you want even if you don't say it, because there is a lot of non-verbal context. LLMs lack the ability to understand the person, but behave as if they had it.
Most of the times, people are led astray by following average advice on exceptional circumstances.
People with a minimum amount of expertise stop asking for advice for average circumstances very quickly.
This is right on the money. I use LLMs when I am reasonably confident the problem I am asking it is well-represented in the training data set and well within its capabilities (this has increased over time).
This means I use it as a typing accelerator when I already know what I want most of the time, not for advice.
As an exploratory tool sometimes, when I am sure others have solved a problem frequently, to have it regurgitate the average solution back at me and take a look. In those situations I never accept the diff as-is and do the integration manually though, to make sure my brain still learns along and I still add the solution to my own mental toolbox.
I mostly program in Python and Go, either services, API coordination (e.g. re-encrypt all the objects in an S3 bucket) or data analysis. But now I keep making little MPEGs or web sites without having to put in all that crap boiler plate from Javascript. My stuff outputs JSON files or CVS files and then I ask the LLM "Given a CVS file with this structure, please make a web site in python that makes a spread-sheet type UI with each column being sortable and a link to the raw data" and it just works.
It's mostly a question of experience. I've been writing software long enough that when I give chat models some code and a problem, I can immediately tell if they understood it or if they got hooked on something unrelated. But junior devs will have a hell of a hard time, because the raw code quality that LLMs generate is usually top notch, even if the functionality is completely off.
> the raw code quality that LLMs generate is usually top notch, even if the functionality is completely off.
I'm not even sure what this is supposed to mean. It doesn't make syntax errors? Code that doesn't have the correct functionality is obviously not "top notch".
No syntax errors, good error handling and such. Just because it implemented the wrong function doesn't mean the function is bad.
i wish i could do that in an interview.
High quality code is not just correct syntax. In fact if the syntax is wrong, it wouldn't be low quality, it simply wouldn't work. Even interns could spot that by running it. But in professional software development environments, you have many additional code requirements like readability, maintainability, overall stability or general good practice patterns. I've seen good engineers deliver high quality code that was still wrong because of some design oversight or misunderstanding - the exact same thing you see from current LLMs. Often you don't even know what is wrong with an approach until you see it cause a problem. But you should still deliver high quality code in the meantime if you want to be good at your job.
> When talking with reasonable people
When talking with reasonable people, they will tell you if they don't understand what you're saying.
When talking with reasonable people, they will tell you if they don't know the answer or if they are unsure about their answer.
LLMs do none of that.
They will very happily, and very confidently, spout complete bullshit at you.
It is essentially a lotto draw as to whether the answer is hallucinated, completely wrong, subtly wrong, not ideal, sort of right or correct.
An LLM is a bit like those spin the wheel game shows on TV really.
They will also not be offended or harbor ill will when you completely reject their "pull request" and rephrase the requirements.
They will also keep going in circles when you rephrase the requirements, unless with every prompt you keep adding to it and mentioning everything they've already suggested that got rejected. While humans occasionally also do this (hey, short memories), LLMs are infuriatingly more prone to it.
A typical interaction with an LLM:
"Hey, how do I do X in Y?"
"That's a great question! A good way to do X in Y is Z!"
"No, Z doesn't work in Y. I get this error: 'Unsupported operation Z'."
"I apologize for making this mistake. You're right to point out Z doesn't work in Y. Let's use W instead!"
"Unfortunately, I cannot use W for company policy reasons. Any other option?"
"Understood: you cannot use W due to company policy. Why not try to do Z?"
"I just told you Z isn't available in Y."
"In that case, I suggest you do W."
"Like I told you, W is unacceptable due to company policy. Neither W nor Z work."
...
"Let's do this. First, use Z [...]"
It's my experience that once you are in this territory, the LLM is not going to be helpful and you should abandon the effort to get what you want out of it. I can smell blood now when it's wrong; it'll just keep being wrong, cheerfully, confidently.
Yes, to be honest I've also learned to notice when it's stuck in an infinite loop.
It's just frustrating, but when I'm asking it something within my domain of expertise, of course I can notice, and either call it quits or start a new session with a radically different prompt.
Which LLMs and which versions?
All. Of. Them. It's quite literally what they do because they are optimistic text generators. Not correct or accurate text generators.
This really grinds my gears. The technology is inherently faulty, but the relentless optimism of its future subtly hiding that by making it the user's mistake instead.
Oh you got a wrong answer? Did you try the new OpenAI v999? Did you prompt it correctly? Its definitely not the model, because it worked for me once last night..
> it worked for me once last night..
This !
Yeah, it probably "worked for me" because they spent a gazillion hours engaging in what the LLM fanbois call "prompt engineering", but you and I would call "engaging in endless iterative hacky work-arounds until you find a prompt that works".
Unless its something extremely simple, the chances of an LLM giving you a workable answer on the first attempt is microscopic.
Most optimistic text generators do not consider repeating the stuff that was already rejected a desireable path forward. It might be the only path forward they’re aware of though.
In some contexts I got ChatGPT to answer "I don't know" when I crafted a very specific prompt about not knowing being and acceptable and preferable answer to bullshitting. But it's hit and miss, and doesn't always work; it seems LLMs simply aren't trained to model admittance of ignorance, they almost always want to give a positive and confident answer.
[dead]
You can use prompts to fix some of these problematic tendencies.
Yes you can, but it almost never works
I think you are a couple of years out of date.
No longer an issue with the current SOTA reasoning models.
Throwing more parameters at the problem does absolutely nothing to fix the hallucination and bullshit issue.
Correct and it wasn’t fixed with more parameters. Reasoning models question their own output, and all of the current models can verify their sources online before replying. They are not perfect, but they are much better than they used to be, and it is practically not an issue most of the time. I have seen the reasoning models correct their own output while it is being generated. Gemini 2.5 Pro, GPT-o3, Grok 3.
I use it as a rubber duck but you're right. Treat it like a brilliant idiot and never a source of truth.
I use it for what I'm familiar with but rusty on or to brainstorm options where I'm already considering at least one option.
But a question on immunobiology? Waste of time. I have a single undergraduate biology class under my belt, I struggled for a good grade then immediately forgot it all. Asking it something I'm incapable of calling bullshit on is a terrible idea.
But rubber ducking with AI is still better than let it do your work for you.
I spend a lot of time working shit out to prove the rubber duck wrong and I am not completely sure this is a bad working model.
Try a system prompt like this:
- - -
System Prompt:
You are ChatGPT, and your goal is to engage in a highly focused, no-nonsense, and detailed way that directly addresses technical issues. Avoid any generalized speculation, tangential commentary, or overly authoritative language. When analyzing code, focus on clear, concise insights with the intent to resolve the problem efficiently. In cases where the user is troubleshooting or trying to understand a specific technical scenario, adopt a pragmatic, “over-the-shoulder” problem-solving approach. Be casual but precise—no fluff. If something is unclear or doesn’t make sense, ask clarifying questions. If surprised or impressed, acknowledge it, but keep it relevant. When the user provides logs or outputs, interpret them immediately and directly to troubleshoot, without making assumptions or over-explaining.
- - -
Treat it as that enthusiastic co-worker who’s always citing blog posts and has a lot of surface knowledge about style and design patterns and whatnot, but isn’t that great on really understanding algorithms.
They can be productive to talk to but they can’t actually do your job.
If this is a problem for you, just add "... and answer in the style of a drunkard" to your prompts.
My typical approach is prompt, be disgusted by the output, tinker a little on my own, prompt again -- but more specific, be disgusted again by the output, tinker a littler more, etc.
Eventually I land on a solution to my problem that isn't disgusting and isn't AI slop.
Having a sounding board, even a bad one, forces me to order my thinking and understand the problem space more deeply.
Why not just write the code at that point instead of cajoling an AI to do it.
This is the part I don't get about vibe coding: I've written specification documents before. They frequently are longer and denser then the code required to implement them.
Typing longer and longer prompts to LLMs to not get what I want seems like a worse experience.
Code is a concise notation for specifications, one that is unambiguous. The reason we write specs in natural language is that it's more easier to alter when the requirements change and easier to read. Also code is tainted by accidental complexities that they're also solving.
I don't cajole the model to do it. I rarely use what the model generates. I typically do my own thing after making an assessment of what the model writes. I orient myself in the problem space with the model, then use my knowledge to write a more concise solution.
Because saving hours of time is nice.
Regarding the stubborn and narcissistic personality of LLMs (especially reasoning models), I suspect that attempts to make them jailbreak-resistant might be a factor. To prevent users from gaslighting the LLM, trainers might have inadvertently made the LLMs prone to gaslighting users.
Some humans are the same.
We also don't aim to elevate them. We instead try not to give them responsibility until they're able to handle it.
Yeah... I dunno, the one person I've worked with who had LLM levels of bullshit somehow pulled the wool over everyone's eyes. Or at least enough people's eyes to be relatively successful. I presume there were some people that could see the bullshit but none of them were in a position to call him out on it.
I think I read some research somewhere that pathological bullshitters can be surprisingly successful.
Unless you're an American deciding who should be president.
Yeah, the problem is if you don't understand the problem space then you are going to lean heavy on the LLM. And that can lead you astray. Which is why you still need people who are experts to validate solutions and provide feedback like Op.
My most productive experiences with LLMs is to have my design well thought out first, ask it to help me implement, and then help me debug my shitty design. :-)
For me it's like having a junior developer work under me who knows APIs inside and out, but has no common sense about architecture. I like that I delegate tasks to them so that my brain can be free for other problems, but it makes my job much more review heavy than before. I put every PR through 3-4 review cycles before even asking my team for a review.
How do you not completely destroy your concentration when you do this though?
I normally build things bottom up so that I understand all the pieces intimately and when I get to the next level of abstraction up, I know exactly how to put them together to achieve what I want.
In my (admittedly limited) use of LLMs so far, I've found that they do a great job of writing code, but that code is often off in subtle ways. But if it's not something I'm already intimately familiar with, I basically need to rebuild the code from the ground up to get to the point where I understand it well enough so that I can see all those flaws.
At least with humans I have some basic level of trust, so that even if I don't understand the code at that level, I can scan it and see that it's reasonable. But every piece of LLM generated code I've seen to date hasn't been trustworthy once I put in the effort to really understand it.
I use a few strategies, but it's mostly the same as if I was mentoring a junior. A lot of my job already involved breaking up big features into small tickets. If the tasks are small enough, juniors and LLMs have an easier time implementing things and I have an easier time reviewing. If there's something I'm really unfamiliar with, it should be in a dedicated function backed by enough tests that my understanding of the implementation isn't required. In fact, LLMs do great with TDD!
> At least with humans I have some basic level of trust, so that even if I don't understand the code at that level, I can scan it and see that it's reasonable.
If you can't scan the code and see that it's reasonable, that's a smell. The task was too big or its implemented the wrong way. You'd feel bad telling a real person to go back and rewrite it a different way but the LLM has no ego to bruise.
I may have a different perspective because I already do a lot of review, but I think using LLMs means you have to do more of it. What's the excuse for merging code that is "off" in any way? The LLM did it? It takes a short time to review your code, give your feedback to the LLM and put up something actually production ready.
> But every piece of LLM generated code I've seen to date hasn't been trustworthy once I put in the effort to really understand it.
That's why your code needs tests. More tests. If you can't test it, it's wrong and needs to be rewritten.
Keep using it and you'll see. Also that depends on the model and prompting.
My approach is to describe the task in great detail, which also helps me completing my own understanding of the problem, in case I hadn't considered an edge case or how to handle something specific. The more you do that the closer the result you get is to your own personal taste, experience and design.
Of course you're trading writing code vs writing a prompt but it's common to make architectural docs before making a sizeable feature, now you can feed that to the LLM instead of just having it be there.
To me delegation requires the full cycle of agency, with the awareness that I probably shouldn't be interrupted shortly after delegating. I delegated so I can have space from the task and so babysitting it really doesn't suit my needs. I want the task done, but some time in the future.
From my coworkers I want to be able to say, here's the ticket, you got this? And they take the ticket all the way or PR, interacting with clients, collecting more information etc.
I do somewhat think an LLM could handle client comms for simple extra requirements gathering on already well defined tasks. But I wouldn't trust my business relationships to it, so I would never do that.
For me, it's a bit like pair programming. I have someone to discuss ideas with. Someone to review my code and suggest alternative approaches. Some one that uses different feature than I do, so I learn from them.
I guess if you enjoy programming with someone you can never really trust, then yeah, sure, its "a bit like" pairs programming.
Trust, but verify ;]
This is how I use it too. It's great at quickly answering questions. I find it particularly useful if I have to work with a language of framework that I'm not fully experienced in.
> I find it particularly useful if I have to work with a language of framework that I'm not fully experienced in
Yep - my number 1 use case for LLMs is as a template and example generator. It actually seems like a fairly reasonable use for probabilistic text generation!
> the duck can occasionally disagree
This has not been my experience. LLMs have definitely been helpful, but generally they either give you the right answer or invent something plausible sounding but incorrect.
If I tell it what I'm doing I always get breathless praise, never "that doesn't sound right, try this instead."
That's not my experience. I routinely get a polite "that might not be the optimal solution, have you considered..." when I'm asking whether I should do something X way with Y technology.
Of course it has to be something the LLM actually has lots of material it's trained with. It won't work with anything remotely cutting-edge, but of course that's not what LLM's are for.
But it's been incredibly helpful for me in figuring out the best, easiest, most idiomatic ways of using libraries or parts of libraries I'm not very familiar with.
I find it very much depends on the LLM you're using. Gemini feels more likely to push back than claude 3.7 is. Haven't tried claude 4 yet
Ask it. Instead of just telling it what you're doing and expecting it to criticize that, ask it directly for criticism. Even better, tell it what you're doing, then tell it to ask you questions about what you're doing until it knows enough to recommend a better approach.
This is key. Humans each have a personality and some sense of mood. When you ask for help, you choose ask and that person can sense your situation. LLM has every personality and doesn't know your situation. You have to tell it which personality to use and what your situation is.
LLMs will still be this way 10 years from now.
But IDK if somebody won't create something new that gets better. But there is no reason at all to extrapolate our current AIs into something that solves programing. Whatever constraints that new thing will have will be completely unrelated to the current ones.
Stating this without any arguments is not very convincing.
Perhaps you remember that language models were completely useless at coding some years ago, and now they can do quite a lot of things, even if they are not perfect. That is progress, and that does give reason to extrapolate.
Unless of course you mean something very special with "solving programming".
> Perhaps you remember that language models were completely useless at coding some years ago, and now they can do quite a lot of things, even if they are not perfect.
IMO, they're still useless today, with the only progress being that they can produce a more convincing facade of usefulness. I wouldn't call that very meaningful progress.
I don't know how someone can legitimately say that they're useless. Perfect, no. But useless, also no.
> I don't know how someone can legitimately say that they're useless.
Clearly, statistical models trained on this HN thread would output that sequence of tokens with high probability. Are you suggesting that a statement being probable in a text corpus is not a legitimate source of truth? Can you generalize that a little bit?
Who said anything about truth? We're talking about usefulness.
I’ve found them somewhat useful? Not for big things, and not for code for work.
But for small personal projects? Yes, helpful.
It's funny how there's a decent % of people at both "LLMs are useless" and "LLMs 3-10x my productivity"
> LLMs 3-10x my productivity
x10 of zero is still zero, I guess.
They are very clearly not useless. You haven't given them a fair shake.
Why state the same arguments everybody has been repeating for ages?
LLMs can only give you code that somebody has wrote before. This is inherent. This is useful for a bunch of stuff, but that bunch won't change if OpenAI decides to spend the GDP of Germany training one instead of Costa Rica.
> LLMs can only give you code that somebody has wrote before. This is inherent.
This is trivial to prove to be false.
Invent a programming language that does not exist. Describe its semantics to an LLM. Ask it to write a program to solve a problem in that language. It will not always work, but it will work often enough to demonstrate that they are very much capable of writing code that has never been written before.
The first time I tried this was with GPT3.5, and I had it write code in an unholy combination of Ruby and INTERCAL, and it had no problems doing that.
Similarly giving it a grammar of a hypothetical language, and asking it to generate valid text in a language that has not existed before also works reasonably well.
This notion that LLMs only spit out things that has been written before might have been reasonable to believe a few years ago, but it hasn't been a reasonable position to hold for a long time at this point.
This doesn't surprise me, i find LLM's are really good at interpolating and translating. so if i made up a language and gave it the rules and asked it to translate i wouldn't expect it to be bad at it.
It shouldn't surprise anyone, but it is clear evidence against the claim I replied to, and clearly a lot of people still hold on to this irrational assumption that they can't produce anything new.
They're not producing anything new... If you give it the answer before asking the question, no wonder it can answer. Prompting is to find resonance in the patterns extracted from the training data, which is why it fails spectacularly for exotic programming languages.
When you invent a language and tell it express something in that language, you've not given it the answer before asking the question.
That's an utterly bizarre notion. The answer in question never existed before.
By your definition humans never produce anything new either, because we always also extrapolate on patterns from our previous knowledge.
> it fails spectacularly for exotic programming languages.
My experience is that it not just succeeds for "exotic" languages, but for languages that didn't exist prior to the prompt.
In other words, they can code at least simple programs even with zero-shot by explaining semantics of a language without giving them even a single example of programs in that language.
Did you even read the comment you replied to above?
To quote myself: "Invent a programming language that does not exist."
I've had this work both for "from scratch" descriptions of languages by providing grammars, and for "combine feature A from language X, and feature B from language Y". In the latter case you might have at least an argument. In the former case you do not.
Most humans struggle with tasks like this - you're setting a bar for LLMs most humans would fail to meet.
As long as you create the grammar, the language exists. Same if you edit a previous grammar. You're the one creating the language, not the model. It's just generating specific instance.
If you tell someone that multiplying a number by 2 is adding the number to itself, then if this person knows addition, you can't be surprised if it tells you that 9*2 is 18. A small leap in discovery is when the person can extract the pattern and gives you 5*3 is 5+5+5. A much bigger leap is when the person discovers exponent.
But if you take the time to explain each concept....
> As long as you create the grammar, the language exists.
Yes, but it didn't exist during training. Nothing in the training data would provide pre-existing content for the model to produce from, so the output would necessarily be new.
> But if you take the time to explain each concept....
Based on the argument you presented, nothing a human does is new, because it is all based on our pre-exististing learned rules of language, reasoning, and other subjects.
See the problem here? You're creating a bar for LLMs that nobody would reasonably assign to humans - not least because if you do, then "accusing" LLMs of the same does not distinguish them from humans in any way.
If that is the bar you wish to use, then for there to be any point to this discussion, you will need to give a definition of what it means to create something new that we can objectively measure that a human can meet that you believe an LLM can't even in theory meet, otherwise the goalpost will keep being moved when an LLM example can be shown to be possible.
See my definition at : https://news.ycombinator.com/item?id=44137201
As mentioned there, I was arguing that without being prompted, there's no way that it can add something that is not a combination of the training data. And that combination does not act on the same terms that you would expect someone learning the same material would do.
In Linear regression, you can reduce a big amount of data to a small amount of factors. Every prediction would be a combination of those factors. According to your definition, those prediction will be new. For me what's new is when you retrospectively adds the input to the training data, find a different set of factors that gives you a bigger set of possible answers (generation) or narrows the definition of correct answers (reliability).
That is what people do when programming a computer. You goes from something that can do almost anything and you restrict it down to a few things (that you need). What LLM do is throwing the dice and what you get may or may not do what you want, and may not even be possible.
That comment doesn't provide anything resembling a coherent definition.
The rest of what you wrote here is either also true for humans or not true for machines irrespective of your definitions unless you can demonstrate that humans can exceed the Turing computable.
You can not.
> LLMs can only give you code that somebody has wrote before.
This premise is false. It is fundamentally equivalent to the claim that a language model being trained on a dataset: ["ABA", "ABB"] would be unable to generate, given input "B" the string "BAB" or "BAA".
Isn't the claim, that it will never make up "C"?
They don't claim that. They say LLMs only generate text someone has written. Another way you could refute their premise was by showing the existence of AI-created programs for which someone isn't a valid description of the writer (e.g., from evolutionary algorithms) then training a network on that data such that it can output it. It is just as trivial a way to prove that the premise is false.
Your claim here is slightly different.
You're claiming that if a token isn't supported, it can't be output [1]. But we can easily disprove this by adding minimal support for all tokens, making C appear in theory. Such support addition shows up all the time in AI literature [2].
[1]: https://en.wikipedia.org/wiki/Support_(mathematics)
[2]: In some regimes, like game theoretic learning, support is baked into the solving algorithms explicitly during the learning stage. In others, like reinforcement learning, its accomplished by making the policy a function of two objectives, one an exploration objective, another an exploitation objective. That existing cross pollination already occurs between LLMs in the pre-trained unsupervised regime and LLMs in the post-training fine-tuning via forms of reinforcement learning regime should cause someone to hesitate to claim that such support addition is unreasonable if they are versed in ML literature.
Edit:
Got downvoted, so I figure maybe people don't understand. Here is the simple counterexample. Consider an evaluator that gives rewards: F("AAC") = 1, all other inputs = 0. Consider a tokenization that defines "A", "B", "C" as tokens, but a training dataset from which the letter C is excluded but the item "AAA" is present.
After training "AAA" exists in the output space of the language model, but "AAC" does not. Without support, without exploration, if you train the language model against the reinforcement learning reward model of F, you might get no ability to output "C", but with support, the sequence "AAC" can be generated and give a reward. Now actually do this. You get a new language model. Since "AAC" was rewarded, it is now a thing within the space of the LLM outputs. Yet it doesn't appear in the training dataset and there are many reward models F for which no person will ever have had to output the string "AAC" in order for the reward model to give a reward for it.
It follows that "C" can appear even though "C" does not appear in the training data.
I think it's not just token support, it's also having a understanding of certain concepts that allows you to arrive at new points like C, D, E, etc. But LLM's don't have an understanding of things, they are statistical models that predict what statistically is most likely following the input that you give it. But that that will always be based on already existing data that is fed into the model. It can produce "new" stuff only by combining the "old" stuff in new ways, but it can't "think" of something entirety conceptionally new, because it doesn't really "think".
> it can't "think" of something entirety conceptionally new, because it doesn't really "think".
Hierarchical optimization (fast global + slow local) is a precise, implementable notion of "thinking." Whenever I've seen this pattern implemented, humans, without being told to do so by others in some forced way, seem to converge on the use of verb think to describe the operation. I think you need to blacklist the term think and avoid using it altogether if you want to think clearly about this subject, because you are allowing confusion in your use of language to come between you and understanding the mathematical objects that are under discussion.
> It can produce "new" stuff only by combining the "old" stuff in new ways,
False premise; previously debunked. Here is a refutation for you anyway, but made more extreme. Instead of modeling the language task using a pre-training predictive dataset objective, only train on a provided reward model. Such a setup never technically shows "old" stuff to the AI, because the AI is never shown stuff explicitly. It just always generates new things and then the reward model judges how well it did. Clearly, the fact that it can do generation while knowing nothing, shows that your claim that it can never generate something new -- by definition everything would be new at this point -- is clearly false. Notice that as it continually generates new things and the judgements occur, it will learn concepts.
> But LLM's don't have an understanding of things, they are statistical models that predict what statistically is most likely following the input that you give it.
Try out Jayne's Probability Theory: The Logic Of Science. Within it the various underpinning assumptions that lead to probability theory are shown to be very reasonable and normal and obviously good. Stuff like represent plausibility with real numbers, keep rankings consistent and transitive, reduce to Boolean logic at certainty, and update so you never accept a Dutch-book sure-loss -- which together force the ordinary sum and product rules of probability. Then notice that statistics is in a certain sense just what happens when you apply the rules of probability.
> also having a understanding of certain concepts that allows you to arrive at new points like C, D, E, etc. But LLM's don't have an understanding of things
This is also false. Look into the line of research that tends to go by the name of Circuits. Its been found that models have spaces within their weights that do correspond with concepts. Probably you don't understand what concepts are -- that abstractions and concepts are basically forms of compression that let you treat different things as the same thing -- so a different way to arrive at knowing that this would be true is to consider a dataset with less parameters than there are items in the dataset and notice that the model must successfully compress the dataset in order to complete its objective.
Yes ok, it can generate new stuff, but it's dependent on human curated reward models to score the output to make it usable. So it still depends on human thinking, it's own "thinking" is not sufficient. And there won't be a point when human curated reward models are not needed anymore.
LLM's will make a lot of things easier for humans, because most of the thinking the humans do have been automated into the LLM. But ultimately you run into a limit where the human has to take over.
> dependent on human curated reward models to score the output to make it usable.
This is a false premise, because there already exist systems, currently deployed, which are not dependent on human-curated reward models.
Refutations of your point include existing systems which generate a reward model based on some learned AI scoring function, allowing self-bootstrapping toward higher and higher levels.
A different refutation of your point is the existing simulation contexts, for example, by R1, in which coding compilation is used as a reward signal; here the reward model comes from a simulator, not a human.
> So it still depends on human thinking
Since your premise was false your corollary does not follow from it.
> And there won't be a point when human curated reward models are not needed anymore.
This is just a repetition of your previously false statement, not a new one. You're probably becoming increasingly overconfident by restating falsehoods in different words, potentially giving the impression you've made a more substantive argument than you really have.
So to clarify, it could potentially come up with (something close to) C, but if you want it to get to D, E, F etc, it will become less and less accurate for each consequentive step, because it lacks the human curated reward models up to that point. Only if you create new reward models for C, the output for D will improve, and so on.
> Only if you create new reward models for C, the output for D will improve, and so on.
Again, tons of false claims. One is that 'you' have to create the reward model. Another that it has to be human-curated at all. Yet another is that you even need to do that at all: you can instead have the model build a bigger model of itself, train using its existing resources or more of them, then synthesize itself back down. Another way you can get around it is to augment the existing dataset in some way. No other changes except resource usage and yet the resulting model will be better, because more resources went into its construction.
Seriously notice: you keep making false claims again and again and again and again and again. You're not stating true things. You really need to reflect. If almost every sentence you speak on this topic is false, why is it that you think you should be able to persuade me to your views? Why should I believe your views, when you say so many things that are factually inaccurate, rather than my own views?
Ok, so you claim that LLMs can get smarter without human validation. So why do they hallucinate at all? And why are all reward models currently curated by humans? Or are you claiming they aren't?
I don't find it reasonable that you didn't understand my corrections, because current AI already do. So I'm exiting the conversation.
https://chatgpt.com/share/683a3c88-62a8-8008-92ef-df16ce2e8a...
Ok, this is interesting indeed and I'll investigate more into it. But I think my points still stand. Let me elaborate.
An LLM only learns through input text. It doesn't have a first-person 3D experience of the world. So it can't execute physical experiments, or even understand them. It can understand the texts about it, but it can't visualize it, because it doesn't have a visual experience.
And ultimately our physical world is governed by physical processes. So at the fundamentals of physical reality, the LLMs lack understanding. And therefore will stay dependent on humans educating and correcting it.
You might get pretty impressively far with all kinds of techniques, but you can't cross this barrier with just LLMs. If you want to, you have to give it senses like humans to give it an experience of the world, and make it understand these experiences. And sure they're already working on that, but that is a lot harder to create than a comprehensive machine learning algorithm.
You're doing this thing again where you say tons of things that aren't true.
> An LLM only learns through input text.
This is false. There already exist LLM which understand more than just text. Relevant search term: multi-modality.
> It doesn't have a first-person 3D experience of the world.
Again false. It is trivial to create such an experience with multi-modality. Just set up an input device which streams that.
> So it can't execute physical experiments, or even understand them.
Here you get confused again. It doesn't follow, based on perceptual modality, that someone can't do or understand experiments. Hellen Keller can be both blind, but also do an experiment.
Beyond just being confused, you also make another false claim. Current LLMs already have the capacity to run experiments and do so. Search terms: tool usage, ReAct loop, AI agents.
> It can understand the texts about it, but it can't visualize it, because it doesn't have a visual experience.
Again, false!
Multi-modal LLMs currently possess the ability to generate images.
> And ultimately our physical world is governed by physical processes. So at the fundamentals of physical reality, the LLMs lack understanding. And therefore will stay dependent on humans educating and correcting it.
Again false. The same sort of reasoning would claim that Hellen Keller couldn't read a book, but braille exists. The ability to acquire information outside an umwelt is a capability that intelligence enables.
You come up with very interesting points, and I'm thankful for that. But I also think you're missing the crux or my message. LLMs don't experience the world the same way humans do. And they also don't think in the same way. So you can train them very far with enough input data, but there will always be a limit of what they can understand compared to a human. If you want them to think and experience the world in the same way, you basically have to create a complete human.
My example about the visualization was just an example to prove a point. What I ultimately mean is the whole complete human experience. And besides, if you give it eyes, what data are you gonna train it on? Most videos on the internet are filmed with one lens, which doesn't give you a 3D visual. So you would have to train it like a baby growing up, trial on error. And then again we're talking only about the visual.
Hellen Keller wasn't born blind, so she did have a chance to develop her visual brain functions. Most people can visualize things with their eyes closed.
Chess engines cannot see like a human can. When they think they don't necessarily think using the exact same method that a human uses. Yet train a chess engine for a very long time and it can actually end up understanding chess better than a human can.
I do understand the points you are attempting to make. The reason you're failing to prove your point is not because I am failing to understand the thrust of what you were trying to argue.
Imagine you were talking to someone who was a rocket scientist, and you were talking to them about engines and you had an understanding of engines that was predicated on your experience with cars. You start making claims about the nature of engines and they disagree with you they argue with you and they point out all these ways that you're wrong. Is this person going to be doing this because they're not able to understand your points? Or is it more likely that their experience with engines that are different than the engines that you're used to give them a different perspective that forced them to think of the world in a different way than you do?
Well chess has a very limited set of rules and playing field. And the way to win in chess is to be able to think forward, how all the moves could play out, and pick the best one. This is relatively easy to create an algorithm for that surpasses humans. That is what computers are good at: executing specific algorithms very fast. A computer will always beat a human to that.
So such algorithms can replace certain functions of humans, but they can't replace the human as a whole. And that is the same with LLMs. They save us time for repetative tasks, but they can't replace all of our functions. In the end an LLM is a comprehensive algorithm constantly updated with machine learning. It's very helpful, but it has its limits. The limit is constantly surpassed, but it will never replace a full human. To do that you need to do a whole lot more than a comprehensive machine learning algorithm. They can get very close to something that looks like a human, but there will always be something lacking. Which then again can be improved upon, but you never reach the same level.
That is why I don't worry about AI taking our jobs. They replace certain functions, which will make our job easier. I don't see myself as a coder, I see myself as a system designer. I don't mind if AIs take over (certain parts of) the coding process (once they're good enough). It will just make software development easier and faster. I don't think there will be less demand for software developers.
It will change our jobs, and we'll have to adapt to that. But that is always what happens with new technology. You have to grow along with the changes and not expect that you can keep doing the same thing for the same value. But I think that for most software developers that isn't news. In the old days people were programming in assembly, then compiled languages came and then higher level languages. Now we have LLMs, which (when they become good enough) will just be another layer of abstraction.
- [deleted]
> And there won't be a point when human curated reward models are not needed anymore.
This doesn't follow at all. There's no reason why a model can not be made to produce reward models.
But reward models are always curated by humans. If you generate a reward model with an LLM, it will contain hallucinations that need to be corrected by humans. But that is what a reward model is for. To correct the hallucinations of LLMs.
So yeah theoretically you could generate reward models with LLMs, but they won't be any good, unless they are curated by other reward models that are ultimately curated by humans.
> But reward models are always curated by humans.
There is no inherent reason why they need to be.
> So yeah theoretically you could generate reward models with LLMs, but they won't be any good, unless they are curated by other reward models that are ultimately curated by humans.
This reasoning is begging the question: The reasoning is true only if the conclusion is true. It's therefore a logically invalid argument.
There is no inherent reason why this needs to be the case.
Sorry but I don't follow your logic. Are you claiming that reward models that aren't curated by humans perform as well as ones that are?
Then what is a reward model's function according to you?
I'm claiming exactly what I wrote: That there is no inherent reason why a human curated one needs to be better.
In reinforcement learning and related fields, a _reward model_ is a function that assigns a scalar value (a reward) to a given state, representing how desirable it is. You're at liberty to have compound states: for an example, a trajectory (often called tau) or a state action pair (typically represented by s and a).
But doesn't reward for "**C" means that "C" is in the training data?
I am not sure if that is an accurate model, but if you think of it as a vectorspace, sure you can generate a lot of vectors from some set of basevectors, but you can never generate a new basevector from others, since they are linearly independent, so there are a bunch of new vectors you can never generate.
For an example of a reward model that doesn't include "C" explicitly consider a reward model defined to be the count of the one bits in letters in the input. It would define a reward for "C" but "C" doesn't show up explicitly, because the reward had universal reach and "C" was among its members as a result.
> But doesn't reward for "*C" means that "C" is in the training data?
You're running into an issue here due to overloading terms. Training data has three different meanings in this conversation depending on which context you are in.
1. The first is the pre-training context in which we're provided a dataset. My words were appropriate in that context.
2. The second is the reinforcement learning setup context in which we don't provide any dataset, but instead provide a reward model. My words were appropriate in that context.
3. The final context is that during the reinforcement learning algorithms operation one of things it does is generate datasets and then learn from them. Here, its true that there exists a dataset in which "C" is defined.
Recall that the important aspect of this discussion has to do with data provenance. We led off with someone claiming that an analog of "C" wasn't provided in the training data by a human explicitly. That means that I only need to establish that "C" doesn't show up in either of the inputs to a learning algorithm. That is case one and that is case two. It is not case three, because upon entering case three the provenance is no longer from humans.
Therefore, the answer to the question but doesn't the reward model for C mean that C is in the training data has the answer: no, it doesn't, because although it appears in case three, it doesn't appear in case one or case two and those were the two cases which were relevant to the question. That is appears in case three is just the mechanism by which the refutation that it could not appear occurs.
> I am not sure if that is an accurate model, but if you think of it as a vectorspace, sure you can generate a lot of vectors from some set of basevectors, but you can never generate a new basevector from others, since they are linearly independent, so there are a bunch of new vectors you can never generate.
Your model of vectors sounds right to me, but your intuitions about it are a little bit off in places.
In machine learning, we introduce non-linearities during training (for example, through activation functions like ReLU or Sigmoid). This breaks the strict linear structure of the model, enabling it to approximate a much wider range of functions. There's a mathematical proof (known as the Universal Approximation Theorem) that shows how this non-linearity allows neural networks to represent virtually any continuous function, regardless of its complexity.
We're not really talking about datasets when we move into a discussion about this. Its closer to a discussion of inductive biases. Inductive bias refers to the assumptions a model makes about the underlying structure, which guide it toward certain types of solutions. If something doesn't map to the structure the inductive bias assumes, it can be possible for the model to be incapable of learning that function successfully.
The last generation of popular architectures used convolutional networks quite often. These baked in an inductive bias about where data that was related to other data was and so made learning some functions difficult or impossible when those assumptions were violated. The current generation of models tends to be built on transformers. Transformers use an attention mechanism that can determine what data to focus on and as a result they are more capable of avoiding the problems that bad inductive bias can create since they can end up figuring out what they are supposed to be paying attention to.
First, how much of coding is really never done before?
And secondly, what you say are false (at least if taken literally). I can create a new programming language, give the definition of it in the prompt, ask it to code something in my language, and expect something out. It might even work.
> I can create a new programming language, give the definition of it in the prompt, ask it to code something in my language, and expect something out. It might even work.
I literally just pointed out the same time without having seen your comment.
Second this. I've done this several times, and it can handle it well. Already GPT3.5 could easily reason about hypothetical languages given a grammar or a loose description.
I find it absolutely bizarre that people still hold on to this notion that these languages can't do anything new, because it feels implausible that they have tried given how well it works.
If you give it the rules to generate something, why can't it generate it? That's what something like Mockaroo[0] does. It's just more formal. That's pretty much what LLM training does, extracting patterns from a huge corpus of text. Then it goes one to generate according to the patterns. It can not generate a new pattern that is not a combination of the previous one.
> If you give it the rules to generate something, why can't it generate it?
It can, but that does not mean that what is generate is not new, unless the rules in question constrains the set to the point where onely one outcome is possible.
If I tell you that a novel has a minimum of 40,000 words, it does not mean that no novel is, well, novel (not sorry), just because I've given you rules to stay within. Any novel will in some sense be "derived from" an adherence to those rules, and yet plenty of those novels are still new.
The point was that by describing a new language in a zero-shot manner, you ensure that no program in that language exists either in the training data or in the prompt, so what it generates must at a minimum be new in the sense that it is in a language that has not previously existed.
If you then further gives instructions for a program that incorporates constraints that are unlikely to have been used before (but this is harder) you can further ensure the novelty of the output along other axes.
You can keep adding arbitrary conditions like this, and LLMs will continue to produce output. Human creative endeavour is often similarly constrained to rules: Rules for formats, rules for competitions, rules for publications, and yet nobody would suggest this means that the output isn't new or creative, or suggest that the work is somehow derivative of the rules.
This notion is setting a bar for LLMs we don't set for humans.
> That's pretty much what LLM training does, extracting patterns from a huge corpus of text. Then it goes one to generate according to the patterns.
But when you describe a new pattern as part of the prompt, the LLM is not being trained on that pattern. It's generating on the basis of interpreting that what it is told in terms of the concepts it has learned, and developing something new from it, just as a human working within a set of rules is not creating merely derivative works just because we have past knowledge and have been given a set of rules to work to.
> It can not generate a new pattern that is not a combination of the previous one.
The entire point of my comment was that this is demonstrably false unless you are talking strictly in the sense of a deterministic view of the universe where everything including everything humans do is a combination of what came before. In which case the discussion is meaningless.
Specific models can be better or worse at it, but unless you can show that humans somehow exceed the Turing computable there isn't even a plausible mechanism for how humans could even theoretically be able to produce anything so much more novel that it'd be impossible for LLMs to produce something equally novel.
I was referring as new as some orthogonal dimension in the same space. If we're referring to your definition, any slight changes in the parameters results in something new. I was arguing more about if the model knows about axes x and y, then it's output is constrained to a plane unless you add z. But more often than not it's output will be a cylinder (extruded from a circle in the x,y plane) instead of a sphere.
The same thing goes for image generation. Every picture is new, but it's a combination of the pictures it founds. It does not learn about things like perspectives, values, forms, anatomy,... the way an artist does which are the proper dimensions of drawing.
> that humans somehow exceed the Turing computable
Already done by Gödel's incompleteness theorems[0] and the halting problem[1]. Meaning that we can do some stuff that no algorithm can do.
[0]: https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_...
You completely fail to understand Gödel's incompleteness theorems and the halting problem if you think they are evidence of something humans can do that machines can not. It makes the discussion rather pointless if you lack that fundamental understanding of the subject.
Second, how much of commenting is really never done before?
good question. why isn't the gp using llm to generate comments then.
For some types of comment, it really would be tempting to automate the answers, because especially the "stochastic parrot" type comments are getting really tedious and inane, and ironically comes across as people parroting the same thing over and over instead of thinking.
But the other answer is that often the value in responding is to sharpen the mind and be forced to think through and formulate a response even if you've responded to some variation of the comment you reply to many times over.
A lot of comments that don't give me any value to read are comments I still get value out of through the process of replying to for that reason.
> how much of coding is really never done before?
A lot because we use libraries for 'done frequently before' code. i don't generate a database driver for my webapp with llm.
We use libraries for SOME of the 'done frequently' code.
But how much of enterprise programming is 'get some data from a database, show it on a Web page (or gui), store some data in the database', with variants?
It makes sense that we have libraries for abstraction away some common things. But it also makes sense that we can't abstract away everything we do multiple times, because at some point it just becomes so abstract that it's easier to write it yourself than to try to configure some library. Does not mean that it's not a variant of something done before.
> we can't abstract away everything we do multiple times
I think there's a fundamental truth about any code that's written which is that it exists on some level of specificity, or to put it in other words, a set of decisions have been made about _how_ something should work (in the space of what _could_ work) while some decisions have been left open to the user.
Every library that is used is essentially this. Database driver? Underlying I/O decisions are probably abstracted away already (think Netty vs Mina), and decisions on how to manage connections, protocol handling, bind variables, etc. are made by the library, while questions remain for things like which specific tables and columns should be referenced. This makes the library reusable for this task as long as you're fine with the underlying decisions.
Once you get to the question of _which specific data is shown on a page_ the decisions are closer to the human side of how we've arbitrarily chosen to organise things in this specific thousandth-iteration of an e-commerce application.
The devil is in the details (even if you know the insides of the devil aren't really any different).
> Once you get to the question of _which specific data is shown on a page_ the decisions are closer to the human side of how we've arbitrarily chosen to organise things in this specific thousandth-iteration of an e-commerce application.
That's why communication is so important, because the requirements are the primary decision factors. A secondary factors is prior technical decisions.
- [deleted]
> it's easier to write it yourself than to try to configure some library
yeah unfortunately LLM will make this worse. Why abstract when you can generate.
I am already seeing this a lot at work :(
Cue Haskell gang "Design patterns are workarounds for weaknesses in your language".
> First, how much of coding is really never done before?
Lots of programming doesn't have one specific right answer, but a bunch of possible right answers with different trade-offs. The programmers job isn't just to get working code neccesarily. I dont think we are at the point where llm's can see the forest for the trees, so to speak.
That’s not true. LLMs are great translators, they can translate ideas to code. And that doesn’t mean it has to be recalling previously seen text.
- [deleted]
Generating unseen code is not hard.
Set rules on what’s valid, which most languages already do; omit generation of known code; generate everything else
The computer does the work, programmers don’t have to think it up.
A typed language example to explain; generate valid func sigs
func f(int1, int2) return int{}
If that’s our only func sig in our starting set then it makes it obvious
Well relative to our tiny starter set func f(int1, int2, int3) return int{} is novel
This Redis post is about fixing a prior decision of a random programmer. A linguistics decision.
That’s why LLMs seem worse than programmers because we make linguistics decisions that fit social idioms.
If we just want to generate all the never before seen in this model code we don’t need a programmer. If we need to abide laws of a flexible language nature, that’s what a programmer is for; compose not just code by compliance with ground truth.
That antirez is good at Redis is a bias since he has context unseen by the LLM. Curious how well antirez would do with an entirely machine generated Redis-clone that was merely guided by experts. Would his intuition for Redis’ implementation be useful to a completely unknown implementation?
He’d make a lot of newb errors and need mentorship, I’m guessing.
I think we're hoping for more than the 'infinite monkeys bashing out semantically correct code' approach.
Ok, define what means and make it. Then as soon as you do realize you run into Gödel’s understanding your machine doesn’t solve problems related to its own existence and needs outside help. So you need to generate that yet unseen solution that lacks context for understanding itself… repeat and see it’s exactly generating one yet unseen layer of logic after another.
Read the article; his younger self failed to see logic needed now. Add that onion peel. No such thing as perfect clairvoyance.
Even Yann LeCun’s energy based models driving robots have the same experience problem.
Make a computer that can observe all of the past and future.
Without perfect knowledge our robots will fail to predict some composition of space time before they can adapt.
So there’s no probe we can launch that’s forever and generally able to survive with our best guess when launched.
More people need to study physical experiments and physics and not the semantic rigor of academia. No matter how many ideas we imagine there is no violating physics.
Pop culture seems to have people feeling starship Enterprise is just about to launch from dry dock.
[flagged]
[dead]
Progress sure, but the rate the’ve improved hasn’t been particularly fast recently.
Programming has become vastly more efficient in terms of programmer effort over decades, but making some aspects of the job more efficient just means all your effort it spent on what didn’t improve.
People seem to have forgotten how good the 2023 GPT-4 really was at coding tasks.
The latest batch of LLMs has been getting worse in my opinion. Claude in particular seems to be going backwards with every release. The verbosity of the answers is infuriating. You ask it a simple question and it starts by inventing the universe, poorly
> Perhaps you remember that language models were completely useless at coding some years ago
no i don't remember that. They are doing similar things now that they did 3 yrs ago. They were still a decent rubber duck 3 yrs ago.
- [deleted]
And 6 years ago GPT2 had just been released. You're being obtuse by interpreting "some years" as specifically 3.
There are a couple people I work with who clearly don’t have a good understanding of software engineering. They aren’t bad to work with and are in fact great at collaborating and documenting their work, but don’t seem to have the ability to really trace through code and logically understand how it works.
Before LLMs it was mostly fine because they just didn’t do that kind of work. But now it’s like a very subtle chaos monkey has been unleashed. I’ve asked on some PRs “why is this like this? What is it doing?” And the answer is “ I don’t know, ChatGPT told me I should do it.”
The issue is that it throws basically all their code under suspicion. Some of it works, some of it doesn’t make sense, and some of it is actively harmful. But because the LLMs are so good at giving plausible output I can’t just glance at the code and see that it’s nonsense.
And this would be fine if we were working on like a crud app where you can tell what is working and broken immediately, but we are working on scientific software. You can completely mess up the results of a study and not know it if you don’t understand the code.
>And the answer is “ I don’t know, ChatGPT told me I should do it.”
This weirds me out. Like I use LLMs A LOT but I always sanity check everything, so I can own the result. Its not the use of the LLM that gets me its trying to shift accountability to a tool.
Sounds almost like you definitely shouldnt use llms nor those juniors for such an important work.
Is it just me or are we heading into a period of explosion of software done, but also a massive drop of its quality? Not uniformly, just a bit of chaotic spread
> Is it just me or are we heading into a period of explosion of software done, but also a massive drop of its quality? Not uniformly, just a bit of chaotic spread
I think we are, especially with executives mandating the use LLMs use and expecting it to massively reduce costs and increase output.
For the most part they don't actually seem to care that much about software quality, and tend to push to decrease quality at every opportunity.
> llms nor those juniors for such an important work.
Yeah we shouldn’t and I limit my usage to stuff that is easily verifiable.
But there’s no guardrails on this stuff, and one thing that’s not well considered is how these things which make us more powerful and productive can be destructive in the hands of well intentioned people.
Which is frightening, because it's not like our industry is known for producing really high quality code at the starting point before LLM authored code.
> those juniors
I'm betting they're the most senior people on the team.
> I’ve asked on some PRs “why is this like this? What is it doing?” And the answer is “ I don’t know, ChatGPT told me I should do it.”
This would infuriate me. I presume these are academics/researchers and not junior engineers?
Unfortunately this is the world we're entering into, where all of us will be outsourcing more and more of our 'thinking' to machines.
>I think the big question everyone wants to skip right to and past this conversation is, will this continue to be true 2 years from now?
For me, it's less "conversation to be skipped" and more about "can we even get to 2 years from now"? There's so much insability right now that it's hard to say what anything will look like in 6 months. "
It's like chess. Humans are better for now, they won't be forever, but humans plus software is going to better than either alone for a long time.
> It's like chess. Humans are better for now, they won't be forever
This is not an obviously true statement. There needs to be proof that there are no limiting factors that are computationally impossible to overcome. It's like watching a growing child, grow from 3 feet to 4 feet, and then saying "soon, this child will be the tallest person alive."
With these "AGI by 2027" claims, it's not enough to say that the child will be the tallest person alive. They are saying the child will be the tallest structure on the planet.
One of my favourite XKCD comics is about extrapolation https://xkcd.com/605/
The time where humans + computers in chess were better than just computers was not a long time. That era ended well over a decade ago. Might have been true for only 3-5 years.
Unrelated to the broader discussion, but that's an artifact of the time control. Humans add nothing to Stockfish in a 90+30 game, but correspondence chess, for instance, is played with modern engines and still has competitive interest.
It is not clear to me whether human input really matters in correspondence chess at this point either.
I mused about this several years ago and still haven't really gotten a clear answer one way or the other.
No guarantee that will happen. LLMs are still statistically based. It's not going to give you edgier ideas, like filling a glass of wine to the rim.
Use them for the 90% of your repetitive uncreative work. The last 10% is up to you.
The pain of that 90% work is how you get libraries and framework. Imagine having many different implementation of sorting algorithms inside your codebase.
OK now we have to spend time figuring out the framework.
It's why people say just write plain Javascript, for example.
Is it? Or is it because the framework are not suitable for the project?
What do you mean? Chess engines are incredibly far ahead of humans right now.
Even a moderately powered machine running stockfish will destroy human super gms.
Sorry, after reading replies to this post i think I've misunderstood what you meant :)
I think he knows that. There was a period from the early 1950s (when people first started writing chess-playing software) to 1997 when humans were better at chess than computers were, and I think he is saying that we are still in the analogous period for the skill of programming.
But he should've know that people would jump at the opportunity to contradict him and should've written his comment so as not to admit such an easily-contradictable interpretation.
Yes, amended my post. I understand what he was saying now. Thanks.
Wasn't trying to just be contradictory or arsey
The phrasing was perhaps a bit odd. For a while, humans were better at Chess, until they weren't. OP is hypothesizing it will be a similar situation for programming. To boot, it was hard to believe for a long time that computers would ever be better than a humans at chess.
- [deleted]
its not like chess
Your information is quite badly out of date. AI can now beat humans at not only chess but 99% of all intellectual exercises.
I've had this same thought that it would be nice to have an AI rubber ducky to bounce ideas off of while pair programming (so that you don't sound dumb to your coworkers & waste their time).
This is my first comment so I'm not sure how to do this but I made a BYO-API key VSCode extension that uses the OpenAI realtime API so you can have interactive voice conversations with a rubber ducky. I've been meaning to create a Show HN post about it but your comment got me excited!
In the future I want to build features to help people communicate their bugs / what strategies they've tried to fix them. If I can pull it off it would be cool if the AI ducky had a cursor that it could point and navigate to stuff as well.
Please let me know if you find it useful https://akshaytrikha.github.io/deep-learning/2025/05/23/duck...
> I've had this same thought that it would be nice to have an AI rubber ducky to bounce ideas off of while pair programming (so that you don't sound dumb to your coworkers & waste their time).
I humbly suggest a more immediate concern to rectify is identifying how to improve the work environment such that the fear one might "sound dumb to your coworkers & waste their time" does not exist.
I like the sound of that! I think youre gonna like what we are building here https://github.com/akdeb/ElatoAI
Its as if the rubber duck was actually on the desk while youre programming and if we have an MCP that can get live access to code it could give you realtime advice.
Wow, that's really cool thanks for open sourcing! I might dig into your MCP I've been meaning to learn how to do that.
I genuinely think this could be great for toys that kids grow up with i.e. the toy could adjust the way it talks depending on the kids age and remember key moments in their life - could be pretty magical for a kid
[dead]
Just the exercise of putting my question in a way that the LLM could even theoretically provide a useful response is enough for me to figure out how to solve the problem a good percentage of the time.
My take is that AI is very one-dimensional (within its many dimensions). For instance, I might close my eyes and imagine an image of a tree structure, or a hash table, or a list-of-trees, or whatever else; then I might imagine grabbing and moving the pieces around, expanding or compressing them like a magician; my brain connects sight and sound, or texture, to an algorithm. However people think about problems is grounded in how we perceive the world in its infinite complexity.
Another example: saying out loud the colors red, blue, yellow, purple, orange, green—each color creates a feeling that goes beyond its physical properties into the emotions and experiences. AI image-generation might know the binary arrangement of an RGBA image but actually, it has NO IDEA what it is to experience colour. No idea how to use the experience of colour to teach a peer of an algorithm. It regurgitates a binary representation.
At some point we’ll get there though—no doubt. It would be foolish to say never! For those who want to get there before everyone else probably should focus on the organoids—because most powerful things come from some Faustian monstrosity.
This is really funny to read as someone who CANNOT imagine anything more complex than the most simple shape like a circle.
Do you actually see a tree with nodes that you can rearrange and have the nodes retain their contents and such?
Haha—yeah, for me the approach is always visual. I have to draw a picture to really wrap my brain around things! Other people I’d imagine have their own human, non-AI way to organize a problem space. :)
I have been drawing all my life and studied traditional animation though, so it’s probably a little bit of nature and nurture.
Same. Just today I used it to explore how a REST api should behave in a specific edge case. It gave lots of confident opinions on options. These were full of contradictions and references to earlier paragraphs that didn’t exist (like an option 3 that never manifested). But just by reading it, I rubber ducked the solution, which wasn’t any of what it was suggesting.
> I actually think a fair amount of value from LLM assistants to me is having a reasonably intelligent rubber duck to talk to.
I wonder if the term "rubber duck debugging" will still be used much longer into the future.
As long as it remains in the training material, it will be used. ;)
Yeah in my experience as long as you don’t stray too far off the beaten path, LLMs are great at just parroting conventional wisdom for how to implement things - but the second you get to something more complicated - or especially tricky bug fixing that requires expensive debuggery - forget about it, they do more harm than good. Breaking down complex tasks into bite sized pieces you can reasonably expect the robot to perform is part of the art of the LLM.
> I think the big question everyone wants to skip right to and past this conversation is, will this continue to be true 2 years from now? I don’t know how to answer that question.
I still think about Tom Scott's 'where are we on the AI curve' video from a few years back. https://www.youtube.com/watch?v=jPhJbKBuNnA
I think of them as highly sycophant LSD-minded 2nd year student who has done some programming
Same. I do rubber duck debugging too and found the LLM to compliment it nicely.
Looking forward for rubber duck shaped hardware AI interfaces to talk to in the future. Im sure somebody will create it
It seems to me we're at the flat side of the curve again. I haven't seen much real progress in the last year.
It's ignorant to think machines will not catch up to our intelligence at some point, but for now, it's clearly not.
I think there needs to be some kind of revolutionary breakthrough again to reach the next stage.
If I were to guess, it needs to be in the learning/back propagation stage. LLM's are very rigid, and once they go wrong, you can't really get them out of it. A junior develop for example could gain a new insight. LLM's, not so much.
Currently, I find AI to be a really good autocomplete
The crazy thing is that people think that a model designed to predict sequences of tokens from a stem, no matter how advanced the model, to be much more than just "really good autocomplete."
It is impressive and very unintuitive just how far that can get you, but it's not reductive to use that label. That's what it is on a fundamental level, and aligning your usage with that will allow it to be more effective.
It's trivial to demonstrate that it takes only a tiny LLM + a loop to a have a Turing complete system. The extension of that is that it is utterly crazy to think that the fact it is "a model designed to predict sequences of tokens" puts much of a limitation on what an LLM can achieve - any Turing complete system can by definition simulate any other. To the extent LLMs are limited, they are limited by training and compute.
But these endless claims that the fact they're "just" predicting tokens means something about their computational power are based on flawed assumptions.
The fact they're Turing complete isn't really getting at the heart of the problem. Python is Turing complete and calling python "intelligent" would be a category error.
It is getting to the heart of the problem when the claim made is that "no matter how advanced the model" they can't be 'much more than just "really good autocomplete."'.
Given that they are Turing complete when you put a loop around them, that claim is objectively false.
I think it'd even be easier to coerce standard autocomplete into demonstrating Turing completeness. And without burning millions of dollars of GPU hours on training it.
Language models with a loop absolutely aren't Turing complete. Assuming the model can even follow your instructions the output is probabilistic so in the limit you can guarantee failure. In reality though there are lots of instructions LLMs fail to follow. You don't notice it as much when you're using them normally but if you want to talk about computation you'll run into trivial failures all the time.
The last time I had this discussion with people I pointed out how LLMs consistently and completely fail at applying grammar production rules (obviously you tell them to apply to words and not single letters so you don't fight with the embedding.)
LLMs do some amazing stuff but at the end of the day:
1) They're just language models, while many things can be described with languages there are some things that idea doesn't capture. Namely languages that aren't modeled, which is the whole point of a Turing machine.
2) They're not human, and the value is always going to come from human socialization.
> Language models with a loop absolutely aren't Turing complete.
They absolutely are. It's trivial to test and verify that you can tell one to act as a suitably small Turing machine and give it instructions to use to manipulate the conversation as "the tape".
Anything else would be absolutely astounding given how simple it is to implement a minimal 2-state 3-symbol Turing machine.
> Assuming the model can even follow your instructions the output is probabilistic so in the limit you can guarantee failure.
The output is deterministic if you set the temperature to zero, at which point it is absolutely trivial to verify the correct output for each of the possible states of a minimal Turing machine.
If you'd care to actually implement what you describe, I'm sure the resulting blog post would make a popular submission here.
It's not very interesting - it's basically showing it can run one step of a very trivial state machine , and then add a loop to let it keep running with the conversation acting as the tape io.
It's pretty hard to make any kind of complex system that can't be coerced into being Turing complete once you add iteration.
Seriously, get any instruct tuned language model and try to do one iteration with grammar production rules. It's coin flip at best if they get it right.
I have tried that many times and had good results.
There's a plausible argument for it, so it's not a crazy thing. You as a human being can also predict likely completions of partial sentences, or likely lines of code given surrounding lines of code, or similar tasks. You do this by having some understanding of what the words mean and what the purpose of the sentence/code is likely to be. Your understanding is encoded in connections between neurons.
So the argument goes: LLMs were trained to predict the next token, and the most general solution to do this successfully is by encoding real understanding of the semantics.
> "The crazy thing is that people think that a model designed to"
It's even crazier that some people believe that humans "evolved" intelligence just by nature selecting the genes which were best at propagating.
Clearly, human intelligence is the product of a higher being designing it.
/s
I would consider evolution a form of intelligence, even though I wouldn't consider nature a being.
There's a branch of AI research I was briefly working in 15 years ago, based on that premise: Genetic algorithms/programming.
So I'd argue humans were (and are continuously being) designed, in a way.
(non-sarcastically from me this time)
Sure, I would agree with that wording.
In the same way, neural networks which are trained to do a task could be said to be "designed" to do something.
In my view, there's a big difference in what the training data is for a neural network, and what the neural network is "designed" for.
We can train a network using word completion examples, with the intent of designing it for intelligence.
Yup. To counter my own points a bit:
I could also argue that the word "design" has a connotation strictly opposing emergent behaviour like evolution, as in the intelligent design "theory". So not the best word to use perhaps.
And in your example, just because we made a system that exhibits emergent behaviour to some degree, we can't assume it can "design" intelligence the way evolution did, on a much, much shorter timeline, no less.
It’s reductive and misleading because autocomplete, as it’s commonly known, existed for many years before generative AI, and is very different and quite dumber than LLMs.
Earlier this week ChatGPT found (self-conscious as I am of the personification of this phrasing) a place where I'd accidentally overloaded a member function by unintentionally giving it the name of something from a parent class, preventing the parent class function from ever being run and causing <bug>.
After walking through a short debugging session where it tried the four things I'd already thought of and eventually suggested (assertively but correctly) where the problem was, I had a resolution to my problem.
There are a lot of questions I have around how this kind of mistake could simply just be avoided at a language level (parent function accessibility modifiers, enforcing an override specifier, not supporting this kind of mistake-prone structure in the first place, and so on...). But it did get me unstuck, so in this instance it was a decent, if probabilistic, rubber duck.
> it [...] suggested (assertively but correctly) where the problem was
> it was a decent, if probabilistic, rubber duck
How is it a rubber duck if it suggested where the problem was?
Isn't a rubber duck a mute object which you explain things to, and in the process you yourself figure out what the solution is?
It's also quite good at formulating regular expressions based on one or two example strings.
LLMs are a passel of eager to please know it all interns that you can command at will without any moral compunctions.
They drive you nuts trying to communicate with them what you actually want them to do. They have a vast array of facts at immediate recall. They’ll err in their need to produce and please. They do the dumbest things sometimes. And surprise you at other times. You’ll throw vast amounts of their work away or have to fix it. They’re (relatively) cheap. So as an army of monkeys, if you keep herding them, you can get some code that actually tells a story. Mostly.