The commits are revealing.
Look at this one:
> Ask Claude to remove the "backup" encryption key. Clearly it is still important to security-review Claude's code!
> prompt: I noticed you are storing a "backup" of the encryption key as `encryptionKeyJwk`. Doesn't this backup defeat the end-to-end encryption, because the key is available in the grant record without needing any token to unwrap it?
I don’t think a non-expert would even know what this means, let alone spot the issue and direct the model to fix it.
That is how LLM:s should be used today. An expert prompts it and checks the code. Still saves a lot of time vs typing everything from scratch. Just the other day I was working on a prototype and let claude write code for a auth flow. Everything was good until the last step where it was just sending the user id as a string with the valid token. So if you got a valid token you could just pass in any user id and become that user. Still saved me a lot of time vs doing it from scratch.
At least for me, I'm fairly sure that I'm better at not adding security flaws to my code (which I'm already not perfect at!) than I am at spotting them in code that I didn't write, unfortunately.
They're different mindsets. Some folks are better editors, inspectors, auditors, etc, whereas some are better builders, creators, and drafters.
So what you're saying makes sense. And I'm definitely on the other side of that fence.
When you form a mental model and then write code from that, thats a very lossy transformation. You can write comments and documentation to make it less lossy, but there will be information that is lost to an reviewer, who has to spend great effort to recreate it. If it is unknown how code is supposed to behave, then it becomes physically impossible to verify it for correctness.
This is less a matter of "mindset", but more a general problem of information.
Whether reviewer or creator, if the start conditions / problem is known, both start with the same info.
"code base must do X with Y conditions"
The reviewer is at no disadvantage, other than the ability to walk the problem without coding.
This is the ideal case where the produced code is well readable and commented so its intent is obvious.
The worst case is an intern or LLM having generated some code where the intent is not obvious and them not being able to explain the intent behind it. "How is that even related to the ticket"-style code.
> Still saves a lot of time vs typing everything from scratch.
In my experience, it takes longer to debug/instruct the LLM than to write it from scratch.
Depends on what you're doing. For example when you're writing something like React components and using something like Tailwind for styling, I find the speedup is close to 10X.
Scaffolding works fine, for things that are common, and you already have 100x examples on the web. Once you need something more specific, it falls apart and leads to hours of prompting and debugging for something that takes 30 minutes to write from scratch.
Some basic things it fails at:
* Upgrading the React code-base from Material-UI V4 → V5 * Implementing a simple header navigation dropdown in HTML/CSS that looks decent and is usable (it kept having bugs with hovering, wrong sizes, padding, responsiveness, duplicated code etc.) * Changing anything. About half of the time, it keeps saying "I made those changes", but no changes were made (it happens with all of them, Windsurf, Copilot, etc.).
This can’t be stressed enough: it depends on what you’re doing. Developers talking about whether LLMs are useful are just talking past each other unless they say “useful for React” or “useful for Rust.” I mostly write Drupal code, and the JetBrains LLM autocomplete saves me a few keystrokes, maybe. It’s not amazing. My theory is that there just isn’t much boilerplate Drupal code out there to train on: everything possible gets pushed out of code and into configuration + UI. If I were writing React components I’d be having an entirely different experience.
Isn't there some way to speed up with codegen besides using LLMs?
Some may have a better answer, but I often compare with tools like OpenAPI and AsyncAPI generators where HTTP/AMQP/etc code can be generated for servers, clients and extended documentation viewers.
The trade off here would be that you must create the spec file (and customize the template files where needed) which drives the codegen, in exchange for explicit control over deterministic output. So there’s more typing but potentially less cognitive overhead with reviewing a bunch of LLM output.
For this use case I find the explicit codegen UX preferable to inspecting what the LLM decided to do with my human-language prompt, if attempting to have the LLM directly code the library/executable source (as opposed to asking it to create the generator, template or API spec).
You can require less code by using a more expressive programming language.
Isn’t this because the LLMs had like a million+ react tutorials/articles/books/repos to train on?
I mean I try to use them for svelte or vue and it still recommends react snippets sometimes.
I have had no issues with LLMs trying to force a language on me. I tried the whole snake game test with ChatGPT but Instead of using Python I asked it to use the nodejs bindings for raylib, which is rather unusual.
It did it in no time and no complaints.
Generally speaking, "LLMs" that I use are always the latest thinking versions of the flagship models (Grok 3/Gemini 2.5/...). GPT4o (and equivalent) are a mess.
But you're correct, when you use more exotic and/or quite new libraries, the outputs can be of mixed quality. For my current stack (Typescript, Node, Express, React 19, React Router 7, Drizzle and Tailwind 4) both Grok 3 (the paid one with 100k+ context) and Gemini 2.5 are pretty damn good. But I use them for prototyping, i.e. quickly putting together new stuff, for types, refactorings... I would never trust their output verbatim. (YET.) "Build an app that ..." would be a nightmare, but React-like UI code at sufficiently granular level is pretty much the best case scenario for LLMs as your components should be relatively isolated from the rest of the app and not too big anyways.
I put these in the Gemini Pro 2.5 system prompt and it's golden for Svelte.
I do this and it still spits out react snippets regardless like 40% of the time... I feel like unless you are doing something extremely basic this is fine but once you introduce state or animations all these systems death spiral.
Yes, definitely. Act accordingly.
I use https://visualstudio.microsoft.com/services/intellicode/ for my IDE which learns on your codebase, so it does end up saving me a ton of time after its learned my patterns and starts suggesting entire classes hooked up to the correct properties in my EF models.
It lets me still have my own style preferences with the benefit of AI code generation. Bridged the barrier I had with code coming from Claude/ChatGPT/etc where its style preferences were based on the wider internets standards. This is probably a preference on the level of tabs vs spaces, but ¯\_(ツ)_/¯
> An expert prompts it and checks the code. Still saves a lot of time vs typing everything from scratch.
It's a lie. The marketing one, to be specific, which makes it even worse.
huh?
I really don't agree with the idea that expert time would just be spent typing, and I'd be really surprised if that's the common sentiment around here.
An expert reasons, plans ahead, thinks and reasons a little bit more before even thinking about writing code.
If you are measuring productivity by lines of code per hour then you don't understand what being a dev is.
> I really don't agree with the idea that expert time would just be spent typing, and I'd be really surprised if that's the common sentiment around here.
They didn't suggest that at all, they merely suggested that the component of the expert's work that would otherwise be spent typing can be saved, while the rest of their utility comes from intense scrutiny, problem solving, decision making about what to build and why, and everything else that comes from experience and domain understanding.
It's not just time spent typing. Figuring out what needs to be typed can be both draining and time consuming. It's often (but not always) much easier to review someone else's solution to the problem than it is to solve it from scratch on your own.
Oddly enough security critical flows are likely to be one of the few exceptions because catching subtle reasoning errors that won't trip any unit tests when reviewing code that you didn't write is extremely difficult.
The problem is, building something IS the destination. At least the first 5-10 times. Building and fixing along the way is what builds lasting knowledge for most people.
Time spent typing is statistically 0% of overall time spent in developing/implementing/shipping a feature or product or whatever. There's literally no reason to try to optimize that irrelevant detail.
Yea, and you still do that now. Lets say you spend 30% of your time coding and the rest planning. Well, now you got even more time for planning.
> Still saves a lot of time vs typing everything from scratch
No it doesn't. Typing speed is never the bottleneck for an expert.
As an offline database of Google-tier knowledge, LLM's are useful. Though current LLM tech is half-baked, we need:
a) Cheap commodity hardware for running your own models locally. (And by "locally" I mean separate dedicated devices, not something that fights over your desktop's or laptop's resources.)
b) Standard bulletproof ways to fine-tune models on your own data. (Inference is already there mostly with things like llama.cpp, finetuning isn't.)
I realize I procrastinate less when using LLM to write code which I know I could write.
I've noticed this too.
I remember hearing somewhere that humans have a limited capacity in terms of number of decisions made in a day, and it seems to fit here: If I'm writing the code myself, I have to make several decisions on every line of code, and that's mentally tiring, so I tend to stop and procrastinate frequently.
If an LLM is handling a lot of the details, then I'm just making higher-level decisions, allowing me to make more progress.
Of course this is totally speculation and theories like this tend to be wrong, but it is at least consistent with how I feel.
I have a feeling that it's something that might help today but also something you might pay for later. When you have to maintain or bug fix that same code down the line the fact that you were the one making all those higher-level decisions and thinking through the details gives you an advantage. Just having everything structured and named in ways that make the most sense to you seems like it'd be helpful the next time you have to deal with the code.
While it's often a luxury, I'd much rather work on code I wrote than code somebody else wrote.
Maybe you type faster than me then :) I for sure type slower than Claude code. :)
> No it doesn't. Typing speed is never the bottleneck for an expert
How could that possibly be true!? Seems like it'd be the same as suggesting being constrained to analog writing utensils wouldn't bottleneck the process of publishing a book or research paper. At the very least such a statement implies that people with ADHD can't be experts.
Completely agree with you. I was working on the front-end of an application and I prompted Claude the following: "The endpoint /foo/bar is returning the json below ##json goes here##, show this as cards inside the component FooBaz following the existing design system".
In less than 5 minutes Claude created code that: - encapsulated the api call - modeled the api response using Typescript - created a re-usable and responsive ui component for the card (including a load state) - included it in the right part of the page
Even if I typed at 200wpm I couldn't produce that much code from such a simple prompt.
I also had similar experiences/gains refactoring back-end code.
This being said, there are cases in which writing the code yourself is faster than writing a detailed enough prompt, BUT those cases are becoming exception with new LLM iteration. I noticed that after the jump from Claude 3.7 to Claude 4 my prompts can be way less technical.
The thing is... does your code end there? Would you put that code in production without a deep analysis of what Claude did?
I’m not who you replied to but I keep functions small and testable paired with unit tests with a healthy mix of happy/sad path.
Afterwards I make sure the LLM passes all the tests before I spend my time to review the code.
I find this process keeps the iterations count low for review -> prompt -> review.
I personally love writing code with an LLM. I’m a sloppy typist but love programming. I find it’s a great burnout prevention.
For context: node.js development/React (a very LLM friendly stack.)
(GP) I wouldn't, but it would get me close enough that I can do the work that's more intellectually stimulating. Sometimes you need the people to do the concrete for a driveway, and sometimes you need to be signing off on the way the concrete was done, perhaps making some tweaks during the early stages.
It seems fair to say that it is ~never the overall bottleneck? Maybe once you figure out what you want, typing speed briefly becomes the bottleneck, but does any expert finish a day thinking "If only I could type twice as fast, I'd have gotten twice as much work done?" That said, I don't think "faster typing" is the only benefit that AI assistance provides.
> How could that possibly be true!?
(I'll assume you're not joking, because your post is ridiculous enough to look like sarcasm.)
The answer is because programmers read code 10 times more (and think about code 100 times more) than they write it.
Yeah, but how fast can you write compared to how fast you think?
How many times have you read a story card and by the time you finished reading it you thought "It's an easy task, should take me 1 hour of work to write the code and tests"?
In my experience, in most of those cases the AI can do the same amount of code writing in under 10 minutes, leaving me the other 50 minutes to review the code, make/ask for any necessary adjustments, and move on to another task.
I don't know anyone who can think faster than they can type (on average), they would have to have an IQ over 150 or something. For mere mortals like myself, reasoning through edge cases and failure conditions and error handling and state invariants takes time. Time that I spend looking at a blinking cursor while the gears spin, or reading code. I've never finished a day where I thought to myself "gosh darn, if only I could type faster this would be done already".
You could be fast if you were coding only the happy path, like a lot of juniors do. Instead of thinking about trivial things like malformed input, library semantics, framework gotchas and what not.
I wasn't joking, it's a bottleneck sometimes, that's it. It's a bottleneck like comfort and any good tool is a bottleneck, like a slow computer is a bottleneck. It's silly to suggest that your ability to rapidly use a fundamental tool is never a bottleneck, no matter what other bits need to come into play during the course of your day.
My ability to review and understand intent behind code isn't a primarily bottleneck to me actually efficiently reviewing code when it's requested of me, the primary bottleneck is being notified at the right time that I have a waiting request to review code.
If compilers were never a bottleneck, why would we ever try to make them faster? If build tools were never a bottleneck, why would we ever optimize those? These are all just some of the things that can stand between the identification of a problem and producing a solution for it.
Sure! But over half the fun of coding is writing and learning.
> ... Still saves a lot of time vs typing everything from scratch ...
how ? the prompts have still to be typed right ? and then the output examined in earnest.
A prompt can be as little as a sentence to write hundreds of lines of code.
Hundreds of lines that you have to carefully read and understand.
Are you not doing that already?
I go line-by-line through the code that I wrote (in my git client) before I stage+commit it.
Depends on what it is doing. A html template without JS? Enough to just check if it looks right and works.
You also have to do that with code you write without LLM assistance.
Latest project I been working on. Prompts are a few sentences (and technically I dictate them instead of typing) and the LLM generates a few hundred lines of code.
not if you don't want to. speech to text is pretty good these days, and even eg aider has a /voice command thanks to OpenAI's whisper.
> Still saves a lot of time vs typing everything from scratch
Probably very language specific. I use a lot of Ruby, typing things takes no time it's so terse. Instead I get to spend 95% of my time pondering my problems (or prompting the LLM)...
With a proper IDE you don't type much even in Java/.Net, it's all autocomplete anyway. "Too verbose" complaints are mostly from Notepad lovers, and those who never needed to read somebody else's code.
It can create a whole dashboard view in elixir in a few seconds that is 100 lines long. No way I can type that in the same time.
If you're making a dashboard view your productivity is zero, making it faster just multiplies zero by a bigger number.
Edit: this comment was more a result of me being in a terrible mood than a true claim. Sorry.
In my experience the problem is never creating the dashboard view (there's a million examples of it out there anyway to copy/paste), but making sense of the data. Especially if you're doing anything even remotely novel.
I tend to disagree, but I don't know what my disagreement means for the future of being able to use AI when writing software. This workers-oauth-provider project is 1200 lines of code. An expert should be able to write that on the scale of an hour.
The main value I've gotten out of AI writing software comes from the two extremes; not from the middle-ground you present. Vibe coding can be great and seriously productive; but if I have to check it or manually maintain it in nearly any capacity more complicated than changing one string, productivity plummets. Conversely; delegating highly complex, isolated function writing to an AI can also be super productive, because it can (sometimes) showcase intelligence beyond mine and arrive at solutions which would take me 10x longer; but definitionally I am not the right person to check its code output; outside of maybe writing some unit tests for it (a third thing AI tends to be quite good at)
> This workers-oauth-provider project is 1200 lines of code. An expert should be able to write that on the scale of an hour.
Are you being serious here?
Let's do the math.
1200 lines in a hour would be one line every three seconds, with no breaks.
And your figure of 1200 lines is apparently omitting whitespace and comments. The actual code is 2626 lines. Let's say we ignore blank lines, then it's 2251 lines. So one line per ~1.6 seconds.
The best typists type like 2 words per second, so unless the average line of code has 3 words on it, a human literally couldn't type that fast -- even if they knew exactly what to type.
Of course, people writing code don't just type non-stop. We spend most of our time thinking. Also time testing and debugging. (The test is 2195 lines BTW, not included in above figures.) Literal typing of code is a tiny fraction of a developer's time.
I'd say your estimate is wrong by at least one, but realistically more likely two orders of magnitude.
"On the scale of an hour" means "within an order of magnitude of one hour", or either "10 minutes to 10 hours" or "0.1 hours to 10 hours" depending on your interpretation, either is fine.
> An expert should be able to write that on the scale of an hour.
An expert in oauth, perhaps. Not your typical expert dev who doesn't specialize in auth but rather in whatever he's using the auth for. Navigating those sorts of standards is extremely time consuming.
Maybe, but also: Cloudflare is one of like fifteen organizations on the planet writing code like this. The vast majority of The Rest Of Us will just consume code like this, which companies like Cloudflare, Auth0, etc write. That tends to be the nature of highly-specialized highly-domain-specific code. Cloudflare employs those mythical Oauth experts you talk about.
That's me. I'm the expert.
On my very most productive days of my entire career I've managed to produce ~1000 lines of code. This library is ~5000 (including comments, tests, and documentation, which you omitted for some reason). I managed to prompt it out of the AI over the course of about five days. But they were five days when I also had a lot of other things going on -- meetings, chats, code reviews, etc. Not my most productive.
So I estimate it would have taken me 2x-5x longer to write this library by hand.
Revealing against what?
If you look at the README it is completely revealed... so i would argue there is nothing to "reveal" in the first place.
> I started this project on a lark, fully expecting the AI to produce terrible code for me to laugh at. And then, uh... the code actually looked pretty good. Not perfect, but I just told the AI to fix things, and it did. I was shocked.
> To emphasize, this is not "vibe coded". Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.
I think OP meant "revealing" as in "enlightening", not as "uncovering something that was hidden intentionally".
> Revealing against what?
Revealing of what it is like working with an LLM in this way.
Revealing the types of critical mistakes LLMs make. In particular someone that didn’t already understand OAuth likely would not have caught this and ended up with a vulnerable system.
If the guy knew how to properly implement oauth - did he save any time though by prompting or just tried to prove a point that if you actually already know all details of impl you can guide llm to do it?
Thats the biggest issue I see. In most cases I don't use llm because DIYing it takes less time than prompting/waiting/checking every line.
> did he save any time though
Yes:
> It took me a few days to build the library with AI.
> I estimate it would have taken a few weeks, maybe months to write by hand.
– https://news.ycombinator.com/item?id=44160208
> or just tried to prove a point that if you actually already know all details of impl you can guide llm to do it?
No:
> I was an AI skeptic. I thoughts LLMs were glorified Markov chain generators that didn't actually understand code and couldn't produce anything novel. I started this project on a lark, fully expecting the AI to produce terrible code for me to laugh at. And then, uh... the code actually looked pretty good. Not perfect, but I just told the AI to fix things, and it did. I was shocked.
— https://github.com/cloudflare/workers-oauth-provider/?tab=re...
> I thoughts LLMs were glorified Markov chain generators that didn't actually understand code and couldn't produce anything novel.
How novel is a OAuth provider library for cloudflare workers? I wouldn't be surprised if it'd been trained on multiple examples.
I'm not aware of any other OAuth provider libraries for Workers. Plenty of clients, but not providers -- implementing the provider side is not that common, historically. See my other comment:
- [deleted]
Do people save time by learning to write code at 420WPM? By optimising their vi(m) layouts and using languages with lots of fancy operators that make things quicker to write?
Using an LLM to write code you already know how to write is just like using intellisense or any other smart autocomplete, but at a larger scale.
[flagged]
While I think this is a cool (public) experiment by Claude, asking an LLM to write security-sensitive code seems crazy at this point. Ad absurdum: Can you imagine asking Claude to implement new functionality in OpenSSL libs!?
Which is exactly why AI coding assistants work with your expertise rather than replace it. Most people I see fail at AI assisted development are either non-technical people expecting the AI will solve it all, or technical people playing gotcha with the machine rather than collaborating with it.
There is also one quite early in the repo where the dev has to tell Claude to store only the hashes of secrets
Yeah I was disappointed in that one.
I hate to say, though, but I have reviewed a lot of human code in my time, and I've definitely caught many humans making similar-magnitude mistakes. :/
I just wanted to say thanks so much publishing this, and especially your comments here - I found them really helpful and insightful. I think it's interesting (though not unexpected) that many of the other commenters' comments here show what a Rorschach test this is. I think that's kind of unfortunate, because your experience clearly showed some of the benefits and limitations/pitfalls of coding like this in an objective manner.
I am curious, did you find the work of reviewing Claude's output more mentally tiring/draining than writing it yourself? Like some other folks mentioned, I generally find reviewing code more mentally tiring than writing it, but I get a lot of personal satisfaction by mentoring junior developers and collaborating with my (human) colleagues (most of them anyway...) Since I don't get that feeling when reviewing AI code, I find it more draining. I'm curious how you felt reviewing this code.
I find reviewing AI code less mentally tiring that reviewing human code.
This was a surprise to me! Until I tried it, I dreaded the idea.
I think it is because of the shorter feedback loop. I look at what the AI writes as it is writing it, and can ask for changes which it applies immediately. Reviewing human code typically has hours or days of round-trip time.
Also with the AI code I can just take over if it's not doing the right thing. Humans don't like it when I start pushing commits directly to their PR.
There's also the fact that the AI I'm prompting is, obviously, working on my priorities, whereas humans are often working on other priorities, but I can't just decline to review someone's code because it's not what I'm personally interested in at that moment.
When things go well, reviewing the AI's work is less draining than writing it myself, because it's basically doing the busy work while I'm still in control of high-level direction and architecture. I like that. But things don't always go well. Sometimes the AI goes in totally the wrong direction, and I have to prompt it too many times to do what I want, in which case it's not saving me time. But again, I can always just cancel the session and start doing it myself... humans don't like it when I tell them to drop a PR and let me do it.
Personally, I don't generally get excited about mentoring and collaborating. I wish I did, and I recognize it's an important part of my job which I have to do either way, but I just don't. I get excited primarily about ideas and architecture and not so much about people.
Thank you so much for your detailed, honest, and insightful response! I've done a bunch of AI-assisted coding to varying degrees of success, but your comment here helped me think about it in new ways so that I can take the most advantage of it.
Again, I think your posting of this is probably the best actual, real world evidence that shows both the pros and cons of AI-assisted coding, dispassionately. Awesome work!
Most interesting aspect of this is it likely learned this pattern from human-written code!
But AIbros will be running around telling everyone that Claude invented OAuth for Cloudflare all on its own and then opensourced it.
- [deleted]
this seems like a true but pointless observation? if you're producing security-sensitive code then experts need to be involved, whether that's me unwisely getting a junior to do something, or receiving a PR from my cat, or using an LLM.
removing expert humans from the loop is the deeply stupid thing the Tech Elite Who Want To Crush Their Own Workforces / former-NFT fanboys keep pushing, just letting an LLM generate code for a human to review then send out for more review is really pretty boring and already very effective for simple to medium-hard things.
> …removing expert humans from the loop is the deeply stupid thing the Tech Elite Who Want To Crush Their Own Workforce…
this is completely expected behavior by them. departments with well paid experts will be one of the first they’ll want to cut. in every field. experts cost money.
we’re a long, long, long way off from a bot that can go into random houses and fix under the sink plumbing, or diagnose and then fix an electrical socket. however, those who do most of their work on a computer, they’re pretty close to a point where they can cut these departments.
in every industry in every field, those will be jobs cut first. move fast and break things.
I think it's a critically important observation.
I thought this experience was so helpful as it gave an objective, evidence-based sample on both the pros and cons of AI-assisted coding, where so many of the loudest voices on this topic are so one-sided ("AI is useless" or "developers will be obsolete in a year"). You say "removing expert humans from the loop is the deeply stupid thing the Tech Elite Who Want To Crush Their Own Workforces / former-NFT fanboys keep pushing", but the fact is many people with the power to push AI onto their workers are going to be more receptive to actual data and evidence than developers just complaining that AI is stupid.
It's a Jr Developer that you have to check all their code over. To some people that is useful. But you're still going to have to train Jr Developers so they can turn into Sr Developers.
I don't like the jr dev analogy. It neither has the same weaknesses nor the same strenghts.
It's more like the genious coworker that has an overassertive ego and sometimes shows up drunk, but if you know how to work with them and see past their flaws, can be a real asset.
I also like your analogy, but it also explains why I find working with AI-assisted coding so mentally tiresome.
It's like with some auto-driving systems - I say it like having a slightly inebriated teenager at the wheel. I can't just relax and read a book, because then I'd die. But so I have to be more mentally alert than just driving myself because everything could be going smoothly and relaxed, but at any moment the driving system could decide to drive into a tree.
I don't really agree; a junior developer, if they're curious enough, wouldn't just write insecure code, they would do self-study and find out best practices etc before writing code, including not storing plaintext passwords and the like.
You have clearly only ever worked with the creme de la creme of junior developers.