Tao on “blue team” vs. “red team” LLMs

mathstodon.xyz

・

534 points

・

qsort

・

4 days ago

175 comments

_alternator_ ・ 4 days ago

This red vs blue team is a good way to understand the capabilities and current utility of LLMs for expert use. I trust them to add tests almost indiscriminately because tests are usually cheap; if they are wrong it’s easy to remove or modify them; and if they are correct, they adds value. But often they don’t test the core functionality; the best tests I still have to write myself.

Having LLMs fix bugs or add features is more fraught, since they are prone to cheating or writing non robust code (eg special code paths to pass tests without solving the actual problem).

skdidjdndh ・ 4 days ago

> I trust them to add tests almost indiscriminately because tests are usually cheap; if they are wrong it’s easy to remove or modify them
Having worked on legacy codebases this is extremely wrong and harmful. Tests are the source of truth more so than your code - and incorrect tests are even more harmful than incorrect code.
Having worked on legacy codebases, some of the hardest problems are determining “why is this broken test here that appears to test a behavior we don’t support”. Do we have a bug? Or do we have a bad test? On the other end, when there are tests for scenarios we don’t actually care about it’s impossible to determine if that test is meaningful or was added because “it’s testing the code as written”.
- yojo ・ 4 days ago
  
  I would add that few things slow developer velocity as much as a large suite of comprehensive and brittle tests. This is just as true on greenfield as on legacy.
  Anticipating future responses: yes, a robust test harness allows you to make changes fearlessly. But most big test suites I’ve seen are less “harness” and more “straight-jacket”
  
  materielle ・ 4 days ago
  ・ 2 more
  
  I think a problem with AI productivity metrics is that a lot of the productivity is made up.
  Most enterprise code involves layers of interfaces. So implementing any feature requires updating 5 layers and mocking + unit testing at each layer.
  When people say “AI helps me generate tests”, I find that this is what they are usually referring to. Generating hundreds of lines of mock and fake data boilerplate in a few minutes, that would otherwise take an entire day to do manually.
  Of course, the AI didn’t make them more productive. The entire point of automated testing is to ensure software correctness without having to test everything manually each time.
  The style of unit testing above is basically pointless. Because it doesn’t actually accomplish the goal. All the unit tests could pass and the only thing you’ve tested is that your canned mock responses and asserts are in-sync in the unit testing file.
  A problem with how LLMs are used is that they help churn through useless bureaucratic BS faster. But the problem is that there’s no ceiling to bureaucracy. I have strong faith that organizations can generate pointless tasks faster than LLMs can automate them away.
  Of course, this isn’t a problem with LLMs themselves, but rather an organization context in which I see them frequently being used.
  
  refulgentis ・ 4 days ago
  
  I think it's appropriate to be skeptical with new tools, and being appropriately, respectfully, prosocially, skeptical, point out failure modes. Kudos.
  Something that crosses my mind is if AI generating tests necessitates that it only generates tests with fakes and stubs that exercise no actual logic, the expertise required to notice that, and if it is correctable.
  Yesterday, I was working on some OAuth flow stuff. Without replayed responses, I'm not quite sure how I'd test it without writing my own server, and I'm not sure how I'd develop the expertise to do that without, effectively, just returning the responses I expected.
  It reminds me that if I eschewed tests with fakes and stubs as untrustworthy in toto, I'd be throwing the baby with the bathwater.
  
  ch33zer ・ 4 days ago
  ・ 8 more
  
  An old coworker used to call these types of tests change detector tests. They are excellent at telling you whether some behavior changed, but horrible at telling you whether that behavior change is meaningful or not.
  
  jrockway ・ 4 days ago
  ・ 6 more
  
  Yup. Working on a 10 year old codebase, I always wondered whether a test failing was "a long-standing bug was accidentally fixed" or "this behavior was added on purpose and customers rely on it". It can be about 50/50 but you're always surprised.
  Change detector tests add to the noise here. No, this wasn't a feature customers care about, some AI added a test to make sure foo.go line 42 contained less than 80 characters.
  
  baq ・ 4 days ago
  
  I like calling out behavioral vs normative tests. The difference is optics, mostly, but the mere fact that somebody took the time to add a line of comment to ten or hundred lines of mostly boilerplate tests is usually more than enough to raise an eyebrow and I honestly don’t need more than just a pinch of surprise to make the developer pause.
  
  groestl ・ 4 days ago
  ・ 4 more
  
  > a long-standing bug was accidentally fixed
  In some cases (e.g. in our case) long standing bugs become part of the API that customers rely on.
  
  strbean ・ 4 days ago
  ・ 3 more
  
  It's nearly guaranteed, even if it is just because customers had to work around the bug in such a way that their flow now breaks when the bug is gone.
  Obligatory: https://xkcd.com/1172/
  
  thaumasiotes ・ 3 days ago
  
  That comic doesn't show someone working around a bug in such a way that their flow breaks when the bug is gone. It shows them incorporating a bug into their workflow for purposes of causing the bug to occur.
  It isn't actually possible for fixing a bug to break a workaround. The point of a workaround is that you're not doing the thing that's broken; when that thing is fixed, your flow won't be affected, because you weren't doing it anyway.
  
  giaour ・ 4 days ago
  
  Also known as Hyrum's Law (https://www.hyrumslaw.com/), but more people know the XKCD at this point :)
  
  PeeMcGee ・ 4 days ago
  
  These sorts of tests are invaluable for things like ensuring adherence to specifications such as OAuth2 flows. A high-level test that literally describes each step of a flow will swiftly catch odd changes in behavior such as a request firing twice in a row or a well-defined payload becoming malformed. Say a token validator starts misbehaving and causes a refresh to occur with each request (thus introducing latency and making the IdP angry). That change in behavior would be invisible to users, but a test that verified each step in an expected order would catch it right away, and should require little maintenance unless the spec itself changes.
  
  andrepd ・ 4 days ago
  ・ 15 more
  
  I don't understand this. How does it slow your development if the tests being green is a necessary condition for the code being correct? Yes it slows it compared to just writing incorrect code lol, but that's not the point.
  
  yojo ・ 4 days ago
  
  "Brittle" here means either:
  1) your test is specific to the implementation at the time of writing, not the business logic you mean to enforce.
  2) your test has non-deterministic behavior (more common in end-to-end tests) that cause it to fail some small percentage of the time on repeated runs.
  At the extreme, these types of tests degenerate your suite into a "change detector," where any modification to the code-base is guaranteed to make one or more tests fail.
  They slow you down because every code change also requires an equal or larger investment debugging the test suite, even if nothing actually "broke" from a functional perspective.
  Using LLMs to litter your code-base with low-quality tests will not end well.
  
  winstonewert ・ 4 days ago
  
  The problem is that sometimes it is not a necessary condition. Rather, the tests might have been checking implementation details or just been wrong in the first place. Now, when tests fails I have extra work to figure out if its a real break or just a bad test.
  
  jrockway ・ 4 days ago
  ・ 2 more
  
  The goal of tests is not to prevent you from changing the behavior of your application. The goal is to preserve important behaviors.
  If you can't tell if a test is there to preserve existing happenstance behavior, or if it's there to preserve an important behavior, you're slowed way down. Every red test when you add a new feature is a blocker. If the tests are red because you broke something important, great. You saved weeks! If the tests are red because the test was testing something that doesn't matter, not so great. Your afternoon was wasted on a distraction. You can't know in advance whether something is a distraction, so this type of test is a real productivity landmine.
  Here's a concrete, if contrived, example. You have a test that starts your app up in a local webserver, and requests /foo, expecting to get the contents of /foo/index.html. One day, you upgrade your web framework, and it has decided to return a 302 Moved redirect to /foo/index.html, so that URLs are always canonical now. Your test fails with "incorrect status code; got 302, want 200". So now what? Do you not apply the version upgrade? Do you rewrite the test to check for a 302 instead of a 200? Do you adjust the test HTTP client to follow redirects silently? The problem here is that you checked for something you didn't care about, the HTTP status, instead of only checking for what you cared about, that "GET /foo" gets you some text you're looking for. In a world where you let the HTTP client follow redirects, like human-piloted HTTP clients, and only checked for what you cared about, you wouldn't have had to debug this to apply the web framework security update. But since you tightened down the screws constraining your application as tightly as possible, you're here debugging this instead of doing something fun.
  (The fun doubles when you have to run every test for every commit before merging, and this one failure happened 45 minutes in. Goodbye, the rest of your day!)
  
  HappMacDonald ・ 3 days ago
  
  This example smells a lot like "overfit" in AI training as well.
  
  threatofrain ・ 4 days ago
  ・ 10 more
  
  It's that hard to write specs that truly match the business, hence why test-driven-development or specification-first failed to take off as a movement.
  Asking specs to truly match the business before we begin using them as tests would handcuff test people in the same way we're saying that tests have the potential to handcuff app and business logic people — as opposed to empowering them. So I wouldn't blame people for writing specs that only match the code implementation at that time. It's hard to engage in prophecy.
  
  nyrikki ・ 4 days ago
  ・ 6 more
  
  The problem with TDD is that people assumed it was writing a specification, or directly tried to map it directly to post-hoc testing and metrics.
  TDD at its core is defining expected inputs and mapping those to expected outputs at the unit of work level, e.g. function, class etc.
  While UAT and domain informed what those inputs=outputs are, avoiding trying to write a broader spec that that is what many people struggle with when learning TDD.
  Avoiding writing behavior or acceptance tests, and focusing on the unit of implementation tests is the whole point.
  But it is challenging for many to get that to click. It should help you find ambiguous requirements, not develop a spec.
  
  MoreQARespect ・ 4 days ago
  ・ 5 more
  
  I literally do the diametric opposite of you and it works extremely well.
  Im weirded out by your comment. Writing tests that couple to low level implementation details was something I thought most people did accidentally before giving up on TDD, not intentionally.
  
  nyrikki ・ 4 days ago
  ・ 4 more
  
  It isn't coupling low level implementation details, it is writing tests based on input and output of the unit under test.
  The expected output from a unit, given an input is not an implementation detail, unless you have some very different definition of implementation detail than I.
  Testing the unit under test produces the expected outputs from a set of inputs implies nothing about implementation details at all. It is also a concept older than dirt:
  https://www.researchgate.net/publication/221329933_Iterative...
  
  MoreQARespect ・ 3 days ago
  ・ 3 more
  
  If the "unit under test" is low level then thats coupling low level implementation details to the test.
  If you're vague about what constitutes a "unit" that means youre probably not thinking about this problem.
  
  nyrikki ・ 3 days ago
  ・ 2 more
  
  Often, even outside of software, unit testing means testing a component's (unit‘s) external behavior.
  If you don't accept that concept I can see how TDD and testing in general would be challenging.
  I general it is most productive when building competency with a new subject to accept the author's definitions, then adjust once you have experience.
  IMHO, the sizing of components is context, language, and team dependent. But it really doesn't matter TDD is just as much about helping with other problems like action bias, and is only one part of a comprehensive testing strategy.
  While how you choose to define a 'unit' will impact outcomes, TDD it self isn't dependent on a firm definition.
  
  MoreQARespect ・ 3 days ago
  
  >If you don't accept that concept
  Nobody anywhere in the world disputes that unit tests should surround a unit.
  >IMHO, the sizing of components is context, language, and team dependent. But it really doesn't matter
  Yeah, thats the attitude that will trip you up.
  The process you use to determine the borders which you will couple to your test - i.e what constitutes a "unit" to be tested is critically important and nonobvious.
  
  marcosdumay ・ 4 days ago
  ・ 3 more
  
  > So I wouldn't blame people for writing specs that only match the code implementation at that time.
  WFT are you doing writing specs based on implementation? If you already have the implementation, what are you using the specs for? Or, if you want to apply this direct to tests, if you are already assuming the program is correct, what are you trying to test?
  Are you talking about rewriting applications?
  
  baq ・ 4 days ago
  ・ 2 more
  
  Where do you work if you don’t need to reverse engineer an existing implementation? Have you written everything yourself?
  
  marcosdumay ・ 2 days ago
  
  Unless you are rewriting the application, you shouldn't assume that whatever behavior you find on the system is the correct one.
  Even more because if you are looking into it, it's probably because it's wrong.
- manmal ・ 4 days ago
  
  > Tests are the source of truth more so than your code
  Tests poke and prod with a stick at the SUT, and the SUT's behaviour is observed. The truth lives in the code, the documentation, and, unfortunately, in the heads of the dev team. I think this distinction is quite important, because this question:
  > Do we have a bug? Or do we have a bad test?
  cannot be answered by looking at the test + the implementation. The spec or people have to be consulted when in doubt.
  
  9rx ・ 4 days ago
  ・ 11 more
  
  > The spec
  The tests are your spec. They exist precisely to document what the program is supposed to do for other humans, with the secondary benefit of also telling a machine what the program is supposed to do, allowing implementations to automatically validate themselves against the spec. If you find yourself writing specs and tests as independent things, that's how you end up with bad, brittle tests that make development a nightmare — or you simply like pointless busywork, I suppose.
  But, yes, you may still have to consult a human if there is reason to believe the spec isn't accurate.
  
  munificent ・ 4 days ago
  ・ 2 more
  
  Unfortunately, tests can never be a complete specification unless the system is simple enough to have a finite set of possible inputs.
  For all real-world software, a test suite tests a number of points in the space of possible inputs and we hope that those points generalize to pinning down the overall behavior of the implementation.
  But there's no guarantee of that generalization. An implementation that fails a test is guaranteed to not implement the spec, but an implementation that passes all of the tests is not guaranteed to implement it.
  
  9rx ・ 4 days ago
  
  > Unfortunately, tests can never be a complete specification
  They are for the human, which is the intended recipient.
  Given infinite time the machine would also be able to validate against the complete specification, but, of course, we normally cut things short because we want to release the software in a reasonable amount of time. But, as before, that this ability exists at all is merely a secondary benefit.
  
  godelski ・ 4 days ago
  ・ 8 more
  
  > The tests are your spec.
  That's not quite right, but it's almost right.
  Tests are an *approximation* of your spec.
  Tests are a description, and like all descriptions are noisy. The thing is it is very very difficult to know if your tests have complete coverage. It's very hard to know if your description is correct.
  How often do you figure out something you didn't realize previously? How often do you not realize something and it's instead pointed out by your peers? How often do you realize something after your peers say something that sparks an idea?
  Do you think that those events are over? No more things to be found? I know I'm not that smart because if I was I would have gotten it all right from the get go.
  There are, of course, formal proofs but even they aren't invulnerable to these issues. And these aren't commonly used in practice and at that point we're back to programming/math, so I'm not sure we should go down that route.
  
  9rx ・ 4 days ago
  ・ 7 more
  
  > Tests are a description
  As is a spec. "Description" is literally found in the dictionary definition. Which stands to reason as tests are merely a way to write a spec. They are the same thing.
  > The thing is it is very very difficult to know if your tests have complete coverage.
  There is no way to avoid that, though. Like you point out, not even formal proofs, the closest speccing methodology we know of to try and avoid this, is immune.
  > Tests are an approximation of your spec.
  Specs are an approximation of what you actually want, sure, but that does not change that tests are the spec. There are other ways to write a spec, of course, but if you went down that road you wouldn't also have tests. That would be not only pointless, but a nightmare due to not having a single source of truth which causes all kinds of social (and sometimes technical) problems.
  
  godelski ・ 4 days ago
  ・ 6 more
  
  > that does not change that tests are the spec.
  I disagree. It's, like you say, one description of your spec but that's not the spec.
  > not having a single source of truth
  Well that's the thing, there is no single source of truth. A single source of truth is for religion, not code.
  The point of saying this is to ensure you don't fall prey to fooling yourself. You're the easiest person for you to fool, after all. You should always carry some doubt. Not so much it is debilitating, but enough to keep you from being too arrogant. You need to constantly check that your documentation is aligned to your specs and that your specs are aligned to your goals. If you cannot see how these are different things then it's impossible to check your alignment and you've fooled yourself.
  
  9rx ・ 4 days ago
  ・ 5 more
  
  > You need to constantly check that your documentation is aligned to your specs
  Documentation, tests, and specs are all ultimately different words for the same thing.
  You do have to check that your implementation and documentation/spec/tests are aligned, which can be a lot of work if you do so by hand, but that's why we invented automatic methods. Formal verification is theoretically best (that we know of) at this, but a huge pain in the ass for humans to write, so that is why virtually everyone has adopted tests instead. It is a reasonable tradeoff between comfort in writing documentation while still providing sufficient automatic guarantees that the documentation is true.
  > If you cannot see how these are different things
  If you see them as different things, you are either pointlessly repeating yourself over and over or inventing information that is, at best, worthless (but often actively harmful).
  
  godelski ・ 4 days ago
  ・ 4 more
  
  > different words for the same thing
  You're still misunderstanding and missing the layer of abstraction, which is what I'm (and others are) talking about
  We have 3 objects: doc, test, spec. How do you prove they are the same thing?
  You are arguing that they all point to the same address.
  I'm arguing they all have the same parent.
  I think it's pretty trivial to show that they aren't identical, so I'll give two examples (I'm sure you can figure out a few more trivial ones):
  1) the documentation is old and/or incorrect, therefore isn't aligned with tests. Neither address nor value are equivalent here. 2) docs are written in natural language, tests are written in programming languages. I wouldn't say that the string "two" (or even "2") is identical to the integer 2 (nor the float 2). Duck typing may make them *appear* the same and they may *reference* the same abstraction (or even object!), but that is a very different thing than *being* the same. We could even use the classic Python mistake of confusing "is" with "==" (though that's a subset of the issue here).
  Yes, you should simplify things as much as possible but be careful to not simplify further
  
  9rx ・ 4 days ago
  ・ 3 more
  
  > We have 3 objects: doc, test, spec. How do you prove they are the same thing?
  You... don't? There is nothing good that can come from trying to understand crazy. Best to run away as fast as possible if you ever encounter this.
  > You are arguing that they all point to the same address.
  Oh? I did say if you document something the same way three different times (even if you give each time a different name, as if that somehow makes a difference), you are going to pointlessly end up with the same thing. I am not sure that necessarily equates to "the same address". In fact,
  > I'm arguing they all have the same parent.
  I also said that if they don't end up being equivalent documentation then you will only find difference in information that isn't useful. And that often that information becomes detrimental (see some of the adjacent comments that go into that problem). This is "having the same parent".
  In reality, I "argued" both. You'd have better luck if you read the comments before replying.
  > you should simplify things as much as possible but be careful to not simplify further
  Exactly. Writing tests, documentation, or specs (whatever you want to call it; it all caries the same intent) in natural language certainly feels simpler in the moment, but you'll pay the price later. In reality, you at very least need a tool that supports automatic verification. That could mean formal verification, but, as before it's a beast that is tough to wrangle. More realistically, tests are going to be the best choice amid all the tradeoffs. Industry (including the Haskell fanbois, even) have settled on it for good reason.
  > docs are written in natural language, tests are written in programming languages.
  Technically "docs" is a concept of less specificity. Documentation can be written in natural language, that is true, but it can also be written in code (like what we call tests), or even pictures or video. "Tests" carries more specificity, being a particular way to write documentation — but ultimately they are the same thing. Same goes for "spec". It describes a different level of specificity (less specific than "tests", but more specific than "docs"), but not something entirely different. It is all documentation.
  
  godelski ・ 4 days ago
  ・ 2 more
  
  > In reality, I "argued" both.
  I mean it is hard to have this conversation because you will say that they are the same thing and then leverage the fact that they aren't while disagreeing with me but using nearly identical settings to my examples.
  I mean if your argument is that a mallard (test) and a muscovy (docs) are both types of ducks but a mallard is not a muscovy and a muscovy is not a mallard, then I fail to see how we aren't on the same page. I can't put it any clearer than this: all mallards are ducks but not all ducks are mallards. In other words, a mallard is a duck, but it is not representative of all ducks. You can't look at a mallard and know everything there is to know about ducks. You'll be missing stuff. If you treat your mallard and duck as isomorphic you're going to land yourself into trouble, even if most (domesticated) ducks are mallards.
  It isn't that complex and saying "don't be overly confident" isn't adding crazy amounts of complexity that is going to overwhelm yourself. It's simply a recognition that you can't write a perfect spec.
  Look, munificent[0] is saying the same thing. So is Kinrany[1], and manmal[2]. Do you think we're all wrong? In exactly the same way?
  Besides, this whole argument is literally a demonstration of our claim. If you could write a perfect spec you'd (and we'd) be communicating perfectly and there'd be no hangup. But if that were possible we wouldn't need to write code in programming languages in the first place![3]
  [0] https://news.ycombinator.com/item?id=44713138
  [1] https://news.ycombinator.com/item?id=44713314
  [2] https://news.ycombinator.com/item?id=44712266
  [3] https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...
  
  9rx ・ 4 days ago
  
  > I mean if your argument is that a mallard (test) and a muscovy (docs) are both types of ducks
  To draw a more reasonable analogy with how the words are actually used on a normal basis, you'd have fowl (docs), ducks (specs), and mallards (tests). As before, the terms change in specificity, but do not refer to something else entirely. Pointing at a mallard and calling it a duck, or fowl, doesn't alter what it is. It is all the very same animal.
  Yes, fowl could also refer to chickens just as documentation could refer to tax returns. 'Tis the nature of using a word lacking specificity. But from context one should be able to understand that we're not talking about tax returns here.
  But I don't have an "argument". High school debate team is over there.
  > It's simply a recognition that you can't write a perfect spec.
  That was recognized from the onset. What is the purpose of adding this again?
  > Do you think we're all wrong?
  We're all bad at communicating, if that's what you are straining to ask. Which isn't exactly much of a revelation. We've both already indicated as such, as have many commenters that came before us.
  
  Kinrany ・ 4 days ago
  
  None of the four: code, tests, spec, people's memory, are the single source of truth.
  It's easy to see them as four cache layers, but empirically it's almost never the case that the correct thing to do when they disagree is to blindly purge and recreate levels that are farther from the "truth" (even ignoring the cost of doing that).
  Instead, it's always an ad-hoc reasoning exercise in looking at all four of them, deciding what the correct answer is, and updating some or all of them.
  
  andruby ・ 4 days ago
  ・ 6 more
  
  What does SUT stand for? I'm not familiar with the acronym
  Is it "System Under Test"? (That's Claude.ai's guess)
  
  card_zero ・ 4 days ago
  
  That's what Wiktionary says too. Lucky guess, Claude.
  
  undefined ・ 4 days ago
  
  [deleted]
  
  dfabulich ・ 4 days ago
  
  It is.
  
  undefined ・ 4 days ago
  
  [deleted]
  
  lightbendover ・ 4 days ago
  
  [dead]
- Pxtl ・ 4 days ago
  
  > “why is this broken test here that appears to test a behavior we don’t support”
  Because somebody complained when that behavior we don't support was broken, so the bug-that-wasn't-really-a-bug was fixed and a test was created to prevent regression.
  Imho, the mistake was in documentation: the Test should have comments explaining why this test was created.
  Just as true for tests as for the actual business logic code:
  The code can only describe the what and the how. It's up to comments to describe the why.
- bicx ・ 4 days ago
  
  I believe they just meant that tests are easy to generate for eng review and modification before actually committing to the codebase. Nothing else is a dependency on an individual test (if done correctly), so it's comparatively cheap to add or remove compared to production code.
  
  _alternator_ ・ 4 days ago
  
  Yup. I do read and review the tests generated by LLMs. Often the LLM tests will just be more comprehensive than my initial test, and hit edge cases that I didn’t think of (or which are tedious). For example, I’ll write a happy path test case for an API, and a single “bad path” where all of the inputs are bad. The LLM will often generate a bunch of “bad path” cases where only one field has an error. These are great red team tests, and occasionally catch serious bugs.
- SamuelAdams ・ 4 days ago
  
  Ideally the git history provides the “why was this test written”, however if you have one Jira card tied to 500+ AI generated tests, it’s not terribly helpful.
  
  djeastm ・ 4 days ago
  
  >if you have one Jira card tied to 500+ AI generated tests
  The dreaded "Added tests" commit...
- jgalt212 ・ 4 days ago
  
  > Having worked on legacy codebases this is extremely wrong and harmful. Tests are the source of truth more so than your code - and incorrect tests are even more harmful than incorrect code.
  I hear you on this, but you can still use so long as these tests are not comingled with the tests generated by subject-matter experts. I'd treat them almost a fuzzers.
- wagwang ・ 4 days ago
  
  This is the conclusion I'm at too, working on a relatively new codebase. Our rule is that every generated test must be human reviewed, otherwise its an autodelete.
- ozgrakkurt ・ 4 days ago
  
  What do you think about leaning on fuzz testing and deriving unit tests from bugs found by fuzzing?
  
  JonChesterfield ・ 4 days ago
  
  You end up with a pile of unit tests called things like "regression, don't crash when rhs null" or "regression, terminate on this" which seems fine.
  The "did it change?" genre of characterisation/snapshot tests can be created very effectively using a fuzzer, but should probably be kept separate from the unit tests checking for specific behaviour, and partially regenerated when deliberately changing behaviour.
  Llvm has a bunch of tests generated mechanically from whatever the implementation does and checked in. I do not rate these - they're thousands of lines long, glow red in code review and I'm pretty sure don't get read by anyone in practice - but because they exist more focused tests do not.
  
  manmal ・ 4 days ago
  ・ 10 more
  
  What kind of bugs do you find this way, besides missing sanitization?
  
  cookiengineer ・ 4 days ago
  
  Pointer errors. Null pointer returns instead of using the correct types. Flow/state problems. Multithreading problems. I/O errors. Network errors. Parsing bugs... etc
  Basically the whole world of bugs introduced by someone being a too smart C/C++ coder. You can battletest parsers quite nicely with fuzzers, because parsers often have multiple states that assume naive input data structures.
  
  ozgrakkurt ・ 4 days ago
  ・ 3 more
  
  You can use the fuzzer to generate test cases instead of writing test cases manually.
  For example you can make it generate queries and data for a database and generate a list of operations and timings for the operations.
  Then you can mix assertions into the test so you make sure everything is going as expected.
  This is very useful because there can be many combinations of inputs and timings etc. and it tests basically everything for you without you needing to write a million unit tests
  
  manmal ・ 4 days ago
  
  That sounds worse than letting an LLM dream up tests tbh. I wouldn’t consider grooming a huge number of tests for their usefulness after they‘ve been generated randomly. And just keeping all of them will just lock the implementation in place where it currently is, not validate its correctness.
  
  undefined ・ 4 days ago
  
  [deleted]
  
  raddan ・ 4 days ago
  ・ 5 more
  
  You can often find memory errors not directly related to string handling with fuzz testing. More generally, if your program embodies any kind of state machine, you may find that a good fuzzer drives it into states that you did not think should exist.
  
  manmal ・ 4 days ago
  ・ 4 more
  
  That sounds a bit like using a jackhammer to drive in a nail. Wouldn’t it be smarter to enumerate edge cases and test all permutations of those?
  
  quacksilver ・ 4 days ago
  ・ 3 more
  
  Would it even be possible to enumerate all edge cases and test all the permutations of them in non-trivial codebases or interconnected systems? How do you know when you have all of the edge cases?
  With fuzzing you can randomly generate bad input that passes all of your test cases that were written using by whatever method you have already been using but still causes the application to crash or behave badly. This may mean that there are more tests that you could write that would catch the issue related to the fuzz case, or the fuzz case itself could be used as a test.
  Using probability you can get to 90 or 99% or 99.999% or whatever confidence level you need that the software is unaffected by bugs based on the input size / number of fuzz test cases. In many non-critical situations the goal may not be 100% but 'statistically very unlikely with a known probability and error'
  
  undefined ・ 4 days ago
  
  [deleted]
  
  manmal ・ 4 days ago
  
  Thanks for elaborating, I might start fuzzing.
- layer8 ・ 3 days ago
  
  This is why tests need documenting what exactly they intend to test, and why.
mvieira38 ・ 4 days ago

I have the exact opposite idea. I want the tests to be mine and thoroughly understood, so I am the true arbiter and then I can let the LLM go ham on the code without fear. If the tests are AI made, then I get some anxiety letting agents mess with the rest of the codebase
- _alternator_ ・ 4 days ago
  
  I think this is exactly the tradeoff (blue team and red team need to be matched in power), except that I’ve seen LLMs literally cheat the tests (eg “match input: TEST_INPUT then return TEST_OUTPUT”) far too many times to be comfortable with letting LLMs be a major blue team player.
  
  johnisgood ・ 4 days ago
  ・ 3 more
  
  Yeah, they may do that, but people really should read the code an LLM produces. Ugh, makes me furious. No wonder LLMs have a bad rep from such users.
  
  otabdeveloper4 ・ 3 days ago
  ・ 2 more
  
  > people really should read the code an LLM produces
  Yeah, but that, like, requires that you know how to code. And wasn't the point of LLMs in the first place to let clueless people make software?
  
  johnisgood ・ 3 days ago
  
  I do not know, I would hope not. The bar to entry is already too low. I do not think you will ever be able to get an LLM work flawlessly for people who do not know programming. I know how to code, and I used LLMs before. It seems to be a prerequisite to know how to code if I want useful outputs.
fpoling ・ 4 days ago

I tried a LLM to generate tests for Rust code. It was more harmful then useful. Surely there were a lot of tests, but they still miss the key coverage and it was hard to see what was missed due to the amount of generated code. Then to change the code behavior in future would require to fix a lot of tests again versus fixing few lines in manually written tests.
torginus ・ 4 days ago

There's a saying that since nobody tests the tests, they must be trivially correct.
That's why they came up with the Arrange-Act-Assert pattern.
My favorite kind of unit test nowadays is when you store known input-output pairs and validate the code on them. It's easy to test corner cases and see that the output works as desired.
- 01HNNWZ0MV43FF ・ 4 days ago
  
  "Golden snapshot testing"
bravesoul2 ・ 4 days ago

AI is like a calculatorin this respect. Calculators can do things most humans can't. They make great augmentation devices. AI being a different kind of intelligence is very useful! Everyone is building AI replace human things. But the value is in augmentation.
positron26 ・ 4 days ago

> prone to cheating or writing non robust code (eg special code paths to pass tests without solving the actual problem).
The solution will come from synthetic data training methods that lobotomize part of the weights. It's just cross-validation. A distilled awareness won't maintain knowledge of the cheat paths, exposing them as erroneous.
This may a reason why every living thing on Earth that encounters psychoactive drugs seems to enjoy them. Self-deceptive paths depend on consistency whereas truth-preservation of facts grounded in reality will always be re-derived.
theptip ・ 3 days ago

I think the more fundamental attribute of interest is how easy it is to verify the work.
Much red team work is easily verifiable; either the exploit works or it doesn’t. Whereas more blue-team work is not easily verifiable; it might take judgement to figure out if a feature is promising.
LLMs are extremely powerful (and trainable) on tasks with a good oracle.

recipe19 ・ 4 days ago

I get the broader point, but the infosec framing here is weird. It's a naive and dangerous view that the defense efforts are only as strong as the weakest link. If you're building your security program that way, you're going to lose. The idea is to have multiple layers of defense because you can never really, consistently get 100% with any single layer: people will make mistakes, there will be systems you don't know about, etc.

In that respect, the attack and defense sides are not hugely different. The main difference is that many attackers are shielded from the consequences of their mistakes, whereas corporate defenders mostly aren't. But you also have the advantage of playing on your home turf, while the attackers are comparatively in the dark. If you squander that... yeah, things get rough.

darkwater ・ 4 days ago

Well, I think the his example (locked door + opened window) makes sense, and the multiple LAYERS concept applies to things an attacker has to do or go through to reach the jackpot. But doors and windows are on the same layer, and there the weakest link totally defines how strong the chain is. A similar example in the web world would be that you have your main login endpoint very well protected, audited, using only strong authentication method, and the you have a `/v1/legacy/external_backoffice` endpoint completely open with no authentication and giving you access to a forgotten machine in the same production LAN. That would be the weakest link. Then you might have other internal layers to mitigate/stop an attacker that got access to that machine, and that would be the point of "multiple layer of defense".
- lanstin ・ 4 days ago
  
  Or a single logging jar that will execute some of its message contents. Inside all your DMZ layers in the app content.
  
  darkwater ・ 4 days ago
  
  Poor log4j...
NitpickLawyer ・ 4 days ago

> It's a naive and dangerous view that the defense efforts are only as strong as the weakest link.
Well, to be fair, you added some words that are not there in the post
> The output of a blue team is only as strong as its weakest link: a security system that consists of a strong component and a weak component [...] will be insecure (and in fact worse, because the strong component may convey a false sense of security).
You added "defense efforts". But that doesn't invalidate the claim in the article, in fact it builds upon it.
What Terence is saying is true, factually correct. It's a golden rule in security. That is why your "efforts" should focus on overlaying different methods, strategies and measures. You build layers upon layers, so that if one weak link gets broken there are other things in place to detect, limit and fix the damage. But it's still true that often the weakest link will be an "in".
Take the recent example of cognizant desk people resetting passwords for their clients without any check whatsoever. The clients had "proper security", with VPNs and 2FA, and so on. But the recovery mechanism was outsourced to a helpdesk that turned out to be the weakest link. The attackers (allegedly) simply called, asked for credentials, and got them. That was the weakest link, and that got broken. According to their complaint, the attackers then gained access to internal systems, and managed to gather enough data to call the helpdesk again and reset the 2FA for an "IT security" account (different than the first one). And that worked as well. They say they detected the attackers in 3 hours and terminated their access, but that's "detection, mitigation" not "prevention". The attackers were already in, rummaging through their systems.
The fact that they had VPNs and 2FA gave them "a false sense of security", while their weakest link was "account recovery". (Terence is right). The fact that they had more internal layers, that detected the 2nd account access and removed it after ~3 hours is what you are saying (and you're right) that defense in depth also works.
So both are right.
In recent years the infosec world has moved from selling "prevention" to promoting "mitigation". Because it became apparent that there are some things you simply can't prevent. You then focus on mitigating the risk, limiting the surfaces, lowering trust wherever you can, treating everything as ephemeral, and so on.
Davidzheng ・ 4 days ago

I'm not a security person at all. But this comments reads against the best practices which I've heard. Like that the best defense is using open source & well-tested protocols with extremely small attack surface to minimize the space of possible exploits. Curious what I'm not understanding here.
- fnordsensei ・ 4 days ago
  
  Just because it’s open source doesn’t mean it’s well tested, or well pen tested, or whatever the applicable security aspect is.
  It could also mean that attacks against it are high value (because of high distribution).
  Point is, license isn’t a great security parameter in and of itself IMO.
- tetha ・ 4 days ago
  
  This area of security always feels a bit weird because ideally, you should think about your assumptions being subverted.
  For example, our development teams are using modern, stable libraries in current versions, have systems like Sonar and Snyk around, blocking pipelines for many of them, images are scanned before deployment.
  I can assume this layer to be well-secured to the best of their ability. It is most likely difficult to find an exploit here.
  But once I step a layer downwards, I have to ask myself: Alright, what happens IF a container gets popped and an attacker can run code in there? Some data will be exfiltrated and accessible, sure, but this application server should not be able to access more than the data it needs to access to function. The data of a different application should stay inaccessible.
  As a physical example - a guest in a hotel room should only have access to their own fuse box at most, not the fuse box of their neighbours. A normal person (aka not a youtuber with big eye brows) wouldn't mess with it anyway, but even if they start messing around, they should not be able to mess with their neighbour.
  And this continues: What if the database is not configured correctly to isolate access? We have, for example, isolated certain critical application databases into separate database clusters - lateral movement within a database cluster requires some configuration errors, but lateral movement onto a different database cluster requires a lot more effort. And we could even further. Currently we have one production cluster, but we could isolate that into multiple production clusters which share zero trust between them. An even bigger hurdle putting up boundaries an attacker has to overcome.
- mindcrime ・ 4 days ago
  
  But "defense in depth" is a security best practice. I'm not following exactly how the gp post is reading against any best practices.
  
  __s ・ 4 days ago
  ・ 3 more
  
  Defense in depth is a security best practice because adding shit to a mess is more feasible than maintaining a simple stack. "There are always systems you don't know about" reflects an environment where one person doesn't maintain everything
  
  fdw ・ 4 days ago
  
  No, defense in depth is a best practice because you assume that each layer can fall. It is more practical to have many layers that are very secure than to have one layer that has to be perfectly secure.
  
  yadaeno ・ 4 days ago
  
  I think you are confusing “security through obscurity” and “defense in depth”.
  You can add layers of high quality simple systems to increase your overall security exponentially, think using a VPN behind TOR etc.
- adastra22 ・ 4 days ago
  
  Security person here. Open sourcing your entire stack is NOT best practices. The best defense is defense in depth, with some proprietary layers unknown to the attacker.
- 0xDEAFBEAD ・ 4 days ago
  
  It should be possible to add layers without increasing attack surface.
- vlovich123 ・ 4 days ago
  
  Who have you been listening to?
dkarl ・ 4 days ago

I think it's just a poorly chosen analogy. When I read it, I understood "weakest link" to be the easiest path to penetrate the system, which will be harder if it requires penetrating multiple layers. But you're right that it's ambiguous and could be interpreted as a vulnerability in a single layer.
chaps ・ 4 days ago

Isn't offense just another layer of defense? As they say, the best defense is a good offense.
- fdw ・ 4 days ago
  
  They say this about sports, which is (usually) a zero-sum game: If I'm attacking, no matter how badly, my opponent cannot attack at all. Therefore, it is preferable to be attacking.
  In cyber security, there is no reason the opponent cannot attack as well. So, my red team is attacking is not a reason that I do not need defense, because my opponent can also attack.
  
  chaps ・ 4 days ago
  
  My post was really was in the context of real-time strategy games. It's very, very possible to attack and defend at the same time no matter the skill of either side. Offense and defense aren't mutually exclusive, which is kinda the point of my post.

ants_everywhere ・ 4 days ago

I have a couple of thoughts here:

(a) AI on both the "red" and "blue" teams is useful. Blue team is basically brain storming.

(b) AlphaEvolve is an example of an explicit "red/blue team" approach in his sense, although they don't use those terms [0]. Tao was an advisor to that paper.

(c) This is also reminiscent of the "verifier/falsifier" division of labor in game semantics. This may be the way he's actually thinking about it, since he has previously said publicly that he thinks in these terms [0]. The "blue/red" wording may be adapting it for an audience of programmers.

(d) Nitpicking: a security system is not only as strong as its weakest link. This depends on whether there are layers of security or if the elements are in parallel. A corridor consisting of strong doors and weak doors (in series) is as strong as the strongest door. A fraud detection algorithm made by aggregating weak classifiers is often much better than the weakest classifier.

[0] https://storage.googleapis.com/deepmind-media/DeepMind.com/B...

[1] https://mathoverflow.net/questions/38639/thinking-and-explai...

yeahwhatever10 ・ 4 days ago

How is the LLM in AlphaEvolve red team? All the LLM does is generate new code when prompted with examples. It doesn’t evaluate the code.
- ants_everywhere ・ 4 days ago
  
  From Tao's post, red team is characterized this way
  > In my own personal experiments with AI, for instance, I have found it to be useful for providing additional feedback on some proposed text, argument, code, or slides that I have generated (including this current text).
  In AlphaEvolve, different scoring mechanisms are discussed. One is evaluation of a fixed function. Another is evaluation by an LLM. In either case, the LLM takes the score as information and provides feedback on the proposed program, argument, code, etc.
  An example is given in the paper
  > The current model uses a simple ResNet architecture with only three ResNet blocks. We can improve its performance by increasing the model capacity and adding regularization. This will allow the model to learn more complex features and generalize better to unseen data. We also add weight decay to the optimizer to further regularize the model and prevent overfitting. AdamW is generally a better choice than Adam, especially with weight decay.
  It then also generates code, which is something he considers blue team.
  More generally, using AI as blue team and red team is conceptually similar to a kind of actor/critic algorithm

ashton314 ・ 4 days ago

As I understand it, this is how the RSA algorithm was made. I don't know where my copy of "The Code Book" by Simon Singh is right now, but iirc, Rivest and Shamir would come up with ideas and Adleman's primary role was finding flaws in the security.

Oh look, it's on the Wikipedia page: https://en.wikipedia.org/wiki/RSA_cryptosystem

Yay blue/red teams in math!

griffzhowl ・ 4 days ago

Reminds me of a pair of cognitive scientists I know who often collaborate. One is expansive and verbose and often gets carried away on tangential trains of thought, the other is very logical and precise. Their way of producing papers is the first one writes and the second deletes.
- ashton314 ・ 4 days ago
  
  That's a great model. Even if you're not naturally that way, it's helpful to think of a verbose phase followed by a revising phase. You can do this either as a team or as an individual—though as an individual it can be hard to context switch.

javier_e06 ・ 4 days ago

In cybersecurity red and blue test are two equal forces. In software development the analogy I think is a stretch, coding and testing are not two equal forces. Test is code too, and as such, it has bugs too. Test runs afoul with police paradox: Who polices the police? The Police police the police.

fsckboy ・ 4 days ago

"Police police police police police police police."
https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal...
vermarish ・ 4 days ago

I interpret it a different way than that. I see application code and testing code as both a part of blue team. It's the code reviews and architectural critiques that are part of red team.
Personally, I've found GitHub's feature of AI PR reviewers exceptionally helpful. I think that's the type of red team LLM app Tao is describing here.
th0ma5 ・ 3 days ago

This is an underrated comment... Most all LLM stuff suffers from not having any ground truth, even with multiple agentic rag integrations.

nostrademons ・ 4 days ago

Interesting way of viewing this!

Business also has a “blue team” (those industries that the rest of the economy is built upon - electricity, oil, telecommunications, software, banking; possibly not coincidentally, “blue chips”) and a “red team” (industries that are additive to consumer welfare, but not crucial if any one of them goes down. Restaurants, specialty retail, luxuries, tourism, etc.)

It is almost always better, economically, to be on the blue team.” That’s because the blue team needs to ensure they do everything right (low supply) but has a lot of red-team customers they support (high demand). The red team, however, is additive: each additional red team firm improves the quality of the overall ecosystem, but they aren’t strictly necessary* for the success of the ecosystem as a whole. You can kinda see this even in the examples of Tao’s post: software engineers get paid more than QA, proof-creation is widely seen as harder and more economically valuable than proof-checking, etc.

If you’re Sam Altman and have to raise capital to train these LLMs, you have to hype them as blue team, because investors won’t fund them as red team. That filters down into the whole media narrative around the technology. So even though the technology itself may be most useful on the red team, the companies building it will never push that use, because if they admit that, they’re admitting that investors will never make back their money. (Which is obvious to a lot of people without a dog in the fight, but these people stay on the sidelines and don’t make multi-billion dollar investments into AI.)

The same dynamic seems to have happened to Google Glasses, VR, and wearables. These are useful red-team technologies in niche markets, but they aren’t huge new platforms and they will never make trillions like the web or mobile dev did. As a result, they’ve been left to languish because capital owners can’t justify spending huge sums on them.

ozgrakkurt ・ 3 days ago

Maybe it can’t be blue team in current state but it could get better and actually be able to create software. If this happens then the ones that get there first will have a big advantage.
But not sure if buying a million gpus and training llms will be the strategy to improve it

furyofantares ・ 4 days ago

John Cleese has a talk on being in an open mode mentally vs closed mode. Come up with ideas in as open a mode as possible. Then at a later time, get into a closed mode and reject bad ideas and work on and refine the good ones.

Authors of all types typically have editors. In Magic: the Gathering design, sets are initially created by a design team and handed off to a (usually completely separate) development team. Anyone have more examples?

jderick ・ 4 days ago

dev vs validation team

TheGRS ・ 4 days ago

After using agentic models and workflows recently, I think these agents belong in both roles. Even more than that, they should be involved in the management tasks too. The developer becomes more of an overseer. You're overseeing the planning of a task - writing prompts, distilling the scope of the task down. You're overseeing writing the tests. And you're overseeing writing out the code. Its a ton of reviewing, but I've always felt more in control as a red team type myself making sure things don't break.

spoaceman7777 ・ 4 days ago

The reality is the opposite of this post. LLMs are great at rapidly creating rough drafts, and humans are best (when properly trained) at critiquing LLM results.

So, LLMs are in fact better at blue-teaming, and humans are better at red-teaming.

abalaji ・ 4 days ago

I think this flips at the frontier which may be what Tao is commenting on.
- cubefox ・ 3 days ago
  
  Tao is a) unusually intelligent and b) an expert in his field. Most people are neither very intelligent nor have expert knowledge in any academic subject. So Tao is pretty much the least representative LLM user possible.
  
  auggierose ・ 3 days ago
  ・ 2 more
  
  You could argue that Tao is the most representative LLM user possible, because why would you need not very intelligent people use LLMs? Just replace them with LLMs.
  
  cubefox ・ 3 days ago
  
  I assume you wouldn't want to be replaced by an LLM.
ozgrakkurt ・ 3 days ago

They are better at everything tbh

jacobjwebber ・ 4 days ago

This is an interesting perspective. I have long thought that the key to creativity is _curating_ ideas (taste) rather than generating them

pamelafox ・ 4 days ago

(Disclosure: I work for Microsoft) I run automated red-teaming on my RAG samples through the azure-ai-evaluation SDK, which uses an adversarial LLM (an LLM without the guardrails) plus the pyrit package to come up with horrible questions to ask your app and then transform them (base64, ceaser cipher, urlencode, etc), to see how the app will respond. It's really interesting to see the results, and I agree that red-teaming generally can be a good use of LLMs.

Video of me demo'ing it here: https://www.youtube.com/watch?v=sZzcSX7BFVA (Sorry I'm shout-y, weird venue)

cubefox ・ 3 days ago

Any experiences like this one? https://www.lesswrong.com/posts/MnYnCFgT3hF6LJPwn/why-white-...

jeffrallen ・ 4 days ago

My experience with a really clever agentic workflow (I use sketch.dev) is that the LLM is playing both blue and red team. If I give a good spec, it will make the thing I'm asking for, and then it will test it better than I would have done myself (partly because it's more clever than me, but mostly because it's way harder working than I am, or rather it puts more effort into testing that I would be able to do with the time leftover after writing the thing).

Also, I cam ask it to do security reviews on the system it's made and it works with it's same characteristic fervor.

I love Tao's observation, but I disagree, at least for the domains I'm allowing LLMs to creat for, that they should not play both teams.

LeifCarrotson ・ 4 days ago

> The blue team is more obviously necessary to create the desired product; but the red team is just as essential, given the damage that can result from deploying insecure systems.

> Many of the proposed use cases for AI tools try to place such tools in the "blue team" category, such as creating code...

> However, in view of the unreliability and opacity of such tools, it may be better to put them to work on the "red team", critiquing the output of blue team human experts but not directly replacing that output...

The red team is only essential if you're a coward who isn't willing to take a few risks for increased profit. Why bother testing and securing when you can boost your quarterly bonus by just... not doing that?

I suspect that Terence Tao's experience leans heavily towards high-profile risk-averse institutions. People don't call one of the greatest living mathematicians to check your work when they're just trying to duct taping a new interface on top of a line-of-business app that hasn't seen much real investment since the late 90s. Conversely, the people who are writing cutting-edge algorithms for new network protocols and filesystems are hopefully not trying to churn out code as fast and cheap as possible by copy-pasting snippets to and from random chatbots.

There are a lot of people who are already cutting corners on programmer salaries, accruing invisible tech debt minute by minute. They're not trying to add AI tools to create a missing red team, they're trying to reduce headcount on the only team they have, which is the blue team (which is actually just one overworked IT guy in over his head).

nostrademons ・ 4 days ago

Tao is talking about systems, which are self-sustaining dynamic networks that function independently of who the individual actors and organizations within the system are. You can break up the monopoly at the heart of the blue team system (as the U.S. did with Standard Oil and AT&T) and it will just reform through mergers over generations (as it largely has with Exxon Mobil and Verizon). You can fire or kill all the people involved and they will just be replaced by other people filling the same roles. The details may change, but the overall dynamics remain the same.
In this case, all the companies who are doing what you describe are themselves the red team. They are the unreliable, additive, distributed players in an ecosystem where the companies themselves are disposable. The blue team is the blue team by virtue of incentives: they are the organization where proper functioning of their role requires that all the parts are reliable and work well together, and if the individual people fulfilling those roles do not have those qualities, they will fail and be replaced by people who do.
- kibwen ・ 4 days ago
  
  > and it will just reform through mergers over generations
  You say "just" as though this is a failure of the system, but this is the system working as designed. Economies of scale are half the reason to bother with large-scale enterprise, so they inevitably consolidate to the point of monopoly, so disrupting that monopoly by force to keep the market aligned is an ongoing and never-ending process that you should expect to need to do on a regular basis.
  
  nostrademons ・ 4 days ago
  ・ 3 more
  
  I'm not saying this is a failure of the system, only that it is the system. My overall point is that systems take the form they do based on available technology, efficiencies of production, lines of communication, and incentives, and that the individual firms involved are disposable actors that are forced by the factors above into economically-rational actions. If the natural form of an industry is monopoly (as most "blue team" industries are), that's what we'll get, and government action can at best delay it.
  
  kibwen ・ 3 days ago
  ・ 2 more
  
  Sure, but this is making the common mistake of viewing government intervention as being somehow separate or outside of the market system, rather than being inside the system. Corporations in a competitive market consolidate to the point of monopoly, they use that monopoly to abuse customers, customers demand their governments intervene, some measure of competitiveness is restored, goto start. This is the system.
  
  nostrademons ・ 3 days ago
  
  That's a fair way of looking at it.
  Bringing it back to the article's point, the government is part of the "blue team" portion of the system. In that if they don't do their job, and their job includes complex regulations that balance multiple competing factors, then large portions of the system...well, "collapse" is a judgy term, but "function in significantly different ways" gets the point across. Inaction or ineffectiveness of the government effectively creates new "blue team" industries, and distributes power in different ways across the economy.

resters ・ 4 days ago

Suppose there is an LLM that has a very small context size but reasons extremely well within it. That LLM would be useful for a different set of tasks than an LLM with a massive context that reasons somewhat less effectively.

Any dimension of LLM training and inference can be thought of as a tradeoff that makes it better for some tasks, and worse for others. Maybe in some scenarios a heavily quantized model that returns a result in 10ms is more useful than one that returns a result in 200ms.

zkmon ・ 4 days ago

Red team is not a team. It is the background context in which the foreground operates. Evolution happens through interaction and adaptation between foreground and background. It is true that the background (context) is a dual form to the foreground (thing). But the context is not just another thing in the same sense as the foreground.

hiq ・ 4 days ago

What about formal proofs? Don't we expect LLMs to help there, in a more "blue team" role? E.g. when a mathematician talks about a "technical proof", enumerating cases in the thousands, my impression is that LLM would save some time, and potentially help mathematicians focus on the actually hard (rather than tedious) parts.

LPisGood ・ 4 days ago

Formal verification and case automatikn can be done automatically anyway without a mathematician hand checking each case.
For an old example that predates LLMs, see the four color theorem.
topaz0 ・ 4 days ago

A computer can be helpful for enumerating cases and similar mechanical work. But an LLM specifically would be a terrible way to do this.

mathattack ・ 3 days ago

Interesting. From a writing point of view this suggests that it's better to have the LLM "critique my draft" rather than "write the first draft." (Both for writing text and code) Also implies that we want to manually check all of the LLM's suggestions. This makes it sound more like a co-worker (agent) than all-powerful SuperIntelligence. I guess this is a symptom of the hallucinations.

https://open.substack.com/pub/therosen/p/should-llms-write-y...

kkaske ・ 3 days ago

Maybe. I also think that the implications of code can be harder to decipher on first pass than writing text which leads me to believe that maybe that mental model (Red Team, Blue Team) might not fit here.
- mathattack ・ 3 days ago
  
  Good point. I can quickly intuitively tell if the suggestions for my writing is correct. Harder to tell on code.
  Perhaps the analogy is better for "Writing code" versus "Writing Test Cases"?

simianwords ・ 4 days ago

After having thought a long bit about why I find LLM's useful despite the high error rate: it is because of my ability to verify a certain result is high enough (my internal verifier model) and the generator model which is the LLM is also accurate enough. This is the same concept as red and blue team.

Its the same reason I find asking opinions from many people useful - I take every answer and try to fit it into my world model and see what sticks. The point that many miss is that each individual's verifier model is actually accurate enough so that external generator models may afford to have high error rates.

I have not yet completely explored how the internal "fitting" mechanism works but to give an example: I read many anecdotes from Reddit, fully knowing that many are astroturfed, some flat out wrong. But I still have tricks to identify what can be accurate, which I probably do subconsciously.

In reality: answers don't exist in a randomly uniform space. "Truth" always has some structure and it is this structure (that we all individually understand a small part of) that helps us tune our verifier model.

It is useful to think of how LLM's would work with varying levels of accuracy. For example, generating gibberish to GPT O3 to ground truth. Gibberish is so inaccurate that even extremely high levels of accuracy of our internal verifier model may not allow it to be useful. But O3 is high enough that combined with my internal verifier model it is generally useful.

davidhs ・ 4 days ago

LLMs can be useful when you have access to a verifier or verification process.
- simianwords ・ 4 days ago
  
  yes https://deepmind.google/discover/blog/alphaevolve-a-gemini-p...
  Our internal verifier model is fuzzy but in this example I think it is pretty much always accurate.

zaking17 ・ 4 days ago

My coding flow today involves a lot of asking an LLM to generate code (blue team) and then me code reviewing, rewriting, and making it scalable (red team?). The analogy breaks down, because I'm providing the safety and correctness; LLMs are offering a head start.

I'm optimistic about AI-powered infra & monitoring tools. When I have a long dump of system logs that I don't understand, LLMs help immensely. But then it's my job to finalize the analysis and make sure whatever debugging comes next is a good use of time. So not quite red team/blue team in that case either.

LPisGood ・ 4 days ago

The analogy is not about safety and correctness, but about who is producing and who is assessing/analyzing/poking & prodding.

Fabricio20 ・ 4 days ago

Meta but is the font on the website hard to read for anyone else? To me it's hard to distinguish lines and everything looks a bit blurry? I had to open dev tools and set the font back to one of my os fonts.

deepdarkforest ・ 4 days ago

Using LLMs as a critic/red teamer is great in theory, but economically is not that more useful, doesnt save that much time, if anything, it increases the time because you might uncover more errors or think about your work more. Which is amazing if you value quality work and you have learnt to think. Unfortunately, all the VC money is pushing the opposite, using LLMs to just do mediocre work. No point of critiquing anything if your job is to output some slop from bullet points, pass it along to the reader/recipient who also uses LLms to boil your slop down back to bullet points and pass it again etc. Even mentally, it's much more enticing or addicting to use LLMs for everything if you don't' care about the output of your work, and let your brain atrophy.

I also see this in a lot of undergrads i work with. The top 10% is even better with LLMs, they know much more and they are more productive. But the rest have just resulted to turning in clear slop with no care. I still have not read a good solution on how to incentivize/restrict the use of LLms in both academia or at work correctly. Which i suspect is just the old reality of quality work is not desirable by the vast majority, and LLMs are just magnifying this

qsort ・ 4 days ago

> The top 10% is even better with LLMs, they know much more and they are more productive. But the rest have just resulted to turning in clear slop with no care.
This is interesting, I'm noticing something similar (even taking LLMs out of the equation). I don't teach, but I've been coaching students for math competitions, and I feel like there's a pattern where the top few% is significantly stronger than, say, 10 years ago, but the median is weaker. Not sure why, or whether this is even real to begin with.
j2kun ・ 4 days ago

Fail them enough and it will sink in I'm sure.

jedberg ・ 4 days ago

Chaos engineering was created to be the "red team" of operations. Let's figure out all the ways we can break a production system before it happens on its own.

And there are a host of teams working on the "red team" side of LLMs right now, using them for autonomous testing. Basically, instead of trying to figure out all the things that can go wrong and writing tests, you let the AI explore the space of all possible failures, and then write those tests.

scoreandmore ・ 4 days ago

The first thing I did when I signed up for Claude was have it analyze my website for security holes. But it only recommended superficial changes, like the lifecycle of my JWTs. After reading this, I’m wondering if a prompt asking it to attack the website would be better than asking it where it should be beefed up. But I no longer pay for Claude, and I suspect it won’t give me instructions on how to attack something. How would one get past this?

ethan_smith ・ 4 days ago

Try framing your prompts as security assessments rather than attacks - ask the model to identify "potential vulnerabilities" or "security considerations" while providing specific technical details about your architecture.

sathish316 ・ 4 days ago

Isn’t this the basis of GAN (Generative Adversarial networks), which is how most GenAI image models work? The purpose of generator network is to generate data that is as close to the training set as possible. The purpose of discriminator network is to distinguish the original from generated data.

Is blue-team and red-team like a post-training generator and discriminator?

1970-01-01 ・ 4 days ago

Good read, but I'm struggling to understand why Terry did not use the foundational terms offense and defense.

shiandow ・ 4 days ago

Because describing the task of writing code as defense is a bit confusing.
- wavemode ・ 4 days ago
  
  well, in a way you're defending against bugs and vulnerabilities by reviewing code

jeron ・ 4 days ago

so we've reinvented GAN but with LLMs

dimatura ・ 4 days ago

I was going to mention this sounds like the idea behind adversarial approaches, which I guess go all the way back to game theory and algorithms like minimax. They're definitely used in the control literature ("adversarial disturbances"). And of course GANs.

65 ・ 4 days ago

I'm not sure why I thought this article would be about LLMs vs. the philosophical concept of the Tao.

m3kw9 ・ 4 days ago

I’m not understanding why he said unreliable red team contributors can be useful?

bc569a80a344f9c ・ 4 days ago

He didn't say that - he said they can be _more_ useful. The argument is that LLMs are unreliable, so using LLMs anywhere in your workflow introduces an unreliable contributor. It is then better to have that unreliable contributor on the red team than on the blue team, because an unreliable contributor on defense introduces weaknesses and vulnerabilities while an unreliable contributor on offense introduces a non-viable or trivial attack.

chubot ・ 4 days ago

I made this point a few months ago here, but using the words attacker and defender (builder) rather than red team and blue team: https://lobste.rs/s/i2edlt/how_i_use_ai

The asymmetry is:

An attacker only has to be right ONCE, and he wins

Conversely, the defender only has to be wrong once, and he is wrong.

So the conclusion is:

Defenders/creators are using LLMs to pump out crappy code, and not testing enough, or relying on the LLM to test itself.

Some attackers might be too dismissive of LLMs, and could accelerate their work by using them to try more things

The comment was related to these stories:

How I Use AI (11 months ago) - https://news.ycombinator.com/item?id=41150317

Carlini has the fairly rare job of being an attacker: Why I Attack - https://nicholas.carlini.com/writing/2024/why-i-attack.html

1970-01-01 ・ 4 days ago

So if they are to be focused on attacking and defending, they are to be separated. This leaves us with an argument where you effectively dismiss purple teams as a hack.

xiande04 ・ 4 days ago

It's called "separation of concerns".
tonetegeatinst ・ 4 days ago

Yes, I feel this author ignores the fact purple teams exist. That or he must not know about them.
In addition, red and purple teams end goal is to help the blue team at the end of the day to remedy the issues discovered.

d4rkn0d3z ・ 4 days ago

Intelligence as a byproduct of pitched battle? A spatio-temporal convergence scheme? Really, is that novel?

bodhi_mind ・ 4 days ago

Is there a concept of purple team in cybersecurity where a team does both roles? Or does that break the purpose of both teams?

pragma_x ・ 4 days ago

I think it presents a conflict of interest. Considering we're talking about system security, it's best to not leave this up to the ethics of just one team.
Also: a lot of development teams in security-oriented fields are doing a lot of self-investigation and improvement anyway. Red Teams still have value, and prove that time and again, in spite of that.
IMO, having another team attack your stuff also creates "real" stakes for failure that feel closer to reality than some existential hacker threat. I think just the presence of a looming "Red Team Exercise" creates a stronger motivation to do a better job when building IT systems.

johnrob ・ 4 days ago

Humans are good at sifting valid feedback from bad feedback. But we are bad at spotting subtle bugs in PRs.

fnord123 ・ 4 days ago

> Because of this, unreliable contributors may be more useful in the "red team" side of a project than the "blue team" side

Is Pirate Software catching strays from Terrence Tao now?

danieltk76 ・ 4 days ago

[flagged]

undefined ・ 4 days ago

[deleted]

some_random ・ 4 days ago

This is an interesting discussion intellectually but it ignores the reality of cybersecurity. Yes I agree that AI tools best fit the red team role HOWEVER the reality is that the place that needs the most help is on the blue team and indeed this is where we see the biggest uplift from AI tools. To extend the "defend a house" metaphor, the previous state of security tooling was that an alert would be sent to the SOC every time any motion was detected on the cameras, leading to alert fatigue and increasing the time between a true positive alert being fired and it being escalated. Now add some CV in which tries to categorize those motion detection alerts into a few buckets, "person spotted", "car pulled up", "branch moved", "cat came home", etc and suddenly you go from having a thousand alerts to review a day to fifty.

bgwalter ・ 4 days ago

Tao's blue team stands for generative "AI", the red team stands for critical/auditing "AI".
I have not seen any independent claim that generative "AI" makes programs safer or that generating supervising features as you suggest works.
For auditing "AI" I have seen one claim (not independent or using a public methodology) that auditing "AI" rakes in bug bounties.

iLoveOncall ・ 4 days ago

Pretty poor analogies here.

> The output of a blue team is only as strong as its weakest link: a security system that consists of a strong component and a weak component (e.g., a house with a securely locked door, but an open window) will be insecure

Hum, no? With an open window you can go through the whole house. With a XSS vulnerability you cannot do the same amount of damage as with a SQL injection. This is why security issues have levels of severity.

carstimon ・ 4 days ago

You've made the choice of (Locked Door, Open Window) ~ (Good SQL usage, XSS Vulnerability) which seems to be an incorrect rebuttal. Your example doesn't contradict "only as strong as its weakest link", here the weakest link is the XSS Vuln.
The "house analogy" can also support cases where the potential damage is not the same, e.g. if the open window has bars a robber might grab some stuff within reach but not be able to enter.
cowpig ・ 4 days ago

Does this detail detract from the core idea?
Ensorceled ・ 4 days ago

You can always find problems with analogies, analogies are intentionally simplified to allow readers to better understand difficult or nuanced ideas.
In this case you are criticizing an analogy meant to convey understanding of "weakest link" for not also imparting an understanding of "levels of severity".
pkoiralap ・ 4 days ago

Not true, if XSS is used to compromise an admin user, the damage can be far more than what a seemingly harmless SQL injection that just reads extra columns from a table does.
This particular comment feels more like an over-concentration on trivialities rather than refutation or critique of opinion.