This red vs blue team is a good way to understand the capabilities and current utility of LLMs for expert use. I trust them to add tests almost indiscriminately because tests are usually cheap; if they are wrong it’s easy to remove or modify them; and if they are correct, they adds value. But often they don’t test the core functionality; the best tests I still have to write myself.
Having LLMs fix bugs or add features is more fraught, since they are prone to cheating or writing non robust code (eg special code paths to pass tests without solving the actual problem).
> I trust them to add tests almost indiscriminately because tests are usually cheap; if they are wrong it’s easy to remove or modify them
Having worked on legacy codebases this is extremely wrong and harmful. Tests are the source of truth more so than your code - and incorrect tests are even more harmful than incorrect code.
Having worked on legacy codebases, some of the hardest problems are determining “why is this broken test here that appears to test a behavior we don’t support”. Do we have a bug? Or do we have a bad test? On the other end, when there are tests for scenarios we don’t actually care about it’s impossible to determine if that test is meaningful or was added because “it’s testing the code as written”.
I would add that few things slow developer velocity as much as a large suite of comprehensive and brittle tests. This is just as true on greenfield as on legacy.
Anticipating future responses: yes, a robust test harness allows you to make changes fearlessly. But most big test suites I’ve seen are less “harness” and more “straight-jacket”
I think a problem with AI productivity metrics is that a lot of the productivity is made up.
Most enterprise code involves layers of interfaces. So implementing any feature requires updating 5 layers and mocking + unit testing at each layer.
When people say “AI helps me generate tests”, I find that this is what they are usually referring to. Generating hundreds of lines of mock and fake data boilerplate in a few minutes, that would otherwise take an entire day to do manually.
Of course, the AI didn’t make them more productive. The entire point of automated testing is to ensure software correctness without having to test everything manually each time.
The style of unit testing above is basically pointless. Because it doesn’t actually accomplish the goal. All the unit tests could pass and the only thing you’ve tested is that your canned mock responses and asserts are in-sync in the unit testing file.
A problem with how LLMs are used is that they help churn through useless bureaucratic BS faster. But the problem is that there’s no ceiling to bureaucracy. I have strong faith that organizations can generate pointless tasks faster than LLMs can automate them away.
Of course, this isn’t a problem with LLMs themselves, but rather an organization context in which I see them frequently being used.
I think it's appropriate to be skeptical with new tools, and being appropriately, respectfully, prosocially, skeptical, point out failure modes. Kudos.
Something that crosses my mind is if AI generating tests necessitates that it only generates tests with fakes and stubs that exercise no actual logic, the expertise required to notice that, and if it is correctable.
Yesterday, I was working on some OAuth flow stuff. Without replayed responses, I'm not quite sure how I'd test it without writing my own server, and I'm not sure how I'd develop the expertise to do that without, effectively, just returning the responses I expected.
It reminds me that if I eschewed tests with fakes and stubs as untrustworthy in toto, I'd be throwing the baby with the bathwater.
An old coworker used to call these types of tests change detector tests. They are excellent at telling you whether some behavior changed, but horrible at telling you whether that behavior change is meaningful or not.
Yup. Working on a 10 year old codebase, I always wondered whether a test failing was "a long-standing bug was accidentally fixed" or "this behavior was added on purpose and customers rely on it". It can be about 50/50 but you're always surprised.
Change detector tests add to the noise here. No, this wasn't a feature customers care about, some AI added a test to make sure foo.go line 42 contained less than 80 characters.
I like calling out behavioral vs normative tests. The difference is optics, mostly, but the mere fact that somebody took the time to add a line of comment to ten or hundred lines of mostly boilerplate tests is usually more than enough to raise an eyebrow and I honestly don’t need more than just a pinch of surprise to make the developer pause.
> a long-standing bug was accidentally fixed
In some cases (e.g. in our case) long standing bugs become part of the API that customers rely on.
It's nearly guaranteed, even if it is just because customers had to work around the bug in such a way that their flow now breaks when the bug is gone.
Obligatory: https://xkcd.com/1172/
That comic doesn't show someone working around a bug in such a way that their flow breaks when the bug is gone. It shows them incorporating a bug into their workflow for purposes of causing the bug to occur.
It isn't actually possible for fixing a bug to break a workaround. The point of a workaround is that you're not doing the thing that's broken; when that thing is fixed, your flow won't be affected, because you weren't doing it anyway.
Also known as Hyrum's Law (https://www.hyrumslaw.com/), but more people know the XKCD at this point :)
These sorts of tests are invaluable for things like ensuring adherence to specifications such as OAuth2 flows. A high-level test that literally describes each step of a flow will swiftly catch odd changes in behavior such as a request firing twice in a row or a well-defined payload becoming malformed. Say a token validator starts misbehaving and causes a refresh to occur with each request (thus introducing latency and making the IdP angry). That change in behavior would be invisible to users, but a test that verified each step in an expected order would catch it right away, and should require little maintenance unless the spec itself changes.
I don't understand this. How does it slow your development if the tests being green is a necessary condition for the code being correct? Yes it slows it compared to just writing incorrect code lol, but that's not the point.
"Brittle" here means either:
1) your test is specific to the implementation at the time of writing, not the business logic you mean to enforce.
2) your test has non-deterministic behavior (more common in end-to-end tests) that cause it to fail some small percentage of the time on repeated runs.
At the extreme, these types of tests degenerate your suite into a "change detector," where any modification to the code-base is guaranteed to make one or more tests fail.
They slow you down because every code change also requires an equal or larger investment debugging the test suite, even if nothing actually "broke" from a functional perspective.
Using LLMs to litter your code-base with low-quality tests will not end well.
The problem is that sometimes it is not a necessary condition. Rather, the tests might have been checking implementation details or just been wrong in the first place. Now, when tests fails I have extra work to figure out if its a real break or just a bad test.
The goal of tests is not to prevent you from changing the behavior of your application. The goal is to preserve important behaviors.
If you can't tell if a test is there to preserve existing happenstance behavior, or if it's there to preserve an important behavior, you're slowed way down. Every red test when you add a new feature is a blocker. If the tests are red because you broke something important, great. You saved weeks! If the tests are red because the test was testing something that doesn't matter, not so great. Your afternoon was wasted on a distraction. You can't know in advance whether something is a distraction, so this type of test is a real productivity landmine.
Here's a concrete, if contrived, example. You have a test that starts your app up in a local webserver, and requests /foo, expecting to get the contents of /foo/index.html. One day, you upgrade your web framework, and it has decided to return a 302 Moved redirect to /foo/index.html, so that URLs are always canonical now. Your test fails with "incorrect status code; got 302, want 200". So now what? Do you not apply the version upgrade? Do you rewrite the test to check for a 302 instead of a 200? Do you adjust the test HTTP client to follow redirects silently? The problem here is that you checked for something you didn't care about, the HTTP status, instead of only checking for what you cared about, that "GET /foo" gets you some text you're looking for. In a world where you let the HTTP client follow redirects, like human-piloted HTTP clients, and only checked for what you cared about, you wouldn't have had to debug this to apply the web framework security update. But since you tightened down the screws constraining your application as tightly as possible, you're here debugging this instead of doing something fun.
(The fun doubles when you have to run every test for every commit before merging, and this one failure happened 45 minutes in. Goodbye, the rest of your day!)
This example smells a lot like "overfit" in AI training as well.
It's that hard to write specs that truly match the business, hence why test-driven-development or specification-first failed to take off as a movement.
Asking specs to truly match the business before we begin using them as tests would handcuff test people in the same way we're saying that tests have the potential to handcuff app and business logic people — as opposed to empowering them. So I wouldn't blame people for writing specs that only match the code implementation at that time. It's hard to engage in prophecy.
The problem with TDD is that people assumed it was writing a specification, or directly tried to map it directly to post-hoc testing and metrics.
TDD at its core is defining expected inputs and mapping those to expected outputs at the unit of work level, e.g. function, class etc.
While UAT and domain informed what those inputs=outputs are, avoiding trying to write a broader spec that that is what many people struggle with when learning TDD.
Avoiding writing behavior or acceptance tests, and focusing on the unit of implementation tests is the whole point.
But it is challenging for many to get that to click. It should help you find ambiguous requirements, not develop a spec.
I literally do the diametric opposite of you and it works extremely well.
Im weirded out by your comment. Writing tests that couple to low level implementation details was something I thought most people did accidentally before giving up on TDD, not intentionally.
It isn't coupling low level implementation details, it is writing tests based on input and output of the unit under test.
The expected output from a unit, given an input is not an implementation detail, unless you have some very different definition of implementation detail than I.
Testing the unit under test produces the expected outputs from a set of inputs implies nothing about implementation details at all. It is also a concept older than dirt:
https://www.researchgate.net/publication/221329933_Iterative...
If the "unit under test" is low level then thats coupling low level implementation details to the test.
If you're vague about what constitutes a "unit" that means youre probably not thinking about this problem.
Often, even outside of software, unit testing means testing a component's (unit‘s) external behavior.
If you don't accept that concept I can see how TDD and testing in general would be challenging.
I general it is most productive when building competency with a new subject to accept the author's definitions, then adjust once you have experience.
IMHO, the sizing of components is context, language, and team dependent. But it really doesn't matter TDD is just as much about helping with other problems like action bias, and is only one part of a comprehensive testing strategy.
While how you choose to define a 'unit' will impact outcomes, TDD it self isn't dependent on a firm definition.
>If you don't accept that concept
Nobody anywhere in the world disputes that unit tests should surround a unit.
>IMHO, the sizing of components is context, language, and team dependent. But it really doesn't matter
Yeah, thats the attitude that will trip you up.
The process you use to determine the borders which you will couple to your test - i.e what constitutes a "unit" to be tested is critically important and nonobvious.
> So I wouldn't blame people for writing specs that only match the code implementation at that time.
WFT are you doing writing specs based on implementation? If you already have the implementation, what are you using the specs for? Or, if you want to apply this direct to tests, if you are already assuming the program is correct, what are you trying to test?
Are you talking about rewriting applications?
Where do you work if you don’t need to reverse engineer an existing implementation? Have you written everything yourself?
Unless you are rewriting the application, you shouldn't assume that whatever behavior you find on the system is the correct one.
Even more because if you are looking into it, it's probably because it's wrong.
> Tests are the source of truth more so than your code
Tests poke and prod with a stick at the SUT, and the SUT's behaviour is observed. The truth lives in the code, the documentation, and, unfortunately, in the heads of the dev team. I think this distinction is quite important, because this question:
> Do we have a bug? Or do we have a bad test?
cannot be answered by looking at the test + the implementation. The spec or people have to be consulted when in doubt.
> The spec
The tests are your spec. They exist precisely to document what the program is supposed to do for other humans, with the secondary benefit of also telling a machine what the program is supposed to do, allowing implementations to automatically validate themselves against the spec. If you find yourself writing specs and tests as independent things, that's how you end up with bad, brittle tests that make development a nightmare — or you simply like pointless busywork, I suppose.
But, yes, you may still have to consult a human if there is reason to believe the spec isn't accurate.
Unfortunately, tests can never be a complete specification unless the system is simple enough to have a finite set of possible inputs.
For all real-world software, a test suite tests a number of points in the space of possible inputs and we hope that those points generalize to pinning down the overall behavior of the implementation.
But there's no guarantee of that generalization. An implementation that fails a test is guaranteed to not implement the spec, but an implementation that passes all of the tests is not guaranteed to implement it.
> Unfortunately, tests can never be a complete specification
They are for the human, which is the intended recipient.
Given infinite time the machine would also be able to validate against the complete specification, but, of course, we normally cut things short because we want to release the software in a reasonable amount of time. But, as before, that this ability exists at all is merely a secondary benefit.
That's not quite right, but it's almost right.> The tests are your spec.
Tests are an *approximation* of your spec.
Tests are a description, and like all descriptions are noisy. The thing is it is very very difficult to know if your tests have complete coverage. It's very hard to know if your description is correct.
How often do you figure out something you didn't realize previously? How often do you not realize something and it's instead pointed out by your peers? How often do you realize something after your peers say something that sparks an idea?
Do you think that those events are over? No more things to be found? I know I'm not that smart because if I was I would have gotten it all right from the get go.
There are, of course, formal proofs but even they aren't invulnerable to these issues. And these aren't commonly used in practice and at that point we're back to programming/math, so I'm not sure we should go down that route.
> Tests are a description
As is a spec. "Description" is literally found in the dictionary definition. Which stands to reason as tests are merely a way to write a spec. They are the same thing.
> The thing is it is very very difficult to know if your tests have complete coverage.
There is no way to avoid that, though. Like you point out, not even formal proofs, the closest speccing methodology we know of to try and avoid this, is immune.
> Tests are an approximation of your spec.
Specs are an approximation of what you actually want, sure, but that does not change that tests are the spec. There are other ways to write a spec, of course, but if you went down that road you wouldn't also have tests. That would be not only pointless, but a nightmare due to not having a single source of truth which causes all kinds of social (and sometimes technical) problems.
I disagree. It's, like you say, one description of your spec but that's not the spec.> that does not change that tests are the spec.
Well that's the thing, there is no single source of truth. A single source of truth is for religion, not code.> not having a single source of truth
The point of saying this is to ensure you don't fall prey to fooling yourself. You're the easiest person for you to fool, after all. You should always carry some doubt. Not so much it is debilitating, but enough to keep you from being too arrogant. You need to constantly check that your documentation is aligned to your specs and that your specs are aligned to your goals. If you cannot see how these are different things then it's impossible to check your alignment and you've fooled yourself.
> You need to constantly check that your documentation is aligned to your specs
Documentation, tests, and specs are all ultimately different words for the same thing.
You do have to check that your implementation and documentation/spec/tests are aligned, which can be a lot of work if you do so by hand, but that's why we invented automatic methods. Formal verification is theoretically best (that we know of) at this, but a huge pain in the ass for humans to write, so that is why virtually everyone has adopted tests instead. It is a reasonable tradeoff between comfort in writing documentation while still providing sufficient automatic guarantees that the documentation is true.
> If you cannot see how these are different things
If you see them as different things, you are either pointlessly repeating yourself over and over or inventing information that is, at best, worthless (but often actively harmful).
You're still misunderstanding and missing the layer of abstraction, which is what I'm (and others are) talking about> different words for the same thing
We have 3 objects: doc, test, spec. How do you prove they are the same thing?
You are arguing that they all point to the same address.
I'm arguing they all have the same parent.
I think it's pretty trivial to show that they aren't identical, so I'll give two examples (I'm sure you can figure out a few more trivial ones):
Yes, you should simplify things as much as possible but be careful to not simplify further1) the documentation is old and/or incorrect, therefore isn't aligned with tests. Neither address nor value are equivalent here. 2) docs are written in natural language, tests are written in programming languages. I wouldn't say that the string "two" (or even "2") is identical to the integer 2 (nor the float 2). Duck typing may make them *appear* the same and they may *reference* the same abstraction (or even object!), but that is a very different thing than *being* the same. We could even use the classic Python mistake of confusing "is" with "==" (though that's a subset of the issue here).
> We have 3 objects: doc, test, spec. How do you prove they are the same thing?
You... don't? There is nothing good that can come from trying to understand crazy. Best to run away as fast as possible if you ever encounter this.
> You are arguing that they all point to the same address.
Oh? I did say if you document something the same way three different times (even if you give each time a different name, as if that somehow makes a difference), you are going to pointlessly end up with the same thing. I am not sure that necessarily equates to "the same address". In fact,
> I'm arguing they all have the same parent.
I also said that if they don't end up being equivalent documentation then you will only find difference in information that isn't useful. And that often that information becomes detrimental (see some of the adjacent comments that go into that problem). This is "having the same parent".
In reality, I "argued" both. You'd have better luck if you read the comments before replying.
> you should simplify things as much as possible but be careful to not simplify further
Exactly. Writing tests, documentation, or specs (whatever you want to call it; it all caries the same intent) in natural language certainly feels simpler in the moment, but you'll pay the price later. In reality, you at very least need a tool that supports automatic verification. That could mean formal verification, but, as before it's a beast that is tough to wrangle. More realistically, tests are going to be the best choice amid all the tradeoffs. Industry (including the Haskell fanbois, even) have settled on it for good reason.
> docs are written in natural language, tests are written in programming languages.
Technically "docs" is a concept of less specificity. Documentation can be written in natural language, that is true, but it can also be written in code (like what we call tests), or even pictures or video. "Tests" carries more specificity, being a particular way to write documentation — but ultimately they are the same thing. Same goes for "spec". It describes a different level of specificity (less specific than "tests", but more specific than "docs"), but not something entirely different. It is all documentation.
I mean it is hard to have this conversation because you will say that they are the same thing and then leverage the fact that they aren't while disagreeing with me but using nearly identical settings to my examples.> In reality, I "argued" both.
I mean if your argument is that a mallard (test) and a muscovy (docs) are both types of ducks but a mallard is not a muscovy and a muscovy is not a mallard, then I fail to see how we aren't on the same page. I can't put it any clearer than this: all mallards are ducks but not all ducks are mallards. In other words, a mallard is a duck, but it is not representative of all ducks. You can't look at a mallard and know everything there is to know about ducks. You'll be missing stuff. If you treat your mallard and duck as isomorphic you're going to land yourself into trouble, even if most (domesticated) ducks are mallards.
It isn't that complex and saying "don't be overly confident" isn't adding crazy amounts of complexity that is going to overwhelm yourself. It's simply a recognition that you can't write a perfect spec.
Look, munificent[0] is saying the same thing. So is Kinrany[1], and manmal[2]. Do you think we're all wrong? In exactly the same way?
Besides, this whole argument is literally a demonstration of our claim. If you could write a perfect spec you'd (and we'd) be communicating perfectly and there'd be no hangup. But if that were possible we wouldn't need to write code in programming languages in the first place![3]
[0] https://news.ycombinator.com/item?id=44713138
[1] https://news.ycombinator.com/item?id=44713314
[2] https://news.ycombinator.com/item?id=44712266
[3] https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...
> I mean if your argument is that a mallard (test) and a muscovy (docs) are both types of ducks
To draw a more reasonable analogy with how the words are actually used on a normal basis, you'd have fowl (docs), ducks (specs), and mallards (tests). As before, the terms change in specificity, but do not refer to something else entirely. Pointing at a mallard and calling it a duck, or fowl, doesn't alter what it is. It is all the very same animal.
Yes, fowl could also refer to chickens just as documentation could refer to tax returns. 'Tis the nature of using a word lacking specificity. But from context one should be able to understand that we're not talking about tax returns here.
But I don't have an "argument". High school debate team is over there.
> It's simply a recognition that you can't write a perfect spec.
That was recognized from the onset. What is the purpose of adding this again?
> Do you think we're all wrong?
We're all bad at communicating, if that's what you are straining to ask. Which isn't exactly much of a revelation. We've both already indicated as such, as have many commenters that came before us.
None of the four: code, tests, spec, people's memory, are the single source of truth.
It's easy to see them as four cache layers, but empirically it's almost never the case that the correct thing to do when they disagree is to blindly purge and recreate levels that are farther from the "truth" (even ignoring the cost of doing that).
Instead, it's always an ad-hoc reasoning exercise in looking at all four of them, deciding what the correct answer is, and updating some or all of them.
What does SUT stand for? I'm not familiar with the acronym
Is it "System Under Test"? (That's Claude.ai's guess)
That's what Wiktionary says too. Lucky guess, Claude.
- [deleted]
It is.
- [deleted]
[dead]
> “why is this broken test here that appears to test a behavior we don’t support”
Because somebody complained when that behavior we don't support was broken, so the bug-that-wasn't-really-a-bug was fixed and a test was created to prevent regression.
Imho, the mistake was in documentation: the Test should have comments explaining why this test was created.
Just as true for tests as for the actual business logic code:
The code can only describe the what and the how. It's up to comments to describe the why.
I believe they just meant that tests are easy to generate for eng review and modification before actually committing to the codebase. Nothing else is a dependency on an individual test (if done correctly), so it's comparatively cheap to add or remove compared to production code.
Yup. I do read and review the tests generated by LLMs. Often the LLM tests will just be more comprehensive than my initial test, and hit edge cases that I didn’t think of (or which are tedious). For example, I’ll write a happy path test case for an API, and a single “bad path” where all of the inputs are bad. The LLM will often generate a bunch of “bad path” cases where only one field has an error. These are great red team tests, and occasionally catch serious bugs.
Ideally the git history provides the “why was this test written”, however if you have one Jira card tied to 500+ AI generated tests, it’s not terribly helpful.
>if you have one Jira card tied to 500+ AI generated tests
The dreaded "Added tests" commit...
> Having worked on legacy codebases this is extremely wrong and harmful. Tests are the source of truth more so than your code - and incorrect tests are even more harmful than incorrect code.
I hear you on this, but you can still use so long as these tests are not comingled with the tests generated by subject-matter experts. I'd treat them almost a fuzzers.
This is the conclusion I'm at too, working on a relatively new codebase. Our rule is that every generated test must be human reviewed, otherwise its an autodelete.
What do you think about leaning on fuzz testing and deriving unit tests from bugs found by fuzzing?
You end up with a pile of unit tests called things like "regression, don't crash when rhs null" or "regression, terminate on this" which seems fine.
The "did it change?" genre of characterisation/snapshot tests can be created very effectively using a fuzzer, but should probably be kept separate from the unit tests checking for specific behaviour, and partially regenerated when deliberately changing behaviour.
Llvm has a bunch of tests generated mechanically from whatever the implementation does and checked in. I do not rate these - they're thousands of lines long, glow red in code review and I'm pretty sure don't get read by anyone in practice - but because they exist more focused tests do not.
What kind of bugs do you find this way, besides missing sanitization?
Pointer errors. Null pointer returns instead of using the correct types. Flow/state problems. Multithreading problems. I/O errors. Network errors. Parsing bugs... etc
Basically the whole world of bugs introduced by someone being a too smart C/C++ coder. You can battletest parsers quite nicely with fuzzers, because parsers often have multiple states that assume naive input data structures.
You can use the fuzzer to generate test cases instead of writing test cases manually.
For example you can make it generate queries and data for a database and generate a list of operations and timings for the operations.
Then you can mix assertions into the test so you make sure everything is going as expected.
This is very useful because there can be many combinations of inputs and timings etc. and it tests basically everything for you without you needing to write a million unit tests
That sounds worse than letting an LLM dream up tests tbh. I wouldn’t consider grooming a huge number of tests for their usefulness after they‘ve been generated randomly. And just keeping all of them will just lock the implementation in place where it currently is, not validate its correctness.
- [deleted]
You can often find memory errors not directly related to string handling with fuzz testing. More generally, if your program embodies any kind of state machine, you may find that a good fuzzer drives it into states that you did not think should exist.
That sounds a bit like using a jackhammer to drive in a nail. Wouldn’t it be smarter to enumerate edge cases and test all permutations of those?
Would it even be possible to enumerate all edge cases and test all the permutations of them in non-trivial codebases or interconnected systems? How do you know when you have all of the edge cases?
With fuzzing you can randomly generate bad input that passes all of your test cases that were written using by whatever method you have already been using but still causes the application to crash or behave badly. This may mean that there are more tests that you could write that would catch the issue related to the fuzz case, or the fuzz case itself could be used as a test.
Using probability you can get to 90 or 99% or 99.999% or whatever confidence level you need that the software is unaffected by bugs based on the input size / number of fuzz test cases. In many non-critical situations the goal may not be 100% but 'statistically very unlikely with a known probability and error'
- [deleted]
Thanks for elaborating, I might start fuzzing.
This is why tests need documenting what exactly they intend to test, and why.
I have the exact opposite idea. I want the tests to be mine and thoroughly understood, so I am the true arbiter and then I can let the LLM go ham on the code without fear. If the tests are AI made, then I get some anxiety letting agents mess with the rest of the codebase
I think this is exactly the tradeoff (blue team and red team need to be matched in power), except that I’ve seen LLMs literally cheat the tests (eg “match input: TEST_INPUT then return TEST_OUTPUT”) far too many times to be comfortable with letting LLMs be a major blue team player.
Yeah, they may do that, but people really should read the code an LLM produces. Ugh, makes me furious. No wonder LLMs have a bad rep from such users.
> people really should read the code an LLM produces
Yeah, but that, like, requires that you know how to code. And wasn't the point of LLMs in the first place to let clueless people make software?
I do not know, I would hope not. The bar to entry is already too low. I do not think you will ever be able to get an LLM work flawlessly for people who do not know programming. I know how to code, and I used LLMs before. It seems to be a prerequisite to know how to code if I want useful outputs.
I tried a LLM to generate tests for Rust code. It was more harmful then useful. Surely there were a lot of tests, but they still miss the key coverage and it was hard to see what was missed due to the amount of generated code. Then to change the code behavior in future would require to fix a lot of tests again versus fixing few lines in manually written tests.
There's a saying that since nobody tests the tests, they must be trivially correct.
That's why they came up with the Arrange-Act-Assert pattern.
My favorite kind of unit test nowadays is when you store known input-output pairs and validate the code on them. It's easy to test corner cases and see that the output works as desired.
"Golden snapshot testing"
AI is like a calculatorin this respect. Calculators can do things most humans can't. They make great augmentation devices. AI being a different kind of intelligence is very useful! Everyone is building AI replace human things. But the value is in augmentation.
> prone to cheating or writing non robust code (eg special code paths to pass tests without solving the actual problem).
The solution will come from synthetic data training methods that lobotomize part of the weights. It's just cross-validation. A distilled awareness won't maintain knowledge of the cheat paths, exposing them as erroneous.
This may a reason why every living thing on Earth that encounters psychoactive drugs seems to enjoy them. Self-deceptive paths depend on consistency whereas truth-preservation of facts grounded in reality will always be re-derived.
I think the more fundamental attribute of interest is how easy it is to verify the work.
Much red team work is easily verifiable; either the exploit works or it doesn’t. Whereas more blue-team work is not easily verifiable; it might take judgement to figure out if a feature is promising.
LLMs are extremely powerful (and trainable) on tasks with a good oracle.