My favorite tool for trying scary complicated things in an unknown space is the feature flag. This works even if you have zero tests and no documentation. The only thing you need is the live production system and a way to toggle the flag at runtime.
If you can ship your hypothesis along with an effectively unaltered version of prod, the ability to test things without breaking other things becomes much more feasible. I've never been in a real business scenario where I wasn't able to negotiate a brief experimental window during live business hours for at least one client.
While very powerful, I think it's worth calling out some pitfuls. A few things we've ran into - long lived feature flags that are never cleaned up (which usually cause zombie or partially dead code) - rollout drift where different environments or customers have different flags set and it's difficult to know who actually has the feature - not flagging all connected functionality (i.e. one API is missing the flag that should have had it)
A good decom/cleanup strategy definitely helps
Have them emit metrics when it's triggered. You can do a bulk "names X, Y, Z haven't used branch B in >30 days, delete?" task generator pretty easily. Un-triggered ones are also easy to catch if you force all calls to be grep-friendly (or similar), which is also an easy lint to write: unclear result? Block it, force `flag("inline constant", ...)`.
Personally I've also had a lot of success requiring "expiration" dates for all flags, and when passed they emit a highly visible warning metric. You can always just bump it another month to defer it, but people eventually get sick of doing that and clean it up so it'll go away for good. Make it annoying, so the cleanup is an improvement, and it happens pretty automatically.
Yep, archiving feature flags and deleting the dead code is usually thing number 9001 on the list of priorities, so in practice most projects end up with a graveyard of them.
Another issue that I've ran into a few times, is if a feature flag starts as a simple thing, but as new features get added, it evolves into a complex bifurcation of logic and many code paths become dependent on it, which can add crippling complexity to what you're developing
Feature flags are like bloom filters. They make 98 out of 100 situations better and they make the other 2 worse. When performance is the issue that’s usually fine. When reliability is the issue, that’s not sufficient.
If you work on fifty feature toggles a year, one of them is going to go wrong. If your team is doing a few hundred, you’re gonna have oopsies.
Most of the problematic cases are where the code is set up so that the old path and the new one can’t bypass each other cleanly. They get tangled up and maybe the toggle gets implemented inverted where it’s difficult to remove the old path without breaking the new.
You can go even further with something like the gem scientist at the application level, or tee-testing at the data store level. Compare A and A', record the result, and return A. Eventually, you reach 100% compatibility between the two (or only deviations that are desirable) and can remove A, leaving only A'
I also like recording and replaying production traffic, as well, so that you can do your tee-testing in an environment that doesn't affect latency for production, but that's not quite the same thing.
You’ve just resolved a problem I had. I had this problem on a search engine, but I made it as a “v2”. And I told customers to switch to v2. And you know the v2 problem: Discrepancies that customers like. So both versions have fans, but we really need to pull the plug on v1. You’ve just solved it: I should have indexed even records with v1, odd records with v2. Then only I would know which engine was used.