I’m curious about the practical workflow after you’ve written and verified a TLA+ spec. How do you go from the TLA+ proof to actual code? Is there any established process or best practices for translating the spec into implementation while ensuring you’re not introducing bugs during that translation? And once you’ve written the code, how do you maintain the connection between spec and implementation as things evolve?
You can think about a TLA+ spec as the level of design between the "natural language design" you have in your head or on paper and the code. It makes it easier to go from the idea to the code, and allows you to explore ideas, at any level of abstraction, with full rigour.
The question of how you maintain the spec as the code changes could be the same as how you maintain the natural language spec as the code changes. Sometimes you just don't, and you may want to only when there is a very substantial change that you'd also like to explore with rigour.
However, there are several approaches for creating a more formal (i.e. mechanical) connection between the code and the TLA+ spec, such as generating tests or checking production logs against the high-level spec, but I would say that you already get so much benefit even without such a mechanical connection that having it is not needed, but could be the cherry on top in some situations that really require it.
I think that the greatest hurdle in getting the most out of TLA+ is internalising that the spec isn't code, and isn't supposed to be. For example, if you're making use of some hashing function or a sorting function and the subject of your design isn't the hashing or sorting algorithm itself, in TLA+ you want write a hashing/sorting spec or reuse one from a library; rather you'd write something like "assume there exists a hashing/sorting function".
That's why you may end up writing different specs for the same application, each focusing on a particular aspect of the system at the appropriate level of detail for that aspect. A single line of TLA+ spec could correspond to anywhere between 1 and 100K lines of code. The use is more similar to how a physicist would describe a system than to code, which is (part of) the system. For example, if the physicist is interested only in the orbit of a planet around its star, it may represent it as a point-mass; if she's interested in weather patterns on the planet, then there's an entire different description. It's a map, not the territory.
I've never did that before but this recent post from mongodb team has a interesting view on this.
https://www.mongodb.com/blog/post/engineering/conformance-ch...
You can also go the other way around and generate traces [1] from the TLA+ or Quint [2] specs using the Apalache model checker [3], map each action in the trace to an action on your system under test, and check that the abstract and concrete states match at each step.
Right now the best practice is generating test suites from the TLA+ spec, though right now it's bespoke for each company that does it and there's no production-ready universal tools do that. LLMs help.
Runtime monitoring of system behavior against the spec is one way to close the gap. We wrote about our experiences with one tool for that (PObserve) here: https://cacm.acm.org/practice/systems-correctness-practices-...
I’m wondering the same. I’ve read multiple articles about formal methods and how they’ve been used to find obscure bugs in distributed systems, but usually they just show a proof and talk about formal methods without concrete examples.
I've thought that one thing that could help was annotations on your code that index into the spec. That way you at least have a one way binding. If-you-are-changing-this-you-should-at-least-review-that kind-of-thing.
How is that different for many algos whose proof of performance characteristics is purely in math?