I'm no expert on chess engine development, but it's surprising to me that both lc0 and stockfish use SPSA for "tuning" the miscellaneous magic numbers which appear in the system rather than different black box optimization algorithms like Bayesian optimization or evolutionary algorithms. As far as I am aware both of these approaches are used more often for similar tasks in non-chess applications (ex. hyperparameter optimization in ML training) and have much more active research communities compared to SPSA.
Is there something special about these chess engines that makes SPSA more desirable for these use cases specifically? My intuition is that something like Bayesian optimization could yield stronger optimization results, and that the computational overhead of doing BO would be minimal compared to the time it takes to train and evaluate the models.
One thing I wonder is why design of experiments (DOE) methodology is so seldom used for these things.
Statisticians and operations researchers have spent a hundred years deciding how to do as few experiments as possible to tweak parameters in the ways that give the highest impact with statistical basis that the selections are good.
In the language of information and decision trees, these experiments are trying to in some sense “branch” on the entropy minimizing variables.
SPRT is used religiously in engine development today. There is enormous incentive to test efficiently.
https://github.com/official-stockfish/fishtest/wiki/Fishtest...
DOE is still very useful in many contexts, but when it's possible do use a sequential design these adaptive techniques really start to pull away in terms of optimization quality.
There's simply a lot of sample efficiency to gain by adapting the experiment to incoming data in a regime where one can repeatedly design n candidates, observe their effects, and repeat m times compared to a setting where one must design a fixed experiment with n*m samples.
Engines like Stockfish might have over 100 "search parameters" that need to be tuned, to my best knowledge SPSA is preferred because the computational cost typically does not depend on the number of parameters.
Or, if attempting to use SPSA to say, perform a final post-training tune to the last layers of a neural network, this could be thousands of parameters or more.
The concern about the dimensionality of the search space is real, especially once things cross over into the 100s -- BO would certainly not be useful post-training the way the blog post talks about using SPSA.
That being said, it still seems possible to be that using a different black box optimization technique for a fairly constrained set of related magic numbers (say, fewer than 50) might lead to some real performance improvements in these systems, could be worth reaching out to the lc0 or stockfish development communities.