>If the Hessian-vector product is Hv for some fixed vector v, we're interested in solving Hx=v for x. The hope is to soon use this as a preconditioner to speed up stochastic gradient descent.
Silly question, but if you have some clever way to compute the inverse Hessian, why not go all the way and use it for Newton's method, rather than as a preconditioner for SGD?
Good q. The method computes Hessian-inverse on a batch. When people say "Newton's method" they're often thinking H^{-1} g, where both the Hessian and the gradient g are on the full dataset. I thought saying "preconditioner" instead of "Newton's method" would make it clear this is solving H^{-1} g on a batch, not on the full dataset.
Just a heads up in case you didn't know, taking the Hessian over batches is indeed referred to as Stochastic Newton, and methods of this kind have been studied for quite some time. Inverting the Hessian is often done with CG, which tends to work pretty well. The only problem is that the Hessian is often not invertible so you need a regularizer (same as here I believe). Newton methods work at scale, but no-one with the resources to try them at scale seems to be aware of them.
It's an interesting trick though, so I'd be curious to see how it compares to CG.
[1] https://arxiv.org/abs/2204.09266 [2] https://arxiv.org/abs/1601.04737 [3] https://pytorch-minimize.readthedocs.io/en/latest/api/minimi...
For solving physics equations there is also Jacobian-free Newton-Krylov methods.
Yes the combination of Krylov and quasi-Newton methods are very successful for physics problems (https://en.wikipedia.org/wiki/Quasi-Newton_method).
Iirc eg GMRES is a popular Krylov subspace method.
I lately used these methods and BFGS worked better than CG for me.
Absolutely plausible (BFGS is awesome), but this is situation dependent (no free lunch and all that). In the context of training neural networks, it gets even more complicated when one takes implicit regularisation coming from the optimizer into account. It's often worthwhile to try a SGD-type optimizer, BFGS, and a Newton variant to see which type works best for a particular problem.
I'd call it "Stochastic Newton's Method" then. :-)
fair. thanks. i'll sleep on it and update the paper if it still sounds right tomorrow.
probably my nomenclature bias is that i started this project as a way to find new preconditioners on deep nets.