On seeing it, I immediately noticed that with log-likelihood as the loss function, the whitening metric looks a lot like the Jeffreys prior or an approximation (https://en.wikipedia.org/wiki/Jeffreys_prior), which is a reference prior when the CLT holds. The square root can be derived from the reference prior structure, but also has the effect in a lot of modeling scenarios of scaling things proportionally to the scale of the parameters (for lack of a better way of putting it; think standard error versus sampling variance).
If you think of the optimization method this way, you're essentially reconstructing a kind of Bayesian criterion with a Jeffreys prior.
>Likely, there is a method that can use the orthogonalization machinery of Muon while keeping the signal-to-noise estimation of Adam, and this optimizer will be great.
if you take SOAP and change all betas to 0, it still works well, so SOAP is that already
the square root is from PCA/ZCA whitening, what it does it it makes empirical covariance of gradients become identity, so they become decorellated, which is exactly what hessian does on a quadratic objective by the way
Interesting read and interesting links.
The entry asks "why the square root?"
On seeing it, I immediately noticed that with log-likelihood as the loss function, the whitening metric looks a lot like the Jeffreys prior or an approximation (https://en.wikipedia.org/wiki/Jeffreys_prior), which is a reference prior when the CLT holds. The square root can be derived from the reference prior structure, but also has the effect in a lot of modeling scenarios of scaling things proportionally to the scale of the parameters (for lack of a better way of putting it; think standard error versus sampling variance).
If you think of the optimization method this way, you're essentially reconstructing a kind of Bayesian criterion with a Jeffreys prior.
>Likely, there is a method that can use the orthogonalization machinery of Muon while keeping the signal-to-noise estimation of Adam, and this optimizer will be great.
if you take SOAP and change all betas to 0, it still works well, so SOAP is that already
I personally think we've hit the limit and no more better optimizers are to be developed in my humble opinion
best we can do is something like make SOAP faster by replacing QR with something cheaper and maybe warm started
the square root is from PCA/ZCA whitening, what it does it it makes empirical covariance of gradients become identity, so they become decorellated, which is exactly what hessian does on a quadratic objective by the way
https://en.wikipedia.org/wiki/Whitening_transformation for ZCA whitening
which PSGD did you use because there is apparenly like a million of them