Non-convex Second Order Methods

[TODO blog post is under construction, please disregard]

This is a high-level overview of the methods for second order local improvement of non-convex costs with a particular focus on neural networks (NNs).

Make sure to read the general overview post first. Also, we’ll borrow the same setting from the introduction of the first-order methods post and we will generally add an assumption that \(f\in\mathcal{C}^2\), if not even more smooth. Some of these methods might never explicitly touch the Hessian, but their analysis and intuition depend critically on it.

TODO caution dismissal of 2nd order, like Goodfellow and Kingma in Adam do (noisiness arg)

TODO Goodfellow 2nd order notes

TODO overview

Full-Matrix Methods

Second-order methods are based on Newton’s method for optimization. This searches for critical points by solving $\nabla J = \bsz$ iteratively. This can be done by iterative line search for $\alpha_t$ and Newton updates: While this has fast convergence in terms of error [TODO source; under what assumptions; perhaps look in convex opt book?], the convergence is to an $\epsilon$-critical point. As mentioned above, most critical points are saddles, which causes Newton’s method to go to a poor solution.

Worse yet, full inverse Hessian construction is cubic in the parameter count. Backprop-based Hessian-vector approximations can more efficiently reconstruct Newton-like iteration, in a family of quasi-Newtonian methods; yet these suffer from the issues mentioned above.

Inspiration, full gradient: Newton’s Method

Guass-Newton

Levenberg–Marquardt

L-BFGS

Saddle-Free Stochastic Newton

Dauphin et al 2014

Conjugate Gradient

overview

list of update rules

Zhang and Li 2011

Cubic Regularization

Agarwal et al 2017

Stochastic subsample

More analysis

Useful discussion

deep-hessian-free

backprop curvature

Stochastic meta descent

reddit overview

http://www.andrewng.org/portfolio/on-optimization-methods-for-deep-learning/