# Non-convex Second Order Methods

[TODO blog post is under construction, please disregard]

This is a high-level overview of the methods for second order local improvement of non-convex costs with a particular focus on neural networks (NNs).

Make sure to read the general overview post first. Also, we’ll borrow the same setting from the introduction of the first-order methods post and we will generally add an assumption that $$f\in\mathcal{C}^2$$, if not even more smooth. Some of these methods might never explicitly touch the Hessian, but their analysis and intuition depend critically on it.

TODO overview

## Full-Matrix Methods

Second-order methods are based on Newton’s method for optimization. This searches for critical points by solving $\nabla J = \bsz$ iteratively. This can be done by iterative line search for $\alpha_t$ and Newton updates: $\bsth_{t+1}=\bsth_{t}-\alpha_{t}H_{t}^{-1}\nabla_t$ While this has fast convergence in terms of error [TODO source; under what assumptions; perhaps look in convex opt book?], the convergence is to an $\epsilon$-critical point. As mentioned above, most critical points are saddles, which causes Newton’s method to go to a poor solution.

Worse yet, full inverse Hessian construction is cubic in the parameter count. Backprop-based Hessian-vector approximations can more efficiently reconstruct Newton-like iteration, in a family of quasi-Newtonian methods; yet these suffer from the issues mentioned above.

Guass-Newton

Levenberg–Marquardt

## L-BFGS

Dauphin et al 2014

overview

list of update rules

Zhang and Li 2011

## Cubic Regularization

Agarwal et al 2017

Stochastic subsample

More analysis

Useful discussion

deep-hessian-free

backprop curvature

Stochastic meta descent

reddit overview

http://www.andrewng.org/portfolio/on-optimization-methods-for-deep-learning/