先验知识
- Jacobian:
- is the matrix of all first-order partial derivatives of a vector-valued function.
- that is to say, the function $f(x)$ maps $R^n \to R^m$
- the Jacobian of $f(x)$ is
- $$\mathbf {J} = {\begin{bmatrix}{\frac {\partial \mathbf {f} }{\partial x_{1}}} & \cdots & {\frac {\partial \mathbf {f} }{\partial x_{n}}} \end{bmatrix}} = {\begin{bmatrix} {\dfrac {\partial f_{1}}{\partial x_{1}}} & \cdots & {\frac {\partial f_{1}}{\partial x_{n}}} \\ \vdots & \ddots & \vdots \\ {\dfrac {\partial f_{m}}{\partial x_{1}}} & \cdots & {\dfrac {\partial f_{m}}{\partial x_{n}}}\end{bmatrix}}$$
- Hessian:
- a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field.
- that is to say, the function $f(x)$ maps $R^n \to R$
- the Hessian of $f(x)$ is
- $$\mathbf {H} ={\begin{bmatrix}{\dfrac {\partial ^{2}f}{\partial x_{1}^{2}}}&{\dfrac {\partial ^{2}f}{\partial x_{1}\,\partial x_{2}}} & \cdots & {\dfrac {\partial ^{2}f}{\partial x_{1}\,\partial x_{n}}} \\ {\dfrac {\partial ^{2}f}{\partial x_{2}\,\partial x_{1}}}&{\dfrac {\partial ^{2}f}{\partial x_{2}^{2}}}&\cdots &{\dfrac {\partial ^{2}f}{\partial x_{2}\,\partial x_{n}}}\\ \vdots &\vdots &\ddots &\vdots \\ {\dfrac {\partial ^{2}f}{\partial x_{n}\,\partial x_{1}}}&{\dfrac {\partial ^{2}f}{\partial x_{n}\,\partial x_{2}}}&\cdots &{\dfrac {\partial ^{2}f}{\partial x_{n}^{2}}}\end{bmatrix}} $$
- Newton’s method in optimization
- generally, Newton’s method is an iterative method to find the roots of a differentiable function: $f(x)=0$
- 在优化当中,找极值时,我们一般会用过求导的方式来找该函数导数为零的点,因为这个点就意味着极值。
- 所以Newton’s method在优化运用中,is to find the root of th ederivative of function f: $f’(x)=0$
- In one-dimensional problem, we hope to find the min or max to satisfy: $f’(x^*)=0$
- 将f(x)泰勒展开:$$f(x) = f(x_n+\Delta x) \approx f(x_n) + f’(x_n)\Delta x + \frac12 f’’(x_n)\Delta x^2$$
- 我们希望在n次迭代后,能找到那个 $x^*$, 此时这个点满足 $f’(x)$, 即:$$f’(x)=0 \to f’(x_n+\Delta x) = 0$$ $$\frac {\partial f(x_n+\Delta x)}{\partial \Delta x} = 0$$ $$\frac {\partial (f(x_n)+f’(x_n)\Delta x + \frac 12 f’’(x_n)\Delta x^2)}{\partial \Delta x} = 0$$ $$f’(x_n) + \frac 12 f’’(x_n)\Delta x^2 =0 $$ $$\therefore \Delta x = - \frac {f’(x_n)}{f’’(x_n)}$$
- 所以在Newton’s method中, 寻找极值的方向就是$\Delta x = - \frac {f’(x_n)}{f’’(x_n)}$
- 在高阶当中,$\Delta x = - [Hf(x_n)]^{-1} \Delta {f(x_n)}$
Gauss-Newton method
- the problem is as follows:$$min_{x\in R^n} f(x) = \frac 12 \sum_{i=1}^m r_i(x)^2$$
- where $m \ge n$
- n is thge number of parameters
- m is the number of observations
- r(x) is the vector of residuals $r(x) = {\begin{bmatrix} r_1(x) \\ r_2(x) \\ \vdots \\ r_m(x) \end{bmatrix}}$
- Note that there could be 3 spaces when considering such probelm:
- the parameter space
- the observation space
- the model space
- the gradient of f(x) is: $$\Delta f(x) = \Delta r(x) r(x) = J(x)^Tr(x)$$
- where $J(x) = \Delta r(x)^T$, the Jacobian of r(x). (Note only r(x) can have Jacobian but not f(x))
- 推导过程直接对原式求导即可
- the Hessian of f(x) is: $$\Delta^2 f(x) = \Delta r(x) \Delta r(x)^T + \sum_{i=1}^m r_i(x) \Delta^2 r_i(x)$$ $$= J(x)^TJ(x) + Q(x)$$\
- In optimization, we hope that the residual can approximate 0, if so, $r(x)=0$, and $Q(x) = 0$
- A method that uses the approximation Q(x)=0 is called Gauss-Newton method
- If the search direction using Newton equation: $$\Delta^2 f(x) p^N = - \Delta f(x)$$
- 代入以上的hessian式子: $$(J^T(x)J(x) + Q(x))P^N = - \Delta f(x)$$
- if Q(x) 接近0,$$J^T(x)J(x)P^{GN} = - \Delta f(x) = -J(x)^Tr(x)$$
- if J(x) is full rank, we can have a unique solution; but if not, the problem gets under-deterined ot over-determined.
Pros amd Cons
- If $r(x^∗) = 0$, then the approximation $Q(x) ≈ 0$ is good and the Gauss-Newton method will behave like the Newton method close to the solution, i.e. converge quadratically if $J(x^∗)$ has full rank.
- The advantage over the Newton method is that we do not need to calculate the second-order derivatives $∇^2 r_i(x)$.
- However, if any residual component $r_i(x^∗)$ and/or the corresponding curvature $∇^2 r_i(x)$ is large, the approximation $Q(x) ≈ 0$ will be poor, and the Gauss-Newton method will converge slower than the Newton method.
- For such problems, the Gauss-Newton method may not even be locally convergent, i.e. without a global strategy such as the line search, it wouldn’t converge no matter how close to the solution we start.
Reference:
https://www8.cs.umu.se/kurser/5DA001/HT07/lectures/lsq-handouts.pdf
http://fourier.eng.hmc.edu/e176/lectures/NM/