Today we will look at a variant of gradient descent called the steepest descent algorithm.

1. Steepest Descent Algorithm

Having seen the Gradient Descent Algorithm, we now turn our attention to yet another member of the Descent Algorithms family — the Steepest Descent Algorithm. In this algorithm, we optimize the descent direct to obtain the maximum decrease of the objective function. First, let us look at the descent direction for Steepest Direction. Let ${f: \mathbb{R}^n \rightarrow \mathbb{R}}$ be continuous and differentiable. Using Taylor Series, we get

$\displaystyle f(x+v) = f(x) + \nabla f(x)^T v.$

Our objective is to minimize the second term i.e. ${\nabla f(x)^T v}$ term. The descent direction is defined in two steps. First, we look at the so-called normalized steepest descent direction. This in turn helps us define the steepest descent direction. The normalized steepest descent direction is defined as

$\displaystyle \Delta x_{nsd} = {\text{argmin}_v} \{ \nabla f(x)^T v : \| v \| = 1 \}.$

Observe that the descent direction depends not only on the point ${x}$, but also on the norm ${\| .\|}$. As is clear from the definition, the name normalized is used since we are searching only through those vectors that have unit magnitude. Looking closely, we notice that this definition is similar to the one of dual norm of a given norm. Hence, the minimum value is given by ${-\| \nabla f(x) \|_*}$. The negative sign compensates for the “min” used here as opposed to the “max” used in the definition of a dual norm. Now we can define the descent direction for the Steepest Descent Algorithm as follows:

$\displaystyle \Delta x_{sd} = \| \nabla f(x) \|_* \Delta x_{nsd}.$

The following calculation verifies that the descent direction defined above satisfies the sufficiency condition required, and also justifies the use of ${\| \nabla f(x) \|_*}$ term used in the definition:

$\displaystyle \begin{array}{rcl} \nabla f(x)^T \Delta x_{sd} & = & \| \nabla f(x) \|_* \nabla f(x)^T \Delta x_{nsd}, \\ & \overset{(a)}{=} & - \| \nabla f(x) \|_*^2, \\ & < & 0, \end{array}$

where ${(a)}$ follows from the definition of normalized steepest descent direction. Hence, the update equation is

$\displaystyle x_{k+1} = x_k + t \Delta x_{sd}.$

The following Lemma casts the descent direction as a simpler optimization problem.

Lemma 1

$\displaystyle \Delta x_{sd} = \text{argmin}_v \left\lbrace \nabla f(x)^T v + \frac{\| v \|^2}{2} \right\rbrace.$

Proof: The proof follows by representing ${ v= (w, t)}$, where ${t>0}$, ${\|w\| =1}$ and jointly optimizing over ${t}$ and ${w}$. $\Box$

We will now look at the descent directions generated by a quadratic norm.

Definition 2 Let ${P}$ be a fixed Positive Definite matrix i.e., ${P \in S^n_{++}}$. The quadratic norm of any vector ${z \in \mathbb{R}^n}$ is then defined as

$\displaystyle \| z \|_P = \left( z^T P z \right)^{\frac{1}{2}} = \| P^{\frac{1}{2}} z \|_2$

If we define ${\tilde{x}}$ = ${P^{\frac{1}{2}} x}$, then this essentially corresponds to a change of basis. Since ${P \in S^n_{++}}$, ${x}$ = ${P^{-\frac{1}{2}} \tilde{x}}$ is well defined. This in turn lets us allow the quadratic expression to be simplified as follows:

$\displaystyle \begin{array}{rcl} x^T P x & = & \tilde{x}^T P^{-\frac{1}{2}} P P^{-\frac{1}{2}} \tilde{x}, \\ & = & \tilde{x}^T \tilde{x}. \end{array}$

Hence, an ellipsoid is converted to a sphere. In the case of Quadratic Norm, the descent direction is

$\displaystyle \Delta x_{sd} = \underset{v}{\text{argmin}} \left\lbrace \nabla f(x)^T v + \frac{1}{2} v^T P v \right\rbrace.$

Since the above minimization problem is unconstrained, setting its gradient to zero gives the minimizer. Therefore,

$\displaystyle \nabla f(x) + P v^* =0$

and hence

$\displaystyle \Delta x_{sd} = P^{-1} \nabla f(x).$

This norm can be used when the condition number of the objective function is bad. In this case, if one has some idea of the Hessian at the optimal point the matrix ${P }$ can be chosen to be equal to ${\nabla^2f(x^*)}$. This would improve the condition number and the convergence rate of the algorithm.

1.1. Convergence Rate for Steepest Descent with Back Tracking

Theorem 3 Let ${f: \mathbb{R}^n \rightarrow \mathbb{R}}$ be ${\beta-}$smooth and ${\alpha-}$strongly convex. Then Steepest Descent with Back Tracking has a linear convergence rate.

Proof: See Boyd. $\Box$