Nestrov’s accelerated method

September 21, 2015 in Convex Optimization, EE5121: Optimization for SP and Comm, EE6151: Convex optimization algorithms | Tags: gradient descent, Lipscihtz | Leave a comment

Nesterov’s Accelerated Gradient Descent Algorithm

We have seen that ${\Theta(1/k^2)}$ is an upperbound on the rate of convergence for the class of ${\beta}$ -smooth functions. However, GDA achieves only ${\Theta(1/k)}$ convergence rate for ${\beta}$ -smooth functions. We will now look at the Nestrov’s accelerated method that achieves the optimal rate of convergence.
Nestrov’s accelerated method

$\displaystyle x_{s+1} = y_{s+1} + \frac{s+2}{s+5}(y_s -y_{s+1}). \ \ \ \ \ (1)$

$\displaystyle y_{s+1} = x_s -\frac{1}{\beta}\nabla f(x_s).$

Essentially, the accelerated method has memory and the current point depends on the previous two updates. This technique has the optimal convergence rate.

Lemma 1 The accelerated method has a convergence rate of ${\Theta(1/k^2)}$ , where ${k}$ is the iteration number.

Proof: We have

$\displaystyle f(y_{s+1}) - f(y_s) = f(y_{s+1}) - f(x_s) +f(x_s)-f(y_s)$

Using the result for ${\beta}$ -smooth function and the convexity, we obtain

$\displaystyle \begin{array}{rcl} f(y_{s+1}) - f(y_s)&\leq \nabla f(x_s)^T(y_{s+1}-x_s) +\frac{\beta}{2}\|y_{s+1} - x_s\|^2 \\ &+\nabla f(x_s)^T(x_s-y_s)\\ &=\nabla f(x_s)^T(y_{s+1}-y_s) + \frac{\beta}{2}\|y_{s+1} - x_s\|^2 \hdots (*) \end{array}$

The above inequality would hold true even with ${y_s}$ replaced with ${y^*}$ . Hence we also have

$\displaystyle f(y_{s+1}) - f(y^*) \leq \nabla f(x_s)^T(y_{s+1}-y^*) + \frac{\beta}{2}\|y_{s+1} - x_s\|^2. \hdots (**).$

Let ${\delta_s = f(y_s)-f(y^*)}$ . Multiplying ${(*)}$ with ${(1+s/2)}$ and adding to ${(**)}$ , we obtain

$\displaystyle \begin{array}{rcl} (2+s/2)\delta_{s+1}-(1+s/2)\delta_s\leq & \frac{\beta}{2}(2+s/2)\|y_{s+1}-x_s\|^2\\ &+\nabla f(x_s)^T\left[(1+s/2)(y_{s+1}-y_s)+(y_{s+1}-y^*)\right]. \end{array}$

We now multiply both sides by ${(s/2+2)}$ . We first look at the LHS of the above equation after multiplying by ${(s/2+2)}$ . Since ${(1+s/2)(2+s/2)<(2+(s-1)/2)^2}$ ,

$\displaystyle \begin{array}{rcl} (2+s/2)^2\delta_{s+1}-(1+s/2)(2+s/2)\delta_s \geq (2+s/2)^2\delta_{s+1}\\ -(2+(s-1)/2)^2\delta_s .. (***) \end{array}$

The RHS after multiplication by ${(s/2+2)}$ is

$\displaystyle \begin{array}{rcl} \frac{\beta}{2}\|(2+s/2)(y_{s+1}-x_s)\|^2\\ +(2+s/2)\nabla f(x_s)^T\left[(1+s/2)(y_{s+1}-y_s)+(y_{s+1}-y^*)\right]. \end{array}$

Using ${y_{s+1} = x_s-\frac{1}{\beta}\nabla f(x_s)}$ , the above can be simplified to

$\displaystyle \begin{array}{rcl} \frac{\beta}{2}\Big( 2(2+s/2) (x_s-y_{s+1})^T\left\{(1+s/2)(x_s-y_s)+(x_s-y^*)\right\}\\-\|(2+s/2)(y_{s+1}-x_s)\|^2\Big). \end{array}$

Using the relation ${2ab-a^2 = b^2-(b-a)^2}$ , the above simplifies to

$\displaystyle \begin{array}{rcl} \frac{\beta}{2}\left[ \|(1+s/2)(x_s-y_s) +(x_s-y^*)\|^2\right.\\ \left.-\|(1+s/2)(x_s-y_s)-(2+s/2)(x_s-y_{s+1}) +(x_s-y^*) \| \right] \end{array}$

Using the relation between ${x_{s+1}}$ and ${y_{s+1}}$ in the update rule, (the second term) can be simplified to

$\displaystyle \begin{array}{rcl} \frac{\beta}{2}\left[ \| (2+s/2)x_s-(1+s/2)y_s-y^*\|^2\right.\\ \left.-\| (2+(s+1)/2)x_{s+1}-(1+(s+1)/2)y_{s+1}-y^* \| \right] \end{array}$

Letting ${u_s = (2+s/2)x_s-(1+s/2)y_s-y^*}$ , we have

$\displaystyle (2+s/2)^2\delta_{s+1}-(2+(s-1)/2)^2\delta_s\leq \frac{\beta}{2}(\|u_s\|^2-\|u_{s+1}\|^2).$

Summing from ${s=1}$ to ${k-1}$ , we have

$\displaystyle \left(\frac{s-1}{2} +2\right)^2\delta_k -4\delta_0 \leq \frac{\beta}{2}(\|u_0\|^2-\|u_{k}\|^2).$

Rearranging, we see

$\displaystyle \delta_k \leq \frac{4\left(\frac{\beta}{2}\|u_0\|^2+4\delta_0\right)}{(k+3)^2},$

proving the result. $\Box$

Observing the proof, we see that the method can be generalized as follows:

$\displaystyle x_{s+1} =(1+\nu_s)y_{s+1} - \nu_s y_s.$

where ${\nu_s= (\alpha_s-1)/(\alpha_s+1)}$ and ${\alpha}$ should satisfy ${\alpha_s(\alpha_s-1) <\alpha_{s-1}^2}$ and ${\alpha_s \sim s}$ for ${s}$ large.

The term ${ as}$ in (1) is called as the momentum term and we observe that the coefficient ${(s+2)/(s+5) \sim 1-3/s}$ for large ${s}$ . The coefficient is generalized to ${(k-1)/(k-1+r)}$ in the paper by Weijie Su et. al. “A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights”

Let function ${f}$ be ${\alpha}$ -strongly convex and ${\beta}$ -smooth, and Q= ${\beta / \alpha}$ be the condition number of ${f}$ . In this case the basic gradient descent algorithm requires ${O(Q \log(1/\epsilon))}$ iterations to reach ${\epsilon}$ -accuracy. Nesterov’s Accelerated Gradient Descent attains the optimal oracle complexity of ${O(\sqrt{Q} \log(1/\epsilon))}$ . This improvement is quite relevant for Machine Learning applications.

Algorithm: The basic structure of the algorithm is given below

Start at an arbitrary initial point ${x_1 = y_1}$ .
Then iterate the following equations for ${s \geq 1}$ ,
$\displaystyle \begin{array}{rcl} y_{s+1} & = & x_s - \frac{1}{\beta} \nabla f(x_s) , \\ x_{s+1} & = & \left(1 + \frac{\sqrt{Q}-1}{\sqrt{Q}+1} \right) y_{s+1} - \frac{\sqrt{Q}-1}{\sqrt{Q}+1} y_s . \end{array}$

Theorem 2 Let ${f}$ be ${\alpha}$ -strongly convex and ${\beta}$ -smooth, then Nesterov’s Accelerated Gradient Descent satisfies

$\displaystyle f(y_s)-p^* \leq \frac{\alpha + \beta}{2} \|x_1-x^*\|^2 exp \left(-\frac{(s-1)}{\sqrt{Q}}\right) . \ \ \ \ \ (2)$

Lower bounds on the complexity for the class of beta smooth and convex functions

September 14, 2015 in Convex Optimization, EE6151: Convex optimization algorithms | Tags: complexity, Convex optimization algorithms, gradient descent, Lipscihtz | Leave a comment

Till now, we were looking at the gradient descent algorithm and its performance on different class of functions. When GD is used on the class of ${\beta}$ -smooth convex functions, the convergence rate is ${\Theta(1/k)}$ . The convergence rate equals ${\Theta(\exp(-k/Q))}$ , when used on ${\alpha}$ -strong and ${\beta}$ -smooth convex functions. Are these the optimal rates? Are there algorithms that can improve upon these convergence rates? For example, are there first order algorithms that can achieve a convergence rate of ${\Theta(1/k^4)}$ , when used on the ${\beta}$ -smooth functions class? In this lecture, we will come up with explicit functions that provide upperbounds on the rate of convergence, \ie, these functions exhibit the worst case convergence rate. We first look at the class of ${\beta}$ smooth functions. The following proof is adapted from Nemirovski’s textbook. For simplicity, we make the following assumptions on the oracle. We will later see as to how to relax these conditions.

First Order Oracle: Only first order information should be used.
${x_0=0}$ and ${x_k \in}$ Span ${\lbrace \nabla f(x_1),.......,\nabla f(x_k-1)\rbrace}$

Theorem 1 Let ${t \leq (n-1)/2}$ and ${\beta >0}$ . There exists a ${\beta}$ -smooth convex function ${f:{\mathbb R}^n \rightarrow {\mathbb R}}$ such that for any black-procedure satisfying the conditions defined above,

$\displaystyle \min_{1 \leq s \leq t} f(x_s)-f(x^*) \geq \frac{3\beta}{32}\frac{||x_1-x^*||^2}{{(t+1)}^2} \ \ \ \ \ (1)$

Proof: For ${k \leq n}$ , let ${A_k \in \mathbb{R}^{n \times n}}$ be the symmetric and tridiagonal matrix defined by

$\displaystyle (A_k)_{i,j}=\left\{ \begin{array}{rl} 2, & i=j,i\leq k \\ -1, & j \in \lbrace i-1,i+1\rbrace, i \leq k, j \neq k+1\\ 0, & \text{otherwise.} \end{array} \right.$

We will first show that ${ 0 \preceq A_k \preceq 4I_n}$ . The first inequality can be verified as follows.

$\displaystyle \begin{array}{rcl} x^T A_k x &=& 2\sum\limits_{i=1}^k x(i)^2 - 2 \sum\limits_{i=1}^{k-1} x(i)x(i+1) \\ &=& x(1)^2 + x(k)^2 + \sum\limits_{i=1}^{k-1} {(x(i)-x(i+1))}^2 \geq 0. \end{array}$

Similarly, it can be easily shown that ${ A_k \preceq 4I_n}$ . Observe that from our assumption ${2t-1<n}$ and hence ${A_{2t+1}}$ is well defined. We consider now the following ${\beta-}$ smooth convex function:

$\displaystyle f(x)=\frac{\beta}{8}x^T A_{2t+1}x - \frac{\beta}{4}x^Te_1. \ \ \ \ \ (2)$

Note: Here ${t}$ is not the running index of the algorithm. First we fix a ${t}$ and we define the function ${f(x)}$ for a given ${t}$ . The running index in the proof is ${s}$ . The function ${f(x)}$ can be easily verified to be ${\beta-}$ smooth as follows.

$\displaystyle \begin{array}{rcl} \nabla^2f(x) &=& \frac{\beta}{4}A_{2t+1}. \\ ||\nabla^2f(x)|| &\leq& \frac{\beta}{4}||A_{2t+1}|| \\ &\leq& \beta \; \; \; \left(\text{since} A_k \preceq 4I_n \right). \end{array}$

Let us now look at the possible iterates generated by any algorithm that satisfies out assumptions. As ${x_0=0}$ and ${\nabla f(x)= \frac{\beta}{4}A_{2t+1}x-\frac{\beta}{4}e_1}$ , we can represent ${x_1}$ , ${x_2}$ , ${\hdots}$ , as follows,

$\displaystyle \begin{array}{rcl} x_1&=& x_0-\frac{\beta}{4}e_1, \end{array}$

where ${e_i}$ is the standard basis (it has ${1}$ only in the ${i}$ -th coordinate). We see that ${x_1}$ has non-zero value only in the first coordinate, \ie ${x_1 =\text{span}(e_1)}$ . Similarly,

$\displaystyle \begin{array}{rcl} x_2&=& x_1- \left(\nabla f(x_1)\right) \\ &=& x1-\left[\left(\frac{\beta}{4}\left(A_{2t+1}\right)\left(x_0-\frac{\beta}{4}e_1\right)\right)-\frac{\beta}{4}e_1\right], \end{array}$

from which we see that ${x_2 \in \text{span}(e_1, e_2)}$ . From the way, ${A_k}$ is defined, it is easy to verify that ${x_s}$ must lie in the linear span of ${e_1, e_2, ...,e_{s}}$ . In particular for ${s\leq t}$ we necessarily have ${x_s(i)=0}$ for ${i=s,...,n}$ , which implies

$\displaystyle x_s^TA_{2t+1}x_s=x_s^TA_{s}x_s. \ \ \ \ \ (3)$

In other words, if we denote

$\displaystyle f_k(x)=\frac{\beta}{8}x^TA_kx - \frac{\beta}{4}x^Te_1, \ \ \ \ \ (4)$

then we just proved that

$\displaystyle \begin{array}{rcl} f(x_s)-f^*&\stackrel{a}{=}&f_s(x_s)-f^*_{2t+1}, \\ &\stackrel{b}{\geq}& f_s^* - f^*_{2t+1}. \end{array}$

where ${(a)}$ follows since ${f(x_s)=f_s(x_s)}$ and from the definition of ${f(x)}$ and ${f_k(x)}$ we can see, ${f(x)=f_{2t+1}(x)}$ . The inequality ${(b)}$ follows since ${f_s^*=\min_x{f_s}(x)}$ . We now compute the minimizer ${x_k^*}$ of ${f_k}$ , its norm, and the corresponding function value ${f_k^*}$ . The optimal ${x_k^*}$ can be found by setting solving ${\nabla f_k(x)=0}$ . i.e., ${A_kx-e_1=0}$ . So, the point ${x_k^*}$ is the unique solution in the span of ${e_1,.....,e_k}$ of ${A_kx=e_1}$ . It is easy to verify that it is defined by ${x_k^*(i)=1-\frac{i}{k+1}}$ for ${i=1,....,k}$ . Thus we immediately have:

$\displaystyle \begin{array}{rcl} f_k^*&=& \frac{\beta}{8}{(x_k^*)}^TA_kx_k^* - \frac{\beta}{4}(x_k^*)^T e_1 \\ &=& -\frac{\beta}{8}(x_k^*)^Te_1 \\ &=& -\frac{\beta}{8}\left(1 - \frac{1}{k+1}\right). \end{array}$

We first observe that ${f_k^*}$ decreases with ${k}$ . Hence ${f_s^* >f_t^*}$ . So we have

$\displaystyle f(x_s)-f^* > f_t^* -f^*_{2t+1}.$

Furthermore note that

$\displaystyle \begin{array}{rcl} ||x_k^*||^2&=& \sum\limits_{i=1}^{k}\left(1-\frac{i}{k+1}\right)^2 \\ &=& \sum\limits_{i=1}^k \left(\frac{i}{k+1}\right) \\ &\leq& \frac{k+1}{3}. \end{array}$

Thus one obtains:

$\displaystyle \begin{array}{rcl} f_t^*-f_{2t+1}^*&=& \frac{\beta}{8}\left(\frac{1}{t+1} -\frac{1}{2t+2}\right) \\ &=& \frac{3\beta}{32}\frac{||x^*_{2t+1}||^2}{(t+1)^2}, \end{array}$

which concludes the proof. $\Box$

We make the following observations:

From Theorem 1, we observe that the worst case complexity is achieved by a quadratic function and no first order algorithm can provide a convergence rate better than ${\Theta(1/k^2)}$ .
From the above theorem and the convergence rate of the GDA we have the following result: Let ${N}$ be the number of iterations required to achieve an error of less than ${\epsilon}$ . Then
$\displaystyle \min\left\{\frac{n-1}{2},\sqrt{\frac{3\beta\|x_1-x^*\|^2}{32 \epsilon}}\right\}\leq N \leq \frac{\beta\|x_1-x^*\|^2}{ \epsilon}.$

First we observe that GDA does not match the lower bound on complexity. Also, for ${n^2 \succeq \frac{\beta\|x_1-x^*\|^2}{ \epsilon}}$ , we have that the upperbound is the square(up to a constant) of the lower bound. However, for small ${n}$ , the lower bound is not tight. In the next class, we will look at a technique that would provide an upperbound that would match with the lower bound.
A similar kind of result can be obtained for strongly convex functions. We will state the result here without proof.
Theorem 2 Let Q= ${\beta/\alpha >1}$ . There exists a ${\beta}$ -smooth and ${\alpha}$ -strongly convex function such ${f}$ such that for any ${t \geq 1}$ one has

$\displaystyle f(x_t)-f(x^*) \geq \frac{\alpha}{2} \left(\frac{\sqrt{Q}-1}{\sqrt{Q}+1}\right)^{2(t-1)} \|x_1-x^*\|^2 \ \ \ \ \ (5)$

We know for small values of x; ${e^x \approx 1+x}$ . This means for large values of Q:

$\displaystyle \left(\frac{\sqrt{Q}-1}{\sqrt{Q}+1}\right)^{2(t-1)}\approx \exp \left(-\frac{4(t-1)}{\sqrt{Q}}\right) \ \ \ \ \ (6)$

We observe that the convergence rate and the GDA differ by a factor ${\sqrt{Q}}$ in the denominator of the exponent.

Steepest descent algorithm

September 8, 2015 in Convex Optimization, EE6151: Convex optimization algorithms | Tags: Convex optimization algorithms, gradient descent | Leave a comment

Today we will look at a variant of gradient descent called the steepest descent algorithm.

1. Steepest Descent Algorithm

Having seen the Gradient Descent Algorithm, we now turn our attention to yet another member of the Descent Algorithms family — the Steepest Descent Algorithm. In this algorithm, we optimize the descent direct to obtain the maximum decrease of the objective function. First, let us look at the descent direction for Steepest Direction. Let ${f: \mathbb{R}^n \rightarrow \mathbb{R}}$ be continuous and differentiable. Using Taylor Series, we get

$\displaystyle f(x+v) = f(x) + \nabla f(x)^T v.$

Our objective is to minimize the second term i.e. ${\nabla f(x)^T v}$ term. The descent direction is defined in two steps. First, we look at the so-called normalized steepest descent direction. This in turn helps us define the steepest descent direction. The normalized steepest descent direction is defined as

$\displaystyle \Delta x_{nsd} = {\text{argmin}_v} \{ \nabla f(x)^T v : \| v \| = 1 \}.$

Observe that the descent direction depends not only on the point ${x}$ , but also on the norm ${\| .\|}$ . As is clear from the definition, the name normalized is used since we are searching only through those vectors that have unit magnitude. Looking closely, we notice that this definition is similar to the one of dual norm of a given norm. Hence, the minimum value is given by ${-\| \nabla f(x) \|_*}$ . The negative sign compensates for the “min” used here as opposed to the “max” used in the definition of a dual norm. Now we can define the descent direction for the Steepest Descent Algorithm as follows:

$\displaystyle \Delta x_{sd} = \| \nabla f(x) \|_* \Delta x_{nsd}.$

The following calculation verifies that the descent direction defined above satisfies the sufficiency condition required, and also justifies the use of ${\| \nabla f(x) \|_*}$ term used in the definition:

$\displaystyle \begin{array}{rcl} \nabla f(x)^T \Delta x_{sd} & = & \| \nabla f(x) \|_* \nabla f(x)^T \Delta x_{nsd}, \\ & \overset{(a)}{=} & - \| \nabla f(x) \|_*^2, \\ & < & 0, \end{array}$

where ${(a)}$ follows from the definition of normalized steepest descent direction. Hence, the update equation is

$\displaystyle x_{k+1} = x_k + t \Delta x_{sd}.$

The following Lemma casts the descent direction as a simpler optimization problem.

Lemma 1

$\displaystyle \Delta x_{sd} = \text{argmin}_v \left\lbrace \nabla f(x)^T v + \frac{\| v \|^2}{2} \right\rbrace.$

Proof: The proof follows by representing ${ v= (w, t)}$ , where ${t>0}$ , ${\|w\| =1}$ and jointly optimizing over ${t}$ and ${w}$ . $\Box$

We will now look at the descent directions generated by a quadratic norm.

Definition 2 Let ${P}$ be a fixed Positive Definite matrix i.e., ${P \in S^n_{++}}$ . The quadratic norm of any vector ${z \in \mathbb{R}^n}$ is then defined as

$\displaystyle \| z \|_P = \left( z^T P z \right)^{\frac{1}{2}} = \| P^{\frac{1}{2}} z \|_2$

If we define ${\tilde{x}}$ = ${P^{\frac{1}{2}} x}$ , then this essentially corresponds to a change of basis. Since ${P \in S^n_{++}}$ , ${x}$ = ${P^{-\frac{1}{2}} \tilde{x}}$ is well defined. This in turn lets us allow the quadratic expression to be simplified as follows:

$\displaystyle \begin{array}{rcl} x^T P x & = & \tilde{x}^T P^{-\frac{1}{2}} P P^{-\frac{1}{2}} \tilde{x}, \\ & = & \tilde{x}^T \tilde{x}. \end{array}$

Hence, an ellipsoid is converted to a sphere. In the case of Quadratic Norm, the descent direction is

$\displaystyle \Delta x_{sd} = \underset{v}{\text{argmin}} \left\lbrace \nabla f(x)^T v + \frac{1}{2} v^T P v \right\rbrace.$

Since the above minimization problem is unconstrained, setting its gradient to zero gives the minimizer. Therefore,

$\displaystyle \nabla f(x) + P v^* =0$

and hence

$\displaystyle \Delta x_{sd} = P^{-1} \nabla f(x).$

This norm can be used when the condition number of the objective function is bad. In this case, if one has some idea of the Hessian at the optimal point the matrix ${P }$ can be chosen to be equal to ${\nabla^2f(x^*)}$ . This would improve the condition number and the convergence rate of the algorithm.

1.1. Convergence Rate for Steepest Descent with Back Tracking

Theorem 3 Let ${f: \mathbb{R}^n \rightarrow \mathbb{R}}$ be ${\beta-}$ smooth and ${\alpha-}$ strongly convex. Then Steepest Descent with Back Tracking has a linear convergence rate.

Proof: See Boyld. $\Box$

Backtracking line search

August 25, 2015 in Convex Optimization, EE6151: Convex optimization algorithms | Tags: Convex optimization algorithms, gradient descent, strongly convex | Leave a comment