Nesterov’s Accelerated Gradient Descent Algorithm

We have seen that {\Theta(1/k^2)} is an upperbound on the rate of convergence for the class of {\beta}-smooth functions. However, GDA achieves only {\Theta(1/k)} convergence rate for {\beta}-smooth functions. We will now look at the Nestrov’s accelerated method that achieves the optimal rate of convergence.
Nestrov’s accelerated method

\displaystyle  x_{s+1} = y_{s+1} + \frac{s+2}{s+5}(y_s -y_{s+1}).  \ \ \ \ \ (1)

\displaystyle y_{s+1} = x_s -\frac{1}{\beta}\nabla f(x_s).

Essentially, the accelerated method has memory and the current point depends on the previous two updates. This technique has the optimal convergence rate.

Lemma 1 The accelerated method has a convergence rate of {\Theta(1/k^2)}, where {k} is the iteration number.

Proof: We have

\displaystyle f(y_{s+1}) - f(y_s) = f(y_{s+1}) - f(x_s) +f(x_s)-f(y_s)

Using the result for {\beta}-smooth function and the convexity, we obtain

\displaystyle  \begin{array}{rcl}  f(y_{s+1}) - f(y_s)&\leq \nabla f(x_s)^T(y_{s+1}-x_s) +\frac{\beta}{2}\|y_{s+1} - x_s\|^2 \\ &+\nabla f(x_s)^T(x_s-y_s)\\ &=\nabla f(x_s)^T(y_{s+1}-y_s) + \frac{\beta}{2}\|y_{s+1} - x_s\|^2 \hdots (*) \end{array}

The above inequality would hold true even with {y_s} replaced with {y^*}. Hence we also have

\displaystyle f(y_{s+1}) - f(y^*) \leq \nabla f(x_s)^T(y_{s+1}-y^*) + \frac{\beta}{2}\|y_{s+1} - x_s\|^2. \hdots (**).

Let {\delta_s = f(y_s)-f(y^*)}. Multiplying {(*)} with {(1+s/2)} and adding to {(**)}, we obtain

\displaystyle  \begin{array}{rcl}  (2+s/2)\delta_{s+1}-(1+s/2)\delta_s\leq & \frac{\beta}{2}(2+s/2)\|y_{s+1}-x_s\|^2\\ &+\nabla f(x_s)^T\left[(1+s/2)(y_{s+1}-y_s)+(y_{s+1}-y^*)\right]. \end{array}

We now multiply both sides by {(s/2+2)}. We first look at the LHS of the above equation after multiplying by {(s/2+2)}. Since {(1+s/2)(2+s/2)<(2+(s-1)/2)^2},

\displaystyle  \begin{array}{rcl}  (2+s/2)^2\delta_{s+1}-(1+s/2)(2+s/2)\delta_s \geq (2+s/2)^2\delta_{s+1}\\ -(2+(s-1)/2)^2\delta_s .. (***) \end{array}

The RHS after multiplication by {(s/2+2)} is

\displaystyle  \begin{array}{rcl}  \frac{\beta}{2}\|(2+s/2)(y_{s+1}-x_s)\|^2\\ +(2+s/2)\nabla f(x_s)^T\left[(1+s/2)(y_{s+1}-y_s)+(y_{s+1}-y^*)\right]. \end{array}

Using {y_{s+1} = x_s-\frac{1}{\beta}\nabla f(x_s)}, the above can be simplified to

\displaystyle  \begin{array}{rcl} \frac{\beta}{2}\Big( 2(2+s/2) (x_s-y_{s+1})^T\left\{(1+s/2)(x_s-y_s)+(x_s-y^*)\right\}\\-\|(2+s/2)(y_{s+1}-x_s)\|^2\Big). \end{array}

Using the relation {2ab-a^2 = b^2-(b-a)^2}, the above simplifies to

\displaystyle  \begin{array}{rcl}  \frac{\beta}{2}\left[ \|(1+s/2)(x_s-y_s) +(x_s-y^*)\|^2\right.\\ \left.-\|(1+s/2)(x_s-y_s)-(2+s/2)(x_s-y_{s+1}) +(x_s-y^*) \| \right] \end{array}

Using the relation between {x_{s+1}} and {y_{s+1}} in the update rule, (the second term) can be simplified to

\displaystyle  \begin{array}{rcl}  \frac{\beta}{2}\left[ \| (2+s/2)x_s-(1+s/2)y_s-y^*\|^2\right.\\ \left.-\| (2+(s+1)/2)x_{s+1}-(1+(s+1)/2)y_{s+1}-y^* \| \right] \end{array}

Letting {u_s = (2+s/2)x_s-(1+s/2)y_s-y^*}, we have

\displaystyle (2+s/2)^2\delta_{s+1}-(2+(s-1)/2)^2\delta_s\leq \frac{\beta}{2}(\|u_s\|^2-\|u_{s+1}\|^2).

Summing from {s=1} to {k-1}, we have

\displaystyle \left(\frac{s-1}{2} +2\right)^2\delta_k -4\delta_0 \leq \frac{\beta}{2}(\|u_0\|^2-\|u_{k}\|^2).

Rearranging, we see

\displaystyle \delta_k \leq \frac{4\left(\frac{\beta}{2}\|u_0\|^2+4\delta_0\right)}{(k+3)^2},

proving the result. \Box

Observing the proof, we see that the method can be generalized as follows:

\displaystyle x_{s+1} =(1+\nu_s)y_{s+1} - \nu_s y_s.

where {\nu_s= (\alpha_s-1)/(\alpha_s+1)} and {\alpha} should satisfy {\alpha_s(\alpha_s-1) <\alpha_{s-1}^2} and {\alpha_s \sim s} for {s} large.

The term { as} in (1) is called as the momentum term and we observe that the coefficient {(s+2)/(s+5) \sim 1-3/s} for large {s}. The coefficient is generalized to {(k-1)/(k-1+r)} in the paper by Weijie Su et. al. “A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights”

Let function {f} be {\alpha}-strongly convex and {\beta}-smooth, and Q={\beta / \alpha} be the condition number of {f}. In this case the basic gradient descent algorithm requires {O(Q \log(1/\epsilon))} iterations to reach {\epsilon}-accuracy. Nesterov’s Accelerated Gradient Descent attains the optimal oracle complexity of {O(\sqrt{Q} \log(1/\epsilon))}. This improvement is quite relevant for Machine Learning applications.

Algorithm: The basic structure of the algorithm is given below

  1. Start at an arbitrary initial point {x_1 = y_1}.
  2. Then iterate the following equations for {s \geq 1},

    \displaystyle  \begin{array}{rcl}  y_{s+1} & = & x_s - \frac{1}{\beta} \nabla f(x_s) , \\ x_{s+1} & = & \left(1 + \frac{\sqrt{Q}-1}{\sqrt{Q}+1} \right) y_{s+1} - \frac{\sqrt{Q}-1}{\sqrt{Q}+1} y_s . \end{array}

Theorem 2 Let {f} be {\alpha}-strongly convex and {\beta}-smooth, then Nesterov’s Accelerated Gradient Descent satisfies

\displaystyle  f(y_s)-p^* \leq \frac{\alpha + \beta}{2} \|x_1-x^*\|^2 exp \left(-\frac{(s-1)}{\sqrt{Q}}\right) . \ \ \ \ \ (2)

Till now, we were looking at the gradient descent algorithm and its performance on different class of functions. When GD is used on the class of {\beta}-smooth convex functions, the convergence rate is {\Theta(1/k)}. The convergence rate equals {\Theta(\exp(-k/Q))}, when used on {\alpha}-strong and {\beta}-smooth convex functions. Are these the optimal rates? Are there algorithms that can improve upon these convergence rates? For example, are there first order algorithms that can achieve a convergence rate of {\Theta(1/k^4)}, when used on the {\beta}-smooth functions class? In this lecture, we will come up with explicit functions that provide upperbounds on the rate of convergence, \ie, these functions exhibit the worst case convergence rate. We first look at the class of {\beta} smooth functions. The following proof is adapted from Nemirovski’s textbook. For simplicity, we make the following assumptions on the oracle. We will later see as to how to relax these conditions.

  • First Order Oracle: Only first order information should be used.
  • {x_0=0} and {x_k \in} Span{\lbrace \nabla f(x_1),.......,\nabla f(x_k-1)\rbrace}

Theorem 1 Let {t \leq (n-1)/2} and {\beta >0}. There exists a {\beta}-smooth convex function {f:{\mathbb R}^n \rightarrow {\mathbb R}} such that for any black-procedure satisfying the conditions defined above,

\displaystyle  \min_{1 \leq s \leq t} f(x_s)-f(x^*) \geq \frac{3\beta}{32}\frac{||x_1-x^*||^2}{{(t+1)}^2} \ \ \ \ \ (1)

Proof: For {k \leq n}, let {A_k \in \mathbb{R}^{n \times n}} be the symmetric and tridiagonal matrix defined by

\displaystyle  (A_k)_{i,j}=\left\{ \begin{array}{rl} 2, & i=j,i\leq k \\ -1, & j \in \lbrace i-1,i+1\rbrace, i \leq k, j \neq k+1\\ 0, & \text{otherwise.} \end{array} \right.

We will first show that { 0 \preceq A_k \preceq 4I_n}. The first inequality can be verified as follows.

\displaystyle  \begin{array}{rcl}  x^T A_k x &=& 2\sum\limits_{i=1}^k x(i)^2 - 2 \sum\limits_{i=1}^{k-1} x(i)x(i+1) \\ &=& x(1)^2 + x(k)^2 + \sum\limits_{i=1}^{k-1} {(x(i)-x(i+1))}^2 \geq 0. \end{array}

Similarly, it can be easily shown that { A_k \preceq 4I_n}. Observe that from our assumption {2t-1<n} and hence {A_{2t+1}} is well defined. We consider now the following {\beta-}smooth convex function:

\displaystyle  f(x)=\frac{\beta}{8}x^T A_{2t+1}x - \frac{\beta}{4}x^Te_1. \ \ \ \ \ (2)

Note: Here {t} is not the running index of the algorithm. First we fix a {t} and we define the function {f(x)} for a given {t}. The running index in the proof is {s}. The function {f(x)} can be easily verified to be {\beta-}smooth as follows.

\displaystyle  \begin{array}{rcl}  \nabla^2f(x) &=& \frac{\beta}{4}A_{2t+1}. \\ ||\nabla^2f(x)|| &\leq& \frac{\beta}{4}||A_{2t+1}|| \\ &\leq& \beta \; \; \; \left(\text{since} A_k \preceq 4I_n \right). \end{array}

Let us now look at the possible iterates generated by any algorithm that satisfies out assumptions. As {x_0=0} and {\nabla f(x)= \frac{\beta}{4}A_{2t+1}x-\frac{\beta}{4}e_1}, we can represent {x_1}, {x_2}, {\hdots}, as follows,

\displaystyle  \begin{array}{rcl}  x_1&=& x_0-\frac{\beta}{4}e_1, \end{array}

where {e_i} is the standard basis (it has {1} only in the {i}-th coordinate). We see that {x_1} has non-zero value only in the first coordinate, \ie {x_1 =\text{span}(e_1)}. Similarly,

\displaystyle  \begin{array}{rcl}  x_2&=& x_1- \left(\nabla f(x_1)\right) \\ &=& x1-\left[\left(\frac{\beta}{4}\left(A_{2t+1}\right)\left(x_0-\frac{\beta}{4}e_1\right)\right)-\frac{\beta}{4}e_1\right], \end{array}

from which we see that {x_2 \in \text{span}(e_1, e_2)}. From the way, {A_k} is defined, it is easy to verify that {x_s} must lie in the linear span of {e_1, e_2, ...,e_{s}}. In particular for {s\leq t} we necessarily have {x_s(i)=0} for {i=s,...,n}, which implies

\displaystyle  x_s^TA_{2t+1}x_s=x_s^TA_{s}x_s. \ \ \ \ \ (3)

In other words, if we denote

\displaystyle  f_k(x)=\frac{\beta}{8}x^TA_kx - \frac{\beta}{4}x^Te_1, \ \ \ \ \ (4)

then we just proved that

\displaystyle  \begin{array}{rcl}  f(x_s)-f^*&\stackrel{a}{=}&f_s(x_s)-f^*_{2t+1}, \\ &\stackrel{b}{\geq}& f_s^* - f^*_{2t+1}. \end{array}

where {(a)} follows since {f(x_s)=f_s(x_s)} and from the definition of {f(x)} and {f_k(x)} we can see, {f(x)=f_{2t+1}(x)}. The inequality {(b)} follows since {f_s^*=\min_x{f_s}(x)}. We now compute the minimizer {x_k^*} of {f_k}, its norm, and the corresponding function value {f_k^*}. The optimal {x_k^*} can be found by setting solving {\nabla f_k(x)=0}. i.e., {A_kx-e_1=0}. So, the point {x_k^*} is the unique solution in the span of {e_1,.....,e_k} of {A_kx=e_1}. It is easy to verify that it is defined by {x_k^*(i)=1-\frac{i}{k+1}} for {i=1,....,k}. Thus we immediately have:

\displaystyle  \begin{array}{rcl}  f_k^*&=& \frac{\beta}{8}{(x_k^*)}^TA_kx_k^* - \frac{\beta}{4}(x_k^*)^T e_1 \\ &=& -\frac{\beta}{8}(x_k^*)^Te_1 \\ &=& -\frac{\beta}{8}\left(1 - \frac{1}{k+1}\right). \end{array}

We first observe that {f_k^*} decreases with {k}. Hence {f_s^* >f_t^*}. So we have

\displaystyle f(x_s)-f^* > f_t^* -f^*_{2t+1}.

Furthermore note that

\displaystyle  \begin{array}{rcl}  ||x_k^*||^2&=& \sum\limits_{i=1}^{k}\left(1-\frac{i}{k+1}\right)^2 \\ &=& \sum\limits_{i=1}^k \left(\frac{i}{k+1}\right) \\ &\leq& \frac{k+1}{3}. \end{array}

Thus one obtains:

\displaystyle  \begin{array}{rcl}  f_t^*-f_{2t+1}^*&=& \frac{\beta}{8}\left(\frac{1}{t+1} -\frac{1}{2t+2}\right) \\ &=& \frac{3\beta}{32}\frac{||x^*_{2t+1}||^2}{(t+1)^2}, \end{array}

which concludes the proof. \Box

We make the following observations:

  • From Theorem 1, we observe that the worst case complexity is achieved by a quadratic function and no first order algorithm can provide a convergence rate better than {\Theta(1/k^2)}.
  • From the above theorem and the convergence rate of the GDA we have the following result: Let {N} be the number of iterations required to achieve an error of less than {\epsilon}. Then

    \displaystyle  \min\left\{\frac{n-1}{2},\sqrt{\frac{3\beta\|x_1-x^*\|^2}{32 \epsilon}}\right\}\leq N \leq \frac{\beta\|x_1-x^*\|^2}{ \epsilon}.

    First we observe that GDA does not match the lower bound on complexity. Also, for {n^2 \succeq \frac{\beta\|x_1-x^*\|^2}{ \epsilon}}, we have that the upperbound is the square(up to a constant) of the lower bound. However, for small {n}, the lower bound is not tight. In the next class, we will look at a technique that would provide an upperbound that would match with the lower bound.

  • A similar kind of result can be obtained for strongly convex functions. We will state the result here without proof.
    Theorem 2 Let Q={\beta/\alpha >1}. There exists a {\beta}-smooth and {\alpha}-strongly convex function such {f} such that for any {t \geq 1} one has

    \displaystyle  f(x_t)-f(x^*) \geq \frac{\alpha}{2} \left(\frac{\sqrt{Q}-1}{\sqrt{Q}+1}\right)^{2(t-1)} \|x_1-x^*\|^2 \ \ \ \ \ (5)

    We know for small values of x; {e^x \approx 1+x}. This means for large values of Q:

    \displaystyle  \left(\frac{\sqrt{Q}-1}{\sqrt{Q}+1}\right)^{2(t-1)}\approx \exp \left(-\frac{4(t-1)}{\sqrt{Q}}\right) \ \ \ \ \ (6)

    We observe that the convergence rate and the GDA differ by a factor {\sqrt{Q}} in the denominator of the exponent.

Today we will look at a variant of gradient descent called the steepest descent algorithm.

1. Steepest Descent Algorithm

Having seen the Gradient Descent Algorithm, we now turn our attention to yet another member of the Descent Algorithms family — the Steepest Descent Algorithm. In this algorithm, we optimize the descent direct to obtain the maximum decrease of the objective function. First, let us look at the descent direction for Steepest Direction. Let {f: \mathbb{R}^n \rightarrow \mathbb{R}} be continuous and differentiable. Using Taylor Series, we get

\displaystyle f(x+v) = f(x) + \nabla f(x)^T v.

Our objective is to minimize the second term i.e. {\nabla f(x)^T v} term. The descent direction is defined in two steps. First, we look at the so-called normalized steepest descent direction. This in turn helps us define the steepest descent direction. The normalized steepest descent direction is defined as

\displaystyle  \Delta x_{nsd} = {\text{argmin}_v} \{ \nabla f(x)^T v : \| v \| = 1 \}.

Observe that the descent direction depends not only on the point {x}, but also on the norm {\| .\|}. As is clear from the definition, the name normalized is used since we are searching only through those vectors that have unit magnitude. Looking closely, we notice that this definition is similar to the one of dual norm of a given norm. Hence, the minimum value is given by {-\| \nabla f(x) \|_*}. The negative sign compensates for the “min” used here as opposed to the “max” used in the definition of a dual norm. Now we can define the descent direction for the Steepest Descent Algorithm as follows:

\displaystyle  \Delta x_{sd} = \| \nabla f(x) \|_* \Delta x_{nsd}.

The following calculation verifies that the descent direction defined above satisfies the sufficiency condition required, and also justifies the use of {\| \nabla f(x) \|_*} term used in the definition:

\displaystyle  \begin{array}{rcl}  \nabla f(x)^T \Delta x_{sd} & = & \| \nabla f(x) \|_* \nabla f(x)^T \Delta x_{nsd}, \\ & \overset{(a)}{=} & - \| \nabla f(x) \|_*^2, \\ & < & 0, \end{array}

where {(a)} follows from the definition of normalized steepest descent direction. Hence, the update equation is

\displaystyle  x_{k+1} = x_k + t \Delta x_{sd}.

The following Lemma casts the descent direction as a simpler optimization problem.

Lemma 1

\displaystyle \Delta x_{sd} = \text{argmin}_v \left\lbrace \nabla f(x)^T v + \frac{\| v \|^2}{2} \right\rbrace.

Proof: The proof follows by representing { v= (w, t)}, where {t>0}, {\|w\| =1} and jointly optimizing over {t} and {w}. \Box

We will now look at the descent directions generated by a quadratic norm.

Definition 2 Let {P} be a fixed Positive Definite matrix i.e., {P \in S^n_{++}}. The quadratic norm of any vector {z \in \mathbb{R}^n} is then defined as

\displaystyle  \| z \|_P = \left( z^T P z \right)^{\frac{1}{2}} = \| P^{\frac{1}{2}} z \|_2

If we define {\tilde{x}} = {P^{\frac{1}{2}} x}, then this essentially corresponds to a change of basis. Since {P \in S^n_{++}}, {x} = {P^{-\frac{1}{2}} \tilde{x}} is well defined. This in turn lets us allow the quadratic expression to be simplified as follows:

\displaystyle  \begin{array}{rcl}  x^T P x & = & \tilde{x}^T P^{-\frac{1}{2}} P P^{-\frac{1}{2}} \tilde{x}, \\ & = & \tilde{x}^T \tilde{x}. \end{array}

Hence, an ellipsoid is converted to a sphere. In the case of Quadratic Norm, the descent direction is

\displaystyle  \Delta x_{sd} = \underset{v}{\text{argmin}} \left\lbrace \nabla f(x)^T v + \frac{1}{2} v^T P v \right\rbrace.

Since the above minimization problem is unconstrained, setting its gradient to zero gives the minimizer. Therefore,

\displaystyle \nabla f(x) + P v^* =0

and hence

\displaystyle \Delta x_{sd} = P^{-1} \nabla f(x).

This norm can be used when the condition number of the objective function is bad. In this case, if one has some idea of the Hessian at the optimal point the matrix {P } can be chosen to be equal to {\nabla^2f(x^*)}. This would improve the condition number and the convergence rate of the algorithm.

1.1. Convergence Rate for Steepest Descent with Back Tracking

Theorem 3 Let {f: \mathbb{R}^n \rightarrow \mathbb{R}} be {\beta-}smooth and {\alpha-}strongly convex. Then Steepest Descent with Back Tracking has a linear convergence rate.

Proof: See Boyld. \Box

Today we will look at a variant of gradient descent called the steepest descent algorithm.

1. Steepest Descent Algorithm

Having seen the Gradient Descent Algorithm, we now turn our attention to yet another member of the Descent Algorithms family — the Steepest Descent Algorithm. In this algorithm, we optimize the descent direct to obtain the maximum decrease of the objective function. First, let us look at the descent direction for Steepest Direction. Let {f: \mathbb{R}^n \rightarrow \mathbb{R}} be continuous and differentiable. Using Taylor Series, we get

\displaystyle f(x+v) = f(x) + \nabla f(x)^T v.

Our objective is to minimize the second term i.e. {\nabla f(x)^T v} term. The descent direction is defined in two steps. First, we look at the so-called normalized steepest descent direction. This in turn helps us define the steepest descent direction. The normalized steepest descent direction is defined as

\displaystyle  \Delta x_{nsd} = {\text{argmin}_v} \{ \nabla f(x)^T v : \| v \| = 1 \}.

Observe that the descent direction depends not only on the point {x}, but also on the norm {\| .\|}. As is clear from the definition, the name normalized is used since we are searching only through those vectors that have unit magnitude. Looking closely, we notice that this definition is similar to the one of dual norm of a given norm. Hence, the minimum value is given by {-\| \nabla f(x) \|_*}. The negative sign compensates for the “min” used here as opposed to the “max” used in the definition of a dual norm. Now we can define the descent direction for the Steepest Descent Algorithm as follows:

\displaystyle  \Delta x_{sd} = \| \nabla f(x) \|_* \Delta x_{nsd}.

The following calculation verifies that the descent direction defined above satisfies the sufficiency condition required, and also justifies the use of {\| \nabla f(x) \|_*} term used in the definition:

\displaystyle  \begin{array}{rcl}  \nabla f(x)^T \Delta x_{sd} & = & \| \nabla f(x) \|_* \nabla f(x)^T \Delta x_{nsd}, \\ & \overset{(a)}{=} & - \| \nabla f(x) \|_*^2, \\ & < & 0, \end{array}

where {(a)} follows from the definition of normalized steepest descent direction. Hence, the update equation is

\displaystyle  x_{k+1} = x_k + t \Delta x_{sd}.

The following Lemma casts the descent direction as a simpler optimization problem.

Lemma 1

\displaystyle \Delta x_{sd} = \text{argmin}_v \left\lbrace \nabla f(x)^T v + \frac{\| v \|^2}{2} \right\rbrace.

Proof: The proof follows by representing { v= (w, t)}, where {t>0}, {\|w\| =1} and jointly optimizing over {t} and {w}. \Box

We will now look at the descent directions generated by a quadratic norm.

Definition 2 Let {P} be a fixed Positive Definite matrix i.e., {P \in S^n_{++}}. The quadratic norm of any vector {z \in \mathbb{R}^n} is then defined as

\displaystyle  \| z \|_P = \left( z^T P z \right)^{\frac{1}{2}} = \| P^{\frac{1}{2}} z \|_2

If we define {\tilde{x}} = {P^{\frac{1}{2}} x}, then this essentially corresponds to a change of basis. Since {P \in S^n_{++}}, {x} = {P^{-\frac{1}{2}} \tilde{x}} is well defined. This in turn lets us allow the quadratic expression to be simplified as follows:

\displaystyle  \begin{array}{rcl}  x^T P x & = & \tilde{x}^T P^{-\frac{1}{2}} P P^{-\frac{1}{2}} \tilde{x}, \\ & = & \tilde{x}^T \tilde{x}. \end{array}

Hence, an ellipsoid is converted to a sphere. In the case of Quadratic Norm, the descent direction is

\displaystyle  \Delta x_{sd} = \underset{v}{\text{argmin}} \left\lbrace \nabla f(x)^T v + \frac{1}{2} v^T P v \right\rbrace.

Since the above minimization problem is unconstrained, setting its gradient to zero gives the minimizer. Therefore,

\displaystyle \nabla f(x) + P v^* =0

and hence

\displaystyle \Delta x_{sd} = P^{-1} \nabla f(x).

This norm can be used when the condition number of the objective function is bad. In this case, if one has some idea of the Hessian at the optimal point the matrix {P } can be chosen to be equal to {\nabla^2f(x^*)}. This would improve the condition number and the convergence rate of the algorithm.

1.1. Convergence Rate for Steepest Descent with Back Tracking

Theorem 3 Let {f: \mathbb{R}^n \rightarrow \mathbb{R}} be {\beta-}smooth and {\alpha-}strongly convex. Then Steepest Descent with Back Tracking has a linear convergence rate.

Proof: See Boyd. \Box

For a given {\beta}-smooth and convex function, we have proved in the last lecture that the GD algorithm with constant step size converges to the optimal and the convergence rate is of the order {\frac{1}{k}}. In this lecture, we present a GD algorithm with constant step size for {\beta}-smooth and {\alpha}-strongly convex functions whose convergence rate is exponential.

Theorem 1 Let {f} be {\beta}-smooth and {\alpha}-strongly convex function. Then GD with step size {t_k \leq 2/(\alpha+\beta)} satisfies

\displaystyle f \left( x_k \right) - p^* \leq \frac{\beta}{2} \prod_{n=1}^k \left( 1 - \frac{2t_n\alpha \beta}{\alpha + \beta} \right) \Vert x_0 - x^* \Vert^2.

Proof: GD algorithm update step: { x_{k+1} = x_k -t_k \nabla f \left( x_k \right) }. Since {f(x)} is {\beta}-smooth,

\displaystyle  f \left( x_k \right) \leq f \left( x^* \right) + \nabla f \left( x^* \right)^T \left( x_k - x^* \right) + \frac{\beta}{2} \Vert x_k - x^* \Vert ^2 .

Since the gradient vanishes at the optimal point,

\displaystyle  f \left( x_k \right) - f \left( x^* \right) \leq \frac{\beta}{2} \Vert x_k -x^* \Vert ^2.  \ \ \ \ \ (1)

Now, let us bound {\Vert x_{k+1} - x^* \Vert^2} as follows:

\displaystyle  \begin{array}{rcl}  \Vert x_{k+1} - x^* \Vert^2 = \Vert x_k - x^* \Vert^2 + t_k^2 \Vert \nabla f \left( x_k \right) \Vert^2 -2t_k \nabla f \left( x_k \right)^T\left( x_k - x* \right). \end{array}

Using Lemma in the previous lecture, we have

\displaystyle \Vert x_{k+1} - x^* \Vert^2 \leq \Vert x_k - x^* \Vert^2 + t_k^2 \Vert \nabla f \left( x_k \right) \Vert^2 -2t_k \left( \frac{\alpha \beta}{\alpha + \beta} \Vert x_k - x^* \Vert^2 + \frac{\Vert \nabla f \left( x_k \right) \Vert^2}{\alpha + \beta} \right).

After simplification, we obtain

\displaystyle  \begin{array}{rcl}  \Vert x_{k+1} - x^* \Vert^2 \leq& \left( 1 - 2t_k \frac{\alpha \beta}{\alpha + \beta} \right) \Vert x_k - x^* \Vert^2 + t_k \left( t_k - \frac{2}{\alpha + \beta} \right) \Vert\nabla f \left( x_k \right) \Vert^2. \end{array}

Since {t_k <\frac{2}{\alpha + \beta}} the last term in the RHS can be neglected and we obtain the following upper bound.

\displaystyle \Vert x_{k+1} - x^* \Vert^2 \leq\left( 1 - 2t_k \frac{\alpha \beta}{\alpha + \beta} \right) \Vert x_k - x^* \Vert^2.

Now, iterating over {k},

\displaystyle  \Vert x_k - x^* \Vert ^2 \leq \Vert x_0 - x^* \Vert ^2 \prod_{n=1}^k \left( 1 - \frac{2t_n\alpha \beta}{\alpha + \beta} \right).

Substituting in (1), we obtain

\displaystyle f \left( x_k \right) - p^* \leq \frac{\beta}{2} \prod_{n=1}^k \left( 1 - \frac{2t_n\alpha \beta}{\alpha + \beta} \right) \Vert x_0 - x^* \Vert^2 .

\Box

Using the fact {(1-x) \leq \exp(-x)}, it follows from Theorem 1 that

\displaystyle  f \left( x_k \right) - p^* \leq \frac{\beta}{2} \Vert x_0 - x^* \Vert^2 e^{-\frac{2\alpha\beta}{\alpha+\beta}\sum_{n=1}^k t_n}.  \ \ \ \ \ (2)

The following lemma shows that GD with strongly convex functions and constant step sizes exhibit exponential convergence.

Lemma 2 GD with step size, { t = \frac{2}{\alpha + \beta}} on an {\alpha}-strong and {\beta} smooth convex function satisfies

\displaystyle  \begin{array}{rcl}  f \left( x_k \right) - p^* &\leq \frac{\beta}{2} \left( \frac{Q_f - 1}{Q_f + 1} \right)^{2k} \Vert x_0 - x^* \Vert^2 \\ &\leq \frac{\beta}{2} \exp \left( - \frac{4k}{Q_f+1} \right) \Vert x_0 - x^* \Vert^2, \end{array}

with {Q_f = \frac{\beta}{\alpha}}

Proof: Substituting {t_k =2/(\alpha+\beta)} in Theorem 1, we have

\displaystyle  \begin{array}{rcl}  f \left( x_k \right) - p^* &\leq \frac{\beta}{2} \prod_{n=1}^k \left( 1 - \frac{2t_n\alpha \beta}{\alpha + \beta} \right) \Vert x_0 - x^* \Vert^2 \\ &= \frac{\beta}{2} \left( 1 - \frac{2}{Q_f+1} \right)^{2k} \Vert x_0 - x^* \Vert^2 \\ & \leq \frac{\beta}{2} \exp \left( \frac{-4k}{Q_f+1} \right) \Vert x_0 - x^* \Vert^2. \end{array}

The last step follows since { \left( 1-x \right)^k \leq \exp(-xk) }. \Box

In convex analysis terminology, {\exp(-k)} type of convergence is called as linear convergence and {\exp \left( -\exp(k) \right)} convergence is called as quadratic convergence. It can be seen that the rate of convergence in the above theorem critically depends on the condition number {Q_f = \frac{\beta}{\alpha}}. Large values of {Q_f} imply slow convergence. The next lemma examines the convergence behaviour with diminishing step size.

Lemma 3 When {t_k = \frac{c}{k}}, we have

\displaystyle f(x_k)- p^* \leq \frac{\beta}{2 k^{\frac{2c\alpha\beta}{\alpha+\beta}}}\|x_0 -x^*\|^2.

Proof: Substituting {t_k = \frac{c}{k}} in (2), and using the fact {\sum_{n=1}^k 1/n \sim \log(k)}, we have

\displaystyle f \left( x_k \right) - p^* \leq \frac{\beta}{2} \Vert x_0 - x^* \Vert^2 e^{-\frac{2c\alpha\beta}{\alpha+\beta} \log(k)},

and the result follows. \Box

This shows that any polynomial convergence rate can be obtained by increasing {c}. However for large {c}, the initial steps might violate the condition {t_k <2/(\alpha+\beta)}. While this might slow down the initial steps of the algorithm, the convergence rate would still be given by Lemma 3 since {\frac{c}{k} < 2/(\alpha+\beta)} for large {k}.

In the previous lectures we have seen properties of {\beta} smoothness and strong convexity which are used quite often in the problems involving Convex Optimization. We got an idea about the basic idea of descent algorithms and analysed gradient descent algorithm. In today’s lecture we will look into some more theorems relating to gradient descent method. We will consider a generalised step size i.e., the update rule is

\displaystyle x_{k+1} = x_k - t_k \nabla f(x_k)

Theorem 1 Let convex function f be {\beta} smooth with {x_n} as the current point and {x^*} as the optimal point. Then {\Arrowvert x_n-x^*\Arrowvert} decreases with {n} if {t_n \leq 1/\beta}.

Proof:

\displaystyle  \begin{array}{rcl}  \Arrowvert x_{n+1}-x^*\Arrowvert ^2 &=& \Arrowvert x_n - t_n\nabla f(x_n) -x^*\Arrowvert ^2,\nonumber\\ &=& \Arrowvert x_n -x^*\Arrowvert ^2 -{2t_n\nabla f(x_n)^T(x_n-x^*)} + t_n^2{\Arrowvert \nabla f(x_n)\Arrowvert ^2} .\nonumber \end{array}

Now, as derived in previous lecture, we have

\displaystyle  \begin{array}{rcl}  f(x_n) - f(x^*) \leq \nabla f(x_n)^T(x_n-x^*) - \frac{\Arrowvert\nabla f(x_n)\Arrowvert ^2}{2\beta}. \end{array}

But we know that,

\displaystyle  \begin{array}{rcl}  f(x_n) - f(x^*) \geq 0\nonumber\\ -\nabla f(x_n)^T(x_n-x^*) \leq -\frac{\Arrowvert\nabla f(x_n)\Arrowvert ^2}{2\beta}.\nonumber \end{array}

Hence,

\displaystyle  \begin{array}{rcl}  \Arrowvert x_{n+1}-x^*\Arrowvert ^2 &=& \Arrowvert x_n -x^*\Arrowvert ^2 -2t_n{\nabla f(x_n)^T(x_n-x^*)}+ t_n^2{\Arrowvert \nabla f(x_n)\Arrowvert ^2}\nonumber\\ &\leq & \Arrowvert x_n -x^*\Arrowvert ^2 -t_n\frac{\Arrowvert \nabla f(x_n)\Arrowvert ^2}{\beta} + t_n^2{\Arrowvert \nabla f(x_n)\Arrowvert ^2}\nonumber\\ &\leq & \Arrowvert x_n -x^*\Arrowvert ^2 -t_n\left(\frac{1}{\beta}-t_n\right)\|\nabla f(x_n) \|^2\nonumber. \end{array}

If {t_n < 1/\beta}, then

\displaystyle \Arrowvert x_{n+1}-x^*\Arrowvert ^2 \leq \Arrowvert x_n -x^*\Arrowvert ^2.

\Box

We note that, while the distance to the optimum point decreases, the rate of decrease can be arbitrarily slow. In fact it match the rate of any real sequence tending to zero. We will now look at the convergence rate of gradient descent algorithm on the class of smooth convex functions.

Theorem 2 Let f be a convex and {\beta}-smooth function on {{\mathbb R}^d}. Then the gradient descent method with step lengths {t_k \leq \frac{1}{\beta}} satisfies

\displaystyle f(x_n)-f(x^*) \leq \frac{ \Arrowvert x_0 - x^*\Arrowvert ^2}{ \sum_{k=0}^{n-1} t_k\left( 1-\frac{\beta t_k}{2}\right)}. \ \ \ \ \ (1)

Proof: Since {f} is a {\beta}-smooth function, we have

\displaystyle f(x) \leq f(y) + \nabla f(y)^T(x-y) + \frac{\beta}{2}\Arrowvert x-y\Arrowvert ^2. \ \ \ \ \ (2)

 

Now according to the gradient descent algorithm, we get the next point as

\displaystyle x_{n+1} = x_n -t_n\nabla f(x_n) \ \ \ \ \ (3)

 

Substituting (2) in (3), we get

\displaystyle \begin{array}{rcl} f(x_{n+1}) - f(x_n) &\leq & \nabla f(x_n)^T(x_{n+1}-x_n) + \frac{\beta}{2}\Arrowvert x_{n+1}-x_{n}\Arrowvert ^2\\ &\leq &-t_n\nabla f(x_n)^T\nabla f(x_n)+\frac{\beta t_n^2}{2}\Arrowvert \nabla f(x_n)\Arrowvert ^2\\ &\leq &-t_n\left( 1-\frac{\beta t_n}{2}\right) \Arrowvert \nabla f(x_n)\Arrowvert ^2. \end{array}

Now let {f(x_{n})-f(x^*)} be {\eta_n} and {f(x_{n+1})-f(x^*)} be {\eta_{n+1}}. Substituting this in the previous result we get,

\displaystyle \begin{array}{rcl} &f(x_{n+1}) - f(x_n) \leq-t_n\left( 1-\frac{\beta t_n}{2}\right)\Arrowvert \nabla f(x_n)\Arrowvert ^2 \\ &\eta_{n+1} + f(x^*) -\eta_{n} -f(x^*)\leq -t_n\left( 1-\frac{\beta t_n}{2}\right)\Arrowvert \nabla f(x_n)\Arrowvert ^2 \end{array}

Hence

\displaystyle \eta_{n+1} \leq \eta_{n}-t_n\left( 1-\frac{\beta t_n}{2}\right)\Arrowvert \nabla f(x_n)\Arrowvert ^2. \ \ \ \ \ (4)

 

According to the convexity condition we have,

\displaystyle \begin{array}{rcl} f(x_n) - f(x^*) &\leq & \nabla f(x_n)^T(x_n - x^*)\nonumber\\ \eta_n &\leq & \Arrowvert \nabla f(x_n)\Arrowvert \Arrowvert x_n-x^*\Arrowvert\nonumber \end{array}

So,

\displaystyle \frac{\eta_n}{\Arrowvert x_n-x^*\Arrowvert} \leq \Arrowvert \nabla f(x_n)\Arrowvert. \ \ \ \ \ (5)

 

Substituting the result of (5) to (4), we get

\displaystyle \begin{array}{rcl} \eta_{n+1} \leq \eta_{n} -\frac{t_n\left( 1-\frac{\beta t_n}{2}\right)\eta_n^2}{\Arrowvert x_{n} - x^*\Arrowvert ^2}\nonumber. \end{array}

As proved in the previous theorem, {\Arrowvert x_n-x^*\Arrowvert} decreases with n. Thus {\Arrowvert x_{0}-x^*\Arrowvert \geq \Arrowvert x_n-x^*\Arrowvert} and consequently {\frac{-1}{\Arrowvert x_{0}-x^*\Arrowvert} \geq \frac{-1}{\Arrowvert x_n-x^*\Arrowvert}}. So

\displaystyle \begin{array}{rcl} \eta_{n+1} \leq \eta_{n} -\frac{ t_n\left( 1-\frac{\beta t_n}{2}\right) \eta_n^2}{\Arrowvert x_{0} - x^*\Arrowvert ^2}. \end{array}

Let {\omega = \frac{1}{\Arrowvert x_0 - x^*\Arrowvert ^2}}.

\displaystyle \begin{array}{rcl} \eta_{n+1} \leq \eta_{n} -\omega t_n\left( 1-\frac{\beta t_n}{2}\right)\eta_n^2\nonumber. \end{array}

Dividing by ({\eta_{n+1}\eta_{n}}) we get,

\displaystyle \begin{array}{rcl} \frac{1}{\eta_n} \leq \frac{1}{\eta_{n+1}} + \frac{\omega t_n\left( 1-\frac{\beta t_n}{2}\right) \eta_n}{\eta_{n+1}}\nonumber. \end{array}

From Theorem 1 it can be said that {\eta_{n+1} \leq \eta_n}, so {\frac{\eta_{n}}{\eta_{n+1}} \geq 1}. Thus we have

\displaystyle \begin{array}{rcl} \frac{1}{\eta_{n+1}}-\frac{1}{\eta_{n}} \geq \frac{\omega t_n\left( 1-\frac{\beta t_n}{2}\right)\eta_n}{\eta_{n+1}} \geq \omega t_n\left( 1-\frac{\beta t_n}{2}\right). \end{array}

The above equation (7) forms a telescopic series. Summing all the terms from { 0} to { n-1} we get,

\displaystyle \begin{array}{rcl} \frac{1}{\eta_n} - \frac{1}{\eta_{0}} \geq \omega \sum_{k=0}^{n-1} t_k\left( 1-\frac{\beta t_k}{2}\right)\nonumber. \end{array}

Since {f(x_0)-f(x^*) \geq 0}, {\eta_{0}} and thus {\frac{1}{\eta_{0}}} are positive numbers. Thus

\displaystyle \begin{array}{rcl} \frac{1}{\eta_n} &\geq &\omega \sum_{k=0}^{n-1} t_k\left( 1-\frac{\beta t_k}{2}\right)\nonumber\\ \eta_n &\leq & \frac{1}{\omega \sum_{k=0}^{n-1} t_k\left( 1-\frac{\beta t_k}{2}\right)}\nonumber\\ \eta_n &\leq & \frac{ \Arrowvert x_0 - x^*\Arrowvert ^2}{ \sum_{k=0}^{n-1} t_k\left( 1-\frac{\beta t_k}{2}\right)}. \end{array}

\Box

We will now look at some specific step sizes. When the step size is constant it is easy to see that {t=1/\beta} maximizes { t\left( 1-\frac{\beta t}{2}\right)} and hence optimal.

Corollary 3 (Constant stepsize) When {t_k =1/\beta},

\displaystyle f(x_n)-f(x^*) \leq \frac{ 2\beta\Arrowvert x_0 - x^*\Arrowvert ^2}{ n-1}. \ \ \ \ \ (6)

Hence for a constant step size, the algorithm converges as {\Theta(1/n)}.

We now look at diminishing step sizes (that is very common for stochastic GDA.

Corollary 4 Let {t_k} be such that {\sum_{k=1}^\infty t_k =\infty } and {\sum_{k=1}^\infty t_k^2 <\infty}. Then GDA converges to the global optimum. In particular when {t_k= 1/k}, we have

\displaystyle f(x_n)-f(x^*) \leq \Theta\left(\frac{1}{\log(n)}\right){\Arrowvert x_0 - x^*\Arrowvert ^2}.

Observe that for the case of diminishing step size, the knowledge of {\beta} is not necessary. Also, {\sum_{k=1}^\infty t_k^2 <\infty} is a sufficient condition but not necessary. We only require { \sum_{k=0}^{n-1} t_k\left( 1-\frac{\beta t_k}{2}\right)} to be well defined for large {n}. For example when {t_k = 1/\log(k)}, it can be shown that (by approximating the sum by a Riemannian integral)

\displaystyle \sum_{k=0}^{n-1} t_k\left( 1-\frac{\beta t_k}{2}\right) \sim \frac{\beta n}{2\log(n)}.

Hence when {t_k = 1/\log(k)},

\displaystyle f(x_n)-f(x^*) \preceq \frac{2\log(n)\Arrowvert x_0 - x^*\Arrowvert ^2}{ \beta n},

which results in a loss of {\log(n)} in the convergence rate compared to {t_k =1/\beta}.

Convergence rates of a quadratic function x'Mx for different step sizes. The EIG(M)=[1.09, 0.227,0]. Hence the Lipscitz constant  is 1.09 and the function is also not strongly convex.

Convergence rates of a quadratic function x’Mx for different step sizes. The EIG(M)=[1.09, 0.227,0]. Hence the Lipschitz constant is 1.09 and the function is also not strongly convex.

For comparing algorithms with respect to a standard condition let us assume the optimum to lie within a radius R of the starting point {x_0} as {\eta_n} depends on the distance of the starting point to the optimum point. For {t_k =1/\beta}, we have

\displaystyle \begin{array}{rcl} \eta_n &\leq & \frac{2\beta R ^2}{(n-1)}. \end{array}

Now for allowable tolerance of {\epsilon} from the optimum {f(x^*)} for the algorithm to end,

\displaystyle \begin{array}{rcl} f(x_n)-f(x^*) &\leq &\epsilon\nonumber\\ \eta_n &\leq &\epsilon\nonumber \end{array}

Hence

\displaystyle \begin{array}{rcl} \epsilon &\geq & \frac{2\beta R ^2}{(n-1)}\nonumber\\ n-1 &\geq & \frac{2\beta R ^2}{\epsilon}\nonumber\\ n &\geq & \frac{2\beta R ^2}{\epsilon}. \end{array}

Thus the minimum number of steps required to reach to the optimum using gradient descent algorithm is given by {\frac{2\beta R ^2}{\epsilon}} which shows that the convergence depends directly on the Lipschiz constant {\beta}, distance of the starting point from optima and inversely on the tolerance set for the convergence to end.

A few points to remember:

  1. Lipschitz Continuity is always assumed in Convex Optimization.
  2. {x_{k+1}} {<} {x_{k}} is a poor way to go through with optimization as will be explained in later lectures.
  3. The convergence rate with {\beta} smoothness and convexity alone does not give good convergence rates.Thus {\alpha}-strongly convexity is assumeed for faster convergence.
  4. gradient descent algorithm scales linearly with the number of dimensions.

We require the following Lemma for proving convergence rate of strongly convex functions in the next class.

Lemma 5 Let f be {\beta} smooth and {\alpha}-strongly convex. Then {\forall x,y \in {\mathbb R}^n},

\displaystyle \begin{array}{rcl} (\nabla f(x) -\nabla f(y))^T \geq \frac{\alpha\beta\Arrowvert x-y\Arrowvert ^2}{\alpha + \beta} + \frac{\Arrowvert\nabla f(x) - \nabla f(y)\Arrowvert ^2}{\alpha +\beta}. \end{array}

Theorem 1 Let {f:{\mathbb R}^n \rightarrow {\mathbb R}} be a convex and a {\beta}-smooth function. Then, for any {x,y \in {\mathbb R}^n},

\displaystyle f(x) - f(y) \leq \nabla f(x)^T(x-y) - \frac{1}{2 \beta} \| \nabla f(x) - \nabla f(y)\| ^2. \ \ \ \ \ (1)

Proof: Let {z=y-\frac{1}{\beta} \left( \nabla f(y) - \nabla f(x) \right)}. Then,

\displaystyle \begin{array}{rcl} f(x) - f(y) &= ~ & f(x) - f(z) + f(z) - f(y),\\ & \stackrel{ (a)}{\leq} ~ & \nabla f(x)^T(x-z) + \nabla f(y)^T(z-y) + \frac{\beta}{2}\| z-y \|^2,\\ &= ~& \nabla f(x)^T (x-y) + \left( \nabla f(x) - \nabla f(y) \right) ^T (y-z) + \frac{\beta}{2} \| z-y \|^2,\\ &\stackrel{(b)}{=} ~& \nabla f(x)^T(x-y) - \frac{1}{2 \beta} \|\nabla f(x) - \nabla f(y) \|^2, \end{array}

where {(a)} follows by using gradient inequality for the {\left( f(x) - f(z) \right)} term since {f} is a convex function, and using the {\beta}-smoothess of {f} for the {\left( f(z) - f(y) \right)} term, and {(b)} follows by substituting {z=y-\frac{1}{\beta} \left( \nabla f(y) - \nabla f(x) \right)}. \Box

The above theorem is from the notes of Dr. Sébastien Bubeck.

1. Unconstrained Minimization

We now look at the following optimization problem.

\displaystyle p^*=\min\limits_{x \in {\mathbb R}^n} f(x). \ \ \ \ \ (2)

Our objective is to come up with a sequence of vectors {x_1,x_2, . . ., x_n} such that {\lim\limits_{n \rightarrow \infty}f(x_n)=p^*}. We refer to such a sequence as a minimizing sequence. We next define a tolerance limit to the optimization problem which basically limits the allowed gap between the optimal value and the function value at the point returned by the algorithm. By this we mean that the algorithm stops at {x_n} if

\displaystyle 0 < f(x_n) - p^* < \epsilon. \ \ \ \ \ (3)

2. Descent Algorithms

In descent algorithms, we follow the following update rule.

\displaystyle x_{k+1} = x_k + t_k \Delta x_k, \ \ \ \ \ (4)

 

where {t_k > 0} is the step size and {\Delta x_k} is the descent direction. Here, {||\Delta x_k|| \ne 1} and hence it is not a “direction” in the strict sense of the word but a misnomer. By descent, we mean the following,

\displaystyle f(x_{k+1}) < f(x_k).\ \ \ \ \ (5)

If {f} is a convex function, by gradient inequality,

\displaystyle f(x_{k+1}) \geq f(x_k) + \nabla f(x_k)^T(x_{k+1}-x_k).\ \ \ \ \ (6)

From (4)

\displaystyle f(x_{k+1}) \geq f(x_k) + \nabla f(x_k)^Tt_k \Delta x_k.\ \ \ \ \ (7)

From (5) and (7), we have the following as a necessary condition for every gradient descent algorithm.

\displaystyle \nabla f(x_k)^T \Delta x_k < 0. \ \ \ \ \ (8)

 

Given below is the basic structure of a descent algorithm.


 

  1. Start at an initial point {x_0}.
  2. Repeat the following till the “Stopping Criterion” is met.
    1. Descent direction : Choose a descent direction {\Delta x_k}.
    2. Line search : Choose a step size {t_k > 0}.
    3. Update : {x_{k+1} = x_k + t_k\Delta x_k}.

The stopping criterion does not directly follow from the tolerance limit as we do not know the optimal value {p^*}. As we will see later, the choice of stopping criterion is of prime importance to the performance of a descent algorithm and it is not easy to come up with an appropriate stopping criterion. For now, we assume that “Exact Line Search” is used. By this, we mean that the step size {t_k} is given by,

\displaystyle t_k = {\text{argmin}}_t~f(x_k + t \Delta x_k). \ \ \ \ \ (9)

We will later see that implementing “Exact Line Search” is not practical and we will come up with simpler ways to find the step size {t_k} without any significant loss in performance. We are now left with the choice of the descent direction {\Delta x_k} which should follow (8). All the descent algorithms differ on the choice of this descent direction. We now look at a basic descent algorithms,

2.1. Gradient Descent Algorithm (GDA)

The descent direction is chosen as

\displaystyle \Delta x_k = - \nabla f(x_k).

Observe that this descent direction clearly satisfies (8). Next, we state and prove the following important theorem which shows that GDA works on Lipschitz functions.

Theorem 2 Let {f} be a {\beta}-smooth function and {f^* = \min f(x)>- \infty}. Then the gradient descent algorithm with a constant step size {t < \frac{2}{\beta}} will converge to a stationary point, i.e., the set {\left\lbrace x : \nabla f(x) = 0\right\rbrace}.

Proof: Recall that for gradient descent algorithm,

\displaystyle x_{k+1} = x_k - t \nabla f(x_k). \ \ \ \ \ (10)

As {f} is {\beta}-smooth, we have

\displaystyle \begin{array}{rcl} f(x_{k+1}) &\leq~& f(x_k) + \nabla f(x_k)(x_{k+1} - x_k ) + \frac{\beta}{2} \mid \mid x_{k+1} - x_k \mid \mid ^2,\\ & \stackrel{(a)}{=}~ & f(x_k) - t \|\nabla f(x_k) \|^2 + \frac{\beta t^2}{2}\|\nabla f(x_k) \|^2, \end{array}

where {(a)} follows from (10). Thus, we have

\displaystyle \begin{array}{rcl} & f(x_{k+1}) \leq f(x_k) - t \left( 1 - \frac{\beta t}{2} \right) \|\nabla f(x_k)\|^2,\\ & \|\nabla f(x_k)\|^2 \stackrel{(a)}{\leq} \frac{f(x_k) - f(x_{k+1})}{t \left(1 - \frac{\beta t}{2} \right)}, \end{array}

where {(a)} follows since {t < \frac{2}{\beta}}. Next, we have

\displaystyle \begin{array}{rcl} \sum\limits_{k=0}^{N} \|\nabla f(x_k)\|^2 &\leq & \frac{1}{t\left( 1 -\frac{\beta t}{2} \right)} \sum\limits_{k=0}^{N}f(x_k) - f(x_{k+1}),\\ \end{array}

\displaystyle = \frac{f(x_0) - f(x_N)}{t \left( 1 - \frac{\beta t}{2} \right)}, \ \ \ \ \ (11)

\displaystyle \begin{array}{rcl} &\leq & \frac{f(x_0) - f^*}{t \left( 1 - \frac{\beta t}{2} \right)}, \end{array}

where {f^*} is the global optimal value. Taking the limit as {N \rightarrow \infty}, we have

\displaystyle \begin{array}{rcl} & \sum\limits_{k=0}^{\infty} \| \nabla f(x_k) \|^2 < \infty. \end{array}

Hence {\| \nabla f(x_k) \| \rightarrow 0,} and hence {\nabla f(x_k) \rightarrow 0}. \Box

Note that {t^* = \frac{1}{\beta}} maximizes {t \left( 1 - \frac{\beta t}{2} \right)} term appearing in the denominator of (11) and hence is the optimal step size to be used. In this case we have

\displaystyle \sum\limits_{k=0}^{N} \|\nabla f(x_k)\|^2 \leq 2(\beta {f(x_0) - f^*}).

The convergence rates can be obtained as follows: Let {g_N^* = \min_{k=1...N} \|\nabla f(x_k)\|}. Then

\displaystyle (1+N)g_N^2 \leq 2(\beta {f(x_0) - f^*}),

which implies

\displaystyle g_N \leq \sqrt{ \frac{2(\beta {f(x_0) - f^*)}}{1+N}}.

Hence, when {f} is just a {\beta}-smooth function (need not be convex), the above theorem shows that gradient descent converges sub-linearly to a stationary point (this need not even be a local minimum). Also, observe that the above bound is on {g_N}, i.e, the best gradient till {N}. However, the gradient might oscillate while the function value might not. While, Lipschitz condition is used to prove the convergence rate, the algorithm will converge to a stationary point if the level set {\{x: f(x) \leq f(x_0)\}} corresponding to the initial point is bounded.

 

In optimization, differentiability (even higher orders) is not a very strong condition compared to continuity and hence cannot ensure convergence rates. This is because, every continuous function (on compact space) can be approximated by a smooth function (arbitrarily differentiable) to an arbitrary degree of precision (look at Stone-Weierstrass theorem for approximation by polynomials). Hence we require a stronger conditions on the functions to ensure convergence rates.

Definition 1 (Lipschitz Continuous functions:) A function{f(x):{\mathbb R}^n \rightarrow {\mathbb R}} is said to be Lipschitz if

\displaystyle  |f(x_1) -f(x_2)| \leq L\|x_1-x_2\|.

Here {L} is called the Lipschitz constant. The function {\exp(-x)} is Lipschitz with constant {1} on {[0,\infty)}. The following theorem characterized Lipschitz functions which are differentiable.

Lemma 2 A differentiable function is Lipschitz on a convex domain iff

\displaystyle \|f'\|_\infty < \infty,

and the sup norm of the derivative is the Lipschitz constant.

The above lemma also helps us in identifying Lipschitz functions. For example, {f(x)}={e^{-x}}, {x \in (0, \infty)}. {f^{\prime}(x)}={-e^{-x}} and {|-e^{-x}| < 1}. Hence the Lipschitz constant of an exponential function is 1.

In Convex analysis literature, functions whose derivatives are Lipschitz continuous are also called as Lipschitz continuous functions. To avoid the confusion, we will call the functions whose derivatives are Lipschitz continuous as {\beta}-smooth functions i.e.,

\displaystyle  ||\nabla f(x) -\nabla f(y)||_2 \leq \beta ||x-y||_2.

Theorem 3 If f is {\beta} smooth, then

\displaystyle |f(x)-f(y)-\nabla f(y)^{T}(x-y)| \leq \frac{\beta}{2} || x-y||^2.

Proof: Consider the function {g(t)=f(y+t(x-y))}. We have {g(1)=f(x), g(0)=f(y)} and

\displaystyle g'(t)|_{t=0} = f(y+t(x-y))^{T}(x-y).

Since

\displaystyle \int_0^1 g'(t)d t = g(1)-g(0),

we have

\displaystyle  \begin{array}{rcl}  &|f(x)-f(y)-\nabla f(y)^{T}(x-y)|\\ &=| \int_0 ^1 \nabla f(y+t(x-y))^{T}(x-y) d t - \nabla f(y)^{T}(x-y) | \\ &\leq \int_0 ^1| (\nabla f(y+t(x-y))^{T}(x-y) - \nabla f(y)^{T}(x-y)) |d t | \\ \end{array}

By applying Cauchy Schwartz inequality,

\displaystyle  \begin{array}{rcl}  &\leq \int_0 ^1 ||(\nabla f(y+t(x-y)) - \nabla f(y)||_2 ||x-y||_2 d t \\ \end{array}

As {f} is {\beta} smooth

\displaystyle  \begin{array}{rcl}  &\leq ||x-y||_2 \int_0 ^1 \beta t ||x-y||_2 dt.\\ &= ||x-y||_2^2 \frac{\beta}{2}. \end{array}

\Box

This shows that {f(x) \leq f(y) + \nabla f(y)^{T}(x-y) +\frac{\beta}{2}||x-y||^2,} i.e., if a function is {\beta}-smooth then it is upper bounded by a quadratic function. While the upper bound is good, the quadratic lower bound has a negative square term ({-\|x-y\|^2}) and hence very loose. Nevertheless, this shows that the growth rate of a {\beta}-smooth function is at most quadratic. We now look at convex functions:

Definition 4 (convex) A function {f(x): A\rightarrow {\mathbb R}} is convex if

  • {f(\lambda x +(1-\lambda) y) \leq \lambda f(x) +(1-\lambda)f(y), \lambda \in [0,1]}.
  • {A} is convex.

We now look an easier and useful characterisation of convexity when the function is differentiable.

Lemma 5 A continuously differentiable function {f(x):{\mathbb R}^n \rightarrow {\mathbb R}} is convex iff

\displaystyle f(y) \geq f(x) +\nabla f(x)^T(y-x)

We now define the notion of strong convexity which will provide quadratic lower bounds on convex functions (I wont dwell upon this too much as these have already been proved in the previous course). A twice differentiable function is convex iff {\nabla^2 f(x) \succeq 0}, i.e., the Hessian is positive semi definite matrix for all {x}. For example, {e^{-x}} is convex on the positive reals. We now introduce the notion of strong convexity which will provide a quadratic lower bound to convex functions.

Definition 6 (Strong Convexity) A function {f} is said to be strongly convex with parameter {\alpha} if

\displaystyle f(x) \geq f(y) + \nabla f(y)^{T}(x-y) +\frac{\alpha}{2}||x-y||^2, \quad \forall x,y \in {\mathbb R}^n.

Obviously, {\alpha} should be strictly lesser than {\beta} if the function is both {\beta}-smooth and strongly convex. The above definition can be used to show that the Hessian of a strongly Convex function must satisfy

\displaystyle \nabla^2 f -\alpha I \succeq 0,

i.e., must be positive semi definite. The advantage of a strongly convex function is that the function will not be flat at the bottom and hence the convergence to the optimal solution will be faster. While the function {e^{-x}, x>0} is convex, it is not strongly convex. However (adding a regularisation term) {e^{-x} +\lambda x^2, \lambda>0} is strongly convex. Similarly adding {\lambda \|x\|^2} to any convex function would lead to a strongly convex function. The next theorem provides a bound on the gap between the function value and the optimum in terms of the gradient.

Theorem 7 If {f} is strongly convex, then {f(y)- \min\limits_{x \in X} f(x)} { \leq \frac{1}{2\alpha} ||\nabla f(y) ||^2}, where {X} is the convex set over which we optimize.

Proof: For a strongly convex function,

\displaystyle  f(x) \geq f(y)+ \nabla f(y)^T(x-y) +\frac{\alpha}{2} || x-y||^2 .  \ \ \ \ \ (1)

Let

\displaystyle Q(x) = f(y)+ \nabla f(y)^T(x-y) +\frac{\alpha}{2} || x-y||^2.

Since {Q(x)} is a quadratic function of {x}, it can can be easily seen that it attains its minimum when {x^{*}}={y-\frac{1}{\alpha} \nabla f(y).} Substituting this in (1), we obtain

\displaystyle f(x) \geq f(y) -\frac{1}{2\alpha} ||\nabla f(y) ||^2.

Let {P^*} denote {\min\limits_{x \in X}}. As {P^* \geq f(x)\quad \forall x}, {P^* \geq f(x) \geq f(y) -\frac{1}{2\alpha} ||\nabla f(y) ||^2}. \Box

It is very difficult to solve general non-linear optimization problems and there exists no universal techniques. This is intuitive since most difficult math problems can be cast as optimization problems. For example, the simple problem of finding the root of a non-linear function {f(x)} can be cast as the optimization problem: {\min_{x\in {\mathbb R}^n} f(x)^2}. Hence we focus on a sub-class of optimization problems that can be solved efficiently.

Linear optimization problems (linear objective function and polyhedral constraint set) can be solved in a very efficient manner. This class of problems can be solved with even a million variables very efficiently and commercial software exist to solve these large problems. On the other hand, solving general non-linear optimization problems is extremely difficult (almost impossible). While solving linear problems is important and useful, the linear class does not cover all interesting optimization problems (for example eigenvalue maximisation). Convex optimization problems are an important class (subsumes linear and contains a subset of non-linear problems) that are interesting, useful and that can be solved efficiently.

In this course, we will look at algorithms for convex optimization problems. In particular, the focus will be on large dimension problems and several algorithms along with their convergence rates will be presented.

There is no standard textbook for this course. We will be referring to the following text books

  1. Introductory lectures on convex optmization, A basic course, Yurii Nesterov
  2. Convex optimization, Stephen Boyd
  3. Lectures on modern convex optimization, Aharon Ben-Tal and Arkadi Nemirovski
  4. Theory of convex optimization for machine learning, Sebastien Bubeck
  5. Convex optimization theory, Dimitri P Bertsekas
  6. Nonlinear programming, Dimitri P Bertsekas
  7. Linear and nonlinear programming, David G, Luenberger and Yinyu Ye
  8. Numerical optimization, Jorge Nocedal and Stephen J. Wright
  9. Problem complexity and method efficiency in optimization, A Nemirovsky and D. B. Yudin
  10. A course in convexity, Alexander Barvinok

1. Comparing various algorithms

In the first part (and most) of the course, we will be looking at algorithms that do not exploit the structure of the objective function. In this regard, we assume an oracle that provides with required information and we do not have any other additional information. In this course we consider the following oracles

  1. Zeroth-order oracle: input {x}, output {f(x)}.
  2. First-order oracle: input {x}, output {(f(x), \nabla f(x))}
  3. Second-order oracle: input {x}, ouput {(f(x), \nabla f(x), \nabla^2 f(x))}

We can now compare different algorithms over a class of functions with different oracles. The complexity of the algorithm will be measured in terms of number of calls to the oracle and the computation using these oracle outputs to computing the optimum value. However, we will not be counting the complexity of the oracle, i.e., the complexity of computing {f(x), \nabla f(x), \nabla^2 f(x)} will be neglected. In most of the problems we will be considering, {f(x)} is real valued and hence it does not make sense (or will be of infinite complexity) to ask for the optimal value. Instead, we will look for solutions that are {\epsilon} close to the optimal value, i.e, we are interested in {\tilde{x}} such that {f(\tilde{x})- f(x^*) < \epsilon}.

2. Zeroth-order oracle

Let us first consider the complexity of solving the following problem

\displaystyle \min_{x\in [0,1]^n} f(x),

when the function {f(x)} belongs to the class of Lipschitz continuous functions with respect to {\| \|_\infty} norm. We require an {\epsilon} accurate solution, i.e., if {\tilde{x}} is the declared solution, then {f(\tilde{x}) - f(x^*) <\epsilon}. A function {f(x):{\mathbb R}^n\rightarrow {\mathbb R}} is Lipschitz continuous with constant {L} if

\displaystyle |f(x)-f(y)|\leq L\|x-y\|_\infty, \quad \forall x,y \in {\mathbb R}^n.

Suppose we consider a zeroth order oracle that provides only the value of the function, what can we say about the computational complexity of this class of functions? We can provide an upper bound on the complexity by providing an algorithm. The simplest algorithm is the following grid search. Divide each dimension into {K} parts and search over the resulting lattice. It can easily be seen that {K^n} calls to the oracle are required.

Lemma 1 If {K> 2L/\epsilon}, the grid search algorithm provides an {\epsilon} accurate solution.

Proof: The optimal point {x^*} belongs to some small hypercube in the grid. Let {\bar{x}} be some vertex of the hypercube. Then by Lipschitz continuity of {f},

\displaystyle f(\bar{x}) - f(x^*) \leq L\|x- x^*\|_\infty \leq \frac{L}{2K}.

Hence choosing {L/K<\epsilon}, i.e., {K> L/2\epsilon}, we obtain the required accuracy. \Box

So this algorithm calls the zeroth -order oracle {(L/2\epsilon)^n} times to obtain an {\epsilon} accurate solution. Is this optimal (in an order sense). To prove this, we build a worst case function and show that, whatever be the algorithm, the optimal point cant be found in less than {\Theta((2L/\epsilon)^n)} steps.

Lemma 2 With the zeroth order oracle, no (deterministic?) algorithm can perform better.

Proof: In this class, all the algorithms should only depend on the value of the function that the oracle provides. Suppose you have an algorithm and you feed in the function {g(x)=0}. The algorithm will provide a sequence of points {x_1, x_2, \hdots, x_N} after which the algorithm outputs the optimal value with {\epsilon} accuracy. Let us assume that {N< (L/2\epsilon)^n =p^n} (otherwise we are done). Now since there are only {p^n} points in the square {[0,1]^n}, there exists one point {\tilde{x} \in [0,1]^n} such that {B(\tilde{x}, 1/2p)\cap \{x_1,x_2,\hdots,x_n\} = \emptyset} (the ball is with respect to {\| \|_\infty} norm). Now consider the function

\displaystyle f(x) = \min\{0,L\|x- \tilde{x}\|-\epsilon\}.

This is a Lipschitz function with an optimal value {-\epsilon}. This function is non zero only when {\|x\| \leq \epsilon/L}, i.e., in the ball {B(\tilde{x},\epsilon/L)}. It is easy to see that algorithm that is designed will never get closer to the optimal value. \Box

This result shows that the complexity of optimization with zeroth-order oracle for Lipschitz functions is exponential in the dimension. In the next class, we will look at Lipschitz functions in more detail.

 

 

Reference: Problem complexity and method efficiency in optimization, A Nemirovsky and D. B. Yudin

Before we proceed further, in the above corollary, we observe that  the property of a linear function was used only in a small part of the proof of corollary 8 in last post. However, observe that the proof (for maximum) would follow through if the function {f(x)} satisfies the following properties:

  1. {f(a x) = a f(x)}, {a>0}.
  2. sub-additive property: {f(x+y) \leq f(x)+ f(y)}

Such functions are called as sub-additive functions. So a sub-additive function which is continuous achieves its maximum on a vertex of a compact convex set. An example of sub-additive function is the norm {\|x\|}. Another example is the support function of a set (we will look at this function later).

We will now look at some interesting examples of extreme points and maximization of linear functionals. We will first begin with the following motivational example.

Examples:

Assignment problem (AP): The AP arises in many contexts and is an example of binary optimization. Here we will motivate the problem in the context of wireless and later generalize the notion.

OFDM (orthogonal frequency division multiplexing) is a multi-carrier modulation which is being used in 4G networks. In OFDM,the data is modulated over different tones in such a way that orthogonality can be maintained across tones.

In general, a BS tower serves multiple users. When using OFDM, the BS generally has to decide the allocation of different tones to different users (OFDMA). Assume for now there are {n} frequency tones and there are {n} users. In general the users are distributed in space. Hence each frequency tone will have a different “channel” with respect a user. So let {h_{ij}} denote the channel of tone {i} from the BS to the user {j}. A good channel in general indicates a higher data rate on that particular tone. A simple map from the channel quality to the data rate obtained is {R_{ij} = \log_2(1+h_{ij}^2)}. So the BS wants to allocate one frequency tone per user and allocate all the tones, such that the sum rate is maximized. For example if {n=4}, a sample allocation is given by {\sigma =\{3,2,4,1\}}, which implies the first tone is allocated to the third user, the second tone to the second user, the third to the fourth user and the fourth to the first user. Observe that any allocation is just a permutation of the integers {\{1,2,...,n\}}. So mathematically the BS should find a permutation {\sigma} such that the allocated sum rate

\displaystyle \sum_{i=1}^n R_{i \sigma(i)},

is maximized. Observe that if the BS has check for every permutation of {1,...,n} before it can find the permutation that maximizes the sum rate. Suppose there are {20} tones, then the search space is approximately {20!\approx 2\times 10^{18}} which is huge.

In a little while using the extreme points and Krein-Milman theorem, we will come up with a smarter way of solving this problem in polynomial complexity.

We will now start looking more closely at the vertices (extreme points) of a polyhedron. Most of the material is taken from AB. We will first begin with the following theorem that characterizes the vertices of a polyhedron in terms of linear dependence of the vectors.

Theorem 1 Let {a_i, i=1..,n} be {n} vectors in {{\mathbb R}^d} and {b_i \in {\mathbb R}}. Let {P} be a polyhedron defined as

\displaystyle P=\{x: x\in {\mathbb R}^d, a_i^t x \leq b_i, i = 1...n\}.

For any point {u \in P} define

\displaystyle I(u)= \{i; a_i^t u = b_i\}.

Then {u} is a vertex of {P} if and only if the set of vectors {\{a_i, i \in I(u)\}} linearly span {{\mathbb R}^d}.

Proof: Let us first prove that vertex implies linear span. We will prove by contradiction.Let {u} be a vertex of {P}. Suppose {\{a_i, i \in I(u)\}} does not span {{\mathbb R}^d}. For notation let the cardinality of {I(u)} be {m=|I(u)|} Consider the matrix {A} with {a_i} as the columns, \ie,

\displaystyle A=[a_1,... ,a_m],

an {d \times m} matrix. Since {A} does not span the entire {{\mathbb R}^d}, the kernel space (or null space) of the matrix {A} is non empty (or the dimension of the kernel is greater than {0}.) Hence there exists a non zero {y \in \ker(A)}. Since {y \in \ker(A)},

\displaystyle a_i^t y =0,\quad \forall i \in I(u). \ \ \ \ \ (1)

We will now prove that {u} is not a vertex. Let {\epsilon >0} be a small number. Define

\displaystyle X= u + \epsilon y,

and

\displaystyle Y= u -\epsilon y.

Observe that {u= (X+Y)/2} and hence if {X} and {Y} lie in the polyhedron {P}, then {u} is not an extreme point (vertex) which is a contradiction. We will now prove that for very small {\epsilon}, {X} and {Y} indeed lie in {P}. To show that {X} (or {Y}) lie in {P} we should show that they satisfy {a_i^t X \leq b_i} for all {i}. Let {i \in I(u)}. Then

\displaystyle \begin{array}{rcl} a_i^tX &= a_i^t(u+\epsilon y)\\ &= a_i^tu +\epsilon a_i^t y\\ &\stackrel{(a)}{=} b_i +0 \end{array}

{(a)} follows since {a_i^t u =b_i} for {i\in I(u)} and {a_i^t y =0} since {y \in \ker(A)}. Now if {i \notin I(u)}, then

\displaystyle \begin{array}{rcl} a_i^tX &= a_i^t(u+\epsilon y)\\ &=a_i^tu +\epsilon a_i^t y \end{array}

Observe that {a_i^t u <b_i}. Hence if we choose {\epsilon} very very small, then {a_i^tu + \epsilon a_i^t y < b_i}. (For example, you can choose {\epsilon = 0.2*\min_{i=1,...,n} \frac{b_i - a_i^t u}{|a_i^t y|}}.). Hence {X} satisfies all the inequalities and hence is in the polyhedron. Similarly, you can show that {Y} satisfies all the inequalities and hence in the polyhedron.

We will now prove the converse, \ie, linear span implies vertex. Suppose { u = (X+Y)/2} where {X, Y \in P}. We first have for {i\in I(u)},

\displaystyle a_i^t u =b_i= \frac{a_i^t X + a_i^t Y}{2}.,

Since {X, Y \in P}, we have {a_i^tX \leq b_i} and {a_i^TY \leq b_i}, which is possible only when {a_i^tX = b_i} and {a_i^TY = b_i}. Now consider the set of equations for {i \in I(u)}.

\displaystyle a_i^t z = b_i.

Since {a_i} span the entire {{\mathbb R}^d}, it follows that there is a unique solution to the above set of equations. However, we have {a_i^tX \leq b_i} and {a_i^TY \leq b_i} and {a_i^t u =b_i}. This implies {u = X = Y} since the solution should be unique. \Box

Corollary 2 For any vertex {u}, the cardinality of the set {I(u)} should be greater than {d}, \ie {|I(u)| \geq d}.

Proof: We require at least {d} vectors to span {{\mathbb R}^d}. \Box

We will now analyze a particular polyhedron that is required to simplify the assignment problem. Before that we require the following definition of a permutation matrix. Suppose {\sigma} is a permutation of {1,...,n}. Then the permutation matrix {X_\sigma} is an {n\times n} matrix with the {(i,j)} entry given by

\displaystyle x_{ij}=\left\{\begin{array}{cc} 1& \text{if } \sigma(i) =j\\ 0& \text{otherwise}. \end{array} \right. \ \ \ \ \ (2)

So for a given {n} there are {n!} permutation matrices.

Birkhoff polytope:

Birkhoff polytope {B_n} is the set of all doubly stochastic {n\times n} matrices. Mathematically,

\displaystyle B_n=\left\{\{\xi_{ij}\}_{n\times n}: \sum_{i=1}^n \xi_{ij}=1,\sum_{j=1}^n \xi_{ij}=1, \xi_{ij}\geq 0 , \forall i, j \right\}.

First observe that {B_n} is entirely defined by linear inequalities and hence a polyhedron. Also observe that {B_n} is a convex set.

Hw 1 Show that {B_n} is a compact set.
Lemma 3 An {n \times n} permutation matrix {X_\sigma} is an extreme point of {B_n}.

Proof: Suppose {X_\sigma = (A+ C)/2} where {A, C \in B_n}. Consider the {i}-th row of all the three matrices. Suppose {x_{im} =1} and all the other entries are zero. So we have

\displaystyle 2[0, 0, ..,1,....0] = [a_{i1}, ..., a_{in}]+[c_{i1}, ..., c_{in}]

We have {\sum_{k=1}^n a_{ik} =1}, {\sum_{k=1}^n c_{ik} =1} and each term is non-negative. This is possible only when {a_{im}=c_{im} =1} and zero for all the other entries of the {i}-th row. This shows that {X_\sigma= A=C}, which proves that {X_\sigma} is an extreme point of {B_n}. \Box

We will now prove the converse of the above statement.

Theorem 4 (Birkhoff- von Neumann Theorem.) The extreme points of {B_n} are precisely the set of permutation matrices.

Proof: We prove the lemma by induction. For {n=1}, it is an obvious statement. Let us assume it is true till dimension {(n-1)} and prove it for dimension {n}.

Consider the following affine subspace of {{\mathbb R}^{n^2}}. The space {L} is the set of all {n^2} real numbers {\xi_{ij} \in {\mathbb R}} such that

\displaystyle \sum_{i=1}^n \xi_{ij} =1, \ \forall j=1,...,n,

and

\displaystyle \sum_{j=1}^n \xi_{ij} =1, \ \forall i=1,...,n.

Observe that the dimension of the space {L} is {(n-1)^2}. (Think why??) Observe that {B_n\subset L} and is defined as the polyhedron

\displaystyle \xi_{ij} \geq 0, \forall\ i,j

in {L}.

Let {X} be an extreme point of {B_n}. So by Theorem 1, an extreme point {X} of {B_n} should satisfy at least {(n-1)^2} inequalities with equalities. Hence {x_{ij} =0} for at least {(n-1)^2} entries of the matrix. We make the following observation.

  • Every row cannot have more than or equal to {2} non zero entries: In this case the maximum number of zero entries would be {n(n-2)} (each row can have at maximum {n-2} zero entries and there are {n} rows). This cannot be true, since {n(n-2) < (n-1)^2}.

So there is at least one row {i_o} with only one non-zero entry at {j_o} location. Since {X} is also doubly stochastic, we have that {x_{i_oj_o}=1}. Hence it also follows that there can be no more non zero entries in the column {j_o}. Let {\tilde{X}} be the new matrix obtained after removing the row {i_o} and the column {j_o}. Observe that

  1. {\tilde{X}} is a doubly stochastic matrix:This is because the original matrix {X} was doubly stochastic and the contribution from row {i_o} and column {j_o} is only {x_{i_oj_o}}.
  2. {\tilde{X}} is an extreme point of {B_{n-1}}: Suppose {\tilde{X}} is not an extreme point of {B_{n-1}}, \ie, {\tilde{X}} can be expressed as a sum of {\tilde{X} = (C+D)/2}, where {C\in B_{n-1}} and {D \in B_{n-1}}. Then it is easy to see that by adding one more row (all zeros except at {j_o}) and column (all zeros except at {i_o}) to {C} and {D} at positions {i_o} and {j_o} will result in new matrices {\hat C} and {\hat D}. Observe that {X = (\hat{C}+\hat{D})/2}.

So {\tilde{X}} is an extreme point of {B_{n-1}}. So by induction hypothesis, we have that {\tilde{X}} is some {(n-1)\times (n-1)} permutation matrix {X_\sigma}. Now augment {X_\sigma} with the {i_o} row and {j_o} column to obtain {X}. \Box

We will now see how to utilize this theorem to simplify the assignment problem. Recall that the AP was to minimize over all the permutations of {1,...,n}.

\displaystyle \min_{\sigma}\sum_{i=1}^n R_{i \sigma(i)}.

Observe that the cost function {\sum_{i=1}^n R_{i \sigma(i)}} is equivalent to

\displaystyle \sum_{i,j} R_{ij} x_{ij},

where {X_\sigma = \{x_{ij}\}} is the permutation matrix. So the AP equals

\displaystyle \min_{X_\sigma}\sum_{i,j} R_{ij} x_{ij}.

Observe that even now, {x_{ij}} is restricted to be either {0} or {1}.

Now consider a new problem P2:

\displaystyle \min\sum_{ij} R_{ij} \xi_{ij},

over the set

\displaystyle \begin{array}{rcl} &\sum_{i}\xi_{ij} =1, \forall i\\ &\sum_{j}\xi_{ij} =1, \forall j\\ &\xi_{ij} \geq 0. \end{array}

Observe that we are now maximizing over the Birkhoff polytope {B_n} and {\xi_{ij}} are real numbers. So we are maximizing a linear function {\sum_{ij} R_{ij} \xi_{ij}} over a convex compact set {B_n}. Hence the maximum occurs at a vertex point of {B_n}. But by Birkhoff- von Neumann theorem, the vertices of {B_n} are nothing but the permutation matrices. Hence solving P2 is equal to solving AP. Also observe that the new problem P2 is a linear programming problem and is very easy to solve in polynomial complexity.