You are currently browsing the category archive for the ‘EE6151: Convex optimization algorithms’ category.

** Nesterov’s Accelerated Gradient Descent Algorithm **

We have seen that is an upperbound on the rate of convergence for the class of -smooth functions. However, GDA achieves only convergence rate for -smooth functions. We will now look at the Nestrov’s accelerated method that achieves the optimal rate of convergence.

**Nestrov’s accelerated method
**

Essentially, the accelerated method has memory and the current point depends on the previous two updates. This technique has the optimal convergence rate.

**Lemma 1**The accelerated method has a convergence rate of , where is the iteration number.

*Proof:* We have

Using the result for -smooth function and the convexity, we obtain

The above inequality would hold true even with replaced with . Hence we also have

Let . Multiplying with and adding to , we obtain

We now multiply both sides by . We first look at the LHS of the above equation after multiplying by . Since ,

The RHS after multiplication by is

Using , the above can be simplified to

Using the relation , the above simplifies to

Using the relation between and in the update rule, (the second term) can be simplified to

Letting , we have

Summing from to , we have

Rearranging, we see

proving the result.

Observing the proof, we see that the method can be generalized as follows:

where and should satisfy and for large.

The term in (1) is called as the momentum term and we observe that the coefficient for large . The coefficient is generalized to in the paper by Weijie Su et. al. “A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights”

Let function be -strongly convex and -smooth, and Q= be the condition number of . In this case the basic gradient descent algorithm requires iterations to reach -accuracy. Nesterov’s Accelerated Gradient Descent attains the optimal oracle complexity of . This improvement is quite relevant for Machine Learning applications.

**Algorithm**: The basic structure of the algorithm is given below

- Start at an arbitrary initial point .
- Then iterate the following equations for ,

**Theorem 2**Let be -strongly convex and -smooth, then Nesterov’s Accelerated Gradient Descent satisfies

Till now, we were looking at the gradient descent algorithm and its performance on different class of functions. When GD is used on the class of -smooth convex functions, the convergence rate is . The convergence rate equals , when used on -strong and -smooth convex functions. Are these the optimal rates? Are there algorithms that can improve upon these convergence rates? For example, are there first order algorithms that can achieve a convergence rate of , when used on the -smooth functions class? In this lecture, we will come up with explicit functions that provide upperbounds on the rate of convergence, \ie, these functions exhibit the worst case convergence rate. We first look at the class of smooth functions. The following proof is adapted from Nemirovski’s textbook. For simplicity, we make the following assumptions on the oracle. We will later see as to how to relax these conditions.

- First Order Oracle: Only first order information should be used.
- and Span

**Theorem 1**Let and . There exists a -smooth convex function such that for any black-procedure satisfying the conditions defined above,

*Proof:* For , let be the symmetric and tridiagonal matrix defined by

We will first show that . The first inequality can be verified as follows.

Similarly, it can be easily shown that . Observe that from our assumption and hence is well defined. We consider now the following smooth convex function:

**Note:** Here is not the running index of the algorithm. First we fix a and we define the function for a given . The running index in the proof is . The function can be easily verified to be smooth as follows.

Let us now look at the possible iterates generated by any algorithm that satisfies out assumptions. As and , we can represent , , , as follows,

where is the standard basis (it has only in the -th coordinate). We see that has non-zero value only in the first coordinate, \ie . Similarly,

from which we see that . From the way, is defined, it is easy to verify that must lie in the linear span of . In particular for we necessarily have for , which implies

In other words, if we denote

then we just proved that

where follows since and from the definition of and we can see, . The inequality follows since . We now compute the minimizer of , its norm, and the corresponding function value . The optimal can be found by setting solving . i.e., . So, the point is the unique solution in the span of of . It is easy to verify that it is defined by for . Thus we immediately have:

We first observe that decreases with . Hence . So we have

Furthermore note that

Thus one obtains:

which concludes the proof.

We make the following observations:

- From Theorem 1, we observe that the worst case complexity is achieved by a quadratic function and no first order algorithm can provide a convergence rate better than .
- From the above theorem and the convergence rate of the GDA we have the following result: Let be the number of iterations required to achieve an error of less than . Then
First we observe that GDA does not match the lower bound on complexity. Also, for , we have that the upperbound is the square(up to a constant) of the lower bound. However, for small , the lower bound is not tight. In the next class, we will look at a technique that would provide an upperbound that would match with the lower bound.

- A similar kind of result can be obtained for strongly convex functions. We will state the result here without proof.
**Theorem 2**Let Q=. There exists a -smooth and -strongly convex function such such that for any one hasWe know for small values of x; . This means for large values of Q:

We observe that the convergence rate and the GDA differ by a factor in the denominator of the exponent.

Today we will look at a variant of gradient descent called the steepest descent algorithm.

**1. Steepest Descent Algorithm **

Having seen the Gradient Descent Algorithm, we now turn our attention to yet another member of the Descent Algorithms family — the Steepest Descent Algorithm. In this algorithm, we optimize the descent direct to obtain the maximum decrease of the objective function. First, let us look at the descent direction for Steepest Direction. Let be continuous and differentiable. Using Taylor Series, we get

Our objective is to minimize the second term i.e. term. The descent direction is defined in two steps. First, we look at the so-called normalized steepest descent direction. This in turn helps us define the steepest descent direction. The *normalized steepest descent direction* is defined as

Observe that the descent direction depends not only on the point , but also on the norm . As is clear from the definition, the name *normalized* is used since we are searching only through those vectors that have unit magnitude. Looking closely, we notice that this definition is similar to the one of *dual norm* of a given norm. Hence, the minimum value is given by . The negative sign compensates for the “min” used here as opposed to the “max” used in the definition of a dual norm. Now we can define the descent direction for the Steepest Descent Algorithm as follows:

The following calculation verifies that the descent direction defined above satisfies the sufficiency condition required, and also justifies the use of term used in the definition:

where follows from the definition of normalized steepest descent direction. Hence, the update equation is

The following Lemma casts the descent direction as a simpler optimization problem.

**Lemma 1**

*Proof:* The proof follows by representing , where , and jointly optimizing over and .

We will now look at the descent directions generated by a quadratic norm.

**Definition 2**Let be a fixed Positive Definite matrix i.e., . The quadratic norm of any vector is then defined as

If we define = , then this essentially corresponds to a change of basis. Since , = is well defined. This in turn lets us allow the quadratic expression to be simplified as follows:

Hence, an ellipsoid is converted to a sphere. In the case of Quadratic Norm, the descent direction is

Since the above minimization problem is unconstrained, setting its gradient to zero gives the minimizer. Therefore,

and hence

This norm can be used when the condition number of the objective function is bad. In this case, if one has some idea of the Hessian at the optimal point the matrix can be chosen to be equal to . This would improve the condition number and the convergence rate of the algorithm.

** 1.1. Convergence Rate for Steepest Descent with Back Tracking **

**Theorem 3**Let be smooth and strongly convex. Then Steepest Descent with Back Tracking has a linear convergence rate.

*Proof:* See Boyld.

Today we will look at a variant of gradient descent called the steepest descent algorithm.

**1. Steepest Descent Algorithm **

Having seen the Gradient Descent Algorithm, we now turn our attention to yet another member of the Descent Algorithms family — the Steepest Descent Algorithm. In this algorithm, we optimize the descent direct to obtain the maximum decrease of the objective function. First, let us look at the descent direction for Steepest Direction. Let be continuous and differentiable. Using Taylor Series, we get

Our objective is to minimize the second term i.e. term. The descent direction is defined in two steps. First, we look at the so-called normalized steepest descent direction. This in turn helps us define the steepest descent direction. The *normalized steepest descent direction* is defined as

Observe that the descent direction depends not only on the point , but also on the norm . As is clear from the definition, the name *normalized* is used since we are searching only through those vectors that have unit magnitude. Looking closely, we notice that this definition is similar to the one of *dual norm* of a given norm. Hence, the minimum value is given by . The negative sign compensates for the “min” used here as opposed to the “max” used in the definition of a dual norm. Now we can define the descent direction for the Steepest Descent Algorithm as follows:

The following calculation verifies that the descent direction defined above satisfies the sufficiency condition required, and also justifies the use of term used in the definition:

where follows from the definition of normalized steepest descent direction. Hence, the update equation is

The following Lemma casts the descent direction as a simpler optimization problem.

**Lemma 1**

*Proof:* The proof follows by representing , where , and jointly optimizing over and .

We will now look at the descent directions generated by a quadratic norm.

**Definition 2**Let be a fixed Positive Definite matrix i.e., . The quadratic norm of any vector is then defined as

If we define = , then this essentially corresponds to a change of basis. Since , = is well defined. This in turn lets us allow the quadratic expression to be simplified as follows:

Hence, an ellipsoid is converted to a sphere. In the case of Quadratic Norm, the descent direction is

Since the above minimization problem is unconstrained, setting its gradient to zero gives the minimizer. Therefore,

and hence

This norm can be used when the condition number of the objective function is bad. In this case, if one has some idea of the Hessian at the optimal point the matrix can be chosen to be equal to . This would improve the condition number and the convergence rate of the algorithm.

** 1.1. Convergence Rate for Steepest Descent with Back Tracking **

**Theorem 3**Let be smooth and strongly convex. Then Steepest Descent with Back Tracking has a linear convergence rate.

*Proof:* See Boyd.

For a given -smooth and convex function, we have proved in the last lecture that the GD algorithm with constant step size converges to the optimal and the convergence rate is of the order . In this lecture, we present a GD algorithm with constant step size for -smooth and -strongly convex functions whose convergence rate is exponential.

*Proof:* GD algorithm update step: . Since is -smooth,

Since the gradient vanishes at the optimal point,

Using Lemma in the previous lecture, we have

After simplification, we obtain

Since the last term in the RHS can be neglected and we obtain the following upper bound.

Now, iterating over ,

Substituting in (1), we obtain

Using the fact , it follows from Theorem 1 that

The following lemma shows that GD with strongly convex functions and constant step sizes exhibit exponential convergence.

**Lemma 2**GD with step size, on an -strong and smooth convex function satisfies

with

*Proof:* Substituting in Theorem 1, we have

The last step follows since .

*In convex analysis terminology, type of convergence is called as linear convergence and convergence is called as quadratic convergence.* It can be seen that the rate of convergence in the above theorem critically depends on the condition number . Large values of imply slow convergence. The next lemma examines the convergence behaviour with diminishing step size.

*Proof:* Substituting in (2), and using the fact , we have

and the result follows.

This shows that any polynomial convergence rate can be obtained by increasing . However for large , the initial steps might violate the condition . While this might slow down the initial steps of the algorithm, the convergence rate would still be given by Lemma 3 since for large .

In the previous lectures we have seen properties of smoothness and strong convexity which are used quite often in the problems involving Convex Optimization. We got an idea about the basic idea of descent algorithms and analysed gradient descent algorithm. In today’s lecture we will look into some more theorems relating to gradient descent method. We will consider a generalised step size i.e., the update rule is

**Theorem 1**Let convex function f be smooth with as the current point and as the optimal point. Then decreases with if .

*Proof:*

Now, as derived in previous lecture, we have

But we know that,

Hence,

If , then

We note that, while the distance to the optimum point decreases, the rate of decrease can be arbitrarily slow. In fact it match the rate of any real sequence tending to zero. We will now look at the convergence rate of gradient descent algorithm on the class of smooth convex functions.

**Theorem 2**Let f be a convex and -smooth function on . Then the gradient descent method with step lengths satisfies

*Proof:* Since is a -smooth function, we have

Now according to the gradient descent algorithm, we get the next point as

Substituting (2) in (3), we get

Now let be and be . Substituting this in the previous result we get,

According to the convexity condition we have,

Substituting the result of (5) to (4), we get

As proved in the previous theorem, decreases with n. Thus and consequently . So

Let .

Dividing by () we get,

From Theorem 1 it can be said that , so . Thus we have

The above equation (7) forms a telescopic series. Summing all the terms from to we get,

Since , and thus are positive numbers. Thus

We will now look at some specific step sizes. When the step size is constant it is easy to see that maximizes and hence optimal.

**Corollary 3 (Constant stepsize)**

*When ,*

Hence for a constant step size, the algorithm converges as .

We now look at diminishing step sizes (that is very common for stochastic GDA.

**Corollary 4**Let be such that and . Then GDA converges to the global optimum. In particular when , we have

Observe that for the case of diminishing step size, the knowledge of is not necessary. Also, is a sufficient condition but not necessary. We only require to be well defined for large . For example when , it can be shown that (by approximating the sum by a Riemannian integral)

Hence when ,

which results in a loss of in the convergence rate compared to .

For comparing algorithms with respect to a standard condition let us assume the optimum to lie within a radius R of the starting point as depends on the distance of the starting point to the optimum point. For , we have

Now for allowable tolerance of from the optimum for the algorithm to end,

Hence

Thus the minimum number of steps required to reach to the optimum using gradient descent algorithm is given by which shows that the convergence depends directly on the Lipschiz constant , distance of the starting point from optima and inversely on the tolerance set for the convergence to end.

**A few points to remember**:

- Lipschitz Continuity is always assumed in Convex Optimization.
- is a poor way to go through with optimization as will be explained in later lectures.
- The convergence rate with smoothness and convexity alone does not give good convergence rates.Thus -strongly convexity is assumeed for faster convergence.
- gradient descent algorithm scales linearly with the number of dimensions.

We require the following Lemma for proving convergence rate of strongly convex functions in the next class.

**Lemma 5**Let f be smooth and -strongly convex. Then ,

**Theorem 1**Let be a convex and a -smooth function. Then, for any ,

*Proof:* Let . Then,

where follows by using gradient inequality for the term since is a convex function, and using the -smoothess of for the term, and follows by substituting .

The above theorem is from the notes of Dr.Â SÃ©bastien Bubeck.

**1. Unconstrained Minimization **

We now look at the following optimization problem.

Our objective is to come up with a sequence of vectors such that . We refer to such a sequence as a minimizing sequence. We next define a tolerance limit to the optimization problem which basically limits the allowed gap between the optimal value and the function value at the point returned by the algorithm. By this we mean that the algorithm stops at if

**2. Descent Algorithms **

In descent algorithms, we follow the following update rule.

where is the step size and is the descent direction. Here, and hence it is not a “direction” in the strict sense of the word but a misnomer. By descent, we mean the following,

If is a convex function, by gradient inequality,

From (4)

**From (5) and (7), we have the following as a necessary condition for every gradient descent algorithm. **

Given below is the basic structure of a descent algorithm.

- Start at an initial point .
- Repeat the following till the “Stopping Criterion” is met.
- Descent direction : Choose a descent direction .
- Line search : Choose a step size .
- Update : .

The stopping criterion does not directly follow from the tolerance limit as we do not know the optimal value . As we will see later, the choice of stopping criterion is of prime importance to the performance of a descent algorithm and it is not easy to come up with an appropriate stopping criterion. For now, we assume that “Exact Line Search” is used. By this, we mean that the step size is given by,

We will later see that implementing “Exact Line Search” is not practical and we will come up with simpler ways to find the step size without any significant loss in performance. We are now left with the choice of the descent direction which should follow (8). All the descent algorithms differ on the choice of this descent direction. We now look at a basic descent algorithms,

** 2.1. Gradient Descent Algorithm (GDA) **

The descent direction is chosen as

Observe that this descent direction clearly satisfies (8). Next, we state and prove the following important theorem which shows that GDA works on Lipschitz functions.

**Theorem 2**Let be a -smooth function and . Then the gradient descent algorithm with a constant step size will converge to a stationary point, i.e., the set .

*Proof:* Recall that for gradient descent algorithm,

As is -smooth, we have

where follows from (10). Thus, we have

where follows since . Next, we have

where is the global optimal value. Taking the limit as , we have

Hence and hence .

Note that maximizes term appearing in the denominator of (11) and hence is the optimal step size to be used. In this case we have

The convergence rates can be obtained as follows: Let . Then

which implies

Hence, when is just a -smooth function (need not be convex), the above theorem shows that gradient descent converges sub-linearly to a stationary point (this need not even be a local minimum). Also, observe that the above bound is on , i.e, the best gradient till . However, the gradient might oscillate while the function value might not. While, Lipschitz condition is used to prove the convergence rate, the algorithm will converge to a stationary point if the level set corresponding to the initial point is bounded.

In optimization, differentiability (even higher orders) is not a very strong condition compared to continuity and hence cannot ensure convergence rates. This is because, every continuous function (on compact space) can be approximated by a smooth function (arbitrarily differentiable) to an arbitrary degree of precision (look at Stone-Weierstrass theorem for approximation by polynomials). Hence we require a stronger conditions on the functions to ensure convergence rates.

**Definition 1 (Lipschitz Continuous functions:)**

*A function is said to be Lipschitz if*

Here is called the Lipschitz constant. The function is Lipschitz with constant on . The following theorem characterized Lipschitz functions which are differentiable.

*
*

**Lemma 2**A differentiable function is Lipschitz on a convex domain iff

and the sup norm of the derivative is the Lipschitz constant.

The above lemma also helps us in identifying Lipschitz functions. For example, =, . = and . Hence the Lipschitz constant of an exponential function is 1.

In Convex analysis literature, *functions whose derivatives are Lipschitz continuous are also called as Lipschitz continuous functions*. To avoid the confusion, we will call the functions whose derivatives are Lipschitz continuous as -smooth functions i.e.,

**Theorem 3**If f is smooth, then

*Proof:* Consider the function . We have and

Since

we have

By applying Cauchy Schwartz inequality,

As is smooth

This shows that i.e., if a function is -smooth then it is upper bounded by a quadratic function. While the upper bound is good, the quadratic lower bound has a negative square term () and hence very loose. Nevertheless, this shows that the growth rate of a -smooth function is at most quadratic. We now look at convex functions:

**Definition 4 (convex)**

*A function is convex if*

- .
- is convex.

*
*We now look an easier and useful characterisation of convexity when the function is differentiable.

**Lemma 5**A continuously differentiable function is convex iff

We now define the notion of strong convexity which will provide quadratic lower bounds on convex functions (I wont dwell upon this too much as these have already been proved in the previous course). A twice differentiable function is convex iff , i.e., the Hessian is positive semi definite matrix for all . For example, is convex on the positive reals. We now introduce the notion of strong convexity which will provide a quadratic lower bound to convex functions.

**Definition 6 (Strong Convexity)**

*A function is said to be strongly convex with parameter if*

*
*Obviously, should be strictly lesser than if the function is both -smooth and strongly convex. The above definition can be used to show that the Hessian of a strongly Convex function must satisfy

i.e., must be positive semi definite. The advantage of a strongly convex function is that the function will not be flat at the bottom and hence the convergence to the optimal solution will be faster. While the function is convex, it is not strongly convex. However (adding a regularisation term) is strongly convex. Similarly adding to any convex function would lead to a strongly convex function. The next theorem provides a bound on the gap between the function value and the optimum in terms of the gradient.

**Theorem 7**If is strongly convex, then , where is the convex set over which we optimize.

*Proof:* For a strongly convex function,

Since is a quadratic function of , it can can be easily seen that it attains its minimum when = Substituting this in (1), we obtain

Let denote . As , .

*
*It is very difficult to solve general non-linear optimization problems and there exists no universal techniques. This is intuitive since most difficult math problems can be cast as optimization problems. For example, the simple problem of finding the root of a non-linear function can be cast as the optimization problem: . Hence we focus on a sub-class of optimization problems that can be solved efficiently.

Linear optimization problems (linear objective function and polyhedral constraint set) can be solved in a very efficient manner. This class of problems can be solved with even a million variables very efficiently and commercial software exist to solve these large problems. On the other hand, solving general non-linear optimization problems is extremely difficult (almost impossible). While solving linear problems is important and useful, the linear class does not cover all interesting optimization problems (for example eigenvalue maximisation). Convex optimization problems are an important class (subsumes linear and contains a subset of non-linear problems) that are interesting, useful and that can be solved efficiently.

In this course, we will look at algorithms for convex optimization problems. In particular, the focus will be on large dimension problems and several algorithms along with their convergence rates will be presented.

There is no standard textbook for this course. We will be referring to the following text books

*Introductory lectures on convex optmization, A basic course*, Yurii Nesterov*Convex optimization*, Stephen Boyd*Lectures on modern convex optimization*, Aharon Ben-Tal and Arkadi Nemirovski*Theory of convex optimization for machine learning*, Sebastien Bubeck*Convex optimization theory*, Dimitri P Bertsekas*Nonlinear programming*, Dimitri P Bertsekas*Linear and nonlinear programming*, David G, Luenberger and Yinyu Ye*Numerical optimization*, Jorge Nocedal and Stephen J. Wright*Problem complexity and method efficiency in optimization*, A Nemirovsky and D. B. Yudin*A course in convexity*, Alexander Barvinok

**1. Comparing various algorithms **

In the first part (and most) of the course, we will be looking at algorithms that do not exploit the structure of the objective function. In this regard, we assume an oracle that provides with required information and we do not have any other additional information. In this course we consider the following oracles

- Zeroth-order oracle: input , output .
- First-order oracle: input , output
- Second-order oracle: input , ouput

We can now compare different algorithms over a class of functions with different oracles. The complexity of the algorithm will be measured in terms of number of calls to the oracle and the computation using these oracle outputs to computing the optimum value. However, we will not be counting the complexity of the oracle, *i.e.,* the complexity of computing will be neglected. In most of the problems we will be considering, is real valued and hence it does not make sense (or will be of infinite complexity) to ask for the optimal value. Instead, we will look for solutions that are close to the optimal value, *i.e,* we are interested in such that .

**2. Zeroth-order oracle **

Let us first consider the complexity of solving the following problem

when the function belongs to the class of Lipschitz continuous functions with respect to norm. We require an accurate solution, *i.e.*, if is the declared solution, then . A function is Lipschitz continuous with constant if

Suppose we consider a zeroth order oracle that provides only the value of the function, what can we say about the computational complexity of this class of functions? We can provide an upper bound on the complexity by providing an algorithm. The simplest algorithm is the following grid search. Divide each dimension into parts and search over the resulting lattice. It can easily be seen that calls to the oracle are required.

Lemma 1If , the grid search algorithm provides an accurate solution.

*Proof:* The optimal point belongs to some small hypercube in the grid. Let be some vertex of the hypercube. Then by Lipschitz continuity of ,

Hence choosing , *i.e.*, , we obtain the required accuracy.

So this algorithm calls the zeroth -order oracle times to obtain an accurate solution. Is this optimal (in an order sense). To prove this, we build a worst case function and show that, whatever be the algorithm, the optimal point cant be found in less than steps.

Lemma 2With the zeroth order oracle, no (deterministic?) algorithm can perform better.

*Proof:* In this class, all the algorithms should only depend on the value of the function that the oracle provides. Suppose you have an algorithm and you feed in the function . The algorithm will provide a sequence of points after which the algorithm outputs the optimal value with accuracy. Let us assume that (otherwise we are done). Now since there are only points in the square , there exists one point such that (the ball is with respect to norm). Now consider the function

This is a Lipschitz function with an optimal value . This function is non zero only when , *i.e.*, in the ball . It is easy to see that algorithm that is designed will never get closer to the optimal value.

This result shows that the complexity of optimization with zeroth-order oracle for Lipschitz functions is exponential in the dimension. In the next class, we will look at Lipschitz functions in more detail.

**Reference**: *Problem complexity and method efficiency in optimization*, A Nemirovsky and D. B. Yudin