You are currently browsing the tag archive for the ‘complexity’ tag.

Till now, we were looking at the gradient descent algorithm and its performance on different class of functions. When GD is used on the class of -smooth convex functions, the convergence rate is . The convergence rate equals , when used on -strong and -smooth convex functions. Are these the optimal rates? Are there algorithms that can improve upon these convergence rates? For example, are there first order algorithms that can achieve a convergence rate of , when used on the -smooth functions class? In this lecture, we will come up with explicit functions that provide upperbounds on the rate of convergence, \ie, these functions exhibit the worst case convergence rate. We first look at the class of smooth functions. The following proof is adapted from Nemirovski’s textbook. For simplicity, we make the following assumptions on the oracle. We will later see as to how to relax these conditions.

- First Order Oracle: Only first order information should be used.
- and Span

**Theorem 1**Let and . There exists a -smooth convex function such that for any black-procedure satisfying the conditions defined above,

*Proof:* For , let be the symmetric and tridiagonal matrix defined by

We will first show that . The first inequality can be verified as follows.

Similarly, it can be easily shown that . Observe that from our assumption and hence is well defined. We consider now the following smooth convex function:

**Note:** Here is not the running index of the algorithm. First we fix a and we define the function for a given . The running index in the proof is . The function can be easily verified to be smooth as follows.

Let us now look at the possible iterates generated by any algorithm that satisfies out assumptions. As and , we can represent , , , as follows,

where is the standard basis (it has only in the -th coordinate). We see that has non-zero value only in the first coordinate, \ie . Similarly,

from which we see that . From the way, is defined, it is easy to verify that must lie in the linear span of . In particular for we necessarily have for , which implies

In other words, if we denote

then we just proved that

where follows since and from the definition of and we can see, . The inequality follows since . We now compute the minimizer of , its norm, and the corresponding function value . The optimal can be found by setting solving . i.e., . So, the point is the unique solution in the span of of . It is easy to verify that it is defined by for . Thus we immediately have:

We first observe that decreases with . Hence . So we have

Furthermore note that

Thus one obtains:

which concludes the proof.

We make the following observations:

- From Theorem 1, we observe that the worst case complexity is achieved by a quadratic function and no first order algorithm can provide a convergence rate better than .
- From the above theorem and the convergence rate of the GDA we have the following result: Let be the number of iterations required to achieve an error of less than . Then
First we observe that GDA does not match the lower bound on complexity. Also, for , we have that the upperbound is the square(up to a constant) of the lower bound. However, for small , the lower bound is not tight. In the next class, we will look at a technique that would provide an upperbound that would match with the lower bound.

- A similar kind of result can be obtained for strongly convex functions. We will state the result here without proof.
**Theorem 2**Let Q=. There exists a -smooth and -strongly convex function such such that for any one hasWe know for small values of x; . This means for large values of Q:

We observe that the convergence rate and the GDA differ by a factor in the denominator of the exponent.