You are currently browsing the tag archive for the ‘complexity’ tag.

Till now, we were looking at the gradient descent algorithm and its performance on different class of functions. When GD is used on the class of {\beta}-smooth convex functions, the convergence rate is {\Theta(1/k)}. The convergence rate equals {\Theta(\exp(-k/Q))}, when used on {\alpha}-strong and {\beta}-smooth convex functions. Are these the optimal rates? Are there algorithms that can improve upon these convergence rates? For example, are there first order algorithms that can achieve a convergence rate of {\Theta(1/k^4)}, when used on the {\beta}-smooth functions class? In this lecture, we will come up with explicit functions that provide upperbounds on the rate of convergence, \ie, these functions exhibit the worst case convergence rate. We first look at the class of {\beta} smooth functions. The following proof is adapted from Nemirovski’s textbook. For simplicity, we make the following assumptions on the oracle. We will later see as to how to relax these conditions.

  • First Order Oracle: Only first order information should be used.
  • {x_0=0} and {x_k \in} Span{\lbrace \nabla f(x_1),.......,\nabla f(x_k-1)\rbrace}

Theorem 1 Let {t \leq (n-1)/2} and {\beta >0}. There exists a {\beta}-smooth convex function {f:{\mathbb R}^n \rightarrow {\mathbb R}} such that for any black-procedure satisfying the conditions defined above,

\displaystyle  \min_{1 \leq s \leq t} f(x_s)-f(x^*) \geq \frac{3\beta}{32}\frac{||x_1-x^*||^2}{{(t+1)}^2} \ \ \ \ \ (1)

Proof: For {k \leq n}, let {A_k \in \mathbb{R}^{n \times n}} be the symmetric and tridiagonal matrix defined by

\displaystyle  (A_k)_{i,j}=\left\{ \begin{array}{rl} 2, & i=j,i\leq k \\ -1, & j \in \lbrace i-1,i+1\rbrace, i \leq k, j \neq k+1\\ 0, & \text{otherwise.} \end{array} \right.

We will first show that { 0 \preceq A_k \preceq 4I_n}. The first inequality can be verified as follows.

\displaystyle  \begin{array}{rcl}  x^T A_k x &=& 2\sum\limits_{i=1}^k x(i)^2 - 2 \sum\limits_{i=1}^{k-1} x(i)x(i+1) \\ &=& x(1)^2 + x(k)^2 + \sum\limits_{i=1}^{k-1} {(x(i)-x(i+1))}^2 \geq 0. \end{array}

Similarly, it can be easily shown that { A_k \preceq 4I_n}. Observe that from our assumption {2t-1<n} and hence {A_{2t+1}} is well defined. We consider now the following {\beta-}smooth convex function:

\displaystyle  f(x)=\frac{\beta}{8}x^T A_{2t+1}x - \frac{\beta}{4}x^Te_1. \ \ \ \ \ (2)

Note: Here {t} is not the running index of the algorithm. First we fix a {t} and we define the function {f(x)} for a given {t}. The running index in the proof is {s}. The function {f(x)} can be easily verified to be {\beta-}smooth as follows.

\displaystyle  \begin{array}{rcl}  \nabla^2f(x) &=& \frac{\beta}{4}A_{2t+1}. \\ ||\nabla^2f(x)|| &\leq& \frac{\beta}{4}||A_{2t+1}|| \\ &\leq& \beta \; \; \; \left(\text{since} A_k \preceq 4I_n \right). \end{array}

Let us now look at the possible iterates generated by any algorithm that satisfies out assumptions. As {x_0=0} and {\nabla f(x)= \frac{\beta}{4}A_{2t+1}x-\frac{\beta}{4}e_1}, we can represent {x_1}, {x_2}, {\hdots}, as follows,

\displaystyle  \begin{array}{rcl}  x_1&=& x_0-\frac{\beta}{4}e_1, \end{array}

where {e_i} is the standard basis (it has {1} only in the {i}-th coordinate). We see that {x_1} has non-zero value only in the first coordinate, \ie {x_1 =\text{span}(e_1)}. Similarly,

\displaystyle  \begin{array}{rcl}  x_2&=& x_1- \left(\nabla f(x_1)\right) \\ &=& x1-\left[\left(\frac{\beta}{4}\left(A_{2t+1}\right)\left(x_0-\frac{\beta}{4}e_1\right)\right)-\frac{\beta}{4}e_1\right], \end{array}

from which we see that {x_2 \in \text{span}(e_1, e_2)}. From the way, {A_k} is defined, it is easy to verify that {x_s} must lie in the linear span of {e_1, e_2, ...,e_{s}}. In particular for {s\leq t} we necessarily have {x_s(i)=0} for {i=s,...,n}, which implies

\displaystyle  x_s^TA_{2t+1}x_s=x_s^TA_{s}x_s. \ \ \ \ \ (3)

In other words, if we denote

\displaystyle  f_k(x)=\frac{\beta}{8}x^TA_kx - \frac{\beta}{4}x^Te_1, \ \ \ \ \ (4)

then we just proved that

\displaystyle  \begin{array}{rcl}  f(x_s)-f^*&\stackrel{a}{=}&f_s(x_s)-f^*_{2t+1}, \\ &\stackrel{b}{\geq}& f_s^* - f^*_{2t+1}. \end{array}

where {(a)} follows since {f(x_s)=f_s(x_s)} and from the definition of {f(x)} and {f_k(x)} we can see, {f(x)=f_{2t+1}(x)}. The inequality {(b)} follows since {f_s^*=\min_x{f_s}(x)}. We now compute the minimizer {x_k^*} of {f_k}, its norm, and the corresponding function value {f_k^*}. The optimal {x_k^*} can be found by setting solving {\nabla f_k(x)=0}. i.e., {A_kx-e_1=0}. So, the point {x_k^*} is the unique solution in the span of {e_1,.....,e_k} of {A_kx=e_1}. It is easy to verify that it is defined by {x_k^*(i)=1-\frac{i}{k+1}} for {i=1,....,k}. Thus we immediately have:

\displaystyle  \begin{array}{rcl}  f_k^*&=& \frac{\beta}{8}{(x_k^*)}^TA_kx_k^* - \frac{\beta}{4}(x_k^*)^T e_1 \\ &=& -\frac{\beta}{8}(x_k^*)^Te_1 \\ &=& -\frac{\beta}{8}\left(1 - \frac{1}{k+1}\right). \end{array}

We first observe that {f_k^*} decreases with {k}. Hence {f_s^* >f_t^*}. So we have

\displaystyle f(x_s)-f^* > f_t^* -f^*_{2t+1}.

Furthermore note that

\displaystyle  \begin{array}{rcl}  ||x_k^*||^2&=& \sum\limits_{i=1}^{k}\left(1-\frac{i}{k+1}\right)^2 \\ &=& \sum\limits_{i=1}^k \left(\frac{i}{k+1}\right) \\ &\leq& \frac{k+1}{3}. \end{array}

Thus one obtains:

\displaystyle  \begin{array}{rcl}  f_t^*-f_{2t+1}^*&=& \frac{\beta}{8}\left(\frac{1}{t+1} -\frac{1}{2t+2}\right) \\ &=& \frac{3\beta}{32}\frac{||x^*_{2t+1}||^2}{(t+1)^2}, \end{array}

which concludes the proof. \Box

We make the following observations:

  • From Theorem 1, we observe that the worst case complexity is achieved by a quadratic function and no first order algorithm can provide a convergence rate better than {\Theta(1/k^2)}.
  • From the above theorem and the convergence rate of the GDA we have the following result: Let {N} be the number of iterations required to achieve an error of less than {\epsilon}. Then

    \displaystyle  \min\left\{\frac{n-1}{2},\sqrt{\frac{3\beta\|x_1-x^*\|^2}{32 \epsilon}}\right\}\leq N \leq \frac{\beta\|x_1-x^*\|^2}{ \epsilon}.

    First we observe that GDA does not match the lower bound on complexity. Also, for {n^2 \succeq \frac{\beta\|x_1-x^*\|^2}{ \epsilon}}, we have that the upperbound is the square(up to a constant) of the lower bound. However, for small {n}, the lower bound is not tight. In the next class, we will look at a technique that would provide an upperbound that would match with the lower bound.

  • A similar kind of result can be obtained for strongly convex functions. We will state the result here without proof.
    Theorem 2 Let Q={\beta/\alpha >1}. There exists a {\beta}-smooth and {\alpha}-strongly convex function such {f} such that for any {t \geq 1} one has

    \displaystyle  f(x_t)-f(x^*) \geq \frac{\alpha}{2} \left(\frac{\sqrt{Q}-1}{\sqrt{Q}+1}\right)^{2(t-1)} \|x_1-x^*\|^2 \ \ \ \ \ (5)

    We know for small values of x; {e^x \approx 1+x}. This means for large values of Q:

    \displaystyle  \left(\frac{\sqrt{Q}-1}{\sqrt{Q}+1}\right)^{2(t-1)}\approx \exp \left(-\frac{4(t-1)}{\sqrt{Q}}\right) \ \ \ \ \ (6)

    We observe that the convergence rate and the GDA differ by a factor {\sqrt{Q}} in the denominator of the exponent.