You are currently browsing the category archive for the ‘Capacity’ category.

Theorem 1 Let ${f:{\mathbb R}^n \rightarrow {\mathbb R}}$ be a convex and a ${\beta}$-smooth function. Then, for any ${x,y \in {\mathbb R}^n}$,

$\displaystyle f(x) - f(y) \leq \nabla f(x)^T(x-y) - \frac{1}{2 \beta} \| \nabla f(x) - \nabla f(y)\| ^2. \ \ \ \ \ (1)$

Proof: Let ${z=y-\frac{1}{\beta} \left( \nabla f(y) - \nabla f(x) \right)}$. Then,

$\displaystyle \begin{array}{rcl} f(x) - f(y) &= ~ & f(x) - f(z) + f(z) - f(y),\\ & \stackrel{ (a)}{\leq} ~ & \nabla f(x)^T(x-z) + \nabla f(y)^T(z-y) + \frac{\beta}{2}\| z-y \|^2,\\ &= ~& \nabla f(x)^T (x-y) + \left( \nabla f(x) - \nabla f(y) \right) ^T (y-z) + \frac{\beta}{2} \| z-y \|^2,\\ &\stackrel{(b)}{=} ~& \nabla f(x)^T(x-y) - \frac{1}{2 \beta} \|\nabla f(x) - \nabla f(y) \|^2, \end{array}$

where ${(a)}$ follows by using gradient inequality for the ${\left( f(x) - f(z) \right)}$ term since ${f}$ is a convex function, and using the ${\beta}$-smoothess of ${f}$ for the ${\left( f(z) - f(y) \right)}$ term, and ${(b)}$ follows by substituting ${z=y-\frac{1}{\beta} \left( \nabla f(y) - \nabla f(x) \right)}$. $\Box$

The above theorem is from the notes of Dr. Sébastien Bubeck.

# 1. Unconstrained Minimization

We now look at the following optimization problem.

$\displaystyle p^*=\min\limits_{x \in {\mathbb R}^n} f(x). \ \ \ \ \ (2)$

Our objective is to come up with a sequence of vectors ${x_1,x_2, . . ., x_n}$ such that ${\lim\limits_{n \rightarrow \infty}f(x_n)=p^*}$. We refer to such a sequence as a minimizing sequence. We next define a tolerance limit to the optimization problem which basically limits the allowed gap between the optimal value and the function value at the point returned by the algorithm. By this we mean that the algorithm stops at ${x_n}$ if

$\displaystyle 0 < f(x_n) - p^* < \epsilon. \ \ \ \ \ (3)$

2. Descent Algorithms

In descent algorithms, we follow the following update rule.

$\displaystyle x_{k+1} = x_k + t_k \Delta x_k, \ \ \ \ \ (4)$

where ${t_k > 0}$ is the step size and ${\Delta x_k}$ is the descent direction. Here, ${||\Delta x_k|| \ne 1}$ and hence it is not a “direction” in the strict sense of the word but a misnomer. By descent, we mean the following,

$\displaystyle f(x_{k+1}) < f(x_k).\ \ \ \ \ (5)$

If ${f}$ is a convex function, by gradient inequality,

$\displaystyle f(x_{k+1}) \geq f(x_k) + \nabla f(x_k)^T(x_{k+1}-x_k).\ \ \ \ \ (6)$

From (4)

$\displaystyle f(x_{k+1}) \geq f(x_k) + \nabla f(x_k)^Tt_k \Delta x_k.\ \ \ \ \ (7)$

From (5) and (7), we have the following as a necessary condition for every gradient descent algorithm.

$\displaystyle \nabla f(x_k)^T \Delta x_k < 0. \ \ \ \ \ (8)$

Given below is the basic structure of a descent algorithm.

1. Start at an initial point ${x_0}$.
2. Repeat the following till the “Stopping Criterion” is met.
1. Descent direction : Choose a descent direction ${\Delta x_k}$.
2. Line search : Choose a step size ${t_k > 0}$.
3. Update : ${x_{k+1} = x_k + t_k\Delta x_k}$.

The stopping criterion does not directly follow from the tolerance limit as we do not know the optimal value ${p^*}$. As we will see later, the choice of stopping criterion is of prime importance to the performance of a descent algorithm and it is not easy to come up with an appropriate stopping criterion. For now, we assume that “Exact Line Search” is used. By this, we mean that the step size ${t_k}$ is given by,

$\displaystyle t_k = {\text{argmin}}_t~f(x_k + t \Delta x_k). \ \ \ \ \ (9)$

We will later see that implementing “Exact Line Search” is not practical and we will come up with simpler ways to find the step size ${t_k}$ without any significant loss in performance. We are now left with the choice of the descent direction ${\Delta x_k}$ which should follow (8). All the descent algorithms differ on the choice of this descent direction. We now look at a basic descent algorithms,

The descent direction is chosen as

$\displaystyle \Delta x_k = - \nabla f(x_k).$

Observe that this descent direction clearly satisfies (8). Next, we state and prove the following important theorem which shows that GDA works on Lipschitz functions.

Theorem 2 Let ${f}$ be a ${\beta}$-smooth function and ${f^* = \min f(x)>- \infty}$. Then the gradient descent algorithm with a constant step size ${t < \frac{2}{\beta}}$ will converge to a stationary point, i.e., the set ${\left\lbrace x : \nabla f(x) = 0\right\rbrace}$.

Proof: Recall that for gradient descent algorithm,

$\displaystyle x_{k+1} = x_k - t \nabla f(x_k). \ \ \ \ \ (10)$

As ${f}$ is ${\beta}$-smooth, we have

$\displaystyle \begin{array}{rcl} f(x_{k+1}) &\leq~& f(x_k) + \nabla f(x_k)(x_{k+1} - x_k ) + \frac{\beta}{2} \mid \mid x_{k+1} - x_k \mid \mid ^2,\\ & \stackrel{(a)}{=}~ & f(x_k) - t \|\nabla f(x_k) \|^2 + \frac{\beta t^2}{2}\|\nabla f(x_k) \|^2, \end{array}$

where ${(a)}$ follows from (10). Thus, we have

$\displaystyle \begin{array}{rcl} & f(x_{k+1}) \leq f(x_k) - t \left( 1 - \frac{\beta t}{2} \right) \|\nabla f(x_k)\|^2,\\ & \|\nabla f(x_k)\|^2 \stackrel{(a)}{\leq} \frac{f(x_k) - f(x_{k+1})}{t \left(1 - \frac{\beta t}{2} \right)}, \end{array}$

where ${(a)}$ follows since ${t < \frac{2}{\beta}}$. Next, we have

$\displaystyle \begin{array}{rcl} \sum\limits_{k=0}^{N} \|\nabla f(x_k)\|^2 &\leq & \frac{1}{t\left( 1 -\frac{\beta t}{2} \right)} \sum\limits_{k=0}^{N}f(x_k) - f(x_{k+1}),\\ \end{array}$

$\displaystyle = \frac{f(x_0) - f(x_N)}{t \left( 1 - \frac{\beta t}{2} \right)}, \ \ \ \ \ (11)$

$\displaystyle \begin{array}{rcl} &\leq & \frac{f(x_0) - f^*}{t \left( 1 - \frac{\beta t}{2} \right)}, \end{array}$

where ${f^*}$ is the global optimal value. Taking the limit as ${N \rightarrow \infty}$, we have

$\displaystyle \begin{array}{rcl} & \sum\limits_{k=0}^{\infty} \| \nabla f(x_k) \|^2 < \infty. \end{array}$

Hence ${\| \nabla f(x_k) \| \rightarrow 0,}$ and hence ${\nabla f(x_k) \rightarrow 0}$. $\Box$

Note that ${t^* = \frac{1}{\beta}}$ maximizes ${t \left( 1 - \frac{\beta t}{2} \right)}$ term appearing in the denominator of (11) and hence is the optimal step size to be used. In this case we have

$\displaystyle \sum\limits_{k=0}^{N} \|\nabla f(x_k)\|^2 \leq 2(\beta {f(x_0) - f^*}).$

The convergence rates can be obtained as follows: Let ${g_N^* = \min_{k=1...N} \|\nabla f(x_k)\|}$. Then

$\displaystyle (1+N)g_N^2 \leq 2(\beta {f(x_0) - f^*}),$

which implies

$\displaystyle g_N \leq \sqrt{ \frac{2(\beta {f(x_0) - f^*)}}{1+N}}.$

Hence, when ${f}$ is just a ${\beta}$-smooth function (need not be convex), the above theorem shows that gradient descent converges sub-linearly to a stationary point (this need not even be a local minimum). Also, observe that the above bound is on ${g_N}$, i.e, the best gradient till ${N}$. However, the gradient might oscillate while the function value might not. While, Lipschitz condition is used to prove the convergence rate, the algorithm will converge to a stationary point if the level set ${\{x: f(x) \leq f(x_0)\}}$ corresponding to the initial point is bounded.

Recently this talk was given by Dr. Andrew Thangaraj in our reading seminar and I was fascinated by the elegance of the solution. I went through the original paper “On the Shannon Capacity of a Graph” by Lovasz and decided to write up some notes. The technique described by Andrew follows a modern approach that  requires spectral graph theory and I will describe the spectral theory approach in a later post. I will begin with the basic description of the problem and Lovassz’s solution.

Zero error capacity of a noisy type writer channel

Suppose you have a typewriter that is noisy, i.e., the printed letter might turn out to be the pressed letter or the next letter with equal probability. For example, if the “c” key is pressed then the printed letter might be “c” with probability ${0.5}$ or a “d” with probability ${0.5}$. If letter “z” is pressed, then the printed letter is “z” with probability ${0.5}$ or an “a” with probability ${0.5}$. So the zero-error capacity of this type writer channel is defined as the maximum number of symbols (more correctly bits) that can be conveyed per channel use. It is easy to see that if we only use a ${13}$ alternate letter set, then there is no confusion at the output. It can also be shown that something better cannnot be done. So the error free capacity in bits is

$\displaystyle \Theta_E=\log_2(13).$

Alternatively, we can form the following graph: Connect two nodes if there is confusion of using two nodes simultaneously. For the ${26}$-letter type writer channel it is easy to see that the graph equals the cycle ${C_{26}}$. Observe that the maximum independent set of ${C_{26}}$ has ${13}$ nodes. Hence for ${C_{26}}$ ,

$\displaystyle \Theta_E=\log_2(\alpha(C_{26})),$

where ${\alpha(G)}$ is the size of the maximal independent set of a graph ${G}$. So now lets look at the cycle ${C_5}$. This would correspond to a type writer of ${5}$ letters ${\{a,b,c,d,e\}}$. It is easy to see that ${\alpha(C_5)=2}$. Earlier the code was for one channel use. It turns out that combing letters over two channel uses improves the number of error free symbols. So for the ${C_5}$ (And its corresponding typewriter) sending the pairs of letters ${\{aa,bc,ce,db,ed\}}$ leads to an error free code. For ${C_5}$, the error-free capacity equals ${\log_2(5)/2}$ (proved by Lovasz). So we should look at graph product and their independence numbers to obtain the error free capacity.

Definition 1 (Strong Product) For two graphs ${G}$ and ${H}$, their strong product denoted by ${G.H}$ is defined on ${V(G)\times V(H)}$ in which ${(x,y)}$ is adjacent to ${(x',y')}$ iff ${x\sim x'}$ in ${G}$ and ${y\sim y'}$ in ${H}$. (“${\sim}$” denotes adjacency.)

${G^k}$ denotes the strong product of the graph ${G}$ by itself ${k}$ times. The following Figure illustrates a  product of a 5 cycle graph with itself.

The strong product of a 5 cycle graph.