You are currently browsing the tag archive for the ‘Convex sets’ tag.

Before we proceed further, in the above corollary, we observe that  the property of a linear function was used only in a small part of the proof of corollary 8 in last post. However, observe that the proof (for maximum) would follow through if the function ${f(x)}$ satisfies the following properties:

1. ${f(a x) = a f(x)}$, ${a>0}$.
2. sub-additive property: ${f(x+y) \leq f(x)+ f(y)}$

Such functions are called as sub-additive functions. So a sub-additive function which is continuous achieves its maximum on a vertex of a compact convex set. An example of sub-additive function is the norm ${\|x\|}$. Another example is the support function of a set (we will look at this function later).

We will now look at some interesting examples of extreme points and maximization of linear functionals. We will first begin with the following motivational example.

Examples:

Assignment problem (AP): The AP arises in many contexts and is an example of binary optimization. Here we will motivate the problem in the context of wireless and later generalize the notion.

OFDM (orthogonal frequency division multiplexing) is a multi-carrier modulation which is being used in 4G networks. In OFDM,the data is modulated over different tones in such a way that orthogonality can be maintained across tones.

In general, a BS tower serves multiple users. When using OFDM, the BS generally has to decide the allocation of different tones to different users (OFDMA). Assume for now there are ${n}$ frequency tones and there are ${n}$ users. In general the users are distributed in space. Hence each frequency tone will have a different “channel” with respect a user. So let ${h_{ij}}$ denote the channel of tone ${i}$ from the BS to the user ${j}$. A good channel in general indicates a higher data rate on that particular tone. A simple map from the channel quality to the data rate obtained is ${R_{ij} = \log_2(1+h_{ij}^2)}$. So the BS wants to allocate one frequency tone per user and allocate all the tones, such that the sum rate is maximized. For example if ${n=4}$, a sample allocation is given by ${\sigma =\{3,2,4,1\}}$, which implies the first tone is allocated to the third user, the second tone to the second user, the third to the fourth user and the fourth to the first user. Observe that any allocation is just a permutation of the integers ${\{1,2,...,n\}}$. So mathematically the BS should find a permutation ${\sigma}$ such that the allocated sum rate

$\displaystyle \sum_{i=1}^n R_{i \sigma(i)},$

is maximized. Observe that if the BS has check for every permutation of ${1,...,n}$ before it can find the permutation that maximizes the sum rate. Suppose there are ${20}$ tones, then the search space is approximately ${20!\approx 2\times 10^{18}}$ which is huge.

In a little while using the extreme points and Krein-Milman theorem, we will come up with a smarter way of solving this problem in polynomial complexity.

We will now start looking more closely at the vertices (extreme points) of a polyhedron. Most of the material is taken from AB. We will first begin with the following theorem that characterizes the vertices of a polyhedron in terms of linear dependence of the vectors.

Theorem 1 Let ${a_i, i=1..,n}$ be ${n}$ vectors in ${{\mathbb R}^d}$ and ${b_i \in {\mathbb R}}$. Let ${P}$ be a polyhedron defined as

$\displaystyle P=\{x: x\in {\mathbb R}^d, a_i^t x \leq b_i, i = 1...n\}.$

For any point ${u \in P}$ define

$\displaystyle I(u)= \{i; a_i^t u = b_i\}.$

Then ${u}$ is a vertex of ${P}$ if and only if the set of vectors ${\{a_i, i \in I(u)\}}$ linearly span ${{\mathbb R}^d}$.

Proof: Let us first prove that vertex implies linear span. We will prove by contradiction.Let ${u}$ be a vertex of ${P}$. Suppose ${\{a_i, i \in I(u)\}}$ does not span ${{\mathbb R}^d}$. For notation let the cardinality of ${I(u)}$ be ${m=|I(u)|}$ Consider the matrix ${A}$ with ${a_i}$ as the columns, \ie,

$\displaystyle A=[a_1,... ,a_m],$

an ${d \times m}$ matrix. Since ${A}$ does not span the entire ${{\mathbb R}^d}$, the kernel space (or null space) of the matrix ${A}$ is non empty (or the dimension of the kernel is greater than ${0}$.) Hence there exists a non zero ${y \in \ker(A)}$. Since ${y \in \ker(A)}$,

$\displaystyle a_i^t y =0,\quad \forall i \in I(u). \ \ \ \ \ (1)$

We will now prove that ${u}$ is not a vertex. Let ${\epsilon >0}$ be a small number. Define

$\displaystyle X= u + \epsilon y,$

and

$\displaystyle Y= u -\epsilon y.$

Observe that ${u= (X+Y)/2}$ and hence if ${X}$ and ${Y}$ lie in the polyhedron ${P}$, then ${u}$ is not an extreme point (vertex) which is a contradiction. We will now prove that for very small ${\epsilon}$, ${X}$ and ${Y}$ indeed lie in ${P}$. To show that ${X}$ (or ${Y}$) lie in ${P}$ we should show that they satisfy ${a_i^t X \leq b_i}$ for all ${i}$. Let ${i \in I(u)}$. Then

$\displaystyle \begin{array}{rcl} a_i^tX &= a_i^t(u+\epsilon y)\\ &= a_i^tu +\epsilon a_i^t y\\ &\stackrel{(a)}{=} b_i +0 \end{array}$

${(a)}$ follows since ${a_i^t u =b_i}$ for ${i\in I(u)}$ and ${a_i^t y =0}$ since ${y \in \ker(A)}$. Now if ${i \notin I(u)}$, then

$\displaystyle \begin{array}{rcl} a_i^tX &= a_i^t(u+\epsilon y)\\ &=a_i^tu +\epsilon a_i^t y \end{array}$

Observe that ${a_i^t u . Hence if we choose ${\epsilon}$ very very small, then ${a_i^tu + \epsilon a_i^t y < b_i}$. (For example, you can choose ${\epsilon = 0.2*\min_{i=1,...,n} \frac{b_i - a_i^t u}{|a_i^t y|}}$.). Hence ${X}$ satisfies all the inequalities and hence is in the polyhedron. Similarly, you can show that ${Y}$ satisfies all the inequalities and hence in the polyhedron.

We will now prove the converse, \ie, linear span implies vertex. Suppose ${ u = (X+Y)/2}$ where ${X, Y \in P}$. We first have for ${i\in I(u)}$,

$\displaystyle a_i^t u =b_i= \frac{a_i^t X + a_i^t Y}{2}.,$

Since ${X, Y \in P}$, we have ${a_i^tX \leq b_i}$ and ${a_i^TY \leq b_i}$, which is possible only when ${a_i^tX = b_i}$ and ${a_i^TY = b_i}$. Now consider the set of equations for ${i \in I(u)}$.

$\displaystyle a_i^t z = b_i.$

Since ${a_i}$ span the entire ${{\mathbb R}^d}$, it follows that there is a unique solution to the above set of equations. However, we have ${a_i^tX \leq b_i}$ and ${a_i^TY \leq b_i}$ and ${a_i^t u =b_i}$. This implies ${u = X = Y}$ since the solution should be unique. $\Box$

Corollary 2 For any vertex ${u}$, the cardinality of the set ${I(u)}$ should be greater than ${d}$, \ie ${|I(u)| \geq d}$.

Proof: We require at least ${d}$ vectors to span ${{\mathbb R}^d}$. $\Box$

We will now analyze a particular polyhedron that is required to simplify the assignment problem. Before that we require the following definition of a permutation matrix. Suppose ${\sigma}$ is a permutation of ${1,...,n}$. Then the permutation matrix ${X_\sigma}$ is an ${n\times n}$ matrix with the ${(i,j)}$ entry given by

$\displaystyle x_{ij}=\left\{\begin{array}{cc} 1& \text{if } \sigma(i) =j\\ 0& \text{otherwise}. \end{array} \right. \ \ \ \ \ (2)$

So for a given ${n}$ there are ${n!}$ permutation matrices.

Birkhoff polytope:

Birkhoff polytope ${B_n}$ is the set of all doubly stochastic ${n\times n}$ matrices. Mathematically,

$\displaystyle B_n=\left\{\{\xi_{ij}\}_{n\times n}: \sum_{i=1}^n \xi_{ij}=1,\sum_{j=1}^n \xi_{ij}=1, \xi_{ij}\geq 0 , \forall i, j \right\}.$

First observe that ${B_n}$ is entirely defined by linear inequalities and hence a polyhedron. Also observe that ${B_n}$ is a convex set.

Hw 1 Show that ${B_n}$ is a compact set.
Lemma 3 An ${n \times n}$ permutation matrix ${X_\sigma}$ is an extreme point of ${B_n}$.

Proof: Suppose ${X_\sigma = (A+ C)/2}$ where ${A, C \in B_n}$. Consider the ${i}$-th row of all the three matrices. Suppose ${x_{im} =1}$ and all the other entries are zero. So we have

$\displaystyle 2[0, 0, ..,1,....0] = [a_{i1}, ..., a_{in}]+[c_{i1}, ..., c_{in}]$

We have ${\sum_{k=1}^n a_{ik} =1}$, ${\sum_{k=1}^n c_{ik} =1}$ and each term is non-negative. This is possible only when ${a_{im}=c_{im} =1}$ and zero for all the other entries of the ${i}$-th row. This shows that ${X_\sigma= A=C}$, which proves that ${X_\sigma}$ is an extreme point of ${B_n}$. $\Box$

We will now prove the converse of the above statement.

Theorem 4 (Birkhoff- von Neumann Theorem.) The extreme points of ${B_n}$ are precisely the set of permutation matrices.

Proof: We prove the lemma by induction. For ${n=1}$, it is an obvious statement. Let us assume it is true till dimension ${(n-1)}$ and prove it for dimension ${n}$.

Consider the following affine subspace of ${{\mathbb R}^{n^2}}$. The space ${L}$ is the set of all ${n^2}$ real numbers ${\xi_{ij} \in {\mathbb R}}$ such that

$\displaystyle \sum_{i=1}^n \xi_{ij} =1, \ \forall j=1,...,n,$

and

$\displaystyle \sum_{j=1}^n \xi_{ij} =1, \ \forall i=1,...,n.$

Observe that the dimension of the space ${L}$ is ${(n-1)^2}$. (Think why??) Observe that ${B_n\subset L}$ and is defined as the polyhedron

$\displaystyle \xi_{ij} \geq 0, \forall\ i,j$

in ${L}$.

Let ${X}$ be an extreme point of ${B_n}$. So by Theorem 1, an extreme point ${X}$ of ${B_n}$ should satisfy at least ${(n-1)^2}$ inequalities with equalities. Hence ${x_{ij} =0}$ for at least ${(n-1)^2}$ entries of the matrix. We make the following observation.

• Every row cannot have more than or equal to ${2}$ non zero entries: In this case the maximum number of zero entries would be ${n(n-2)}$ (each row can have at maximum ${n-2}$ zero entries and there are ${n}$ rows). This cannot be true, since ${n(n-2) < (n-1)^2}$.

So there is at least one row ${i_o}$ with only one non-zero entry at ${j_o}$ location. Since ${X}$ is also doubly stochastic, we have that ${x_{i_oj_o}=1}$. Hence it also follows that there can be no more non zero entries in the column ${j_o}$. Let ${\tilde{X}}$ be the new matrix obtained after removing the row ${i_o}$ and the column ${j_o}$. Observe that

1. ${\tilde{X}}$ is a doubly stochastic matrix:This is because the original matrix ${X}$ was doubly stochastic and the contribution from row ${i_o}$ and column ${j_o}$ is only ${x_{i_oj_o}}$.
2. ${\tilde{X}}$ is an extreme point of ${B_{n-1}}$: Suppose ${\tilde{X}}$ is not an extreme point of ${B_{n-1}}$, \ie, ${\tilde{X}}$ can be expressed as a sum of ${\tilde{X} = (C+D)/2}$, where ${C\in B_{n-1}}$ and ${D \in B_{n-1}}$. Then it is easy to see that by adding one more row (all zeros except at ${j_o}$) and column (all zeros except at ${i_o}$) to ${C}$ and ${D}$ at positions ${i_o}$ and ${j_o}$ will result in new matrices ${\hat C}$ and ${\hat D}$. Observe that ${X = (\hat{C}+\hat{D})/2}$.

So ${\tilde{X}}$ is an extreme point of ${B_{n-1}}$. So by induction hypothesis, we have that ${\tilde{X}}$ is some ${(n-1)\times (n-1)}$ permutation matrix ${X_\sigma}$. Now augment ${X_\sigma}$ with the ${i_o}$ row and ${j_o}$ column to obtain ${X}$. $\Box$

We will now see how to utilize this theorem to simplify the assignment problem. Recall that the AP was to minimize over all the permutations of ${1,...,n}$.

$\displaystyle \min_{\sigma}\sum_{i=1}^n R_{i \sigma(i)}.$

Observe that the cost function ${\sum_{i=1}^n R_{i \sigma(i)}}$ is equivalent to

$\displaystyle \sum_{i,j} R_{ij} x_{ij},$

where ${X_\sigma = \{x_{ij}\}}$ is the permutation matrix. So the AP equals

$\displaystyle \min_{X_\sigma}\sum_{i,j} R_{ij} x_{ij}.$

Observe that even now, ${x_{ij}}$ is restricted to be either ${0}$ or ${1}$.

Now consider a new problem P2:

$\displaystyle \min\sum_{ij} R_{ij} \xi_{ij},$

over the set

$\displaystyle \begin{array}{rcl} &\sum_{i}\xi_{ij} =1, \forall i\\ &\sum_{j}\xi_{ij} =1, \forall j\\ &\xi_{ij} \geq 0. \end{array}$

Observe that we are now maximizing over the Birkhoff polytope ${B_n}$ and ${\xi_{ij}}$ are real numbers. So we are maximizing a linear function ${\sum_{ij} R_{ij} \xi_{ij}}$ over a convex compact set ${B_n}$. Hence the maximum occurs at a vertex point of ${B_n}$. But by Birkhoff- von Neumann theorem, the vertices of ${B_n}$ are nothing but the permutation matrices. Hence solving P2 is equal to solving AP. Also observe that the new problem P2 is a linear programming problem and is very easy to solve in polynomial complexity.

1. Separation Theorems

We begin with some definitions of halfspaces:

Definition 1 Let ${a \in {\mathbb R}^n}$ and ${b\in {\mathbb R}}$.

1. Open half space : ${H_+ = \{x: a^tx > b\}}$
2. Open half space : ${H_- = \{x: a^tx < b\}}$
3. Closed half space : ${\overline{H_+} = \{x: a^tx \geq b\}}$
4. Closed half space : ${\overline{H_-} = \{x: a^tx \leq b\}}$
Definition 2 Let ${A}$ and ${B}$ be two set in ${{\mathbb R}^n}$ and ${H}$ a hyperplane. The sets ${A}$ and${ B}$ are said to be separated by ${H}$ if ${A}$ lies in ${\overline{H_+}}$ and ${B\subset \overline{H_-}}$. The sets are ${A}$ and${ B}$ are strictly separated by ${H}$ if ${A}$ lies in ${ {H_+}}$ and ${B\subset {H_-}}$. See Figure 1.

Observe that the two sets A and B in the top figure cannot be separated by any hyperplane. Right figure: $A= {x<0}$ and $B={y>1/x, x>0.}$. Observe that while the sets can be separated, they cannot be strictly separated.

Separation theorems deal with sufficient conditions for separation of sets by hyperplane. There are numerous separation theorems with increasing order of complexity. We will look at two separation theorems. In the following discussion the norm we consider is the Euclidean norm. We begin with the following definition of distance to a set

Definition 3 The distance between a point ${x}$ and a set ${A}$ is defined as

$\displaystyle dis(x,A)=\inf\{\|x-y\|; y \in A\}$

We require the following lemma that characterize the distance of a point to a closed set. The proof can be found in standard analysis textbooks.

Lemma 4 Let ${A}$ be a non-empty closed set and ${x}$ be a point. Then there exists a point in ${A}$ such that

$\displaystyle \|x-u\| =dis(x,A).$

Lemma 5 Let ${A}$ and ${B}$ be two non-empty sets in ${{\mathbb R}^n}$. Further, let ${A}$ be closed and ${B}$ compact. Then there exists two points ${a\in A}$ and ${b\in B}$ such that

$\displaystyle \|a-b\| = \inf\{\|x-y\|; x\in A , y \in B \}.$

Proof: Claim1: The function: ${dis(x,A): {\mathbb R}^n \rightarrow [0,\infty) }$ is a continuous function. (Easy to prove. Think why).

Since ${B}$ is a compact set and ${dis(x,A)}$ is a continuous function, it attains a minimum on a point ${b}$ on ${B}$, \ie,

$\displaystyle dis(b,A)= \inf\{dis(x,A), x\in B \}. \ \ \ \ \ (1)$

Also by previous lemma there exists a point ${a }$ in ${A}$ such that

$\displaystyle dis(b,A)=\|b-a\|.$

Let ${x }$be some point in ${A}$ and ${y}$ some point in ${B}$. Then

$\displaystyle \|x-y\| \stackrel{(a)}{\geq} dis(y,A)\stackrel{(b)}{\geq }dis(b,A) = \|b-a\|,$

where ${(a)}$ follows from the definition of distance to the set and ${(b)}$ follows from (1). So for any ${x \in A}$ and ${y\in B}$, we have ${\|x-y\| \geq \|b-a\|}$. Hence

$\displaystyle \inf\{\|x-y\| , x\in A, y \in B\} = \|b-a\|,$

since ${b \in B}$ and ${a \in A}$. $\Box$

We will now look into this distance functions more closely.

Lemma 6 Let ${A}$ be a closed convex set and ${x}$ be a point in ${{\mathbb R}^n}$. Then there exists an unique point ${u \in A}$ such that

$\displaystyle \|x-u\| = dis(x,A) ,$

\ie, the point ${u \in A}$ achieves the distance. Moreover, the inner-product

$\displaystyle \leq 0,$

for any ${y\in A}$.

Proof: Existence of some point ${u}$ in ${A}$ follows from Lemma 4 because ${A}$ is closed. We will have to show that the point is unique and the inner product inequality is satisfied. We will begin with the inequality. Since ${y \in A}$ and ${u \in A}$, the line segment ${[y, u] \in A}$. Since ${\|x-u\|}$ is the minimum distance,

$\displaystyle \|x-u\| \leq \|x - \theta y -(1-\theta)u\|= \|x-u+\theta(u-y) \|.$

So we have

$\displaystyle \|x-u\| \leq \|x-u+\theta(u-y) \|.$

Squaring both sides and after some basic simplification (check this)

$\displaystyle 2\theta +\theta^2\|u-y\|^2\geq 0.$

Observe that this inequality should hold true for all ${\theta}$, especially very small ${\theta}$. Since for very small ${\theta}$, ${\theta^2 << 2\theta}$, it follows that the above expression is non-negative for all ${\theta}$ if and only if ${ \geq 0}$, which proves the inequality.

Suppose there exist one more point ${u_i \in A}$ such that ${\|x-u\| =\|x-u_1\|}$. Then by the above inequality we have

$\displaystyle \leq 0$

and

$\displaystyle \leq 0.$

Adding both these inequalities we obtain

$\displaystyle \|u-u_1\|^2 \leq 0,$

which implies ${u=u_1}$. $\Box$

Geometrically, the inequality means that the angle between the vectors ${x-u}$ and ${y-u}$ is obtuse.

Hw 1 Let ${A}$ be a closed convex set. Then for any ${x \in {\mathbb R}^n}$, from the previous lemma, we have seen that there exists a unique point ${u}$ such which attains the minimum distance. For any ${x \in {\mathbb R}^n}$, call the point ${u\in A}$ that attains the distance, the projection of the point ${x}$ on ${A}$. Then we can define a projection function ${f_p(x):{\mathbb R}^n \rightarrow A}$. Prove that ${f_p(x)}$ is a Lipschitz function.This function is called the projection operator. (Hint: Use the inner product inequality)

We will now look at the first separation theorem:

Theorem 7 Let ${A\subset {\mathbb R}^n}$ and ${B \subset {\mathbb R}^n}$ be non-empty non-intersecting convex sets. Moreover let ${A}$ be a compact set and ${B}$be a closed set. Then there exists an hyperplane that strictly separates both the sets.

Proof: Since ${A}$ is compact and ${B}$ is closed, by Lemma 5, there exists two points ${a \in A}$ and ${b \in B}$ such that ${a }$ is the closest point in ${A}$ of the point ${b}$ and ${b}$ the closest point in ${B}$ of the point ${a}$.

Let ${x}$ be some point in ${A}$. Then from Lemma 6, we have

$\displaystyle \begin{array}{rcl} \leq 0 \end{array}$

which implies

$\displaystyle \begin{array}{rcl} &\leq &\\ &=&-\\ &=&\frac{1}{2}\left(2-2\right)\\ &=&\frac{1}{2}\left( - -+2-\right)\\ &=&\frac{1}{2}\left( \|b\|^2-\|a\|^2 -\|a-b\|^2\right)\\ &<&\frac{1}{2}\left( \|b\|^2-\|a\|^2 \right). \end{array}$

Observe the strict inequality in the last line. This is because ${A\cap B = \emptyset}$ which implies ${\|a-b\| >0}$. Hence we have

$\displaystyle (b-a)^t x < \frac{1}{2}\left( \|b\|^2-\|a\|^2 \right). \ \ \ \ \ (2)$

Now let ${y}$ be some point in ${B}$. Then from Lemma 6, we have

$\displaystyle \leq 0.$

Using this inequality and similar trick as before, it follows that

$\displaystyle (b-a)^t y > \frac{1}{2}\left( \|b\|^2-\|a\|^2 \right). \ \ \ \ \ (3)$

Let the hyperplane be

$\displaystyle H= \{z: (b-a)^Tz =\frac{1}{2}\left( \|b\|^2-\|a\|^2 \right), z\in {\mathbb R}^n \}.$

It is easy to see from (2) and (3) that the hyperplane ${H}$ strictly separates the sets ${A}$ and ${B}$. $\Box$

The next separation theorem looks at separation of a point and an open set. Before that lets look at hyperplanes again. Earlier we have seen that a Hyperplane in ${{\mathbb R}^n}$ is an affine space and its dimension is of dimension ${n-1}$. It also follows that any affine subspace of dimension ${n-1}$ in ${{\mathbb R}^n}$ is a hyperplane.

Lemma 8 Let ${A \subset {\mathbb R}^n}$ be an affine subspace of dimension ${n-1}$. Then ${A}$ is a hyperplane.

Proof: To show that ${A}$ is an hyperplane, we have to find a scalar ${b \in {\mathbb R}}$ and a vector ${a \in {\mathbb R}^n}$ such that

$\displaystyle A=\{x: a^tx=b, x\in {\mathbb R}^n\}.$

Let ${v \in A}$. Since ${A}$ is an affine subspace, translating the space ${A}$ by ${v}$, \ie, ${B=A-v}$ is a subspace again of dimension ${n-1}$. Since ${B}$ is a subspace (which is a vectorspace) of ${{\mathbb R}^n}$ there exists ${n-1}$ vectors ${v_1, ..... v_{n-1}}$ such that they form the basis of ${B}$. By Gram-Schmidt orthogonalization we can also assume that these vectors form an orthonormal set. Since these are orthonormal, they are linearly independent in ${{\mathbb R}^n}$. But since ${{\mathbb R}^n}$ is of dimension ${n}$, we can find one more vector ${v_n}$ which is orthogonal to every vector in the set ${\{v_1, ....,v_{n-1}\}}$. (${v_n}$ together with ${v_1,...,v_{n-1}}$ will form a basis of ${{\mathbb R}^n}$). Since ${v_n}$ is orthogonal to any basis vector of the subspace ${B}$, it follows that it is orthogonal to every vector in the subspace ${B}$. Indeed, it can be shown that any vector orthogonal to ${v_n}$ should belong to ${B}$ (think projections ?). Hence

$\displaystyle B=\{x: v_n^tx=0, x\in {\mathbb R}^n\}$

Since ${B=A-v}$, we have

$\displaystyle \begin{array}{rcl} A&=&\{x+v: v_n^tx=0, x\in {\mathbb R}^n\}\\ &=&\{y: v_n^t(y-v)=0, y\in {\mathbb R}^n\}\\ &=&\{y: v_n^ty=v_n^tv, y\in {\mathbb R}^n\} \end{array}$

which is the equation of a hyperplane with ${a=v_n}$ and ${b=v_n^tv}$. $\Box$

We will now get back to the main separation theorem. The following separation theorem is a special case of geometric Hahn-Banach theorem.

Theorem 9 (Geometric Hahn-Banach theorem) Let ${A\in {\mathbb R}^d}$ be an open convex set and ${u \notin A}$ be some point. Then there exists a hyperplane which contains ${u}$ and strictly isolates ${A}$.

Proof: Without loss of generality we can assume that ${u=o}$. Otherwise, we can translate both ${u}$ and ${A}$ so that ${u=0}$.

We will prove this in three stages: First we will prove the theorem for ${d=2}$, i.e, a plane. Consider the circle ${S_1=\{(x,y): x^2+y^2=1 \}}$. Now we project the convex set ${A \subset {\mathbb R}^2}$ onto the circle in the following way:

$\displaystyle f(x): A \rightarrow S_1 ,$

as

$\displaystyle f(x)= \frac{x}{\|x\|},\ x\in A.$

Observe that the angle of the arc ${f(A)}$ should be less than ${\pi}$. Otherwise,if it is greater than ${\pi}$, consider a point ${x}$ on the arc ${f(A)}$. Let this point be the image of a point ${x_A \in A}$. Since the arc ${f(A)}$ is greater than ${\pi}$, the diametrically opposite point of ${x}$ should also be in the set ${f(A)}$. Call that point ${y}$ and its inverse image as ${y_A \in A}$. Since ${A}$ is convex, observe that the line segment ${[x_A , y_A]}$ should be in the set ${A}$ and also the line segment should contain the origin. Hence ${o \in A}$, which is a contradiction to our assumption.

Also, since ${A}$ is an open set, observe that the image of ${A}$, ${f(A)}$ will be an open arc on ${S_1}$. Now consider an end point of the arc. This can be done, since the arc is open. Now consider the line joining the end point and the center origin. This line contains the origin and the set ${A}$ is on one side of the line. This line is the required hyperplane for two dimensions.

Now let us consider higher dimensions ${d \geq 2}$:

We will first now prove that there is always a line ${L}$ (one dimension) in higher dimension that passes through the origin such that ${L\cap A=\emptyset}$. Consider a two dimensional plane ${P}$ through the origin. If it does not have any intersection with ${A}$, pick some line in the plane and denote it by ${L}$. This is the required line. Suppose ${P\cap A}$ is non-empty, it is easy to see that ${P\cap A}$ is a convex and open set that does not contain the origin. Hence by the previous two dimensional result, it follows that there is a line ${L}$ that passes through the origin such that ${A\cap P\cap L =\emptyset}$. This is the required line ${L}$.

Let ${M}$ be an subspace such that ${o \in M}$ and ${M\cap A =\empty}$. Also let ${M}$ be the biggest such affine space, \ie, any other subspace with this property is contained in ${M}$. If the dimension of ${M}$ is ${d-1}$, then by Lemma 8, it is a hyperplane and we are done (this is because ${M}$ is an affine subspace through the origin).

We will now prove by contradiction that the dimension of such maximal set is ${d-1}$. Let the dimension of ${M}$ be ${k . Since ${M}$ is a subspace there are ${k}$ basis vectors that generate this subspace. Call them ${v_1,....,v_k}$. Adding more vectors ${v_{k+1}, ...,v_{d}}$, the set ${v_1,v_2,....,v_d}$ will form a basis of ${{\mathbb R}^d}$. So any vector ${x \in {\mathbb R}^d}$ can be represented as a linear combination of the basis vector, i.e.,

$\displaystyle x =\sum_{i=1}^k \beta_i v_i +\sum_{i=k+1}^d \beta_i v_i . \ \ \ \ \ (4)$

Hence we can define the following projection function from ${{\mathbb R}^{d}}$ to ${{\mathbb R}^{d-k}}$ as

$\displaystyle f_p(x)= (\beta_{k+1},\hdots, \beta_d).$

Consider the projection of the convex set ${f_p(A)}$. The set ${f_p(A)}$ is convex and open (Why?). Now from the previous argument we can find a line ${L}$ that passes through the origin in ${{\mathbb R}^{d-1}}$ but does not intersect ${f_p(A)}$. Now consider the inverse image of ${L}$ under the map ${f_p(x)}$, \ie, ${f_p^{-1}(L)}$. From (4), we observe that this means only the last ${d-k}$ coordinates will correspond to ${L}$ and the first ${k}$ coordinates to ${M}$. Hence it follows that

$\displaystyle G=f_p^{-1}(L) = \{M+x: x\in L\} =\cup_{x\in L} (M+x).$

Observe that ${G}$ is a subspace (why ?) that contains ${M}$ and larger than ${M}$. Also, it has no intersection with ${A}$. This is a contradiction since we assumed ${M}$ is the largest subspace with these properties.

So${ k=d-1}$ and hence by Lemma 8, ${M}$ is a hyperplane. $\Box$

1. Properties of convex sets

1. Lemma 1

Let a set ${B_\alpha}$ be a convex set for every ${\alpha \in I}$. Then the intersection of all these sets

$\displaystyle \cap_{\alpha \in I}B_\alpha$

is convex.

Proof: HW problem $\Box$

Example 1 (A cool example from SB) The set of PSD matrices can be represented as

$\displaystyle S_{+}=\cap_{z\in R^n, z \neq 0}\{ A :\ A \in S,\ z^TAz\geq 0\}.$

Observe that for any ${z\neq 0}$, the set

$\displaystyle \{ A :\ A \in S,\ z^TAz\geq 0\}$

is a convex set in ${S}$ (why?). Hence from the above Lemma ${S_+}$ is convex.

2. A scaled and a translated version of a convex sets is convex.
3. The projection of a convex set onto its coordinates is convex. If the set ${A}$ is convex then the set

$\displaystyle \{x_1: (x_1,x_2) \in A\},$

is convex.

Example 2 The projection of a polyhedron on to its coordinates is a polyhedron.

Proof: The following example illustrates the procedure of the proof. Consider the following polyhedron:

$\displaystyle P= \{(x,y,z): x+y+z\leq 1;\ x\geq0, y\geq 0,z\geq 0\}.$

Let us consider its projection onto the ${(x,y)}$ plane. Denote this projection of ${P}$ onto the ${(x,y)}$ plane by ${T}$. It is easy to see that

$\displaystyle T= \{(x,y): x+y\leq 1;\ x\geq0, y\geq 0\}.$

Let us see the formal way to prove it. So ${(x,y)}$ belongs to the projection ${T}$ if and only if there exists a ${z}$ such that \begin{align*} z&\leq 1-(x+y),

z&\geq 0. \end{align*} There exits a ${z}$ that satisfies these two conditions if

$\displaystyle 1-(x+y) \geq z \geq 0,$

which implies ${x+y\leq 1}$. This combines with the initial inequalities ${x\geq 0}$ and ${y\geq 0}$ lead to the polyhedron ${T}$. (This technique of variable reduction is also called as Fourier -Motzkin reduction) $\Box$

4. Consider the affine function ${f(x) = Ax +b}$ and ${A\in {\mathbb R}^{m\times n}}$. Then the image of a convex set ${S \in {\mathbb R}^n}$ is convex. Also, if ${S \in {\mathbb R}^m}$ is convex, then the inverse image of a set ${S}$ under ${f(x)}$ is convex. Proof:Let ${A=f^{-1}(S)}$ be the inverse image of a convex set ${S}$. Let ${x,y \in A}$. then ${\theta x+(1-\theta)y}$ is the inverse image of ${\theta f(x) +(1-\theta)f(y)}$ which belongs to the set ${S}$ since it is convex. $\Box$Let us look at some examples: (These examples are from SB)

Example 3 Consider the matrix inequality

$\displaystyle A(x)= x_1A_1+x_2A_2+...+x_nA_n \preceq B,$

for ${A_i \in S}$, ${B\in S}$. Consider two sets ${C,D \in S}$, we say ${C \succeq D}$ if and only if ${C-D\succeq 0}$, \ie, ${C-D \in S_+}$.

This is called a linear matrix inequality. Then the set ${\{x: \ A(x) \preceq B\}}$ is convex. Consider the function ${f(x)=B-A(x): {\mathbb R}^n \rightarrow S}$. Observe that

$\displaystyle \{x: \ 0 \preceq B-A(x)\}= f^{-1}(S_+)$

Example 4 (Ellipsoid) Consider the set ${\{y+Ax:\|x\|\leq 1\}}$, ${A\succ 0}$. This set is the inverse image of ${f(x)= A^{-1}x -y}$ of the unit ball ${\|x\|<1}$. and hence convex.
5. In general the union of convex sets need not be convex. The union of an increasing sequence of convex sets is convex.

1.1. Affine sets

Definition 2 A set ${A\subset {\mathbb R}^n}$ is affine if for any two points ${x,y \in A}$, the line passing through ${x}$ and${y}$ lies in ${A}$. Mathematically,

$\displaystyle \theta x + (1-\theta)y \in A,\ \forall \theta \in {\mathbb R}.$

Definition 3 A linear combination of points

$\displaystyle \sum_{i=1}^n \alpha_i x_i, \quad \sum_{i=1}^n \alpha_i =1$

is called an affine combination of the points ${x_1,...,x_n}$.

Definition 4 An affine hull of a set is the set of all affine combinations of points from the set.

Some examples:

1. Let ${A}$ be the unit square. Then the affine hull of ${A}$ is the entire ${{\mathbb R}^2}$.
2. ${A =\{(y,x); y= 3x+2, x\in {\mathbb R}\}}$. Then the affine hull of ${A}$ is ${A}$.

Let ${V}$ be a vector space and ${L}$ be a subspace (${ax + by}$ belongs to the space). Then for any element ${x\in V}$, the translation ${A=L+x}$ is called an affine subspace. The dimension of ${A}$ is the dimension of ${L}$.

Lemma 5 An affine set ${A\subset {\mathbb R}^n}$ is an affine subspace.

Proof: Let ${a\in A}$. Then consider the set ${B=\{y-a; y \in A\}}$. We will now prove that ${B}$ is a subspace. ${o}$ belongs to the subspace. Let ${x,z \in B}$. Then

$\displaystyle \alpha x + \beta z =\alpha(x+a) + \beta (z+a) +(1-\alpha -\beta) a -a .$

Observe that ${(x+a) \in A}$, ${(z+a)\in A}$ and of course ${a\in A}$. So their affine combination ${\eta= \alpha(x+a) + \beta (z+a) +(1-\alpha -\beta) a}$ belongs to the set ${A}$. So ${\eta-a}$ belongs to the set ${B}$, which implies that ${B}$ is a subspace. Since

$\displaystyle A= B+a,$

and hence it follows that ${A}$ is an affine subspace. $\Box$

Example 5 The Hyperplane

$\displaystyle \{x: a^tx=b\}$

is an affine subspace.

A hyperplane in ${{\mathbb R}^n}$ is of dimension ${n-1}$. (This is because only ${n-1}$ coordinates can be freely chosen. The remaining coordinate is automatically fixed by the constraint ${a^tx=b}$.)

Definition 6 A set of points ${\{x_1,x_2,....,x_n\}}$ are affinley independent if

$\displaystyle \sum_{i=1}^n a_i x_i =o,\quad \sum_{i=1}^n a_i =0,$

implies ${a_i=0}$ for ${i=1,..,n}$.

1.2. Topological properties of convex sets

Lemma 7 Let ${A\subset {\mathbb R}^n}$ and ${x \in int(A)}$ be an interior point. Then for any ${y \in A}$

$\displaystyle (1-\alpha) x +\alpha y, \in int(A),$

for ${0\leq \alpha <1}$.

Proof: Since ${x}$ is an interior point, there is a ball ${B(x,r)}$ that lies entirely in the set ${A}$. Now consider the point ${y}$. We should prove that every point on the line segment joining ${x}$ and ${y}$ (expect for the end point ${y}$) is an interior point. Consider the “cone ” with ${y}$ as an end point and a section of the Ball ${B(x,r)}$ passing through ${x}$ as a base.

This cone is entirely in the set ${A}$ because of convexity. Hence for any point on the line between ${x}$ and ${y}$, we will be able to find a ball that fits in this cone and hence the set ${A}$. See the figure $\Box$

Lemma 8 Let ${A\subset {\mathbb R}^d}$ be a convex set. Then the interior of the set ${int(A)}$ is convex.

Proof: Let ${x,y}$ be interior points of set ${A}$. By Lemma 7, it follows that

$\displaystyle \alpha x + (1-\alpha)y \in int(A), \ \alpha \in [0,1).$

Since ${x}$ also belongs to the interior, the line segment ${[x,y] \in int(A)}$. $\Box$

Lemma 9 If ${\{x_1,...,x_{d+1}\}\subset {\mathbb R}^{d}}$ are affinley independent points, then ${conv(x_1,...,x_n)}$ has a non-empty interior.

Proof: Consider the point ${x_c= \sum_{i=1}^{d+1} \frac{x_i}{1+d}}$ a convex combination with equal weights. We will now show that there exists an ${\epsilon >0}$ such that ${B(x_c,\epsilon) \in conv(x_1,....,x_n)}$. This will prove that ${x_c}$ belongs to the interior of the set. To show that ${B(x_c,\epsilon) \in conv(x_1,....,x_n)}$, it suffices to prove that any ${y\in B(x_c,\epsilon)}$ is a convex combination of ${\{x_1,....,x_n\}}$.

Consider the following ${(d+1)\times(d+1)}$ matrix denoted by ${B}$

$\displaystyle B= \left[ \begin{array}{cccc} x_1&x_2&\hdots&x_{d+1}\\ 1&1&\hdots&1 \end{array} \right]$

Observe that from the independence of points we have

$\displaystyle \ker\{B\}=\{o\},$

since ${Bz=0}$ implies ${z=0}$. Hence the matrix ${B}$ is invertible and hence its eigenvalues are non-zero (condition number is non zero). Now consider the following function

$\displaystyle f(y): {\mathbb R}^{d+1}\rightarrow {\mathbb R}^d =B^{-1} \left[ \begin{array}{c} y\\ 1 \end{array} \right]$

Observe that since ${B}$ is invertible both ${f(y)}$ and ${f^{-1}(y)}$ are continuous. We have that

$\displaystyle f(x_c) = B^{-1} \left[ \begin{array}{c} x_c\\ 1 \end{array} \right] = \left[ \begin{array}{c} 1/(d+1)\\ \vdots\\ 1/(d+1) \end{array} \right] .$

So from the definition of continuity we have, for any ${\epsilon >0}$, there exists a ${\delta>0}$ such that

$\displaystyle \|x_c -y\| <\delta \implies \left \|\left[ \begin{array}{c} 1/(d+1)\\ \vdots\\ 1/(d+1) \end{array} \right] -f(y)\right\|<\epsilon$

If ${\epsilon}$ is made very small, it follows that all the coefficients of the vector ${f(y)}$ will be positive (in fact close to ${1/d+1}$). Hence any ${y \in B(x_c,\delta)}$, ${y}$ can be represented as a convex combination

$\displaystyle \sum_{i=1}^n f_i(y)x_i,$

where ${f_i(y)}$ are the coordinates of the vector ${y}$. Also observe that ${\sum f_i(y) =1}$ because of the last row of ${B}$. $\Box$

Consider the highlighted and coloured set. First it is not a convex set. Also observe that the set (highlighted and coloured) has an empty interior. However the smallest affine space that contains the set is of dimension 3, the same as that of the original space.

In general a set need not have an interior. For example the interval ${(0,1)}$ considered as a subset of ${{\mathbb R}^2}$ has a non-zero interior. However, it turns out if a convex set ${A \subset {\mathbb R}^d}$ has an empty interior, then we can find an affine space of dimension lower than ${d}$ that contains ${A}$ where ${A}$ has a non zero interior. For example see the Figure  and consider the dimension of an affine space that contains the set.

Theorem 10 Let ${A\subset {\mathbb R}^n}$ be a convex set. If ${int(A) =\emptyset}$, then there exists an affine subspace ${L\subset {\mathbb R}^n}$ such that ${A\subset L}$ and ${\dim (L) .

Proof: First claim: The number of affinely independent point in set ${A}$ cannot be greater than ${d}$. This is because,if there are ${d+1}$ or more independent points, by previous Lemma, we have that the interior of the convex set is non-empty. This contradicts the theorem statement that the interior is empty.

Let the maximum number of affinely independent points in set ${A}$ be ${K < d+1}$ and denote them by ${x_1, ....,x_k}$. Let ${y}$ be some point in ${A}$. Now consider the following set of equations:

$\displaystyle \begin{array}{rcl} \gamma_1 x_1+\gamma_2 x_2+ \hdots+\gamma_k x_k +\gamma y =o\\ \gamma_1+\gamma_2 + \hdots+\gamma_k +\gamma =0\\ \end{array}$

Since the maximum number of independent points in the set is ${k}$, the above set of equations has a non trivial solution. Also, observe that ${\gamma \neq 0}$ (think why ?) . Hence we have

$\displaystyle y=\frac{-\gamma_1 }{\gamma}x_1+\frac{-\gamma_2}{\gamma} x_2+ \hdots+\frac{-\gamma_k}{\gamma} x_k$

Also observe that

$\displaystyle \frac{-\gamma_1 }{\gamma}+\frac{-\gamma_2}{\gamma} + \hdots+\frac{-\gamma_k}{\gamma} =\frac{-\gamma}{\gamma}=1.$

Hence any point ${y}$ is an affine combination of the points ${x_1,....,x_k}$. So the set ${A\subset \text{aff}(x_1,....,x_k)}$. Also

$\displaystyle \dim\text{aff}(x_1,....,x_k) = k-1 < d ,$

hence proving the theorem.

$\Box$

We will look a nice application of Carathéodory’s theorem in information theory. Let ${e_i =[0,0,...0,1,0,...,0]}$ essentially a vector with ${1}$ int he ${i}$-th position and zero elsewhere. Before we go to the main example, we require the following lemma.

Lemma 1 Let ${\mathop{\mathbb P}_n}$ denote the “vertices” (extreme points) of the ${\Delta_{n-1}}$ simplex, \ie ${\mathop{\mathbb P}_n =\{e_1,e_2,...,e_n\}}$ are the standard basis of ${{\mathbb R}^n}$. For ${i=1,....k}$, let ${f_i(x):\mathop{\mathbb P} \rightarrow {\mathbb R}}$ be ${k}$ real valued functions on ${\mathop{\mathbb P}_n}$. Also, let ${\beta_i \geq 0, i=1,...,n}$ be such that ${\sum_{i=1}^n \beta_i=1}$. Then there exists ${\alpha_i\geq 0}$, ${i=1,...,k+1}$, ${\sum_{i=1}^{k+1}\alpha_i=1}$ and ${\{\hat{e_1}, ....,\hat{e_{k+1}}\}\subset \mathop{\mathbb P}_n}$ such that

$\displaystyle \sum_{i=1}^n \beta_i f_m(e_i ) = \sum_{i=1}^{k+1}\alpha_i f_m(\hat{e_i}), \ \text{ for } m=1,...,k$

Proof: We have

$\displaystyle \sum_{i=1}^n \beta_i\left[ \begin{array}{c} f_1(e_i)\\ f_2(e_i)\\ \vdots\\ f_k(e_i)\\ \end{array} \right] \in {conv}\left(\left[ \begin{array}{c} f_1(e_1)\\ f_2(e_1)\\ \vdots\\ f_k(e_1)\\ \end{array} \right], \left[ \begin{array}{c} f_1(e_2)\\ f_2(e_2)\\ \vdots\\ f_k(e_2)\\ \end{array} \right] \hdots \left[ \begin{array}{c} f_1(e_n)\\ f_2(e_n)\\ \vdots\\ f_k(e_n)\\ \end{array} \right] \right)$

Hence by Caratheodory’s theorem, there exists scalars ${\alpha_1,...\alpha_{k+1}}$ and ${k+1}$ vectors such that

$\displaystyle \sum_{i=1}^n \beta_i\left[ \begin{array}{c} f_1(e_i)\\ f_2(e_i)\\ \vdots\\ f_k(e_i)\\ \end{array} \right] = \sum_{i=1}^{k+1} \alpha_i\left[ \begin{array}{c} f_1(\hat{e_i})\\ f_2(\hat{e_i})\\ \vdots\\ f_k(\hat{e_i})\\ \end{array} \right],$

proving the theorem. $\Box$

For a more general version of the Lemma look at Lemma 3 in “Source coding with side information and a converse for degraded broadcast channels”

Bounds on the cardinality of channel inputs using Caratheodory’s theorem

We will now look at the capacity theorem for channel capacity of Discrete memoryless channels (DMC). Please see Wikipedia and related links for more introduction on the topic of channel capacity.  Refer to the the technical report by M SALEHI for more advanced techniques to bound the cardinality of random variables in Information theory.

Consider a channel with input from the alphabet ${\mathcal{X}}$ and output alphabet ${\mathcal{Y}}$. A channel is specified by a conditional probability distribution for each input, ${P(y|x)}$, \ie the probability of obtaining ${y}$ when ${x}$ is transmitted. For the discussion below, let us assume that the input and the output alphabet are discrete. Shannon proved that the channel capacity of a DMC is

$\displaystyle C= \max_{P_x }I(X; \ Y),$

where the maximization is over all the probability distributions on the input alphabet. Let for notational simplicity that the input alphabet cardinality ${|\mathcal{X}| =n}$ and ${|\mathcal{Y}| =m}$. Observe that in the above optimization problem, the search domain is the set of all discrete probability distrbutions with mass at ${n}$-points, \ie, the search space is the ${n-1}$-simplex. We will now use Caratheodeory theorem to show that this search space can be reduced drastically if the cardinality of the output alphabet ${m}$ is small.

Let ${X}$ and ${Y}$ be two discrete random variables with PMF ${p(x), x\in \mathcal{X}}$ and conditional PMF ${p(y| x), x\in \mathcal{X}, y \in \mathcal{Y}}$. Once the channel ${P(y|x)}$ is specified, fixing the input distribution ${p(x)}$ fixes the output distribution as ${p(y) =\sum_{x\in \mathcal{X}}p(x)p(y|x)}$. So the only flexibility we have is the distribution of the input.

The mutual information between the random variables ${X}$ and ${Y}$ is denoted by ${I(X; Y)}$ and equals

$\displaystyle I(X;Y) = H(Y) - H(Y|X),$

where ${H(Y)}$ is the entropy of the random variable ${Y}$ defined as

$\displaystyle H(Y) =\sum_{y\in \mathcal{Y}}p(y)\log\left(\frac{1}{p(y)}\right),$

and the conditional entropy is defined as

$\displaystyle H(Y|X) =\sum_{x\in \mathcal{X}}p(x)\sum_{y\in \mathcal{Y}}p(y|x)\log\left(\frac{1}{p(y|x)}\right),$

Observe that the entropies and mutual information depend only on the distribution of the random variables. Let ${p^*(x)}$ be the optimal distribution on the input alphabet that achieves the capacity ${C^*}$. As mentioned earlier, this optimal distribution induces a distribution and conditional distribution ${p^*(y)}$ and ${p^*(y|x)}$.

Now we will show that there exists one more probability distribution on ${\mathcal{X}}$ ${P^{**}}$ such that the same mutual information ${C^*}$. However the new distribution will have a non-zero probability only on ${m+1}$ terms of ${\mathcal{X}}$. Observe that if ${m+1 then the search space will reduce.

Since ${p^*(x)}$ is a ${n}$ dimensional (term) probability distribution, it belongs to the ${n-1}$ simplex and hence can be represented as a convex combination of the vertex points, \ie,

$\displaystyle P^*_X = \sum_{i=1}^n \beta_i e_i,$

for some ${\beta_i\geq 0}$ and ${\sum_{i=1}^n \beta_i=1}$ and ${e_i}$ are the standard ${n}$-dimensional basis defined in the previous lemma. Here we have used the notation for the optimal distribution

$\displaystyle P^*_X=\left[ \begin{array}{c} p^*(x_1)\\ p^*(x_2)\\ \vdots\\ p^*(x_n)\\ \end{array} \right]$

where ${x_1,...x_n}$ are the alphabet of ${\mathcal{X}}$. We now consider the following functions on ${\mathop{\mathbb P}_n}$ (see the previous lemma for the definition of ${\mathop{\mathbb P}_n}$)and use the previous lemma to reduce the cardinality. Let ${B(a)=[p(a|x_1),p(a|x_2),....,p(a|x_n)]^T}$. Let ${y_1,y_2,...y_m}$ be the alphabet of the output Define For any vector ${P\in \mathop{\mathbb P}_n}$,

$\displaystyle \begin{array}{rcl} f_{1}(P)&=& P^T B(y_1).\\ f_{2}(P)&=& P^T B(y_2).\\ \vdots &=&\vdots\\ f_{{m-1}}(P)&=& P^T B(y_{m-1}).\\ f_{m}(P)&=& P^T\left [\sum_{y}p(y|x_1)\log\left(\frac{1}{p(y|x_1)}\right),.,\sum_{y}p(y|x_n)\log\left(\frac{1}{p(y|x_n)}\right)\right]^T.\\ \end{array}$

We make the following observation for ${1\leq k\leq m-1}$

$\displaystyle \begin{array}{rcl} \sum_{i=1}^n \beta_i f_{k}(e_i)&=&\sum_{i=1}^n \beta_i( e_i^T B(y_i))\\ &=&\left(\sum_{i=1}^n \beta_i e_i^T\right) B(y_i)\\ &=&P_X^*B(y_i)\\ &=&\sum_{x\in \mathcal{X}}p^*(x) p(y_i|x)\\ & =& p^*(y_i), \end{array}$

where ${ p^*(y_i)}$ is the probability of ${y_i}$ in the optimal distribution of the channel output. Similarly it follows that

$\displaystyle \sum_{i=1}^n \beta_i f_{m}(e_i) = H(Y^*|X^*).$

So we have

$\displaystyle \left[ \begin{array}{c} p^*(y_1)\\ p^*(y_2)\\ \vdots\\ p^*(y_{m-1})\\ H(Y^*|X^*) \end{array} \right]=\left[ \begin{array}{c} \sum_{i=1}^n \beta_i f_{1}(e_i)\\ \sum_{i=1}^n \beta_i f_{2}(e_i)\\ \vdots\\ \sum_{i=1}^n \beta_i f_{m-1}(e_i)\\ \sum_{i=1}^n \beta_i f_{m}(e_i)\\ \end{array} \right]$

So the right hand side corresponds to the optimal distribution ${P^*_Y}$ and the optimal entropy distribution ${H(Y^*|X^*)}$. But by previous lemma, there exists constants ${\alpha_1,...\alpha_m \geq 0}$ and ${\sum_{i=1}^{m+1}\alpha_i =1}$ and ${\tilde{e_1},...,\tilde{e_{m+1}}}$such that

$\displaystyle \left[ \begin{array}{c} \sum_{i=1}^n \beta_i f_{1}(e_i)\\ \sum_{i=1}^n \beta_i f_{2}(e_i)\\ \vdots\\ \sum_{i=1}^n \beta_i f_{m-1}(e_i)\\ \sum_{i=1}^n \beta_i f_{m}(e_i)\\ \end{array} \right]= \left[ \begin{array}{c} \sum_{i=1}^{m+1} \alpha_i f_{1}(\tilde{e_i})\\ \sum_{i=1}^{m+1} \alpha_i f_{2}(\tilde{e_i})\\ \vdots\\ \sum_{i=1}^{m+1} \alpha_i f_{m-1}(\tilde{e_i})\\ \sum_{i=1}^{m+1} \alpha_i f_{m}(\tilde{e_i})\\ \end{array} \right]$

But from the definitions of ${f_i}$, it follows that

$\displaystyle \left[ \begin{array}{c} \sum_{i=1}^{m+1} \alpha_i f_{1}(\tilde{e_i})\\ \sum_{i=1}^{m+1} \alpha_i f_{2}(\tilde{e_i})\\ \vdots\\ \sum_{i=1}^{m+1} \alpha_i f_{m-1}(\tilde{e_i})\\ \sum_{i=1}^{m+1} \alpha_i f_{m}(\tilde{e_i})\\ \end{array} \right]= \left[ \begin{array}{c} (\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )B(y_1)\\ (\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )B(y_2)\\ \vdots\\ (\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )B(y_{m-1})\\ (\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )D\\ \end{array} \right],$

where ${D=\left [\sum_{y\in \mathcal{Y}}p(y|x_1)\log\left(\frac{1}{p(y|x_1)}\right),.....,\sum_{y\in \mathcal{Y}}p(y|x_n)\log\left(\frac{1}{p(y|x_n)}\right)\right]}$ Hence it follows that choosing the distribution ${(\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )}$ on ${\mathcal{X}}$ would lead to the same optimal distribution on ${Y}$ as ${Y^*}$ and the same conditional entropy ${H(Y^*|X^*)}$. But observe that the new optimal distribution.

$\displaystyle P^{**}=\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T ,$

has positive mass only on ${m+1}$ terms of the PMF. This is because ${e_i}$ has one only one coordinate, the ${i}$-th coordinate. This proves the claim. Hence it is sufficient to search over all the ${n=|\mathcal{X}|}$ probability distributions with positive mass on ${\min\{|\mathcal{X}|, |\mathcal{Y}|+1\}}$ coordinates.

Definition 1 (Convex hull) The set of all convex combinations of points from a set ${A}$ is called the convex hull of ${A}$. It is denoted by ${\text{conv}(A)}$.

Similar to a closure of a set, the convex hull of a set is the smallest convex set containing ${A}$. We will now see some examples.

1. Some basic geometric examples.
2. Birkhoff – Von Neumann Theorem: Consider the set of permutation matrices ${\{X_\sigma, \sigma \in P_n\}}$. Then the convex hull of this set is the set of all doubly stochastic matrices.
Hw 1 Prove that the convex hull of a set is convex.

We will now see an interesting application of convex hull.

Theorem 2 (Gauss-Lucas Theorem) Let ${f(z)}$ be a ${n}$-th degree polynomial with real coefficients. By fundamental theorem of algebra there are ${n}$-roots of the polynomial (they might be complex). Call the roots ${\{z_1 = x_1+i y_1, z_2=x_2+iy_2,...,z_n=x_n+iy_n\}}$ be the set of all the roots. Observe that each root can be represented as a point in the plane ${{\mathbb R}^2}$. Then the roots of the ${n-1}$ degree polynomial ${f'(z) = \frac{d f(z)}{dz}}$ lies in the convex hull ${{conv}(z_1,z_2,...,z_n)}$.

Proof: Without loss of generality the polynomial can be represented as

$\displaystyle f(z)= (z-z_1)(z-z_2)....(z-z_n).$

So we have

$\displaystyle f'(z)= \sum_{k=1}^n \prod_{i\neq k}(z-z_i).$

Let ${w}$ be a root of ${f'(z)}$. Then we have ${f'(w)=0}$ and hence

$\displaystyle f'(w)= \sum_{k=1}^n \prod_{i\neq k}(w-z_i) =0$

Multiplying both sides with ${\prod_{i\neq k}(\bar w-\bar z_i)}$, we obatin

$\displaystyle \sum_{k=1}^n \frac{1}{( w - z_i)} \prod_{i=1}^n |w-z_i|^2 =0$

which implies

$\displaystyle \prod_{i=1}^n |w-z_i|^2\sum_{k=1}^n \frac{1}{( w - z_i)} =0.$

If ${\prod_{i=1}^n |w-z_i|^2 =0}$ then ${w = z_i}$ for some ${i}$ and hence the theorem is proved. Else if

$\displaystyle \sum_{k=1}^n \frac{1}{(w - z_k)} =0.$

which equals

$\displaystyle \sum_{k=1}^n \frac{\bar{w}-\bar{z_k}}{|w - z_k|^2} =0,$

which implies

$\displaystyle \bar{w} = \sum_{k=1}^n\frac{\frac{\bar{z_k}}{|w - z_k|^2}}{\sum_{i=1}^n\frac{\bar{z_k}}{|w - z_i|^2}}.$

Taking the conjugate we get the result ${w =\sum_{i=1}^n z_i \alpha_i}$ where ${\alpha_i =\frac{\frac{1}{|w - z_k|^2}}{\sum_{i=1}^n\frac{\bar{z_k}}{|w - z_i|^2}}}$. $\Box$

We will now look at one more theorem that looks at convex combinations.

Theorem 3 (Caratheodory’s theorem ) Let ${S \subset {\mathbb R}^d}$. Then any point ${x \in {conv}(S)}$ can be represented as a convex combination of ${d+1}$ points of ${S}$

Some observations before the proof:

1. The ${(d+1)}$ points depend on ${x}$ in consideration.
2. ${d+1}$ points are necessary. Let ${S}$ be three non-collinear points. Let ${x}$ be the the centroid of the triangle formed by these three points. Observe that the convex combination of the three points is necessary to obtain ${x}$.
3. Observe that the ${d+1}$ points are from ${S}$ and not ${{conv}(S)}$.

Proof: Any point ${y\in {conv}(S)}$ can be represented as a convex combination of some ${m}$ points in ${S}$. So

$\displaystyle y=\sum_{i=1}^m \alpha_i x_i,$

where ${x_i \in S}$ and ${\sum_{i=1}^m \alpha_i=1}$. If ${m \leq d+1}$ we are done. Hence we will only consider the case of ${m > d+1}$. Let ${x_i =(x_i^1,x_i^2,....,x_i^{d})}$. Now consider the set of equations ${\sum_{i=1}^m\gamma_i x_i=0}$ and ${\sum_{i=1}^m \gamma_i=0}$. This system of equations can be put in matrix form as

$\displaystyle \left[ \begin{array}{cccc} x_1^1 & x_2^1 &...&x_m^d \\ x_1^2 & x_2^2 &...&x_m^d \\ \vdots&\vdots&\vdots&\vdots\\ x_1^d & x_2^d &...&x_m^d \\ 1&1&...&1 \end{array} \right]\left[ \begin{array}{c} \gamma_1\\ \gamma_2\\ \vdots\\ \gamma_m \\ \end{array} \right] =o_{d+1 \times 1}$

Since the matrix is of size ${(d+1)\times m}$, and ${d+1 < m}$, there are non trivial solutions of this equation (look at rank-nullity theorem to observe that ${\dim(\mathcal{N}(A))\geq m-(d+1)>0}$). From now on Let ${\gamma_1, ....,\gamma_m}$ be one such solution. Now let ${\tau >0}$. Since ${\sum_{i=1}^m \gamma_i x_i =0}$, the point ${y}$ can be represented as

$\displaystyle y = \sum_{i=1}^m \alpha_i x_i -\tau \sum_{i=1}^m \gamma_i x_i = \sum_{i=1}^m (\alpha_i -\tau \gamma_i)x_i$

Also observe that ${\sum_{i=1}^m (\alpha_i -\tau \gamma_i) =1}$ since ${\sum_{i=1}^m \gamma_i =0}$. Now we will choose ${\tau}$ such that ${\alpha_i -\tau\gamma_i \geq 0}$ and equals ${0}$ for one term so that ${\sum_{i=1}^m (\alpha_i -\tau \gamma_i)x_i}$ becomes a convex combination. Since ${\sum \gamma_i =0}$ implies some ${\gamma_i}$‘s are positive and others are negative. If for some ${i}$, ${\gamma_i <0}$, then ${\alpha-\tau\gamma_i \geq0}$. So we need to consider only the terms with ${\gamma_i >0}$. For all ${i}$ such that ${\gamma_i >0}$, we want

$\displaystyle \alpha_i-\tau\gamma_i \geq 0,$

which implies

$\displaystyle \frac{\alpha_i}{\gamma_i}-\tau \geq 0.$

Hence if we choose ${\tau =\min\{\frac{\alpha_i}{\gamma_i}, \gamma_i >0\}}$, then one term ${\alpha_i -\tau\gamma_i =0}$ and every other term will be positive. So we are able to express

$\displaystyle y = \sum_{i=1}^m (\alpha_i -\tau \gamma_i)x_i,$

where one of the coefficients is zero, \ie, as a convex combination of ${m-1}$ points. We can continue this procedure till ${m\leq d+1}$ $\Box$

We will now see a simple application of Caratheodory’s theorem.

Lemma 4 If a set ${S\subset {\mathbb R}^d}$ is compact, then its convex hull ${{conv}(S)}$ is also compact.

Proof: Consider a sequence ${\{x_n\} \in {conv}(S)}$. If we show that there is a convergent subsequence which converges to a point in ${{conv}(S)}$ then we are done.

By Caratheodory’s theorem, each ${x_i \in {conv}(S)}$ can be represented as a convex sum of ${d+1}$ points in ${S}$. So

$\displaystyle x_i =\sum_{k=1}^{d+1}{\alpha_k^i}y_k^i, \quad \sum_{k=1}^{d+1}\alpha_k^i =1 \text{ and } y_k^i \in S$

Now consider the following matrix of dimensions ${\infty \times 2(d+1)}$

$\displaystyle \left[ \begin{array}{ccccccc} y_1^1 & \alpha_1^1& y_2^1 &\alpha_2^1&...&x_{d+1}^1&\alpha_{d+1}^1 \\ y_1^2& \alpha_1^2& y_2^2 &\alpha_2^2&...&x_{d+1}^2&\alpha_{d+1}^2 \\ y_1^3 & \alpha_1^3& y_2^3 &\alpha_2^3&...&x_{d+1}^3&\alpha_{d+1}^3\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ y_1^n & \alpha_1^n& y_2^n &\alpha_2^n&...&x_{d+1}^n&\alpha_{d+1}^n\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \end{array} \right]$

Now consider the first column ${ \{y_k^1\}}$. Since this sequence lies in ${S}$ and ${S}$ is compact, there is a sub-sequence in that that converges to a point ${y^1 \in S}$. The indices of this subsequence are denoted by ${I_1 \subset \{1,2,3,...\}}$. Now consider the second column, but consider only rows corresponding to the index set ${I_1}$. Since ${\alpha_1^{k} \in [0,1]}$ a compact set, there exists a subsequene of ${\{\alpha_1^{k}\}_{k\in I_1}}$ that converges to some point ${\alpha_1}$ in ${[0,1]}$. Denote the indices corresponding to this subsequence by the set ${I_2 \subset I_1}$. Now consider the third column corresponding to ${y_2^{k}}$ but only corresponding to the rows ${I_2}$. Since ${\{y_2^{k}\} \subset S}$ and ${S}$ is compact, we can find a subsequence which converges to ${y_2 \in S}$. Denote the set of the indices corresponding to this sub sequence by ${I_3 \subset I_2}$. Continue this till all the ${2(d+1)}$ are exhausted. So we will finally have an index set ${I_{2(d+1)}}$ corresponding to the last column.

Now consider the original sequence ${\{x_n\}}$. It is easy to see that, the subsequence corresponding to the index set ${I_{2(d+1)}}$ converges to a point

$\displaystyle \sum_{i=1}^{d+1}\alpha_i y_i.$

which belongs to the set ${{conv}(S)}$ proving the result. $\Box$

1. Convex Sets

Definition 1 (Line) Let ${x_1}$ and ${x_2}$ be two points. Then the line passing through ${x_1}$ and ${x_2}$ can be represented as

$\displaystyle y=\theta x_1 + (1-\theta) x_2,\ \theta \in {\mathbb R}.$

The value ${\theta =1}$ corresponds to the point ${x_1}$ and ${\theta=0}$ is the point ${x_2}$. Also observe that for a line ${\theta}$ can any real value. When ${\theta}$ is restricted to ${[0,1]}$, \ie, ${\theta \in[0,1]}$, a line segment connecting ${x_1}$ and ${x_2}$ is obtained. We denote the line segment between ${x_1}$ and ${x_2}$ as ${[x_1,x_2]}$. Let us see some examples:

1. The concept of line and line segment is very clear if ${x}$ and ${y}$ belong to ${{\mathbb R}^d}$.
2. Let ${X \in S}$ and ${Y \in S}$. Then it is very clear that ${\theta X \in S}$ for all ${t\in {\mathbb R}}$. Also it is very clear that ${\theta X +(1-\theta)Y \in S}$. Hence the concept of line is well defined on the set ${S}$.
3. Let ${X \in S_{++}}$ and ${Y \in S_{++}}$. Then it is very clear that ${\theta X \in S_{++}}$ for ${\theta \in [0,1]}$. From the properties of SD matrices, it follows that

$\displaystyle \theta X +(1-\theta)Y \in S_{++},\ \ \theta \in [0,1].$

So the notion of line segment is well defined on ${S_{++}}$.

A set ${C}$ is convex if the line segment joining any two points in ${C}$ lies in ${C}$. Mathematically the set ${C}$ is convex if and only if for any ${x, y\in C}$ implies ${[x,y]\subseteq C}$. First observe that convexity is not a topological property, that is it does not depend on the norm. Let us see some examples:

1. the empty set ${\emptyset}$ is always convex.
2. Provide other geometrical examples.
3. The set

$\displaystyle \Delta=\{(x_1,x_2,...,x_{d+1}): x_1+x_2+...+x_{d+1}=1, x_i\geq 0 \}$

is a convex set. This is also called as the ${d}$-dimensional simplex. Observe that this set corresponds to all ${d+1}$ valued discrete random variables.

4. Any ball ${B(o,R)}$ is simplex. Let ${x}$ and${y}$ belong to the ball ${B(o,R)}$. Then by the property of norm

$\displaystyle \begin{array}{rcl} \|\theta x+ (1-\theta)y\| &\leq \|\theta x\| + \|(1-t)y\|\\ &=\theta\| x\| + (1-t)\|y\|\\ &\leq \theta R +(1-t)R =R \end{array}$

This implies ${\theta x+ (1-\theta)y \in B(o,R)}$ which implies ${B(o,R)}$ is convex. In general, ${B(x,R)}$ is a convex set for any ${x}$ and ${R}$ (Home work). Observe that the ball ${B(x,R)}$ can be represented as (prove it)

$\displaystyle B(y,R) = \{y+Rx:\ \|x\|\leq 1\}.$

5. Ellipsoid: Prove that the set ${\{y+Ax:\ \|x\|\leq 1\}}$, where ${A \succ 0 }$ is a convex set.
6. Hyperplane and halfspace: Let ${a\in {\mathbb R}^d}$ and ${b\in {\mathbb R}}$, then the set

$\displaystyle \{x: a^tx=b\}$

is called a hyperplane. The set

$\displaystyle \{x: a^tx\leq b\}$

is the halfspace.

7. Systems of linear equations or Polyhedron: Let ${c_i}$ be vectors in ${{\mathbb R}^d}$ and ${\beta_i}$ be some scalars. Then the set

$\displaystyle P=\{x\in {\mathbb R}^d: c_i^T x \leq \beta_i,\ i=1,2,...,n\},$

is called a polyhedron. Succinctly, this can be represented as ${Cx\preceq \beta}$.

8. The set of PSD matrices ${S_+}$ and ${S_{++}}$ is convex. Proof: Let ${Z}$ and ${Y}$ belong to ${S_{++}}$.

$\displaystyle x^t(\theta Z+(1-\theta)Y )x = \theta x^t Zx + (1-\theta) x^t Y x \succ 0.$

9. Voronoi tessellation: Let ${x_1, x_2,\hdots, x_n}$ denote ${n}$ points in ${{\mathbb R}^d}$. For any point ${x_i}$ let ${V(x_i)}$ denote the Voronoi region of the point ${x_i}$, i.e.m

$\displaystyle V(x_i) = \{x: \|x-x_i| \leq \|x-x_j\|, \forall j=1,...,n\}.$

Prove that ${V(x_i)}$ is a Voronoi set.

Definition 2 (Convex combination) Let ${\{x_1,x_2,\hdots, x_n\}\subset {\mathbb R}^d}$ denote a set of ${n}$ points. Let ${\beta_i >0}$ and ${\sum_{i=1}^n \beta_i =1}$. Then

$\displaystyle x=\sum_{i=1}^n \beta_i x_i,$

is called a convex combination of the points ${x_1,\hdots,x_n}$.

Theorem 3 A set ${A}$ is convex if and only if it contains all convex combinations of its points.

Proof: It is easy to see that if a set contains all convex combinations of its points, then the set is convex.

Now to prove the other way. We prove it by induction on the number of points. By the definition, it is is true for ${n=2}$. Let us assume it is true for any ${n}$ points, \ie, any convex combination of ${n}$-points lies in the set. Now consider ${n+1}$ points ${\{x_1,x_2,...,x_{n+1}\}}$. Consider

$\displaystyle x=\sum_{i=1}^{n+1} a_i x_i,$

such that ${\sum_{i=1}^{n+1}a_i=1}$. Alternatively,

$\displaystyle x=\sum_{i=1}^{n+1} a_i x_i= a_{n+1}x_{n+1}+(1-a_{n+1})\left(\sum_{i=1}^n \frac{a_i}{1-a_{n+1}} x_i \right).$

Observe by induction assumption, ${y=\sum_{i=1}^n \frac{a_i}{1-a_{n+1}} x_i \in A}$, since ${\sum_{i=1}^n\frac{a_i}{1-a_{n+1}} = \frac{1-a_i}{1-a_i}=1}$. So,

$\displaystyle x=a_{n+1}x_{n+1}+(1-a_{n+1})y,$

is a two points convex combination of ${x_{n+1}}$ and the points ${y}$ and hence ${x}$ belongs to the set ${A}$.

$\Box$

We will now define a process of convexifying any set.

Definition 4 (Convex hull) The set of all convex combinations of points from a set ${A}$ is called the convex hull of ${A}$. It is denoted by ${\text{conv}(A)}$.

Similar to a closure of a set, the convex hull of a set is the smallest convex set containing ${A}$. We will now see some examples.

1. Some basic geometric examples.
2. Birkhoff – Von Neumann Theorem: Consider the set of permutation matrices ${\{X_\sigma, \sigma \in P_n\}}$. Then the convex hull of this set is the set of all doubly stochastic matrices.
Hw 1 Prove that the convex hull of a set is convex.