You are currently browsing the tag archive for the ‘information theory’ tag.

We will look a nice application of Carathéodory’s theorem in information theory. Let {e_i =[0,0,...0,1,0,...,0]} essentially a vector with {1} int he {i}-th position and zero elsewhere. Before we go to the main example, we require the following lemma.

Lemma 1 Let {\mathop{\mathbb P}_n} denote the “vertices” (extreme points) of the {\Delta_{n-1}} simplex, \ie {\mathop{\mathbb P}_n =\{e_1,e_2,...,e_n\}} are the standard basis of {{\mathbb R}^n}. For {i=1,....k}, let {f_i(x):\mathop{\mathbb P} \rightarrow {\mathbb R}} be {k} real valued functions on {\mathop{\mathbb P}_n}. Also, let {\beta_i \geq 0, i=1,...,n} be such that {\sum_{i=1}^n \beta_i=1}. Then there exists {\alpha_i\geq 0}, {i=1,...,k+1}, {\sum_{i=1}^{k+1}\alpha_i=1} and {\{\hat{e_1}, ....,\hat{e_{k+1}}\}\subset \mathop{\mathbb P}_n} such that

\displaystyle \sum_{i=1}^n \beta_i f_m(e_i ) = \sum_{i=1}^{k+1}\alpha_i f_m(\hat{e_i}), \ \text{ for } m=1,...,k

Proof: We have

\displaystyle \sum_{i=1}^n \beta_i\left[ \begin{array}{c} f_1(e_i)\\ f_2(e_i)\\ \vdots\\ f_k(e_i)\\ \end{array} \right] \in {conv}\left(\left[ \begin{array}{c} f_1(e_1)\\ f_2(e_1)\\ \vdots\\ f_k(e_1)\\ \end{array} \right], \left[ \begin{array}{c} f_1(e_2)\\ f_2(e_2)\\ \vdots\\ f_k(e_2)\\ \end{array} \right] \hdots \left[ \begin{array}{c} f_1(e_n)\\ f_2(e_n)\\ \vdots\\ f_k(e_n)\\ \end{array} \right] \right)

Hence by Caratheodory’s theorem, there exists scalars {\alpha_1,...\alpha_{k+1}} and {k+1} vectors such that

\displaystyle \sum_{i=1}^n \beta_i\left[ \begin{array}{c} f_1(e_i)\\ f_2(e_i)\\ \vdots\\ f_k(e_i)\\ \end{array} \right] = \sum_{i=1}^{k+1} \alpha_i\left[ \begin{array}{c} f_1(\hat{e_i})\\ f_2(\hat{e_i})\\ \vdots\\ f_k(\hat{e_i})\\ \end{array} \right],

proving the theorem. \Box

For a more general version of the Lemma look at Lemma 3 in “Source coding with side information and a converse for degraded broadcast channels”

Bounds on the cardinality of channel inputs using Caratheodory’s theorem

We will now look at the capacity theorem for channel capacity of Discrete memoryless channels (DMC). Please see Wikipedia and related links for more introduction on the topic of channel capacity.  Refer to the the technical report by M SALEHI for more advanced techniques to bound the cardinality of random variables in Information theory.

Consider a channel with input from the alphabet {\mathcal{X}} and output alphabet {\mathcal{Y}}. A channel is specified by a conditional probability distribution for each input, {P(y|x)}, \ie the probability of obtaining {y} when {x} is transmitted. For the discussion below, let us assume that the input and the output alphabet are discrete. Shannon proved that the channel capacity of a DMC is

\displaystyle C= \max_{P_x }I(X; \ Y),

where the maximization is over all the probability distributions on the input alphabet. Let for notational simplicity that the input alphabet cardinality {|\mathcal{X}| =n} and {|\mathcal{Y}| =m}. Observe that in the above optimization problem, the search domain is the set of all discrete probability distrbutions with mass at {n}-points, \ie, the search space is the {n-1}-simplex. We will now use Caratheodeory theorem to show that this search space can be reduced drastically if the cardinality of the output alphabet {m} is small.

Let {X} and {Y} be two discrete random variables with PMF {p(x), x\in \mathcal{X}} and conditional PMF {p(y| x), x\in \mathcal{X}, y \in \mathcal{Y}}. Once the channel {P(y|x)} is specified, fixing the input distribution {p(x)} fixes the output distribution as {p(y) =\sum_{x\in \mathcal{X}}p(x)p(y|x)}. So the only flexibility we have is the distribution of the input.

The mutual information between the random variables {X} and {Y} is denoted by {I(X; Y)} and equals

\displaystyle I(X;Y) = H(Y) - H(Y|X),

where {H(Y)} is the entropy of the random variable {Y} defined as

\displaystyle H(Y) =\sum_{y\in \mathcal{Y}}p(y)\log\left(\frac{1}{p(y)}\right),

and the conditional entropy is defined as

\displaystyle H(Y|X) =\sum_{x\in \mathcal{X}}p(x)\sum_{y\in \mathcal{Y}}p(y|x)\log\left(\frac{1}{p(y|x)}\right),

Observe that the entropies and mutual information depend only on the distribution of the random variables. Let {p^*(x)} be the optimal distribution on the input alphabet that achieves the capacity {C^*}. As mentioned earlier, this optimal distribution induces a distribution and conditional distribution {p^*(y)} and {p^*(y|x)}.

Now we will show that there exists one more probability distribution on {\mathcal{X}} {P^{**}} such that the same mutual information {C^*}. However the new distribution will have a non-zero probability only on {m+1} terms of {\mathcal{X}}. Observe that if {m+1 <n} then the search space will reduce.

Since {p^*(x)} is a {n} dimensional (term) probability distribution, it belongs to the {n-1} simplex and hence can be represented as a convex combination of the vertex points, \ie,

\displaystyle P^*_X = \sum_{i=1}^n \beta_i e_i,

for some {\beta_i\geq 0} and {\sum_{i=1}^n \beta_i=1} and {e_i} are the standard {n}-dimensional basis defined in the previous lemma. Here we have used the notation for the optimal distribution

\displaystyle P^*_X=\left[ \begin{array}{c} p^*(x_1)\\ p^*(x_2)\\ \vdots\\ p^*(x_n)\\ \end{array} \right]

where {x_1,...x_n} are the alphabet of {\mathcal{X}}. We now consider the following functions on {\mathop{\mathbb P}_n} (see the previous lemma for the definition of {\mathop{\mathbb P}_n})and use the previous lemma to reduce the cardinality. Let {B(a)=[p(a|x_1),p(a|x_2),....,p(a|x_n)]^T}. Let {y_1,y_2,...y_m} be the alphabet of the output Define For any vector {P\in \mathop{\mathbb P}_n},

\displaystyle \begin{array}{rcl} f_{1}(P)&=& P^T B(y_1).\\ f_{2}(P)&=& P^T B(y_2).\\ \vdots &=&\vdots\\ f_{{m-1}}(P)&=& P^T B(y_{m-1}).\\ f_{m}(P)&=& P^T\left [\sum_{y}p(y|x_1)\log\left(\frac{1}{p(y|x_1)}\right),.,\sum_{y}p(y|x_n)\log\left(\frac{1}{p(y|x_n)}\right)\right]^T.\\ \end{array}

We make the following observation for {1\leq k\leq m-1}

\displaystyle \begin{array}{rcl} \sum_{i=1}^n \beta_i f_{k}(e_i)&=&\sum_{i=1}^n \beta_i( e_i^T B(y_i))\\ &=&\left(\sum_{i=1}^n \beta_i e_i^T\right) B(y_i)\\ &=&P_X^*B(y_i)\\ &=&\sum_{x\in \mathcal{X}}p^*(x) p(y_i|x)\\ & =& p^*(y_i), \end{array}

where { p^*(y_i)} is the probability of {y_i} in the optimal distribution of the channel output. Similarly it follows that

\displaystyle \sum_{i=1}^n \beta_i f_{m}(e_i) = H(Y^*|X^*).

So we have

\displaystyle \left[ \begin{array}{c} p^*(y_1)\\ p^*(y_2)\\ \vdots\\ p^*(y_{m-1})\\ H(Y^*|X^*) \end{array} \right]=\left[ \begin{array}{c} \sum_{i=1}^n \beta_i f_{1}(e_i)\\ \sum_{i=1}^n \beta_i f_{2}(e_i)\\ \vdots\\ \sum_{i=1}^n \beta_i f_{m-1}(e_i)\\ \sum_{i=1}^n \beta_i f_{m}(e_i)\\ \end{array} \right]

So the right hand side corresponds to the optimal distribution {P^*_Y} and the optimal entropy distribution {H(Y^*|X^*)}. But by previous lemma, there exists constants {\alpha_1,...\alpha_m \geq 0} and {\sum_{i=1}^{m+1}\alpha_i =1} and {\tilde{e_1},...,\tilde{e_{m+1}}}such that

\displaystyle \left[ \begin{array}{c} \sum_{i=1}^n \beta_i f_{1}(e_i)\\ \sum_{i=1}^n \beta_i f_{2}(e_i)\\ \vdots\\ \sum_{i=1}^n \beta_i f_{m-1}(e_i)\\ \sum_{i=1}^n \beta_i f_{m}(e_i)\\ \end{array} \right]= \left[ \begin{array}{c} \sum_{i=1}^{m+1} \alpha_i f_{1}(\tilde{e_i})\\ \sum_{i=1}^{m+1} \alpha_i f_{2}(\tilde{e_i})\\ \vdots\\ \sum_{i=1}^{m+1} \alpha_i f_{m-1}(\tilde{e_i})\\ \sum_{i=1}^{m+1} \alpha_i f_{m}(\tilde{e_i})\\ \end{array} \right]

But from the definitions of {f_i}, it follows that

\displaystyle \left[ \begin{array}{c} \sum_{i=1}^{m+1} \alpha_i f_{1}(\tilde{e_i})\\ \sum_{i=1}^{m+1} \alpha_i f_{2}(\tilde{e_i})\\ \vdots\\ \sum_{i=1}^{m+1} \alpha_i f_{m-1}(\tilde{e_i})\\ \sum_{i=1}^{m+1} \alpha_i f_{m}(\tilde{e_i})\\ \end{array} \right]= \left[ \begin{array}{c} (\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )B(y_1)\\ (\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )B(y_2)\\ \vdots\\ (\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )B(y_{m-1})\\ (\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )D\\ \end{array} \right],

where {D=\left [\sum_{y\in \mathcal{Y}}p(y|x_1)\log\left(\frac{1}{p(y|x_1)}\right),.....,\sum_{y\in \mathcal{Y}}p(y|x_n)\log\left(\frac{1}{p(y|x_n)}\right)\right]} Hence it follows that choosing the distribution {(\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )} on {\mathcal{X}} would lead to the same optimal distribution on {Y} as {Y^*} and the same conditional entropy {H(Y^*|X^*)}. But observe that the new optimal distribution.

\displaystyle P^{**}=\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T ,

has positive mass only on {m+1} terms of the PMF. This is because {e_i} has one only one coordinate, the {i}-th coordinate. This proves the claim. Hence it is sufficient to search over all the {n=|\mathcal{X}|} probability distributions with positive mass on {\min\{|\mathcal{X}|, |\mathcal{Y}|+1\}} coordinates.