You are currently browsing the monthly archive for January 2012.

We will look a nice application of Carathéodory’s theorem in information theory. Let {e_i =[0,0,...0,1,0,...,0]} essentially a vector with {1} int he {i}-th position and zero elsewhere. Before we go to the main example, we require the following lemma.

Lemma 1 Let {\mathop{\mathbb P}_n} denote the “vertices” (extreme points) of the {\Delta_{n-1}} simplex, \ie {\mathop{\mathbb P}_n =\{e_1,e_2,...,e_n\}} are the standard basis of {{\mathbb R}^n}. For {i=1,....k}, let {f_i(x):\mathop{\mathbb P} \rightarrow {\mathbb R}} be {k} real valued functions on {\mathop{\mathbb P}_n}. Also, let {\beta_i \geq 0, i=1,...,n} be such that {\sum_{i=1}^n \beta_i=1}. Then there exists {\alpha_i\geq 0}, {i=1,...,k+1}, {\sum_{i=1}^{k+1}\alpha_i=1} and {\{\hat{e_1}, ....,\hat{e_{k+1}}\}\subset \mathop{\mathbb P}_n} such that

\displaystyle \sum_{i=1}^n \beta_i f_m(e_i ) = \sum_{i=1}^{k+1}\alpha_i f_m(\hat{e_i}), \ \text{ for } m=1,...,k

Proof: We have

\displaystyle \sum_{i=1}^n \beta_i\left[ \begin{array}{c} f_1(e_i)\\ f_2(e_i)\\ \vdots\\ f_k(e_i)\\ \end{array} \right] \in {conv}\left(\left[ \begin{array}{c} f_1(e_1)\\ f_2(e_1)\\ \vdots\\ f_k(e_1)\\ \end{array} \right], \left[ \begin{array}{c} f_1(e_2)\\ f_2(e_2)\\ \vdots\\ f_k(e_2)\\ \end{array} \right] \hdots \left[ \begin{array}{c} f_1(e_n)\\ f_2(e_n)\\ \vdots\\ f_k(e_n)\\ \end{array} \right] \right)

Hence by Caratheodory’s theorem, there exists scalars {\alpha_1,...\alpha_{k+1}} and {k+1} vectors such that

\displaystyle \sum_{i=1}^n \beta_i\left[ \begin{array}{c} f_1(e_i)\\ f_2(e_i)\\ \vdots\\ f_k(e_i)\\ \end{array} \right] = \sum_{i=1}^{k+1} \alpha_i\left[ \begin{array}{c} f_1(\hat{e_i})\\ f_2(\hat{e_i})\\ \vdots\\ f_k(\hat{e_i})\\ \end{array} \right],

proving the theorem. \Box

For a more general version of the Lemma look at Lemma 3 in “Source coding with side information and a converse for degraded broadcast channels”

Bounds on the cardinality of channel inputs using Caratheodory’s theorem

We will now look at the capacity theorem for channel capacity of Discrete memoryless channels (DMC). Please see Wikipedia and related links for more introduction on the topic of channel capacity.  Refer to the the technical report by M SALEHI for more advanced techniques to bound the cardinality of random variables in Information theory.

Consider a channel with input from the alphabet {\mathcal{X}} and output alphabet {\mathcal{Y}}. A channel is specified by a conditional probability distribution for each input, {P(y|x)}, \ie the probability of obtaining {y} when {x} is transmitted. For the discussion below, let us assume that the input and the output alphabet are discrete. Shannon proved that the channel capacity of a DMC is

\displaystyle C= \max_{P_x }I(X; \ Y),

where the maximization is over all the probability distributions on the input alphabet. Let for notational simplicity that the input alphabet cardinality {|\mathcal{X}| =n} and {|\mathcal{Y}| =m}. Observe that in the above optimization problem, the search domain is the set of all discrete probability distrbutions with mass at {n}-points, \ie, the search space is the {n-1}-simplex. We will now use Caratheodeory theorem to show that this search space can be reduced drastically if the cardinality of the output alphabet {m} is small.

Let {X} and {Y} be two discrete random variables with PMF {p(x), x\in \mathcal{X}} and conditional PMF {p(y| x), x\in \mathcal{X}, y \in \mathcal{Y}}. Once the channel {P(y|x)} is specified, fixing the input distribution {p(x)} fixes the output distribution as {p(y) =\sum_{x\in \mathcal{X}}p(x)p(y|x)}. So the only flexibility we have is the distribution of the input.

The mutual information between the random variables {X} and {Y} is denoted by {I(X; Y)} and equals

\displaystyle I(X;Y) = H(Y) - H(Y|X),

where {H(Y)} is the entropy of the random variable {Y} defined as

\displaystyle H(Y) =\sum_{y\in \mathcal{Y}}p(y)\log\left(\frac{1}{p(y)}\right),

and the conditional entropy is defined as

\displaystyle H(Y|X) =\sum_{x\in \mathcal{X}}p(x)\sum_{y\in \mathcal{Y}}p(y|x)\log\left(\frac{1}{p(y|x)}\right),

Observe that the entropies and mutual information depend only on the distribution of the random variables. Let {p^*(x)} be the optimal distribution on the input alphabet that achieves the capacity {C^*}. As mentioned earlier, this optimal distribution induces a distribution and conditional distribution {p^*(y)} and {p^*(y|x)}.

Now we will show that there exists one more probability distribution on {\mathcal{X}} {P^{**}} such that the same mutual information {C^*}. However the new distribution will have a non-zero probability only on {m+1} terms of {\mathcal{X}}. Observe that if {m+1 <n} then the search space will reduce.

Since {p^*(x)} is a {n} dimensional (term) probability distribution, it belongs to the {n-1} simplex and hence can be represented as a convex combination of the vertex points, \ie,

\displaystyle P^*_X = \sum_{i=1}^n \beta_i e_i,

for some {\beta_i\geq 0} and {\sum_{i=1}^n \beta_i=1} and {e_i} are the standard {n}-dimensional basis defined in the previous lemma. Here we have used the notation for the optimal distribution

\displaystyle P^*_X=\left[ \begin{array}{c} p^*(x_1)\\ p^*(x_2)\\ \vdots\\ p^*(x_n)\\ \end{array} \right]

where {x_1,...x_n} are the alphabet of {\mathcal{X}}. We now consider the following functions on {\mathop{\mathbb P}_n} (see the previous lemma for the definition of {\mathop{\mathbb P}_n})and use the previous lemma to reduce the cardinality. Let {B(a)=[p(a|x_1),p(a|x_2),....,p(a|x_n)]^T}. Let {y_1,y_2,...y_m} be the alphabet of the output Define For any vector {P\in \mathop{\mathbb P}_n},

\displaystyle \begin{array}{rcl} f_{1}(P)&=& P^T B(y_1).\\ f_{2}(P)&=& P^T B(y_2).\\ \vdots &=&\vdots\\ f_{{m-1}}(P)&=& P^T B(y_{m-1}).\\ f_{m}(P)&=& P^T\left [\sum_{y}p(y|x_1)\log\left(\frac{1}{p(y|x_1)}\right),.,\sum_{y}p(y|x_n)\log\left(\frac{1}{p(y|x_n)}\right)\right]^T.\\ \end{array}

We make the following observation for {1\leq k\leq m-1}

\displaystyle \begin{array}{rcl} \sum_{i=1}^n \beta_i f_{k}(e_i)&=&\sum_{i=1}^n \beta_i( e_i^T B(y_i))\\ &=&\left(\sum_{i=1}^n \beta_i e_i^T\right) B(y_i)\\ &=&P_X^*B(y_i)\\ &=&\sum_{x\in \mathcal{X}}p^*(x) p(y_i|x)\\ & =& p^*(y_i), \end{array}

where { p^*(y_i)} is the probability of {y_i} in the optimal distribution of the channel output. Similarly it follows that

\displaystyle \sum_{i=1}^n \beta_i f_{m}(e_i) = H(Y^*|X^*).

So we have

\displaystyle \left[ \begin{array}{c} p^*(y_1)\\ p^*(y_2)\\ \vdots\\ p^*(y_{m-1})\\ H(Y^*|X^*) \end{array} \right]=\left[ \begin{array}{c} \sum_{i=1}^n \beta_i f_{1}(e_i)\\ \sum_{i=1}^n \beta_i f_{2}(e_i)\\ \vdots\\ \sum_{i=1}^n \beta_i f_{m-1}(e_i)\\ \sum_{i=1}^n \beta_i f_{m}(e_i)\\ \end{array} \right]

So the right hand side corresponds to the optimal distribution {P^*_Y} and the optimal entropy distribution {H(Y^*|X^*)}. But by previous lemma, there exists constants {\alpha_1,...\alpha_m \geq 0} and {\sum_{i=1}^{m+1}\alpha_i =1} and {\tilde{e_1},...,\tilde{e_{m+1}}}such that

\displaystyle \left[ \begin{array}{c} \sum_{i=1}^n \beta_i f_{1}(e_i)\\ \sum_{i=1}^n \beta_i f_{2}(e_i)\\ \vdots\\ \sum_{i=1}^n \beta_i f_{m-1}(e_i)\\ \sum_{i=1}^n \beta_i f_{m}(e_i)\\ \end{array} \right]= \left[ \begin{array}{c} \sum_{i=1}^{m+1} \alpha_i f_{1}(\tilde{e_i})\\ \sum_{i=1}^{m+1} \alpha_i f_{2}(\tilde{e_i})\\ \vdots\\ \sum_{i=1}^{m+1} \alpha_i f_{m-1}(\tilde{e_i})\\ \sum_{i=1}^{m+1} \alpha_i f_{m}(\tilde{e_i})\\ \end{array} \right]

But from the definitions of {f_i}, it follows that

\displaystyle \left[ \begin{array}{c} \sum_{i=1}^{m+1} \alpha_i f_{1}(\tilde{e_i})\\ \sum_{i=1}^{m+1} \alpha_i f_{2}(\tilde{e_i})\\ \vdots\\ \sum_{i=1}^{m+1} \alpha_i f_{m-1}(\tilde{e_i})\\ \sum_{i=1}^{m+1} \alpha_i f_{m}(\tilde{e_i})\\ \end{array} \right]= \left[ \begin{array}{c} (\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )B(y_1)\\ (\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )B(y_2)\\ \vdots\\ (\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )B(y_{m-1})\\ (\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )D\\ \end{array} \right],

where {D=\left [\sum_{y\in \mathcal{Y}}p(y|x_1)\log\left(\frac{1}{p(y|x_1)}\right),.....,\sum_{y\in \mathcal{Y}}p(y|x_n)\log\left(\frac{1}{p(y|x_n)}\right)\right]} Hence it follows that choosing the distribution {(\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T )} on {\mathcal{X}} would lead to the same optimal distribution on {Y} as {Y^*} and the same conditional entropy {H(Y^*|X^*)}. But observe that the new optimal distribution.

\displaystyle P^{**}=\sum_{i=1}^{m+1} \alpha_i \tilde{e_i}^T ,

has positive mass only on {m+1} terms of the PMF. This is because {e_i} has one only one coordinate, the {i}-th coordinate. This proves the claim. Hence it is sufficient to search over all the {n=|\mathcal{X}|} probability distributions with positive mass on {\min\{|\mathcal{X}|, |\mathcal{Y}|+1\}} coordinates.

Advertisements

Definition 1 (Convex hull) The set of all convex combinations of points from a set {A} is called the convex hull of {A}. It is denoted by {\text{conv}(A)}.

Similar to a closure of a set, the convex hull of a set is the smallest convex set containing {A}. We will now see some examples.

  1. Some basic geometric examples.
  2. Birkhoff – Von Neumann Theorem: Consider the set of permutation matrices {\{X_\sigma, \sigma \in P_n\}}. Then the convex hull of this set is the set of all doubly stochastic matrices.
Hw 1 Prove that the convex hull of a set is convex.

We will now see an interesting application of convex hull.

Theorem 2 (Gauss-Lucas Theorem) Let {f(z)} be a {n}-th degree polynomial with real coefficients. By fundamental theorem of algebra there are {n}-roots of the polynomial (they might be complex). Call the roots {\{z_1 = x_1+i y_1, z_2=x_2+iy_2,...,z_n=x_n+iy_n\}} be the set of all the roots. Observe that each root can be represented as a point in the plane {{\mathbb R}^2}. Then the roots of the {n-1} degree polynomial {f'(z) = \frac{d f(z)}{dz}} lies in the convex hull {{conv}(z_1,z_2,...,z_n)}.

Proof: Without loss of generality the polynomial can be represented as

\displaystyle f(z)= (z-z_1)(z-z_2)....(z-z_n).

So we have

\displaystyle f'(z)= \sum_{k=1}^n \prod_{i\neq k}(z-z_i).

Let {w} be a root of {f'(z)}. Then we have {f'(w)=0} and hence

\displaystyle f'(w)= \sum_{k=1}^n \prod_{i\neq k}(w-z_i) =0

Multiplying both sides with {\prod_{i\neq k}(\bar w-\bar z_i)}, we obatin

\displaystyle \sum_{k=1}^n \frac{1}{( w - z_i)} \prod_{i=1}^n |w-z_i|^2 =0

which implies

\displaystyle \prod_{i=1}^n |w-z_i|^2\sum_{k=1}^n \frac{1}{( w - z_i)} =0.

If {\prod_{i=1}^n |w-z_i|^2 =0} then {w = z_i} for some {i} and hence the theorem is proved. Else if

\displaystyle \sum_{k=1}^n \frac{1}{(w - z_k)} =0.

which equals

\displaystyle \sum_{k=1}^n \frac{\bar{w}-\bar{z_k}}{|w - z_k|^2} =0,

which implies

\displaystyle \bar{w} = \sum_{k=1}^n\frac{\frac{\bar{z_k}}{|w - z_k|^2}}{\sum_{i=1}^n\frac{\bar{z_k}}{|w - z_i|^2}}.

Taking the conjugate we get the result {w =\sum_{i=1}^n z_i \alpha_i} where {\alpha_i =\frac{\frac{1}{|w - z_k|^2}}{\sum_{i=1}^n\frac{\bar{z_k}}{|w - z_i|^2}}}. \Box

We will now look at one more theorem that looks at convex combinations.

Theorem 3 (Caratheodory’s theorem ) Let {S \subset {\mathbb R}^d}. Then any point {x \in {conv}(S)} can be represented as a convex combination of {d+1} points of {S}

Some observations before the proof:

  1. The {(d+1)} points depend on {x} in consideration.
  2. {d+1} points are necessary. Let {S} be three non-collinear points. Let {x} be the the centroid of the triangle formed by these three points. Observe that the convex combination of the three points is necessary to obtain {x}.
  3. Observe that the {d+1} points are from {S} and not {{conv}(S)}.

Proof: Any point {y\in {conv}(S)} can be represented as a convex combination of some {m} points in {S}. So

\displaystyle y=\sum_{i=1}^m \alpha_i x_i,

where {x_i \in S} and {\sum_{i=1}^m \alpha_i=1}. If {m \leq d+1} we are done. Hence we will only consider the case of {m > d+1}. Let {x_i =(x_i^1,x_i^2,....,x_i^{d})}. Now consider the set of equations {\sum_{i=1}^m\gamma_i x_i=0} and {\sum_{i=1}^m \gamma_i=0}. This system of equations can be put in matrix form as

\displaystyle  \left[ \begin{array}{cccc} x_1^1 & x_2^1 &...&x_m^d \\ x_1^2 & x_2^2 &...&x_m^d \\ \vdots&\vdots&\vdots&\vdots\\ x_1^d & x_2^d &...&x_m^d \\ 1&1&...&1 \end{array} \right]\left[ \begin{array}{c} \gamma_1\\ \gamma_2\\ \vdots\\ \gamma_m \\ \end{array} \right] =o_{d+1 \times 1}

Since the matrix is of size {(d+1)\times m}, and {d+1 < m}, there are non trivial solutions of this equation (look at rank-nullity theorem to observe that {\dim(\mathcal{N}(A))\geq m-(d+1)>0}). From now on Let {\gamma_1, ....,\gamma_m} be one such solution. Now let {\tau >0}. Since {\sum_{i=1}^m \gamma_i x_i =0}, the point {y} can be represented as

\displaystyle y = \sum_{i=1}^m \alpha_i x_i -\tau \sum_{i=1}^m \gamma_i x_i = \sum_{i=1}^m (\alpha_i -\tau \gamma_i)x_i

Also observe that {\sum_{i=1}^m (\alpha_i -\tau \gamma_i) =1} since {\sum_{i=1}^m \gamma_i =0}. Now we will choose {\tau} such that {\alpha_i -\tau\gamma_i \geq 0} and equals {0} for one term so that {\sum_{i=1}^m (\alpha_i -\tau \gamma_i)x_i} becomes a convex combination. Since {\sum \gamma_i =0} implies some {\gamma_i}‘s are positive and others are negative. If for some {i}, {\gamma_i <0}, then {\alpha-\tau\gamma_i \geq0}. So we need to consider only the terms with {\gamma_i >0}. For all {i} such that {\gamma_i >0}, we want

\displaystyle \alpha_i-\tau\gamma_i \geq 0,

which implies

\displaystyle \frac{\alpha_i}{\gamma_i}-\tau \geq 0.

Hence if we choose {\tau =\min\{\frac{\alpha_i}{\gamma_i}, \gamma_i >0\}}, then one term {\alpha_i -\tau\gamma_i =0} and every other term will be positive. So we are able to express

\displaystyle y = \sum_{i=1}^m (\alpha_i -\tau \gamma_i)x_i,

where one of the coefficients is zero, \ie, as a convex combination of {m-1} points. We can continue this procedure till {m\leq d+1} \Box

We will now see a simple application of Caratheodory’s theorem.

Lemma 4 If a set {S\subset {\mathbb R}^d} is compact, then its convex hull {{conv}(S)} is also compact.

Proof: Consider a sequence {\{x_n\} \in {conv}(S)}. If we show that there is a convergent subsequence which converges to a point in {{conv}(S)} then we are done.

By Caratheodory’s theorem, each {x_i \in {conv}(S)} can be represented as a convex sum of {d+1} points in {S}. So

\displaystyle x_i =\sum_{k=1}^{d+1}{\alpha_k^i}y_k^i, \quad \sum_{k=1}^{d+1}\alpha_k^i =1 \text{ and } y_k^i \in S

Now consider the following matrix of dimensions {\infty \times 2(d+1)}

\displaystyle  \left[ \begin{array}{ccccccc} y_1^1 & \alpha_1^1& y_2^1 &\alpha_2^1&...&x_{d+1}^1&\alpha_{d+1}^1 \\ y_1^2& \alpha_1^2& y_2^2 &\alpha_2^2&...&x_{d+1}^2&\alpha_{d+1}^2 \\ y_1^3 & \alpha_1^3& y_2^3 &\alpha_2^3&...&x_{d+1}^3&\alpha_{d+1}^3\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ y_1^n & \alpha_1^n& y_2^n &\alpha_2^n&...&x_{d+1}^n&\alpha_{d+1}^n\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \end{array} \right]

Now consider the first column { \{y_k^1\}}. Since this sequence lies in {S} and {S} is compact, there is a sub-sequence in that that converges to a point {y^1 \in S}. The indices of this subsequence are denoted by {I_1 \subset \{1,2,3,...\}}. Now consider the second column, but consider only rows corresponding to the index set {I_1}. Since {\alpha_1^{k} \in [0,1]} a compact set, there exists a subsequene of {\{\alpha_1^{k}\}_{k\in I_1}} that converges to some point {\alpha_1} in {[0,1]}. Denote the indices corresponding to this subsequence by the set {I_2 \subset I_1}. Now consider the third column corresponding to {y_2^{k}} but only corresponding to the rows {I_2}. Since {\{y_2^{k}\} \subset S} and {S} is compact, we can find a subsequence which converges to {y_2 \in S}. Denote the set of the indices corresponding to this sub sequence by {I_3 \subset I_2}. Continue this till all the {2(d+1)} are exhausted. So we will finally have an index set {I_{2(d+1)}} corresponding to the last column.

Now consider the original sequence {\{x_n\}}. It is easy to see that, the subsequence corresponding to the index set {I_{2(d+1)}} converges to a point

\displaystyle \sum_{i=1}^{d+1}\alpha_i y_i.

which belongs to the set {{conv}(S)} proving the result. \Box

1. Convex Sets

Definition 1 (Line) Let {x_1} and {x_2} be two points. Then the line passing through {x_1} and {x_2} can be represented as

\displaystyle  y=\theta x_1 + (1-\theta) x_2,\ \theta \in {\mathbb R}.

The value {\theta =1} corresponds to the point {x_1} and {\theta=0} is the point {x_2}. Also observe that for a line {\theta} can any real value. When {\theta} is restricted to {[0,1]}, \ie, {\theta \in[0,1]}, a line segment connecting {x_1} and {x_2} is obtained. We denote the line segment between {x_1} and {x_2} as {[x_1,x_2]}. Let us see some examples:

  1. The concept of line and line segment is very clear if {x} and {y} belong to {{\mathbb R}^d}.
  2. Let {X \in S} and {Y \in S}. Then it is very clear that {\theta X \in S} for all {t\in {\mathbb R}}. Also it is very clear that {\theta X +(1-\theta)Y \in S}. Hence the concept of line is well defined on the set {S}.
  3. Let {X \in S_{++}} and {Y \in S_{++}}. Then it is very clear that {\theta X \in S_{++}} for {\theta \in [0,1]}. From the properties of SD matrices, it follows that

    \displaystyle \theta X +(1-\theta)Y \in S_{++},\ \ \theta \in [0,1].

    So the notion of line segment is well defined on {S_{++}}.

A set {C} is convex if the line segment joining any two points in {C} lies in {C}. Mathematically the set {C} is convex if and only if for any {x, y\in C} implies {[x,y]\subseteq C}. First observe that convexity is not a topological property, that is it does not depend on the norm. Let us see some examples:

  1. the empty set {\emptyset} is always convex.
  2. Provide other geometrical examples.
  3. The set

    \displaystyle \Delta=\{(x_1,x_2,...,x_{d+1}): x_1+x_2+...+x_{d+1}=1, x_i\geq 0 \}

    is a convex set. This is also called as the {d}-dimensional simplex. Observe that this set corresponds to all {d+1} valued discrete random variables.

  4. Any ball {B(o,R)} is simplex. Let {x} and{y} belong to the ball {B(o,R)}. Then by the property of norm

    \displaystyle  \begin{array}{rcl}  \|\theta x+ (1-\theta)y\| &\leq \|\theta x\| + \|(1-t)y\|\\ &=\theta\| x\| + (1-t)\|y\|\\ &\leq \theta R +(1-t)R =R \end{array}

    This implies {\theta x+ (1-\theta)y \in B(o,R)} which implies {B(o,R)} is convex. In general, {B(x,R)} is a convex set for any {x} and {R} (Home work). Observe that the ball {B(x,R)} can be represented as (prove it)

    \displaystyle B(y,R) = \{y+Rx:\ \|x\|\leq 1\}.

  5. Ellipsoid: Prove that the set {\{y+Ax:\ \|x\|\leq 1\}}, where {A \succ 0 } is a convex set.
  6. Hyperplane and halfspace: Let {a\in {\mathbb R}^d} and {b\in {\mathbb R}}, then the set

    \displaystyle \{x: a^tx=b\}

    is called a hyperplane. The set

    \displaystyle \{x: a^tx\leq b\}

    is the halfspace.

  7. Systems of linear equations or Polyhedron: Let {c_i} be vectors in {{\mathbb R}^d} and {\beta_i} be some scalars. Then the set

    \displaystyle P=\{x\in {\mathbb R}^d: c_i^T x \leq \beta_i,\ i=1,2,...,n\},

    is called a polyhedron. Succinctly, this can be represented as {Cx\preceq \beta}.

  8. The set of PSD matrices {S_+} and {S_{++}} is convex. Proof: Let {Z} and {Y} belong to {S_{++}}.

    \displaystyle x^t(\theta Z+(1-\theta)Y )x = \theta x^t Zx + (1-\theta) x^t Y x \succ 0.

  9. Voronoi tessellation: Let {x_1, x_2,\hdots, x_n} denote {n} points in {{\mathbb R}^d}. For any point {x_i} let {V(x_i)} denote the Voronoi region of the point {x_i}, i.e.m

    \displaystyle V(x_i) = \{x: \|x-x_i| \leq \|x-x_j\|, \forall j=1,...,n\}.

    Prove that {V(x_i)} is a Voronoi set.

Definition 2 (Convex combination) Let {\{x_1,x_2,\hdots, x_n\}\subset {\mathbb R}^d} denote a set of {n} points. Let {\beta_i >0} and {\sum_{i=1}^n \beta_i =1}. Then

\displaystyle x=\sum_{i=1}^n \beta_i x_i,

is called a convex combination of the points {x_1,\hdots,x_n}.

Theorem 3 A set {A} is convex if and only if it contains all convex combinations of its points.

Proof: It is easy to see that if a set contains all convex combinations of its points, then the set is convex.

Now to prove the other way. We prove it by induction on the number of points. By the definition, it is is true for {n=2}. Let us assume it is true for any {n} points, \ie, any convex combination of {n}-points lies in the set. Now consider {n+1} points {\{x_1,x_2,...,x_{n+1}\}}. Consider

\displaystyle x=\sum_{i=1}^{n+1} a_i x_i,

such that {\sum_{i=1}^{n+1}a_i=1}. Alternatively,

\displaystyle x=\sum_{i=1}^{n+1} a_i x_i= a_{n+1}x_{n+1}+(1-a_{n+1})\left(\sum_{i=1}^n \frac{a_i}{1-a_{n+1}} x_i \right).

Observe by induction assumption, {y=\sum_{i=1}^n \frac{a_i}{1-a_{n+1}} x_i \in A}, since {\sum_{i=1}^n\frac{a_i}{1-a_{n+1}} = \frac{1-a_i}{1-a_i}=1}. So,

\displaystyle x=a_{n+1}x_{n+1}+(1-a_{n+1})y,

is a two points convex combination of {x_{n+1}} and the points {y} and hence {x} belongs to the set {A}.

\Box

We will now define a process of convexifying any set.

Definition 4 (Convex hull) The set of all convex combinations of points from a set {A} is called the convex hull of {A}. It is denoted by {\text{conv}(A)}.

Similar to a closure of a set, the convex hull of a set is the smallest convex set containing {A}. We will now see some examples.

  1. Some basic geometric examples.
  2. Birkhoff – Von Neumann Theorem: Consider the set of permutation matrices {\{X_\sigma, \sigma \in P_n\}}. Then the convex hull of this set is the set of all doubly stochastic matrices.
Hw 1 Prove that the convex hull of a set is convex.

1. Linear Algebra

Let {A\in {\mathbb R}^{m\times n}} be a real valued matrix. The matrix {A} can be considered as a linear operator from {{\mathbb R}^n} to {{\mathbb R}^m}. Let us see what a linear operator is: A function {f(x):{\mathbb R}^n \rightarrow {\mathbb R}^m} is called a linear operator (functional) if for any {x,y \in {\mathbb R}^n}, and {a, b \in {\mathbb R}},

\displaystyle f(ax+by)=af(x)+bf(y).

It turns out that on {{\mathbb R}^m}, the only linear functional’s are matrix operators, \ie, the linear functionals are of the form,

\displaystyle f(x) = Ax, \quad A\in {\mathbb R}^{m\times n}, x\in {\mathbb R}^n.

For every linear functional, there are a couple of important associated spaces.

  1. Range space: The range space of the matrix is a subspace of {{\mathbb R}^m} and is defined as

    \displaystyle \mathcal{R}(A)=\{y:\ y \in {\mathbb R}^m, y = Ax \text{ for some } x\in {\mathbb R}^n\}.

    A space is a subspace if for scalars {a, b} and {x, y} in the space, {ax+by } belongs to the space.

    Let {a_1, a_2, ..., a_n} be the columns of {A}, \ie, {A =[a_1, a_2,\hdots, a_n]}. Let {x={\mathbb R}^n}, {x=(x_1,x_2,...,x_n)}

    \displaystyle Ax = a_1 x_1 + a_2 x_2+\hdots+a_n x_n.

    This means that {Ax} is a linear combination of the columns of {A}. Alternatively {\mathcal{R}(A)} is the space spanned by the columns of {A}. Hence the Range space of {A} is also the column space of {A}.

     

  2. Null space or Kernel space: The kernel space of a matrix is a subspace of {{\mathbb R}^n} of {A} is

    \displaystyle \mathcal{N}(A)=\mathcal{K}(A)= \{x :\ Ax =o, x\in {\mathbb R}^n\}.

    Essentially, null space is the set of solution to the homogeneous equation {Ax=o}. Let {\hat{a_1}, \hat{a_2},\hdots, \hat{a_n}} denote the row of {A}. Then

    \displaystyle Ax=\left[ \begin{array}{c} a_1 x\\ a_2 x\\ \vdots\\ a_n x \end{array} \right].

    So {Ax=0} requires that {a_1x=0}, {a_2x=0}, {\hdots}, {a_nx=0}. This implies that {x \perp a_1}, {x \perp a_2}, {\hdots}, {x \perp a_n}. So for any scalars {c_1, c_2, \hdots, c_n \in {\mathbb R}} and the linearity of the inner product,

    \displaystyle (c_1 a_1+c_2a_2+\hdots+c_na_n)x =0

    So {x \in \mathcal{N}(A)} if and only if {x} is orthogonal to the subspace spanned by the rows of the matrix {A}. From the previous argument observe that the set of {x \in {\mathbb R}^m } such that {A^tx=0} equals the orthogonal subspace to the row space of {A^t} or the column space of {A}, \ie, {\mathcal{R}(A)}. So

    \displaystyle \mathcal{R}(A)\perp \mathcal{N}(A^t)

The following is a fundamental theorem in linear algebra

Theorem 1 Let {A \in {\mathbb R}^{m\times n}}. Then

\displaystyle \mathcal{R}(A)\stackrel{\perp}{\bigoplus} \mathcal{N}(A^t) = {\mathbb R}^m

This theorem means that any vector of {{\mathbb R}^n} belongs to {\mathcal{R}(A)} or {\mathcal{N}(A^t)} or can be expressed as a sum of two vectors one belonging to { \mathcal{R}(A)} and the other to {\mathcal{N}(A^t)}.

Lemma 2 (Fredholm Alternative) {Ax=b} has a solution if and only if for any {y} such that {A^Ty=0}, \ie, {y\in \mathcal{N}(A^t)}, implies {y^T b=0}, \ie, {b\in \mathcal{R}(A)=\perp\mathcal{N}(A^t)}.
Hw 1 Let {A\in {\mathbb R}^{m\times n}}. Show that {\mathcal{R}(A)} and {\mathcal{N}(A)} are subspaces.
Lemma 3 Consider the solution of {Ax=b}. Let {b \in {\mathbb R}^m} and the range space of {A}. Let {x'} be a solution to {Ax=b}. Then any solution to the equation {Ax =b} belongs to the set

\displaystyle \{x'+y: y\in \mathcal{N}(A)\}.

Proof:We have {Ax'=b}. So the equation to be solved can be reformulated to be

\displaystyle Ax = Ax' \iff A(x-x')=o.

So any solution of the original equation is of the form {x-x'=y}, where {y} belongs to the kernel of {A}. Hence {x=x'+y}. \Box

Some observations:

  1. So {Ax=b} has a unique solution only when the kernel of {A} has only one element, \ie the zero vector. This means that the dimension of {\mathcal{K}(A)} is zero.
  2. Also observe that {Ax} is one to one only if {\mathcal{N}(A)=\{o\}}.

We will now see when the linear map {Ax:{\mathbb R}^n \rightarrow {\mathbb R}^m} is continuous.

Theorem 4 The map {Ax:{\mathbb R}^n \rightarrow {\mathbb R}^m} is a continuous function if any matrix norm of {A} is finite.

Proof:Let {\{x_n \rightarrow {\mathbb R}^n\}} and {\|\|_b} a induced matrix norm. So we have to show that {Ax_n \rightarrow Ax} in {{\mathbb R}^m}. Let {\epsilon >0}. Then

\displaystyle \|Ax_n -Ax\| = \|A(x_n-x)\|\stackrel{(a)}{\leq}\|A\|\|x_n-x\| .

Since {\|x_n -x\| \rightarrow 0} and {\|A\|<\infty}, {\|Ax_n -Ax\| \rightarrow 0} which proves the convergence and hence continuity.

Also since all norms are equivalent, it is sufficient that {\|A\|} be finite with respect to one norm. \Box

We will now describe the rank of a matrix. The rank of a matrix is the dimension of the range space of {A}. The following is a fundamental theorem of linear algebra

Theorem 5 The dimension of the range (or the column space) of a matrix {A} equals the dimension of the range of {A^T} (or the row space) and equals the rank of the matrix. Or equivalently, The dimension of the row space equals that of the column space

Proof: Follows from rank decomposition {A=CF} where the matrix {C} is the matrix of independent vectors of column space. …… \Box

Theorem 6 Rank-nullity theorem: Let {A \in {\mathbb R}^{m\times n}}, then

\displaystyle \dim(N(A))+\dim(\mathcal{R}(A))=n.

Proof:Let {k} denote the dimension of the null space or {k=\dim(N(A))}. Hence there exists {\{v_1,v_2, \hdots v_k\} \subset {\mathbb R}^n} a basis of {N(A)}. We can add vectors {\{v_{k+1},v_{k+2},...,v_{n}\}}, so that {\{v_{1},\hdots,v_n\}} form a basis of {{\mathbb R}^n}. From the definition of the range space,

\displaystyle \mathcal{R}(A)= A* \text{LinearComb}(\{v_{1},\hdots,v_n\})

which by the linearity equals,

\displaystyle \mathcal{R}(A)= \text{LinearComb}(\{Av_{1},Av_2,...,Av_k,Av_{k+1},\hdots,v_n\})

\displaystyle \mathcal{R}(A)= \text{LinearComb}(\{o,o,...,o,Av_{k+1},\hdots,v_n\})

\displaystyle \mathcal{R}(A)= \text{LinearComb}(\{Av_{k+1},\hdots,v_n\})

Observe that the dimension of the RHS is {n-k} if all the vectors are linearly independent. Suppose there exists some scalars {c_{k+1},c_{k+2},...c_n} so that

\displaystyle c_{k+1}Av_{k+1}+c_{k+2}Av_{k+2}+...+c_nAv_n=o

Which implies

\displaystyle A(c_{k+1}v_{k+1}+c_{k+2}v_{k+2}+...+c_nv_n)=o,

\ie, {(c_{k+1}v_{k+1}+c_{k+2}v_{k+2}+...+c_nv_n) \in \mathcal{N}(A)}. Hence there exists {c_1, ....c_k} such that

\displaystyle c_{k+1}v_{k+1}+c_{k+2}v_{k+2}+...+c_nv_n= c_1v_1+....+c_kv_k,

which implies

\displaystyle -c_1v_1-....-c_kv_kc_{k+1}v_{k+1}+c_{k+2}v_{k+2}+...+c_nv_n=o,

which is a contradiction since we assumed {v_1,....v_n} are a basis of {{\mathbb R}^n} unless {c_1=c_2=...=c_n=0} which proves our theorem. \Box

Lemma 7 So the linear map {Ax :{\mathbb R}^n \rightarrow {\mathbb R}^n} is invertible if and only if either of the conditions hold true
  • {\mathcal{N}(A)=\{o\}}, \ie, {A} is one-one.
  • {A} is onto, \ie, {\dim\mathcal{R}(A) =n}

Proof: Follows from the rank nullity theorem. \Box

So the homogeneous set of equations {Ax=0} has a non trivial solution only when {rank(A) <n}.

1.1. Determinant

The determinant of a matrix {A} is a function from the space of square matrices to real numbers. One definition of determinant is that it is the volume of the parallelepiped formed by the row or column vectors.

(Leibniz formula)

\displaystyle \det(A)=\sum_{\sigma \in \Pi_n}\text{sgn}(\sigma)\prod_{i=1}^nA_{i\sigma(i)}

Look at the following Wikipedia page for more details about the permutation group \Pi_n and the sign of permutation \text{sgn}(\sigma). From this definition it follows that the determinant is a continuous function of the matrices. Very formally, the determinant if the unique function on {n\times n} matrices that is {n}-linear alternating in the columns and takes a unit value on the identity matrix. Some properties of the determinants:

  1. {\det(AB)=\det(A)\det(B)}.
  2. {\det(I+AB) =\det(I+BA)}
  3. {\det(A^{-1}) = \frac{1}{\det(A)}}

1.2. Eigenvalues

A vector {u \neq o} is an eigenvector of an {n \times n} matrix if there exists a scalar {\lambda} such that

\displaystyle Au=\lambda u.

The scalar is called the eigenvalue of the matrix {A}. The eigenvalues can alternatively be characterized as the roots of the polynomial equation

\displaystyle \det(A-\lambda I)=0.

The set of eigenvectors form a subspace (Why?) and is called the eigenspace. If there are {n} independent eigenvectors then the matrix {A} can be decomposed

\displaystyle A= U\Lambda U^{-1},

where the {n}-eigenvectors form the columns of {U}. Some facts about eigenvalues:

  1. {\det(A)=\prod_{i=1}^n \lambda_i}.
  2. {\text{tr}(A) =\sum_{i=1}^n \lambda_i}

1.3. Symmetric matrices

A matrix is symmetric if {A^T =A}. The space of Symmetric {n\times n} matrices is denoted by {S}. We have the following properties of a symmetric matrix

  1. The dimension of a symmetric matrix is {n(n+1)/2}.
  2. The eigenvalues of a symmetric matrix are real. (The eigenvectors are real and non trivial if restricted to {{\mathbb R}^n})
  3. Eigenvectors corresponding to different eigenvalues are orthogonal.
  4. Indeed {A} can be decomposed as {U^T\Lambda U} for a real orthogonal matrix {U^TU=I}.
  5. If {u_i} denote the moralized eigenvectors of {A}, then {A= \lambda_1 u_1u_1^T + \lambda_2 u_2u_2^T +...+ \lambda_n u_nu_n^T }

1.4. Rayleigh quotient

Let {A} be a symmetric real matrix. Then the Rayleigh quotient is defined as

\displaystyle R(x) = \frac{x^TAx}{x^Tx}.

It turns out that Rayleigh quotient is a very important quantity in the analysis of eigenvalues. For example the following lemma characterizes the maximum eigenvalue of a symmetric matrix.

Lemma 8 Let {\lambda_{\max}} denote the maximum eigenvalue of the symmetric matrix {A}. Then

\displaystyle \lambda_{\max}(A) = \sup\{R(x): x\in {\mathbb R}^n\},

\displaystyle \lambda_{\min}(A) = \inf\{R(x): x\in {\mathbb R}^n\}.

Proof:Let {U ^T \Lambda U} be the eigen decomposition of {A}. Then

\displaystyle R(x) = \frac{x ^TU^ T\Lambda U x}{x^Tx}.

Since {U^TU=I}, {x^Tx = X^TU^TUx}. So

\displaystyle R(x) = \frac{x ^TU^ T\Lambda U x}{X^TU^TUx}= \frac{y^T\lambda y}{y^T y},

where {Y=Ux} some other vector. Observe that {U} is invertible and hence onto. So for {x} spanning the whole of {{\mathbb R}^n}, {y} also spans the whole of {{\mathbb R}^n}. Hence

\displaystyle \sup _{x\in {\mathbb R}^n} R(x)=\sup _{y\in {\mathbb R}^n} \frac{y^T\lambda y}{y^T y}

This equals

\displaystyle \sup _{x\in {\mathbb R}^n} R(x)= \sup\frac{\sum_{i=1}^n\lambda_i y_i^2}{\sum_{i=1}^n y_i^2}=\sup_{p_i\geq0, \sum_{i=1}^n p_i=1}\sum_{i=1}^n \lambda_i p_i =\max{\lambda_i}.

Similarly we have the following result for the other eigenvalue. \Box

Hw 2 How do you characterize the other eigenvalues?

\subsubsection{Positive definite matrices} A matrix {A\in S} is positive definite if for any {x\neq o} and {x\in {\mathbb R}^n},

\displaystyle x^tAx >0.

This is denoted by {A \succ 0}. The set of positive definite matrices is denoted by {S_{++}}.

Small digression as to why we consider only the subsets of {S}:Technically, a positive definite matrix need not be symmetric. Every matrix can be represented as

\displaystyle A= B+C=\frac{1}{2}(A+A^T)+\frac{1}{2}(A-A^T).

Observe that {C=(A-A^T)/2} is skew symmetric, \ie, {C=-C^T}. Also {B=B^T}. Hence

\displaystyle <x, Ax> =<x, Bx>+<x,Cx>.

We have

\displaystyle <x, Cx> =<C^Tx,x>=-<Cx,x>.

This implies {<x,Cx>=0}. So, {<x,Ax>=<x,Bx>}. Hence to understand the positive definite property of a matrix, it suffices to consider only the symmetric part.)

Some important properties:

  1. A matrix {A} is positive definite if and only if all its eigenvalues are positive.
  2. If {A \in S_{++}} then {\det(A)>0} and hence invertible.
  3. Let {A_S} be the matrix obtained by removing the rows ans the columns indexed by {S}. Then {A_S \succ 0}. This implies all the diagonal elements should be greater than {0}.
  4. There exists a unique matrix called the square root such that {A= BB^T}. It is easy to see that {B= U^T \sqrt{\Lambda}U}.
  5. If {A,B \in S_{++}} then {A+B\in S_{++}}
  6. If {A \in S_{++}}, {A^m \in S_{++}}.
  7. {A \in S_{++}} if and only if for any invertible matrix {M}, {MAM^T \in S_{++}}

(Positive seimidefinite matrices) A matrix {A\in S^n} is positive definite if for any {x\neq o} and {x\in {\mathbb R}^n},

\displaystyle x^tAx \geq0.

This is denoted by {A \succeq 0}. The set of positive definite matrices is denoted by {S_+}.

  1. A matrix {A} is positive semidefinite if and only if all its eigenvalues are non-negative.
  2. Let {A_D} be the matrix obtained by removing the rows ans the columns indexed by {D}. Then {A_D \succeq 0}. This implies all the diagonal elements should be non-negative
  3. There exists a unique matrix called the square root such that {A= BB^T}. It is easy to see that {B= U^T \sqrt{\Lambda}U}.

1.5. Schur Complement

Let {X \in S }. If {X} is partitioned as

\displaystyle \left( \begin{array}{ccc} A &B\\ B^T & C \\ \end{array}\right)

Then if {\det(C) >0} then the Schur complement of {C} in {X} is defined as

\displaystyle S_c=A-B^TC^{-1}B.

We have the following lemma

Lemma 9 For any symmetric matrix {X}, that can be partitioned as

\displaystyle \left( \begin{array}{ccc} A &B\\ B^T & C \\ \end{array}\right),

Then,

  • The matrix {X\succ 0} if and only if {C\succ0} and {A-B^TC^{-1}B \succ 0}.
  • If {C \succ 0}, then {X \succeq 0} if and only if {A-B^TC^{-1}B \succeq 0}

Proof:

\displaystyle X= \left( \begin{array}{ccc} I &BC^{-1}\\ 0 & I \\ \end{array}\right)\left( \begin{array}{ccc} A-B^TC^{-1}B &0\\ 0 & D \\ \end{array}\right)\left( \begin{array}{ccc} I &BC^{-1}\\ 0 & I \\ \end{array}\right)^T

\Box

Also we have

\displaystyle \left( \begin{array}{ccc} I &BC^{-1}\\ 0 & I \\ \end{array}\right)^{-1} = \left( \begin{array}{ccc} I &-BC^{-1}\\ 0 & I \\ \end{array}\right).

So

\displaystyle \left( \begin{array}{ccc} I &-BC^{-1}\\ 0 & I \\ \end{array}\right) X \left( \begin{array}{ccc} I &-BC^{-1}\\ 0 & I \\ \end{array}\right)^T =\left( \begin{array}{ccc} A-B^TC^{-1}B &0\\ 0 & D \\ \end{array}\right)

Since for any matrix {D \in S_{++}} if and only if for any invertible matrix {M}, {MDM^T \in S_{++}} This implies {A-B^{T}C^{-1}B \succ 0}

1. Compact sets

We will now move to an important class of sets. These sets are desirable (most analysts) since they are very nice and easy to work with. There are many definitions of compact sets. Since we are in {{\mathbb R}^d}, we will use a sequence definition). There are alternative ways to define compact sets, however we will not concentrate on them (See the appendix for an alternative definition).

We first begin with a notion for a subsequence. Given a sequence {\{x_n\}} a subsequence is denoted by {\{x_{n_k}\}} such that {n_1\leq n_2 \leq n_3 \leq ...}.

A set {A\subset {\mathbb R}^d} is compact if and only if for any sequence {\{x_n\} \subset A}, there is a subsequence that converges to a point in {A}.

The above definition does not provide us with a clear picture of how compact sets look. The next theorem provides a better characterization of compact sets in {{\mathbb R}^d}.

Theorem 1 (Bolzano–Weierstrass theorem) A set {A\subset {\mathbb R}^d} is compact if and only if it is closed and bounded in {{\mathbb R}^d}

Before we go into the proof let us see some examples of compact sets.

  1. The set {[0,2]} is compact since it is closed and bounded.
  2. {(0,2]} is not compact since it is not closed.
  3. {[0,\infty)} is not compact since it is not bounded.
  4. The ball {B(0,R)} in {{\mathbb R}^d} is not compact since it is an open set. However the closure of the ball

    \displaystyle \overline{B(o,R)} = \{x: \|x\|\leq R , \ x\in {\mathbb R}^d\}

    is compact.

Proof: I will prove the theorem in {{\mathbb R}^2}. A similar proof with more notation will hold true for general {{\mathbb R}^d}. We will first prove that a closed and bounded set {A\subset {\mathbb R}^2} is compact. So we should prove that given any sequence in {A}, there exists a convergence subsequence with limit in {A}.

Let {\{x_n\} \subset A} be a sequence in {A}. Since {A} is bounded, it can be fitted in a big square say {S_1=[-a, a]^2}, that is {A\subset [-a,a]^2}. Now divide the square {S_1} into {4} equal parts. So now each square is of side {-a/2}. Since the sequence {x_n} has infinite number of points, at least one of the square should contain infinite number of points (If not we are done). Call that square {S_2}. So {S_2} is some square of side length {a/2} . Now divide the square {S_2} into {4} equal squares of side length {a/4} . Again, at least one of these squares will contain infinite number of points. Pick one such square and call it {S_3}. Now again divide the {S_3} square into {4} equal squares and continue this procedure.

Observe that the side length of {S_n} is {1/2^{n-1}}. So will pick the subsequence this way: Pick any point of the original sequence from square {S_1} and denote it by {x_{n_1}}. Pick the second point from the original sequence with {n_2 > n_1} and lying in square {S_2}. Call it {x_{n_2}}. Continue this way to obtain a subsequence {\{x_{n_k}\}}. You should always be able to do this since there are infinite number of points in any square we picked. Also observe the following: The sets {S_n} are monotocnically decreasing, i.e., {S_{n+1}\subset S_n} and they are closed. Since the arbitrary intersection of closed sets is closed, {\cap_{n=1}^\infty S_n } is closed. It turns out that the intersection of these sets will contain only one point which we denote by , i.e., {\cap_{n=1}^\infty S_n =\{x\}} (It is interesting to prove this and will leave it to you).

The claim is that the subsequence {\{x_{n_k}\}} converges to the point {x} in {A}. It is sufficient to prove that the subsequence converses. The limit will automatically lie in the set {A} since the set {A} is closed. So let us now prove that the subsequence converges. Fix an {\epsilon >0}. It is easy to observe that for all {m>k}, {x_{n_m}} will lie in {S_k} and also {x} lies in {S_k}. Since the width of {S_k} is {1/2^{k-1}} it follows that

\displaystyle \|x_{n_m}-x\|<\frac{1}{2^{k-1}}

We can choose a large {k} such that {\frac{1}{2^{k-1}} <\epsilon} (for example {k=\log(2/\epsilon)}) and hence {\|x_{n_m}-x\|<\epsilon}, which implies {x_{n_m}\rightarrow x} as {m\rightarrow \infty}. This proves the existence of a subsequence of any sequence converging to a point in {A}.

The converse (showing that compact implies closed and bounded) is left as HW. \Box

In some sense compact sets are the best kind of sets to deal with. Since they are closed all the limit points exist inside the set and the set itself is bounded.

Hw 1 Are the following sets compact: (prove or disprove)

  1. {A_1= [5,10]\cup\{0\}}
  2. {A_2 = \{0,1/2,1/3,1/4,/1/5,....\}} as a subset of {{\mathbb R}}
  3. {A_3 = \{x; x \in {\mathbb R}^d, \|x\|_\alpha \leq 1\}}
  4. {A_4 = \{x; x \in {\mathbb R}^d, Ax \geq b\}}, {A} is invertible {d \times d} matrix and {b\neq 0}
  5. {A_5 = A_4 \cap B(0,10)}
  6. {A_6 = (0,1)} as a subset of {{\mathbb R}^2}.
  7. {A_7 ={\mathbb R}}. Try to prove or disprove using the sequential definition of compactness.
Hw 2 Prove that the union of two compact sets is compact.
Hw 3 Let {S_i} be the squares defined in the previous theorem. Prove that {\cap_{i=1}^n S_i =\{x\}}. (Hint: Look at the coordinates of the corners of the square. See if they form a monotonic bounded sequence. )
Hw 4 If {A} is a compact set and {B} is a closed subset of {A}, i.e., {B\subset A} then {B} is compact.

If {A\subset {\mathbb R}} is a compact set, since {A} is bounded, {\sup(A)} and {\inf(A)} exist and since it is closed the {\sup} and {\inf} belong to the set {A}. We now come to the most important theorem from an optimization perspective.

Lemma 2 Let {U\in {\mathbb R}^d} be a compact set and {f(x)} be a continuous function. Then the image of the set {U}, \ie {f(U)} is compact.

Proof: Let {x_n} be a sequence in the set {f(U)}. Then the pre image of the sequence {y_n =f^{-1}(x_n)} is in the compact set {U}. By the definition of compactness, there is some subsequence {y_{n_k}} that converges to a point {y} in {U}. Since the function {f} is continuous, by the definition, {f(y_{n_k})} converges to {f(y)}. This means {f(f^{-1}(x_{n_k})) = x_{n_k}\rightarrow f(y)} where {f(y)} is some point in {f(U)}. Hence for every sequence {x_n} there exists a subsequence {x_{n_k}} which converges to a point in {f(U)}. Hence the set {f(U)} is compact. \Box

We will see some examples soon, but first an important corollary.

Lemma 3 If {f(x):A\subset {\mathbb R}^d \rightarrow R} is a continuous function. Then {f(x)} is bounded on the set {A} and attains its maximum or minimum in the set {A}.

Proof: Since {A} is a compact set and {f(x)} is continuous the image {f(A)\subset {\mathbb R}} is a compact set. Since {f(A)} is compact it is closed and bounded. Hence the supremum and the infimum of the set {f(A)} exist and they belong to the set. \Box

Lemma 4 Any two norms {\|.\|_a} and {\|.\|_b} on the Euclidean space {{\mathbb R}^d} are equivalent.

Proof: Let {S} be the unit sphere in {{\mathbb R}^d} with respect to the standard Euclidean norm {\|.\|_2}, i.e., {S=\{ x: \ \|x\|_2 =1\}}.

First claim: It suffices to prove that {\|.\|_a} and {\|.\|_b} are equivalent on {S}.

Reason: Suppose you are able to prove that {\|.\|_a} and {\|.\|_b} are equivalent on {S}, i.e.,

\displaystyle C_1\|x\|_a \leq \|x\|_b \leq C_2\|x\|_a, \ \forall \ x \text{ such that } \|x\|_2 =1. \ \ \ \ \ (1)

 

Let {y} be an arbitrary point in {{\mathbb R}^d}. Observe that {\|\frac{y}{\|y\|_2}\|_2 = \frac{\|y\|_2}{\|y\|_2} =1}, i.e., the point { \frac{y}{\|y\|_2} \in S}. Hence

\displaystyle C_1\|\frac{y}{\|y\|_2}\|_a \leq \|\frac{y}{\|y\|_2}\|_b \leq C_2\|\frac{y}{\|y\|_2}\|_a.

By the property of the norms,

\displaystyle \frac{C_1}{\|y\|_2}\|y\|_a \leq \frac{1}{\|y\|_2}\|y\|_b \leq \frac{C_2}{\|y\|_2}\|y\|_a,

which proves that

\displaystyle C_1\|y\|_a \leq \frac{1}{\|y\|_2}\|y\|_b \leq C_2\|y\|_a,\quad \forall y \in {\mathbb R}^d.

We will now prove (1). Consider the function {f(x)= \frac{\|x\|_b}{\|x\|_a}} on {S}. Since {x\neq 0}, and {\|x\|_a} and {\|x\|_b} are continuous functions with respect to the standard Euclidean norm, their quotient is also continuous with respect to the Euclidean norm. Also observe that {S} is a closed and bounded set with respect to Euclidean norm and hence compact. By the previous theorem the image {f(S)} is compact. Hence the supremum and infimum of the set {f(S)} are attained. Hence for any {x \in S}

\displaystyle C_1\leq f(x)\leq C_2,

which proves the result. \Box

We will state the following theorem without proof:

Theorem 5 (Tychonoff’s theorem) If {A} and {B} are compact sets, then their Cartesian product {A\times B} is also compact. (This holds true for product of an arbitrary number of sets)

2. Appendix (optional)

The following characterization of compact sets is fundamental compared to the sequential definition as it depends only on the underlying topology (open sets)

2.1. An open cover description of compact sets

An open cover of a set {A} is a collection of sets {Y_a, a \in I } such that {A\subset \cup_{a\in I}Y_a}. In plain English, an open cover of {A} is a collection of open sets that cover the set {A}. Observe that the set {I} need not be finite. Let’s see some examples.

  • Let {A=[0,1]}. Then an open cover of {A} is the set of sets {\{(x-1,x+1); x\in [0,1]\}}. This is because {A\subset \cup_{x \in [0,1]}(x-1,x+1)}. Observe that the cardinality of the cover is infinite.
  • Let {A=[0,1]}. Another open cover of {A} is {\{(-0.5, 0.6), (0.5,1.5) \}} since {[0,1] \subset (-0.5, 0.6)\cup(0.5,1.5)}. Here the cardinality of the cover is finite.
  • Let {A=B(0,1)\subset {\mathbb R}^d}. Then the set {\{B(x,\epsilon): x \in B(o,1)\}} is an open cover since {A\subset \cup_{x\in B(o,1)}B(a,\epsilon)}. Here again the cardinality of the over cover is not finite.

A set {A} is compact if any open cover of {A} has finite sub cover. This means that given any cover of compact set {A}, you will be able to pick a finite number of subsets of the open cover to cover {A}. The emphasis is on the word any. You should be able to pick a finite sub cover for any cover. Lets look at some examples:

  1. Consider the set {A=[0,1]}. Then one open cover of {A} is the set of sets {\{(x-1,x+1); x\in [0,1]\}}. Let’s see if we can find a finite subcover. Observe that {(-1,1)} and {(-1,2)} belong to the previous cover and {[0,1] \subset (-1,1)\cup (-1,2)}. It turns out (we will see later) that {[0,1]} is compact.
  2. Consider the set {A=(0,1)}. Then one open cover of {A} is the set of sets {\{(x-1,x+1); x\in (0,1])\}}. Let’s see if we can find a finite subcover. Observe that {(-1,1)} and {(-1,2)} belong to the previous cover and {(0,1) \subset (-1,1)\cup (-1,2)}. Let see if we find an open cover of {(0,1)} which has no finite sub cover.

Consider the open cover {\{(1/n , 1), (1/(n+1), 2/n), (1(n+2), 2/(n+1)),....\}}. This is an open cover of {(0,1)}. You can check, how much ever you try you cannot pick a finite number of these sets to cover {(0,1)}. {(0,1)} is not a compact set. As this example shows, the finite subcover property should hold for any open cover.

The following theorem shows that the earlier definition of compact sets based on sequences and this definition based on open sets are both equivalent on {{\mathbb R}^d}.

Theorem 6 (Heine–Borel theorem) A set {S\subset {\mathbb R}^d} is compact if and only if {S} is bounded and closed.

1. Continuous functions

A function {f(x)} is continuous at {x} if for any sequence {x_n \rightarrow x}, {f(x_n) \rightarrow f(x)}. A function is continuous if it is continuous on every point of its domain. This is a very intuitive definition. Observe that the definition does not restrict the function to be in real spaces. Let us know see some examples. Let us see an example of a discontinuous function for the definition to be clear.

\displaystyle S(x)= -1, x<0; \ 0, x=0; 1, x>0;

It is obvious that this function is discontinuous. However, we will use the definition to make this point. Consider the sequence {x_n =1/n}. We have that {\lim_{n\rightarrow \infty} 1/n =0}. We have

\displaystyle S(x_n)=S(1/n)=1,

since {1/n >0}. However we have

\displaystyle S(\lim_{n\rightarrow \infty}1/n) =S(0) =0.

So,

\displaystyle \lim_{n\rightarrow\infty}S(x_n)\neq S(\lim_{n\rightarrow \infty}1/n).

Since we are able to find a sequence {x_n \rightarrow 0} such that {\lim_{n\rightarrow \infty}S(x_n)\neq S(0)} which implies the function {S(x)} is not continuous at {x=0}. One important observation is the following which I state as a lemma.

Lemma 1 For any sequence {x_n} and a function {f(x)}

\displaystyle \lim_{n \rightarrow \infty}f(x_n) = f(\lim_{n \rightarrow \infty}x_n),

if and only if {f(x)} is continuous.

So before taking the limit operation inside a function, it is necessary to check if the function is continuous. Alternatively, a function is continuous at {x} if for any {\epsilon >0}, there exits a {\delta>0} such that {\|f(x)-f(x')\|<\epsilon} whenever {\|x-x'\|<\delta}. This is just a reformulation of the continuity. Observe that the notion of continuity depends on the notion of Balls and norm. (It turns out that in {{\mathbb R}^n} norm doesn’t matter since all norms are equivalent).

Let {\|x\|_a} be some norm in {{\mathbb R}^n}. Consider the function {f(x)=\|x\|_a}. We will now show that the function {f(x)} is continuous with respect to the topology generated by the the Euclidean norm {\|.\|_2}.

Lemma 2 Let {\|.\|_a} denote some norm on {R^n}. Then the function {f(x)=\|x\|:{\mathbb R}^n \rightarrow [0,\infty)} is a continuous function with respect to the Euclidean norm.

Proof: Let {x_n} be some sequence in {{\mathbb R}^n} and let {x_n} converge to {x} in the {\|.\|_2} norm. This means that {\|x_k-x\|_2\rightarrow 0} as {k} becomes large. To prove continuity, we will have to show that {f(x_k)\rightarrow f(x)} as {k\rightarrow \infty}. From triangle inequality,

\displaystyle f(x_k)=\|x_k\|_a = \|x_k -y +y\|_a \leq \|x_k-y\|_a+\|y\|_a.

Hence rearranging and taking absolute value

\displaystyle \|x_k\|_a-\|y\|_a| \leq \|x_k -y\|_a

Let {e_i, e_2, ..., e_n} be the standard basis of {{\mathbb R}^n}, i.e., {e_i =(0,0,..0,1,0,..0,0)} a vector with {1} in the {i}-th position. Then we have

\displaystyle \|x_k -y\|_a=\|\sum_{i=1}^n (x_k^{(i)} - y_k^{(i)} )e_i\|_a,

where {x_k^{(i)}} is the {i}-th coordinate of {x_k} (recall that {x_k \in {\mathbb R}^n)}. Again using triangle inequality (or the property of the norm),

\displaystyle \|x_k -y\|_a\leq\sum_{i=1}^n \|(x_k^{(i)} - y_k^{(i)} )e_i\|_a.

Observe that {(x_k^{(i)} - y_k^{(i)} )} is a real number. Using the scaling property of a norm,

\displaystyle \|x_k -y\|_a\leq\sum_{i=1}^n \|(x_k^{(i)} - y_k^{(i)} )e_i\|_a=\sum_{i=1}^n |x_k^{(i)} - y_k^{(i)} |\|e_i\|_a.

Now using Cauchy-Schwartz inequality,

\displaystyle \sum_{i=1}^n |x_k^{(i)} - y_k^{(i)} |\|e_i\|_a\leq \sqrt{\sum_{i=1}^n |x_k^{(i)} - y_k^{(i)} |^2}\sqrt{\sum_{i=1}^n \|e_i\|_a^2}

Hence

\displaystyle \|x_k -y\|_a\leq \|x_k -x\|_2 C,

where {C=\sqrt{\sum_{i=1}^n \|e_i\|_a^2}} is some finite constant. So,

\displaystyle \begin{array}{rcl} \|x_k\|_a-\|y\|_a|& \leq& \|x_k -y\|_a\\ &\leq&\|x_k -x\|_2 C \rightarrow 0 \end{array}

as {k\rightarrow \infty} . Hence {\|x_k\|} converges to {\|x\|} as {k \rightarrow \infty}. This implies that the function {f(x)} is continuous.

But we have for all {n>k}, {\|x_n-y\|<\epsilon}. So

\displaystyle |\|x_n\|-\|y\|| \leq \epsilon

Hence {\|x_n\|} converges to {\|y\|}. This implies {\|.\|} is continuous. \Box

While the definition is analytically good, its not “easy” to check if a function is continuous at every point. We now provide a stronger version of continuity that is often useful and desired. A function {f(x)} is Lipschitz continuous, if there exists a constant {K>0}, such that for any {x,y} in the domain of the function

\displaystyle \|f(x)-f(y)\| \leq K\|x-y\|.

{K} is called the Lipschitz constant. It is easy to see that every Lipschitz continuous function is continuous. Let {x_n \rightarrow x}. Hence

\displaystyle \lim_{n \rightarrow \infty}\|f(x_n)-f(x)\| \leq K\lim_{n \rightarrow \infty}\|x_n-x\|.

Since {\|.\|} is continuous we can move the limit inside the norm on the RHS.

\displaystyle \lim_{n \rightarrow \infty}\|f(x_n)-f(x)\| \leq K\lim_{n \rightarrow \infty}\|x_n-x\| = 0.

This implies {f(x_n)\rightarrow f(x)} which implies the continuity of {f}. Indeed we can show that every Lipschitz function is also differentiable and its derivative bounded. Examples of Lipschitz functions:

  1. {\exp(-x)} is Lipschitz function with constant {1}.
  2. {\sin(x)} and {\cos(x)} are Lipschitz function with constant {1}.
  3. {\exp(-x^2)} is Lipschitz function with constant {1/e}.

We will now look at an alternative definition of continuity that uses open sets.

Lemma 3 A function {f(x):{\mathbb R}^n \rightarrow {\mathbb R}^m} is continuous if and only if the inverse image of any open set is an open set. More precisely, {f^{-1}(A)} is open for any open set {A\subset {\mathbb R}^m}.

Proof: We will prove that for a continuous function, the inverse image of an open set {A} is open.

  1. Let {x \in f^{-1}(A)}. This implies {y=f(x) \in A}.
  2. Since {A} is an open set, there exists an {\epsilon} such that {B(y,\epsilon)\subset A}.
  3. Since {f(x)} is continuous, there exits a {\delta} such that if {z \in B(x,\delta)}, then {f(z) \in B(y,\epsilon)}. This implies that {f(B(x,\delta))\subset B(f(x),\epsilon)\subset A}
  4. This implies that {B(x,\delta) \subset f^{-1}(A)}.
  5. Hence {f^{-1}(A)} is an open set.

The converse can be proved similarly (HW). \Box

Observe that the ambient spaces ({R^n} and { R^m}) are important. A function can be continuous in one space and not in another. Look at {1/x} on { (0,1)} and {[0.1)}.

Let us check for the constant function, {f(x)=c}. Its range is {\{c\}} which is a closed set. Now take any open set {A} in {{\mathbb R}}. Then {f^{-1}(A) = \emptyset} which is open. Hence the function is continuous. Let us look at the step function {S(x)} that I have defined earlier. Consider the open set {(1/2 , 2)}. Its inverse image is {S^{-1}(1/2,2)} and equals {\{x: x>0\} \cup \{0\}}. Observe that{\{x: x>0\} \cup \{0\}} is not an open set and hence the function is not continuous.

Hw 1 Use this definition to check the continuity of {\exp(-x)}.
Hw 2 Is the function {1/x} continuous on {[0,1]}?

1. Open sets

On {{\mathbb R}^d} the standard topology can be described using the balls generated by the Euclidean norm. We begin with the central object of topology, an open set.

Definition 1 A set {A \subset {\mathbb R}^d} is open if for every {x \in A}, there exists an {\epsilon} such that {B(x,\epsilon)\subset A}.

Simply put, a set is open if for every point in the set we can find a radius {\epsilon} such that a ball of radius {\epsilon} centered around the point fits entirely in the set. Observe that the {\epsilon} can be arbitrary small and also can depend on the position {x}. We now provide a few examples.

  1. By convention, empty set {\emptyset} is an open set.
  2. Let {d=1}, \ie, the real line. The interval {I=(3,4)} is an open set. It is obvious that if {x=2.99}, we can choose {\epsilon =0.00001} and {B(2.99, 0.00001)} will fit entirely in {(3,4)}. For any {x\in (3,4)}, choosing {\epsilon =0.5\min\{x-3, 4-x\}} would suffice. Also observe that we do not have to worry about the end points {3} and {4} since they do not belong to {I}.
  3. Observe that if {I_1} and {I_2} are open intervals then {I_1 \cup I_2} is open. This proof is obvious.
  4. If {I_1} and {I_2} are open sets in {{\mathbb R}^d}, show that {I_1\cap I_2} are open sets.
  5. {(0,\infty)} is open in {{\mathbb R}}.
  6. {[0,1)} is not an open set. Reason out why? (Check what happens when you try to center a ball around {0}).
  7. A {d}-dimensional ball {B(x,r)} is an open set in {{\mathbb R}^d}. (Prove it)
Hw 1 Check if the following sets are open subsets of {{\mathbb R}^2}

\displaystyle A_1=\{(x,y): x>0, y >0, 3x+2y < 1 \}.

\displaystyle A_2=\{(x,y): x>0, y >0, 3x+2y > 1 \}.

\displaystyle A_3=\{(x,y): x>0, y >0, 3x+2y \leq 1 \}.

\displaystyle A_4=\{(x,y): 0 < x< \pi, \sin(x)<1/2\}.

please provide reasons.

Theorem 2 Let {A_1} and {A_2} be two open sets. Then the intersection of the sets {A_1 \cap A_2} is open.

Proof: Let {x \in A_1 \cap A_2}. This implies that {x} belongs to both {A_1} and {A_2}. Since {A_1} is an open set, there exists an {\epsilon_1} such that {B(x,\epsilon_1)\subset A_1} and an {\epsilon_2} such that {B(x,\epsilon_2)\subset A_2}. It is easy to see that, {B(x,\min\{\epsilon_1,\epsilon_2\}) \subset A_1\cap A_2}, which implies {A_1\cap A_2} is open. \Box

The previous two results can be extended to any finite collection of sets by induction. More precisely, if {A_1, A_2, \hdots A_n} are {n} open sets. Then {A_1\cap A_2\cap ... \cap A_n} is open. Let us see an example. If {A_1=(-1,2)} and {A_2=(0,1)}, then their intersection is {(0,1)} which is an open set. Let us also see an example which highlights the importance of finite number of sets. Let

\displaystyle A_n= (1-1/n, 2+ 1/n).

Observe that {A_n} is an open set. It is also easy to observe that the intersection of {A_1, A_2} or for that case any finite number of {A_i} is open. However,

\displaystyle \cap_{i=1}^\infty A_i = [1,2],

which is closed. Hence one cannot claim that the intersection of an arbitrary number of open sets is open.

It turns out that the union of open sets is always open regardless of the number of sets.

Theorem 3 Let {I} be any set. Then if {A_\alpha}, is an open set for every {\alpha \in I}, then their union

\displaystyle \cup_{\alpha \in I} A_\alpha

is always open.

Proof: HW \Box

The set {I} in the above theorem is called an indexing set. For example {I=\{1,2\}} or {I=\{1,2,3,4,...\}} or {I=(0,1)}. So the number of sets in consideration depends on the cardinality of the index set. The above theorem indicates that for any cardinality of the index sets , the union is still open.

Hw 2 This problem emphasizes the fact that the notion of a set being open (or closed) depends on the ambient space in consideration. As a subset of {{\mathbb R}} check if {(0,1)} is an open set. Is {(0,1)} an open set of {{\mathbb R}^2}? (Use the definition of open set)?

Hw 3 Let {A= (0,10) \setminus \{1, 2, 3 ,4 ,5\}}. Is the set {A} an open subset of {{\mathbb R}}? “{\setminus}” denotes set minus.

Hw 4 Let {Q} denote the set of rational numbers in {(0,1)}. That is

\displaystyle Q=\left\{\frac{m}{n} : m, n \in \{1,2,3,4,....\} \text{ and } \frac{m}{n}<1\right\} .

Is the set {Q} an open set of {{\mathbb R}}?

Hw 5 Let {A} be a {n\times n} invertible matrix and {b\in{\mathbb R}^n} belong to the range space of {A}. Show that the set

\displaystyle \{x: x\in {\mathbb R}^n, Ax < b\},

is an open subset of {{\mathbb R}^n}. What happens if {A} is not invertible?

Hw 6 The following exercise shows the properties of sequences in an open set. Let {I=(0,1)} and {x_n = 1/n,\ n>2}. Verify the following:

  • Does {x_n \in I} ?
  • What is the limit { \lim_{n\rightarrow \infty}x_n}? Does it belong to the set {I}?
  • Let {y_n = \frac{1}{2} +\frac{1}{n}}. Does the limit of the sequence {y_n} belong to {I}?

This shows that even though a sequence belongs to an open set, the limit of the sequence (if it exists) may not belong to the set.

Hw 7 Observe that the definition of an open set depends on the notion of a Ball which in turn depends on the norm used. Suppose {A\subset {\mathbb R}^d} is an open set with respect to the norm {\|.\|_2} (the standard Euclidean norm). Is the set {A} an open set in {{\mathbb R}^d}, when we use a different norm, for example the {\|.\|_1} norm? Hint: Use the equivalence of norms in {{\mathbb R}^d} and the definition of open sets.

2. Closed sets

A set {A \subset {\mathbb R}^n} is a closed set if its complement {{\mathbb R}^n \setminus\ A} is open.

So {(-\infty,3]\cup[4,\infty)} is a closed set in {{\mathbb R}} since its complement {(3,4)} is an open set. Any set with finite cardinality (for example {\{3\}} or {\{0,4\}}) is a closed set. Also observe that the entire set {{\mathbb R}^d} is both a closed and open set with respect to {{\mathbb R}^d}. So a closed or an open set need not be bounded. Also by convention, the empty set {\emptyset} is a closed set.

Hw 8 Check if the following sets are closed subsets of {{\mathbb R}^2}

\displaystyle A_1=\{(x,y): x>0, y >0, 3x+2y \leq 1 \}.

\displaystyle A_2=\{(x,y): x>0, y >0, 3x+2y > 1 \}.

\displaystyle A_4=\{(x,y): 0 < x< \pi, \sin(x)\leq1/2\}.

Explain why.

From the definition of closed sets, we have the following important property which we state as a theorem.

Theorem 4 Let {A} be an open set in {{\mathbb R}^n}. Then its complement {{\mathbb R}^n\setminus A} denoted by {A^c} is closed.

We will now use the definition of closed sets and De-Morgan laws to characterize the union and intersection of closed sets.

Theorem 5 Let {B_1} and {B_2} be two closed sets. Then the union of the sets {B_1 \cap B_2} is closed.

Proof: We have {(B_1\cup B_2)^c = B_1^c \cap B_2^c} Since {B_1^c} and {B_2^c} are open sets (complements of closed set), it follows from the open sets intersection theorem that {B_1^c \cap B_2^c} is open. Hence {(B_1\cup B_2)^c} is open implying its complement {(B_1\cap B_2)} is closed. \Box

Observe that as in the open set case, the above theorem can be extended to any finite collection of closed sets. It turns out that the intersection of closed sets is always closed regardless of the number of sets.

Theorem 6 Let {I} be any set. Then if {B_\alpha}, is a closed set for every {\alpha \in I}, then their intersection

\displaystyle \cap_{\alpha \in I} B_\alpha

is always closed.

Proof: HW \Box

If {B_1, B_2, \hdots B_n} are {n} closed sets. Then {B_1\cap B_2\cap ... \cap B_n} is closed and follows from the previous result by induction.

While the above definition characterizes a closed set, we now look at an alternate definition of closed sets that is more intuitive. For that we require the notion of a limit point.

Definition 7 {x\in {\mathbb R}^n} is a limit point of a set {S\subset {\mathbb R}^n} if for every {\epsilon>0}, {B(x,\epsilon)\cap S\setminus\{x\}} is non-empty.

So a point is a limit point of a set when every ball centered around the point has a non-empty intersection with the set. For example if {S=(0,2)} then every point of {S} is a limit point of the set. In addition, the points {0} and {2} are limit points of the set {S} (check why?). However {2.00000001} is not a limit point of the set {S} (Why?). So one can think of a limit point as a point in the set, or a point which is arbitrary close to some point in the set.

Hw 9 Find the limit points of the following sets.

  1. {A_1 = \{1,2,3,4\}}.
  2. {A_2 =[0,1]}
  3. {A_3= (0,1)}
  4. {A_4=\{\|x\|<1: x\in {\mathbb R}^2\}}
  5. {Q} the set of rational numbers.

Theorem 8 A set {B\subset {\mathbb R}^n} is closed if and only if it contains all of its limit points.

Proof: We we prove by contradiction. Let {x} be a limit point of the closed set {B} and let {x \notin B}. We will provide a contradiction.

Since {B} is closed, {B^c} is open. Also since {x\notin B} implies {x \in B^c}. Since {B^c} is open, by the definition of an open set, there exists an {\epsilon} such that {B(x,\epsilon) \subset B^c}. This implies that for this {\epsilon} {B(x,\epsilon)\cap B =\empty}. Hence {x} is not a limit point of {B}. This contradicts our assumption. \Box

Hw 10 Prove that the supremum and the inifimum of a closed set {S\subset {\mathbb R}} belong to the set.

The closure of a set {S} denoted by {\bar{S}} equals the union of the set {S} and all its limit points, \ie,

\displaystyle \bar{S} = S\cup \{\text{Limit points of } S\}

So closure of {S=(0,2)} equals {[0,2]}.

Hw 11 Find the closures of the following sets

  1. {A_1 = \{1,2,3,4\}}.
  2. {A_2 =[0,1]}
  3. {A_3= (0,1)}
  4. {A_4=\{\|x\|<1: x\in {\mathbb R}^2\}}
  5. { A_5=(0,1)\cup\{2\}}
  6. { A_6=\{x: Ax< b, x\in {\mathbb R}^n, A \in {\mathbb R}^{n\times n}, b\in {\mathbb R}^n\}}
  7. {A_7=Q} the set of rational numbers.

We will now define the interior and the boundary of a set. Before that we require the notion of an interior point. Let {x\in S} be a point such that there exists a ball of some radius {\epsilon>0} such that the ball fits entirely in {S}. The set of all such points of {S} is called the interior of {S} and is denoted by {S^o}

\displaystyle S^o=\{x; x\in S, B(x,\epsilon)\subset S, \text{ for some } \epsilon >0\}.

The definition of the interior is similar to that of an open set. In fact the interior of a set {S} is the largest open set contained in {S}. The interior of the set {(0,1)} is the set {(0,1)} itself. The interior of the set {[0,1]} is the set {(0,1)}. The interior of the set {\{(x,y); x^2+y^2\leq 1\}} is the set {\{(x,y); x^2+y^2< 1\}}. Also observe that the set might not have any interior.

Hw 12 Find the interior of the sets {A_1, A_2, A_3, A_4,A_,A_6,A_7} in HW problem 11
Hw 13 What is the interior of the set { x^2+y^2\leq 1} as a subset of {{\mathbb R}^3} (with respect to the metric on {{\mathbb R}^3}). What is its interior as a subset of {{\mathbb R}^2} (with respect to the metric on {{\mathbb R}^2}).

The above exercise shows that the ambient space is important to the definition of the interior of a set.

Hw 14 Prove that a set {S} is open if and only if all of its elements are inter points, \ie {S=S^o}. Hint: Look at both the definitions.

The boundary of a set {S} denoted by {\partial{S}} equals the set difference between the closure of the set and the interior of the set, \ie,

\displaystyle \partial S = \bar{S}\setminus S^o.

For example the boundary of the set {(0,1)} are the points {\{0,1\}}. The boundary of the set {\{(x,y); x^2+y^2\leq 1\}} is the set {\{(x,y); x^2+y^2=1\}}.

Hw 15 What is the boundary of the set { x^2+y^2\leq 1} as a subset of {{\mathbb R}^3} (with respect to the metric on {{\mathbb R}^3}). What is its boundary of the above set as a subset of {{\mathbb R}^2} (with respect to the metric on {{\mathbb R}^2}).
Hw 16 Find the boundary of the sets {A_1, A_2, A_3, A_4,A_,A_6,A_7} in Exercise 11
Hw 17 Prove that the boundary of a set is a closed set.

A set {S\subset {\mathbb R}^d} is bounded if there exists a ball of radius {R<\infty} centered around the origin {B(o,R)} that contains the set. The square {[0, 2]^2} is bounded while the set {\{(x,y); x>0, y>0\}} is not bounded.

This lecture notes will cover the basics of Analysis and Linear Algebra that is required for the optimization theory course EE5121. This is by no stretch a comprehensive treatment of either Real Analysis or Linear Algebra. There are many excellent books that deal with both these subjects. Here are a few references:

  • Andrei Nikolaevich Kolmogorov, Sergei Vasilevich Fomin, “Introductory real analysis”.
  • H. L. Royden, “Real Analysis”.
  • Walter Rudin, “Principles of mathematical analysis”.
  • Gilbert Strang, “Introduction to linear algebra”.
  • Roger A. Horn, Charles R. Johnson, “Matrix analysis”.

1. Real numbers

The set of real number is denoted by {{\mathbb R}}. We begin with an important property (axiom?) of the real number system. The real number system can be ordered. In plain English this means any two real number can be compared. We take this property for granted. However, it turns out to be a very important property that in general does not hold for lots of other fields.

Since any two numbers can be compared we can provide an upper bound to a set of real numbers. For example, consider the set of numbers {A=\{0, -3, -\pi, \pi, 2.718, 10\}}. There are lot of real numbers that upper bound this set. Any number in the set {\{11, 12, 14, 16\}}, upper bound the numbers in {A} while any number in the set {\{-100, -200,-300\}}, lower bound the real numbers in {A}. If you consider the interval {(-10,10)}, then any number in the set {[10, \infty)} upper bounds {A} while {(-\infty,-10]} lower bounds {A}. It turns out, whenever a set has some upper bound and some lower bound, we can always pick the best upper bound and the best lower bound.

Don’t try to prove this as this is a distinguishing axiom of real numbers: “Every non empty set of real numbers that has an upper bound has a least upper bound. Similarly, Every non empty set of real numbers that has a lower bound has a greatest lower bound.” The least upper bound of a set is often denoted by {\sup(S)} and the greatest lower bound is denoted by {\inf(S)}. Also observe that the {\sup(S)} and {\inf(S)} need not belong to the set {S}.

If we consider the set {A=(-10,10)}, we can see that {\sup(A)=10} and {\inf(A)=-10}. For the set {B=\{1,2,3,4 ,...\}}, there is no supremum as it is not upper bounded. Its infimum {\inf(B) =1}. However, if we extended the real number system by including {-\infty} and {\infty} infimum and supremum exist for any set (they might just be {\infty} and {-\infty}).

Hw 1 Find the {\sup} and the {\inf} of the following sets:

  1. {A_1=\{\frac{1}{n};\ n=1,2,3,\hdots\}}
  2. {A_2=\{2^{1/n},n=1,2,\hdots\}}
  3. {A_3=\{x: \arcsin(x) < 1/2\}}
  4. {A_4 =(-3,4)}
  5. {A_5 =[-3,4)}
  6. {A_6 =[-3,4]}

Hw 2 For any {A\subset {\mathbb R}}, can {\sup(A)=\inf(A)}? Can you tell anything about the set {A}?

Hw 3 Is it necessary that the supremeum {\sup(A)} belongs to the set {A}. Similar question for the infimum. Look at the sets {A_4}, {A_5} and {A_6}.

If the supremeum of a set {A} belongs to the set, ie, {\sup(A)\in A}, then the supremum is usually replaced with the maximum {\max(A)}. So {\max([-1,0]) } makes sense while {\max((-1,0))} doesn’t make sense. So if you are unsure, always use {\sup} and {\inf} instead of {\max} and {\min}.

2. Basic Topology

2.1. Norm and Inner Product

One of the nice properties of real spaces is that we can define a norm and a notion of distance in a rather concrete manner. Most of the course deals with real functions and subsets of {n}-dimensional real space. So unless otherwise stated the discussion will be about real spaces and functions. We first begin with the definition of a norm. A norm on {{\mathbb R}^n} is a function {f(x)} that maps every {x \in {\mathbb R}^d} to a non-negative real number with the following properties:

  1. Let {a \in {\mathbb R}} and {x \in {\mathbb R}^d}, then {f(ax)=|a|f(x)}.
  2. {f(x+y)\leq f(x)+f(y)}.
  3. If {f(x)=0} then {x=0}

The most popular norm on {{\mathbb R}^d} is the Euclidean norm. For any point {x =(x_1,\hdots,x_d) \in {\mathbb R}^d}, its Euclidean norm is defined as

\displaystyle \|x\|_2 = \sqrt{\sum_{i=1}^d x_i^2} =\sqrt{x^Tx}

Other examples of norms are

  1. {\|x\|_1=\sum_{i=1}^d |x_i|}
  2. {\|x\|_\infty=\max_{i=1,...d}|x_i|}
  3. {\|x\|_p =(\sum_{i=1}^d |x_i|^p)^{1/p}}
Hw 4 Prove that {\|x\|_1},{\|x\|_2}, {\|x\|_\infty} satisfy the properties of a norm.

Two norms {\| . \|} and {\| . \|'} are equivalent, if there exists two positive constants {c_1} and {c_2} such that for any {x \in {\mathbb R}^d}

\displaystyle c_2 \|x\|' \leq \| x\|\leq c_1 \|x\|'.

Observe that {c_1} and {c_2} do not depend on {x}. Let us now show that {\|.\|_2} and {\|.\|_\infty} are equivalent. Let {x = (x_1,\hdots, x_n)} be some element in {{\mathbb R}^n}. Then

\displaystyle  \begin{array}{rcl}  \|x\|_2^2 &=& \sum_{i=1}^n x_i^2,\\ &\stackrel{(a)}{\leq}&\sum_{i=1}^n \max (x_i^2),\\ &=&\max (x_i^2)\sum_{i=1}^n 1= \|x\|_{\infty}^2 n. \end{array}

Where {(a)} follows since any {i}, {x_i \leq \max(x_i)}. Taking square root on both sides we have, {\|x\|_2\leq \sqrt{n}\|x\|_{\infty}}. We also have

\displaystyle  \begin{array}{rcl}  \|x\|_2^2 &=& \sum_{i=1}^n x_i^2,\\ &\geq &\max({x_i}^2),\\ &=&\|x\|_{\infty}^2. \end{array}

Taking square root we obtain {\|x\|_2\geq \|x\|_{\infty}}. Hence we have

\displaystyle \|x\|_{\infty} \leq \|x\|_2\leq \sqrt{n}\|x\|_{\infty},

which show that the norms {\|x\|_\infty} and {\|x\|_2} are equivalent. The following theorem shows that all norms in {{\mathbb R}^d} are equivalent.

Theorem 1 In {{\mathbb R}^d} (or on any finite dimensional vector space), all the norms are equivalent.

Proof: The proof of this follows from a simple optimization problem. We will prove this later in the course. \Box

Hw 5 Show that {\|x\|_2}, {\|x\|_1} and {\|x\|_\infty} norms are equivalent.

From now on we assume that we are provided with some norm on {{\mathbb R}^n} which we denote by {\|.\|}. A norm can be used to define a distance metric as

\displaystyle d(x,y) = \|x-y\|.

It is easy to prove that {d(x,y)} is a metric. The only thing to prove is the triangle inequality which follows in a straight forward manner from the norm definition.

One of the most fundamental geometric object is a ball. A ball centered at {x} and radius {r} is denoted by {B(x,r)} and is defined as

\displaystyle B(x,r)= \{y : y \in {\mathbb R}^2, \|y-x\|<r \}.

Observe the strict inequality. This is called as an open ball.

Hw 6

  1. Plot {B(o,1)} when the norm is given by {\|.\|_1} in {{\mathbb R}^2}.
  2. Plot {B(o,1)} when the norm is given by {\|.\|_2} in {{\mathbb R}^2}.
  3. Plot {B(o,1)} when the norm is given by {\|.\|_\infty} in {{\mathbb R}^2}.

Explain what equivalence of these three norms in terms of the plots.

Hw 7 Let {A} be a {n\times n} real matrix. For any {x \in {\mathbb R}^n} define

\displaystyle \|x\|_A =\|Ax\|_2.

Is {\|x\|_A } a norm in {{\mathbb R}^n} a norm for any matrix {A}?. What are the properties of {A} that will make {\|.\|_A } a norm?

Hw 8 Let {A} be a {n\times n} real matrix. For any {x \in {\mathbb R}^n} define

\displaystyle \|x\|_A =\sqrt{x^T A x}.

Is {\|x\|_A } a norm in {{\mathbb R}^n} a norm for any matrix {A}?. What are the properties of {A} that will make {\|.\|_A } a norm?

Hw 9 Let {S} denote a symmetric {n\times n} matrix. A norm on the set of symmetric matrices can be defined as

\displaystyle  \|S\| =\sqrt{ Tr(SS^T)}=\sqrt{\sum_{i=1}^n\sum_{j=1}^n s_{ij}^2}.

Show that {\|S\|} is a norm on symmetric real matrices.

\subsubsection{Dual norm} For any norm {\|\|}, its dual is defined as

\displaystyle \|x\|_* = \sup\{x^Ty \ |\quad \|y\|\leq 1\}

We will first show that {\|x\|_* } is a norm. We will show all the properties

  1. Homogeneity: First assume {a>0},

    \displaystyle  \begin{array}{rcl}  \|ax\|_* &=& \sup\{ax^Ty \ |\quad \|y\|\leq 1\}\\ &=&a\sup\{x^Ty \ |\quad \|y\|\leq 1\}\\ &=&a\|x\|_* \end{array}

    Now if {a<0}, then

    \displaystyle  \begin{array}{rcl}  \|ax\|_* &=&\sup\{ax^Ty \ |\quad \|y\|\leq 1\}\\ &=&|a|\sup\{-x^Ty \ |\quad \|y\|\leq 1\}\\ &=&|a|\sup\{x^T(-y) \ |\quad \|y\|\leq 1\}\\ &=&|a|\sup\{x^Ty \ |\quad \|y\|\leq 1\}\\ &=&|a|\|x\|_* \end{array}

  2. Suppose {\|x\|_*=0}. This implies {\sup\{x^Ty \ |\quad \|y\|\leq 1\}=0}. However if we choose {y = x/\|x\|}, then {x^Ty =1} which is a contradiction. Hence {x=o}.
  3. Triangle inequality:

    \displaystyle  \begin{array}{rcl}  \|x+z\|_* &=& \sup\{(x+y)^Ty \ |\quad \|y\|\leq 1\}\\ &=& \sup\{x^Ty +z^Ty \ |\quad \|y\|\leq 1\}\\ &\leq&\sup\{x^Ty \ |\quad \|y\|\leq 1\}+\sup\{z^Ty \ |\quad \|y\|\leq 1\}\\ &=&\|x\|_*+\|z\|_* \end{array}

Also observe that from the proof we observe that {\|x\|_* \geq 1} when {x\neq o}. The dual norm of {\| \|_2} is again the {\| \|_2} norm, while for {\| \|_p} is the conjugate norm of {\| \|_q}, where {p^{-1}+q^{-1}=1}. We have the following important lemma.

Lemma 2

\displaystyle x^Tz\leq \|x\|_*\|z\| .

Proof: From the definition we have

\displaystyle \|x\|_* = \sup\{x^Ty \ |\quad \|y\|\leq 1\}.

Since {\|x\|_*} is the supremum of {x^Ty } over all norm {1} {y} vectors, picking any particular one will only give a lower bound on {\|x\|_*}. Pick {y= z/\|z\|}. Observe that {\|y\|=1}. Hence

\displaystyle \|x\|_* \geq \frac{ x^T z}{\|z\|},

which proves the result.

\Box

2.2. Sequences

A sequence {X_n = \{x_1,x_2,x_3,x_4,\dots\}} in {{\mathbb R}} (or any space), is a map from the set of integers to {{\mathbb R}} (or the space in consideration). We will now look at some important properties regarding the convergence of sequences. We say that a sequence {x_n} converges to {x} if for every {\epsilon >0}, there exists some {k} such that the distance between {x_n} and {x} for all {n>k} is less than {\epsilon}, \ie,

\displaystyle d(x_n,x)=\|x_n-x\|< \epsilon, \quad \forall n> k.

Let us think about a game: Suppose you want to argue (prove) that a sequence {x_n} converges to some limit {x} to your friend who is not convinced and challenges you. Your friend gives you an {\epsilon}, say {\epsilon =0.01}. Then you should show that you can find some {k} such that the distance between the terms of the sequence after {x_k} and {x}, \ie, {\|x_n-x\|, n>k} is smaller than the {\epsilon} provided by your friend. Observe that you can choose whatever {k} you want. So {\epsilon} is provided by someone (your friend) and you will have to find the required {k}. Now let us see some examples.

  1. Let {x_n = 1} for all {n}. It is obvious that the limit of the sequence is {1}. However, let us check it more formally. Suppose your friend gives you {\epsilon =0.1}. You can observe that you can choose {k=1}, since {\|x_n-1\|=0<0.1}. Even if he had given you {\epsilon 		=0.000001}, {k=1} would be valid.
  2. Let {x_n =1/n}. Of course the limit is {x=0}. Suppose your friend gives {\epsilon =1}. Let us see how to choose the correct {k}. So we want

    \displaystyle \|x_n-0\|= \|1/n\|<1,

    which is always true. So you can pick {k=1}. Now suppose your friend challenges you for {\epsilon =0.01}, then you would want

    \displaystyle \|1/n\| = 1/n < 0.01.

    This is possible if {n \geq 101}. So you can answer your friend that the {k} corresponding to {\epsilon =0.01} is {101}.

  3. Let {x_n=1/n} except that we replace some values as follows: {x_{102}= 1}, {x_{200}=10}, {x_{500}=-10} and {x_{1000}=100}. You claim that the sequence still converges to {x=0}. Then your friend challenges to {\epsilon =0.01}. So what do you pick {k} as. As last time picking {k=101} will not work since {\|x_{102}-0\|= |1-0|> 0.01}. But you can see that choosing {k= 1001} will work, since {\|x_{n}-0\|<0.01} for all {n>1001}. So only the values of {x_n} for large {n} will matter and fluctuations for a finite set of values wont matter.
Hw 10 Formally prove the following limits.

  1. {\lim_{n \rightarrow \infty}\frac{1}{1+n}= 0}
  2. {\lim_{n\rightarrow\infty}(-1)^n \exp(-n) =0}

Let {x_n =1+1/n}. Observe that {x_n \in (0,1)} for all {n}. However, the limit {0} does not belong to the set {(0,1)}. In the coming sections, we will characterize sets so that when the sequence belong to the set, the limit also belongs to the set.