It is very difficult to solve general non-linear optimization problems and there exists no universal techniques. This is intuitive since most difficult math problems can be cast as optimization problems. For example, the simple problem of finding the root of a non-linear function ${f(x)}$ can be cast as the optimization problem: ${\min_{x\in {\mathbb R}^n} f(x)^2}$. Hence we focus on a sub-class of optimization problems that can be solved efficiently.

Linear optimization problems (linear objective function and polyhedral constraint set) can be solved in a very efficient manner. This class of problems can be solved with even a million variables very efficiently and commercial software exist to solve these large problems. On the other hand, solving general non-linear optimization problems is extremely difficult (almost impossible). While solving linear problems is important and useful, the linear class does not cover all interesting optimization problems (for example eigenvalue maximisation). Convex optimization problems are an important class (subsumes linear and contains a subset of non-linear problems) that are interesting, useful and that can be solved efficiently.

In this course, we will look at algorithms for convex optimization problems. In particular, the focus will be on large dimension problems and several algorithms along with their convergence rates will be presented.

There is no standard textbook for this course. We will be referring to the following text books

1. Introductory lectures on convex optmization, A basic course, Yurii Nesterov
2. Convex optimization, Stephen Boyd
3. Lectures on modern convex optimization, Aharon Ben-Tal and Arkadi Nemirovski
4. Theory of convex optimization for machine learning, Sebastien Bubeck
5. Convex optimization theory, Dimitri P Bertsekas
6. Nonlinear programming, Dimitri P Bertsekas
7. Linear and nonlinear programming, David G, Luenberger and Yinyu Ye
8. Numerical optimization, Jorge Nocedal and Stephen J. Wright
9. Problem complexity and method efficiency in optimization, A Nemirovsky and D. B. Yudin
10. A course in convexity, Alexander Barvinok

1. Comparing various algorithms

In the first part (and most) of the course, we will be looking at algorithms that do not exploit the structure of the objective function. In this regard, we assume an oracle that provides with required information and we do not have any other additional information. In this course we consider the following oracles

1. Zeroth-order oracle: input ${x}$, output ${f(x)}$.
2. First-order oracle: input ${x}$, output ${(f(x), \nabla f(x))}$
3. Second-order oracle: input ${x}$, ouput ${(f(x), \nabla f(x), \nabla^2 f(x))}$

We can now compare different algorithms over a class of functions with different oracles. The complexity of the algorithm will be measured in terms of number of calls to the oracle and the computation using these oracle outputs to computing the optimum value. However, we will not be counting the complexity of the oracle, i.e., the complexity of computing ${f(x), \nabla f(x), \nabla^2 f(x)}$ will be neglected. In most of the problems we will be considering, ${f(x)}$ is real valued and hence it does not make sense (or will be of infinite complexity) to ask for the optimal value. Instead, we will look for solutions that are ${\epsilon}$ close to the optimal value, i.e, we are interested in ${\tilde{x}}$ such that ${f(\tilde{x})- f(x^*) < \epsilon}$.

2. Zeroth-order oracle

Let us first consider the complexity of solving the following problem

$\displaystyle \min_{x\in [0,1]^n} f(x),$

when the function ${f(x)}$ belongs to the class of Lipschitz continuous functions with respect to ${\| \|_\infty}$ norm. We require an ${\epsilon}$ accurate solution, i.e., if ${\tilde{x}}$ is the declared solution, then ${f(\tilde{x}) - f(x^*) <\epsilon}$. A function ${f(x):{\mathbb R}^n\rightarrow {\mathbb R}}$ is Lipschitz continuous with constant ${L}$ if

$\displaystyle |f(x)-f(y)|\leq L\|x-y\|_\infty, \quad \forall x,y \in {\mathbb R}^n.$

Suppose we consider a zeroth order oracle that provides only the value of the function, what can we say about the computational complexity of this class of functions? We can provide an upper bound on the complexity by providing an algorithm. The simplest algorithm is the following grid search. Divide each dimension into ${K}$ parts and search over the resulting lattice. It can easily be seen that ${K^n}$ calls to the oracle are required.

Lemma 1 If ${K> 2L/\epsilon}$, the grid search algorithm provides an ${\epsilon}$ accurate solution.

Proof: The optimal point ${x^*}$ belongs to some small hypercube in the grid. Let ${\bar{x}}$ be some vertex of the hypercube. Then by Lipschitz continuity of ${f}$,

$\displaystyle f(\bar{x}) - f(x^*) \leq L\|x- x^*\|_\infty \leq \frac{L}{2K}.$

Hence choosing ${L/K<\epsilon}$, i.e., ${K> L/2\epsilon}$, we obtain the required accuracy. $\Box$

So this algorithm calls the zeroth -order oracle ${(L/2\epsilon)^n}$ times to obtain an ${\epsilon}$ accurate solution. Is this optimal (in an order sense). To prove this, we build a worst case function and show that, whatever be the algorithm, the optimal point cant be found in less than ${\Theta((2L/\epsilon)^n)}$ steps.

Lemma 2 With the zeroth order oracle, no (deterministic?) algorithm can perform better.

Proof: In this class, all the algorithms should only depend on the value of the function that the oracle provides. Suppose you have an algorithm and you feed in the function ${g(x)=0}$. The algorithm will provide a sequence of points ${x_1, x_2, \hdots, x_N}$ after which the algorithm outputs the optimal value with ${\epsilon}$ accuracy. Let us assume that ${N< (L/2\epsilon)^n =p^n}$ (otherwise we are done). Now since there are only ${p^n}$ points in the square ${[0,1]^n}$, there exists one point ${\tilde{x} \in [0,1]^n}$ such that ${B(\tilde{x}, 1/2p)\cap \{x_1,x_2,\hdots,x_n\} = \emptyset}$ (the ball is with respect to ${\| \|_\infty}$ norm). Now consider the function

$\displaystyle f(x) = \min\{0,L\|x- \tilde{x}\|-\epsilon\}.$

This is a Lipschitz function with an optimal value ${-\epsilon}$. This function is non zero only when ${\|x\| \leq \epsilon/L}$, i.e., in the ball ${B(\tilde{x},\epsilon/L)}$. It is easy to see that algorithm that is designed will never get closer to the optimal value. $\Box$

This result shows that the complexity of optimization with zeroth-order oracle for Lipschitz functions is exponential in the dimension. In the next class, we will look at Lipschitz functions in more detail.

Reference: Problem complexity and method efficiency in optimization, A Nemirovsky and D. B. Yudin