Editing Lagrange multiplier (section)

== Examples ==

=== Example 1 ===
[[Image:Lagrange very simple.svg|thumb|right|300px|Illustration of the constrained optimization problem&nbsp;'''1''']]

Suppose we wish to maximize <math>\ f(x,y) = x+y\ </math> subject to the constraint <math>\ x^2 + y^2 = 1 ~.</math> The [[Candidate solution|feasible set]] is the unit circle, and the [[level set]]s of {{mvar|f}} are diagonal lines (with slope −1), so we can see graphically that the maximum occurs at <math>\ \left(\tfrac{1}{\sqrt{2}}, \tfrac{1}{\sqrt{2}}\right)\ ,</math> and that the minimum occurs at <math>\ \left(-\tfrac{1}{\sqrt{2}}, -\tfrac{1}{\sqrt{2}}\right) ~.</math>

For the method of Lagrange multipliers, the constraint is
<math display="block"> g(x,y) = x^2 + y^2-1 = 0\ ,</math>
hence the Lagrangian function,
<math display="block">\begin{align} 
\mathcal{L}(x, y, \lambda) &= f(x,y) + \lambda \cdot g(x,y) \\[4pt]
&= x + y + \lambda (x^2 + y^2 - 1)\ ,
\end{align}</math>
is a function that is equivalent to <math>\ f(x,y)\ </math> when <math>\ g(x,y)\ </math> is set to {{math|0}}.

Now we can calculate the gradient:
<math display="block">\begin{align}
\nabla_{x,y,\lambda} \mathcal{L}(x , y, \lambda) &= \left( \frac{\partial \mathcal{L}}{\partial x}, \frac{\partial \mathcal{L}}{\partial y}, \frac{\partial \mathcal{L}}{\partial \lambda} \right ) \\[4pt]
&= \left ( 1  + 2 \lambda x, 1 + 2 \lambda y, x^2 + y^2 -1 \right)
\ \color{gray}{,}
\end{align}</math>
and therefore:
<math display="block">\nabla_{x,y,\lambda} \mathcal{L}(x , y, \lambda)=0 \quad \Leftrightarrow \quad \begin{cases} 1 + 2 \lambda x = 0 \\ 1 + 2 \lambda y = 0 \\ x^2 + y^2 -1  = 0 \end{cases}</math>

Notice that the last equation is the original constraint.

The first two equations yield
<math display="block"> x = y = - \frac{1}{2\lambda}, \qquad \lambda \neq 0 ~.</math>
By substituting into the last equation we have:
<math display="block"> \frac{1}{4\lambda^2} + \frac{1}{4\lambda^2} - 1 = 0\ ,</math>
so
<math display="block"> \lambda = \pm \frac{1}{\sqrt{2\ }}\ ,</math>
which implies that the stationary points of <math>\mathcal{L}</math> are
<math display="block">\left(\tfrac{\sqrt{2\ }}{2}, \tfrac{\sqrt{2\ }}{2}, -\tfrac{1}{\sqrt{2\ }}\right), \qquad \left(-\tfrac{\sqrt{2\ }}{2}, -\tfrac{\sqrt{2\ }}{2}, \tfrac{1}{\sqrt{2\ }} \right) ~.</math>

Evaluating the objective function {{mvar|f}} at these points yields
<math display="block">f\left(\tfrac{\sqrt{2\ }}{2}, \tfrac{\sqrt{2\ }}{2}\right) = \sqrt{2\ }\ , \qquad  f\left(-\tfrac{\sqrt{2\ }}{2}, -\tfrac{\sqrt{2\ }}{2} \right) = -\sqrt{2\ } ~.</math>

Thus the constrained maximum is <math>\ \sqrt{2\ }\ </math> and the constrained minimum is <math>-\sqrt{2}</math>.

=== Example 2 ===
[[Image:Lagrange very simple-1b.svg|thumb|right|300px|Illustration of the constrained optimization problem&nbsp;'''2''']]

Now we modify the objective function of Example&nbsp;'''1''' so that we minimize <math>\ f(x,y) = (x + y)^2\ </math> instead of <math>\ f(x,y) = x + y\ ,</math> again along the circle <math>\ g(x,y)=x^2+y^2-1 = 0 ~.</math> Now the level sets of <math>f</math> are still lines of slope −1, and the points on the circle tangent to these level sets are again <math>\ (\sqrt{2}/2,\sqrt{2}/2)\ </math> and <math>\ (-\sqrt{2}/2,-\sqrt{2}/2) ~.</math> These tangency points are maxima of <math>\ f ~.</math>

On the other hand, the minima occur on the level set for <math>\ f = 0\ </math> (since by its construction <math>\ f\ </math> cannot take negative values), at <math>\ (\sqrt{2}/2,-\sqrt{2}/2)\ </math> and <math>\ (-\sqrt{2}/2, \sqrt{2}/2)\ ,</math> where the level curves of <math>\ f\ </math> are not tangent to the constraint. The condition that <math>\ \nabla_{x,y,\lambda}\left(f(x,y)+\lambda\cdot g(x,y) \right)=0\ </math> correctly identifies all four points as extrema; the minima are characterized in by <math>\ \lambda =0\ </math> and the maxima by <math>\ \lambda = -2 ~.</math>

=== Example 3 ===
[[Image:Lagrange simple.svg|thumb|right|300px| Illustration of constrained optimization problem&nbsp;'''3'''.]]

This example deals with more strenuous calculations, but it is still a single constraint problem.

Suppose one wants to find the maximum values of
<math display="block"> f(x, y) = x^2 y</math>
with the condition that the <math>\ x\ </math>- and <math>\ y\ </math>-coordinates lie on the circle around the origin with radius <math>\ \sqrt{3\ } ~.</math> That is, subject to the constraint
<math display="block"> g(x,y) = x^2 + y^2 - 3 = 0 ~.</math>

As there is just a single constraint, there is a single multiplier, say <math>\ \lambda ~.</math>

The constraint <math>\ g(x,y)\ </math> is identically zero on the circle of radius <math>\ \sqrt{3\ } ~.</math> Any multiple of <math>\ g(x,y)\ </math> may be added to <math>\ g(x,y)\ </math> leaving <math>\ g(x,y)\ </math> unchanged in the region of interest (on the circle where our original constraint is satisfied).

Applying the ordinary Lagrange multiplier method yields
<math display="block">\begin{align}
\mathcal{L}(x, y, \lambda) &= f(x,y) + \lambda \cdot g(x, y) \\
&= x^2y +  \lambda (x^2 + y^2 - 3)\ ,
\end{align}</math>
from which the gradient can be calculated:
<math display="block">\begin{align}
\nabla_{x,y,\lambda} \mathcal{L}(x , y, \lambda) &= \left ( \frac{\partial \mathcal{L}}{\partial x}, \frac{\partial \mathcal{L}}{\partial y}, \frac{\partial \mathcal{L}}{\partial \lambda} \right) \\
&= \left ( 2 x y + 2 \lambda x, x^2 + 2 \lambda y, x^2 + y^2 -3 \right) ~.
\end{align}</math>
And therefore:
<math display="block">\nabla_{x,y,\lambda} \mathcal{L}(x , y, \lambda)=0 \quad \iff \quad \begin{cases}
2 x y + 2 \lambda x = 0 \\
x^2 + 2 \lambda y = 0 \\
x^2 + y^2 - 3 = 0
\end{cases} \quad \iff \quad \begin{cases}
x (y + \lambda) = 0 & \text{(i)} \\
x^2 = -2 \lambda y & \text{(ii)} \\
x^2 + y^2 = 3 & \text{(iii)}
\end{cases} </math>
(iii) is just the original constraint. (i) implies <math>\ x=0\ </math> or <math>\ \lambda=-y ~.</math> If <math>x=0</math> then <math>\ y = \pm \sqrt{3\ }\ </math> by (iii) and consequently  <math>\ \lambda=0\ </math> from (ii). If <math>\ \lambda=-y\ ,</math> substituting this into (ii) yields <math>\ x^2 = 2y^2 ~.</math> Substituting this into (iii) and solving for <math>\ y\ </math> gives <math>\ y=\pm1 ~.</math> Thus there are six critical points of <math>\ \mathcal{L}\ :</math>
<math display="block"> (\sqrt{2\ },1,-1); \quad (-\sqrt{2\ },1,-1); \quad (\sqrt{2\ },-1,1); \quad (-\sqrt{2\ },-1,1); \quad (0,\sqrt{3\ }, 0); \quad (0,-\sqrt{3\ }, 0) ~.</math>

Evaluating the objective at these points, one finds that
<math display="block"> f(\pm\sqrt{2\ },1) = 2; \quad f(\pm\sqrt{2\ },-1) = -2; \quad f(0,\pm \sqrt{3\ }) = 0 ~.</math>

Therefore, the  objective function attains the [[global maximum]] (subject to the constraints) at <math>\ (\pm\sqrt{2\ },1\ )</math> and the [[global minimum]] at <math>\ (\pm\sqrt{2\ },-1) ~.</math> The point <math>\ (0,\sqrt{3\ })\ </math> is a [[local minimum]] of <math>\ f\ </math> and <math>\ (0,-\sqrt{3\ })\ </math> is a [[local maximum]] of <math>\ f\ ,</math> as may be determined by consideration of the [[Hessian (mathematics)#Bordered Hessian|Hessian matrix]] of <math>\ \mathcal{L}(x,y,0) ~.</math>

Note that while <math>\ (\sqrt{2\ }, 1, -1)\ </math> is a critical point of <math>\ \mathcal{L}\ ,</math> it is not a local extremum of <math>\ \mathcal{L} ~.</math> We have
<math display="block">\mathcal{L} \left (\sqrt{2\ } + \varepsilon, 1, -1 + \delta \right ) = 2 + \delta \left( \varepsilon^2 + \left (2\sqrt{2\ } \right)\varepsilon \right) ~.</math>

Given any neighbourhood of <math>\ (\sqrt{2\ }, 1, -1)\ ,</math> one can choose a small positive <math>\ \varepsilon\ </math> and a small <math>\ \delta\ </math> of either sign to get <math>\ \mathcal{L}</math> values both greater and less than <math>\ 2 ~.</math> This can also be seen from the Hessian matrix of <math>\ \mathcal{L}\ </math> evaluated at this point (or indeed at any of the critical points) which is an [[indefinite matrix]]. Each of the critical points of <math>\ \mathcal{L}\ </math> is a [[saddle point]] of <math>\ \mathcal{L} ~.</math><ref name=Walsh1975/>

=== Example 4 – Entropy ===

Suppose we wish to find the [[Probability distribution#Discrete probability distribution|discrete probability distribution]] on the points <math>\ \{p_1, p_2, \ldots, p_n\}\ </math> with maximal [[information entropy]]. This is the same as saying that we wish to find the [[Principle of maximum entropy|least structured]] probability distribution on the points <math>\ \{p_1, p_2, \cdots, p_n\} ~.</math> In other words, we wish to maximize the [[Shannon entropy]] equation:
<math display="block"> f(p_1,p_2,\ldots,p_n) = -\sum_{j=1}^n p_j \log_2 p_j ~.</math>

For this to be a probability distribution the sum of the probabilities <math>\ p_i\ </math> at each point <math>\ x_i\ </math> must equal 1, so our constraint is:
<math display="block"> g(p_1,p_2,\ldots,p_n)=\sum_{j=1}^n p_j = 1 ~.</math>

We use Lagrange multipliers to find the point of maximum entropy, <math>\ \vec{p}^{\,*}\ ,</math> across all discrete probability distributions <math>\ \vec{p}\ </math> on <math>\ \{x_1,x_2, \ldots, x_n\} ~.</math> We require that:
<math display="block">\left. \frac{\partial}{\partial \vec{p}}(f+\lambda (g-1)) \right|_{\vec{p}=\vec{p}^{\,*}} = 0\ ,</math>
which gives a system of {{mvar|n}} equations, <math>\ k = 1,\ \ldots, n\ ,</math> such that:
<math display="block">\left. \frac{\partial}{\partial p_k}\left\{ -\left (\sum_{j=1}^n p_j \log_2 p_j \right ) + \lambda \left(\sum_{j=1}^n p_j - 1 \right) \right\} \right|_{ p_k = p_{\star k} } = 0 ~.</math>

Carrying out the differentiation of these {{mvar|n}} equations, we get
<math display="block"> -\left(\frac{1}{\ln 2}+\log_2 p_{\star k} \right) + \lambda = 0 ~.</math>

This shows that all <math>\ p_{\star k}\ </math> are equal (because they depend on {{mvar|λ}} only). By using the constraint
<math display="block">\sum_j p_j =1\ ,</math>
we find
<math display="block"> p_{\star k} = \frac{1}{n} ~.</math>

Hence, the uniform distribution is the distribution with the greatest entropy, among distributions on {{mvar|n}} points.

=== Example 5 – Numerical optimization===
[[Image:lagnum1.png|thumb|right|300px|Lagrange multipliers cause the critical points to occur at saddle points (Example&nbsp;'''5''').]]

[[Image:lagnum2.png|thumb|right|300px|The magnitude of the gradient can be used to force the critical points to occur at local minima (Example&nbsp;'''5''').]]

The critical points of Lagrangians occur at [[saddle point]]s, rather than at local maxima (or minima).<ref name=Walsh1975/><ref name=Heath2005>{{cite book |first=Michael T. |last= Heath |author-link=Michael Heath (computer scientist) |year=2005 |title=Scientific Computing: An introductory survey |page=203 |publisher=McGraw-Hill |isbn=978-0-07-124489-3 |url=https://books.google.com/books?id=gwBrMAEACAAJ}}</ref> Unfortunately, many numerical optimization techniques, such as [[hill climbing]], [[gradient descent]], some of the [[quasi-Newton method]]s, among others, are designed to find local maxima (or minima) and not saddle points. For this reason, one must either modify the formulation to ensure that it's a minimization problem (for example, by extremizing the square of the [[gradient]] of the Lagrangian as below), or else use an optimization technique that finds [[stationary points]] (such as [[Newton's method in optimization|Newton's method]] without an extremum seeking [[line search]]) and not necessarily extrema.

As a simple example, consider the problem of finding the value of {{mvar|x}} that minimizes <math>\ f(x) = x^2\ ,</math> constrained such that <math>\ x^2 = 1 ~.</math> (This problem is somewhat untypical because there are only two values that satisfy this constraint, but it is useful for illustration purposes because the corresponding unconstrained function can be visualized in three dimensions.)

Using Lagrange multipliers, this problem can be converted into an unconstrained optimization problem:
<math display="block"> \mathcal{L}( x, \lambda ) = x^2 + \lambda(x^2-1) ~.</math>

The two critical points occur at saddle points where {{math|''x'' {{=}} 1}} and {{math|''x'' {{=}} −1}}.

In order to solve this problem with a numerical optimization technique, we must first transform this problem such that the critical points occur at local minima. This is done by computing the magnitude of the gradient of the unconstrained optimization problem.

First, we compute the partial derivative of the unconstrained problem with respect to each variable:
<math display="block">\begin{align}
& \frac{\partial \mathcal{L} }{ \partial x } = 2x + 2x \lambda \\[5pt]
& \frac{\partial \mathcal{L} }{ \partial \lambda }=x^2-1 ~.
\end{align}</math>

If the target function is not easily differentiable, the differential with respect to each variable can be approximated as
<math display="block">\begin{align}
\frac{\ \partial \mathcal{L}\ }{ \partial x } \approx \frac{\mathcal{L}(x + \varepsilon,\lambda) - \mathcal{L}(x,\lambda)}{\varepsilon}, \\[5pt]
\frac{\ \partial \mathcal{L}\ }{ \partial \lambda } \approx \frac{\mathcal{L}(x, \lambda + \varepsilon) - \mathcal{L}(x,\lambda)}{\varepsilon},
\end{align}</math>
where <math> \varepsilon
 </math> is a small value.

Next, we compute the magnitude of the gradient, which is the square root of the sum of the squares of the partial derivatives:
<math display="block">\begin{align}
h(x,\lambda) & = \sqrt{ (2x+2x\lambda)^2 + (x^2-1)^2\ } \\[4pt]
& \approx \sqrt{ \left(\frac{\ \mathcal{L}(x+\varepsilon,\lambda)-\mathcal{L}(x,\lambda)\ }{\varepsilon}\right)^2 + \left(\frac{\ \mathcal{L}(x,\lambda+\varepsilon) - \mathcal{L}(x,\lambda)\ }{\varepsilon}\right)^2\ }~ .
\end{align}</math>

(Since magnitude is always non-negative, optimizing over the squared-magnitude is equivalent to optimizing over the magnitude. Thus, the "square root" may be omitted from these equations with no expected difference in the results of optimization.)

The critical points of {{mvar|h}} occur at {{math|1=''x'' = 1}} and {{math|1=''x'' = −1}}, just as in <math> \mathcal{L} ~.</math> Unlike the critical points in <math> \mathcal{L}\, ,</math> however, the critical points in {{mvar|h}} occur at local minima, so numerical optimization techniques can be used to find them.