Editing Logistic regression (section)

===Other approaches===

In machine learning applications where logistic regression is used for binary classification, the MLE minimises the [[cross-entropy]] loss function.

Logistic regression is an important [[machine learning]] algorithm. The goal is to model the probability of a random variable <math>Y</math> being 0 or 1 given experimental data.<ref>{{cite journal | last = Ng | first = Andrew | year = 2000 | pages = 16–19 | journal = CS229 Lecture Notes | title = CS229 Lecture Notes | url = http://akademik.bahcesehir.edu.tr/~tevfik/courses/cmp5101/cs229-notes1.pdf}}</ref>

Consider a [[generalized linear model]] function parameterized by <math>\theta</math>,
:<math>
h_\theta(X) = \frac{1}{1 + e^{-\theta^TX}} = \Pr(Y=1 \mid X; \theta)
</math>

Therefore,
:<math> 
\Pr(Y=0 \mid X; \theta) = 1 - h_\theta(X)
</math>
and since <math> Y \in \{0,1\}</math>, we see that <math> \Pr(y\mid X;\theta) </math> is given by <math> \Pr(y \mid X; \theta) = h_\theta(X)^y(1 - h_\theta(X))^{(1-y)}. </math> We now calculate the [[likelihood function]] assuming that all the observations in the sample are independently Bernoulli distributed,
:<math>\begin{align}
L(\theta \mid y; x) &= \Pr(Y \mid X; \theta) \\
  &= \prod_i \Pr(y_i \mid x_i; \theta) \\
  &= \prod_i h_\theta(x_i)^{y_i}(1 - h_\theta(x_i))^{(1-y_i)}
\end{align}</math>

Typically, the log likelihood is maximized,
:<math>
N^{-1} \log L(\theta \mid y; x) = N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta)
</math>
which is maximized using optimization techniques such as [[gradient descent]].

Assuming the <math>(x, y)</math> pairs are drawn uniformly from the underlying distribution, then in the limit of large&nbsp;''N'',
:<math>\begin{align}
& \lim \limits_{N \rightarrow +\infty} N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta) 
= \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \log \Pr(Y=y \mid X=x; \theta) \\[6pt]
= {} & \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \left( - \log\frac{\Pr(Y=y \mid X=x)}{\Pr(Y=y \mid X=x; \theta)} + \log \Pr(Y=y \mid X=x) \right) \\[6pt]
= {} &  - D_\text{KL}( Y \parallel Y_\theta ) - H(Y \mid X) 
\end{align}</math>
where <math>H(Y\mid X)</math> is the [[conditional entropy]] and <math>D_\text{KL}</math> is the [[Kullback–Leibler divergence]]. This leads to the intuition that by maximizing the log-likelihood of a model, you are minimizing the KL divergence of your model from the maximal entropy distribution. Intuitively searching for the model that makes the fewest assumptions in its parameters.