Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Logistic regression
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Other approaches=== In machine learning applications where logistic regression is used for binary classification, the MLE minimises the [[cross-entropy]] loss function. Logistic regression is an important [[machine learning]] algorithm. The goal is to model the probability of a random variable <math>Y</math> being 0 or 1 given experimental data.<ref>{{cite journal | last = Ng | first = Andrew | year = 2000 | pages = 16–19 | journal = CS229 Lecture Notes | title = CS229 Lecture Notes | url = http://akademik.bahcesehir.edu.tr/~tevfik/courses/cmp5101/cs229-notes1.pdf}}</ref> Consider a [[generalized linear model]] function parameterized by <math>\theta</math>, :<math> h_\theta(X) = \frac{1}{1 + e^{-\theta^TX}} = \Pr(Y=1 \mid X; \theta) </math> Therefore, :<math> \Pr(Y=0 \mid X; \theta) = 1 - h_\theta(X) </math> and since <math> Y \in \{0,1\}</math>, we see that <math> \Pr(y\mid X;\theta) </math> is given by <math> \Pr(y \mid X; \theta) = h_\theta(X)^y(1 - h_\theta(X))^{(1-y)}. </math> We now calculate the [[likelihood function]] assuming that all the observations in the sample are independently Bernoulli distributed, :<math>\begin{align} L(\theta \mid y; x) &= \Pr(Y \mid X; \theta) \\ &= \prod_i \Pr(y_i \mid x_i; \theta) \\ &= \prod_i h_\theta(x_i)^{y_i}(1 - h_\theta(x_i))^{(1-y_i)} \end{align}</math> Typically, the log likelihood is maximized, :<math> N^{-1} \log L(\theta \mid y; x) = N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta) </math> which is maximized using optimization techniques such as [[gradient descent]]. Assuming the <math>(x, y)</math> pairs are drawn uniformly from the underlying distribution, then in the limit of large ''N'', :<math>\begin{align} & \lim \limits_{N \rightarrow +\infty} N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \log \Pr(Y=y \mid X=x; \theta) \\[6pt] = {} & \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \left( - \log\frac{\Pr(Y=y \mid X=x)}{\Pr(Y=y \mid X=x; \theta)} + \log \Pr(Y=y \mid X=x) \right) \\[6pt] = {} & - D_\text{KL}( Y \parallel Y_\theta ) - H(Y \mid X) \end{align}</math> where <math>H(Y\mid X)</math> is the [[conditional entropy]] and <math>D_\text{KL}</math> is the [[Kullback–Leibler divergence]]. This leads to the intuition that by maximizing the log-likelihood of a model, you are minimizing the KL divergence of your model from the maximal entropy distribution. Intuitively searching for the model that makes the fewest assumptions in its parameters.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)