Editing Mahalanobis distance

{{Short description|Statistical distance measure}}
The '''Mahalanobis distance''' is a [[distance measure|measure of the distance]] between a point <math>P</math> and a [[probability distribution]] <math>D</math>, introduced by [[Prasanta Chandra Mahalanobis|P.&nbsp;C. Mahalanobis]] in 1936.<ref>{{Cite journal |date=2018-12-01 |title=Reprint of: Mahalanobis, P.C. (1936) "On the Generalised Distance in Statistics." |url=https://doi.org/10.1007/s13171-019-00164-5 |journal=Sankhya A |language=en |volume=80 |issue=1 |pages=1–7 |doi=10.1007/s13171-019-00164-5 |issn=0976-8378|url-access=subscription }}</ref> The mathematical details of Mahalanobis distance first appeared in the ''Journal of The Asiatic Society of Bengal'' in 1936.<ref>{{Cite book |last= |url=https://archive.org/details/dli.ernet.28728/page/n813/mode/1up |title=Journal and Procedings Of The Asiatic Society Of Bengal Vol-xxvi |date=1933 |publisher=Asiatic Society Of Bengal Calcutta}}</ref> Mahalanobis's definition was prompted by the problem of [[similarity measure|identifying the similarities]] of skulls based on measurements (the earliest work related to similarities of skulls are from 1922 and another later work is from 1927).<ref>{{Cite book |last=Mahalanobis |first=Prasanta Chandra |url=http://archive.org/details/records-indian-museum-23-001-096 |title=Anthropological Observations on the Anglo-Indians of Culcutta---Analysis of Male Stature |date=1922 |language=English}}</ref><ref>{{Cite journal |last=Mahalanobis |first=Prasanta Chandra |date=1927 |title=Analysis of race mixture in Bengal |url=https://archive.org/details/in.ernet.dli.2015.280409/page/n522/mode/1up |journal=Journal and Proceedings of the Asiatic Society of Bengal |volume=23 |pages=301–333}}</ref> [[Raj Chandra Bose|R.C. Bose]] later obtained the sampling distribution of Mahalanobis distance, under the assumption of equal dispersion.<ref>{{Cite book |last= |url=https://archive.org/details/in.ernet.dli.2015.23164/page/n169/mode/1up |title=Science And Culture (1935-36) Vol. 1 |date=1935 |publisher=Indian Science News Association |pages=205–206}}</ref>

It is a multivariate generalization of the square of the [[standard score]] <math>z=(x- \mu)/\sigma</math>: how many [[standard deviations]] away <math>P</math> is from the [[mean]] of <math>D</math>. This distance is zero for <math>P</math> at the mean of <math>D</math> and grows as <math>P</math> moves away from the mean along each [[principal component]] axis. If each of these axes is re-scaled to have unit variance, then the Mahalanobis distance corresponds to standard [[Euclidean distance]] in the transformed space. The Mahalanobis distance is thus [[unitless]], [[Scale invariance|scale-invariant]], and takes into account the [[correlations]] of the [[data set]].

==Definition==
Given a probability distribution <math>Q</math> on <math>\R^N</math>, with mean <math>\vec{\mu} = (\mu_1, \mu_2, \mu_3, \dots , \mu_N)^\mathsf{T}</math> and positive semi-definite [[covariance matrix]] <math>\mathbf{\Sigma}</math>, the Mahalanobis distance of a point <math>\vec{x} = (x_1, x_2, x_3, \dots, x_N )^\mathsf{T}</math> from <math>Q</math> is <ref>{{Cite journal |last1=De Maesschalck |first1=R. |last2=Jouan-Rimbaud |first2=D. |last3=Massart |first3=D.&nbsp;L. |title=The Mahalanobis distance |journal=Chemometrics and Intelligent Laboratory Systems |year=2000 |volume=50 |issue=1 |pages=1–18 |doi=10.1016/s0169-7439(99)00047-7}}</ref><math display="block">d_M(\vec{x}, Q) = \sqrt{(\vec{x} - \vec{\mu})^\mathsf{T} \mathbf{\Sigma}^{-1} (\vec{x} - \vec{\mu})}.</math>Given two points <math>\vec{x}</math> and <math>\vec{y}</math> in <math>\R^N</math>, the Mahalanobis distance between them with respect to <math>Q</math> is<math display="block"> d_M(\vec{x} ,\vec{y}; Q) = \sqrt{(\vec{x} - \vec{y})^\mathsf{T} \mathbf{\Sigma}^{-1} (\vec{x} - \vec{y})}.</math>which means that <math>d_M(\vec{x}, Q) = d_M(\vec{x},\vec{\mu}; Q)</math>.

Since <math>\mathbf{\Sigma}</math> is [[Positive semidefinite matrices|positive semi-definite]], so is <math>\mathbf{\Sigma}^{-1}</math>, thus the square roots are always defined.

We can find useful decompositions of the squared Mahalanobis distance that help to explain some reasons for the outlyingness of multivariate observations and also provide a graphical tool for identifying outliers.<ref>{{Cite journal |last=Kim |first=M.&nbsp;G. |year=2000 |title=Multivariate outliers and decompositions of Mahalanobis distance |journal=Communications in Statistics – Theory and Methods |volume=29 |issue=7 |pages=1511–1526 |doi=10.1080/03610920008832559|s2cid=218567835 }}</ref>

By the [[spectral theorem]], <math>\mathbf{\Sigma}</math> can be decomposed as <math> \mathbf{\Sigma} = \mathbf{S}^T \mathbf{S}</math> for some real <math> N\times N</math> matrix. One choice for <math>\mathbf{S}</math> is the symmetric square root of <math>\mathbf{\Sigma}</math>, which is the [[Standard deviation#Standard deviation matrix|standard deviation matrix]].<ref name="Das">{{cite arXiv |eprint=2012.14331 |last1=Das |first1=Abhranil |author2=Wilson S Geisler |title=Methods to integrate multinormals and compute classification measures |date=2020 |class=stat.ML }}</ref> This gives us the equivalent definition<math display="block">d_M(\vec{x}, \vec{y}; Q) = \|\mathbf{S}^{-1}(\vec{x} - \vec{y})\|</math>where <math>\|\cdot\|</math> is the Euclidean norm. That is, the Mahalanobis distance is the Euclidean distance after a [[whitening transformation]].

The existence of <math>\mathbf{S}</math> is guaranteed by the spectral theorem, but it is not unique. Different choices have different theoretical and practical advantages.<ref>{{Cite journal |last1=Kessy |first1=Agnan |last2=Lewin |first2=Alex |last3=Strimmer |first3=Korbinian |date=2018-10-02 |title=Optimal Whitening and Decorrelation |url=https://doi.org/10.1080/00031305.2016.1277159 |journal=The American Statistician |volume=72 |issue=4 |pages=309–314 |doi=10.1080/00031305.2016.1277159 |s2cid=55075085 |issn=0003-1305|arxiv=1512.00809 }}</ref>

In practice, the distribution <math>Q</math> is usually the [[sample distribution]] from a set of [[Independent and identically distributed random variables|IID]] samples from an underlying unknown distribution, so <math>\mu</math> is the sample mean, and <math>\mathbf{\Sigma}</math> is the covariance matrix of the samples.

When the [[affine span]] of the samples is not the entire <math>\R^N</math>, the covariance matrix would not be positive-definite, which means the above definition would not work. However, in general, the Mahalanobis distance is preserved under any full-rank affine transformation of the affine span of the samples. So in case the affine span is not the entire <math>\R^N</math>, the samples can be first orthogonally projected to <math>\R^n</math>, where <math>n</math> is the dimension of the affine span of the samples, then the Mahalanobis distance can be computed as usual.

==Intuitive explanation==
{{unreferenced section|date=May 2021}}
Consider the problem of estimating the probability that a test point in ''N''-dimensional [[Euclidean space]] belongs to a set, where we are given sample points that definitely belong to that set. Our first step would be to find the [[centroid]] or center of mass of the sample points. Intuitively, the closer the point in question is to this center of mass, the more likely it is to belong to the set.

However, we also need to know if the set is spread out over a large range or a small range, so that we can decide whether a given distance from the center is noteworthy or not. The simplistic approach is to estimate the [[standard deviation]] of the distances of the sample points from the center of mass. If the distance between the test point and the center of mass is less than one standard deviation, then we might conclude that it is highly probable that the test point belongs to the set. The further away it is, the more likely that the test point should not be classified as belonging to the set.

This intuitive approach can be made quantitative by defining the normalized distance between the test point and the set to be <math>\frac{\lVert x - \mu\rVert_2}{\sigma}</math>, which reads: <math>\frac{\text{testpoint} - \text{sample mean}}{\text{standard deviation}}</math>. By plugging this into the normal distribution, we can derive the probability of the test point belonging to the set.

The drawback of the above approach was that we assumed that the sample points are distributed about the center of mass in a spherical manner. Were the distribution to be decidedly non-spherical, for instance ellipsoidal, then we would expect the probability of the test point belonging to the set to depend not only on the distance from the center of mass, but also on the direction. In those directions where the ellipsoid has a short axis the test point must be closer, while in those where the axis is long the test point can be further away from the center.

Putting this on a mathematical basis, the ellipsoid that best represents the set's probability distribution can be estimated by building the covariance matrix of the samples. The Mahalanobis distance is the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point.

==Normal distributions==

For a [[multivariate normal distribution|normal distribution]] in any number of dimensions, the probability density of an observation <math>\vec{x}</math> is uniquely determined by the Mahalanobis distance <math>d</math>:
: <math>
\begin{align}
\Pr[\vec x] \,d\vec x & = \frac 1 {\sqrt{\det(2\pi \mathbf{\Sigma})}} \exp \left(-\frac{(\vec x - \vec \mu)^\mathsf{T} \mathbf{\Sigma}^{-1} (\vec x - \vec \mu)} 2 \right) \,d\vec{x} \\[6pt]
& = \frac{1}{\sqrt{\det(2\pi \mathbf{\Sigma})}} \exp\left( -\frac{d^2} 2 \right) \,d\vec x.
\end{align}
</math>
Specifically, <math>d^2</math> follows the [[chi-squared distribution]] with <math>n</math> degrees of freedom, where <math>n</math> is the number of dimensions of the normal distribution. If the number of dimensions is 2, for example, the probability of a particular calculated <math>d</math> being less than some threshold <math>t</math> is <math>1 - e^{-t^2/2}</math>. To determine a threshold to achieve a particular probability, <math>p</math>, use  <math display="inline">t = \sqrt{-2\ln(1 - p)}</math>, for 2 dimensions. For number of dimensions other than 2, the cumulative chi-squared distribution should be consulted.

In a normal distribution, the region where the Mahalanobis distance is less than one (i.e. the region inside the ellipsoid at distance one) is exactly the region where the probability distribution is [[concave function|concave]].

The Mahalanobis distance is proportional, for a normal distribution, to the square root of the negative [[log-likelihood]] (after adding a constant so the minimum is at zero).

== Other forms of multivariate location and scatter ==
[[File:Mahalanobis-distance-location-and-scatter-methods.png|thumb|620x620px|Hypothetical two-dimensional example of Mahalanobis distance with three different methods of defining the multivariate location and scatter of the data.]]
The sample mean and covariance matrix can be quite sensitive to outliers, therefore other approaches for calculating the multivariate location and scatter of data are also commonly used when calculating the Mahalanobis distance.  The Minimum Covariance Determinant approach estimates multivariate location and scatter from a subset numbering <math>h</math> data points that has the smallest variance-covariance matrix determinant.<ref>{{Cite journal|last1=Hubert|first1=Mia|last2=Debruyne|first2=Michiel|date=2010|title=Minimum covariance determinant|url=https://onlinelibrary.wiley.com/doi/10.1002/wics.61|journal=WIREs Computational Statistics|language=en|volume=2|issue=1|pages=36–43|doi=10.1002/wics.61|s2cid=123086172 |issn=1939-5108|url-access=subscription}}</ref>  The Minimum Volume Ellipsoid approach is similar to the Minimum Covariance Determinant approach in that it works with a subset of size <math>h</math> data points, but the Minimum Volume Ellipsoid estimates multivariate location and scatter from the ellipsoid of minimal volume that encapsulates the <math>h</math> data points.<ref>{{Cite journal|last1=Van Aelst|first1=Stefan|last2=Rousseeuw|first2=Peter|date=2009|title=Minimum volume ellipsoid|url=https://onlinelibrary.wiley.com/doi/10.1002/wics.19|journal=Wiley Interdisciplinary Reviews: Computational Statistics|language=en|volume=1|issue=1|pages=71–82|doi=10.1002/wics.19|s2cid=122106661 |issn=1939-5108|url-access=subscription}}</ref>  Each method varies in its definition of the distribution of the data, and therefore produces different Mahalanobis distances.  The Minimum Covariance Determinant and Minimum Volume Ellipsoid approaches are more robust to samples that contain outliers, while the sample mean and covariance matrix tends to be more reliable with small and biased data sets.<ref>{{Cite journal|last=Etherington|first=Thomas R.|date=2021-05-11|title=Mahalanobis distances for ecological niche modelling and outlier detection: implications of sample size, error, and bias for selecting and parameterising a multivariate location and scatter method|journal=PeerJ|language=en|volume=9|pages=e11436|doi=10.7717/peerj.11436|issn=2167-8359|pmc=8121071|pmid=34026369 |doi-access=free }}</ref>

==Relationship to normal random variables==
In general, given a normal ([[Gaussian distribution|Gaussian]]) random variable <math>X</math> with variance <math>S=1</math> and mean <math>\mu = 0</math>, any other normal random variable <math>R</math> (with mean <math>\mu_1</math> and variance <math>S_1</math>) can be defined in terms of <math>X</math> by the equation <math>R = \mu_1 + \sqrt{S_1}X.</math> Conversely, to recover a normalized random variable from any normal random variable, one can typically solve for <math>X = (R - \mu_1)/\sqrt{S_1} </math>. If we square both sides, and take the square-root, we will get an equation for a metric that looks a lot like the Mahalanobis distance:

<math display="block">D = \sqrt{X^2} = \sqrt{(R - \mu_1)^2/S_1} = \sqrt{(R - \mu_1) S_1^{-1} (R - \mu_1) }.</math>

The resulting magnitude is always non-negative and varies with the distance of the data from the mean, attributes that are convenient when trying to define a model for the data.

==Relationship to leverage==

{{Main|Leverage (statistics)#Mahalanobis distance}}

Mahalanobis distance is closely related to the [[Leverage (statistics)|leverage statistic]], <math>h</math>, but has a different scale:

<math display="block">D^2 = (N - 1) \left(h - \tfrac 1 N \right).</math>

==Applications==
Mahalanobis distance is widely used in [[Data clustering|cluster analysis]] and [[Statistical classification|classification]] techniques. It is closely related to [[Hotelling's T-square distribution]] used for multivariate statistical testing and Fisher's [[linear discriminant analysis]] that is used for [[supervised classification]].<ref>{{cite book |url={{google books |plainurl=y |id=O_qHDLaWpDUC&pg=PR13}} |title=Discriminant Analysis and Statistical Pattern Recognition |last=McLachlan |first=Geoffrey |date=4 August 2004 |publisher=John Wiley & Sons |isbn=978-0-471-69115-0 |pages=13–}}</ref>

In order to use the Mahalanobis distance to classify a test point as belonging to one of ''N'' classes, one first [[Estimation of covariance matrices|estimates the covariance matrix]] of each class, usually based on samples known to belong to each class. Then, given a test sample, one computes the Mahalanobis distance to each class, and classifies the test point as belonging to that class for which the Mahalanobis distance is minimal.

Mahalanobis distance and leverage are often used to detect [[outlier]]s, especially in the development of [[linear regression]] models. A point that has a greater Mahalanobis distance from the rest of the sample population of points is said to have higher leverage since it has a greater influence on the slope or coefficients of the regression equation. Mahalanobis distance is also used to determine multivariate outliers. Regression techniques can be used to determine if a specific case within a sample population is an outlier via the combination of two or more variable scores. Even for normal distributions, a point can be a multivariate outlier even if it is not a univariate outlier for any variable (consider a probability density concentrated along the line <math>x_1 = x_2</math>, for example), making Mahalanobis distance a more sensitive measure than checking dimensions individually.

Mahalanobis distance has also been used in [[ecological niche modelling]],<ref>{{Cite journal|last=Etherington|first=Thomas R.|date=2019-04-02|title=Mahalanobis distances and ecological niche modelling: correcting a chi-squared probability error|journal=PeerJ|language=en|volume=7|pages=e6678|doi=10.7717/peerj.6678|issn=2167-8359|pmc=6450376|pmid=30972255 |doi-access=free }}</ref><ref>{{Cite journal|last1=Farber|first1=Oren|last2=Kadmon|first2=Ronen|date=2003|title=Assessment of alternative approaches for bioclimatic modeling with special emphasis on the Mahalanobis distance|journal=Ecological Modelling|language=en|volume=160|issue=1–2|pages=115–130|doi=10.1016/S0304-3800(02)00327-7|doi-access=}}</ref> as the convex elliptical shape of the distances relates well to the concept of the [[fundamental niche]].

Another example of usage is in finance, where Mahalanobis distance has been used to compute an indicator called the "turbulence index",<ref>{{Cite journal|last1=Kritzman|first1=M.|last2=Li|first2=Y.|date=2019-04-02|title=Skulls, Financial Turbulence, and Risk Management|url=https://www.tandfonline.com/doi/abs/10.2469/faj.v66.n5.3|journal=Financial Analysts Journal|language=en|volume=66|issue=5|pages=30–41|doi=10.2469/faj.v66.n5.3 |s2cid=53478656 |url-access=subscription}}</ref> which is a statistical measure of financial markets abnormal behaviour. An implementation as a Web API of this indicator is available online.<ref>{{Cite web|url=https://portfoliooptimizer.io/|title=Portfolio Optimizer |website=portfoliooptimizer.io/|access-date=2022-04-23}}</ref>

==Software implementations==
Many programming languages and statistical packages, such as [[R (programming language)|R]], [[Python (programming language)|Python]], etc., include implementations of Mahalanobis distance.

{| class="wikitable sortable"
! Language/program !! Function !! Ref.
|-
| [[Julia (programming language)|Julia]] || <code>mahalanobis(x, y, Q)</code>  || [https://github.com/JuliaStats/Distances.jl#distance-type-hierarchy]
|-
| [[MATLAB]] || <code>mahal(x, y)</code>  || [https://de.mathworks.com/help/stats/mahal.html]
|-
| [[R (programming language)|R]] || <code>mahalanobis(x, center, cov, inverted = FALSE, ...)</code>  || [https://stat.ethz.ch/R-manual/R-devel/library/stats/html/mahalanobis.html]
|-
| [[SciPy]] ([[Python (programming language)|Python]]) || <code>mahalanobis(u, v, VI)</code>  || [https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.mahalanobis.html]
|}

==See also==
* [[Bregman divergence]] (the Mahalanobis distance is an example of a Bregman divergence)
* [[Bhattacharyya distance]] related, for measuring similarity between data sets (and not between a point and a data set)
* [[Hamming distance]] identifies the difference bit by bit of two strings
* [[Hellinger distance]], also a measure of distance between data sets
* [[Similarity learning]], for other approaches to learn a distance metric from examples.

==References==
{{reflist}}

== External links ==
* {{springer|title=Mahalanobis distance|id=p/m062130}}
* [http://people.revoledu.com/kardi/tutorial/Similarity/MahalanobisDistance.html Mahalanobis distance tutorial] – interactive online program and spreadsheet computation
* [http://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.html Mahalanobis distance (Nov-17-2006)] – overview of Mahalanobis distance, including MATLAB code
* [http://blogs.sas.com/content/iml/2012/02/15/what-is-mahalanobis-distance/ What is Mahalanobis distance?] – intuitive, illustrated explanation, from Rick Wicklin on blogs.sas.com

{{DEFAULTSORT:Mahalanobis Distance}}
[[Category:Statistical distance]]
[[Category:Multivariate statistics]]
[[Category:Distance]]