Editing Linear separability

{{Short description|Geometric property of a pair of sets of points in Euclidean geometry}}
[[File:Linearly separable red-blue cropped .svg|thumb|211x211px|The existence of a line separating the two types of points means that the data is linearly separable]]
In [[Euclidean geometry]], '''linear separability''' is a property of two sets of [[point (geometry)|points]]. This is most easily visualized in two dimensions (the [[Euclidean plane]]) by thinking of one set of points as being colored blue and the other set of points as being colored red. These two sets are ''linearly separable'' if there exists at least one [[line (geometry)|line]] in the plane with all of the blue points on one side of the line and all the red points on the other side. This idea immediately generalizes to higher-dimensional Euclidean spaces if the line is replaced by a [[hyperplane]].

The problem of determining if a pair of sets is linearly separable and finding a separating hyperplane if they are, arises in several areas.  In [[statistics]] and [[machine learning]], classifying certain types of data is a problem for which good algorithms exist that are based on this concept.

==Mathematical definition==

Let <math>X_{0}</math> and <math>X_{1}</math> be two sets of points in an ''n''-dimensional Euclidean space. Then <math>X_{0}</math> and <math>X_{1}</math> are ''linearly separable'' if there exist ''n'' + 1 real numbers <math>w_{1}, w_{2},..,w_{n}, k</math>, such that every point <math>x \in X_{0}</math> satisfies <math>\sum^{n}_{i=1} w_{i}x_{i} > k</math> and every point <math>x \in X_{1}</math> satisfies <math>\sum^{n}_{i=1} w_{i}x_{i} < k</math>, where <math>x_{i}</math> is the <math>i</math>-th component of <math>x</math>.

Equivalently, two sets are linearly separable precisely when their respective [[convex hull]]s are [[disjoint sets|disjoint]] (colloquially, do not overlap).<ref>{{Cite book |last1=Boyd |first1=Stephen |url=http://dx.doi.org/10.1017/cbo9780511804441 |title=Convex Optimization |last2=Vandenberghe |first2=Lieven |date=2004-03-08 |publisher=Cambridge University Press |doi=10.1017/cbo9780511804441 |isbn=978-0-521-83378-3}}</ref>

In simple 2D, it can also be imagined that the set of points under a linear transformation collapses into a line, on which there exists a value, k, greater than which one set of points will fall into, and lesser than which the other set of points fall.

== Examples ==

Three non-[[collinear]] points in two classes ('+' and '-') are always linearly separable in two dimensions. This is illustrated by the three examples in the following figure (the all '+' case is not shown, but is similar to the all '-' case):

{| align="center" border="0" cellpadding="4" cellspacing="10"
| align="center" | [[File:VC1.svg]]
| align="center" | [[File:VC2.svg]]
| align="center" | [[File:VC3.svg]]
|}

However, not all sets of four points, no three collinear, are linearly separable in two dimensions. The following example would need ''two'' straight lines and thus is not linearly separable:

{| align="center" border="0" cellpadding="4" cellspacing="0"
| [[File:VC4.svg]]
|}

Notice that three points which are collinear and of the form "+ ⋅⋅⋅ &mdash; ⋅⋅⋅ +" are also not linearly separable.

== Number of linear separations ==
Let <math>T(N, K)</math> be the number of ways to linearly separate ''N'' points (in general position) in ''K'' dimensions, then<ref name=":2">{{cite book |last=MacKay |first=David |url=https://books.google.com/books?id=AKuMj4PN_EMC&pg=PA483 |title=Information Theory, Inference and Learning Algorithms |date=2003-09-25 |publisher=[[Cambridge University Press]] |isbn=9780521642989 |page=483 |author-link=David J. C. MacKay}}</ref><math display="block">T(N, K)=\left\{\begin{array}{cc}
2^N & K \geq N \\
2 \sum_{k=0}^{K-1}\left(\begin{array}{c}
N-1 \\
k
\end{array}\right) & K<N
\end{array}\right.</math>When ''K'' is large, <math>T(N, K)/2^N</math> is very close to one when <math>N \leq 2K</math>, but very close to zero when <math>N> 2K</math>. In words, one [[perceptron]] unit can almost certainly memorize a random assignment of binary labels on N points when <math>N \leq 2K</math>, but almost certainly not when <math>N> 2K</math>.

== Linear separability of Boolean functions in ''n'' variables ==

A [[Boolean function]] in ''n'' variables can be thought of as an assignment of ''0'' or ''1'' to each vertex of a Boolean [[hypercube]] in ''n'' dimensions. This gives a natural division of the vertices into two sets. The Boolean function is said to be ''linearly separable'' provided these two sets of points are linearly separable. The number of distinct Boolean functions is <math>2^{2^{n}}</math>where ''n'' is the number of variables passed into the function.<ref>{{Cite book|title=Artificial intelligence a modern approach|author=Russell, Stuart J.|others=Norvig, Peter 1956-|year=2016|isbn=978-1292153964|edition= Third|location=Boston|pages=766|oclc=945899984}}</ref>

Such functions are also called linear threshold logic, or [[Perceptron|perceptrons]]. The classical theory is summarized in,<ref>{{Cite book |last=Muroga |first=Saburo |title=Threshold logic and its applications |date=1971 |publisher=Wiley-Interscience |isbn=978-0-471-62530-8 |location=New York}}</ref> as Knuth claims.<ref>{{Cite book |last=Knuth |first=Donald Ervin |title=The art of computer programming |date=2011 |publisher=Addison-Wesley |isbn=978-0-201-03804-0 |location=Upper Saddle River |pages=75–79}}</ref>

The value is only known exactly up to <math>n=9</math> case, but the order of magnitude is known quite exactly: it has upper bound <math>2^{n^2 - n \log_2 n + O(n)}</math> and lower bound <math>2^{n^2 - n \log_2 n - O(n)}</math>.<ref name=":0">{{Cite journal |last1=Šíma |first1=Jiří |last2=Orponen |first2=Pekka |date=2003-12-01 |title=General-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic Results |url=https://direct.mit.edu/neco/article/15/12/2727-2778/6791 |journal=Neural Computation |language=en |volume=15 |issue=12 |pages=2727–2778 |doi=10.1162/089976603322518731 |pmid=14629867 |s2cid=264603251 |issn=0899-7667}}</ref> 

It is [[co-NP-complete]] to decide whether a Boolean function given in [[Disjunctive normal form|disjunctive]] or [[conjunctive normal form]] is linearly separable.<ref name=":0" />

{| class="wikitable"
|+<small>Number of linearly separable Boolean functions in each dimension</small><ref>
{{cite document
| last=Gruzling
| first=Nicolle
| title=Linear separability of the vertices of an n-dimensional hypercube. M.Sc Thesis
| publisher= University of Northern British Columbia
| year=2006
}}</ref> {{OEIS|id=A000609}}
!Number of variables
!Boolean functions
!Linearly separable Boolean functions
|-
|  2 
|16|| 14
|-
|  3 
|256|| 104
|-
|  4 
|65536|| 1882
|-
|  5 
|4294967296|| 94572
|-
|  6 
|18446744073709552000|| 15028134
|-
|  7 
|3.402823669 ×10^38
| 8378070864
|-
|  8 
|1.157920892 ×10^77|| 17561539552946
|-
|  9 
|1.340780792 ×10^154|| 144130531453121108
|}

== Threshold logic ==
A linear threshold logic gate is a Boolean function defined by <math>n</math> weights <math>w_1, \dots, w_n</math> and a threshold <math>\theta</math>. It takes <math>n</math> binary inputs <math>x_1, \dots, x_n</math>, and outputs 1 if <math>\sum_i w_i x_i > \theta</math>, and otherwise outputs 0. 

For any fixed <math>n</math>, because there are only finitely many Boolean functions that can be computed by a threshold logic unit, it is possible to set all <math>w_1, \dots, w_n, \theta</math> to be integers. Let <math>W(n)</math> be the smallest number <math>W</math> such that every possible real threshold function of <math>n</math> variables can be realized using integer weights of absolute value <math>\leq W</math>. It is known that<ref>{{Cite journal |last=Alon |first=Noga |last2=Vũ |first2=Văn H |date=1997-07-01 |title=Anti-Hadamard Matrices, Coin Weighing, Threshold Gates, and Indecomposable Hypergraphs |url=https://linkinghub.elsevier.com/retrieve/pii/S0097316597927801 |journal=Journal of Combinatorial Theory, Series A |volume=79 |issue=1 |pages=133–160 |doi=10.1006/jcta.1997.2780 |issn=0097-3165}}</ref><math display="block">\frac 12 n \log n - 2n + o(n) \leq \log_2 W(n) \leq \frac 12 n \log n - n + o(n)</math>See <ref>{{Cite book |last=Jukna |first=Stasys |title=Boolean Function Complexity: Advances and Frontiers |date=2012 |publisher=Springer Berlin Heidelberg |isbn=978-3-642-24507-7 |series=Algorithms and Combinatorics |location=Berlin, Heidelberg}}</ref>{{Pg|location=Section 11.10}} for a literature review.

== Support vector machines==
{{main|Support vector machine}}

[[Image:Svm separating hyperplanes (SVG).svg|thumb|right|H<sub>1</sub> does not separate the sets. H<sub>2</sub> does, but only with a small margin.  H<sub>3</sub> separates them with the maximum margin.]]
[[Statistical classification|Classifying data]] is a common task in [[machine learning]].
Suppose some data points, each belonging to one of two sets, are given and we wish to create a model that will decide which set a ''new'' data point will be in. In the case of [[support vector machine]]s, a data point is viewed as a ''p''-dimensional vector (a list of ''p'' numbers), and we want to know whether we can separate such points with a (''p''&nbsp;&minus;&nbsp;1)-dimensional [[hyperplane]]. This is called a [[linear classifier]]. There are many hyperplanes that might classify (separate) the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two sets. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the ''[[maximum-margin hyperplane]]'' and the linear classifier it defines is known as a ''maximum [[margin classifier]]''.

More formally, given some training data <math>\mathcal{D}</math>, a set of ''n'' points of the form

:<math>\mathcal{D} = \left\{ (\mathbf{x}_i, y_i)\mid\mathbf{x}_i \in \mathbb{R}^p,\, y_i \in \{-1,1\}\right\}_{i=1}^n</math>

where the ''y''<sub>''i''</sub> is either 1 or −1, indicating the set to which the point <math>\mathbf{x}_i </math> belongs. Each <math> \mathbf{x}_i </math> is a ''p''-dimensional [[real number|real]] vector. We want to find the maximum-margin hyperplane that divides the points having <math>y_i=1</math> from those having <math>y_i=-1</math>. Any hyperplane can be written as the set of points <math>\mathbf{x}</math> satisfying

: <math>\mathbf{w}\cdot\mathbf{x} - b=0,</math>

where <math>\cdot</math> denotes the [[dot product]] and <math>{\mathbf{w}}</math> the (not necessarily normalized) [[Normal (geometry)|normal vector]] to the hyperplane. The parameter <math>\tfrac{b}{\|\mathbf{w}\|}</math> determines the offset of the hyperplane from the origin along the normal vector <math>{\mathbf{w}}</math>.

If the training data are linearly separable, we can select two hyperplanes in such a way that they separate the data and there are no points between them, and then try to maximize their distance.

== See also ==
* [[Clustering (statistics)]]
* [[Hyperplane separation theorem]]
* [[Kirchberger's theorem]]
* [[Perceptron]]
* [[Vapnik–Chervonenkis dimension]]

== References ==
{{reflist}}

[[Category:Geometry]]
[[Category:Convex analysis]]
[[Category:Machine learning]]