Editing Gaussian process (section)

==Covariance functions==
{{main|Covariance function}}
{{further|Variogram}}
A key fact of Gaussian processes is that they can be completely defined by their second-order statistics.<ref name="prml">{{cite book |last=Bishop |first=C.M. |title= Pattern Recognition and Machine Learning |year=2006 |publisher=[[Springer Science+Business Media|Springer]] |isbn=978-0-387-31073-2}}</ref> Thus, if a Gaussian process is assumed to have mean zero, defining the [[covariance function]] completely defines the process' behaviour. Importantly the non-negative definiteness of  this function enables its spectral decomposition using the [[Karhunen–Loève theorem|Karhunen–Loève expansion]]. Basic aspects that can be defined through the covariance function are the process' [[stationary process|stationarity]], [[isotropy]], [[smoothness]] and [[periodic function|periodicity]].<ref name="brml">{{cite book |last=Barber |first=David |title=Bayesian Reasoning and Machine Learning |url=http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.HomePage |year=2012 |publisher=[[Cambridge University Press]] |isbn=978-0-521-51814-7}}</ref><ref name="gpml">{{cite book |last=Rasmussen |first=C.E. |author2=Williams, C.K.I |title=Gaussian Processes for Machine Learning |url=http://www.gaussianprocess.org/gpml/ |year=2006 |publisher=[[MIT Press]] |isbn=978-0-262-18253-9}}</ref>

[[stationary process|Stationarity]] refers to the process' behaviour regarding the separation of any two points <math>x</math> and <math>x'</math>. If the process is stationary, the covariance function depends only on <math>x-x'</math>. For example, the [[Ornstein–Uhlenbeck process]] is stationary.

If the process depends only on <math>|x-x'|</math>, the Euclidean distance (not the direction) between <math>x</math> and <math>x'</math>,  then the process is considered isotropic. A process that is concurrently stationary and isotropic is considered to be [[homogeneous]];<ref name="PRP">{{cite book |last=Grimmett  |first=Geoffrey |author2=David Stirzaker|title= Probability and Random Processes| year=2001 |publisher=[[Oxford University Press]] |isbn=978-0198572220}}</ref> in practice these properties reflect the differences (or rather the lack of them) in the behaviour of the process given the location of the observer.

Ultimately Gaussian processes translate as taking priors on functions and the smoothness of these priors can be induced by the covariance function.<ref name ="brml"/> If we expect that for "near-by" input points <math>x</math> and <math>x'</math> their corresponding output points <math>y</math> and <math>y'</math> to be "near-by" also, then the assumption of continuity is present. If we wish to allow for significant displacement then we might choose a rougher covariance function. Extreme examples of the behaviour is the Ornstein&ndash;Uhlenbeck covariance function and the squared exponential where the former is never differentiable and the latter infinitely differentiable.

Periodicity refers to inducing periodic patterns within the behaviour of the process. Formally, this is achieved by mapping the input <math>x</math> to a two dimensional vector <math>u(x) = \left( \cos(x), \sin(x) \right)</math>.

===Usual covariance functions===
[[File:Gaussian process draws from prior distribution.png|thumbnail|300px|right|The effect of choosing different kernels on the prior function distribution of the Gaussian process. Left is a squared exponential kernel. Middle is Brownian. Right is quadratic.]]
There are a number of common covariance functions:<ref name="gpml"/>
*Constant : <math> K_\operatorname{C}(x,x') = C </math>
*Linear: <math> K_\operatorname{L}(x,x') =  x^\mathsf{T} x'</math>
*white Gaussian noise: <math> K_\operatorname{GN}(x,x') = \sigma^2 \delta_{x,x'}</math>
*Squared exponential: <math> K_\operatorname{SE}(x,x') = \exp \left(-\tfrac{d^2}{2\ell^2} \right)</math>
*Ornstein&ndash;Uhlenbeck: <math> K_\operatorname{OU}(x,x') = \exp \left(-\tfrac{d} \ell \right)</math>
*Matérn: <math> K_\operatorname{Matern}(x,x') = \tfrac{2^{1-\nu}}{\Gamma(\nu)} \left(\tfrac{\sqrt{2\nu}d}{\ell} \right)^\nu K_\nu \left(\tfrac{\sqrt{2\nu}d}{\ell} \right)</math>
*Periodic: <math> K_\operatorname{P}(x,x') = \exp\left(-\tfrac{2}{\ell^2} \sin^2 (d/2) \right)</math>
*Rational quadratic: <math> K_\operatorname{RQ}(x,x') =  \left(1+d^2\right)^{-\alpha}, \quad \alpha \geq 0</math>

Here <math>d = |x- x'| </math>. The parameter <math>\ell</math> is the characteristic length-scale of the process (practically, "how close" two points <math>x</math> and <math>x'</math> have to be to influence each other significantly), ''<math>\delta</math>'' is the [[Kronecker delta]] and <math>\sigma</math> the [[standard deviation]] of the noise fluctuations. Moreover, <math>K_\nu</math> is the [[modified Bessel function]] of order <math>\nu</math> and <math>\Gamma(\nu)</math> is the [[gamma function]] evaluated at <math>\nu</math>. Importantly, a complicated covariance function can be defined as a linear combination of other simpler covariance functions in order to incorporate different insights about the data-set at hand.

The inferential results are dependent on the values of the hyperparameters <math>\theta</math> (e.g. <math>\ell</math> and <math>\sigma</math>) defining the model's behaviour. A popular choice for <math>\theta</math> is to provide ''[[maximum a posteriori]]'' (MAP) estimates of it with some chosen prior. If the prior is very near uniform, this is the same as maximizing the [[marginal likelihood]] of the process; the  marginalization being done over the observed process values <math>y</math>.<ref name= "gpml"/> This approach is also known as ''maximum likelihood II'', ''evidence maximization'', or ''[[empirical Bayes]]''.<ref name="seegerGPML">{{cite journal |last1= Seeger| first1= Matthias |year= 2004 |title= Gaussian Processes for Machine Learning|journal= International Journal of Neural Systems|volume= 14|issue= 2|pages= 69–104 |doi=10.1142/s0129065704001899 | pmid= 15112367 | citeseerx= 10.1.1.71.1079 | s2cid= 52807317 }}</ref>