Editing Gaussian process (section)

===Gaussian process prediction, or Kriging===
{{further|Kriging}}
[[File:Gaussian Process Regression.png|thumbnail|right|Gaussian Process Regression (prediction) with a squared exponential kernel. Left plot are draws from the prior function distribution. Middle are draws from the posterior. Right is mean prediction with one standard deviation shaded.]]
When concerned with a general Gaussian process regression problem (Kriging), it is assumed that for a Gaussian process <math>f</math> observed at coordinates <math>x</math>, the vector of values {{tmath|f(x)}} is just one sample from a multivariate Gaussian distribution of dimension equal to number of observed coordinates {{tmath|n}}. Therefore, under the assumption of a zero-mean distribution, {{tmath|f (x') \sim N (0, K(\theta,x,x'))}}, where {{tmath|K(\theta,x,x')}} is the covariance matrix between all possible pairs {{tmath|(x,x')}} for a given set of hyperparameters ''θ''.<ref name= "gpml"/>
As such the log marginal likelihood is:

<math display="block">\log p(f(x')\mid\theta,x) =  -\frac{1}{2} \left(f(x)^\mathsf{T} K(\theta,x,x')^{-1} f(x') + \log \det(K(\theta,x,x')) + n \log 2\pi \right)</math>

and maximizing this marginal likelihood towards {{mvar|θ}} provides the complete specification of the Gaussian process {{math|''f''}}. One can briefly note at this point that the first term corresponds to a penalty term for a model's failure to fit observed values and the second term to a penalty term that increases proportionally to a model's complexity. Having specified {{mvar|θ}}, making predictions about unobserved values {{tmath|f(x^*)}} at coordinates {{math|''x''*}} is then only a matter of drawing samples from the predictive distribution <math>p(y^*\mid x^*,f(x),x) = N(y^*\mid A,B)</math> where the posterior mean estimate {{mvar|A}} is defined as
<math display="block">A = K(\theta,x^*,x) K(\theta,x,x')^{-1} f(x)</math>
and the posterior variance estimate ''B'' is defined as:
<math display="block">B = K(\theta,x^*,x^*) - K(\theta,x^*,x)  K(\theta,x,x')^{-1}  K(\theta,x^*,x)^\mathsf{T} </math>
where {{tmath|K(\theta,x^*,x)}} is the covariance between the new coordinate of estimation ''x''* and all other observed coordinates ''x'' for a given hyperparameter vector {{mvar|θ}}, {{tmath|K(\theta,x,x')}} and {{tmath|f(x)}} are defined as before and {{tmath|K(\theta,x^*,x^*)}} is the variance at point {{math|''x''*}} as dictated by {{mvar|θ}}. It is important to note that practically the posterior mean estimate of {{tmath|f(x^*)}} (the "point estimate") is just a linear combination of the observations {{tmath|f(x)}}; in a similar manner the variance of {{tmath|f(x^*)}} is actually independent of the observations {{tmath|f(x)}}. A known bottleneck in Gaussian process prediction is that the computational complexity of inference and likelihood evaluation is cubic in the number of points |''x''|, and as such can become unfeasible for larger data sets.<ref name= "brml"/><ref name="highDimBayesianGeostat">{{Cite journal |last1 = Banerjee| first1 = Sudipto | title= High-dimensional Bayesian Geostatistics |journal= Bayesian Analysis | year = 2017 | volume = 12 | issue = 2 | pages=583–614| doi= 10.1214/17-BA1056R | url=https://doi.org/10.1214/17-BA1056R | pmid = 29391920 | pmc = 5790125  }}</ref> Works on sparse Gaussian processes, that usually are based on the idea of building a ''representative set'' for the given process ''f'', try to circumvent this issue. <ref name="smolaSparse">{{cite journal |last1= Smola| first1= A.J.| last2=Schoellkopf | first2= B. |year= 2000 |title= Sparse greedy matrix approximation for machine learning |journal= Proceedings of the Seventeenth International Conference on Machine Learning| pages=911–918| citeseerx= 10.1.1.43.3153}}</ref><ref name="CsatoSparse">{{cite journal |last1= Csato| first1=L.| last2=Opper | first2= M. |year= 2002 |title= Sparse on-line Gaussian processes  |journal= Neural Computation |number=3| volume= 14 | pages=641–668 | doi=10.1162/089976602317250933| pmid=11860686| citeseerx=10.1.1.335.9713| s2cid=11375333}}</ref><ref name="banerjeePredictiveProcess">{{Cite journal |last1 = Banerjee| first1 = Sudipto | last2=Gelfand | first2 = Alan E.| last3 = Finley | first3 = Andrew O. | last4 = Sang | first4 = Huiyan | title= Gaussian Predictive Process Models for large spatial datasets |journal= Journal of the Royal Statistical Society, Series B (Statistical Methodology) | year = 2008 | volume = 70 | issue = 4 | pages=825–848| doi=10.1111/j.1467-9868.2008.00663.x | url=https://doi.org/10.1111/j.1467-9868.2008.00663.x | pmid = 19750209 | pmc = 2741335}}</ref> The [[kriging]] method can be used in the latent level of a [[nonlinear mixed-effects model]] for a spatial functional prediction: this technique is called the latent kriging.<ref>{{Cite journal |last1=Lee|first1=Se Yoon |first2=Bani|last2=Mallick|  title = Bayesian Hierarchical Modeling: Application Towards Production Results in the Eagle Ford Shale of South Texas|journal=Sankhya B|year=2021|volume=84 |pages=1–43 |doi=10.1007/s13571-020-00245-8|doi-access=free}}</ref> Other classes of scalable Gaussian process for analyzing massive datasets have emerged from the [[Vecchia approximation]] and Nearest Neighbor Gaussian Processes (NNGP).<ref name="DattaEtAl2016">{{cite journal|last1=Datta|first1=Abhirup|last2=Banerjee|first2=Sudipto|last3=Finley|first3=Andrew|last4=Gelfand|first4=Alan|title=Hierarchical Nearest-Neighbor Gaussian Process Models for Large Spatial Data|journal=Journal of the American Statistical Association|year=2016|volume=111|issue=514|pages=800–812|doi=10.1080/01621459.2015.1044091|pmid=29720777 |pmc=5927603 }}</ref><ref name = "highDimBayesianGeostat"></ref> 

Often, the covariance has the form <math display="inline">K(\theta, x,x') = \frac{1}{\sigma^2} \tilde{K}(\theta,x,x')</math>, where <math>\sigma^2</math> is a scaling parameter. Examples are the Matérn class covariance functions. If this scaling parameter <math>\sigma^2</math> is either known or unknown (i.e. must be marginalized), then the posterior probability, <math>p(\theta \mid D)</math>, i.e. the probability for the hyperparameters <math>\theta</math> given a set of data pairs <math>D</math> of observations of <math>x</math> and <math>f(x)</math>, admits an analytical expression.<ref>{{Cite journal| last1=Ranftl|first1=Sascha|last2=Melito|first2=Gian Marco|last3=Badeli|first3=Vahid|last4=Reinbacher-Köstinger|first4=Alice| last5=Ellermann|first5=Katrin|last6=von der Linden|first6=Wolfgang|date=2019-12-31|title=Bayesian Uncertainty Quantification with Multi-Fidelity Data and Gaussian Processes for Impedance Cardiography of Aortic Dissection|journal=Entropy| volume=22|issue=1| pages=58|doi=10.3390/e22010058|issn=1099-4300|pmc=7516489|pmid=33285833|bibcode=2019Entrp..22...58R |doi-access=free}}</ref>