Editing Gaussian process (section)

==Applications==
[[Image:Regressions sine demo.svg|thumbnail|right|An example of Gaussian Process Regression (prediction) compared with other regression models.<ref>The documentation for [[scikit-learn]] also has similar [http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_compare_gpr_krr.html examples].</ref>]]
A Gaussian process can be used as a [[prior probability distribution]] over [[Function (mathematics)|functions]] in [[Bayesian inference]].<ref name="gpml"/><ref>{{cite book |last=Liu |first=W. |author2=Principe, J.C. |author3=Haykin, S. |title=Kernel Adaptive Filtering: A Comprehensive Introduction |url=http://www.cnel.ufl.edu/~weifeng/publication.htm |year=2010 |publisher=[[John Wiley & Sons|John Wiley]] |isbn=978-0-470-44753-6 |access-date=2010-03-26 |archive-url=https://web.archive.org/web/20160304042652/http://www.cnel.ufl.edu/~weifeng/publication.htm |archive-date=2016-03-04 |url-status=dead }}</ref> Given any set of ''N'' points in the desired domain of your functions, take a [[multivariate Gaussian]] whose covariance [[matrix (mathematics)|matrix]] parameter is the [[Gram matrix]] of your ''N'' points with some desired [[stochastic kernel|kernel]], and [[sampling (mathematics)|sample]] from that Gaussian.  For solution of the multi-output prediction problem, Gaussian process regression for vector-valued function was developed. In this method, a 'big' covariance is constructed, which describes the correlations between all the input and output variables taken in ''N'' points  in the desired domain.<ref name="Alvares2012">{{cite journal |last1= Álvarez|first1= Mauricio A.|last2= Rosasco | first2= Lorenzo |last3= Lawrence|first3=Neil D.|year= 2012 |title= Kernels for vector-valued functions: A review|journal= Foundations and Trends in Machine Learning|volume= 4|issue= 3|pages= 195–266 |doi=10.1561/2200000036|s2cid= 456491|url= http://eprints.whiterose.ac.uk/114503/1/1106.6251v2.pdf}}</ref> This approach was elaborated in detail for the matrix-valued Gaussian processes and generalised to processes with 'heavier tails' like [[Student's t-distribution#Student's t-process|Student-t processes]].<ref name="Zexun2020">{{cite journal |last1= Chen| first1= Zexun |last2= Wang| first2= Bo|last3= Gorban|first3=Alexander N.|year= 2019 |title= Multivariate Gaussian and Student-t process regression for multi-output prediction|journal= Neural Computing and Applications|volume=32|issue=8|pages= 3005–3028 | doi=10.1007/s00521-019-04687-8|doi-access= free| arxiv= 1703.04455 }}</ref>

Inference of continuous values with a Gaussian process prior is known as Gaussian process regression, or [[kriging]]; extending Gaussian process regression to [[Kernel methods for vector output|multiple target variables]] is known as ''cokriging''.<ref>{{cite book |last=Stein |first=M.L. |title=Interpolation of Spatial Data: Some Theory for Kriging |year=1999 |publisher = [[Springer Science+Business Media|Springer]]}}</ref> Gaussian processes are thus useful as a powerful non-linear multivariate [[interpolation]] tool. Kriging is also used to extend Gaussian process in the case of mixed integer inputs.<ref>{{Cite journal | doi=10.1016/j.neucom.2023.126472 | title=A mixed-categorical correlation kernel for Gaussian process| journal=Neurocomputing| volume=550| pages=126472| year=2023| last1=Saves| first1=Paul | last2=Diouane| first2=Youssef |  last3=Bartoli| first3=Nathalie | last4=Lefebvre| first4=Thierry | last5=Morlier| first5=Joseph | arxiv=2211.08262}}</ref>

Gaussian processes are also commonly used to tackle numerical analysis problems such as numerical integration, solving differential equations, or optimisation in the field of [[probabilistic numerics]].

Gaussian processes can also be used in the context of mixture of experts models, for example.<ref>{{Cite journal |doi = 10.1109/TPAMI.2013.183|pmid = 26353224|title = Gaussian Process-Mixture Conditional Heteroscedasticity|journal = IEEE Transactions on Pattern Analysis and Machine Intelligence|volume = 36|issue = 5|pages = 888–900|year = 2014|last1 = Platanios |first1 = Emmanouil A.|last2 = Chatzis|first2 = Sotirios P.|s2cid = 10424638}}</ref><ref>{{Cite journal | doi=10.1016/j.neucom.2013.04.029| title=A latent variable Gaussian process model with Pitman–Yor process priors for multiclass classification| journal=Neurocomputing| volume=120| pages=482–489| year=2013| last1=Chatzis| first1=Sotirios P.}}</ref> The underlying rationale of such a learning framework consists in the assumption that a given mapping cannot be well captured by a single Gaussian process model. Instead, the observation space is divided into subsets, each of which is characterized by a different mapping function; each of these is learned via a different Gaussian process component in the postulated mixture.

In the natural sciences, Gaussian processes have found use as probabilistic models of astronomical time series and as predictors of molecular properties.<ref>{{Cite thesis |doi = 10.17863/CAM.93643|title = Applications of Gaussian Processes at Extreme Lengthscales: From Molecules to Black Holes|degree = PhD| publisher = University of Cambridge|year = 2022 |first=Ryan-Rhys |last=Griffiths| arxiv=2303.14291 }}</ref> They are also being increasingly used as surrogate models for force field optimization.<ref>{{cite journal |last1=Shanks |first1=B. L. |last2=Sullivan |first2=H. W. |last3=Shazed |first3=A. R. |last4=Hoepfner |first4=M. P. |title=Accelerated Bayesian Inference for Molecular Simulations using Local Gaussian Process Surrogate Models |journal=Journal of Chemical Theory and Computation |date=2024 |volume=20 |issue=9 |pages=3798–3808 |doi=10.1021/acs.jctc.3c01358 |pmid=38551198 |url=https://pubs.acs.org/doi/full/10.1021/acs.jctc.3c01358|arxiv=2310.19108 }}</ref>

===Gaussian process prediction, or Kriging===
{{further|Kriging}}
[[File:Gaussian Process Regression.png|thumbnail|right|Gaussian Process Regression (prediction) with a squared exponential kernel. Left plot are draws from the prior function distribution. Middle are draws from the posterior. Right is mean prediction with one standard deviation shaded.]]
When concerned with a general Gaussian process regression problem (Kriging), it is assumed that for a Gaussian process <math>f</math> observed at coordinates <math>x</math>, the vector of values {{tmath|f(x)}} is just one sample from a multivariate Gaussian distribution of dimension equal to number of observed coordinates {{tmath|n}}. Therefore, under the assumption of a zero-mean distribution, {{tmath|f (x') \sim N (0, K(\theta,x,x'))}}, where {{tmath|K(\theta,x,x')}} is the covariance matrix between all possible pairs {{tmath|(x,x')}} for a given set of hyperparameters ''θ''.<ref name= "gpml"/>
As such the log marginal likelihood is:

<math display="block">\log p(f(x')\mid\theta,x) =  -\frac{1}{2} \left(f(x)^\mathsf{T} K(\theta,x,x')^{-1} f(x') + \log \det(K(\theta,x,x')) + n \log 2\pi \right)</math>

and maximizing this marginal likelihood towards {{mvar|θ}} provides the complete specification of the Gaussian process {{math|''f''}}. One can briefly note at this point that the first term corresponds to a penalty term for a model's failure to fit observed values and the second term to a penalty term that increases proportionally to a model's complexity. Having specified {{mvar|θ}}, making predictions about unobserved values {{tmath|f(x^*)}} at coordinates {{math|''x''*}} is then only a matter of drawing samples from the predictive distribution <math>p(y^*\mid x^*,f(x),x) = N(y^*\mid A,B)</math> where the posterior mean estimate {{mvar|A}} is defined as
<math display="block">A = K(\theta,x^*,x) K(\theta,x,x')^{-1} f(x)</math>
and the posterior variance estimate ''B'' is defined as:
<math display="block">B = K(\theta,x^*,x^*) - K(\theta,x^*,x)  K(\theta,x,x')^{-1}  K(\theta,x^*,x)^\mathsf{T} </math>
where {{tmath|K(\theta,x^*,x)}} is the covariance between the new coordinate of estimation ''x''* and all other observed coordinates ''x'' for a given hyperparameter vector {{mvar|θ}}, {{tmath|K(\theta,x,x')}} and {{tmath|f(x)}} are defined as before and {{tmath|K(\theta,x^*,x^*)}} is the variance at point {{math|''x''*}} as dictated by {{mvar|θ}}. It is important to note that practically the posterior mean estimate of {{tmath|f(x^*)}} (the "point estimate") is just a linear combination of the observations {{tmath|f(x)}}; in a similar manner the variance of {{tmath|f(x^*)}} is actually independent of the observations {{tmath|f(x)}}. A known bottleneck in Gaussian process prediction is that the computational complexity of inference and likelihood evaluation is cubic in the number of points |''x''|, and as such can become unfeasible for larger data sets.<ref name= "brml"/><ref name="highDimBayesianGeostat">{{Cite journal |last1 = Banerjee| first1 = Sudipto | title= High-dimensional Bayesian Geostatistics |journal= Bayesian Analysis | year = 2017 | volume = 12 | issue = 2 | pages=583–614| doi= 10.1214/17-BA1056R | url=https://doi.org/10.1214/17-BA1056R | pmid = 29391920 | pmc = 5790125  }}</ref> Works on sparse Gaussian processes, that usually are based on the idea of building a ''representative set'' for the given process ''f'', try to circumvent this issue. <ref name="smolaSparse">{{cite journal |last1= Smola| first1= A.J.| last2=Schoellkopf | first2= B. |year= 2000 |title= Sparse greedy matrix approximation for machine learning |journal= Proceedings of the Seventeenth International Conference on Machine Learning| pages=911–918| citeseerx= 10.1.1.43.3153}}</ref><ref name="CsatoSparse">{{cite journal |last1= Csato| first1=L.| last2=Opper | first2= M. |year= 2002 |title= Sparse on-line Gaussian processes  |journal= Neural Computation |number=3| volume= 14 | pages=641–668 | doi=10.1162/089976602317250933| pmid=11860686| citeseerx=10.1.1.335.9713| s2cid=11375333}}</ref><ref name="banerjeePredictiveProcess">{{Cite journal |last1 = Banerjee| first1 = Sudipto | last2=Gelfand | first2 = Alan E.| last3 = Finley | first3 = Andrew O. | last4 = Sang | first4 = Huiyan | title= Gaussian Predictive Process Models for large spatial datasets |journal= Journal of the Royal Statistical Society, Series B (Statistical Methodology) | year = 2008 | volume = 70 | issue = 4 | pages=825–848| doi=10.1111/j.1467-9868.2008.00663.x | url=https://doi.org/10.1111/j.1467-9868.2008.00663.x | pmid = 19750209 | pmc = 2741335}}</ref> The [[kriging]] method can be used in the latent level of a [[nonlinear mixed-effects model]] for a spatial functional prediction: this technique is called the latent kriging.<ref>{{Cite journal |last1=Lee|first1=Se Yoon |first2=Bani|last2=Mallick|  title = Bayesian Hierarchical Modeling: Application Towards Production Results in the Eagle Ford Shale of South Texas|journal=Sankhya B|year=2021|volume=84 |pages=1–43 |doi=10.1007/s13571-020-00245-8|doi-access=free}}</ref> Other classes of scalable Gaussian process for analyzing massive datasets have emerged from the [[Vecchia approximation]] and Nearest Neighbor Gaussian Processes (NNGP).<ref name="DattaEtAl2016">{{cite journal|last1=Datta|first1=Abhirup|last2=Banerjee|first2=Sudipto|last3=Finley|first3=Andrew|last4=Gelfand|first4=Alan|title=Hierarchical Nearest-Neighbor Gaussian Process Models for Large Spatial Data|journal=Journal of the American Statistical Association|year=2016|volume=111|issue=514|pages=800–812|doi=10.1080/01621459.2015.1044091|pmid=29720777 |pmc=5927603 }}</ref><ref name = "highDimBayesianGeostat"></ref> 

Often, the covariance has the form <math display="inline">K(\theta, x,x') = \frac{1}{\sigma^2} \tilde{K}(\theta,x,x')</math>, where <math>\sigma^2</math> is a scaling parameter. Examples are the Matérn class covariance functions. If this scaling parameter <math>\sigma^2</math> is either known or unknown (i.e. must be marginalized), then the posterior probability, <math>p(\theta \mid D)</math>, i.e. the probability for the hyperparameters <math>\theta</math> given a set of data pairs <math>D</math> of observations of <math>x</math> and <math>f(x)</math>, admits an analytical expression.<ref>{{Cite journal| last1=Ranftl|first1=Sascha|last2=Melito|first2=Gian Marco|last3=Badeli|first3=Vahid|last4=Reinbacher-Köstinger|first4=Alice| last5=Ellermann|first5=Katrin|last6=von der Linden|first6=Wolfgang|date=2019-12-31|title=Bayesian Uncertainty Quantification with Multi-Fidelity Data and Gaussian Processes for Impedance Cardiography of Aortic Dissection|journal=Entropy| volume=22|issue=1| pages=58|doi=10.3390/e22010058|issn=1099-4300|pmc=7516489|pmid=33285833|bibcode=2019Entrp..22...58R |doi-access=free}}</ref>

=== Bayesian neural networks as Gaussian processes ===
{{further|Neural network Gaussian process}}
Bayesian neural networks are a particular type of [[Bayesian network]] that results from treating [[deep learning]] and [[artificial neural network]] models probabilistically, and assigning a [[Prior probability|prior distribution]] to their [[Statistical parameter|parameters]]. Computation in artificial neural networks is usually organized into sequential layers of [[artificial neuron]]s. The number of neurons in a layer is called the layer width. As layer width grows large, many Bayesian neural networks reduce to a Gaussian process with a [[Closed-form expression|closed form]] compositional kernel. This Gaussian process is called the Neural Network Gaussian Process (NNGP) (not to be confused with the Nearest Neighbor Gaussian Process <ref name="DattaEtAl2016"></ref>).<ref name="gpml"/><ref name="novak2020">{{cite journal |last1=Novak |first1=Roman |last2=Xiao |first2=Lechao |last3=Hron |first3=Jiri |last4=Lee |first4=Jaehoon |last5=Alemi |first5=Alexander A. |last6=Sohl-Dickstein |first6=Jascha |last7=Schoenholz |first7=Samuel S. |title=Neural Tangents: Fast and Easy Infinite Neural Networks in Python |journal=International Conference on Learning Representations |date=2020|arxiv=1912.02803 }}</ref><ref>{{Cite book|last=Neal|first=Radford M.|title=Bayesian Learning for Neural Networks|publisher=Springer Science and Business Media| year=2012}}</ref> It allows predictions from Bayesian neural networks to be more efficiently evaluated, and provides an analytic tool to understand [[deep learning]] models.