Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Gaussian process
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Applications== [[Image:Regressions sine demo.svg|thumbnail|right|An example of Gaussian Process Regression (prediction) compared with other regression models.<ref>The documentation for [[scikit-learn]] also has similar [http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_compare_gpr_krr.html examples].</ref>]] A Gaussian process can be used as a [[prior probability distribution]] over [[Function (mathematics)|functions]] in [[Bayesian inference]].<ref name="gpml"/><ref>{{cite book |last=Liu |first=W. |author2=Principe, J.C. |author3=Haykin, S. |title=Kernel Adaptive Filtering: A Comprehensive Introduction |url=http://www.cnel.ufl.edu/~weifeng/publication.htm |year=2010 |publisher=[[John Wiley & Sons|John Wiley]] |isbn=978-0-470-44753-6 |access-date=2010-03-26 |archive-url=https://web.archive.org/web/20160304042652/http://www.cnel.ufl.edu/~weifeng/publication.htm |archive-date=2016-03-04 |url-status=dead }}</ref> Given any set of ''N'' points in the desired domain of your functions, take a [[multivariate Gaussian]] whose covariance [[matrix (mathematics)|matrix]] parameter is the [[Gram matrix]] of your ''N'' points with some desired [[stochastic kernel|kernel]], and [[sampling (mathematics)|sample]] from that Gaussian. For solution of the multi-output prediction problem, Gaussian process regression for vector-valued function was developed. In this method, a 'big' covariance is constructed, which describes the correlations between all the input and output variables taken in ''N'' points in the desired domain.<ref name="Alvares2012">{{cite journal |last1= Álvarez|first1= Mauricio A.|last2= Rosasco | first2= Lorenzo |last3= Lawrence|first3=Neil D.|year= 2012 |title= Kernels for vector-valued functions: A review|journal= Foundations and Trends in Machine Learning|volume= 4|issue= 3|pages= 195–266 |doi=10.1561/2200000036|s2cid= 456491|url= http://eprints.whiterose.ac.uk/114503/1/1106.6251v2.pdf}}</ref> This approach was elaborated in detail for the matrix-valued Gaussian processes and generalised to processes with 'heavier tails' like [[Student's t-distribution#Student's t-process|Student-t processes]].<ref name="Zexun2020">{{cite journal |last1= Chen| first1= Zexun |last2= Wang| first2= Bo|last3= Gorban|first3=Alexander N.|year= 2019 |title= Multivariate Gaussian and Student-t process regression for multi-output prediction|journal= Neural Computing and Applications|volume=32|issue=8|pages= 3005–3028 | doi=10.1007/s00521-019-04687-8|doi-access= free| arxiv= 1703.04455 }}</ref> Inference of continuous values with a Gaussian process prior is known as Gaussian process regression, or [[kriging]]; extending Gaussian process regression to [[Kernel methods for vector output|multiple target variables]] is known as ''cokriging''.<ref>{{cite book |last=Stein |first=M.L. |title=Interpolation of Spatial Data: Some Theory for Kriging |year=1999 |publisher = [[Springer Science+Business Media|Springer]]}}</ref> Gaussian processes are thus useful as a powerful non-linear multivariate [[interpolation]] tool. Kriging is also used to extend Gaussian process in the case of mixed integer inputs.<ref>{{Cite journal | doi=10.1016/j.neucom.2023.126472 | title=A mixed-categorical correlation kernel for Gaussian process| journal=Neurocomputing| volume=550| pages=126472| year=2023| last1=Saves| first1=Paul | last2=Diouane| first2=Youssef | last3=Bartoli| first3=Nathalie | last4=Lefebvre| first4=Thierry | last5=Morlier| first5=Joseph | arxiv=2211.08262}}</ref> Gaussian processes are also commonly used to tackle numerical analysis problems such as numerical integration, solving differential equations, or optimisation in the field of [[probabilistic numerics]]. Gaussian processes can also be used in the context of mixture of experts models, for example.<ref>{{Cite journal |doi = 10.1109/TPAMI.2013.183|pmid = 26353224|title = Gaussian Process-Mixture Conditional Heteroscedasticity|journal = IEEE Transactions on Pattern Analysis and Machine Intelligence|volume = 36|issue = 5|pages = 888–900|year = 2014|last1 = Platanios |first1 = Emmanouil A.|last2 = Chatzis|first2 = Sotirios P.|s2cid = 10424638}}</ref><ref>{{Cite journal | doi=10.1016/j.neucom.2013.04.029| title=A latent variable Gaussian process model with Pitman–Yor process priors for multiclass classification| journal=Neurocomputing| volume=120| pages=482–489| year=2013| last1=Chatzis| first1=Sotirios P.}}</ref> The underlying rationale of such a learning framework consists in the assumption that a given mapping cannot be well captured by a single Gaussian process model. Instead, the observation space is divided into subsets, each of which is characterized by a different mapping function; each of these is learned via a different Gaussian process component in the postulated mixture. In the natural sciences, Gaussian processes have found use as probabilistic models of astronomical time series and as predictors of molecular properties.<ref>{{Cite thesis |doi = 10.17863/CAM.93643|title = Applications of Gaussian Processes at Extreme Lengthscales: From Molecules to Black Holes|degree = PhD| publisher = University of Cambridge|year = 2022 |first=Ryan-Rhys |last=Griffiths| arxiv=2303.14291 }}</ref> They are also being increasingly used as surrogate models for force field optimization.<ref>{{cite journal |last1=Shanks |first1=B. L. |last2=Sullivan |first2=H. W. |last3=Shazed |first3=A. R. |last4=Hoepfner |first4=M. P. |title=Accelerated Bayesian Inference for Molecular Simulations using Local Gaussian Process Surrogate Models |journal=Journal of Chemical Theory and Computation |date=2024 |volume=20 |issue=9 |pages=3798–3808 |doi=10.1021/acs.jctc.3c01358 |pmid=38551198 |url=https://pubs.acs.org/doi/full/10.1021/acs.jctc.3c01358|arxiv=2310.19108 }}</ref> ===Gaussian process prediction, or Kriging=== {{further|Kriging}} [[File:Gaussian Process Regression.png|thumbnail|right|Gaussian Process Regression (prediction) with a squared exponential kernel. Left plot are draws from the prior function distribution. Middle are draws from the posterior. Right is mean prediction with one standard deviation shaded.]] When concerned with a general Gaussian process regression problem (Kriging), it is assumed that for a Gaussian process <math>f</math> observed at coordinates <math>x</math>, the vector of values {{tmath|f(x)}} is just one sample from a multivariate Gaussian distribution of dimension equal to number of observed coordinates {{tmath|n}}. Therefore, under the assumption of a zero-mean distribution, {{tmath|f (x') \sim N (0, K(\theta,x,x'))}}, where {{tmath|K(\theta,x,x')}} is the covariance matrix between all possible pairs {{tmath|(x,x')}} for a given set of hyperparameters ''θ''.<ref name= "gpml"/> As such the log marginal likelihood is: <math display="block">\log p(f(x')\mid\theta,x) = -\frac{1}{2} \left(f(x)^\mathsf{T} K(\theta,x,x')^{-1} f(x') + \log \det(K(\theta,x,x')) + n \log 2\pi \right)</math> and maximizing this marginal likelihood towards {{mvar|θ}} provides the complete specification of the Gaussian process {{math|''f''}}. One can briefly note at this point that the first term corresponds to a penalty term for a model's failure to fit observed values and the second term to a penalty term that increases proportionally to a model's complexity. Having specified {{mvar|θ}}, making predictions about unobserved values {{tmath|f(x^*)}} at coordinates {{math|''x''*}} is then only a matter of drawing samples from the predictive distribution <math>p(y^*\mid x^*,f(x),x) = N(y^*\mid A,B)</math> where the posterior mean estimate {{mvar|A}} is defined as <math display="block">A = K(\theta,x^*,x) K(\theta,x,x')^{-1} f(x)</math> and the posterior variance estimate ''B'' is defined as: <math display="block">B = K(\theta,x^*,x^*) - K(\theta,x^*,x) K(\theta,x,x')^{-1} K(\theta,x^*,x)^\mathsf{T} </math> where {{tmath|K(\theta,x^*,x)}} is the covariance between the new coordinate of estimation ''x''* and all other observed coordinates ''x'' for a given hyperparameter vector {{mvar|θ}}, {{tmath|K(\theta,x,x')}} and {{tmath|f(x)}} are defined as before and {{tmath|K(\theta,x^*,x^*)}} is the variance at point {{math|''x''*}} as dictated by {{mvar|θ}}. It is important to note that practically the posterior mean estimate of {{tmath|f(x^*)}} (the "point estimate") is just a linear combination of the observations {{tmath|f(x)}}; in a similar manner the variance of {{tmath|f(x^*)}} is actually independent of the observations {{tmath|f(x)}}. A known bottleneck in Gaussian process prediction is that the computational complexity of inference and likelihood evaluation is cubic in the number of points |''x''|, and as such can become unfeasible for larger data sets.<ref name= "brml"/><ref name="highDimBayesianGeostat">{{Cite journal |last1 = Banerjee| first1 = Sudipto | title= High-dimensional Bayesian Geostatistics |journal= Bayesian Analysis | year = 2017 | volume = 12 | issue = 2 | pages=583–614| doi= 10.1214/17-BA1056R | url=https://doi.org/10.1214/17-BA1056R | pmid = 29391920 | pmc = 5790125 }}</ref> Works on sparse Gaussian processes, that usually are based on the idea of building a ''representative set'' for the given process ''f'', try to circumvent this issue. <ref name="smolaSparse">{{cite journal |last1= Smola| first1= A.J.| last2=Schoellkopf | first2= B. |year= 2000 |title= Sparse greedy matrix approximation for machine learning |journal= Proceedings of the Seventeenth International Conference on Machine Learning| pages=911–918| citeseerx= 10.1.1.43.3153}}</ref><ref name="CsatoSparse">{{cite journal |last1= Csato| first1=L.| last2=Opper | first2= M. |year= 2002 |title= Sparse on-line Gaussian processes |journal= Neural Computation |number=3| volume= 14 | pages=641–668 | doi=10.1162/089976602317250933| pmid=11860686| citeseerx=10.1.1.335.9713| s2cid=11375333}}</ref><ref name="banerjeePredictiveProcess">{{Cite journal |last1 = Banerjee| first1 = Sudipto | last2=Gelfand | first2 = Alan E.| last3 = Finley | first3 = Andrew O. | last4 = Sang | first4 = Huiyan | title= Gaussian Predictive Process Models for large spatial datasets |journal= Journal of the Royal Statistical Society, Series B (Statistical Methodology) | year = 2008 | volume = 70 | issue = 4 | pages=825–848| doi=10.1111/j.1467-9868.2008.00663.x | url=https://doi.org/10.1111/j.1467-9868.2008.00663.x | pmid = 19750209 | pmc = 2741335}}</ref> The [[kriging]] method can be used in the latent level of a [[nonlinear mixed-effects model]] for a spatial functional prediction: this technique is called the latent kriging.<ref>{{Cite journal |last1=Lee|first1=Se Yoon |first2=Bani|last2=Mallick| title = Bayesian Hierarchical Modeling: Application Towards Production Results in the Eagle Ford Shale of South Texas|journal=Sankhya B|year=2021|volume=84 |pages=1–43 |doi=10.1007/s13571-020-00245-8|doi-access=free}}</ref> Other classes of scalable Gaussian process for analyzing massive datasets have emerged from the [[Vecchia approximation]] and Nearest Neighbor Gaussian Processes (NNGP).<ref name="DattaEtAl2016">{{cite journal|last1=Datta|first1=Abhirup|last2=Banerjee|first2=Sudipto|last3=Finley|first3=Andrew|last4=Gelfand|first4=Alan|title=Hierarchical Nearest-Neighbor Gaussian Process Models for Large Spatial Data|journal=Journal of the American Statistical Association|year=2016|volume=111|issue=514|pages=800–812|doi=10.1080/01621459.2015.1044091|pmid=29720777 |pmc=5927603 }}</ref><ref name = "highDimBayesianGeostat"></ref> Often, the covariance has the form <math display="inline">K(\theta, x,x') = \frac{1}{\sigma^2} \tilde{K}(\theta,x,x')</math>, where <math>\sigma^2</math> is a scaling parameter. Examples are the Matérn class covariance functions. If this scaling parameter <math>\sigma^2</math> is either known or unknown (i.e. must be marginalized), then the posterior probability, <math>p(\theta \mid D)</math>, i.e. the probability for the hyperparameters <math>\theta</math> given a set of data pairs <math>D</math> of observations of <math>x</math> and <math>f(x)</math>, admits an analytical expression.<ref>{{Cite journal| last1=Ranftl|first1=Sascha|last2=Melito|first2=Gian Marco|last3=Badeli|first3=Vahid|last4=Reinbacher-Köstinger|first4=Alice| last5=Ellermann|first5=Katrin|last6=von der Linden|first6=Wolfgang|date=2019-12-31|title=Bayesian Uncertainty Quantification with Multi-Fidelity Data and Gaussian Processes for Impedance Cardiography of Aortic Dissection|journal=Entropy| volume=22|issue=1| pages=58|doi=10.3390/e22010058|issn=1099-4300|pmc=7516489|pmid=33285833|bibcode=2019Entrp..22...58R |doi-access=free}}</ref> === Bayesian neural networks as Gaussian processes === {{further|Neural network Gaussian process}} Bayesian neural networks are a particular type of [[Bayesian network]] that results from treating [[deep learning]] and [[artificial neural network]] models probabilistically, and assigning a [[Prior probability|prior distribution]] to their [[Statistical parameter|parameters]]. Computation in artificial neural networks is usually organized into sequential layers of [[artificial neuron]]s. The number of neurons in a layer is called the layer width. As layer width grows large, many Bayesian neural networks reduce to a Gaussian process with a [[Closed-form expression|closed form]] compositional kernel. This Gaussian process is called the Neural Network Gaussian Process (NNGP) (not to be confused with the Nearest Neighbor Gaussian Process <ref name="DattaEtAl2016"></ref>).<ref name="gpml"/><ref name="novak2020">{{cite journal |last1=Novak |first1=Roman |last2=Xiao |first2=Lechao |last3=Hron |first3=Jiri |last4=Lee |first4=Jaehoon |last5=Alemi |first5=Alexander A. |last6=Sohl-Dickstein |first6=Jascha |last7=Schoenholz |first7=Samuel S. |title=Neural Tangents: Fast and Easy Infinite Neural Networks in Python |journal=International Conference on Learning Representations |date=2020|arxiv=1912.02803 }}</ref><ref>{{Cite book|last=Neal|first=Radford M.|title=Bayesian Learning for Neural Networks|publisher=Springer Science and Business Media| year=2012}}</ref> It allows predictions from Bayesian neural networks to be more efficiently evaluated, and provides an analytic tool to understand [[deep learning]] models.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)