Editing Mercer's theorem

{{Short description|Mathematical theorem}}
{{More footnotes needed|date=December 2024}}
In [[mathematics]], specifically [[functional analysis]], '''Mercer's theorem''' is a representation of a symmetric [[Definite bilinear form|positive-definite]] function on a square as a sum of a convergent sequence of product functions.  This theorem, presented in {{harv|Mercer|1909}}, is one of the most notable results of the work of [[James Mercer (mathematician)|James Mercer]] (1883–1932). It is an important theoretical tool in the theory of [[integral equation]]s; it is used in the [[Hilbert space]] theory of [[stochastic process]]es, for example the [[Karhunen–Loève theorem]]; and it is also used in the [[reproducing kernel Hilbert space]] theory where it characterizes a symmetric [[positive-definite kernel]] as a reproducing kernel.<ref>{{cite web |first=Peter |last=Bartlett |title=Reproducing Kernel Hilbert Spaces |date=2008 |work=Lecture notes of CS281B/Stat241B Statistical Learning Theory |publisher=University of California at Berkeley |url=https://people.eecs.berkeley.edu/~bartlett/courses/281b-sp08/7.pdf }}</ref>

== Introduction ==
To explain Mercer's theorem, we first consider an important special case; see [[#Generalizations|below]] for a more general formulation.
A ''kernel'', in this context, is a [[Symmetric_function|symmetric]] continuous function

:<math> K: [a,b] \times [a,b] \rightarrow \mathbb{R}</math>

where <math>K(x,y) = K(y,x)</math> for all <math>x,y \in [a,b]</math>.

''K'' is said to be a [[positive-definite kernel]] if and only if

:<math> \sum_{i=1}^n\sum_{j=1}^n K(x_i, x_j) c_i c_j \geq 0</math>

for all finite sequences of points ''x''<sub>1</sub>,&nbsp;...,&nbsp;''x''<sub>''n''</sub> of [''a'',&nbsp;''b''] and all choices of real numbers ''c''<sub>1</sub>,&nbsp;...,&nbsp;''c''<sub>''n''</sub>.  Note that the term "positive-definite" is well-established in literature despite the weak inequality in the definition.<ref>{{Cite book |last=Mohri |first=Mehryar |url=https://www.worldcat.org/oclc/1041560990 |title=Foundations of machine learning |date=2018 |others=Afshin Rostamizadeh, Ameet Talwalkar |isbn=978-0-262-03940-6 |edition=Second |location=Cambridge, Massachusetts |oclc=1041560990}}</ref><ref>{{Cite book |last=Berlinet |first=A. |url=https://www.worldcat.org/oclc/844346520 |title=Reproducing kernel Hilbert spaces in probability and statistics |date=2004 |publisher=Springer Science+Business Media |others=Christine Thomas-Agnan |isbn=1-4419-9096-8 |location=New York |oclc=844346520}}</ref> 

The fundamental characterization of stationary positive-definite kernels (where <math>K(x,y) = K(x-y)</math>) is given by '''[[Bochner's theorem]]'''. It states that a continuous function <math>K(x-y)</math> is positive-definite if and only if it can be expressed as the [[Fourier transform]] of a finite non-negative measure <math>\mu</math>:

:<math>K(x-y) = \int_{-\infty}^{\infty} e^{i(x-y)\omega} \, d\mu(\omega)</math>

This spectral representation reveals the connection between positive definiteness and harmonic analysis, providing a stronger and more direct characterization of positive definiteness than the abstract definition in terms of inequalities when the kernel is stationary, e.g, when it can be expressed as a 1-variable function of the distance between points rather than the 2-variable function of the positions of pairs of points.

Associated to ''K'' is a [[linear operator]] (more specifically a [[Hilbert–Schmidt integral operator]] when the interval is compact) on functions defined by the integral

:<math> [T_K \varphi](x) =\int_a^b  K(x,s) \varphi(s)\, ds. </math>

We assume <math>\varphi</math> can range through the space
of real-valued [[square-integrable functions]] ''L''<sup>2</sup>[''a'',&nbsp;''b'']; however, in many cases the associated RKHS can be strictly larger than ''L''<sup>2</sup>[''a'',&nbsp;''b'']. Since ''T<sub>K</sub>'' is a linear operator, the [[eigenvalues]] and [[eigenfunction]]s of ''T<sub>K</sub>'' exist.

'''Theorem'''. Suppose ''K'' is a continuous symmetric positive-definite kernel.  Then there is an [[orthonormal basis]]
{''e''<sub>i</sub>}<sub>i</sub> of ''L''<sup>2</sup>[''a'',&nbsp;''b''] consisting of eigenfunctions of ''T''<sub>''K''</sub> such that the corresponding
sequence of eigenvalues {&lambda;<sub>''i''</sub>}<sub>''i''</sub> is nonnegative.  The eigenfunctions corresponding to non-zero eigenvalues are continuous on [''a'',&nbsp;''b''] and ''K'' has the representation

:<math> K(s,t) = \sum_{j=1}^\infty \lambda_j \, e_j(s) \, e_j(t) </math>

where the convergence is absolute and uniform.

== Details ==
We now explain in greater detail the structure of the proof of
Mercer's theorem, particularly how it relates to [[spectral theory of compact operators]].

* The map ''K''  ↦ ''T''<sub>''K''</sub> is  [[Injective function|injective]].
* ''T''<sub>''K''</sub> is a non-negative symmetric  compact operator on ''L''<sup>2</sup>[''a'',''b'']; moreover ''K''(''x'', ''x'') &ge; 0.

To show compactness, show that the image of the [[unit ball]] of ''L''<sup>2</sup>[''a'',''b''] under ''T''<sub>''K''</sub> is [[equicontinuous]] and apply [[Ascoli's theorem]], to show that the image of the unit ball is relatively compact in C([''a'',''b'']) with the [[uniform norm]] and ''a fortiori'' in ''L''<sup>2</sup>[''a'',''b''].

Now apply the [[spectral theorem]] for compact operators on Hilbert
spaces to ''T''<sub>''K''</sub> to show the existence of the
orthonormal basis {''e''<sub>i</sub>}<sub>i</sub> of
''L''<sup>2</sup>[''a'',''b'']

:<math> \lambda_i e_i(t)= [T_K e_i](t)  = \int_a^b K(t,s) e_i(s)\, ds. </math>

If &lambda;<sub>i</sub> &ne; 0, the eigenvector ([[eigenfunction]]) ''e''<sub>i</sub> is seen to be continuous on [''a'',''b''].  Now

:<math> \sum_{i=1}^\infty \lambda_i |e_i(t) e_i(s)| \leq \sup_{x \in [a,b]} |K(x,x)|, </math>

which shows that the sequence

:<math> \sum_{i=1}^\infty \lambda_i e_i(t) e_i(s) </math>

converges absolutely and uniformly to a kernel ''K''<sub>0</sub> which is easily seen to define the same operator as the kernel ''K''.  Hence ''K''=''K''<sub>0</sub> from which Mercer's theorem follows.

Finally, to show non-negativity of the eigenvalues one can write <math>\lambda \langle f,f \rangle= \langle f, T_{K}f \rangle</math> and expressing the right hand side as an integral well-approximated by its Riemann sums, which are non-negative
by positive-definiteness of ''K'', implying <math>\lambda \langle f,f \rangle \geq 0</math>, implying <math>\lambda \geq 0 </math>.

== Trace ==
The following is immediate:

'''Theorem'''.  Suppose ''K'' is a continuous symmetric positive-definite kernel; ''T''<sub>''K''</sub> has a  sequence of nonnegative
eigenvalues {&lambda;<sub>i</sub>}<sub>i</sub>.  Then

:<math> \int_a^b K(t,t)\, dt = \sum_i \lambda_i. </math>

This shows that the operator ''T''<sub>''K''</sub> is a [[trace class]] operator and

:<math> \operatorname{trace}(T_K) = \int_a^b K(t,t)\, dt.  </math>

== Generalizations ==
Mercer's theorem itself is a generalization of the result that any [[symmetric matrix|symmetric]] [[positive-semidefinite matrix]] is the [[Gramian matrix]] of a set of vectors.

The first generalization replaces the interval [''a'',&nbsp;''b''] with any [[compact Hausdorff space]] and Lebesgue measure on [''a'',&nbsp;''b''] is replaced by a finite countably additive measure &mu; on the [[Borel algebra]] of ''X'' whose support is ''X''.  This means that &mu;(''U'') > 0 for any nonempty open subset ''U'' of ''X''.

A recent generalization replaces these conditions by the following: the set ''X'' is a [[first-countable]] topological space endowed with a Borel (complete) measure &mu;. ''X'' is the support of &mu; and, for all ''x'' in ''X'', there is an open set ''U'' containing ''x'' and having finite measure. Then essentially the same result holds:

'''Theorem'''. Suppose ''K'' is a continuous symmetric  positive-definite kernel on ''X''. If the function &kappa; is ''L''<sup>1</sup><sub>&mu;</sub>(''X''), where &kappa;(x)=K(x,x), for all ''x'' in ''X'', then there is an [[orthonormal set]]
{''e''<sub>i</sub>}<sub>i</sub> of ''L''<sup>2</sup><sub>&mu;</sub>(''X'') consisting of eigenfunctions of ''T''<sub>''K''</sub> such that corresponding
sequence of eigenvalues {&lambda;<sub>i</sub>}<sub>i</sub> is nonnegative.  The eigenfunctions corresponding to non-zero eigenvalues are continuous on ''X'' and ''K'' has the representation

:<math> K(s,t) = \sum_{j=1}^\infty \lambda_j \, e_j(s) \, e_j(t) </math>

where the convergence is absolute and uniform on compact subsets of ''X''.

The next generalization deals with representations of ''measurable'' kernels.

Let (''X'', ''M'', &mu;) be a &sigma;-finite measure space.  An ''L''<sup>2</sup> (or square-integrable) kernel on ''X'' is a function

:<math> K \in L^2_{\mu \otimes \mu}(X \times X). </math>

''L''<sup>2</sup> kernels define a bounded operator ''T''<sub>''K''</sub> by the formula

:<math> \langle T_K \varphi, \psi \rangle = \int_{X \times X} K(y,x) \varphi(y) \psi(x) \,d[\mu \otimes \mu](y,x). </math>

''T''<sub>''K''</sub> is a compact operator (actually it is even a [[Hilbert–Schmidt operator]]).  If the kernel ''K'' is symmetric, by the [[compact operator on Hilbert space#Compact self adjoint operator|spectral theorem]], ''T''<sub>''K''</sub>  has an orthonormal basis of eigenvectors. Those eigenvectors that correspond to non-zero eigenvalues can be arranged in a sequence {''e''<sub>''i''</sub>}<sub>''i''</sub> (regardless of separability).

'''Theorem'''.  If ''K'' is a symmetric positive-definite kernel on (''X'', ''M'', &mu;), then

:<math> K(y,x) = \sum_{i \in \mathbb{N}} \lambda_i e_i(y) e_i(x)  </math>

where the convergence in the  ''L''<sup>2</sup> norm.  Note that when continuity of the kernel is not assumed, the expansion no longer converges uniformly.

==Mercer's condition==
In [[mathematics]], a [[real number|real]]-valued [[function (mathematics)|function]] ''K''(''x'',''y'') is said to fulfill '''Mercer's condition''' if for all [[square-integrable function]]s ''g''(''x'') one has

:<math> \iint g(x)K(x,y)g(y)\,dx\,dy \geq 0. </math>

===Discrete analog===
This is analogous to the definition of a [[positive-semidefinite matrix]]. This is a matrix <math>K</math> of dimension <math>N</math>, which satisfies, for all vectors <math>g</math>, the property
:<math>(g,Kg)=g^{T}{\cdot}Kg=\sum_{i=1}^N\sum_{j=1}^N\,g_i\,K_{ij}\,g_j\geq0</math>.

===Examples===
A positive constant function
:<math>K(x, y)=c\,</math>
satisfies Mercer's condition, as then the integral becomes by [[Fubini's theorem]] 
:<math> \iint g(x)\,c\,g(y)\,dx \, dy = c\int\! g(x) \,dx \int\! g(y) \,dy = c\left(\int\! g(x) \,dx\right)^2</math>
which is indeed [[non-negative]].

== See also ==
* [[Kernel trick]]
* [[Representer theorem]]
* [[Reproducing kernel Hilbert space]]
* [[Spectral theory]]

==Notes==
{{reflist}}

== References ==
* Adriaan Zaanen, ''Linear Analysis'', North Holland Publishing Co., 1960,
* Ferreira, J. C., Menegatto, V. A., ''Eigenvalues of integral operators defined by smooth positive definite kernels'', Integral equation and Operator Theory, 64 (2009), no. 1, 61–81. (Gives the generalization of Mercer's theorem for metric spaces. The result is easily adapted to first countable topological spaces)
* [[Konrad Jörgens]], ''Linear integral operators'', Pitman, Boston, 1982,
* [[Richard Courant]] and [[David Hilbert]], ''[[Methoden der mathematischen Physik|Methods of Mathematical Physics]]'', vol 1, Interscience 1953, 
* Robert Ash, ''Information Theory'', Dover Publications, 1990,
* {{citation
 |first=J. |last=Mercer
 |title=Functions of positive and negative type and their connection with the theory of integral equations
 |journal=Philosophical Transactions of the Royal Society A
 |year=1909 |volume=209 |pages=415&ndash;446
 |doi=10.1098/rsta.1909.0016
 |issue=441–458
|bibcode=1909RSPTA.209..415M
 |doi-access=free
 }},
* {{springer|title=Mercer theorem|id=p/m063440}}
* H. König, ''Eigenvalue distribution of compact operators'', Birkhäuser Verlag, 1986. (Gives the generalization of Mercer's theorem for finite measures μ.)

{{Functional analysis}}

[[Category:Theorems in functional analysis]]