Editing Ridge regression (section)

== Tikhonov regularization ==
Suppose that for a known [[real matrix]] <math>A</math> and vector <math>\mathbf{b}</math>, we wish to find a vector <math>\mathbf{x}</math> such that
<math display="block">A\mathbf{x} = \mathbf{b},</math>
where <math>\mathbf{x}</math> and <math>\mathbf{b}</math> may be of different sizes and <math>A</math> may be non-square.

The standard approach is [[ordinary least squares]] linear regression.{{Clarify|reason=does this represent a system of linear equations (i.e. are x and b both of the same dimension as one side of the - supposedly square - matrix? then, as far as I know, the standard approach for solving it is any of a wide range of solvers ''not'' including linear regression|date=May 2020}} However, if no <math>\mathbf{x}</math> satisfies the equation or more than one <math>\mathbf{x}</math> does—that is, the solution is not unique—the problem is said to be [[Well-posed problem|ill posed]]. In such cases, ordinary least squares estimation leads to an [[Overdetermined system|overdetermined]], or more often an [[Underdetermined system|underdetermined]] system of equations.  Most real-world phenomena have the effect of [[low-pass filters]]{{Clarify|reason=If multiplying a matrix by x is a filter, what in A is a frequency, and what values correspond to high or low frequencies?|date=November 2022}} in the forward direction where <math>A</math> maps <math>\mathbf{x}</math> to <math>\mathbf{b}</math>.  Therefore, in solving the inverse-problem, the inverse mapping operates as a [[high-pass filter]] that has the undesirable tendency of amplifying noise ([[eigenvalues]] / singular values are largest in the reverse mapping where they were smallest in the forward mapping).  In addition, ordinary least squares implicitly nullifies every element of the reconstructed version of <math>\mathbf{x}</math> that is in the null-space of <math>A</math>, rather than allowing for a model to be used as a prior for <math>\mathbf{x}</math>.
Ordinary least squares seeks to minimize the sum of squared [[Residual (numerical analysis)|residuals]], which can be compactly written as
<math display="block">\left\|A\mathbf{x} - \mathbf{b}\right\|_2^2,</math>
where <math>\|\cdot\|_2</math> is the [[Norm (mathematics)#Euclidean norm|Euclidean norm]].

In order to give preference to a particular solution with desirable properties, a regularization term can be included in this minimization:
<math display="block">\left\|A\mathbf{x} - \mathbf{b}\right\|_2^2 + \left\|\Gamma \mathbf{x}\right\|_2^2=\left\|\begin{pmatrix}A\\\Gamma\end{pmatrix}\mathbf{x} - \begin{pmatrix}\mathbf{b}\\\boldsymbol0\end{pmatrix}\right\|_2^2</math>
for some suitably chosen '''Tikhonov matrix''' <math>\Gamma </math>. In many cases, this matrix is chosen as a scalar multiple of the [[identity matrix]] (<math>\Gamma = \alpha I</math>), giving preference to solutions with smaller [[Norm (mathematics)|norms]]; this is known as '''{{math|''L''<sub>2</sub>}} regularization'''.<ref>{{cite conference |first=Andrew Y. |last=Ng |author-link=Andrew Ng |year=2004 |title=Feature selection, L1 vs. L2 regularization, and rotational invariance |conference=Proc. [[International Conference on Machine Learning|ICML]] |url=https://icml.cc/Conferences/2004/proceedings/papers/354.pdf}}</ref> In other cases, high-pass operators (e.g., a [[difference operator]] or a weighted [[discrete fourier transform|Fourier operator]]) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous.
This regularization improves the conditioning of the problem, thus enabling a direct numerical solution. An explicit solution, denoted by <math>\hat\mathbf{x}</math>, is given by
<math display="block">\hat\mathbf{x} = \left(A^\mathsf{T} A + \Gamma^\mathsf{T} \Gamma\right)^{-1} A^\mathsf{T} \mathbf{b}=
\left(\begin{pmatrix}A\\\Gamma\end{pmatrix}^\mathsf{T}\begin{pmatrix}A\\\Gamma\end{pmatrix}\right)^{-1} \begin{pmatrix}A\\\Gamma\end{pmatrix}^\mathsf{T} \begin{pmatrix}\mathbf{b}\\\boldsymbol0\end{pmatrix}.</math>
The effect of regularization may be varied by the scale of matrix <math>\Gamma</math>. For <math>\Gamma = 0</math> this reduces to the unregularized least-squares solution, provided that (''A''<sup>T</sup>''A'')<sup>−1</sup> exists. Note that in case of a [[complex matrix]] <math>A</math>, as usual the transpose <math>A^\mathsf{T}</math> has to be replaced by the [[Hermitian transpose]] <math>A^\mathsf{H}</math>.

{{math|''L''<sub>2</sub>}} regularization is used in many contexts aside from linear regression, such as [[Statistical classification|classification]] with [[logistic regression]] or [[support vector machine]]s,<ref>{{cite journal |author1=R.-E. Fan |author2=K.-W. Chang |author3=C.-J. Hsieh |author4=X.-R. Wang |author5=C.-J. Lin |title=LIBLINEAR: A library for large linear classification |journal=[[Journal of Machine Learning Research]] |volume=9 |pages=1871–1874 |year=2008}}</ref> and matrix factorization.<ref>{{cite journal |last1=Guan |first1=Naiyang |first2=Dacheng |last2=Tao |first3=Zhigang |last3=Luo |first4=Bo |last4=Yuan |title=Online nonnegative matrix factorization with robust stochastic approximation |journal=IEEE Transactions on Neural Networks and Learning Systems |volume=23 |issue=7 |year=2012 |pages=1087–1099|doi=10.1109/TNNLS.2012.2197827 |pmid=24807135 |s2cid=8755408 }}</ref>

=== Application to existing fit results ===

Since Tikhonov Regularization simply adds a quadratic term to the objective function in optimization problems,
it is possible to do so after the unregularised optimisation has taken place.
E.g., if the above problem with <math>\Gamma = 0</math> yields the solution <math>\hat\mathbf{x}_0</math>,
the solution in the presence of <math>\Gamma \ne 0</math> can be expressed as:
<math display="block">\hat\mathbf{x} = B \hat\mathbf{x}_0,</math>
with the "regularisation matrix" <math>B = \left(A^\mathsf{T} A + \Gamma^\mathsf{T} \Gamma\right)^{-1} A^\mathsf{T} A</math>.

If the parameter fit comes with a covariance matrix of the estimated parameter uncertainties <math>V_0</math>,
then the regularisation matrix will be
<math display="block">B = (V_0^{-1} + \Gamma^\mathsf{T}\Gamma)^{-1} V_0^{-1},</math>
and the regularised result will have a new covariance
<math display="block">V = B V_0 B^\mathsf{T}.</math>

In the context of arbitrary likelihood fits, this is valid, as long as the quadratic approximation of the likelihood function is valid. This means that, as long as the perturbation from the unregularised result is small, one can regularise any result that is presented as a best fit point with a covariance matrix. No detailed knowledge of the underlying likelihood function is needed. <ref>{{cite journal|arxiv=2207.02125 |doi=10.1088/1748-0221/17/10/P10021 |title=Post-hoc regularisation of unfolded cross-section measurements |date=2022 |last1=Koch |first1=Lukas |journal=Journal of Instrumentation |volume=17 |issue=10 |pages=10021 |bibcode=2022JInst..17P0021K }}</ref>

===Generalized Tikhonov regularization===
For general multivariate normal distributions for <math>\mathbf x</math> and the data error, one can apply a transformation of the variables to reduce to the case above. Equivalently, one can seek an <math>\mathbf x</math> to minimize
<math display="block">\left\|A \mathbf x - \mathbf b\right\|_P^2 + \left\|\mathbf x - \mathbf x_0\right\|_Q^2,</math>
where we have used <math>\left\|\mathbf{x}\right\|_Q^2</math> to stand for the weighted norm squared <math>\mathbf{x}^\mathsf{T} Q \mathbf{x}</math> (compare with the [[Mahalanobis distance]]). In the Bayesian interpretation <math>P</math> is the inverse [[covariance matrix]] of <math>\mathbf b</math>, <math>\mathbf x_0</math> is the [[expected value]] of <math>\mathbf x</math>, and <math>Q</math> is the inverse covariance matrix of <math>\mathbf x</math>. The Tikhonov matrix is then given as a factorization of the matrix <math>Q = \Gamma^\mathsf{T} \Gamma</math> (e.g. the [[Cholesky factorization]]) and is considered a [[Whitening transformation|whitening filter]].

This generalized problem has an optimal solution <math>\mathbf x^*</math> which can be written explicitly using the formula
<math display="block">\mathbf x^* = \left(A^\mathsf{T} PA + Q\right)^{-1} \left(A^\mathsf{T} P \mathbf{b} + Q \mathbf{x}_0\right),</math>
or equivalently, when ''Q'' is '''not''' a null matrix:
<math display="block">\mathbf x^* = \mathbf x_0 + \left(A^\mathsf{T} P A + Q \right)^{-1} \left(A^\mathsf{T} P \left(\mathbf b - A \mathbf x_0\right)\right).</math>