Editing Inverse problem (section)

===An elementary example: Earth's gravitational field===
Only a few physical systems are actually linear with respect to the model parameters. One such system from geophysics is that of the [[Gravity of Earth|Earth's gravitational field]]. The Earth's gravitational field is determined by the density distribution of the Earth in the subsurface. Because the [[lithology]] of the Earth changes quite significantly, we are able to observe minute differences in the Earth's gravitational field on the surface of the Earth. From our understanding of gravity (Newton's Law of Gravitation), we know that the mathematical expression for gravity is:
<math display="block">d= \frac{G p}{r^2};</math>
here <math>d</math> is a measure of the local gravitational acceleration, <math>G</math> is the [[Gravitational acceleration|universal gravitational constant]], <math>p</math> is the local mass (which is related to density) of the rock in the subsurface and <math>r</math> is the distance from the mass to the observation point.

By discretizing the above expression, we are able to relate the discrete data observations on the surface of the Earth to the discrete model parameters (density) in the subsurface that we wish to know more about. For example, consider the case where we have measurements carried out at 5 locations on the surface of the Earth. In this case, our data vector, <math>d</math> is a column vector of dimension (5×1): its <math>i</math>-th component is associated with the <math>i</math>-th observation location.  We also know that we only have five unknown masses <math>p_j</math> in the subsurface (unrealistic but used to demonstrate the concept) with known location: we denote by <math>r_{ij}</math> the distance between the <math>i</math>-th observation location and the <math>j</math>-th mass. Thus, we can construct the linear system relating the five unknown masses to the five data points as follows:
<math display="block">d = F p, </math>
<math display="block">d =
\begin{bmatrix} d_1 \\ d_2 \\ d_3 \\ d_4 \\ d_5 \end{bmatrix}, \quad
p = \begin{bmatrix} p_1 \\ p_2 \\ p_3 \\ p_4 \\ p_5 \end{bmatrix},</math>
<math display="block">F = \begin{bmatrix}
\frac{G}{r_{11}^2} & \frac{G}{r_{12}^2} & \frac{G}{r_{13}^2} & \frac{G}{r_{14}^2} & \frac{G}{r_{15}^2} \\
\frac{G}{r_{21}^2} & \frac{G}{r_{22}^2} & \frac{G}{r_{23}^2} & \frac{G}{r_{24}^2} & \frac{G}{r_{25}^2} \\
\frac{G}{r_{31}^2} & \frac{G}{r_{32}^2} & \frac{G}{r_{33}^2} & \frac{G}{r_{34}^2} & \frac{G}{r_{35}^2} \\
\frac{G}{r_{41}^2} & \frac{G}{r_{42}^2} & \frac{G}{r_{43}^2} & \frac{G}{r_{44}^2} & \frac{G}{r_{45}^2} \\
\frac{G}{r_{51}^2} & \frac{G}{r_{52}^2} & \frac{G}{r_{53}^2} & \frac{G}{r_{54}^2} & \frac{G}{r_{55}^2}
\end{bmatrix}</math>

To solve for the model parameters that fit our data, we might be able to invert the matrix <math>F</math> to directly convert the measurements into our model parameters. For example:
<math display="block">p = F^{-1} d_\text{obs} </math>
A system with five equations and five unknowns is a very specific situation: our example was designed to end up with this specificity. In general, the numbers of data and unknowns are different so that matrix <math>F</math> is not square.

However, even a square matrix can have no inverse: matrix <math>F</math> can be [[Rank (linear algebra)|rank]] deficient (i.e. has zero eigenvalues) and the solution of the system <math>p = F^{-1} d_\text{obs} </math> is not unique. Then the solution of the inverse problem will be undetermined. This is a first difficulty. Over-determined systems (more equations than unknowns) have other issues.
Also noise may corrupt our observations making <math>d</math> possibly outside the space <math>F(P)</math> of possible responses to model parameters so that solution of the system <math>p = F^{-1} d_\text{obs} </math> may not exist. This is another difficulty.

==== Tools to overcome the first difficulty ====
The first difficulty reflects a crucial problem: Our observations do not contain enough information and additional data are required. Additional data can come from physical '''prior information''' on the parameter values, on their spatial distribution or, more generally, on their mutual dependence. It can also come from other experiments: For instance, we may think of integrating data recorded by gravimeters and seismographs for a better estimation of densities. 
The integration of this additional information is basically a problem of [[statistics]]. This discipline is the one that can answer the question: How to mix quantities of different nature? We will be more precise in the section "Bayesian approach" below.

Concerning distributed parameters, prior information about their spatial distribution often consists of information about some derivatives of these distributed parameters. Also, it is common practice, although somewhat artificial, to look for the "simplest" model that reasonably matches the data. This is usually achieved by [[Penalty method|penalizing]] the [[Lp space|<math>L^1</math> norm]] of the gradient (or the [[total variation]]) of the parameters (this approach is also referred to as the maximization of the entropy). One can also make the model simple through a parametrization that introduces degrees of freedom only when necessary.

Additional information may also be integrated through inequality constraints on the model parameters or some functions of them. Such constraints are important to avoid unrealistic values for the parameters (negative values for instance). In this case, the space spanned by model parameters will no longer be a vector space but a '''subset of admissible models''' denoted by <math>P_\text{adm}</math> in the sequel.

==== Tools to overcome the second difficulty ====
As mentioned above, noise may be such that our measurements are not the image of any model, so that we cannot look for a model that produces the data but rather look for [[model selection|the best (or optimal) model]]: that is, the one that best matches the data. This leads us to minimize an [[objective function]], namely a [[functional (mathematics)|functional]]  that quantifies how big the residuals are or how far the predicted data are from the observed data. Of course, when we have perfect data (i.e. no noise) then the recovered model should fit the observed data perfectly. A standard objective function, <math>\varphi</math>, is of the form:
<math>\varphi(p) = \|F p-d_\text{obs} \|^2 </math>
where <math>\| \cdot \| </math> is the Euclidean norm (it will be the [[Lp space|<math>L^2</math> norm]] when the measurements are functions instead of samples) of the residuals. This approach amounts to making use of [[Least squares|ordinary least squares]], an approach widely used in statistics. However, the Euclidean norm is known to be very sensitive to outliers: to avoid this difficulty we may think of using other distances, for instance the <math>L^1</math> norm, in replacement of the <math>L^2</math> norm.

==== Bayesian approach ====
Very similar to the least-squares approach is the probabilistic approach: If we know the statistics of the noise that contaminates the data, we can think of seeking the most likely model m, which is the model that matches the [[Maximum likelihood estimation|maximum likelihood criterion]]. If the noise is [[Normal distribution|Gaussian]], the maximum likelihood criterion appears as a least-squares criterion, the Euclidean scalar product in data space being replaced by a scalar product involving the [[Covariance|co-variance]] of the noise. Also, should prior information on model parameters be available, we could think of using [[Bayesian inference]] to formulate the solution of the inverse problem. This approach is described in detail in Tarantola's book.<ref>{{cite book |last1=Tarantola |first1=Albert |title=Inverse problem theory |url=https://archive.org/details/inverseproblemth0000tara |url-access=registration |date=1987 |publisher=Elsevier |isbn=9780444599674 |edition=1st}}</ref>

==== Numerical solution of our elementary example ====
Here we make use of the Euclidean norm to quantify the data misfits. As we deal with a linear inverse problem, the objective function is quadratic. For its minimization, it is classical to compute its gradient using the same rationale (as we would to minimize a function of only one variable). At the optimal model <math>p_\text{opt}</math>, this gradient vanishes which can be written as:
<math display="block">\nabla_p \varphi = 2 (F^\mathrm{T} F p_\text{opt} - F^\mathrm{T} d_\text{obs}) = 0 </math>
where ''F''<sup>T</sup> denotes the [[matrix transpose]] of ''F''. This equation simplifies to:
<math display="block">F^\mathrm{T} F p_\text{opt} = F^\mathrm{T} d_\text{obs} </math>

This expression is known as the [https://en.wikipedia.org/?title=Normal_equations&redirect=no normal equation] and gives us a possible solution to the inverse problem. 
In our example matrix <math>F^\mathrm{T} F</math> turns out to be generally full rank so that the equation above makes sense and determines uniquely the model parameters: we do not need integrating additional information for ending up with a unique solution.