Editing Point estimation

{{Short description|Parameter estimation via sample statistics}}
In [[statistics]], '''point estimation''' involves the use of [[statistical sample|sample]] [[data]] to calculate a single value (known as a '''point estimate''' since it identifies a [[Point (geometry)|point]] in some [[parameter space]]) which is to serve as a "best guess" or "best estimate" of an unknown population [[parameter]] (for example, the [[population mean]]). More formally, it is the application of a point [[estimator]] to the data to obtain a point estimate.

Point estimation can be contrasted with [[interval estimation]]: such interval estimates are typically either [[confidence interval]]s, in the case of [[frequentist inference]], or [[credible intervals]], in the case of [[Bayesian inference]]. More generally, a point estimator can be contrasted with a set estimator. Examples are given by [[Confidence region|confidence sets]] or [[Credible interval|credible sets.]] A point estimator can also be contrasted with a distribution estimator. Examples are given by [[confidence distribution]]s, [[Randomised decision rule|randomized estimators]], and [[Bayesian statistics|Bayesian posteriors]].

== Properties of point estimates ==

=== Biasedness ===
“[[Bias of an estimator|Bias]]” is defined as the difference between the expected value of the estimator and the true value of the population parameter being estimated. It can also be described that the closer the [[expected value]] of a parameter is to the measured parameter, the lesser the bias. When the estimated number and the true value is equal, the estimator is considered unbiased. This is called an ''unbiased estimator.'' The estimator will become a ''best unbiased estimator'' if it has minimum [[variance]]. However, a biased estimator with a small variance may be more useful than an unbiased estimator with a large variance.<ref name=":0">{{Cite book|title=A Modern Introduction to Probability and Statistics|publisher=F.M. Dekking, C. Kraaikamp, H.P. Lopuhaa, L.E. Meester|year=2005|language=English}}</ref> Most importantly, we prefer point estimators that have the smallest [[Mean squared error|mean square errors.]]

If we let T = h(X<sub>1</sub>,X<sub>2</sub>, . . . , X<sub>n</sub>) be an estimator based on a random sample X<sub>1</sub>,X<sub>2</sub>, . . . , X<sub>n</sub>, the estimator T is called an unbiased estimator for the parameter θ if E[T] = θ, irrespective of the value of θ.<ref name=":0"/> For example, from the same random sample we have E(x̄) = μ (mean) and E(s<sup>2</sup>) = σ<sup>2</sup> (variance), then x̄ and s<sup>2</sup> would be unbiased estimators for μ and σ<sup>2</sup>. The difference E[T ] − θ is called the bias of T ; if this difference is nonzero, then T is called biased.

=== Consistency ===
Consistency is about whether the point estimate stays close to the value when the parameter increases its size. The larger the sample size, the more accurate the estimate is. If a point estimator is consistent, its expected value and variance should be close to the true value of the parameter. An unbiased estimator is consistent if the limit of the variance of estimator T equals zero.

=== Efficiency ===
Let ''T''<sub>1</sub> and ''T''<sub>2</sub> be two unbiased estimators for the same parameter ''θ''. The estimator ''T''<sub>2</sub> would be called ''more efficient'' than estimator ''T''<sub>1</sub> if Var(''T''<sub>2</sub>) ''<'' Var(''T''<sub>1</sub>), irrespective of the value of ''θ''.<ref name=":0"/> We can also say that the most efficient estimators are the ones with the least variability of outcomes. Therefore, if the estimator has smallest variance among sample to sample, it is both most efficient and unbiased. We extend the notion of efficiency by saying that estimator T<sub>2</sub> is more efficient than estimator T<sub>1</sub> (for the same parameter of interest), if the MSE([[Mean squared error|mean square error]]) of T<sub>2</sub> is smaller than the MSE of T<sub>1</sub>.<ref name=":0"/>

Generally, we must consider the distribution of the population when determining the efficiency of estimators. For example, in a [[normal distribution]], the mean is considered more efficient than the median, but the same does not apply in asymmetrical, or [[Skewed distribution|skewed]], distributions.

=== Sufficiency ===
In statistics, the job of a statistician is to interpret the data that they have collected and to draw statistically valid conclusion about the population under investigation. But in many cases the raw data, which are too numerous and too costly to store, are not suitable for this purpose. Therefore, the statistician would like to condense the data by computing some statistics and to base their analysis on these statistics so that there is no loss of relevant information in doing so, that is the statistician would like to choose those statistics which exhaust all information about the parameter, which is contained in the sample. We define [[sufficient statistic]]s as follows: Let X =( X<sub>1</sub>, X<sub>2</sub>, ... ,X<sub>n</sub>) be a random sample. A statistic T(X) is said to be sufficient for θ (or for the family of distribution) if the conditional distribution of X given T is free from θ.<ref name=":1">{{Cite book|title=Estimation and Inferential Statistics|publisher=Pradip Kumar Sahu, Santi Ranjan Pal, Ajit Kumar Das|year=2015|language=English}}</ref>

== Types of point estimation ==

=== Bayesian point estimation ===
Bayesian inference is typically based on the [[posterior distribution]]. Many [[Bayesian estimation|Bayesian point estimators]] are the posterior distribution's statistics of [[central tendency]], e.g., its mean, median, or mode:
* [[Bayes estimator#Posterior mean|Posterior mean]], which minimizes the (posterior) [[risk function|''risk'']] (expected loss) for a [[Minimum mean square error|squared-error]] [[loss function]]; in Bayesian estimation, the risk is defined in terms of the posterior distribution, as observed by [[Gauss]].<ref name="Dodge">{{cite book|title=Statistical data analysis based on the L1-norm and related methods: Papers from the First International Conference held at Neuchâtel, August 31–September 4, 1987|publisher=[[North-Holland Publishing]]|year=1987|editor-last=Dodge|editor-first=Yadolah|editor-link=Yadolah Dodge}}</ref>
* [[Bayes estimator#Posterior median and other quantiles|Posterior median]], which minimizes the posterior risk for the absolute-value loss function, as observed by [[Laplace]].<ref name="Dodge" /><ref>{{cite book|last1=Jaynes|first1=E. T.|title=Probability Theory: The logic of science|date=2007|publisher=[[Cambridge University Press]]|isbn=978-0-521-59271-0|edition=5. print.|page=172|author-link=Edwin Thompson Jaynes}}</ref>
* [[maximum a posteriori]] (''MAP''), which finds a maximum of the posterior distribution; for a uniform prior probability, the MAP estimator coincides with the maximum-likelihood estimator;
The MAP estimator has good asymptotic properties, even for many difficult problems, on which the maximum-likelihood estimator has difficulties.
For regular problems, where the maximum-likelihood estimator is consistent, the maximum-likelihood estimator ultimately agrees with the MAP estimator.<ref>{{cite book|last=Ferguson|first=Thomas S.|title=A Course in Large Sample Theory|publisher=[[Chapman & Hall]]|year=1996|isbn=0-412-04371-8|author-link=Thomas S. Ferguson}}</ref><ref name="LeCam">{{cite book|last=Le Cam|first=Lucien|title=Asymptotic Methods in Statistical Decision Theory|publisher=[[Springer-Verlag]]|year=1986|isbn=0-387-96307-3|author-link=Lucien Le Cam}}</ref><ref name="FergJASA">{{cite journal|last=Ferguson|first=Thomas&nbsp;S.|author-link=Thomas S. Ferguson|year=1982|title=An inconsistent maximum likelihood estimate|journal=[[Journal of the American Statistical Association]]|volume=77|issue=380|pages=831–834|doi=10.1080/01621459.1982.10477894|jstor=2287314}}</ref>
Bayesian estimators are [[admissible procedure|admissible]], by Wald's theorem.<ref name="LeCam" /><ref name="LehmannCasella">{{cite book|last1=Lehmann|first1=E. L.|title=Theory of Point Estimation|last2=Casella|first2=G.|publisher=Springer|year=1998|isbn=0-387-98502-6|edition=2nd|author-link=Erich Leo Lehmann}}</ref>

The [[Minimum Message Length]] ([[Minimum Message Length|MML]]) point estimator is based in Bayesian [[information theory]] and is not so directly related to the [[posterior distribution]].

Special cases of [[Bayes filter|Bayesian filters]] are important:
*[[Kalman filter]]
*[[Wiener filter]]

Several [[iterative method|methods]] of [[computational statistics]] have close connections with Bayesian analysis:
*[[particle filter]]
*[[Markov chain Monte Carlo]] (MCMC)

== Methods of finding point estimates ==
Below are some commonly used methods of estimating unknown parameters which are expected to provide estimators having some of these important properties. In general, depending on the situation and the purpose of our study we apply any one of the methods that may be suitable among the methods of point estimation.

=== Method of maximum likelihood (MLE) ===
The [[Maximum likelihood estimation|method of maximum likelihood]], due to R.A. Fisher, is the most important general method of estimation. This estimator method attempts to acquire unknown parameters that maximize the likelihood function. It uses a known model (ex. the normal distribution) and uses the values of parameters in the model that maximize a likelihood function to find the most suitable match for the data.<ref>{{Cite book|title=Categorical Data Analysis|publisher=Agresti A.|year=1990|location=John Wiley and Sons, New York|pages=}}</ref>

Let X = (X<sub>1</sub>, X<sub>2</sub>, ... ,X<sub>n</sub>) denote a random sample with joint p.d.f or p.m.f. f(x, θ) (θ may be a vector). The function f(x, θ), considered as a function of θ, is called the likelihood function. In this case, it is denoted by L(θ). The principle of maximum likelihood consists of choosing an estimate within the admissible range of θ, that maximizes the likelihood. This estimator is called the maximum likelihood estimate (MLE) of θ. In order to obtain the MLE of θ, we use the equation

''dlog''L(θ)/''d''θ<sub>i</sub>=0, i = 1, 2, …, k. If θ is a vector, then partial derivatives are considered to get the likelihood equations.<ref name=":1" />

=== Method of moments (MOM) ===
The [[Method of moments (statistics)|method of moments]] was introduced by K. Pearson and P. Chebyshev in 1887, and it is one of the oldest methods of estimation. This method is based on [[law of large numbers]], which uses all the known facts about a population and apply those facts to a sample of the population by deriving equations that relate the population moments to the unknown parameters. We can then solve with the sample mean of the population moments.<ref>{{Cite book|title=The Concise Encyclopedia of Statistics|publisher=Dodge, Y.|year=2008|location=Springer}}</ref> However, due to the simplicity, this method is not always accurate and can be biased easily.

Let (X<sub>1</sub>, X<sub>2</sub>,…X<sub>n</sub>) be a random sample from a population having p.d.f. (or p.m.f) f(x,θ), θ = (θ<sub>1</sub>, θ<sub>2</sub>, …, θ<sub>k</sub>). The objective is to estimate the parameters θ<sub>1</sub>, θ<sub>2</sub>, ..., θ<sub>k</sub>. Further, let the first k population moments about zero exist as explicit function of θ, i.e. μ<sub>r</sub> = μ<sub>r</sub>(θ<sub>1</sub>, θ<sub>2</sub>,…, θ<sub>k</sub>), r = 1, 2, …, k. In the method of moments, we equate k sample moments with the corresponding population moments. Generally, the first k moments are taken because the errors due to sampling increase with the order of the moment. Thus, we get k equations μ<sub>r</sub>(θ<sub>1</sub>, θ<sub>2</sub>,…, θ<sub>k</sub>) = m<sub>r</sub>, r = 1, 2, …, k. Solving these equations we get the method of moment estimators (or estimates) as

m<sub>r</sub> = 1/n ΣX<sub>i</sub><sup>r</sup>.<ref name=":1" /> See also [[generalized method of moments]].

=== Method of least square ===
In the method of least square, we consider the estimation of parameters using some specified form of the expectation and second moment of the observations. For

fitting a curve of the form y = f( x, β<sub>0</sub>, β<sub>1</sub>, ,,,, β<sub>p</sub>) to the data (x<sub>i</sub>, y<sub>i</sub>), i = 1, 2,…n, we may use the method of least squares. This method consists of minimizing the

sum of squares.

When f(x, β<sub>0</sub>, β<sub>1</sub>, ,,,, β<sub>p</sub>) is a linear function of the parameters and the x-values are known, least square estimators will be [[best linear unbiased estimator]] (BLUE). Again, if we assume that the least square estimates are independently and identically normally distributed, then a linear estimator will be [[minimum-variance unbiased estimator]] (MVUE) for the entire class of unbiased estimators. See also [[minimum mean squared error]] (MMSE).<ref name=":1" />

=== Minimum-variance mean-unbiased estimator (MVUE) ===
The method of [[minimum-variance unbiased estimator]] minimizes the [[risk function|risk]] (expected loss) of the squared-error [[loss function|loss-function]].

=== Median unbiased estimator ===
[[Median-unbiased estimator]] minimizes the risk of the absolute-error loss function.

=== Best linear unbiased estimator (BLUE) ===
[[Best linear unbiased estimator]], also known as the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero.<ref>{{Cite book|title=Best Linear Unbiased Estimation and Prediction|publisher=Theil Henri|year=1971|location=New York: John Wiley & Sons}}</ref>

== Point estimate v.s. confidence interval estimate ==
[[File:Point estimation and confidence interval estimation.png|thumb|Point estimation and confidence interval estimation.]]
There are two major types of estimates: point estimate and [[Interval estimation|confidence interval estimate]]. In the point estimate we try to choose a unique point in the parameter space which can reasonably be considered as the true value of the parameter. On the other hand, instead of unique estimate of the parameter, we are interested in constructing a family of sets that contain the true (unknown) parameter value with a specified probability. In many problems of statistical inference we are not interested only in estimating the parameter or testing some hypothesis concerning the parameter, we also want to get a lower or an upper bound or both, for the real-valued parameter. To do this, we need to construct a confidence interval.

[[Confidence interval]] describes how reliable an estimate is. We can calculate the upper and lower confidence limits of the intervals from the observed data. Suppose a dataset x<sub>1</sub>, . . . , x<sub>n</sub> is given, modeled as realization of random variables X<sub>1</sub>, . . . , X<sub>n</sub>. Let θ be the parameter of interest, and γ a number between 0 and 1. If there exist sample statistics L<sub>n</sub> = g(X<sub>1</sub>, . . . , X<sub>n</sub>) and U<sub>n</sub> = h(X<sub>1</sub>, . . . , X<sub>n</sub>) such that P(L<sub>n</sub> < θ < U<sub>n</sub>) = γ for every value of θ, then (l<sub>n</sub>, u<sub>n</sub>), where l<sub>n</sub> = g(x<sub>1</sub>, . . . , x<sub>n</sub>) and u<sub>n</sub> = h(x<sub>1</sub>, . . . , x<sub>n</sub>), is called a 100γ% [[confidence interval]] for θ. The number γ is called the [[confidence level]].<ref name=":0" /> In general, with a normally-distributed sample mean, Ẋ, and with a known value for the standard deviation, σ, a 100(1-α)% confidence interval for the true μ is formed by taking Ẋ ± e, with e = z<sub>1-α/2</sub>(σ/n<sup>1/2</sup>), where z<sub>1-α/2</sub> is the 100(1-α/2)% cumulative value of the standard normal curve, and n is the number of data values in that column. For example, z<sub>1-α/2</sub> equals 1.96 for 95% confidence.<ref>{{Cite book|title=Experimental Design – With Applications in Management, Engineering, and the Sciences|publisher=Paul D. Berger, Robert E. Maurer, Giovana B. Celli|year=2019|location=Springer}}</ref>

Here two limits are computed from the set of observations, say l<sub>n</sub> and u<sub>n</sub> and it is claimed with a certain degree of confidence (measured in probabilistic terms) that the true value of γ lies between l<sub>n</sub> and u<sub>n</sub>. Thus we get an interval (l<sub>n</sub> and u<sub>n</sub>) which we expect would include the true value of γ(θ). So this type of estimation is called confidence interval estimation.<ref name=":1" /> This estimation provides a range of values which the parameter is expected to lie. It generally gives more information than point estimates and are preferred when making inferences. In some way, we can say that point estimation is the opposite of interval estimation.

== See also ==
{{Portal|Mathematics}}
* [[Algorithmic inference]]
*[[Binomial distribution]]
*[[Confidence distribution]]
* [[Induction (philosophy)]]
* [[Interval estimation]]
* [[Philosophy of statistics]]
* [[Predictive inference]]

== References ==
{{Reflist|30em}}

== Further reading ==
* {{cite book
|author1=Bickel, Peter J.  |author2=Doksum, Kjell A.
 |name-list-style=amp |title=Mathematical Statistics: Basic and Selected Topics
|volume=I
|edition=Second (updated printing 2007)
|year=2001
|publisher=Pearson Prentice-Hall
}}
<!-- * {{cite book|author=Lehmann, Erich|author-link=Erich Leo Lehmann| title=Testing Statistical Hypotheses|url=https://archive.org/details/testingstatistic0000lehm|url-access=registration|year=1959}} -->
* {{cite book
|author1=Liese, Friedrich  |author2=Miescke, Klaus-J.
 |name-list-style=amp |title=Statistical Decision Theory: Estimation, Testing, and Selection
|year=2008
|publisher=Springer
}}

{{Authority control}}

[[Category:Estimation theory]]