Editing Weighted arithmetic mean (section)

=== Variance ===

==== Simple i.i.d. case ====

When treating the weights as constants, and having a sample of ''n'' observations from [[Uncorrelatedness (probability theory)|uncorrelated]] [[random variables]], all with the same [[variance]] and [[Expected value|expectation]] (as is the case for [[Independent and identically distributed random variables|i.i.d]] random variables), then the variance of the weighted mean can be estimated as the multiplication of the unweighted variance by [[Design effect#Unequal selection probabilities|Kish's design effect]] (see [[Design effect#Assumptions and proofs|proof]]):

: <math>\operatorname{Var}(\bar y_w) = \hat \sigma_y^2 \frac{\overline{w^2}}{ \bar{w}^2 } </math>

With <math>\hat \sigma_y^2 = \frac{\sum_{i=1}^n (y_i - \bar y)^2}{n-1} </math>, <math>\bar{w} = \frac{\sum_{i=1}^n w_i}{n} </math>, and <math>\overline{w^2} = \frac{\sum_{i=1}^n w_i^2}{n} </math>

However, this estimation is rather limited due to the strong assumption about the ''y'' observations. This has led to the development of alternative, more general, estimators.

==== Survey sampling perspective ====

From a ''model based'' perspective, we are interested in estimating the variance of the weighted mean when the different <math>y_i</math> are not [[Independent and identically distributed random variables|i.i.d]] random variables. An alternative perspective for this problem is that of some arbitrary [[Survey sampling|sampling design]] of the data in which units are [[Design effect#Sources for unequal selection probabilities|selected with unequal probabilities]] (with replacement).<ref name = "Cochran1977" />{{rp|306}}

In [[Survey methodology]], the population mean, of some quantity of interest ''y'', is calculated by taking an estimation of the total of ''y'' over all elements in the population (''Y'' or sometimes ''T'') and dividing it by the population size – either known (<math>N</math>) or estimated (<math>\hat N</math>). In this context, each value of ''y'' is considered constant, and the variability comes from the selection procedure. This in contrast to "model based" approaches in which the randomness is often described in the y values. The [[survey sampling]] procedure yields a series of [[Bernoulli distribution|Bernoulli]] indicator values (<math>I_i</math>) that get 1 if some observation ''i'' is in the sample and 0 if it was not selected. This can occur with fixed sample size, or varied sample size sampling (e.g.: [[Poisson sampling]]). The probability of some element to be chosen, given a sample, is denoted as <math>P(I_i=1 \mid \text{Some sample of size } n ) = \pi_i </math>, and the one-draw probability of selection is <math>P(I_i=1 | \text{one sample draw}) = p_i \approx \frac{\pi_i}{n}</math> (If N is very large and each <math>p_i</math> is very small). For the following derivation we'll assume that the probability of selecting each element is fully represented by these probabilities.<ref name="sarndal1992">{{cite book |title = Model Assisted Survey Sampling |author=Carl-Erik Sarndal |author2=Bengt Swensson |author3=Jan Wretman |isbn= 978-0-387-97528-3 |year = 1992|publisher=Springer }}</ref>{{rp|42,43,51}} I.e.: selecting some element will not influence the probability of drawing another element (this doesn't apply for things such as [[cluster sampling]] design).

Since each element (<math>y_i</math>) is fixed, and the randomness comes from it being included in the sample or not (<math>I_i</math>), we often talk about the multiplication of the two, which is a random variable. To avoid confusion in the following section, let's call this term: <math>y'_i = y_i I_i</math>. With the following expectancy: <math>E[y'_i] = y_i E[I_i] = y_i \pi_i</math>; and variance: <math>V[y'_i] = y_i^2 V[I_i] = y_i^2 \pi_i(1-\pi_i)</math>.

When each element of the sample is inflated by the inverse of its selection probability, it is termed the <math>\pi</math>-expanded ''y'' values, i.e.: <math>\check y_i = \frac{y_i}{\pi_i}</math>. A related quantity is <math>p</math>-expanded ''y'' values: <math>\frac{y_i}{p_i} = n \check y_i</math>.<ref name="sarndal1992" />{{rp|42,43,51,52}} As above, we can add a tick mark if multiplying by the indicator function. I.e.: <math>\check y'_i = I_i \check y_i = \frac{I_i y_i}{\pi_i}</math>

In this ''design based'' perspective, the weights, used in the numerator of the weighted mean, are obtained from taking the inverse of the selection probability (i.e.: the inflation factor). I.e.: <math>w_i = \frac{1}{\pi_i} \approx \frac{1}{n \times p_i}</math>.

==== Variance of the weighted sum (''pwr''-estimator for totals) ====

If the population size ''N'' is known we can estimate the population mean using <math>\hat{\bar Y}_{\text{known } N} = \frac{\hat Y_{pwr}}{N} \approx \frac{\sum_{i=1}^n w_i y'_i}{N} </math>.

If the [[sampling design]] is one that results in a fixed sample size ''n'' (such as in [[Probability-proportional-to-size sampling|pps sampling]]), then the variance of this estimator is:

: <math> \operatorname{Var} \left( \hat{\bar Y}_{\text{known }N} \right) =  \frac{1}{N^2} \frac{n}{n-1} \sum_{i=1}^n \left( w_i y_i - \overline{wy} \right)^2 </math>

{{math proof|proof=
The general formula can be developed like this:

: <math>\hat{\bar Y}_{\text{known } N} = \frac{\hat Y_{pwr}}{N} = \frac{\frac{1}{n} \sum_{i=1}^n \frac{y'_i}{p_i} }{N} \approx \frac{\sum_{i=1}^n \frac{y'_i}{\pi_i}}{N} = \frac{\sum_{i=1}^n w_i y'_i}{N}. </math>

The population total is denoted as <math>Y = \sum_{i=1}^N y_i</math> and it may be estimated by the (unbiased) [[Horvitz–Thompson estimator]], also called the ''<math>\pi</math>''-estimator. This estimator can be itself estimated using the ''pwr''-estimator (i.e.: <math>p</math>-expanded with replacement estimator, or "probability with replacement" estimator). With the above notation, it is: <math>\hat Y_{pwr} = \frac{1}{n} \sum_{i=1}^n \frac{y'_i}{p_i} = \sum_{i=1}^n \frac{y'_i}{n p_i} \approx \sum_{i=1}^n \frac{y'_i}{\pi_i} = \sum_{i=1}^n w_i y'_i</math>.<ref name = "sarndal1992" />{{rp|51}}

The estimated variance of the ''pwr''-estimator is given by:<ref name = "sarndal1992" />{{rp|52}}
<math display="block">\operatorname{Var}(\hat Y_{pwr}) = \frac{n}{n-1} \sum_{i=1}^n \left( w_i y_i - \overline{wy} \right)^2 </math>
where <math>\overline{wy} = \sum_{i=1}^n \frac{w_i y_i}{n} </math>.

The above formula was taken from Sarndal et al. (1992) (also presented in Cochran 1977), but was written differently.<ref name = "sarndal1992" />{{rp|52}}<ref name = "Cochran1977" />{{rp|307 (11.35)}} The left side is how the variance was written and the right side is how we've developed the weighted version:

<math display="block">\begin{align}
\operatorname{Var}(\hat Y_\text{pwr}) & = \frac{1}{n} \frac{1}{n-1} \sum_{i=1}^n \left( \frac{y_i}{p_i} - \hat Y_{pwr} \right)^2 \\
& = \frac{1}{n} \frac{1}{n-1} \sum_{i=1}^n \left( \frac{n}{n} \frac{y_i}{p_i} - \frac{n}{n} \sum_{i=1}^n w_i y_i \right)^2
  = \frac{1}{n} \frac{1}{n-1} \sum_{i=1}^n \left( n \frac{y_i}{\pi_i} -  n \frac{\sum_{i=1}^n w_i y_i}{n} \right)^2 \\
& = \frac{n^2}{n} \frac{1}{n-1} \sum_{i=1}^n \left( w_i y_i - \overline{wy} \right)^2 \\
& = \frac{n}{n-1} \sum_{i=1}^n \left( w_i y_i - \overline{wy} \right)^2
\end{align}</math>

And we got to the formula from above.
}}

An alternative term, for when the sampling has a random sample size (as in [[Poisson sampling]]), is presented in Sarndal et al. (1992) as:<ref name = "sarndal1992" />{{rp|182}}

<math display="block">\operatorname{Var}(\hat \bar Y_{\text{pwr (known }N\text{)}}) = \frac{1}{N^2} \sum_{i=1}^n \sum_{j=1}^n \left( \check{\Delta}_{ij} \check{y}_i \check{y}_j \right) </math>

With <math>\check{y}_i = \frac{y_i}{\pi_i}</math>. Also, <math>C(I_i, I_j) = \pi_{ij} - \pi_{i}\pi_{j} = \Delta_{ij} </math> where <math>\pi_{ij}</math> is the probability of selecting both i and j.<ref name = "sarndal1992" />{{rp|36}} And <math>\check{\Delta}_{ij} = 1 - \frac{\pi_{i}\pi_{j}}{\pi_{ij}}</math>, and for i=j: <math>\check{\Delta}_{ii} = 1 - \frac{\pi_{i}\pi_{i}}{\pi_{i}} = 1- \pi_{i}</math>.<ref name = "sarndal1992" />{{rp|43}}

If the selection probability are uncorrelated (i.e.: <math>\forall i \neq j: C(I_i, I_j) = 0</math>), and when assuming the probability of each element is very small, then:

: <math>\operatorname{Var}(\hat \bar Y_{\text{pwr (known }N\text{)}}) = \frac{1}{N^2} \sum_{i=1}^n \left( w_i y_i \right)^2 </math>

{{math proof|proof=
We assume that <math>(1- \pi_i) \approx 1</math> and that
<math display="block">\begin{align}
\operatorname{Var}(\hat Y_{\text{pwr (known } N\text{)}}) & = \frac{1}{N^2} \sum_{i=1}^n \sum_{j=1}^n \left( \check{\Delta}_{ij} \check{y}_i \check{y}_j \right)  \\
& = \frac{1}{N^2} \sum_{i=1}^n \left( \check{\Delta}_{ii} \check{y}_i \check{y}_i \right)  \\
& = \frac{1}{N^2} \sum_{i=1}^n \left( (1- \pi_i) \frac{y_i}{\pi_i} \frac{y_i}{\pi_i} \right)  \\
& = \frac{1}{N^2} \sum_{i=1}^n \left( w_i y_i \right)^2
\end{align}</math>
}}

==== Variance of the weighted mean ({{pi}}-estimator for ratio-mean) ====

The previous section dealt with estimating the population mean as a ratio of an estimated population total (<math>\hat Y</math>) with a known population size (<math>N</math>), and the variance was estimated in that context. Another common case is that the population size itself (<math>N</math>) is unknown and is estimated using the sample (i.e.: <math>\hat N</math>). The estimation of <math>N</math> can be described as the sum of weights. So when <math>w_i = \frac{1}{\pi_i} </math> we get <math>\hat N = \sum_{i=1}^n w_i I_i = \sum_{i=1}^n \frac{I_i}{\pi_i} = \sum_{i=1}^n \check 1'_i </math>. With the above notation, the parameter we care about is the ratio of the sums of <math>y_i</math>s, and 1s. I.e.: <math>R = \bar Y = \frac{\sum_{i=1}^N \frac{y_i}{\pi_i}}{\sum_{i=1}^N \frac{1}{\pi_i}} = \frac{\sum_{i=1}^N \check y_i}{\sum_{i=1}^N \check 1_i} = \frac{\sum_{i=1}^N w_i y_i}{\sum_{i=1}^N w_i} </math>. We can estimate it using our sample with: <math>\hat R = \hat {\bar Y} = \frac{\sum_{i=1}^N I_i \frac{y_i}{\pi_i}}{\sum_{i=1}^N I_i \frac{1}{\pi_i}} = \frac{\sum_{i=1}^N \check y'_i}{\sum_{i=1}^N \check 1'_i} = \frac{\sum_{i=1}^N w_i y'_i}{\sum_{i=1}^N w_i 1'_i} = \frac{\sum_{i=1}^n w_i y'_i}{\sum_{i=1}^n w_i 1'_i} = \bar y_w</math>. As we moved from using N to using n, we actually know that all the indicator variables get 1, so we could simply write: <math>\bar y_w = \frac{\sum_{i=1}^n w_i y_i}{\sum_{i=1}^n w_i }</math>. This will be the [[estimand]] for specific values of y and w, but the statistical properties comes when including the indicator variable <math>\bar y_w = \frac{\sum_{i=1}^n w_i y'_i}{\sum_{i=1}^n w_i 1'_i }</math>.<ref name = "sarndal1992" />{{rp|162,163,176}}

This is called a [[Ratio estimator]] and it is approximately unbiased for ''R''.<ref name = "sarndal1992" />{{rp|182}}

In this case, the variability of the [[Ratio distribution#Means and variances of random ratios|ratio]] depends on the variability of the random variables both in the numerator and the denominator - as well as their correlation. Since there is no closed analytical form to compute this variance, various methods are used for approximate estimation. Primarily [[Taylor series]] first-order linearization, asymptotics, and bootstrap/jackknife.<ref name = "sarndal1992" />{{rp|172}} The Taylor linearization method could lead to under-estimation of the variance for small sample sizes in general, but that depends on the complexity of the statistic. For the weighted mean, the approximate variance is supposed to be relatively accurate even for medium sample sizes.<ref name = "sarndal1992" />{{rp|176}} For when the sampling has a random sample size (as in [[Poisson sampling]]), it is as follows:<ref name = "sarndal1992" />{{rp|182}}

: <math>\widehat {V (\bar y_w)}
 = \frac{1}{(\sum_{i=1}^n w_i)^2} \sum_{i=1}^n w_i^2 (y_i - \bar y_w)^2
</math>.

If <math>\pi_i \approx p_i n</math>, then either using <math>w_i = \frac{1}{\pi_i}</math> or <math>w_i = \frac{1}{p_i}</math> would give the same estimator, since multiplying <math>w_i</math> by some factor would lead to the same estimator. It also means that if we scale the sum of weights to be equal to a known-from-before population size ''N'', the variance calculation would look the same. When all weights are equal to one another, this formula is reduced to the standard unbiased variance estimator.

{{math proof|proof=
The Taylor linearization states that for a general ratio estimator of two sums (<math>\hat R = \frac{\hat{Y}}{\hat{Z}}</math>), they can be expanded around the true value R, and give:<ref name = "sarndal1992" />{{rp|178}}

<math display="block">\hat R = \frac{\hat{Y}}{\hat{Z}} = \frac{\sum_{i=1}^n w_i y'_i}{\sum_{i=1}^n w_i z'_i} \approx R + \frac{1}{Z} \sum_{i=1}^n \left( \frac{y'_i}{\pi_i} - R \frac{z'_i}{\pi_i} \right) </math>

And the variance can be approximated by:<ref name = "sarndal1992" />{{rp|178,179}}

<math display="block">\widehat {V (\hat R)} =  \frac{1}{\hat{Z}^2} \sum_{i=1}^n \sum_{j=1}^n \left( \check{\Delta}_{ij} \frac{y_i - \hat R z_i}{\pi_i}\frac{y_j - \hat R z_j}{\pi_j} \right) = \frac{1}{\hat{Z}^2} \left[  \widehat {V (\hat Y)} + \hat R \widehat {V (\hat Z)} -2 \hat R \hat C (\hat Y, \hat Z) \right] </math>.

The term <math> \hat C (\hat Y, \hat Z) </math> is the estimated covariance between the estimated sum of Y and estimated sum of Z. Since this is the [[Covariance#Covariance of linear combinations|covariance of two sums of random variables]], it would include many combinations of covariances that will depend on the indicator variables. If the selection probability are uncorrelated (i.e.: <math>\forall i \neq j: \Delta_{ij} = C(I_i, I_j) = 0</math>), this term would still include a summation of ''n'' covariances for each element ''i'' between <math> y'_i = I_i y_i </math> and <math> z'_i = I_i z_i </math>. This helps illustrate that this formula incorporates the effect of correlation between y and z on the variance of the ratio estimators.

When defining <math> z_i = 1 </math> the above becomes:<ref name = "sarndal1992" />{{rp|182}}

<math display="block">\widehat{V (\hat R)} = \widehat {V (\bar y_w) }
 = \frac{1}{\hat{N}^2} \sum_{i=1}^n \sum_{j=1}^n \left( \check{\Delta}_{ij} \frac{y_i - \bar y_w}{\pi_i}\frac{y_j - \bar y_w}{\pi_j} \right) . </math>

If the selection probability are uncorrelated (i.e.: <math>\forall i \neq j: \Delta_{ij} = C(I_i, I_j) = 0</math>), and when assuming the probability of each element is very small (i.e.: <math>(1- \pi_i) \approx 1</math>), then the above reduced to the following:
<math display="block">\widehat{ V (\bar y_w) }
 = \frac{1}{\hat{N}^2} \sum_{i=1}^n \left( (1- \pi_i) \frac{y_i - \bar y_w}{\pi_i} \right)^2
 = \frac{1}{(\sum_{i=1}^n w_i)^2} \sum_{i=1}^n w_i^2 (y_i - \bar y_w)^2.
</math>

A similar re-creation of the proof (up to some mistakes at the end) was provided by Thomas Lumley in crossvalidated.<ref>Thomas Lumley (https://stats.stackexchange.com/users/249135/thomas-lumley), How to estimate the (approximate) variance of the weighted mean?, URL (version: 2021-06-08): https://stats.stackexchange.com/q/525770</ref>
}}

We have (at least) two versions of variance for the weighted mean: one with known and one with unknown population size estimation. There is no uniformly better approach, but the literature presents several arguments to prefer using the population estimation version (even when the population size is known).<ref name = "sarndal1992" />{{rp|188}} For example: if all y values are constant, the estimator with unknown population size will give the correct result, while the one with known population size will have some variability. Also, when the sample size itself is random (e.g.: in [[Poisson sampling]]), the version with unknown population mean is considered more stable. Lastly, if the proportion of sampling is negatively correlated with the values (i.e.: smaller chance to sample an observation that is large), then the un-known population size version slightly compensates for that.

For the trivial case in which all the weights are equal to 1, the above formula is just like the regular formula for the variance of the mean (but notice that it uses the maximum likelihood estimator for the variance instead of the unbiased variance. I.e.: dividing it by n instead of (n-1)).

==== Bootstrapping validation ====

It has been shown, by Gatz et al. (1995), that in comparison to [[bootstrapping (statistics)|bootstrapping]] methods, the following (variance estimation of ratio-mean using [[Taylor series]] linearization) is a reasonable estimation for the square of the standard error of the mean (when used in the context of measuring chemical constituents):<ref>{{cite journal |last1=Gatz |first1=Donald F. |last2=Smith |first2=Luther |title=The standard error of a weighted mean concentration—I. Bootstrapping vs other methods |journal=Atmospheric Environment |date=June 1995 |volume=29 |issue=11 |pages=1185–1193 |doi=10.1016/1352-2310(94)00210-C|bibcode=1995AtmEn..29.1185G }} - [https://www.cs.tufts.edu/~nr/cs257/archive/donald-gatz/weighted-standard-error.pdf pdf link]</ref>{{rp|1186}}

:<math>
\widehat{\sigma_{\bar{x}_w}^2} = \frac{n}{(n-1)(n \bar{w} )^2}  \left[\sum (w_i x_i - \bar{w} \bar{x}_w)^2 -
2 \bar{x}_w \sum (w_i - \bar{w})(w_i x_i - \bar{w} \bar{x}_w)
+ \bar{x}_w^2 \sum (w_i - \bar{w})^2 \right]
</math>

where <math>\bar{w} = \frac{\sum w_i}{n}</math>. Further simplification leads to

:<math>\widehat{\sigma_{\bar{x}}^2} = \frac{n}{(n-1)(n \bar{w} )^2}  \sum w_i^2(x_i - \bar{x}_w)^2</math>

Gatz et al. mention that the above formulation was published by Endlich et al. (1988) when treating the weighted mean as a combination of a weighted total estimator divided by an estimator of the population size,<ref>{{Cite journal| doi = 10.1175/1520-0450(1988)027<1322:SAOPCM>2.0.CO;2| volume = 27| issue = 12| pages = 1322–1333| last1 = Endlich| first1 = R. M.| last2 = Eymon| first2 = B. P.| last3 = Ferek| first3 = R. J.| last4 = Valdes| first4 = A. D.| last5 = Maxwell| first5 = C.| title = Statistical Analysis of Precipitation Chemistry Measurements over the Eastern United States. Part I: Seasonal and Regional Patterns and Correlations| journal = Journal of Applied Meteorology and Climatology| date = 1988-12-01 | doi-access = free| bibcode = 1988JApMe..27.1322E}}</ref> based on the formulation published by Cochran (1977), as an approximation to the ratio mean. However, Endlich et al. didn't seem to publish this derivation in their paper (even though they mention they used it), and Cochran's book includes a slightly different formulation.<ref name = "Cochran1977">Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Nashville, TN: John Wiley & Sons. {{ISBN|978-0-471-16240-7}}</ref>{{rp|155}} Still, it's almost identical to the formulations described in previous sections.

==== Replication-based estimators ====

Because there is no closed analytical form for the variance of the weighted mean, it was proposed in the literature to rely on replication methods such as the [[Jackknife resampling|Jackknife]] and [[Bootstrapping (statistics)|Bootstrapping]].<ref name = "Cochran1977" />{{rp|321}}

==== Other notes ====

For uncorrelated observations with variances <math>\sigma^2_i</math>, the variance of the weighted sample mean is{{Citation needed|date=October 2018}}
: <math> \sigma^2_{\bar x} = \sum_{i=1}^n {w_i'^2 \sigma^2_i}</math>
whose square root <math>\sigma_{\bar x}</math> can be called the ''standard error of the weighted mean (general case)''.{{Citation needed|date=October 2018}}{{anchor|Standard error}}

Consequently, if all the observations have equal variance, <math>\sigma^2_i= \sigma^2_0</math>, the weighted sample mean will have variance
: <math> \sigma^2_{\bar x} =  \sigma^2_0 \sum_{i=1}^n {w_i'^2},</math>
where <math display="inline">1/n \le \sum_{i=1}^n {w_i'^2} \le 1</math>. The variance attains its maximum value, <math>\sigma_0^2</math>, when all weights except one are zero. Its minimum value is found when all weights are equal (i.e., unweighted mean), in which case we have <math display="inline"> \sigma_{\bar x} = \sigma_0 / \sqrt {n} </math>, i.e., it degenerates into the [[standard error of the mean]], squared.

Because one can always transform non-normalized weights to normalized weights, all formulas in this section can be adapted to non-normalized weights by replacing all <math>w_i' = \frac{w_i}{\sum_{i=1}^n{w_i}}</math>.