Editing Weighted arithmetic mean (section)

==== Variance of the weighted mean ({{pi}}-estimator for ratio-mean) ====

The previous section dealt with estimating the population mean as a ratio of an estimated population total (<math>\hat Y</math>) with a known population size (<math>N</math>), and the variance was estimated in that context. Another common case is that the population size itself (<math>N</math>) is unknown and is estimated using the sample (i.e.: <math>\hat N</math>). The estimation of <math>N</math> can be described as the sum of weights. So when <math>w_i = \frac{1}{\pi_i} </math> we get <math>\hat N = \sum_{i=1}^n w_i I_i = \sum_{i=1}^n \frac{I_i}{\pi_i} = \sum_{i=1}^n \check 1'_i </math>. With the above notation, the parameter we care about is the ratio of the sums of <math>y_i</math>s, and 1s. I.e.: <math>R = \bar Y = \frac{\sum_{i=1}^N \frac{y_i}{\pi_i}}{\sum_{i=1}^N \frac{1}{\pi_i}} = \frac{\sum_{i=1}^N \check y_i}{\sum_{i=1}^N \check 1_i} = \frac{\sum_{i=1}^N w_i y_i}{\sum_{i=1}^N w_i} </math>. We can estimate it using our sample with: <math>\hat R = \hat {\bar Y} = \frac{\sum_{i=1}^N I_i \frac{y_i}{\pi_i}}{\sum_{i=1}^N I_i \frac{1}{\pi_i}} = \frac{\sum_{i=1}^N \check y'_i}{\sum_{i=1}^N \check 1'_i} = \frac{\sum_{i=1}^N w_i y'_i}{\sum_{i=1}^N w_i 1'_i} = \frac{\sum_{i=1}^n w_i y'_i}{\sum_{i=1}^n w_i 1'_i} = \bar y_w</math>. As we moved from using N to using n, we actually know that all the indicator variables get 1, so we could simply write: <math>\bar y_w = \frac{\sum_{i=1}^n w_i y_i}{\sum_{i=1}^n w_i }</math>. This will be the [[estimand]] for specific values of y and w, but the statistical properties comes when including the indicator variable <math>\bar y_w = \frac{\sum_{i=1}^n w_i y'_i}{\sum_{i=1}^n w_i 1'_i }</math>.<ref name = "sarndal1992" />{{rp|162,163,176}}

This is called a [[Ratio estimator]] and it is approximately unbiased for ''R''.<ref name = "sarndal1992" />{{rp|182}}

In this case, the variability of the [[Ratio distribution#Means and variances of random ratios|ratio]] depends on the variability of the random variables both in the numerator and the denominator - as well as their correlation. Since there is no closed analytical form to compute this variance, various methods are used for approximate estimation. Primarily [[Taylor series]] first-order linearization, asymptotics, and bootstrap/jackknife.<ref name = "sarndal1992" />{{rp|172}} The Taylor linearization method could lead to under-estimation of the variance for small sample sizes in general, but that depends on the complexity of the statistic. For the weighted mean, the approximate variance is supposed to be relatively accurate even for medium sample sizes.<ref name = "sarndal1992" />{{rp|176}} For when the sampling has a random sample size (as in [[Poisson sampling]]), it is as follows:<ref name = "sarndal1992" />{{rp|182}}

: <math>\widehat {V (\bar y_w)}
 = \frac{1}{(\sum_{i=1}^n w_i)^2} \sum_{i=1}^n w_i^2 (y_i - \bar y_w)^2
</math>.

If <math>\pi_i \approx p_i n</math>, then either using <math>w_i = \frac{1}{\pi_i}</math> or <math>w_i = \frac{1}{p_i}</math> would give the same estimator, since multiplying <math>w_i</math> by some factor would lead to the same estimator. It also means that if we scale the sum of weights to be equal to a known-from-before population size ''N'', the variance calculation would look the same. When all weights are equal to one another, this formula is reduced to the standard unbiased variance estimator.

{{math proof|proof=
The Taylor linearization states that for a general ratio estimator of two sums (<math>\hat R = \frac{\hat{Y}}{\hat{Z}}</math>), they can be expanded around the true value R, and give:<ref name = "sarndal1992" />{{rp|178}}

<math display="block">\hat R = \frac{\hat{Y}}{\hat{Z}} = \frac{\sum_{i=1}^n w_i y'_i}{\sum_{i=1}^n w_i z'_i} \approx R + \frac{1}{Z} \sum_{i=1}^n \left( \frac{y'_i}{\pi_i} - R \frac{z'_i}{\pi_i} \right) </math>

And the variance can be approximated by:<ref name = "sarndal1992" />{{rp|178,179}}

<math display="block">\widehat {V (\hat R)} =  \frac{1}{\hat{Z}^2} \sum_{i=1}^n \sum_{j=1}^n \left( \check{\Delta}_{ij} \frac{y_i - \hat R z_i}{\pi_i}\frac{y_j - \hat R z_j}{\pi_j} \right) = \frac{1}{\hat{Z}^2} \left[  \widehat {V (\hat Y)} + \hat R \widehat {V (\hat Z)} -2 \hat R \hat C (\hat Y, \hat Z) \right] </math>.

The term <math> \hat C (\hat Y, \hat Z) </math> is the estimated covariance between the estimated sum of Y and estimated sum of Z. Since this is the [[Covariance#Covariance of linear combinations|covariance of two sums of random variables]], it would include many combinations of covariances that will depend on the indicator variables. If the selection probability are uncorrelated (i.e.: <math>\forall i \neq j: \Delta_{ij} = C(I_i, I_j) = 0</math>), this term would still include a summation of ''n'' covariances for each element ''i'' between <math> y'_i = I_i y_i </math> and <math> z'_i = I_i z_i </math>. This helps illustrate that this formula incorporates the effect of correlation between y and z on the variance of the ratio estimators.

When defining <math> z_i = 1 </math> the above becomes:<ref name = "sarndal1992" />{{rp|182}}

<math display="block">\widehat{V (\hat R)} = \widehat {V (\bar y_w) }
 = \frac{1}{\hat{N}^2} \sum_{i=1}^n \sum_{j=1}^n \left( \check{\Delta}_{ij} \frac{y_i - \bar y_w}{\pi_i}\frac{y_j - \bar y_w}{\pi_j} \right) . </math>

If the selection probability are uncorrelated (i.e.: <math>\forall i \neq j: \Delta_{ij} = C(I_i, I_j) = 0</math>), and when assuming the probability of each element is very small (i.e.: <math>(1- \pi_i) \approx 1</math>), then the above reduced to the following:
<math display="block">\widehat{ V (\bar y_w) }
 = \frac{1}{\hat{N}^2} \sum_{i=1}^n \left( (1- \pi_i) \frac{y_i - \bar y_w}{\pi_i} \right)^2
 = \frac{1}{(\sum_{i=1}^n w_i)^2} \sum_{i=1}^n w_i^2 (y_i - \bar y_w)^2.
</math>

A similar re-creation of the proof (up to some mistakes at the end) was provided by Thomas Lumley in crossvalidated.<ref>Thomas Lumley (https://stats.stackexchange.com/users/249135/thomas-lumley), How to estimate the (approximate) variance of the weighted mean?, URL (version: 2021-06-08): https://stats.stackexchange.com/q/525770</ref>
}}

We have (at least) two versions of variance for the weighted mean: one with known and one with unknown population size estimation. There is no uniformly better approach, but the literature presents several arguments to prefer using the population estimation version (even when the population size is known).<ref name = "sarndal1992" />{{rp|188}} For example: if all y values are constant, the estimator with unknown population size will give the correct result, while the one with known population size will have some variability. Also, when the sample size itself is random (e.g.: in [[Poisson sampling]]), the version with unknown population mean is considered more stable. Lastly, if the proportion of sampling is negatively correlated with the values (i.e.: smaller chance to sample an observation that is large), then the un-known population size version slightly compensates for that.

For the trivial case in which all the weights are equal to 1, the above formula is just like the regular formula for the variance of the mean (but notice that it uses the maximum likelihood estimator for the variance instead of the unbiased variance. I.e.: dividing it by n instead of (n-1)).