Editing Log-normal distribution (section)

====Confidence interval for comparing two log normals====

Comparing two log-normal distributions can often be of interest, for example, from a treatment and control group (e.g., in an [[A/B testing|A/B test]]). We have samples from two independent log-normal distributions with parameters <math>(\mu_1, \sigma_1^2)</math> and <math>(\mu_2, \sigma_2^2)</math>, with sample sizes <math>n_1</math> and <math>n_2</math> respectively.

Comparing the medians of the two can easily be done by taking the log from each and then constructing straightforward confidence intervals and transforming it back to the exponential scale.

<math display="block">\mathrm{CI}(e^{\mu_1-\mu_2}): \exp\left(\hat \mu_1 - \hat \mu_2 \pm z_{1-\frac{\alpha}{2}} \sqrt{\frac{S_1^2}{n} + \frac{S_2^2}{n} } \right)</math>

These CI are what's often used in epidemiology for calculation the CI for [[relative-risk]] and [[odds-ratio]].<ref>[https://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH717-QuantCore/PH717-Module8-CategoricalData/PH717-Module8-CategoricalData5.html?fbclid=IwY2xjawFeH3JleHRuA2FlbQIxMAABHbmxa15uyyzJuzEwh9PIUr_m2Jsc9NGiPuS6IwfA36Ca5r1wV1EoPEz3MQ_aem_03PRd_jlRfbsnr6xCPkZmw Confidence Intervals for Risk Ratios and Odds Ratios]</ref> The way it is done there is that we have two approximately Normal distributions (e.g., p<sub>1</sub> and p<sub>2</sub>, for RR), and we wish to calculate their ratio.{{efn|The issue is that we don't know how to do it directly, so we take their logs, and then use the [[delta method]] to say that their logs is itself (approximately) normal. This trick allows us to pretend that their exp was log normal, and use that approximation to build the CI. Notice that in the RR case, the median and the mean in the base distribution (i.e., before taking the log), is actually identical (since they are originally normal, and not log normal).
For example, <math>\hat p_1 \dot \sim N(p_1, p_1(1-p1)/n)</math> and <math>\ln \hat{p}_1 \dot \sim N(\ln p_1, (1-p1)/(p_1 n))</math> Hence, building a CI based on the log and than back-transform will give us <math>CI(p_1): e^{\ln \hat{p}_1 \pm (1 - \hat{p}_1)/(\hat{p}_1 n))}</math>. So while we expect the CI to be for the median, in this case, it's actually also for the mean in the original distribution.
i.e., if the original <math>\hat p_1</math> was log-normal, we'd expect that <math>\operatorname{E}[\hat p_1] = e^{\ln p_1 + \tfrac{1}{2} (1 - p1)/(p_1 n)}</math>. But in practice, we KNOW that <math>\operatorname{E}[\hat p_1] = e^{\ln p_1} = p_1</math>. Hence, the approximation we have is in the second step (of the delta method), but the CI are actually for the expectation (not just the median). This is because we are starting from a base distribution that is normal, and then using another approximation after the log again to normal. This means that a big approximation part of the CI is from the delta method.
}}

However, the ratio of the expectations (means) of the two samples might also be of interest, while requiring more work to develop. The ratio of their means is:

<math display="block">\frac{\operatorname{E}(X_1)}{\operatorname{E}(X_2)} = \frac{e^{\mu_1 + \sigma_1^2 / 2}}{e^{\mu_2 + \sigma_2^2 /2}}
= e^{(\mu_1 - \mu_2) + \frac{1}{2} \left(\sigma_1^2 - \sigma_2^2\right)}</math>

Plugin in the estimators to each of these parameters yields also a log normal distribution, which means that the Cox Method, discussed above, could similarly be used for this use-case:

<math display="block">\mathrm{CI}\left( \frac{\operatorname{E}(X_1)}{\operatorname{E}(X_2)} = \frac{e^{\mu_1 + \sigma_1^2 / 2}}{e^{\mu_2 + \sigma_2^2 / 2}} \right):
\exp\left(\left(\hat \mu_1 - \hat \mu_2 + \tfrac{1}{2}S_1^2 - \tfrac{1}{2}S_2^2\right) \pm z_{1-\frac{\alpha}{2}} \sqrt{ \frac{S_1^2}{n_1} + \frac{S_2^2}{n_2} + \frac{S_1^4}{2(n_1-1)} + \frac{S_2^4}{2(n_2-1)} } \right)</math>

{{hidden begin|style=width:100%|ta1=center|border=1px #aaa solid|title=[Proof]}}

To construct a confidence interval for this ratio, we first note that <math>\hat \mu_1 - \hat \mu_2</math> follows a normal distribution, and that both <math>S_1^2</math> and <math>S_2^2</math> has a [[chi-squared distribution]], which is [[Chi-squared distribution#Related distributions|approximately]] normally distributed (via [[Central limit theorem|CLT]], with the relevant [[Variance#Distribution of the sample variance|parameters]]).

This means that
<math display="block">(\hat \mu_1 - \hat \mu_2 + \frac{1}{2}S_1^2 - \frac{1}{2}S_2^2) \sim N\left((\mu_1 - \mu_2) + \frac{1}{2}(\sigma_1^2 - \sigma_2^2), \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} + \frac{\sigma_1^4}{2(n_1-1)} + \frac{\sigma_2^4}{2(n_2-1)} \right)</math>

Based on the above, standard [[Normal distribution#Confidence intervals|confidence intervals]] can be constructed (using a [[Pivotal quantity]]) as: <math>(\hat \mu_1 - \hat \mu_2 + \frac{1}{2}S_1^2 - \frac{1}{2}S_2^2) \pm z_{1-\frac{\alpha}{2}} \sqrt{ \frac{S_1^2}{n_1} + \frac{S_2^2}{n_2} + \frac{S_1^4}{2(n_1-1)} + \frac{S_2^4}{2(n_2-1)} } </math>
And since confidence intervals are preserved for monotonic transformations, we get that:
<math>CI\left( \frac{\operatorname{E}(X_1)}{\operatorname{E}(X_2)} = \frac{e^{\mu_1 + \frac{\sigma_1^2}{2}}}{e^{\mu_2 + \frac{\sigma_2^2}{2}}} \right):e^{\left((\hat \mu_1 - \hat \mu_2 + \frac{1}{2}S_1^2 - \frac{1}{2}S_2^2) \pm z_{1-\frac{\alpha}{2}} \sqrt{ \frac{S_1^2}{n_1} + \frac{S_2^2}{n_2} + \frac{S_1^4}{2(n_1-1)} + \frac{S_2^4}{2(n_2-1)} } \right)}</math>

As desired.

{{hidden end}}

It's worth noting that naively using the [[Maximum likelihood estimation|MLE]] in the ratio of the two expectations to create a [[ratio estimator]] will lead to a [[Consistency (statistics)|consistent]], yet biased, point-estimation (we use the fact that the estimator of the ratio is a log normal distribution):{{efn|The formula can found by just treating the estimated means and variances as approximetly normal, which indicates the terms is itself a log-normal, enabling us to quickly get the expectation. The bias can be partially minimized by using:

<math display="block">\begin{align}
\widehat \left[ \frac{\operatorname{E}(X_1)}{\operatorname{E}(X_2)} \right] &= \left[ \frac{\widehat \operatorname{E}(X_1)}{\widehat \operatorname{E}(X_2)} \right] \frac{2}{\widehat \left( \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} + \frac{\sigma_1^4}{2(n_1-1)} + \frac{\sigma_2^4}{2(n_2-1)} \right)} \\
&\approx \left[e^{(\widehat \mu_1 - \widehat \mu_2) + \frac{1}{2}\left(S_1^2 - S_2^2\right)}\right] \frac{2}{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2} + \frac{S_1^4}{2(n_1-1)} + \frac{S_2^4}{2(n_2-1)}}
\end{align} </math>}}{{citation needed|date=December 2024}}

<math display="block">\begin{align}
\operatorname{E}\left[ \frac{\widehat \operatorname{E}(X_1)}{\widehat \operatorname{E}(X_2)} \right]
&= \operatorname{E}\left[\exp\left(\left(\widehat \mu_1 - \widehat \mu_2\right) + \tfrac{1}{2} \left(S_1^2 - S_2^2\right)\right)\right] \\
&\approx \exp\left[{(\mu_1 - \mu_2) + \frac{1}{2}(\sigma_1^2 - \sigma_2^2) + \frac{1}{2}\left( \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} + \frac{\sigma_1^4}{2(n_1-1)} + \frac{\sigma_2^4}{2(n_2-1)} \right) }\right]
\end{align} </math>