Editing Imputation (statistics) (section)

==Single imputation==
===Hot-deck===
A once-common method of imputation was hot-deck imputation where a missing value was imputed from a randomly selected similar record. The term "hot deck" dates back to the storage of data on [[punched card]]s, and indicates that the information donors come from the same dataset as the recipients. The stack of cards was "hot" because it was currently being processed.

One form of hot-deck imputation is called "last observation carried forward" (or LOCF for short), which involves sorting a dataset according to any of a number of variables, thus creating an ordered dataset. The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value. The process is repeated for the next cell with a missing value until all missing values have been imputed. In the common scenario in which the cases are repeated measurements of a variable for a person or other entity, this represents the belief that if a measurement is missing, the best guess is that it hasn't changed from the last time it was measured. This method is known to increase risk of increasing bias and potentially false conclusions. For this reason LOCF is not recommended for use.<ref>{{Cite journal|last1=Molnar|first1=Frank J.|last2=Hutton|first2=Brian|last3=Fergusson|first3=Dean|date=2008-10-07|title=Does analysis using "last observation carried forward" introduce bias in dementia research?|journal=Canadian Medical Association Journal|volume=179|issue=8|pages=751–753|doi=10.1503/cmaj.080820|issn=0820-3946|pmc=2553855|pmid=18838445}}</ref>

===Cold-deck===
Cold-deck imputation, by contrast, selects donors from another dataset. Due to advances in computer power, more sophisticated methods of imputation have generally superseded the original random and sorted hot deck imputation techniques. It is a method of replacing with response values of similar items in past surveys. It is available in surveys that measure time intervals.

===Mean substitution===
Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable. However, mean imputation attenuates any correlations involving the variable(s) that are imputed. This is because, in cases with imputation, there is guaranteed to be no relationship between the imputed variable and any other measured variables. Thus, mean imputation has some attractive properties for univariate analysis but becomes problematic for multivariate analysis.

Mean imputation can be carried out within classes (e.g. categories such as gender), and can be expressed as <math>\hat{y}_i = \bar{y}_h</math> where <math>\hat{y}_i </math> is the imputed value for record <math>i</math> and <math>\bar{y}_h</math> is the sample mean of respondent data within some class <math>h</math>. This is a special case of generalized regression imputation:

<math display="block">
\hat{y}_{mi} = b_{r0} + \sum_j b_{rj} z_{mij} + \hat{e}_{mi}
</math>

Here the values <math>b_{r0}, b_{rj}</math> are estimated from regressing <math>y</math> on <math>x</math> in non-imputed data, <math>z</math> is a [[dummy variable (statistics)|dummy variable]] for class membership, and data are split into respondent (<math>r</math>) and missing (<math>m</math>).<ref>{{cite journal | last1 = Kalton | first1 = Graham | title = The treatment of missing survey data | journal = Survey Methodology | volume = 12 | year = 1986 | pages = 1–16}}</ref><ref>{{cite journal | last1 = Kalton |first1 = Graham | first2 = Daniel | last2 = Kasprzyk | title = Imputing for missing survey responses | journal = Proceedings of the Section on Survey Research Methods | publisher = [[American Statistical Association]] | volume = 22 | year = 1982 |s2cid = 195855359 | url = https://pdfs.semanticscholar.org/58f9/8fcc52333348a63b9e6dd5fabbdcc6fefe0e.pdf | archive-url = https://web.archive.org/web/20200212025249/https://pdfs.semanticscholar.org/58f9/8fcc52333348a63b9e6dd5fabbdcc6fefe0e.pdf | url-status = dead | archive-date = 2020-02-12 }}</ref>

===Non-negative matrix factorization===
{{main|Non-negative matrix factorization}}
[[Non-negative matrix factorization]] (NMF) can take missing data while minimizing its cost function, rather than treating these missing data as zeros that could introduce biases.<ref name = "ren20">{{Cite journal|arxiv=2001.00563|last1= Ren|first1= Bin |title= Using Data Imputation for Signal Separation in High Contrast Imaging|journal= The Astrophysical Journal|volume= 892|issue= 2|pages= 74|last2=  Pueyo|first2= Laurent|last3= Chen | first3 = Christine|last4=  Choquet|first4= Elodie |last5=  Debes|first5= John H|last6=  Duchene |first6= Gaspard|last7= Menard|first7=Francois|last8=Perrin|first8=Marshall D.|year= 2020|doi= 10.3847/1538-4357/ab7024 | bibcode = 2020ApJ...892...74R |s2cid= 209531731|doi-access= free}}</ref> This makes it a mathematically proven method for data imputation. NMF can ignore missing data in the cost function, and the impact from missing data can be as small as a second order effect.

===Regression===
Regression imputation has the opposite problem of mean imputation. A [[regression model]] is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where the value of that variable is missing. In other words, available information for complete and incomplete cases is used to predict the value of a specific variable. Fitted values from the regression model are then used to impute the missing values. The problem is that the imputed data do not have an [[error term]] included in their estimation, thus the estimates fit perfectly along the regression line without any residual [[variance]]. This causes relationships to be over-identified and suggest greater precision in the imputed values than is warranted. The regression model predicts the most likely value of missing data but does not supply uncertainty about that value.

[[Stochastic]] regression was a fairly successful attempt to correct the lack of an error term in regression imputation by adding the average regression variance to the regression imputations to introduce error. Stochastic regression shows much less bias than the above-mentioned techniques, but it still missed one thing – if data are imputed then intuitively one would think that more noise should be introduced to the problem than simple residual variance.<ref name="enders2010"/>