Editing Imputation (statistics)

{{Short description|Process of replacing missing data with substituted values}}
{{Other uses of|imputation|Imputation (disambiguation)}}

In [[statistics]], '''imputation''' is the process of replacing [[missing data]] with substituted values. When substituting for a data point, it is known as "'''unit imputation'''"; when substituting for a component of a data point, it is known as "'''item imputation'''". There are three main problems that missing data causes: missing data can introduce a substantial amount of [[bias (statistics)|bias]], make the handling and analysis of the data more arduous, and create reductions in [[Efficiency (statistics)|efficiency]].<ref>{{Cite journal|last1=Barnard|first1=J.|last2=Meng|first2=X. L.|date=1999-03-01|title=Applications of multiple imputation in medical studies: from AIDS to NHANES|journal=Statistical Methods in Medical Research|volume=8|issue=1|pages=17–36|issn=0962-2802|pmid=10347858|doi=10.1177/096228029900800103|s2cid=11453137}}</ref> Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with [[listwise deletion]] of cases that have missing values. That is to say, when one or more values are missing for a case, most [[List of statistical packages|statistical packages]] default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data.<ref>Gelman, Andrew, and [[Jennifer Hill]]. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, 2006. Ch.25</ref> There have been many theories embraced by scientists to account for missing data but the majority of them introduce bias.  A few of the well known attempts to deal with missing data include: [[#Hot deck|hot deck]] and [[#Cold deck|cold deck]] imputation; [[#Listwise (complete case) deletion|listwise and pairwise deletion]]; [[#Mean substitution|mean imputation]]; [[#Non-negative matrix factorization|non-negative matrix factorization]]; [[#Regression|regression imputation]]; [[#Hot-deck|last observation carried forward]]; [[#Regression|stochastic imputation]]; and [[#Multiple imputation|multiple imputation]].

== Listwise (complete case) deletion ==
{{Main|Listwise deletion}}
By far, the most common means of dealing with missing data is listwise deletion (also known as complete case), which is when all cases with a missing value are deleted. If the data are [[missing completely at random]], then listwise deletion does not add any bias, but it does decrease the [[Power (statistics)|power]] of the analysis by decreasing the effective sample size. For example, if 1000 cases are collected but 80 have missing values, the effective sample size after listwise deletion is 920. If the cases are not missing completely at random, then listwise deletion will introduce bias because the sub-sample of cases represented by the missing data are not representative of the original sample (and if the original sample was itself a representative sample of a population, the complete cases are not representative of that population either).<ref name="cambridge.org">{{Cite journal|last1=Lall|first1=Ranjit|date=2016|title=How Multiple Imputation Makes a Difference|url=https://www.cambridge.org/core/journals/political-analysis/article/how-multiple-imputation-makes-a-difference/8C6616B679EF8F3EB0041B1BC88EEBB9|journal=Political Analysis|language=en|volume=24|issue=4|pages=414–433|doi=10.1093/pan/mpw020|doi-access=free}}</ref> While listwise deletion is unbiased when the missing data is missing completely at random, this is rarely the case in actuality.<ref>{{Cite journal|last=Kenward|first=Michael G|date=2013-02-26|title=The handling of missing data in clinical trials|journal=Clinical Investigation|volume=3|issue=3|pages=241–250|doi=10.4155/cli.13.7|doi-broken-date=2024-11-11 |issn=2041-6792|url=https://semanticscholar.org/paper/964403060982c44cc10842084105de256876b8c6}}</ref>

Pairwise deletion (or "available case analysis") involves deleting a case when it is missing a variable required for a particular analysis, but including that case in analyses for which all required variables are present. When pairwise deletion is used, the total N for analysis will not be consistent across parameter estimations. Because of the incomplete N values at some points in time, while still maintaining complete case comparison for other parameters, pairwise deletion can introduce impossible mathematical situations such as correlations that are over 100%.<ref name="enders2010">{{cite book |last=Enders |first=C. K. |year=2010 |title=Applied Missing Data Analysis |location=New York |publisher=Guilford Press |isbn=978-1-60623-639-0 }}</ref>

The one advantage complete case deletion has over other methods is that it is straightforward and easy to implement. This is a large reason why complete case is the most popular method of handling missing data in spite of the many disadvantages it has.

==Single imputation==
===Hot-deck===
A once-common method of imputation was hot-deck imputation where a missing value was imputed from a randomly selected similar record. The term "hot deck" dates back to the storage of data on [[punched card]]s, and indicates that the information donors come from the same dataset as the recipients. The stack of cards was "hot" because it was currently being processed.

One form of hot-deck imputation is called "last observation carried forward" (or LOCF for short), which involves sorting a dataset according to any of a number of variables, thus creating an ordered dataset. The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value. The process is repeated for the next cell with a missing value until all missing values have been imputed. In the common scenario in which the cases are repeated measurements of a variable for a person or other entity, this represents the belief that if a measurement is missing, the best guess is that it hasn't changed from the last time it was measured. This method is known to increase risk of increasing bias and potentially false conclusions. For this reason LOCF is not recommended for use.<ref>{{Cite journal|last1=Molnar|first1=Frank J.|last2=Hutton|first2=Brian|last3=Fergusson|first3=Dean|date=2008-10-07|title=Does analysis using "last observation carried forward" introduce bias in dementia research?|journal=Canadian Medical Association Journal|volume=179|issue=8|pages=751–753|doi=10.1503/cmaj.080820|issn=0820-3946|pmc=2553855|pmid=18838445}}</ref>

===Cold-deck===
Cold-deck imputation, by contrast, selects donors from another dataset. Due to advances in computer power, more sophisticated methods of imputation have generally superseded the original random and sorted hot deck imputation techniques. It is a method of replacing with response values of similar items in past surveys. It is available in surveys that measure time intervals.

===Mean substitution===
Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable. However, mean imputation attenuates any correlations involving the variable(s) that are imputed. This is because, in cases with imputation, there is guaranteed to be no relationship between the imputed variable and any other measured variables. Thus, mean imputation has some attractive properties for univariate analysis but becomes problematic for multivariate analysis.

Mean imputation can be carried out within classes (e.g. categories such as gender), and can be expressed as <math>\hat{y}_i = \bar{y}_h</math> where <math>\hat{y}_i </math> is the imputed value for record <math>i</math> and <math>\bar{y}_h</math> is the sample mean of respondent data within some class <math>h</math>. This is a special case of generalized regression imputation:

<math display="block">
\hat{y}_{mi} = b_{r0} + \sum_j b_{rj} z_{mij} + \hat{e}_{mi}
</math>

Here the values <math>b_{r0}, b_{rj}</math> are estimated from regressing <math>y</math> on <math>x</math> in non-imputed data, <math>z</math> is a [[dummy variable (statistics)|dummy variable]] for class membership, and data are split into respondent (<math>r</math>) and missing (<math>m</math>).<ref>{{cite journal | last1 = Kalton | first1 = Graham | title = The treatment of missing survey data | journal = Survey Methodology | volume = 12 | year = 1986 | pages = 1–16}}</ref><ref>{{cite journal | last1 = Kalton |first1 = Graham | first2 = Daniel | last2 = Kasprzyk | title = Imputing for missing survey responses | journal = Proceedings of the Section on Survey Research Methods | publisher = [[American Statistical Association]] | volume = 22 | year = 1982 |s2cid = 195855359 | url = https://pdfs.semanticscholar.org/58f9/8fcc52333348a63b9e6dd5fabbdcc6fefe0e.pdf | archive-url = https://web.archive.org/web/20200212025249/https://pdfs.semanticscholar.org/58f9/8fcc52333348a63b9e6dd5fabbdcc6fefe0e.pdf | url-status = dead | archive-date = 2020-02-12 }}</ref>

===Non-negative matrix factorization===
{{main|Non-negative matrix factorization}}
[[Non-negative matrix factorization]] (NMF) can take missing data while minimizing its cost function, rather than treating these missing data as zeros that could introduce biases.<ref name = "ren20">{{Cite journal|arxiv=2001.00563|last1= Ren|first1= Bin |title= Using Data Imputation for Signal Separation in High Contrast Imaging|journal= The Astrophysical Journal|volume= 892|issue= 2|pages= 74|last2=  Pueyo|first2= Laurent|last3= Chen | first3 = Christine|last4=  Choquet|first4= Elodie |last5=  Debes|first5= John H|last6=  Duchene |first6= Gaspard|last7= Menard|first7=Francois|last8=Perrin|first8=Marshall D.|year= 2020|doi= 10.3847/1538-4357/ab7024 | bibcode = 2020ApJ...892...74R |s2cid= 209531731|doi-access= free}}</ref> This makes it a mathematically proven method for data imputation. NMF can ignore missing data in the cost function, and the impact from missing data can be as small as a second order effect.

===Regression===
Regression imputation has the opposite problem of mean imputation. A [[regression model]] is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where the value of that variable is missing. In other words, available information for complete and incomplete cases is used to predict the value of a specific variable. Fitted values from the regression model are then used to impute the missing values. The problem is that the imputed data do not have an [[error term]] included in their estimation, thus the estimates fit perfectly along the regression line without any residual [[variance]]. This causes relationships to be over-identified and suggest greater precision in the imputed values than is warranted. The regression model predicts the most likely value of missing data but does not supply uncertainty about that value.

[[Stochastic]] regression was a fairly successful attempt to correct the lack of an error term in regression imputation by adding the average regression variance to the regression imputations to introduce error. Stochastic regression shows much less bias than the above-mentioned techniques, but it still missed one thing – if data are imputed then intuitively one would think that more noise should be introduced to the problem than simple residual variance.<ref name="enders2010"/>

==Multiple imputation==
In order to deal with the problem of increased noise due to imputation, Rubin (1987)<ref>{{cite book |last1=Rubin |first1=Donald |title=Multiple imputation for nonresponse in surveys |series=Wiley Series in Probability and Statistics |date=9 June 1987 |publisher=Wiley |doi=10.1002/9780470316696 |isbn=9780471087052 }}</ref> developed a method for averaging the outcomes across multiple imputed data sets to account for this. All multiple imputation methods follow three steps.<ref name="cambridge.org"/>
# Imputation – Similar to single imputation, missing values are imputed. However, the imputed values are drawn ''m'' times from a distribution rather than just once. At the end of this step, there should be ''m'' completed datasets.
# Analysis – Each of the ''m'' datasets is analyzed. At the end of this step there should be ''m'' analyses.
# Pooling – The ''m'' results are consolidated into one result by calculating the mean, variance, and confidence interval of the variable of concern<ref>{{cite journal
 | title=Multiple imputation for missing data: Concepts and new development
 | last=Yuan
 | first=Yang C.
 | journal=SAS Institute Inc., Rockville, MD
 | volume=49
 | pages=1–11
 | year=2010
 | url=https://support.sas.com/rnd/app/stat/papers/multipleimputation.pdf
 | access-date=2018-01-17
 | archive-date=2018-11-03
 | archive-url=https://web.archive.org/web/20181103210428/https://support.sas.com/rnd/app/stat/papers/multipleimputation.pdf
 | url-status=dead
 }}</ref><ref>{{Cite book|title=Flexible Imputation of Missing Data|volume=20125245|chapter=2. Multiple Imputation|last=Van Buuren|first=Stef|date=2012-03-29|publisher=Chapman and Hall/CRC|isbn=9781439868249|series=Chapman & Hall/CRC Interdisciplinary Statistics Series|doi=10.1201/b11826|s2cid=60316970 }}</ref> or by combining simulations from each separate model.<ref>{{Cite journal|author1-link=Gary King (political scientist)|author4-link=Kenneth Scheve|last1=King|first1=Gary|last2=Honaker|first2=James|last3=Joseph|first3=Anne|last4=Scheve|first4=Kenneth|date=March 2001|title=Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation|url=https://www.cambridge.org/core/journals/american-political-science-review/article/analyzing-incomplete-political-science-data-an-alternative-algorithm-for-multiple-imputation/9E712982CCE2DE79A574FE98488F212B|journal=American Political Science Review|language=en|volume=95|issue=1|pages=49–69|doi=10.1017/S0003055401000235|s2cid=15484116 |issn=1537-5943}}</ref>

Multiple imputation can be used in cases where the data are [[Missing data#Missing completely at random|missing completely at random]], [[Missing data#Missing at random|missing at random]], and [[Missing data#Missing not at random|missing not at random]], though it can be biased in the latter case.<ref name="Pepinsky 2018 pp. 480–488">{{cite journal | last=Pepinsky | first=Thomas B. | title=A Note on Listwise Deletion versus Multiple Imputation | journal=Political Analysis | publisher=Cambridge University Press (CUP) | volume=26 | issue=4 | date=2018-08-03 | issn=1047-1987 | doi=10.1017/pan.2018.18 | pages=480–488| doi-access=free }}</ref> One approach is multiple imputation by chained equations (MICE), also known as "fully conditional specification" and "sequential regression multiple imputation."<ref>{{Cite journal|last1=Azur|first1=Melissa J.|last2=Stuart|first2=Elizabeth A.|last3=Frangakis|first3=Constantine|last4=Leaf|first4=Philip J.|date=2011-03-01|title=Multiple imputation by chained equations: what is it and how does it work?|journal=International Journal of Methods in Psychiatric Research|volume=20|issue=1|pages=40–49|doi=10.1002/mpr.329|issn=1557-0657|pmc=3074241|pmid=21499542}}</ref> MICE is designed for missing at random data, though there is simulation evidence to suggest that with a sufficient number of auxiliary variables it can also work on data that are missing not at random. However, MICE can suffer from performance problems when the number of observation is large and the data have complex features, such as nonlinearities and high dimensionality.

More recent approaches to multiple imputation use machine learning techniques to improve its performance. MIDAS (Multiple Imputation with Denoising Autoencoders), for instance, uses denoising [[autoencoder]]s, a type of unsupervised neural network, to learn fine-grained latent representations of the observed data.<ref name="The MIDAS Touch 2020">{{Cite journal|last1=Lall|first1=Ranjit|last2=Robinson|first2=Thomas|date=2021|title=The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning|journal=Political Analysis|volume=30 |issue=2 |pages=179–196 |doi=10.1017/pan.2020.49|doi-access=free}}</ref> MIDAS has been shown to provide accuracy and efficiency advantages over traditional multiple imputation strategies.

As alluded in the previous section, single imputation does not take into account the uncertainty in the imputations. After imputation, the data is treated as if they were the actual real values in single imputation. The negligence of uncertainty in the imputation can lead to overly precise results and errors in any conclusions drawn.<ref>{{Cite journal|last=Graham|first=John W.|date=2009-01-01|title=Missing data analysis: making it work in the real world|journal=Annual Review of Psychology|volume=60|pages=549–576|doi=10.1146/annurev.psych.58.110405.085530|issn=0066-4308|pmid=18652544}}</ref> By imputing multiple times, multiple imputation accounts for the uncertainty and range of values that the true value could have taken. As expected, the combination of both uncertainty estimation and deep learning for imputation is among the best strategies and has been used to model heterogeneous drug discovery data.<ref>{{Cite journal|last=Irwin|first=Benedict|date=2020-06-01|title=Practical Applications of Deep Learning to Impute Heterogeneous Drug Discovery Data|journal=Journal of Chemical Information and Modeling|volume=60|issue=6|pages=2848–2857|doi=10.1021/acs.jcim.0c00443|pmid=32478517|s2cid=219171721 }}</ref><ref>{{Cite journal|last=Whitehead|first=Thomas|date=2019-02-12|title=Imputation of Assay Bioactivity Data Using Deep Learning|journal=Journal of Chemical Information and Modeling|volume=59|issue=3|pages=1197–1204|doi=10.1021/acs.jcim.8b00768|pmid=30753070|s2cid=73429643 }}</ref>

Additionally, while single imputation and complete case are easier to implement, multiple imputation is not very difficult to implement. There are a wide range of statistical packages in [[List of statistical software|different statistical software]] that readily performs multiple imputation. For example, the MICE package allows users in [[R (programming language)|R]] to perform multiple imputation using the MICE method.<ref>{{Cite journal|last1=Horton|first1=Nicholas J.|last2=Kleinman|first2=Ken P.|date=2007-02-01|title=Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models|journal=The American Statistician|volume=61|issue=1|pages=79–90|doi=10.1198/000313007X172556|issn=0003-1305|pmc=1839993|pmid=17401454}}</ref> MIDAS can be implemented in R with the rMIDAS package and in Python with the MIDASpy package.<ref name="The MIDAS Touch 2020"/>

===Harmonizing Spline regression and Tensor Decomposition algorithms for missing data imputation===

Where Matrix/Tensor factorization or decomposition algorithms predominantly uses global structure for imputing data, algorithms like piece-wise linear interpolation and spline regression use time-localized trends for estimating missing information in time series. Where former is more effective for estimating larger missing gaps, the latter works well only for small-length missing gaps. SPRINT (Spline-powered Informed Tensor Decomposition) algorithm is proposed in literature which capitalizes the strengths of the two and combine them in an iterative framework for enhanced estimation of missing information, especially effective for datasets, which have both long and short-length missing gaps.<ref>{{cite journal |last1=Sharma |first1=Shubham |last2=Nayak |first2=Richi |last3=Bhaskar |first3=Ashish |title=Harmonizing recurring patterns and non-recurring trends in traffic datasets for enhanced estimation of missing information |journal=Transportation Research Part C: Emerging Technologies |date=2025 |volume=174 |doi=10.1016/j.trc.2025.105083}}</ref>

==See also==
{{Div col|colwidth=25em}}
* [[Bootstrapping (statistics)]]
* [[Censoring (statistics)]]
* [[Expectation–maximization algorithm]]
* [[Geo-imputation]]
* [[Interpolation]]
* [[Matrix completion]]
* [[Full information maximum likelihood]]
{{Div col end}}

==References==
{{Reflist|30em}}

==External links==
* [https://archive.today/20130223193833/http://division.aomonline.org/rm/1999_RMD_Forum_Missing_Data.htm Missing Data: Instrument-Level Heffalumps and Item-Level Woozles]
* [https://web.archive.org/web/20120831160303/http://www.multiple-imputation.com/ Multiple-imputation.com]
* [https://web.archive.org/web/20050212022244/http://www.stat.psu.edu/~jls/mifaq.html Multiple imputation FAQs, Penn State U]
* [http://www.stat.fi/isi99/proceedings/arkisto/varasto/scho0502.pdf A description] of hot deck imputation from Statistics Finland.
* [https://web.archive.org/web/20160303174300/http://www.amstat.org/sections/srms/Proceedings/papers/1993_005.pdf Paper]  extending Rao-Shao approach and discussing problems with multiple imputation.
* [http://www.iaeng.org/publication/WCE2012/WCE2012_pp391-394.pdf Paper]  Fuzzy Unordered Rules Induction Algorithm Used as Missing Value Imputation Methods for K-Mean Clustering on Real Cardiovascular Data.
* [http://www.ons.gov.uk/ons/guide-method/method-quality/general-methodology/data-editing-and-imputation/index.html] Real world application of Imputation by the UK Office of National Statistics

{{Authority control}}

{{DEFAULTSORT:Imputation (Statistics)}}
[[Category:Missing data]]
[[Category:Statistical data coding]]
[[Category:Statistical data transformation]]