Missing Data Options

From Q
Jump to navigation Jump to search

This page describes the options available in Standard R advanced analyses that provide options for Missing data. Not all options are available for all advanced analyses (e.g., some are applicable only to cluster analysis, some only to regression).

Error if missing data

An error is returned if any of the data used in the analysis contains missing values.

Exclude cases with missing data

The analysis is conducted using cases with no missing data. For example, if there are three variables, x, y, and z, and the total sample size is 10, but 5 cases have no data for z, only the 5 cases with the complete data are used in the analysis. This is also known as casewise deletion and the complete-case method. It is the default approach in Q.

Assign partial data to clusters

The initial analysis is performed based on Exclude cases with missing data, and then any cases that have some, but not only, missing data are assigned to the most similar clusters based on the data that is available.

Use partial data

The analysis is conducted using all the data for each case. For example, in Segments - K-Means Cluster Analysis, if there are nine variables in the analysis, and a case only has data for six, then the case is assigned to the most similar cluster based on the data for the six variables.

Use partial data (pairwise correlations)

The analysis is conducted using the correlations, rather than raw data, and the correlations are computed based on all the available data. For example, if there are three variables, x, y, and z, and the total sample size is 10, but 5 cases have no data for z, then the correlation between x and y is computed for all 10 cases, and the correlation for x and z and y and z are computed using the 5 cases.

Where this approach is being used in regression, the correlation matrix is analyzed using the sweep operator.[1]

Dummy variable adjustment

This method assumes that the missing data is structurally missing and that a predictor could be impossible for some cases and hence coded as missing. E.g. if a non-married person is asked to rate the quality of their marriage. A model with a structure to allow this is used whereby the missing predictor is removed and an intercept adjustment is performed. This is implemented by adding a dummy variable for each predictor that has at least one missing value. The dummy indicator variables take the value zero if the original predictor has a non-missing value and the value one if the original predictor is missing. The original missing value is then recoded to a new value. In particular, the missing values of numeric predictors are recoded to be the mean of the predictor (excluding the missing data) and the missing values of factors are recoded to be the reference level of the factor. [2]

Imputation (replace missing values with estimates)

By default, data is imputed using the default settings from the mice R package, which employs Multivariate Imputation by Chained Equations (predictive mean matching) [3]. Care should be taken to ensure that variables have the correct variable type, as this has a big impact on this algorithm. Where a technical error is experienced using mice, the imputation is performed using hot-decking, via the hot.deck package in R.[4]

When applied with regression, missing values in the outcome variable are excluded from the analysis after the imputation has been performed.[5]

Note that although imputation can reduce the bias of parameter estimates, it can create misleading statistical inference (e.g., as the simulated sample size is assumed to be the actual sample size in calculations).

Multiple imputation

This is the same as imputation (described above), except that:

  • The imputation is repeated multiple times (by default, 10).
  • Parameter estimates are based on the average result across the different data sets.
  • Standard errors are computed using using Rubin's (1987) method[6] and the degrees of freedom using the 'small sample' approach.[7]

Other than parameter estimates, standard errors, p-statistics, and p-values, diagnostics are based on based on the results from only the first of the models (e.g., measures of influence, tests of normality, residuals, are all from the first model).

References

Template:Reflist

Further reading: Data Analysis Software

  1. Dempster, A.P. (1969). Elements of continuous multivariate analysis. Reading: Addison-Wesley.
  2. Allison, Paul D. (2001). "Missing Data. Quantitative Applications in the Social Sciences". SAGE Publications. Kindle Edition, 9-10.
  3. Stef van Buuren and Karin Groothuis-Oudshoorn (2011), "mice: Multivariate Imputation by Chained Equations in R", Journal of Statistical Software, 45:3, 1-67.
  4. Skyler J. Cranmer and Jeff Gill (2013). We Have to Be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data. British Journal of Political Science, 43, pp 425-449.
  5. von Hippel, Paul T. 2007. "Regression With Missing Y's: An Improved Strategy for Analyzing Multiply Imputed Data."
  6. D.B. Rubin (1987) 'Multiple Imputation for Nonresponse in Surveys', John Wiley & Sons.
  7. J Barnard and DB Rubin (1999) 'Small-sample degrees of freedom with multiple imputation' Biometrika (1999) 86 (4): 948-955.