Missing Data Options

From Q
Jump to: navigation, search

This page describes the options available in Standard R advanced analyses that provide options for Missing data. Not all options are available for all advanced analyses (e.g., some are applicable only to cluster analysis, some only to regression).

Error if missing data

An error is returned if any of the data used in the analysis contains missing values.

Exclude cases with missing data

The analysis is conducted using cases with no missing data. For example, if there are three variables, x, y, and z, and the total sample size is 10, but 5 cases have no data for z, only the 5 cases with th3e complete data are used in the analysis. This is also known as casewise deletion and the complete-case method. It is the default approach in Q.

Assign partial data to clusters

The initial analysis is performed based on 'Exclude cases with missing data, and then any cases that have some, but not only, missing data are assigned to the most similar clusters based on the data that is available.

Use partial data

The analysis is conducted using all the data for each case. For example, in Segments - K-Means, if there are nine variables in the analysis, and a case only has data for six, then the case is assigned to the most similar cluster based on the data for the six variables.

Use partial data (pairwise correlations)

The analysis is conducted using the correlations, rather than raw data, and the correlations are computed based on all the available data. For example, if there are three variables, x, y, and z, and the total sample size is 10, but 5 cases have no data for z, then the correlation between x and y is computed for all 10 cases, and the correlation for x and z and y and z are computed using the 5 cases.

Where this approach is being used in regression, the correlation matrix is analyzed using the sweep operator.[1]

Imputation (replace missing values with estimates)

By default, data is imputed using the default settings from the mice R package, which employs Multivariate Imputation by Chained Equations (predictive mean matching) [2]. Care should be taken to ensure that variables are have the correct variable type, as this has a big impact on this algorithm). Where a technical error is experienced using mice, the imputation is performed using hot-decking, via the hot.deck package in R.[3]

When applied with regression, missing values in the outcome variable are excluded from the analysis after the imputation has been performed.[4]

Note that although imputation can reduce the bias of parameter estimates, it can create misleading statistical inference (e.g., as the simulated sample size is assumed to be the actual sample size in calculations).

Multiple imputation

This is the same as imputation (described above), except that:

  • The imputation is repeated multiple times (by default, 10).
  • Parameter estimates are based on the average result across the different data sets.
  • Standard errors are computed using using Rubin's (1987) method[5] and the degrees of freedom using the 'small sample' approach.[6]

Other than parameter estimates, standard errors, p-statistics, and p-values, diagnostics are based on based on the results from only the first of the models (e.g., measures of influence, tests of normality, residuals, are all from the first model).

References

  1. Dempster, A.P. (1969). Elements of continuous multivariate analysis. Reading: Addison-Wesley.
  2. Stef van Buuren and Karin Groothuis-Oudshoorn (2011), "mice: Multivariate Imputation by Chained Equations in R", Journal of Statistical Software, 45:3, 1-67.
  3. Skyler J. Cranmer and Jeff Gill (2013). We Have to Be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data. British Journal of Political Science, 43, pp 425-449.
  4. von Hippel, Paul T. 2007. "Regression With Missing Y's: An Improved Strategy for Analyzing Multiply Imputed Data."
  5. D.B. Rubin (1987) 'Multiple Imputation for Nonresponse in Surveys', John Wiley & Sons.
  6. J Barnard and DB Rubin (1999) 'Small-sample degrees of freedom with multiple imputation' Biometrika (1999) 86 (4): 948-955.

Further reading: Data Analysis Software