Create New Variables - Impute Missing Data

From Q
Jump to navigation Jump to search

This tool creates new QuestionsVariable Sets in your data set where any missing values are filled in using imputation. Imputation is a process of creating estimates for missing values using the distribution of existing values in one or more variables. When you use this tool, a new QuestionVariable Set is added for each QuestionVariable Set that was selected. These new QuestionsVariable Sets are inputed together, meaning that all of the variables are used to derrive the imputed values. In addition, you can later add Auxilliary variables, which are additional variables whose data is used to inform the imputation, but which are not themselves added to the data set.

Example

The following table shows raw data for responses from two survey questions which asked respondents how many text messages they send in a typical week, and how much they spend per month on their phone bill:

The first two columns show the original data, and respondents 12, 15, 21, and 22 have some missing values. The second two columns show the imputed versions of the same two variables, and those respondents have now been assigned values based on the imputation.

The settings for these new inputed variables are as follows:

ImputationSettingsExample1.png

This tells us that the new imputed data is derived from four variables in total:

  • Text messages per week
  • Average monthly bill
  • Age
  • Gender

The final two variables have not themselves been added to the data set

Usage

  1. Select one or more variables or questions in the Variables and Questions tab.
  2. Select Automate > Browse Online Library > Create New Variables > Impute Missing Data.
  3. To change how the imputation is performed:
    1. Select one of the new imputed Questions in the Variables and Questions tab.
    2. Right-click and select Edit R Variable.
    3. Choose the desired options in the Inputs section on the right. These options are explained below.
    4. Click Update R Variable.
  1. Select one or more variables under Data Sets.
  2. Click the + symbol to the right of the selected variables.
  3. Select the desired option from Ready-Made New Variables > Impute Missing Data.
  4. To change how the imputation is performed:
    1. Select one of the imputed variable sets under Data Sets.
    2. Change the options in the Inputs tab of the object inspector on the right of the screen.
    3. Click Calculate.

Settings

The following settings are available for this tool:

Variables These are the variables which are imputed.

Auxilliary variables You can add additional variables to this drop-box to use the data from those variables in the imputation.

Seed This is the random number seed used in the imputation. Changing this number will result in a different solution.

Method This option allows you to choose which imputation algorithm is used.

  • Try mice The imputation will initially try to use the mice algorithm, and if this is not successful it will attempt to use the hotdeck algoithm.
  • Hot Deck Force the imputation to only use the hotdeck algoritm.
  • Mice Force the imputation to only use the mice algoritm.

Technical details

By default, data is imputed using the default settings from the mice R package, which employs Multivariate Imputation by Chained Equations (predictive mean matching) [1]. Care should be taken to ensure that variables have the correct variable type, as this has a big impact on this algorithm. Where a technical error is experienced using mice, the imputation is performed using hot-decking, via the hot.deck package in R.[2]

When applied with regression, missing values in the outcome variable are excluded from the analysis after the imputation has been performed.[3]

Note that although imputation can reduce the bias of parameter estimates, it can create misleading statistical inference (e.g., as the simulated sample size is assumed to be the actual sample size in calculations).

The new QuestionsVariable Sets are imputed jointly. This means that if you make changes to one of them then the others will also change.

There are some technical limitations with regards to how you can change the new variables:

  • You cannot add or remove variables from the Variables drop-box.
  • You cannot change the order of variables in the Variables drop-box.
  • If you wish to delete any of the imputed variables you must delete them all together because they are linked.

How to apply this QScript

  • Start typing the name of the QScript into the Search features and data box in the top right of the Q window.
  • Click on the QScript when it appears in the QScripts and Rules section of the search results.

OR

  • Select Automate > Browse Online Library.
  • Select this QScript from the list.

Customizing the QScript

This QScript is written in JavaScript and can be customized by copying and modifying the JavaScript.

Customizing QScripts in Q4.11 and more recent versions

  • Start typing the name of the QScript into the Search features and data box in the top right of the Q window.
  • Hover your mouse over the QScript when it appears in the QScripts and Rules section of the search results.
  • Press Edit a Copy (bottom-left corner of the preview).
  • Modify the JavaScript (see QScripts for more detail on this).
  • Either:
    • Run the QScript, by pressing the blue triangle button.
    • Save the QScript and run it at a later time, using Automate > Run QScript (Macro) from File.

Customizing QScripts in older versions

  • Copy the JavaScript shown on this page.
  • Create a new text file, giving it a file extension of .QScript. See here for more information about how to do this.
  • Modify the JavaScript (see QScripts for more detail on this).
  • Run the file using Automate > Run QScript (Macro) from File.

JavaScript

// For insertAtHoverButtonIfShown, preventDuplicateQuestionName, inDisplayr,
//    correctTerminology, preventDuplicateVariableName
includeWeb("QScript Utility Functions");  
includeWeb("QScript Selection Functions");  // getAllUserSelections
includeWeb("QScript R Variable Creation Functions")  // robustNewRQuestion

function ImputeSelections(){
    let user_selections = getAllUserSelections();
    let selected_questions = user_selections.selected_questions;

    let selected_variables = user_selections.selected_variables;
    
    if (selected_variables.length === 0) {
        let data_location = (inDisplayr() ? "Data Sets on the left" :
                             "the Variables and Questions tab");
        log("Please select the variables to impute from " + data_location + " and rerun.");
        return false;
    }
    let data_file = selected_questions[0].dataFile;
    if (selected_questions.some(v => v.dataFile.name !== data_file.name)) {
        log("Sorry, all selected variables must be from the same data set.");
        return false;
    }

    let non_text_obj = selected_variables.filter(v => v.variableType != "Text").filter(validNonTextVariable)
                                          .map( function (v, ind) {
                                                    return { variable: v,
                                                            question: v.question,
                                                            type: v.question.questionType,
                                                            index: ind,
                                                            is_text: v.variableType == "Text" };
                                                });

    let text_obj = selected_variables.filter(v => v.variableType == "Text").filter(validTextVariable)
                                          .map( function (v, ind) {
                                                    return { variable: v,
                                                            question: v.question,
                                                            type: v.question.questionType,
                                                            index: ind,
                                                            is_text: v.variableType == "Text" };
                                                });

    if (text_obj.length === 0 && non_text_obj.length === 0)
    {
        log("The selected variables contain entirely missing data. Imputation not performed.")
        return false;
    }

    let new_non_text = [];
    let new_text = [];

    if (non_text_obj.length > 0) {
        new_non_text = createImputedVariableSet(non_text_obj.map(o => o.variable)).variables;
        if (!new_non_text)
            return false;
        non_text_obj.forEach(function (o, ind) {
            o[["new_variable"]] = new_non_text[ind];
        });
    }

    if (text_obj.length > 0) {
        new_text = createImputedVariableSet(text_obj.map(o => o.variable)).variables;
        if (!new_text)
            return false;
        text_obj.forEach(function (o, ind) {
            o[["new_variable"]] = new_text[ind];
        });
    }

    let all_obj = text_obj.concat(non_text_obj);
    let selected_variable_names = selected_variables.map(v => v.name);
    all_obj.sort(function (a, b) {
        selected_variable_names.indexOf(a.variable.name) - selected_variable_names.indexOf(b.variable.name);
    })


    let new_questions = [];
    selected_questions.forEach(function (q) {
        let new_question;
        let new_vars_for_question = all_obj.filter(function (obj) {
                return obj.question.equals(q);
            }).map(obj => obj.new_variable);
        let df = q.dataFile;
        new_question = df.setQuestion(preventDuplicateQuestionName(df, q.name + " - Imputed"), 
                                          q.questionType, new_vars_for_question);
        df.moveAfter(new_question.variables, q.variables[q.variables.length-1]);
        new_questions.push(new_question);
    });
    moveQuestionsToHoverButtonIfShown(new_questions)

    return;
}

function validNonTextVariable(variable) {
	let vattr = variable.valueAttributes;
	let invalid = variable.rawValues.filter(x => !Number.isNaN(x) &&
				  !vattr.getIsMissingData(x)).length === 0;
    if (invalid)
        log("The variable '" + variable.label + "' contains entirely missing values and has been ignored.")
    return !invalid;
}
// Check if text variable has entirely missing data.
// We allow variables with only one single-unique non-missing value, because
// sometimes hot deck is okay with this and we catch this error later
function validTextVariable(variable)
{
    let unique_vals = variable.uniqueValues;
    let invalid = unique_vals.length < 1 && !unique_vals[0];
    if (invalid)
        log("The variable '" + variable.label + "' contains entirely missing values and has been ignored.")
    return !invalid;
}

function createImputedVariableSet(imputed_vars) {
    let data_file = imputed_vars[0].question.dataFile;
    let selected_variables = project.report.selectedVariables();
    let imputed_names = imputed_vars.map(v => v.name);

    let n_original_variables = imputed_vars.length;
    let n_aux_variables = selected_variables.length - n_original_variables;
    let aux_guids;
    if (n_aux_variables)
            aux_guids = selected_variables.filter(v => !imputed_names.includes(v.name)).map(v => v.guid).join(';');
    let new_question_name = preventDuplicateQuestionName(data_file,
                                                         "TEMP" + " - Imputed");
    let temp_var_name = preventDuplicateVariableName(data_file,
                                                    "tempVarXFBG361_adf");
    let structure_name = correctTerminology("variable set");

    let new_r_question;
    let inputs = {formData: imputed_vars.map(v => v.guid).join(";")};
    if (n_aux_variables)
        inputs[["formAuxiliary"]] = aux_guids;
    try {
        let v_name_string = imputed_vars.map(v => v.name).join(",");
        new_r_question = robustNewRQuestion(data_file, rCodeString(n_original_variables, v_name_string),
                                                new_question_name, temp_var_name,
                                                imputed_vars[imputed_vars.length - 1], jsCodeString(), inputs);

        new_r_question.variables.forEach((v,i) => {
            v.label = imputed_vars[i].label;
            v.name = preventDuplicateVariableName(data_file, imputed_vars[i].name + "_imputed");
        });
        insertAtHoverButtonIfShown(new_r_question);
        project.report.setSelectedRaw([new_r_question.variables[0]]);
    }catch (e) {
        let data_location = inDisplayr() ? "Data Editor" : "Data tab";
        if (e.message.indexOf("supply both 'x' and 'y'") > -1)
            log("Your variables appear to contain entirely missing data. " + 
                "Please check the supplied variable values in the " + data_location + ".");
        else if (e.message.indexOf("invalid first argument") > -1 || 
                 e.message.indexOf("default method not implemented for type") > -1)
            log("Sorry, imputation failed. This may occur if some of your input variables are inappropriate. " +
                "For example, if they contain only one unique non-missing value. " +
                "Please check the supplied variables in the " + data_location + ".");
	else  if (e.message.indexOf("Can only convert tabular results") > -1)
            log("Sorry, we were unable create the imputed data set because it is too large. " + 
                " Please consider reducing the number of variables and contacting Support.");
        else if (e.message.indexOf("cannot allocate vector of size") > -1)
            log("Sorry, we are unable to perform imputation with such a large amount of input data. " +
                " Please consider reducing the number of variables and contacting Support.");
        else    
            log("Sorry, an error occurred while imputing the selected data. " + e);
        return false;
    }
    return new_r_question;
}

function rCodeString(n_var, v_name_string) {
    let structure_name = correctTerminology("variable set");
    return `library(flipImputation)

N.VARIABLES <- ${ n_var } 
if (N.VARIABLES != length(formData))
    stop("Sorry, it is not possible to change the number of imputed variables in the existing ${structure_name}. ",
         "Please rerun the feature with updated selections to add or remove variables." )
ORIGINAL.NAMES <- "${ v_name_string }"
dat <- QDataFrame(formData)
CURRENT.NAMES <- paste0(vapply(dat, FUN = function (xx) {return(attr(xx, "name"))} , FUN.VALUE = character(1)), collapse = ",")
if (CURRENT.NAMES != ORIGINAL.NAMES)
    stop("Sorry, it is not possible to change the order of imputed variables. The best thing to do is to click the Undo button in the top left.") 
if (length(formAuxiliary))
    dat <- cbind(dat, QDataFrame(formAuxiliary))

imputed.data <- Imputation(dat, seed = formSeed, method = tolower(formMethod))[[1]][, 1:N.VARIABLES]
imputed.data`;
}

function jsCodeString() {
    return `form.dropBox({name: 'formData', label: 'Variables', multi: true, 
    required: true, types: ['Variables'],
    prompt: 'Supply variables to be imputed'});
form.dropBox({name: 'formAuxiliary', label: 'Auxiliary variables', multi: true, 
    required: false, types: ['Variables'],
    prompt: 'Additional variables to use in modeling/prediction but not to be imputed'});
form.numericUpDown({name: 'formSeed', label: 'Seed', default_value: 12321,
    minimum: 1, increment: 1, maximum: 10000000,
    prompt: 'Seed to use for random number generation'});
form.comboBox({name: 'formMethod', label: 'Method', 
alternatives: ['Try mice', 'Hot Deck', 'Mice'], default_value: 'Try mice'});`;
}

ImputeSelections();

See also

  1. Stef van Buuren and Karin Groothuis-Oudshoorn (2011), "mice: Multivariate Imputation by Chained Equations in R", Journal of Statistical Software, 45:3, 1-67.
  2. Skyler J. Cranmer and Jeff Gill (2013). We Have to Be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data. British Journal of Political Science, 43, pp 425-449.
  3. von Hippel, Paul T. 2007. "Regression With Missing Y's: An Improved Strategy for Analyzing Multiply Imputed Data."