Regression - Driver Analysis

From Q
Jump to: navigation, search

A Driver Analysis models the relationship between a dependent variable and one or more independent variables and quantifies the importance of each of the independent variables in predicting the dependent variable relative to the other independent variables.

Interpretation

Driver analysis computes an estimate of the importance of various independent variables in predicting a dependent variable. Most commonly, the dependent variable measures preference or usage of a particular brand (or brands), and the independent variables measure characteristics of this brand (or brands). For example, the dependent variable may be a measure of overall satisfaction and the independent variables may be measurements of satisfaction with bank fees, efficiency, friendliness, wait times, etc.

Variable statistics

Importance score the magnitude of the importance coefficient indicates the contribution each independent variable has in explaining the overall outcome variable relative to the other independent variables in the model. These importance scores are scaled to be a proportion of 100 to allow an easier numeric scale to interpret.

Raw score the magnitude of the raw importance contribution the independent variable has to the overall outcome variable. This raw importance is the contribution of the independent variable has in explaining the model R-squared relative to the other variables.

The coefficient is colored if the variable is statistically significant at the 5% level.

Standard Error measures the accuracy of an estimate. The smaller the standard error, the more accurate the predictions.

t-statistic the estimate divided by the standard error. The magnitude (either positive or negative) indicates the significance of the variable. The values are highlighted based on their magnitude.

p-value expresses the t-statistic as a probability. A p-value under 0.05 means that the variable is statistically significant at the 5% level; a p-value under 0.01 means that the variable is statistically significant at the 1% level. P-values under 0.05 are shown in bold.

Overall statistics

n the sample size of the model

R-squared assess the goodness of fit of the model. A larger number indicates that the model captures more of the variation in the dependent variable.

See also Regression Diagnostics.

Create a Linear Regression Model in Displayr

With unstacked data the process is similar to a standard Regression model.

1. Go to Insert > Regression > Driver Analysis
2. Under Inputs > Outcome, select your dependent variable
3. Under Inputs > Predictor(s), select your independent variables

Stacked data can be handled with a

1. Go to Insert > Regression > Driver Analysis
2. Check the 'Allow stacked data' control to allow stacked data.
2. Under Inputs > Outcome, select a single dependent variable, if stacked it would have a multi structure.
3. Under Inputs > Predictor(s), select your independent variable set, this should have a grid structure that suitably matches the outcome variable above.

See Question Types for more information on grid and multi type structures.


Object Inspector Options

Outcome The variable to be predicted by the predictor variables.

Predictors The variable(s) to predict the outcome.

Algorithm The fitting algorithm. Defaults to Regression but may be changed to other machine learning methods.

Type: You can use this option to toggle between different types of regression models, but note that certain types are not appropriate for certain types of outcome variable. The other types are not appropriate for a continuous outcome variable.

Linear.
Binary Logit See Regression - Binary Logit.
Ordered Logit See Regression - Ordered Logit.
Poisson See Regression - Poisson Regression.
Quasi-Poisson See Regression - Quasi-Poisson Regression.
NBD See Regression - NBD Regression.

Robust standard errors Computes standard errors that are robust to violations of the assumption of constant variance (i.e., heteroscedasticity). See Robust Standard Errors. This is only available when Type is Linear.

Missing data See Missing Data Options.

Output

Summary Gives summary output from a standard Regression model.
Detail Typical R output, some additional information compared to Summary, but without the pretty formatting.
ANOVA Analysis of variance table containing the results of Chi-squared likelihood ratio tests for each predictor.
Relative Importance Analysis The default; The results of a relative importance analysis. See here and the references for more information. This option is not available for Multinomial Logit. Note that categorical predictors are not converted to be numeric, unlike in Driver (Importance) Analysis - Relative Importance Analysis.
Shapley Regression See here and the references for more information. This option is only available for Linear Regression. Note that categorical predictors are not converted to be numeric, unlike in Driver (Importance) Analysis - Shapley.
Jaccard Coefficient Computes the relative importance of the predictor variables against the outcome variable with the Jaccard Coefficients. See Driver (Importance_ Analysis - Jaccard Coefficient. This option is only available for Linear Regression and requires both binary variables for the outcome variable and the predictor variables.
Correlation Computes the relative importance of the predictor variables against the outcome variable via the bivariate Pearson product moment correlations. This option is only available for Linear Regression. See Driver (Importance) Analysis - Correlation and references therein for more information.
Effects Plot Plots the relationship between each of the Predictors and the Outcome. Not available for Multinomial Logit.

Correction The multiple comparisons correction applied when computing the p-values of the post-hoc comparisons.

Variable names Displays Variable Names in the output instead of labels.

Absolute importance scores Whether the absolute value of Relative Importance Analysis scores should be displayed.

Auxiliary variables Variables to be used when imputing missing values (in addition to all the other variables in the model).

Weight. Where a weight has been set for the R Output, it will automatically applied when the model is estimated. By default, the weight is assumed to be a sampling weight, and the standard errors are estimated using Taylor series linearization (by contrast, in the Legacy Regression, weight calibration is used). See Weights, Effective Sample Size and Design Effects.

Filter The data is automatically filtered using any filters prior to estimating the model.

Crosstab Interaction Optional variable to test for interaction with other variables in the model. The interaction variable is treated as a categorical variable. Coefficients in the table are computed by creating separate regressions for each level of the interaction variable. To evaluate whether a coefficient is significantly higher (blue) or lower (red), we perform a t-test of the coefficient compared to the coefficient using the remaining data as described in Driver Analysis. P-values are corrected for multiple comparisons across the whole table (excluding the NET column). The P-value in the sub-title is calculated using a the likelihood ratio test between the pooled model with no interaction variable, and a model where all predictors interact with the interaction variable.

Automated outlier removal percentage A numeric value between 0 and 50 (including 0 but not 50) to specify the percentage of the data that is removed from analysis. If a zero-value is selected for this input control then no outlier removal is performed and a standard regression output for the entire (possibly filtered) dataset is applied. If a non-zero value is selected for this option then the regression model is fitted twice. The first regression model uses the entire dataset (after filters have been applied) and identifies the observations that generate the largest residuals. The user specified percent of cases in the data that have the largest residuals are then removed. The regression model is refitted on this reduced dataset and output returned. The specific residual used in linear regression is the studentized residual in an unweighted regression and the Pearson residual in a weighted regression. The studentized residual computes the distance between the observed and fitted value for each point and standardizes (adjusts) based on the influence and an externally adjusted variance calculation (see rstudent function in R and Davison and Snell (1991) for more details). The Pearson residual in the weighted case adjusts appropriately for the provided survey weights.

Stack data Whether the input data should be stacked before analysis. Stacking can be desirable when each individual in the data set has multiple cases and an aggregate model is desired. More information is available at Stacking Data FilesStacked Data. If this option is chosen then the Outcome needs to be a single Question that has a Multi type structure suitable for regression such as a Pick One - Multi, Pick Any or Number - MultiVariable Set that has a Multi type structure suitable for regression such as a Binary - Multi, Nominal - Multi, Ordinal - Multi or Numeric - Multi. Similarly, the Predictor(s) need to be a single Question that has a Grid type structure such as a Pick Any - Grid or a Number - GridVariable Set that has a Grid type structure such as a Binary - Grid or a Numeric - Grid. In the process of stacking, the data reductionData Reduction is inspected. Any constructed NETs are removed unless comprised of source values that are mutually exclusive to other codes, such as the result of merging two categories.

Random seed Seed used to initialize the (pseudo)random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.

Additional options are available by editing the code.

DIAGNOSTICS

Cook's distance plot Creates a line/rug plot showing Cook's Distance for each observation.

Cook's distance vs leverage plot Creates a scatterplot showing Cook's distance vs leverage for each observation.

Influence index plot Creates index plots of studentized residuals, hat values, and Cook's distance.

Multicollinearity (VIF) table Creates a table containing variance inflation factors (VIF) to diagnose multicollinearity.

Normal Q-Q plot Creates a normal Quantile-Quantile (QQ) plot to reveal departures of the residuals from normality.

Prediction-accuracy table Creates a table showing the observed and predicted values, as a heatmap.

Residual heteroscedasticity test Conducts a heteroscedasticity test on the residuals.

Residual normality (Shapiro-Wilk) test Conducts a Shapiro-Wilk test of normality on the (deviance) residuals.

Residuals vs fitted plot Creates a scatterplot of residuals versus fitted values.

Residuals vs leverage plot Creates a plot of residuals versus leverage values.

Scale-location plot Creates a plot of the square root of the absolute standardized residuals by fitted values.

Serial correlation (Durbin-Watson) test Conducts a Durbin-Watson test of serial correlation (auto-correlation) on the residuals.

SAVE VARIABLE(S)

Save fitted values Creates a new variable containing fitted values for each case in the data.

Save predicted values Creates a new variable containing predicted values for each case in the data.

Save residuals Creates a new variable containing residual values for each case in the data.

Additional Properties

When using this feature you can obtain additional information that is stored by the R code which produces the output.

  1. To do so, select Create > R Output.
  2. In the R CODE, paste: item = YourReferenceName
  3. Replace YourReferenceName with the reference name of your item. Find this in the Report tree or by selecting the item and then going to Properties > General > Name from the object inspector on the right.
  4. Below the first line of code, you can paste in snippets from below or type in str(item) to see a list of available information.

For a more in depth discussion on extracting information from objects in R, checkout our blog post here.

Properties which may be of interest are:

  • Summary outputs from the regression model:
item$summary$coefficients # summary regression outputs

More information

What is Linear Regression?

Acknowledgements

See Regression - Generalized Linear Model.

References

For residual definitions and information: Davison, A. C. and Snell, E. J. (1991) Residuals and diagnostics. In: Statistical Theory and Modelling. In Honour of Sir David Cox, FRS, eds. Hinkley, D. V., Reid, N. and Snell, E. J., Chapman & Hall.

For relative importance analysis: Johnson, J. W. (2000). "A heuristic method for estimating the relative weight of predictor variables in multiple regression". Multivariate behavioral research, 35(1), 1-19.

For Shapley:

Bock, T., "What is Shapley Value Regression?" [Blog post]. Accessed from [1]

Yap, J., "When to Use Relative Weights Over Shapley" [Blog post]. Accessed from [2]

Yap, J., "The Difference Between Shapley Regression and Relative Weights" [Blog post]. Accessed from [3]

Code

var controls = [];

// ALGORITHM
var algorithm = form.comboBox({label: "Algorithm",
                               alternatives: ["CART", "Deep Learning", "Gradient Boosting", "Linear Discriminant Analysis",
                                              "Random Forest", "Regression", "Support Vector Machine"],
                               name: "formAlgorithm", default_value: "Regression",
                               prompt: "Machine learning or regression algorithm for fitting the model"});

controls.push(algorithm);
algorithm = algorithm.getValue();

var regressionType = "";
if (algorithm == "Regression")
{
    regressionTypeControl = form.comboBox({label: "Regression type", 
                                           alternatives: ["Linear", "Binary Logit", "Ordered Logit", "Multinomial Logit", "Poisson",
                                                          "Quasi-Poisson", "NBD"], 
                                           name: "formRegressionType", default_value: "Linear",
                                           prompt: "Select type according to outcome variable type"});
    regressionType = regressionTypeControl.getValue();
    controls.push(regressionTypeControl);
}

// DEFAULT CONTROLS
missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Imputation (replace missing values with estimates)"];

// AMEND DEFAULT CONTROLS PER ALGORITHM
if (algorithm == "Support Vector Machine")
    output_options = ["Accuracy", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Gradient Boosting") 
    output_options = ["Accuracy", "Importance", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Random Forest")
    output_options = ["Importance", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Deep Learning")
    output_options = ["Accuracy", "Prediction-Accuracy Table", "Cross Validation", "Network Layers"];
if (algorithm == "Linear Discriminant Analysis")
    output_options = ["Means", "Detail", "Prediction-Accuracy Table", "Scatterplot", "Moonplot"];

if (algorithm == "CART") {
    output_options = ["Sankey", "Tree", "Text", "Prediction-Accuracy Table", "Cross Validation"];
    missing_data_options = ["Error if missing data", "Exclude cases with missing data",
                             "Use partial data", "Imputation (replace missing values with estimates)"]
}
if (algorithm == "Regression") {
    if (regressionType == "Multinomial Logit")
        output_options = ["Summary", "Detail", "ANOVA"];
    else if (regressionType == "Linear")
        output_options = ["Summary", "Detail", "ANOVA", "Relative Importance Analysis", "Shapley Regression", "Jaccard Coefficient", "Correlation", "Effects Plot"];
    else
        output_options = ["Summary", "Detail", "ANOVA", "Relative Importance Analysis", "Effects Plot"];
}

// COMMON CONTROLS FOR ALL ALGORITHMS
var outputControl = form.comboBox({label: "Output", prompt: "The type of output used to show the results",
                                   alternatives: output_options, name: "formOutput",
                                   default_value: algorithm === "Regression" ? "Relative Importance Analysis": output_options[0]});
controls.push(outputControl);
var output = outputControl.getValue();

if (algorithm == "Regression") {
    if (regressionType == "Linear") {
        if (output == "Jaccard Coefficient" || output == "Correlation")
            missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Use partial data (pairwise correlations)"];
        else
            missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Dummy variable adjustment", "Use partial data (pairwise correlations)", "Multiple imputation"];
    }        
    else
        missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Dummy variable adjustment", "Multiple imputation"];
}

var missingControl = form.comboBox({label: "Missing data", 
                                    alternatives: missing_data_options, name: "formMissing", default_value: "Exclude cases with missing data",
                                    prompt: "Options for handling cases with missing data"});
var missing = missingControl.getValue();
controls.push(missingControl);
controls.push(form.checkBox({label: "Variable names", name: "formNames", default_value: false, prompt: "Display names instead of labels"}));


// CONTROLS FOR SPECIFIC ALGORITHMS

if (algorithm == "Support Vector Machine")
    controls.push(form.textBox({label: "Cost", name: "formCost", default_value: 1, type: "number",
                                prompt: "High cost produces a complex model with risk of overfitting, low cost produces a simpler mode with risk of underfitting"}));

if (algorithm == "Gradient Boosting") {
    controls.push(form.comboBox({label: "Booster", 
                                 alternatives: ["gbtree", "gblinear"], name: "formBooster", default_value: "gbtree",
                                 prompt: "Boost tree or linear underlying models"}));
    controls.push(form.checkBox({label: "Grid search", name: "formSearch", default_value: false,
                                 prompt: "Search for optimal hyperparameters"}));
}

if (algorithm == "Random Forest")
    if (output == "Importance")
        controls.push(form.checkBox({label: "Sort by importance", name: "formImportance", default_value: true}));

if (algorithm == "Deep Learning") {
    controls.push(form.numericUpDown({name:"formEpochs", label:"Maximum epochs", default_value: 10, minimum: 1, maximum: 1000000,
                                      prompt: "Number of rounds of training"}));
    controls.push(form.textBox({name: "formHiddenLayers", label: "Hidden layers", prompt: "Comma delimited list of the number of nodes in each hidden layer", required: true}));
    controls.push(form.checkBox({label: "Normalize predictors", name: "formNormalize", default_value: true,
                                 prompt: "Normalize to zero mean and unit variance"}));
}

if (algorithm == "Linear Discriminant Analysis") {
    if (output == "Scatterplot")
    {
        controls.push(form.colorPicker({label: "Outcome color", name: "formOutColor", default_value:"#5B9BD5"}));
        controls.push(form.colorPicker({label: "Predictors color", name: "formPredColor", default_value:"#ED7D31"}));
    }
    controls.push(form.comboBox({label: "Prior", alternatives: ["Equal", "Observed",], name: "formPrior", default_value: "Observed",
                                 prompt: "Probabilities of group membership"}));
}

if (algorithm == "CART") {
    controls.push(form.comboBox({label: "Pruning", alternatives: ["Minimum error", "Smallest tree", "None"], 
                                 name: "formPruning", default_value: "Minimum error",
                                 prompt: "Remove nodes after tree has been built"}));
    controls.push(form.checkBox({label: "Early stopping", name: "formStopping", default_value: false,
                                 prompt: "Stop building tree when fit does not improve"}));
    controls.push(form.comboBox({label: "Predictor category labels", alternatives: ["Full labels", "Abbreviated labels", "Letters"],
                                 name: "formPredictorCategoryLabels", default_value: "Abbreviated labels",
                                 prompt: "Labelling of predictor categories in the tree"}));
    controls.push(form.comboBox({label: "Outcome category labels", alternatives: ["Full labels", "Abbreviated labels", "Letters"],
                                 name: "formOutcomeCategoryLabels", default_value: "Full labels",
                                 prompt: "Labelling of outcome categories in the tree"}));
    controls.push(form.checkBox({label: "Allow long-running calculations", name: "formLongRunningCalculations", default_value: false,
                                 prompt: "Allow predictors with more than 30 categories"}));
}

var stacked_check = false;
if (algorithm == "Regression") {
    if (missing == "Multiple imputation")
        controls.push(form.dropBox({label: "Auxiliary variables",
                                    types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"], 
                                    name: "formAuxiliaryVariables", required: false, multi:true,
                                    prompt: "Additional variables to use when imputing missing values"}));
    controls.push(form.comboBox({label: "Correction", alternatives: ["None", "False Discovery Rate", "Bonferroni"], name: "formCorrection",
                                 default_value: "None", prompt: "Multiple comparisons correction applied when computing p-values of post-hoc comparisons"}));
    var is_RIA_or_shapley = output == "Relative Importance Analysis" || output == "Shapley Regression";
    var is_Jaccard_or_Correlation = output == "Jaccard Coefficient" || output == "Correlation";
    if (regressionType == "Linear" && missing != "Use partial data (pairwise correlations)" && missing != "Multiple imputation")
        controls.push(form.checkBox({label: "Robust standard errors", name: "formRobustSE", default_value: false,
                                     prompt: "Standard errors are robust to violations of assumption of constant variance"}));
    if (is_RIA_or_shapley)
        controls.push(form.checkBox({label: "Absolute importance scores", name: "formAbsoluteImportance", default_value: false,
                                     prompt: "Show absolute instead of signed importances"}));
    if (regressionType != "Multinomial Logit" && (is_RIA_or_shapley || is_Jaccard_or_Correlation || output == "Summary"))
        controls.push(form.dropBox({label: "Crosstab interaction", name: "formInteraction", types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"],
                                    required: false, prompt: "Categorical variable to test for interaction with other variables"}));
    if (regressionType !== "Multinomial Logit")
        controls.push(form.numericUpDown({name : "formOutlierProportion", label:"Automated outlier removal percentage", default_value: 0, 
                                          minimum:0, maximum:49.9, increment:0.1,
                                          prompt: "Data points removed and model refitted based on the residual values in the model using the full dataset"}));
    stacked_check_box = form.checkBox({label: "Stack data", name: "formStackedData", default_value: false,
                                       prompt: "Allow input into the Outcome control to be a single multi variable and Predictors to be a single grid variable"})
    stacked_check = stacked_check_box.getValue();
    controls.push(stacked_check_box);
}

controls.push(form.numericUpDown({name:"formSeed", label:"Random seed", default_value: 12321, minimum: 1, maximum: 1000000,
                                  prompt: "Initializes randomization for imputation and certain algorithms"}));

var outcome = form.dropBox({label: "Outcome", 
                            types: [ stacked_check ? "VariableSet: BinaryMulti, NominalMulti, OrdinalMulti, NumericMulti" : "Variable: Numeric, Date, Money, Categorical, OrderedCategorical"], 
                            multi: false,
                            name: "formOutcomeVariable",
                            prompt: "Independent target variable to be predicted"});
var predictors = form.dropBox({label: "Predictor(s)",
                               types:[ stacked_check ? "VariableSet: BinaryGrid, NumericGrid" : "Variable: Numeric, Date, Money, Categorical, OrderedCategorical"], 
                               name: "formPredictorVariables", multi: stacked_check ? false : true,
                               prompt: "Dependent input variables"});
controls.unshift(predictors);
controls.unshift(outcome);

form.setInputControls(controls);
form.setHeading((regressionType == "" ? "" : (regressionType + " ")) + algorithm);
library(flipMultivariates)

model <- MachineLearning(formula = if (isTRUE(get0("formStackedData"))) as.formula(NULL) else QFormula(formOutcomeVariable ~ formPredictorVariables),
                         algorithm = formAlgorithm,
                         weights = QPopulationWeight, subset = QFilter,
                         missing = formMissing,
                         output = if (formOutput == "Shapley Regression") "Shapley regression" else formOutput,
                         show.labels = !formNames,
                         seed = get0("formSeed"),
                         cost = get0("formCost"),
                         booster = get0("formBooster"),
                         grid.search = get0("formSearch"),
                         sort.by.importance = get0("formImportance"),
                         hidden.nodes = get0("formHiddenLayers"),
                         max.epochs = get0("formEpochs"),
                         normalize = get0("formNormalize"),
                         outcome.color = get0("formOutColor"),
                         predictors.color = get0("formPredColor"),
                         prior = get0("formPrior"),
                         prune = get0("formPruning"),
                         early.stopping = get0("formStopping"),
                         predictor.level.treatment = get0("formPredictorCategoryLabels"),
                         outcome.level.treatment = get0("formOutcomeCategoryLabels"),
                         long.running.calculations = get0("formLongRunningCalculations"),
                         type = get0("formRegressionType"),
                         auxiliary.data = get0("formAuxiliaryVariables"),
                         correction = get0("formCorrection"),
                         robust.se = get0("formRobustSE", ifnotfound = FALSE),
                         importance.absolute = get0("formAbsoluteImportance"),
                         interaction = get0("formInteraction"),
                         outlier.prop.to.remove = if (get0("formRegressionType", ifnotfound = "") != "Multinomial Logit") get0("formOutlierProportion")/100 else NULL,
                         stacked.data.check = get0("formStackedData"),
                         unstacked.data = if (isTRUE(get0("formStackedData"))) list(Y = get0("formOutcomeVariable"), X = get0("formPredictorVariables")) else NULL)