Regression - Generalized Linear Model

From Q
Jump to: navigation, search

The Generalized Linear Model feature models the relationships between a dependent variable and one or more independent variables. There are seven types of regression analysis to choose from. The linear regression model is the default.

Regression Types

Linear

The Linear Regression models the linear relationship between a dependent variable and one or more independent variables. The linear regression option is most commonly used when the dependent variable is continuous. See Regression - Linear Regression.

Binary Logit

The Binary Logit is a form of regression analysis that models a binary dependent variable (e.g. yes/no, pass/fail, win/lose). It is also known as a Logistic regression, and Binomial regression. See Regression - Binary Logit.

Ordered Logit

The Ordered Logit is a form of regression analysis that models a discrete and ordinal dependent variable with more than two outcomes (Net promoter Score, Customer Satisfaction rating, etc.). It is also known as an Ordinal Logistic Regression and the cumulative link model. See Regression - Ordered Logit.

Multinomial Logit

The Multinomial Logit is a form of regression analysis that models a discrete and nominal dependent variable with more than two outcomes (Yes/No/Maybe, Red/Green/Blue, Brand A/Brand B/Brand C, etc.). It is also known as a multinomial logistic regression and multinomial logistic discriminant analysis. See Regression - Multinomial Logit.

Poisson

The Poisson Regression is used to model count data with the assumption that the dependent variable has a Poisson distribution, where the mean is equal to the variance. If there is a high level of variance (overdispersion), the Quasi-Poisson or NBD may be a better option. See Regression - Poisson Regression.

Quasi-Poisson

The Quasi-Poisson Regression is a generalization of the Poisson regression and is used when modeling an overdispersed count variable. The Quasi-Poisson model assumes that the variance is a linear function of the mean. See Regression - Quasi-Poisson Regression.

NBD

The Negative Binomial Distribution (NBD) Regression is a generalization of the Poisson regression and is used when modeling an overdispersed count variable. The NBD model assumes that the variance is a quadratic function of the mean. See Regression - NBD Regression.

Create a Generalized Linear Model in Displayr

1. Go to Insert > Regression > Generalized Linear Model
2. Under Inputs > Outcome, select your dependent variable
3. Under Inputs > Predictor(s), select your independent variables
4. Under Inputs > Regression, select the model you want to use

Object Inspector Options

Outcome The variable to be predicted by the predictor variables.

Predictors The variable(s) to predict the outcome.

Type:

Linear See Regression - Linear Regression.
Binary Logit See Regression - Binary Logit.
Ordered Logit See Regression - Ordered Logit.
Multinomial Logit See Regression - Multinomial Logit.
Poisson See Regression - Poisson Regression.
Quasi-Poisson See Regression - Quasi-Poisson Regression.
NBD See Regression - NBD Regression.

Robust standard errors Computes standard errors that are robust to violations of the assumption of constant variance (i.e., heteroscedasticity). See Robust Standard Errors. This is only available when Type is Linear.

Missing data See Missing Data Options.

Summary The default; as shown in the example above.
Detail Typical R output, some additional information compared to Summary, but without the pretty formatting.
ANOVA Analysis of variance table containing the results of Chi-squared likelihood ratio tests for each predictor.
Relative Importance Analysis See here and the references for more information. This option is not available for Multinomial Logit. Note that categorical predictors are not converted to be numeric, unlike in Driver (Importance) Analysis - Relative Importance Analysis.The results of a relative importance analysis.
Effects Plot Plots the relationship between each of the Predictors and the Outcome. Not available for Multinomial Logit.

Correction The multiple comparisons correction applied when computing the p-values of the post-hoc comparisons.

Variable names Displays Variable Names in the output instead of labels.

Absolute importance scores Whether the absolute value of Relative Importance Analysis scores should be displayed.

Auxiliary variables Variables to be used when imputing missing values (in addition to all the other variables in the model).

Weight. Where a weight has been set for the R Output, it will automatically applied when the model is estimated. By default, the weight is assumed to be a sampling weight, and the standard errors are estimated using Taylor series linearization (by contrast, in the Legacy Regression, weight calibration is used). See Weights, Effective Sample Size and Design Effects.

Filter The data is automatically filtered using any filters prior to estimating the model.

Crosstab Interaction Optional variable to test for interaction with other variables in the model. See Linear Regression for more details.

Random seed Seed used to initialize the (pseudo)random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.

Additional options are available by editing the code.

Diagnostics

See Regression Diagnostics.

Additional Properties

When using this feature you can obtain additional information that is stored by the R code which produces the output.

  1. To do so, select Create > R Output.
  2. In the R CODE, paste one of the code snippets from below.
  3. Replace item with the name of your item. Find this in the Report tree or by selecting the item and then selecting Properties > General from the object inspector on the right.

Properties which may be of interest are:

  • Summary outputs from the regression model:
item$summary$coefficients # summary regression outputs

Acknowledgements

Estimated using:

  • R (R Core Team 2016).
  • survey (Lumley 2014,2014), and MASS packages.
  • car (Fox and Weisberg 2011)
  • MASS (Venables and Ripley 2002)

See How to Read a Standard R Table for acknowledgements regarding the outputs.

References

R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

John Fox and Sanford Weisberg (2011). An {R} Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. URL: http://socserv.socsci.mcmaster.ca/jfox/Books/Companion

T. Lumley (2014) "survey: analysis of complex survey samples". R package version 3.30.

T. Lumley (2004) Analysis of complex survey samples. Journal of Statistical Software 9(1): 1-19.

Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

Code

form.dropBox({label: "Outcome", 
            types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"], 
            name: "formOutcomeVariable",
            prompt: "Independent target variable to be predicted"});
form.dropBox({label: "Predictor(s)",
            types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"], 
            name: "formPredictorVariables", multi:true,
            prompt: "Dependent input variables"});

// ALGORITHM
var algorithm = form.comboBox({label: "Algorithm",
               alternatives: ["CART", "Deep Learning", "Gradient Boosting", "Linear Discriminant Analysis",
                              "Random Forest", "Regression", "Support Vector Machine"],
               name: "formAlgorithm", default_value: "Regression",
               prompt: "Machine learning or regression algorithm for fitting the model"}).getValue();
var regressionType = "";
if (algorithm == "Regression")
    regressionType = form.comboBox({label: "Regression type", 
                                        alternatives: ["Linear", "Binary Logit", "Ordered Logit", "Multinomial Logit", "Poisson",
                                                                                                          "Quasi-Poisson", "NBD"], 
                                        name: "formRegressionType", default_value: "Linear",
                                        prompt: "Select type according to outcome variable type"}).getValue();
form.setHeading((regressionType == "" ? "" : (regressionType + " ")) + algorithm);

// DEFAULT CONTROLS
missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Imputation (replace missing values with estimates)"];

// AMEND DEFAULT CONTROLS PER ALGORITHM
if (algorithm == "Support Vector Machine")
    output_options = ["Accuracy", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Gradient Boosting") 
    output_options = ["Accuracy", "Importance", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Random Forest")
    output_options = ["Importance", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Deep Learning")
    output_options = ["Accuracy", "Prediction-Accuracy Table", "Cross Validation", "Network Layers"];
if (algorithm == "Linear Discriminant Analysis")
    output_options = ["Means", "Detail", "Prediction-Accuracy Table", "Scatterplot", "Moonplot"];

if (algorithm == "CART") {
    output_options = ["Sankey", "Tree", "Text", "Prediction-Accuracy Table", "Cross Validation"];
    missing_data_options = ["Error if missing data", "Exclude cases with missing data",
                             "Use partial data", "Imputation (replace missing values with estimates)"]
}
if (algorithm == "Regression") {
    if (regressionType == "Multinomial Logit")
        output_options = ["Summary", "Detail", "ANOVA"];
    else
        output_options = ["Summary", "Detail", "ANOVA", "Relative Importance Analysis", "Effects Plot"]
    if (regressionType == "Linear")
        missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Use partial data (pairwise correlations)", "Multiple imputation"];
    else
        missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Multiple imputation"];
}

// COMMON CONTROLS FOR ALL ALGORITHMS
var output = form.comboBox({label: "Output", prompt: "The type of output used to show the results", 
              alternatives: output_options, name: "formOutput", default_value: output_options[0]}).getValue();
var missing = form.comboBox({label: "Missing data", 
              alternatives: missing_data_options, name: "formMissing", default_value: "Exclude cases with missing data",
              prompt: "Options for handling cases with missing data"}).getValue();
form.checkBox({label: "Variable names", name: "formNames", default_value: false, prompt: "Display names instead of labels"});

// CONTROLS FOR SPECIFIC ALGORITHMS

if (algorithm == "Support Vector Machine")
    form.textBox({label: "Cost", name: "formCost", default_value: 1, type: "number",
                  prompt: "High cost produces a complex model with risk of overfitting, low cost produces a simpler mode with risk of underfitting"});

if (algorithm == "Gradient Boosting") {
    form.comboBox({label: "Booster", 
                  alternatives: ["gbtree", "gblinear"], name: "formBooster", default_value: "gbtree",
                  prompt: "Boost tree or linear underlying models"})
    form.checkBox({label: "Grid search", name: "formSearch", default_value: false,
                   prompt: "Search for optimal hyperparameters"});
}

if (algorithm == "Random Forest")
    if (output == "Importance")
        form.checkBox({label: "Sort by importance", name: "formImportance", default_value: true});

if (algorithm == "Deep Learning") {
    form.numericUpDown({name:"formEpochs", label:"Maximum epochs", default_value: 10, minimum: 1, maximum: 1000000,
                        prompt: "Number of rounds of training"});
    form.textBox({name: "formHiddenLayers", label: "Hidden layers", prompt: "Comma delimited list of the number of nodes in each hidden layer", required: true});
    form.checkBox({label: "Normalize predictors", name: "formNormalize", default_value: true,
                   prompt: "Normalize to zero mean and unit variance"});
}

if (algorithm == "Linear Discriminant Analysis") {
    if (output == "Scatterplot")
    {
        form.colorPicker({label: "Outcome color", name: "formOutColor", default_value:"#5B9BD5"});
        form.colorPicker({label: "Predictors color", name: "formPredColor", default_value:"#ED7D31"});
    }
    form.comboBox({label: "Prior", alternatives: ["Equal", "Observed",], name: "formPrior", default_value: "Observed",
                   prompt: "Probabilities of group membership"})
}

if (algorithm == "CART") {
    form.comboBox({label: "Pruning", alternatives: ["Minimum error", "Smallest tree", "None"], 
                   name: "formPruning", default_value: "Minimum error",
                   prompt: "Remove nodes after tree has been built"})
    form.checkBox({label: "Early stopping", name: "formStopping", default_value: false,
                   prompt: "Stop building tree when fit does not improve"});
    form.comboBox({label: "Predictor category labels", alternatives: ["Full labels", "Abbreviated labels", "Letters"],
                   name: "formPredictorCategoryLabels", default_value: "Abbreviated labels",
                   prompt: "Labelling of predictor categories in the tree"})
    form.comboBox({label: "Outcome category labels", alternatives: ["Full labels", "Abbreviated labels", "Letters"],
                   name: "formOutcomeCategoryLabels", default_value: "Full labels",
                   prompt: "Labelling of outcome categories in the tree"})
    form.checkBox({label: "Allow long-running calculations", name: "formLongRunningCalculations", default_value: false,
                   prompt: "Allow predictors with more than 30 categories"});
}

if (algorithm == "Regression") {
    if (missing == "Multiple imputation")
        form.dropBox({label: "Auxiliary variables",
            types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"], 
            name: "formAuxiliaryVariables", required: false, multi:true,
            prompt: "Additional variables to use when imputing missing values"});
    form.comboBox({label: "Correction", alternatives: ["None", "False Discovery Rate", "Bonferroni"], name: "formCorrection",
                   default_value: "None", prompt: "Multiple comparisons correction applied when computing p-values of post-hoc comparisons"});
    var is_RIA = (output == "Relative Importance Analysis");
    if (regressionType == "Linear" && missing != "Use partial data (pairwise correlations)" && missing != "Multiple imputation")
        form.checkBox({label: "Robust standard errors", name: "formRobustSE", default_value: false,
                       prompt: "Standard errors are robust to violations of assumption of constant variance"});
    if (output == "Relative Importance Analysis")
        form.checkBox({label: "Absolute importance scores", name: "formAbsoluteImportance", default_value: false,
                       prompt: "Show absolute instead of signed importances"});
    if (regressionType != "Multinomial Logit" && (is_RIA || output == "Summary"))
        form.dropBox({label: "Crosstab interaction", name: "formInteraction", types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"],
                      required: false, prompt: "Categorical variable to test for interaction with other variables"});
}

form.numericUpDown({name:"formSeed", label:"Random seed", default_value: 12321, minimum: 1, maximum: 1000000,
                    prompt: "Initializes randomization for imputation and certain algorithms"});
library(flipMultivariates)

model <- MachineLearning(formula = QFormula(formOutcomeVariable ~ formPredictorVariables),
                                    algorithm = formAlgorithm,
                                    weights = QPopulationWeight, subset = QFilter,
                                    missing = formMissing, output = formOutput, show.labels = !formNames,
                                    seed = get0("formSeed"),
                                    cost = get0("formCost"),
                                    booster = get0("formBooster"),
                                    grid.search = get0("formSearch"),
                                    sort.by.importance = get0("formImportance"),
                                    hidden.nodes = get0("formHiddenLayers"),
                                    max.epochs = get0("formEpochs"),
                                    normalize = get0("formNormalize"),
                                    outcome.color = get0("formOutColor"),
                                    predictors.color = get0("formPredColor"),
                                    prior = get0("formPrior"),
                                    prune = get0("formPruning"),
                                    early.stopping = get0("formStopping"),
                                    predictor.level.treatment = get0("formPredictorCategoryLabels"),
                                    outcome.level.treatment = get0("formOutcomeCategoryLabels"),
                                    long.running.calculations = get0("formLongRunningCalculations"),
                                    type = get0("formRegressionType"),
                                    auxiliary.data = get0("formAuxiliaryVariables"),
                                    correction = get0("formCorrection"),
                                    robust.se = get0("formRobustSE", ifnotfound = FALSE),
                                    importance.absolute = get0("formAbsoluteImportance"),
                                    interaction = get0("formInteraction"),
                                    relative.importance = formOutput == "Relative Importance Analysis")

Further reading: Data Analysis Software