# Regression - Binary Logit

The Binary Logit is a form of regression analysis that models a binary dependent variable (e.g. yes/no, pass/fail, win/lose). It is also known as a Logistic regression, and Binomial regression.

The *Binary Logit* is a form of regression analysis that models a binary dependent variable (e.g. yes/no, pass/fail, win/lose). It is also known as a Logistic regression, and Binomial regression.

## Data format

The key requirement for a binary logit regression is that the dependent variable is binary. In Displayr, the best data format for this type is “Nominal: Mutually exclusive categories”, with values of “0” and “1”.

The independent variables can be continuous, categorical, or binary — just as with any other regression model.

## Interpretation

Variable statistics measure the impact and significance of individual variables within a model, while overall statistics apply to the model as a whole. Both are shown in the binary logit output.

### Variable statistics

**Estimate** the magnitude of the coefficient indicates the size of the change in the independent variable as the value of the dependent variable changes. A positive number indicates a direct relationship (y increases as x increases), and a negative number indicates an inverse relationship (y decreases as x increases.

The coefficient is colored if the variable is statistically significant at the 5% level.

**Standard Error** measures the accuracy of an estimate. The smaller the standard error, the more accurate the predictions.

**Z-value** the estimate divided by the standard error. The magnitude (either positive or negative) indicates the significance of the variable. The values are highlighted based on their magnitude.

**P-value** expresses the z-value as a probability. A p-value under 0.05 means that the variable is statistically significant at the 5% level; a p-value under 0.01 means that the variable is statistically significant at the 1% level. P-values under 0.05 are shown in bold.

### Overall statistics

**n** the sample size of the model

**McFadden’s rho-squared** assess the goodness of fit of the model. A larger number indicates that the model captures more of the variation in the dependent variable.

**AIC** Akaike information criterion is a measure of the quality of the model. When comparing similar models, the AIC can be used to identify the superior model.

See also Regression Diagnostics.

## Example

The example below is a model that predicts a survey respondent’s likelihood of having consumed a fast-food product based on characteristics like age, gender, and work status.

### Create a Binary Logit Model in Displayr

- 1. Go to
**Insert > Regression > Binary Logit** - 2. Under
**Inputs > Outcome**, select your dependent variable - 3. Under
**Inputs > Predictor(s)**, select your independent variables

## Object Inspector Options

**Outcome** The variable to be predicted by the *predictor variables*.

**Predictors** The variable(s) to predict the *outcome*.

**Algorithm** The fitting algorithm. Defaults to *Regression* but may be changed to other machine learning methods.

**Type**: You can use this option to toggle between different types of regression models, but note that the other types are not appropriate for a binary outcome variable.

**Linear**See Regression - Linear Regression.**Binary Logit**.**Ordered Logit**See Regression - Ordered Logit.**Multinomial Logit**See Regression - Multinomial Logit.**Poisson**See Regression - Poisson Regression.**Quasi-Poisson**See Regression - Quasi-Poisson Regression.**NBD**See Regression - NBD Regression.

**Robust standard errors** Computes standard errors that are robust to violations of the assumption of constant variance (i.e., heteroscedasticity). See Robust Standard Errors. This is only available when **Type** is **Linear**.

**Missing data** See Missing Data Options.

**Output**

**Summary**The default; as shown in the example above.**Detail**Typical R output, some additional information compared to**Summary**, but without the pretty formatting.**ANOVA**Analysis of variance table containing the results of Chi-squared likelihood ratio tests for each predictor.**Relative Importance Analysis**The results of a relative importance analysis. See here and the references for more information. This option is not available for Multinomial Logit. Note that categorical predictors are not converted to be numeric, unlike in Driver (Importance) Analysis - Relative Importance Analysis.**Effects Plot**Plots the relationship between each of the*Predictors*and the*Outcome*. Not available for Multinomial Logit.

**Correction** The multiple comparisons correction applied when computing the *p*-values of the *post-hoc* comparisons.

**Variable names** Displays Variable Names in the output instead of labels.

**Absolute importance scores** Whether the absolute value of Relative Importance Analysis scores should be displayed.

**Auxiliary variables** Variables to be used when imputing missing values (in addition to all the other variables in the model).

**Weight**. Where a weight has been set for the R Output, it will automatically applied when the model is estimated. By default, the weight is assumed to be a *sampling weight*, and the standard errors are estimated using *Taylor series linearization* (by contrast, in the Legacy Regression, *weight calibration* is used). See Weights, Effective Sample Size and Design Effects.

**Filter** The data is automatically filtered using any filters prior to estimating the model.

**Crosstab Interaction** Optional variable to test for interaction with other variables in the model. See Linear Regression for more details.

**Automated outlier removal percentage** Optional control to remove possible outliers in the data. See Linear Regression for more details on the general methodology. The specific residual used in the case of Binary Logit in both the weighted and unweighted case is a type of surrogate residual. It uses the `resids` function with the jitter parametrization in the sure `R` package (see Greenwell, McCarthy, Boehmke and Liu (2018) for more details).

**Stack data** Whether the input data should be stacked before analysis. Stacking can be desirable when each individual in the data set has multiple cases and an aggregate model is desired. More information is available at Stacking Data Files . If this option is chosen then the *Outcome* needs to be a single Question that has a Multi type structure suitable for binary logit regression such as a Pick One - Multi . Similarly, the *Predictor(s)* need to be a single Question that has a Grid type structure such as a Pick Any - Grid or a Number - Grid . In the process of stacking, the is inspected. Any constructed NETs are removed unless comprised of source values that are mutually exclusive to other codes, such as the result of merging two categories.

**Random seed** Seed used to initialize the (pseudo)random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.

Additional options are available by editing the code.

### DIAGNOSTICS

**Cook's distance plot** Creates a line/rug plot showing Cook's Distance for each observation.

**Cook's distance vs leverage plot** Creates a scatterplot showing Cook's distance vs leverage for each observation.

**Influence index plot** Creates index plots of studentized residuals, hat values, and Cook's distance.

**Multicollinearity (VIF) table** Creates a table containing variance inflation factors (VIF) to diagnose multicollinearity.

**Normal Q-Q plot** Creates a normal Quantile-Quantile (QQ) plot to reveal departures of the residuals from normality.

**Prediction-accuracy table** Creates a table showing the observed and predicted values, as a heatmap.

**Residual normality (Shapiro-Wilk) test** Conducts a *Shapiro-Wilk* test of normality on the (deviance) residuals.

**Residuals vs fitted plot** Creates a scatterplot of residuals versus fitted values.

**Residuals vs leverage plot** Creates a plot of residuals versus leverage values.

**Scale-location plot** Creates a plot of the square root of the absolute standardized residuals by fitted values.

**Serial correlation (Durbin-Watson) test** Conducts a *Durbin-Watson* test of serial correlation (auto-correlation) on the residuals.

### SAVE VARIABLE(S)

**Save fitted values** Creates a new variable containing fitted values for each case in the data.

**Save predicted probabilities** Creates a new variable containing predicted probabilities of each response.

**Save predicted values** Creates a new variable containing predicted values for each case in the data.

**Save residuals** Creates a new variable containing residual values for each case in the data.

## More information

How to do Logistic Regression in Displayr

How to Interpret Logistic Regression Outputs

How to Interpret Logistic Regression Coefficients

## Acknowledgements

Uses the `glm` from the `stats` `R` package. If **weights** are supplied, the `svyglm` function from the `survey` `R` package is used. Also uses the `resids` function in from the `sure` `R` package. See also Regression - Generalized Linear Model.

## References

Greenwell, B. M., McCarthy, A. J., Boehmke, B. C. and Liu, D. (2018). "Residuals and Diagnostics for Binary and Ordinal Regression Models: An Introduction to the sure Package", The R Journal, 10(1), 381--394, doi:10.32614/RJ-2018-004

Yap, J. (2018, August 22). What is logistic regression? [Blog post]. Accessed from https://www.displayr.com/what-is-logistic-regression/.

For relative importance analysis: Johnson, J. W. (2000). A heuristic method for estimating the relative weight of predictor variables in multiple regression. Multivariate behavioral research, 35(1), 1-19.

## Code

```
var controls = [];
// ALGORITHM
var algorithm = form.comboBox({label: "Algorithm",
alternatives: ["CART", "Deep Learning", "Gradient Boosting", "Linear Discriminant Analysis",
"Random Forest", "Regression", "Support Vector Machine"],
name: "formAlgorithm", default_value: "Regression",
prompt: "Machine learning or regression algorithm for fitting the model"});
controls.push(algorithm);
algorithm = algorithm.getValue();
var regressionType = "";
if (algorithm == "Regression")
{
regressionTypeControl = form.comboBox({label: "Regression type",
alternatives: ["Linear", "Binary Logit", "Ordered Logit", "Multinomial Logit", "Poisson",
"Quasi-Poisson", "NBD"],
name: "formRegressionType", default_value: "Binary Logit",
prompt: "Select type according to outcome variable type"});
regressionType = regressionTypeControl.getValue();
controls.push(regressionTypeControl);
}
// DEFAULT CONTROLS
missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Imputation (replace missing values with estimates)"];
// AMEND DEFAULT CONTROLS PER ALGORITHM
if (algorithm == "Support Vector Machine")
output_options = ["Accuracy", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Gradient Boosting")
output_options = ["Accuracy", "Importance", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Random Forest")
output_options = ["Importance", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Deep Learning")
output_options = ["Accuracy", "Prediction-Accuracy Table", "Cross Validation", "Network Layers"];
if (algorithm == "Linear Discriminant Analysis")
output_options = ["Means", "Detail", "Prediction-Accuracy Table", "Scatterplot", "Moonplot"];
if (algorithm == "CART") {
output_options = ["Sankey", "Tree", "Text", "Prediction-Accuracy Table", "Cross Validation"];
missing_data_options = ["Error if missing data", "Exclude cases with missing data",
"Use partial data", "Imputation (replace missing values with estimates)"]
}
if (algorithm == "Regression") {
if (regressionType == "Multinomial Logit")
output_options = ["Summary", "Detail", "ANOVA"];
else if (regressionType == "Linear")
output_options = ["Summary", "Detail", "ANOVA", "Relative Importance Analysis", "Shapley Regression", "Jaccard Coefficient", "Correlation", "Effects Plot"];
else
output_options = ["Summary", "Detail", "ANOVA", "Relative Importance Analysis", "Effects Plot"];
}
// COMMON CONTROLS FOR ALL ALGORITHMS
var outputControl = form.comboBox({label: "Output", prompt: "The type of output used to show the results",
alternatives: output_options, name: "formOutput",
default_value: output_options[0]});
controls.push(outputControl);
var output = outputControl.getValue();
if (algorithm == "Regression") {
if (regressionType == "Linear") {
if (output == "Jaccard Coefficient" || output == "Correlation")
missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Use partial data (pairwise correlations)"];
else
missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Dummy variable adjustment", "Use partial data (pairwise correlations)", "Multiple imputation"];
}
else
missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Dummy variable adjustment", "Multiple imputation"];
}
var missingControl = form.comboBox({label: "Missing data",
alternatives: missing_data_options, name: "formMissing", default_value: "Exclude cases with missing data",
prompt: "Options for handling cases with missing data"});
var missing = missingControl.getValue();
controls.push(missingControl);
controls.push(form.checkBox({label: "Variable names", name: "formNames", default_value: false, prompt: "Display names instead of labels"}));
// CONTROLS FOR SPECIFIC ALGORITHMS
if (algorithm == "Support Vector Machine")
controls.push(form.textBox({label: "Cost", name: "formCost", default_value: 1, type: "number",
prompt: "High cost produces a complex model with risk of overfitting, low cost produces a simpler mode with risk of underfitting"}));
if (algorithm == "Gradient Boosting") {
controls.push(form.comboBox({label: "Booster",
alternatives: ["gbtree", "gblinear"], name: "formBooster", default_value: "gbtree",
prompt: "Boost tree or linear underlying models"}));
controls.push(form.checkBox({label: "Grid search", name: "formSearch", default_value: false,
prompt: "Search for optimal hyperparameters"}));
}
if (algorithm == "Random Forest")
if (output == "Importance")
controls.push(form.checkBox({label: "Sort by importance", name: "formImportance", default_value: true}));
if (algorithm == "Deep Learning") {
controls.push(form.numericUpDown({name:"formEpochs", label:"Maximum epochs", default_value: 10, minimum: 1, maximum: 1000000,
prompt: "Number of rounds of training"}));
controls.push(form.textBox({name: "formHiddenLayers", label: "Hidden layers", prompt: "Comma delimited list of the number of nodes in each hidden layer", required: true}));
controls.push(form.checkBox({label: "Normalize predictors", name: "formNormalize", default_value: true,
prompt: "Normalize to zero mean and unit variance"}));
}
if (algorithm == "Linear Discriminant Analysis") {
if (output == "Scatterplot")
{
controls.push(form.colorPicker({label: "Outcome color", name: "formOutColor", default_value:"#5B9BD5"}));
controls.push(form.colorPicker({label: "Predictors color", name: "formPredColor", default_value:"#ED7D31"}));
}
controls.push(form.comboBox({label: "Prior", alternatives: ["Equal", "Observed",], name: "formPrior", default_value: "Observed",
prompt: "Probabilities of group membership"}));
}
if (algorithm == "CART") {
controls.push(form.comboBox({label: "Pruning", alternatives: ["Minimum error", "Smallest tree", "None"],
name: "formPruning", default_value: "Minimum error",
prompt: "Remove nodes after tree has been built"}));
controls.push(form.checkBox({label: "Early stopping", name: "formStopping", default_value: false,
prompt: "Stop building tree when fit does not improve"}));
controls.push(form.comboBox({label: "Predictor category labels", alternatives: ["Full labels", "Abbreviated labels", "Letters"],
name: "formPredictorCategoryLabels", default_value: "Abbreviated labels",
prompt: "Labelling of predictor categories in the tree"}));
controls.push(form.comboBox({label: "Outcome category labels", alternatives: ["Full labels", "Abbreviated labels", "Letters"],
name: "formOutcomeCategoryLabels", default_value: "Full labels",
prompt: "Labelling of outcome categories in the tree"}));
controls.push(form.checkBox({label: "Allow long-running calculations", name: "formLongRunningCalculations", default_value: false,
prompt: "Allow predictors with more than 30 categories"}));
}
var stacked_check = false;
if (algorithm == "Regression") {
if (missing == "Multiple imputation")
controls.push(form.dropBox({label: "Auxiliary variables",
types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"],
name: "formAuxiliaryVariables", required: false, multi:true,
prompt: "Additional variables to use when imputing missing values"}));
controls.push(form.comboBox({label: "Correction", alternatives: ["None", "False Discovery Rate", "Bonferroni"], name: "formCorrection",
default_value: "None", prompt: "Multiple comparisons correction applied when computing p-values of post-hoc comparisons"}));
var is_RIA_or_shapley = output == "Relative Importance Analysis" || output == "Shapley Regression";
var is_Jaccard_or_Correlation = output == "Jaccard Coefficient" || output == "Correlation";
if (regressionType == "Linear" && missing != "Use partial data (pairwise correlations)" && missing != "Multiple imputation")
controls.push(form.checkBox({label: "Robust standard errors", name: "formRobustSE", default_value: false,
prompt: "Standard errors are robust to violations of assumption of constant variance"}));
if (is_RIA_or_shapley)
controls.push(form.checkBox({label: "Absolute importance scores", name: "formAbsoluteImportance", default_value: false,
prompt: "Show absolute instead of signed importances"}));
if (regressionType != "Multinomial Logit" && (is_RIA_or_shapley || is_Jaccard_or_Correlation || output == "Summary"))
controls.push(form.dropBox({label: "Crosstab interaction", name: "formInteraction", types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"],
required: false, prompt: "Categorical variable to test for interaction with other variables"}));
if (regressionType !== "Multinomial Logit")
controls.push(form.numericUpDown({name : "formOutlierProportion", label:"Automated outlier removal percentage", default_value: 0,
minimum:0, maximum:49.9, increment:0.1,
prompt: "Data points removed and model refitted based on the residual values in the model using the full dataset"}));
stacked_check_box = form.checkBox({label: "Stack data", name: "formStackedData", default_value: false,
prompt: "Allow input into the Outcome control to be a single multi variable and Predictors to be a single grid variable"})
stacked_check = stacked_check_box.getValue();
controls.push(stacked_check_box);
}
controls.push(form.numericUpDown({name:"formSeed", label:"Random seed", default_value: 12321, minimum: 1, maximum: 1000000,
prompt: "Initializes randomization for imputation and certain algorithms"}));
var outcome = form.dropBox({label: "Outcome",
types: [ stacked_check ? "VariableSet: BinaryMulti, NominalMulti, OrdinalMulti, NumericMulti" : "Variable: Numeric, Date, Money, Categorical, OrderedCategorical"],
multi: false,
name: "formOutcomeVariable",
prompt: "Independent target variable to be predicted"});
var predictors = form.dropBox({label: "Predictor(s)",
types:[ stacked_check ? "VariableSet: BinaryGrid, NumericGrid" : "Variable: Numeric, Date, Money, Categorical, OrderedCategorical"],
name: "formPredictorVariables", multi: stacked_check ? false : true,
prompt: "Dependent input variables"});
controls.unshift(predictors);
controls.unshift(outcome);
form.setInputControls(controls);
form.setHeading((regressionType == "" ? "" : (regressionType + " ")) + algorithm);
```

```
library(flipMultivariates)
model <- MachineLearning(formula = if (isTRUE(get0("formStackedData"))) as.formula(NULL) else QFormula(formOutcomeVariable ~ formPredictorVariables),
algorithm = formAlgorithm,
weights = QPopulationWeight, subset = QFilter,
missing = formMissing,
output = if (formOutput == "Shapley Regression") "Shapley regression" else formOutput,
show.labels = !formNames,
seed = get0("formSeed"),
cost = get0("formCost"),
booster = get0("formBooster"),
grid.search = get0("formSearch"),
sort.by.importance = get0("formImportance"),
hidden.nodes = get0("formHiddenLayers"),
max.epochs = get0("formEpochs"),
normalize = get0("formNormalize"),
outcome.color = get0("formOutColor"),
predictors.color = get0("formPredColor"),
prior = get0("formPrior"),
prune = get0("formPruning"),
early.stopping = get0("formStopping"),
predictor.level.treatment = get0("formPredictorCategoryLabels"),
outcome.level.treatment = get0("formOutcomeCategoryLabels"),
long.running.calculations = get0("formLongRunningCalculations"),
type = get0("formRegressionType"),
auxiliary.data = get0("formAuxiliaryVariables"),
correction = get0("formCorrection"),
robust.se = get0("formRobustSE", ifnotfound = FALSE),
importance.absolute = get0("formAbsoluteImportance"),
interaction = get0("formInteraction"),
outlier.prop.to.remove = if (get0("formRegressionType", ifnotfound = "") != "Multinomial Logit") get0("formOutlierProportion")/100 else NULL,
stacked.data.check = get0("formStackedData"),
unstacked.data = if (isTRUE(get0("formStackedData"))) list(Y = get0("formOutcomeVariable"), X = get0("formPredictorVariables")) else NULL)
```