# Machine Learning - Linear Discriminant Analysis

This method is only available in Q5.

Fits *linear discriminant analysis (LDA)* to predict a categorical variable by two or more numeric variables. Ordered categorical predictors are coerced to numeric values. Un-ordered categorical predictors are converted to binary dummy variables.

See this post for a description of LDA and this post for a practical guide of how to run LDA in Displayr.

The parameters of the discriminant functions can be extracted with Machine Learning - Diagnostic - Discriminant Functions.

## Example

The table below shows the results of a *linear discriminant analysis* predicting brand preference based on the attributes of the brand. The sub-title shows the predictive accuracy of the model, which in this case is extremely poor, at approximately 7%. The colored shading shows the differences between the means by group. It shows, for example, that the 1,799 Coca-Cola drinkers in the sample has significantly lower ratings of health-conscious, older, and traditional (these are the only significant differences, when compared to the mean, which is why they are in bold. We can also see that there are some significant differences relating to Pepsi. The **R-Squared** column shows the proportion of variance within each row that is explained by the groups; in all cases it is very poor. See Analysis of Variance - One-Way MANOVA for more detail on the interpretation of the table.

There are two reasons why this model is particularly poor:

- The relationship between the predictors and the outcome is weak.
- The
**Prior**is at**Equal**, which assumes that the group sizes in the population are equal. In this example, Coca-Cola is by far the biggest group, so the prior causes the predicted accuracy to be poor.

## Options

**Outcome** The variable to be predicted by the *predictor variables*.

**Predictors** The numeric variable(s) to predict the *outcome*.

**Algorithm** The machine learning algorithm. Defaults to *Linear Discriminant Analysis* but may be changed to other machine learning methods.

**Output**

**Means**Produces a table showing the means by category, and assorted statistics to evaluate the LDA.**Detail**More detailed diagnostics, from the`lda`function in the R`MASS`package.**Prediction-Accuracy Table**Produces a table relating the observed and predicted*outcome*. Also known as a confusion matrix.**Scatterplot**A two-dimensional scatterplot of the group centroids in the space of the first two discriminant function variables. This shows which groups are separated by the first two discriminant function variables. Also plotted are the correlations between the predictor variables and the first two discriminant function variables. The group centroids are scaled to appear on the same scale as the correlations.**Moonplot**A two-dimensional moonplot, using the same assumptions as the scatterplot.

**Outcome color** Color of group centroids in Scatterplot output.

**Predictors color** Color of variable correlations in Scatterplot output.

**Missing data** See Missing Data Options.

**Prior** The prior probabilities used in computing the probabilities of group membership of the **Outcome** (Machine Learning - Save Variable(s) - Probabilities). Note that in the main R package for discriminant analysis (`MASS:lda`), the priors are also used in fitting the model, and this means that results differ between the normal R discriminant analysis and the results in this procedure. This procedure matches the results from SPSS.

**Equal**The prior probabilities are assumed to be equal for each group of the**Outcome**.**Observed**Prior computed based on the current (weighted) group sizes. This is the default.

**Variable names** Displays Variable Names in the output instead of labels.

**Weight**. Where a weight has been set for the R Output, it will automatically applied when the model is estimated. By default, the weight is assumed to be a *sampling weight*, and the standard errors are estimated using *Taylor series linearization* (by contrast, in the Legacy Regression, *weight calibration* is used). See Weights, Effective Sample Size and Design Effects.

**Filter** The data is automatically filtered using any filters prior to estimating the model.

Additional options are available by editing the code.

## Acknowledgements

The algorithm used for fitting the LDA is a modification of `MASS:lda`, generalized to accommodate weights. The `multcomp` package is used to test comparisons (see also Regression - Generalized Linear Model, which describes the models that are used by `multcomp`). The `survey` package is used to compute the `p` for each of the variables in **Means**; a Wald test is used (`regTermTest`)

## Code

```
form.dropBox({label: "Outcome",
types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"],
name: "formOutcomeVariable",
prompt: "Independent target variable to be predicted"});
form.dropBox({label: "Predictor(s)",
types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"],
name: "formPredictorVariables", multi:true,
prompt: "Dependent input variables"});
// ALGORITHM
var algorithm = form.comboBox({label: "Algorithm",
alternatives: ["CART", "Deep Learning", "Gradient Boosting", "Linear Discriminant Analysis",
"Random Forest", "Regression", "Support Vector Machine"],
name: "formAlgorithm", default_value: "Linear Discriminant Analysis",
prompt: "Machine learning or regression algorithm for fitting the model"}).getValue();
var regressionType = "";
if (algorithm == "Regression")
regressionType = form.comboBox({label: "Regression type",
alternatives: ["Linear", "Binary Logit", "Ordered Logit", "Multinomial Logit", "Poisson",
"Quasi-Poisson", "NBD"],
name: "formRegressionType", default_value: "Linear",
prompt: "Select type according to outcome variable type"}).getValue();
form.setHeading((regressionType == "" ? "" : (regressionType + " ")) + algorithm);
// DEFAULT CONTROLS
missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Imputation (replace missing values with estimates)"];
// AMEND DEFAULT CONTROLS PER ALGORITHM
if (algorithm == "Support Vector Machine")
output_options = ["Accuracy", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Gradient Boosting")
output_options = ["Accuracy", "Importance", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Random Forest")
output_options = ["Importance", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Deep Learning")
output_options = ["Accuracy", "Prediction-Accuracy Table", "Cross Validation", "Network Layers"];
if (algorithm == "Linear Discriminant Analysis")
output_options = ["Means", "Detail", "Prediction-Accuracy Table", "Scatterplot", "Moonplot"];
if (algorithm == "CART") {
output_options = ["Sankey", "Tree", "Text", "Prediction-Accuracy Table", "Cross Validation"];
missing_data_options = ["Error if missing data", "Exclude cases with missing data",
"Use partial data", "Imputation (replace missing values with estimates)"]
}
if (algorithm == "Regression") {
if (regressionType == "Multinomial Logit")
output_options = ["Summary", "Detail", "ANOVA"];
else if (regressionType == "Linear")
output_options = ["Summary", "Detail", "ANOVA", "Relative Importance Analysis", "Shapley Regression", "Effects Plot"];
else
output_options = ["Summary", "Detail", "ANOVA", "Relative Importance Analysis", "Effects Plot"]
if (regressionType == "Linear")
missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Use partial data (pairwise correlations)", "Multiple imputation"];
else
missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Multiple imputation"];
}
// COMMON CONTROLS FOR ALL ALGORITHMS
var output = form.comboBox({label: "Output",
alternatives: output_options, name: "formOutput", default_value: output_options[0]}).getValue();
var missing = form.comboBox({label: "Missing data",
alternatives: missing_data_options, name: "formMissing", default_value: "Exclude cases with missing data",
prompt: "Options for handling cases with missing data"}).getValue();
form.checkBox({label: "Variable names", name: "formNames", default_value: false, prompt: "Display names instead of labels"});
// CONTROLS FOR SPECIFIC ALGORITHMS
if (algorithm == "Support Vector Machine")
form.textBox({label: "Cost", name: "formCost", default_value: 1, type: "number",
prompt: "High cost produces a complex model with risk of overfitting, low cost produces a simpler mode with risk of underfitting"});
if (algorithm == "Gradient Boosting") {
form.comboBox({label: "Booster",
alternatives: ["gbtree", "gblinear"], name: "formBooster", default_value: "gbtree",
prompt: "Boost tree or linear underlying models"})
form.checkBox({label: "Grid search", name: "formSearch", default_value: false,
prompt: "Search for optimal hyperparameters"});
}
if (algorithm == "Random Forest")
if (output == "Importance")
form.checkBox({label: "Sort by importance", name: "formImportance", default_value: true});
if (algorithm == "Deep Learning") {
form.numericUpDown({name:"formEpochs", label:"Maximum epochs", default_value: 10, minimum: 1, maximum: 1000000,
prompt: "Number of rounds of training"});
form.textBox({name: "formHiddenLayers", label: "Hidden layers", prompt: "Comma delimited list of the number of nodes in each hidden layer", required: true});
form.checkBox({label: "Normalize predictors", name: "formNormalize", default_value: true,
prompt: "Normalize to zero mean and unit variance"});
}
if (algorithm == "Linear Discriminant Analysis") {
if (output == "Scatterplot")
{
form.colorPicker({label: "Outcome color", name: "formOutColor", default_value:"#5B9BD5"});
form.colorPicker({label: "Predictors color", name: "formPredColor", default_value:"#ED7D31"});
}
form.comboBox({label: "Prior", alternatives: ["Equal", "Observed",], name: "formPrior", default_value: "Observed",
prompt: "Probabilities of group membership"})
}
if (algorithm == "CART") {
form.comboBox({label: "Pruning", alternatives: ["Minimum error", "Smallest tree", "None"],
name: "formPruning", default_value: "Minimum error",
prompt: "Remove nodes after tree has been built"})
form.checkBox({label: "Early stopping", name: "formStopping", default_value: false,
prompt: "Stop building tree when fit does not improve"});
form.comboBox({label: "Predictor category labels", alternatives: ["Full labels", "Abbreviated labels", "Letters"],
name: "formPredictorCategoryLabels", default_value: "Abbreviated labels",
prompt: "Labelling of predictor categories in the tree"})
form.comboBox({label: "Outcome category labels", alternatives: ["Full labels", "Abbreviated labels", "Letters"],
name: "formOutcomeCategoryLabels", default_value: "Full labels",
prompt: "Labelling of outcome categories in the tree"})
form.checkBox({label: "Allow long-running calculations", name: "formLongRunningCalculations", default_value: false,
prompt: "Allow predictors with more than 30 categories"});
}
if (algorithm == "Regression") {
if (missing == "Multiple imputation")
form.dropBox({label: "Auxiliary variables",
types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"],
name: "formAuxiliaryVariables", required: false, multi:true,
prompt: "Additional variables to use when imputing missing values"});
form.comboBox({label: "Correction", alternatives: ["None", "False Discovery Rate", "Bonferroni"], name: "formCorrection",
default_value: "None", prompt: "Multiple comparisons correction applied when computing p-values of post-hoc comparisons"});
var is_RIA_or_shapley = output == "Relative Importance Analysis" || output == "Shapley Regression";
if (regressionType == "Linear" && missing != "Use partial data (pairwise correlations)" && missing != "Multiple imputation")
form.checkBox({label: "Robust standard errors", name: "formRobustSE", default_value: false,
prompt: "Standard errors are robust to violations of assumption of constant variance"});
if (is_RIA_or_shapley)
form.checkBox({label: "Absolute importance scores", name: "formAbsoluteImportance", default_value: false,
prompt: "Show absolute instead of signed importances"});
if (regressionType != "Multinomial Logit" && (is_RIA_or_shapley || output == "Summary"))
form.dropBox({label: "Crosstab interaction", name: "formInteraction", types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"],
required: false, prompt: "Categorical variable to test for interaction with other variables"});
if (regressionType !== "Multinomial Logit")
form.numericUpDown({name : "formOutlierProportion", label:"Automated outlier removal percentage", default_value: 0,
minimum:0, maximum:49.9, increment:0.1,
prompt: "Data points removed and model refitted based on the residual values in the model using the full dataset"})
}
form.numericUpDown({name:"formSeed", label:"Random seed", default_value: 12321, minimum: 1, maximum: 1000000,
prompt: "Initializes randomization for imputation and certain algorithms"});
```

```
library(flipMultivariates)
model <- MachineLearning(formula = QFormula(formOutcomeVariable ~ formPredictorVariables),
algorithm = formAlgorithm,
weights = QPopulationWeight, subset = QFilter,
missing = formMissing,
output = if (formOutput == "Shapley Regression") "Shapley regression" else formOutput,
show.labels = !formNames,
seed = get0("formSeed"),
cost = get0("formCost"),
booster = get0("formBooster"),
grid.search = get0("formSearch"),
sort.by.importance = get0("formImportance"),
hidden.nodes = get0("formHiddenLayers"),
max.epochs = get0("formEpochs"),
normalize = get0("formNormalize"),
outcome.color = get0("formOutColor"),
predictors.color = get0("formPredColor"),
prior = get0("formPrior"),
prune = get0("formPruning"),
early.stopping = get0("formStopping"),
predictor.level.treatment = get0("formPredictorCategoryLabels"),
outcome.level.treatment = get0("formOutcomeCategoryLabels"),
long.running.calculations = get0("formLongRunningCalculations"),
type = get0("formRegressionType"),
auxiliary.data = get0("formAuxiliaryVariables"),
correction = get0("formCorrection"),
robust.se = get0("formRobustSE", ifnotfound = FALSE),
importance.absolute = get0("formAbsoluteImportance"),
interaction = get0("formInteraction"),
outlier.prop.to.remove = if (get0("formRegressionType", ifnotfound = "") != "Multinomial Logit") get0("formOutlierProportion")/100 else NULL)
```