Machine Learning - Support Vector Machine

From Q
Jump to navigation Jump to search


Fits a support vector machine[1] for classification or regression.

Usage

To create a Support Vector Machine:

1. In Displayr, select Anything > Advanced Analysis > Machine Learning > Support Vector Machine. In Q, select Create > Classifier > Support Vector Machine.
2. Under Inputs > Support Vector Machine > Outcome select your outcome variable.
3. Under Inputs > Support Vector Machine > Predictor(s) select your predictor variables.
4. Make any other selections as required.

Examples

Categorical outcome

The table below shows the Accuracy as computed by a Support Vector Machine. The Overall Accuracy is the percentage of instances that are correctly categorised by the model. The accuracies of each individual class are also displayed. In this example the model is best at correctly predicting Group C.

The Prediction-Accuracy Table gives a more complete picture of the output, showing the number of observed examples for each class that were predicted to be in each class. In this example 33 instances of Group B are wrongly predicted to be Group A.

Numeric outcome

The tables below shows the Support Vector Machine outputs for a numeric outcome variable. Accuracy displays 2 measures of performance : Root Mean Square Error (the square root of the average squared difference between the predicted and target outcomes) and R-squared (a measure of the fraction of variation in the data that is explained by the model).

For a numeric outcome variable the Prediction-Accuracy Table is generated by bucketing the predicted and target outcomes and indicating when the bucket of a predicted example does or does not match its observed bucket.

Options

Outcome The variable to be predicted by the predictor variables. It may be either a numeric or categorical variable.

Predictors The variable(s) to predict the outcome.

Algorithm The machine learning algorithm. Defaults to Support Vector Machine but may be changed to other machine learning methods.

Output

Accuracy Produces measures of the goodness of model fit, as illustrated above.
Prediction-Accuracy Table Produces a table relating the observed and predicted outcome. Also known as a confusion matrix.
Detail This returns the default output from svm in the e1071 package[2] .

Missing data See Missing Data Options.

Variable names Displays Variable Names in the output instead of labels.

Cost Controls the extent to which the model correctly predicts the outcome for each training example. Low values of cost maximise the margin between the classes when searching for a separating hyperplane, with the trade-off that certain examples may be misclassified (i.e. lie on the wrong side of the hyperplane). High values of cost result in a smaller margin of separation between the classes and fewer misclassifications. Lowering the cost has the impact of increasing the regularisation, which implies higher bias / lower variance and thus controls overfitting. Raising the cost increases the flexibility of the model but for extreme values will decrease the ability to generalise predictions to unseen data. A typical range of cost to explore would be 0.0001 to 10000.

Random seed Seed used to initialize the (pseudo)random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.

Increase allowed output size Check this box if you encounter a warning message "The R output had size XXX MB, exceeding the 128 MB limit..." and you need to reference the output elsewhere in your document; e.g., to save predicted values to a Data Set or examine diagnostics.

Maximum allowed size for output (MB). This control only appears if Increase allowed output size is checked. Use it to set the maximum allowed size for the regression output in MegaBytes. The warning referred to above about the R output size will state the minimum size you need to increase to to return the full output. Note that having very many large outputs in one document or page may slow down the performance of your document and increase load times.

Weight Where a weight has been set for the R Output, a new data set is generated via resampling, and this new data set is used in the estimation.

Filter The data is automatically filtered using any filters prior to estimating the model.

Additional options are available by editing the code.

DIAGNOSTICS

Prediction-Accuracy Table Creates a table showing the observed and predicted values, as a heatmap.

SAVE VARIABLE(S)

Predicted Values Creates a new variable containing predicted values for each case in the data.

Probabilities of Each Response Creates new variables containing predicted probabilities of each response..

Additional Notes for classification

Typically, classification algorithms will estimate the probabilities that an observation belongs to each category and then estimate the category as the one with the highest probability. However, Support Vector Machines do not do this. As a result, the predicted category for an observation (as obtained using the Predicted Values option above) may not be the category with the highest probability (as obtained using the Probabilities of Each Response option above). The sections below describe the technical details of how the Support Vector Machine works out the predicted categories and probabilities.

The Support Vector Machine determines a hyperplane using a constrained optimization equation for a binary classification problem. The extension for Support Vector Machines to the multi-class (multi label) situation involves using a One vs One approach where a Support Vector Machine is fitted for all pairwise combinations of labels and then the labels and probabilities are estimated by aggregating the results of each pairwise classifier.

Binary Classification

The standard Support Vector Machine (binary classifier) solves the constrained estimation equation.

[math]\displaystyle{ \begin{align} \max_{\beta_0, \beta_1, \beta_2, \ldots, \beta_p, \epsilon_1, \ldots, \epsilon_n} M \quad \text{such that} \quad \sum_{j = 1}^p \beta_j^2 &= 1\\ y_i ( \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots +\beta_p x_{ip}) &\ge M(1 - \epsilon_i)\\ \epsilon_i \ge 0, \sum_{i = 1}^n \epsilon_i &\le C \end{align} }[/math]

where the data and parameters are defined with

  • The [math]\displaystyle{ y_i }[/math] values denote the observed class labels which are [math]\displaystyle{ -1 }[/math] or [math]\displaystyle{ +1 }[/math] if an observation belongs to the negative or positive class respectively.
  • The [math]\displaystyle{ x_{ij} }[/math] values denote the observed value for respondent [math]\displaystyle{ i }[/math] for variable [math]\displaystyle{ j }[/math].
  • The decision value is defined, [math]\displaystyle{ f(x) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots +\beta_p x_{ip} }[/math], which is the hyperplane equation that defines the hyperplane (solid black line in above plot) and produces the signed distance of points away from the hyperplane.
  • The [math]\displaystyle{ M }[/math] parameter is the width of the margin from the hyperplane (distance between dotted line and hyperplane above).
  • The [math]\displaystyle{ C }[/math] parameter is the cost parameter. It is a non-negative parameter that allows the points to violate the margin and hyperplane boundary (no violation is allowed if [math]\displaystyle{ C }[/math] is zero and increasing amounts of violation are allowed as [math]\displaystyle{ C }[/math] increases.

Once the hyperplane equation is estimated, the labels and probabilities of label membership are determined using the signed distance from the hyperplane. Nonlinear boundaries are accounted for using the standard ‘kernel trick’ and everything below will still follow.

Class prediction

This is quite simple and labels are estimated by their placement away from the hyperplane [math]\displaystyle{ f(x) }[/math]. The signed distance away from the hyperplane (defined as the decision value in the algorithm), determines the estimated label. E.g. if [math]\displaystyle{ f(x) \lt 0 }[/math], it predicts the negative class, if [math]\displaystyle{ f(x) \gt 0 }[/math] it predicts the positive class.

Probability prediction

To estimate the probability that an observation belongs to each class, the logistic mapping is used to convert the signed distances (decision values) to probabilities. That is, each observation produces a decision value [math]\displaystyle{ f(x) }[/math]. These decision values are mapped using the logistic function,

[math]\displaystyle{ \begin{align} P(Y = 1|x) = \frac{1}{1 + \exp\left( A f(x) + B\right)} \end{align} }[/math]


The [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B }[/math] parameters are estimated by maximizing (minimizing) the (negative) log-likelihood of the data. This is equivalent to fitting a logistic regression using the single index version of the data (decision values of the data).

Multi-class Classification

The Support Vector Machine cannot directly handle multi-class classification data. Instead, all pairs of class labels are separately fitted using binary classification Support Vector Machines. A One vs One approach is used to determine the predictions for each class.

Class predictions

Each pair is inspected and the class that wins the most pairwise comparisons is determined as the predicted class.

Probability prediction

Each pairwise model will generate a probability of belonging to the positive class in each case. These pairwise probabilities are then used to estimate the multi-class probabilities via the second algorithm in Wu, Lin and Weng (2003) [3].


Acknowledgements

  1. Cortes, C., Vapnik, V. Support-vector networks. Machine Learning, 20, 273–297 (1995). doi: https://doi.org/10.1007/BF00994018
  2. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2023). _e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien_. R package version 1.7-13, <https://CRAN.R-project.org/package=e1071>
  3. Wu, T-F and Lin, C-J and Weng, R (2003). Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5, 975--1005. doi: https://doi.org/10.5555/1005332.1016791

More information

This blog post explains the concept of support vector machines.
The process of determining the cost parameter is described here.

Code

var controls = [];

// ALGORITHM
var algorithm = form.comboBox({label: "Algorithm",
                               alternatives: ["CART", "Deep Learning", "Gradient Boosting", "Linear Discriminant Analysis",
                                              "Random Forest", "Regression", "Support Vector Machine"],
                               name: "formAlgorithm", default_value: "Support Vector Machine",
                               prompt: "Machine learning or regression algorithm for fitting the model"});

controls.push(algorithm);
algorithm = algorithm.getValue();

var regressionType = "";
if (algorithm == "Regression")
{
    regressionTypeControl = form.comboBox({label: "Regression type", 
                                           alternatives: ["Linear", "Binary Logit", "Ordered Logit", "Multinomial Logit", "Poisson",
                                                          "Quasi-Poisson", "NBD"], 
                                           name: "formRegressionType", default_value: "Linear",
                                           prompt: "Select type according to outcome variable type"});
    regressionType = regressionTypeControl.getValue();
    controls.push(regressionTypeControl);
}

// DEFAULT CONTROLS
missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Imputation (replace missing values with estimates)"];

// AMEND DEFAULT CONTROLS PER ALGORITHM
if (algorithm == "Support Vector Machine")
    output_options = ["Accuracy", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Gradient Boosting") 
    output_options = ["Accuracy", "Importance", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Random Forest")
    output_options = ["Importance", "Prediction-Accuracy Table", "Detail"];
if (algorithm == "Deep Learning")
    output_options = ["Accuracy", "Prediction-Accuracy Table", "Cross Validation", "Network Layers"];
if (algorithm == "Linear Discriminant Analysis")
    output_options = ["Means", "Detail", "Prediction-Accuracy Table", "Scatterplot", "Moonplot"];

if (algorithm == "CART") {
    output_options = ["Sankey", "Tree", "Text", "Prediction-Accuracy Table", "Cross Validation"];
    missing_data_options = ["Error if missing data", "Exclude cases with missing data",
                             "Use partial data", "Imputation (replace missing values with estimates)"]
}
if (algorithm == "Regression") {
    if (regressionType == "Multinomial Logit")
        output_options = ["Summary", "Detail", "ANOVA"];
    else if (regressionType == "Linear")
        output_options = ["Summary", "Detail", "ANOVA", "Relative Importance Analysis", "Shapley Regression", "Jaccard Coefficient", "Correlation", "Effects Plot"];
    else
        output_options = ["Summary", "Detail", "ANOVA", "Relative Importance Analysis", "Effects Plot"];
}

// COMMON CONTROLS FOR ALL ALGORITHMS
var outputControl = form.comboBox({label: "Output", prompt: "The type of output used to show the results",
                                   alternatives: output_options, name: "formOutput",
                                   default_value: output_options[0]});
controls.push(outputControl);
var output = outputControl.getValue();

if (algorithm == "Regression") {
    if (regressionType == "Linear") {
        if (output == "Jaccard Coefficient" || output == "Correlation")
            missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Use partial data (pairwise correlations)"];
        else
            missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Dummy variable adjustment", "Use partial data (pairwise correlations)", "Multiple imputation"];
    }        
    else
        missing_data_options = ["Error if missing data", "Exclude cases with missing data", "Dummy variable adjustment", "Multiple imputation"];
}

var missingControl = form.comboBox({label: "Missing data", 
                                    alternatives: missing_data_options, name: "formMissing", default_value: "Exclude cases with missing data",
                                    prompt: "Options for handling cases with missing data"});
var missing = missingControl.getValue();
controls.push(missingControl);
controls.push(form.checkBox({label: "Variable names", name: "formNames", default_value: false, prompt: "Display names instead of labels"}));

// CONTROLS FOR SPECIFIC ALGORITHMS

if (algorithm == "Support Vector Machine")
    controls.push(form.textBox({label: "Cost", name: "formCost", default_value: 1, type: "number",
                                prompt: "High cost produces a complex model with risk of overfitting, low cost produces a simpler mode with risk of underfitting"}));

if (algorithm == "Gradient Boosting") {
    controls.push(form.comboBox({label: "Booster", 
                                 alternatives: ["gbtree", "gblinear"], name: "formBooster", default_value: "gbtree",
                                 prompt: "Boost tree or linear underlying models"}));
    controls.push(form.checkBox({label: "Grid search", name: "formSearch", default_value: false,
                                 prompt: "Search for optimal hyperparameters"}));
}

if (algorithm == "Random Forest")
    if (output == "Importance")
        controls.push(form.checkBox({label: "Sort by importance", name: "formImportance", default_value: true}));

if (algorithm == "Deep Learning") {
    controls.push(form.numericUpDown({name:"formEpochs", label:"Maximum epochs", default_value: 10, minimum: 1, maximum: Number.MAX_SAFE_INTEGER,
                                      prompt: "Number of rounds of training"}));
    controls.push(form.textBox({name: "formHiddenLayers", label: "Hidden layers", prompt: "Comma delimited list of the number of nodes in each hidden layer", required: true}));
    controls.push(form.checkBox({label: "Normalize predictors", name: "formNormalize", default_value: true,
                                 prompt: "Normalize to zero mean and unit variance"}));
}

if (algorithm == "Linear Discriminant Analysis") {
    if (output == "Scatterplot")
    {
        controls.push(form.colorPicker({label: "Outcome color", name: "formOutColor", default_value:"#5B9BD5"}));
        controls.push(form.colorPicker({label: "Predictors color", name: "formPredColor", default_value:"#ED7D31"}));
    }
    controls.push(form.comboBox({label: "Prior", alternatives: ["Equal", "Observed",], name: "formPrior", default_value: "Observed",
                                 prompt: "Probabilities of group membership"}));
}

if (algorithm == "CART") {
    controls.push(form.comboBox({label: "Pruning", alternatives: ["Minimum error", "Smallest tree", "None"], 
                                 name: "formPruning", default_value: "Minimum error",
                                 prompt: "Remove nodes after tree has been built"}));
    controls.push(form.checkBox({label: "Early stopping", name: "formStopping", default_value: false,
                                 prompt: "Stop building tree when fit does not improve"}));
    controls.push(form.comboBox({label: "Predictor category labels", alternatives: ["Full labels", "Abbreviated labels", "Letters"],
                                 name: "formPredictorCategoryLabels", default_value: "Abbreviated labels",
                                 prompt: "Labelling of predictor categories in the tree"}));
    controls.push(form.comboBox({label: "Outcome category labels", alternatives: ["Full labels", "Abbreviated labels", "Letters"],
                                 name: "formOutcomeCategoryLabels", default_value: "Full labels",
                                 prompt: "Labelling of outcome categories in the tree"}));
    controls.push(form.checkBox({label: "Allow long-running calculations", name: "formLongRunningCalculations", default_value: false,
                                 prompt: "Allow predictors with more than 30 categories"}));
}

var stacked_check = false;
if (algorithm == "Regression") {
    if (missing == "Multiple imputation")
        controls.push(form.dropBox({label: "Auxiliary variables",
                                    types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"], 
                                    name: "formAuxiliaryVariables", required: false, multi:true,
                                    prompt: "Additional variables to use when imputing missing values"}));
    controls.push(form.comboBox({label: "Correction", alternatives: ["None", "False Discovery Rate", "Bonferroni"], name: "formCorrection",
                                 default_value: "None", prompt: "Multiple comparisons correction applied when computing p-values of post-hoc comparisons"}));
    var is_RIA_or_shapley = output == "Relative Importance Analysis" || output == "Shapley Regression";
    var is_Jaccard_or_Correlation = output == "Jaccard Coefficient" || output == "Correlation";
    if (regressionType == "Linear" && missing != "Use partial data (pairwise correlations)" && missing != "Multiple imputation")
        controls.push(form.checkBox({label: "Robust standard errors", name: "formRobustSE", default_value: false,
                                     prompt: "Standard errors are robust to violations of assumption of constant variance"}));
    if (is_RIA_or_shapley)
        controls.push(form.checkBox({label: "Absolute importance scores", name: "formAbsoluteImportance", default_value: false,
                                     prompt: "Show absolute instead of signed importances"}));
    if (regressionType != "Multinomial Logit" && (is_RIA_or_shapley || is_Jaccard_or_Correlation || output == "Summary"))
        controls.push(form.dropBox({label: "Crosstab interaction", name: "formInteraction", types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"],
                                    required: false, prompt: "Categorical variable to test for interaction with other variables"}));
    if (regressionType !== "Multinomial Logit")
        controls.push(form.numericUpDown({name : "formOutlierProportion", label:"Automated outlier removal percentage", default_value: 0, 
                                          minimum:0, maximum:49.9, increment:0.1,
                                          prompt: "Data points removed and model refitted based on the residual values in the model using the full dataset"}));
    stacked_check_box = form.checkBox({label: "Stack data", name: "formStackedData", default_value: false,
                                       prompt: "Allow input into the Outcome control to be a single multi variable and Predictors to be a single grid variable"})
    stacked_check = stacked_check_box.getValue();
    controls.push(stacked_check_box);
}

controls.push(form.numericUpDown({name:"formSeed", label:"Random seed", default_value: 12321, minimum: 1, maximum: Number.MAX_SAFE_INTEGER,
                                  prompt: "Initializes randomization for imputation and certain algorithms"}));

let allowLargeOutputsCtrl = form.checkBox({label: "Increase allowed output size",
					   name: "formAllowLargeOutputs", default_value: false,
					   prompt: "Increase the limit on the maximum size allowed for the output to fix warnings about it being too large"});
controls.push(allowLargeOutputsCtrl);
if (allowLargeOutputsCtrl.getValue())
    controls.push(form.numericUpDown({name:"formMaxOutputSize", label:"Maximum allowed size for output (MB)", default_value: 128, minimum: 1, maximum: Number.MAX_SAFE_INTEGER,
                                  prompt: "The maximum allowed size for the returned output in MB. Very large outputs may impact document performance"}));

var outcome = form.dropBox({label: "Outcome", 
                            types: [ stacked_check ? "VariableSet: BinaryMulti, NominalMulti, OrdinalMulti, NumericMulti" : "Variable: Numeric, Date, Money, Categorical, OrderedCategorical"], 
                            multi: false,
                            name: "formOutcomeVariable",
                            prompt: "Independent target variable to be predicted"});
var predictors = form.dropBox({label: "Predictor(s)",
                               types:[ stacked_check ? "VariableSet: BinaryGrid, NumericGrid" : "Variable: Numeric, Date, Money, Categorical, OrderedCategorical"], 
                               name: "formPredictorVariables", multi: stacked_check ? false : true,
                               prompt: "Dependent input variables"});

controls.unshift(predictors);
controls.unshift(outcome);

form.setInputControls(controls);
var heading_text = "";
if (regressionType == "") {
    heading_text = algorithm;
} else {    
    heading_text = regressionType + " " + algorithm;
}

if (!!form.setObjectInspectorTitle)
    form.setObjectInspectorTitle(heading_text, heading_text);
else 
    form.setHeading(heading_text);
library(flipMultivariates)
if (get0("formAllowLargeOutputs", ifnotfound = FALSE))
    QAllowLargeResultObject(1e6*get0("formMaxOutputSize"))

WarnIfVariablesSelectedFromMultipleDataSets()

model <- MachineLearning(formula = if (isTRUE(get0("formStackedData"))) as.formula(NULL) else QFormula(formOutcomeVariable ~ formPredictorVariables),
                         algorithm = formAlgorithm,
                         weights = QPopulationWeight, subset = QFilter,
                         missing = formMissing,
                         output = if (formOutput == "Shapley Regression") "Shapley regression" else formOutput,
                         show.labels = !formNames,
                         seed = get0("formSeed"),
                         cost = get0("formCost"),
                         booster = get0("formBooster"),
                         grid.search = get0("formSearch"),
                         sort.by.importance = get0("formImportance"),
                         hidden.nodes = get0("formHiddenLayers"),
                         max.epochs = get0("formEpochs"),
                         normalize = get0("formNormalize"),
                         outcome.color = get0("formOutColor"),
                         predictors.color = get0("formPredColor"),
                         prior = get0("formPrior"),
                         prune = get0("formPruning"),
                         early.stopping = get0("formStopping"),
                         predictor.level.treatment = get0("formPredictorCategoryLabels"),
                         outcome.level.treatment = get0("formOutcomeCategoryLabels"),
                         long.running.calculations = get0("formLongRunningCalculations"),
                         type = get0("formRegressionType"),
                         auxiliary.data = get0("formAuxiliaryVariables"),
                         correction = get0("formCorrection"),
                         robust.se = get0("formRobustSE", ifnotfound = FALSE),
                         importance.absolute = get0("formAbsoluteImportance"),
                         interaction = get0("formInteraction"),
                         outlier.prop.to.remove = if (get0("formRegressionType", ifnotfound = "") != "Multinomial Logit") get0("formOutlierProportion")/100 else NULL,
                         stacked.data.check = get0("formStackedData"),
                         unstacked.data = if (isTRUE(get0("formStackedData"))) list(Y = get0("formOutcomeVariable"), X = get0("formPredictorVariables")) else NULL,
                         use.combined.scatter = TRUE)