Classifier - Random Forest

From Q
Jump to: navigation, search

*** Classifier functions are being renamed Machine Learning ***

This page will soon be removed, please see the relevant Machine Learning page.

This method is only available in Q5.

Fits a random forest of classification or regression trees.


Categorical outcome

The table below shows the variable importance as computed by a Random Forest. The column called MeanDecreaseAccuracy contains a measure of the extent to which a variable improves the accuracy of the forest in predicting the classification. Higher values mean that the variable improves prediction. In a rough sense, it can be interpreted as showing the amount of increase in classification accuracy that is provided by including the variable in the model (a more precise statement of the meaning is complicated, and requires a detailed understanding of the underlying mechanics of random forests). In this example, x1 is clearly the most important variable, followed by x2, and x3.

The first three columns show the importance of the variable at improving accuracy by category of the outcome variable. We can see in this example, that x1's importance as a predictor is largely due to its usefulness in predicting membership of Group C, whereas x2 is primarily improving prediction of Group A, followed by Group C, and has a marginally deleterious impact on prediction of Group B.

Importance (MeanDecreaseGini) provides a more nuanced measure of importance, which factors in both the contribution that variable makes to accuracy, and the degree of misclassification (e.g., if a variable improves the probability of an observation being classified to a segment from 55% to 90%, this will show up in the Importance (MeanDecreaseGini), but not in MeanDecreaseAccuracy). As with MeanDecreaseAccuracy, high numbers indicate that a variable is more important as a predictor.

Numeric outcome

The table below shows the random forest outputs for a numeric outcome variable. The first column can be interpreted as indicating the extent to which different variables explain the variance in the dependent variable. The second column can be interpreted as showing the extent to which different variables reduce uncertainty in the predictions of the model. As with the description of the categorical variable random forest, these are only rough "translations" of the true meaning of these metrics. It is not clear which metric is better for judging importance.


Outcome The variable to be predicted by the predictor variables. It may be either a numeric variable, in which case a forest of regression trees is estimated, or classification trees if categorical.

Predictors The variable(s) to predict the outcome.


Importance Produces importance tables, as illustrated above.
Detail This returns the default output from randomForest in the randomForest package. It includes a confusion matrix for classification trees, and the percentage of variance explained for regression trees.
Prediction-Accuracy Table Produces a table relating the observed and predicted outcome. Also known as a confusion matrix.

Missing data See Missing Data Options.

Variable names Displays Variable Names in the output.

Sort by importance Sort the rows by importance (the last column in the table).

Weight. Where a weight has been set for the R Output, a new data set is generated via resampling, and this new data set is used in the estimation.

Filter The data is automatically filtered using any filters prior to estimating the model.

Additional options are available by editing the code.


Uses the algorithm randomForest algorithm from the randomForest package.

Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.


form.setHeading('Random Forest');
form.dropBox({label: "Outcome", 
            types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"], 
            name: "formOutcomeVariable"});
form.dropBox({label: "Predictor(s)",
            types:["Variable: Numeric, Date, Money, Categorical, OrderedCategorical"], 
            name: "formPredictorVariables", 
var output = form.comboBox({label: "Output", 
              alternatives: ["Importance", "Prediction-Accuracy Table", "Detail"], name: "formOutput", default_value: "Importance"}).getValue();
form.comboBox({label: "Missing data", 
              alternatives: ["Error if missing data", "Exclude cases with missing data"], name: "formMissing", default_value: "Exclude cases with missing data"});
form.checkBox({label: "Variable names", name: "formNames", default_value: false});
if (output == "Importance")
    form.checkBox({label: "Sort by importance", name: "formImportance", default_value: true});
options(width = 120)
rf <- RandomForest(QFormula(formOutcomeVariable ~ formPredictorVariables), weights = QPopulationWeight, subset = QFilter,
                  missing = formMissing, output = formOutput, show.labels = !formNames, = get0("formImportance", ifnotfound = FALSE))