Dimension Reduction - Principal Components Analysis

From Q
Jump to navigation Jump to search


Conduct a Principal Components Analysis on a selection of variables. Principal Components Analysis is a tool for reducing a large set of variables to a smaller set of variables while retaining as much of the variation in the original data set as possible. See this blog post for an introduction.

The new variables for the reduced data set (which contain the component scores) can be saved by running Dimension Reduction - Save Variable(s) - Components/Dimensions or by clicking on the button Inputs > ACTIONS > Save variables. The new variables can then be used in subsequent analyses.

How to Create

  1. Add the object:
    1. In Displayr: Insert > More > Dimension Reduction > Principal Components Analysis
    2. In Q: Create > Dimension Reduction > Principal Components Analysis
  2. Under Inputs > Algorithm select PCA
  3. Under Inputs > Variables select your variables to include

Example

There are different outputs under the Inputs > Output dropdown.

Example Output
The main output is the Loadings Table, which shows the loadings of each of the original variables on the components that have been identified, along with the amount of variance explained by the components. In this example, two components have been obtained from the preference scores for the six cola brands. Component 1 is most strongly correlated with the scores for the diet drinks, and Component 2 is most strongly correlated with the scores for the full-sugar drinks.



Example Output
To get more detail, change the Output to Detailed Output, which shows more of the underlying metrics associated with the analysis.


Example Output
To see the Variance Explained of all of the components, change the Output to Variance Explained.


Options

Variables The variables or a question containing variables that you would like to analyze.

Use correlation matrix If this is true, then the correlation matrix of the data in Variables will be used to conduct the PCA. Otherwise, the covariance matrix is used.

Create binary variables from categories Represents unordered categorical variables as binary variables. Otherwise, their Value AttributesValue Attributes are used. Number - Multi questionsNumeric - Multi variable sets are treated according to their numeric values and not converted to binary.

Rule for selecting components Method for determining the number of principal components to keep in the analysis:

Kaiser rule Keep components with eigenvalues greater than 1. If the unscaled covariance matrix is used instead of the correlation matrix, components with eigenvalues greater than the mean eigenvalue are kept.
Eigenvalue over Keep components with eigenvalues greater than a user-specified number. If the unscaled covariance matrix is used instead of the correlation matrix, components with eigenvalues greater than a multiple of the eigenvalue mean are kept.
Number of components Manually select the number of components to keep.

Rotation method (see below for more details):

None
Varimax
Quartimax
Equamax
Oblimin
Promax

Delta (Oblimin rotation) A parameter used when performing an Oblimin rotation. The default value is 0.

Kappa (Promax rotation) A parameter used when performing a Promax rotation. The default value is 4.

Missing data See Missing Data Options.

Output

Loadings Table Display a table of the component loadings, which is sometimes referred to as a Pattern matrix.
Structure Matrix Display the structure matrix, which is the loadings matrix multiplied by the correlations between the components.
Variance Explained Display the eigenvalues of the original, unrotated components, along with the variance explained, and cumulative variance explained.
Component Plot Display a scatterplot of the loadings of the first two principal components.
Scree Plot Display a chart of the eigenvalues of the correlation or covariance matrix.
Detailed Output Show more details on the results, including the loadings, structure matrix, variable communalities, sum of squared loadings, and score weights.
2D Scatterplot Show the data charted with axes of the first 2 components and labelled according to Grouping Variable.

Sort coefficients by size When displaying loadings or the structure matrix sort the components according to their size.

Suppress small loadings When displaying loadings or the structure matrix, replace small values with blank spaces to facilitate interpretation.

Absolute value below In tables, cells which have absolute values smaller than this will be replaced with blank spaces.

Include labels in plots Whether or not the variable labels will be included in the Component Plot.

Variable names Displays Variable Names in the output instead of variable labels.

Group variable The variable to group (i.e. label) the points of a 2D Scatterplot.

Rotations

Rotations of the principal components are used to produce solutions where the loadings tend to be closer to 0, 1, or -1, making interpretation of the solution easier.

The Varimax, Quartimax, and Equamax rotations are orthogonal, which means that the components produced are always uncorrelated with one another.

The Promax and Oblimin rotations are oblique, meaning that the components can be correlated with one another.

After rotation, components with large negative loadings will have signs flipped, so that the largest loadings are positive, to make interpretation easier.

Scores

Principal components analysis can be used to create a new set of variables which give the new values for each case on the components that have been identified. Here, this is done using the Regression method. The coefficients for transforming the original variables to the new set of scores are shown in the Detailed Output under Score coefficient matrix, and the new variables can be saved to your data set using Dimension Reduction - Save Variable(s) or by clicking on the button Inputs > ACTIONS > Save variables.

Diagnostics

Test - Bartlett Test of Sphericity can be used to test whether or not the input variables are correlated with one another before conducting the principal components analysis.

Additional Properties

When using this feature you can obtain additional information that is stored by the R code which produces the output.

  1. To do so, select Create > R Output.
  2. In the R CODE, paste: item = YourReferenceName
  3. Replace YourReferenceName with the reference name of your item. Find this in the Report tree or by selecting the item and then going to Properties > General > Name from the object inspector on the right.
  4. Below the first line of code, you can paste in snippets from below or type in str(item) to see a list of available information.

For a more in depth discussion on extracting information from objects in R, checkout our blog post here.

Properties which may be of interest are:

  • Loadings:
item$loadings
  • Scores:
item$scores
  • If the Output is set to 2D Scatterplot, you can create a table to plot the scores for other components to be used in a separate scatterplot as follows:
#CHANGE the principal.components.analysis below to the reference name of your PCA
yourpca=principal.components.analysis

#CHANGE the components to the two that you want
thecomps=yourpca$scores[,c("Component 2","Component 3")]

#pull in the default chart data to get the appropriate groups
currchart=attr(yourpca, "ChartData")

#combine the new components with the groups and dummy size column
finalchart=data.frame(thecomps,
                 Size=rep(.7,NROW(thecomps)),
                 Color=as.character(currchart$Group))
finalchart

Acknowledgements

The R package psych is used to extract the original, unrotated components from the input data.

The R package GPArotation is used to conduct rotations.

Code

var default_algorithm = "PCA";

// VERSION 1.14
function isEmpty(x) { return (x == undefined || x.getValue() == null && (x.getValues() == null || x.getValues().length == 0)) }
function isBlankSheet(x) { return (x.getValue() == null || x.getValue().length == 0) }
var allow_control_groups = Q.fileFormatVersion() > 10.9; // Group controls for Displayr and later versions of Q
var controls = [];

var algo_type = form.comboBox({label: "Algorithm", alternatives: ["PCA", "t-SNE", "MDS - Metric", "MDS - Non-metric"],
                               name: "formAlgorithm", default_value: default_algorithm,
                               prompt: "The method for performing the dimensionality reduction"});
let is_pca = algo_type.getValue() === "PCA";
var heading = is_pca ? "Principal Components Analysis (PCA)" : algo_type.getValue();
if (!!form.setObjectInspectorTitle)
    form.setObjectInspectorTitle(heading, heading);
else 
    form.setHeading(heading);

controls.push(algo_type);

var varInput = form.dropBox({name: "formVariables", label: "Variables",
                             types: ["Q: pickone, pickonemulti, number, numbermulti, numbergrid, pickany, pickanycompact, pickanygrid",
                                     "V:numeric, categorical, ordered categorical"], multi: true, required: false,
                             prompt: "Numeric variables, each representing a dimension"});
var tableInput = form.dropBox({label: "Distance matrix", name: "formDistance", types:["RItem"], required: false,
                               prompt: "Symmetric numeric matrix of distances between points"});
var pasteInput = form.dataEntry({label: "Paste or type distance matrix", name: "formDistanceRaw", prompt: "Opens a spreadsheet into which you can paste data.", required: true, large_data_error: "The data entered is too large. The best alternative is to add your data as a Data Set, use Table > Raw Data > Variable(s), and connect that table to this analysis."})


if (is_pca || !allow_control_groups || !isEmpty(varInput) || (isEmpty(tableInput) && isBlankSheet(pasteInput)))
{
    controls.push(varInput);

    if (is_pca || !allow_control_groups || !isEmpty(varInput))
    {
        let norm = form.checkBox({label: is_pca ? "Use correlation matrix" : "Normalize variables", name: "formNormalization", default_value: true,
                                  prompt: is_pca ? "Use correlation matrix (if selected) or the covariance matrix (if not selected)" : "Standardize variables to [0,1]"});
        controls.push(norm);
    }

    if (!allow_control_groups || !isEmpty(varInput))
    {
        var binVar = form.checkBox({name: "formBinary", label: "Create binary variables from categories", default_value: false,
                                    prompt: "Convert categorical variables to dummy binary variables"});
        controls.push(binVar);
    }
}
if (!is_pca)
{
   if (!allow_control_groups || !isEmpty(tableInput) || (isEmpty(varInput) && isBlankSheet(pasteInput)))
       controls.push(tableInput);
   if (!allow_control_groups || !isBlankSheet(pasteInput) || (isEmpty(varInput) && isEmpty(tableInput)))
       controls.push(pasteInput);
}
if (is_pca)
{
    var selectOpt = form.comboBox({name: "selectRule", label: "Rule for selecting components", alternatives: ["Kaiser rule", "Eigenvalues over", "Number of components"],
                                   default_value: "Kaiser rule", prompt: "Determines how many components are retained"});
    controls.push(selectOpt);
    if (selectOpt.getValue() == "Eigenvalues over")
        controls.push(form.numericUpDown({name: "eigenMin", label: "Cutoff", default_value: 1, maximum: Number.MAX_SAFE_INTEGER, increment: 0.1, prompt: "Minimum eigenvalue to retain component"}));
    if (selectOpt.getValue() == "Number of components")
        controls.push(form.numericUpDown({ name: "numberFactors", label: "Number of components", default_value: 2, increment: 1, miniumum: 1, maximum: Number.MAX_SAFE_INTEGER,
                             prompt: "Retain a fixed number of components"}));
    var rotation_type = form.comboBox({ name: "rotationType",
                                        label: "Rotation method",
                                        alternatives: ["None",
                                                     "Varimax",
                                                     "Quartimax",
                                                     "Equamax",
                                                     "Promax",
                                                     "Oblimin"],
                                        default_value: "Varimax", prompt: "Varimax, Quartimax and Equamax produce uncorrelated components"});
    controls.push(rotation_type);
    if (rotation_type.getValue() == "Oblimin")
        controls.push(form.numericUpDown({name: "delta", label: "Delta", default_value: 0, increment: 0.1, maximum:0.8, minimum: -100,
                            prompt: "Oblimin control parameter"}));
    if (rotation_type.getValue() == "Promax")
        controls.push(form.numericUpDown({name: "kappa", label: "Kappa", default_value: 4, increment: 1, minimum: 2, maximum: Number.MAX_SAFE_INTEGER,
                            prompt: "Promax control parameter"}));

    controls.push(form.comboBox({name: "missingType",
                   label: "Missing data:",
                   alternatives: ["Error if missing data", "Exclude cases with missing data", "Use partial data (pairwise correlations)", "Imputation (replace missing values with estimates)"],
                   default_value: "Use partial data (pairwise correlations)", prompt: "Handling of cases with missing data" }));
    var print_type = form.comboBox({ name: "printType", label: "Output", alternatives: ["Loadings Table", "Structure Matrix", "Variance Explained", "Component Plot", "Scree Plot", "Detailed Output", "2D Scatterplot"], default_value: "Loadings Table", prompt: "Output to be shown" });
    controls.push(print_type);
    if (["Component Plot", "Scree Plot", "Variance Explained", "2D Scatterplot"].indexOf(print_type.getValue()) == -1)
    {
        controls.push(form.checkBox({ name: "sortCoefficients", label: "Sort coefficients by size", default_value: true }));
        var suppress = form.checkBox({ name: "suppressCoefficients", label: "Suppress small coefficients", default_value: true,
                                       prompt: "Replace small coefficients with blanks"});
        controls.push(suppress)
        if (suppress.getValue())
            controls.push(form.numericUpDown({ name: "minLoading", label: "Absolute value below", default_value: 0.4, increment: 0.1, minimum: 0, maximum: Number.MAX_SAFE_INTEGER,
                                 prompt: "Threshold to replace small coefficients with blanks"}));
    }

    if (print_type.getValue() == "Component Plot")
        controls.push(form.checkBox({ name: "scatterPlotLabels", label: "Include labels in plots", default_value: true,
                        prompt: "Label the points, else use integers"}));
    if (["Component Plot", "Loadings Table", "Structure Matrix", "Detailed Output"].indexOf(print_type.getValue()) != -1)
        controls.push(form.checkBox({label: "Variable names", name: "formNames", default_value: false, prompt: "Use names instead of labels"}));

}
if (!allow_control_groups || !isEmpty(varInput))
{
    if (!is_pca || print_type.getValue() == "2D Scatterplot")
    {
        var groups = form.dropBox({name: "formGroups", label: "Group variable", types: ["V:numeric, categorical, ordered categorical"], multi:false, required:false, prompt: "Variable used to color the points"});
        controls.push(groups);
    }
}

if (algo_type.getValue() == "t-SNE")
{
    var perplex = form.numericUpDown({name: "formPerplexity", label: "Perplexity", default_value: 10, increment: 1, maximum: 100, minimum: 2,
                                      prompt: "Low values emphasize local rather than global structure"});
    controls.push(perplex);
}
form.setInputControls(controls);
library(flipDimensionReduction)

WarnIfVariablesSelectedFromMultipleDataSets()

dim.reduce <- DimensionReductionScatterplot(algorithm = formAlgorithm,
    data = get0("formVariables"),
    data.groups = if (exists("formGroups") && length(formVariables) > 0) formGroups else NULL,
    table = if (!is.null(get0("formDistanceRaw"))) formDistanceRaw else get0("formDistance"),
    raw.table = !is.null(get0("formDistanceRaw")),
    binary = get0("formBinary", ifnotfound = FALSE),
    perplexity = get0("formPerplexity", ifnotfound = 0),
    normalization = get0("formNormalization", ifnotfound = FALSE),
    # Parameters for PCA
    weights = QCalibratedWeight,
    missing = get0("missingType"),
    select.n.rule = get0("selectRule"),
    rotation = get0("rotationType"),
    eigen.min = get0("eigenMin"),
    n.factors = get0("numberFactors"),
    sort.coefficients.by.size = get0("sortCoefficients"),
    suppress.small.coefficients = get0("suppressCoefficients"),
    min.display.loading.value = get0("minLoading", ifnotfound = 0),
    print.type = get0("printType"),
    plot.labels = get0("scatterPlotLabels"),
    promax.kappa = get0("kappa"),
    oblimin.delta = get0("delta"),
    show.labels = !isTRUE(get0("formNames")),
    subset = QFilter,
    use.combined.scatter = TRUE)

See Also