Text Analysis - Advanced - Principal Components Analysis (Text)

From Q
Jump to: navigation, search

Automatically conducts a Principal Components Analysis (PCA) on a text variable using Google's Universal Encoder to convert the text variable to numeric data. The results are presented in a modified loadings table that shows the cross-cvorrelation between the principal component scores and an augmented document term matrix for ease of interpretation. The output scores can be saved using Inputs > Actions > Save Variables.

Example

To create a Principal Components Analysis for Text data, use the following steps

Create > Text Analysis > Advanced > Principal Components Analysis (Text) Insert > Text Analysis > Advanced > Principal Components Analysis (Text)

  1. Under Inputs > Variable select a Text variable.
  2. Make any other selections or changes to the settings you require such as component selection rules and/or rotations (see details below).
  3. Ensure the Automatic box is checked, or click Calculate.

TextPCA-Input-tab-snapshot.PNG

The example below shows the input and Principal Components Analysis (Text) output for a survey question where respondents were asked about their opinion of Tom Cruise. In particular, what they didn't like about Tom Cruise. The responses are shown in this table.

The example below conducts a Principal Components Analysis (Text) with four components and uses a varimax type rotation to aid interpretation. The output displays a modified loadings table that shows the correlation of different words and phrases with the numeric variables that describe the text. If the table entries are sorted by size, as is in the table output below, each column has its entries sorted such that they are in decreasing order of the magnitude of the correlations. These correlations are between the component scores and the word and phrase variables. The words and phrases with the largest correlations with component one are shown first, and then followed by the largest correlations with component two and so on. Scroll down the output table to see the largest correlations with components three and four.

Extracting the Principal Component Scores

To extract the (possibly rotated) principal component scores from this output as a variable into your Data Set. Take these steps:

  1. Select the Principal Components Analysis (Text) output and then select the Inputs tab in the object inspector on the right side of the screen.
  2. Click the Save Variables button at the bottom of the Inputs tab in Inputs > Actions > Save Variables.

A Number - MultiNumeric - Multi is created in your 'Data Set that contains a NumberNumeric variable for the Principal Component scores for each component. These can then be used like any other variable set to create tables and further outputs.

Options

Variables The text variable containing responses that you would like to analyze in the Principal Component Space.

Rule for selecting components Method for determining the number of principal components to keep in the analysis:

Number of components Manually select the number of components to keep.
Eigenvalue over Keep components with eigenvalues greater than a user-specified number.

Rotation method (see below for more details):

Varimax A Varimax type rotation of the principal components is used by default to produce solutions where the cross-correlation of the principal component scores and an augmented documentation term matrix has entries closer to 0, 1, or -1, making interpretation of the solution easier.
None The original principal component scores are used in creating the cross-correlation matrix with an augmented document term matrix for each case.

Sort coefficients by size When displaying loadings or the structure matrix sort the components according to their size.

Suppress small loadings When displaying loadings or the structure matrix, replace small values with blank spaces to facilitate interpretation.

Absolute value below In tables, cells which have absolute values smaller than this will be replaced with blank spaces.

Rotations

A Varimax type rotation of the principal components is used by default to produce solutions where the cross-correlation of the principal component scores and a type of documentation term matrix has entries closer to 0, 1, or -1, making interpretation of the solution easier.

The Varimax-type rotation is orthogonal, meaning that the components produced are always uncorrelated with one another.

Scores

Principal components analysis can be used to create a new set of variables which give the new values for each case on the components that have been identified. The initial analysis is done without rotation and computed using the Regression method. If a Varimax type rotation is specified (default), then the principal component scores are rotated with an orthogonal matrix to maximize the variance of the cross-correlation matrix of the rotated principal components scores against the presented document term type matrix. The scores can be saved using the Dimension Reduction - Save Variable(s) feature that is also available as an action button in the Inputs tab.

Code

// VERSION 1.0
var controls = [];
 
var varInput = form.dropBox({name: "formVariable", label: "Variable",
                             types: ["Q: text", "V:text"], multi: false, required: false,
                             prompt: "Text variable to analyze"});
controls.push(varInput);

var selectOpt = form.comboBox({name: "selectRule", label: "Rule for selecting components", alternatives: ["Number of components", "Eigenvalues over"],
                               default_value: "Number of components", prompt: "Determines how many components are retained"});
controls.push(selectOpt);
if (selectOpt.getValue() == "Eigenvalues over")
    controls.push(form.numericUpDown({name: "eigenMin", label: "Cutoff", default_value: 1, increment: 0.1, prompt: "Minimum eigenvalue to retain component"}));
if (selectOpt.getValue() == "Number of components")
    controls.push(form.numericUpDown({name: "numberFactors", label: "Number of components", default_value: 2, increment: 1, minumum: 1,
                                      prompt: "Retain a fixed number of components"}));

var rotation_type = form.comboBox({name: "rotationType", 
                                   label: "Rotation method", 
                                   alternatives: ["None", "Varimax"],
                                   default_value: "Varimax", prompt: "Rotate the entries in the table"});
controls.push(rotation_type);
controls.push(form.checkBox({name: "sortCoefficients", label: "Sort coefficients by size", default_value: true }));
var suppress = form.checkBox({name: "suppressCoefficients", label: "Suppress small coefficients", default_value: false,
                              prompt: "Replace small coefficients with blanks"});
controls.push(suppress)
if (suppress.getValue())
    controls.push(form.numericUpDown({name: "minLoading", label: "Absolute value below", default_value: 0.4, increment: 0.1, minimum: 0,
                                      prompt: "Threshold to replace small coefficients with blanks"}));
form.setInputControls(controls);
library(flipTextAnalysis)

text.pca <- TextPrincipalComponentsAnalysis(x = get0("formVariable"),
                                            subset = QFilter,
                                            weights = QPopulationWeight,
                                            n.comp = get0("numberFactors"),
                                            select.n.rule = get0("selectRule"),
                                            eigen.min = get0("eigenMin"),
                                            rotation = get0("rotationType"),
                                            sort.coefficients.by.size = get0("sortCoefficients"),
                                            suppress.small.coefficients = get0("suppressCoefficients"),
                                            min.display.loading.value = get0("minLoading", ifnotfound = 0),
                                            show.labels = !isTRUE(get0("formNames")))