Text Analysis - Advanced - Principal Components Analysis (Text)

From Q
Jump to navigation Jump to search


Automatically conducts a Principal Components Analysis (PCA) on a text variable using Google's Universal Encoder to convert the text variable to numeric data. The results are presented in a modified loadings table that shows the cross-correlation between the principal component scores and an augmented document term matrix for ease of interpretation. The output scores can be saved using Inputs > Actions > Save Variables.

Example

To create a Principal Components Analysis for Text data, use the following steps (you can download the example data here)

Create > Text Analysis > Advanced > Principal Components Analysis (Text) Insert > Text Analysis > Advanced > Principal Components Analysis (Text)

  1. Under Inputs > Text Variable select a Text variable.
  2. Make any other selections or changes to the settings you require such as component selection rules and/or rotations (see details below).
  3. Ensure the Automatic box is checked, or click Calculate.

Text PCA Object Inspector.PNG

The example below shows the input and Principal Components Analysis (Text) output for a survey question where respondents were asked about their opinion of Tom Cruise. In particular, what they didn't like about Tom Cruise. The responses are shown in this table.

The example below conducts a Principal Components Analysis (Text) with four components and uses a varimax type rotation to aid interpretation. The output displays a modified loadings table that shows the correlation of different words and phrases with the numeric variables that describe the text. If the table entries are sorted by size, as is in the table output below, each column has its entries sorted such that they are in decreasing order of the magnitude of the correlations. These correlations are between the component scores and the word and phrase variables. The words and phrases with the largest correlations with component one are shown first, and then followed by the largest correlations with component two and so on. Scroll down the output table to see the largest correlations with components three and four.

Extracting the Principal Component Scores

To extract the (possibly rotated) principal component scores from this output as a variable into your Data Set. Take these steps:

  1. Select the Principal Components Analysis (Text) output and then select the Inputs tab in the object inspector on the right side of the screen.
  2. Click the Save Variables button at the bottom of the Inputs tab in Inputs > Actions > Save Variables.

A Number - MultiNumeric - Multi is created in your 'Data Set that contains a NumberNumeric variable for the Principal Component scores for each component. These can then be used like any other variable set to create tables and further outputs.

Options

Text variable The text variable containing responses that you would like to analyze in the Principal Component Space.

Truncate cases when characters exceed Truncate cases whose number of characters exceeds this number. This is done as the algorithm expects text data to consist of one sentence per case and may not function correctly when cases are too long. If any cases are truncated due to this setting (a warning will be shown), it is likely that the data does not conform to this assumption (one sentence per case) and is not appropriate for this analysis.

Rule for selecting components Method for determining the number of principal components to keep in the analysis:

Number of components Manually select the number of components to keep.
Eigenvalue over Keep components with eigenvalues greater than a user-specified number.

Rotation method (see below for more details):

Varimax A Varimax type rotation of the principal components is used by default to produce solutions where the cross-correlation of the principal component scores and an augmented documentation term matrix has entries closer to 0, 1, or -1, making interpretation of the solution easier.
None The original principal component scores are used in creating the cross-correlation matrix with an augmented document term matrix for each case.

Sort coefficients by size When displaying loadings or the structure matrix sort the components according to their size.

Suppress small loadings When displaying loadings or the structure matrix, replace small values with blank spaces to facilitate interpretation.

Absolute value below In tables, cells which have absolute values smaller than this will be replaced with blank spaces.

Rotations

A Varimax type rotation of the principal components is used by default to produce solutions where the cross-correlation of the principal component scores and a type of documentation term matrix has entries closer to 0, 1, or -1, making interpretation of the solution easier.

The Varimax-type rotation is orthogonal, meaning that the components produced are always uncorrelated with one another.

Scores

Principal components analysis can be used to create a new set of variables which give the new values for each case on the components that have been identified. The initial analysis is done without rotation and computed using the Regression method. If a Varimax type rotation is specified (default), then the principal component scores are rotated with an orthogonal matrix to maximize the variance of the cross-correlation matrix of the rotated principal components scores against the presented document term type matrix. The scores can be saved using the Dimension Reduction - Save Variable(s) feature that is also available as an action button in the Inputs tab.

Code

// VERSION 1.0.1
var heading_text = "Principal Components Analysis of Text";
if (!!form.setObjectInspectorTitle)
    form.setObjectInspectorTitle(heading_text, heading_text);
else
    form.setHeading(heading_text);

var controls = [];

var varInput = form.dropBox({name: "formVariable", label: "Text Variable",
                             types: ["Q: text", "V:text"], multi: false, required: true,
                             prompt: "Text variable to analyze"});
controls.push(varInput);
var varTruncate = form.numericUpDown({name: "formMaxCaseChar",
                                      label: "Truncate cases with characters exceeding",
                                      default_value: 2000,
                                      increment: 1000,
                                      maximum: Number.MAX_SAFE_INTEGER,
                                      minimum: 1,
                                      prompt: "Truncate cases whose number of characters exceeds this number."});
controls.push(varTruncate);

var selectOpt = form.comboBox({name: "selectRule", label: "Rule for selecting components", alternatives: ["Number of components", "Eigenvalues over"],
                               default_value: "Number of components", prompt: "Determines how many components are retained"});
controls.push(selectOpt);
if (selectOpt.getValue() == "Eigenvalues over")
    controls.push(form.numericUpDown({name: "eigenMin", label: "Cutoff", default_value: 1, increment: 0.1, maximum: Number.MAX_SAFE_INTEGER, prompt: "Minimum eigenvalue to retain component"}));
if (selectOpt.getValue() == "Number of components")
    controls.push(form.numericUpDown({name: "numberFactors", label: "Number of components", default_value: 2, increment: 1, minumum: 1, maximum: Number.MAX_SAFE_INTEGER,
                                      prompt: "Retain a fixed number of components"}));

var rotation_type = form.comboBox({name: "rotationType",
                                   label: "Rotation method",
                                   alternatives: ["None", "Varimax"],
                                   default_value: "Varimax", prompt: "Rotate the entries in the table"});
controls.push(rotation_type);
controls.push(form.checkBox({name: "sortCoefficients", label: "Sort coefficients by size", default_value: true }));
var suppress = form.checkBox({name: "suppressCoefficients", label: "Suppress small coefficients", default_value: false,
                              prompt: "Replace small coefficients with blanks"});
controls.push(suppress)
if (suppress.getValue())
    controls.push(form.numericUpDown({name: "minLoading", label: "Absolute value below", default_value: 0.4, increment: 0.1, minimum: 0, maximum: Number.MAX_SAFE_INTEGER,
                                      prompt: "Threshold to replace small coefficients with blanks"}));
form.setInputControls(controls);
library(flipTextAnalysis)

arguments <- list(x = get0("formVariable"),
                  subset = QFilter,
                  weights = QPopulationWeight,
                  n.comp = get0("numberFactors"),
                  select.n.rule = get0("selectRule"),
                  eigen.min = get0("eigenMin"),
                  rotation = get0("rotationType"),
                  sort.coefficients.by.size = get0("sortCoefficients"),
                  suppress.small.coefficients = get0("suppressCoefficients"),
                  min.display.loading.value = get0("minLoading", ifnotfound = 0),
                  show.labels = !isTRUE(get0("formNames")),
                  max.case.characters = formMaxCaseChar)

text.pca <- TextAnalysis(TextPrincipalComponentsAnalysis, arguments)