Text Analysis - Advanced - Term Document Matrix

From Q
Jump to navigation Jump to search
Related Videos

Chapter within Text Analysis in Q5 (Video)

 


Create a term-document matrix using the result of Text Analysis - Advanced - Setup Text Analysis. A term-document matrix represents the processed text from a text analysis as a table or matrix where the rows represent the text responses, or documents, and the columns represent the words or phrases (the terms). If the text response contains the word or phrase then the corresponding cell of the table will contain a value of 1, otherwise it will contain a value of 0. For performance reasons, the term-document matrix is returned as a sparse matrix of class dgCMatrix using the Matrix package. There are a number of packages in R that will use this format directly. Otherwise, it can be converted to a normal matrix by wrapping it inside the function as.matrix, e.g.: as.matrix(term.document.matrix).

You must create an item using Text Analysis - Advanced - Setup Text Analysis before you can create a term-document matrix. All of the options for the text analysis are specified on that item.

This blog post describes the term document matrix format and how it can be used as input to a predictive model.

Example

In Displayr, go to Insert > Text Analysis > Advanced > Create Term Document Matrix.

In Q, go to Create > Text Analysis > Advanced > Create Term Document Matrix

  1. Under Inputs > Setup item select a Text Analysis - Advanced - Setup Text Analysis object.
  2. Ensure the Automatic box is checked, or click Calculate

Term Document Matrix Input settings

The output generated by this function will provide a sparse matrix, since many values are likely to be zero.

To access/view the full matrix, take these steps:

1. Insert a new R output (Q: Create > R Output; Displayr: Insert > R Output).
2. In the R CODE field, enter the code below, and replace term.document.matrix with the name of the object you just created:

as.matrix(term.document.matrix)

3. Ensure the Automatic box is checked, or click Calculate. The result will be a table similar to this:

Options

Setup item An item created from Text Analysis - Setup Text Analysis.

Minimum document count Terms must appear in this many documents (rows) in order to be included in the matrix. This allows further control over the size of the table.

More Information

Text Analysis: Hooking up Your Term Document Matrix to Custom R Code
How to Set up Your Text Analysis in Displayr

Acknowledgements

Uses the tm package.

Code

var heading_text = "Term Document Matrix";
if (!!form.setObjectInspectorTitle)
    form.setObjectInspectorTitle(heading_text, "Term Document Matrices");
else
    form.setHeading(heading_text);
form.dropBox({name: "formwb", label: "Setup item", types: ["R:wordBag"], required: true,
              prompt: "An object from Text Analysis > Setup"});
form.numericUpDown({name: "formmindoc", label: "Minimum document count", default_value: 5, increment: 1, minimum: 1, maximum: Number.MAX_SAFE_INTEGER,
                    prompt: "Terms appearing in fewer documents will be excluded"});
library(flipTextAnalysis)
library(flipData)
 
# Check user inputs - Goes at the top
if (!is.null(QPopulationWeight))
{
    warning("Weights have no effect on this item.")
}
wb = QInputs(formwb)
if (class(wb) != "wordBag")
{
    stop("The input should be created by selecting Insert > Advanced > Text Analysis > Setup.")
}
t.d.m <- AsTermMatrix(wb, min.frequency = QInputs(formmindoc), sparse = TRUE)
clean.subset <- CleanSubset(QFilter, nrow(t.d.m))
term.document.matrix <- t.d.m[clean.subset, ]