Text Analysis - Advanced - Create Term Document Matrix

From Q
Jump to: navigation, search
Related Videos

Chapter within Text Analysis in Q5 (Video)

 

Create a term-document matrix using the result of Text Analysis - Setup Text Analysis. A term-document matrix represents the processed text from a text analysis as a table or matrix where the rows represent the text responses, or documents, and the columns represent the words or phrases (the terms). If the text response contains the word or phrase then the corresponding cell of the table will contain a value of 1, otherwise it will contain a value of 0. For performance reasons, the term-document matrix is returned as a sparse matrix. Before using the term-document matrix in your code, it first needs to be converted to a normal matrix by wrapping it inside the function as.matrix, e.g.: as.matrix(term.document.matrix).

You must create an item using Text Analysis - Setup Text Analysis first before you can create a term-document matrix. All of the options for the text analysis are specified on that item.

This blog post describes the term document matrix format and how it can be used as input to a predictive model.

Example

The Term Document Matrix is output as a sparse matrix, since many values are likely to be zero,

By using as.matrix in R, the full matrix can be displayed,

Options

Setup item An item created from Text Analysis - Setup Text Analysis.

Minimum document count Terms must appear in this many documents (rows) in order to be included in the matrix. This allows further control over the size of the table.

Acknowledgements

Uses the tm package.

Code

form.dropBox({name: "formwb", label: "Setup item", types: ["R:wordBag"], required: true,
              prompt: "An object from Text Analysis > Setup"});
form.numericUpDown({name: "formmindoc", label: "Minimum document count", default_value: 5, increment: 1, minimum:1,
                    prompt: "Terms appearing in fewer documents will be excluded"});
library(flipTextAnalysis)
library(flipData)
 
# Check user inputs - Goes at the top
if (!is.null(QPopulationWeight))
{
    warning("Weights have no effect on this item.")
}
wb = QInputs(formwb)
if (class(wb) != "wordBag")
{
    stop("The input should be created by selecting Insert > Advanced > Text Analysis > Setup.")
}
t.d.m <- AsTermMatrix(wb, min.frequency = QInputs(formmindoc), sparse = TRUE)
clean.subset <- CleanSubset(QFilter, nrow(t.d.m))
term.document.matrix <- t.d.m[clean.subset, ]