Text Analysis - Advanced - Setup Text Analysis

From Q
Jump to: navigation, search
Related Videos

Chapter within Text Analysis in Q5 (Video)

 

This item performs the initial phases of text analysis, and it is required as the input to other text analysis options. The processing is designed to work with English-language inputs. See this blog post for an example and description of the process.

This item works by splitting up all of the text from the input text and then processing each word separately before combining the words from each entry back together again. Processing of the words includes basic spell-checking, stemming, removal of words, replacement of specific words, and combination of words into phrases.

The item allows you to view the frequencies of the words that remain in the processed text, or the the final text responses which have been produced by the processing.

If an input was not blank before processing but processing removes all words, the message <NO_WORDS_REMAIN_AFTER_PROCESSING> is displayed.

Example

To create a Setup Text Analysis output, go to Insert > Text Analysis > Advanced > Setup Text AnalysisCreate > Text Analysis > Advanced > Setup Text Analysis.

  1. Under Inputs > Text Analysis Options > Text Variable select a Text variable.
  2. Make any other selections that you require.
  3. Ensure the Automatic box is checked, or click Calculate

STA-Object-Inspector.png

The output generated by this function will look like the below. The colors represent unique words and are intended to draw your attention to where they appear. The colors have no meaning beyond this.

Extract word frequencies

To extract the word frequencies table from the output, take the following steps:

  1. Select Create > R Output.
  2. Under Properties > R CODE paste in the code below, remembering to replace text.analysis.setup with the reference name of your text analysis setup object:
  3. data.frame("word" = text.analysis.setup$final.tokens, "count" = as.numeric(text.analysis.setup$final.counts))

  4. Check the Automatic box.

Options

Text variable The variable containing the text responses that you want to process.

Correct spelling Whether or not you want the setup to include basic English-language spell-checking.

Stemming Whether or not you want the setup to include basic stemming of English-language words. Stemming refers to replacing a word with it's root word, or with the most common word in the text that corresponds to the same root. This includes removing plurals and paste tense. For example, the words tests, tested, and test, all share a common word stem and will be replaced with whichever of the three words is most common in the text variable.

Replace synonyms Whether words with the same meaning (in English) will be grouped. They will be replaced with the variant that occurs most frequently in the data set. Words with ambiguous meanings (e.g. 'fast', 'book') will not be replaced. This operation occurs after spelling corrections and stemming.

Replace these words/phrases Specify any words or phrases that should be replaced. For example, if you want to replace the word ease with the word easy wherever it appears, and the phrase one half with the phrase fifty percent wherever it appears, then use the syntax: ease:easy, one half:fifty percent.

Remove these words/phrases Any words or phrases that you want to ignore from the analysis. English-language stopwords (like at, of, the, etc.) will be removed automatically. If you want to remove additional words or phrases, type them in separated by commas.

Phrases Specify any combinations of words that should not be split up when they occur together. For example, if your text contains lots of responses that say the product is easy to use and you want to keep this as a unit in your analysis rather than splitting it up as the words "easy", and "to", and "use" then you can enter this phrase here. Additional phrases should be separated by commas. Any phrases specified in Replace these words/phrases and Remove these words/phrases are automatically treated as phrases throughout the text without being specified here.

Maximum n for n-gram identification This value determines the largest number of words to be considered as possible n-grams (contiguous chunks of n words). For example, a value of 5 will search the text for pentagrams or smaller; 3 will search for trigrams, bigrams, and unigrams; etc.

Minimum frequency Words or phrases must appear at least this many times in your text to be included in the analysis.

Count frequency by number of respondents If checked, multiple occurrences of a word or phrase by the same respondent will only be counted once in the category's frequency. Otherwise, each occurrence will be counted separately.

Sort alphabetically When displaying the Word Frequencies output, display words and phrases in alphabetical order instead of order of frequency.

Maximum number of unique tokens to save Maximum number of unique tokens that are to be saved. If the number of unique tokens exceeds this limit then the most popular tokens are chosen with ties broken alphabetically.

Maximum number of tokens per case to save Maximum number of tokens per case to be saved. If this control is set to a value n, but there the maximum number of tokens for a case is m > n, then only the first n tokens are saved.

SAVE VARIABLE(S)

Save categories Adds variables to the data set containing the categories. Where there are multiple input variables, multiple sets of variables are added for each.

Save first category Adds a variable to the data set containing the first category mentioned. Where there are multiple input categories, the first category of each will be saved as a separate variable.

Save sentiment scores Adds a variable which assigns scores to quantify how positive or negative each text response is.

Save tidied text Adds a variable to the data set that contains the tidied text.

Other Analyses

This item is used as an input in the following analyses:

More Information

How to set up your text analysis in Displayr

Code

form.setHeading("Text Analysis Options");
form.dropBox({name: "formtextvar", label: "Text variable", types: ["Q:text", "R:character"],
              prompt: "Variable containing text strings to analyze"});
form.checkBox({name: "formdospell", label: "Correct spelling", default_value: true,
               prompt: "English-language corrections"});
form.checkBox({name: "formdostem", label: "Perform stemming", default_value: true,
               prompt: "Replace words with root word, e.g. tested is replaced by test"});
form.checkBox({name: "formdosynonyms", label: "Replace synonyms", default_value: true,
               prompt: "Replace words with synonyms"});
form.textBox({name: "formreplacewords", label: "Replace these words/phrases", required: false,
              prompt: "Comma-separated list of pairs separated by colon, e.g. very bad:really poor replaces 'very bad' with 'really poor'"});
form.textBox({name: "formremovewords", label: "Remove these words/phrases", required: false, prompt: "Comma-separated list e.g. about, from, john smith"});
form.textBox({name: "formphrases", label: "Phrases", required:false,
              prompt: "Comma-separated list of phrases which are not broken into words, e.g. hello world, high five"});
form.numericUpDown({name: "formNGramMax", label: "Maximum n for n-gram identification", default_value: 5, increment: 1, minimum:1, maximum: 999999,
                    prompt: "5 will return pentagrams or smaller; 3 will return trigrams, bigrams, and unigrams; etc."});
form.numericUpDown({name: "formminfreq", label: "Minimum frequency", default_value: 5, increment: 1, minimum:1, maximum: 999999,
                    prompt: "Words/phrases with lower frequencies are discarded"});
form.checkBox({name: "formCountRespondents", label: "Count frequency by number of respondents",
               default_value: true, prompt: "If checked, multiple occurrences of a phrase by the same respondent will only be " +
                                             "counted once in the category's frequency. Otherwise, each occurrence will be counted separately."});
form.checkBox({name: "formalphabetical", label: "Sort alphabetically", default_value: false});
form.numericUpDown({name: "formMaxLevels", label: "Maximum number of unique tokens to save", 
                    prompt: "Maximum number of unique tokens to save when using Save Variable(s) Categories or First Category",
                    default_value: 500, minimum: 1, maximum: 999999});
form.numericUpDown({name:"formMaxMentions", label: "Maximum number of tokens per case to save",
                    prompt: "Maximum number of tokens per case to save when using Save Variable(s) Categories", 
                    default_value : 100, minimum: 1, maximum: 999999});
library(flipTextAnalysis)
if (!is.null(QPopulationWeight))
{
    warning("Weights have no effect on this item.")
}
options <- GetTextAnalysisOptions(phrases = formphrases, 
                                 extra.stopwords.text = formremovewords,
                                 replacements.text = formreplacewords,
                                 do.stem = formdostem,
                                 do.spell = formdospell,
                                 do.synonyms = formdosynonyms)
text.analysis.setup <- InitializeWordBag(formtextvar,
                                   min.frequency = formminfreq, 
                                   operations = options$operations,
                                   manual.replacements = options$replacement.matrix,
                                   stoplist = options$stopwords,
                                   alphabetical.sort = formalphabetical,
                                   phrases = options$phrases,
                                   subset = QFilter,
                                   ngram.max = formNGramMax,
                                   count.num.respondents = formCountRespondents)