Text Analysis - Advanced - Setup Text Analysis

From Q
Jump to: navigation, search
Related Videos

Chapter within Text Analysis in Q5 (Video)

 

This item performs the initial phases of text analysis, and it is required as the input to other text analysis options. The processing is designed to work with English-language inputs. See this blog post for an example and description of the process.

This item works by splitting up all of the text from the input text and then processing each word separately before combining the words from each entry back together again. Processing of the words includes basic spell-checking, stemming, removal of words, replacement of specific words, and combination of words into phrases.

The item allows you to view the frequencies of the words that remain in the processed text, or the the final text responses which have been produced by the processing.

If an input was not blank before processing but processing removes all words, the message <NO_WORDS_REMAIN_AFTER_PROCESSING> is displayed.

Example

Extract word frequencies

To extract the word frequencies table from the output, take the following steps:

  1. Select Create > R Output.
  2. Under Properties > R CODE paste in the code below, remembering to replace text.analysis.setup with the reference name of your text analysis setup object:
  3. data.frame("word" = text.analysis.setup$final.tokens, "count" = as.numeric(text.analysis.setup$final.counts))

  4. Check the Automatic box.

Options

Text variable The variable containing the text responses that you want to process.

Correct spelling Whether or not you want the setup to include basic English-language spell-checking.

Stemming Whether or not you want the setup to include basic stemming of English-language words. Stemming refers to replacing a word with it's root word, or with the most common word in the text that corresponds to the same root. This includes removing plurals and paste tense. For example, the words tests, tested, and test, all share a common word stem and will be replaced with whichever of the three words is most common in the text variable.

Replace synonyms Whether words with the same meaning (in English) will be grouped. They will be replaced with the variant that occurs most frequently in the data set. Words with ambiguous meanings (e.g. 'fast', 'book') will not be replaced. This operation occurs after spelling corrections and stemming.

Replace these words/phrases Specify any words or phrases that should be replaced. For example, if you want to replace the word ease with the word easy wherever it appears, and the phrase one half with the phrase fifty percent wherever it appears, then use the syntax: ease:easy, one half:fifty percent.

Remove these words/phrases Any words or phrases that you want to ignore from the analysis. English-language stopwords (like at, of, the, etc.) will be removed automatically. If you want to remove additional words or phrases, type them in separated by commas.

Phrases Specify any combinations of words that should not be split up when they occur together. For example, if your text contains lots of responses that say the product is easy to use and you want to keep this as a unit in your analysis rather than splitting it up as the words "easy", and "to", and "use" then you can enter this phrase here. Additional phrases should be separated by commas. Any phrases specified in Replace these words/phrases and Remove these words/phrases are automatically treated as phrases throughout the text without being specified here.

Maximum n for n-gram identification This value determines the largest number of words to be considered as possible n-grams (contiguous chunks of n words). For example, a value of 5 will search the text for pentagrams or smaller; 3 will search for trigrams, bigrams, and unigrams; etc.

Minimum frequency Words or phrases must appear at least this many times in your text to be included in the analysis.

Sort alphabetically When displaying the Word Frequencies output, display words and phrases in alphabetical order instead of order of frequency.

Other Analyses

This item is used as an input in the following analyses:

Code

form.setHeading("Text Analysis Options");
form.dropBox({name: "formtextvar", label: "Text variable", types: ["Q:text", "R:character"],
              prompt: "Variable containing text strings to analyze"});
form.checkBox({name: "formdospell", label: "Correct spelling", default_value: true,
               prompt: "English-language corrections"});
form.checkBox({name: "formdostem", label: "Perform stemming", default_value: true,
               prompt: "Replace words with root word, e.g. tested is replaced by test"});
form.checkBox({name: "formdosynonyms", label: "Replace synonyms", default_value: true,
               prompt: "Replace words with synonyms"});
form.textBox({name: "formreplacewords", label: "Replace these words/phrases", required: false,
              prompt: "Comma-separated list of pairs separated by colon, e.g. very bad:really poor replaces 'very bad' with 'really poor'"});
form.textBox({name: "formremovewords", label: "Remove these words/phrases", required: false, prompt: "Comma-separated list e.g. about, from, john smith"});
form.textBox({name: "formphrases", label: "Phrases", required:false,
              prompt: "Comma-separated list of phrases which are not broken into words, e.g. hello world, high five"});
form.numericUpDown({name: "formNGramMax", label: "Maximum n for n-gram identification", default_value: 5, increment: 1, minimum:1, maximum: 999999,
                    prompt: "5 will return pentagrams or smaller; 3 will return trigrams, bigrams, and unigrams; etc."});
form.numericUpDown({name: "formminfreq", label: "Minimum frequency", default_value: 5, increment: 1, minimum:1, maximum: 999999,
                    prompt: "Words/phrases with lower frequencies are discarded"});
form.checkBox({name: "formalphabetical", label: "Sort alphabetically", default_value: false});
library(flipTextAnalysis)
if (!is.null(QPopulationWeight))
{
    warning("Weights have no effect on this item.")
}
options <- GetTextAnalysisOptions(phrases = formphrases, 
                                 extra.stopwords.text = formremovewords,
                                 replacements.text = formreplacewords,
                                 do.stem = formdostem,
                                 do.spell = formdospell,
                                 do.synonyms = formdosynonyms)
text.analysis.setup <- InitializeWordBag(formtextvar,
                                   min.frequency = formminfreq, 
                                   operations = options$operations,
                                   manual.replacements = options$replacement.matrix,
                                   stoplist = options$stopwords,
                                   alphabetical.sort = formalphabetical,
                                   phrases = options$phrases,
                                   subset = QFilter,
                                   ngram.max = formNGramMax)