Text Analysis - Advanced - Setup Text Analysis
Related Videos | |
---|---|
Chapter within Text Analysis in Q5 (Video) |
This item performs the initial phases of text analysis, and it is required as the input to other text analysis options. The processing is designed to work with English-language inputs. See this blog post for an example and description of the process.
This item works by splitting up all of the text from the input text and then processing each word separately before combining the words from each entry back together again. Processing of the words includes basic spell-checking, stemming, removal of words, replacement of specific words, and combination of words into phrases.
The item allows you to view the frequencies of the words that remain in the processed text, or the the final text responses which have been produced by the processing.
If an input was not blank before processing but processing removes all words, the message <NO_WORDS_REMAIN_AFTER_PROCESSING> is displayed.
Example
To create a Setup Text Analysis output, go to Create > Text Analysis > Advanced > Setup Text Analysis.
- Under Inputs > Text Analysis Options > Text Variable select a Text variable.
- Make any other selections that you require.
- Ensure the Automatic box is checked, or click Calculate
The output generated by this function will look like the below. The colors represent unique words and are intended to draw your attention to where they appear. The colors have no meaning beyond this.
Extract word frequencies
To extract the word frequencies table from the output, take the following steps:
- Select Create > R Output.
- Under Properties > R CODE paste in the code below, remembering to replace text.analysis.setup with the reference name of your text analysis setup object:
- Check the Automatic box.
data.frame("word" = text.analysis.setup$final.tokens, "count" = as.numeric(text.analysis.setup$final.counts))
Options
Text variable The variable containing the text responses that you want to process.
Correct spelling Whether or not you want the setup to include basic English-language spell-checking.
Stemming Whether or not you want the setup to include basic stemming of English-language words. Stemming refers to replacing a word with it's root word, or with the most common word in the text that corresponds to the same root. This includes removing plurals and paste tense. For example, the words tests, tested, and test, all share a common word stem and will be replaced with whichever of the three words is most common in the text variable.
Replace synonyms Whether words with the same meaning (in English) will be grouped. They will be replaced with the variant that occurs most frequently in the data set. Words with ambiguous meanings (e.g. 'fast', 'book') will not be replaced. This operation occurs after spelling corrections and stemming.
Replace these words/phrases Specify any words or phrases that should be replaced. For example, if you want to replace the word ease with the word easy wherever it appears, and the phrase one half with the phrase fifty percent wherever it appears, then use the syntax: ease:easy, one half:fifty percent.
Remove these words/phrases Any words or phrases that you want to ignore from the analysis. English-language stopwords (like at, of, the, etc.) will be removed automatically. If you want to remove additional words or phrases, type them in separated by commas.
Phrases Specify any combinations of words that should not be split up when they occur together. For example, if your text contains lots of responses that say the product is easy to use and you want to keep this as a unit in your analysis rather than splitting it up as the words "easy", and "to", and "use" then you can enter this phrase here. Additional phrases should be separated by commas. Any phrases specified in Replace these words/phrases and Remove these words/phrases are automatically treated as phrases throughout the text without being specified here.
Maximum n for n-gram identification This value determines the largest number of words to be considered as possible n-grams (contiguous chunks of n words). For example, a value of 5 will search the text for pentagrams or smaller; 3 will search for trigrams, bigrams, and unigrams; etc.
Minimum frequency as a percentage of cases Whether to specify the minimum frequency as a percentage of cases or as a fixed number.
Percentage (%) of cases for minimum frequency The percentage of cases to use as the minimum frequency. Words or phrases must appear in least at this percentage of cases to be included in the analysis.
Minimum frequency Words or phrases must appear at least this many times in your text to be included in the analysis.
Count frequency by number of respondents If checked, multiple occurrences of a word or phrase by the same respondent will only be counted once in the category's frequency. Otherwise, each occurrence will be counted separately.
Sort alphabetically When displaying the Word Frequencies output, display words and phrases in alphabetical order instead of order of frequency.
Page of raw text to show The page number to show, e.g., if the page number is 3 and there are 1000 rows per page, then rows 2001 to 3000 will be shown.
Rows of raw text per page The number of rows of raw text to show in a page.
SAVE VARIABLE(S)
Maximum number of unique categories to save Maximum number of unique categories that are to be saved. If the number of unique categories exceeds this limit then the most popular categories are chosen with ties broken alphabetically.
Maximum number of categories per case to save Maximum number of categories per case to be saved. If this control is set to a value n, but there the maximum number of categories for a case is m > n, then only the first n categories are saved.
Save categories Adds variables to the data set containing the categories. Where there are multiple input variables, multiple sets of variables are added for each.
Save first category Adds a variable to the data set containing the first category mentioned. Where there are multiple input categories, the first category of each will be saved as a separate variable.
Save sentiment scores Adds a variable which assigns scores to quantify how positive or negative each text response is.
Save tidied text Adds a variable to the data set that contains the tidied text.
Other Analyses
This item is used as an input in the following analyses:
- Text Analysis - Techniques - Create Term Document Matrix creates a matrix whose columns correspond to the words identified in the text. This can be used as an input to other models.
- Text Analysis - Techniques - Predictive Tree creates a predictive tree which shows how the presence of particular words in the text predicts the values of an outcome variable.
- Text Analysis - Techniques - Save Sentiment Scores creates a new variable in your data which measures how many negative and positive words occur in each text response.
- Text Analysis - Techniques - Save Tidied Text creates a new variable in your data which contains the processed text. This can be used as an input to coding or a word cloud.
More Information
How to set up your text analysis in Displayr
A standard list of stopwords is removed automatically using this function. If you'd like to keep stopwords in your text, you can modify the underlying Properties > R CODE to do this. You will just need to add in the remove.stopwords argument to the arguments list an set it equal to FALSE -- like in the picture below:
Code
var heading_text = "Text Analysis Options";
if (!!form.setObjectInspectorTitle)
form.setObjectInspectorTitle(heading_text, heading_text);
else
form.setHeading(heading_text);
form.dropBox({name: "formtextvar", label: "Text variable", types: ["Q:text", "R:character"],
prompt: "Variable containing text strings to analyze"});
form.checkBox({name: "formdospell", label: "Correct spelling", default_value: true,
prompt: "English-language corrections"});
form.checkBox({name: "formdostem", label: "Perform stemming", default_value: true,
prompt: "Replace words with root word, e.g. tested is replaced by test"});
form.checkBox({name: "formdosynonyms", label: "Replace synonyms", default_value: true,
prompt: "Replace words with synonyms"});
form.textBox({name: "formreplacewords", label: "Replace these words/phrases", required: false,
prompt: "Comma-separated list of pairs separated by colon, e.g. very bad:really poor replaces 'very bad' with 'really poor'"});
form.textBox({name: "formremovewords", label: "Remove these words/phrases", required: false, prompt: "Comma-separated list e.g. about, from, john smith"});
form.textBox({name: "formphrases", label: "Phrases", required:false,
prompt: "Comma-separated list of phrases which are not broken into words, e.g. hello world, high five"});
form.numericUpDown({name: "formNGramMax", label: "Maximum n for n-gram identification", default_value: 5, increment: 1, minimum:1, maximum: 999999,
prompt: "5 will return pentagrams or smaller; 3 will return trigrams, bigrams, and unigrams; etc."});
var min_freq_as_percent = form.checkBox({name: "formminfreqaspercent", label: "Minimum frequency as a percentage of cases",
default_value: true,
prompt: "Specify minimum frequency as a percentage of cases. Words/phrases with lower frequencies are discarded."}).getValue();
if (min_freq_as_percent)
form.numericUpDown({name: "formminfreqpercent", label: "Percentage (%) of cases for minimum frequency",
default_value: 0.5, increment: 0.1, minimum:0, maximum: 100,
prompt: "The percentage of cases to be used as the minimum frequency. Words/phrases with a lower percentage are discarded"});
else
form.numericUpDown({name: "formminfreq", label: "Minimum frequency", default_value: 5, increment: 1, minimum:1, maximum: 999999,
prompt: "Words/phrases with lower frequencies are discarded"});
form.checkBox({name: "formCountRespondents", label: "Count frequency by number of respondents",
default_value: true, prompt: "If checked, multiple occurrences of a phrase by the same respondent will only be " +
"counted once in the category's frequency. Otherwise, each occurrence will be counted separately."});
form.checkBox({name: "formalphabetical", label: "Sort alphabetically", default_value: false});
form.numericUpDown({name: "formPageNum", label: "Page of raw text to show",
default_value: 1, minimum: 1, maximum: 99999999,
prompt: "The page number to show, e.g., if the page number is 3 and there are 1000 rows per page, then rows 2001 to 3000 will be shown."});
form.numericUpDown({name: "formPageSize", label: "Rows of raw text per page",
default_value: 1000, minimum: 1, maximum: 10000,
prompt: "The number of rows of raw text to show per page."});
form.group("Save Variable(s)");
form.numericUpDown({name: "formMaxLevels", label: "Maximum number of unique categories to save",
prompt: "Maximum number of unique categories to save when using Save Variable(s) Categories or First Category",
default_value: 500, minimum: 1, maximum: 999999});
form.numericUpDown({name:"formMaxMentions", label: "Maximum number of categories per case to save",
prompt: "Maximum number of categories per case to save when using Save Variable(s) Categories",
default_value : 100, minimum: 1, maximum: 999999});
library(flipTextAnalysis)
if (!is.null(QPopulationWeight))
{
warning("Weights have no effect on this item.")
}
options <- GetTextAnalysisOptions(phrases = formphrases,
extra.stopwords.text = formremovewords,
replacements.text = formreplacewords,
do.stem = formdostem,
do.spell = formdospell,
do.synonyms = formdosynonyms)
arguments <- list(text = formtextvar,
min.frequency = if (formminfreqaspercent) NULL else formminfreq,
min.frequency.scale = if (formminfreqaspercent) formminfreqpercent / 100 else NULL,
operations = options$operations,
manual.replacements = options$replacement.matrix,
stoplist = options$stopwords,
alphabetical.sort = formalphabetical,
phrases = options$phrases,
subset = QFilter,
ngram.max = formNGramMax,
count.num.respondents = formCountRespondents,
num.rows.in.page.to.show = formPageSize,
page.to.show = formPageNum)
text.analysis.setup <- TextAnalysis(InitializeWordBag, arguments)