Text Analysis - Automatic Categorization - List of Items

From Q
Jump to navigation Jump to search


Extracts categories from text data with a list-like format (i.e.: nouns separated by delimiters). The algorithm will also attempt to analyse cases with text data that is not in this format, e.g. sentences, and correct misspellings.

Example

To create a List of Items output, go to Insert > Text Analysis > Automatic Categorization > List of ItemsInsert > Text Analysis > Automatic Categorization > List of Items.

  1. Under Inputs > Text variable select one or more Text variables.
  2. Make any other selections or changes to the settings that you require.
  3. Ensure the Automatic box is checked, or click Calculate

TAList inputs.png

The example below shows a list categorization output for a survey question on which software respondents use for coding text data. The Categories section which is expanded shows a table of the categories on the left and the raw and transformed text on the right. Each category is distinguished by a unique shading, whereas replaced text is shaded in bright yellow. The Diagnostics section at the bottom (which is collapsed but can be expanded) shows diagnostic information for each processing step (which are also collapsed but can be expanded).

Extracting a Table of Frequencies

To extract the table of frequencies from this output can be done by saving the results as a variable into your Data Set and then making a table from that. Take these steps:

  1. Select the List of Items output.
  2. Go to Insert > Text Analysis > Advanced > Save Variable(s) > CategoriesCreate > Text Analysis > Advanced > Save Variable(s) > Categories

One new variable setQuestion for each of the input variables will be saved in your Data Set. These can then be used like any other variable setQuestion to create tables and further outputs.

More Information

Automatically Extract Entities and Sentiment from Text

Options

DATA SOURCE

Text variable(s) The variable(s) containing text to analyze.

Minimum category size Categories with a frequency below this value will be merged into the UNCLASSIFIED category, and can be reviewed in Diagnostics > Categories below minimum frequency in the output. Categories that should not be UNCLASSIFIED need to be added to the list of required categories.

Count frequency by number of respondents If checked, multiple occurrences of a phrase by the same respondent will only be counted once in the category's frequency. Otherwise, each occurrence will be counted separately.

OUTPUT

Sort frequency table Sort categories in frequency table alphabetically or by frequency (descending).

Sort raw text Sort cases in raw text by the original order in the data or by the distance between the raw and normalized text (descending).

Categories to show Only show cases in the output containing the specified category.

Raw text to show Only show cases in the output containing the specified text.

Show cases of a specific variable Only show cases of a specific variable in the output.

Page of raw text to show The page number to show, e.g., if the page number is 3 and there are 1000 rows per page, then rows 2001 to 3000 will be shown.

Rows of raw text per page The number of rows of raw text to show in a page.

REQUIRED CATEGORIES

Add/Edit required phrases and variants Enter categories that are required to appear in the output in the first column under the column header. A required category will always appear in the list of categories and not be split into smaller categories, spell corrected or removed if they fall below the minimum category frequency. Variants of the required categories are specified in subsequent columns. This allows the consolidation of different variants of a category.

DELIMITERS / SPLIT TEXT

Tab Use tab as a delimiter to split text.

Semicolon Use semicolon as a delimiter to split text.

Comma Use comma as a delimiter to split text.

Space Use space as a delimiter to split text.

Newline Use newline as a delimiter to split text.

Other Other delimiters as a comma-separated list, e.g.: +, &, and.

Conditional These delimiters will only be used if both phrases on either side are recognized. Enter delimiters as a comma-separated list e.g.: +, &, and.

Split if contains known categories Split certain categories if they contain known categories within them, e.g., the phrase 'I use Q and Excel' gets split into 'I use', 'Q', 'and', 'Excel' if 'Q' and 'Excel' are known categories. Splits can be reviewed in Diagnostics > Splits by known categories in the output and incorrect splits can be prevented by adding them to the list of required categories.

Minimum known category frequency Minimum frequency for category to be considered a 'known category'. See Split if contains known categories.

Maximum category frequency to split Maximum frequency for a category to be considered for splitting by known categories. See Split if contains known categories.

Split phrase into categories Phrases to be split into multiple categories. The phrases to be split are specified in the first column and the resulting categories are specified in subsequent columns.

SPELLING CORRECTION

Correct spelling Whether to correct certain categories for spelling errors using commonly occurring categories and words from a dictionary as a reference. Spelling corrections can be reviewed in Diagnostics > Spelling Corrections in the output.

Minimum reference category frequency Phrases occurring with at least this frequency will be included into the dictionary.

Maximum corrected category frequency Phrases occurring at or below this frequency will be considered for spelling correction.

Use dictionary If not selected only high frequency words in the data set will be used as corrections.

Phrases that shouldn't be corrected Specify phrases that should not be spell corrected in the first column under the column header. Spelling corrections can be reviewed in Diagnostics > Spelling Corrections in the output and incorrect corrections added to this list.

CATEGORIES TO DISCARD

Add/Edit categories to discard Enter categories to discard in the first column under the column header. Discarded categories will be merged into the UNCLASSIFIED category.

SAVE VARIABLE(S)

Maximum number of unique categories to save Maximum number of unique categories that are to be saved per text variable. If the maximum number of unique categories exceeds this limit then the most popular categories are chosen with ties broken alphabetically.

Maximum number of categories per case to save Maximum number of categories per case to be saved. If this control is set to a value n, but there the maximum number of categories for a case is m > n, then only the first n categories are saved.

Save categories Adds variables to the data set containing the categories. Where there are multiple input variables, multiple sets of variables are added for each.

Save first category Adds a variable to the data set containing the first category mentioned. Where there are multiple input categories, the first category of each will be saved as a separate variable.

Code

var allow_control_groups = Q.fileFormatVersion() > 10.9;
var controls = [];

if (allow_control_groups)
    form.group({label: "Data source", expanded: true});

controls.push(form.dropBox({name: "formTextVar", label: "Text variable(s)", types: ["Variable: text, categorical", "R:character"],
              prompt: "Select text variable(s) to analyze.", multi: true}));
controls.push(form.numericUpDown({name: "formMinFreq", label: "Minimum category size",
                    default_value: 1, increment: 1, maximum: Number.MAX_SAFE_INTEGER, minimum: 1,
                        prompt: "Categories with a frequency below this value will be merged into the UNCLASSIFIED category."}));

controls.push(form.checkBox({name: "formCountRespondents", label: "Count frequency by number of respondents",
               default_value: true, prompt: "If checked, multiple occurrences of a phrase by the same respondent will only be " +
                                             "counted once in the category's frequency. Otherwise, each occurrence will be counted separately."}));

if (allow_control_groups)
    form.group({label: "Output", expanded: true});

var sort_phrases = form.comboBox({name: "formSortPhrases", label: "Sort frequency table",
                  alternatives: ["Alphabetically", "By frequency"],
			      default_value: "By frequency",
                              prompt: "Sort categories in frequency table alphabetically or by frequency (descending)."});
controls.push(sort_phrases);
controls.push(form.comboBox({name: "formSortRawText", label: "Sort raw text", alternatives: ["By transformation distance", "By original case order"],
                             default_value: "By transformation distance",
                             prompt: "Sort cases in raw text by the original order in the data or by the distance between the raw and normalized text (descending)."}));

controls.push(form.textBox({name: "formPhraseShow", label: "Categories to show", required: false,
                            prompt: "Only show cases in the output containing the specified category."}));
controls.push(form.textBox({name: "formTextShow", label: "Raw text to show", required: false,
                            prompt: "Only show cases in the output containing the specified text."}));

var show_specific = form.checkBox({name: "formSpecificVariable",
                             label: "Show cases of a specific variable",
                             default_value: false,
                             prompt: "Only show cases of a specific variable in the output."})
controls.push(show_specific);

if (show_specific.getValue())
    controls.push(form.numericUpDown({name: "formVariableIndex",
                                      label: "Variable index",
                                      default_value: 1,
                                      increment: 1,
                                      maximum: Number.MAX_SAFE_INTEGER,
                                      minimum: 1}));
controls.push(form.numericUpDown({name: "formPageNum", label: "Page of raw text to show",
                                  default_value: 1, minimum: 1, maximum: Number.MAX_SAFE_INTEGER,
                                  prompt: "The page number to show, e.g., if the page number is 3 and there are 1000 rows per page, then rows 2001 to 3000 will be shown."}));
controls.push(form.numericUpDown({name: "formPageSize", label: "Rows of raw text per page",
                                  default_value: 1000, minimum: 1, maximum: 10000,
                                  prompt: "The number of rows of raw text to show per page."}));

controls.push(form.comboBox({name: "formExpanded", label: "Show expanded",
                             alternatives: ["Categories", "Variant suggestions", "Required categories", "Delimiters",
                                            "Conditional delimiters", "Splits by known categories", "Splits into categories",
                                            "Spelling corrections", "Discarded categories", "Categories below minimum frequency"],
                             default_value: "Categories", prompt: "Output to show expanded"}));

if (allow_control_groups)
    form.group({label: "Required categories", expanded: false});
var manual_default = [["Enter required categories below:"]]
for (var i = 1; i < 99; i++)
    manual_default[0][i] = "Variant " + i;
controls.push(form.dataEntry({name: "formManual",
                    prompt: "Enter categories that are required to appear in the output in the first column under the column header. " +
                            "Phrases identified in this category will not be processed further.",
                    required: false,
                    default_value: manual_default, label: "Add required phrases and variants",
                    edit_label: "Edit required phrases and variants",
                    large_data_error: "The data entered is too large. You may only enter data with up to 1000 rows and up to 100 columns."}));

if (allow_control_groups)
    form.group({label: "Delimiters / split text", expanded: false});

controls.push(form.checkBox({name: "formTab", label: "Tab", default_value: true,
               prompt: "Use tab as a delimiter to split text"}));
controls.push(form.checkBox({name: "formSemicolon", label: "Semicolon", default_value: true,
               prompt: "Use semicolon as a delimiter to split text"}));
controls.push(form.checkBox({name: "formComma", label: "Comma", default_value: true,
               prompt: "Use comma as a delimiter to split text"}));
controls.push(form.checkBox({name: "formSpace", label: "Space", default_value: false,
               prompt: "Use space as a delimiter to split text"}));
controls.push(form.checkBox({name: "formNewline", label: "Newline", default_value: true,
               prompt: "Use newline as a delimiter to split text"}));
controls.push(form.textBox({name: "formDelimiters", label: "Other",
              required: false, prompt: "Enter other delimiters as a comma-separated list, e.g.: +, &, and",
              default_value: "(,),[,],{,}"}));

controls.push(form.textBox({name: "formConditionalDelimiters", label: "Conditional",
              required: false, prompt: "Enter delimiters as a comma-separated list e.g.: +, &, and. These delimiters will only be used if both phrases on either side are recognized"}));

var identify_common_phrases = form.checkBox({name: "formCommonPhrase",
                             label: "Split if contains known categories",
                             default_value: true,
                             prompt: "Split certain categories if they contain known categories within them, e.g., the phrase 'I use Q and Excel' " +
                                     "gets split into 'I use', 'Q', 'and', 'Excel' if 'Q' and 'Excel' are known categories."});
controls.push(identify_common_phrases)
if (identify_common_phrases.getValue())
{
    controls.push(form.numericUpDown({name: "formCommonPhraseFreq",
                                      label: "Minimum known category frequency",
                    default_value: 10, increment: 1, maximum: Number.MAX_SAFE_INTEGER, minimum: 1,
                    prompt: "Minimum frequency for category to be considered a 'known category'."}));
    controls.push(form.numericUpDown({name: "formSentencePhraseFreq",
                                      label: "Maximum category frequency to split",
                    default_value: 2, increment: 1, maximum: Number.MAX_SAFE_INTEGER, minimum: 1,
                    prompt: "Maximum frequency for a category to be considered for splitting by known categories."}));
}

var split_default = [["Enter phrases to split into categories below:"]]
for (var i = 1; i < 99; i++)
    split_default[0][i] = "Category " + i;
controls.push(form.dataEntry({name: "formSplitIntoCategories",
                              label: "Split phrase into categories",
                              edit_label: "Split phrase into categories",
                              prompt: "",
                              required: false,
                              default_value: split_default,
                              large_data_error: "The data entered is too large. You may only enter data with up to 1000 rows and up to 100 columns."}));

if (allow_control_groups)
    form.group({label: "Spelling correction", expanded: false});
var correct_spelling = form.checkBox({name: "formSpelling",
                             label: "Correct spelling",
                             default_value: true,
                             prompt: "Whether to correct certain categories for spelling errors using commonly occurring categories as a reference."});
controls.push(correct_spelling);

if (correct_spelling.getValue())
{
    controls.push(form.numericUpDown({name: "formSpellingReferenceFreq",
                                      label: "Minimum reference category frequency",
                    default_value: 5, increment: 1, maximum: Number.MAX_SAFE_INTEGER, minimum: 1,
                    prompt: "Phrases occurring with at least this frequency will be included into the dictionary."}));
    controls.push(form.numericUpDown({name: "formSpellingCorrectedFreq",
                                      label: "Maximum corrected category frequency",
                    default_value: 4, increment: 1, maximum: Number.MAX_SAFE_INTEGER, minimum: 1,
                    prompt: "Phrases occurring at or below this frequency will be considered for spelling correction."}));
    controls.push(form.checkBox({name: "formDictionary", label: "Use dictionary",
                    default_value: false, prompt: "If not selected only high frequency words in the data set will be used as corrections."}));

    var dont_correct_default = [["Enter phrases that shouldn't be corrected below:"]]
    controls.push(form.dataEntry({name: "formDontSpellCorrect",
                    label: "Phrases that shouldn't be corrected",
                    edit_label: "Phrases that shouldn't be corrected",
                    prompt: "Enter phrases that shouldn't be corrected in the first column",
                    required: false,
                    default_value: dont_correct_default,
                    large_data_error: "The data entered is too large. You may only enter data with up to 1000 rows and up to 100 columns."}));
}

if (allow_control_groups)
    form.group({label: "Categories to discard", expanded: false});
var discard_default = [["Enter categories to discard below:"]]
controls.push(form.dataEntry({name: "formDiscard",
                              prompt: "Enter categories to discard in the first column under the column header. " +
                                      "Discarded categories will be merged into the UNCLASSIFIED category.",
                              required: false,
                              default_value: discard_default,
                              label: "Add categories to discard",
                              edit_label: "Edit categories to discard",
                              large_data_error: "The data entered is too large. You may only enter data with up to 1000 rows and up to 100 columns."}));

if (allow_control_groups)
    form.group({label: "SAVE VARIABLE(S)", expanded: false});
controls.push(form.numericUpDown({name:"formMaxLevels", label:"Maximum number of unique categories to save",
                                  prompt: "Maximum number of unique categories to save when using Save Variable(s) Categories or First Category",
                                  default_value : 500, minimum: 1, maximum: Number.MAX_SAFE_INTEGER}));
controls.push(form.numericUpDown({name:"formMaxMentions", label:"Maximum number of categories per case to save",
                                  prompt: "Maximum number of categories per case to save when using Save Variable(s) Categories",
                                  default_value : 100, minimum: 1, maximum: Number.MAX_SAFE_INTEGER}));

form.setInputControls(controls);
var heading_text = "List of Items Categorization";
if (!!form.setObjectInspectorTitle)
    form.setObjectInspectorTitle(heading_text);
else
    form.setHeading(heading_text);
library(flipTextAnalysis)
list.categories <- ListNormalization(formTextVar,
    subset = QFilter,
    required.categories = formManual,
    split.into.categories = formSplitIntoCategories,
    discard.categories = formDiscard,
    raw.text.replacement = NULL,
    min.phrase.frequency = formMinFreq,
    count.num.respondents = formCountRespondents,                                          
    use.tab.delimiter = formTab,
    use.semicolon.delimiter = formSemicolon,
    use.comma.delimiter = formComma,
    use.space.delimiter = formSpace,
    use.newline.delimiter = formNewline,
    other.delimiters = formDelimiters,
    conditional.delimiters = formConditionalDelimiters,
    replace.with.common.phrases = formCommonPhrase,
    common.phrase.frequency = formCommonPhraseFreq,
    sentence.phrase.frequency = formSentencePhraseFreq,
    unclassified.phrase = "UNCLASSIFIED",
    trim.characters = c("&", "/", "-"),
    # Spelling
    correct.spelling = formSpelling,
    use.dictionary = get0("formDictionary", ifnotfound = FALSE),
    spelling.dictionary = get("EnglishWords"),
    spelling.reference.frequency = formSpellingReferenceFreq,
    spelling.mistake.frequency = formSpellingCorrectedFreq,
    spelling.dont.correct = formDontSpellCorrect,
    spelling.refphrase.max.edit.dist = 1.6,
    spelling.refwords.max.edit.dist = 1.0,
    spelling.dict.max.edit.dist = 1.0,
    spelling.refphrase.max.nchar.dist = 1,
    spelling.refwords.max.nchar.dist = 1,
    spelling.dict.max.nchar.dist = 1,
    spelling.refphrase.min.nchar = 5,
    spelling.refwords.min.nchar = 5,
    spelling.dict.min.nchar = 6,
    # Output
    alphabetical.sort = formSortPhrases == "Alphabetically",
    sort.raw.text = formSortRawText,                             
    variable.index.to.show = get0("formVariableIndex"),
    num.rows.in.page.to.show = formPageSize,
    page.to.show = formPageNum,
    phrase.to.show = formPhraseShow,
    text.to.show = formTextShow,
    details.expand = formExpanded,
    tidy.variable.names = TRUE,
    show.diagnostics = TRUE,
    diagnostics.max.rows = 100)