Text Analysis - Automatic Categorization - Unstructured Text

From Q
Jump to: navigation, search

Automatically categorize a text variable containing unstructured text into single-response or multiple-response categories. These categories can then be saved using Create > Text Analysis > Advanced > Save Variable(s) > Categories.


In Displayr, go to Insert > Text Analysis > Automatic Categorization > Unstructured Text.

In Q, go to Create > Text Analysis > Automatic Categorization > Unstructured Text

  1. Under Inputs > Text variable select a Text variables.
  2. Make any other selections or changes to the settings that you require.
  3. Ensure the Automatic box is checked, or click Calculate

TAUnstructured inputs.png

The output below shows automatically generated categories for a question on how people feel about Microsoft. Each row in the example column can be expanded to show all the text in the category.

Extracting a Table of Frequencies

To extract the table of frequencies from this output can be done by saving the results as a variable into your Data Set and then making a table from that. Take these steps:

  1. Select the Unstructured Text output.
  2. Go to Displayr: Insert > Text Analysis > Advanced > Save Variable(s) > Categories (in Q, go to Create > Text Analysis > Advanced > Save Variable(s) > Categories)

A variable (Question) called "Categories from Text data" will be saved in your Data Set. This can then be used like any other variable to create tables and further outputs.

More Information

Automatic Coding of Unstructured Text Data
Automatic Categorization of Unstructured Text Data


Text variable The text variable to run automatic categorization on.

Source language The language of the input text variable. To specify the language separately for each case, select "Specify with variable" (see Source language variable for more information). The input text is translated from the source language into English before categorization is performed. Translation is done using Google Cloud Translation and results may change over time as the translation algorithm improves.

Source language variable A text/categorical variable containing the language for each case. Languages should be referred to by the language names in Source language. This control appears when Source language is set to "Specify with variable".

Existing categorization A categorical, ordered-categorical variable or Pick Any question containing an existing categorization to be used.

Number of categories The number of categories to automatically generate. This is hidden if an existing categorization is provided.


var languages = ["Afrikaans","Albanian","Amharic","Arabic","Armenian","Azerbaijani","Basque",
                 "Belarusian","Bengali","Bosnian","Bulgarian","Catalan","Cebuano","Chinese (Simplified)",
                 "Chinese (Traditional)","Corsican","Croatian","Czech","Danish","Dutch","English",
                 "Greek","Gujarati","Haitian Creole","Hausa","Hawaiian","Hebrew","Hindi","Hmong",
                 "Maori","Marathi","Mongolian","Myanmar (Burmese)","Nepali","Norwegian",
                 "Nyanja (Chichewa)","Pashto","Persian","Polish","Portuguese (Portugal, Brazil)",
                 "Punjabi","Romanian","Russian","Samoan","Scots Gaelic","Serbian","Sesotho","Shona",
                 "Sindhi","Sinhala (Sinhalese)","Slovak","Slovenian","Somali","Spanish","Sundanese",
                 "Swahili","Swedish","Tagalog (Filipino)","Tajik","Tamil","Telugu","Thai","Turkish",

form.dropBox({name: "formTextVar", label: "Text variable",
              types: ["Variable: text, categorical", "R:character"],
              prompt: "Select the text variable to analyze.", multi: false});

var source_language = form.comboBox({name: "formSourceLang",
                                     label: "Source language",
                                     alternatives: ["English",
                                                    "Specify with variable"].concat(languages),
                                     default_value: "English",
                                     prompt: "Specify the language of the input text variable."}).getValue();

if (source_language == "Specify with variable")
    form.dropBox({name: "formSourceLangVar",
                  label: "Source language variable",
                  types: ["Variable: Text, Categorical, OrderedCategorical"],
                  prompt: "Variable containing source language for each case",
                  required: true});

var existing_cat = form.dropBox({name: "formExistingCat", label: "Existing categorization",
              types: ["Variable: categorical, orderedcategorical", "Question: PickAny"],
              prompt: "Optional variable containing a manual categorization",
              multi: false, required: false}).getValue();
if (existing_cat === null)
    form.numericUpDown({name: "formCategories",
                    label: "Number of categories",
                    default_value: 10,
                    increment: 1,
                    minimum: 1})
else {
   form.textBox({label: "Categories to ignore", 
              type: "text",
              default_value: "NET, Total, SUM", 
              name: "formIgnore",
              required: false,
              prompt: "Specify categories to ignore when predicting the existing categorization"});
    form.numericUpDown({name: "formCVIncrement",
                   label: "Sample size increment for cross-validation",
                   prompt: "Increment of the training data sample size across a K-fold cross validation",     
                   default_value: 50,
                   increment: 50,
                   minimum: 50,
                   maximum: 99999999})
    form.numericUpDown({name: "formNCVs",
                   label: "Number of cross validations",
                   prompt: "Number of cross validations to perform (i.e. K in K-fold cross validation)",     
                   default_value: 4,
                   increment: 1,
                   minimum: 1,
                   maximum: 99999999})
categorization <- if (is.null(formExistingCat)) {
                            n.categories = min(formCategories, length(formTextVar)),
                            sentiment.weight = 2,
                            discard.phrases = NULL,
                            raw.text.replacement = NULL,
                            min.frequency = max(5, log(length(formTextVar))),
                            max.bag.size = Inf,
                            subset = QFilter,
                            weights = QPopulationWeight,
                            seed = 1223,
                            version = 0,
                            source.language = get0("formSourceLangVar", ifnotfound = formSourceLang))    
} else {
    vars <- formExistingCat
    if(is.data.frame(vars)) {
        vars <- formExistingCat[, !names(vars) %in% strsplit(formIgnore, ",")[[1]], drop = FALSE]
        n.observed <- nrow(vars)
    } else 
        n.observed <- length(vars)
                   discard.phrases = NULL,
                   raw.text.replacement = NULL,
                   min.frequency = 5,
                   max.bag.size = 100,
                   subset = QFilter,
                   weights = QPopulationWeight,
                   cross.validation.size = formCVIncrement,
                   n.cross.validations = formNCVs,
                   seed = 12321,
                   source.language = get0("formSourceLangVar", ifnotfound = formSourceLang))