Text Analysis - Automatic Categorization - Unstructured Text

This tool allows you to use a partial categorization of text data to predict categories for text which was not categorized. This is useful if you have done some categorization on a subset of your data and you wish to apply this categorization to more text. For example, if text data has been categorized for the first wave of a study, and you wish to use the same categorization to subsequent waves of the same study. Alternatively, this tool may be used to determine appropriate categories and their labels automatically based on patterns observed in the data. However, this second use case is only supported for legacy reasons, and the generation of categories can be done in a better and more flexible way by using + Anything > Advanced Analysis > Text Analysis > Semi-Automatic Categorizationthe Semi-Automatic Categorization tool which is available when right-clicking Text variables in the Variables and Questions tab.

The categories for all cases can then be saved using Create > Text Analysis > Advanced > Save Variable(s) > CategoriesInsert > Text Analysis > Advanced > Save Variable(s) > Categories.

Example

To create an Unstructured Text output, go to Insert > Text Analysis > Automatic Categorization > Unstructured TextCreate > Text Analysis > Automatic Categorization > Unstructured Text.

With Existing Categorization

Under Inputs > DATA SOURCE > Text variable select a Text variable.
Under Inputs > CATEGORIES > Existing Categorization select a Nominal or Ordinal variable or Binary - Multi variable setCategorical or Ordered categorical variable or Pick Any Question.
Make any other selections or changes to the settings that you require.
Ensure the Automatic box is checked, or click Calculate

The output below shows automatically generated categories for a question on how people feel about their cell phone provider. The text variable has 2090 responses, 895 of which were provided an existing categorization by a market researcher. Each row in the example column can be expanded to show all the text in the category and the performance metrics can be viewed in the main table and the cross validation performance visible in the table which appears in the footer.

Without Existing Categorization

Under Inputs > CATEGORIES > Category creation choose Create new categorization
Under Inputs > DATA SOURCE > Text variable select a Text variable.
Make any other selections or changes to the settings that you require.
Ensure the Automatic box is checked, or click Calculate

The output below shows automatically generated categories for a question on how people feel about Microsoft. Each row in the example column can be expanded to show all the text in the category.

Extracting a Table of Frequencies

To extract the table of frequencies from this output can be done by saving the results as a variable into your Data Set and then making a table from that. Take these steps:

Select the Unstructured Text output.
Go to Insert > Text Analysis > Advanced > Save Variable(s) > CategoriesCreate > Text Analysis > Advanced > Save Variable(s) > Categories

A variable setQuestion called "Categories from Text data" will be saved in your Data Set. This can then be used like any other variable to create tables and further outputs.

More Information

Coding of Unstructured Text Data
Categorization of Unstructured Text Data

Options

DATA SOURCE

Text variable The text variable to run automatic categorization on.

Truncate cases when characters exceed Truncate cases whose number of characters exceeds this number. This is done as the algorithm expects text data to consist of one sentence per case and may not function correctly when cases are too long. If any cases are truncated due to this setting, a warning will be shown, and it is likely that the data does not conform to this assumption (one sentence per case) and is not appropriate for this analysis.

TRANSLATE (GOOGLE CLOUD TRANSLATION)

Source language The language of the input text variable. To specify the language separately for each case, select "Specify with variable" (see Source language variable for more information). The input text is translated from the source language into English before categorization is performed. Translation is done using Google Cloud Translation and results may change over time as the translation algorithm improves.

Source language variable A text/categorical variable containing the language for each case. Languages should be referred to by the language names in Source language. This control appears when Source language is set to "Specify with variable".

Output language The language of the categories and example text to display in the output.

CROSS VALIDATION

These controls are only available when an existing categorization is provided. When provided, the cases in the data that have been given an existing categorization are interpreted as the full training data set and repeated random sub-sampling cross validation is performed on the full training data set. In particular, the full training data-set is split into a training and test datasets where the test data has its existing categorization removed. The categorization for the test data is then predicted from the available data in the training sub sample and performance table generated across all cross validation iterations. The controls below determine the size of the training data sub sample and how many cross validation iterations of this type are performed. If no existing categorization is provided, then these controls are hidden.

Sample size increment for cross-validation An integer that specifies how large the initial training data sub sample size will be and how much it will grow for each repeated random sub-sampling cross validation iteration that is performed. The default value is 50 meaning that the initial training sub sample will have 50 observations from the full training data set and 50 more observations will be added for each new cross validation iteration that is performed.

Number of cross-validation iterations Number of cross validations iterations to perform for the random sub-sampling cross validation procedure described above. The default is four iterations and the default value for the Sample size increment for cross-validation control above is 50. If n > 200 cases are provided for the existing categorization then this would result in 4 randomly generated training and cross validation splits of the full training data whereby the training sub samples will have 50, 100, 150 and 200 cases while the test sub samples will have n - 50, n - 100, n - 150 and n - 200 cases respectively over the four cross validation iterations.

SAVE VARIABLE(S)

Save categories Adds variables to the data set containing the categories. Where there are multiple input variables, multiple sets of variables are added for each.

Save first category Adds a variable to the data set containing the first category mentioned. Where there are multiple input categories, the first category of each will be saved as a separate variable.

Code

▶ Show Code

let heading_text = "Unstructured Text Categorization";
if (!!form.setObjectInspectorTitle)
    form.setObjectInspectorTitle(heading_text);
else
    form.setHeading(heading_text);

let allow_control_groups = Q.fileFormatVersion() > 10.9;

const TRANSLATION_LANGUAGES = [
    'Abkhaz', 'Acehnese', 'Acholi', 'Afrikaans', 'Albanian', 'Alur', 'Amharic', 'Arabic',
    'Armenian', 'Assamese', 'Awadhi', 'Aymara', 'Azerbaijani', 'Balinese', 'Bambara',
    'Bashkir', 'Basque', 'Batak Karo', 'Batak Simalungun', 'Batak Toba', 'Belarusian',
    'Bemba', 'Bengali', 'Betawi', 'Bhojpuri', 'Bikol', 'Bosnian', 'Breton', 'Bulgarian',
    'Buryat', 'Cantonese', 'Catalan', 'Cebuano', 'Chichewa', 'Chinese (Simplified)',
    'Chinese (Traditional)', 'Chuvash', 'Corsican', 'Crimean Tatar', 'Croatian', 'Czech',
    'Danish', 'Dhivehi', 'Dinka', 'Dogri', 'Dombe', 'Dutch', 'Dzongkha', 'English',
    'Esperanto', 'Estonian', 'Ewe', 'Fijian', 'Filipino', 'Finnish', 'French', 'Frisian',
    'Fulani', 'Ga', 'Galician', 'Georgian', 'German', 'Greek', 'Guarani', 'Gujarati',
    'Haitian Creole', 'Hakha Chin', 'Hausa', 'Hawaiian', 'Hebrew', 'Hiligaynon', 'Hindi',
    'Hmong', 'Hungarian', 'Hunsrik', 'Icelandic', 'Igbo', 'Ilocano', 'Indonesian',
    'Irish', 'Italian', 'Japanese', 'Javanese', 'Kannada', 'Kapampangan', 'Kazakh',
    'Khmer', 'Kiga', 'Kinyarwanda', 'Kituba', 'Konkani', 'Korean', 'Krio', 'Kurdish (Kurmanji)',
    'Kurdish (Sorani)', 'Kyrgyz', 'Lao', 'Latgalian', 'Latin', 'Latvian', 'Ligurian',
    'Limburgish', 'Lingala', 'Lithuanian', 'Lombard', 'Luganda', 'Luo', 'Luxembourgish',
    'Macedonian', 'Maithili', 'Makassar', 'Malagasy', 'Malay', 'Malay (Jawi)', 'Malayalam',
    'Maltese', 'Maori', 'Marathi', 'Meadow Mari', 'Meiteilon (Manipuri)', 'Minang',
    'Mizo', 'Mongolian', 'Myanmar (Burmese)', 'Ndebele (South)', 'Nepalbhasa (Newari)',
    'Nepali', 'Norwegian', 'Nuer', 'Occitan', 'Odia (Oriya)', 'Oromo', 'Pangasinan',
    'Papiamento', 'Pashto', 'Persian', 'Polish', 'Portuguese (Brazil)', 'Punjabi (Gurmukhi)',
    'Punjabi (Shahmukhi)', 'Quechua', 'Romani', 'Romanian', 'Rundi', 'Russian', 'Samoan',
    'Sango', 'Sanskrit', 'Scots Gaelic', 'Sepedi', 'Serbian', 'Sesotho', 'Seychellois Creole',
    'Shan', 'Shona', 'Sicilian', 'Silesian', 'Sindhi', 'Sinhala', 'Slovak', 'Slovenian',
    'Somali', 'Spanish', 'Sundanese', 'Swahili', 'Swati', 'Swedish', 'Tajik', 'Tamil',
    'Tatar', 'Telugu', 'Tetum', 'Thai', 'Tigrinya', 'Tsonga', 'Tswana', 'Turkish',
    'Turkmen', 'Twi', 'Ukrainian', 'Urdu', 'Uyghur', 'Uzbek', 'Vietnamese', 'Welsh',
    'Xhosa', 'Yiddish', 'Yoruba', 'Yucatec Maya', 'Zulu'
];

if (allow_control_groups)
    form.group({label: "Data source", expanded: true});

form.dropBox({name: "formTextVar", label: "Text variable",
              types: ["Variable: text, categorical", "R:character"],
              prompt: "Select the text variable to analyze.", multi: false});

form.numericUpDown({name: "formMaxCaseChar",
                    label: "Truncate cases with characters exceeding",
                    default_value: 2000,
                    increment: 1000,
                    maximum: Number.MAX_SAFE_INTEGER,
                    minimum: 1,
                    prompt: "Truncate cases whose number of characters exceeds this number."})

if (allow_control_groups)
    form.group({label: "Categories", expanded: true});

let selected_mode = form.comboBox({name: "formMode", label: "Category creation",
                                   alternatives: ["Use existing categorization", "Create new categorization"],
                                   default_value: "Use existing categorization",
                                   prompt: "Predict categories for new cases using and exising categorization or create a new categorization from scratch"}).getValue();

let existing_cat;
if (selected_mode === "Use existing categorization") {
    existing_cat = form.dropBox({name: "formExistingCat", label: "Existing categorization",
                types: ["Variable: categorical, orderedcategorical", "Question: PickAny"],
                prompt: "Optional variable containing a manual categorization",
                multi: false, required: true}).getValue();

    form.textBox({label: "Categories to ignore",
                type: "text",
                default_value: "NET, Total, SUM",
                name: "formIgnore",
                required: false,
                prompt: "Specify categories to ignore when predicting the existing categorization"});


} else {
    form.numericUpDown({name: "formCategories",
                    label: "Number of categories",
                    default_value: 10,
                    increment: 1,
                    maximum: Number.MAX_SAFE_INTEGER,
                    minimum: 1})
}

if (allow_control_groups)
    form.group({label: "Filters & Weight"});

if (allow_control_groups)
    form.group({label: "Translate (Google Cloud Translation)", expanded: true});

let source_language = form.comboBox({name: "formSourceLang",
                                     label: "Source language",
                                     alternatives: ["English",
                                                    "Specify with variable"].concat(TRANSLATION_LANGUAGES),
                                     default_value: "English",
                                     prompt: "Specify the language of the input text variable."}).getValue();

if (source_language == "Specify with variable")
{
    form.dropBox({name: "formSourceLangVar",
                  label: "Source language variable",
                  types: ["Variable: Text, Categorical, OrderedCategorical"],
                  prompt: "Variable containing source language for each case",
                  required: true});
}

form.comboBox({name: "formOutputLang",
               label: "Output language",
               alternatives: TRANSLATION_LANGUAGES,
               default_value: "English",
               prompt: "Specify the output language"}).getValue();



if (selected_mode === "Use existing categorization") {
    if (allow_control_groups)
    form.group({label: "Cross Validation", expanded: true});

    form.numericUpDown({name: "formCVIncrement",
            label: "Sample size increment for cross-validation",
            prompt: "Increment of the training data sample size across a repeated random sub-sampling cross validation",
            default_value: 50,
            increment: 50,
            minimum: 50,
            maximum: Number.MAX_SAFE_INTEGER})
    form.numericUpDown({name: "formNCVs",
            label: "Number of cross-validation iterations",
            prompt: "Number of cross validations iterations to perform or number of times to increment the training data sample size",
            default_value: 4,
            increment: 1,
            minimum: 1,
            maximum: Number.MAX_SAFE_INTEGER})
}

if (allow_control_groups)
        form.group({label: "Save Variable(s)", expanded: false});