Text Analysis - Automatic Categorization - Entity Extraction

From Q
Jump to navigation Jump to search


Automatically performs the task of named entity recognition from a text variable containing unstructured text. Named entities are pre-defined categories that include real world objects such as people, locations and organisations; temporal and numeric expressions such as dates, money and other numeric measures; and abstract concepts such as religion, ideology and criminal charge. These extracted entities can then be saved with Create > Text Analysis > Advanced > Save Variable(s) > CategoriesInsert > Text Analysis > Advanced > Save Variable(s) > Categories to extract all entity mentions for all entity types. This will be saved as one new Variable SetQuestion in your Data Set for each of the different entity types. The first mention of each entity in each document (response / data point) can also be saved with Create > Text Analysis > Advanced > Save Variable(s) > First CategoryInsert > Text Analysis > Advanced > Save Variable(s) > First Category

Example

To create an Entity Extraction output, go to Insert > Text Analysis > Automatic Categorization > Entity ExtractionCreate > Text Analysis > Automatic Categorization > Entity Extraction.

  1. Under Inputs > Text variable select a Text variable.
  2. Make any other selections or changes to the settings that you require.
  3. Ensure the Automatic box is checked, or click Calculate

TAEntity inputs 2020 07 16.png

The output below shows the extracted entities for a text variable with Donald Trump's tweets from the end of the 2016 presidential primary race. Each row in the example column can be expanded to show all the variants of each entity type extracted.

Options

Text variable The text variable to run the entity extraction.

Minimum number of cases to save The minimum number of observed named entities of a single type required to be identified in a text variable before saving the extracted entities. Create > Text Analysis > Advanced > Save Variable(s) > First Category and Create > Text Analysis > Advanced > Save Variable(s) > Categories E.g. if the minimum number of cases to save is set to 3, then there needs to be at least 3 observed entities of each type before that variable can be saved and added to the dataset through the menu. So if there were 5 Person entities extracted in the R output, then it is possible to save the Person entities and add to the dataset. However, if there were only 2 location entities extracted then the location variable would not be created.

Add named entities to extraction Add custom named entities to be extracted from the text via a data entry form. The data entry form window requires a named entity type to be specified in a cell in the first row. In the column below each named entity type, a list of words (named entities) can be specified with a word within in each cell to be included in the entity extraction. See technical information about constraints on adding named entities.

Remove named entities from extraction Similar to above, named entities can be excluded from the extraction by populating a similar data entry form. In the first row is should state the entity type to remove. In the column below each specified entity type, a list of words can be listed with a word for each cell to specify all the words or entities that the user wishes to remove.

Example of the expected input is given below for the Remove named entities from extraction control to remove five entities. In the example, the following are specified: one Person entity named "Wall St", one Country entity named "fold" and three Number entities; "007cigarjoe", "@tzard000" and "1sonny12".

Remove-entities-ascii.PNG

SAVE VARIABLE(S)

Maximum number of unique entity levels to save Maximum number of unique levels in each entity type to save from the text. If the number of unique unique entities for one entity type exceeds this limit then the most popular entities are chosen with ties broken alphabetically.

Maximum number of entities per case to save Maximum number of entity mentions per case to save when using Save Variable(s) Categories. If this control is set to a value n, but there the maximum number of tokens for a case is m > n, then only the first n entity mentions are saved.

Technical Information

Possible named entity types are Person, Location, Organization, Misc, Money, Number, Ordinal, Percent, Date, Time, Duration, Set, Email, Url, City, State or province, Country, Nationality, Religion, Title, Ideology, Criminal charge, and Cause of death. The entity extraction uses the Stanford Core Natural Language Processing (CoreNLP) Named Entity Recognition (NER) annotator which uses a combination of machine learning sequence models to rule based extraction such as regular expressions (Regex) and other classifiers. These run in a sequence and once entities are identified by a method, the identified entities cannot be changed or modified by later steps in the sequence. This has consequences for adding named entities to the extraction in the user settings. The user specified . If an entity is already identified by Stanford CoreNLP, it cannot be assigned to another entity type via the user settings to add named entities to the extraction. However, this constraint doesn't apply in the removal of named entities in the user settings. Any identified entities by the CoreNLP NER can be removed from the extraction.

More Information

Automatic Coding of Unstructured Text Data
At Last, Machine Learning Can Accurately Categorize Text Data

References

Code

var heading_text = "Entity Extraction";
if (!!form.setObjectInspectorTitle)
    form.setObjectInspectorTitle(heading_text);
else
    form.setHeading(heading_text);

var allow_control_groups = Q.fileFormatVersion() > 10.9;
var default_spreadsheet = [["Person", "Location", "Organization", "Misc", "Money", "Number", "Ordinal",
                  "Percent", "Date", "Time", "Duration", "Set", "Email", "Url", "City",
                  "State or province", "Country", "Nationality", "Religion", "Title", "Ideology",
                  "Criminal charge", "Cause of death"]];
form.dropBox({name: "formTextVar", label: "Text variable", types: ["Variable: text, categorical", "R:character"],
		prompt: "Select the text variable to analyze.", multi: false})
form.numericUpDown({name: "formMinCount", label: "Minimum number of cases to save",
			prompt: "Specify the minimum number of observed cases to observe before saving variables",
			default_value: 50, increment: 1, maximum: Number.MAX_SAFE_INTEGER, minimum: 1})
form.dataEntry({name: "formAddEntity", label: "Add named entities to extraction", edit_label: "Add named entities to extraction", required : false,
			prompt: "The top row specifies the entity type to add in the entity extraction. Populate columns with each word (named entity) you want to extract from the text. New entity types and their subsequent entities can be added in a new column",
            default_value : default_spreadsheet,
            large_data_error: "The data entered is too large. You may only enter data with up to 1000 rows and up to 100 columns."})
form.dataEntry({name: "formRemoveEntity", label: "Remove named entities from extraction",
			edit_label: "Remove named entities from extraction",  required : false,
			prompt: "The top row specifies the entity type to remove. Populate columns with each word (named entity) to be removed in the entity extraction algorithm",
            default_value : default_spreadsheet,
            large_data_error: "The data entered is too large. You may only enter data with up to 1000 rows and up to 100 columns."})
if (allow_control_groups)
    form.group({label: "SAVE VARIABLE(S)", expanded: true});
form.numericUpDown({name:"formMaxLevels", label: "Maximum number of unique entity levels to save",
                    prompt: "Maximum number of unique levels in each entity type to save from the text",
                    default_value : 500, minimum: 1, maximum: Number.MAX_SAFE_INTEGER});
form.numericUpDown({name:"formMaxMentions", label: "Maximum number of entities per case to save",
                    prompt: "Maximum number of mentions per case to save when using Save Variable(s) Categories",
                    default_value : 100, minimum: 1, maximum: Number.MAX_SAFE_INTEGER});
library(flipTextAnalysis)

arguments <- list(x = formTextVar,
                  subset = QFilter,
                  add.entities = formAddEntity,
                  remove.entities = formRemoveEntity,
                  min.cases.to.save = formMinCount,
                  max.levels.to.save = formMaxLevels,
                  max.mentions.to.save = formMaxMentions)

entityExtract <- TextAnalysis(EntityExtraction, arguments)