Text Analysis Case Study - Trump's Tweets
Related Videos | |
---|---|
Chapter within Text Analysis in Q5 (Video) |
This case study matches the content of the webinar Text Analysis in Q5 from October, 2016. Video from the webinar will be made available in the near future.
The content of the webinar was inspired by David Robinson of the blog Variance Explained. David used sentiment analysis and other techniques to investigate the authorship of the Twitter communications made by Trump's account during the US presidential primary race of 2016. David hypothesized that the angrier tweets were sent by Trump, while other tweets were sent by Trump's assistant(s).
A QPack containing all of the analyses that are discussed on this page can be downloaded here.
Data Set
The tweets used in this case study are stored in an archive on the web. This archive has been added to the Q Project using File > Data Sets > Add to Project > From R.
The R code that was used is:
load(url("http://varianceexplained.org/files/trump_tweets_df.rda"))
tweet.data <- trump_tweets_df
cleanFun <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))
}
tweet.data$statusSource <- cleanFun(tweet.data$statusSource)
tweet.data
The data contains a sample of 1512 tweets which originated from Trump's twitter account @readDonaldTrump. Each case in the data set represents one tweet. The variables contain the tweet text itself, as well as various kinds of metadata provided by twitter. The variables that are considered in this case study are:
- text - the raw text from each tweet.
- favoriteCount - a count of the number of Twitter users who had marked the tweet as their favorite at the time the data set was originally obtained from Twitter. This is a measure of how engaged Twitter users were with the contents of the tweets. Typical values for this data set are in the tens of thousands.
- Source - a variable which measures the device that was used to send out the tweet. The two main devices in the data set are iPhone and Android. This does not distinguish between different devices of the same type - so if two people were tweeting using an iPhone then they are not differentiated here.
Text and Word Cloud
The original text from the tweets can be viewed by selecting the Question called text in the Blue drop-down menu. The contents of the text can be viewed in the table below:
Using Q's Word Cloud feature we can view the words in proportion to how often they occur in the text. A table showing the raw text can be converted to a Word Cloud by clicking the Show Data as menu and selecting the cloud from the set of chart types. This gives us an overview of the themes:
We can see prominent signals of "thank" (from Tweets thanking people for attending events) and "crooked" (for one of Trump's favorite slogans, attacking "Cooked Hillary"). This visualization of the text is dynamic, and you can combine words by dragging-and-dropping on top of one another, and you can remove words that are not of interest by dragging them to the right section of the cloud (which will appear as Ignore when viewed in Q). For example, to count Hillary and Clinton as the same concept in the word cloud you can drag Hillary on top of Clinton. The word Clinton will then become larger (as the count is now of any text which mentions either word), and Hillary will no longer appear as a separate word in the cloud.
Coding
Using Q's Coding tool, we can manually assign tweets into categories. Coding is still the most accurate way to classify the text (so long as the person doing it does a good job!). In the example project, the original tweet text has been coded into a Question called Tweet Themes. The themes that we identified in this case study are:
- Speeches, interviews, rallies - for tweets announcing speeches and rallies, or tweets which thank people for attending these rallies.
- About the media (negative) - for tweets which denounced bad or biased media coverage.
- Attacks on Clinton, Obama, Democrats, supporters - for tweets which attack Trump's opponents on the other side of politics.
- Positives about Trump or his campaign - for tweets which show positive messages about the candidate or his campaign progress.
- Policies, slogans, calls to arms - for tweets which promote Trump's policies or repeat his (positive) slogans.
- Other - for tweets which talk about the Olympics, or otherwise don't fit the set of themes that we have used here.
The coded Question can then be shown in a table with other Questions from the data set. One important measure that we consider throughout this case study is the favoriteCount variable, which contains a count of the number of times each tweet was marked as a favorite by a twitter user who read it. This can be used as a measure of engagement with the tweets - it shows us how interested or approving people are of a given tweet.
The table shows that tweets about Speeches, interviews, rallies generated a significantly lower level of engagement, and that there is not much differentiation among the other themes that were coded.
Only a subset of 312 tweets has been coded. This is so that the remaining tweets can be automatically coded using a predictive algorithm, as shown below in the section on Automatic Coding.
Text Setup and Cleaning
More automated kinds of text analysis (like sentiment analysis and automatic classification) typically require an initial setup and cleaning phase where the text is first broken up into a collection of individual words (tokenized), and then each word is processed to determine if it should be kept in the analysis, modified, or excluded. The general approach is to try to reduce the total number of words being included where possible. Common cleaning techniques include:
- Stopword removal, which is the removal of common words like the, to, and of which don't tend to convey a lot of information about the meaning of the text by themselves.
- Spelling correction, which is where mis-spelled words are replaced with the most likely correction found in the text.
- Stemming, where words are replaced by their root words, or stems. In English this largely involves the removal of suffixes.
In Q, the initial cleaning is done using Text Analysis - Setup Text Analysis. This option should be used first before using the other Text Analysis options available in the menu.
This does stopword removal automatically, and has options for spelling correction and stemming. Further options are available for manual changes to the text. For instance, in this case study we have:
- Removed some of the text which comes from website links in the tweets, like https, co, and amp, by adding these into the Remove these words section of the options.
- Replaced all instances of clinton with the word hillary, so as to have all mentions of Hillary Clinton treated as a single entity in the text, by adding clinton:hillary into the Replace these words section of the options. We have done the same for sanders and bernie.
- Increased the Minimum word frequency to 10, thereby excluding any words that occur less than ten times in the text. This helps to remove words that do not convey a lot of information, including mis-spelled words that were not captured by the spelling correction.
The result is a table which shows the words and their frequencies. By looking at the table, further decisions could be made about words to remove or combine.
When you use Text Analysis - Setup Text Analysis, an output will appear in your Report. This output stores information about the cleaning and processing that has been done, and it can then be used as an input to other analyses, as discussed in the sections below.
Sentiment Analysis
Sentiment analysis quantifies the tone of a piece of text by identifying and scoring positive and negative words. In Q, a variable which generates the sentiment score for each tweet is generated by selecting the setup item (see above) and then running Text Analysis - Sentiment. The new variable can be used in crosstabs with other Questions from the data set.
From the sentiment scores we find (unsurprisingly) that the average sentiment tends to be highest for tweets coded as Positives about Trump or his campaign and Speeches, interviews, rallies. On the other hand, tweets classified as Attacks on Clinton, Obama, Democrats are significantly lower than average, and are in fact negative (meaning that on average, these tweets tend to contain more negative words than positive words).
By crosstabulating the sentiment scores by the favoriteCount, we find a small-but-significant negative correlation, indicating that tweets with more negative language tended to engage Trump's Twitter followers more.
Finally, by considering the Question called Source, which shows us what kind of device was used to author the tweets, we find a significant difference in the sentiment between those tweets sent by an Android, and those sent by an iPhone. This is the key result that David Robinson discussed in his blog post - those tweets sent by the Android tend to be much more negative.
While it is possible that there are multiple people publishing tweets on this account for each type of device, the result here shows a very different tone for tweets being sent out by the different devices.
Term Document Matrix
The term document matrix represents the text as a table whose columns are binary variables that correspond to the words used in the analysis. Each row represents one of the text responses or tweets, and each column represents one of the words by taking a value of 1 when the word is present in the text for that row, and a value of 0 when it is not. This is a way of communicating the outcomes of the text setup and cleaning phase into other algorithms. If you want to design your own custom analysis, it can be useful to have the term document matrix computed explicitly within your project, and this can be done using Text Analysis - Advanced - Term Document Matrix.
The automatic coding tool that is described below uses the term document matrix explicitly as one of its inputs. The predictive tree computes the term document matrix in the background for its own calculation, and does not rely on the presence of the term document matrix as an item in the report.
Note that the original version of the term document matrix shown in the webinar displayed the full contents of the term document matrix as a table. This turned out to be an inefficient way to store this data, particularly for larger data sets, and so the term document matrix now displays information about the underlying matrix rather than displaying it's contents in full.
Predictive Tree
A predictive tree based on the text can be created using Text Analysis - Advanced - Predictive Tree. This is similar to Machine Learning - Classification And Regression Trees (CART), which is designed for creating a predictive tree between variables in the data file (as opposed to using the text).
In this case study, we used the favoriteCount as the Outcome, or variable to be predicted. Each branch of the tree shows where the presence of a particular word in a tweet predicts a much higher or lower average number of favorites. The width of each branch of the tree shows how many tweets are included in that part of the sample, and the color of the branch indicates the average value of the outcome variable - with darker reds indicating low average values, and lighter reds and blues indicating higher average values. The tree diagram is interactive. If you hover your mouse over a node you get additional information about the sample and outcome variable for that node, and you can click on the nodes to hide or show that part of the tree.
The tree shows significantly high numbers of favorites for tweets which talk about Hillary Clinton, and even higher average favorite count for those tweets which use the words hillary and spending. Similar high scores were observed for tweets containing words bernie, law, and united.
Other variables can be analyzed by changing the Outcome selection.
Automatic Coding
Since this webinar, new tools for automatic coding have been added to Q. The text coding tool now permits Semi-automatic Categorization option, which identifies categories in the text data, and the tool Text Analysis - Automatic Categorization - Unstructured Text allows you to use a more advanced version of the random forest model discussed in the webinar.