Tweet Engagement Prediction

The following is an excerpt from the deliverable I worked on with Austin Daily, Jaynth Thiagarajan, and Ryan Weber for the Data Science and Business Analytics course at NYU Stern. For the project, we developed a model to predict the engagement score (retweets + favorites) for a given tweet, based on its text and metadata features.

Business Understanding

Twitter is an online social networking service that enables users to send and read short 140-character messages called tweets. Users are able to retweet and favorite messages. By retweeting, a user shares the tweet on his newsfeed, enabling his followers to see the tweet. By favoriting, the user marks the tweet, indicating affinity for the message. In aggregate, retweets and favorites can be interpreted as a measure of engagement, or the extent to which the message attracts attention. Thus, individuals and companies whose businesses center around attracting attention may aim to compose messages that score high in engagement.

For our project, we set out to develop a model with which we can predict the engagement of a tweet. If our model is successful, it will create value by delivering a predicted measure of engagement prior to the message being published, thus creating the opportunity for the user to engineer the message for higher engagement.

Brands often partner with influencers, who have large audiences to whom products can be advertised. By predicting the engagement score for a tweet, brands can better decide which influencers to partner with as well as better assess return on investment.

On June 16, 2015, celebrity businessman Donald Trump announced he would run for President of the United States. Since his announcement, Trump has leveraged Twitter to spread his message and gain attention. Our team decided to analyze tweets from Donald Trump due to the increased attention to his campaign.

Data Understanding

We collected 3200 tweets published by Trump between the time of his announcement to run for President and December 8, 2015. The tweets were fetched by querying the Twitter API. Although the average user is mostly only concerned with the 140-character tweets being broadcast out into the world, a single tweet carries more metadata than is commonly realized (Figure 5). Many of these metadata attributes, such as the application from which the tweet was published, are not perceptible to the average user and thus are immaterial to our analysis. However, some attributes such as the publishing time and date may have a significant impact on total engagement. Additionally, the number of retweets and favorites, which will together form the engagement score that will be used as the labels for our data, are contained in the metadata.

Modeling Twitter engagement is at its core a text mining problem. Hence the bulk of the information will come from the text of each tweet. The text itself contains certain structured attributes like references (callouts to other Twitter users using the @ sign followed by the username), hashtags (indicators of a particular topic using the # sign, used to signify a tweet’s participation in a trending topic) and hyperlinks. Beyond these, the features of the text are unstructured, so we have employed vectorization techniques to produce usable features.

The full set of features and labels used is as follows:

  • Full text of the tweet, vectorized into 1- and 2-grams
  • Number of references in the tweet
  • Number of hashtags in the tweet
  • Number of hyperlinks in the tweet
  • Hour of the day at which the tweet was published
  • Day of the week on which the tweet was published, represented as six binary features for each day Monday through Saturday
  • The total engagement of the tweet, constructed using:
    • Number of retweets
    • Number of favorites
    • A weighting factor alpha used to specify the relative importance of retweets and favorites; this value is business case dependent but we have used a value of 0.8 which weights retweets four times as heavily as favorites

Problem Formulation

We attempted to construct our model both as a binary classification problem (will a given tweet achieve high engagement) and separately as a regression problem (what is the expected engagement for a given tweet). Our hope was that the classification models would be easier to evaluate, using powerful metrics such as AUC, and that the lessons learned would be transferable to conceptually similar regression models, which may be more useful from a business perspective by providing more detailed feedback to users. However, we ended up finding that the models that produced the best results for the binary classification problem were either incompatible with the regression problem or simply did not produce similarly good results.

Deployment and Evaluation:

Ideally, the model would power a web application that lets the user pick a profile, build a model for that profile, and employ the model to test various tweets by predicting engagement. The challenge is the time it would take for the system to build a model.

If the total time taken by the system to gather, clean, and optimize the data is more than a few minutes, providing such a user experience may not be possible. There are two ways to address the challenge: either pre-select influential twitter handles or build a data warehouse of twitter feeds from profiles with large followings. Decoupling the data fetching process and the feature extraction process helps to make model generation faster. Depending upon cost benefits, the process could be run on a high parallelized environment using Spark, or other similar frameworks.

From a methodology standpoint, the model currently does not factor in a time dimension. It is likely certain tokens drive more engagement only during parts of the year. For example, one would expect the token ‘holidays’ to drive more engagement during December than during other months. These effects are currently not captured by the model and the methodology needs to be improved to account for overall longer term trends and seasonality.

The model even after deploying to production should be monitored for predicted engagement vs. outcome by taking AUC measurements. Given the way the model would be used in production, the time taken to generate results is also an important metric to monitor on an ongoing basis.

Of particular note are the Random Forest Regression model’s conclusions on the relative importances of the available features. The obvious conclusion is that terrorism-related tweets garner very strong reactions. Most of the top 20 features are words such as “isis”, “paris”, “attacks”, and “syrians”. These subjects play into the fear-based rhetoric that has effectively ignited a large segment of Republicans and helped keep Donald Trump and other far-right, anti-establishment candidates ahead in the polls.


Untitled4Figure 1 – Engagement Scores

Untitled3Figure 2 – ROC Curves

Untitled2Figure 3 – Feature importances from Random Forest Regression

UntitledFigure 4 – Predicted engagement on holdout data from regression models


Figure 5 – Map of Tweet Metadata



Hillary: Age Perception Problem?

“I come from the ‘60s, a long time ago,” Hillary Clinton said at Saturday’s Democratic Presidential debate, in response to a question about student activism. The gaffe has mostly fallen on deaf ears.

Hillary Clinton is 68 years old. She was born October 26, 1947. Here is a visualization showing the age distribution of US Presidents upon assuming the Oval Office.

Screen Shot 2015-11-19 at 12.53.10 PM (2)

To date, only three US Presidents between the ages 65-69 have assumed the Oval Office. Once again, Hillary Clinton is 68 years old. She “come(s) from the ‘60s, a long time ago.”

Granted, the US population is aging. However, when selecting Presidents, this aging US population has trended towards younger Presidents.

Age upon assuming Oval Office: Barack Obama, 47 years old; George W. Bush, 54 years old; Bill Clinton, 46 years old.

Hillary wishes to dispel the age perception. HFA recently posted the SnapChat logo wearing a most recognizable pantsuit.

Screen Shot 2015-11-19 at 12.55.00 PM (2)

Of all the social networks, Snapchat skews youngest. According to Business Insider, 45% of Snapchat users are between the ages 18-24.Screen Shot 2015-11-19 at 12.44.44 PM (2)

Thus, a strategy has emerged: capture the young mind. Win the unspoiled voter, who is excited by the prospect of casting her first vote. Btw, Hillary uses Snapchat.

Further evidence of the strategy:

Twitter Banner displaying young prospective voters

Screen Shot 2015-11-19 at 12.53.58 PM (2)

HFA blog post appealing to youth and “a new age”

Screen Shot 2015-11-19 at 1.43.10 PM (2)

We should expect age to become a louder issue as the campaigns unfold. For reference, here are the ages of all the candidates, both Democrats and Republicans. Those between the ages 44- 60 are in bold.

Hilary Clinton, 68 years old

Bernie Sanders, 74 yeards old

Martin O’Malley, 52 years old

Donald Trump, 69 years old

Ben Carson, 64 years old

Marco Rubio, 44 years old

Ted Cruz, 44 years old

Jeb Bush, 62 years old

Carly Fiorina, 61 years old

John Kasich, 63 years old

Rand Paul, 52 years old

Republican Debate Nov 10, 2015: Google Search Correlation Matrix

Screen Shot 2015-11-10 at 1.37.37 PM (2)Where does the competition stand leading to tonight’s Republican Debate? Using Google Trends data from the past three months, I plotted a correlation matrix comparing search queries among the candidates.


The matrix reveals trends. Notably, Donald Trump searches are the least related to the other candidates. This supports Trump having created his own news cycle, apart from the Republican news cycle. As you may see in the time series plot, the Trump search artifact exhibits peaks and nuance not seen in the other artifacts. Furthermore, Trump searches are the least related to Ben Carson searches, with a 0.01 correlation. Other patterns to note, Ted Cruz and Marco Rubio searches are highly related, 0.94 correlation, and there is high inter-relatedness among searches for Jeb Bush, Carly Fiorina, and Rand Paul. Screen Shot 2015-11-10 at 12.55.19 PM (2)

Screen Shot 2015-11-10 at 12.56.07 PM (2)

It becomes clear that, aside from Trump and Carson, Google Searches for Republican Candidates hit inflection points around the debates. Furthermore, there is clustering among candidates, which can be considered as different news cycles:

  1. Donald Trump
  2. Ben Carson
  3. Ted Cruz, Marco Rubio
  4. Jeb Bush, Carly Fiorina, Rand Paul

We will see tonight if any of the Candidates makes a move to shake up the trends.

Twitter – US Presidential Campaigns, Nov 7, 2015: Heat Map

Screen Shot 2015-11-07 at 11.28.09 PM (2)

This heat map visualizes the US Presidential Campaigns on Twitter as of November 7, 2015. Darkest blue indicates the highest value. Grey indicates the lowest value.


  • Bernie Sanders receives the most engagement per Tweet
  • Donald Trump Tweets the most of any candidate
  • Donald Trump and Hillary Clinton have the most Followers
  • Hillary Clinton, Ben Carson, Jeb Bush, and Chris Christie have curated new Twitter feeds for the 2016 election, as indicated by low Total Tweets
  • With the exception of Donald Trump, the Democrats rank highest across the board, indicating digital competency
  • With the exception of Carly Fiorina, all candidates Tweet a moderate amount per day

Moral Foundations: Reddit Political Communities

Moral Foundations Theory is a social psychological theory intended to explain the origins of and variation in human moral reasoning. The theory proposes moral foundations such as fairness, care, in-group, authority, and purity, and has been popularized by psychologist Jonathan Haidt in his book The Righteous Mind.

Haidt describes human morality as it relates to politics and proposes differences between conservatives and liberals as they relate to the moral foundations (TED Talk). Specifically, whereas conservatives appeal to fairness, care, in-group, authority, and purity equally, liberals appeal to fairness and care more than they appeal to in-group, authority, and purity.

Setting out to observe this phenomenon within Reddit Political Communities, I performed word frequency analyses on the /r/Republican and /r/Democrats corpora, totaling the words for each moral foundation, as defined by the LIWC dictionary. Comparing the totals, I found a trend consistent with Moral Foundations Theory. The visualization shows the moral foundations for /r/Democrats normalized against those for /r/Republican, with each value for /r/Republican set at 100%.


Sentiment Analysis: Donald Trump & Hillary Clinton Tweets, Oct 5 – Oct 11, 2015

Emotion drives our decision-making. By appealing to emotion, others can persuade us to make decisions. We experience this during political campaigns.

Donald Trump knows the power of emotion. A charismatic leader, Trump infuses his speeches with appeals to emotion. Sentiment analysis makes this clear.

Comparing Donald Trump and Hillary Clinton, I sampled tweets from their respective profiles, published between Monday October 5, 2015, and Sunday October 11, 2015. Using sentiment analysis, each tweet was given a score between -1, the most negative, and +1, the most positive. Plotted across the 7 days, the results are displayed below, with Trump in red and Hillary in blue.

Trump exhibits a noisier sentiment artifact. Trump has almost no tweets with a sentiment score of 0. Trump peaks at +1 nine times; Hillary peaks at +1 three times. Using statistics, we see with Trump there is a greater range of sentiment, with a tendency towards positive sentiment. The Median Sentiment for Donald Trump is 0.21, whereas the Median Sentiment for Hillary Clinton is 0. The Standard Deviation for Donald Trump is 0.39, whereas the Standard Deviation for Hillary Clinton is 0.30.

Donald Trump Hillary Clinton
Median Sentiment 0.21 0.00
Standard Deviation 0.39 0.30


So why does this matter? Noisy sentiment drives engagement. 

The chart below shows average tweet engagement for the respective profiles, for tweets published between Monday October 5, 2015, and Sunday October 11, 2015.

Donald Trump Hillary Clinton
Avg. Retweets 1028 783
Avg. Favorites 2136 1196

– Trump received 1.31 retweets for every 1 retweet Clinton received

– Trump received 1.79 favorites for every 1 favorite Clinton received

So while some political analysts doubt Trump’s ability to win over the Republican Establishment, these findings clearly show Trump resonates with the people who have direct access to him on Twitter. Like television before it, social media has ushered in a new era of political campaign strategy, and we must ask, how will this new means of communication influence the selection of the Republican Presidential Nominee.

Content Strategy: The Cats Meeeow

It has been written 15% of all Internet traffic is cat-related. Whether you believe this statistic, there is no doubt cats inhabit the digital space. To cite more popular examples, we have encountered LOL Cats, Grumpy Cat, and Lil’ Bub…what cuties! To date, there are 72 million media tagged as “cat” on instagram.

With so many kitties purring around the interwebz, how might a content creator know where to start? Well I have created a visualization showing the most popular cat breeds by hashtag on instagram. Enjoy, and MEEEEEEEOW!!!