The following is an excerpt from the deliverable I worked on with Austin Daily, Jaynth Thiagarajan, and Ryan Weber for the Data Science and Business Analytics course at NYU Stern. For the project, we developed a model to predict the engagement score (retweets + favorites) for a given tweet, based on its text and metadata features.
Twitter is an online social networking service that enables users to send and read short 140-character messages called tweets. Users are able to retweet and favorite messages. By retweeting, a user shares the tweet on his newsfeed, enabling his followers to see the tweet. By favoriting, the user marks the tweet, indicating affinity for the message. In aggregate, retweets and favorites can be interpreted as a measure of engagement, or the extent to which the message attracts attention. Thus, individuals and companies whose businesses center around attracting attention may aim to compose messages that score high in engagement.
For our project, we set out to develop a model with which we can predict the engagement of a tweet. If our model is successful, it will create value by delivering a predicted measure of engagement prior to the message being published, thus creating the opportunity for the user to engineer the message for higher engagement.
Brands often partner with influencers, who have large audiences to whom products can be advertised. By predicting the engagement score for a tweet, brands can better decide which influencers to partner with as well as better assess return on investment.
On June 16, 2015, celebrity businessman Donald Trump announced he would run for President of the United States. Since his announcement, Trump has leveraged Twitter to spread his message and gain attention. Our team decided to analyze tweets from Donald Trump due to the increased attention to his campaign.
We collected 3200 tweets published by Trump between the time of his announcement to run for President and December 8, 2015. The tweets were fetched by querying the Twitter API. Although the average user is mostly only concerned with the 140-character tweets being broadcast out into the world, a single tweet carries more metadata than is commonly realized (Figure 5). Many of these metadata attributes, such as the application from which the tweet was published, are not perceptible to the average user and thus are immaterial to our analysis. However, some attributes such as the publishing time and date may have a significant impact on total engagement. Additionally, the number of retweets and favorites, which will together form the engagement score that will be used as the labels for our data, are contained in the metadata.
Modeling Twitter engagement is at its core a text mining problem. Hence the bulk of the information will come from the text of each tweet. The text itself contains certain structured attributes like references (callouts to other Twitter users using the @ sign followed by the username), hashtags (indicators of a particular topic using the # sign, used to signify a tweet’s participation in a trending topic) and hyperlinks. Beyond these, the features of the text are unstructured, so we have employed vectorization techniques to produce usable features.
The full set of features and labels used is as follows:
- Full text of the tweet, vectorized into 1- and 2-grams
- Number of references in the tweet
- Number of hashtags in the tweet
- Number of hyperlinks in the tweet
- Hour of the day at which the tweet was published
- Day of the week on which the tweet was published, represented as six binary features for each day Monday through Saturday
- The total engagement of the tweet, constructed using:
- Number of retweets
- Number of favorites
- A weighting factor alpha used to specify the relative importance of retweets and favorites; this value is business case dependent but we have used a value of 0.8 which weights retweets four times as heavily as favorites
We attempted to construct our model both as a binary classification problem (will a given tweet achieve high engagement) and separately as a regression problem (what is the expected engagement for a given tweet). Our hope was that the classification models would be easier to evaluate, using powerful metrics such as AUC, and that the lessons learned would be transferable to conceptually similar regression models, which may be more useful from a business perspective by providing more detailed feedback to users. However, we ended up finding that the models that produced the best results for the binary classification problem were either incompatible with the regression problem or simply did not produce similarly good results.
Deployment and Evaluation:
Ideally, the model would power a web application that lets the user pick a profile, build a model for that profile, and employ the model to test various tweets by predicting engagement. The challenge is the time it would take for the system to build a model.
If the total time taken by the system to gather, clean, and optimize the data is more than a few minutes, providing such a user experience may not be possible. There are two ways to address the challenge: either pre-select influential twitter handles or build a data warehouse of twitter feeds from profiles with large followings. Decoupling the data fetching process and the feature extraction process helps to make model generation faster. Depending upon cost benefits, the process could be run on a high parallelized environment using Spark, or other similar frameworks.
From a methodology standpoint, the model currently does not factor in a time dimension. It is likely certain tokens drive more engagement only during parts of the year. For example, one would expect the token ‘holidays’ to drive more engagement during December than during other months. These effects are currently not captured by the model and the methodology needs to be improved to account for overall longer term trends and seasonality.
The model even after deploying to production should be monitored for predicted engagement vs. outcome by taking AUC measurements. Given the way the model would be used in production, the time taken to generate results is also an important metric to monitor on an ongoing basis.
Of particular note are the Random Forest Regression model’s conclusions on the relative importances of the available features. The obvious conclusion is that terrorism-related tweets garner very strong reactions. Most of the top 20 features are words such as “isis”, “paris”, “attacks”, and “syrians”. These subjects play into the fear-based rhetoric that has effectively ignited a large segment of Republicans and helped keep Donald Trump and other far-right, anti-establishment candidates ahead in the polls.
Figure 1 – Engagement Scores
Figure 2 – ROC Curves
Figure 3 – Feature importances from Random Forest Regression
Figure 4 – Predicted engagement on holdout data from regression models
Figure 5 – Map of Tweet Metadata