On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: #InOrOut: Analysing Voting Trends in Tweets on #EURef Day

In this post I examine the question: could we have predicted the #EUReferendum outcome, based on #Leave and #Remain tweets posted on polling day? This follows up from my #InOrOut debate on Twitter on Jun 23rd, where I analysed tweet volumes, popular hashtags, and most mentioned users.

This is not the only study to analyse referendum day tweets, but here I present a more in-depth analysis, also based on a sample of tweets selected specifically as advocating #Leave/#Remain respectively.

#Leave / #Remain Trend Based on @Brndstr

Our real-time analysis uncovered the most popular user mentioned in posts on referendum day: @Brndstr. @Brndstr are building bots to help brands engage with their customers and also for users to turn into social ambassadors of brands they endorse.

On referendum day, they ran a campaign which encouraged people to tweet how they voted and, in return, their profile picture will change accordingly. This was not uncontroversial to some Twitter users, who took issue with the choice of the Union Jack (for Out voters) vs the EU flag (for In voters), but nevertheless, many people declared their votes in this way.

Show your support with a custom Profile Flag Filter for the #EUref - what will you vote for? #iVoted👍 🇬🇧 🇪🇺 https://t.co/qMZda1tKh8
— Brndstr (@Brndstr) June 23, 2016

I found over 14,600 tweets mentioning @Brndstr in the 715 thousand original tweets we collected on June 23rd. I only limited the analysis to original tweets (i.e. excluded retweets and replies), since I wanted to study distinct, self-declared #Leave / #Remain intentions.

Inspection of a random sample in our Mimir Prospector dashboard showed all tweets had a set pattern, which made it trivial to distinguish #Leave and #Remain votes.

In particular, all #Leave tweets started with: I #VoteOut for the #Brexit #EURef vote with @Brndstr. All #Remain tweets started with: I #VoteIn for the #Brexit #EURef vote with @Brndstr

I used two Mimir queries with those texts, and found 6296 #VoteOut tweets and 8342 #VoteIn tweets. Thus, based on @Brndstr activity, one could hypothesize a #Remain majority.

#Leave / #Remain Trend Based on Full-Text Search

In addition to @Brndstr, I also experimented with full-text searches over the referendum day tweets. For those interested in the technology behind this, I used GATE text analysis tools adapted to the referendum, combined with the Mimir semantic search engine (supports searches over both linguistic annotations and full-text).

First, I searched for tweets containing "I", "voted", and "remain", within an 8 word window. This returned 14,665 matching tweets and upon manual inspection of the top 30 matches, I observed only 2 tweets which did not disclose the actual vote of their poster. Therefore, I considered this a sufficiently accurate query.

The corresponding "I", "voted" and "leave" query returned 11,046 matching tweets, i.e. #Leave votes were outnumbered by #Remain ones again.

These statistics are in line with the findings of other studies of Twitter #EUReferendum posts. For instance, the #EURef Data Hub (by the Press Association, Twitter, and Blurrt) showed Remain tweets dominating over Leave tweets on Jun 23rd, but not on 22nd and earlier, or (unsurprisingly) since.

It must be noted that, similar to the Ontotext study, the #EURef Data Hub statistics are derived from tweets referencing either the Leave or Remain campaigns, but not necessarily showing explicit support or voting intent.

However, as discussed in my earlier post, if we were to try and draw conclusions on the likely outcome based on tweets alone, then we need a more reliable Leave/Remain sample, indicative of actual support/self-declared voting intentions.

So now let's see if the same trend is present there.

#Leave / #Remain Voting Intentions Based on Our Classification Heuristic

Following on from my previous study of the overall characteristics of tweets posted on June 23rd, I separated again the tweets into original tweets, replies, and retweets.

I applied our classification heuristic for reliable identification of #Leave/#Remain posts to all tweets posted on or after 13:00 BST on June 22nd, but before voting closed at 22:00 BST on June 23rd.

As a result, I found just over 100 thousand tweets from 22nd: 39 thousand advocating Remain and 61 thousand - Leave.

On June 23rd, as Twitter activity picked up significantly (also observed by #EURef Data Hub), I found 291 thousand matching tweets. Unlike other studies, however, our voting intent heuristic identified 164 thousand tweets advocating Leave and only 127 thousands advocating Remain.

Therefore, even though voting tweets from @Brndstr and tweet volume statistics from #EURef Data Hub both indicate that Remain was dominant, this trend wasn't supported in our voting intention sample.

Now let us examine the trends over time, separately for original tweets, replies, and retweets.

The graph below shows that indeed #Remain tweets were dominant in the early hours of June 23rd, but not before or after. What is particularly interesting is that #Remain tweets start to fall sharply from around 4pm, whereas #Leave ones pick up sharply a little later. By the time polls close at 10pm, tweets advocating #Leave are more than double the ones supporting #Remain.

Reply tweets show a largely different pattern (see graph below), where replies advocating #Leave are consistently more than those advocating #Remain (at times up to 2.5 times more). This is a trend which we observed also earlier in June. This indicates that #Leave advocates were much more engaged in the Twitter debates, than the #Remain ones.

It should be noted also that the trend observed in original tweets in late afternoon and evening of June 23rd is also evident here, i.e. replies advocating #Remain start to fall, while replies advocating #Leave increase.

Lastly, I show below the trends in re-tweets, where again #Leave advocates dominate the debate, by re-tweeting much more than #Remain ones. Again, I already observed this trend earlier in June.

What Have We Learnt?

Having looked at tweets on 23rd, using @brndstr and “I voted XX” both gave  Remain a majority over Leave, but using our classification heuristic, the opposite was true (i.e. Leave was the more likely winner).

Given the conflicting evidence based on the same set of tweets, it is easy to see why others failed to predict the overall majority correctly.

I must also highlight here that my own analysis was never aimed at being predictive. Instead, I am trying to understand how people engaged, debated, and wrote about the referendum on social media.

In particular, as the referendum clearly showed, older voters tend to vote in higher proportions than young ones and thus, they were those that ultimately determined the overall outcome. That older generation, however, is well known for being under-represented on Twitter, and also probably less aware of @Brndstr and similar services, which explains why these gave the wrong trends.

In future research I would like to explore whether representativeness on Twitter is the full story, and whether this matters for political discussions. Do the younger generation actually talk more or less about politics than the older generation? Also, older people aside - were Brexiters (i.e. people supporting Leave) over- or under-represented on Twitter, as compared to Bremainers (i.e. voters supporting Remain)?

In order to get more accurate answers to these questions, as demonstrated here, it is important to identify actual tweets indicative of specific voting intentions or votes already cast. The largely predominant approach of simply counting tweets mentioning hashtags is not sufficiently accurate as it does not distinguish tweets simply referring to a stance/campaign, from tweets actually advocating a stance/campaign.

As part of subsequent research, I plan to also collect a gold standard of human-annotated tweets where people will be asked to mark tweets indicating actual support and voting intent separately from tweets, which simply mention the Leave/Remain campaigns. This will enable me to quantify how the different sampling strategies affect the accuracy of voting trends over time.

Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team

Any mistakes are my own.

Sunday, 26 June 2016

#InOrOut: Analysing Voting Trends in Tweets on #EURef Day

#Leave / #Remain Trend Based on @Brndstr

#Leave / #Remain Trend Based on Full-Text Search

#Leave / #Remain Voting Intentions Based on Our Classification Heuristic

What Have We Learnt?

Thanks to: