On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: Real-time analysis

Showing posts with label Real-time analysis. Show all posts

Tuesday, 24 April 2018

Funded PhD Opportunity: Large Scale Analysis of Online Disinformation in Political Debates

Applications are invited for an EPSRC-funded studentship at The University of Sheffield commencing on 1 October 2018.

The PhD project will examine the intersection of online political debates and misinformation, through big data analysis. This research is very timely, because online mis- and disinformation is reinforcing the formation of polarised partisan camps, sharing biased, self-reinforcing content. This is coupled with the rise in post-truth politics, where key arguments are repeated continuously, even when proven untrue by journalists or independent experts. Journalists and media have tried to counter this through fact-checking initiatives, but these are currently mostly manual, and thus not scalable to big data.

The aim is to develop machine learning-based methods for large-scale analysis of online misinformation and its role in political debates on online social platforms.

Application deadline: as soon as possible, until the funding is filled
Interviews: interviews take place within 2-3 weeks of application

Supervisory team: Professor Kalina Bontcheva (Department of Computer Science, University of Sheffield), Professor Piers Robinson (Department of Journalism, University of Sheffield), and Dr. Nikolaos Aletras (Information School, University of Sheffield).

Award Details

The studentship will cover tuition fees at the EU/UK rate and provide an annual maintenance stipend at standard Research Council rates (£14,777 in 2018/19) for 3.5 years.

Eligibility

The general eligibility requirements are:

Applicants should normally have studied in a relevant field to a very good standard at MSc level or equivalent experience.
Applicants should also have a 2.1 in a BSc degree, or equivalent qualification, in a related discipline.
ESRPC studentships are only available to students from the UK or European Union. Applications cannot be accepted from students liable to pay fees at the Overseas rate. Normally UK students will be eligible for a full award which pays fees and a maintenance grant if they meet the residency criteria and EU students will be eligible for a fees-only award, unless they have been resident in the UK for 3 years immediately prior to taking up the award.

How to apply

To apply for the studentship, applicants need to apply directly to the University of Sheffield for entrance into the doctoral programme in Computer Science

Complete an application for admission to the standard computer science PhD programme http://www.sheffield.ac.uk/postgraduate/research/apply
Applications should include a research proposal; CV; academic writing sample; transcripts and two references.
The research proposal of up to 1,000 words should outline your reasons for applying to this project and how you would approach the research including details of your skills and experience in both computing and/or data journalism.
Supporting documents should be uploaded to your application.

Sunday, 23 July 2017

The Tools Behind Our Twitter Abuse Analysis with BuzzFeed

Or...How to Quantify Abuse in Tweets in 5 Working Days

When BuzzFeed approached us with the idea to quantify Twitter abuse towards politicians during the election campaign, we only had five working days, before the article had to be completed and go public.

The goal was to use text analytics and analyse tweets replying to UK politicians, in the run up to the 2017 general election, in order to answer questions such as:

How wide spread is abuse received by politicians?
Who are the main politicians targeted by such abusive tweets?
Are there any party or gender differences?
Do abuse levels stay constant over time or not?

So here I explain first how we collect the data for such studies and then how it gets analysed at scale and fast, all with our GATE-based open-source tools and their GATE Cloud text analytics-as-a-service deployment.

For researchers wishing more in-depth details, please read and cite our paper:

D. Maynard, I. Roberts, M. A. Greenwood, D. Rout and K. Bontcheva. A Framework for Real-time Semantic Social Media Analysis. Web Semantics: Science, Services and Agents on the World Wide Web, 2017 (in press). https://doi.org/10.1016/j.websem.2017.05.002, pre-print

Tweet Collection

We already had all necessary tweets at hand, since, within an hour of #GE2017 being announced, I set up, using the GATE Cloud tweet collection service:

https://cloud.gate.ac.uk/shopfront/displayItem/twitter-collector

the continuous collection of tweets by MPs, prominent politicians, parties, and candidates, as well as retweets and replies thereof.

I also made a second twitter collector service running in parallel, to collect election related tweets based purely on hashtags and keywords (e.g. #GE2017, vote, election).

How We Analysed and Quantified Abuse

Given the short 5 day deadline, we were pleased to have at hand the large-scale, real-time text analytics tools in GATE, Mimir/Prospector, and GATE Cloud.

The starting point was the real-time text analysis pipeline from the Brexit research last year. That is capable of analysing up to 100 tweets per second (tps), although, in practice, the tweets usually were coming at the much lower 23 tps.

This time, however, we adapted it with a new abuse analysis component, as well as some more up-to-date knowledge about the politicians (including the new prime minister).

The analysis backbone was again GATE's TwitIE system, which consists of a tokenizer, normalizer, part-of-speech tagger, and a named entity recognizer. TwitIE is also available as-a-service on GATE Cloud, for easy integration and use.

Next, we added information about politicians, e.g. their names, gender, party, constituencies, etc. In this way, we could produce aggregate statistics, such as abuse-containing tweets aimed at Labour or Conservative male/female politicians.

Next is a tweet geolocation component, which uses latitude/longitude, region, and user location metadata to geolocate tweets within the UK NUTS2 regions. This is not always possible, since many accounts and tweets lack such information, and this narrow down the sample significantly, should we choose to restrict by geo-location.

We also detect key themes and topics discussed in the tweets (more than one topic/theme can be contained in each tweet). Here we reused the module from the Brexit analyser.

The most exciting part was working with BuzzFeed's journalists to curate a set of abuse nouns typically aimed at people (e.g. twat), racist words, and milder insults (e.g. coward). We decided to differentiate these from general obscene language and swearing, as these were not always targeting the politician. Nevertheless, they were included in the system, to produce a separate set of statistics. We introduced also basic sub-classification by kind (e.g. racial) and strength (e.g. mild, medium, strong), derived from an Ofcom research report on offensive language.

Overall, we kept the processing pipeline as simple and efficient as possible, so it can run at 100 tweets per second even on a pretty basic server.

The analysis results were fed into GATE Mimir, which indexes efficiently tweet text and all our linguistic annotations. Mimir has a powerful programming API for semantic search queries, which we use to drive the various interactive visualisations and to generate the necessary aggregate statistics behind them.

For instance, we used Mimir queries to generate statistics and visualisations, based on time (e.g. most popular hashtags in abuse-containing tweets on 4 Jun); topic (e.g. the most talked about topics in such tweets), or target of the abusive tweet (e.g. the most frequently targeted politicians by party and gender). We could also navigate to the corresponding tweets behind these aggregate statistics, for a more in-depth analysis.

A rich sample of these statistics, associated visualisations, and abusive tweets is available in the BuzzFeed article.

Research carried out by:

Mark A. Greenwood, Ian Roberts, Dominic Rout, and myself, with ideas and other contributions from Diana Maynard and others from the GATE Team.

Any mistakes are my own.

Friday, 1 July 2016

The Tools Behind Our Brexit Analyser

UPDATE (13 December, 2016): Try the Brexit Analyzer

We have now made parts of the Brexit Analyzer available as a web service. You can try the topic detection by putting an example tweet here (choose mentions of political topics):

https://cloud.gate.ac.uk/shopfront/sampleServices

A more extensive test of the outputs (also including hashtags, voting intent, @mention, and URL detection) can be tried here:

https://cloud.gate.ac.uk/shopfront/displayItem/sobigdata-brexit

This is a web service running on GATE Cloud, where you can find many other text analytics services, available to try for free or run on large batches of data.

We also have now a tweet collection service, should you wish to start collecting and analysing your own Brexit (or any other) tweets:

https://cloud.gate.ac.uk/shopfront/displayItem/twitter-collector

Tools Overview

It will be two weeks tomorrow since we launched the Brexit Analyser -- our real-time tweet analysis system, based on our GATE text analytics and semantic annotation tools.

Back then, we were analysing on average 500,000 (yes, half a million!) tweets a day. Then, on referendum day alone, we had to analyse in real-time well over 2 million tweets. Or on average, just over 23 tweets per second! It wasn't quite so simple though, as tweet volume picked up dramatically as soon as the polls closed at 10pm and we were consistently getting around 50 tweets per second and were also being rate-limited by the Twitter API.

These are some pretty serious data volumes, as well as veracity. So how did we build the Brexit Analyser to cope?

For analysis, we are using GATE's TwitIE system, which consists of a tokenizer, normalizer, part-of-speech tagger, and a named entity recognizer. After that, we added our Leave/Remain classifier, which helps us identify a reliable sample of tweets with unambiguous stance. Next is a tweet geolocation component, which uses latitude/longitude, region, and user location metadata to geolocate tweets within the UK NUTS2 regions. We also detect key themes and topics discussed in the tweets (more than one topic/theme can be contained in each tweet), followed by topic-centric sentiment analysis.

We kept the processing pipeline as simple and efficient as possible, so it can run at 100 tweets per second even on a pretty basic server.

The analysis results are fed into GATE Mimir, which indexes efficiently tweet text and all our linguistic annotations. Mimir has a powerful programming API for semantic search queries, which we use to drive different web pages with interactive visualisations. The user can choose what they want to see, based on time (e.g. most popular hashtags on 23 Jun; most talked about topics in Leave/Remain tweets on 23 Jun). Clicking on these infographics shows the actual matching tweets.

All my blog posts so far have been using screenshots of such interactively generated visualisations.

Mimir also has a more specialised graphical interface (Prospector), which I use for formulating semantic search queries and inspecting the matching data, coupled with some pre-set types of visualisations. The screen shot below shows my Mimir query for all original tweets on 23 Jun which advocate Leave. I can then inspect the most mentioned twitter users within those. (I used Prospector for my analysis of Leave/Remain voting trends on referendum day).

So how do I do my analyses

First I decide what subset of tweets I want to analyse. This is typically a Mimir query restricting by timestamp (normalized to GMT), tweet kind (original, reply, or retweet), voting intention (Leave/Remain), mentioning a specific user/hashtag/topic, written by a specific user, containing a given hashtag or a given topic (e.g. all tweets discussing taxes).

Then, once I identify this dynamically generated subset of tweets, I can analyse it with Prospector or use the visualisations which we generate via the Mimir API. These include:

Top X most frequently mentioned words, nouns, verbs, or noun phrases
Top X most frequent posters/frequently mentioned tweeterers
Top X most frequent Locations, Organizatons, or Persons within those tweets
Top X themes / sub-themes according to our topic classifier
Frequent URLs, language of the tweets, and sentiment

How do we scale it up

It's built using GATE Cloud Paralleliser and some clever queueing, but the take away message is: we can process and index over 100 tweets per second, which allows us to cope in real time with the tweet stream we receive via the Twitter Search API, even at peak times. All of this runs on a server which cost us under £10,000.

The architecture can be scaled up further, if needed, should we get access to a Twitter feed with higher API rate limits than the standard.

Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team

Any mistakes are my own.

Thursday, 23 June 2016

Identifying A Reliable Sample of Leave/Remain Tweets

UPDATE (13 December, 2016): Try the Brexit Analyzer

We have now made parts of the Brexit Analyzer available as a web service. You can try the topic detection by putting an example tweet here (choose mentions of political topics):

https://cloud.gate.ac.uk/shopfront/sampleServices

Overview

This post is the second in the series on the Brexit Tweet Analyser.

Having looked at tweet volumes and basic characteristics of the Twitter discourse around the EU referendum, we now turn to the method we chose for identify a reliable, even if incomplete, sample of leave and remain tweets.

There is currently no ground truth available, i.e. a well known sample of Leave/Remain Twitter users, therefore it is hard to establish the accuracy of these heuristics at present, but it is something we are working on actively.

More importantly, we are not trying to predict if leave or remain are leading, but instead we are interested in identifying a reliable, if incomplete subset, so we can analyse topics discussed and active users within.

Are Hashtags A Reliable Predictor of Leave/Remain Support?

As discussed in our earlier post, over 56% of all tweets on the referendum contain at least one hashtag. Some of these are actually indicative of support for the leave/remain campaigns, e.g. #votetoleave, #voteout, #saferin, #strongertogether. Then there are also hashtags which try to address undecided voters, e.g. #InOrOut, #undecided, while promoting either a remain or leave vote but not through explicit hashtags.

A recent study of EU referendum tweets by Ontotext, carried out over tweets in May 2016, classified tweets as leave or remain on the basis of approximately 30 hashtags. Some of those were associated with leave, the rest -- with remain, and each tweet was classified as leave or remain based on whether it contains predominantly leave or predominantly remain hashtags.

Based on analysing manually a sample of random tweets with those hashtags, we found that this strategy does not always deliver a reliable assessment, since in many cases leave hashtags are used as a reference to the leave campaign, while the tweet itself is supportive of remain or neutral. The converse is also true, i.e. remain hashtags are used to refer to the remain stance/campaign. We have included some examples below.

A more reliable, even if somewhat more restrictive, approach is to consider the last hashtag in the tweet as the most indicative of its intended stance (pro-leave or pro-remain). This results in a higher precision sample of remain/leave tweets, which we can then analyse in more depth in terms of topics discussed and opinions expressed.

Using this approach, amongst the 1.9 million tweets between June 13th and 19th, 5.5% (106 thousand) were identified as supporting the Leave campaign, while 4% (80 thousand) - as supporting the Remain campaign. Taken together, this constitutes just under a 10% sample, which we consider sufficient for the purposes of our analysis.

These results, albeit drawn from a smaller, high-precision sample, seem to indicate that the Leave campaign is receiving more coverage and support on Twitter, when compared to Remain. This is consistent also with the findings of the Ontotext study .

In subsequent posts we will look into the most frequently mentioned hashtags, the most active Twitter users, and the topics discussed in the Remain and Leave samples separately.

What about #Brexit in particular?

The recent Ontotext study on May 2016 data used #Brexit as one of the key hashtags indicative of leave. Others have also used #Brexit in the same fashion.

In our more recent 6.5 million tweets (dated between 1 June and 19 June 2016), just under 1.7 million contain the #Brexit hashtag (26%). However, having examined a random sample of those manually (see examples below), we established that while many tweets did use #Brexit to indicate support for leave, there were also many cases where #Brexit referred to the referendum, or the leave/remain question, or the Brexit campaign as a whole. We have provided some such examples at the end of this blog post. We also found a sufficient number of examples where #Brexit appears at the end of tweets while still not indicating support for voting leave.

Therefore, we chose to distinguish the #Brexit hashtag from all other leave hashtags and tagged tweets with a final #Brexit tag separately. This enables us, in subsequent analyses, to compare findings with and without considering #Brexit.

Example Remain/Leave Hashtag Use

It doesnt matter who some of the dodgy leaders of #Remain and #Brexit are, they each only have ONE VOTE, like all of us public #EURef
— Marcus Storm (@MarcsandSparks) 20 June 2016

Perfect question! "Why is #brexit ahead, despite all the experts supporting #remain?" #questiontime
— Steve Parrott (@steveparrott50) 19 June 2016

Could the last decent politician (of any party) to leave the #Leave camp please turn off the lights.....#Bremain pic.twitter.com/zQjjoIXcyO
— Dr Hamed Khan (@drhamedkhan) 19 June 2016

Today's @thesundaytimes #focus articles on #brexit say it all. #remain is forward-looking, #leave backward
— Patrick White (@pbpwhite) 20 June 2016

Example Brexit Tweets

#Brexit probability declines as campaigns remain quiet https://t.co/qrAhURvRDk via @RJ_FXandRates pic.twitter.com/UnNV1NDnZv
— Bloomberg London (@LondonBC) 17 June 2016

#VoteRemain #VoteLeave #InOrOut #EURef #StrongerIn -- Is #Brexit The End Of The World As We Know It? via @forbes https://t.co/lQ6Xgf0oEW
— Jolly Roger (@EUGrassroots) 17 June 2016

Remaining #Brexit Polls scheduled releases pic.twitter.com/DKzBqjoGcs
— Nicola Duke (@NicTrades) 17 June 2016

Blame austerity—not immigration—for bringing Britain to ‘breaking point’https://t.co/f3oKODbLSe #Brexit #EUref pic.twitter.com/lLJHOsUO7J
— The Conversation (@ConversationUK) June 20, 2016

BREAK World's biggest carmaker #Ford tells staff of "deep concerns abt "uncertainty/potential downsides" of #Brexit pic.twitter.com/bYQ3LyIA6i
— Beth Rigby (@BethRigby) June 20, 2016

Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team

Any mistakes are my own.

Friday, 17 June 2016

Introducing the Brexit Analyser: real-time Twitter analysis with GATE

The GATE team has been busy lately with building the real-time Brexit Analyser. It analyses tweets related to the forthcoming EU referendum, as they come in, in order to track the referendum debate unfolding on Twitter. This research is being carried out as part of the SoBigData project.

The work follows on from our successful collaboration with Nesta on the Political Futures Tracker, which analysed tweets in real-time in the run up to the UK General Election in 2015.

Unlike others, we do not try to predict the outcome of the referendum or answer the question of whether Twitter can be used as a substitute for opinion polls. Instead, our focus is on a more in-depth analysis of the referendum debate; the people and organisations who engage in those debates; what topics are discussed and opinion expressed, and who the top influencers are.

What does it do?

It analyses and indexes tweets as they come in (i.e. in real time), in order to identify commonly discussed topics, opinions expressed, and whether a tweet is expressing support for remaining or leaving the EU. It must be noted that not all tweets have a clear stance and also that not all tweets express a clear voting intention (e.g. "Brexit & Bremain"). More on this in subsequent posts!

In more detail, the Brexit Analyser uses text analytics and opinion mining techniques from GATE, in order to identify tweets expressing voting intentions, the topics discussed within, and the sentiment expressed towards these topics. Watch this space!

The Data (So Far)

We are collecting tweets based on a number of referendum related hashtags and keywords, such as #voteremain, #voteleave, #brexit, #eureferendum.

The volume of original tweets, replies, and re-tweets per day collected so far is shown below. On average, this is close to half a million tweets per day (480 thousand), which is 1.6 times the tweets on 26 March 2015 (300,000), when the Battle For Number 10 interviews took place, in the run up to the May 2015 General Elections.

In total, we have analysed just over 1.9 million tweets in the past 4 days, with 60% of those being re-tweets. On average, a tweet is re-tweeted 1.65 times.

Subsequent posts will examine the distribution of original tweets, re-tweets, and replies specifically in tweets expressing a remain/leave voting intention.

Hashtags: 1 million of those 1.9 million tweets contain at least one hashtag (i.e. 56.5% of all tweets have hashtags). If only original tweets are considered (i.e. all replies and retweets are excluded), then there are 319 thousand tweets with hashtags amongst the original 678 thousand tweets (i.e. 47% of original tweets are hashtag bearing).

Analysing hashtags used in a Twitter debate is interesting, because they indicate commonly discussed topics, stance taken towards the referendum, and also key influencers. As they are easy to search for, hashtags help Twitter users participate in online debates, including other users they are not directly connected to.

Below we show some common hashtags on June 16, 2016. As can be seen, most are associated directly with the referendum and voting intentions, while others refer to politicians, parties, media, places, and events:

URLs: Interestingly, amongst the 1.9 million tweets only 134 thousand contain a URL (i.e. only 7%). Amongst the 1.1 million re-tweets, 11% contain a URL, which indicates that tweets with URLs tend to be retweet more.

These low percentages suggest that the majority of tweets on the EU referendum are expressing opinions or addressing another user, rather than sharing information or providing external evidence.

@Mentions: Indeed, 90 thousand (13%) of the original 678 thousand tweets contain an username mention. The 50 most mentioned users in those tweets are shown below. The size of the user name indicates frequency, i.e. the larger the text the more frequently has this username been mentioned in tweets.

In subsequent posts we will provide information on the most frequently re-tweeted users and the most prolific Twitter users in the dataset.

So What Does This Tell Us?

Without a doubt, there is a heavy volume of tweets on the EU referendum, published daily. However, with only 6.8% of all tweets being replies and over 58% -- re-tweets, this resembles more an echo chamber, rather than a debate.

Pointers to external evidence/sources via URLs are scarce, as are user mentions. The most frequently mentioned users are predominantly media (e.g., BBC, Reuters, FT, the Sun, Huffington Post); politicians playing a prominent role in the campaign (e.g. David Cameron, Boris Johnson, Nigel Farage, Jeremy Corbyn); and campaign accounts created especially for the referendum (e.g. @StrongerIn, @Vote_Leave).

Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team