The Tools Behind Our Twitter Abuse Analysis with BuzzFeed
Or...How to Quantify Abuse in Tweets in 5 Working Days
The goal was to use text analytics and analyse tweets replying to UK politicians, in the run up to the 2017 general election, in order to answer questions such as:
- How wide spread is abuse received by politicians?
- Who are the main politicians targeted by such abusive tweets?
- Are there any party or gender differences?
- Do abuse levels stay constant over time or not?
For researchers wishing more in-depth details, please read and cite our paper:
D. Maynard, I. Roberts, M. A. Greenwood, D. Rout and K. Bontcheva. A Framework for Real-time Semantic Social Media Analysis. Web Semantics: Science, Services and Agents on the World Wide Web, 2017 (in press). https://doi.org/10.1016/j.websem.2017.05.002, pre-print
I also made a second twitter collector service running in parallel, to collect election related tweets based purely on hashtags and keywords (e.g. #GE2017, vote, election).
How We Analysed and Quantified Abuse
The starting point was the real-time text analysis pipeline from the Brexit research last year. That is capable of analysing up to 100 tweets per second (tps), although, in practice, the tweets usually were coming at the much lower 23 tps.
This time, however, we adapted it with a new abuse analysis component, as well as some more up-to-date knowledge about the politicians (including the new prime minister).
The analysis backbone was again GATE's TwitIE system, which consists of a tokenizer, normalizer, part-of-speech tagger, and a named entity recognizer. TwitIE is also available as-a-service on GATE Cloud, for easy integration and use.
Next, we added information about politicians, e.g. their names, gender, party, constituencies, etc. In this way, we could produce aggregate statistics, such as abuse-containing tweets aimed at Labour or Conservative male/female politicians.
Next is a tweet geolocation component, which uses latitude/longitude, region, and user location metadata to geolocate tweets within the UK NUTS2 regions. This is not always possible, since many accounts and tweets lack such information, and this narrow down the sample significantly, should we choose to restrict by geo-location.
We also detect key themes and topics discussed in the tweets (more than one topic/theme can be contained in each tweet). Here we reused the module from the Brexit analyser.
The most exciting part was working with BuzzFeed's journalists to curate a set of abuse nouns typically aimed at people (e.g. twat), racist words, and milder insults (e.g. coward). We decided to differentiate these from general obscene language and swearing, as these were not always targeting the politician. Nevertheless, they were included in the system, to produce a separate set of statistics. We introduced also basic sub-classification by kind (e.g. racial) and strength (e.g. mild, medium, strong), derived from an Ofcom research report on offensive language.
Overall, we kept the processing pipeline as simple and efficient as possible, so it can run at 100 tweets per second even on a pretty basic server.
The analysis results were fed into GATE Mimir, which indexes efficiently tweet text and all our linguistic annotations. Mimir has a powerful programming API for semantic search queries, which we use to drive the various interactive visualisations and to generate the necessary aggregate statistics behind them.
For instance, we used Mimir queries to generate statistics and visualisations, based on time (e.g. most popular hashtags in abuse-containing tweets on 4 Jun); topic (e.g. the most talked about topics in such tweets), or target of the abusive tweet (e.g. the most frequently targeted politicians by party and gender). We could also navigate to the corresponding tweets behind these aggregate statistics, for a more in-depth analysis.
A rich sample of these statistics, associated visualisations, and abusive tweets is available in the BuzzFeed article.
Research carried out by:
Any mistakes are my own.