On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: 2017

Thursday, 21 December 2017

Discerning Truth in the Age of Ubiquitous Disinformation

Initial Reflection on My Evidence to the DCMS Enquiry on Fake News

Kalina Bontcheva (@kbontcheva)

The past few years have heralded the age of ubiquitous disinformation, which is posing serious questions over the role of social media and the Internet in modern democratic societies. Topics and examples abound, ranging from the Brexit referendum and the US presidential election to medical misinformation (e.g. miraculous cures for cancer). Social media are now routinely reinforcing their users’ confirmation bias, so often, little to no attention is paid to opposing views or critical reflections. Blatant lies often make the rounds, re-posted and shared thousands of times, and even jumping successfully sometimes in mainstream media. Debunks and corrections, on the other hand, receive comparatively little attention.

I often get asked: “So why is this happening?”

My short answer is - the 4Ps of the modern disinformation age: post-truth politics, online propaganda, polarised crowds, and partisan media.

Post-truth politics: The first societal and political challenge comes from the emergence of post-truth politics, where politicians, parties, and governments tend to frame key political issues in propaganda, instead of facts. Misleading claims are continuously repeated, even when proven untrue through fact-checking by media or independent experts (e.g. the VoteLeave claim that Britain was paying the EU £350 million a week). This has a highly corrosive effect on public trust.
Online propaganda and fake news: State-backed (e.g. Russia Today), ideology-driven (e.g. misogynistic or Islamophobic), and clickbait websites and social media accounts are all engaged in spreading misinformation, often with the intent to deepen social division and/or influence key political outcomes (e.g. the 2016 US presidential election).
Partisan media: The pressures of the 24-hour news cycle and today’s highly competitive online media landscape have resulted in lower reporting quality and opinion diversity, with misinformation, bias, and factual inaccuracies routinely creeping in.
Polarised crowds: As more and more citizens turn to online sources as their primary source of news, the social media platforms and their advertising and content recommendation algorithms have enabled the creation of partisan camps and polarised crowds, characterised by flame wars and biased content sharing, which in turn, reinforces their prior beliefs (typically referred to as confirmation bias).

On Tuesday (19 December 2017) I gave evidence in front of the Digital, Culture, Media, and Sports Committee (DCMS) as part of their enquiry into fake news (although I prefer the term disinformation) and automation (ako bots) - their ubiquity, impact on society and democracy, the role of platforms and technology in creating the problem, and briefly also - can we use existing technology to detect and neutralise the effect of bots and disinformation.

The session lasted an hour, in which we had to answer 51 questions, spanning all these issues, so it meant each answer had to be kept very brief. The full transcript is available here.

The list of questions was not given to us in advance, which, coupled with the need for short answers, left me with a number of additional points I would like to make. So this is the first of several blog posts where I will revisit some of these questions in more detail.

Let's get started with the first four questions (Q1 to Q4 in the transcript), which were about the availability and accuracy of technology for automatic detection of disinformation on social media platforms. In particular:

can such technology identify disinformation in real time (part of Q3) and should it be adopted by the social media platforms themselves (Q4).

TD;LR: Yes, in principle, but we are still far from having solved key socio-technical issues, so, when it comes to containing the spread of disinformation, we should not use this as yet another stick to beat the social media platforms with.

And here is why this is the case:

Non-trivial scalability: While some of our algorithms work in near real time on specific datasets (e.g. tweets about the Brexit referendum), applying them across all posts on all topics as Twitter would need to do, for example, is very far from trivial. Just to give a sense of the scale here - prior to 23 June 2016 (referendum day) we had to process fewer than 50 Brexit-related tweets per second, which was doable. Twitter, however, would need to process more than 6,000 tweets per second which is a serious software engineering, computational, and algorithmic challenge.

Algorithms make mistakes, so while 90% accuracy intuitively sounds very promising, we must not forget the errors - 10% in this case, or double that at 80% algorithm accuracy. On 6,000 tweets per second this 10% amounts to 600 wrongly labeled tweets per second rising to 1,200 for the lower accuracy algorithm. To make matters worse, automatic disinformation analysis often combines more than one algorithm - first to determine which story a post refers to and second - whether this is likely true, false, or uncertain. Unfortunately, when algorithms are executed in a sequence, errors have a cumulative effect.

These mistakes can be very costly: broadly speaking algorithms make two kinds of errors - false negatives (e.g.. disinformation is wrongly labelled as true or bot accounts wrongly identified as human) and false positives (e.g. correct information is wrongly labelled as disinformation or genuine users being wrongly identified as bots). False negatives are a problem on social platforms, because the high volume and velocity of social posts (e.g. 6,000 tweets per second on average) still leaves us with a lot of disinformation “in the wild”. If we draw an analogy with email spam - even though most of it is filtered out automatically, we are still receiving a significant proportion of spam messages. False positives, on the other hand, pose an even more significant problem, as they could be regarded as censorship. Facebook, for example, has a growing problem with some users having their accounts wrongly suspended.

Sunday, 23 July 2017

The Tools Behind Our Twitter Abuse Analysis with BuzzFeed

Or...How to Quantify Abuse in Tweets in 5 Working Days

When BuzzFeed approached us with the idea to quantify Twitter abuse towards politicians during the election campaign, we only had five working days, before the article had to be completed and go public.

The goal was to use text analytics and analyse tweets replying to UK politicians, in the run up to the 2017 general election, in order to answer questions such as:

How wide spread is abuse received by politicians?
Who are the main politicians targeted by such abusive tweets?
Are there any party or gender differences?
Do abuse levels stay constant over time or not?

So here I explain first how we collect the data for such studies and then how it gets analysed at scale and fast, all with our GATE-based open-source tools and their GATE Cloud text analytics-as-a-service deployment.

For researchers wishing more in-depth details, please read and cite our paper:

D. Maynard, I. Roberts, M. A. Greenwood, D. Rout and K. Bontcheva. A Framework for Real-time Semantic Social Media Analysis. Web Semantics: Science, Services and Agents on the World Wide Web, 2017 (in press). https://doi.org/10.1016/j.websem.2017.05.002, pre-print

Tweet Collection

We already had all necessary tweets at hand, since, within an hour of #GE2017 being announced, I set up, using the GATE Cloud tweet collection service:

https://cloud.gate.ac.uk/shopfront/displayItem/twitter-collector

the continuous collection of tweets by MPs, prominent politicians, parties, and candidates, as well as retweets and replies thereof.

I also made a second twitter collector service running in parallel, to collect election related tweets based purely on hashtags and keywords (e.g. #GE2017, vote, election).

How We Analysed and Quantified Abuse

Given the short 5 day deadline, we were pleased to have at hand the large-scale, real-time text analytics tools in GATE, Mimir/Prospector, and GATE Cloud.

The starting point was the real-time text analysis pipeline from the Brexit research last year. That is capable of analysing up to 100 tweets per second (tps), although, in practice, the tweets usually were coming at the much lower 23 tps.

This time, however, we adapted it with a new abuse analysis component, as well as some more up-to-date knowledge about the politicians (including the new prime minister).

The analysis backbone was again GATE's TwitIE system, which consists of a tokenizer, normalizer, part-of-speech tagger, and a named entity recognizer. TwitIE is also available as-a-service on GATE Cloud, for easy integration and use.

Next, we added information about politicians, e.g. their names, gender, party, constituencies, etc. In this way, we could produce aggregate statistics, such as abuse-containing tweets aimed at Labour or Conservative male/female politicians.

Next is a tweet geolocation component, which uses latitude/longitude, region, and user location metadata to geolocate tweets within the UK NUTS2 regions. This is not always possible, since many accounts and tweets lack such information, and this narrow down the sample significantly, should we choose to restrict by geo-location.

We also detect key themes and topics discussed in the tweets (more than one topic/theme can be contained in each tweet). Here we reused the module from the Brexit analyser.

The most exciting part was working with BuzzFeed's journalists to curate a set of abuse nouns typically aimed at people (e.g. twat), racist words, and milder insults (e.g. coward). We decided to differentiate these from general obscene language and swearing, as these were not always targeting the politician. Nevertheless, they were included in the system, to produce a separate set of statistics. We introduced also basic sub-classification by kind (e.g. racial) and strength (e.g. mild, medium, strong), derived from an Ofcom research report on offensive language.

Overall, we kept the processing pipeline as simple and efficient as possible, so it can run at 100 tweets per second even on a pretty basic server.

The analysis results were fed into GATE Mimir, which indexes efficiently tweet text and all our linguistic annotations. Mimir has a powerful programming API for semantic search queries, which we use to drive the various interactive visualisations and to generate the necessary aggregate statistics behind them.

For instance, we used Mimir queries to generate statistics and visualisations, based on time (e.g. most popular hashtags in abuse-containing tweets on 4 Jun); topic (e.g. the most talked about topics in such tweets), or target of the abusive tweet (e.g. the most frequently targeted politicians by party and gender). We could also navigate to the corresponding tweets behind these aggregate statistics, for a more in-depth analysis.

A rich sample of these statistics, associated visualisations, and abusive tweets is available in the BuzzFeed article.

Research carried out by:

Mark A. Greenwood, Ian Roberts, Dominic Rout, and myself, with ideas and other contributions from Diana Maynard and others from the GATE Team.

Any mistakes are my own.

Tuesday, 20 June 2017

GATE, Java 9, and HDPI Monitors

Over the last couple of months a few people have mentioned that running GATE Developer on HDPI monitors is a bit of a pain. The problem is that Java (up to and including the latest version of Java 8) doesn't have any support for HDPI monitors. The only solution I'd heard people suggest was to reduce the resolution of the monitor before launching GATE, but as you can imagine this is far from an ideal solution.

Having recently upgraded my laptop I also ran into the same problem, and as this screenshot highlights, by default GATE Developer isn't at all usable on a HDPI screen.

A quick hunt around the web and you'll find all sorts of suggestions for getting Java 8 to work nicely with HDPI screens, but try as I might I couldn't get any of them to work for me; I'm running OpenJDK 8 under Ubuntu 16.04. Fortunately HDPI support is going to be built into Java 9. Unfortunately Java 9 still hasn't been officially released so you need to rely on an early access version.

In theory it should have been easy for me to see if Java 9 was a solution, but unfortunately the version of Java 9 in the Ubuntu 16.04 repositorie causes a segfault as soon as you try to run any Java GUI program making life more difficult than it needs to be.

The solution is to install the Oracle early access build of Java 9. You can either download the JDK manually, or follow these instructions under Ubuntu to install from the very useful Web Upd8 repository. Either way once installed, launching GATE gives a usable UI.

Unfortunately this isn't quite enough to solve the problem. Under the hood Java 9 introduces a modular component system (often referred to as Project Jigsaw) which includes new rules on encapsulation. The issue is that one of the libraries GATE uses for reading and writing applications, XStream, uses a number of tricks to access internal data that are prohibited under the new rules. The result is that you can't load or save applications which makes the GUI kind of pointless. Fortunately there is a command line option you can pass to the JVM that allows you to bypass the encapsulation rules. So to get GATE to work properly with Java 9 you need to add

--permit-illegal-access

to the command line. When launching the GUI this is easy to do by adding the flag as a new line in the gate.l4j.ini file which you will find in the GATE home folder.

There are two important things to note. Firstly this fix is only temporary as the command line flag will be removed in a later version of Java, and secondly depending how you are deploying GATE it can be difficult to alter the command line arguments (for example if deploying as a web app). Once Java 9 is officially released we'll look again at this problem to find a more permanent solution. Until then this gives you a way of using GATE on a HDPI monitor, but where possible (i.e. only on a HDPI monitor when you need the UI) we'd still advise using Java 8 and this hack as a last resort.