On GATE, Text and Social Media Analysis, and Detecting Misinformation Online

Discerning Truth in the Age of Ubiquitous Disinformation (2)

How Can We Combat Online Disinformation

Kalina Bontcheva (@kbontcheva)

In my previous blog post I wrote about the 4Ps of the modern disinformation age: post-truth politics, online propaganda, polarised crowds, and partisan media.

Now, let me reflect some more on the question of what can we do about it. Please note that this is not an exhaustive list!

Promote Collaborative Fact Checking Efforts

In order to counter subjectivity, post-truth politics, disinformation, and propaganda, many media and non-partisan institutions worldwide have started fact checking initiatives – 114 in total, according to Poynter. These mostly focus on exposing disinformation in political discourse, but generally aim at encouraging people to pursue accuracy and veracity of information (e.g. Politifact, FullFact.org, Snopes). A study by the American Press Institute has shown that even politically literate consumers benefit from fact-checking as they increase their knowledge of the subject.

Professional fact checking is a time-consuming process that cannot cover a significant proportion of the claims being propagated via social media channels. To date, most projects have been limited to one or two steps of the fact checking process, or are specialized on certain subject domains: Claimbuster, ContentCheck and the ongoing Fake News Challenge are a few examples.

There are two ways to lower the overheads and I believe both are worth pursuing: 1) create a coordinated fact-checking initiative that promotes collaboration between different media organisations, journalists, and NGOs; 2) fund the creation of automation tools for analysing disinformation, to help the human effort. I discuss the latter in more detail next.

Fund open-source research on automatic methods for disinformation detection

In the PHEME research project we focused specifically on studying rumours associated with different types of events—some were events like shootings and others were rumours and hoax stories like “Prince is going to have a concert in Toronto”—and how those stories were disseminated via Twitter or Reddit. We looked at how reliably we can identify such rumours: one of the hardest tasks is how to group all the different social media posts like tweets or Reddit posts around the same rumour together. In Reddit it is a bit easier thanks to threads. Twitter is harder because often there are multiple originating tweets that refer to the same rumour.

That is the real challenge: to piece together all these stories, because the ability to identify whether something is correct or not depends a lot on evidence and also on the discussions around that rumour, that the public are carrying out on social media platforms. By seeing one or two tweets, sometimes even journalists cannot be certain whether a rumour is true or false, but as we see the discussion around the rumours and the accumulating evidence over time, the judgment becomes more reliable.

Consequently, it becomes easier to predict the veracity of a rumour, but the main challenge is identifying reliably all these different tweets that are talking about the same rumour. If sufficient evidence can be provided across different tweet posts, it becomes possible to determine the veracity of that rumour with around 85% accuracy.

In the wider context, there is emerging technology for veracity checking and verification of social media content (going beyond images/video forensics). These include tools developed in several European projects (e.g. PHEME, REVEAL, and InVID), tools assisting crowdsourced verification (e.g. CheckDesk, Veri.ly), citizen journalism (e.g. Citizen Desk), and repositories of checked facts/rumours (e.g. Emergent, FactCheck). However, many of those tools are language specific and would thus need adaptation and enhancement to new languages. Besides, further improvements are needed to the algorithms themselves, in order to achieve accuracy comparable to that of email spam filter technology.

It is also important to invest in establishing ethical protocols and research methodologies, since social media content raises a number of privacy, ethical, and legal challenges.

Dangers and pitfalls of relying purely on automated tools for disinformation detection

Many researchers (myself included) are working on automated methods based on machine learning algorithms, in order to identify automatically disinformation on social media platforms. Given the extremely large volume of social media posts, key questions are can disinformation be identified in real time and should such methods be adopted by the social media platforms themselves?

The very short answer is: Yes, in principle, but we are still far from solving many key socio-technical issues, so, when it comes to containing the spread of disinformation, we should be mindful of the problems which such technology could introduce:

l Non-trivial scalability: While some of our algorithms work in near real time on specific datasets such as tweets about the Brexit referendum - applying them across all posts on all topics as Twitter would need to do, for example, is very far from trivial. Just to give a sense of the scale here - prior to 23 June 2016 (referendum day) we had to process fewer than 50 Brexit-related tweets per second, which was doable. Twitter, however, would need to process more than 6,000 tweets per second, which is a serious software engineering, computational, and algorithmic challenge.

l Algorithms make mistakes, so while 90 per cent accuracy intuitively sounds very promising, we must not forget the errors - 10 per cent in this case, or double that at 80 per cent algorithm accuracy. On 6,000 tweets per second this 10 per cent amounts to 600 wrongly labeled tweets per second rising to 1,200 for the lower accuracy algorithm. To make matters worse, automatic disinformation analysis often combines more than one algorithm - first to determine which story a post refers to and second - whether this is likely true, false, or uncertain. Unfortunately, when algorithms are executed in a sequence, errors have a cumulative effect.

l These mistakes can be very costly: broadly speaking algorithms make two kinds of errors - false negatives in which disinformation is wrongly labelled as true or bot accounts wrongly identified as human and false positives, correct information is wrongly labelled as disinformation or genuine users being wrongly identified as bots. False negatives are a problem on social platforms, because the high volume and velocity of social posts (e.g. 6,000 tweets per second on average) still leaves with a lot of disinformation “in the wild”. If we draw an analogy with email spam - even though most of it is filtered out automatically, we are still receiving a significant proportion of spam messages. False positives, on the other hand, pose an even more significant problem, as falsely removing genuine messages is effectively censorship through artificial intelligence. Facebook, for example, has a growing problem with some users having their accounts wrongly suspended.

Therefore, I strongly believe that the best way forward is to implement human-in-the-loop solutions, where people are assisted by machine learning and AI methods, but not replaced entirely, as accuracy is still not high enough, but primarily, for the censorship danger.

Establishing Cooperation and Data Exchange between Social Platforms and Scientists

Our latest work on analysing misinformation in tweets about the UK referendum [1] [2] showed yet again a very important issue - when it comes to social media and furthering our ability to understand its misuse and impact on society and democracy, the only way forward is for data scientists, political and social scientists and journalists to work together alongside the big social media platforms and policy makers. I believe data scientists and journalists need to be given open access to the full set of public social media posts on key political events for research purposes (without compromising privacy and data protection laws), and be able to work in collaboration with the platforms through grants and shared funding (such as the Google Digital News Initiative).

There are still many outstanding questions that need to be researched - most notably the dynamics of the interaction between all these Twitter accounts over time - for which we need the complete archive of public tweets, images, and URL content shared, as well as profile data and friend/follower networks. This would help us quantify better (amongst other things) what kinds of tweets and messages resulted in misinformation spreading accounts gaining followers and re-tweets, how human-like was the behaviour of the successful ones, and also were they connected to the alternative media ecosystem and how.

The intersection of automated accounts, political propaganda, and misinformation is a key area in need of further investigation, but for which, scientists often lack the much needed data, while the data keepers lack the necessary transparency, motivation to investigate these issues, and willingness to create open and unbiased algorithms.

Policy Decisions around Preserving Important Social Media Content for Future Studies

Governments and policy makers are in a position to help establish this much needed cooperation between social platforms and scientists, promote the definition of policies for ethical, privacy-preserving research and data analytics over social media data, and also ensure the archiving and preservation of social media content of key historical value.

For instance, given the ongoing debate on the scale and influence of Russian propaganda on election and referenda outcomes, it would have been invaluable to have Twitter archives made available to researchers under strict access and code of practice criteria, so it becomes possible to study these questions in more depth. Unfortunately, this is not currently possible, with Twitter having suspended all Russia-linked accounts and bots, as well as all their content and social network information. Similar issues arise when trying to study online abuse of and from politicians, as posts and accounts are again suspended or deleted at a very high rate.

Related to this is the challenge of open and repeatable science on social media data, as again many of the posts in current datasets available for training and evaluating machine learning algorithms, have been deleted or are not available. This causes a problem as algorithms do not have sufficient data to improve as a result and neither can scientists determine easily whether a new method is really outperforming the state-of-the-art.

Promoting Media Literacy and Critical Thinking for Citizens

According to the Media Literacy project: “Media literacy is the ability to access, analyze, evaluate, and create media. Media literate youth and adults are better able to understand the complex messages we receive from television, radio, Internet, newspapers, magazines, books, billboards, video games, music, and all other forms of media.”

Training citizens in the ability to recognise spin, bias, and mis- and disinformation are key elements. Due to the extensive online and social media exposure of children, there are also initiatives aimed specifically at school children, starting from as young as 11 years old. There are also online educational resources on media literacy and fake news [3], [4] that could act as a useful starting point of national media literacy initiatives.

Increasingly, media literacy and critical thinking are seen as key tools in fighting the effects of online disinformation and propaganda techniques [5], [6]. Many of the existing programmes today are delivered by NGOs in a face-to-face group setting. The next challenge is how to roll these out at scale and also online, in order to reach wide audience across all social and age groups.

Establish/revise and enforce national code of practice for politicians and media outlets

Disinformation and biased content reporting are not just the preserve of fake news and state-driven propaganda sites and social accounts. A significant amount also comes from partisan media and factually incorrect statements by prominent politicians.

In the case of the UK EU membership referendum, for example, a false claim regarding immigrants from Turkey was made on the front pages of a major UK newspaper [7], [8]. Another widely known and influential example was VoteLeave’s false claim that the EU costs £350 million a week [9]. Even though the UK Office of National Statistics disputed the accuracy of this claim on 21 April 2016 (2 months prior to the referendum), it continued to be used throughout the campaign.

Therefore, an effective way to combat deliberate online falsehoods must address such cases as well. Governments and policy makers could help again through establishing new or updating existing codes of practice of political parties and press standards, as well as ensuring that they are adhered to.

These need to be supplemented with transparency in political advertising on social platforms, in order to eliminate or significantly reduce promotion of misinformation through advertising. These measures would also help with reducing the impact of all other kinds of disinformation already discussed above.

Disclaimer: All views are my own.

Thursday, 8 March 2018

Discerning Truth in the Age of Ubiquitous Disinformation (2)

How Can We Combat Online Disinformation