On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: February 2019

Wednesday, 20 February 2019

GATE team wins first prize in the Hyperpartisan News Detection Challenge

SemEval 2019 recently launched the Hyperpartisan News Detection Task in order to evaluate how well tools could automatically classify hyperpartisan news texts. The idea behind this is that "given a news text, the system must decide whether it follows a hyperpartisan argumentation, i.e. whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person."

Below we see an example of (part of) two news stories about Donald Trump from the challenge data. The one on the left is considered to be hyperpartisan, as it shows a biased kind of viewpoint. The one on the right simply reports a story and is not considered hyperpartisan. The distinction is difficult even for humans, because there are no exact rules about what makes a story hyperpartisan.

In total, 322 teams registered to take part, of which 42 actually submitted an entry, including the GATE team consisting of Ye Jiang, Xingyi Song and Johann Petrak, with guidance from Kalina Bontcheva and Diana Maynard.

The main performance measure for the task is accuracy on a balanced set of articles, though additionally precision, recall, and F1-score were measured for the hyperpartisan class. In the final submission, the GATE team's hyperpartisan classifying algorithm achieved 0.822 accuracy for manually annotated evaluation set, and ranked in first position in the final leader board.

Our winning system was based on using sentence representations from averaged word embeddings generated from the pre-trained ELMo model with a Convolutional Neural Network and Batch Normalization for training on the provided dataset. An averaged ensemble of models was then used to generate the final predictions.

The source code and full system description is available on github.

One of the major challenges of this task is that the model must have the ability to adapt to a large range of article sizes. Most state-of-the-art neural network approaches for document classification use a token sequence as network input, but such an approach in this case would mean either a massive computational cost or loss of information, depending on how the maximum sequence length. We got around this problem by first pre-calculating sentence level embeddings as the average of word embeddings for each sentence, and then representing the document as a sequence of these sentence embeddings. We also found that actually ignoring some of the provided training data (which was automatically generated based on the document publishing source) improved our results, which leads to important conclusions about the trustworthiness of training data and its implications.

Overall, the ability to do well on the hyperpartisan news prediction task is important both for improving knowledge about neural networks for language processing generally, but also because better understanding of the nature of biased news is critical for society and democracy.

Monday, 18 February 2019

Russian Troll Factory: Sketches of a Propaganda Campaign

When Twitter shared a large archive of propaganda tweets late in 2018 we were excited to get access to over 9 million tweets from almost 4 thousand unique Twitter accounts controlled by Russia's Internet Research Agency. The tweets are posted in 57 different languages, but most are in Russian (53.68%) and English (36.08%). Average account age is around four years, and the longest accounts are as much as ten years old.
A large amount of activity in both the English and Russian accounts is given to news provision. Secondly, many accounts seem to engage in hashtag games, which may be a way to establish an account and get some followers. Of particular interest however are the political trolls. Left trolls pose as individuals interested in the Black Lives Matter campaign. Right trolls are patriotic, anti-immigration Trump supporters. Among left and right trolls, several have achieved large follower numbers and even a degree of fame. Finally there are fearmonger trolls, that propagate scares, and a small number of commercial trolls. The Russian language accounts also divide on similar lines, perhaps posing as individuals with opinions about Ukraine or western politics. These categories were proposed by Darren Linvill and Patrick Warren, from Clemson University. In the word clouds below you can see the hashtags we found left and right trolls using.

Left Troll Hashtags

Right Troll Hashtags

Mehmet E. Bakir has created some interactive graphs enabling us to explore the data. In the network diagram at the start of the post you can see the network of mention/retweet/reply/quote counts we created from the highly followed accounts in the set. You can click through to an interactive version, where you can zoom in and explore different troll types.
In the graph below, you can see activity in different languages over time (interactive version here, or interact with the embedded version below; you may have to scroll right). It shows that the Russian language operation came first, with English language operations following after. The timing of this part of the activity coincides with Russia's interest in Ukraine.

In the graph below, also available here, you can see how different types of behavioural strategy pay off in terms of achieving higher numbers of retweets. Using Linvill and Warren's manually annotated data, Mehmet built a classifier that enabled us to classify all the accounts in the dataset. It is evident that the political trolls have by far the greatest impact in terms of retweets achieved, with left trolls being the most successful. Russia's interest in the Black Lives Matter campaign perhaps suggests that the first challenge for agents is to win a following, and that exploiting divisions in society is an effective way to do that. How that following is then used to influence minds is a separate question. You can see a pre-print of our paper describing our work so far, in the context of the broader picture of partisanship, propaganda and post-truth politics, here.

Friday, 8 February 2019

Teaching computers to understand the sentiment of tweets

As part of the EU SoBigData project, the GATE team hosts a number of short research visits, between 2 weeks and 2 months, for all kinds of data scientists (PhD students, researchers, academics, professionals) to come and work with us and to use our tools and/or datasets on a project involving text mining and social media analysis. Kristoffer Stensbo-Smidt visited us in the summer of 2018 from the University of Copenhagen, to work on developing machine learning tools for sentiment analysis of tweets, and was supervised by GATE team member Diana Maynard and by former team member Isabelle Augenstein, who is now at the University of Copenhagen. Kristoffer has a background in Machine Learning but had not worked in NLP before, so this visit helped him understand how to apply his skills to this kind of domain.

After his visit, Kristoffer wrote up an excellent summary of his research. He essentially tested a number of different approaches to processing text, and analysed how much of the sentiment they were able to identify. Given a tweet and an associated topic, the aim is to ascertain automatically whether the sentiment expressed about this topic is positive, negative or neutral. Kristoffer experimented different word embedding-based models in order to test how much information different word embeddings carry for the sentiment of a tweet. This involved choosing which embeddings models to test, and how to transform the topic vectors. The main conclusions he drew from the work were that in general, word embeddings contain a lot of useful information about sentiment, with newer embeddings containing significantly more. This is not particularly surprising, but shows the importance of advanced models for this task.

3rd International Workshop on Rumours and Deception in Social Media (RDSM)

June 11, 2019 in Munich, Germany
Collocated with ICWSM'2019

Abstract

The 3rd edition of the RDSM workshop will particularly focus on online information disorder and its interplay with public opinion formation.

Social media is a valuable resource for mining all kind of information varying from opinions to factual information. However, social media houses issues that are serious threats to the society. Online information disorder and its power on shaping public opinion lead the category of those issues. Among the known aspects are the spread of false rumours, fake news or even social attacks such as hate speech or other forms of harmful social posts. In this workshop the aim is to bring together researchers and practitioners interested in social media mining and analysis to deal with the emerging issues of information disorder and manipulation of public opinion. The focus of the workshop will be on themes such as the detection of fake news, verification of rumours and the understanding of their impact on public opinion. Furthermore, we aim to put a great emphasis on the usefulness and trust aspects of automated solutions tackling the aforementioned themes.

Workshop Theme and Topics

The aim of this workshop is to bring together researchers and practitioners interested in social media mining and analysis to deal with the emerging issues of veracity assessment, fake news detection and manipulation of public opinion. We invite researchers and practitioners to submit papers reporting results on these issues. Qualitative studies performing user studies on the challenges encountered with the use of social media, such as the veracity of information and fake news detection, as well as papers reporting new data sets are also welcome. Finally, we also welcome studies reporting the usefulness and trust of social media tools tackling the aforementioned problems.

Topics of interest include, but are not limited to:

Detection and tracking of rumours.
Rumour veracity classification.
Fact-checking social media.
Detection and analysis of disinformation, hoaxes and fake news.
Stance detection in social media.
Qualitative user studies assessing the use of social media.
Bots detection in social media.
Measuring public opinion through social media.
Assessing the impact of social media in public opinion.
Political analyses of social media.
Real-time social media mining.
NLP for social media analysis.
Network analysis and diffusion of dis/misinformation.
Usefulness and trust analysis of social media tools.
AI generated fake content (image / text)

Workshop Program Format

We will have 1-2 experts in the field delivering keynote speeches. We will then have a set of 8-10 presentations of peer-reviewed submissions, organised into 3 sessions by subject (the first two sessions about online information disorder and public opinion and the third session about the usefulness and trust aspects). After the session we also plan to have a group work (groups of size 4-5 attendances) where each group will sketch a social media tool for tackling e.g. rumour verification, fake news detection, etc. The emphasis of the sketch should be on aspects like usefulness and trust. This should take no longer than 120 minutes (sketching, presentation/discussion time). We will close the workshop with a summary and take home messages (max. 15 minutes). Attendance will be open to all interested participants.

We welcome both full papers (5-8 pages) to be presented as oral talks and short papers (2-4 pages) to be presented as posters and demos.

Workshop Schedule/Important Dates

Submission deadline: April 1st 2019
Notification of Acceptance: April 15th 2019
Camera-Ready Versions Due: April 26th 2019
Workshop date: June 11, 2019

Submission Procedure

We invite two kinds of submissions:

- Long papers/Brief Research Report (max 8 pages + 2 references)
- Demos and poster (short papers) (max 4 pages + 2 references)

Proceedings of the workshop will be published jointly with other ICWSM workshops in a special
issue of Frontiers in Big Data.

Papers must be submitted electronically in PDF format or any format that is supported by the
submission site through https://www.frontiersin.org/research-topics/9706 (click on "Submit your manuscript").
Note, submitting authors should choose one of the specific track organizers as their preferred Editor.

You can find detailed information on the file submission requirements here:
https://www.frontiersin.org/about/author-guidelines#FileRequirements

Submissions will be peer-reviewed by at least three members of the programme
committee. The accepted papers will appear in the proceedings published at
https://www.frontiersin.org/research-topics/9706

Workshop Organizers

Ahmet Aker, University of Duisburg-Essen, Germany; University of Sheffield, UK
a.aker@is.inf.uni-due.de
Arkaitz Zubiaga, Queen Mary University of London, UK
arkaitz@zubiaga.org
Kalina Bontcheva, University of Sheffield, UK
k.bontcheva@sheffield.ac.uk
Maria Liakata, University of Warwick and Alan Turing Institute, UK
m.liakata@warwick.ac.uk
Rob Procter, University of Warwick and Alan Turing Institute, UK
rob.procter@warwick.ac.uk
Symeon Papadopoulos, Centre for Research and Technology Hellas, Greece
papadop@iti.gr

Programme Committee (Tentative)

Nikolas Aletras, University of Sheffield, UK
Emilio Ferrara, University of Southern California, USA
Bahareh Heravi, University College Dublin, Ireland
Petya Osenova, Ontotext, Bulgaria
Damiano Spina, RMIT University, Australia
Peter Tolmie, Universität Siegen, Germany
Marcos Zampieri, University of Wolverhampton, UK
Milad Mirbabaie, University of Duisburg-Essen, Germany
Tobias Hecking, University of Duisburg-Essen, Germany
Kareem Darwish, QCRI, Qatar
Hassan Sajjad, QCRI, Qatar
Sumithra Velupillai, King's College London, UK

Invited Speaker(s)

To be announced