On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: hyperpartisan

Showing posts with label hyperpartisan. Show all posts

Tuesday, 5 March 2019

Brexit--The Regional Divide

Although the UK voted by a narrow margin in the UK EU membership referendum in 2016 to leave the EU, that outcome failed to capture the diverse feelings held in various regions. It's a curious observation that the UK regions with the most economic dependence on the EU were the regions more likely to vote to leave it. The image below on the right is taken from this article from the Centre for European Reform, and makes the point in a few different ways. This and similar research inspired a current project the GATE team are undertaking with colleagues in the Geography and Journalism departments at Sheffield University, under the leadership of Miguel Kanai and with funding from the British Academy, aiming to understand whether lack of awareness of individual local situation played a role in the referendum outcome.

Our Brexit tweet corpus contains tweets collected during the run-up to the Brexit referendum, and we've annotated almost half a million accounts for Brexit vote intent with a high accuracy. You can read about that here. So we thought we'd be well positioned to bring some insights. We also annotated user accounts with location: many Twitter users volunteer that information, though there can be a lot of variation on how people describe their location, so that was harder to do accurately. We also used local and national news media corpora from the time of the referendum, in order to contrast national coverage with local issues are around the country.

Topics representation in different media

"People's resistance to propaganda and media‐promoted ideas derives from their close ties in real communities"
Jean Seaton

Using topic modelling and named entity recognition, we were able to look for similarities and differences in the focus of local and national media and Twitter users. The bar chart on the left gets us started, illustrating that foci differ between media. Twitter users give more air time than news media to trade and immigration, whereas local press takes the lead on employment, local politics and agriculture. National press gives more space to terrorism than either Twitter or local news.

NER diff between national and local press

On the right is just one of many graphs in which we unpack this on a region-by-region basis (you can find more on the project website). In this choropleth, red indicates that the topic was significantly more discussed in national press than in local press in that area, and green indicates that the topic was significantly more discussed in local press there than in national press. Terrorism and immigration have perhaps been subject to a certain degree of media and propaganda inflation--we talk about this in our Social Informatics paper. Where media focus on locally relevant issues, foci are more grounded, for example in practical topics such as agriculture and employment. We found that across the regions, Twitter remainers showed a closer congruence with local press than Twitter leavers.

The graph on the right shows the number of times a newspaper was linked on Twitter, contrasted against the percentage of people that said they read that newspaper in the British Election Study. It shows that the dynamics of popularity on Twitter are very different to traditional readership. This highlights a need to understand how the online environment is affecting the news reportage we are exposed to, creating a market for a different kind of material, and a potentially more hostile climate for quality journalism, as discussed by project advisor Prof. Jackie Harrison here. Furthermore, local press are increasingly struggling to survive, so it feels important to highlight their value through this work.
You can see more choropleths on the project website. There's also an extended version here of an article currently under review.

Wednesday, 20 February 2019

GATE team wins first prize in the Hyperpartisan News Detection Challenge

SemEval 2019 recently launched the Hyperpartisan News Detection Task in order to evaluate how well tools could automatically classify hyperpartisan news texts. The idea behind this is that "given a news text, the system must decide whether it follows a hyperpartisan argumentation, i.e. whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person."

Below we see an example of (part of) two news stories about Donald Trump from the challenge data. The one on the left is considered to be hyperpartisan, as it shows a biased kind of viewpoint. The one on the right simply reports a story and is not considered hyperpartisan. The distinction is difficult even for humans, because there are no exact rules about what makes a story hyperpartisan.

In total, 322 teams registered to take part, of which 42 actually submitted an entry, including the GATE team consisting of Ye Jiang, Xingyi Song and Johann Petrak, with guidance from Kalina Bontcheva and Diana Maynard.

The main performance measure for the task is accuracy on a balanced set of articles, though additionally precision, recall, and F1-score were measured for the hyperpartisan class. In the final submission, the GATE team's hyperpartisan classifying algorithm achieved 0.822 accuracy for manually annotated evaluation set, and ranked in first position in the final leader board.

Our winning system was based on using sentence representations from averaged word embeddings generated from the pre-trained ELMo model with a Convolutional Neural Network and Batch Normalization for training on the provided dataset. An averaged ensemble of models was then used to generate the final predictions.

The source code and full system description is available on github.

One of the major challenges of this task is that the model must have the ability to adapt to a large range of article sizes. Most state-of-the-art neural network approaches for document classification use a token sequence as network input, but such an approach in this case would mean either a massive computational cost or loss of information, depending on how the maximum sequence length. We got around this problem by first pre-calculating sentence level embeddings as the average of word embeddings for each sentence, and then representing the document as a sequence of these sentence embeddings. We also found that actually ignoring some of the provided training data (which was automatically generated based on the document publishing source) improved our results, which leads to important conclusions about the trustworthiness of training data and its implications.

Overall, the ability to do well on the hyperpartisan news prediction task is important both for improving knowledge about neural networks for language processing generally, but also because better understanding of the nature of biased news is critical for society and democracy.

Monday, 18 February 2019

Russian Troll Factory: Sketches of a Propaganda Campaign

When Twitter shared a large archive of propaganda tweets late in 2018 we were excited to get access to over 9 million tweets from almost 4 thousand unique Twitter accounts controlled by Russia's Internet Research Agency. The tweets are posted in 57 different languages, but most are in Russian (53.68%) and English (36.08%). Average account age is around four years, and the longest accounts are as much as ten years old.
A large amount of activity in both the English and Russian accounts is given to news provision. Secondly, many accounts seem to engage in hashtag games, which may be a way to establish an account and get some followers. Of particular interest however are the political trolls. Left trolls pose as individuals interested in the Black Lives Matter campaign. Right trolls are patriotic, anti-immigration Trump supporters. Among left and right trolls, several have achieved large follower numbers and even a degree of fame. Finally there are fearmonger trolls, that propagate scares, and a small number of commercial trolls. The Russian language accounts also divide on similar lines, perhaps posing as individuals with opinions about Ukraine or western politics. These categories were proposed by Darren Linvill and Patrick Warren, from Clemson University. In the word clouds below you can see the hashtags we found left and right trolls using.

Left Troll Hashtags

Right Troll Hashtags

Mehmet E. Bakir has created some interactive graphs enabling us to explore the data. In the network diagram at the start of the post you can see the network of mention/retweet/reply/quote counts we created from the highly followed accounts in the set. You can click through to an interactive version, where you can zoom in and explore different troll types.
In the graph below, you can see activity in different languages over time (interactive version here, or interact with the embedded version below; you may have to scroll right). It shows that the Russian language operation came first, with English language operations following after. The timing of this part of the activity coincides with Russia's interest in Ukraine.

In the graph below, also available here, you can see how different types of behavioural strategy pay off in terms of achieving higher numbers of retweets. Using Linvill and Warren's manually annotated data, Mehmet built a classifier that enabled us to classify all the accounts in the dataset. It is evident that the political trolls have by far the greatest impact in terms of retweets achieved, with left trolls being the most successful. Russia's interest in the Black Lives Matter campaign perhaps suggests that the first challenge for agents is to win a following, and that exploiting divisions in society is an effective way to do that. How that following is then used to influence minds is a separate question. You can see a pre-print of our paper describing our work so far, in the context of the broader picture of partisanship, propaganda and post-truth politics, here.