On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: text mining

Showing posts with label text mining. Show all posts

Monday, 12 August 2019

In the News: Online Abuse of Politicians, BBC

We've been working together with the BBC to bring public attention to the issue of online abuse against politicians. Rising tensions in Q1 and Q2 of 2019 meant that politicians were seeing more verbal abuse on Twitter than we have previously observed. The findings were presented on the 6 o'clock and 10 o'clock news on Tuesday, August 6th, and you can see in the histogram above that we found the level of incivility rising to almost 4%. You can see the BBC article describing the work here.

The BBC also did a survey. They found 139 MPs out of the 172 who responded to their survey who said either they or their staff had faced abuse in the past year. More than 60% (108) of those who replied said they had been in contact with the police about threats in the last 12 months.

We found that levels of abuse on Twitter fluctuate over time, with spikes driven by events such as the death of IS bride Shamima Begum's baby or key events in the Brexit negotiations. Labour MP David Lammy has received the most abuse of any MP on Twitter so far this year.

As previously, we also found that on average, male MPs attract significantly more general incivility than female ones, though women attract more sexist abuse. Conservative MPs on average, as previously, attracted significantly more abuse than Labour ones, perhaps because they are in power. Sexist abuse is the most prevalent, as compared with homophobia or racism.

Thursday, 6 June 2019

Toxic Online Discussions during the UK European Parliament Election Campaign

The Brexit Party attracted the most engagement on Twitter in the run-up to the UK European Parliament election on May 23rd, their candidates receiving as many tweets as all the other parties combined. Brexit Party leader Nigel Farage was the most interacted-with UK candidate on Twitter, with over twice as many replies as the next most replied-to candidate, Andrew Adonis of the Labour Party.

We studied all tweets sent to or from (or retweets of or by) UK European Election candidates in the month of May, and classified them as abusive or not using the classifier presented here. It must be noted, in particular, that the classifier only identifies reliably whether a reply is abusive or not. It is not sufficiently accurate for us to reliably judge the target politician or party of this abusive reply. What this means is that we can only reliably identify which EP candidates triggered abuse-containing discussion threads on Twitter, but that often this abuse is actually aimed at other politicians or parties.

In addition to attracting the most replies, the Brexit Party candidates also triggered an unusually high level of abuse-containing Twitter discussions. In particular, we found that posts by Farage triggered almost six times as many abuse-containing Twitter threads than the next most replied to candidate, Gavin Esler of Change UK, during May 2019.

There is an important difference, however, in that that many of the abuse-containing replies to posts by Farage and the Brexit Party were actually abusive towards other politicians (most notably the prime minister and the leader of the Labour party) and not Farage himself. In contrast, abusive replies to Gavin Esler were primarily aimed at the politician himself, triggered by his use of the phrase "village idiot" in connection with the Leave Campaign.

Candidates from other parties that triggered unusually high levels of abuse-containing discussions were those from the UK Independence Party, now considered far right, and Change UK, a newly formed but unstable remain party. Change UK was the most active on Twitter, with candidates sending more tweets than other parties. Gavin Esler was the most replied-to Change UK candidate, and also received an unusually high level of abuse. The abuse often referred to his use of the phrase "village idiot" in connection with the leave campaign, which resulted in anger and resentment.

In contrast, MEP candidates from the Conservative and Labour Parties were not hubs of polarised, abuse-containing discussions on Twitter.

What these findings, unsurprisingly, demonstrate is that politicians and parties who themselves use divisive and abusive language, for example, to brand political opponents as “village idiots”, “traitors”, or as “desperate to betray”, are thus triggering the toxic online responses and deep political antagonism that we have witnessed.

After the Brexit Party, the next most replied-to MEP candidates were from the Labour partyAfter the Brexit Party, the next most replied-to party was Labour, according to the study, followed by Change UK.

MEP candidates from both the Liberal Democrats and the Green Party were also active on Twitter, with the Green MEP candidates second only to Change UK ones for number of tweets sent, but didn't get a lot of engagement in return. The Liberal Democrats in particular received a low number of replies. This may suggest that these parties became the choices of default for a population of discouraged remainers, as both made gains in the election. Both parties attracted a particularly civil tone of reply.

Brexit Party candidates were also the ones that replied most to those who tweeted them, rather than authoring original tweets or retweeting other tweets.

Acknowledgements: Research carried out by Genevieve Gorrell, Mehmet Bakir, and Kalina Bontcheva. This work was partially supported by the European Union under grant agreements No. 654024 SoBigData and No. 825297 WeVerify.

Monday, 11 March 2019

Coming Up: 12th GATE Summer School 17-21 June 2019

It is approaching that time of the year again! The GATE training course will be held from 17-21 June 2019 at the University of Sheffield, UK.

No previous experience or programming expertise is necessary, so it's suitable for anyone with an interest in text mining and using GATE, including people from humanities backgrounds, social sciences, etc.

This event will follow a similar format to that of the 2018 course, with one track Monday to Thursday, and two parallel tracks on Friday, all delivered by the GATE development team. You can read more about it and register here. Early bird registration is available at a discounted rate until 1 May.

The focus will be on mining text and social media content with GATE. Many of the hands on exercises will be focused on analysing news articles, tweets, and other textual content.

The planned schedule is as follows (NOTE: may still be subject to timetabling changes).
Single track from Monday to Thursday (9am - 5pm):

Monday: Module 1: Basic Information Extraction with GATE

Intro to GATE + Information Extraction (IE)
Corpus Annotation and Evaluation
Writing Information Extraction Patterns with JAPE

Tuesday: Module 2: Using GATE for social media analysis

Challenges for analysing social media, GATE for social media
Twitter intro + JSON structure
Language identification, tokenisation for Twitter
POS tagging and Information Extraction for Twitter

Wednesday: Module 3: Crowdsourcing, GATE Cloud/MIMIR, and Machine Learning

Crowdsourcing annotated social media content with the GATE crowdsourcing plugin
GATE Cloud, deploying your own IE pipeline at scale (how to process 5 million tweets in 30 mins)
GATE Mimir - how to index and search semantically annotated social media streams
Challenges of opinion mining in social media
Training Machine Learning Models for IE in GATE

Thursday: Module 4: Advanced IE and Opinion Mining in GATE

Advanced Information Extraction
Useful GATE components (plugins)
Opinion mining components and applications in GATE

On Friday, there is a choice of modules (9am - 5pm):

Module 5: GATE for developers
- Basic GATE Embedded
- Writing your own plugin
- GATE in production - multi-threading, web applications, etc.
Module 6: GATE Applications
- Building your own applications
- Examples of some current GATE applications: social media summarisation, visualisation, Linked Open Data for IE, and more

These two modules are run in parallel, so you can only attend one of them. You will need to have some programming experience and knowledge of Java to follow Module 5 on the Friday. No particular expertise is needed for Module 6.
Hope to see you in Sheffield in June!

Wednesday, 20 February 2019

GATE team wins first prize in the Hyperpartisan News Detection Challenge

SemEval 2019 recently launched the Hyperpartisan News Detection Task in order to evaluate how well tools could automatically classify hyperpartisan news texts. The idea behind this is that "given a news text, the system must decide whether it follows a hyperpartisan argumentation, i.e. whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person."

Below we see an example of (part of) two news stories about Donald Trump from the challenge data. The one on the left is considered to be hyperpartisan, as it shows a biased kind of viewpoint. The one on the right simply reports a story and is not considered hyperpartisan. The distinction is difficult even for humans, because there are no exact rules about what makes a story hyperpartisan.

In total, 322 teams registered to take part, of which 42 actually submitted an entry, including the GATE team consisting of Ye Jiang, Xingyi Song and Johann Petrak, with guidance from Kalina Bontcheva and Diana Maynard.

The main performance measure for the task is accuracy on a balanced set of articles, though additionally precision, recall, and F1-score were measured for the hyperpartisan class. In the final submission, the GATE team's hyperpartisan classifying algorithm achieved 0.822 accuracy for manually annotated evaluation set, and ranked in first position in the final leader board.

Our winning system was based on using sentence representations from averaged word embeddings generated from the pre-trained ELMo model with a Convolutional Neural Network and Batch Normalization for training on the provided dataset. An averaged ensemble of models was then used to generate the final predictions.

The source code and full system description is available on github.

One of the major challenges of this task is that the model must have the ability to adapt to a large range of article sizes. Most state-of-the-art neural network approaches for document classification use a token sequence as network input, but such an approach in this case would mean either a massive computational cost or loss of information, depending on how the maximum sequence length. We got around this problem by first pre-calculating sentence level embeddings as the average of word embeddings for each sentence, and then representing the document as a sequence of these sentence embeddings. We also found that actually ignoring some of the provided training data (which was automatically generated based on the document publishing source) improved our results, which leads to important conclusions about the trustworthiness of training data and its implications.

Overall, the ability to do well on the hyperpartisan news prediction task is important both for improving knowledge about neural networks for language processing generally, but also because better understanding of the nature of biased news is critical for society and democracy.

Thursday, 22 November 2018

Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms

This summer, we presented some of our latest work at SEMANTiCS 2018 in Vienna: "Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms".

Zhang, Z., Petrak, J. & Maynard, D. Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. in SEMANTiCS 2018 – 14th International Conference on Semantic Systems 00, 0-000 (2018).

This work has been carried out in the context of the EU KNOWMAK project, where we're developing tools for multi-topic classification of text against an ontology, in order to attempt to map the state of European research output in key technologies.

Automatic Term Extraction (ATE) is a fundamental technique used in computational linguistics for recognising terms in text. Processing the collected terms in a text is a key step in understanding the content of the text. There are many different ATE methods, but these all tend to work well only in a one specific domain. In other words, there is no universal method which produces consistently good results, and so we have to choose an appropriate method for the domain being targeted.

In this work, we have developed a novel method for ATE which addresses two major limitations: the fact that no single ATE method consistently performs well across all domains, and the fact that the majority of ATE methods are unsupervised. Our generic method, AdaText, improves the accuracy of existing ATE methods, using existing lexical resources to support them, by revising the TextRank algorithm.

After being given a target text, AdaText:

Selects a subset of words based on their semantic relatedness to a set of seed words or phrases relevant to the domain, but not necessarily representative of the terms within the target text.
It then applies an adapted TextRank algorithm to create a graph for these words, and computes a text-level TextRank score for each selected word.
Finally, these scores are used to revise the score of a term candidate previously computed by an ATE method.

This technique was trialled using a variety of parameters (such as the threshold of semantic similarity to select words, as described in step two) over two distinct datasets (GENIA and ACLv2, comprising Medline abstracts and abstracts from ACL respectively). We also tested it with a wide variety of state of the art ATE methods, including modified TFIDF, CValue, Basic, RAKE, Weirdness, LinkProbability, X², GlossEx and PositiveUnlabeled.

The figures show a sample of performances in different datasets and using different ATE techniques. The base performance of the ATE method is represented by the blachttps://gate.ac.uk/g8/page/show/2/sale/images/blog/Results-by-AdaText-compared-against-the-base-ATE-methods-y-axis-average-PK-for-all.pngk horizontal line. The horizontal axis represents the semantic similarity threshold used in step 1. The vertical axis shows average P@K for all five Ks considered.

This new generic combination approach can consistently improve the performance of the ATE method by 25 points, which is a significant increase. However, there is still room for improvement. In future work, we aim to optimise the selection of words from the TextRank graph, work on expanding TextRank to a graph of both words and phrases, and to explore how the size and source of the seed lexicon affects the performance of AdaText.

Wednesday, 5 September 2018

Students use GATE and Twitter to drive Lego robots—again!

At the university's Headstart Summer School in July 2018, 42 secondary school students (age 16 and 17) from all over the UK (see below for maps) were taught to write Java programs to control Lego robots, using input from the robots (such as the sensor for detecting coloured marks on the floor) as well as operating the motors to move and turn. The Department of Computer Science provided a Java library for driving the robots and taught the students to use it.

After they had successfully operated the robots, we ran a practical session on 10 and 11 July on "Controlling Robots with Tweets". We presented a quick introduction to natural language processing (using computer programs to analyse human languages, such as English) and provided them with a bundle of software containing a version of the GATE Cloud Twitter Collector modified to run a special GATE application with a custom plugin to use the Java robot library to control the robots.

The bundle came with a simple "gazetteer" containing two lists of keywords:

left	turn
left	turn
port	take
	make
	move

and a basic JAPE grammar (set of rules) to make use of it. JAPE is a specialized programming language used in GATE to match regular expressions over annotations in documents, such as the "Lookup" annotations created whenever the gazetteer finds a matching keyword in a document. (The annotations are similar to XML tags, except that GATE applications can create them as well as read them and they can overlap each other without restrictions. Technically they form an annotation graph.)

The sample rule we provided would match any keyword from the "turn" list followed by any keyword from the "left" list (with optional other words in between, so that "turn to port", "take a left", "turn left" all work the same way) and then run the code to turn the robot's right motor (making it turn left in place).

We showed them how to configure the Twitter Collector, follow their own accounts, and then run the collector with the sample GATE application. Getting the system set up and working took a bit of work, but once the first few groups got their robot to move in response to a tweet, everyone cheered and quickly became more interested. They then worked on extending the word lists and JAPE rules to cover a wider range of tweeted commands.

Some of the students had also developed interesting Java code the previous day, which they wanted to incorporate into the Twitter-controlled system. We helped these students add their code to their own copies of the GATE plugin and re-load it so the JAPE rules could call their procedures.

We first ran this project in the Headstart course in July 2017; we made improvements for this year and it was a success again, so we plan to include it in Headstart 2019 too.

The following maps show where all the students and the female students came from.

This work is supported by the European Union's Horizon 2020 project SoBigData (grant agreement no. 654024). Thanks to Genevieve Gorrell for the diagram illustrating how the system works.

Tuesday, 27 February 2018

Students use GATE and Twitter to drive Lego robots

At the university's Headstart Summer School in July 2017, 42 students (age 16 and 17) from all over the UK were taught to write Java programs to control Lego robots, using input from the robots (such as the sensor for detecting coloured marks on the floor) as well as operating the motors to move and turn. (The university provided a custom Java library for this.)

On 11 and 12 July we ran a practical session on "Controlling Robots with Tweets". We presented a quick introduction to natural language processing (using computer programs to analyse human languages such as English) and provided them with a bundle of software containing a version of the GATE Cloud Twitter Collector modified to run a special GATE application with a custom plugin to let it use the Java robot library.

The bundle came with a simple "gazetteer" containing two lists of classified keywords:

left	turn
left	turn
port	take
	make
	move

and a basic JAPE grammar to make use of it. JAPE is a specialized language used in GATE to match regular expressions over annotations in documents. (The annotations are similar to XML tags, except that GATE applications can create them as well as read them and they can overlap each other without restrictions. Technically they form an annotation graph.)

The grammar we provided would match any keyword from the "turn" list followed by any keyword from the "left" list (with zero or more unmatched words in between, e.g., "turn to port", "take a left", "turn left") and then run the code to turn the robot's right motor (making it turn left in place).

We showed them how to configure the Twitter Collector, authenticate with their Twitter accounts, follow themselves, and then run the collector with this application. Getting the system set up and working was a bit laborious, but once the first group got their robot to move in response to a tweet and cheered, everyone got a lot more interested very quickly. They were very interested in extending the word lists and JAPE rules to cover a wider range of tweeted commands.

Some of the students had also developed interesting and complicated manoeuvres in Java the previous day, which they wanted to incorporate into the Twitter-controlled system. We helped these students add their code to their own copies of the GATE plugin and re-load it so the JAPE rules could call their procedures.

This project was fun and interesting for the staff as well as the students, and we will include it in Headstart 2018.

The Headstart 2017 video includes these activities. The instructions (presentation and handout) and software are available on-line.

This work is supported by the European Union's Horizon 2020 project SoBigData (grant agreement no. 654024).