Tuesday, 20 October 2020

From Entity Recognition to Ethical Recognition: a museum terminology journey


This guest blog post from Jonathan Whitson Cloud tells the story of "how a relatively simple entity recognition project at the Horniman Museum has, thanks to the range and flexibility of tools available in GATE, opened the door to a method for the democratisation and decolonisation of terminology in Museums."
In 2018 the Horniman Museum opened a new long term display called the World Gallery. As is usual with museum displays, there was only a very limited amount of space for text giving context to the over 3,000 items in the cases. As is also now usual, the Horniman looked to its website to share more of the research and stories the curators had unearthed in the 6 year gestation of the gallery. 

The Horniman World Gallery

Entity Recognition

Central to the ambition for the web content was a desire to bridge the gap between the database that the museum uses to record its collections and the narrative and research texts recorded in a wiki. The link would be the database terminologies and authority lists, used as business controls in the database. The construction of these terminologies has a revered place in museum practice. Museums as they are today emerged from the enlightenment project to categorise and bring order to the world. More on the consequences of this later, but for now it was useful to have a series of reference terms for the types of objects in the gallery, the cultures they came from, the people, places and materials etc. 

I had learnt about GATE and participated in the week's training course in 2015, when I first became interested and aware of the potential for Natural Language Processing as a way of managing and getting the most out of the vast and often messy data holdings in museums. 

My hope was that the terminologies and authorities in our collections database could serve as gazetteers for gazetteer-based entity recognition in GATE. The terminology entities from the database-generated gazetteers would be matched in the wiki texts and rendered as hyperlinks to reference pages for the entities on our website.

This worked pretty well, and we released over 500 wiki pages of marked up text, with new pages continuing to come on line. The gazetteer matching, though, was only accurate enough to be suggestive, with many strings appearing in multiple gazetteers (people’s names were particularly difficult). I had been wanting an excuse to explore the machine learning potential in GATE and this seemed like an opportunity, so I came up to Sheffield for an additional day’s training (thank you Xingyi) in early March 2020, and came away with a pipeline that used Machine Learning to identify term types independently of the gazetteers, which could then be built into a set of rules that improved the gazetteer identification significantly. The annotations produced were still checked prior to publication, but with considerably fewer adjustments required.

The Gazetteer Pipeline developed in GATE

The next experiment was to run the machine learning enhanced gazetteer pipeline over a set of gallery texts for an older exhibition. This produced a lot of matches/links, and should we publish these texts online, they will appear with in-line links to terms already in use in our Mimsy and the World Gallery Wiki texts, so becoming an integrated part of the web of linked terms and texts.

The Machine Learning pipeline built in GATE


Another very welcome outcome of this process was that the pipeline identified a number of terms that were not in our gazetteers and which became suggested new terms for our terminologies, demonstrating GATE’s ability to create as well as identify terminology, and it is this function that we are now looking to exploit in a new project.

Decolonisation of Museum Collections

In 2019 the Horniman was appointed by the Department of Culture Media and Sport (DCMS) to lead a group of museums in developing new collecting and interpretation practice addressing the historic and ongoing cultural impact of the UK as a colonising power.  The terminology that museums use about their collections is very much a subject of interest to museums seeking to decolonise their collections. As mentioned before, the creation and application of categories has been fundamental to museum practice since museums emerged as knowledge organisations in the 18th century. It has now become painfully clear, however, that these categories have been created and applied with the same scant regard for the rights and culture of the people who made and used the items to which they have been applied as the ‘collecting’ of them. That is to say, at best rudely and at worst violently. 

We are currently building a mechanism, again based on a wiki and GATE, whereby new and existing texts authored by the communities who made and used the items in the museum collection can also be marked up by those communities to make learning corpora. A machine learning pipeline will then build new terminologies to be applied to the items that the communities made and used. This is not only decolonising but democratising as it gives value to texts by any members of a community, not just cultural academics or other specialists, in many media including social media.

The GATE tool with its modular architecture has enabled me to take an experimental and incremental approach to accessing advanced NLP tools, despite not being an NLP or even a computer expert. That it is open source and supported by an active user community makes it ideal for the Cultural Heritage sector which otherwise lacks the funding, the confidence and the expertise to access the powerful NLP techniques and all they offer for the redirecting of museum interpretation away from expert exposition towards a truly democratic and decolonised future. 

Monday, 24 February 2020

Online Abuse toward Candidates during the UK General Election 2019



In this blog post I’m going to discuss the 2019 UK general election and the increase in abuse aimed at politicians online. We collected 4.2 million tweets sent to or from election candidates in the six week period spanning from the start of November until shortly after the December 12th election. The graph above shows the who received the most abuse up to and including December 14th, with Boris Johnson and Jeremy Corbyn receiving the most by far.

The 2016 "Brexit" referendum left the parliament and the nation divided. Since then we have seen two general elections, and two Prime Ministers jostle to strengthen their majority and improve their negotiating position with the EU. National feeling has never been so polarised and it will come as no surprise that with the social changes brought about through the rise of social media, abuse towards politicians in the UK has increased. Using natural language processing we can identify abuse and type it according to whether it is political, sexist or simply generic abuse.

Our work investigates a large tweet collection on which natural language processing has been performed in order to identify abusive language, the politicians it is targeted at and the topics in the politician’s original tweet that tend to trigger abusive replies, thus enabling large scale quantitative analysis. A list of slurs, offensive words and potentially sensitive identity markers was used. The slurs list contained 1081 abusive terms or short phrases in British and American English, comprising mostly an extensive collection of insults, racist and homophobic slurs, as well as terms that denigrate a person’s appearance or intelligence, gathered from sources that include http://hatebase.org and Farrell et al [2].

Method

Tweets were collected in real-time using Twitter’s streaming API. We began immediately to collect any candidate who had been entered into Democracy Club’s database[10] who had Twitter accounts. We used the API to follow the accounts of all candidates over the campaign period. This means we collected all the tweets sent by each candidate, any replies to those tweets, and any retweets either made by the candidate or of the candidate’s own tweets. Note that this approach does not collect all tweets which an individual would see in their timeline, as it does not include those in which they are just mentioned. We took this
approach as the analysis results are more reliable due to the fact that replies are
directed at the politician who authored the tweet, and thus, any abusive language
is more likely to be directed at them. Ethics approval was granted to collect the data through application 25371 at the University of Sheffield.

Findings

Table 1 gives overall statistics of research period, which contains a total of 184,014 candidate-authored original tweets, 334,952 retweets and 131,292 replies. 3,541,769 replies to politicians were found, of which abuse was found in 4.46%. The second row gives similar statistics for the 2017 general election period. It is evident that the level of abuse received by political candidates has risen in the intervening two and a half years. 

In terms of representation in the sample of election candidates with Twitter accounts, gender balance is skewed heavily in favour of men for the Conservatives and LibDems; Labour in contrast had more female/non-binary than male candidates. Most abuse is aimed at Jeremy Corbyn and Boris Johnson, with Matthew Hancock, Jacob Rees-Mogg, Jo Swinson, Michael Gove, David Lammy and James Cleverly also receiving substantial abuse. Michael Gove received a great deal of personal abuse following the climate debate. Jo Swinson received the most sexist abuse.


Period
Original MP tweets
MP retweets
MP
replies
Replies to MPs
Abusive replies to MPs
%
Abuse
3 Nov–15 Dec 2019
184,014 
334,952
131,292 
3,541,769 
157,844
4.46
29 Apr–9 Jun 2017
126,216 
245,518 
71,598 
961,413 
31,454 
3.27


Who is getting abuse?

The topic of Brexit draws abuse for all three parties. Conservative candidates initially move away from this, toward their safer topic of taxation, before returning to Brexit. Liberal Democrats continue to focus on Brexit despite receiving abuse. Labour candidates consistently don’t focus on Brexit; public health is a safe topic for Labour. 


Levels of abuse increased in the run up to the election. The figure below highlights the number of abusive tweets received by the three major parties. There is a considerable spike for both Labour and the Conservatives in the week prior to the election.




In the graph below we look at the average abuse per month received by MPs did not stand again those who did choose to stand again. We see that in all bar one of the earlier months of the year those individuals received more abuse, and particularly in June.MPs who stood down received more abuse than those who chose to stand again in all but one month in the first half of 2019, and in June they received over 50% more abuse.


Conclusions

Between Nov 3rd and December 15th, we found 157,844 abusive replies to candidates’ tweets (4.44% of all replies received)–a low estimate of probably around half of the actual abusive tweets. Overall, abuse levels climbed week on week in November and early December, as the election campaign progressed, from 17,854 in the first week to 41,421 in the week of the December 12th election. The escalation in abuse was toward Conservative candidates specifically, with abuse levels towards candidates from the other two main parties remaining stable week on week; however, after Labour’s decisive defeat, their candidates were subjected to a spike in abuse. Abuse levels are not constant; abuse is triggered by external events (e.g. leadership debates) or controversial tweets by the candidates. Abuse levels have also been approximately climbing month on month over the year, and in November were more than double by volume compared with January.

Wednesday, 6 November 2019

Which MPs changed party affiliation, 2017-2019

As part of our work tracking Twitter abuse towards MPs and candidates going into the December 12th general election I've been updating our data files regarding party membership. I thought you might be interested to see the result!

Update December 10th: Green stars now indicate MPs who chose not to stand again.

Monday, 12 August 2019

In the News: Online Abuse of Politicians, BBC


We've been working together with the BBC to bring public attention to the issue of online abuse against politicians. Rising tensions in Q1 and Q2 of 2019 meant that politicians were seeing more verbal abuse on Twitter than we have previously observed. The findings were presented on the 6 o'clock and 10 o'clock news on Tuesday, August 6th, and you can see in the histogram above that we found the level of incivility rising to almost 4%. You can see the BBC article describing the work here.

The BBC also did a survey. They found 139 MPs out of the 172 who responded to their survey who said either they or their staff had faced abuse in the past year. More than 60% (108) of those who replied said they had been in contact with the police about threats in the last 12 months.

We found that levels of abuse on Twitter fluctuate over time, with spikes driven by events such as the death of IS bride Shamima Begum's baby or key events in the Brexit negotiations. Labour MP David Lammy has received the most abuse of any MP on Twitter so far this year.

As previously, we also found that on average, male MPs attract significantly more general incivility than female ones, though women attract more sexist abuse. Conservative MPs on average, as previously, attracted significantly more abuse than Labour ones, perhaps because they are in power. Sexist abuse is the most prevalent, as compared with homophobia or racism.

Tuesday, 30 July 2019

GATE Cloud services for Google Sheets featured in the CLARIN Newsflash

CLARIN ERIC is a research infrastructure through Europe and beyond to encourage the sharing and sustainability of language data and tools for research in the humanities and social sciences.  We are pleased to announce that our functions for text analysis in Google Sheets were featured in the July 2019 issue of the CLARIN Newsflash.

We are still working on getting Google to publish our add-on, which we hope to have available in the marketplace in a few months. Until then, you can follow the instructions in our previous blog post to use this tool, which currently provides standard and Twitter-oriented named entity recognition for English, French, and German; named entity linking for English, French, and German; and rumour veracity evaluation for English. In the future we will expand the range of functions to cover a wider variety of GATE Cloud services.

Monday, 15 July 2019

GATE Cloud services for Google Sheets

Spreadsheets are an increasingly popular way of storing all kinds of information, including text, and giving it some informal structure, and systems like Google Sheets are especially popular for collaborative work and sharing data.

In response to the demand for standard natural language processing (NLP) tasks in spreadsheets, we have developed a Google Sheets add-on that provides functions to carry out the following tasks on text cells using GATE Cloud services:
  • named entity recognition (NER) for standard text (e.g. news) in English, French, or German;
  • NER tuned for tweets in English, French, or German;
  • named entity linking using our YODIE service in English, French, or German;
  • veracity reporting for rumours in tweets.

We have demonstrated this work several times, most recently at the IAMCR conference "Communication, Technology and Human Dignity: Disputed Rights, Contested Truths", which took place on 7–11 July at the Universidad Complutense de Madrid in Spain. There we used it to show how organisations monitoring the safety of journalists could automatically add information about entities and events to their spreadsheets. Potential users have said it looks very useful and they would like access to it as soon as possible.

Google sheet showing Named Entity and Linking applications run over descriptions of journalist killings from the Committee to Protect Journalists (CPJ) databases

We are applying to have this add-on published in the G Suite Marketplace, but the process is very slow, so we are making the software available now as a read-only Google Drive document that anyone can copy and re-use. 

The document contains several examples and instructions are available from the Add-onsGATE Text Analysis menu item. The language processing is actually done on our servers; the spreadsheet functions send the text to GATE Cloud using the REST API and reformat the output into a human-readable form, so they require a network connection and are subject to rate-limiting. You can use the functions without setting up a GATE Cloud account, but if you create one and authenticate while using this add-on, rate-limiting will be reduced.



Open this Google spreadsheet, then use FileMake a copy to save a copy to your own Google Drive (you can’t edit the original). For the functions to work, you will have to grant permission for the scripts to send data to and from GATE Cloud services and to use your user-level cache.

This work has been supported by the European Union’s Horizon 2020 research and innovation programme under grant agreements No 687847 (COMRADES) and No 654024 (SoBigData).


Friday, 12 July 2019

Using GATE to drive robots at Headstart 2019


In collaboration with Headstart (a charitable trust that provides hands-on science, engineering and maths taster courses), the Department of Computer Science has just run its fourth annual summer school for maths and science A-level students. This residential course ran from 8 to 12 July 2019 and included practical work in computer programming, Lego robots, and project development as well as tours of the campus and talks about the industry.

For the third year in a row, we have included a section on natural language processing using GATE Developer and a special GATE plugin (which uses the ShefRobot library available from GitHub) that allows JAPE rules to operate the Lego robots.  As before, we provided the students with a starter GATE application (essentially the same as in last year's course) containing just enough gazetteer entries, JAPE, and sample code to let them tweet variations like "turn left" and "take a left" to make the robot do just that.  We also use the GATE Cloud Twitter Collector, which we have modified to run locally so the students can set it up on a lab computer so it follows their own twitter accounts and processes their tweets through the GATE application, sending commands to the robots when the JAPE rules match.


Based on lessons learned from the previous years, we put more effort into improving the instructions and the Twitter Collector software to help them get it running faster.  This time the first robot started moving under GATE's control less than 40 minutes from the start of the presentation, and the students rapidly progressed with the development of additional rules and then tweeting commands to their robots.



The structure and broader coverage of this year's course meant that the students had more resources available and a more open project assignment, so not all of them chose to use GATE in their projects, but it was much easier  and more streamlined for them to use than in previous years.







This year 42 students (14 female; 28 male) from around the UK attended the Computer Science Headstart Summer School.
Geography of male students

Geography of female students

The handout and slides are publicly available from the GATE website, which also hosts GATE Developer and other software products in the GATE family.  Source code is available from our GitHub site.  

GATE Cloud development is supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 654024 (the SoBigData project).