Wednesday 3 August 2022

Populate a Corpus from a List of URLs

GATE provides support for loading numerous different document formats, as well as a number of ways populating corpora. Until recently, however, we've not offered any way of populating a corpus from a simple list of URLs. Worse, even though it's now quite easy to do this in GATE it's unlikely you would come across the option by accident.

The support for this is actually hidden away inside the "Format CSV" plugin (you'll need to use version 8.7 or above) and in GATE Developer is exposed through the "Populate from CSV file..." option in the context menu of a corpus.
In this screenshot I've configured the populator ready to build a corpus from a simple text file with one URL per line. The important settings are:
  • Column Separator is set to "\t". This means we are using a tab character as the column separator. We do this simply as you can't have a tab in a URL whereas you could have a URL containing a comma and we don't want our URLs split in half.
  • Document Content is in column 0. We always count columns (or almost anything) starting from 0, so this just ensures we use the URL as the document content.
  • Create one document per row is selected. The important option isn't available if we don't first select this as it makes no sense to try and load multiple URLs into the same GATE document.
  • Cell contains document URL is selected. This is the new feature which makes this trick possible. Essentially it looks at the contents of a cell and if it can be interpreted as a URL then it creates a document from the contents of the URL, otherwise it uses the cell content as normal to build the document.
Once configured it's simply a case of selecting your text file, one URL per line, and hitting the OK button. Be aware that there is currently no rate limiting so be careful if you are listing a lot of URLs from a single domain etc. You may also want to combine this with the cookie trick from the previous post to ensure you get the correct content from each of the URLs.

Of course while this post has been about how to populate a corpus from a simple list of URLs you can use more complex CSV or TSV files which happen to contain URLs in one column. In that case the details from the other columns will be added as document features.

Wednesday 2 March 2022

GATE and the Cookie Jar

One of the useful features of GATE is that documents can be loaded directly form the web as well as from local files. This is specifically useful for pages which update frequently which you might want to process repeatedly. While using this feature recently we came across some pages that refused to load correctly. The page loaded fine in a web browser but returned a 403 unauthorised response when accessed via GATE.

After a bit of debugging it turned out that this issue was related to cookies. The specific URL we were trying to load went through a number of redirects before ending up at the final page. The problem was that the first redirect set a cookie, and that needed to be present for the further redirects to work. By default Java, and hence GATE, doesn't maintain cookies across requests, as each connection is handled independently.

If you are using GATE in an embedded context, then it is trivial to add support for cookies using the default Java cookie handler. This is a JVM level setting so once configured in your own code, all requests made by GATE to load documents will also gain support for handling cookies. The entire solution is the following single line of code:;

The problem we faced though, was that we wanted to be able to load documents that required cookies from within GATE Developer and that required a little more thought. Whilst we could have just added the code to GATE there are a number of reasons not to (details of which are outside the scope of this blog post) and I wanted to make it easier for all existing GATE users to be able to use cookies without needing to upgrade. The answer is the rather versatile Groovy plugin.

If you load the Groovy plugin into GATE Developer you can then access the Groovy Console from within the tools menu. Simply pasting that single line of code into the console and executing it is enough to add the cookie support within that instance of GATE. It's slightly annoying that it won't persist across multiple instances of GATE, but as it's such a simple trick hopefully it's easy enough to apply when needed.

Monday 28 February 2022

How green is your recipe? Using GATE to calculate the environmental impact of recipes


The calculation of environmental impacts from recipes remains a barrier to effective uptake of sustainable diets.  In a recent project funded by Alpro, led by Dr Christian Reynolds from the Centre for Food Policy at City University London, we explored digitised recipe texts from websites in English, Dutch and German. We study recipes rather than individual ingredients because this is how people typically think about environmental impact and diet.

Recipes are hard to process because they use different weights and measures, and sometimes quite vague or obscure terms (e.g. "a pinch of salt", "a handful of lettuce"). Together with our project partner Text Mining Solutions, we used GATE to develop customised tools to automatically extract ingredients, quantities and units from 220,168 indexed recipes, and to match these to a food environmental database of 4500 ingredients (using the classification system FoodEx2). This database provided Land Use, GHG emissions, Eutrophying Emissions, Stress-Weighted Water Use, and Freshwater Withdrawals for each ingredient.

Nutrition information was sourced from the USDA FoodData Central (McKillop et al., 2021) and McCance and Widdowson's Composition of Foods Integrated Database (Public Health England, 2015). Environmental and Nutrition information was matched to two classification systems (FoodEx2, containing 4,500 ingredients, and USDA Nutrient Database, containing 2,484 ingredients). This allowed us to calculate these impacts at the mean, 5% and 95% confidence level per recipe and per portion, enabling us to explore the environmental impacts of vegan, vegetarian and non-vegetarian (omnivore) recipes if we were to cook these recipes using contemporary ingredients.

To validate the tool, we manually calculated the impacts of 50 recipes from 4 websites: BBC Good Food, Albert Heijn/Allerhande, and Kochbar, and compared these with the results from our tool. 

We created a website where you can enter a recipe and get back the calculation for the recipe and per portion (with confidence intervals). The image below shows a sample screenshot.

We presented some of our findings as a poster at the Livestock, Environment and People (LEAP) conference in December 2021. You can find more examples of our analysis and results there.

It's interesting to see how the recipes from the different countries, as well as recipes with different protein sources, lead to different median CO2 footprints. Below we see a chart showing the median GHGE per portion in recipes from different protein sources (e.g. those containing beef, those containing tofu) in omnivore, vegetarian, and vegan recipes. Unsurprisingly, the dishes containing meat have higher GHGE values on the whole, though we do find variations within individual recipes. We were particularly excited to find a recipe for chocolate cake that "beat" a salad in terms of low GHGE!

When we compared the different datasets (depicting recipes from different European countries) in terms of median GHGE per protein source, we found that Kochbar (German) recipes typically fared the worst, followed by the BBC Good Food recipes (British), and Albert Heijn (Dutch) faring much better.

The work is now continuing with the development of a dashboard enabling additional visualisations and further analysis to be produced.

Sunday 7 February 2021

New releases bringing GATE and Python closer together


The GATE Team is proud to announce two new releases that bring GATE and Python together:

  • Python GateNLP (version 1.0.2): a Python 3 package that brings many of the concepts and the ease of handling documents, annotations and features to Python.
  • GATE Python Plugin (version 3.0.2): a new plugin that can be used from Java GATE to process documents using Python code and the methods provided by the Python GateNLP package

Both releases are meant as first releases to a wider community to give feedback about what users need and what the basic design should look like. 


Users are invited to give feedback about the Python GateNLP package:

  • If you detect a bug, or have a feature request, please use the GitHub Issue Tracker
  • For more general discussions, ideas, asking the community for help, please use (preferably) the GitHub Discussions Forum or the General GATE Mailing List
  • We are also interested in feedback about the API and the functionality of the package. If you want to use the package for your own development and want to discuss changes, improvements or how you can contribute, please use the GitHub Discussions Forum 
  • We are happy to receive contributions! Please create an issue and discuss/plan with developers on the issue tracker before providing a pull request.

To give feedback about the Python Plugin:

IMPORTANT: whenever you give feedback, please include as much detail about your Operating System, Java or Python version, package/plugin version and your concrete problem or question as possible!

GATE Course Module

Module 11 of the upcoming online GATE course in February 2021 will introduce the Python GateNLP package and the GATE Python plugin. You can register for this and many other modules of the course here.

Python GateNLP

Python GateNLP is a Python NLP framework which provides some of the concepts and abstractions known from Java GATE in Python, plus a number of new features: 
  • Documents with arbitrarily many features, arbitrarily many named Annotation sets. GateNLP also adds the capability of keeping a ChangeLog
  • AnnotationSets with arbitrarily many (stand-off) Annotations which can overlap in any way and can span any character range (not just entire tokens/words)
  • Annotations with arbitrarily many features, grouped per set by some annotation type name
  • Features which map keys to arbitrary values 
  • Corpora: collections of documents. Python GateNLP provides corpora that directly map to files in a directory (recursively). 
  • Prepared modules for processing documents. In GateNLP these are called "Annotators" and also allow for filtering, splitting of documents
  • Reading and writing in various formats. GateNLP uses three new formats, "bdocjs" (JSON serialization), "bdocym" (YAML serialization) and "bdocMP" (Message Pack serialization). Documents in that format can be exchanged with Java GATE through the GATE plugin Format_Bdoc
  • Gazetteers for fast lookup and annotation of token sequences or character sequences which match a large list of known terms or phrases
  • A way to annotate documents based on patterns based on text and other annotations and annotation features: PAMPAC
  • A HTML visualizer which allows the user to interactively view GATE documents, annotations and features as separate HTML files or within Jupyter notebooks.
  • Bridges to powerful NLP libraries and conversion of their annotations to GateNLP annotations:
  • GateWorker: an API that allows the user to directly run Java GATE from Python and exchange documents between Python and Java
  • The Java GATE Python Plugin (see below) allows the user to run Python GateNLP code directly from Java GATE and process documents with it.

GATE Python Plugin

The GATE Python Plugin is one of many GATE plugins that extend the functionality of Java GATE. This plugin allows the user to process GATE documents running in the Java GATE GUI or via the multiprocessing Gate Cloud Processor (GCP) with Python programs (which use the GateNLP API for manipulating documents).

Tuesday 20 October 2020

From Entity Recognition to Ethical Recognition: a museum terminology journey

This guest blog post from Jonathan Whitson Cloud tells the story of "how a relatively simple entity recognition project at the Horniman Museum has, thanks to the range and flexibility of tools available in GATE, opened the door to a method for the democratisation and decolonisation of terminology in Museums."
In 2018 the Horniman Museum opened a new long term display called the World Gallery. As is usual with museum displays, there was only a very limited amount of space for text giving context to the over 3,000 items in the cases. As is also now usual, the Horniman looked to its website to share more of the research and stories the curators had unearthed in the 6 year gestation of the gallery. 

The Horniman World Gallery

Entity Recognition

Central to the ambition for the web content was a desire to bridge the gap between the database that the museum uses to record its collections and the narrative and research texts recorded in a wiki. The link would be the database terminologies and authority lists, used as business controls in the database. The construction of these terminologies has a revered place in museum practice. Museums as they are today emerged from the enlightenment project to categorise and bring order to the world. More on the consequences of this later, but for now it was useful to have a series of reference terms for the types of objects in the gallery, the cultures they came from, the people, places and materials etc. 

I had learnt about GATE and participated in the week's training course in 2015, when I first became interested and aware of the potential for Natural Language Processing as a way of managing and getting the most out of the vast and often messy data holdings in museums. 

My hope was that the terminologies and authorities in our collections database could serve as gazetteers for gazetteer-based entity recognition in GATE. The terminology entities from the database-generated gazetteers would be matched in the wiki texts and rendered as hyperlinks to reference pages for the entities on our website.

This worked pretty well, and we released over 500 wiki pages of marked up text, with new pages continuing to come on line. The gazetteer matching, though, was only accurate enough to be suggestive, with many strings appearing in multiple gazetteers (people’s names were particularly difficult). I had been wanting an excuse to explore the machine learning potential in GATE and this seemed like an opportunity, so I came up to Sheffield for an additional day’s training (thank you Xingyi) in early March 2020, and came away with a pipeline that used Machine Learning to identify term types independently of the gazetteers, which could then be built into a set of rules that improved the gazetteer identification significantly. The annotations produced were still checked prior to publication, but with considerably fewer adjustments required.

The Gazetteer Pipeline developed in GATE

The next experiment was to run the machine learning enhanced gazetteer pipeline over a set of gallery texts for an older exhibition. This produced a lot of matches/links, and should we publish these texts online, they will appear with in-line links to terms already in use in our Mimsy and the World Gallery Wiki texts, so becoming an integrated part of the web of linked terms and texts.

The Machine Learning pipeline built in GATE

Another very welcome outcome of this process was that the pipeline identified a number of terms that were not in our gazetteers and which became suggested new terms for our terminologies, demonstrating GATE’s ability to create as well as identify terminology, and it is this function that we are now looking to exploit in a new project.

Decolonisation of Museum Collections

In 2019 the Horniman was appointed by the Department of Culture Media and Sport (DCMS) to lead a group of museums in developing new collecting and interpretation practice addressing the historic and ongoing cultural impact of the UK as a colonising power.  The terminology that museums use about their collections is very much a subject of interest to museums seeking to decolonise their collections. As mentioned before, the creation and application of categories has been fundamental to museum practice since museums emerged as knowledge organisations in the 18th century. It has now become painfully clear, however, that these categories have been created and applied with the same scant regard for the rights and culture of the people who made and used the items to which they have been applied as the ‘collecting’ of them. That is to say, at best rudely and at worst violently. 

We are currently building a mechanism, again based on a wiki and GATE, whereby new and existing texts authored by the communities who made and used the items in the museum collection can also be marked up by those communities to make learning corpora. A machine learning pipeline will then build new terminologies to be applied to the items that the communities made and used. This is not only decolonising but democratising as it gives value to texts by any members of a community, not just cultural academics or other specialists, in many media including social media.

The GATE tool with its modular architecture has enabled me to take an experimental and incremental approach to accessing advanced NLP tools, despite not being an NLP or even a computer expert. That it is open source and supported by an active user community makes it ideal for the Cultural Heritage sector which otherwise lacks the funding, the confidence and the expertise to access the powerful NLP techniques and all they offer for the redirecting of museum interpretation away from expert exposition towards a truly democratic and decolonised future. 

Monday 24 February 2020

Online Abuse toward Candidates during the UK General Election 2019

In this blog post I’m going to discuss the 2019 UK general election and the increase in abuse aimed at politicians online. We collected 4.2 million tweets sent to or from election candidates in the six week period spanning from the start of November until shortly after the December 12th election. The graph above shows the who received the most abuse up to and including December 14th, with Boris Johnson and Jeremy Corbyn receiving the most by far.

The 2016 "Brexit" referendum left the parliament and the nation divided. Since then we have seen two general elections, and two Prime Ministers jostle to strengthen their majority and improve their negotiating position with the EU. National feeling has never been so polarised and it will come as no surprise that with the social changes brought about through the rise of social media, abuse towards politicians in the UK has increased. Using natural language processing we can identify abuse and type it according to whether it is political, sexist or simply generic abuse.

Our work investigates a large tweet collection on which natural language processing has been performed in order to identify abusive language, the politicians it is targeted at and the topics in the politician’s original tweet that tend to trigger abusive replies, thus enabling large scale quantitative analysis. A list of slurs, offensive words and potentially sensitive identity markers was used. The slurs list contained 1081 abusive terms or short phrases in British and American English, comprising mostly an extensive collection of insults, racist and homophobic slurs, as well as terms that denigrate a person’s appearance or intelligence, gathered from sources that include and Farrell et al [2].


Tweets were collected in real-time using Twitter’s streaming API. We began immediately to collect any candidate who had been entered into Democracy Club’s database[10] who had Twitter accounts. We used the API to follow the accounts of all candidates over the campaign period. This means we collected all the tweets sent by each candidate, any replies to those tweets, and any retweets either made by the candidate or of the candidate’s own tweets. Note that this approach does not collect all tweets which an individual would see in their timeline, as it does not include those in which they are just mentioned. We took this
approach as the analysis results are more reliable due to the fact that replies are
directed at the politician who authored the tweet, and thus, any abusive language
is more likely to be directed at them. Ethics approval was granted to collect the data through application 25371 at the University of Sheffield.


Table 1 gives overall statistics of research period, which contains a total of 184,014 candidate-authored original tweets, 334,952 retweets and 131,292 replies. 3,541,769 replies to politicians were found, of which abuse was found in 4.46%. The second row gives similar statistics for the 2017 general election period. It is evident that the level of abuse received by political candidates has risen in the intervening two and a half years. 

In terms of representation in the sample of election candidates with Twitter accounts, gender balance is skewed heavily in favour of men for the Conservatives and LibDems; Labour in contrast had more female/non-binary than male candidates. Most abuse is aimed at Jeremy Corbyn and Boris Johnson, with Matthew Hancock, Jacob Rees-Mogg, Jo Swinson, Michael Gove, David Lammy and James Cleverly also receiving substantial abuse. Michael Gove received a great deal of personal abuse following the climate debate. Jo Swinson received the most sexist abuse.

Original MP tweets
MP retweets
Replies to MPs
Abusive replies to MPs
3 Nov–15 Dec 2019
29 Apr–9 Jun 2017

Who is getting abuse?

The topic of Brexit draws abuse for all three parties. Conservative candidates initially move away from this, toward their safer topic of taxation, before returning to Brexit. Liberal Democrats continue to focus on Brexit despite receiving abuse. Labour candidates consistently don’t focus on Brexit; public health is a safe topic for Labour. 

Levels of abuse increased in the run up to the election. The figure below highlights the number of abusive tweets received by the three major parties. There is a considerable spike for both Labour and the Conservatives in the week prior to the election.

In the graph below we look at the average abuse per month received by MPs did not stand again those who did choose to stand again. We see that in all bar one of the earlier months of the year those individuals received more abuse, and particularly in June.MPs who stood down received more abuse than those who chose to stand again in all but one month in the first half of 2019, and in June they received over 50% more abuse.


Between Nov 3rd and December 15th, we found 157,844 abusive replies to candidates’ tweets (4.44% of all replies received)–a low estimate of probably around half of the actual abusive tweets. Overall, abuse levels climbed week on week in November and early December, as the election campaign progressed, from 17,854 in the first week to 41,421 in the week of the December 12th election. The escalation in abuse was toward Conservative candidates specifically, with abuse levels towards candidates from the other two main parties remaining stable week on week; however, after Labour’s decisive defeat, their candidates were subjected to a spike in abuse. Abuse levels are not constant; abuse is triggered by external events (e.g. leadership debates) or controversial tweets by the candidates. Abuse levels have also been approximately climbing month on month over the year, and in November were more than double by volume compared with January.

Wednesday 6 November 2019

Which MPs changed party affiliation, 2017-2019

As part of our work tracking Twitter abuse towards MPs and candidates going into the December 12th general election I've been updating our data files regarding party membership. I thought you might be interested to see the result!