Tuesday 20 October 2020

From Entity Recognition to Ethical Recognition: a museum terminology journey


This guest blog post from Jonathan Whitson Cloud tells the story of "how a relatively simple entity recognition project at the Horniman Museum has, thanks to the range and flexibility of tools available in GATE, opened the door to a method for the democratisation and decolonisation of terminology in Museums."
In 2018 the Horniman Museum opened a new long term display called the World Gallery. As is usual with museum displays, there was only a very limited amount of space for text giving context to the over 3,000 items in the cases. As is also now usual, the Horniman looked to its website to share more of the research and stories the curators had unearthed in the 6 year gestation of the gallery. 

The Horniman World Gallery

Entity Recognition

Central to the ambition for the web content was a desire to bridge the gap between the database that the museum uses to record its collections and the narrative and research texts recorded in a wiki. The link would be the database terminologies and authority lists, used as business controls in the database. The construction of these terminologies has a revered place in museum practice. Museums as they are today emerged from the enlightenment project to categorise and bring order to the world. More on the consequences of this later, but for now it was useful to have a series of reference terms for the types of objects in the gallery, the cultures they came from, the people, places and materials etc. 

I had learnt about GATE and participated in the week's training course in 2015, when I first became interested and aware of the potential for Natural Language Processing as a way of managing and getting the most out of the vast and often messy data holdings in museums. 

My hope was that the terminologies and authorities in our collections database could serve as gazetteers for gazetteer-based entity recognition in GATE. The terminology entities from the database-generated gazetteers would be matched in the wiki texts and rendered as hyperlinks to reference pages for the entities on our website.

This worked pretty well, and we released over 500 wiki pages of marked up text, with new pages continuing to come on line. The gazetteer matching, though, was only accurate enough to be suggestive, with many strings appearing in multiple gazetteers (people’s names were particularly difficult). I had been wanting an excuse to explore the machine learning potential in GATE and this seemed like an opportunity, so I came up to Sheffield for an additional day’s training (thank you Xingyi) in early March 2020, and came away with a pipeline that used Machine Learning to identify term types independently of the gazetteers, which could then be built into a set of rules that improved the gazetteer identification significantly. The annotations produced were still checked prior to publication, but with considerably fewer adjustments required.

The Gazetteer Pipeline developed in GATE

The next experiment was to run the machine learning enhanced gazetteer pipeline over a set of gallery texts for an older exhibition. This produced a lot of matches/links, and should we publish these texts online, they will appear with in-line links to terms already in use in our Mimsy and the World Gallery Wiki texts, so becoming an integrated part of the web of linked terms and texts.

The Machine Learning pipeline built in GATE


Another very welcome outcome of this process was that the pipeline identified a number of terms that were not in our gazetteers and which became suggested new terms for our terminologies, demonstrating GATE’s ability to create as well as identify terminology, and it is this function that we are now looking to exploit in a new project.

Decolonisation of Museum Collections

In 2019 the Horniman was appointed by the Department of Culture Media and Sport (DCMS) to lead a group of museums in developing new collecting and interpretation practice addressing the historic and ongoing cultural impact of the UK as a colonising power.  The terminology that museums use about their collections is very much a subject of interest to museums seeking to decolonise their collections. As mentioned before, the creation and application of categories has been fundamental to museum practice since museums emerged as knowledge organisations in the 18th century. It has now become painfully clear, however, that these categories have been created and applied with the same scant regard for the rights and culture of the people who made and used the items to which they have been applied as the ‘collecting’ of them. That is to say, at best rudely and at worst violently. 

We are currently building a mechanism, again based on a wiki and GATE, whereby new and existing texts authored by the communities who made and used the items in the museum collection can also be marked up by those communities to make learning corpora. A machine learning pipeline will then build new terminologies to be applied to the items that the communities made and used. This is not only decolonising but democratising as it gives value to texts by any members of a community, not just cultural academics or other specialists, in many media including social media.

The GATE tool with its modular architecture has enabled me to take an experimental and incremental approach to accessing advanced NLP tools, despite not being an NLP or even a computer expert. That it is open source and supported by an active user community makes it ideal for the Cultural Heritage sector which otherwise lacks the funding, the confidence and the expertise to access the powerful NLP techniques and all they offer for the redirecting of museum interpretation away from expert exposition towards a truly democratic and decolonised future.