On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: GATE

Showing posts with label GATE. Show all posts

Wednesday, 3 August 2022

Populate a Corpus from a List of URLs

GATE provides support for loading numerous different document formats, as well as a number of ways populating corpora. Until recently, however, we've not offered any way of populating a corpus from a simple list of URLs. Worse, even though it's now quite easy to do this in GATE it's unlikely you would come across the option by accident.

The support for this is actually hidden away inside the "Format CSV" plugin (you'll need to use version 8.7 or above) and in GATE Developer is exposed through the "Populate from CSV file..." option in the context menu of a corpus.

In this screenshot I've configured the populator ready to build a corpus from a simple text file with one URL per line. The important settings are:

Column Separator is set to "\t". This means we are using a tab character as the column separator. We do this simply as you can't have a tab in a URL whereas you could have a URL containing a comma and we don't want our URLs split in half.
Document Content is in column 0. We always count columns (or almost anything) starting from 0, so this just ensures we use the URL as the document content.
Create one document per row is selected. The important option isn't available if we don't first select this as it makes no sense to try and load multiple URLs into the same GATE document.
Cell contains document URL is selected. This is the new feature which makes this trick possible. Essentially it looks at the contents of a cell and if it can be interpreted as a URL then it creates a document from the contents of the URL, otherwise it uses the cell content as normal to build the document.

Once configured it's simply a case of selecting your text file, one URL per line, and hitting the OK button. Be aware that there is currently no rate limiting so be careful if you are listing a lot of URLs from a single domain etc. You may also want to combine this with the cookie trick from the previous post to ensure you get the correct content from each of the URLs.

Of course while this post has been about how to populate a corpus from a simple list of URLs you can use more complex CSV or TSV files which happen to contain URLs in one column. In that case the details from the other columns will be added as document features.

Wednesday, 2 March 2022

GATE and the Cookie Jar

One of the useful features of GATE is that documents can be loaded directly form the web as well as from local files. This is specifically useful for pages which update frequently which you might want to process repeatedly. While using this feature recently we came across some pages that refused to load correctly. The page loaded fine in a web browser but returned a 403 unauthorised response when accessed via GATE.

After a bit of debugging it turned out that this issue was related to cookies. The specific URL we were trying to load went through a number of redirects before ending up at the final page. The problem was that the first redirect set a cookie, and that needed to be present for the further redirects to work. By default Java, and hence GATE, doesn't maintain cookies across requests, as each connection is handled independently.

If you are using GATE in an embedded context, then it is trivial to add support for cookies using the default Java cookie handler. This is a JVM level setting so once configured in your own code, all requests made by GATE to load documents will also gain support for handling cookies. The entire solution is the following single line of code:

java.net.CookieHandler.setDefault(new java.net.CookieManager());

The problem we faced though, was that we wanted to be able to load documents that required cookies from within GATE Developer and that required a little more thought. Whilst we could have just added the code to GATE there are a number of reasons not to (details of which are outside the scope of this blog post) and I wanted to make it easier for all existing GATE users to be able to use cookies without needing to upgrade. The answer is the rather versatile Groovy plugin.

If you load the Groovy plugin into GATE Developer you can then access the Groovy Console from within the tools menu. Simply pasting that single line of code into the console and executing it is enough to add the cookie support within that instance of GATE. It's slightly annoying that it won't persist across multiple instances of GATE, but as it's such a simple trick hopefully it's easy enough to apply when needed.

Tuesday, 20 October 2020

From Entity Recognition to Ethical Recognition: a museum terminology journey

This guest blog post from Jonathan Whitson Cloud tells the story of "how a relatively simple entity recognition project at the Horniman Museum has, thanks to the range and flexibility of tools available in GATE, opened the door to a method for the democratisation and decolonisation of terminology in Museums."

In 2018 the Horniman Museum opened a new long term display called the World Gallery. As is usual with museum displays, there was only a very limited amount of space for text giving context to the over 3,000 items in the cases. As is also now usual, the Horniman looked to its website to share more of the research and stories the curators had unearthed in the 6 year gestation of the gallery.

The Horniman World Gallery

Entity Recognition

Central to the ambition for the web content was a desire to bridge the gap between the database that the museum uses to record its collections and the narrative and research texts recorded in a wiki. The link would be the database terminologies and authority lists, used as business controls in the database. The construction of these terminologies has a revered place in museum practice. Museums as they are today emerged from the enlightenment project to categorise and bring order to the world. More on the consequences of this later, but for now it was useful to have a series of reference terms for the types of objects in the gallery, the cultures they came from, the people, places and materials etc.

I had learnt about GATE and participated in the week's training course in 2015, when I first became interested and aware of the potential for Natural Language Processing as a way of managing and getting the most out of the vast and often messy data holdings in museums.

My hope was that the terminologies and authorities in our collections database could serve as gazetteers for gazetteer-based entity recognition in GATE. The terminology entities from the database-generated gazetteers would be matched in the wiki texts and rendered as hyperlinks to reference pages for the entities on our website.

This worked pretty well, and we released over 500 wiki pages of marked up text, with new pages continuing to come on line. The gazetteer matching, though, was only accurate enough to be suggestive, with many strings appearing in multiple gazetteers (people’s names were particularly difficult). I had been wanting an excuse to explore the machine learning potential in GATE and this seemed like an opportunity, so I came up to Sheffield for an additional day’s training (thank you Xingyi) in early March 2020, and came away with a pipeline that used Machine Learning to identify term types independently of the gazetteers, which could then be built into a set of rules that improved the gazetteer identification significantly. The annotations produced were still checked prior to publication, but with considerably fewer adjustments required.

The Gazetteer Pipeline developed in GATE

The next experiment was to run the machine learning enhanced gazetteer pipeline over a set of gallery texts for an older exhibition. This produced a lot of matches/links, and should we publish these texts online, they will appear with in-line links to terms already in use in our Mimsy and the World Gallery Wiki texts, so becoming an integrated part of the web of linked terms and texts.

The Machine Learning pipeline built in GATE

Another very welcome outcome of this process was that the pipeline identified a number of terms that were not in our gazetteers and which became suggested new terms for our terminologies, demonstrating GATE’s ability to create as well as identify terminology, and it is this function that we are now looking to exploit in a new project.

Decolonisation of Museum Collections

In 2019 the Horniman was appointed by the Department of Culture Media and Sport (DCMS) to lead a group of museums in developing new collecting and interpretation practice addressing the historic and ongoing cultural impact of the UK as a colonising power. The terminology that museums use about their collections is very much a subject of interest to museums seeking to decolonise their collections. As mentioned before, the creation and application of categories has been fundamental to museum practice since museums emerged as knowledge organisations in the 18th century. It has now become painfully clear, however, that these categories have been created and applied with the same scant regard for the rights and culture of the people who made and used the items to which they have been applied as the ‘collecting’ of them. That is to say, at best rudely and at worst violently.

We are currently building a mechanism, again based on a wiki and GATE, whereby new and existing texts authored by the communities who made and used the items in the museum collection can also be marked up by those communities to make learning corpora. A machine learning pipeline will then build new terminologies to be applied to the items that the communities made and used. This is not only decolonising but democratising as it gives value to texts by any members of a community, not just cultural academics or other specialists, in many media including social media.

The GATE tool with its modular architecture has enabled me to take an experimental and incremental approach to accessing advanced NLP tools, despite not being an NLP or even a computer expert. That it is open source and supported by an active user community makes it ideal for the Cultural Heritage sector which otherwise lacks the funding, the confidence and the expertise to access the powerful NLP techniques and all they offer for the redirecting of museum interpretation away from expert exposition towards a truly democratic and decolonised future.

Friday, 12 July 2019

Using GATE to drive robots at Headstart 2019

In collaboration with Headstart (a charitable trust that provides hands-on science, engineering and maths taster courses), the Department of Computer Science has just run its fourth annual summer school for maths and science A-level students. This residential course ran from 8 to 12 July 2019 and included practical work in computer programming, Lego robots, and project development as well as tours of the campus and talks about the industry.

For the third year in a row, we have included a section on natural language processing using GATE Developer and a special GATE plugin (which uses the ShefRobot library available from GitHub) that allows JAPE rules to operate the Lego robots. As before, we provided the students with a starter GATE application (essentially the same as in last year's course) containing just enough gazetteer entries, JAPE, and sample code to let them tweet variations like "turn left" and "take a left" to make the robot do just that. We also use the GATE Cloud Twitter Collector, which we have modified to run locally so the students can set it up on a lab computer so it follows their own twitter accounts and processes their tweets through the GATE application, sending commands to the robots when the JAPE rules match.

Based on lessons learned from the previous years, we put more effort into improving the instructions and the Twitter Collector software to help them get it running faster. This time the first robot started moving under GATE's control less than 40 minutes from the start of the presentation, and the students rapidly progressed with the development of additional rules and then tweeting commands to their robots.

The structure and broader coverage of this year's course meant that the students had more resources available and a more open project assignment, so not all of them chose to use GATE in their projects, but it was much easier and more streamlined for them to use than in previous years.

This year 42 students (14 female; 28 male) from around the UK attended the Computer Science Headstart Summer School.

Geography of male students

Geography of female students

The handout and slides are publicly available from the GATE website, which also hosts GATE Developer and other software products in the GATE family. Source code is available from our GitHub site.

GATE Cloud development is supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 654024 (the SoBigData project).

Wednesday, 3 July 2019

12th GATE Summer School (17-21 June 2019)

12th GATE Training Course: open-source natural language processing with an emphasis on social media

For over a decade, the GATE team has provided an annual course in using our technology. The course content and track options have changed a bit over the years, but it always includes material to help novices get started with GATE as well as introductory and more advanced use of the JAPE language for matching patterns of document annotations.

The latest course also included machine learning, crowdsourcing, sentiment analysis, and an optional programming module (aimed mainly at Java programmers to help them embed GATE libraries, applications, and resources in web services and other "behind the scenes" processing). We have also added examples and new tools in GATE to cover the increasing demand for getting data out of and back into spreadsheets, and updated our work on social media analysis, another growing field.

Information in "feral databases" (spreadsheets)

We also disseminated work from several current research projects.

Semantics in scientometrics

From KNOWMAK and RISIS, we presented our work on using semantic technologies in scientometrics, by applying NLP and ontologies to document categorization in order to contribute to a searchable knowledge base that allows users to find aggregate and specific data about scientific publications, patents, and research projects by geography, category, etc.
Much of our recent work on social media analysis, including opinion mining and abuse detection and measurement, has been done as part of the SoBigData project.
The increasing range of tools for languages other than English links with our participation in the European Language Grid, which is also supported further development of GATE Cloud, our platform for text analytics as a service.

Conditional processing of multilingual documents

Processing German in GATE

The GATE software distributions, documentation, and training materials from our courses can all be downloaded from our website under open licences. Source code is also available from our github page.

Acknowledgements

The course included research funded by the European Union's Horizon 2020 research and innovation programme under grant agreements No. 726992 (KNOWMAK), No. 654024 (SoBigData), No. 824091 (RISIS), and No. 825627 (European Language Grid); by the Free Press Unlimited pilot project "Developing a database for the improved collection and systematisation of information on incidents of violations against journalists"; by EPSRC grant EP/I004327/1; by the British Academy under the call "The Humanities and Social Sciences: Tackling the UK’s International Challenges"; and by Nesta.

Thursday, 27 June 2019

GATE's submission wins 2nd place in United Nations General Assembly Resolutions Extraction and Elicitation Global Challenge

In May 2019 we submitted a prototype to the United Nations General Assembly Resolutions Extraction and Elicitation Global Challenge, which asked for submissions using mature natural language processing techniques to produce semantically enhanced, machine-readable documents from PDFs of UN GA resolutions, with particular interest in identifying named entities and items in certain thesauri and ontologies and in making use of the document structure (in particular, the distinction between preamble and operative paragraphs).

Our prototype included a customized GATE application designed to read PDFs of United Nations General Assembly resolutions and identify named entities, resolution adoption information (resolution number and adoption date), preamble sections, operative sections, and references to keywords and phrases in the English parts of the UN Bibliographical Information System thesaurus.

We downloaded and automatically annotated over 2800 resolution documents and pushed the results into a Mímir index to allow semantic search using combinations of the entities and sections identified, such as the following (more examples are provided in the documentation that we submitted):

find sentences in operative paragraphs containing a person and an UNBIS term;
find preamble paragraphs containing a person, an organization, and a date;
find combinations referring to a specific UNBIS term.

We also developed an easier to use web front end for exploring co-occurrences of keywords and semantic annotations.

We are excited to receive the second place award, along with an invitation to improve our work with more feedback and a "lessons learned" discussion with the panel. The panel highlighted in particular the submission of comprehensive and testable code, and the use of GATE as a mature respected framework.

Our GitHub site contains the information extraction and search front end software, licensed under the GPL-3.0 and available for anyone to download and use.

Monday, 11 March 2019

Coming Up: 12th GATE Summer School 17-21 June 2019

It is approaching that time of the year again! The GATE training course will be held from 17-21 June 2019 at the University of Sheffield, UK.

No previous experience or programming expertise is necessary, so it's suitable for anyone with an interest in text mining and using GATE, including people from humanities backgrounds, social sciences, etc.

This event will follow a similar format to that of the 2018 course, with one track Monday to Thursday, and two parallel tracks on Friday, all delivered by the GATE development team. You can read more about it and register here. Early bird registration is available at a discounted rate until 1 May.

The focus will be on mining text and social media content with GATE. Many of the hands on exercises will be focused on analysing news articles, tweets, and other textual content.

The planned schedule is as follows (NOTE: may still be subject to timetabling changes).
Single track from Monday to Thursday (9am - 5pm):

Monday: Module 1: Basic Information Extraction with GATE

Intro to GATE + Information Extraction (IE)
Corpus Annotation and Evaluation
Writing Information Extraction Patterns with JAPE

Tuesday: Module 2: Using GATE for social media analysis

Challenges for analysing social media, GATE for social media
Twitter intro + JSON structure
Language identification, tokenisation for Twitter
POS tagging and Information Extraction for Twitter

Wednesday: Module 3: Crowdsourcing, GATE Cloud/MIMIR, and Machine Learning

Crowdsourcing annotated social media content with the GATE crowdsourcing plugin
GATE Cloud, deploying your own IE pipeline at scale (how to process 5 million tweets in 30 mins)
GATE Mimir - how to index and search semantically annotated social media streams
Challenges of opinion mining in social media
Training Machine Learning Models for IE in GATE

Thursday: Module 4: Advanced IE and Opinion Mining in GATE

Advanced Information Extraction
Useful GATE components (plugins)
Opinion mining components and applications in GATE

On Friday, there is a choice of modules (9am - 5pm):

Module 5: GATE for developers
- Basic GATE Embedded
- Writing your own plugin
- GATE in production - multi-threading, web applications, etc.
Module 6: GATE Applications
- Building your own applications
- Examples of some current GATE applications: social media summarisation, visualisation, Linked Open Data for IE, and more

These two modules are run in parallel, so you can only attend one of them. You will need to have some programming experience and knowledge of Java to follow Module 5 on the Friday. No particular expertise is needed for Module 6.
Hope to see you in Sheffield in June!

Monday, 17 December 2018

Open Call for SoBigData-funded Transnational Access!

The SoBigData project invites researchers and professionals to apply to participate in Short-Term Scientific Missions (STSMs) to carry forward their own big data projects. The Natural Language Processing (NLP) group at the University of Sheffield are taking part in this initiative and invite all applications.

Funding is available for STSMs (2 weeks to 2 months) of up to 4500 euros, covering daily subsistence, accommodation and flights. These bursaries are awarded on a competitive basis.

Research areas are varied but include studies involving societal debate, online misinformation and rumour analysis. A key topic is analysis of social media and newspaper articles to understand the state of public debate in terms of what is being discussed, how it is being discussed, who is discussing it, and how this discussion is being influenced. The effects of online disinformation campaigns (especially hyper-partisan content) and the use of bot accounts to perpetrate this disinformation are also of particular interest.

Applications are welcomed for visits from 1 November 2018 and 31 July 2019!

For specific details, eligibility criteria, and to apply, click here!

Thursday, 22 November 2018

Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms

This summer, we presented some of our latest work at SEMANTiCS 2018 in Vienna: "Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms".

Zhang, Z., Petrak, J. & Maynard, D. Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. in SEMANTiCS 2018 – 14th International Conference on Semantic Systems 00, 0-000 (2018).

This work has been carried out in the context of the EU KNOWMAK project, where we're developing tools for multi-topic classification of text against an ontology, in order to attempt to map the state of European research output in key technologies.

Automatic Term Extraction (ATE) is a fundamental technique used in computational linguistics for recognising terms in text. Processing the collected terms in a text is a key step in understanding the content of the text. There are many different ATE methods, but these all tend to work well only in a one specific domain. In other words, there is no universal method which produces consistently good results, and so we have to choose an appropriate method for the domain being targeted.

In this work, we have developed a novel method for ATE which addresses two major limitations: the fact that no single ATE method consistently performs well across all domains, and the fact that the majority of ATE methods are unsupervised. Our generic method, AdaText, improves the accuracy of existing ATE methods, using existing lexical resources to support them, by revising the TextRank algorithm.

After being given a target text, AdaText:

Selects a subset of words based on their semantic relatedness to a set of seed words or phrases relevant to the domain, but not necessarily representative of the terms within the target text.
It then applies an adapted TextRank algorithm to create a graph for these words, and computes a text-level TextRank score for each selected word.
Finally, these scores are used to revise the score of a term candidate previously computed by an ATE method.

This technique was trialled using a variety of parameters (such as the threshold of semantic similarity to select words, as described in step two) over two distinct datasets (GENIA and ACLv2, comprising Medline abstracts and abstracts from ACL respectively). We also tested it with a wide variety of state of the art ATE methods, including modified TFIDF, CValue, Basic, RAKE, Weirdness, LinkProbability, X², GlossEx and PositiveUnlabeled.

The figures show a sample of performances in different datasets and using different ATE techniques. The base performance of the ATE method is represented by the blachttps://gate.ac.uk/g8/page/show/2/sale/images/blog/Results-by-AdaText-compared-against-the-base-ATE-methods-y-axis-average-PK-for-all.pngk horizontal line. The horizontal axis represents the semantic similarity threshold used in step 1. The vertical axis shows average P@K for all five Ks considered.

This new generic combination approach can consistently improve the performance of the ATE method by 25 points, which is a significant increase. However, there is still room for improvement. In future work, we aim to optimise the selection of words from the TextRank graph, work on expanding TextRank to a graph of both words and phrases, and to explore how the size and source of the seed lexicon affects the performance of AdaText.

Wednesday, 5 September 2018

Students use GATE and Twitter to drive Lego robots—again!

At the university's Headstart Summer School in July 2018, 42 secondary school students (age 16 and 17) from all over the UK (see below for maps) were taught to write Java programs to control Lego robots, using input from the robots (such as the sensor for detecting coloured marks on the floor) as well as operating the motors to move and turn. The Department of Computer Science provided a Java library for driving the robots and taught the students to use it.

After they had successfully operated the robots, we ran a practical session on 10 and 11 July on "Controlling Robots with Tweets". We presented a quick introduction to natural language processing (using computer programs to analyse human languages, such as English) and provided them with a bundle of software containing a version of the GATE Cloud Twitter Collector modified to run a special GATE application with a custom plugin to use the Java robot library to control the robots.

The bundle came with a simple "gazetteer" containing two lists of keywords:

left	turn
left	turn
port	take
	make
	move

and a basic JAPE grammar (set of rules) to make use of it. JAPE is a specialized programming language used in GATE to match regular expressions over annotations in documents, such as the "Lookup" annotations created whenever the gazetteer finds a matching keyword in a document. (The annotations are similar to XML tags, except that GATE applications can create them as well as read them and they can overlap each other without restrictions. Technically they form an annotation graph.)

The sample rule we provided would match any keyword from the "turn" list followed by any keyword from the "left" list (with optional other words in between, so that "turn to port", "take a left", "turn left" all work the same way) and then run the code to turn the robot's right motor (making it turn left in place).

We showed them how to configure the Twitter Collector, follow their own accounts, and then run the collector with the sample GATE application. Getting the system set up and working took a bit of work, but once the first few groups got their robot to move in response to a tweet, everyone cheered and quickly became more interested. They then worked on extending the word lists and JAPE rules to cover a wider range of tweeted commands.

Some of the students had also developed interesting Java code the previous day, which they wanted to incorporate into the Twitter-controlled system. We helped these students add their code to their own copies of the GATE plugin and re-load it so the JAPE rules could call their procedures.

We first ran this project in the Headstart course in July 2017; we made improvements for this year and it was a success again, so we plan to include it in Headstart 2019 too.

The following maps show where all the students and the female students came from.

This work is supported by the European Union's Horizon 2020 project SoBigData (grant agreement no. 654024). Thanks to Genevieve Gorrell for the diagram illustrating how the system works.

Monday, 20 August 2018

Deep Learning in GATE

Few can have failed to notice the rise of deep learning over the last six or seven years, and its role in our emergence from the AI winter. Thanks in part to the increased speed offered by GPUs*, neural net approaches came into their own and out from under the shadow of the support vector machine, offering more scope than that and other previously popular methods, such as Random Forests and CRFs, for continued improvement as training data volumes increase. Natural language processing has traditionally been a multi-step endeavour, perhaps beginning with tokenization and parsing and working up to semantic processing such as question answering. In addition to being labour-intensive, this approach is also limiting, as each abstraction can only access the current step, and thus throws away potentially valuable information from previous steps. Deep learning offers the possibility to overcome these limitations by bringing a much greater number of parameters into play (much greater flexibility). Deep neural nets (DNNs) may learn end-to-end solutions, starting with raw data and producing sophisticated output. Furthermore they can encode much more complex dependencies than those we have seen in less parameterizable approaches--in other words, much more elaborate reasoning. And while we step back from the need to break down involved problems into pieces ourselves, a promising line of work finds that DNN "skills" are also "transferable"--models may for example be pre-trained on generic data, providing a basic language understanding that can then be put to use in other specialized contexts (multi-task learning).

For these reasons, deep learning is widely seen as key to continuing progress on a wide range of artificial intelligence tasks including natural language processing, so of course it is of great interest to us here in the GATE team. Classic GATE tasks such as entity recognition and sentence classification could be advanced by utilizing an approach with greater potential to learn a discriminative model, given sufficient training data. And by supporting the substitution of words with "embeddings" (DNN-derived vectors that capture relationships between words) trained on readily available unlabelled general or domain-specific data, we can bring some of the benefits of deep learning even to cases where training data are meagre by deep learning standards. Deep learning is therefore likely to be of benefit in any task but the most trivial, as long as you have the skills and a reasonable amount of data.

The Learning Framework is our ongoing project bringing current machine learning technologies to GATE, enabling users to leverage GATE's ecosystem of text processing offerings to create features to train learners, and to include these learners in text processing pipelines. The guiding vision for the Learning Framework has always been to offer an accessible interface that enables GATE users to get up and running quickly with machine learning, whilst at the same time supporting the most current and interesting of technologies. When it comes to deep learning, meeting these twin objectives is a little more challenging, but we have stepped up to the plate!

Deep learning framework in the GATE GUI

Previous machine learning algorithms would work their magic with comparatively little in the way of tweaking required. Deep learning is, however, an entirely different beast in this respect. In fact, it's more like an entire zoo! As discussed above, the advantage to DNNs is their massive flexibility, but this seriously stretches GATE's previous assumptions about how machine learning works. An integration needs to support the design of an architecture (a "shape" of neural net) and the tuning of many parameters, including dropout, optimization strategy, learning rate, momentum, and many more. All of these factors are critical in obtaining a good performance. The integration is still under (very) intensive development, but it is already possible to get something running relatively quickly with deep learning in GATE. Here are some current highlights:

Two of the most-used frameworks for Deep Learning can be used: PyTorch and Keras, both Python-based;
Support for both Linux and MacOS (Windows is not yet supported);
A range of template architectures, which may produce acceptable results out of the box (though in many cases it will be necessary for the user to adapt the architecture, the parameters of the architecture, or other aspects of the DNN solution);
The possibility to work with an initial GATE-created model both inside and outside of GATE.

We encourage anyone who is interested to give it a try and to talk to us about it. There will always be more to add (current challenges include drop-out, gradient clipping, L1/L2 weight regularization, attention, modified weight initialization, char-augmented LSTMs and LSTM-CRF architectures, to name a few) but much is achievable already. This is one of relatively few efforts globally to sift the essence out of this highly active research field and transform it into something relatively high level and generalizable across a range of NLP tasks, making state of the art technologies accessible to non-specialists. There's some documentation available here.

At the same time, we've been applying deep learning in our research in several ways. In a forthcoming EMNLP paper, team member Xingyi Song and co-authors use the fixed-size, ordinally-forgetting (FOFE) approach to combine LSTM and CNN neural net architectures in a more computationally efficient way than previously, in order to make better use of context in sentence classification tasks. Together with researchers at KCL and South London and Maudsley NHS Trust, he's also demonstrated the value of this technology in the context of detection of suicidal ideation in medical records.

Furthermore, we have successfully used LSTMs for veracity verification of rumours spread in social media such as in Twitter. Our approach makes use of only the tweet content, which it passes through LSTM units that learn to distinguish between true, false and unverifiable rumours. However, the unique part of our approach is that prior to passing the tweet to the LSTM layer, it first looks within the tweet for some recurring information that is typically used by others to spread rumours, and makes adjustments on the input--words carrying useful information are kept as they are, and others are downgraded in terms of contribution. This is achieved through attention layer implementation. We evaluated our approach on the RumourEval 2017 test data and achieved over 60% accuracy, which is currently the state-of-the-art performance for this task.

*Graphics Processing Units; technology driven by the demands of computer gamers that has been used to speed up deep learning approaches by as much as 250 times compared with CPUs.

Title artwork from https://www.deviantart.com/redwolf518stock

Thursday, 12 July 2018

GATE and JSON: Now Supporting 280 Character Tweets!

We first added support for reading tweets stored as JSON objects to GATE in version 8, all the way back in 2014. This support has proved exceptionally useful both internally to help our own research but also to the many researchers outside of Sheffield who use GATE for analysing Twitter posts. Recent changes that Twitter have made to the way they represent Tweets as JSON objects and the move to 280 character tweets has led us to re-develop our support for Twitter JSON and to also develop a simpler JSON format for storing general text documents and annotations.

This work has resulted in two new (or re-developed plugins); Format: JSON and Format Twitter. Both are currently at version 8.6-SNAPSHOT and are offered in the default plugin list to users of GATE 8.6-SNAPSHOT.

The Format: JSON plugin contains both a document format and export support for a simple JSON document format inspired by the original Twitter JSON format. Essentially each document is stored as a JSON object with two properties text and entities. The text field is simply the text of the document, while the entities contains the annotations and their features. The format of this field is that same as that used by Twitter to store entities, namely a map from annotation type to an array of objects each of which contains the offsets of the annotation and any other features. You can load documents using this format by specifying text/json as the mime type. If your JSON documents don't quite match this format you can still extract the text from them by specifying the path through the JSON to the text element as a dot separated string as a parameter to the mime type. For example, assume the text in your document was in a field called text but this wasn't at the root of the JSON document but inside an object named document, then you would load this by specifying the mime type text/json;text-path=document.text. When saved the text and any annotations would, however, by stored at the top level. This format essentially mirrors the original Twitter JSON, but we will now be freezing this format as a general JSON format for GATE (i.e. it won't change if/when Twitter changes the way they store Tweets as JSON).

As stated earlier the new version of our Format: Twitter plugin now fully supports Twitters new JSON format. This means we can correctly handle not only 280 character tweets but also quoted tweets. Essentially a single JSON object may now contain multiple tweets in a nested hierarchy. For example, you could have a retweet of a tweet which itself quotes another tweet. This is represented as three separate tweets in a single JSON object. Each top level tweet is loaded into a GATE document and covered with a Tweet annotation. Each of the tweets it contains are then added to the document and covered with a TweetSegment annotation. Each TweetSegment annotation has three features textPath, entitiesPath, and tweetType. The latter of these tells you the type of tweet i.e. retweet, quoted etc. whereas the first two give the dotted path through the JSON object to the fields from which text and entities were extracted to produce that segment. All the JSON data is added as nested features on the top level Tweet annotation. To use this format make sure to use the mime type text/x-json-twitter when loading documents into GATE.

So far we've only talked about loading single JSON objects as documents, however, usually you end up with a single file containing many JSON objects (often one per line) which you want to use to populate a corpus. For this use case we've added a new JSON corpus populator.

This populator allows you to select the JSON file you want to load, set the mime type to use to process each object within the file, and optionally provide a path to a field in the object that should be used to set the document name. In this example I'm loading Tweets so I've specified /id_str so that the name of the document is the ID of the tweet; paths are / separated list of fields specifying the root to the relevant field and must start with a /.

The code for both plugins is still under active development (hence the -SNAPSHOT version number) while we improve error handling etc. so if you spot any issues or have suggestions for features we should add please do let us know. You can use the relevant issue trackers on GitHub for either the JSON or Twitter format plugins.

Tuesday, 27 February 2018

Students use GATE and Twitter to drive Lego robots

At the university's Headstart Summer School in July 2017, 42 students (age 16 and 17) from all over the UK were taught to write Java programs to control Lego robots, using input from the robots (such as the sensor for detecting coloured marks on the floor) as well as operating the motors to move and turn. (The university provided a custom Java library for this.)

On 11 and 12 July we ran a practical session on "Controlling Robots with Tweets". We presented a quick introduction to natural language processing (using computer programs to analyse human languages such as English) and provided them with a bundle of software containing a version of the GATE Cloud Twitter Collector modified to run a special GATE application with a custom plugin to let it use the Java robot library.

The bundle came with a simple "gazetteer" containing two lists of classified keywords:

left	turn
left	turn
port	take
	make
	move

and a basic JAPE grammar to make use of it. JAPE is a specialized language used in GATE to match regular expressions over annotations in documents. (The annotations are similar to XML tags, except that GATE applications can create them as well as read them and they can overlap each other without restrictions. Technically they form an annotation graph.)

The grammar we provided would match any keyword from the "turn" list followed by any keyword from the "left" list (with zero or more unmatched words in between, e.g., "turn to port", "take a left", "turn left") and then run the code to turn the robot's right motor (making it turn left in place).

We showed them how to configure the Twitter Collector, authenticate with their Twitter accounts, follow themselves, and then run the collector with this application. Getting the system set up and working was a bit laborious, but once the first group got their robot to move in response to a tweet and cheered, everyone got a lot more interested very quickly. They were very interested in extending the word lists and JAPE rules to cover a wider range of tweeted commands.

Some of the students had also developed interesting and complicated manoeuvres in Java the previous day, which they wanted to incorporate into the Twitter-controlled system. We helped these students add their code to their own copies of the GATE plugin and re-load it so the JAPE rules could call their procedures.

This project was fun and interesting for the staff as well as the students, and we will include it in Headstart 2018.

The Headstart 2017 video includes these activities. The instructions (presentation and handout) and software are available on-line.

This work is supported by the European Union's Horizon 2020 project SoBigData (grant agreement no. 654024).

Sunday, 23 July 2017

The Tools Behind Our Twitter Abuse Analysis with BuzzFeed

Or...How to Quantify Abuse in Tweets in 5 Working Days

When BuzzFeed approached us with the idea to quantify Twitter abuse towards politicians during the election campaign, we only had five working days, before the article had to be completed and go public.

The goal was to use text analytics and analyse tweets replying to UK politicians, in the run up to the 2017 general election, in order to answer questions such as:

How wide spread is abuse received by politicians?
Who are the main politicians targeted by such abusive tweets?
Are there any party or gender differences?
Do abuse levels stay constant over time or not?

So here I explain first how we collect the data for such studies and then how it gets analysed at scale and fast, all with our GATE-based open-source tools and their GATE Cloud text analytics-as-a-service deployment.

For researchers wishing more in-depth details, please read and cite our paper:

D. Maynard, I. Roberts, M. A. Greenwood, D. Rout and K. Bontcheva. A Framework for Real-time Semantic Social Media Analysis. Web Semantics: Science, Services and Agents on the World Wide Web, 2017 (in press). https://doi.org/10.1016/j.websem.2017.05.002, pre-print

Tweet Collection

We already had all necessary tweets at hand, since, within an hour of #GE2017 being announced, I set up, using the GATE Cloud tweet collection service:

https://cloud.gate.ac.uk/shopfront/displayItem/twitter-collector

the continuous collection of tweets by MPs, prominent politicians, parties, and candidates, as well as retweets and replies thereof.

I also made a second twitter collector service running in parallel, to collect election related tweets based purely on hashtags and keywords (e.g. #GE2017, vote, election).

How We Analysed and Quantified Abuse

Given the short 5 day deadline, we were pleased to have at hand the large-scale, real-time text analytics tools in GATE, Mimir/Prospector, and GATE Cloud.

The starting point was the real-time text analysis pipeline from the Brexit research last year. That is capable of analysing up to 100 tweets per second (tps), although, in practice, the tweets usually were coming at the much lower 23 tps.

This time, however, we adapted it with a new abuse analysis component, as well as some more up-to-date knowledge about the politicians (including the new prime minister).

The analysis backbone was again GATE's TwitIE system, which consists of a tokenizer, normalizer, part-of-speech tagger, and a named entity recognizer. TwitIE is also available as-a-service on GATE Cloud, for easy integration and use.

Next, we added information about politicians, e.g. their names, gender, party, constituencies, etc. In this way, we could produce aggregate statistics, such as abuse-containing tweets aimed at Labour or Conservative male/female politicians.

Next is a tweet geolocation component, which uses latitude/longitude, region, and user location metadata to geolocate tweets within the UK NUTS2 regions. This is not always possible, since many accounts and tweets lack such information, and this narrow down the sample significantly, should we choose to restrict by geo-location.

We also detect key themes and topics discussed in the tweets (more than one topic/theme can be contained in each tweet). Here we reused the module from the Brexit analyser.

The most exciting part was working with BuzzFeed's journalists to curate a set of abuse nouns typically aimed at people (e.g. twat), racist words, and milder insults (e.g. coward). We decided to differentiate these from general obscene language and swearing, as these were not always targeting the politician. Nevertheless, they were included in the system, to produce a separate set of statistics. We introduced also basic sub-classification by kind (e.g. racial) and strength (e.g. mild, medium, strong), derived from an Ofcom research report on offensive language.

Overall, we kept the processing pipeline as simple and efficient as possible, so it can run at 100 tweets per second even on a pretty basic server.

The analysis results were fed into GATE Mimir, which indexes efficiently tweet text and all our linguistic annotations. Mimir has a powerful programming API for semantic search queries, which we use to drive the various interactive visualisations and to generate the necessary aggregate statistics behind them.

For instance, we used Mimir queries to generate statistics and visualisations, based on time (e.g. most popular hashtags in abuse-containing tweets on 4 Jun); topic (e.g. the most talked about topics in such tweets), or target of the abusive tweet (e.g. the most frequently targeted politicians by party and gender). We could also navigate to the corresponding tweets behind these aggregate statistics, for a more in-depth analysis.

A rich sample of these statistics, associated visualisations, and abusive tweets is available in the BuzzFeed article.

Research carried out by:

Mark A. Greenwood, Ian Roberts, Dominic Rout, and myself, with ideas and other contributions from Diana Maynard and others from the GATE Team.

Any mistakes are my own.

Tuesday, 20 June 2017

GATE, Java 9, and HDPI Monitors

Over the last couple of months a few people have mentioned that running GATE Developer on HDPI monitors is a bit of a pain. The problem is that Java (up to and including the latest version of Java 8) doesn't have any support for HDPI monitors. The only solution I'd heard people suggest was to reduce the resolution of the monitor before launching GATE, but as you can imagine this is far from an ideal solution.

Having recently upgraded my laptop I also ran into the same problem, and as this screenshot highlights, by default GATE Developer isn't at all usable on a HDPI screen.

A quick hunt around the web and you'll find all sorts of suggestions for getting Java 8 to work nicely with HDPI screens, but try as I might I couldn't get any of them to work for me; I'm running OpenJDK 8 under Ubuntu 16.04. Fortunately HDPI support is going to be built into Java 9. Unfortunately Java 9 still hasn't been officially released so you need to rely on an early access version.

In theory it should have been easy for me to see if Java 9 was a solution, but unfortunately the version of Java 9 in the Ubuntu 16.04 repositorie causes a segfault as soon as you try to run any Java GUI program making life more difficult than it needs to be.

The solution is to install the Oracle early access build of Java 9. You can either download the JDK manually, or follow these instructions under Ubuntu to install from the very useful Web Upd8 repository. Either way once installed, launching GATE gives a usable UI.

Unfortunately this isn't quite enough to solve the problem. Under the hood Java 9 introduces a modular component system (often referred to as Project Jigsaw) which includes new rules on encapsulation. The issue is that one of the libraries GATE uses for reading and writing applications, XStream, uses a number of tricks to access internal data that are prohibited under the new rules. The result is that you can't load or save applications which makes the GUI kind of pointless. Fortunately there is a command line option you can pass to the JVM that allows you to bypass the encapsulation rules. So to get GATE to work properly with Java 9 you need to add

--permit-illegal-access

to the command line. When launching the GUI this is easy to do by adding the flag as a new line in the gate.l4j.ini file which you will find in the GATE home folder.

There are two important things to note. Firstly this fix is only temporary as the command line flag will be removed in a later version of Java, and secondly depending how you are deploying GATE it can be difficult to alter the command line arguments (for example if deploying as a web app). Once Java 9 is officially released we'll look again at this problem to find a more permanent solution. Until then this gives you a way of using GATE on a HDPI monitor, but where possible (i.e. only on a HDPI monitor when you need the UI) we'd still advise using Java 8 and this hack as a last resort.