On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: July 2018

Thursday, 12 July 2018

GATE and JSON: Now Supporting 280 Character Tweets!

We first added support for reading tweets stored as JSON objects to GATE in version 8, all the way back in 2014. This support has proved exceptionally useful both internally to help our own research but also to the many researchers outside of Sheffield who use GATE for analysing Twitter posts. Recent changes that Twitter have made to the way they represent Tweets as JSON objects and the move to 280 character tweets has led us to re-develop our support for Twitter JSON and to also develop a simpler JSON format for storing general text documents and annotations.

This work has resulted in two new (or re-developed plugins); Format: JSON and Format Twitter. Both are currently at version 8.6-SNAPSHOT and are offered in the default plugin list to users of GATE 8.6-SNAPSHOT.

The Format: JSON plugin contains both a document format and export support for a simple JSON document format inspired by the original Twitter JSON format. Essentially each document is stored as a JSON object with two properties text and entities. The text field is simply the text of the document, while the entities contains the annotations and their features. The format of this field is that same as that used by Twitter to store entities, namely a map from annotation type to an array of objects each of which contains the offsets of the annotation and any other features. You can load documents using this format by specifying text/json as the mime type. If your JSON documents don't quite match this format you can still extract the text from them by specifying the path through the JSON to the text element as a dot separated string as a parameter to the mime type. For example, assume the text in your document was in a field called text but this wasn't at the root of the JSON document but inside an object named document, then you would load this by specifying the mime type text/json;text-path=document.text. When saved the text and any annotations would, however, by stored at the top level. This format essentially mirrors the original Twitter JSON, but we will now be freezing this format as a general JSON format for GATE (i.e. it won't change if/when Twitter changes the way they store Tweets as JSON).

As stated earlier the new version of our Format: Twitter plugin now fully supports Twitters new JSON format. This means we can correctly handle not only 280 character tweets but also quoted tweets. Essentially a single JSON object may now contain multiple tweets in a nested hierarchy. For example, you could have a retweet of a tweet which itself quotes another tweet. This is represented as three separate tweets in a single JSON object. Each top level tweet is loaded into a GATE document and covered with a Tweet annotation. Each of the tweets it contains are then added to the document and covered with a TweetSegment annotation. Each TweetSegment annotation has three features textPath, entitiesPath, and tweetType. The latter of these tells you the type of tweet i.e. retweet, quoted etc. whereas the first two give the dotted path through the JSON object to the fields from which text and entities were extracted to produce that segment. All the JSON data is added as nested features on the top level Tweet annotation. To use this format make sure to use the mime type text/x-json-twitter when loading documents into GATE.

So far we've only talked about loading single JSON objects as documents, however, usually you end up with a single file containing many JSON objects (often one per line) which you want to use to populate a corpus. For this use case we've added a new JSON corpus populator.

This populator allows you to select the JSON file you want to load, set the mime type to use to process each object within the file, and optionally provide a path to a field in the object that should be used to set the document name. In this example I'm loading Tweets so I've specified /id_str so that the name of the document is the ID of the tweet; paths are / separated list of fields specifying the root to the relevant field and must start with a /.

The code for both plugins is still under active development (hence the -SNAPSHOT version number) while we improve error handling etc. so if you spot any issues or have suggestions for features we should add please do let us know. You can use the relevant issue trackers on GitHub for either the JSON or Twitter format plugins.

Wednesday, 4 July 2018

11th GATE Training Course: Large Scale Text and Social Media Analytics with GATE

Every year for the last decade, the GATE team at Sheffield have been delivering summer courses helping people get to grips with GATE technology. One year we even ran a second course in Montreal! It's always a challenge deciding what to include. GATE has been around for almost a quarter of a century, and in that time it has organically grown to include a wide variety of technologies too numerous to cover in a week long course, and adapt to the changing needs of our users during one of the most technologically exciting periods in history. But under the capable leadership of Diana Maynard and Kalina Bontcheva, we've learned to squeeze the most useful material into the limited time available, helping beginners to get started with GATE without overwhelming them, as well as empowering more experienced users to see the potential to push it into new territory.

Recent years have seen a surge of interest in social media. These media offer potential for commercial users to deepen their understanding of their customers, and for researchers to explore and understand the ways in which these media are affecting society, as well as using social media data for various other research purposes. For this reason, we have positioned social media as a central theme for the course, which most students seem to find accessible and interesting. It provides an opportunity to showcase GATE's Twitter support, and draw examples from our own work on social media within the Societal Debates theme of SoBigData. However, there are also plenty of examples illustrating how GATE can be applied to other popular areas, such as analysis of news or medical text.

I've been teaching GATE's machine learning offering for most of the time the course has been running, and therefore I've had the opportunity to explore different ways of helping people to get a handle on what can seem an intimidating topic to those who aren't already familiar with it. Machine learning is challenging to teach to a mixed audience, because it's such a large field and the time is limited. It's also an important one though, as it's increasingly a part of the public discourse, and many students are excited to learn about the ways they can incorporate machine learning into their work using GATE. Johann Petrak has taken the lead on keeping the GATE Learning Framework up to date with the latest developments in this rapidly evolving field, and I'm always proud and excited to teach something new that's been added since the last course.

It's evident from the discussions during lunch and tea breaks that students are eager to talk to us about how they are using GATE, and how they would like to use it. I think one of the most valuable things about the course is the opportunity it provides for the students to talk to us about what they are doing with GATE, and for us to be inspired by the range of uses to which GATE is being put. Here is some of the feedback we received from students this year:

"Last week was one of the most useful courses I have done. Overall I think it was pitched really well given the range of technical abilities."

"Thank you all for such an informative and well-delivered course. I was a little worried about whether I'd be able to pick it up as I don’t have a background in programming, but I learned so much and the trainers were all very helpful and patient."