Showing posts with label GATE Developer. Show all posts
Showing posts with label GATE Developer. Show all posts

Wednesday, 3 August 2022

Populate a Corpus from a List of URLs

GATE provides support for loading numerous different document formats, as well as a number of ways populating corpora. Until recently, however, we've not offered any way of populating a corpus from a simple list of URLs. Worse, even though it's now quite easy to do this in GATE it's unlikely you would come across the option by accident.

The support for this is actually hidden away inside the "Format CSV" plugin (you'll need to use version 8.7 or above) and in GATE Developer is exposed through the "Populate from CSV file..." option in the context menu of a corpus.
In this screenshot I've configured the populator ready to build a corpus from a simple text file with one URL per line. The important settings are:
  • Column Separator is set to "\t". This means we are using a tab character as the column separator. We do this simply as you can't have a tab in a URL whereas you could have a URL containing a comma and we don't want our URLs split in half.
  • Document Content is in column 0. We always count columns (or almost anything) starting from 0, so this just ensures we use the URL as the document content.
  • Create one document per row is selected. The important option isn't available if we don't first select this as it makes no sense to try and load multiple URLs into the same GATE document.
  • Cell contains document URL is selected. This is the new feature which makes this trick possible. Essentially it looks at the contents of a cell and if it can be interpreted as a URL then it creates a document from the contents of the URL, otherwise it uses the cell content as normal to build the document.
Once configured it's simply a case of selecting your text file, one URL per line, and hitting the OK button. Be aware that there is currently no rate limiting so be careful if you are listing a lot of URLs from a single domain etc. You may also want to combine this with the cookie trick from the previous post to ensure you get the correct content from each of the URLs.

Of course while this post has been about how to populate a corpus from a simple list of URLs you can use more complex CSV or TSV files which happen to contain URLs in one column. In that case the details from the other columns will be added as document features.

Wednesday, 2 March 2022

GATE and the Cookie Jar

One of the useful features of GATE is that documents can be loaded directly form the web as well as from local files. This is specifically useful for pages which update frequently which you might want to process repeatedly. While using this feature recently we came across some pages that refused to load correctly. The page loaded fine in a web browser but returned a 403 unauthorised response when accessed via GATE.

After a bit of debugging it turned out that this issue was related to cookies. The specific URL we were trying to load went through a number of redirects before ending up at the final page. The problem was that the first redirect set a cookie, and that needed to be present for the further redirects to work. By default Java, and hence GATE, doesn't maintain cookies across requests, as each connection is handled independently.

If you are using GATE in an embedded context, then it is trivial to add support for cookies using the default Java cookie handler. This is a JVM level setting so once configured in your own code, all requests made by GATE to load documents will also gain support for handling cookies. The entire solution is the following single line of code:

java.net.CookieHandler.setDefault(new java.net.CookieManager());


The problem we faced though, was that we wanted to be able to load documents that required cookies from within GATE Developer and that required a little more thought. Whilst we could have just added the code to GATE there are a number of reasons not to (details of which are outside the scope of this blog post) and I wanted to make it easier for all existing GATE users to be able to use cookies without needing to upgrade. The answer is the rather versatile Groovy plugin.

If you load the Groovy plugin into GATE Developer you can then access the Groovy Console from within the tools menu. Simply pasting that single line of code into the console and executing it is enough to add the cookie support within that instance of GATE. It's slightly annoying that it won't persist across multiple instances of GATE, but as it's such a simple trick hopefully it's easy enough to apply when needed.

Thursday, 12 July 2018

GATE and JSON: Now Supporting 280 Character Tweets!

We first added support for reading tweets stored as JSON objects to GATE in version 8, all the way back in 2014. This support has proved exceptionally useful both internally to help our own research but also to the many researchers outside of Sheffield who use GATE for analysing Twitter posts. Recent changes that Twitter have made to the way they represent Tweets as JSON objects and the move to 280 character tweets has led us to re-develop our support for Twitter JSON and to also develop a simpler JSON format for storing general text documents and annotations.

This work has resulted in two new (or re-developed plugins); Format: JSON and Format Twitter. Both are currently at version 8.6-SNAPSHOT and are offered in the default plugin list to users of GATE 8.6-SNAPSHOT.

The Format: JSON plugin contains both a document format and export support for a simple JSON document format inspired by the original Twitter JSON format. Essentially each document is stored as a JSON object with two properties text and entities. The text field is simply the text of the document, while the entities contains the annotations and their features. The format of this field is that same as that used by Twitter to store entities, namely a map from annotation type to an array of objects each of which contains the offsets of the annotation and any other features. You can load documents using this format by specifying text/json as the mime type. If your JSON documents don't quite match this format you can still extract the text from them by specifying the path through the JSON to the text element as a dot separated string as a parameter to the mime type. For example, assume the text in your document was in a field called text but this wasn't at the root of the JSON document but inside an object named document, then you would load this by specifying the mime type text/json;text-path=document.text. When saved the text and any annotations would, however, by stored at the top level. This format essentially mirrors the original Twitter JSON, but we will now be freezing this format as a general JSON format for GATE (i.e. it won't change if/when Twitter changes the way they store Tweets as JSON).

As stated earlier the new version of our Format: Twitter plugin now fully supports Twitters new JSON format. This means we can correctly handle not only 280 character tweets but also quoted tweets. Essentially a single JSON object may now contain multiple tweets in a nested hierarchy. For example, you could have a retweet of a tweet which itself quotes another tweet. This is represented as three separate tweets in a single JSON object. Each top level tweet is loaded into a GATE document and covered with a Tweet annotation. Each of the tweets it contains are then added to the document and covered with a TweetSegment annotation. Each TweetSegment annotation has three features textPath, entitiesPath, and tweetType. The latter of these tells you the type of tweet i.e. retweet, quoted etc. whereas the first two give the dotted path through the JSON object to the fields from which text and entities were extracted to produce that segment. All the JSON data is added as nested features on the top level Tweet annotation. To use this format make sure to use the mime type text/x-json-twitter when loading documents into GATE.


So far we've only talked about loading single JSON objects as documents, however, usually you end up with a single file containing many JSON objects (often one per line) which you want to use to populate a corpus. For this use case we've added a new JSON corpus populator.


This populator allows you to select the JSON file you want to load, set the mime type to use to process each object within the file, and optionally provide a path to a field in the object that should be used to set the document name. In this example I'm loading Tweets so I've specified /id_str so that the name of the document is the ID of the tweet; paths are / separated list of fields specifying the root to the relevant field and must start with a /.

The code for both plugins is still under active development (hence the -SNAPSHOT version number) while we improve error handling etc. so if you spot any issues or have suggestions for features we should add please do let us know. You can use the relevant issue trackers on GitHub for either the JSON or Twitter format plugins.

Tuesday, 20 June 2017

GATE, Java 9, and HDPI Monitors

Over the last couple of months a few people have mentioned that running GATE Developer on HDPI monitors is a bit of a pain. The problem is that Java (up to and including the latest version of Java 8) doesn't have any support for HDPI monitors. The only solution I'd heard people suggest was to reduce the resolution of the monitor before launching GATE, but as you can imagine this is far from an ideal solution.

Having recently upgraded my laptop I also ran into the same problem, and as this screenshot highlights, by default GATE Developer isn't at all usable on a HDPI screen.


A quick hunt around the web and you'll find all sorts of suggestions for getting Java 8 to work nicely with HDPI screens, but try as I might I couldn't get any of them to work for me; I'm running OpenJDK 8 under Ubuntu 16.04. Fortunately HDPI support is going to be built into Java 9. Unfortunately Java 9 still hasn't been officially released so you need to rely on an early access version.

In theory it should have been easy for me to see if Java 9 was a solution, but unfortunately the version of Java 9 in the Ubuntu 16.04 repositorie causes a segfault as soon as you try to run any Java GUI program making life more difficult than it needs to be.

The solution is to install the Oracle early access build of Java 9. You can either download the JDK manually, or follow these instructions under Ubuntu to install from the very useful Web Upd8 repository. Either way once installed, launching GATE gives a usable UI.


Unfortunately this isn't quite enough to solve the problem. Under the hood Java 9 introduces a modular component system (often referred to as Project Jigsaw) which includes new rules on encapsulation. The issue is that one of the libraries GATE uses for reading and writing applications, XStream, uses a number of tricks to access internal data that are prohibited under the new rules. The result is that you can't load or save applications which makes the GUI kind of pointless. Fortunately there is a command line option you can pass to the JVM that allows you to bypass the encapsulation rules. So to get GATE to work properly with Java 9 you need to add
--permit-illegal-access
to the command line. When launching the GUI this is easy to do by adding the flag as a new line in the gate.l4j.ini file which you will find in the GATE home folder.

There are two important things to note. Firstly this fix is only temporary as the command line flag will be removed in a later version of Java, and secondly depending how you are deploying GATE it can be difficult to alter the command line arguments (for example if deploying as a web app). Once Java 9 is officially released we'll look again at this problem to find a more permanent solution. Until then this gives you a way of using GATE on a HDPI monitor, but where possible (i.e. only on a HDPI monitor when you need the UI) we'd still advise using Java 8 and this hack as a last resort.