This work has resulted in two new (or re-developed plugins); Format: JSON and Format Twitter. Both are currently at version 8.6-SNAPSHOT and are offered in the default plugin list to users of GATE 8.6-SNAPSHOT.
The Format: JSON plugin contains both a document format and export support for a simple JSON document format inspired by the original Twitter JSON format. Essentially each document is stored as a JSON object with two properties
text
and entities
. The text
field is simply the text of the document, while the entities
contains the annotations and their features. The format of this field is that same as that used by Twitter to store entities, namely a map from annotation type to an array of objects each of which contains the offsets of the annotation and any other features. You can load documents using this format by specifying text/json
as the mime type. If your JSON documents don't quite match this format you can still extract the text from them by specifying the path through the JSON to the text element as a dot separated string as a parameter to the mime type. For example, assume the text in your document was in a field called text
but this wasn't at the root of the JSON document but inside an object named document
, then you would load this by specifying the mime type text/json;text-path=document.text
. When saved the text and any annotations would, however, by stored at the top level. This format essentially mirrors the original Twitter JSON, but we will now be freezing this format as a general JSON format for GATE (i.e. it won't change if/when Twitter changes the way they store Tweets as JSON).As stated earlier the new version of our Format: Twitter plugin now fully supports Twitters new JSON format. This means we can correctly handle not only 280 character tweets but also quoted tweets. Essentially a single JSON object may now contain multiple tweets in a nested hierarchy. For example, you could have a retweet of a tweet which itself quotes another tweet. This is represented as three separate tweets in a single JSON object. Each top level tweet is loaded into a GATE document and covered with a Tweet annotation. Each of the tweets it contains are then added to the document and covered with a TweetSegment annotation. Each TweetSegment annotation has three features
textPath
, entitiesPath
, and tweetType
. The latter of these tells you the type of tweet i.e. retweet, quoted etc. whereas the first two give the dotted path through the JSON object to the fields from which text and entities were extracted to produce that segment. All the JSON data is added as nested features on the top level Tweet annotation. To use this format make sure to use the mime type text/x-json-twitter
when loading documents into GATE.So far we've only talked about loading single JSON objects as documents, however, usually you end up with a single file containing many JSON objects (often one per line) which you want to use to populate a corpus. For this use case we've added a new JSON corpus populator.
This populator allows you to select the JSON file you want to load, set the mime type to use to process each object within the file, and optionally provide a path to a field in the object that should be used to set the document name. In this example I'm loading Tweets so I've specified
/id_str
so that the name of the document is the ID of the tweet; paths are / separated list of fields specifying the root to the relevant field and must start with a /.The code for both plugins is still under active development (hence the -SNAPSHOT version number) while we improve error handling etc. so if you spot any issues or have suggestions for features we should add please do let us know. You can use the relevant issue trackers on GitHub for either the JSON or Twitter format plugins.
No comments:
Post a Comment