On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: Python

Release

The GATE Team is proud to announce two new releases that bring GATE and Python together:

Python GateNLP (version 1.0.2): a Python 3 package that brings many of the concepts and the ease of handling documents, annotations and features to Python.
GATE Python Plugin (version 3.0.2): a new plugin that can be used from Java GATE to process documents using Python code and the methods provided by the Python GateNLP package

Both releases are meant as first releases to a wider community to give feedback about what users need and what the basic design should look like.

Feedback

Users are invited to give feedback about the Python GateNLP package:

If you detect a bug, or have a feature request, please use the GitHub Issue Tracker
For more general discussions, ideas, asking the community for help, please use (preferably) the GitHub Discussions Forum or the General GATE Mailing List
We are also interested in feedback about the API and the functionality of the package. If you want to use the package for your own development and want to discuss changes, improvements or how you can contribute, please use the GitHub Discussions Forum
We are happy to receive contributions! Please create an issue and discuss/plan with developers on the issue tracker before providing a pull request.

To give feedback about the Python Plugin:

For reporting bugs or feature requests, please use the GitHub Issue Tracker
For getting help and more general discussions, please use the General GATE Mailing List

IMPORTANT: whenever you give feedback, please include as much detail about your Operating System, Java or Python version, package/plugin version and your concrete problem or question as possible!

GATE Course Module

Module 11 of the upcoming online GATE course in February 2021 will introduce the Python GateNLP package and the GATE Python plugin. You can register for this and many other modules of the course here.

Python GateNLP

Python GateNLP is a Python NLP framework which provides some of the concepts and abstractions known from Java GATE in Python, plus a number of new features:

Documents with arbitrarily many features, arbitrarily many named Annotation sets. GateNLP also adds the capability of keeping a ChangeLog
AnnotationSets with arbitrarily many (stand-off) Annotations which can overlap in any way and can span any character range (not just entire tokens/words)
Annotations with arbitrarily many features, grouped per set by some annotation type name
Features which map keys to arbitrary values
Corpora: collections of documents. Python GateNLP provides corpora that directly map to files in a directory (recursively).
Prepared modules for processing documents. In GateNLP these are called "Annotators" and also allow for filtering, splitting of documents
Reading and writing in various formats. GateNLP uses three new formats, "bdocjs" (JSON serialization), "bdocym" (YAML serialization) and "bdocMP" (Message Pack serialization). Documents in that format can be exchanged with Java GATE through the GATE plugin Format_Bdoc
Gazetteers for fast lookup and annotation of token sequences or character sequences which match a large list of known terms or phrases
A way to annotate documents based on patterns based on text and other annotations and annotation features: PAMPAC
A HTML visualizer which allows the user to interactively view GATE documents, annotations and features as separate HTML files or within Jupyter notebooks.
Bridges to powerful NLP libraries and conversion of their annotations to GateNLP annotations:

GateWorker: an API that allows the user to directly run Java GATE from Python and exchange documents between Python and Java
The Java GATE Python Plugin (see below) allows the user to run Python GateNLP code directly from Java GATE and process documents with it.

GATE Python Plugin

The GATE Python Plugin is one of many GATE plugins that extend the functionality of Java GATE. This plugin allows the user to process GATE documents running in the Java GATE GUI or via the multiprocessing Gate Cloud Processor (GCP) with Python programs (which use the GateNLP API for manipulating documents).

GATE Cloud is GATE, the world-leading text-analytics platform, made available on the web with both human user interfaces and programmatic ones.

My name is David Jones and part of my role is to make it easier for you to use GATE. This article is aimed at Python programmers and people who are, rightly, curious to see if Python can help with their text analysis work.

GATE Cloud exposes a web API for many of its services. In this article, I'm going to sketch an example in Python that uses the GATE Cloud API to ANNIE, the English Named Entity Recognizer.

I'm writing in Python 3 using the really excellent requests library.

The GATE Cloud API documentation describes the general outline of using the API, which is that you make an HTTP request setting particular headers.

The full code that I'm using is available on GitHub and is installable and runnable.

A simple use is to pass text to ANNIE and get annotated results back.
In terms of Python:

    text = "David Jones joined the University of Sheffield this year"
    headers = {'Content-Type': 'text/plain'}
    response = requests.post(url, data=text, headers=headers)

The Content-Type header is required and specifies the MIME type of the text we are sending. In this case it's text/plain but GATE Cloud supports many types including PDF, HTML, XML, and Twitter's JSON format; details are in the GATE Cloud API documentation.

The default output is JSON and in this case once I've used Python's json.dumps(thing, indent=2) to format it nicely, it looks like this:

{
"text": "David Jones joined the University of Sheffield this year",
"entities": {
    "Date": [
      {
        "indices": [
          47,
          56
        ],
        "rule": "ModifierDate",
        "ruleFinal": "DateOnlyFinal",
        "kind": "date"
      }
    ],
    "Organization": [
      {
        "indices": [
          23,
          46
        ],
        "orgType": "university",
        "rule": "GazOrganization",
        "ruleFinal": "OrgFinal"
      }
    ],
    "Person": [
      {
        "indices": [
          0,
          11
        ],
        "firstName": "David",
        "gender": "male",
        "surname": "Jones",
        "kind": "fullName",
        "rule": "PersonFull",
        "ruleFinal": "PersonFinal"
      }
    ]
}
}

The JSON returned here is designed to have a similar structure to the format used by Twitter: Tweet JSON. The outermost dictionary has a text key and an entities key. The entities object is a dictionary that contains arrays of annotations of different types; each annotation being a dictionary with an indices key and other metadata. I find this kind of thing is impossible to describe and impossible to work with until I have an example and half-working code in front of me.

The full Python example uses this code to unpick the annotations and display their type and text:

    gate_json = response.json()
    response_text = gate_json["text"]
    for annotation_type, annotations in gate_json["entities"].items():
        for annotation in annotations:
            i, j = annotation["indices"]
            print(annotation_type, ":", response_text[i:j])

With the text I gave above, I get this output:

Date : this year
Organization : University of Sheffield
Person : David Jones

We can see that ANNIE has correctly picked out a date, an organisation, and a person, from the text. It's worth noting that the JSON output has more detail that I'm not using in this example: "University of Sheffield" is identified as a university; "David Jones" is identified with the gender "male".

Some notes on programming

requests is nice.
Content-Type header is required.
requests has a response.json() method which is a shortcut for parsing the JSON into Python objects.
the JSON response has a text field, which is the text that was analysed (in my example they are the same, but for PDF we need the linear text so that we can unambiguously assign index values within it).
the JSON response has an entities field, which is where all the annotations are, first separated and keyed by their annotation type.
the indices returned in the JSON are 0-based end-exclusive which matches the Python string slicing convention, hence we can use response_text[i:j] to get the correct piece of text.

Quota and API keys

The public service has a fairly limited quota, but if you create an account on GATE Cloud you can create an API key which will allow you to access the service with increased quota and fewer limits.

To use your API key, use HTTP basic authentication, passing in the Key ID as the user-id and the API key password as the password. requests makes this pretty simple, as you can supply auth=(user, pass) as an additional keyword argument to requests.post(). Possibly even simpler though is to put those values in your ~/.netrc file (_netrc in Windows):

    machine cloud-api.gate.ac.uk
    login 71rs93h36m0c
    password 9u8ki81lstfc2z8qjlae

The nice thing about this is that requests will find and use these values automatically without you having to write any code.

Go try using the web API now, and let us know how you get on!

On GATE, Text and Social Media Analysis, and Detecting Misinformation Online

Sunday, 7 February 2021

New releases bringing GATE and Python closer together