On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: Python: using ANNIE via its web API

GATE Cloud is GATE, the world-leading text-analytics platform, made available on the web with both human user interfaces and programmatic ones.

My name is David Jones and part of my role is to make it easier for you to use GATE. This article is aimed at Python programmers and people who are, rightly, curious to see if Python can help with their text analysis work.

GATE Cloud exposes a web API for many of its services. In this article, I'm going to sketch an example in Python that uses the GATE Cloud API to ANNIE, the English Named Entity Recognizer.

I'm writing in Python 3 using the really excellent requests library.

The GATE Cloud API documentation describes the general outline of using the API, which is that you make an HTTP request setting particular headers.

The full code that I'm using is available on GitHub and is installable and runnable.

A simple use is to pass text to ANNIE and get annotated results back.
In terms of Python:

    text = "David Jones joined the University of Sheffield this year"
    headers = {'Content-Type': 'text/plain'}
    response = requests.post(url, data=text, headers=headers)

The Content-Type header is required and specifies the MIME type of the text we are sending. In this case it's text/plain but GATE Cloud supports many types including PDF, HTML, XML, and Twitter's JSON format; details are in the GATE Cloud API documentation.

The default output is JSON and in this case once I've used Python's json.dumps(thing, indent=2) to format it nicely, it looks like this:

{
"text": "David Jones joined the University of Sheffield this year",
"entities": {
    "Date": [
      {
        "indices": [
          47,
          56
        ],
        "rule": "ModifierDate",
        "ruleFinal": "DateOnlyFinal",
        "kind": "date"
      }
    ],
    "Organization": [
      {
        "indices": [
          23,
          46
        ],
        "orgType": "university",
        "rule": "GazOrganization",
        "ruleFinal": "OrgFinal"
      }
    ],
    "Person": [
      {
        "indices": [
          0,
          11
        ],
        "firstName": "David",
        "gender": "male",
        "surname": "Jones",
        "kind": "fullName",
        "rule": "PersonFull",
        "ruleFinal": "PersonFinal"
      }
    ]
}
}

The JSON returned here is designed to have a similar structure to the format used by Twitter: Tweet JSON. The outermost dictionary has a text key and an entities key. The entities object is a dictionary that contains arrays of annotations of different types; each annotation being a dictionary with an indices key and other metadata. I find this kind of thing is impossible to describe and impossible to work with until I have an example and half-working code in front of me.

The full Python example uses this code to unpick the annotations and display their type and text:

    gate_json = response.json()
    response_text = gate_json["text"]
    for annotation_type, annotations in gate_json["entities"].items():
        for annotation in annotations:
            i, j = annotation["indices"]
            print(annotation_type, ":", response_text[i:j])

With the text I gave above, I get this output:

Date : this year
Organization : University of Sheffield
Person : David Jones

We can see that ANNIE has correctly picked out a date, an organisation, and a person, from the text. It's worth noting that the JSON output has more detail that I'm not using in this example: "University of Sheffield" is identified as a university; "David Jones" is identified with the gender "male".

Some notes on programming

requests is nice.
Content-Type header is required.
requests has a response.json() method which is a shortcut for parsing the JSON into Python objects.
the JSON response has a text field, which is the text that was analysed (in my example they are the same, but for PDF we need the linear text so that we can unambiguously assign index values within it).
the JSON response has an entities field, which is where all the annotations are, first separated and keyed by their annotation type.
the indices returned in the JSON are 0-based end-exclusive which matches the Python string slicing convention, hence we can use response_text[i:j] to get the correct piece of text.

Quota and API keys

The public service has a fairly limited quota, but if you create an account on GATE Cloud you can create an API key which will allow you to access the service with increased quota and fewer limits.

To use your API key, use HTTP basic authentication, passing in the Key ID as the user-id and the API key password as the password. requests makes this pretty simple, as you can supply auth=(user, pass) as an additional keyword argument to requests.post(). Possibly even simpler though is to put those values in your ~/.netrc file (_netrc in Windows):

    machine cloud-api.gate.ac.uk
    login 71rs93h36m0c
    password 9u8ki81lstfc2z8qjlae

The nice thing about this is that requests will find and use these values automatically without you having to write any code.

Go try using the web API now, and let us know how you get on!

On GATE, Text and Social Media Analysis, and Detecting Misinformation Online

Thursday, 7 March 2019

Python: using ANNIE via its web API

Some notes on programming

Quota and API keys

No comments:

Post a Comment