Thursday, 9 May 2019

GATE at World Press Freedom Day

GATE at World Press freedom day: STRENGTHENING THE MONITORING OF SDG 16.10.1

In her role with CFOM (the University's Centre for Freedom of the Media, hosted in the department of Journalism Studies), Diana Maynard travelled to Ethiopia together with CFOM members Sara Torsner and Jackie Harrison to present their research at the World Press Freedom Day Academic Conference on the Safety of journalists in Addis Ababa, on 1 May, 2019. This ongoing research aims to facilitate the comprehensive monitoring of violations against journalists, in line with Sustainable Development Goal (SDG) 16.10.1. This is part of a collaborative project between CFOM and the press freedom organisation Free Press Unlimited, which aims to develop a methodology for systematic data collection on a range of attacks on journalists, and to provide a mechanism for dealing with missing, conflicting and potentially erroneous information.

Discussing possibilities for adopting NLP tools for developing a monitoring infrastructure that allows for the systematisation and organisation of a range of information and data sources related to violations against journalists, Diana proposed a set of areas of research that aim to explore this in more depth. These include: switching to an events-based methodology, reconciling data from multiple sources, and investigating information validity.

Whereas approaches to monitoring violations against journalists traditionally uses a person-based approach, recording information centred around an individual, we suggest that adopting an events-based methodology instead allows for the violation itself to be placed at the centre: ‘by enabling the contextualising and recording of in-depth information related to a single instance of violence such as a killing, including information about key actors and their interrelationship (victim, perpetrator and witness of a violation), the events-based approach enables the modelling of the highly complex structure of a violation. It also allows for the recording of the progression of subsequent violations as well as multiple violations experienced by the same victim (e.g. detention, torture and killing)’.

Event-based data model from HURIDOCS Source:
Another area of research includes possibilities for reconciling information from different databases and sources of information on violations against journalists through NLP techniques. Such methods would allow for the assessment and compilation of partial and contradictory data about the elements constituting a given attack on a journalist. ‘By creating a central categorisation scheme we would essentially be able to facilitate the mapping and pooling of data from various sources into one data source, thus creating a monitoring infrastructure for SDG 16.10.1’, said Diana Maynard. Systematic data on a range of violations against journalists that are gathered in a methodologically systematic and transparent way would also be able to address issues of information validity and source verification: ‘Ultimately such data would facilitate the investigation of patterns, trends and early warnings, leading to a better understanding of the contexts in which threats to journalists can escalate into a killing undertaken with impunity’. We thus propose a framework for mapping between different datasets and event categorisation schemes in order to harmonise information.

In our proposed methodology, GATE tools can be used to extract information from the free text portions of existing databases and link them to external knowledge sources in order to acquire more detailed information about an event, and to enable semantic reasoning about entities and events, thereby helping to both reconcile information at different levels of granularity (e.g. Dublin vs Ireland; shooting vs killing) and to structure information for further search and analysis. 

Slides from the presentation are available here; the full journal paper is forthcoming.
The original article from which this post is adapted is available on the CFOM website

Wednesday, 17 April 2019

WeVerify: Algorithm-Supported Verification of Digital Content

Announcing WeVerify: an algorithm-supported method for digital content verification. The WeVerify platform will provide an independent and community driven environment for the verification of online content, with further verification provided by expert partners. Prof. Kalina Bontcheva will be serving as the Scientific Director of the project.

Online disinformation and fake media content have emerged as a serious threat to democracy, economy and society. Content verification is currently far from trivial, even for experienced journalists, human rights activists or media literacy scholars. Moreover, recent advances in artificial intelligence (deep learning) have enabled the creation of intelligent bots and highly realistic synthetic multimedia content. Consequently, it is extremely challenging for citizens and journalists to assess the credibility of online content, and to navigate the highly complex online information landscapes.

WeVerify aims to address the complex content verification challenges through a participatory verification approach, open source algorithms, low-overhead human-in-the-loop machine learning and intuitive visualizations. Social media and web content will be analysed and contextualised within the broader online ecosystem, in order to expose fabricated content, through cross-modal content verification, social network analysis, micro-targeted debunking and a blockchain-based public database of known fakes.

Add caption
A key outcome will be the WeVerify platform for collaborative, decentralised content verification, tracking, and debunking.
The platform will be open source to engage communities and citizen journalists alongside newsroom and freelance journalists. To enable low-overhead integration with in-house content management systems and support more advanced newsroom needs, a premium version of the platform will also be offered. It will be furthermore supplemented by a digital companion to assist with verification tasks.

Results will be validated by professional journalists and debunking specialists from project partners (DW, AFP, DisinfoLab), external participants (e.g. members of the First Draft News network), the community of more than 2,700 users of the InVID verification plugin, and by media literacy, human rights and emergency response organisations.

The WeVerify website can be found at, and WeVerify can be found on Twitter @WeV3rify!

Monday, 11 March 2019

Coming Up: 12th GATE Summer School 17-21 June 2019

It is approaching that time of the year again! The GATE training course will be held from 17-21 June 2019 at the University of Sheffield, UK.

No previous experience or programming expertise is necessary, so it's suitable for anyone with an interest in text mining and using GATE, including people from humanities backgrounds, social sciences, etc.

This event will follow a similar format to that of the 2018 course, with one track Monday to Thursday, and two parallel tracks on Friday, all delivered by the GATE development team. You can read more about it and register here. Early bird registration is available at a discounted rate until 1 May.

The focus will be on mining text and social media content with GATE. Many of the hands on exercises will be focused on analysing news articles, tweets, and other textual content.

The planned schedule is as follows (NOTE: may still be subject to timetabling changes).
Single track from Monday to Thursday (9am - 5pm):
  • Monday: Module 1: Basic Information Extraction with GATE
    • Intro to GATE + Information Extraction (IE)
    • Corpus Annotation and Evaluation
    • Writing Information Extraction Patterns with JAPE
  • Tuesday: Module 2: Using GATE for social media analysis
    • Challenges for analysing social media, GATE for social media
    • Twitter intro + JSON structure
    • Language identification, tokenisation for Twitter
    • POS tagging and Information Extraction for Twitter
  • Wednesday: Module 3: Crowdsourcing, GATE Cloud/MIMIR, and Machine Learning
    • Crowdsourcing annotated social media content with the GATE crowdsourcing plugin
    • GATE Cloud, deploying your own IE pipeline at scale (how to process 5 million tweets in 30 mins)
    • GATE Mimir - how to index and search semantically annotated social media streams
    • Challenges of opinion mining in social media
    • Training Machine Learning Models for IE in GATE
  • Thursday: Module 4: Advanced IE and Opinion Mining in GATE
    • Advanced Information Extraction
    • Useful GATE components (plugins)
    • Opinion mining components and applications in GATE
On Friday, there is a choice of modules (9am - 5pm):
  • Module 5: GATE for developers
    • Basic GATE Embedded
    • Writing your own plugin
    • GATE in production - multi-threading, web applications, etc.
  • Module 6: GATE Applications
    • Building your own applications
    • Examples of some current GATE applications: social media summarisation, visualisation, Linked Open Data for IE, and more
These two modules are run in parallel, so you can only attend one of them. You will need to have some programming experience and knowledge of Java to follow Module 5 on the Friday. No particular expertise is needed for Module 6.
Hope to see you in Sheffield in June!

Thursday, 7 March 2019

Python: using ANNIE via its web API

GATE Cloud is GATE, the world-leading text-analytics platform, made available on the web with both human user interfaces and programmatic ones.

My name is David Jones and part of my role is to make it easier for you to use GATE. This article is aimed at Python programmers and people who are, rightly, curious to see if Python can help with their text analysis work.

GATE Cloud exposes a web API for many of its services. In this article, I'm going to sketch an example in Python that uses the GATE Cloud API to ANNIE, the English Named Entity Recognizer.

I'm writing in Python 3 using the really excellent requests library.

The GATE Cloud API documentation describes the general outline of using the API, which is that you make an HTTP request setting particular headers.

The full code that I'm using is available on GitHub and is installable and runnable.

A simple use is to pass text to ANNIE and get annotated results back.
In terms of Python:

    text = "David Jones joined the University of Sheffield this year"
    headers = {'Content-Type': 'text/plain'}
    response =, data=text, headers=headers)

The Content-Type header is required and specifies the MIME type of the text we are sending. In this case it's text/plain but GATE Cloud supports many types including PDF, HTML, XML, and Twitter's JSON format; details are in the GATE Cloud API documentation.

The default output is JSON and in this case once I've used Python's json.dumps(thing, indent=2) to format it nicely, it looks like this:
  "text": "David Jones joined the University of Sheffield this year",
  "entities": {
    "Date": [
        "indices": [
        "rule": "ModifierDate",
        "ruleFinal": "DateOnlyFinal",
        "kind": "date"
    "Organization": [
        "indices": [
        "orgType": "university",
        "rule": "GazOrganization",
        "ruleFinal": "OrgFinal"
    "Person": [
        "indices": [
        "firstName": "David",
        "gender": "male",
        "surname": "Jones",
        "kind": "fullName",
        "rule": "PersonFull",
        "ruleFinal": "PersonFinal"
The JSON returned here is designed to have a similar structure to the format used by Twitter: Tweet JSON. The outermost dictionary has a text key and an entities key. The entities object is a dictionary that contains arrays of annotations of different types; each annotation being a dictionary with an indices key and other metadata. I find this kind of thing is impossible to describe and impossible to work with until I have an example and half-working code in front of me.

The full Python example uses this code to unpick the annotations and display their type and text:

    gate_json = response.json()
    response_text = gate_json["text"]
    for annotation_type, annotations in gate_json["entities"].items():
        for annotation in annotations:
            i, j = annotation["indices"]
            print(annotation_type, ":", response_text[i:j])

With the text I gave above, I get this output:
Date : this year
Organization : University of Sheffield
Person : David Jones
We can see that ANNIE has correctly picked out a date, an organisation, and a person, from the text. It's worth noting that the JSON output has more detail that I'm not using in this example: "University of Sheffield" is identified as a university; "David Jones" is identified with the gender "male".

Some notes on programming

  • requests is nice.
  • Content-Type header is required.
  • requests has a response.json() method which is a shortcut for parsing the JSON into Python objects.
  • the JSON response has a text field, which is the text that was analysed (in my example they are the same, but for PDF we need the linear text so that we can unambiguously assign index values within it).
  • the JSON response has an entities field, which is where all the annotations are, first separated and keyed by their annotation type.
  • the indices returned in the JSON are 0-based end-exclusive which matches the Python string slicing convention, hence we can use response_text[i:j] to get the correct piece of text.

Quota and API keys

The public service has a fairly limited quota, but if you create an account on GATE Cloud you can create an API key which will allow you to access the service with increased quota and fewer limits.

To use your  API key, use HTTP basic authentication, passing in the Key ID as the user-id and the API key password as the password. requests makes this pretty simple, as you can supply auth=(user, pass) as an additional keyword argument to Possibly even simpler though is to put those values in your ~/.netrc file (_netrc in Windows):

    login 71rs93h36m0c
    password 9u8ki81lstfc2z8qjlae

The nice thing about this is that requests will find and use these values automatically without you having to write any code.

Go try using the web API now, and let us know how you get on!

Tuesday, 5 March 2019

Brexit--The Regional Divide

Although the UK voted by a narrow margin in the UK EU membership referendum in 2016 to leave the EU, that outcome failed to capture the diverse feelings held in various regions. It's a curious observation that the UK regions with the most economic dependence on the EU were the regions more likely to vote to leave it. The image below on the right is taken from this article from the Centre for European Reform, and makes the point in a few different ways. This and similar research inspired a current project the GATE team are undertaking with colleagues in the Geography and Journalism departments at Sheffield University, under the leadership of Miguel Kanai and with funding from the British Academy, aiming to understand whether lack of awareness of individual local situation played a role in the referendum outcome.
Our Brexit tweet corpus contains tweets collected during the run-up to the Brexit referendum, and we've annotated almost half a million accounts for Brexit vote intent with a high accuracy. You can read about that here. So we thought we'd be well positioned to bring some insights. We also annotated user accounts with location: many Twitter users volunteer that information, though there can be a lot of variation on how people describe their location, so that was harder to do accurately. We also used local and national news media corpora from the time of the referendum, in order to contrast national coverage with local issues are around the country.
"People's resistance to propaganda and media‐promoted ideas derives from their close ties in real communities"
Jean Seaton
Using topic modelling and named entity recognition, we were able to look for similarities and differences in the focus of local and national media and Twitter users. The bar chart on the left gets us started, illustrating that foci differ between media. Twitter users give more air time than news media to trade and immigration, whereas local press takes the lead on employment, local politics and agriculture. National press gives more space to terrorism than either Twitter or local news.
On the right is just one of many graphs in which we unpack this on a region-by-region basis (you can find more on the project website). In this choropleth, red indicates that the topic was significantly more discussed in national press than in local press in that area, and green indicates that the topic was significantly more discussed in local press there than in national press. Terrorism and immigration have perhaps been subject to a certain degree of media and propaganda inflation--we talk about this in our Social Informatics paper. Where media focus on locally relevant issues, foci are more grounded, for example in practical topics such as agriculture and employment. We found that across the regions, Twitter remainers showed a closer congruence with local press than Twitter leavers.
The graph on the right shows the number of times a newspaper was linked on Twitter, contrasted against the percentage of people that said they read that newspaper in the British Election Study. It shows that the dynamics of popularity on Twitter are very different to traditional readership. This highlights a need to understand how the online environment is affecting the news reportage we are exposed to, creating a market for a different kind of material, and a potentially more hostile climate for quality journalism, as discussed by project advisor Prof. Jackie Harrison here. Furthermore, local press are increasingly struggling to survive, so it feels important to highlight their value through this work.
You can see more choropleths on the project website. There's also an extended version here of an article currently under review.

Wednesday, 20 February 2019

GATE team wins first prize in the Hyperpartisan News Detection Challenge

SemEval 2019 recently launched the Hyperpartisan News Detection Task in order to evaluate how well tools could automatically classify hyperpartisan news texts. The idea behind this is that "given a news text, the system must decide whether it follows a hyperpartisan argumentation, i.e. whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person.

Below we see an example of (part of) two news stories about Donald Trump from the challenge data. The one on the left is considered to be hyperpartisan, as it shows a biased kind of viewpoint. The one on the right simply reports a story and is not considered hyperpartisan. The distinction is difficult even for humans, because there are no exact rules about what makes a story hyperpartisan.

In total, 322 teams registered to take part, of which 42 actually submitted an entry, including the GATE team consisting of Ye Jiang, Xingyi Song and Johann Petrak, with guidance from Kalina Bontcheva and Diana Maynard.

The main performance measure for the task is accuracy on a balanced set of articles, though additionally precision, recall, and F1-score were measured for the hyperpartisan class. In the final submission, the GATE team's hyperpartisan classifying algorithm achieved 0.822 accuracy for manually annotated evaluation set, and ranked in first position in the final leader board.

Our winning system was based on using sentence representations from averaged word embeddings  generated from the pre-trained ELMo model with a Convolutional Neural Network and Batch Normalization for training on the provided dataset. An averaged ensemble of models was then used to generate the final predictions. 

The source code and full system description is available on github.

One of the major challenges of this task is that the model must have the ability to adapt to a large range of article sizes. Most state-of-the-art neural network approaches for document classification use a token sequence as network input, but such an approach in this case would mean either a massive computational cost or loss of information, depending on how the maximum sequence length. We got around this problem by first pre-calculating sentence level embeddings as the average of word embeddings for each sentence, and then representing the document as a sequence of these sentence embeddings. We also found that actually ignoring some of the provided training data (which was automatically generated based on the document publishing source) improved our results, which leads to important conclusions about the trustworthiness of training data and its implications.

Overall, the ability to do well on the hyperpartisan news prediction task is important both for improving knowledge about neural networks for language processing generally, but also because better understanding of the nature of biased news is critical for society and democracy.

Monday, 18 February 2019

Russian Troll Factory: Sketches of a Propaganda Campaign

When Twitter shared a large archive of propaganda tweets late in 2018 we were excited to get access to over 9 million tweets from almost 4 thousand unique Twitter accounts controlled by Russia's Internet Research Agency. The tweets are posted in 57 different languages, but most are in Russian (53.68%) and English (36.08%). Average account age is around four years, and the longest accounts are as much as ten years old.
A large amount of activity in both the English and Russian accounts is given to news provision. Secondly, many accounts seem to engage in hashtag games, which may be a way to establish an account and get some followers. Of particular interest however are the political trolls. Left trolls pose as individuals interested in the Black Lives Matter campaign. Right trolls are patriotic, anti-immigration Trump supporters. Among left and right trolls, several have achieved large follower numbers and even a degree of fame. Finally there are fearmonger trolls, that propagate scares, and a small number of commercial trolls. The Russian language accounts also divide on similar lines, perhaps posing as individuals with opinions about Ukraine or western politics. These categories were proposed by Darren Linvill and Patrick Warren, from Clemson University. In the word clouds below you can see the hashtags we found left and right trolls using.

Left Troll Hashtags

Right Troll Hashtags
Mehmet E. Bakir has created some interactive graphs enabling us to explore the data. In the network diagram at the start of the post you can see the network of mention/retweet/reply/quote counts we created from the highly followed accounts in the set. You can click through to an interactive version, where you can zoom in and explore different troll types.
In the graph below, you can see activity in different languages over time (interactive version here, or interact with the embedded version below; you may have to scroll right). It shows that the Russian language operation came first, with English language operations following after. The timing of this part of the activity coincides with Russia's interest in Ukraine.

In the graph below, also available here, you can see how different types of behavioural strategy pay off in terms of achieving higher numbers of retweets. Using Linvill and Warren's manually annotated data, Mehmet built a classifier that enabled us to classify all the accounts in the dataset. It is evident that the political trolls have by far the greatest impact in terms of retweets achieved, with left trolls being the most successful. Russia's interest in the Black Lives Matter campaign perhaps suggests that the first challenge for agents is to win a following, and that exploiting divisions in society is an effective way to do that. How that following is then used to influence minds is a separate question. You can see a pre-print of our paper describing our work so far, in the context of the broader picture of partisanship, propaganda and post-truth politics, here.