Showing posts with label #GateAcUk. Show all posts

Thursday, 9 May 2019

GATE at World Press Freedom Day

GATE at World Press freedom day: STRENGTHENING THE MONITORING OF SDG 16.10.1

In her role with CFOM (the University's Centre for Freedom of the Media, hosted in the department of Journalism Studies), Diana Maynard travelled to Ethiopia together with CFOM members Sara Torsner and Jackie Harrison to present their research at the World Press Freedom Day Academic Conference on the Safety of journalists in Addis Ababa, on 1 May, 2019. This ongoing research aims to facilitate the comprehensive monitoring of violations against journalists, in line with Sustainable Development Goal (SDG) 16.10.1. This is part of a collaborative project between CFOM and the press freedom organisation Free Press Unlimited, which aims to develop a methodology for systematic data collection on a range of attacks on journalists, and to provide a mechanism for dealing with missing, conflicting and potentially erroneous information.

Discussing possibilities for adopting NLP tools for developing a monitoring infrastructure that allows for the systematisation and organisation of a range of information and data sources related to violations against journalists, Diana proposed a set of areas of research that aim to explore this in more depth. These include: switching to an events-based methodology, reconciling data from multiple sources, and investigating information validity.

Whereas approaches to monitoring violations against journalists traditionally uses a person-based approach, recording information centred around an individual, we suggest that adopting an events-based methodology instead allows for the violation itself to be placed at the centre: ‘by enabling the contextualising and recording of in-depth information related to a single instance of violence such as a killing, including information about key actors and their interrelationship (victim, perpetrator and witness of a violation), the events-based approach enables the modelling of the highly complex structure of a violation. It also allows for the recording of the progression of subsequent violations as well as multiple violations experienced by the same victim (e.g. detention, torture and killing)’.

Event-based data model from HURIDOCS Source:

: https://openevsys.org/the-methodology-designing-formats-and-data-consistency/

Another area of research includes possibilities for reconciling information from different databases and sources of information on violations against journalists through NLP techniques. Such methods would allow for the assessment and compilation of partial and contradictory data about the elements constituting a given attack on a journalist. ‘By creating a central categorisation scheme we would essentially be able to facilitate the mapping and pooling of data from various sources into one data source, thus creating a monitoring infrastructure for SDG 16.10.1’, said Diana Maynard. Systematic data on a range of violations against journalists that are gathered in a methodologically systematic and transparent way would also be able to address issues of information validity and source verification: ‘Ultimately such data would facilitate the investigation of patterns, trends and early warnings, leading to a better understanding of the contexts in which threats to journalists can escalate into a killing undertaken with impunity’. We thus propose a framework for mapping between different datasets and event categorisation schemes in order to harmonise information.

In our proposed methodology, GATE tools can be used to extract information from the free text portions of existing databases and link them to external knowledge sources in order to acquire more detailed information about an event, and to enable semantic reasoning about entities and events, thereby helping to both reconcile information at different levels of granularity (e.g. Dublin vs Ireland; shooting vs killing) and to structure information for further search and analysis.

Slides from the presentation are available here; the full journal paper is forthcoming.
The original article from which this post is adapted is available on the CFOM website.

Tuesday, 24 April 2018

Funded PhD Opportunity: Large Scale Analysis of Online Disinformation in Political Debates

Applications are invited for an EPSRC-funded studentship at The University of Sheffield commencing on 1 October 2018.

The PhD project will examine the intersection of online political debates and misinformation, through big data analysis. This research is very timely, because online mis- and disinformation is reinforcing the formation of polarised partisan camps, sharing biased, self-reinforcing content. This is coupled with the rise in post-truth politics, where key arguments are repeated continuously, even when proven untrue by journalists or independent experts. Journalists and media have tried to counter this through fact-checking initiatives, but these are currently mostly manual, and thus not scalable to big data.

The aim is to develop machine learning-based methods for large-scale analysis of online misinformation and its role in political debates on online social platforms.

Application deadline: as soon as possible, until the funding is filled
Interviews: interviews take place within 2-3 weeks of application

Supervisory team: Professor Kalina Bontcheva (Department of Computer Science, University of Sheffield), Professor Piers Robinson (Department of Journalism, University of Sheffield), and Dr. Nikolaos Aletras (Information School, University of Sheffield).

Award Details

The studentship will cover tuition fees at the EU/UK rate and provide an annual maintenance stipend at standard Research Council rates (£14,777 in 2018/19) for 3.5 years.

Eligibility

The general eligibility requirements are:

Applicants should normally have studied in a relevant field to a very good standard at MSc level or equivalent experience.
Applicants should also have a 2.1 in a BSc degree, or equivalent qualification, in a related discipline.
ESRPC studentships are only available to students from the UK or European Union. Applications cannot be accepted from students liable to pay fees at the Overseas rate. Normally UK students will be eligible for a full award which pays fees and a maintenance grant if they meet the residency criteria and EU students will be eligible for a fees-only award, unless they have been resident in the UK for 3 years immediately prior to taking up the award.

How to apply

To apply for the studentship, applicants need to apply directly to the University of Sheffield for entrance into the doctoral programme in Computer Science

Complete an application for admission to the standard computer science PhD programme http://www.sheffield.ac.uk/postgraduate/research/apply
Applications should include a research proposal; CV; academic writing sample; transcripts and two references.
The research proposal of up to 1,000 words should outline your reasons for applying to this project and how you would approach the research including details of your skills and experience in both computing and/or data journalism.
Supporting documents should be uploaded to your application.

Tuesday, 27 February 2018

Students use GATE and Twitter to drive Lego robots

At the university's Headstart Summer School in July 2017, 42 students (age 16 and 17) from all over the UK were taught to write Java programs to control Lego robots, using input from the robots (such as the sensor for detecting coloured marks on the floor) as well as operating the motors to move and turn. (The university provided a custom Java library for this.)

On 11 and 12 July we ran a practical session on "Controlling Robots with Tweets". We presented a quick introduction to natural language processing (using computer programs to analyse human languages such as English) and provided them with a bundle of software containing a version of the GATE Cloud Twitter Collector modified to run a special GATE application with a custom plugin to let it use the Java robot library.

The bundle came with a simple "gazetteer" containing two lists of classified keywords:

left	turn
left	turn
port	take
	make
	move

and a basic JAPE grammar to make use of it. JAPE is a specialized language used in GATE to match regular expressions over annotations in documents. (The annotations are similar to XML tags, except that GATE applications can create them as well as read them and they can overlap each other without restrictions. Technically they form an annotation graph.)

The grammar we provided would match any keyword from the "turn" list followed by any keyword from the "left" list (with zero or more unmatched words in between, e.g., "turn to port", "take a left", "turn left") and then run the code to turn the robot's right motor (making it turn left in place).

We showed them how to configure the Twitter Collector, authenticate with their Twitter accounts, follow themselves, and then run the collector with this application. Getting the system set up and working was a bit laborious, but once the first group got their robot to move in response to a tweet and cheered, everyone got a lot more interested very quickly. They were very interested in extending the word lists and JAPE rules to cover a wider range of tweeted commands.

Some of the students had also developed interesting and complicated manoeuvres in Java the previous day, which they wanted to incorporate into the Twitter-controlled system. We helped these students add their code to their own copies of the GATE plugin and re-load it so the JAPE rules could call their procedures.

This project was fun and interesting for the staff as well as the students, and we will include it in Headstart 2018.

The Headstart 2017 video includes these activities. The instructions (presentation and handout) and software are available on-line.

This work is supported by the European Union's Horizon 2020 project SoBigData (grant agreement no. 654024).

Friday, 1 July 2016

The Tools Behind Our Brexit Analyser

UPDATE (13 December, 2016): Try the Brexit Analyzer

We have now made parts of the Brexit Analyzer available as a web service. You can try the topic detection by putting an example tweet here (choose mentions of political topics):

https://cloud.gate.ac.uk/shopfront/sampleServices

A more extensive test of the outputs (also including hashtags, voting intent, @mention, and URL detection) can be tried here:

https://cloud.gate.ac.uk/shopfront/displayItem/sobigdata-brexit

This is a web service running on GATE Cloud, where you can find many other text analytics services, available to try for free or run on large batches of data.

We also have now a tweet collection service, should you wish to start collecting and analysing your own Brexit (or any other) tweets:

https://cloud.gate.ac.uk/shopfront/displayItem/twitter-collector

Tools Overview

It will be two weeks tomorrow since we launched the Brexit Analyser -- our real-time tweet analysis system, based on our GATE text analytics and semantic annotation tools.

Back then, we were analysing on average 500,000 (yes, half a million!) tweets a day. Then, on referendum day alone, we had to analyse in real-time well over 2 million tweets. Or on average, just over 23 tweets per second! It wasn't quite so simple though, as tweet volume picked up dramatically as soon as the polls closed at 10pm and we were consistently getting around 50 tweets per second and were also being rate-limited by the Twitter API.

These are some pretty serious data volumes, as well as veracity. So how did we build the Brexit Analyser to cope?

For analysis, we are using GATE's TwitIE system, which consists of a tokenizer, normalizer, part-of-speech tagger, and a named entity recognizer. After that, we added our Leave/Remain classifier, which helps us identify a reliable sample of tweets with unambiguous stance. Next is a tweet geolocation component, which uses latitude/longitude, region, and user location metadata to geolocate tweets within the UK NUTS2 regions. We also detect key themes and topics discussed in the tweets (more than one topic/theme can be contained in each tweet), followed by topic-centric sentiment analysis.

We kept the processing pipeline as simple and efficient as possible, so it can run at 100 tweets per second even on a pretty basic server.

The analysis results are fed into GATE Mimir, which indexes efficiently tweet text and all our linguistic annotations. Mimir has a powerful programming API for semantic search queries, which we use to drive different web pages with interactive visualisations. The user can choose what they want to see, based on time (e.g. most popular hashtags on 23 Jun; most talked about topics in Leave/Remain tweets on 23 Jun). Clicking on these infographics shows the actual matching tweets.

All my blog posts so far have been using screenshots of such interactively generated visualisations.

Mimir also has a more specialised graphical interface (Prospector), which I use for formulating semantic search queries and inspecting the matching data, coupled with some pre-set types of visualisations. The screen shot below shows my Mimir query for all original tweets on 23 Jun which advocate Leave. I can then inspect the most mentioned twitter users within those. (I used Prospector for my analysis of Leave/Remain voting trends on referendum day).

So how do I do my analyses

First I decide what subset of tweets I want to analyse. This is typically a Mimir query restricting by timestamp (normalized to GMT), tweet kind (original, reply, or retweet), voting intention (Leave/Remain), mentioning a specific user/hashtag/topic, written by a specific user, containing a given hashtag or a given topic (e.g. all tweets discussing taxes).

Then, once I identify this dynamically generated subset of tweets, I can analyse it with Prospector or use the visualisations which we generate via the Mimir API. These include:

Top X most frequently mentioned words, nouns, verbs, or noun phrases
Top X most frequent posters/frequently mentioned tweeterers
Top X most frequent Locations, Organizatons, or Persons within those tweets
Top X themes / sub-themes according to our topic classifier
Frequent URLs, language of the tweets, and sentiment

How do we scale it up

It's built using GATE Cloud Paralleliser and some clever queueing, but the take away message is: we can process and index over 100 tweets per second, which allows us to cope in real time with the tweet stream we receive via the Twitter Search API, even at peak times. All of this runs on a server which cost us under £10,000.

The architecture can be scaled up further, if needed, should we get access to a Twitter feed with higher API rate limits than the standard.

Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team

Any mistakes are my own.

Sunday, 26 June 2016

#InOrOut: Analysing Voting Trends in Tweets on #EURef Day

In this post I examine the question: could we have predicted the #EUReferendum outcome, based on #Leave and #Remain tweets posted on polling day? This follows up from my #InOrOut debate on Twitter on Jun 23rd, where I analysed tweet volumes, popular hashtags, and most mentioned users.

This is not the only study to analyse referendum day tweets, but here I present a more in-depth analysis, also based on a sample of tweets selected specifically as advocating #Leave/#Remain respectively.

#Leave / #Remain Trend Based on @Brndstr

Our real-time analysis uncovered the most popular user mentioned in posts on referendum day: @Brndstr. @Brndstr are building bots to help brands engage with their customers and also for users to turn into social ambassadors of brands they endorse.

On referendum day, they ran a campaign which encouraged people to tweet how they voted and, in return, their profile picture will change accordingly. This was not uncontroversial to some Twitter users, who took issue with the choice of the Union Jack (for Out voters) vs the EU flag (for In voters), but nevertheless, many people declared their votes in this way.

Show your support with a custom Profile Flag Filter for the #EUref - what will you vote for? #iVoted👍 🇬🇧 🇪🇺 https://t.co/qMZda1tKh8
— Brndstr (@Brndstr) June 23, 2016

I found over 14,600 tweets mentioning @Brndstr in the 715 thousand original tweets we collected on June 23rd. I only limited the analysis to original tweets (i.e. excluded retweets and replies), since I wanted to study distinct, self-declared #Leave / #Remain intentions.

Inspection of a random sample in our Mimir Prospector dashboard showed all tweets had a set pattern, which made it trivial to distinguish #Leave and #Remain votes.

In particular, all #Leave tweets started with: I #VoteOut for the #Brexit #EURef vote with @Brndstr. All #Remain tweets started with: I #VoteIn for the #Brexit #EURef vote with @Brndstr

I used two Mimir queries with those texts, and found 6296 #VoteOut tweets and 8342 #VoteIn tweets. Thus, based on @Brndstr activity, one could hypothesize a #Remain majority.

#Leave / #Remain Trend Based on Full-Text Search

In addition to @Brndstr, I also experimented with full-text searches over the referendum day tweets. For those interested in the technology behind this, I used GATE text analysis tools adapted to the referendum, combined with the Mimir semantic search engine (supports searches over both linguistic annotations and full-text).

First, I searched for tweets containing "I", "voted", and "remain", within an 8 word window. This returned 14,665 matching tweets and upon manual inspection of the top 30 matches, I observed only 2 tweets which did not disclose the actual vote of their poster. Therefore, I considered this a sufficiently accurate query.

The corresponding "I", "voted" and "leave" query returned 11,046 matching tweets, i.e. #Leave votes were outnumbered by #Remain ones again.

These statistics are in line with the findings of other studies of Twitter #EUReferendum posts. For instance, the #EURef Data Hub (by the Press Association, Twitter, and Blurrt) showed Remain tweets dominating over Leave tweets on Jun 23rd, but not on 22nd and earlier, or (unsurprisingly) since.

It must be noted that, similar to the Ontotext study, the #EURef Data Hub statistics are derived from tweets referencing either the Leave or Remain campaigns, but not necessarily showing explicit support or voting intent.

However, as discussed in my earlier post, if we were to try and draw conclusions on the likely outcome based on tweets alone, then we need a more reliable Leave/Remain sample, indicative of actual support/self-declared voting intentions.

So now let's see if the same trend is present there.

#Leave / #Remain Voting Intentions Based on Our Classification Heuristic

Following on from my previous study of the overall characteristics of tweets posted on June 23rd, I separated again the tweets into original tweets, replies, and retweets.

I applied our classification heuristic for reliable identification of #Leave/#Remain posts to all tweets posted on or after 13:00 BST on June 22nd, but before voting closed at 22:00 BST on June 23rd.

As a result, I found just over 100 thousand tweets from 22nd: 39 thousand advocating Remain and 61 thousand - Leave.

On June 23rd, as Twitter activity picked up significantly (also observed by #EURef Data Hub), I found 291 thousand matching tweets. Unlike other studies, however, our voting intent heuristic identified 164 thousand tweets advocating Leave and only 127 thousands advocating Remain.

Therefore, even though voting tweets from @Brndstr and tweet volume statistics from #EURef Data Hub both indicate that Remain was dominant, this trend wasn't supported in our voting intention sample.

Now let us examine the trends over time, separately for original tweets, replies, and retweets.

The graph below shows that indeed #Remain tweets were dominant in the early hours of June 23rd, but not before or after. What is particularly interesting is that #Remain tweets start to fall sharply from around 4pm, whereas #Leave ones pick up sharply a little later. By the time polls close at 10pm, tweets advocating #Leave are more than double the ones supporting #Remain.

Reply tweets show a largely different pattern (see graph below), where replies advocating #Leave are consistently more than those advocating #Remain (at times up to 2.5 times more). This is a trend which we observed also earlier in June. This indicates that #Leave advocates were much more engaged in the Twitter debates, than the #Remain ones.

It should be noted also that the trend observed in original tweets in late afternoon and evening of June 23rd is also evident here, i.e. replies advocating #Remain start to fall, while replies advocating #Leave increase.

Lastly, I show below the trends in re-tweets, where again #Leave advocates dominate the debate, by re-tweeting much more than #Remain ones. Again, I already observed this trend earlier in June.

What Have We Learnt?

Having looked at tweets on 23rd, using @brndstr and “I voted XX” both gave  Remain a majority over Leave, but using our classification heuristic, the opposite was true (i.e. Leave was the more likely winner).

Given the conflicting evidence based on the same set of tweets, it is easy to see why others failed to predict the overall majority correctly.

I must also highlight here that my own analysis was never aimed at being predictive. Instead, I am trying to understand how people engaged, debated, and wrote about the referendum on social media.

In particular, as the referendum clearly showed, older voters tend to vote in higher proportions than young ones and thus, they were those that ultimately determined the overall outcome. That older generation, however, is well known for being under-represented on Twitter, and also probably less aware of @Brndstr and similar services, which explains why these gave the wrong trends.

In future research I would like to explore whether representativeness on Twitter is the full story, and whether this matters for political discussions. Do the younger generation actually talk more or less about politics than the older generation? Also, older people aside - were Brexiters (i.e. people supporting Leave) over- or under-represented on Twitter, as compared to Bremainers (i.e. voters supporting Remain)?

In order to get more accurate answers to these questions, as demonstrated here, it is important to identify actual tweets indicative of specific voting intentions or votes already cast. The largely predominant approach of simply counting tweets mentioning hashtags is not sufficiently accurate as it does not distinguish tweets simply referring to a stance/campaign, from tweets actually advocating a stance/campaign.

As part of subsequent research, I plan to also collect a gold standard of human-annotated tweets where people will be asked to mark tweets indicating actual support and voting intent separately from tweets, which simply mention the Leave/Remain campaigns. This will enable me to quantify how the different sampling strategies affect the accuracy of voting trends over time.

Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team

Any mistakes are my own.

Thursday, 23 June 2016

#InOrOut: Today's #EURef Debate on Twitter

So what did the #EUReferendum debate look like today? Is Twitter still voting #Leave as it did back in May? What were the main hashtags and user mentions in today's tweets?

Tweet Volumes

Record breaking 1.9 million tweets were posted today on the #InOrOut #EUReferendum, which is between three and six times the daily volumes observed earlier in June. On average, this is 21 tweets per second over the day, although, the peaks of activity occurred after 9am (see graphs below). 1.5 million of those tweets were posted during poll opening times. In that period, only 3,300 posts were inaccessible to us due to Twitter rate limits.

Since the polls closed at 10pm tonight, there was a huge surge in Twitter activity with over 60,000 posts between 10pm and 11pm alone. Twitter rate limits meant that we could not access another 6,000 posts from that period. Since this is only 10% of the overall data in this hour, we still have a representative sample for our analyses.

Amongst the 1.9 million posts, over 1 million (57%) were retweets and 94 thousand (5%) - replies. These proportions of retweets and replies are consistent with patterns observed earlier in June.

Tweets, Re-tweets, and Replies: #Leave or #Remain

Let's start by looking at original tweets, i.e. tweets which have been posted by their authors and are not a reply to another tweet or a retweet. I refer to the authors of those tweets as the OPs (Original Posters), following terminology adopted from online forums.

My analysis of voting intentions showed some conflicting findings, depending on the way used to sample tweets (details and trend graphs here).

The gist is that, using @brndstr and “I voted XX” patterns both gave Remain a majority over Leave, but using our voting intention classification heuristic, the opposite was true (i.e. Leave was the more likely winner).

In retweets, the #Leave proponents were more vocal in comparison to the #Remain.

The difference is particularly pronounced for replies, where #Leave proponents are engaging in more debates than #Remain ones. Nevertheless, with replies constituting only 5% of all tweets today, the echo chamber effect observed earlier in June still remains unchanged.

#InOrOut, #Leave, #Remain and Other Popular Hashtags

Interestingly, 75% of all tweets today (1.4 million) contained at least one hashtag. This is a very significant increase on the 56.5% observed several days ago.

Some of the most popular hashtags remain unchanged from earlier in June. These refer to the leave and remain campaigns, immigration, NHS, parties, media, and politicians. Interestingly, there is now increased interest in #forex and #stocks, as predictors of the likely outcome.

Most Mentioned Users Today: What is @Brndstr

Last for tonight, I compared the most frequently mentioned Twitter users in original tweets from today (see above) against those most mentioned earlier in June. The majority of popular mentioned users remains unchanged, with a mix of campaign Twitter accounts, media, and key political leaders.

The most prominent difference is that @Brndstr (Bots for Brands) came top (mentioned in over 14 thousand tweets), followed by @YouTube with 3 thousand mentions. Other new, frequently mentioned accounts today were Avaaz, DanHannanMEP,BuzzFeedUK, and realDonaldTrump.

So What Does This Tell Us?

The #InOrOut #EUReferendum has attracted unprecedented tweet volumes on poll day, with a significantly higher proportion of hashtags than previously. This seems to suggest that Twitter users are trying to get their voices heard and spread the word far and wide, well beyond the bounds of their normal follower network.

There are some exciting new entrants in the top 30 most mentioned Twitter accounts in today's referendum posts. I will analyse these in more depth tomorrow. For now, good night!

Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team

Any mistakes are my own.