On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: SemSearch

Showing posts with label SemSearch. Show all posts

Sunday, 23 July 2017

The Tools Behind Our Twitter Abuse Analysis with BuzzFeed

Or...How to Quantify Abuse in Tweets in 5 Working Days

When BuzzFeed approached us with the idea to quantify Twitter abuse towards politicians during the election campaign, we only had five working days, before the article had to be completed and go public.

The goal was to use text analytics and analyse tweets replying to UK politicians, in the run up to the 2017 general election, in order to answer questions such as:

How wide spread is abuse received by politicians?
Who are the main politicians targeted by such abusive tweets?
Are there any party or gender differences?
Do abuse levels stay constant over time or not?

So here I explain first how we collect the data for such studies and then how it gets analysed at scale and fast, all with our GATE-based open-source tools and their GATE Cloud text analytics-as-a-service deployment.

For researchers wishing more in-depth details, please read and cite our paper:

D. Maynard, I. Roberts, M. A. Greenwood, D. Rout and K. Bontcheva. A Framework for Real-time Semantic Social Media Analysis. Web Semantics: Science, Services and Agents on the World Wide Web, 2017 (in press). https://doi.org/10.1016/j.websem.2017.05.002, pre-print

Tweet Collection

We already had all necessary tweets at hand, since, within an hour of #GE2017 being announced, I set up, using the GATE Cloud tweet collection service:

https://cloud.gate.ac.uk/shopfront/displayItem/twitter-collector

the continuous collection of tweets by MPs, prominent politicians, parties, and candidates, as well as retweets and replies thereof.

I also made a second twitter collector service running in parallel, to collect election related tweets based purely on hashtags and keywords (e.g. #GE2017, vote, election).

How We Analysed and Quantified Abuse

Given the short 5 day deadline, we were pleased to have at hand the large-scale, real-time text analytics tools in GATE, Mimir/Prospector, and GATE Cloud.

The starting point was the real-time text analysis pipeline from the Brexit research last year. That is capable of analysing up to 100 tweets per second (tps), although, in practice, the tweets usually were coming at the much lower 23 tps.

This time, however, we adapted it with a new abuse analysis component, as well as some more up-to-date knowledge about the politicians (including the new prime minister).

The analysis backbone was again GATE's TwitIE system, which consists of a tokenizer, normalizer, part-of-speech tagger, and a named entity recognizer. TwitIE is also available as-a-service on GATE Cloud, for easy integration and use.

Next, we added information about politicians, e.g. their names, gender, party, constituencies, etc. In this way, we could produce aggregate statistics, such as abuse-containing tweets aimed at Labour or Conservative male/female politicians.

Next is a tweet geolocation component, which uses latitude/longitude, region, and user location metadata to geolocate tweets within the UK NUTS2 regions. This is not always possible, since many accounts and tweets lack such information, and this narrow down the sample significantly, should we choose to restrict by geo-location.

We also detect key themes and topics discussed in the tweets (more than one topic/theme can be contained in each tweet). Here we reused the module from the Brexit analyser.

The most exciting part was working with BuzzFeed's journalists to curate a set of abuse nouns typically aimed at people (e.g. twat), racist words, and milder insults (e.g. coward). We decided to differentiate these from general obscene language and swearing, as these were not always targeting the politician. Nevertheless, they were included in the system, to produce a separate set of statistics. We introduced also basic sub-classification by kind (e.g. racial) and strength (e.g. mild, medium, strong), derived from an Ofcom research report on offensive language.

Overall, we kept the processing pipeline as simple and efficient as possible, so it can run at 100 tweets per second even on a pretty basic server.

The analysis results were fed into GATE Mimir, which indexes efficiently tweet text and all our linguistic annotations. Mimir has a powerful programming API for semantic search queries, which we use to drive the various interactive visualisations and to generate the necessary aggregate statistics behind them.

For instance, we used Mimir queries to generate statistics and visualisations, based on time (e.g. most popular hashtags in abuse-containing tweets on 4 Jun); topic (e.g. the most talked about topics in such tweets), or target of the abusive tweet (e.g. the most frequently targeted politicians by party and gender). We could also navigate to the corresponding tweets behind these aggregate statistics, for a more in-depth analysis.

A rich sample of these statistics, associated visualisations, and abusive tweets is available in the BuzzFeed article.

Research carried out by:

Mark A. Greenwood, Ian Roberts, Dominic Rout, and myself, with ideas and other contributions from Diana Maynard and others from the GATE Team.

Any mistakes are my own.

Friday, 1 July 2016

The Tools Behind Our Brexit Analyser

UPDATE (13 December, 2016): Try the Brexit Analyzer

We have now made parts of the Brexit Analyzer available as a web service. You can try the topic detection by putting an example tweet here (choose mentions of political topics):

https://cloud.gate.ac.uk/shopfront/sampleServices

A more extensive test of the outputs (also including hashtags, voting intent, @mention, and URL detection) can be tried here:

https://cloud.gate.ac.uk/shopfront/displayItem/sobigdata-brexit

This is a web service running on GATE Cloud, where you can find many other text analytics services, available to try for free or run on large batches of data.

We also have now a tweet collection service, should you wish to start collecting and analysing your own Brexit (or any other) tweets:

https://cloud.gate.ac.uk/shopfront/displayItem/twitter-collector

Tools Overview

It will be two weeks tomorrow since we launched the Brexit Analyser -- our real-time tweet analysis system, based on our GATE text analytics and semantic annotation tools.

Back then, we were analysing on average 500,000 (yes, half a million!) tweets a day. Then, on referendum day alone, we had to analyse in real-time well over 2 million tweets. Or on average, just over 23 tweets per second! It wasn't quite so simple though, as tweet volume picked up dramatically as soon as the polls closed at 10pm and we were consistently getting around 50 tweets per second and were also being rate-limited by the Twitter API.

These are some pretty serious data volumes, as well as veracity. So how did we build the Brexit Analyser to cope?

For analysis, we are using GATE's TwitIE system, which consists of a tokenizer, normalizer, part-of-speech tagger, and a named entity recognizer. After that, we added our Leave/Remain classifier, which helps us identify a reliable sample of tweets with unambiguous stance. Next is a tweet geolocation component, which uses latitude/longitude, region, and user location metadata to geolocate tweets within the UK NUTS2 regions. We also detect key themes and topics discussed in the tweets (more than one topic/theme can be contained in each tweet), followed by topic-centric sentiment analysis.

We kept the processing pipeline as simple and efficient as possible, so it can run at 100 tweets per second even on a pretty basic server.

The analysis results are fed into GATE Mimir, which indexes efficiently tweet text and all our linguistic annotations. Mimir has a powerful programming API for semantic search queries, which we use to drive different web pages with interactive visualisations. The user can choose what they want to see, based on time (e.g. most popular hashtags on 23 Jun; most talked about topics in Leave/Remain tweets on 23 Jun). Clicking on these infographics shows the actual matching tweets.

All my blog posts so far have been using screenshots of such interactively generated visualisations.

Mimir also has a more specialised graphical interface (Prospector), which I use for formulating semantic search queries and inspecting the matching data, coupled with some pre-set types of visualisations. The screen shot below shows my Mimir query for all original tweets on 23 Jun which advocate Leave. I can then inspect the most mentioned twitter users within those. (I used Prospector for my analysis of Leave/Remain voting trends on referendum day).

So how do I do my analyses

First I decide what subset of tweets I want to analyse. This is typically a Mimir query restricting by timestamp (normalized to GMT), tweet kind (original, reply, or retweet), voting intention (Leave/Remain), mentioning a specific user/hashtag/topic, written by a specific user, containing a given hashtag or a given topic (e.g. all tweets discussing taxes).

Then, once I identify this dynamically generated subset of tweets, I can analyse it with Prospector or use the visualisations which we generate via the Mimir API. These include:

Top X most frequently mentioned words, nouns, verbs, or noun phrases
Top X most frequent posters/frequently mentioned tweeterers
Top X most frequent Locations, Organizatons, or Persons within those tweets
Top X themes / sub-themes according to our topic classifier
Frequent URLs, language of the tweets, and sentiment

How do we scale it up

It's built using GATE Cloud Paralleliser and some clever queueing, but the take away message is: we can process and index over 100 tweets per second, which allows us to cope in real time with the tweet stream we receive via the Twitter Search API, even at peak times. All of this runs on a server which cost us under £10,000.

The architecture can be scaled up further, if needed, should we get access to a Twitter feed with higher API rate limits than the standard.

Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team

Any mistakes are my own.

Monday, 29 April 2013

EnviLOD: Lessons Learnt

The EnviLOD project demonstrated the benefits that location-based searches, enabled and underpinned by Linked Open Data (LOD) and semantic technologies, can have in terms of enabling improved retrieval of information. Although the semantic search tool developed through the EnviLOD project is not yet ‘production-ready’, it does demonstrate the benefits of this newly emerging technology. As such, it will be incorporated into the Envia ‘labs’ page of the Envia website, which is currently under development. Within Envia Labs, users of the regular Envia service will be able to experiment with and comment on tools that might eventually augment or be incorporated into the service, thus allowing the Envia project team to gauge their potential uptake by the user community.

We also worked on the automatic generation of semantically enriched metadata, to accompany records within the Envia system. This aims to improve the discovery of information within the current Envia system by automatically generating keywords to be included in the article metadata based on the occurrences of terms from the GEMET, DBpedia, and GeoNames vocabularies. A pipeline for this to be incorporated into the Envia system in a regular and sustainable manner is already under way.

One particularly important lesson learnt from this short-term project is that availability of large amounts of content, open to text mining and experimentation needs to be ensured from the very beginning of the project. In EnviLOD there were copyright issues with the majority of environmental science content at the British Library, which limited the experimental system to just over one thousand documents. Due to this limited content, users were not always able to judge how comprehensive or accurate the semantic search was, especially if compared against results offered by Google. Since the British Library is now planning to integrate the EnviLOD semantic enrichment tools within the advanced Envia Labs functionality, future work on this tool could potentially be able to evaluate on this more comprehensive data, through the Envia system.

Another important lesson learnt from the research activities is that working with Linked Open Data is very challenging, not only in terms of data volumes and computational efficiency, but also in terms of data noise and robustness. In terms of noise, an initial evaluation of the DBpedia-based semantic enrichment pipeline revealed that relevant entity candidates were not included initially, because in the ontology they were classified as owl:Thing, whereas we were considering instances of specific sub-classes (e.g. Person, Place). There are over 1 million unclassified instances in the current DBpedia snapshot. In terms of computational efficiency, we had to introduce memory-based caches and efficient data indexing, in order to make the entity linking and disambiguation algorithm sufficiently efficient to process data in near real-time. Lastly, deploying the semantic enrichment on a server, e.g. at Sheffield or at the British Library, is far from trivial, since both OWLIM and our algorithms require large amounts of RAM and computational power. Parallelising the computation to more than three threads is an open challenge, due to the difficulties experienced with parallelising OWLIM. Ontotext are currently working on cloud-based, scalable deployments, so future projects would be able to solve the scalability issue effectively.

Lastly, the quantitative evaluation of our DBpedia-based semantic enrichment pipeline was far from trivial. It required us to annotate manually a gold standard corpus of environmental science content (100 documents were annotated with disambiguated named entities). However, releasing these to other researchers has proven to be practically impossible, due to the copyright and licensing restrictions imposed by the content publishers on the British library. In a related project, we have now developed a web-based entity annotation interface, based on Crowd Flower. This will enable future projects to create gold standards in an easier fashion, based on copyright-free content. Ultimately, during development we made use of available news and similar corpora created by TAC-KBP 2009 and 2010, which we used for algorithm development and testing in EnviLOD, prior to final quantitative evaluation on the copyrighted BL content. So even though the aims of the project were achieved and a useful running pilot system was created, publishing the results in scientific journals has been hampered by these content issues.

In conclusion, we fully support the findings of the JISC report on text mining that copyright exemption for text mining research is necessary, in order to fully unlock the benefits of text mining to scientific research.

Thursday, 25 April 2013

EnviLOD Recap: Technical Objectives and Deliverables

The aim of the #EnviLOD project was to demonstrate the value of using Linked Open Data (LOD) vocabularies in the field of environmental science, by pursuing four key objectives. These remained unchanged during the 7 month long project, and are as follows:

To engage actively with environmental science researchers and other key stakeholders, in order to derive requirements and evaluate project results.
To develop tools for efficient LOD-based semantic enrichment of unstructured content.
To create and evaluate intuitive user interface methods that hide the complexities of the SPARQL semantic search language.
To use British Library’s Envia tool as a case study in using LOD vocabularies for enhanced information discovery and management.

Objective 1: Stakeholder engagement, Requirements Capture, and Evaluation

In order to demonstrate the value of shared LOD vocabularies to different applications, information types and audiences, we focused on use cases related to research on flooding and climate change in the UK. We captured the requirements from relevant audiences and groups via a web based questionnaire, which captured some actual search queries, alongside user input on the kinds of searches they require. We engaged researchers, practitioners and information managers, in order to assess how LOD vocabularies might support their needs. This also motivated our choice of different information types including full-text content, metadata, and LOD datasets. The main user requirement to be fulfilled was for supporting location-based searches, e.g. flooding near Sheffield, or flooding on rivers flowing through Gloucestershire. In addition, users emphasized their need of an intuitive semantic search UI.

A new British Library information discovery tool for environmental science, Envia, was used as a starting point to test the use of semantics towards enhancing information discovery and management. Envia is particularly suited as a test case for these purposes, as it features a mixed corpus of content, including datasets, journal articles, and grey literature, with accompanying metadata records. Envia also enabled us to examine the value of semantic enrichment for information managers. Environmental consultants at HR Wallingford collaborated as domain experts, providing feedback on how the semantic work undertaken in EnviLOD supported their work as environmental science practitioners and innovators.

During the project, stakeholder engagement was ongoing through the project website, blog, Twitter presence, published reports, and joint meetings. In particular, user input and feedback was sought during the design of the semantic search user interface, in order to ensure that it meets user needs. The interface was implemented in three iterations:

The British Library team participated in the design meeting and provided feedback on the first implementation. This helped the Sheffield team to adjust and simplify the interface.
Following this, the semantic search UI was demonstrated during a lunchtime workshop and EnviLOD presentation. At the end, environmental science researchers were given the opportunity to try the interface and provide us with structured feedback (via a written questionnare) and a user-lead discussion. This early evaluation helped us refine further the user interface design and remove confusing elements.
Lastly, a much wider stakeholder feedback was solicited during a user outreach and evaluation workshop, organised at the British Library. There were 25 participants at the event, which enabled us to gather very detailed feedback and suggestions for minor improvements. Overall, the majority of users stated that semantic search would be very useful for information discovery and that they would be using the system, if it were deployed in production.

Objective 2: LOD-based Semantic Enrichment

Semantic annotation is the process of tying semantic models, such as ontologies, and scientific articles together. It may be characterised as the dynamic semantic enrichment of unstructured and semi-structured documents and linking these to relevant domain ontologies/knowledge bases.

The focus of our work was on implementing a LOD-based semantic enrichment algorithm and apply it to metadata and full-text documents from Envia. A trial web service is now available.

As part of this work, we evaluated the coverage and accuracy of relevant general purpose LOD datasets (namely GeoNames and DBPedia), when applied to data and content from our domain. The results showed that GeoNames is a useful resource of rich knowledge about locations (e.g. NUTS administrative regions, latitude, longitude, parent country), however it is not suitable on its own as a primary source for knowledge enrichment. This is due to the high level of detail and location ambiguity (e.g. it contained names of farms). DBpedia on the other hand is much more balanced, including also knowledge about people, organisations, products, and other entities. Therefore DBpedia was chosen as the primary LOD resource for semantic enrichment. Specifically for locations, we identified their equivalent entry in GeoNames and enriched the text content with additional metatada from there.

Tools for LOD-based geo-location disambiguation, date and measurement recognition and normalisation were implemented and tested. In some detail, the first step is to identify all candidate instance URIs from DBpedia, which are mentioned in a given document. This phase is designed to maximise recall, in order to ensure that more relevant documents can be returned at search time. The second step is entity disambiguation, which is carried out on the basis of string, semantic, and contextual similarity, coupled with a corpus frequency metric. The algorithm was developed on a general purpose, shared news-like corpus and evaluated on environmental science papers and metadata records from the British Library.

Objective 3: User Interface for Semantic Search

The semantic search interface is shown below and can be tried online too:

There is a keyword search field, complemented with optional semantic search constraints, through a set of inter-dependent drop-down lists. In the first list, Location allows users to search for mentions of locations; Date – of dates; Document – for specifying constraints on document-level attributes, etc.

More than one semantic constraint can be added, through the plus button, which inserts a new row underneath the current row of constraints.

If a Location is chosen as a semantic constraint, then, if required, further constraints can be specified by choosing an appropriate property constraint. Population allows users to pose restrictions on the population number of the locations that are being searched for. Similar numeric constraints can be imposed on the latitude, longitude, and population density attribute values.

Restrictions can also be imposed in terms of its name or the country code, i.e. which country it belongs to. When “is” is chosen, it means that the location name must be exactly as specified (e.g. Oxford), whereas “contains” provides sub-string matching, (e.g. Oxfordshire). In the example below, the user is searching for documents mentioning locations which name contains Oxford. When the search is executed, this would return documents mentioning Oxford explicitly, but also documents mentioning Oxfordshire and other locations in Oxfordshire (e.g. Wytham Woods, Banbury).

Objective 4: Use of Envia as a Testbed

Envia provided us with readily available content and a testbed for experimenting with the semantic enrichment methods. Sparsely populated metadata records were enriched with environmental science terms and location and organisation entities, as well as with additional metadata imported from GeoNames and DBpedia.

The British Library will launch a public beta of Envia in May 2013, where EnviLOD enriched content would be included as an experimental option, complementing the traditional full-text search in Envia. Over time, this will give access to user query logs and allow the iterative identification and improvement of the quality of the semantic enrichment and search algorithms.

Conclusion

All our technical objectives have now been completed and we are ready to deploy the semantic enrichment pipeline within the Envia system, as well as carry out further improvements and experiments with the EnviLOD semantic search UI. We are looking forward to taking this work further in the future, implementing the ideas which we received during the user evaluation workshop.