On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: Sentiment

Showing posts with label Sentiment. Show all posts

Friday, 8 February 2019

Teaching computers to understand the sentiment of tweets

As part of the EU SoBigData project, the GATE team hosts a number of short research visits, between 2 weeks and 2 months, for all kinds of data scientists (PhD students, researchers, academics, professionals) to come and work with us and to use our tools and/or datasets on a project involving text mining and social media analysis. Kristoffer Stensbo-Smidt visited us in the summer of 2018 from the University of Copenhagen, to work on developing machine learning tools for sentiment analysis of tweets, and was supervised by GATE team member Diana Maynard and by former team member Isabelle Augenstein, who is now at the University of Copenhagen. Kristoffer has a background in Machine Learning but had not worked in NLP before, so this visit helped him understand how to apply his skills to this kind of domain.

After his visit, Kristoffer wrote up an excellent summary of his research. He essentially tested a number of different approaches to processing text, and analysed how much of the sentiment they were able to identify. Given a tweet and an associated topic, the aim is to ascertain automatically whether the sentiment expressed about this topic is positive, negative or neutral. Kristoffer experimented different word embedding-based models in order to test how much information different word embeddings carry for the sentiment of a tweet. This involved choosing which embeddings models to test, and how to transform the topic vectors. The main conclusions he drew from the work were that in general, word embeddings contain a lot of useful information about sentiment, with newer embeddings containing significantly more. This is not particularly surprising, but shows the importance of advanced models for this task.

3rd International Workshop on Rumours and Deception in Social Media (RDSM)

June 11, 2019 in Munich, Germany
Collocated with ICWSM'2019

Abstract

The 3rd edition of the RDSM workshop will particularly focus on online information disorder and its interplay with public opinion formation.

Social media is a valuable resource for mining all kind of information varying from opinions to factual information. However, social media houses issues that are serious threats to the society. Online information disorder and its power on shaping public opinion lead the category of those issues. Among the known aspects are the spread of false rumours, fake news or even social attacks such as hate speech or other forms of harmful social posts. In this workshop the aim is to bring together researchers and practitioners interested in social media mining and analysis to deal with the emerging issues of information disorder and manipulation of public opinion. The focus of the workshop will be on themes such as the detection of fake news, verification of rumours and the understanding of their impact on public opinion. Furthermore, we aim to put a great emphasis on the usefulness and trust aspects of automated solutions tackling the aforementioned themes.

Workshop Theme and Topics

The aim of this workshop is to bring together researchers and practitioners interested in social media mining and analysis to deal with the emerging issues of veracity assessment, fake news detection and manipulation of public opinion. We invite researchers and practitioners to submit papers reporting results on these issues. Qualitative studies performing user studies on the challenges encountered with the use of social media, such as the veracity of information and fake news detection, as well as papers reporting new data sets are also welcome. Finally, we also welcome studies reporting the usefulness and trust of social media tools tackling the aforementioned problems.

Topics of interest include, but are not limited to:

Detection and tracking of rumours.
Rumour veracity classification.
Fact-checking social media.
Detection and analysis of disinformation, hoaxes and fake news.
Stance detection in social media.
Qualitative user studies assessing the use of social media.
Bots detection in social media.
Measuring public opinion through social media.
Assessing the impact of social media in public opinion.
Political analyses of social media.
Real-time social media mining.
NLP for social media analysis.
Network analysis and diffusion of dis/misinformation.
Usefulness and trust analysis of social media tools.
AI generated fake content (image / text)

Workshop Program Format

We will have 1-2 experts in the field delivering keynote speeches. We will then have a set of 8-10 presentations of peer-reviewed submissions, organised into 3 sessions by subject (the first two sessions about online information disorder and public opinion and the third session about the usefulness and trust aspects). After the session we also plan to have a group work (groups of size 4-5 attendances) where each group will sketch a social media tool for tackling e.g. rumour verification, fake news detection, etc. The emphasis of the sketch should be on aspects like usefulness and trust. This should take no longer than 120 minutes (sketching, presentation/discussion time). We will close the workshop with a summary and take home messages (max. 15 minutes). Attendance will be open to all interested participants.

We welcome both full papers (5-8 pages) to be presented as oral talks and short papers (2-4 pages) to be presented as posters and demos.

Workshop Schedule/Important Dates

Submission deadline: April 1st 2019
Notification of Acceptance: April 15th 2019
Camera-Ready Versions Due: April 26th 2019
Workshop date: June 11, 2019

Submission Procedure

We invite two kinds of submissions:

- Long papers/Brief Research Report (max 8 pages + 2 references)
- Demos and poster (short papers) (max 4 pages + 2 references)

Proceedings of the workshop will be published jointly with other ICWSM workshops in a special
issue of Frontiers in Big Data.

Papers must be submitted electronically in PDF format or any format that is supported by the
submission site through https://www.frontiersin.org/research-topics/9706 (click on "Submit your manuscript").
Note, submitting authors should choose one of the specific track organizers as their preferred Editor.

You can find detailed information on the file submission requirements here:
https://www.frontiersin.org/about/author-guidelines#FileRequirements

Submissions will be peer-reviewed by at least three members of the programme
committee. The accepted papers will appear in the proceedings published at
https://www.frontiersin.org/research-topics/9706

Workshop Organizers

Ahmet Aker, University of Duisburg-Essen, Germany; University of Sheffield, UK
a.aker@is.inf.uni-due.de
Arkaitz Zubiaga, Queen Mary University of London, UK
arkaitz@zubiaga.org
Kalina Bontcheva, University of Sheffield, UK
k.bontcheva@sheffield.ac.uk
Maria Liakata, University of Warwick and Alan Turing Institute, UK
m.liakata@warwick.ac.uk
Rob Procter, University of Warwick and Alan Turing Institute, UK
rob.procter@warwick.ac.uk
Symeon Papadopoulos, Centre for Research and Technology Hellas, Greece
papadop@iti.gr

Programme Committee (Tentative)

Nikolas Aletras, University of Sheffield, UK
Emilio Ferrara, University of Southern California, USA
Bahareh Heravi, University College Dublin, Ireland
Petya Osenova, Ontotext, Bulgaria
Damiano Spina, RMIT University, Australia
Peter Tolmie, Universität Siegen, Germany
Marcos Zampieri, University of Wolverhampton, UK
Milad Mirbabaie, University of Duisburg-Essen, Germany
Tobias Hecking, University of Duisburg-Essen, Germany
Kareem Darwish, QCRI, Qatar
Hassan Sajjad, QCRI, Qatar
Sumithra Velupillai, King's College London, UK

Invited Speaker(s)

To be announced

Friday, 1 July 2016

The Tools Behind Our Brexit Analyser

UPDATE (13 December, 2016): Try the Brexit Analyzer

We have now made parts of the Brexit Analyzer available as a web service. You can try the topic detection by putting an example tweet here (choose mentions of political topics):

https://cloud.gate.ac.uk/shopfront/sampleServices

A more extensive test of the outputs (also including hashtags, voting intent, @mention, and URL detection) can be tried here:

https://cloud.gate.ac.uk/shopfront/displayItem/sobigdata-brexit

This is a web service running on GATE Cloud, where you can find many other text analytics services, available to try for free or run on large batches of data.

We also have now a tweet collection service, should you wish to start collecting and analysing your own Brexit (or any other) tweets:

https://cloud.gate.ac.uk/shopfront/displayItem/twitter-collector

Tools Overview

It will be two weeks tomorrow since we launched the Brexit Analyser -- our real-time tweet analysis system, based on our GATE text analytics and semantic annotation tools.

Back then, we were analysing on average 500,000 (yes, half a million!) tweets a day. Then, on referendum day alone, we had to analyse in real-time well over 2 million tweets. Or on average, just over 23 tweets per second! It wasn't quite so simple though, as tweet volume picked up dramatically as soon as the polls closed at 10pm and we were consistently getting around 50 tweets per second and were also being rate-limited by the Twitter API.

These are some pretty serious data volumes, as well as veracity. So how did we build the Brexit Analyser to cope?

For analysis, we are using GATE's TwitIE system, which consists of a tokenizer, normalizer, part-of-speech tagger, and a named entity recognizer. After that, we added our Leave/Remain classifier, which helps us identify a reliable sample of tweets with unambiguous stance. Next is a tweet geolocation component, which uses latitude/longitude, region, and user location metadata to geolocate tweets within the UK NUTS2 regions. We also detect key themes and topics discussed in the tweets (more than one topic/theme can be contained in each tweet), followed by topic-centric sentiment analysis.

We kept the processing pipeline as simple and efficient as possible, so it can run at 100 tweets per second even on a pretty basic server.

The analysis results are fed into GATE Mimir, which indexes efficiently tweet text and all our linguistic annotations. Mimir has a powerful programming API for semantic search queries, which we use to drive different web pages with interactive visualisations. The user can choose what they want to see, based on time (e.g. most popular hashtags on 23 Jun; most talked about topics in Leave/Remain tweets on 23 Jun). Clicking on these infographics shows the actual matching tweets.

All my blog posts so far have been using screenshots of such interactively generated visualisations.

Mimir also has a more specialised graphical interface (Prospector), which I use for formulating semantic search queries and inspecting the matching data, coupled with some pre-set types of visualisations. The screen shot below shows my Mimir query for all original tweets on 23 Jun which advocate Leave. I can then inspect the most mentioned twitter users within those. (I used Prospector for my analysis of Leave/Remain voting trends on referendum day).

So how do I do my analyses

First I decide what subset of tweets I want to analyse. This is typically a Mimir query restricting by timestamp (normalized to GMT), tweet kind (original, reply, or retweet), voting intention (Leave/Remain), mentioning a specific user/hashtag/topic, written by a specific user, containing a given hashtag or a given topic (e.g. all tweets discussing taxes).

Then, once I identify this dynamically generated subset of tweets, I can analyse it with Prospector or use the visualisations which we generate via the Mimir API. These include:

Top X most frequently mentioned words, nouns, verbs, or noun phrases
Top X most frequent posters/frequently mentioned tweeterers
Top X most frequent Locations, Organizatons, or Persons within those tweets
Top X themes / sub-themes according to our topic classifier
Frequent URLs, language of the tweets, and sentiment

How do we scale it up

It's built using GATE Cloud Paralleliser and some clever queueing, but the take away message is: we can process and index over 100 tweets per second, which allows us to cope in real time with the tweet stream we receive via the Twitter Search API, even at peak times. All of this runs on a server which cost us under £10,000.

The architecture can be scaled up further, if needed, should we get access to a Twitter feed with higher API rate limits than the standard.

Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team

Any mistakes are my own.

Tuesday, 7 August 2012

GATE is Getting Sentimental about Social Media

Over the past two years, Diana Maynard, myself, and other colleagues in the GATE team have been working on a number of GATE-based sentiment analysis and opinion mining tools, specifically optimised for Twitter, blogs, comments, and other kinds of social media posts. The work has been part of the Arcomem and TrendMiner EC-funded projects, as well as my EPSRC fellowship on mining and summarisation of social media (grant EP/I004327/1).

Speaking from experience, doing opinion mining on social media is nothing but challenging. And in this paper Diana, Dominic, and I have tried to explain why. In a nutshell:

Most NLP tools do not come with a swear word plugin. As part of her work on the Arcomem project, Diana had fun collecting a suitable training corpus and a swear word list for sentiment detection.
"It's all Greek to me": less than 50% of all tweets are in English. Thanks to the plethora of GATE multilingual plugins, building a basic NLP pipeline wasn't as bad as it could have been.
Identifying relevant posts: there's more chaff than wheat out there, especially on Twitter.
Twts r noizy: Normalisation and spelling correction are essential. It turns out that the perfect way to collect a training corpus of tweets for normalisation purposes is to search for Justin Bieber.
Opinion target identification in tweets is...ahem...even more challenging than in longer texts (not that we have fully solved it there either).
And please do NOT start me on negation
...or context, time, space, and summarisation for that matter.

If you'd like to know more technical details, here's another paper on detecting political opinion in tweets with GATE.

If you wish to learn hands-on how to roll your own sentiment analyser, Diana will be giving a practical sentiment analysis tutorial with GATE at the forthcoming Sentiment Analysis Symposium in San Francisco, California, on October 29th, 2012.

Give us a shout, if you need more info and thanks for reading!

Follow the GATE Team on Twitter: @GateAcUk
Follow Diana Maynard on Twitter: @dianamaynard
Follow me on Twitter: @kbontcheva

Friday, 8 February 2019

Teaching computers to understand the sentiment of tweets

3rd International Workshop on Rumours and Deception in Social Media (RDSM)

Abstract

Workshop Theme and Topics

Topics of interest include, but are not limited to:

Submission Procedure

Workshop Organizers

Programme Committee (Tentative)

Invited Speaker(s)

Sponsors

Friday, 1 July 2016

The Tools Behind Our Brexit Analyser

UPDATE (13 December, 2016): Try the Brexit Analyzer

Tools Overview

So how do I do my analyses

How do we scale it up

Thanks to:

Tuesday, 7 August 2012

GATE is Getting Sentimental about Social Media

GATE is Getting Sentimental about Social Media