On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: Text Analysis

Showing posts with label Text Analysis. Show all posts

Monday, 6 January 2025

GATE team hosts its first ATRIUM TNA research visit: Using NLP to understand trends in political and social debate

In December 2024, we hosted research visitor Tasos Galanopoulos as part of the ATRIUM project (Advancing fronTier Research In the arts and hUManities) TransNational Access scheme. ATRIUM's aim is to bridge 4 leading research infrastructures in: arts and humanities (DARIAH), archaeology (ARIADNE), language technology (CLARIN), and open scholarly communication in the social sciences and humanities (OPERAS). The Transnational Access (TNA) scheme offers fully funded placements for researchers across Europe. This initiative is designed to support Arts and Humanities researchers by providing access to expert knowledge, mentorship, and tools from leading Data Management organisations. Successful applicants have the opportunity to visit one of 14 different host organisations across Europe in order to conduct their research, benefiting from direct contact, knowledge sharing and network building.

Tasos describes his visit below...

How can NLP tools and large language models be used to understand trends in political and social debate around major issues of the day?

What is the relationship between 'distant reading' and the layered understanding that these tools offer for large volumes of data, and 'close reading', understanding aspects of these topical issues?

What role can these modern tools play in the humanities and in everyday journalistic practice?

Questions such as these, on the occasion of a project on "Analysis of textual data from newspapers on the agreement of Greece's accession to the European Economic Community EEC (1961)", in the context of my postgraduate studies in Digital Humanities at the Open University of Greece, brought me to the School of Computer Science at the University of Sheffield at the end of November (23/11/2024 - 7/12/2024), to collaborate with members of the GATE team.

Despite the short period of the stay, the impressions were the best: the patience and goodwill of all the team - with Dr Maynard at the forefront - helped me to "navigate" the tools offered by the GATE Cloud and the European Language Grid, to understand a bit better the processes required, and the wider field, to learn a bit more about its "alphabet" and requirements. At the same time, through the regular meetings of the team I was able to get a "glimpse" of the modern, specialised, and valuable research being carried out at the university.

In relation to the actual subject of the research, the findings from the processing with tools such as NamedEntity Recognition, N-gram detection and their visualization with wordclouds, Topic Classification, Sentiment Analysis, Multidimensional analysis with LIWC-22, Persuasion techniques were very interesting, giving answers and insights to our questions that had to do with the attempt to develop a methodology to identify, document and frame named entities in the context of the investigation of public discourse, Press with different political orientation and political rhetoric in relation to critical events in political life, with reference to the economic and social environment inside and outside the country. Also "identifying" and categorising arguments for and against, and 'bias' for/against in the Press of that time and at a subsequent level , enabled us to explore ways to link entities to key concepts in argumentation.

Overall, my impressions were therefore the best from this constructive visit, a visit that on a personal level gave me inspiration and opened new horizons, but also created new contacts with remarkable people.

Monday, 28 February 2022

How green is your recipe? Using GATE to calculate the environmental impact of recipes

The calculation of environmental impacts from recipes remains a barrier to effective uptake of sustainable diets. In a recent project funded by Alpro, led by Dr Christian Reynolds from the Centre for Food Policy at City University London, we explored digitised recipe texts from websites in English, Dutch and German. We study recipes rather than individual ingredients because this is how people typically think about environmental impact and diet.

Recipes are hard to process because they use different weights and measures, and sometimes quite vague or obscure terms (e.g. "a pinch of salt", "a handful of lettuce"). Together with our project partner Text Mining Solutions, we used GATE to develop customised tools to automatically extract ingredients, quantities and units from 220,168 indexed recipes, and to match these to a food environmental database of 4500 ingredients (using the classification system FoodEx2). This database provided Land Use, GHG emissions, Eutrophying Emissions, Stress-Weighted Water Use, and Freshwater Withdrawals for each ingredient.

Nutrition information was sourced from the USDA FoodData Central (McKillop et al., 2021) and McCance and Widdowson's Composition of Foods Integrated Database (Public Health England, 2015). Environmental and Nutrition information was matched to two classification systems (FoodEx2, containing 4,500 ingredients, and USDA Nutrient Database, containing 2,484 ingredients). This allowed us to calculate these impacts at the mean, 5% and 95% confidence level per recipe and per portion, enabling us to explore the environmental impacts of vegan, vegetarian and non-vegetarian (omnivore) recipes if we were to cook these recipes using contemporary ingredients.

To validate the tool, we manually calculated the impacts of 50 recipes from 4 websites: BBC Good Food, Albert Heijn/Allerhande, AllRecipes.com and Kochbar, and compared these with the results from our tool.

We created a website where you can enter a recipe and get back the calculation for the recipe and per portion (with confidence intervals). The image below shows a sample screenshot.

We presented some of our findings as a poster at the Livestock, Environment and People (LEAP) conference in December 2021. You can find more examples of our analysis and results there.

It's interesting to see how the recipes from the different countries, as well as recipes with different protein sources, lead to different median CO2 footprints. Below we see a chart showing the median GHGE per portion in recipes from different protein sources (e.g. those containing beef, those containing tofu) in omnivore, vegetarian, and vegan recipes. Unsurprisingly, the dishes containing meat have higher GHGE values on the whole, though we do find variations within individual recipes. We were particularly excited to find a recipe for chocolate cake that "beat" a salad in terms of low GHGE!

Chart

When we compared the different datasets (depicting recipes from different European countries) in terms of median GHGE per protein source, we found that Kochbar (German) recipes typically fared the worst, followed by the BBC Good Food recipes (British), and Albert Heijn (Dutch) faring much better.

The work is now continuing with the development of a dashboard enabling additional visualisations and further analysis to be produced.

Monday, 12 August 2019

In the News: Online Abuse of Politicians, BBC

We've been working together with the BBC to bring public attention to the issue of online abuse against politicians. Rising tensions in Q1 and Q2 of 2019 meant that politicians were seeing more verbal abuse on Twitter than we have previously observed. The findings were presented on the 6 o'clock and 10 o'clock news on Tuesday, August 6th, and you can see in the histogram above that we found the level of incivility rising to almost 4%. You can see the BBC article describing the work here.

The BBC also did a survey. They found 139 MPs out of the 172 who responded to their survey who said either they or their staff had faced abuse in the past year. More than 60% (108) of those who replied said they had been in contact with the police about threats in the last 12 months.

We found that levels of abuse on Twitter fluctuate over time, with spikes driven by events such as the death of IS bride Shamima Begum's baby or key events in the Brexit negotiations. Labour MP David Lammy has received the most abuse of any MP on Twitter so far this year.

As previously, we also found that on average, male MPs attract significantly more general incivility than female ones, though women attract more sexist abuse. Conservative MPs on average, as previously, attracted significantly more abuse than Labour ones, perhaps because they are in power. Sexist abuse is the most prevalent, as compared with homophobia or racism.

Thursday, 6 June 2019

Toxic Online Discussions during the UK European Parliament Election Campaign

The Brexit Party attracted the most engagement on Twitter in the run-up to the UK European Parliament election on May 23rd, their candidates receiving as many tweets as all the other parties combined. Brexit Party leader Nigel Farage was the most interacted-with UK candidate on Twitter, with over twice as many replies as the next most replied-to candidate, Andrew Adonis of the Labour Party.

We studied all tweets sent to or from (or retweets of or by) UK European Election candidates in the month of May, and classified them as abusive or not using the classifier presented here. It must be noted, in particular, that the classifier only identifies reliably whether a reply is abusive or not. It is not sufficiently accurate for us to reliably judge the target politician or party of this abusive reply. What this means is that we can only reliably identify which EP candidates triggered abuse-containing discussion threads on Twitter, but that often this abuse is actually aimed at other politicians or parties.

In addition to attracting the most replies, the Brexit Party candidates also triggered an unusually high level of abuse-containing Twitter discussions. In particular, we found that posts by Farage triggered almost six times as many abuse-containing Twitter threads than the next most replied to candidate, Gavin Esler of Change UK, during May 2019.

There is an important difference, however, in that that many of the abuse-containing replies to posts by Farage and the Brexit Party were actually abusive towards other politicians (most notably the prime minister and the leader of the Labour party) and not Farage himself. In contrast, abusive replies to Gavin Esler were primarily aimed at the politician himself, triggered by his use of the phrase "village idiot" in connection with the Leave Campaign.

Candidates from other parties that triggered unusually high levels of abuse-containing discussions were those from the UK Independence Party, now considered far right, and Change UK, a newly formed but unstable remain party. Change UK was the most active on Twitter, with candidates sending more tweets than other parties. Gavin Esler was the most replied-to Change UK candidate, and also received an unusually high level of abuse. The abuse often referred to his use of the phrase "village idiot" in connection with the leave campaign, which resulted in anger and resentment.

In contrast, MEP candidates from the Conservative and Labour Parties were not hubs of polarised, abuse-containing discussions on Twitter.

What these findings, unsurprisingly, demonstrate is that politicians and parties who themselves use divisive and abusive language, for example, to brand political opponents as “village idiots”, “traitors”, or as “desperate to betray”, are thus triggering the toxic online responses and deep political antagonism that we have witnessed.

After the Brexit Party, the next most replied-to MEP candidates were from the Labour partyAfter the Brexit Party, the next most replied-to party was Labour, according to the study, followed by Change UK.

MEP candidates from both the Liberal Democrats and the Green Party were also active on Twitter, with the Green MEP candidates second only to Change UK ones for number of tweets sent, but didn't get a lot of engagement in return. The Liberal Democrats in particular received a low number of replies. This may suggest that these parties became the choices of default for a population of discouraged remainers, as both made gains in the election. Both parties attracted a particularly civil tone of reply.

Brexit Party candidates were also the ones that replied most to those who tweeted them, rather than authoring original tweets or retweeting other tweets.

Acknowledgements: Research carried out by Genevieve Gorrell, Mehmet Bakir, and Kalina Bontcheva. This work was partially supported by the European Union under grant agreements No. 654024 SoBigData and No. 825297 WeVerify.

Monday, 11 March 2019

Coming Up: 12th GATE Summer School 17-21 June 2019

It is approaching that time of the year again! The GATE training course will be held from 17-21 June 2019 at the University of Sheffield, UK.

No previous experience or programming expertise is necessary, so it's suitable for anyone with an interest in text mining and using GATE, including people from humanities backgrounds, social sciences, etc.

This event will follow a similar format to that of the 2018 course, with one track Monday to Thursday, and two parallel tracks on Friday, all delivered by the GATE development team. You can read more about it and register here. Early bird registration is available at a discounted rate until 1 May.

The focus will be on mining text and social media content with GATE. Many of the hands on exercises will be focused on analysing news articles, tweets, and other textual content.

The planned schedule is as follows (NOTE: may still be subject to timetabling changes).
Single track from Monday to Thursday (9am - 5pm):

Monday: Module 1: Basic Information Extraction with GATE

Intro to GATE + Information Extraction (IE)
Corpus Annotation and Evaluation
Writing Information Extraction Patterns with JAPE

Tuesday: Module 2: Using GATE for social media analysis

Challenges for analysing social media, GATE for social media
Twitter intro + JSON structure
Language identification, tokenisation for Twitter
POS tagging and Information Extraction for Twitter

Wednesday: Module 3: Crowdsourcing, GATE Cloud/MIMIR, and Machine Learning

Crowdsourcing annotated social media content with the GATE crowdsourcing plugin
GATE Cloud, deploying your own IE pipeline at scale (how to process 5 million tweets in 30 mins)
GATE Mimir - how to index and search semantically annotated social media streams
Challenges of opinion mining in social media
Training Machine Learning Models for IE in GATE

Thursday: Module 4: Advanced IE and Opinion Mining in GATE

Advanced Information Extraction
Useful GATE components (plugins)
Opinion mining components and applications in GATE

On Friday, there is a choice of modules (9am - 5pm):

Module 5: GATE for developers
- Basic GATE Embedded
- Writing your own plugin
- GATE in production - multi-threading, web applications, etc.
Module 6: GATE Applications
- Building your own applications
- Examples of some current GATE applications: social media summarisation, visualisation, Linked Open Data for IE, and more

These two modules are run in parallel, so you can only attend one of them. You will need to have some programming experience and knowledge of Java to follow Module 5 on the Friday. No particular expertise is needed for Module 6.
Hope to see you in Sheffield in June!

Thursday, 29 November 2018

A Deep Neural Network Sentence Level Classification Method with Context Information

Today we're looking at the work done within the group which was reported in EMNLP2018: "A Deep Neural Network Sentence Level Classification Method with Context Information", authored by Xingyi Song, Johann Petrak and Angus Roberts, all of the University of Sheffield.

Xingyi, S., Petrak, J. & Roberts, A. A Deep Neural Network Sentence Level Classification Method with Context Information. in EMNLP2018 – 2018 Conference on Empirical Methods in Natural Language Processing 00, 0-000 (2018).

Understanding complex bodies of text is a difficult task, especially those in which the context of a statement can greatly influence its meaning. While methods exist that examine the context surrounding a phrase, the authors present a new approach that makes use of much larger contexts than these. This allows for greater confidence in the results of such a method, especially when dealing with complicated subject matter. Medical records are one such area in which complex judgements on appropriate treatments are made across several sentences. It is vital therefore to fully understand the context of each individual statement to be able to collate meaning and accurately understand the sentiment of the entire body of text and the conclusion that should be drawn from it

Although grounded in its use in the medical domain, this new technique can be demonstrated to be more widely applicable. An evaluation of the technique in non-medical domains showed a solid improvement of over six percentage points over its nearest competitor technique despite requiring 33% less training time.

This technique examines not only the subject sentence, but also context on either side of it. This embedding is encoded using an adapted FOFE technique that allows for large contexts without crippling amounts of additional computation.

But how does it work? At its core, this novel method analyses not only the target sentence but also an amount of text on either side of it. This context is encoded using an adapted Fixed-size Ordinally Forgetting Encoding (FOFE), turning it from a variable length context into a fixed length embedding. This is processed along with the target, before being concatenated and post-processed to produce an output.

Experimentation on this new technique was then performed, in comparison to peer techniques. These results showed markedly improved performance compared to LSTM-CNN methods, despite taking almost the same amount of time. The performance of this new Context-LSTM-CNN technique even surpassed an L-LSTM-CNN method despite a substantial reduction in required time.

Average test accuracy and training time. Best values are marked as bold, standard deviations in parentheses

In conclusion, a new technique is presented, Context-LSTM-CNN, that combines the strength of LSTM and CNN with the lightweight context encoding algorithm, FOFE. The model shows a consistent improvement over either a non-context based model and a LSTM context encoded model, for the sentence classification task.

Thursday, 22 November 2018

Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms

This summer, we presented some of our latest work at SEMANTiCS 2018 in Vienna: "Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms".

Zhang, Z., Petrak, J. & Maynard, D. Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. in SEMANTiCS 2018 – 14th International Conference on Semantic Systems 00, 0-000 (2018).

This work has been carried out in the context of the EU KNOWMAK project, where we're developing tools for multi-topic classification of text against an ontology, in order to attempt to map the state of European research output in key technologies.

Automatic Term Extraction (ATE) is a fundamental technique used in computational linguistics for recognising terms in text. Processing the collected terms in a text is a key step in understanding the content of the text. There are many different ATE methods, but these all tend to work well only in a one specific domain. In other words, there is no universal method which produces consistently good results, and so we have to choose an appropriate method for the domain being targeted.

In this work, we have developed a novel method for ATE which addresses two major limitations: the fact that no single ATE method consistently performs well across all domains, and the fact that the majority of ATE methods are unsupervised. Our generic method, AdaText, improves the accuracy of existing ATE methods, using existing lexical resources to support them, by revising the TextRank algorithm.

After being given a target text, AdaText:

Selects a subset of words based on their semantic relatedness to a set of seed words or phrases relevant to the domain, but not necessarily representative of the terms within the target text.
It then applies an adapted TextRank algorithm to create a graph for these words, and computes a text-level TextRank score for each selected word.
Finally, these scores are used to revise the score of a term candidate previously computed by an ATE method.

This technique was trialled using a variety of parameters (such as the threshold of semantic similarity to select words, as described in step two) over two distinct datasets (GENIA and ACLv2, comprising Medline abstracts and abstracts from ACL respectively). We also tested it with a wide variety of state of the art ATE methods, including modified TFIDF, CValue, Basic, RAKE, Weirdness, LinkProbability, X², GlossEx and PositiveUnlabeled.

The figures show a sample of performances in different datasets and using different ATE techniques. The base performance of the ATE method is represented by the blachttps://gate.ac.uk/g8/page/show/2/sale/images/blog/Results-by-AdaText-compared-against-the-base-ATE-methods-y-axis-average-PK-for-all.pngk horizontal line. The horizontal axis represents the semantic similarity threshold used in step 1. The vertical axis shows average P@K for all five Ks considered.

This new generic combination approach can consistently improve the performance of the ATE method by 25 points, which is a significant increase. However, there is still room for improvement. In future work, we aim to optimise the selection of words from the TextRank graph, work on expanding TextRank to a graph of both words and phrases, and to explore how the size and source of the seed lexicon affects the performance of AdaText.

Wednesday, 15 August 2018

What matters most to people around the world? Using the GATE social media toolkit to investigate wellbeing.

As part of the EU SoBigData project, the GATE team hosts a number of short research visits, between 2 weeks and 2 months, for all kinds of data scientists (PhD students, researchers, academics, professionals) to come and work with us and use our tools and/or datasets on a project involving text mining and social media analysis. One such visitor was Economics PhD student Giuliano Resce from the University of Roma Tre in Italy. During his month-long visit, he worked with Diana Maynard on a project collecting and analysing millions of public tweets in 7 different languages, in order to understand the different societal priorities of people in different countries of the OECD. The work explored the different opinions on Twitter of people around the world about societal issues such as the environment, housing and life satisfaction.

OECD Better Life Index

Giuliano first used the GATE Twitter Collector to collect a set of tweets, and then processed them with the GATE social media analysis toolkit, using GATE Mimir to investigate the results. Topics were determined using the initial set of OECD topics, in 7 languages, which we then expanded for each language into a set of keywords for each topic using first existing lists from the GATE political tweets analyser and then Word2Vec to find more related keywords to those.

Better Life Index Topic frequency at county level in Twitter (percentage)

The ensuing analysis of the tweets then enabled Giuliano to redesign Composite Indices for the OECD’s Better Life Index, a measure of well-being which gives a detailed overview of the social, economic and environmental performances of different countries. In turn, this redesign helps to better reflect the actual needs of the people. The idea is that the aggregate of millions of tweets may provide a representation of the different priorities among the eleven topics of the Better Life Index. By combining topic performances and related Twitter trends, they produced new evidence about the relationship between people’s priorities and policy makers’ activity in the BLI framework.

Rank in Composite BLI using local Twitter trends as Weights and using Equal Weights

A paper about the work has been published in the Journal of Technological Forecasting & Social Change.

More information about SoBigData TransNational Access Research visits