Posts about our text and social media analysis work and latest news on GATE (http://gate.ac.uk) - our open source text and social media analysis platform. Also posts about the PHEME project (http://pheme.eu) and our work on automatic detection of rumours in social media. Lately also general musings about fake news, misinformation, and online propaganda.
Showing posts with label Social Media. Show all posts
Showing posts with label Social Media. Show all posts
Monday, 12 August 2019
In the News: Online Abuse of Politicians, BBC
We've been working together with the BBC to bring public attention to the issue of online abuse against politicians. Rising tensions in Q1 and Q2 of 2019 meant that politicians were seeing more verbal abuse on Twitter than we have previously observed. The findings were presented on the 6 o'clock and 10 o'clock news on Tuesday, August 6th, and you can see in the histogram above that we found the level of incivility rising to almost 4%. You can see the BBC article describing the work here.
The BBC also did a survey. They found 139 MPs out of the 172 who responded to their survey who said either they or their staff had faced abuse in the past year. More than 60% (108) of those who replied said they had been in contact with the police about threats in the last 12 months.
We found that levels of abuse on Twitter fluctuate over time, with spikes driven by events such as the death of IS bride Shamima Begum's baby or key events in the Brexit negotiations. Labour MP David Lammy has received the most abuse of any MP on Twitter so far this year.
As previously, we also found that on average, male MPs attract significantly more general incivility than female ones, though women attract more sexist abuse. Conservative MPs on average, as previously, attracted significantly more abuse than Labour ones, perhaps because they are in power. Sexist abuse is the most prevalent, as compared with homophobia or racism.
Labels:
abuse language,
Social Media,
Text Analysis,
text mining
Tuesday, 30 July 2019
GATE Cloud services for Google Sheets featured in the CLARIN Newsflash
CLARIN ERIC is a research infrastructure through Europe and beyond to encourage the sharing and sustainability of language data and tools for research in the humanities and social sciences. We are pleased to announce that our functions for text analysis in Google Sheets were featured in the July 2019 issue of the CLARIN Newsflash.
We are still working on getting Google to publish our add-on, which we hope to have available in the marketplace in a few months. Until then, you can follow the instructions in our previous blog post to use this tool, which currently provides standard and Twitter-oriented named entity recognition for English, French, and German; named entity linking for English, French, and German; and rumour veracity evaluation for English. In the future we will expand the range of functions to cover a wider variety of GATE Cloud services.
We are still working on getting Google to publish our add-on, which we hope to have available in the marketplace in a few months. Until then, you can follow the instructions in our previous blog post to use this tool, which currently provides standard and Twitter-oriented named entity recognition for English, French, and German; named entity linking for English, French, and German; and rumour veracity evaluation for English. In the future we will expand the range of functions to cover a wider variety of GATE Cloud services.
Labels:
GATE Cloud,
Google Sheets,
SoBigData,
Social Media
Monday, 15 July 2019
GATE Cloud services for Google Sheets
Spreadsheets are an increasingly popular way of storing all kinds of information, including text, and giving it some informal structure, and systems like Google Sheets are especially popular for collaborative work and sharing data.
In response to the demand for standard natural language processing (NLP) tasks in spreadsheets, we have developed a Google Sheets add-on that provides functions to carry out the following tasks on text cells using GATE Cloud services:
We have demonstrated this work several times, most recently at the IAMCR conference "Communication, Technology and Human Dignity: Disputed Rights, Contested Truths", which took place on 7–11 July at the Universidad Complutense de Madrid in Spain. There we used it to show how organisations monitoring the safety of journalists could automatically add information about entities and events to their spreadsheets. Potential users have said it looks very useful and they would like access to it as soon as possible.
Open this Google spreadsheet, then use File → Make a copy to save a copy to your own Google Drive (you can’t edit the original). For the functions to work, you will have to grant permission for the scripts to send data to and from GATE Cloud services and to use your user-level cache.
This work has been supported by the European Union’s Horizon 2020 research and innovation programme under grant agreements No 687847 (COMRADES) and No 654024 (SoBigData).
In response to the demand for standard natural language processing (NLP) tasks in spreadsheets, we have developed a Google Sheets add-on that provides functions to carry out the following tasks on text cells using GATE Cloud services:
- named entity recognition (NER) for standard text (e.g. news) in English, French, or German;
- NER tuned for tweets in English, French, or German;
- named entity linking using our YODIE service in English, French, or German;
- veracity reporting for rumours in tweets.
We have demonstrated this work several times, most recently at the IAMCR conference "Communication, Technology and Human Dignity: Disputed Rights, Contested Truths", which took place on 7–11 July at the Universidad Complutense de Madrid in Spain. There we used it to show how organisations monitoring the safety of journalists could automatically add information about entities and events to their spreadsheets. Potential users have said it looks very useful and they would like access to it as soon as possible.
Google sheet showing Named Entity and Linking applications run over descriptions of journalist killings from the Committee to Protect Journalists (CPJ) databases |
We are applying to have this add-on published in the G Suite Marketplace, but the process is very slow, so we are making the software available now as a read-only Google Drive document that anyone can copy and re-use.
The document contains several examples and instructions are available from the Add-ons → GATE Text Analysis menu item. The language processing is actually done on our servers; the spreadsheet functions send the text to GATE Cloud using the REST API and reformat the output into a human-readable form, so they require a network connection and are subject to rate-limiting. You can use the functions without setting up a GATE Cloud account, but if you create one and authenticate while using this add-on, rate-limiting will be reduced.
Open this Google spreadsheet, then use File → Make a copy to save a copy to your own Google Drive (you can’t edit the original). For the functions to work, you will have to grant permission for the scripts to send data to and from GATE Cloud services and to use your user-level cache.
This work has been supported by the European Union’s Horizon 2020 research and innovation programme under grant agreements No 687847 (COMRADES) and No 654024 (SoBigData).
Labels:
COMRADES,
GATE Cloud,
Google Sheets,
JournoSafety,
SoBigData,
Social Media
Friday, 12 July 2019
Using GATE to drive robots at Headstart 2019
In collaboration with Headstart (a charitable trust that provides hands-on science, engineering and maths taster courses), the Department of Computer Science has just run its fourth annual summer school for maths and science A-level students. This residential course ran from 8 to 12 July 2019 and included practical work in computer programming, Lego robots, and project development as well as tours of the campus and talks about the industry.
For the third year in a row, we have included a section on natural language processing using GATE Developer and a special GATE plugin (which uses the ShefRobot library available from GitHub) that allows JAPE rules to operate the Lego robots. As before, we provided the students with a starter GATE application (essentially the same as in last year's course) containing just enough gazetteer entries, JAPE, and sample code to let them tweet variations like "turn left" and "take a left" to make the robot do just that. We also use the GATE Cloud Twitter Collector, which we have modified to run locally so the students can set it up on a lab computer so it follows their own twitter accounts and processes their tweets through the GATE application, sending commands to the robots when the JAPE rules match.
Based on lessons learned from the previous years, we put more effort into improving the instructions and the Twitter Collector software to help them get it running faster. This time the first robot started moving under GATE's control less than 40 minutes from the start of the presentation, and the students rapidly progressed with the development of additional rules and then tweeting commands to their robots.
The structure and broader coverage of this year's course meant that the students had more resources available and a more open project assignment, so not all of them chose to use GATE in their projects, but it was much easier and more streamlined for them to use than in previous years.
This year 42 students (14 female; 28 male) from around the UK attended the Computer Science Headstart Summer School.
![]() |
Geography of male students |
![]() |
Geography of female students |
The handout and slides are publicly available from the GATE website, which also hosts GATE Developer and other software products in the GATE family. Source code is available from our GitHub site.
GATE Cloud development is supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 654024 (the SoBigData project).
Wednesday, 3 July 2019
12th GATE Summer School (17-21 June 2019)
12th GATE Training Course: open-source natural language processing with an emphasis on social media
For over a decade, the GATE team has provided an annual course in using our technology. The course content and track options have changed a bit over the years, but it always includes material to help novices get started with GATE as well as introductory and more advanced use of the JAPE language for matching patterns of document annotations.The latest course also included machine learning, crowdsourcing, sentiment analysis, and an optional programming module (aimed mainly at Java programmers to help them embed GATE libraries, applications, and resources in web services and other "behind the scenes" processing). We have also added examples and new tools in GATE to cover the increasing demand for getting data out of and back into spreadsheets, and updated our work on social media analysis, another growing field.
![]() |
Information in "feral databases" (spreadsheets) |
Semantics in scientometrics |
- From KNOWMAK and RISIS, we presented our work on using semantic technologies in scientometrics, by applying NLP and ontologies to document categorization in order to contribute to a searchable knowledge base that allows users to find aggregate and specific data about scientific publications, patents, and research projects by geography, category, etc.
- Much of our recent work on social media analysis, including opinion mining and abuse detection and measurement, has been done as part of the SoBigData project.
- The increasing range of tools for languages other than English links with our participation in the European Language Grid, which is also supported further development of GATE Cloud, our platform for text analytics as a service.
![]() |
Conditional processing of multilingual documents |
![]() |
Processing German in GATE |
Acknowledgements
Labels:
GATE,
KNOWMAK,
SoBigData,
Social Media,
Training Course
Thursday, 6 June 2019
Toxic Online Discussions during the UK European Parliament Election Campaign
The Brexit Party attracted the most engagement on Twitter in the run-up to the UK European Parliament election on May 23rd, their candidates receiving as many tweets as all the other parties combined. Brexit Party leader Nigel Farage was the most interacted-with UK candidate on Twitter, with over twice as many replies as the next most replied-to candidate, Andrew Adonis of the Labour Party.
We studied all tweets sent to or from (or retweets of or by) UK European Election candidates in the month of May, and classified them as abusive or not using the classifier presented here. It must be noted, in particular, that the classifier only identifies reliably whether a reply is abusive or not. It is not sufficiently accurate for us to reliably judge the target politician or party of this abusive reply. What this means is that we can only reliably identify which EP candidates triggered abuse-containing discussion threads on Twitter, but that often this abuse is actually aimed at other politicians or parties.
In addition to attracting the most replies, the Brexit Party candidates also triggered an unusually high level of abuse-containing Twitter discussions. In particular, we found that posts by Farage triggered almost six times as many abuse-containing Twitter threads than the next most replied to candidate, Gavin Esler of Change UK, during May 2019.
There is an important difference, however, in that that many of the abuse-containing replies to posts by Farage and the Brexit Party were actually abusive towards other politicians (most notably the prime minister and the leader of the Labour party) and not Farage himself. In contrast, abusive replies to Gavin Esler were primarily aimed at the politician himself, triggered by his use of the phrase "village idiot" in connection with the Leave Campaign.
Candidates from other parties that triggered unusually high levels of abuse-containing discussions were those from the UK Independence Party, now considered far right, and Change UK, a newly formed but unstable remain party. Change UK was the most active on Twitter, with candidates sending more tweets than other parties. Gavin Esler was the most replied-to Change UK candidate, and also received an unusually high level of abuse. The abuse often referred to his use of the phrase "village idiot" in connection with the leave campaign, which resulted in anger and resentment.
In contrast, MEP candidates from the Conservative and Labour Parties were not hubs of polarised, abuse-containing discussions on Twitter.
What these findings, unsurprisingly, demonstrate is that politicians and parties who themselves use divisive and abusive language, for example, to brand political opponents as “village idiots”, “traitors”, or as “desperate to betray”, are thus triggering the toxic online responses and deep political antagonism that we have witnessed.
After the Brexit Party, the next most replied-to MEP candidates were from the Labour partyAfter the Brexit Party, the next most replied-to party was Labour, according to the study, followed by Change UK.
MEP candidates from both the Liberal Democrats and the Green Party were also active on Twitter, with the Green MEP candidates second only to Change UK ones for number of tweets sent, but didn't get a lot of engagement in return. The Liberal Democrats in particular received a low number of replies. This may suggest that these parties became the choices of default for a population of discouraged remainers, as both made gains in the election. Both parties attracted a particularly civil tone of reply.
Brexit Party candidates were also the ones that replied most to those who tweeted them, rather than authoring original tweets or retweeting other tweets.
Acknowledgements: Research carried out by Genevieve Gorrell, Mehmet Bakir, and Kalina Bontcheva. This work was partially supported by the European Union under grant agreements No. 654024 SoBigData and No. 825297 WeVerify.
Wednesday, 17 April 2019
WeVerify: Algorithm-Supported Verification of Digital Content
Announcing WeVerify: a new project developing AI-based tools for computer-supported digital content verification. The WeVerify platform will provide an independent and community driven environment for the verification of online content, to be used to assist journalists in gathering and verifying quickly online content. Prof. Kalina Bontcheva will be serving as the Scientific Director of the project.
Online disinformation and fake media content have emerged as a serious threat to democracy, economy and society. Content verification is currently far from trivial, even for experienced journalists, human rights activists or media literacy scholars. Moreover, recent advances in artificial intelligence (deep learning) have enabled the creation of intelligent bots and highly realistic synthetic multimedia content. Consequently, it is extremely challenging for citizens and journalists to assess the credibility of online content, and to navigate the highly complex online information landscapes.
WeVerify aims to address the complex content verification challenges through a participatory verification approach, open source algorithms, low-overhead human-in-the-loop machine learning and intuitive visualizations. Social media and web content will be analysed and contextualised within the broader online ecosystem, in order to expose fabricated content, through cross-modal content verification, social network analysis, micro-targeted debunking and a blockchain-based public database of known fakes.
A key outcome will be the WeVerify platform for collaborative,
decentralised content verification, tracking, and debunking.
The platform will be open source to engage communities and citizen journalists alongside newsroom and freelance journalists. To enable low-overhead integration with in-house content management systems and support more advanced newsroom needs, a premium version of the platform will also be offered. It will be furthermore supplemented by a digital companion to assist with verification tasks.
Online disinformation and fake media content have emerged as a serious threat to democracy, economy and society. Content verification is currently far from trivial, even for experienced journalists, human rights activists or media literacy scholars. Moreover, recent advances in artificial intelligence (deep learning) have enabled the creation of intelligent bots and highly realistic synthetic multimedia content. Consequently, it is extremely challenging for citizens and journalists to assess the credibility of online content, and to navigate the highly complex online information landscapes.
WeVerify aims to address the complex content verification challenges through a participatory verification approach, open source algorithms, low-overhead human-in-the-loop machine learning and intuitive visualizations. Social media and web content will be analysed and contextualised within the broader online ecosystem, in order to expose fabricated content, through cross-modal content verification, social network analysis, micro-targeted debunking and a blockchain-based public database of known fakes.
![]() |
Add caption |
The platform will be open source to engage communities and citizen journalists alongside newsroom and freelance journalists. To enable low-overhead integration with in-house content management systems and support more advanced newsroom needs, a premium version of the platform will also be offered. It will be furthermore supplemented by a digital companion to assist with verification tasks.
Results will be validated by professional journalists and
debunking specialists from project partners (DW, AFP, DisinfoLab), external participants (e.g. members of the First Draft News
network), the community of more than 2,700 users of the InVID verification plugin, and by media literacy, human rights and
emergency response organisations.
The WeVerify website can be found at https://weverify.eu/, and WeVerify can be found on Twitter @WeV3rify!
The WeVerify website can be found at https://weverify.eu/, and WeVerify can be found on Twitter @WeV3rify!
Monday, 11 March 2019
Coming Up: 12th GATE Summer School 17-21 June 2019
It is approaching that time of the year again! The GATE training course will be held from 17-21 June 2019 at the University of Sheffield, UK.
No previous experience or programming expertise is necessary, so it's suitable for anyone with an interest in text mining and using GATE, including people from humanities backgrounds, social sciences, etc.
This event will follow a similar format to that of the 2018 course, with one track Monday to Thursday, and two parallel tracks on Friday, all delivered by the GATE development team. You can read more about it and register here. Early bird registration is available at a discounted rate until 1 May.
The focus will be on mining text and social media content with GATE. Many of the hands on exercises will be focused on analysing news articles, tweets, and other textual content.
The planned schedule is as follows (NOTE: may still be subject to timetabling changes).
Single track from Monday to Thursday (9am - 5pm):
Hope to see you in Sheffield in June!
No previous experience or programming expertise is necessary, so it's suitable for anyone with an interest in text mining and using GATE, including people from humanities backgrounds, social sciences, etc.
This event will follow a similar format to that of the 2018 course, with one track Monday to Thursday, and two parallel tracks on Friday, all delivered by the GATE development team. You can read more about it and register here. Early bird registration is available at a discounted rate until 1 May.
The focus will be on mining text and social media content with GATE. Many of the hands on exercises will be focused on analysing news articles, tweets, and other textual content.
The planned schedule is as follows (NOTE: may still be subject to timetabling changes).
Single track from Monday to Thursday (9am - 5pm):
- Monday: Module 1: Basic Information Extraction with GATE
- Intro to GATE + Information Extraction (IE)
- Corpus Annotation and Evaluation
- Writing Information Extraction Patterns with JAPE
- Tuesday: Module 2: Using GATE for social media analysis
- Challenges for analysing social media, GATE for social media
- Twitter intro + JSON structure
- Language identification, tokenisation for Twitter
- POS tagging and Information Extraction for Twitter
- Wednesday: Module 3: Crowdsourcing, GATE Cloud/MIMIR, and Machine Learning
- Crowdsourcing annotated social media content with the GATE crowdsourcing plugin
- GATE Cloud, deploying your own IE pipeline at scale (how to process 5 million tweets in 30 mins)
- GATE Mimir - how to index and search semantically annotated social media streams
- Challenges of opinion mining in social media
- Training Machine Learning Models for IE in GATE
- Thursday: Module 4: Advanced IE and Opinion Mining in GATE
- Advanced Information Extraction
- Useful GATE components (plugins)
- Opinion mining components and applications in GATE
- Module 5: GATE for developers
- Basic GATE Embedded
- Writing your own plugin
- GATE in production - multi-threading, web applications, etc.
- Module 6: GATE Applications
- Building your own applications
- Examples of some current GATE applications: social media summarisation, visualisation, Linked Open Data for IE, and more
Hope to see you in Sheffield in June!
Tuesday, 5 March 2019
Brexit--The Regional Divide
Although the UK voted by a narrow margin in the UK EU membership referendum in 2016 to leave the EU, that outcome failed to capture the diverse feelings held in various regions. It's a curious observation that the UK regions with the most economic dependence on the EU were the regions more likely to vote to leave it. The image below on the right is taken from this article from the Centre for European Reform, and makes the point in a few different ways. This and similar research inspired a current project the GATE team are undertaking with colleagues in the Geography and Journalism departments at Sheffield University, under the leadership of Miguel Kanai and with funding from the British Academy, aiming to understand whether lack of awareness of individual local situation played a role in the referendum outcome.
Our Brexit tweet corpus contains tweets collected during the run-up to the Brexit referendum, and we've annotated almost half a million accounts for Brexit vote intent with a high accuracy. You can read about that here. So we thought we'd be well positioned to bring some insights. We also annotated user accounts with location: many Twitter users volunteer that information, though there can be a lot of variation on how people describe their location, so that was harder to do accurately. We also used local and national news media corpora from the time of the referendum, in order to contrast national coverage with local issues are around the country.
Using topic modelling and named entity recognition, we were able to look for similarities and differences in the focus of local and national media and Twitter users. The bar chart on the left gets us started, illustrating that foci differ between media. Twitter users give more air time than news media to trade and immigration, whereas local press takes the lead on employment, local politics and agriculture. National press gives more space to terrorism than either Twitter or local news.
On the right is just one of many graphs in which we unpack this on a region-by-region basis (you can find more on the project website). In this choropleth, red indicates that the topic was significantly more discussed in national press than in local press in that area, and green indicates that the topic was significantly more discussed in local press there than in national press. Terrorism and immigration have perhaps been subject to a certain degree of media and propaganda inflation--we talk about this in our Social Informatics paper. Where media focus on locally relevant issues, foci are more grounded, for example in practical topics such as agriculture and employment. We found that across the regions, Twitter remainers showed a closer congruence with local press than Twitter leavers.
The graph on the right shows the number of times a newspaper was linked on Twitter, contrasted against the percentage of people that said they read that newspaper in the British Election Study. It shows that the dynamics of popularity on Twitter are very different to traditional readership. This highlights a need to understand how the online environment is affecting the news reportage we are exposed to, creating a market for a different kind of material, and a potentially more hostile climate for quality journalism, as discussed by project advisor Prof. Jackie Harrison here. Furthermore, local press are increasingly struggling to survive, so it feels important to highlight their value through this work.
You can see more choropleths on the project website. There's also an extended version here of an article currently under review.
Our Brexit tweet corpus contains tweets collected during the run-up to the Brexit referendum, and we've annotated almost half a million accounts for Brexit vote intent with a high accuracy. You can read about that here. So we thought we'd be well positioned to bring some insights. We also annotated user accounts with location: many Twitter users volunteer that information, though there can be a lot of variation on how people describe their location, so that was harder to do accurately. We also used local and national news media corpora from the time of the referendum, in order to contrast national coverage with local issues are around the country.
"People's resistance to propaganda and media‐promoted ideas derives from their close ties in real communities" Jean Seaton |
On the right is just one of many graphs in which we unpack this on a region-by-region basis (you can find more on the project website). In this choropleth, red indicates that the topic was significantly more discussed in national press than in local press in that area, and green indicates that the topic was significantly more discussed in local press there than in national press. Terrorism and immigration have perhaps been subject to a certain degree of media and propaganda inflation--we talk about this in our Social Informatics paper. Where media focus on locally relevant issues, foci are more grounded, for example in practical topics such as agriculture and employment. We found that across the regions, Twitter remainers showed a closer congruence with local press than Twitter leavers.
The graph on the right shows the number of times a newspaper was linked on Twitter, contrasted against the percentage of people that said they read that newspaper in the British Election Study. It shows that the dynamics of popularity on Twitter are very different to traditional readership. This highlights a need to understand how the online environment is affecting the news reportage we are exposed to, creating a market for a different kind of material, and a potentially more hostile climate for quality journalism, as discussed by project advisor Prof. Jackie Harrison here. Furthermore, local press are increasingly struggling to survive, so it feels important to highlight their value through this work.
You can see more choropleths on the project website. There's also an extended version here of an article currently under review.
Wednesday, 20 February 2019
GATE team wins first prize in the Hyperpartisan News Detection Challenge
SemEval 2019 recently launched the Hyperpartisan News Detection Task in order to evaluate how well tools could automatically classify hyperpartisan news texts. The idea behind this is that "given a news text, the system must decide whether it follows a hyperpartisan argumentation, i.e. whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person."
Below we see an example of (part of) two news stories about Donald Trump from the challenge data. The one on the left is considered to be hyperpartisan, as it shows a biased kind of viewpoint. The one on the right simply reports a story and is not considered hyperpartisan. The distinction is difficult even for humans, because there are no exact rules about what makes a story hyperpartisan.
In total, 322 teams registered to take part, of which 42 actually submitted an entry, including the GATE team consisting of Ye Jiang, Xingyi Song and Johann Petrak, with guidance from Kalina Bontcheva and Diana Maynard.
The main performance measure for the task is accuracy on a balanced set of articles, though additionally precision, recall, and F1-score were measured for the hyperpartisan class. In the final submission, the GATE team's hyperpartisan classifying algorithm achieved 0.822 accuracy for manually annotated evaluation set, and ranked in first position in the final leader board.
Our winning system was based on using sentence representations from averaged word embeddings generated from the pre-trained ELMo model with a Convolutional Neural Network and Batch Normalization for training on the provided dataset. An averaged ensemble of models was then used to generate the final predictions.
The source code and full system description is available on github.
One of the major challenges of this task is that the model must have the ability to adapt to a large range of article sizes. Most state-of-the-art neural network approaches for document classification use a token sequence as network input, but such an approach in this case would mean either a massive computational cost or loss of information, depending on how the maximum sequence length. We got around this problem by first pre-calculating sentence level embeddings as the average of word embeddings for each sentence, and then representing the document as a sequence of these sentence embeddings. We also found that actually ignoring some of the provided training data (which was automatically generated based on the document publishing source) improved our results, which leads to important conclusions about the trustworthiness of training data and its implications.
Overall, the ability to do well on the hyperpartisan news prediction task is important both for improving knowledge about neural networks for language processing generally, but also because better understanding of the nature of biased news is critical for society and democracy.
The source code and full system description is available on github.
One of the major challenges of this task is that the model must have the ability to adapt to a large range of article sizes. Most state-of-the-art neural network approaches for document classification use a token sequence as network input, but such an approach in this case would mean either a massive computational cost or loss of information, depending on how the maximum sequence length. We got around this problem by first pre-calculating sentence level embeddings as the average of word embeddings for each sentence, and then representing the document as a sequence of these sentence embeddings. We also found that actually ignoring some of the provided training data (which was automatically generated based on the document publishing source) improved our results, which leads to important conclusions about the trustworthiness of training data and its implications.
Overall, the ability to do well on the hyperpartisan news prediction task is important both for improving knowledge about neural networks for language processing generally, but also because better understanding of the nature of biased news is critical for society and democracy.
Monday, 18 February 2019
Russian Troll Factory: Sketches of a Propaganda Campaign
When Twitter shared a large archive of propaganda tweets late in 2018 we were excited to get access to over 9 million tweets from almost 4 thousand unique Twitter accounts controlled by Russia's Internet Research Agency. The tweets are posted in 57 different languages, but most are in Russian (53.68%) and English (36.08%). Average account age is around four years, and the longest accounts are as much as ten years old.
A large amount of activity in both the English and Russian accounts is given to news provision. Secondly, many accounts seem to engage in hashtag games, which may be a way to establish an account and get some followers. Of particular interest however are the political trolls. Left trolls pose as individuals interested in the Black Lives Matter campaign. Right trolls are patriotic, anti-immigration Trump supporters. Among left and right trolls, several have achieved large follower numbers and even a degree of fame. Finally there are fearmonger trolls, that propagate scares, and a small number of commercial trolls. The Russian language accounts also divide on similar lines, perhaps posing as individuals with opinions about Ukraine or western politics. These categories were proposed by Darren Linvill and Patrick Warren, from Clemson University. In the word clouds below you can see the hashtags we found left and right trolls using.
Mehmet E. Bakir has created some interactive graphs enabling us to explore the data. In the network diagram at the start of the post you can see the network of mention/retweet/reply/quote counts we created from the highly followed accounts in the set. You can click through to an interactive version, where you can zoom in and explore different troll types.
In the graph below, you can see activity in different languages over time (interactive version here, or interact with the embedded version below; you may have to scroll right). It shows that the Russian language operation came first, with English language operations following after. The timing of this part of the activity coincides with Russia's interest in Ukraine.
In the graph below, also available here, you can see how different types of behavioural strategy pay off in terms of achieving higher numbers of retweets. Using Linvill and Warren's manually annotated data, Mehmet built a classifier that enabled us to classify all the accounts in the dataset. It is evident that the political trolls have by far the greatest impact in terms of retweets achieved, with left trolls being the most successful. Russia's interest in the Black Lives Matter campaign perhaps suggests that the first challenge for agents is to win a following, and that exploiting divisions in society is an effective way to do that. How that following is then used to influence minds is a separate question. You can see a pre-print of our paper describing our work so far, in the context of the broader picture of partisanship, propaganda and post-truth politics, here.
A large amount of activity in both the English and Russian accounts is given to news provision. Secondly, many accounts seem to engage in hashtag games, which may be a way to establish an account and get some followers. Of particular interest however are the political trolls. Left trolls pose as individuals interested in the Black Lives Matter campaign. Right trolls are patriotic, anti-immigration Trump supporters. Among left and right trolls, several have achieved large follower numbers and even a degree of fame. Finally there are fearmonger trolls, that propagate scares, and a small number of commercial trolls. The Russian language accounts also divide on similar lines, perhaps posing as individuals with opinions about Ukraine or western politics. These categories were proposed by Darren Linvill and Patrick Warren, from Clemson University. In the word clouds below you can see the hashtags we found left and right trolls using.
![]() Left Troll Hashtags |
![]() Right Troll Hashtags |
In the graph below, you can see activity in different languages over time (interactive version here, or interact with the embedded version below; you may have to scroll right). It shows that the Russian language operation came first, with English language operations following after. The timing of this part of the activity coincides with Russia's interest in Ukraine.
In the graph below, also available here, you can see how different types of behavioural strategy pay off in terms of achieving higher numbers of retweets. Using Linvill and Warren's manually annotated data, Mehmet built a classifier that enabled us to classify all the accounts in the dataset. It is evident that the political trolls have by far the greatest impact in terms of retweets achieved, with left trolls being the most successful. Russia's interest in the Black Lives Matter campaign perhaps suggests that the first challenge for agents is to win a following, and that exploiting divisions in society is an effective way to do that. How that following is then used to influence minds is a separate question. You can see a pre-print of our paper describing our work so far, in the context of the broader picture of partisanship, propaganda and post-truth politics, here.
Labels:
Disinformation,
hyperpartisan,
Misinformation,
Natural Language Processing,
SoBigData,
Social Media
Friday, 8 February 2019
Teaching computers to understand the sentiment of tweets
As part of the EU SoBigData project, the GATE team hosts a number of short research visits, between 2 weeks and 2 months, for all kinds of data scientists (PhD students, researchers, academics, professionals) to come and work with us and to use our tools and/or datasets on a project involving text mining and social media analysis. Kristoffer Stensbo-Smidt visited us in the summer of 2018 from the University of Copenhagen, to work on developing machine learning tools for sentiment analysis of tweets, and was supervised by GATE team member Diana Maynard and by former team member Isabelle Augenstein, who is now at the University of Copenhagen. Kristoffer has a background in Machine Learning but had not worked in NLP before, so this visit helped him understand how to apply his skills to this kind of domain.
After his visit, Kristoffer wrote up an excellent summary of his research. He essentially tested a number of different approaches to processing text, and analysed how much of the sentiment they were able to identify. Given a tweet and an associated topic, the aim is to ascertain automatically whether the sentiment expressed about this topic is positive, negative or neutral. Kristoffer experimented different word embedding-based models in order to test how much information different word embeddings carry for the sentiment of a tweet. This involved choosing which embeddings models to test, and how to transform the topic vectors. The main conclusions he drew from the work were that in general, word embeddings contain a lot of useful information about sentiment, with newer embeddings containing significantly more. This is not particularly surprising, but shows the importance of advanced models for this task.
After his visit, Kristoffer wrote up an excellent summary of his research. He essentially tested a number of different approaches to processing text, and analysed how much of the sentiment they were able to identify. Given a tweet and an associated topic, the aim is to ascertain automatically whether the sentiment expressed about this topic is positive, negative or neutral. Kristoffer experimented different word embedding-based models in order to test how much information different word embeddings carry for the sentiment of a tweet. This involved choosing which embeddings models to test, and how to transform the topic vectors. The main conclusions he drew from the work were that in general, word embeddings contain a lot of useful information about sentiment, with newer embeddings containing significantly more. This is not particularly surprising, but shows the importance of advanced models for this task.
3rd International Workshop on Rumours and Deception in Social Media (RDSM)
June 11, 2019 in Munich, Germany
Collocated with ICWSM'2019
Collocated with ICWSM'2019
Abstract
The 3rd edition of the RDSM workshop will particularly focus on online information disorder and its interplay with public opinion formation.Social media is a valuable resource for mining all kind of information varying from opinions to factual information. However, social media houses issues that are serious threats to the society. Online information disorder and its power on shaping public opinion lead the category of those issues. Among the known aspects are the spread of false rumours, fake news or even social attacks such as hate speech or other forms of harmful social posts. In this workshop the aim is to bring together researchers and practitioners interested in social media mining and analysis to deal with the emerging issues of information disorder and manipulation of public opinion. The focus of the workshop will be on themes such as the detection of fake news, verification of rumours and the understanding of their impact on public opinion. Furthermore, we aim to put a great emphasis on the usefulness and trust aspects of automated solutions tackling the aforementioned themes.
Workshop Theme and Topics
The aim of this workshop is to bring together researchers and practitioners interested in social media mining and analysis to deal with the emerging issues of veracity assessment, fake news detection and manipulation of public opinion. We invite researchers and practitioners to submit papers reporting results on these issues. Qualitative studies performing user studies on the challenges encountered with the use of social media, such as the veracity of information and fake news detection, as well as papers reporting new data sets are also welcome. Finally, we also welcome studies reporting the usefulness and trust of social media tools tackling the aforementioned problems.Topics of interest include, but are not limited to:
- Detection and tracking of rumours.
- Rumour veracity classification.
- Fact-checking social media.
- Detection and analysis of disinformation, hoaxes and fake news.
- Stance detection in social media.
- Qualitative user studies assessing the use of social media.
- Bots detection in social media.
- Measuring public opinion through social media.
- Assessing the impact of social media in public opinion.
- Political analyses of social media.
- Real-time social media mining.
- NLP for social media analysis.
- Network analysis and diffusion of dis/misinformation.
- Usefulness and trust analysis of social media tools.
- AI generated fake content (image / text)
Workshop Program Format
We will have 1-2 experts in the field delivering keynote speeches. We will then have a set of 8-10 presentations of peer-reviewed submissions, organised into 3 sessions by subject (the first two sessions about online information disorder and public opinion and the third session about the usefulness and trust aspects). After the session we also plan to have a group work (groups of size 4-5 attendances) where each group will sketch a social media tool for tackling e.g. rumour verification, fake news detection, etc. The emphasis of the sketch should be on aspects like usefulness and trust. This should take no longer than 120 minutes (sketching, presentation/discussion time). We will close the workshop with a summary and take home messages (max. 15 minutes). Attendance will be open to all interested participants.
We welcome both full papers (5-8 pages) to be presented as oral talks and short papers (2-4 pages) to be presented as posters and demos.
Workshop Schedule/Important Dates
- Submission deadline: April 1st 2019
- Notification of Acceptance: April 15th 2019
- Camera-Ready Versions Due: April 26th 2019
- Workshop date: June 11, 2019
Submission Procedure
We invite two kinds of submissions:
- Long papers/Brief Research Report (max 8 pages + 2 references)
- Demos and poster (short papers) (max 4 pages + 2 references)
- Long papers/Brief Research Report (max 8 pages + 2 references)
- Demos and poster (short papers) (max 4 pages + 2 references)
Proceedings of the workshop will be published jointly with other ICWSM workshops in a special
issue of Frontiers in Big Data.
issue of Frontiers in Big Data.
Papers must be submitted electronically in PDF format or any format that is supported by the
submission site through https://www.frontiersin.org/research-topics/9706 (click on "Submit your manuscript").
Note, submitting authors should choose one of the specific track organizers as their preferred Editor.
submission site through https://www.frontiersin.org/research-topics/9706 (click on "Submit your manuscript").
Note, submitting authors should choose one of the specific track organizers as their preferred Editor.
You can find detailed information on the file submission requirements here:
https://www.frontiersin.org/about/author-guidelines#FileRequirements
https://www.frontiersin.org/about/author-guidelines#FileRequirements
Submissions will be peer-reviewed by at least three members of the programme
committee. The accepted papers will appear in the proceedings published at
https://www.frontiersin.org/research-topics/9706
committee. The accepted papers will appear in the proceedings published at
https://www.frontiersin.org/research-topics/9706
Workshop Organizers
- Ahmet Aker, University of Duisburg-Essen, Germany; University of Sheffield, UK
a.aker@is.inf.uni-due.de - Arkaitz Zubiaga, Queen Mary University of London, UK
arkaitz@zubiaga.org - Kalina Bontcheva, University of Sheffield, UK
k.bontcheva@sheffield.ac.uk - Maria Liakata, University of Warwick and Alan Turing Institute, UK
m.liakata@warwick.ac.uk - Rob Procter, University of Warwick and Alan Turing Institute, UK
rob.procter@warwick.ac.uk - Symeon Papadopoulos, Centre for Research and Technology Hellas, Greece
papadop@iti.gr
Programme Committee (Tentative)
- Nikolas Aletras, University of Sheffield, UK
- Emilio Ferrara, University of Southern California, USA
- Bahareh Heravi, University College Dublin, Ireland
- Petya Osenova, Ontotext, Bulgaria
- Damiano Spina, RMIT University, Australia
- Peter Tolmie, Universität Siegen, Germany
- Marcos Zampieri, University of Wolverhampton, UK
- Milad Mirbabaie, University of Duisburg-Essen, Germany
- Tobias Hecking, University of Duisburg-Essen, Germany
- Kareem Darwish, QCRI, Qatar
- Hassan Sajjad, QCRI, Qatar
- Sumithra Velupillai, King's College London, UK
Invited Speaker(s)
To be announcedSponsors
This workshop is supported by the European Union under grant agreement No. 654024, SoBigData.

And the EU co-funded horizon 2020 project that deals with algorithm-supported verification of digital content
Monday, 17 December 2018
Open Call for SoBigData-funded Transnational Access!
The SoBigData project invites researchers and professionals to apply to participate in Short-Term Scientific Missions (STSMs) to carry forward their own big data projects. The Natural Language Processing (NLP) group at the University of Sheffield are taking part in this initiative and invite all applications.
Funding is available for STSMs (2 weeks to 2 months) of up to 4500 euros, covering daily subsistence, accommodation and flights. These bursaries are awarded on a competitive basis.
Research areas are varied but include studies involving societal debate, online misinformation and rumour analysis. A key topic is analysis of social media and newspaper articles to understand the state of public debate in terms of what is being discussed, how it is being discussed, who is discussing it, and how this discussion is being influenced. The effects of online disinformation campaigns (especially hyper-partisan content) and the use of bot accounts to perpetrate this disinformation are also of particular interest.
Applications are welcomed for visits from 1 November 2018 and 31 July 2019!
For specific details, eligibility criteria, and to apply, click here!
Funding is available for STSMs (2 weeks to 2 months) of up to 4500 euros, covering daily subsistence, accommodation and flights. These bursaries are awarded on a competitive basis.
Research areas are varied but include studies involving societal debate, online misinformation and rumour analysis. A key topic is analysis of social media and newspaper articles to understand the state of public debate in terms of what is being discussed, how it is being discussed, who is discussing it, and how this discussion is being influenced. The effects of online disinformation campaigns (especially hyper-partisan content) and the use of bot accounts to perpetrate this disinformation are also of particular interest.
Applications are welcomed for visits from 1 November 2018 and 31 July 2019!
For specific details, eligibility criteria, and to apply, click here!
Tuesday, 11 September 2018
Vizualisations of Political Hate Speech on Twitter
Recently there's been some media interest in our work on abuse toward politicians. We performed an analysis of abusive replies on Twitter sent to MPs and candidates in the months leading up to the 2015 and 2017 UK elections, disaggregated by gender, political party, year, and geographical area, amongst other things. We've posted about this previously, and there's also a more technical publication here. In this post, we wanted to highlight our interactive visualizations of the data, which were created by Mark Greenwood. The thumbnails below give a flavour of them, but click through to access the interactive versions.
Abusive Replies
Sunburst diagrams showing the raw number of abusive replies sent to MPs before the 2015 and 2017 elections. Rather than showing all candidates, these only show the MPs who were elected (i.e. the successful candidates). These nicely show the proportion of abusive replies sent to each party/gender combination but don't give any feeling per MP the proportion of replies which were abusive. Interactive version here!Increase in Abuse
An overlapping bar chart showing how the percentage of abuse received per party/gender by MPs has increased between 2015 and 2017. For each party/gender two bars are drawn. The height of the bar in the party colour represents the percentage of replies which were abusive in 2017. The height of the grey bar (drawn at the back) is the percentage of replies which were abusive in 2015 and the width shows the change in volume of abusive replies (i.e. the width is calculated by dividing the 2015 raw abusive reply count by that from 2017 to give a percentage which is then used to scale the width of the bar). So height shows change in proportion, width shows increase in volume. There is also a simple version of this graph which only shows the change in proportion (i.e. the widths of the two bars are the same). Original version here.Geographical Distribution of Abuse
A map showing the geographical distribution of abusive replies. The map of the UK is divided into the NUTS 1 regions, and each region is coloured based on the percentage of abusive replies sent to MPs who represent that region. Data from both 2015 and 2017 can be displayed to see how the distribution of abuse has changed. Interactive version here!Wednesday, 5 September 2018
Students use GATE and Twitter to drive Lego robots—again!
At the university's Headstart Summer School in July 2018, 42 secondary school students (age 16 and 17) from all over the UK (see below for maps) were taught to write Java programs to control Lego robots, using input from the robots (such as the sensor for detecting coloured marks on the floor) as well as operating the motors to move and turn. The Department of Computer Science provided a Java library for driving the robots and taught the students to use it.
After they had successfully operated the robots, we ran a practical session on 10 and 11 July on "Controlling Robots with Tweets". We presented a quick introduction to natural language processing (using computer programs to analyse human languages, such as English) and provided them with a bundle of software containing a version of the GATE Cloud Twitter Collector modified to run a special GATE application with a custom plugin to use the Java robot library to control the robots.
The bundle came with a simple "gazetteer" containing two lists of keywords:
After they had successfully operated the robots, we ran a practical session on 10 and 11 July on "Controlling Robots with Tweets". We presented a quick introduction to natural language processing (using computer programs to analyse human languages, such as English) and provided them with a bundle of software containing a version of the GATE Cloud Twitter Collector modified to run a special GATE application with a custom plugin to use the Java robot library to control the robots.
The bundle came with a simple "gazetteer" containing two lists of keywords:
left | turn |
---|---|
left | turn |
port | take |
make | |
move |
and a basic JAPE grammar (set of rules) to make use of it. JAPE is a specialized programming language used in GATE to match regular expressions over annotations in documents, such as the "Lookup" annotations created whenever the gazetteer finds a matching keyword in a document. (The annotations are similar to XML tags, except that GATE applications can create them as well as read them and they can overlap each other without restrictions. Technically they form an annotation graph.)
The sample rule we provided would match any keyword from the "turn" list followed by any keyword from the "left" list (with optional other words in between, so that "turn to port", "take a left", "turn left" all work the same way) and then run the code to turn the robot's right motor (making it turn left in place).
We showed them how to configure the Twitter Collector, follow their own accounts, and then run the collector with the sample GATE application. Getting the system set up and working took a bit of work, but once the first few groups got their robot to move in response to a tweet, everyone cheered and quickly became more interested. They then worked on extending the word lists and JAPE rules to cover a wider range of tweeted commands.
Some of the students had also developed interesting Java code the previous day, which they wanted to incorporate into the Twitter-controlled system. We helped these students add their code to their own copies of the GATE plugin and re-load it so the JAPE rules could call their procedures.
We first ran this project in the Headstart course in July 2017; we made improvements for this year and it was a success again, so we plan to include it in Headstart 2019 too.
The following maps show where all the students and the female students came from.
This work is supported by the European Union's Horizon 2020 project SoBigData (grant agreement no. 654024). Thanks to Genevieve Gorrell for the diagram illustrating how the system works.
The sample rule we provided would match any keyword from the "turn" list followed by any keyword from the "left" list (with optional other words in between, so that "turn to port", "take a left", "turn left" all work the same way) and then run the code to turn the robot's right motor (making it turn left in place).
We first ran this project in the Headstart course in July 2017; we made improvements for this year and it was a success again, so we plan to include it in Headstart 2019 too.
The following maps show where all the students and the female students came from.
This work is supported by the European Union's Horizon 2020 project SoBigData (grant agreement no. 654024). Thanks to Genevieve Gorrell for the diagram illustrating how the system works.
Labels:
GATE,
Lego robot,
secondary school,
SoBigData,
Social Media,
text mining
Subscribe to:
Posts (Atom)