Posts about our text and social media analysis work and latest news on GATE (http://gate.ac.uk) - our open source text and social media analysis platform. Also posts about the PHEME project (http://pheme.eu) and our work on automatic detection of rumours in social media. Lately also general musings about fake news, misinformation, and online propaganda.
Tuesday, 11 September 2018
Vizualisations of Political Hate Speech on Twitter
Recently there's been some media interest in our work on abuse toward politicians. We performed an analysis of abusive replies on Twitter sent to MPs and candidates in the months leading up to the 2015 and 2017 UK elections, disaggregated by gender, political party, year, and geographical area, amongst other things. We've posted about this previously, and there's also a more technical publication here. In this post, we wanted to highlight our interactive visualizations of the data, which were created by Mark Greenwood. The thumbnails below give a flavour of them, but click through to access the interactive versions.
Thursday, 6 September 2018
How difficult is it to understand my web pages? Using GATE to compute a complexity score for Web text.
The Web Science Summer School, which took place from 30 July - 4 August at the L3S Research Centre in Hannover, Germany, gave students a chance to learn about a number of tools and techniques related to web science. As part of this, team member Diana Maynard gave a keynote talk about applying text mining techniques to real-world applications such as sentiment and hate speech detection, and political social media analysis, followed by a 90 minute practical GATE tutorial where the students learnt to use ANNIE, TwitIE and sentiment analysis tools.
The keynotes and tutorials throughout the week were complemented with group work, where the students were tasked with the question: “Can more meaningful indicators for text complexity be extracted from web pages ?”. Here follows the account of one student team, who in the space of only 4 hours, managed to use GATE to complete the task – an extremely creditable performance given their very brief exposure to GATE.
After some discussion, our team decided to focus on a very practical problem: the readability metrics commonly used to assess the difficulty of a text do not account for the target audience or the narrative context. We believed a simple approach employing GATE could offer greater insights into how to identify the relevant features associated with text complexity.
Everyone had an intuitive understanding of text complexity; it was when trying to match these ideas into an objective framework that issues arose. Particular definitions on complexity, understandability, comprehensibility, and readability were mixed and matched when approaching this issue.
In our team vision, the complexity of a document is based not only on the structure of the sentences but also in the context of its narrative and the ease with which the targeted audience can understand it
In our model, the complexity score of a text is linked to the context of the text’s narrative. This means texts about certain narrative contexts (topics) are inherently harder to understand than other texts. How hard it is to understand a particular text is also related to the capabilities of the reader. Thus, texts on specific narrative contexts can be characterized to create a score of how hard to understand they will be for certain audiences.
To do this, we proposed the following process:
- Create an instance lexicon for content complexity
- Collect a set of texts from different narrative contexts that the audience may be expected to read, e.g. celebrity news, political news, sports news, medical information leaflets, coursebook fragments.
- Identify the relevant entities in those texts, i.e. persons, locations, organizations, percentages, dates, and technical terms
- Assess the complexity of each text by using crowdsourcing, e.g. have a sample of UK young adults assess the difficulty of the texts via ratings or procedures like CLOZE.
- Assign a complexity value to each entity in the lexicon based on the complexity values of the text it appeared in and its relevance to those texts.
- Assess the complexity of new texts
- Identify the relevant entities in the text
- Employ the entity complexity lexicon to compute an estimate value for the new text.
Running the Term-Raider plugin to identify the entities in the texts. |
The corpus was composed by 9 Wikipedia pages and 2 academic articles, and an independent scoring (1-10 scale) of the pages’ complexity was given by 4 team reviewers. Then, the entities were identified for each document by running the ANNIE and TermRaider plugins in the GATE GUI.
Employing ANNIC to search for entities linked to organizations, locations, persons, dates or percentages within the texts. |
Result extract exported in XML format from the TermRaider plugin |
Once duplicates had been accounted for, our lexicon was composed of 906 weighted pairs.
|
Comparison of the scores assigned by the lexicon (1-10) and the complexity score given to us as a base (0-1) |
In general, the entities in a text are associated with the text narrative contexts, e.g. celebrity news will include celebrity names and places, while scientific literature will reference percentages, ratios and error estimates. In our model, an annotation of the complexity of a sample of pages from several narrative contexts could be used to determine a complexity value for relevant entities based on the complexity scores of the pages in which it appears, which can then be used to estimate complexity scores for new pages.
Given the time constrains we had, many of the activities were done based on naïve algorithms and within the limits of our resources. We have some further ideas on how this approach could be further explored. First, we believe that any complexity score should take into account the audience capability. In this case, the researcher should appreciate that determining the characteristics of the population they wish to explore is just as important as determining the narrative context and structure of the text. Asking teenagers to read mathematical formulae will yield different complexity scores from if the audience were GPs or older adults.
An objective way of scoring the complexity of a text is the use of comprehension testing process like CLOZE, where every 5th word is replaced with a blank space which respondents are asked to then fill. Such a procedure can be used in crowdsourcing platforms like Mechanical Turk to create complexity lexicons for specific audiences: sample texts of diverse narrative contexts (topics) would be selected to be assessed by the crowd, which would tell us in how complex particular groups of people find certain texts (e.g. UK teenagers find maths text really difficult and tweets easy, but the complexity scores may reverse for older Mexican maths professors when given the same texts).
Another aspect that could be easily improved is the use of centrality metrics like TextRank to determine which named entities are actually relevant to the text, based on their frequency and position within the narrative. Finally, a ranking algorithm like Page-Rank can be adapted to obtain the complexity scores of the entity lexicon in a way that permits to identify relevant entities by employing clustering algorithms.
Team:
Damianos Melidis, L3S Hannover, Germany
Latifah Alshammari, University of Bath, UK
Fernando Santos Sanchez, University of Southampton, UK
Ahmed Al-Ghez, University of Goettingen, Germany
Wednesday, 5 September 2018
Students use GATE and Twitter to drive Lego robots—again!
At the university's Headstart Summer School in July 2018, 42 secondary school students (age 16 and 17) from all over the UK (see below for maps) were taught to write Java programs to control Lego robots, using input from the robots (such as the sensor for detecting coloured marks on the floor) as well as operating the motors to move and turn. The Department of Computer Science provided a Java library for driving the robots and taught the students to use it.
After they had successfully operated the robots, we ran a practical session on 10 and 11 July on "Controlling Robots with Tweets". We presented a quick introduction to natural language processing (using computer programs to analyse human languages, such as English) and provided them with a bundle of software containing a version of the GATE Cloud Twitter Collector modified to run a special GATE application with a custom plugin to use the Java robot library to control the robots.
The bundle came with a simple "gazetteer" containing two lists of keywords:
After they had successfully operated the robots, we ran a practical session on 10 and 11 July on "Controlling Robots with Tweets". We presented a quick introduction to natural language processing (using computer programs to analyse human languages, such as English) and provided them with a bundle of software containing a version of the GATE Cloud Twitter Collector modified to run a special GATE application with a custom plugin to use the Java robot library to control the robots.
The bundle came with a simple "gazetteer" containing two lists of keywords:
left | turn |
---|---|
left | turn |
port | take |
make | |
move |
and a basic JAPE grammar (set of rules) to make use of it. JAPE is a specialized programming language used in GATE to match regular expressions over annotations in documents, such as the "Lookup" annotations created whenever the gazetteer finds a matching keyword in a document. (The annotations are similar to XML tags, except that GATE applications can create them as well as read them and they can overlap each other without restrictions. Technically they form an annotation graph.)
The sample rule we provided would match any keyword from the "turn" list followed by any keyword from the "left" list (with optional other words in between, so that "turn to port", "take a left", "turn left" all work the same way) and then run the code to turn the robot's right motor (making it turn left in place).
We showed them how to configure the Twitter Collector, follow their own accounts, and then run the collector with the sample GATE application. Getting the system set up and working took a bit of work, but once the first few groups got their robot to move in response to a tweet, everyone cheered and quickly became more interested. They then worked on extending the word lists and JAPE rules to cover a wider range of tweeted commands.
Some of the students had also developed interesting Java code the previous day, which they wanted to incorporate into the Twitter-controlled system. We helped these students add their code to their own copies of the GATE plugin and re-load it so the JAPE rules could call their procedures.
We first ran this project in the Headstart course in July 2017; we made improvements for this year and it was a success again, so we plan to include it in Headstart 2019 too.
The following maps show where all the students and the female students came from.
This work is supported by the European Union's Horizon 2020 project SoBigData (grant agreement no. 654024). Thanks to Genevieve Gorrell for the diagram illustrating how the system works.
The sample rule we provided would match any keyword from the "turn" list followed by any keyword from the "left" list (with optional other words in between, so that "turn to port", "take a left", "turn left" all work the same way) and then run the code to turn the robot's right motor (making it turn left in place).
We first ran this project in the Headstart course in July 2017; we made improvements for this year and it was a success again, so we plan to include it in Headstart 2019 too.
The following maps show where all the students and the female students came from.
This work is supported by the European Union's Horizon 2020 project SoBigData (grant agreement no. 654024). Thanks to Genevieve Gorrell for the diagram illustrating how the system works.
Subscribe to:
Posts (Atom)