Thursday, 22 November 2018

Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms

Today we're looking at the work done within the group which was reported at SEMANTiCS 2018: "Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms", authored by Ziqi Zhang, Johann Petrak and Diana Maynard, all of the University of Sheffield.

Zhang, Z., Petrak, J. & Maynard, D. Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. in SEMANTiCS 2018 – 14th International Conference on Semantic Systems 00, 0-000 (2018).

Automatic Term Extraction (ATE) is a fundamental technique used in computational linguistics. It allows for the recognition of terms used within a body of text. Processing the collected terms in a text is a key step in computationally understanding the knowledge content of the text. These terms can then be used to perform higher level analysis. There are many distinct ATE methods which tend to have better performance in some domains more than others. In other words, there is no universal method which produces consistently good results, and so we must choose an appropriate method for the domain being targeted. Even with a well targeted method, there is often a significant space for improvement.

In this paper, a novel method of ATE which addresses two major limitations is presented. First, the fact that no single ATE method consistently performs well across all domains, and secondly that the majority of ATE methods are unsupervised. These are addressed by investigating a generic method that can improve the accuracy of existing ATE methods, and by arguing that it is possible to use existing lexical resources to support ATE.

Based on this, the authors introduced AdaText, a generic method that revises the TextRank algorithm. After being given a target text, AdaText:
  1. Selects a subset of words based on their semantic relatedness to a set of seed words or phrases relevant to the domain, but not necessarily representative of the terms within the target text. 
  2. It then applies an adapted TextRank algorithm to create a graph for these words, and computes a text-level TextRank score for each selected word. 
  3. Finally, these scores are used to revise the score of a term candidate previously computed by an ATE method. 

This technique was trialled using a variety of parameters (such as the threshold of semantic similarity to select words as described in step two) over two distinct datasets (GENIA and ACLv2, comprising of Medline abstracts and abstracts from ACL respectively). The authors also used a wide variety of state of the art ATE methods, including modified TFIDF, CValue, Basic, RAKE, Weirdness, LinkProbability, X2, GlossEx and PositiveUnlabeled.

A sample of performances in different datasets and using different ATE techniques. The base performance of the ATE method is represented by the black horizontal line. The horizontal axis represents the semantic similarity threshold used in step 1. The vertical axis shows average P@K for all five K’s considered.

It was demonstrated that this new generic combination approach can consistently improve the performance of the ATE method by 25 points, a significant increase. Although this is a substantial improvement, there are some areas that the authors still wish to expand upon. In future work, they aim to optimise the selection of words from the TextRank graph, work on expanding TextRank to a graph of both words and phrases, and to explore how the size and source of the seed lexicon affects the performance of AdaText.  

No comments:

Post a Comment