Thursday, 22 November 2018

Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms

Zhang, Z., Petrak, J. & Maynard, D. Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. in SEMANTiCS 2018 – 14th International Conference on Semantic Systems 00, 0-000 (2018).

This work has been carried out in the context of the EU KNOWMAK project, where we're developing tools for multi-topic classification of text against an ontology, in order to attempt to map the state of European research output in key technologies.

Automatic Term Extraction (ATE) is a fundamental technique used in computational linguistics for recognising terms in text. Processing the collected terms in a text is a key step in understanding the content of the text.  There are many different ATE methods, but these all tend to work well only in a one specific domain.  In other words, there is no universal method which produces consistently good results, and so we have to choose an appropriate method for the domain being targeted.

In this work, we have developed a novel method for ATE which addresses two major limitations: the fact that no single ATE method consistently performs well across all domains, and the fact that the majority of ATE methods are unsupervised. Our generic method, AdaText, improves the accuracy of existing ATE methods, using existing lexical resources to support them, by revising the TextRank algorithm.
After being given a target text, AdaText:
  1. Selects a subset of words based on their semantic relatedness to a set of seed words or phrases relevant to the domain, but not necessarily representative of the terms within the target text. 
  2. It then applies an adapted TextRank algorithm to create a graph for these words, and computes a text-level TextRank score for each selected word. 
  3. Finally, these scores are used to revise the score of a term candidate previously computed by an ATE method. 
This technique was trialled using a variety of parameters (such as the threshold of semantic similarity to select words, as described in step two) over two distinct datasets (GENIA and ACLv2, comprising Medline abstracts and abstracts from ACL respectively). We also tested it with a wide variety of state of the art ATE methods, including modified TFIDF, CValue, Basic, RAKE, Weirdness, LinkProbability, X2, GlossEx and PositiveUnlabeled.

The figures show a sample of performances in different datasets and using different ATE techniques. The base performance of the ATE method is represented by the black horizontal line. The horizontal axis represents the semantic similarity threshold used in step 1. The vertical axis shows average P@K for all five Ks considered.

This new generic combination approach can consistently improve the performance of the ATE method by 25 points, which is a significant increase. However, there is still room for improvement. In future work, we aim to optimise the selection of words from the TextRank graph, work on expanding TextRank to a graph of both words and phrases, and to explore how the size and source of the seed lexicon affects the performance of AdaText.  

No comments:

Post a Comment