On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: 2013

Monday, 29 April 2013

EnviLOD: Lessons Learnt

The EnviLOD project demonstrated the benefits that location-based searches, enabled and underpinned by Linked Open Data (LOD) and semantic technologies, can have in terms of enabling improved retrieval of information. Although the semantic search tool developed through the EnviLOD project is not yet ‘production-ready’, it does demonstrate the benefits of this newly emerging technology. As such, it will be incorporated into the Envia ‘labs’ page of the Envia website, which is currently under development. Within Envia Labs, users of the regular Envia service will be able to experiment with and comment on tools that might eventually augment or be incorporated into the service, thus allowing the Envia project team to gauge their potential uptake by the user community.

We also worked on the automatic generation of semantically enriched metadata, to accompany records within the Envia system. This aims to improve the discovery of information within the current Envia system by automatically generating keywords to be included in the article metadata based on the occurrences of terms from the GEMET, DBpedia, and GeoNames vocabularies. A pipeline for this to be incorporated into the Envia system in a regular and sustainable manner is already under way.

One particularly important lesson learnt from this short-term project is that availability of large amounts of content, open to text mining and experimentation needs to be ensured from the very beginning of the project. In EnviLOD there were copyright issues with the majority of environmental science content at the British Library, which limited the experimental system to just over one thousand documents. Due to this limited content, users were not always able to judge how comprehensive or accurate the semantic search was, especially if compared against results offered by Google. Since the British Library is now planning to integrate the EnviLOD semantic enrichment tools within the advanced Envia Labs functionality, future work on this tool could potentially be able to evaluate on this more comprehensive data, through the Envia system.

Another important lesson learnt from the research activities is that working with Linked Open Data is very challenging, not only in terms of data volumes and computational efficiency, but also in terms of data noise and robustness. In terms of noise, an initial evaluation of the DBpedia-based semantic enrichment pipeline revealed that relevant entity candidates were not included initially, because in the ontology they were classified as owl:Thing, whereas we were considering instances of specific sub-classes (e.g. Person, Place). There are over 1 million unclassified instances in the current DBpedia snapshot. In terms of computational efficiency, we had to introduce memory-based caches and efficient data indexing, in order to make the entity linking and disambiguation algorithm sufficiently efficient to process data in near real-time. Lastly, deploying the semantic enrichment on a server, e.g. at Sheffield or at the British Library, is far from trivial, since both OWLIM and our algorithms require large amounts of RAM and computational power. Parallelising the computation to more than three threads is an open challenge, due to the difficulties experienced with parallelising OWLIM. Ontotext are currently working on cloud-based, scalable deployments, so future projects would be able to solve the scalability issue effectively.

Lastly, the quantitative evaluation of our DBpedia-based semantic enrichment pipeline was far from trivial. It required us to annotate manually a gold standard corpus of environmental science content (100 documents were annotated with disambiguated named entities). However, releasing these to other researchers has proven to be practically impossible, due to the copyright and licensing restrictions imposed by the content publishers on the British library. In a related project, we have now developed a web-based entity annotation interface, based on Crowd Flower. This will enable future projects to create gold standards in an easier fashion, based on copyright-free content. Ultimately, during development we made use of available news and similar corpora created by TAC-KBP 2009 and 2010, which we used for algorithm development and testing in EnviLOD, prior to final quantitative evaluation on the copyrighted BL content. So even though the aims of the project were achieved and a useful running pilot system was created, publishing the results in scientific journals has been hampered by these content issues.

In conclusion, we fully support the findings of the JISC report on text mining that copyright exemption for text mining research is necessary, in order to fully unlock the benefits of text mining to scientific research.

Thursday, 25 April 2013

EnviLOD Recap: Technical Objectives and Deliverables

The aim of the #EnviLOD project was to demonstrate the value of using Linked Open Data (LOD) vocabularies in the field of environmental science, by pursuing four key objectives. These remained unchanged during the 7 month long project, and are as follows:

To engage actively with environmental science researchers and other key stakeholders, in order to derive requirements and evaluate project results.
To develop tools for efficient LOD-based semantic enrichment of unstructured content.
To create and evaluate intuitive user interface methods that hide the complexities of the SPARQL semantic search language.
To use British Library’s Envia tool as a case study in using LOD vocabularies for enhanced information discovery and management.

Objective 1: Stakeholder engagement, Requirements Capture, and Evaluation

In order to demonstrate the value of shared LOD vocabularies to different applications, information types and audiences, we focused on use cases related to research on flooding and climate change in the UK. We captured the requirements from relevant audiences and groups via a web based questionnaire, which captured some actual search queries, alongside user input on the kinds of searches they require. We engaged researchers, practitioners and information managers, in order to assess how LOD vocabularies might support their needs. This also motivated our choice of different information types including full-text content, metadata, and LOD datasets. The main user requirement to be fulfilled was for supporting location-based searches, e.g. flooding near Sheffield, or flooding on rivers flowing through Gloucestershire. In addition, users emphasized their need of an intuitive semantic search UI.

A new British Library information discovery tool for environmental science, Envia, was used as a starting point to test the use of semantics towards enhancing information discovery and management. Envia is particularly suited as a test case for these purposes, as it features a mixed corpus of content, including datasets, journal articles, and grey literature, with accompanying metadata records. Envia also enabled us to examine the value of semantic enrichment for information managers. Environmental consultants at HR Wallingford collaborated as domain experts, providing feedback on how the semantic work undertaken in EnviLOD supported their work as environmental science practitioners and innovators.

During the project, stakeholder engagement was ongoing through the project website, blog, Twitter presence, published reports, and joint meetings. In particular, user input and feedback was sought during the design of the semantic search user interface, in order to ensure that it meets user needs. The interface was implemented in three iterations:

The British Library team participated in the design meeting and provided feedback on the first implementation. This helped the Sheffield team to adjust and simplify the interface.
Following this, the semantic search UI was demonstrated during a lunchtime workshop and EnviLOD presentation. At the end, environmental science researchers were given the opportunity to try the interface and provide us with structured feedback (via a written questionnare) and a user-lead discussion. This early evaluation helped us refine further the user interface design and remove confusing elements.
Lastly, a much wider stakeholder feedback was solicited during a user outreach and evaluation workshop, organised at the British Library. There were 25 participants at the event, which enabled us to gather very detailed feedback and suggestions for minor improvements. Overall, the majority of users stated that semantic search would be very useful for information discovery and that they would be using the system, if it were deployed in production.

Objective 2: LOD-based Semantic Enrichment

Semantic annotation is the process of tying semantic models, such as ontologies, and scientific articles together. It may be characterised as the dynamic semantic enrichment of unstructured and semi-structured documents and linking these to relevant domain ontologies/knowledge bases.

The focus of our work was on implementing a LOD-based semantic enrichment algorithm and apply it to metadata and full-text documents from Envia. A trial web service is now available.

As part of this work, we evaluated the coverage and accuracy of relevant general purpose LOD datasets (namely GeoNames and DBPedia), when applied to data and content from our domain. The results showed that GeoNames is a useful resource of rich knowledge about locations (e.g. NUTS administrative regions, latitude, longitude, parent country), however it is not suitable on its own as a primary source for knowledge enrichment. This is due to the high level of detail and location ambiguity (e.g. it contained names of farms). DBpedia on the other hand is much more balanced, including also knowledge about people, organisations, products, and other entities. Therefore DBpedia was chosen as the primary LOD resource for semantic enrichment. Specifically for locations, we identified their equivalent entry in GeoNames and enriched the text content with additional metatada from there.

Tools for LOD-based geo-location disambiguation, date and measurement recognition and normalisation were implemented and tested. In some detail, the first step is to identify all candidate instance URIs from DBpedia, which are mentioned in a given document. This phase is designed to maximise recall, in order to ensure that more relevant documents can be returned at search time. The second step is entity disambiguation, which is carried out on the basis of string, semantic, and contextual similarity, coupled with a corpus frequency metric. The algorithm was developed on a general purpose, shared news-like corpus and evaluated on environmental science papers and metadata records from the British Library.

Objective 3: User Interface for Semantic Search

The semantic search interface is shown below and can be tried online too:

There is a keyword search field, complemented with optional semantic search constraints, through a set of inter-dependent drop-down lists. In the first list, Location allows users to search for mentions of locations; Date – of dates; Document – for specifying constraints on document-level attributes, etc.

More than one semantic constraint can be added, through the plus button, which inserts a new row underneath the current row of constraints.

If a Location is chosen as a semantic constraint, then, if required, further constraints can be specified by choosing an appropriate property constraint. Population allows users to pose restrictions on the population number of the locations that are being searched for. Similar numeric constraints can be imposed on the latitude, longitude, and population density attribute values.

Restrictions can also be imposed in terms of its name or the country code, i.e. which country it belongs to. When “is” is chosen, it means that the location name must be exactly as specified (e.g. Oxford), whereas “contains” provides sub-string matching, (e.g. Oxfordshire). In the example below, the user is searching for documents mentioning locations which name contains Oxford. When the search is executed, this would return documents mentioning Oxford explicitly, but also documents mentioning Oxfordshire and other locations in Oxfordshire (e.g. Wytham Woods, Banbury).

Objective 4: Use of Envia as a Testbed

Envia provided us with readily available content and a testbed for experimenting with the semantic enrichment methods. Sparsely populated metadata records were enriched with environmental science terms and location and organisation entities, as well as with additional metadata imported from GeoNames and DBpedia.

The British Library will launch a public beta of Envia in May 2013, where EnviLOD enriched content would be included as an experimental option, complementing the traditional full-text search in Envia. Over time, this will give access to user query logs and allow the iterative identification and improvement of the quality of the semantic enrichment and search algorithms.

Conclusion

All our technical objectives have now been completed and we are ready to deploy the semantic enrichment pipeline within the Envia system, as well as carry out further improvements and experiments with the EnviLOD semantic search UI. We are looking forward to taking this work further in the future, implementing the ideas which we received during the user evaluation workshop.