- To engage actively with environmental science researchers and other key stakeholders, in order to derive requirements and evaluate project results.
- To develop tools for efficient LOD-based semantic enrichment of unstructured content.
- To create and evaluate intuitive user interface methods that hide the complexities of the SPARQL semantic search language.
- To use British Library’s Envia tool as a case study in using LOD vocabularies for enhanced information discovery and management.
Objective 1: Stakeholder engagement, Requirements Capture, and Evaluation
In order to demonstrate the value of shared LOD vocabularies to different applications, information types and audiences, we focused on use cases related to research on flooding and climate change in the UK. We captured the requirements from relevant audiences and groups via a web based questionnaire, which captured some actual search queries, alongside user input on the kinds of searches they require. We engaged researchers, practitioners and information managers, in order to assess how LOD vocabularies might support their needs. This also motivated our choice of different information types including full-text content, metadata, and LOD datasets. The main user requirement to be fulfilled was for supporting location-based searches, e.g. flooding near Sheffield, or flooding on rivers flowing through Gloucestershire. In addition, users emphasized their need of an intuitive semantic search UI.
A new British Library information discovery tool for environmental science, Envia, was used as a starting point to test the use of semantics towards enhancing information discovery and management. Envia is particularly suited as a test case for these purposes, as it features a mixed corpus of content, including datasets, journal articles, and grey literature, with accompanying metadata records. Envia also enabled us to examine the value of semantic enrichment for information managers. Environmental consultants at HR Wallingford collaborated as domain experts, providing feedback on how the semantic work undertaken in EnviLOD supported their work as environmental science practitioners and innovators.
During the project, stakeholder engagement was ongoing through the project website, blog, Twitter presence, published reports, and joint meetings. In particular, user input and feedback was sought during the design of the semantic search user interface, in order to ensure that it meets user needs. The interface was implemented in three iterations:
- The British Library team participated in the design meeting and provided feedback on the first implementation. This helped the Sheffield team to adjust and simplify the interface.
- Following this, the semantic search UI was demonstrated during a lunchtime workshop and EnviLOD presentation. At the end, environmental science researchers were given the opportunity to try the interface and provide us with structured feedback (via a written questionnare) and a user-lead discussion. This early evaluation helped us refine further the user interface design and remove confusing elements.
- Lastly, a much wider stakeholder feedback was solicited during a user outreach and evaluation workshop, organised at the British Library. There were 25 participants at the event, which enabled us to gather very detailed feedback and suggestions for minor improvements. Overall, the majority of users stated that semantic search would be very useful for information discovery and that they would be using the system, if it were deployed in production.
Objective 2: LOD-based Semantic Enrichment
Semantic annotation is the process of tying semantic models, such as ontologies, and scientific articles together. It may be characterised as the dynamic semantic enrichment of unstructured and semi-structured documents and linking these to relevant domain ontologies/knowledge bases.
The focus of our work was on implementing a LOD-based semantic enrichment algorithm and apply it to metadata and full-text documents from Envia. A trial web service is now available.
As part of this work, we evaluated the coverage and accuracy of relevant general purpose LOD datasets (namely GeoNames and DBPedia), when applied to data and content from our domain. The results showed that GeoNames is a useful resource of rich knowledge about locations (e.g. NUTS administrative regions, latitude, longitude, parent country), however it is not suitable on its own as a primary source for knowledge enrichment. This is due to the high level of detail and location ambiguity (e.g. it contained names of farms). DBpedia on the other hand is much more balanced, including also knowledge about people, organisations, products, and other entities. Therefore DBpedia was chosen as the primary LOD resource for semantic enrichment. Specifically for locations, we identified their equivalent entry in GeoNames and enriched the text content with additional metatada from there.
Tools for LOD-based geo-location disambiguation, date and measurement recognition and normalisation were implemented and tested. In some detail, the first step is to identify all candidate instance URIs from DBpedia, which are mentioned in a given document. This phase is designed to maximise recall, in order to ensure that more relevant documents can be returned at search time. The second step is entity disambiguation, which is carried out on the basis of string, semantic, and contextual similarity, coupled with a corpus frequency metric. The algorithm was developed on a general purpose, shared news-like corpus and evaluated on environmental science papers and metadata records from the British Library.
Objective 3: User Interface for Semantic Search
There is a keyword search field, complemented with optional semantic search constraints, through a set of inter-dependent drop-down lists. In the first list, Location allows users to search for mentions of locations; Date – of dates; Document – for specifying constraints on document-level attributes, etc.
More than one semantic constraint can be added, through the plus button, which inserts a new row underneath the current row of constraints.
If a Location is chosen as a semantic constraint, then, if required, further constraints can be specified by choosing an appropriate property constraint. Population allows users to pose restrictions on the population number of the locations that are being searched for. Similar numeric constraints can be imposed on the latitude, longitude, and population density attribute values.
Restrictions can also be imposed in terms of its name or the country code, i.e. which country it belongs to. When “is” is chosen, it means that the location name must be exactly as specified (e.g. Oxford), whereas “contains” provides sub-string matching, (e.g. Oxfordshire). In the example below, the user is searching for documents mentioning locations which name contains Oxford. When the search is executed, this would return documents mentioning Oxford explicitly, but also documents mentioning Oxfordshire and other locations in Oxfordshire (e.g. Wytham Woods, Banbury).
Objective 4: Use of Envia as a Testbed
Envia provided us with readily available content and a testbed for experimenting with the semantic enrichment methods. Sparsely populated metadata records were enriched with environmental science terms and location and organisation entities, as well as with additional metadata imported from GeoNames and DBpedia.
The British Library will launch a public beta of Envia in May 2013, where EnviLOD enriched content would be included as an experimental option, complementing the traditional full-text search in Envia. Over time, this will give access to user query logs and allow the iterative identification and improvement of the quality of the semantic enrichment and search algorithms.
All our technical objectives have now been completed and we are ready to deploy the semantic enrichment pipeline within the Envia system, as well as carry out further improvements and experiments with the EnviLOD semantic search UI. We are looking forward to taking this work further in the future, implementing the ideas which we received during the user evaluation workshop.