- To engage actively with environmental science researchers and other key stakeholders, in order to derive requirements and evaluate project results.
- To develop tools for efficient LOD-based semantic enrichment of unstructured content.
- To create and evaluate intuitive user interface methods that hide the complexities of the SPARQL semantic search language.
- To use British Library’s Envia tool as a case study in using LOD vocabularies for enhanced information discovery and management.
Objective 1: Stakeholder engagement, Requirements Capture, and Evaluation
In order to demonstrate the value of shared LOD
vocabularies to different applications, information types and audiences, we
focused on use cases related to research on flooding and climate change in the
UK. We captured the requirements from relevant audiences and groups via a web
based questionnaire, which captured some actual search queries, alongside user
input on the kinds of searches they require. We engaged researchers,
practitioners and information managers, in order to assess how LOD vocabularies
might support their needs. This also motivated our choice of different
information types including full-text content, metadata, and LOD datasets. The
main user requirement to be fulfilled was for supporting location-based
searches, e.g. flooding near Sheffield, or flooding on rivers flowing through
Gloucestershire. In addition, users emphasized their need of an intuitive
semantic search UI.
A new British Library information discovery
tool for environmental science, Envia, was used as a starting point to test the
use of semantics towards enhancing information discovery and management. Envia
is particularly suited as a test case for these purposes, as it features a
mixed corpus of content, including datasets, journal articles, and grey
literature, with accompanying metadata records. Envia also enabled us to examine
the value of semantic enrichment for information managers. Environmental
consultants at HR Wallingford collaborated as domain experts, providing
feedback on how the semantic work undertaken in EnviLOD supported their work as
environmental science practitioners and innovators.
During the project, stakeholder engagement was
ongoing through the project website, blog, Twitter presence, published reports,
and joint meetings. In particular, user input and feedback was sought during
the design of the semantic search user interface, in order to ensure that it
meets user needs. The interface was implemented in three iterations:
- The British Library team participated in the design meeting and provided feedback on the first implementation. This helped the Sheffield team to adjust and simplify the interface.
- Following this, the semantic search UI was demonstrated during a lunchtime workshop and EnviLOD presentation. At the end, environmental science researchers were given the opportunity to try the interface and provide us with structured feedback (via a written questionnare) and a user-lead discussion. This early evaluation helped us refine further the user interface design and remove confusing elements.
- Lastly, a much wider stakeholder feedback was solicited during a user outreach and evaluation workshop, organised at the British Library. There were 25 participants at the event, which enabled us to gather very detailed feedback and suggestions for minor improvements. Overall, the majority of users stated that semantic search would be very useful for information discovery and that they would be using the system, if it were deployed in production.
Objective 2: LOD-based Semantic Enrichment
Semantic annotation is the process of tying
semantic models, such as ontologies, and scientific articles together. It may
be characterised as the dynamic semantic enrichment of unstructured and
semi-structured documents and linking these to relevant domain
ontologies/knowledge bases.
The focus of our work was on implementing a
LOD-based semantic enrichment algorithm and apply it to metadata and full-text
documents from Envia. A trial web service is now available.
As part of this work, we evaluated the coverage
and accuracy of relevant general purpose LOD datasets (namely GeoNames and
DBPedia), when applied to data and content from our domain. The results showed
that GeoNames is a useful resource of rich knowledge about locations (e.g. NUTS
administrative regions, latitude, longitude, parent country), however it is not
suitable on its own as a primary source for knowledge enrichment. This is due
to the high level of detail and location ambiguity (e.g. it contained names of
farms). DBpedia on the other hand is much more balanced, including also
knowledge about people, organisations, products, and other entities. Therefore
DBpedia was chosen as the primary LOD resource for semantic enrichment.
Specifically for locations, we identified their equivalent entry in GeoNames
and enriched the text content with additional metatada from there.
Tools for LOD-based geo-location disambiguation, date and measurement recognition and normalisation were implemented and tested. In some detail, the first step is to identify all candidate instance URIs from DBpedia, which are mentioned in a given document. This phase is designed to maximise recall, in order to ensure that more relevant documents can be returned at search time. The second step is entity disambiguation, which is carried out on the basis of string, semantic, and contextual similarity, coupled with a corpus frequency metric. The algorithm was developed on a general purpose, shared news-like corpus and evaluated on environmental science papers and metadata records from the British Library.
Objective 3: User Interface for Semantic Search
There
is a keyword search field, complemented with optional semantic search
constraints, through a set of inter-dependent drop-down lists. In the first
list, Location allows users to search for mentions of locations; Date – of
dates; Document – for specifying constraints on document-level attributes, etc.
More
than one semantic constraint can be added, through the plus button, which
inserts a new row underneath the current row of constraints.
If a Location is chosen as a semantic constraint, then, if required, further constraints can be specified by choosing an appropriate property constraint. Population allows users to pose restrictions on the population number of the locations that are being searched for. Similar numeric constraints can be imposed on the latitude, longitude, and population density attribute values.
Restrictions can also be imposed in terms of its name or the country code, i.e. which country it belongs to. When “is” is chosen, it means that the location name must be exactly as specified (e.g. Oxford), whereas “contains” provides sub-string matching, (e.g. Oxfordshire). In the example below, the user is searching for documents mentioning locations which name contains Oxford. When the search is executed, this would return documents mentioning Oxford explicitly, but also documents mentioning Oxfordshire and other locations in Oxfordshire (e.g. Wytham Woods, Banbury).
Objective 4: Use of Envia as a Testbed
Envia provided us with readily available content and a testbed for experimenting with the semantic enrichment methods. Sparsely populated metadata records were enriched with environmental science terms and location and organisation entities, as well as with additional metadata imported from GeoNames and DBpedia.
The British Library will launch a public beta of Envia in May 2013, where EnviLOD enriched content would be included as an experimental option, complementing the traditional full-text search in Envia. Over time, this will give access to user query logs and allow the iterative identification and improvement of the quality of the semantic enrichment and search algorithms.
Conclusion
All our technical objectives have now been completed and we are ready to deploy the semantic enrichment pipeline within the Envia system, as well as carry out further improvements and experiments with the EnviLOD semantic search UI. We are looking forward to taking this work further in the future, implementing the ideas which we received during the user evaluation workshop.
No comments:
Post a Comment