Friday, 3 August 2012

About the #EnviLOD project

On June 1st, 2012 the GATE Team at the University of Sheffield, in collaboration with the British Library and HR Wallingford, started the #EnviLOD project, funded under the JISC Research Tools Programme

#EnviLOD aims to demonstrate the value of using Linked Open Data (LOD) vocabularies in the field of environmental science, by pursuing the following objectives:

  1. Address the problem of LOD domain vocabulary enrichment and interlinking. Develop GATE-based tools for efficient LOD vocabulary lookup and LOD-based term disambiguation. Evaluate these, both quantitatively and with end-users and other stakeholders.
  2. Develop and evaluate intuitive user interface methods that can hide the complexities of the SPARQL semantic search language, while allowing environmental researchers to search successfully, using LOD vocabularies. 
  3. Build a case study, using the new British Library information discovery tool for environmental science, Envia. Test the use of LOD vocabularies towards enhancing information discovery and management. 
  4. Collaborate with domain experts at the environmental consultants HR Wallingford, providing feedback on how the semantic work undertaken here supports their work as environmental science practitioners and innovators.
Follow EnviLOD on Twitter#envilod

Background and Motivation

Environmental Science is a broad, interdisciplinary subject area that spans biology, chemistry, earth sciences, physics, and engineering. Because of the breadth of the subject scope, information discovery and sharing in environmental science is often a challenge. Linked Open Data (LOD) and vocabularies offer an opportunity to improve the process of information discovery and sharing through unique, machine-readable, interlinked open vocabularies, thus ultimately connecting users more efficiently to useful and relevant resources.

Key vocabularies for environmental science are already becoming available as Linked Data (e.g. the GEMET thesaurus), as are other key resources relevant for the domain (e.g. Geonames, DBpedia). One outstanding challenge is to use them to enrich unstructured content and metadata with semantics. Doing so manually is prohibitively expensive and unsustainable, since LOD vocabularies typically have millions of instances. Therefore there is a strong need for semantic annotation tools that enrich metadata and content with LOD semantics automatically. EnviLOD will tackle the problem of LOD vocabulary enrichment, interlinking, and adoption in the domain of environmental science, however, results will be relevant also to other fields. The starting point will be the DBpedia-based entity annotation and disambiguation algorithms, developed by Sheffield as part of the TrendMiner project.

The second major challenge is to develop information access facilities that use semantics to deliver a semantic search service, which is not only more powerful, but also as simple to use as its non-semantic counterparts. At present, the most widely used method for retrieving information from Linked Data is through SPARQL queries. However, formulating such queries is beyond the capabilities of most users and presents a significant barrier to widespread uptake. EnviLOD will evaluate user interface methods that can hide the complexities of SPARQL, while allowing users successfully to utilise semantic search.

In the context of environmental science, for example, a user searching for flooding in south-east Britain would be able to find a report with a chapter on water levels at the Thames barrier. In other words, by exploiting the additional semantic context from relevant Linked Open Data ontologies, the user will find a report in the search results that would not have been picked up based on a simple keyword search.


#EnviLOD will be creating a number of research outputs and improving some pre-existing GATE open-source tools for semantic annotation and search:

Output / Outcome Type

Brief Description
User needs analysis, requirements gathering and use case definition.
Open source tools for semantic enrichment with Linked Environment Data.
A web-based interface for semantic search with Linked Environment Data.
Quantitative and user-based evaluation results.
A final report detailing the lessons learned.
At least one research paper
Dissemination materials
Online demonstration and documentation; website; blog
User engagement event
User workshop
Project documentation
JISC project documentation (Project plan, project reports, etc)
Knowledge built
Knowledge of LOD, LOD-based semantic annotation, and semantic search
Knowledge built
Spreading awareness of LOD and its relevance to environmental science
Knowledge built
Knowledge transfer between computer scientists, information scientists, and environmental scientists

Critical Success Factors

1.   Scalability: LOD resources, such as DBPedia and GeoNames have (tens of) millions of instances, so using them for semantic annotation and semantic queries is far from trivial. Thus scalability and robustness to noisy data are key requirements for EnviLOD. Our solution is based on Ontotext's OWLIM semantic repository, which scales to billions of triples. OWLIM is coupled with the open-source GATE semantic annotation tools and Linked Data endpoints. We import Linked Data into the OWLIM semantic repository, which provides a SPARQL endpoint. GATE Mimir is used to index full text, metadata, and semantic annotations, which underpin the semantic search UI.
2.   Sustainability: All project results will be made available as open-source. Software will be provided with a clearly-defined API to facilitate adoption. The results will be incorporated within the Envia discovery tool, which will be supported by the British Library.  
3.    Usability: Usability of the semantic search user interface is paramount. UI mockups will be created and tested first with the British Library and HR Wallingford, followed by a wider consultation with key stakeholders. The UI will be designed to match as closely as possible the user’s current search practices, as well as their needs for semantically-enhanced queries.
4.     Interoperability: This will be achieved through the use of widely adopted standards, such as  OWL W3C standard, the RDF W3C standard, .

Dates: 1 June 2012 - 31 December 2012

Follow the GATE Team on Twitter: @GateAcUk
Follow the British Library Science team on Twitter: @ScienceBL

No comments:

Post a Comment