On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: EnviLOD: Lessons Learnt

The EnviLOD project demonstrated the benefits that location-based searches, enabled and underpinned by Linked Open Data (LOD) and semantic technologies, can have in terms of enabling improved retrieval of information. Although the semantic search tool developed through the EnviLOD project is not yet ‘production-ready’, it does demonstrate the benefits of this newly emerging technology. As such, it will be incorporated into the Envia ‘labs’ page of the Envia website, which is currently under development. Within Envia Labs, users of the regular Envia service will be able to experiment with and comment on tools that might eventually augment or be incorporated into the service, thus allowing the Envia project team to gauge their potential uptake by the user community.

We also worked on the automatic generation of semantically enriched metadata, to accompany records within the Envia system. This aims to improve the discovery of information within the current Envia system by automatically generating keywords to be included in the article metadata based on the occurrences of terms from the GEMET, DBpedia, and GeoNames vocabularies. A pipeline for this to be incorporated into the Envia system in a regular and sustainable manner is already under way.

One particularly important lesson learnt from this short-term project is that availability of large amounts of content, open to text mining and experimentation needs to be ensured from the very beginning of the project. In EnviLOD there were copyright issues with the majority of environmental science content at the British Library, which limited the experimental system to just over one thousand documents. Due to this limited content, users were not always able to judge how comprehensive or accurate the semantic search was, especially if compared against results offered by Google. Since the British Library is now planning to integrate the EnviLOD semantic enrichment tools within the advanced Envia Labs functionality, future work on this tool could potentially be able to evaluate on this more comprehensive data, through the Envia system.

Another important lesson learnt from the research activities is that working with Linked Open Data is very challenging, not only in terms of data volumes and computational efficiency, but also in terms of data noise and robustness. In terms of noise, an initial evaluation of the DBpedia-based semantic enrichment pipeline revealed that relevant entity candidates were not included initially, because in the ontology they were classified as owl:Thing, whereas we were considering instances of specific sub-classes (e.g. Person, Place). There are over 1 million unclassified instances in the current DBpedia snapshot. In terms of computational efficiency, we had to introduce memory-based caches and efficient data indexing, in order to make the entity linking and disambiguation algorithm sufficiently efficient to process data in near real-time. Lastly, deploying the semantic enrichment on a server, e.g. at Sheffield or at the British Library, is far from trivial, since both OWLIM and our algorithms require large amounts of RAM and computational power. Parallelising the computation to more than three threads is an open challenge, due to the difficulties experienced with parallelising OWLIM. Ontotext are currently working on cloud-based, scalable deployments, so future projects would be able to solve the scalability issue effectively.

Lastly, the quantitative evaluation of our DBpedia-based semantic enrichment pipeline was far from trivial. It required us to annotate manually a gold standard corpus of environmental science content (100 documents were annotated with disambiguated named entities). However, releasing these to other researchers has proven to be practically impossible, due to the copyright and licensing restrictions imposed by the content publishers on the British library. In a related project, we have now developed a web-based entity annotation interface, based on Crowd Flower. This will enable future projects to create gold standards in an easier fashion, based on copyright-free content. Ultimately, during development we made use of available news and similar corpora created by TAC-KBP 2009 and 2010, which we used for algorithm development and testing in EnviLOD, prior to final quantitative evaluation on the copyrighted BL content. So even though the aims of the project were achieved and a useful running pilot system was created, publishing the results in scientific journals has been hampered by these content issues.

In conclusion, we fully support the findings of the JISC report on text mining that copyright exemption for text mining research is necessary, in order to fully unlock the benefits of text mining to scientific research.

On GATE, Text and Social Media Analysis, and Detecting Misinformation Online

Monday, 29 April 2013

EnviLOD: Lessons Learnt

No comments:

Post a Comment