On GATE, Text and Social Media Analysis, and Detecting Misinformation Online: August 2012

Wednesday, 22 August 2012

#EnviLOD: Project Timeline and Work packages

Our project started in June 2012 and is due to finish on December 31st, 2012. We have just completed the user requirements gathering stage and are writing up the corresponding deliverable. As soon as it is ready, we will share it here for feedback. We also had our third meeting today, discussing the work carried out in the past two weeks on user engagement and LOD-based semantic enrichment.

In the mean time, here are some more details on the project workplan:

WORKPACKAGES	Month	1	2	3	4	5	6	7

1: Project Management
2: User Engagement & Case Studies
3: Linked Environment Data Enrichment
4: User-Friendly Semantic Search over Linked Data
5: Evaluation
6: Dissemination & Engagement

WP 1: Project Management (Responsible partner: Sheffield)

The cross-institutional nature of the project necessitates close liaison between Sheffield, the British Library (BL) and HR Wallingford; in addition to communication as a result of collaborative working, monthly telecoms and regular face-to-face meetings will be used to advance the project and monitor progress.

Deliverables: Project plan. Legacy plan, including sustainability and support. Final report.

WP 2: User Engagement and Case Studies (BL, HR Wallingford)

This WP covers engagement with environmental science researchers and other key stakeholders. This takes place throughout the project, but in particular: (i) early in the project, to produce detailed requirements and use cases, based on interviews; (ii) later in the project, when we will test the utility of Linked Data and assessing how the vocabularies support the needs of researchers and practitioners, and whether the Linked Open Data (LOD) approach will produce an added benefit in comparison with keyword searching.

Deliverables: Stakeholder analysis, requirements and use cases; User feedback.

WP 3: Linked Environment Data Enrichment (Sheffield)

This WP will deliver semantic enrichment tools, based on relevant LOD vocabularies. Where required, relevant ontologies not already connected to existing Linked Environment Data will be integrated. Sheffield’s open-source tools for lookup and term disambiguation with respect to Linked Data vocabularies will be tested and adapted to the environmental science domains. As part of this work, we are evaluating the coverage and accuracy of relevant general purpose LOD datasets (namely GeoNames and DBPedia), when applied to data and content from our domain. Tools for LOD-based geo-location disambiguation, date and measurement recognition and normalisation will also be delivered.

Our solution is based on Ontotext's high performance OWLIM semantic repository, the open-source GATE semantic annotation tools, and their integration with Linked Data endpoints. We import Linked Data into the semantic repository, which provides a SPARQL endpoint and also full text, metadata, and semantic annotation indices, which underpin the semantic search UI.

Deliverables: Open source tools for semantic enrichment with Linked Environment Data.

WP 4:User-Friendly Semantic Search over Linked Data (Sheffield)

GATE Mimir (Multi-paradigm Information ManagementIndexing and Retrieval) is open-source software framework for multi-paradigm indexing and searching of semantically annotated documents. Enriching documents with explicit semantics allows users to search more effectively for ambiguous names such as London (Ontario) and London (UK).The multi-paradigm aspect of Mimir refers to the accessing and linking together of multiple information sources, such as the textual content of the documents, the semantic metadata and knowledge encoded in the Linked Data vocabularies. Accessing knowledge from Linked Data allows Mimir to understand generalisations, making it capable of answering more complex information needs, such as identification of documents that refer to water levels at the Thames barrier as relevant to a keyword search for flooding in south-east Britain. At the same time, the explicit LOD semantics associated with the indexed semantic metadata and content makes sure that references to places called London (other than the one in the UK) are not seen as relevant results to such a query.

This WP will develop a customised semantic search interface, which enables users to carry out such powerful searches and fully benefit from the knowledge contained in Linked Data, without needing to write SPARQL queries.

Deliverable: A web-based interface for semantic search with Linked Environment Data.

WP 5: Evaluation (Sheffield and BL)

Firstly, quantitative evaluation of the accuracy of semantic enrichment and Linked Data vocabulary coverage will be carried out, based on a human annotated gold standard and established metrics such as f-measure. In addition, a comparative evaluation of the new semantic search web interface will be completed, against the current keyword-search Envia tool, using a set of search queries supplied by the BL. Evaluation will be carried out in the context of the user requirements developed in WP2.

Deliverables: Quantitative evaluation results; A report detailing the lessons learned.

WP 6: Dissemination and Engagement (Sheffield, BL, HR Wallingford)

The project will devote significant effort to dissemination, including practical activities such as demonstrations and tutorials, to show how project outputs might be exploited in other institutions. Details of planned dissemination activities are provided below.

Deliverables: Presentations; research paper; online demonstration; training materials; blog; website; user workshop, engagement with JISC programme manager and related projects.

Timing	Dissemination Activity	Audience	Purpose	Key Message
M1-M7	Participation in JISC programme activities, such as JISC Involve (http://jiscinvolve.org/)	JISC	Raise awareness, Promote results	Benefits and challenges of using LOD
M1-M7	Collaboration with other “Research Tools” projects	JISC development programmes	Inform, engage, and promote	EnviLOD objectives and results
M2-M7	Project website	External stakeholders and research community	Raise awareness, inform, promote results	EnviLOD objectives and results
M4-M7	Peer-reviewed publications at journals, conferences and workshops, including relevant environmental science (e.g. EnviroInfo, Ecological Informatics), as well as technical semantic technology ones (ISWC, ESWC, Journal Web Semantics)	Research community, including environmental science and web science	Inform and promote research results	EnviLOD research methods, open-source tools, and evaluation results
M7	Dissemination workshop hosted at The British Library	Stakeholders	Engage stakeholders with the EnviLOD outputs	Benefits of LOD for environmental scientists
M3-M7	Practical, “hands-on” outreach, through open-source software, user documentation, online demonstrations and tutorials	Research community, end users, JISC, and other stakeholders	Promote project results	Availability of open-source tools for LOD-based semantic enrichment and search
M1-M7	Engagement with interested researchers from other institutions and other disciplines	Stakeholders	Inform and promote results	Lessons learned and results delivered

Tuesday, 7 August 2012

GATE is Getting Sentimental about Social Media

Over the past two years, Diana Maynard, myself, and other colleagues in the GATE team have been working on a number of GATE-based sentiment analysis and opinion mining tools, specifically optimised for Twitter, blogs, comments, and other kinds of social media posts. The work has been part of the Arcomem and TrendMiner EC-funded projects, as well as my EPSRC fellowship on mining and summarisation of social media (grant EP/I004327/1).

Speaking from experience, doing opinion mining on social media is nothing but challenging. And in this paper Diana, Dominic, and I have tried to explain why. In a nutshell:

Most NLP tools do not come with a swear word plugin. As part of her work on the Arcomem project, Diana had fun collecting a suitable training corpus and a swear word list for sentiment detection.
"It's all Greek to me": less than 50% of all tweets are in English. Thanks to the plethora of GATE multilingual plugins, building a basic NLP pipeline wasn't as bad as it could have been.
Identifying relevant posts: there's more chaff than wheat out there, especially on Twitter.
Twts r noizy: Normalisation and spelling correction are essential. It turns out that the perfect way to collect a training corpus of tweets for normalisation purposes is to search for Justin Bieber.
Opinion target identification in tweets is...ahem...even more challenging than in longer texts (not that we have fully solved it there either).
And please do NOT start me on negation
...or context, time, space, and summarisation for that matter.

If you'd like to know more technical details, here's another paper on detecting political opinion in tweets with GATE.

If you wish to learn hands-on how to roll your own sentiment analyser, Diana will be giving a practical sentiment analysis tutorial with GATE at the forthcoming Sentiment Analysis Symposium in San Francisco, California, on October 29th, 2012.

Give us a shout, if you need more info and thanks for reading!

Follow the GATE Team on Twitter: @GateAcUk
Follow Diana Maynard on Twitter: @dianamaynard
Follow me on Twitter: @kbontcheva

Meet the #EnviLOD Project Team

On June 28th, 2012 we held the first project meeting in Sheffield, UK. In keeping with traditional British summers, it rained cats and dogs just as the British Library team arrived in the morning and then as we were due to go out to lunch. To add insult to injury, the kitchen was being refurbished, so we had to bring in refreshments from outside and the coffee wasn't very good. After all this, I'm surprised that the project partners are still keen to work with us. Wisely though, they scheduled the follow-up user requirements meeting in London, in early July.

Here is a quick introduction of all #EnviLOD team members:

Dr Niraj Aswani is a post-doctoral researcher in the GATE team at Sheffield, working on ontology-based semantic annotation and search. Most recently, his research has focused on querying Linked Data through SPARQL and using Linked Data for semantic annotation, indexing, and search. He has published at key semantic technology conference (ISWC, ESWC) and journals. In EnviLOD, he will be the key researcher developing the new LOD-based semantic annotation and search tools.

Dr. Johanna Kieniewicz is Environmental Science Research Officer at the British Library and leads on engagement with the environmental science research community for Envia. She has researched information needs of the UK environmental science research community, captured content and user interface requirements for the Library’s Envia project, and has experience with a variety of consultation methodologies. She has also been trained by the University of Sheffield on the semantic annotation of content using GATE.

Michael Wallis is a flood and coastal research scientist and experienced project manager within the Coasts and Estuaries Group at HR Wallingford. He was involved in the Defra/EA funded project ‘Sustainable Flood and Coastal Management’ and was also a contributing author to the EU funded FLOOD site research dissemination document ‘Flood risk assessment and flood risk management’.

I am the project director, who has the pleasure to work with this enthusiastic and talented EnviLOD team. A short bio and outline of my latest activities appear on my home page.

Follow the project on Twitter: #envilod

Friday, 3 August 2012

About the #EnviLOD project

On June 1st, 2012 the GATE Team at the University of Sheffield, in collaboration with the British Library and HR Wallingford, started the #EnviLOD project, funded under the JISC Research Tools Programme.

#EnviLOD aims to demonstrate the value of using Linked Open Data (LOD) vocabularies in the field of environmental science, by pursuing the following objectives:

Address the problem of LOD domain vocabulary enrichment and interlinking. Develop GATE-based tools for efficient LOD vocabulary lookup and LOD-based term disambiguation. Evaluate these, both quantitatively and with end-users and other stakeholders.
Develop and evaluate intuitive user interface methods that can hide the complexities of the SPARQL semantic search language, while allowing environmental researchers to search successfully, using LOD vocabularies.
Build a case study, using the new British Library information discovery tool for environmental science, Envia. Test the use of LOD vocabularies towards enhancing information discovery and management.
Collaborate with domain experts at the environmental consultants HR Wallingford, providing feedback on how the semantic work undertaken here supports their work as environmental science practitioners and innovators.

Follow EnviLOD on Twitter: #envilod

Background and Motivation

Environmental Science is a broad, interdisciplinary subject area that spans biology, chemistry, earth sciences, physics, and engineering. Because of the breadth of the subject scope, information discovery and sharing in environmental science is often a challenge. Linked Open Data (LOD) and vocabularies offer an opportunity to improve the process of information discovery and sharing through unique, machine-readable, interlinked open vocabularies, thus ultimately connecting users more efficiently to useful and relevant resources.

Key vocabularies for environmental science are already becoming available as Linked Data (e.g. the GEMET thesaurus), as are other key resources relevant for the domain (e.g. Geonames, DBpedia). One outstanding challenge is to use them to enrich unstructured content and metadata with semantics. Doing so manually is prohibitively expensive and unsustainable, since LOD vocabularies typically have millions of instances. Therefore there is a strong need for semantic annotation tools that enrich metadata and content with LOD semantics automatically. EnviLOD will tackle the problem of LOD vocabulary enrichment, interlinking, and adoption in the domain of environmental science, however, results will be relevant also to other fields. The starting point will be the DBpedia-based entity annotation and disambiguation algorithms, developed by Sheffield as part of the TrendMiner project.

The second major challenge is to develop information access facilities that use semantics to deliver a semantic search service, which is not only more powerful, but also as simple to use as its non-semantic counterparts. At present, the most widely used method for retrieving information from Linked Data is through SPARQL queries. However, formulating such queries is beyond the capabilities of most users and presents a significant barrier to widespread uptake. EnviLOD will evaluate user interface methods that can hide the complexities of SPARQL, while allowing users successfully to utilise semantic search.

In the context of environmental science, for example, a user searching for flooding in south-east Britain would be able to find a report with a chapter on water levels at the Thames barrier. In other words, by exploiting the additional semantic context from relevant Linked Open Data ontologies, the user will find a report in the search results that would not have been picked up based on a simple keyword search.

Deliverables

#EnviLOD will be creating a number of research outputs and improving some pre-existing GATE open-source tools for semantic annotation and search:

Output / Outcome Type	Brief Description
Report	User needs analysis, requirements gathering and use case definition.
Software	Open source tools for semantic enrichment with Linked Environment Data.
Software	A web-based interface for semantic search with Linked Environment Data.
Report	Quantitative and user-based evaluation results.
Report	A final report detailing the lessons learned.
Publication	At least one research paper
Dissemination materials	Online demonstration and documentation; website; blog
User engagement event	User workshop
Project documentation	JISC project documentation (Project plan, project reports, etc)
Knowledge built	Knowledge of LOD, LOD-based semantic annotation, and semantic search
Knowledge built	Spreading awareness of LOD and its relevance to environmental science
Knowledge built	Knowledge transfer between computer scientists, information scientists, and environmental scientists

Critical Success Factors

1. Scalability: LOD resources, such as DBPedia and GeoNames have (tens of) millions of instances, so using them for semantic annotation and semantic queries is far from trivial. Thus scalability and robustness to noisy data are key requirements for EnviLOD. Our solution is based on Ontotext's OWLIM semantic repository, which scales to billions of triples. OWLIM is coupled with the open-source GATE semantic annotation tools and Linked Data endpoints. We import Linked Data into the OWLIM semantic repository, which provides a SPARQL endpoint. GATE Mimir is used to index full text, metadata, and semantic annotations, which underpin the semantic search UI.

2. Sustainability: All project results will be made available as open-source. Software will be provided with a clearly-defined API to facilitate adoption. The results will be incorporated within the Envia discovery tool, which will be supported by the British Library.

3. Usability: Usability of the semantic search user interface is paramount. UI mockups will be created and tested first with the British Library and HR Wallingford, followed by a wider consultation with key stakeholders. The UI will be designed to match as closely as possible the user’s current search practices, as well as their needs for semantically-enhanced queries.

4. Interoperability: This will be achieved through the use of widely adopted standards, such as OWL W3C standard, the RDF W3C standard, .

Dates: 1 June 2012 - 31 December 2012

Follow the GATE Team on Twitter: @GateAcUk

Follow the British Library Science team on Twitter: @ScienceBL

Thursday, 2 August 2012

Welcome

I have lately been working on text mining and summarisation of social media, with focus on Twitter. This, I must say, is my favourite (micro-)blogging platform, as there I only need to write 140 characters, which is easily achievable on the bus home. As a researcher, I frequently write papers, do presentations, and give talks, so often I kind of feel like I have ran out of words, hence my fascination with brevity, summarisation, and Twitter.

After several years of deliberation, I have finally decided to take the plunge and try blogging too, mostly in my capacity as one of the longer serving in-mates of the GATE (http://gate.ac.uk) research team. I will be posting about our research projects, papers, talks, and collaborations, so it could get a tad self-centred and technical after a while. I will try my best not to give you reasons to unfollow me, but for a much more entertaining take on text analytics I can point you to Hamish Cunningham's Computing Text blog.

Wednesday, 22 August 2012

#EnviLOD: Project Timeline and Work packages

WORKPACKAGES

Month

WP 1: Project Management (Responsible partner: Sheffield)

WP 2: User Engagement and Case Studies (BL, HR Wallingford)

WP 3: Linked Environment Data Enrichment (Sheffield)

WP 4:User-Friendly Semantic Search over Linked Data (Sheffield)

WP 5: Evaluation (Sheffield and BL)

WP 6: Dissemination and Engagement (Sheffield, BL, HR Wallingford)

Tuesday, 7 August 2012

GATE is Getting Sentimental about Social Media

GATE is Getting Sentimental about Social Media

Meet the #EnviLOD Project Team

Friday, 3 August 2012

About the #EnviLOD project

Background and Motivation

Critical Success Factors

Thursday, 2 August 2012

Welcome