Friday, 1 July 2016

The Tools Behind Our Brexit Analyser

It will be two weeks tomorrow since we launched the Brexit Analyser -- our real-time tweet analysis system, based on our GATE text analytics and semantic annotation tools

Back then, we were analysing on average 500,000 (yes, half a million!) tweets a day. Then, on referendum day alone, we had to analyse in real-time well over 2 million tweets. Or on average, just over 23 tweets per second! It wasn't quite so simple though, as tweet volume picked up dramatically as soon as the polls closed at 10pm and we were consistently getting around 50 tweets per second and were also being rate-limited by the Twitter API. 

These are some pretty serious data volumes, as well as veracity. So how did we build the Brexit Analyser to cope?

For analysis, we are using GATE's TwitIE system, which consists of a tokenizer, normalizer, part-of-speech tagger, and a named entity recognizer. After that, we added our Leave/Remain classifier, which helps us identify a reliable sample of tweets with unambiguous stance.  Next is a tweet geolocation component, which uses latitude/longitude, region, and user location metadata to geolocate tweets within the UK NUTS2 regions. We also detect key themes and topics discussed in the tweets (more than one topic/theme can be contained in each tweet), followed by topic-centric sentiment analysis.  



We kept the processing pipeline as simple and efficient as possible, so it can run at 100 tweets per second even on a pretty basic server.  

The analysis results are fed into GATE Mimir, which indexes efficiently tweet text and all our linguistic annotations. Mimir has a powerful programming API for semantic search queries, which we use to drive different web pages with interactive visualisations. The user can choose what they want to see, based on time (e.g. most popular hashtags on 23 Jun; most talked about topics in Leave/Remain tweets on 23 Jun). Clicking on these infographics shows the actual matching tweets. 

All my blog posts so far have been using screenshots of such interactively generated visualisations. 

Mimir also has a more specialised graphical interface (Prospector), which I use for formulating semantic search queries and inspecting the matching data, coupled with some pre-set types of visualisations. The screen shot below shows my Mimir query for all original tweets on 23 Jun which advocate Leave. I can then inspect the most mentioned twitter users within those. (I used Prospector for my analysis of Leave/Remain voting trends on referendum day). 


So how do I do my analyses


First I decide what subset of tweets I want to analyse. This is typically a Mimir query restricting by timestamp (normalized to GMT), tweet kind (original, reply, or retweet), voting intention (Leave/Remain), mentioning a specific user/hashtag/topic, written by a specific user, containing a given hashtag or a given topic (e.g. all tweets discussing taxes).  

Then, once I identify this dynamically generated subset of tweets, I can analyse it with Prospector or use the visualisations which we generate via the Mimir API. These include:

  • Top X most frequently mentioned words, nouns, verbs, or noun phrases
  • Top X most frequent posters/frequently mentioned tweeterers
  • Top X most frequent Locations, Organizatons, or Persons within those tweets
  • Top X themes / sub-themes according to our topic classifier
  • Frequent URLs, language of the tweets, and sentiment

How do we scale it up


It's built using GATE Cloud Paralleliser and some clever queueing, but the take away message is: we can process and index over 100 tweets per second, which allows us to cope in real time with the tweet stream we receive via the Twitter Search API, even at peak times. All of this runs on a server which cost us under £10,000. 

The architecture can be scaled up further, if needed, should we get access to a Twitter feed with higher API rate limits than the standard. 



Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team 

Any mistakes are my own.







Sunday, 26 June 2016

#InOrOut: Analysing Voting Trends in Tweets on #EURef Day

In this post I examine the question: could we have predicted the #EUReferendum outcome, based on #Leave and #Remain tweets posted on polling day? This follows up from my #InOrOut debate on Twitter on Jun 23rd, where I analysed tweet volumes, popular hashtags, and most mentioned users.

This is not the only study to analyse referendum day tweets, but here I present a more in-depth analysis, also based on a sample of tweets selected specifically as  advocating #Leave/#Remain respectively. 


#Leave / #Remain Trend Based on @Brndstr


Our real-time analysis uncovered the most popular user mentioned in posts on referendum day: @Brndstr. @Brndstr are building bots to help brands engage with their customers and also for users to turn into social ambassadors of brands they endorse. 

On referendum day, they ran a campaign which encouraged people to tweet how they voted and, in return, their profile picture will change accordingly. This was not uncontroversial to some Twitter users, who took issue with the choice of the Union Jack (for Out voters) vs the EU flag (for In voters), but nevertheless, many people declared their votes in this way.  



I found over 14,600 tweets mentioning @Brndstr in the 715 thousand original tweets we collected on June 23rd. I only limited the analysis to original tweets (i.e. excluded retweets and replies), since I wanted to study distinct, self-declared #Leave / #Remain intentions.

Inspection of a random sample in our Mimir Prospector dashboard showed all tweets had a set pattern, which made it trivial to distinguish #Leave and #Remain votes.  

In particular, all #Leave tweets started with: I #VoteOut for the #Brexit #EURef vote with @Brndstr. All #Remain tweets started with: I #VoteIn for the #Brexit #EURef vote with @Brndstr

I used two Mimir queries with those texts, and found 6296 #VoteOut tweets and 8342 #VoteIn tweets. Thus, based on @Brndstr activity, one could hypothesize a  #Remain majority. 


#Leave / #Remain Trend Based on Full-Text Search


In addition to @Brndstr, I also experimented with full-text searches over the referendum day tweets. For those interested in the technology behind this, I used GATE text analysis tools adapted to the referendum, combined with the Mimir semantic search engine (supports searches over both linguistic annotations and full-text). 

First, I searched for tweets containing "I", "voted", and "remain", within an 8 word window. This returned 14,665 matching tweets and upon manual inspection of the top 30 matches, I observed only 2 tweets which did not disclose the actual vote of their poster. Therefore, I considered this a sufficiently accurate query.

The corresponding "I", "voted" and "leave" query returned 11,046 matching tweets, i.e. #Leave votes were outnumbered by #Remain ones again. 

These statistics are in line with the findings of other studies of Twitter #EUReferendum posts. For instance, the #EURef Data Hub (by the Press Association, Twitter, and Blurrt) showed Remain tweets dominating over Leave tweets on Jun 23rd, but not on 22nd and earlier, or (unsurprisingly) since. 

It must be noted that, similar to the Ontotext study, the #EURef Data Hub statistics are derived from tweets referencing either the Leave or Remain campaigns, but not necessarily showing explicit support or voting intent. 

However, as discussed in my earlier post, if we were to try and draw conclusions on the likely outcome based on tweets alone, then we need a more reliable Leave/Remain sample, indicative of actual support/self-declared voting intentions

So now let's see if the same trend is present there.


#Leave / #Remain Voting Intentions Based on Our Classification Heuristic


Following on from my previous study of the overall characteristics of tweets posted on June 23rd, I separated again the tweets into original tweets, replies, and retweets.

I applied our classification heuristic for reliable identification of #Leave/#Remain posts to all tweets posted on or after 13:00 BST on June 22nd, but before voting closed at 22:00 BST on June 23rd. 

As a result, I found just over 100 thousand tweets from 22nd: 39 thousand advocating Remain and  61 thousand - Leave. 

On June 23rd, as Twitter activity picked up significantly (also observed by #EURef Data Hub), I found 291 thousand matching tweets. Unlike other studies, however, our voting intent heuristic identified 164 thousand tweets advocating Leave and only 127 thousands advocating Remain. 

Therefore, even though voting tweets from @Brndstr and tweet volume statistics from #EURef Data Hub both indicate that Remain was dominant, this trend wasn't supported in our voting intention sample. 

Now let us examine the trends over time, separately for original tweetsreplies, and retweets. 

The graph below shows that indeed #Remain tweets were dominant in the early hours of June 23rd, but not before or after. What is particularly interesting is that #Remain tweets start to fall sharply from around 4pm, whereas #Leave ones pick up sharply a little later. By the time polls close at 10pm, tweets advocating #Leave are more than double the ones supporting #Remain.      




Reply tweets show a largely different pattern (see graph below), where replies advocating #Leave are consistently more than those advocating #Remain (at times up to 2.5 times more). This is a trend which we observed also earlier in June. This indicates that #Leave advocates were much more engaged in the Twitter debates, than the #Remain ones.  

It should be noted also that the trend observed in original tweets in late afternoon and evening of June 23rd is also evident here, i.e. replies advocating #Remain start to fall, while replies advocating #Leave increase. 

Lastly, I show below the trends in re-tweets, where again #Leave advocates dominate the debate, by re-tweeting much more than #Remain ones.  Again, I already observed this trend earlier in June. 


What Have We Learnt? 


Having looked at tweets on 23rd, using @brndstr and “I voted XX” both gave  Remain a majority over Leave, but using our classification heuristic, the opposite was true (i.e. Leave was the more likely winner).

Given the conflicting evidence based on the same set of tweets, it is easy to see why others failed to predict the overall majority correctly

I must also highlight here that my own analysis was never aimed at being predictive. Instead, I am trying to understand how people engaged, debated, and wrote about the referendum on social media. 

In particular, as the referendum clearly showed, older voters tend to vote in higher proportions than young ones and thus, they were those that ultimately determined the overall outcome.  That older generation, however, is well known for being under-represented on Twitter, and also probably less aware of @Brndstr and similar services, which explains why these gave the wrong trends. 

In future research I would like to explore whether representativeness on Twitter is the full story, and whether this matters for political discussions. Do the younger generation actually talk more or less about politics than the older generation? Also, older people aside - were Brexiters (i.e. people supporting Leave) over- or under-represented on Twitter, as compared to Bremainers (i.e. voters supporting Remain)?   

In order to get more accurate answers to these questions, as demonstrated here, it is important to identify actual tweets indicative of specific voting intentions or votes already cast. The largely predominant approach of simply counting tweets mentioning hashtags is not sufficiently accurate as it does not distinguish tweets simply referring to a stance/campaign, from tweets actually advocating a stance/campaign. 

As part of subsequent research, I plan to also collect a gold standard of human-annotated tweets where people will be asked to mark tweets indicating actual  support and voting intent separately from tweets, which simply mention the Leave/Remain campaigns. This will enable me to quantify how the different sampling strategies affect the accuracy of voting trends over time. 


Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team 

Any mistakes are my own.

Thursday, 23 June 2016

#InOrOut: Today's #EURef Debate on Twitter


So what did the #EUReferendum debate look like today? Is Twitter still voting #Leave as it did back in May? What were the main hashtags and user mentions in today's tweets?


Tweet Volumes

Record breaking 1.9 million tweets were posted today on the #InOrOut #EUReferendum, which is between three and six times the daily volumes observed earlier in June. On average, this is 21 tweets per second over the day, although, the peaks of activity occurred after 9am (see graphs below). 1.5 million of those tweets were posted during poll opening times. In that period, only 3,300 posts were inaccessible to us due to Twitter rate limits. 

Since the polls closed at 10pm tonight, there was a huge surge in Twitter activity with over 60,000 posts between 10pm and 11pm alone.  Twitter rate limits meant that we could not access another 6,000 posts from that period. Since this is only 10% of the overall data in this hour, we still have a representative sample for our analyses. 

Amongst the 1.9 million posts, over 1 million (57%) were retweets and 94 thousand (5%) - replies. These proportions of retweets and replies are consistent with patterns observed earlier in June.   

Tweets, Re-tweets, and Replies: #Leave or #Remain


Let's start by looking at original tweets, i.e. tweets which have been posted by their authors and are not a reply to another tweet or a retweet. I refer to the authors of those tweets as the OPs (Original Posters), following terminology adopted from online forums.

My analysis of voting intentions showed some conflicting findings, depending on the way used to sample tweets  (details and trend graphs here). 

The gist is that, using @brndstr and “I voted XX” patterns both gave Remain a majority over Leave, but using our voting intention classification heuristic, the opposite was true (i.e. Leave was the more likely winner).  

In retweets, the #Leave proponents were more vocal in comparison to the #Remain.   


The difference is particularly pronounced for replies,  where #Leave proponents are engaging in more debates than #Remain ones. Nevertheless, with replies constituting only 5% of all tweets today, the echo chamber effect observed earlier in June still remains unchanged. 

#InOrOut, #Leave, #Remain and Other Popular Hashtags

Interestingly, 75% of all tweets today (1.4 million) contained at least one hashtag. This is a very significant increase on the 56.5% observed several days ago. 


Some of the most popular hashtags  remain unchanged from earlier in June. These refer to the leave and remain campaigns, immigration, NHS, parties, media, and politicians. Interestingly, there is now increased interest in #forex and #stocks, as predictors of the likely outcome. 


Most Mentioned Users Today: What is @Brndstr


Last for tonight, I compared the most frequently mentioned Twitter users in original tweets from today (see above) against those most mentioned earlier in June. The majority of popular mentioned users remains unchanged, with a mix of campaign Twitter accounts, media, and key political leaders.

The most prominent difference is that @Brndstr (Bots for Brands) came top (mentioned in over 14 thousand tweets), followed by @YouTube with 3 thousand mentions. Other new, frequently mentioned accounts today were Avaaz, DanHannanMEP,BuzzFeedUK, and realDonaldTrump.


So What Does This Tell Us?


The #InOrOut #EUReferendum has attracted unprecedented tweet volumes on poll day, with a significantly higher proportion of hashtags than previously. This seems to suggest that Twitter users are trying to get their voices heard and spread the word far and wide, well beyond the bounds of their normal follower  network. 


There are some exciting new entrants in the top 30 most mentioned Twitter accounts in today's referendum posts. I will analyse these in more depth tomorrow. For now, good night!  


Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team 

Any mistakes are my own.

Identifying A Reliable Sample of Leave/Remain Tweets

This post is the second in the series on the Brexit Tweet Analyser.

Having looked at tweet volumes and basic characteristics of the Twitter discourse around the EU referendum, we now turn to the method we chose for identify a reliable, even if incomplete, sample of leave and remain tweets.

No ground truth; not trying to predict if leave or remain are leading, but instead interested in identifying a reliable, if incomplete subset, so we can analyse topics discussed and active users within.



Are Hashtags A Reliable Predictor of Leave/Remain Support?

As discussed in our earlier post, over 56% of all tweets on the referendum contain at least one hashtag. Some of these are actually indicative of support for the leave/remain campaigns, e.g. #votetoleave, #voteout, #saferin, #strongertogether. Then there are also hashtags which try to address undecided voters, e.g. #InOrOut, #undecided, while promoting either a remain or leave vote but not through explicit hashtags.

A recent study of EU referendum tweets by Ontotext, carried out over tweets in May 2016,  classified tweets as leave or remain on the basis of approximately 30 hashtags. Some of those were associated with leave, the rest -- with remain, and each  tweet was classified as leave or remain based on whether it contains predominantly leave or predominantly remain hashtags. 

Based on analysing manually a sample of random tweets with those hashtags, we found that this strategy does not always deliver a reliable assessment, since in many cases leave hashtags are used as a reference to the leave campaign, while the tweet itself is supportive of remain or neutral. The converse is also true, i.e. remain hashtags are used to refer to the remain stance/campaign. We have included some examples below. 

A more reliable, even if somewhat more restrictive, approach is to consider the last hashtag in the tweet as the most indicative of its intended stance (pro-leave or pro-remain). This results in a higher precision sample of remain/leave tweets, which we can then analyse in more depth in terms of topics discussed and opinions expressed. 

Using this approach, amongst the 1.9 million tweets between June 13th and 19th, 5.5% (106 thousand) were identified as supporting the Leave campaign, while 4% (80 thousand) - as supporting the Remain campaign. Taken together, this constitutes just under a 10% sample, which we consider sufficient for the purposes of our analysis. 

These results, albeit drawn from a smaller, high-precision sample, seem to indicate that the Leave campaign is receiving more coverage and support on Twitter, when compared to Remain. This is consistent also with the findings of the Ontotext study .

In subsequent posts we will look into the most frequently mentioned hashtags, the most active Twitter users, and the topics discussed in the Remain and Leave samples separately. 


What about #Brexit in particular?   

The recent Ontotext study on May 2016 data used #Brexit as one of the key hashtags indicative of leave. Others have also used #Brexit in the same fashion.


In our more recent 6.5 million tweets (dated between 1 June and 19 June 2016), just under 1.7 million contain the #Brexit hashtag (26%). However, having examined a random sample of those manually (see examples below), we established that while many tweets did use #Brexit to indicate support for leave, there were also many cases where #Brexit referred to the referendum, or the leave/remain question, or the Brexit campaign as a whole. We have provided some such examples at the end of this blog post. We also found a sufficient number of examples where #Brexit appears at the end of tweets while still not indicating support for voting leave. 

Therefore, we chose to distinguish the #Brexit hashtag from all other leave hashtags and tagged tweets with a final #Brexit tag separately. This enables us, in subsequent analyses, to compare findings with and without considering #Brexit.  



Example Remain/Leave Hashtag Use













Example Brexit Tweets








Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team 

Any mistakes are my own.

Friday, 17 June 2016

Introducing the Brexit Analyser: real-time Twitter analysis with GATE

The GATE team has been busy lately with building the real-time Brexit Analyser.  It analyses tweets related to the forthcoming EU referendum, as they come in, in order to track the referendum debate unfolding on Twitter. This research is being carried out as part of the SoBigData project



The work follows on from our successful collaboration with Nesta on the Political Futures Tracker, which analysed tweets in real-time in the run up to the UK General Election in 2015. 

Unlike others, we do not try to predict the outcome of the referendum or answer the question of whether Twitter can be used as a substitute for opinion polls. Instead, our focus is on a more in-depth analysis of the referendum debate; the people and organisations who engage in those debates; what topics are discussed and opinion expressed, and who the top influencers are.

What does it do?

It analyses and indexes tweets as they come in (i.e. in real time), in order to identify commonly discussed topics, opinions expressed, and whether a tweet is expressing support for remaining or leaving the EU. It must be noted that not all tweets have a clear stance and also that not all tweets express a clear voting intention (e.g. "Brexit & Bremain"). More on this in subsequent posts! 

In more detail, the Brexit Analyser uses text analytics and opinion mining techniques from GATE, in order to identify tweets expressing voting intentions, the topics discussed within, and the sentiment expressed towards these topics. Watch this space! 


The Data  (So Far)

We are collecting tweets based on a number of referendum related hashtags and keywords, such as #voteremain, #voteleave, #brexit, #eureferendum. 

The volume of original tweets, replies, and re-tweets per day collected so far is shown below. On average, this is close to half a million tweets per day (480 thousand), which is 1.6 times the tweets on 26 March 2015 (300,000), when the Battle For Number 10 interviews took place, in the run up to the May 2015 General Elections. 



In total, we have analysed just over 1.9 million tweets in the past 4 days, with 60% of those being re-tweets. On average, a tweet is re-tweeted 1.65 times. 

Subsequent posts will examine the distribution of original tweets, re-tweets, and replies specifically in tweets expressing a remain/leave voting intention.  

Hashtags: 1 million of those 1.9 million tweets contain at least one hashtag  (i.e. 56.5% of all tweets have hashtags). If only original tweets are considered (i.e. all replies and retweets are excluded), then there are 319 thousand tweets with hashtags amongst the original 678 thousand tweets (i.e. 47% of original tweets are hashtag bearing).

Analysing hashtags used in a Twitter debate is interesting, because they indicate commonly discussed topics, stance taken towards the referendum, and also key influencers. As they are easy to search for, hashtags help Twitter users participate in online debates, including other users they are not directly connected to.

Below we show some common hashtags on June 16, 2016. As can be seen, most are associated directly with the referendum and voting intentions, while others refer to politicians, parties, media, places, and events:




URLs:  Interestingly, amongst the 1.9 million tweets only 134 thousand contain a URL (i.e. only 7%).  Amongst the 1.1 million re-tweets, 11% contain a URL, which indicates that tweets with URLs tend to be retweet more.  

These low percentages suggest that the majority of tweets on the EU referendum are expressing opinions or addressing another user, rather than sharing information or providing external evidence. 

@Mentions: Indeed, 90 thousand (13%) of the original 678 thousand tweets contain an username mention. The 50 most mentioned users in those tweets are shown below. The size of the user name indicates frequency, i.e. the larger the text the more frequently has this username been mentioned in tweets. 

In subsequent posts we will provide information on the most frequently re-tweeted users and the most prolific Twitter users in the dataset. 



So What Does This Tell Us?


Without a doubt, there is a heavy volume of tweets on the EU referendum, published daily. However, with only 6.8% of all tweets being replies and over 58% -- re-tweets, this resembles more an echo chamber, rather than a debate.  

Pointers to external evidence/sources via URLs are scarce, as are user mentions. The most frequently mentioned users are predominantly media (e.g., BBC, Reuters, FT, the Sun, Huffington Post);  politicians playing a prominent role in the campaign (e.g. David Cameron,  Boris Johnson, Nigel Farage, Jeremy Corbyn); and campaign accounts created especially for the referendum (e.g. @StrongerIn, @Vote_Leave).    


Thanks to:

Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team 



Tuesday, 14 January 2014

PHEME: A new project on computing the veracity of social media content

The London Eye was on fire during the 2011 England riots! Or was it? Social networks are rife with lies and deception, half-truths and facts. But irrespective of a meme's truthfulness, the rapid spread of such information through social networks and other online media can have immediate and far-reaching consequences. In such cases, large amounts of user-generated content need to be analysed quickly, yet it is not currently possible to carry out such complex analyses in real time.

In the past week I've been very excited (and rather busy) by the starting of a new European project, called PHEME ( PhemeEU).  The aim is to develop automatic methods to help people (e.g. journalists, health professionals, patients, government services) assess the truthfulness of information that is spreading through social networks and other online media.

With partners from seven different countries, the project will combine big data analytics with advanced linguistic and visual methods. The results will be suitable for direct application in medical information systems and digital journalism.

Veracity: The Fourth Challenge of Big Data

Social media poses three major computational challenges, dubbed by Gartner the 3Vs of big data: volume, velocity, and variety.

PHEME will focus on a fourth crucial, but hitherto largely unstudied, challenge: veracity

While writing the proposal, I coined the term phemes to describe memes which are enhanced with truthfulness information. It is a reference also to Pheme - the Greek goddess of fame and rumours.

Identifying Phemes (Rumorous Memes) 

We are concentrating on identifying four types of phemes and modelling their spread across social networks and online media: speculation, controversy, misinformation, and disinformation. However, it is particularly difficult to assess whether a piece of information falls into one of these categories in the context of social media. The quality of the information here is highly dependent on its social context and, up to now, it has proven very challenging to identify and interpret this context automatically.

An Interdisciplinary Approach

PHEME has partners from the fields of natural language processing and text mining, web science, social network analysis, and information visualization. Together, we will use three factors to analyse veracity: first, the information inherent in a document itself – that is lexical, semantic and syntactic information. This is then cross-referenced with data sources that are assessed as particularly trustworthy, for example in the case of medical information, PubMed, the biggest online database in the world for original medical publications. Finally, the diffusion of a piece of information is analysed – who receives what information and from whom, and whether and to whom they pass it on? 

 "Rumor intelligence", that is the ability to identify rumours in good time will be tested, inter alia, in the area of medical information systems. For digital journalism, results will be tested with  swissinfo.ch (the international service of the Swiss Broadcasting Corporation (SBC)) and Ushahidi's SwiftRiver media filtering and verification platform. The new technology will help journalists assesss  the veracity of user-generated content – an activity that is largely carried out manually to date, requiring significant resources. Other news organisations who have expressed support the project are the BBC, the Guardian, and the German regional broadcasting corporation SΓΌdwestrundfunk. 

So this is all going to be great - identifying rumours across social media and helping filter out the misinformation. Keep up with our progress - follow PHEME on Twitter!

Monday, 29 April 2013

EnviLOD: Lessons Learnt


The EnviLOD project demonstrated the benefits that location-based searches, enabled and underpinned by Linked Open Data (LOD) and semantic technologies, can have in terms of enabling improved retrieval of information. Although the semantic search tool developed through the EnviLOD project is not yet ‘production-ready’, it does demonstrate the benefits of this newly emerging technology. As such, it will be incorporated into the Envia ‘labs’ page of the Envia website, which is currently under development. Within Envia Labs, users of the regular Envia service will be able to experiment with and comment on tools that might eventually augment or be incorporated into the service, thus allowing the Envia project team to gauge their potential uptake by the user community.

We also worked on the automatic generation of semantically enriched metadata,  to accompany records within the Envia system. This aims to improve the discovery of information within the current Envia system by automatically generating keywords to be included in the article metadata based on the occurrences of terms from the GEMET, DBpedia, and GeoNames vocabularies. A pipeline for this to be incorporated into the Envia system in a regular and sustainable manner is already under way.

One particularly important lesson learnt from this short-term project is that availability of large amounts of content, open to text mining and experimentation needs to be ensured from the very beginning of the project. In EnviLOD there were copyright issues with the majority of environmental science content at the British Library, which limited the experimental system to just over one thousand documents. Due to this limited content, users were not always able to judge how comprehensive or accurate the semantic search was, especially if compared against results offered by Google. Since the British Library is now planning to integrate the EnviLOD semantic enrichment tools within the advanced Envia Labs functionality, future work on this tool could potentially be able to evaluate on this more comprehensive data, through the Envia system.

Another important lesson learnt from the research activities is that working with Linked Open Data is very challenging, not only in terms of data volumes and computational efficiency, but also in terms of data noise and robustness. In terms of noise, an initial evaluation of the DBpedia-based semantic enrichment pipeline revealed that relevant entity candidates were not included initially, because in the ontology they were classified as owl:Thing, whereas we were considering instances of specific sub-classes (e.g. Person, Place). There are over 1 million unclassified instances in the current DBpedia snapshot. In terms of computational efficiency, we had to introduce memory-based caches and efficient data indexing, in order to make the entity linking and disambiguation algorithm sufficiently efficient to process data in near real-time. Lastly, deploying the semantic enrichment on a server, e.g. at Sheffield or at the British Library, is far from trivial, since both OWLIM and our algorithms require large amounts of RAM and computational power. Parallelising the computation to more than three threads is an open challenge, due to the difficulties experienced with parallelising OWLIM. Ontotext are currently working on cloud-based, scalable deployments, so future projects would be able to solve the scalability issue effectively.

Lastly, the quantitative evaluation of our DBpedia-based semantic enrichment pipeline was far from trivial. It required us to annotate manually a gold standard corpus of environmental science content (100 documents were annotated with disambiguated named entities). However, releasing these to other researchers has proven to be practically impossible, due to the copyright and licensing restrictions imposed by the content publishers on the British library. In a related project, we have now developed a web-based entity annotation interface, based on Crowd Flower. This will enable future projects to create gold standards in an easier fashion, based on copyright-free content. Ultimately, during development we made use of available news and similar corpora created by TAC-KBP 2009 and 2010, which we used for algorithm development and testing in EnviLOD, prior to final quantitative evaluation on the copyrighted BL content. So even though the aims of the project were achieved and a useful running pilot system was created, publishing the results in scientific journals has been hampered by these content issues.

In conclusion, we fully support the findings of the JISC report on text mining  that copyright exemption for text mining research is necessary, in order to fully unlock the benefits of text mining to scientific research.