Tuesday, 7 August 2012

GATE is Getting Sentimental about Social Media

Over the past two years, Diana Maynard, myself, and other colleagues in the GATE team have been working on a number of GATE-based sentiment analysis and opinion mining tools, specifically optimised for Twitter, blogs, comments, and other kinds of social media posts. The work has been part of the Arcomem and TrendMiner EC-funded projects, as well as my EPSRC fellowship on mining and summarisation of social media (grant EP/I004327/1).

Speaking from experience, doing opinion mining on social media is nothing but challenging. And in this paper Diana, Dominic, and I have tried to explain why. In a nutshell:

  • Most NLP tools do not come with a swear word plugin.  As part of her work on the Arcomem project, Diana had fun collecting a suitable training corpus and a swear word list for sentiment detection. 
  •  "It's all Greek to me": less than 50% of all tweets are in English. Thanks to the plethora of GATE multilingual plugins, building a basic NLP pipeline wasn't as bad as it could have been. 
  • Identifying relevant posts: there's more chaff than wheat out there, especially on Twitter. 
  • Twts r noizy: Normalisation and spelling correction are essential. It turns out that the perfect way to collect a training corpus of tweets for normalisation purposes is to search for Justin Bieber. 
  • Opinion target identification in tweets is...ahem...even more challenging than in longer texts (not that we have fully solved it there either).
  • And please do NOT start me on negation
  • ...or context, time, space, and summarisation for that matter.

If you'd like to know more technical details, here's another paper on detecting political opinion in tweets with GATE.

If you wish to learn hands-on how to roll your own sentiment analyser, Diana will be giving a practical sentiment analysis tutorial with GATE at the forthcoming Sentiment Analysis Symposium in San Francisco, California, on October 29th, 2012.

Give us a shout, if you need more info and thanks for reading!

