Extracting business performance signals from Twitter news
Abstract
Social media and social networks underpin a revolution in communication
between people, with the particular feature that much of that communication is open to
all. This provides a massive pool of data that can be exploited by researchers for a wide
variety of different applications. Data from Twitter is of particular interest in this sense,
given its large global usage levels, and the availability of APIs and other tools that enable
easy access to the publicly available stream of tweets. Owing to the wide public
penetration of Twitter, many businesses make use of it to share their latest news,
effectively using Twitter as a gateway to connect to end-users, consumers and/or
investors.
In this thesis, we focus on the potential for extracting information from Twitter that is
relevant to the financial and competitiveness status of a business. We consider a collection
of well-regarded Twitter accounts that are known for communicating recent business
news, and we investigate the automated analysis of the stream of tweets from these
sources, with a view to learning business-relevant information about specific companies.
A key aspect of our approach is the idea of extracting specific areas of business
performance: we explore three such areas: productivity, competitiveness, and industrial
risk. We propose a two-step model which first classifies a tweet into one of these areas,
and then assigns a sentiment value (on a positive/negative scale). The resulting sentiment
values across specific aspects represent novel business indicators that could add
significant value to the toolset used by business analysts. Our experiments are based on a
new manually pre-classified data set (available from a URL provided).
Additionally, we propose n-grams made from non-contiguous words as a novel feature to
enhance performance in this context. Experiments involving a range of feature selection
methods show that these new features provide valuable benefits in comparison with
standard n-gram features.
We also interduce the concept of an extra layer added to the primary classifier, with the
role of filtering out noisy tweets before they enter the system. We use a One-Class SVM
for this purpose.
Broadly, we show that the methods developed in this thesis achieve promising results in
both topic and sentiment classification in the business performance context, suggesting
that twitter can indeed be a useful source of signals related to different aspects of business performance. We also find that our system can provide valuable insight into unseen test
data. However, more research is needed to be able to extract robust signals for industrial
risk, and there seems to be a considerable promise for further development.