Data mining of many-attribute data : investigating the interaction between feature selection strategy and statistical features of datasets
Abstract
In many datasets, there is a very large number of attributes (e.g. many thousands).
Such datasets can cause many problems for machine learning methods. Various
feature selection (FS) strategies have been developed to address these problems. The
idea of an FS strategy is to reduce the number of features in a dataset (e.g. from many
thousands to a few hundred) so that machine learning and/or statistical analysis can be
done much more quickly and effectively. Obviously, FS strategies attempt to select
the features that are most important, considering the machine learning task to be done.
The work presented in this dissertation concerns the comparison between several
popular feature selection strategies, and, in particular, investigation of the interaction
between feature selection strategy and simple statistical features of the dataset. The
basic hypothesis, not investigated before, is that the correct choice of FS strategy for a
particular dataset should be based on a simple (at least) statistical analysis of the
dataset.
First, we examined the performance of several strategies on a selection of datasets.
Strategies examined were: four widely-used FS strategies (Correlation, Relief F,
Evolutionary Algorithm, no-feature-selection), several feature bias (FB) strategies (in
which the machine learning method considers all features, but makes use of bias
values suggested by the FB strategy), and also combinations of FS and FB strategies.
The results showed us that FB methods displayed strong capability on some datasets
and that combined strategies were also often successful.
Examining these results, we noted that patterns of performance were not immediately
understandable. This led to the above hypothesis (one of the main contributions of
the thesis) that statistical features of the dataset are an important consideration when
choosing an FS strategy. We then investigated this hypothesis with several further
experiments. Analysis of the results revealed that a simple statistical feature of a
dataset, that can be easily pre-calculated, has a clear relationship with the performance
Silang Luo PHD-06-2009 Page 2
of certain FS methods, and a similar relationship with differences in performance
between certain pairs of FS strategies.
In particular, Correlation based FS is a very widely-used FS technique based on the
basic hypothesis that good feature sets contain features that are highly correlated with
the class, yet uncorrelated with each other. By analysing the outcome of several FS
strategies on different artificial datasets, the experiments suggest that CFS is never the
best choice for poorly correlated data.
Finally, considering several methods, we suggest tentative guidelines for choosing an
FS strategy based on simply calculated measures of the dataset.