Evolutionary algorithms and weighting strategies for feature selection in predictive data mining
Abstract
The improvements in Deoxyribonucleic Acid (DNA) microarray technology mean
that thousands of genes can be profiled simultaneously in a quick and efficient manner.
DNA microarrays are increasingly being used for prediction and early diagnosis
in cancer treatment. Feature selection and classification play a pivotal role in this
process. The correct identification of an informative subset of genes may directly
lead to putative drug targets. These genes can also be used as an early diagnosis or
predictive tool. However, the large number of features (many thousands) present in
a typical dataset present a formidable barrier to feature selection efforts.
Many approaches have been presented in literature for feature selection in such
datasets. Most of them use classical statistical approaches (e.g. correlation). Classical
statistical approaches, although fast, are incapable of detecting non-linear interactions
between features of interest. By default, Evolutionary Algorithms (EAs)
are capable of taking non-linear interactions into account. Therefore, EAs are very
promising for feature selection in such datasets.
It has been shown that dimensionality reduction increases the efficiency of feature
selection in large and noisy datasets such as DNA microarray data. The two-phase
Evolutionary Algorithm/k-Nearest Neighbours (EA/k-NN) algorithm is a promising
approach that carries out initial dimensionality reduction as well as feature selection
and classification.
This thesis further investigates the two-phase EA/k-NN algorithm and also introduces
an adaptive weights scheme for the k-Nearest Neighbours (k-NN) classifier.
It also introduces a novel weighted centroid classification technique and a correlation
guided mutation approach. Results show that the weighted centroid approach
is capable of out-performing the EA/k-NN algorithm across five large biomedical
datasets. It also identifies promising new areas of research that would complement
the techniques introduced and investigated.