Investigating features and techniques for Arabic authoriship attribution
Abstract
Authorship attribution is the problem of identifying the true author of a disputed text. Throughout history, there have been many examples of this problem concerned with revealing genuine authors of works of literature that were published anonymously, and in some cases where more than one author claimed authorship of the disputed text. There has been considerable research effort into trying to solve this problem. Initially these efforts were based on statistical patterns, and more recently they have centred on a range of techniques from artificial intelligence. An important early breakthrough was achieved by Mosteller and Wallace in 1964 [15], who pioneered the use of ‘function words’ – typically pronouns, conjunctions and prepositions – as the features on which to base the discovery of patterns of usage relevant to specific authors.
The authorship attribution problem has been tackled in many languages, but predominantly in the English language. In this thesis the problem is addressed for the first time in the Arabic Language. We therefore investigate whether the concept of functions words in English can also be used in the same way for authorship attribution in Arabic. We also describe and evaluate a hybrid of evolutionary algorithms and linear discriminant analysis as an approach to learn a model that classifies the author of a text, based on features derived from Arabic function words. The main target of the hybrid algorithm is to find a subset of features that can robustly and accurately classify disputed texts in unseen data. The hybrid algorithm also aims to do this with relatively small subsets of features. A specialised dataset was produced for this work, based on a collection of 14 Arabic books of different natures, representing a collection of six authors. This dataset was processed into training and test partitions in a way that provides a diverse collection of challenges for any authorship attribution approach.
The combination of the successful list of Arabic function words and the hybrid algorithm for classification led to satisfying levels of accuracy in determining the author of portions of the texts in test data. The work described here is the first (to our knowledge) that investigates authorship attribution in the Arabic knowledge using computational methods. Among its contributions are: the first set of Arabic function words, the first specialised dataset aimed at testing Arabic authorship attribution methods, a new hybrid algorithm for classifying authors based on patterns derived from these function words, and, finally, a number of ideas and variants regarding how to use function words in association with character level features, leading in some cases to more accurate results.