ROS Theses Repository

View Item 
  •   ROS Home
  • Mathematical & Computer Sciences
  • Doctoral Theses (Mathematical & Computer Sciences)
  • View Item
  •   ROS Home
  • Mathematical & Computer Sciences
  • Doctoral Theses (Mathematical & Computer Sciences)
  • View Item
  •   ROS Home
  • Mathematical & Computer Sciences
  • Doctoral Theses (Mathematical & Computer Sciences)
  • View Item
  • Admin
JavaScript is disabled for your browser. Some features of this site may not work without it.

New techniques for Arabic document classification

View/Open
ChantarHKH_0913_macs.pdf (1.753Mb)
Date
2013-09
Author
Chantar, Hamouda Khalifa Hamouda
Metadata
Show full item record
Abstract
Text classification (TC) concerns automatically assigning a class (category) label to a text document, and has increasingly many applications, particularly in the domain of organizing, for browsing in large document collections. It is typically achieved via machine learning, where a model is built on the basis of a typically large collection of document features. Feature selection is critical in this process, since there are typically several thousand potential features (distinct words or terms). In text classification, feature selection aims to improve the computational e ciency and classification accuracy by removing irrelevant and redundant terms (features), while retaining features (words) that contain su cient information that help with the classification task. This thesis proposes binary particle swarm optimization (BPSO) hybridized with either K Nearest Neighbour (KNN) or Support Vector Machines (SVM) for feature selection in Arabic text classi cation tasks. Comparison between feature selection approaches is done on the basis of using the selected features in conjunction with SVM, Decision Trees (C4.5), and Naive Bayes (NB), to classify a hold out test set. Using publically available Arabic datasets, results show that BPSO/KNN and BPSO/SVM techniques are promising in this domain. The sets of selected features (words) are also analyzed to consider the di erences between the types of features that BPSO/KNN and BPSO/SVM tend to choose. This leads to speculation concerning the appropriate feature selection strategy, based on the relationship between the classes in the document categorization task at hand. The thesis also investigates the use of statistically extracted phrases of length two as terms in Arabic text classi cation. In comparison with Bag of Words text representation, results show that using phrases alone as terms in Arabic TC task decreases the classification accuracy of Arabic TC classifiers significantly while combining bag of words and phrase based representations may increase the classification accuracy of the SVM classifier slightly.
URI
http://hdl.handle.net/10399/2669
Collections
  • Doctoral Theses (Mathematical & Computer Sciences)

Browse

All of ROSCommunities & CollectionsBy Issue DateAuthorsTitlesThis CollectionBy Issue DateAuthorsTitles

ROS Administrator

LoginRegister
©Heriot-Watt University, Edinburgh, Scotland, UK EH14 4AS.

Maintained by the Library
Tel: +44 (0)131 451 3577
Library Email: libhelp@hw.ac.uk
ROS Email: open.access@hw.ac.uk

Scottish registered charity number: SC000278

  • About
  • Copyright
  • Accessibility
  • Policies
  • Privacy & Cookies
  • Feedback
AboutCopyright
AccessibilityPolicies
Privacy & Cookies
Feedback
 
©Heriot-Watt University, Edinburgh, Scotland, UK EH14 4AS.

Maintained by the Library
Tel: +44 (0)131 451 3577
Library Email: libhelp@hw.ac.uk
ROS Email: open.access@hw.ac.uk

Scottish registered charity number: SC000278

  • About
  • Copyright
  • Accessibility
  • Policies
  • Privacy & Cookies
  • Feedback
AboutCopyright
AccessibilityPolicies
Privacy & Cookies
Feedback