Publication Date


Advisor(s) - Committee Chair

Dr. Huanjing Wang, Director, Dr. Guangming Xing, Dr. Qi Li

Degree Program

Department of Mathematics and Computer Science

Degree Type

Master of Science


Data mining involves the use of data analysis tools to discover previously unknown, valid patterns and relationships from large amounts of data stored in databases, data warehouses, or other information repositories. Feature selection is an important preprocessing step of data mining that helps increase the predictive performance of a model. The main aim of feature selection is to choose a subset of features with high predictive information and eliminate irrelevant features with little or no predictive information. Using a single feature selection technique may generate local optima.

In this thesis we propose an ensemble approach for feature selection, where multiple feature selection techniques are combined to yield more robust and stable results. Ensemble of multiple feature ranking techniques is performed in two steps. The first step involves creating a set of different feature selectors, each providing its sorted order of features, while the second step aggregates the results of all feature ranking techniques. The ensemble method used in our study is frequency count which is accompanied by mean to resolve any frequency count collision.

Experiments conducted in this work are performed on the datasets collected from Kent Ridge bio-medical data repository. Lung Cancer dataset and Lymphoma dataset are selected from the repository to perform experiments. Lung Cancer dataset consists of 57 attributes and 32 instances and Lymphoma dataset consists of 4027 attributes and 96 ix instances. Experiments are performed on the reduced datasets obtained from feature ranking. These datasets are used to build the classification models. Model performance is evaluated in terms of AUC (Area under Receiver Operating Characteristic Curve) performance metric. ANOVA tests are also performed on the AUC performance metric. Experimental results suggest that ensemble of multiple feature selection techniques is more effective than an individual feature selection technique.


Computer Sciences | Databases and Information Systems