1 Introduction
The Metaheuristics algorithms, which are usually nature inspired algorithms, are quite popular in recent times because of their high efficiency in solving various challenging optimization problems. The nature-inspired algorithms are often developed by mimicking the behavior of some species in nature (e.g. humans, insects, birds).
Recently the metaheuristics algorithms are widely applied in the field of genomic data for selecting the suitable genetic marker (genes) to classify various complex diseases [5].
The Genome Wide Association Studies (GWAS) shows that a variety of diseases are characterized by SNP (single nucleotide polymorphism) [18]. The SNPs are type of genetic variation with a substitution of a single nucleotide in the DNA sequence of human genome [7]. The SNPs datasets are recently used for the categorization of various complex diseases. The SNP datasets are known for their high dimensionality and also contains huge level of noisy data. The dataset are basically consist of large number of features(SNPs) and a very small sample size, in machine learning it is considered to be “curse of dimensionality” problem and for this kind of dataset it is not easy to establish an efficient classifier [12].
To classify the complex diseases with SNPs datasets, one has to select the highly discriminative features (SNPs) and for that task a robust feature selection technique is used in computational intelligence [6]. When the high dimensional genomic data is characterized with traditional classifier for the diagnoses of a disease, a very low accuracy is obtained.
The main objective of feature selection in SNPs datasets is to minimize the features and maximize the classification performance. The metaheuristics algorithms have capability to efficiently search the entire set of features to find out the most informative features. In optimization techniques the process of feature selection comes to end when the objective function reaches near to optimum solution. Many metaheuristics methods are successfully applied to genomic datasets for feature selection [8].
The SNPs are excellent genetic markers for many complex diseases, so our aim is to find out the interactions and relationships between SNPs to enhance the classification performance. Due to the high dimensionality and large feature space of SNPs dataset, the combination of intelligent feature selection techniques are used for the prognosis of disease and higher classification accuracy. Anekboon et al. [2] used a hybrid FS method by using three techniques, CBFS used as a filter and K-NN and ANN are used in the wrapper phase.
In this paper, we used the CMIM algorithm as a filter technique to reduce the size of large feature space by selecting a small feature subset on the basis of relevancy and redundancy. The chosen subset is then provided as input to proposed IBCS (Improved binary cuckoo search algorithm) to find out the most informative feature which are deeply related to disease.
The IBSC algorithm is an improved version of binary cuckoo search algorithm with having two main objectives.1) to retain only the useful features from the feature subset. 2) Maximize the predictive accuracy. The proposed technique is compared with many FS techniques, which are used on SNPs datasets, and also some metaheuristics algorithms, which are newly applied on the selected SNPs datasets.
The workflow of the paper is designed as follows: Section 2 describes the proposed technique, Improved Binary cuckoo search and CMIM algorithm. Section 3 discusses the methodology and result and finally section 4 describes the conclusion.
CMIM Algorithm
The CMIM is a filter FS technique proposed by Fleuret [9] in 2003-2004. The CMIM algorithm selects features on the basis of conditional mutual information. It deals with independence and individual strength of a new feature by measuring it with the feature that is already taken.
Let’s say a feature
The above equation explains that a higher value of feature
CSA (Cuckoo Search Algorithm)
The CS algorithm lies in the category of swarm intelligence, it is an optimization algorithm motivated by obligate brooding behavior of cuckoo bird in laying their eggs in other birds nest [20]. The cuckoo bird has a very exotic deception strategy that they replace one egg of the host bird with their own egg. The color and pattern of cuckoo egg resembles to the host eggs. Some of specific species of cuckoo bird have evolved in way that they have become specialized in mimicking the color and pattern of the eggs of few specific host eggs [11].
This strategy helps the cuckoo egg hatch slightly ahead before the host egg and because of their excellent mimicking pattern; the host throw out its own egg from the nest blindly, this process will increase the survival of cuckoo chicks. The idea of using cuckoo bird strategy in optimization was proposed by Yang and Deb [23] in 2009-2010 and can be used for numerous optimization problems.
The basic cuckoo search algorithm follows three simple rules:
R1: The cuckoo bird lays their egg in randomly chosen nests.
R2: The numbers of available nests are fixed, and the nest with the highest quality of eggs forms the next generation.
R3: When the host bird identifies the cuckoo’s egg it has two choices either discard the alien egg or build a new nest for itself.
In the optimization scenario, the host nest is a possible solution for the problem. The first step of the algorithm is to randomly initialize the nest for each iteration.
The second step is to use Levy flight technique shown in Equation 1 and 2 for global random walk and update the position of the nests [22, 21]:
BCSA (Binary Cuckoo Search Algorithm)
The binary version of cuckoo search is known as Binary Cuckoo Search algorithm (BCS), which is used for various feature selection techniques [19]. The main purpose is to collaborate a set of binary values for each nest that represents, whether a feature would reside to the conclusive set of features or not, and the function that has to be maximized is being provided by an organized classifier’s accuracy.
To have solution in discrete form, a limit is applied for valuing the dimensions (eggs) by setting up the binary vectors UB=1(upper bound) and LB=0 (lower bound) in BCS. The solution generated and updated are having values between LB and UB. To choose a particular feature a binary vector is applied, where ‘1’ stands for selected feature and ‘0; for unselected [16].
The binary values with in the Boolean lattice provide by equation 3 and 4:
where ∈
The cuckoo search use the Levy flights search strategy to explore the search space in a straight path with randomly 90 degree turns [20], also CS mostly depend on random walk search to easily jump from one region to another without thoroughly exploring each cuckoo’s nest.
Therefore CS has disadvantages like; low optimization performance, slow convergence rate and weak local searching. To overcome these shortcomings, we propose an IBCS (Improved Binary Cuckoo Search) algorithm.
2 Proposed Methodology
The proposed methodology is used to classify the SNPs datasets for various complex diseases as shown in Fig 1.
Firstly the filter method CMIM is used to reduce the large feature space into a small feature subset. CMIM technique discards the redundant and uninformative features. A fast implementation of CMIM algorithm is applied instead of standard algorithm.
It takes the feature score at the time of selection process and define CMI (conditional mutual information) for the features that are rich in info and less in redundancy. The top N features which maximize the mutual information among them and also with the target class, will be picked iteratively.
After the selection of N features by CMIM, the selected features are provided as an input dataset to IBCS algorithm to choose most relevant subset of features that can increase the classification accuracy to predict the complex diseases.
IBCS Algorithm
The N subset of features which are the output value of CMIM technique, the IBCS algorithm takes it as its input data.
As the problem of FS is considered as binary discrete problem, and each nest represent a solution, the value of nest is randomly selected and converted into binary vector as follows:
If the value of
If the value of
The next improvement takes place in the step size scaling factor
Complex Disease Datasets
The SNPs datasets represents the mutation in the sequence of nucleotide on a specific region for any group of individuals. The SNPs microarray data of five Affymetrix mapping 250 k arrays have been used in this study. The series of SNPs array are: GSE9222 [14], GSE13117 [15], GSE67047 [13], GSE34678 [10] and GSE16125 [17], all are downloaded from NCBI GEO (Gene expression omnibus) database [3].
The NCBI GEO is a free public repository that contains high throughput genomic and next generation sequencing data. Each dataset consist of two labels, Case and control to represent the affected individuals and healthy one. Every data sample represents its genotype at specific loci. The four SNP arrays, GSE9222, GSE13117, GSE67047 and GSE34678 are having data in alphabetical format except GSE16125 SNP, which has values in real numbers as shown in Table 1.
Dataset (GEO Series No.) | Information | No. of SNPs | No. of samples |
GSE9222 | ASD(Autism) | 250000+ | 567 |
GSE13117 | Mental Retardation | 250000+ | 360 |
GSE67047 | Thyroid cancer | 1000000 | 225 |
GSE34678 | Colorectal cancer | 250000+ | 124 |
GSE16125 | Colon cancer | 250000+ | 42 |
To apply classification and feature selection, the alphabetical dataset must be converted into numerical format.
There are numerous methods to convert into numeric, in this article the encoding used is: AB, AA, BB, and NO Call to 01, 11, 10 and 00 respectively. The results are compared with results achieved by [1] on same SNPs datasets.
Classification and Evaluation Measures
The SVM (support vector machine is used as a learner to calculate the performance measures of a proposed technique. The SVM is a supervised learning technique proposed by Boser et al. [4].
A k-fold cross validation is used where the value of k is set to k=2, where cases are less than 500 and k=4, where the cases are above 500 in the selected datasets. With the use of k-fold cross validation the proposed method have enough amount of cases for testing and training process.
The performance of each fold measure by following factors:
where:
where TP = True Positive, FP = False Positive, TN = True Negative, FN = False Negative.
We calculate the overall performance by using average of all folds:
where K is total number of folds.
The proposed algorithm also uses an objective function for achieving maximum
accuracy. Let us consider that N is total number of features, P is selected
features by metaheuristic techniques,
The proposed technique executes on Matlab 2018(a) on windows 10, Intel dual core i5, 2.5GHz and 8GB RAM.
3 Results and Discussion
The following tables provide the comparative results of proposed and other feature selection algorithm that are applied on above mentioned datasets.
The reduced feature subset of each SNPs dataset after applying fast CMIM filter technique are shown in Table 3.
Dataset | Total No SNPs | SNPs Selected by CMIM | No.of samples |
GSE9222 | 250000+ | 500 | 567 |
GSE13117 | 250000+ | 500 | 360 |
GSE67047 | 1000000 | 1000 | 225 |
GSE34678 | 250000+ | 500 | 124 |
GSE16125 | 250000+ | 500 | 42 |
The fast CMIM technique rank and arrange the feature subset in decreasing order of their feature score, we pick top 500 or 1000 feature according to the feature size. The comparison of different feature selection and proposed method on different disease SNPs datasets are shown in Tables 4, 5, 6, 7 and 8.
GSE9222(ASD) | SNPs (Features) | Accuracy |
ReliefF+SVM | 60 | 0.782 |
CBFS+SVM | 10 | 0.643 |
CMIM+SVM-RFE | 100 | 0.895 |
RFS+SVM | 10 | 0.642 |
Proposed(CMIM+IBCS) | 50 | 0.906 |
GSE13117(MR) | SNPs(Features) | Accuracy |
ReliefF+SVM | 30 | 0.781 |
CBFS+SVM | 70 | 0.862 |
CMIM+SVM-RFE | 50 | 0.850 |
RFS+SVM | 10 | 0.731 |
Proposed(CMIM+IBCS) | 60 | 0.905 |
GSE34678(CR) | SNPs (Features) | Accuracy |
ReliefF+SVM | 40 | 0.675 |
CBFS+SVM | 60 | 0.812 |
CMIM+SVM-RFE | 50 | 0.903 |
RFS+SVM | 20 | 0.712 |
Proposed(CMIM+IBCS) | 50 | 0.926 |
GSE67047(Colon cancer) | SNPs (Features) | Accuracy |
ReliefF+SVM | 40 | 0.708 |
CBFS+SVM | 30 | 0.850 |
CMIM+SVM-RFE | 50 | 0.841 |
RFS+SVM | 50 | 0.708 |
Proposed(CMIM+IBCS) | 60 | 0.872 |
GSE16125(Thyroid) | SNPs (Features) | Accuracy |
ReliefF+SVM | 50 | 0.782 |
CBFS+SVM | 60 | 0.754 |
CMIM+SVM-RFE | 50 | 0.901 |
RFS+SVM | 20 | 0.712 |
Proposed(CMIM+IBCS) | 50 | 0.916 |
The results comparison from all the tables’ shows that the proposed method outperforms all the other feature selection method in terms of accuracy. The performance comparison shown in the tables are based on optimal number of SNP subset. As in dataset ASD, CR and Thyroid cancer the optimal feature subset is 50 SNPs, where in CC and MR datasets the feature subset is 60 SNPs. In some cases as in MR and colon cancer the accuracy of proposed method rise more than 3 to 4%.
After reducing the feature subset using fast CMIM, the IBCS algorithm was also compared with different metaheuristics techniques that are being used for feature selection.
With reference to classification performance the IBCS outperformed all other
algorithms while considering average accuracy using the different fitness function,
Dataset | Objective Function | Algorithm | Accuracy |
GSE9222(ASD) | BPSO BACO BGA IBCS |
0.835 0.845 0.873 0.862 |
|
F(x) | BPSO BACO BGA IBCS |
0.852 0.848 0.871 0.906 |
|
GSE67047(Colon cancer) | BPSO BACO BGA IBCS |
0.782 0.793 0.825 0.863 |
|
F(x) | BPSO BACO BGA IBCS |
0.795 0.806 0.830 0.872 |
|
GSE16125(Thyroid) | BPSO BACO BGA IBCS |
0.819 0.804 0.836 0.893 |
|
F(x) | BPSO BACO BGA IBCS |
0.824 0.816 0.847 0.916 |
4 Conclusion
The proposed algorithm in this article, CMIM and (IBCS) Improved binary cuckoo search optimization aims to classify complex diseases SNPs in the datasets.
The CMIM technique was used to reduce the large feature space of SNPs data, after that IBCS algorithm was applied to select best feature from feature subset and maximize the classification performance.
Our experiment was conducted on five different SNPs datasets of complex diseases obtained from NCBI GEO, and the proposed method shows higher accuracy then all the other compared methods like ReliefF, CBFS, RFS and CMIM+SVM-RFE.
After reducing the large feature space by CMIM technique the feature subset used by IBCS was compared with other metaheuristics method such as Binary Ant Colony Optimization, Binary Genetic Algorithm, and Binary Particle Swarm Optimization.
The results revels that IBCS achieved better accuracy as compared to other algorithms. The Support vector machine is used for performance measure of feature subsets; however other classifier can also be used.
The obtained result shows that the SNPs of the whole genome could be applied to find out the affected person from the healthy ones.
For future our proposed algorithm can be applied to other large biomedical datasets rather than SNPs data and also modified for the feature selection of the multiclass datasets. There is also a large amount of work required in order to understand the genetic basis of complex diseases.