Classification of Red Blood Cells using Principal Component Analysis Technique

Principal component analysis (PCA) is based feature reduction that reduces the correlation of features. In this research, a novel approach is proposed by applying the PCA technique on various morphologies of red blood cells (RBCs). According to hematologists, this method successfully classified 40 different types of abnormal RBCs. The classification of RBCs into various distinct subtypes using three machine learning algorithms is important in clinical and laboratory tests for detecting blood diseases. The most common abnormal RBCs are considered as anemic. The RBC features are sufficient to identify the type of anemia and the disease that caused it. Therefore, we found that several features extracted from RBCs in the blood smear images are not significant for classification when observed independently but are significant when combined with other features. The number of feature vectors is reduced from 271 to 8 as time resuming in training and accuracy percentage increased to 98%.



Abstract-Principal component analysis (PCA) is based feature reduction that reduces the correlation of features.In this research, a novel approach is proposed by applying the PCA technique on various morphologies of red blood cells (RBCs).According to hematologists, this method successfully classified 40 different types of abnormal RBCs.The classification of RBCs into various distinct subtypes using three machine learning algorithms is important in clinical and laboratory tests for detecting blood diseases.The most common abnormal RBCs are considered as anemic.The RBC features are sufficient to identify the type of anemia and the disease that caused it.Therefore, we found that several features extracted from RBCs in the blood smear images are not significant for classification when observed independently but are significant when combined with other features.The number of feature vectors is reduced from 271 to 8 as time resuming in training and accuracy percentage increased to 98%.Index Terms-Principal Component Analysis; Feature Redaction; Red Blood Cells Image Characteristics; Machine Learning.

I. INTRODUCTION
Analysing medical images and processing them has a great significance as it helps in identifying as well as treating various blood diseases and in performing clinical studies.These imaging techniques help the doctors and the biologists to reach a diagnosis [1].The peripheral blood includes both normal and abnormal RBCs, where abnormalities refer to as a variation in shape, color and size of RBCs.Normal and abnormal RBCs images are determined by its external edge and central pallor area.The variation of the morphology of RBCs is an indication of the different types of blood diseases [2].Therefore, further examination involving anemia classification on the images is required.Many variations found in the morphology of the RBCs images make it really difficult for them to be detected by machines because of their similarity in shape, size and colour [3].The variation morphology of red blood cells (RBCs) produces a huge of features.The number of features is kept at a minimal when a larger decimation factor is applied [4].This means that as the quality of the analysis is reduced, low classification accuracy is achieved.The classification is based on the use of machine algorithms of feature selection process or feature reduction process.Reduce the redundancy and irrelevant features from a data Published on February 22, 2019.Jameela Ali Alkrimi, teacher, is with University of Babylon College of Dentistry, Babylon, Iraq, (jameela_ali65@yahoo.com).
Loay E. George.University of Baghdad, College of Science.He is now Assistant Professor with the of Department of Remote Sensing, Baghdad, Iraq, (loayedwar57@yahoo.com).
set having correlated variables to avoid the error on a validation the data set [5].The process of decimation has always been proved to be useful as it provides acceptable classification accuracy.The qualitative measure of the performance of a supervised classification algorithm is its accuracy with as low training samples as possible [6].
Principal Component Analysis (PCA) is a statistical technique used in image recognition by almost all scientific disciplines images [7].Principal Component Analysis (PCA) is one of feature selection methods for the exploratory factor analysis to estimate the loading matrix.It seeks values of the loading that bring the estimate of the total communality as close as possible to the total of the observed variances [8].
PCA based on the analysis of variance matrix.It requires the conversion of an image matrix into a column vector which leads to difficulty with determination of the covariance matrix [9].It involves a linear transformation of the data such that, in the new coordinate frame, the projection of the data has its greatest variance along the first axis (called the first principal component), its second greatest variance along the second principal component, etc. Variance is retained for representing the data, while higherorder ones are discarded.The number of principal components used depends on the level of accuracy needed to reconstruct the original data set [10].

II. LITERATURE REVIEW
Nandi in 2015 was presents a survey of the applications of PCA used in the various medical images and the results obtained from them are shown to prove their efficacy [11].A few numbers of studies have been done RBCs classification using PCA and machine learning algorithms [12].Park in 2016 was using three machine learning algorithms to classify abnormal RBCs [13].Nazlibilek in 2015 used the PCA method for classification of five types of white blood cells [14].Wheeless in 1994 was comparison of different machine learning classifiers abnormal RBCs [15].Vincent in 2014 use the PCA tachneqe to propose a novel approach to classify normal and abnormal WBCs for early detection of leukemic diseases [16].DL Omucheni in 2014 was used PCA to perform data dimensionality reduction and to enhance score images for visualization, as a feature extraction through clusters in score space [17].K Jaferzadeh in 2017 was apply the k-means clustering method to cluster RBCs into two groups of young and old RBCs using PCA [18].J Prinyakupt in 2015 proposed a system to locate white blood cells within microscopic blood smear images, segment them into nucleus and cytoplasm regions, extract suitable features and the applied principal component Classification of Red Blood Cells using Principal Component Analysis Technique Jameela Ali Alkrimi, Sherna Aziz Tome, and Loay E. George analysis (PCA) to grouped and classify WBCs into two groups [19].Jia W. in 2018 was study the differences in Raman spectra of red blood cells (RBCs) among patients with β-thalassemia using PCA [20].MU Ali in 2017 reported that Bioinformatics data is high dimensional, so that using PCA to selected of attributes of data for getting accurate results in classification, Prediction, Clustering and Pattern Extraction are useful techniques of data mining [21].Sharma in 2012 was reported that there are no literatures yet with regard to the anemic RBCs feature selection that has been published [22].Thus the research that we have performed is the leading study related to this topic.

A. Input Dataset of RBCs
In this study, 100 different anemic blood smear slides are used.The slides are collected from the Hematology Unit of the Pathology Department, Faculty of Medicine, Serdang Hospital.An Olympus BX43 U-CAM D3 photo imaging microscope (Japan) was used to transform peripheral blood smear slides into digital images at the Faculty of Medicine, SEGI University, Malaysia.Each image includes 18 to 28 individual normal and abnormal RBCs, as shown in Fig. 1.Fig. 1.Amenic blood smear image B. Feature Extraction Our proposed system consists of several steps to properly segment an individual RBC from the digital blood smear image.A total of 1000 individual normal and anemic RBCs are obtained from the segmentation processes.The feature extraction process is a difficult, complex process due to the similarities in RBCs in which some require the examination of cells by several features, which include their size, area, shape, and also their internal configurations to distinguish RBCs that have different central pallors but with cells of similar size and shape, as shown in Fig. 2.
This paper aims to obtain a hybrid of statistical, spectral texture, and geometric features.The statistical features can be based on first-, second-, or higher-order statistics of the gray level of an image.In the case of spectral methods, textures are defined by the spatial frequencies of the band color image, which are red, green, and blue.Fourier descriptor is extracted as geometrical features to identify the shape of RBCs.The variation in RBC morphologies is required to extract different features for each type of RBCs.
A total of 271 features are extracted from 1000 individual RBCs.These features produced a new data set called FRBCs.These features included some redundancies, that is, some of the variables are correlated with one another because they measured the same construct.Thus, reducing the observed variables into a small number of artificial variables (principal components) is important, which will account for most of the variances in the observed variables.

C. Feature Reduction Using PCA
A new FRBCs dataset included 271 features from 40 types of 1000 samples.These features included some redundancies, that is, some of the variables are correlated with one another because they measured the same construct.Thus, reducing the observed variables into a small number of principal components (artificial variables) is important, which will account for most of the variances in the observed variables.The motivations of using feature reduction are as follows: first, to improve the prediction performance of the predictors; Second, to provide rapid and cost-effective predictors; and finally, to provide an improved analysis of the underlying process that generated the data in the ML algorithms.
In this work, Statistical Package for the Social Sciences (SPSS) version 21 was used for the analyses of FRBCs data.A PCA statistical model includes two main phases, namely, data preparation and component extraction using SPSS version 21.Each phase consists of many steps as shown in Fig. 3.
The PCA technique is based on backward reduction features.The backward selection process starts with all the variables (features) and removes them one by one at each step in order to eliminate the one that decreases the error significantly until any removal increases the error significantly.To reduce over fitting, the error referred above is the error on the validation set that is distinct from the training set.In the data preparation phase, this process starts with the normalization of the data set by using standard deviation method.This method places all feature values to be in the same range of values.However, replicated data are not required.The data ranged between (±1) as the outputs for (1).
where; x= (x1, x2...,xn) and zi is the ith normalized data The normalization step aims to reduce the significant variations in the range of values of raw data in several ML algorithms.The majority of classifiers calculated the distance between two points.When one of the features has a broad range of values, this particular feature will determine the distance.
The data set obtained from the scaling process normality test was applied by using a Gaussian distribution test, as shown in (2).
(2) where μ: is a mean, σ: is a standard deviation, x: is the independent variable.
Results of Gaussian test of the FRBC dataset are illustrated in Fig. 4. The shape of the RBCs shows that 99.7% of feature RBCs data set ranges from -0.4 and 0.8 and 68% of FRBCs data are in the range from 0 and 0.4.
It is shows that the data has normal distribution.Before applying PCA to the FRBC data, we need to test first whether the FRBC data are suitable for reduction using the Kaiser-Meyer-Olkin test.
The null hypothesis is tested to prove that the correlation matrix is an identity matrix of FRBC data set by using Bartlett's test of sphericity, as shown in Table I.Subsequently, the extraction of communalities for each feature is conducted.The communalities indicated that the variance in each of the original features is explained by the extracted component.High communities are suitable.The communalities hi ̂ for the ith feature are the sum of the square loadings for that feature.The communalities test is performed by using (3).
(3) where; m is the total components, and j is the number of components that the feature loads.The summaries of 271 variances with relative importance for each feature are shown in Table III.
The components are extracted after the FRBC data set was prepared.In the component extraction phase, Cattell's scree test was used to select the number of component return, which included plotting the eigenvalues of the component and examining the plot to find a point where the shape of the curve changes its direction.
The recommendation is that all components above the elbow should be retained because they contributed the most for the explanation of variance in the dataset.The Cattell's scree test of FRBC data set is shown in Fig. 5.  From Fig. 5, components 8, 9, 10, and 11 are extremely close together.This condition indicated the strong correlation among the components and can be solved by using the varimax orthogonal rotation method.The varimax orthogonal rotation method was used to avoid the correlation component.Fig. 6 shows the assembly of features in the components before and with 11 varimax rotation processes.Fig. 7 show the Scree plot of FRBCs dataset after rotation process.The threshold value of 0.60 for the loading component matrix was used to build the sample structure matrix [16].The sample structure indicated that the components arrangement is based on the number of features, in which the first component includes the maximum number of features that elevate the variance.Fig. 8 shows the loading component matrix before and after rotation.After the sample structure loading matrix is obtained, the estimation score of the PCA model was tested by using the Bartlett method to test the orthogonal components solution, which is calculated as follows: where: ESS: is an estimation of score model ML: is the maximum loading variable in factor

D. Evaluation of the PCA model
The evaluation of the PCA model was performed according to the identity and degree of rotation (DoR) matrices.The identity matrix of the component data set is shown in Table IV.V.
The PCA models were validated by applying the Cronbach's alpha coefficient criterion to test the reality of PCA in the new components data set, as show in Table VI.

IV. CLASSIFICATION METHODS
The classification process was performed by using different ML algorithms to avoid bias and to test the features and component abilities in recognizing different types of RBCs.The three ML algorithms were carried out using Weka software.Three different supervised ML algorithms, each with its own properties, were applied.The first algorithm is generative in the form of artificial neural networks(RBFNN), the second algorithm is discriminative in the form of support vector machines(SVM), and the third is a clustering algorithm, which is the K-nearest neighbor algorithm(KNN).The purpose of using the three algorithms is to enable a fair comparison among them and to measure the efficiency of the extracted component.
Components efficiency was evaluated by assessing the efficiency of the algorithm through the possibility of isolating the normal RBCs from the abnormal.The confusion matrix was used to evaluate the performance of the classification models through discrimination metrics.In addition, evaluation was performed with accuracy, sensitivity, specificity, F-measure, and area under Receiver operating characteristic (ROC), which are the common metrics of comparison in the process of evaluating ML classification algorithms.

V. EXPERIMENTS AND RESULTS
Feature reduction is computed by using the PCA technique for all features (normal and anemic) RBCs.The PCA technique reduced 271 features to 8 components, which created the new data set, CRBCs, the new components named as shown in Table VII.Each component contains at least two features with a high loading value and a minimum 0.90 variance, as shown in Fig. 7 above.The accuracy of the three ML classifications in the data set of features and components is shown in Table VIII.The evaluation of classification performance of the three ML algorithms is presented in Table IX.Whereas, the ROC curves for the three ML classification algorithms are shown in Figs 9 and 10.

VI. CONCLUSION AND FUTURE SCOPE
This work mainly aimed to establish new features of RBCs that can distinguish normal and abnormal RBCs.Through different types of ML methods, we achieved an improved precision and accuracy for interpreting normal and anemic RBC images.According to the experimental results, we can conclude that several features extracted from anemic RBC images are not significant when observed independently but improved the percentage of accuracy when combined with other features.In addition, the reduction in time consumption was very clearly visualized in the training of RBFNN algorithms.The high classification results and the closeness of components indicated their strength.

Fig. 8 .
Fig. 8.The loading component matrix of a new data set before and after rotation is measured by the average diameter of the component transformation matrix.Ii is the efficiency indicator of the rotation process to obtain the orthogonal components.Yong (2013) indicated that the value of DoR should be more than 0.70.The component transformation matrix is presented in Table

Fig. 9 .
Fig. 9. ROC curves of three ML algorithms for FRBCs data set

TABLE I :
Pearson test was used to test the correlation coefficients matrix, as shown in TableII.

TABLE II :
PART OF THE CORRELATION MATRIX FOR FRBCS DATA SET

TABLE III :
DESCRIPTIVE THE SUMMARIES OF VARIANCE, RELATIVE

TABLE IV :
THE COMPONENT SCORE COVARIANCE MATRIX

TABLE VI :
THE CRONBACH'S ALPHA COEFFICIENT Components

TABLE VII :
THE EIGHT RBCS COMPONENT TYPES

TABLE VIII :
THE COMPARISON BETWEEN THREE ML ALGORITHMS Algorithms Data set Accuracy% Time Sec.

TABLE IX :
THE EVALUATION RESULTS OF THE ML ALGORITHMS