Diagnosis of Head and Neck Cancer in Developing Countries using a Stacked Ensemble Model

Head and neck cancers (HNC) are indicated when cells grow abnormally. The incidence of HNC is on the increase owing to several factors. There is often late presentation that can result in loss of lives (mortality) especially in Africa due to paucity of specialists. These challenges prompted the development of a stacked ensemble model for diagnosis of HNC to facilitate prompt referral. The data were collected which consists of 1473 instances with 18 features. Information Gain was used for selecting important features and three supervised learning algorithms were deployed for the base learners: Decision Tree (C4.5), K-Nearest Neighbors and Naïve Bayes. The predictions of the base learners were combined and passed to meta learners: Logistic Model Tree (LMT). The result showed that Information Gain method with stacked LMT was 95.11%. It was deduced that both Information Gain with stacked LMT produced higher accuracy than that of the base learners’ results. Hence, this stacked model can be used for diagnosis of HNC in healthcare systems.



Abstract-Head and neck cancers (HNC) are indicated when cells grow abnormally. The incidence of HNC is on the increase owing to several factors. There is often late presentation that can result in loss of lives (mortality) especially in Africa due to paucity of specialists. These challenges prompted the development of a stacked ensemble model for diagnosis of HNC to facilitate prompt referral. The data were collected which consists of 1473 instances with 18 features. Information Gain was used for selecting important features and three supervised learning algorithms were deployed for the base learners: Decision Tree (C4.5), K-Nearest Neighbors and Naïve Bayes. The predictions of the base learners were combined and passed to meta learners: Logistic Model Tree (LMT). The result showed that Information Gain method with stacked LMT was 95.11%. It was deduced that both Information Gain with stacked LMT produced higher accuracy than that of the base learners' results. Hence, this stacked model can be used for diagnosis of HNC in healthcare systems.

I. INTRODUCTION
Head and neck cancer (HNC) are the types of cancer which occur as a result of uncontrollable growth of abnormal cells in head and neck regions of the body. Cancers can affect any part of human body but if they affect head and neck regions, they are known as head and neck cancers (HNC). The part of human body that HNC affect could be of various parts in the head and neck regions. These are: nasopharyngeal, thyroid, larynx, sinonasal etc. There are factors such as chemical or toxic compound exposure, ionizing radiation, pathogens, human genetics and so on that can predispose to HNC.
Patients with HNC present some symptoms which include weight loss, bleeding, swelling, hemoptysis, dyspnoea etc. Fatigue is a significant symptom of head and neck cancer patients [1]. These (symptoms) and signs are what we called features of HNC. The features were considered to predict the types of cancer around head and neck regions.
There are a large number of people who suffer from cancer [2]. Cancer mostly called malignant tumour has killed so many people in the world [3]. It is a deadly disease that needs utmost attention by both individual and government. National Cancer Institute [4] reported that patients with HNC worldwide is in excess of 550,000 cases and mortality rate was about 300,000 each year. The primary health workers are the first contact that the patients present to especially in developing countries; and the ability for the health workers to diagnose head and neck cancers is abysmally poor. This leads to delay in referral to the specialists in Tertiary Hospitals.
Healthcare organizations are going deep into the data analytics and clinical decision support environments to support population health management and value-based care. The healthcare industries have a large amount of patient records which are available from where information are to be extracted to assist the development of medical researches. These data are capable of reducing the problem being faced by primary health personnel. The use of predictive model for the analysis of medical data helps in diagnosis of HNC.
The application of machine learning in diagnosing head and neck cancer is a means of using available data which are meant to extract pattern and knowledge. Machine learning is of great significance to diagnose diseases including cancer. In machine learning, classification is a task or function that assigns items in a collection to target or classes. The goal of classification is to accurately predict the target class for each case in the data. A classification task begins with a data set in which the class assignments are known. In the model (training) process, different classification algorithms use different techniques for finding relationships between the values of the features and the values of the target or class. The task has the ability to predict the types of cancer around head and neck regions.
If there is no diagnosis of disease (HNC), there will be no prompt referral to the specialists in tertiary institutions. Hence, it is of great significance to employ machine learning algorithm to predict the types of cancer around head and neck regions especially at the primary healthcare level.
Cancers are needed to be diagnosed early so that treatment protocols can commence immediately. When HNCs are diagnosed early enough, then there would be good prognosis.
II. RELATED WORK Some works by researchers which included the use of machine learning algorithms to diagnose diseases have been @ @ @ @ Diagnosis of Head and Neck Cancer in Developing Countries using a Stacked Ensemble Model F. Akinbohun, Ambrose Akinbohun, Adekunle Daniel and Oghenerukevwe E. Oyinloye reviewed. They are, among others: Two prediction models (Radiotherapy (RT) and End of Treatment (EOT)) at different time points were developed to predict weight loss ≥5 kg (yes/no) at 3 months post-Radiotherapy (RT) using Classification and Regression Tree (CART) algorithm [5]. The study indicated that CART could be employed for weight loss prediction in patients with head and neck cancer.
[6] evaluated the use of Classification and Regression Tree (CART) model in predicting weight loss following head and neck cancer radiation therapy. Two prediction modes were developed to predict weight loss ≥5 kg at 3 months' post-radiation therapy. The modes were (1) during radiation therapy planning using patient demographic, delineated dose data, planning target volume-organs at risk shape relationships data and (2) at the end of treatment (EOT) using additional on-treatment toxicities and quality of life data. This showed that CART was deployed to predict weight loss in head and neck cancers patients.
[7] evaluated a pattern of follow up visits among patients with head and neck cancers in Jos, North-Central, Nigeria. Data were collected from Jos University Teaching Hospital, Jos, North-central Nigeria and analyzed using the statistical software.
[8] considered a classifier model using Machine Learning Algorithms on differential diagnosis of suspicious thyroid nodules via Sonography (Ultrasound). The study included 970 histopathologically proven thyroid nodules in 970 patients. Two radiologists retrospectively reviewed ultrasound images, and nodules were graded according to a five-tier sonographic scoring system. Naïve Bayes, Radial Basis Function Neural Network, Support Vector Machine were employed for diagnosis of thyroid nodules.
[9] described competing causes of death in the head and neck cancer population. They identified patients with first mucosal head and neck cancer (HNC) from the Surveillance, Epidemiology and End Result database. The study showed that patients with head and neck cancer died from competing causes.
[10] developed a feature selection method for oral cancer using Apriori algorithm where the original algorithm of Boolean association rules of mining frequent item sets was employed. The data mining methods were explored to identify the suitable techniques for efficient classification of data. [11] deployed Back Propagation Algorithm Neural Network to diagnose thyroid disease. The study indicated that developed neural network could be used as a diagnostic tool for earlier prediction of a thyroid disease.
[12] worked on the prediction of acute myeloid leukemia cancer using data mining. Data mining techniques such as Bayes Network, Jrip, J48, Multilayer Perceptron, IBK, Decision Tree were used on the dataset. The study showed that data mining algorithms could be used for myeloid leukemia cancer prediction.
[13] carried out a study where the objective of the study was to predict breast cancer survivability where different data mining techniques: Naïve Bayes, Back-Propagated Neural Network, and the C4.5 decision tree algorithms were adopted. It was established that C4.5 algorithm had a better performance than the other two techniques. The study was limited to only one type of cancer (breast).
Prediction of survival in patients with esophageal carcinoma (cancer) using Artificial Neural Networks was conducted by [14]. It was observed that accurate estimation of outcome in patients with malignant disease was an important component of the clinical decision-making process

III. MATERIALS AND METHODS
The architecture of a stacked ensemble model to diagnose head and neck cancers is presented in Figure 1. The major components of the architecture for HNC diagnosis model are collection of data, preprocessing, feature selection, base models and meta models/super learners.
The components of the proposed ensemble model for diagnosis of head and neck cancers are explained as follows:

B. Feature Selection
During data collection, problems like irrelevant features might occur. We introduced feature selection method where Information Gain algorithm was used to remove irrelevant features from the dataset (HNC), to improve accuracy and to reduce training time. This task was tackled in the prepared data before applying learning algorithms. The Information Gain as a feature selection method is described thus: Information Gain is used to measure the dependence between features and labels and calculates the information gain between the feature and the class labels C. It is a method that ranks features based on a relevancy score which is based on each individual attribute [15].
It finds the entropy of the classes, features subsets and weighted average entropy in order to get the information gain of each feature. The weighted information gains of the features are ranked. In information gain, a feature is relevant if it has a high information gain.
To calculate the information gain, the entropy of the HNC type (class) is calculated using Equation 1.
Where Pi is the proportion of examples in HNC that belongs to the i-th class, n is number of classes and E is the entropy. Information Gain is expressed as the difference between the prior entropy of classes and posterior entropy [16].

C. Construction of the base model
The base models deployed for this study were Decision Tree, K-Nearest Neighbor and Naïve Bayes. The dataset was divided into: training data and testing data. 1031 records were used for training data while 442 records were used for test data. The training data or cases were assumed to be represented as a pair [x1, x2, x3, …xn y] where x1, x2, x3 …xn are vectors of attribute values describing some cases while y is the appropriate class or target. The fist base model is explained as follows: A decision tree is a decision support tool that uses a treelike graph or model of decisions and their possible consequences. Decision tree is a predictive model to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). Decision tree is constructed in a top-down recursive divide-and-conquer manner [17].
The goal of decision tree is to create a model that predicts the value of a target variable based on several input variables. A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. It is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. These nodes in turn will continue to split until a final node or leaf node is grown. The leaf node determines the final classification of the variable being tested. The second base model employed in this paper was K-Neaest Neighbors (KNN). The k-NN algorithm belongs to the family of instancebased and lazy learning algorithms [18]. Instance-based algorithms are those algorithms that model the problem using data instances (or rows) in order to make predictive decisions [19].
When new unlabeled data comes in, k-NN operates in two basic steps: Firstly, it looks at the k closest labeled training data points. Secondly, using the neighbors' classes, k-NN gets a better idea of how the new data should be classified. For continuous data, k-NN uses a distance metric like Euclidean distance to locate what is closer to k [20]. The Euclidean distance is given in Equation 2 = √∑ ( − ) 2 (2) where x represents the query-instance and y represents all the training samples; ( − ) represents distance between the query-instance and all the training samples. The third base model was Naive Bayes classifier. It is a classification technique and frequently trumps a more sophisticated predictive analytic tool which is designed for use when predictors are independent of one another within each class [21].
Naive Bayes Classifier technique is based on Bayesian theorem. It is a conditional probability model that enables conditional predictions. Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c) using the Equation 3: P(c|x) is the posterior probability of class (target) given predictor (attribute) of class. P(c) is called the prior probability of class. P(x|c) is the likelihood which is the probability of predictor of given class. P(x) is the prior probability of predictor of class. It combines this model (conditional probability) with a decision rule. One common rule is to pick the hypothesis that is most probable; this is known as the Maximum a Posteriori (MAP) decision rule.

D. Stacking with Logistic Model Tree (LMT)
The predictions of the base models were combined, and at super level or meta level, Logistic Model Tree Algorithm was used for building the stacking.
Logistic Model Tree (LMT) is a supervised training model that combines Logistic Regression and Decision Tree [22].
Logistic model tree is based on a decision tree that has linear regression model at its leaves to provide a piecewise linear regression model. The algorithm makes use of cross validation to find a number of LogitBoost iterations that prevent overfitting.

IV. RESULT
When Information gain algorithm was applied, the threshold was set at <0.2, the result is presented in Table 1 where features with higher values were used and passed for the training of base models and a meta model.

A. Results of the Information Gain with Decision Tree Data
The selected features by the information gain were trained on the base models: Decision Tree (C45), KNN and Naïve Bayes in the ratio 70% of the dataset for training set and 30% of the data for test set. The performance metrics we considered for base level are accuracy, precision recall and F1 score. The results are presented in Table 2 and accuracies of the base models on Information Gain is shown in Figure 2

B. Stacking with Logistic Model Tree (LMT) with Information Gain Selection
The predictions of the three base models (Decision tree (C4.5), KNN and Naïve Bayes) were stacked with Logistic Model Tree. This was used on the selected features of Information Gain. The stacking was done on the basis of cross validation. The results is presented in Table 3  From this metric, it showed that KNN had the highest F1 Score followed by Naïve Bayes and the least was Decision Tree. This implied that KNN was a better predictive model to predict or diagnose the type of HNC when Information Gain method was applied in the ratio 70% of the dataset for training set and 30% of the dataset for test set.
Following the predictions of the base learners, they (predictions) were combined. Logistic Model Tree was used at a stacked meta level to train the HNC dataset in a stratified cross validation. This was done on the selected features using Information Gain.
The results of the stacked model of Head and Neck Cancer dataset was 95.11% for Logistic Model Tree (LMT). This showed that LMT was higher than the values of the three base models. i.e. There was an improvement of the accuracy.

VI. CONCLUSION
HNC has become a global disease that its mortality and morbidity rate are on the increase because of late presentation and lack of access to the specialists in Ear, Nose, Throat/Head and Neck. Due to the common clinical features patients with HNC present with, it is of importance to deploy the use of stacked ensemble learning algorithms to predict the appropriate head and neck cancer type.
This study recommends that machine learning algorithms can be used to diagnose head and neck cancers early in developing countries especially as patients in these countries (developing) present to primary health care facilities as first port of call. The study helps primary health workers to diagnose early and refer to the tertiary health institutions where ENT specialists are found.