Visualizing the Educational Data Mining Literature

DOI: http://dx.doi.org/10.24018/ejers.2020.0.CIE.2306 December 2020 1 Abstract—This article provides a visualization of a literature review in students’ performance prediction using educational data mining (EDM) techniques for the period 2015-2019. The results of the review are presented concisely and simply with the use of diagrams. Various aspects of the literature are examined, such as the algorithms adopted, the type of results drawn, the educational setting of the application and the actual exploitation of the outcomes. Findings indicate that tertiary education dominates the EDM field; in contrast, the focus given to secondary and primary education is minimal.


INTRODUCTION
The use of data mining techniques in educational data has increased greatly in recent years. This has led to a huge increase in the amount of educational data now available. The introduction of information systems allows the recording and retention of large volumes of data in educational institutions. The development of modern as well as asynchronous distance learning has also increased the volume and type of data. Thus, the conditions have been for the application of data mining techniques in education and the educational data mining has been developed as a separate interdisciplinary discipline.
The prediction of academic performance of the students is a frequent choice of researchers in this scientific field. A significant number of studies have been published primarily for predicting the performance of students in higher education. The main goal is early detection of students with weaknesses to develop appropriate actions and policies on behalf of educational institutions. For the prediction of the students' academic performance, a large number of techniques and data mining algorithms have been applied and various types of explanatory factors have been used. As predictive variables have used demographic, socio-economic data, grades and other academic data.
We have recently prepared an extensive review of the relevant literature, soon to be published [1].
In this short paper we summarize this review and visualize its main findings. Readers interested in further details may seek the complete list of the referenced works and a discussion on the characteristics of each article in the original journal publication.

A. Articles' selection
The article selection process shown in the Figure 1: In the first phase, we found 564 articles. Subsequently, the title and abstracts of the articles were sought, applying inclusion criteria. After that, 125 articles remained for a full study. After a thorough study and application of the inclusion and exclusion criteria, finally, 120 articles emerged. These articles have been widely criticized.

B. Literature review sources
The number of articles finally selected differs depending on the source. The Figure 2 shows in detail the percentages by a different source.
We observe that a large number of articles concerning journals appear less than two times. The majority of articles reviewed originated from IEEXplore (23%) Corresponding percentage articles derived from Springer (20%).

C. Research field
The vast majority of articles were on tertiary education. This may be due to better access to data through the development of Learning Management Systems (LMS) in higher education institutions, as well as the fact that more scientific experiments can be performed more easily in tertiary education. As shown in Figure3, the percentage of research related to universities or colleges reached 78.69 Research followed in secondary education at 14.75%, while less research conducted on online platforms.

D. Algorithms
Ιn most papers more than one algorithm was applied. We categorized the papers by methodological terms. As we present below the following categories was used: Association rules. A category of machine learning algorithms that strive to extract interesting relationships between variables and create "if-then" statements.
Bayesian methods. Algorithms use Bayes' theorem to update the conditional probability for a hypothesis as more data become available.
Decision Trees. A category of non-parametric supervised learning methods that attempts to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
Ensemble methods that combine particular base algorithms in order to compose a new optimal predictive model.
Instance-based learning. A family of learning algorithms that construct hypotheses directly from the training data, without any previous hypothesis.
Logistic regression is used to model the probability of a certain binary class or event. (e.g. pass/fail, win/lose etc).
Neural networks is a series of algorithms that efforts to identify underlying relations in a set of data through a process similar the way the human brain works. It is constructed as a combination of connected nodes and stimulates with the neurons of the biological organisms. Support Vector Machine an algorithm that tries to discover a hyperplane in an N-dimensional space that clearly classifies the data in different categories and minimize some error mesure.
Linear regression is a statistic method for modeling the relationship between a dependent variable and one or more explanatory or independent variables.
The measures used for the evaluation were also recorded. The Figure 4 shows the frequencies per method. The accuracy was the most used evaluation tool. We observed differences in accuracy between different methods and datasets. It should be noted that in almost every article a dataset differed was used. Thus, the comparison of the accuracy of the algorithms only related in article level. Figure  5 shows the average accuracy per method.
We have observed high frequency and high accuracy score of Decision Tree algorithm. In high-frequency redounded the extensive use of the method in general and the huge number of algorithms accessible in familiar tools, such as WEKA. Other methods such as Bayesian algorithms, mainly Naive Bayes, have also been 45 times used. Despite its calculation speed and low resource consumption. In some cases, Naive Bayes was found to have the highest, while the average accuracy of the algorithm reached 0.7560. Different methods such as Support Vector Machines, Ensemble Learning Methods and Neural Networks, have been applied to a minor extent and produced marginally lower accuracy. Logistic regression has been applied to a few articles, such as Instance-Based Learning. Next, we present in more detail the accuracy and the number of the most frequently applied algorithms.
Decision Trees algorithms accounted for the majority of the algorithms used with very high accuracy. Many algorithms included in this category, such as ID3, CART, C4.5 etc., so researchers had the opportunity to choose from many different approaches in the same paper. The C4.5 algorithm was the most accurate.
Logistic regression was shown a similar score in accuracy compared to Decision Trees proving that it is a powerful algorithm, although it was used in a small number of articles. Instance-based learning algorithms, SVM and Neural Networks showed lower average accuracy and used by fewer researchers. Ensemble learning methods were used in a quite large number of articles. The most frequent algorithm was Random Forest which recorded an accuracy of 0.79. Only a few times Stacking and AdaBoost algorithms were used and showed better accuracy. We also noticed the very low frequency of use of Unsupervised learning techniques, such as K-means, in only two cases.

E. Attributes
The attributes used were tested with various measures. The Figure 6 presents the frequencies of difference attributes combinations of features used. The majority of papers we studied used student ranks as explaining variables (28.72%). Student demographics such as gender, profession of parents, age, etc were used at 23.40% and academic data at 21.81%. In smaller percentages, combinations of the previous attributes are used. Finally, other variables applied to a very minor proportion (behavioral data, internet logs, motivational data), while in two studies data from documents was used. We also tested the average accuracy in terms of attributes used. We present only demographic, academic and grade data because of highest frequency of use. In Figure8, we observe that the highest accuracy has the use of grades as a predictive attribute (0.79476). Grades has also presented the smaller 95% confidence interval (0,74941 -0,84011). The use only demographic and other academic data reduces accuracy (0.76905) but the 95% confidence intervals overlapped (0,70851 -0,82959) and no statistically significant difference is displayed. The use of only demographic data alone was shown bigger accuracy (0, 94567) in only three cases and the use of academic information was shown an accuracy of 0,8828. As shown in the spider graph ( Figure 8) the shape look like a normal heptagon and no combination showed particularly bigger accuracy.

F. Using the findings of the researches
The using of research results for decision making can be very helpful for education authorities. Evidence-Based decision making enables strong decision support at local, regional, national and even supranational level. We studied the cases in which the results of the studies were used in decision making, according to what is mentioned in article. From the study of the articles we found that for the most part these are case studies in a limited sample. The main interest of the researchers was to evaluate the effectiveness of specific algorithms and techniques and not to use their results to conduct educational policy. However, there have been a few cases in which a tool has been developed for use by schools or universities. In Figure 9 we present six cases in which the paper targeting was the practical use of the findings. The reviewed papers are ruled by the studies of higher education, due to the higher availability of data and the closer connection that researchers have with organizations, allows more studies to carry out. On the other side, in Primary and Secondary Education, only a few studies have conducted although the field is greater and wider.
Many algorithms have been identified that allow for studies to evaluate their findings effectiveness in evaluating student performance. Our research has generally shown high accuracy levels. In only a few cases we observed low or very high accuracy and no statistically significant differences were found among the algorithms. No increase by use of more high-level algorithms was found. Decision trees are the majority of the methods adopted with satisfying accuracy levels. Naive Bayes, C4.5 and Random Forest algorithms were adopted more often and KNN was the only algorithms related to instance-based methods. Clustering (K-means) was used in only two papers. Ensemble methods have been implemented in fewer cases In the papers we studied, student marks were most frequently used as prognostic factor. Combinations of the students' grades with demographic and academic data were frequently applied. No combination led to higher accuracy. Our results confirmed a satisfying level of accuracy. However, many different users of the findings have been almost neglected. In a few studies, a tool has been created or the results provided to the institutions to improve the education provided. In some cases, the practical implementation of the study findings has been reported. Such was the practical application of methods for early diagnosis of student failure and the improvement of student results through feedback were the only factual findings.

CONCLUSIONS
This review recognized a rich number of analysis methods as well as an obsession with their use within the academic community. There has been insufficient utilization of data mining studies that support educational policy-making and institutional decision-making. We believe that the expansion of research aimed at their utilization in the daily teaching life but also in support of decision making at the educational policy level should be an option. Internal feedback with thriving cases of using different algorithms and techniques, in the long run, can lead to the withering away of the scientific field. In contrast, the practical application of the findings will extend the scope of research and will be useful for the research and the educational community.