Forecasting Models of the Coronavirus (COVID-19) Cumulative Confirmed Cases Using a Hybrid Genetic Programming Method

— The coronavirus disease 2019 (COVID-19) diffusion process, starting in China, caused more than 4600 lives until June 2020 and became a major threat to global public health systems. In Greece, the phenomenon started on February 2020 and it is still being developed. This paper presents the implementation of a hybrid Genetic Programming (hGP) method in finding fitting models of the Coronavirus (COVID 19) for the cumulative confirmed cases in China for the first saturation level until May 2020 and it proposes forecasting models for Greece before summer tourist season. The specific hGP method encapsulates the use of some well-known diffusion models for forecasting purposes, epidemiological models and produces time dependent models with high performance statistical indices. A retrospective study confirmed the excellent forecasting performance of the method until 3 June 2020.


I. INTRODUCTION
In epidemiology, the research of disease spread as a diffusion process is a subject with a long history. For modeling the diffusion process of infectious diseases, many classical epidemic models have been proposed [1]- [5], such as SIR and their analytical formulations have been proposed [5], [6]. Also, many methods have been implemented on fitting and forecasting the diffusion process phenomenon [8]. A special case of these methods is the diffusion models case [9] which are dynamic and follow a "S" curve over time. In this article the models which have been used are the Gompertz family models, Logistic model, Bass model, Bi-Logistic and Log-in-Log models [8]- [16] for Coronavirus cumulative confirmed cases. The initial concept was introduced in [9] while in [10] and [11] a hybrid modified GP (hGP) method combined diffusion models and hGP to produce forecasting models.
The Genetic Programming technique (GP) belongs to the Evolutionary Algorithms (EA) subcategory of Computational Intelligence and embeds the biological natural selection idea on an appropriate problem solution [17]- [19].

A. Analytical Type of SIR Model
An SIR (Susceptible -Infected -Recovered) model is an epidemiological model that calculates the number of infected people with a disease and recovered or died by a disease in a specific population over time. This models' category involves differential equations relating the number of Susceptible people s(t), the number of Infected people i(t) and number of people who have Recovered r(t). The SIR models such as Kermack & McKendrick model have two different types with vital dynamics and without [1]- [6].
The first SIR model (without vital dynamics) describes an epidemic spread in a short period of time. The second SIR model (with vital dynamics) describes the epidemic spread over a longer period [7], as endemic phenomenon.
Solving Eq. 7, we get: or the Exponential Integral function [20]: • or the equivalent: A simplification of SIR model in Eq. 11 has been implemented in article [6], using Taylor series approximation: This SIR model in [6] and [21] gives a Logistic type approach: where = − + • and C is a constant.

B. Diffusion Models
In articles [8]- [15], the diffusion phenomenon has been extensively described. The models in this process follows: In this equation, y(t) is the penetration until time t, S is the level point of saturation and f(t) is the diffusion's coefficient function [8], [9]. The type of f(t) function defines the differential equation solution.

1) Gompertz Model
In this article two types of Gompertz models have been implemented and are described with the equations: In (13) and (14), which are called Gompertz I and II, respectively, the y(t) is the estimated diffusion level at time t, parameter S is the saturation level and f(t) is a linear function of time t, such as f(t)=a+b•t. Parameter a,b and c are constant numbers [8]- [11], [16].

2) Logistic Model
In general, the Logistic model is given by: In (15), y(t) represents the penetration of the observed quantity, until time t, f(t) is a time dependant function, such as f(t)=a+b•t and a, b are constants [8]- [10].

3) Richards' Model
The generalized logistic model, known as Richards' growth model, introduces an additional constant parameter d, d≠0 to the logistic model [8]:

4) Bass Model
The Bass model concept has been implemented in [8]- [12]. In this model, the diffusion phenomenon has two major elements: the diffusion process pioneers (innovators) and those who follow (imitators).
The penetration of the observed quantity y(t) is represented in (17): In (5), parameter a represents the initial crucial number of diffusion adopters. Parameter b is the addition of p and q, where p and q are the innovators' and imitators' coefficients, respectively. Parameter c equals with r and p multiplication, where r corresponds to a constant number and d corresponds to r• (q/a) multiplication [12].

5) Bi-Logistic Model
In many cases, the observed diffusion process has many phases/generations. For this purpose, the Bi-Logistic curve implements the composition of two Logistic curves [13]. So, Eq.6 is:   2 m t represents the time for the first and second generation start point, respectively [13], [14]. This and the next model could be useful in the case of epidemic spread waves.

6) Logistic Growing Saturation Level in Logistic Model (LogInLog)
When whole diffusion process follows the Logistic model and also the saturation level point is time dependant () St and it follows the Logistic model until the upper saturation level point u S [13]- [15] as well, then the Eq. 19 depicts the process.
where ( ) = + ⋅ and ( ) = + ⋅ and all the parameters are constants. The dynamic saturation level point follows in Eq. 20 where parameter is constant number. This model describes the diffusion process when diffusion process has coated generations, not clearly separated [14], [15].

III. HYBRID GENETIC PROGRAMMING METHOD
A hybrid GP (hGP) has been proposed in articles [9]- [11]. In this GP method the initial population of solutions comprises the optimized diffusion models. In this article epidemiological models such as SIR are additionally introduced in the initial population. The modified hGP consists of three parts, the regression analysis, the genetic algorithm process, and the model selection.

A. Initial Population
In [9], [10] the concept of the randomly created chromosomes in the initial population of the solutions has been adopted. The blocks' form for the initial population' construction combines the basic sets of <Constant of S set>, <arithmetic function of FΑ set>, <variable of M set> and <mathematical function of FΜ set>. The construction follows specific rules of production.
In the initial population, random chosen functions (FA: Arithmetic, FM: Mathematical, diffusion and SIR models), variables, constants, and user defined blocks. The models are optimized with regression analysis [2]. The flowchart of the hGP method is depicted in Fig. 1.
It should be noted that the Exponential Integral function (Ei) in Eq. 10 and SIR model in Eq. 9 and Eq. 11 have been inserted in the function set and in initial population of the hybrid Genetic Programming Method (hGP), respectively.

B. Solutions Representation
The chromosomes/solutions are strings of characters [9]. The Python Programming Language low-level representation of the abstract syntax tree (ast) corresponds to parse trees. For example, the chromosomes exp(2+3•t)•Richards(3•t) and exp(2+3•t)+Exp(-3•t) are presented in Fig. 2 and in Fig. 3 as strings of characters and parse trees, respectively [9].

C. Evaluation and Fitness Function
The best chromosome/solution selection is performed using Eq. 21 and Eq. 22 for fitting and forecasting purposes, respectively.
In Eq. 21, the Sum of Squared Error (SSE) is calculated over time t=1, 2, 3…, T. Also, dataset raw(t) comprises the observed data over time t and y(t) corresponds to the produced model's value [10].
In weighted sum of squared error (wSSE) function, the weight factor is wi =1/T, 2/T, 3/T, …1. So, latest data have greater weight than first data of the dataset. In forecasting process, the whole process is divided in subprocesses with a combination of the SSE and wSSE.
Specifically, the whole process is separated into training and forecasting subprocesses. During subprocess of training, the SSE function is implemented as fitness function, while the forecasting estimation is performed with wSSE. A list with sorted solutions by their fitness function's values is created. Every solution that does not satisfy the precision limit criterion is rejected. The remaining solutions participate as candidates for the Tournament selection and the next operations of the Crossover and/or Mutation [10], [11] are going to be executed.

D. Crossover and Mutation
The crossover operation concerns the selection of two solutions (parents), implementing the Tournament Selection process, in the sorted list. In the specific kind of Tournament selection, several solutions from the list, with the best, good and not so good fitness value, are randomly selected and then the best of those are chosen for the Crossover or Mutation [10], [11].
In each parent characters' string, one Crossover point is randomly chosen. The substring of each parent which begins at the specific crossover point is interchanged between two parents and two new children are generated (see Fig. 4).
In the mutation process, one string's point, which corresponds to a function, is randomly chosen. The process replaces the function, with a new random function [10] (see Fig. 5).   (24) In Eq. 23, Eq. 24 and Eq. 25, y(t) corresponds to the model's value for time t=1, 2, 3…, T. Also, raw(t) are the data of the dataset.

F. Datasets
In this study, the hGP method has been implemented on World Health Organization (WHO) datasets [22]. The first dataset presents the cumulative confirmed cases in China. Specifically, the data is derived from the WHO [22], which presents the cumulative confirmed cases in China, from 1 November 2019 until 16 April 2020. The dataset concerns the daily cumulative confirmed cases per 1000 inhabitants and the dataset is comprised by 98 data points.
The second dataset presents the cumulative confirmed cases in Greece. The data is derived from the WHO [22], which presents the cumulative confirmed cases in Greece, from 26 February 2020 until 5 May 2020. The dataset concerns the daily cumulative confirmed cases per 1000 inhabitants and the dataset is comprised by 70 data points.

IV. RESULTS AND DISCUSSION
In this study the SSE is the statistic indicator that has been used for the fitting process and wSSE for forecasting purposes. In this section, the fitting performance of a generated hGP model and Diffusion models are presented.  The fitting performance of the first modified hGP model for the Coronavirus (COVID-19) cumulative confirmed cases with the best fitness value (SSE), is presented in Fig.  6. The y-axis presents the cumulative confirmed cases per 1000 inhabitants and x-axis the day number from the beginning of the phenomenon.
Someone can say that in the inner format of the model exists the multiple generations' model of Log-In-Log. This is an indicator of the "nature" for the COVID-19 cumulative cases curve.
This behavior of the hGP program execution is partially explained with the findings of diffusion models' fitting performance in the next section.

B. Fitting Performance of the Diffusion Models for China
The Fitting Performance of the Diffusion Models is depicted in Fig. 7. The y-axis presents the cumulative confirmed cases per 1000 inhabitants and x-axis the day numbers from the beginning of the phenomenon.
The relative statistical indices SSE, MAPE, MSE, and MAE of the diffusion models are presented in Table 3. In Table 3 the Log-In-Log diffusion model achieves the best statistical performance, with a value of 239.2022973 for SSE and Richards, Logistic and Bi-Logistic follow with better SSE than the other models.
Comparing the statistical indices for Table 2 and Table 3 we conclude that the fitting hGP model achieves the best SSE, MAPE, MSE and MAE values. The relative statistical indices for the diffusion and SIR models in Table 3 have worse values than hGP model. This is a first indication that the method of hGP can lead us to useful assumptions about the COVID-19 penetration.

C. Proposed Forecasting Models for Cumulative Confirmed Cases in Greece
In this section, forecasting models for the Coronavirus (COVID-19) cumulative confirmed cases in Greece derived by the hybrid Genetic Programming (hGP) method are proposed. The observed period concerns daily data, 70 data points, from the 26 February until 5 May 2020.
In forecasting process, we used diffusion models, SIR model and fitting hGP models derived from COVID-19 cumulative confirmed cases in China. The phenomenon in China preceded more than one month in relation to that in Greece and it has reached the "temporary" saturation level. The diffusion models and hGP models are inserted in the initial population of hGP and the population are being evaluated with the weighed SSE (wSSE) performance indicator. Table 4 contains the program execution parameters of the GP method concerning the WHO Coronavirus (COVID-19) data sets in Greece. The first hGP model corresponds to the optimistic scenario. In the specific scenario, the forecasting model predicts a value of 2790 cumulative confirmed cases in first days of June 2020 (datapoint 99, 3 June 2020). As one can see the model is a combination of Bi-Logistic part and a linear function's part. The first forecasting hGP model has the following format: (2.2285066091270123/(1.0656151198157326+Exp(-2.065099746680325-0.15024536298033256*(t-(48.5436271109599))))) + 0.0058907659155309185*(t-(9.281063892171199) + (0.027021888236162108/(0.16029468427112187 + Exp(-59.53190921241808-1.2505949209299758*(t-103.97484004304785)))) The model consists of the two mathematical parts, Bilogistic and linear function of time. The addition of these parts produces the first forecasting model.
The first part, the Bi-Logistic is: (2.2285066091270123/(1.0656151198157326+Exp(-2.065099746680325-0.15024536298033256*(t-(48.5436271109599))))) + (0.027021888236162108/(0.16029468427112187 + Exp(-59.53190921241808-1.2505949209299758*(t-103.97484004304785)))) The linear part is:     The relative statistical indices wSSE, MAPE, MSE, and MAE of the forecasting hGP models are presented in Table  5. The statistical indices for the hGP forecasting models have similar performance. The second forecasting hGP model achieves better MAPE which is one of the bestknown indices in forecasting. This leads us to assume that the hGP Forecasting model 2 will have better forecasting performance than the hGP Forecasting model 1.
This behavior of the hGP forecasting models could be explained with the findings of diffusion models' fitting performance in the next section. There is a strong influence of the Bi-Logistic in the forecasting performance, as well as α Richards and SIR model influence.

D. Fitting Performance of the Diffusion Models for Greece
The Fitting Performance of the Diffusion Models is depicted in Fig. 11 (using the hGP runtime environment). The relative statistical indices wSSE, MAPE, MSE, and MAE of the diffusion models are presented in Table 6. In Table 6 the Bi-Logistic achieves the best statistical performance, as it achieves SSE value of 0.086998679. Richards and SIR models follow with worse SSE values than Bi-Logistic. This partially explains the structure of the forecasting models in the previous section C, where the hGP forecasting models consisted of Bi-Logistic, Richards and SIR parts.

E. The retrospective study of the forecasting performance in Greece
In this section, we study the forecasting scenarios for Greece retrospectively based on World Health Organization (WHO) datasets [23] for the last forecasted datapoint. Specifically, the retrospective study is derived from the WHO [23], which presents the cumulative confirmed cases in Greece until 3 June 2020. The retrospective dataset concerns the daily cumulative confirmed cases per 1000 inhabitants and the last datapoint (3 June 2020) of cumulative confirmed cases in Greece is 2,918.
The behavior of the hGP forecasting models is explained with the findings of forecasting performance in Table 7. This table presents the absolute error of the hGP forecasting scenarios for the last datapoint. The hGP Forecasting model 2 achieves the best forecasting performance with second the hGP Forecasting model 1. The Bi-Logistic and Richards models predict the worst values for the cumulative confirmed cases in first days of June 2020 (point 99, 3 June 2020), respectively. Also, the Absolute and Percentage Errors of the hGP forecasting models are smaller than the others.
The assumption of the Section C ( Table 5 findings) that the hGP Forecasting model 2 would have better forecasting performance than the hGP Forecasting model 1 has been confirmed.
It should be noted that the dataset for the hGP program execution was 70 datapoints and the forecasting horizon was 29 points (until point 99). So, one can say that the hGP programming method provides us forecasting models using few data compared with the forecast horizon.

V. CONCLUSION
In this paper the hybrid Genetic Programming method (hGP) [9]- [11] has been implemented for World Health Organization Coronavirus (COVID-19) datasets in China and Greece. Using the fitting hGP models derived by the implementation in China's dataset and some widely used diffusion models, the hGP method achieves to produce forecasting models concerning COVID-19 cumulative confirmed cases in Greece.
The fitting and forecasting performance of the modified hGP are presented and the method presents satisfactory performance with the best statistical indices. The combination of diffusion models with GP method indicates that could be produced some forecasting models for Coronavirus cumulative confirmed cases in Greece. Also, the analysis of the models leads to the formation of prediction's scenarios.
The hybrid Genetic Programming method consists of a forecasting framework that produces time dependant models and/or causal models for short or long-term forecasting, with more than one variable.
Finally, a retrospective study for the produced hGP models confirmed the assumptions about the forecasting performance of the method and that this method could be used to accurately forecast with few data. Specifically, using statistical indices for fitting (SSE) and forecasting (wSSE, MAPE) with appropriate models into the initial population, the hGP combines and produces complicated forecasting models with high accuracy. So, the key features of hGP have achieved the desired results.

FUNDING
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

ETHICAL CONSIDERATIONS
Ethics approval not required for this type of study.

CONFLICTS OF INTEREST
None of the authors have any conflicts of interest or financial ties to disclose.