Normality Testing for Vectors on Perceptron Layers

 Abstract —Designing optimal topology of network graph is one of the most prevalent issues in neural network applications. Number of hidden layers, number of nodes in layers, activation functions, and other parameters of neural networks must suit the given data set and the prevailing problem. Massive learning datasets prompt a researcher to exploit probability methods in an attempt to find optimal structure of a neural network. Classic Bayesian estimation of network hyperparameters assumes distribution of specific random parameters to be Gaussian. Multivariate Normality Analysis methods are widespread in contemporary applied mathematics. In this article, the normality of probability distribution of vectors on perceptron layers was examined by the Multivariate Normality Test. Ten datasets from University of California, Irvine were selected for the computing experiment. The result of our hypothesis on Gaussian distribution is negative, ensuring that none of the set of vectors passed the criteria of normality.



Abstract-Designing optimal topology of network graph is one of the most prevalent issues in neural network applications. Number of hidden layers, number of nodes in layers, activation functions, and other parameters of neural networks must suit the given data set and the prevailing problem. Massive learning datasets prompt a researcher to exploit probability methods in an attempt to find optimal structure of a neural network. Classic Bayesian estimation of network hyperparameters assumes distribution of specific random parameters to be Gaussian. Multivariate Normality Analysis methods are widespread in contemporary applied mathematics. In this article, the normality of probability distribution of vectors on perceptron layers was examined by the Multivariate Normality Test. Ten datasets from University of California, Irvine were selected for the computing experiment. The result of our hypothesis on Gaussian distribution is negative, ensuring that none of the set of vectors passed the criteria of normality.

I. INTRODUCTION
Many industries have been disrupted by the influx of neural networks. The last decade has yielded an incredible amount of attention at neural networks in many areas such as face recognition, big data clusterization, and signal processing.
In real deep learning projects, tuning hyperparameters is the primary key to build a network that provides accurate predictions for a specific problem. Common hyperparameters comprise the number of network layers, nodes in each layer, the activation function, and how many times (epochs) training should be repeated. Hyperparameters determine how the neural network is structured, how it is trained, and how its different elements function. The optimization problems for neural network size reduction and hyperparameters are well known. Actually, one of the first books on this topic was published by Kevin Swingler in 1996 [1]. Optimizing hyperparameters is an art: there are several ways ranging from manual trial and error to sophisticated algorithmic methods.
Recognized algorithms for hyperparameters estimation are the Grid search, Random search, Bayesian optimization, Gradient approach, and Evolutionary optimization.
Grid search assumes a researcher can construct  [2]. Random search [3] for estimation of neural network hyperparameters is an extension of the grid search. A statistical distribution is implemented for each hyperparameter under tuning, and their values are randomly sampled using the distributions.
Most papers on Bayesian optimization assume that the researcher is able to observe the objective function. Bayesian approach exploits past evaluation results to construct a probabilistic mapping hyperparameters to a probability of objective function values [4], [5]. The advantage of Bayesian method lies in looking for better hyperparameters based on previous trials.
As for Gradient approach [6], gradients are computed based on performance of cross-validation with respect to all hyperparameters. This occurs via chaining derivatives backwards through the training procedure. One can find optimization solution by any method of the first order.
Evolutionary algorithms are methods of the global optimization of black-box functions with noise. Evolutionary hyperparameter search follows the biological concept. Initial set that named initial population contains random generated hyperparameters [7], [8]. Algorithm checks fitness of each element of population and replaces the worst element with new one generated through evaluation procedure; that is crossover and mutation operations. The algorithm stops when the evaluation does not improve the population.
Bayesian optimization is considered the most contemporary and systematic method for neural network hyperparameters optimization. It cannot guarantee optimal solution; however, it provides near-optimal reasonable values of hyperparameters.

II. PROBLEM, MATERIALS AND SOFTWARE
The problem under consideration lies in the domain of neural network hyperparameters optimization. Bayesian optimization of neural network hyperparameters estimates random vector error calculation [9], which is supposedly normally distributed. Actually, input set of neural network in some papers presumed to be distributed in accordance with normal law [10]. Authors of this article have checked normality of vector sets on neural network layers using hypothesis-testing methods. Neural networks were designed as multilayer perceptrons with 3-5 hidden layers. Each learning data was passed through layers; and vector values on each layer were considered as material for numerical testing.
The initial data were collected from an open library of the machine learning datasets [11] of University of California, @ @ @ Normality Testing for Vectors on Perceptron Layers Youmna Karaki, Halina Kaubasa, and Nick Ivanov Irvine. In total, 11 datasets were checked. They are as follows: Brazilian high school (3 problems of classification), Hepatitis, Lymphography, Liver disorder (2 classifications), Dermatology, Glass identification, Adult (census data), and Wine quality. As usual, nominative variables were digitized in a conventional manner; e.g. a woman's gender was indicated as 0, whereas a man's gender as 1. Other nominative characteristics were enumerated by integers 0, 1, …, n.
Open software Anaconda [12] with its instruments Keras and TensorFlow in Python were used for our neural network design and learning procedures. Keras provides an access into the structure of network layers, so a researcher can easily get matrices for further multivariate testing.
Numerous software vendors offer systems and modules for statistical analysis. R language is one of the most prevalent languages in data science. It provides an efficient interface for testing process. R script for the multivariate data examination was developed with Multi Variate Normality (MVN) library from the CRAN project [13]. Calculation was done for neural networks as well as MVN on usual desktop. It exploited only CPU unit. Computing time for each dataset under examination was within 25 minutes in total for both, neural network and MVN test.

III. MULTIVARIATE NORMALITY TESTS
Mathematicians have discovered normal distribution about two centuries ago. Fisher and Kolmogorov-Smirnov tests for one-dimension data are such reliable measurements. There are special applications for testing multidimensional data. The measure of non-normality for both univariate and multivariate data depends upon asymmetry, tail weight, outliers, and modality. Both univariate and multivariate skewness and kurtosis measure the same characteristics. However, the comparison is done on the joint distribution of numerous variables against a multivariate normal distribution. This is an alternative to the comparison of one variable distribution against a univariate normal distribution. Skewness and kurtosis are the most efficient values as H. Scheffe has remarked in his book [14]. He noted that kurtosis and skewness are the key indicators of the degree to which nonnormality impacts the usual inferences made in variance analysis.
Moreover, skewness and kurtosis are an instinctive way to comprehend normality. If skewness differs from zero, then distribution deviates from symmetry; whereas if kurtosis differs from zero, then distribution diverges from normality in tail mass and shoulder.
There are various formulations for skewness and kurtosis in literature. In 1998, Joanes and Gill [15] originated three common formulations for univariate skewness and kurtosis.
If a sample variance is normally distributed, then kurtosis is equal to zero. It implies that the standard error of variance will be underestimated when kurtosis is positive; and overestimated otherwise. Kurtosis has an impact on variance estimates when the sample sizes are large; whereas in small samples, mean estimates are only affected. Yuan et al. [16] indicated that the characteristics of mean estimates are not influenced by either skewness or kurtosis asymptotically; however, standard error of sample variance is actually a function of kurtosis.
Tests on multivariate normality were verified by K. V. Mardia, K. V. Baringhaus [17], and N. Henze [18]. Actually, there exists several measures for multivariate skewness and kurtosis; however, Mardia's ones are definitely the most common.

IV. EXPERIMENT ON DATASETS NORMALITY
In our experiment, a test was performed on the distribution of neurons in the neural network in order to verify the hypothesis of a multidimensional normal distribution. There are tables demonstrating the skewness and kurtosis of our neural network, generated chi-square multivariance, t-and normal distribution. The hypothesis of univariate normality on single component of some initial datasets was checked as well.
Large hidden layers usually permit the neural network to suit the training data very well. However, since regularization is typically used, it is essential to go for large hidden layers. Using the same size for all of the hidden layers most likely works better than choosing a decreasing or increasing size. Using a first hidden layer that is larger than the input layer, tends to work better too. With unsupervised pre-training, the layers ought to be much bigger than when implementing purely supervised optimization.
Univariate normality for two initial datasets had been verified by skewness and kurtosis estimation approach as well. The significance level for all tests was equal to 0.05.
We created and trained 11 neural networks, instances of a multilayer perceptron; as a training sample, we selected multivariate statistical data on various topics, such as human diseases, characteristics of school students, etc.
Before computing results of the perspective values of the datasets and in order to simplify and make it possible to compare the results of the test, every dataset size was restricted to 8 divisions and from 150 to 600 vectors.
MVN package has methods to calculate the mean and other significant parameters of the data. To concretely demonstrate the impact of skewness and kurtosis, tests were implemented.
As a result of the tests carried out on different variants of the trained neural networks, the following skewness and kurtosis values were obtained.  Some components of initial datasets have occurred inside the confident interval for univariate normality.
The values extracted from our experiment prove the independence of the results of testing the distribution of neurons of a neural network from the input data for training, the number of network layers, and the number of nodes in each layer.

ACKNOWLEDGMENT
First and Foremost, praises and thanks to God, the Almighty, for His showers of blessings throughout our work to complete this paper successfully.
Y. Karaki and H. Kaubasa would like to express their deep and sincere gratitude to their supervisor, Dr. Nick Ivanov for his diligent and thorough efforts in contributing to this paper.
We are also extremely grateful to our caring and loving families. Their continuous support and encouragement to complete this paper is much appreciated and duly noted.