GRAPHICAL JUMP METHOD FOR NEURAL NETWORKS

: A graphical tool for choosing the number of nodes for a neural network is introduced. The idea is to fit the neural network with a range of numbers of nodes at first, and then generate a jump plot using a transformation of the mean square errors of the resulting residuals. A theorem is proven to show that the jump plot will select several candidate numbers of nodes among which one is the true number of nodes. Then a single node only test, which has been theoretically justified, is used to rule out erroneous candidates. The method has a sound theoretical background, yields good results on simulated datasets, and shows wide applicability to datasets from real research.


Introduction
Determining the optimal number of hidden units for a neural network is a difficult problem. When there are too few parameters, a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting occurs and the prediction performance of such a model is poor. When there are too many parameters, a statistical model fits the pattern of random error or noise instead of the underlying relationship, which is called "overfitting". A model that has been overfitted performs poorly in predicting, since it overreacts to minor variations in the training data, essentially focusing on the noise rather than the true underlying trend. For a neural network model, when there are too few nodes, the prediction performance on the output variable is poor. When there are too many nodes, al-though the output error is lower on the training data, the errors for predicting novel examples increase. Selecting an optimal number of hidden nodes allows a good fit to both training data and future predictions, such as a hold-out test sample of data. Herein we consider only single hidden layer feedforward neural networks, although the methodology is extensible to other varieties. Thus we consider fitting models of the form: where ℎ is the ℎ ℎ component of the ℎ sample of the inputs, y is the output, and ∼ N (0, σ 2 ). A variety of approaches have been proposed to combat overfitting in neural networks, including early stopping (Sarle, 1995;Girosi et al., 1995), weight decay (Krogh and Hertz, 1992), and Bayesian methods (Lee, 2004). In addition, Sheela and Deepa (2013) and Xu and Chen (2008) provide recent reviews of the literature on selection of the number of hidden units. Here we survey criteria-based methods, andthen develop a new criterion based on a graphical interface. Two popular general model selection criteria that can be applied to choosing the number of nodes are Akaike's information criterion (AIC) (Akaike, 1973) and the Bayesian Information Criterion (BIC) (Schwarz, 1978). Another related criterion is Mallow's Cp statistic: where SSE(p) is the sum of the squared errors of residuals, n denotes the number of observations, p denotes the number of parameters, and Σ 2 is an unbiased estimate of the variance of an error term (Fogel, 1991). If Σ 2 is known, any model which can estimate regression coefficients unbiasedly and include all critical regressors, has Cp converging to the number of parameters when sample size is large (Gilmour, 1996).
In the context of neural networks, Murata et al. (1994) has studied the theoretical relationship between the training error and the generalization error with regard to the training examples and the complexity of the structure of a neural network. The Network Information Criterion (NIC) chooses a specification for which the following expression takes a minimum: . T is the sample size. L is the like . If the class of models investigated includes the true model, A = B asymptotically. Thus, is the effective number of model parameters, which is typically less than the nominal number because the parameters are dependent. However, this method can suffer from the problem of rejecting hidden units and d the least complex network architectures for model fitting (Anders and Korn, 1999). However, none of aforementioned methods performs well in choosing the best number of nodes for a neural network. Hence it is critical to find a new method which does a good job for such a task.
In the field of choosing the number of clusters in a mixture model, Sugar and James (2003) proposed the jump method from an information theory point of view. By adapting ideas from rate distortion theory to clustering, the theory of the jump method investigates the functional form of the mean square error (MSE) curve in both the appearance and absence of clusters. Furthermore, they demonstrate both theoretically and empirically, that the MSE curve, when transformed to an appropriate negative power, will display a jump, reliably and accurately, at the true number of clusters. However, it is often arduous to designate the transformation parameter directly. Chang and Sugar (2008) proposed a graphical tool, christened the "graphical jump method", to ascertain the number of clusters. By changing the transformation parameter, the transformed MSE curve jumps at divergent numbers of clusters, called candidate numbers, amongst which one is the true number of clusters. If the candidate number is smaller than the true number of clusters, at least one cluster will accommodate more than one true cluster and yield a positive result on a test, which is dubbed the "cluster-existence test" and has been theoretically justified.
In this paper, the graphical jump method is extended to solve the problem of choosing the number of nodes for a neural network. First, by using theoretical results from Murata et al. (1994), a theorem is proved stating that after some boundary conditions are satisfied, there surely exists a transformation power by which the MSE can be transformed to exhibit a jump at the true number of nodes. A "single node only test", which is also justified theoretically, is used to rule out erroneous candidates. The newly developed method for choosing the number of nodes makes limited parametric assumptions, can be rigorously theoretically motivated using theorems from Murata et al. (1994), and is simple to both understand and implement. The jump method only applies to choosing a parameter that is a counting number, and does not apply for continuous or other-valued parameters.
In Section 2, the theory and concrete steps of the graphical jump method are introduced in detail. In Section 3, simulation studies and results are elucidated. Section 4 describes the analysis of a real dataset. Section 5 lays out future research directions.

The introduction of the graphical jump method
Assume that the data are fitted to the following neural network: where ∼ N (0, σ 2 ). The MSE dk equals the variance of residuals generated by fitting the neural network model with k nodes.
Assuming that the dataset is fitted by model (1), the graphical jump method has the following four basic steps for choosing the best number of nodes: a. Calculate MSE dk for K = 1...kmax by fitting the neural network model using k nodes. b. Choose a positive number v > 0, called the transformation power.
c. Calculate the jump score associated with k nodes d. The best number of nodes is the number k with the highest Jk.
To give a simple illustration of how the graphical jump method works, a simulated dataset is generated with 100 observations from a gamma(20, 40) distribution with supplementary standard normal noise and the true number of nodes of 4. The response, Y 1, is the aggregate of the 4 different nodes: Y 2, Y 3, Y 4 and Y 5 ( Figure (1)), whose coefficients are manifested at the top of the plots. The first node and the third node have active declining regions in the range of (-1.5, -0.3) and (0.5,1), while the second node and the fourth node have active increasing regions in the range of (-0.3,0.4) and (1.1,1.9). Consequently, the final response variable, Y 1, has 4 disconnected active regions with adjacent active regions in totally opposite directions, which necessitate 4 nodes to provide the best fit. A graphical visualization is provided by Figure (2) which plots the successive jumps in the transformed MSE. In the plot, the possible number of nodes ranges from 1 to 10. The lower left plot of Figure (2) shows a jump at 4, which is the true number of nodes. Intuitively, this jump occurs because of the sharp increase in the jump scores that results from not modeling noise using additional nodes. Adding subsequent neural network nodes cannot decrease, but increase the MSE of residuals and thus has a smaller contribution to the jump score. When the transformation powers change from 0.4, 1, 2 to 5, the highest jump scores occur at nodes of 1, 1, 4 to  As the transformation power y approaches 0, the jump score for 1 node approaches 1 while the jump score for k nodes approaches 0. Thus the highest jump occurs at node one. As the transformation power approaches an enormous number, the jump score for 10 nodes will be the largest one. This is because the MSE of residuals is a decreasing function of the number of nodes, is smaller than dk for 1 ≤ k ≤ 10. As y approaches infinity, increases much faster than for 1 ≤ k ≤ 10. This leads to the fact that above some point of y, will be the highest jump score.

Summary of the Graphical Jump Method
By utilizing the graphical property of the jump plot, a graphical jump method is developed to choose the number of nodes more efficiently. Since a jump surely occurs at the best number of nodes, by choosing the candidates as the number of nodes where the jumps occur, the range of candidate numbers of nodes are significantly narrowed down. When implementing the graphical jump method, a jump plot is sketched at first to illustrate all the candidate numbers of nodes, assuming there are a total of g candidates. Secondly, organizing the g candidate numbers of nodes from small to large, the dataset is modeled with these candidate numbers of nodes sequentially to produce g consecutive sets of residuals, for which g jump plots are produced to see whether each plot contains candidates with more than one node. The key idea is that if the best model has been found, there is nothing left to model in the residuals, whereas if there are not yet enough nodes in the model, then one can find this signal in the residuals by fitting one or more nodes to the residuals.
If the candidate number of nodes is less than or equal to the best number of nodes minus two, then it cannot account for the total variability of the dataset. The corresponding residuals will encompass variability that has to be explained by additional nodes, and thus it will test negative on the single node only test. This is caused by the fact that if they had produced positive result, then the total number of nodes needed to model the data is the candidate number of nodes plus one, which is less than the best number of nodes and contradicts the original assumption of the true size of the network.
Nevertheless, if a candidate number of nodes is adjacent to the best number of nodes, then the residuals of both numbers of nodes will present positive results on the "Single node only test". As a result, the first candidate number of nodes without an adjacent lower candidate that demonstrates that its residuals need to be explained by a single node only, is the best number of nodes. If two earliest consecutive numbers of nodes, Ni, Ni + 1, both indicate that their residuals only need one node to count for the total variability, then the ratios of dNi−1/dNi and dNi /dNi+1 are calculated.
If Ni is the best number of nodes, dNi−1 should be much larger than dNi since themodel changes from modeling the main effects to modeling the noise at this point. Therefore, the ratio of dNi−1/dNi should be the larger one. If Ni +1 is the true number of nodes, dNi should be much larger than dNi+1 due to similar reason. Consequently, the ratio of dNi /dNi+1 should be the larger one.

The Two Theorems
The following theorem provides an asymptotic result of the shape of the MSE curve after transformation, and thus it provides a theoretical explanation for the graphical jump method. Theorem 1: Assume the dataset y follows model (1). Define Kmax as the maximum number of nodes investigated, t as the sample size. Assume the dataset is composed of G nodes and that the likelihood ratio test of H0: the dataset is composed of G nodes, versus HA: the dataset is composed of less than G nodes, yields a positive result. Define m * as the number of parameters in one node. If 2 ℎ ( * ,0.05) < ν < 2 6 * ( − ) , as t → ∞, the jump method always selects the true number of nodes as a candidate (proof in appendix). By the aforementioned theorem, when changing the transformation power from small to large, a jump surely exists at the best number of nodes. Nonetheless, some-times a correct transformation power is difficult to identify due to the deviation of the dataset from the theoretical model, measurement error caused by experiment operators and random errors, etc. Ergo, the graphical property of the jump plot is explored to provide a tool for choosing the number of nodes based on the jump plot. The nodes at which jumps appear are called candidate numbers of nodes, N1, ..., Ng. Theorem 2 provides a theoretical foundation for examining whether a dataset needs to be fitted by the neural network model with more than one node, dubbed "Single node only test". It is not onerous to conjecture that, if Ni is the best number of nodes, after fitting the data with Ni nodes, the residuals follow a standard normal distribution and result in a positive test in the "Single node only test". Thus the candidate numbers of nodes in the jump plot will be fitted to the dataset with the neural network models one by one to figure out which residuals will give positive results in the test.
Theorem 2: Assume that y N (0, 1). Define Kmax as the maximum number of nodes investigated, t as the sample size, and m * as the number of parameters in one node. If , as t → ∞, the jump plot will only select one node as the candidate. The proof of Theorem 2 is provided in Chang (2011). Theorem 2 is illustrated by Figure (4), which is a jump plot generated by a dataset simulated from a standard normal distribution. Theorem 2 demonstrates that the jump plot designates one node as the solitary candidate in a long range of transformation powers. However, the length of this range depends on the dataset actually generating the jump plot. Empirically, this range would always include (0, 2), i.e, the length of this range would be larger than 2 units for the majority of the datasets. Therefore, if the true number of nodes is one, the jump plot would select one as the candidate number of node in a range, including (0, 2), as the solo candidate. Nevertheless, by Theorem 1, if the true number of nodes is larger than one, the jump plot would select a candidate number of node bigger than one within this range. For instance, in Figure (3), the dataset is composed of 4 nodes and 4 nodes are pinpointed in the jump plot within the range of (1.2, 4.6), which overlaps the range (0, 2) at (1.2, 2). Therefore, in all, if the jump plot picks a candidate number of nodes (larger than one) within the range of (0, 2), then the dataset is deemed as composed of more than one node and one node otherwise. As a result, in Figure (4), the dataset is regarded as composed of one node.

Connection between the theory of the NIC criterion and the graphical jump method
There are connections between the theory of the NIC criterion and the graphical jump method. Consider a stochastic system which has an input vector x ∈ R k and produces an output vector y ∈ R l . An input vector x is generated according to the probability q(x) and an output vector y is generated according to a conditional probability q(y|x) specified by x. q(x, y) is the product of q(x) and q(y|x). A network is considered to have a conditional distribution p(y|x,θ), where θ ∈ R m is an m-dimensional parameter vector that describes the network, such as a set of weights and thresholds. q represents the true distribution of p(y|x, θ). Let f (x, y) be the density function for neural network models. The dataset is assumed to follow the model: y = f (x,θ) + ξ(x), where ξ(x) is noise and E(ξ(x)) = 0. To calculate the goodness of fit of a neural network, a discrepancy function D(q, p(θ)) is designed to measure the difference between q(y|x) and p(y|x, θ). d(x,y,θ) is a loss function, typically, it could be square error loss or negative log likelihood. The square error loss is defined as d(x, y, θ) = (‖y − f (x, θ)‖) 2 . The discrepancy function we use is D(q, p(θ)) ≡ ∫ d(x, y, θ)q(x)q(y|x)dxdy. In order to minimize the discrepancy function, the true probability distribution q(x,y) of the target system needs to be known. However, it is not possible to identify q(x, y) in reality. Frequently, the empirical distribution is used instead. It is well known that the empirical distribution approximates the true distribution q(x, y) in the weak sense when sample size is large, and hence it is reasonable to evaluate the network model by q * (x, y) instead of q(x,y). p(θ) is the distribution of θ. For square error loss, , the following is an important result on which the graphical jump method depends: D(q, p(θopt)) = minθD(q, p(θ)) where θopt is the value of θ when D(q * , p(θ)) achieves the minimum. D(q * , p(θ * )) = minθD(q * , p(θ)) where θ * is the value of θ when D(q * , p(θ)) attains the minimum.
Let Ropt ≡ Vq[∇d(x, y, θopt)], i.e., for the true distribution of θ, Ropt is the variance of the first derivative of d(x, y, θopt) with respect to θ. Let m be the number of parameters in the model, Ropt is of m × m dimensions. Let , i.e., for the true distribution of θ, Qopt is the expectation of the second derivative of d(x, y, θopt) with respect to θ. Let m be the number of parameters in the model, Qopt is of dimension m × m. Theorem 3: The average discrepancy between the system q(x,y) = q(y|x)q(x) and the machine p(y|x,̃) learned from t examples is given by < D(q, p(̃)) >=<D(q * , p(̃)) > +1 tr( 2 ), where < . > denotes the expectation with respect to the distribution of θ , the parameter after sufficient learning of θ with the machine, and * (Murata et al. (1994), page 868). Theorem 1 studies the difference of < D(q, p(̃)) > and < D(q * , p(̃)) > in terms of the ensemble average of training sets. Nevertheless, when using this criterion formodel selection, we need to evaluate < D(q, p(̃)) > and < D(q * , p(̃)) > for one particular training set. A "subset" for a single layer feedforward neural network is defined as following: if the first model has fewer hidden units than the second model, then it is deemed a submodel of the second one. The submodel can be obtained from the full model by setting the connection weights and thresholds of the extra units equal to 0. < D(q, p(̃)) > can be decomposed as we now show. Let Mi ={pi(y|x,θi); θi ∈ R mi } be a hierarchical series of models: M1 ⊂ M2 ⊂ M3 ⊂ ..., where Mi is a submodel of Mj, (i < j). For only one training set where U = √ D(q, p(θopt)) − D(q * , p(θopt)), is a random variable of order 1 with zero mean. U is common to all the models within a hierarchical structure, such as single layer neural network models with the same dimensions for the input and the output vectors, see Murata et al. (1994) (page 869).
Obviously, the discrepancy D(q, p(̃)) achieves the minimum at the best numbers of nodes, resulting in a sequence of inequality equations composed of the right side of formula (2). For negative log likelihood, R = Q, which makes tr(Ropt −1 ) reduce to the number of parameters in the corresponding neural network. U is common to all the models within a hierarchical structure. D(q * ,p(̃)) can be expressed as a function of MSE of residuals under different numbers of nodes. Thus, the inequality equations reduce to the inequality relationship of MSE of residuals under different numbers of nodes. By utilizing those inequality relationships of MSE of residuals and with the help of a Taylor expansion, the necessary conditions of the jump method, which are inequality relationships of MSE after transformation, are proved and the two related Theorems are established.

One dimensional data
Simulation studies are first performed with one x variable, and with 100, 200, 300 and 1000 observations. For the scenarios of one x variable with 100, 200, and 300 observations, data are simulated with 4 nodes, which have distinct active regions as shown in Figure (1). The x variable is generated from a gamma distribution with shape parameter of 20 and rate parameter of 40, plus a noise variable with standard Gaussian distribution. For the first 3 scenarios, from nodes 1 to 4, the y variables are generated using the following formula: The final Y = y1 + y2 + y3 + y4 + ϵ, where ϵ ∼ N (0,0.1). For the last scenario, from nodes 1 to 4, the y variables are generated using the following formula: Finally, Y = y1 + y2 + y3 + y4 + ϵ, where ϵ ∼ N (0, 0.1).
Table (1)   The upper left plot is the scatter plot of the prediction errors versus the ith dataset (There are 30 simulated datasets for each scenario and for each dataset, there is a prediction error generated for each method). In the scatter plot, the red, the dark blue and the green curves connect points of prediction errors for the graphical jump method, the BIC method and the cross validation method separately. This color representation is used for all the scatter plots in this paper. The middle left plot contains several box plots. To draw the box plot, the prediction errors for all the three methods in this scenario are combined and then categorized by their corresponding number of nodes chosen. For example, for all the datasets where the graphical jump method chooses the number of nodes of three, the prediction errors generated by the graphical jump method are combined into one group. Similarly, prediction errors whose corresponding number of nodes chosen are three for the BIC method and the cross validation method are categorized into the same group as that for the graphical jump method. For other numbers of nodes, data are categorized similarly. Theoretically, when overfitting happens, the mean value of boxplot decreases. This may be because the sample size is 100, which is too small, the prediction errors calculated are not accurate enough. The lower left plot is a 2 − D histogram generated by excel. The x-axis indicates the number of nodes chosen. The y-axis indicates the number of datasets which choose the corresponding number of nodes in the x-axis. For all the 2 − D histograms in this paper, the red, the dark blue and the green columns compose the histograms for the graphical jump method, BIC and cross validation separately, i.e., they use the same color representation as that in the scatter plot. For the lower left plot, the graphical jump method chooses the correct number of nodes, i.e. four nodes, with the highest rate. BIC chooses four nodes and five nodes with the highest rates. The right three plots are generated for sample size of 200. For most part of the upper right plot, the red curves are the lowest among the three, which means that for most of the datasets, the prediction errors generated by the graphical jump method are the lowest. For the plot in the middle right, the mean values of the box plots show an increasing trend along the x-axis. The 2 − D histogram in the lower right shows that the graphical jump method and BIC choose 4 nodes with the highest rates. Figure (6) has six plots. The upper left, the middle left and the lower left plots contain the scatter plot, the box plot and the 2 − D histogram separately. There are two parts in the scatter plot, which differ by their y axises. The y-axis of the first part ranges from 0 to 2×10 −6 and the x-axis indicating the sequence of the 30 datasets. The y-axis of the second part ranges from 2×10 −6 to 5×10 −6 . Both parts share the same x-axis. The two parts display the different appearances of the scatter plot in the corresponding ranges of the y axes. For most of the scatter plot, the red curve is the lowest, which means the graphical jump method generates the lowest prediction errors most of the time. The middle left plot shows that the mean value of the box plot increases with the number of nodes chosen generally. For the 2 − D histogram, the graphical jump method chooses 4 nodes as the best number of nodes most of the time. However, BIC and cross validation choose 9 or 10 nodes with the highest percentages. This is because for a small sample size, such as the 100 sample size scenario, 4 nodes can explain most of the variability of the dataset.
Adding more nodes cannot decrease the MSE of residuals too much. Hence BIC and cross validation will choose a node count close to 4 nodes. When the sample size becomes larger, there are more data points. 4 nodes is not enough to explain most of the variability of the dataset. Adding more nodes makes the MSE of the residuals decrease a lot. Hence BIC and cross validation will choose 9 and 10 nodes as the best number of nodes with high rates.
The right three plots in Figure (6) are the scatter plot, the boxplot and the 2-D histogram for sample size of 1000. Since the sample size is large enough, all the three methods select the correct number of nodes most of the time, as you can see from the 2-D histgram. Also the boxplot shows that there are just a few observations from 5 nodes to 10 nodes.

Three dimensional dataset
Finally, simulations are performed for three variables. A cube described by the three variables is generated. For each node, the active region is located at a corner of the cube (See Figure (7)). The faces labeled by "F" are the front faces. Figure (7) demonstrates the positions of the active regions relative to the front faces. For the aggregate of the four nodes, the active regions are located at the 4 different corners of the cube. The nodes are then generated as following: firstly, n×100 observations with gamma distribution, with shape parameter 20 and rate parameter 40, plus a standard Gaussian noise term, are generated.
After sorting them from small to large, the n×100 observations are divided into n subgroups with consecutive 100 observations being in the same subgroup. Therefore, n subgroups are produced. n x1 observations are produced with each observation equaling to the mean of each subgroup. n x2 observations and n x3 observations are generated similarly. Then n observations, 1 ,..., 3 , are produced with the 3 coordinates being a combination of each x1 (first coordinate), x2 (second coordinate) and x3 (third coordinate) values. The first to fourth nodes are simulated by the following formula: where X1, X2 and X3 are the first, second and third coordinates of Xs. The final Y values are generated by Y = y1 + y2 + y3 + y4 + ϵ, where ϵ ∼ N (0, 1). Simulations are done for n values equaling 4, 5 and 6.  Table (2) is presented the same way as for one dimensional case. For all of the three scenarios, the graphical jump method leads the other two methods in both picking the correct number of nodes and making predictions. The graphical jump method picks 100%, 90% and 93.3% correct results for sample sizes of 64, 125 and 216, separately, which are almost twice of those from the second best methods. The mean (sd) of the prediction errors generated from the graphical jump method are 1.14e-04 (7.68e-05), 3.97e-05(1.18e-05) and 3.27e-05(4.21e-06) for sample sizes of 64, 125 and 216, respectively. Both the mean and the sd of the prediction errors generated by the graphical jump method are smaller than those from the other two methods, which means the graphical jump method produces results both more accurate and more stable. For Figure (8), everything else is the same as those from sample size of 300 and dimension of one except that the plot containing box plots is divided into two parts with different ranges in x-axis. Since the mean values of prediction errors for 3 nodes are much higher than those of the rest numbers of nodes, they are sketched separately at the left parts. In the right parts of the plots, the box plots have increased trend in mean values along the x-axis, which means that when overfitting happens, the prediction errors increase with larger number of nodes chosen. For all of the . three scenarios in three dimension, the scatter plots indicate that the graphical jump method yields the lowest prediction errors most of the time. Also the spread of the red points is narrower than those of the green points and the dark blue points. The 2-D histograms show that the graphical jump method picks the correct number of nodes (> 90%) most of the time. For sample size of 64, it even yields 100 % correct results. The graphical jump method yields the highest percentages of correct results among the three.
For each of the scenarios above, another set of explanatory variables and response variable are generated the same way as that in each scenario. The newly generat-ed explanatory variables are used to make predictions and the predicted values are compared to the corresponding newly generated response variable.

Combined Cycle Power Plant Data Set
We now show our method in use on a real dataset. The dataset contains 9568 observations obtained from a Combined Cycle Power Plant over 6 years (2006-2011), during which the power plant worked with full load. The explanatory variables include average ambient variables Temperature (T), which is in the range of 1.81C and 37.11C, Ambient Pressure (AP), which is in the range of 992.89 to 1033.30 milibar, Relative Humidity (RH), which is in the range of 25.56% to 100.16% and Exhaust Vacuum (V), in the range of 25.36 to 81.56 cm Hg. The output is the net hourly electrical energy output (EP) of the plant, varying from 420.26 to 495.76 MW (Tufekci, 2014). Various sensors are located around the plant. They record the ambient variables every second and the hourly averages are taken as the observations given in the dataset. The final variables in the dataset are not normalized.
The dataset is analyzed using a single layer feedforward neural network. The graphical jump method, the BIC criterion, and 10 fold cross validation are implemented to analyze the dataset and they select 1, 4 and 9 nodes as the best numbers of nodes respectively. Then the full dataset is analyzed using the neural network model using 1, 4 and 9 nodes, separately. The graphical jump method yields the lowest Mean Square Error (MSE) of residuals, which is 272.317. The BIC criterion and 10 fold cross validation yield higher MSEs of residuals, which are 291.866 and 291.865, separately. Therefore the number of nodes selected by the graphical jump method does the best job in predicting the outcome variable using the explanary variabes. Hence, the graphical jump method does the best job in choosing the true number of nodes.

Conclusion, Discussion and Future Research Directions
The jump method was first introduced in Sugar and James (2003). It contains 4 simple steps. It produces a jump score for each number of nodes which are functions of negative transformation of Mean Square Error. Given the correct transformation power, the highest jump score occurs at the true number of nodes. However, in practice, the correct transformation power can be difficult to identify since there are many unknown parameters. Instead, the strategy is to use a jump plot to distinguish false candidates from the true candidate. The "jump plot" graphs "the number of nodes selected" versus "transformation power used". Theorem 1 demonstrates that if the true number of nodes is bigger than one, then it will be selected in the jump plot, i.e., for the several numbers of nodes selected in the jump plot, one of them should be the true answer. Theorem 2 shows that if the true number of nodes is one, then the jump plot will select one as the only candidate over a long range of transformation powers. Theorem 2 can be used iteratively to rule out false candidates selected from the jump plot. If the candidate number of nodes is less than the true number of nodes, after fitting the neural network model with that number of nodes, the residuals will still need to be explained by more than one node and show a jump larger than one in the jump plot. The method was demonstrated to work on both simulated and real datasets.
The graphical jump method has the potential to be extended to solve other problems where a counting number needs to be chosen, such as choosing the number of species in environmental studies, or choosing the number of neighbors in computer science. Also we could try to choose the number of nodes for other types of neural networks such as classification neural networks, recurrent neural networks (encompassing simple recurrent networks, Long short-term memory (LSTM) networks, Hopfield networks, Echo state networks), region-based convolutional neural network (R-CNN),the growing neural gas network (GNGN), radial basis function networks, and stochastic neural networks (including the Boltsmann machine). Researchers have already studied the topic of model selection for some of the neural networks aforementioned (Decker, 2006;Hessami and Viau, 2004;Liu, 2016). Since all of these neural networks are composed of multiple nodes for data processing, a "jump plot" might be constructed to find the candidate number of best nodes for the nueral network. Then "a single node only test" could be designed to rule out erroneous candidates. Another interesting extension would be to multivariate decision problems, such as multi-layer neural networks, where each layer might have a different optimal number of nodes .
parameters in the corresponding model, Ropt is of dimensions m×m, tr(Ropt −1 ) = tr(Ropt −1 )= tr(Im×m) = m, (see Murata, Yoshizwa and Amari (1994)(page 868)). Therefore, Since the minimum discrepancy occurs at G nodes, Then by Taylor Series Expansion, we can prove that Again via Taylor series expansion, it can be shown that the highest jump score occurs at G nodes, for any k for certain range of v. Additional details are in Chang (2011).