Combining Unsupervised and Supervised Neural Networks in Cluster Analysis of Gamma-Ray Burst

: The paper proposes the use of Kohonen’s Self Organizing Map (SOM), and supervised neural networks to ﬁnd clusters in samples of gamma-ray burst (GRB) using the measurements given in BATSE GRB. The extent of separation between clusters obtained by SOM was examined by cross validation procedure using supervised neural networks for classiﬁcation. A method is proposed for variable selection to reduce the “curse of dimension-ality”. Six variables were chosen for cluster analysis. Additionally, principal components were computed using all the original variables and 6 components which accounted for a high percentage of variance was chosen for SOM analysis. All these methods indicate 4 or 5 clusters. Further analysis based on the average proﬁles of the GRB indicated a possible reduction in the number of clusters.


Introduction
It is of great interest to astronomers to know whether the measurements on gamma-ray burst (GRB) can be characterized by a single probability distribution around some central value or as a mixture of probability distributions around different central values. Clustering is an exploratory data analysis (EDA) for investigating such problems by looking for groups of observed samples which are well separated using a suitable criterion. The ultimate aim is to seek for a physical interpretation of differences between the groups. An interesting example in a different context is the discovery of three clusters of the general population of individuals based on some blood tests for diabetes, one identified as diabetes free, and the other two representing individuals with 2 different types of diabetes A and B (Reaven and Miller, 1979). Another example is the discovery of 2 clusters of individuals suggesting 2 types of cancers (Golub et al., 1999). Cluster analysis is a valuable tool in knowledge acquisition. In the literature there are two approaches to cluster analysis. One is parametric assuming a mixture of a given number of probability distributions such as multivariate normal. Another is nonparametric which offers a great flexibility in discovering the number of clusters and their shape without going through model selection procedures.
There are a number of methods of cluster analysis, a good review of which can be found in Jain, Murty and Flynn (1999) and Jiang et al. (2004). We use an unsupervised neural network known as SOM (Self Organizing Map) for finding clusters and discuss methods of validating them by cross validation and profile analysis. We also propose two methods of reducing the number of variables for obtaining stable results. Some references to early work on cluster analysis of GRB are Mitrofanov et al. (1998), Bagoly et al. (1998), Mukherjee et al. (1998), Hakkila et al. (2000), and Rajaniemi and Mahonen (2002).

Data
We consider the original BATSE 3B catalogue from the Compton Gamma Ray observatory, which is composed of 1122 GRB trigger samples with 14 measurements of astrophysical interest made on each sample. In addition we also list 3 other measurements usually considered in astrophysical research described in Murkerjee et al. (1998), Mitrofanov et al. (1998, Rajaneimi and Mahonen (2002). Since the computational complexity of the data mining process is not increased dramatically by including additional variables, we used all 17 variables. The list of 17 variables is given in Table 1 with the mean values and standard deviations of log variables. Log transformation is made to reduce the variables to uniform scale.

Cluster analysis using SOM
There were 422 GRB samples with all variables present. A SOM was used for clustering these GRB patterns. Four different topologies were tried to test the clustering process. Figures 1a -1d show the number of patterns in each cluster over squared Kohonen's map of different dimensions with nodes: 25(5×5), 49(7× 7), 100(10 × 10) and 225(15 × 15). As can be seen, the nodes representing the classes are well separated from each other in the 2 dimensional map provided by the topology. For a brief description of SOM and the underlying concepts, reference may be made to Rejaniemi and Mahonen (2000). All topologies clustered the 422 samples into 5 clusters designated as classes 1, 2, 3, 4 and 5. The fifth class had a small frequency and did not appear to be different from the fourth. They were combined to form one cluster as class 4.

Cross validation
Working with the 15 × 15 topology (the one which presented the maximum relative distance between classes), the input patterns were divided into two groups called in sample set (317 patterns) and an out of sample set (105 patterns) with a random algorithm using stratified sampling. A supervised MLP (Multilayer Perceptron) neural net with Bayesian regularization (see Mackay, 1992) was used to train the in sample set for classification of patterns into four classes. Ten different trainings were performed and the patterns in the out of samples were classified into four classes. The overall mean accuracy of classification was 92.4% and the error for each class is as given in Table 2. It is seen that classes 1, 3 and 4 are well separated while 2 is not so well separated from 1. While this needs further discussion, we consider the four classes to explain the method for reducing the number of variables.

Reduction of dimensionality
In multivariate analysis, one is faced with the curse of dimensionality as originally pointed out by Rao (1952) and referred to in the statistical literature as Rao's paradox. For obtaining stable results, a proper selection of variables has to be made. We suggest two procedures for this purpose, one of which is described in this section. The second is based on principal component analysis as detailed in the next section 3. Figure 2 (a) presents, for each input variable of the feedforward neural network, the sum of the absolute values of the weights (S i ) connecting the corresponding input to the hidden layer neurons. Taking the mean (M) and the standard deviation (SD) of theses sums and using as threshold (T) the value T = M − SD, we eliminated the variables whose S i were below T . The neural network was trained 10 times, for randomly chosen sets of the initial weights, and the pruning criterion was used to confirm the eliminated variables. The average of the misclassification error for these 10 samples will be denoted by AV j .
After eliminating variables, further 10 training samples were used and the misclassification errors were computed. If the average of these errors (AV j+1 ) was more than the AV j value, then the variables would be definitively abandoned.The procedure is repeated iteratively until the elimination of variables does not improve the misclassification error. Figure 2 (b) shows the relative importance of each of the remaining input variables that were considered most relevant for the classification process (respectively T 50 , T 90 , F 1 , F 2 , F 3 and F 4 ). For this final configuration, the misclas- sification error was 5.9% for the out of sample set and 1.4% for the in sample set).
Considering six variables, the number of the available observations (without missing values) increased from 422 to 632. Using again a 15 × 15 topology for SOM, now for the six remaining input variables and for 632 patterns, the classes and frequencies found were similar to the classes obtained using all seventeen input variables.
A feedforward neural network trained with the final six variables and 498 in sample and 134 out of sample observations resulted in an out of sample misclassification error of 5.9% compared with 7.6% with the 17 variables of the initial network with 317 in sample and 105 out of sample observations. However, considering that the objective is to compare this methodology with the next one (described in section 3), only the 422 patterns initially considered will be used to compare the two methods.

Principal Component Analysis
A second approach to reduction of dimensionality is PCA (Principal Component Analysis) where the variables are replaced by a smaller number of linear functions of the variables. In computing the principal components only the first 14 variables of Table 1 are used. It may be noted that the last 3 variables of Table 1 are functions of the variables F 1 , F 2 , F 3 , and F 4 in the list. The computations made on 422 samples where all the variables are available provided 14 linear combinations of the variables with the associated eigen values as indicated in Figure 3. The first six principal components accounted for 98% of the total variance, and the SOM was used for clustering based on these components only. The analysis provided the same 5 classes as discussed in Section 2. From the results of Table 3, we conclude that variables T 50 , T 90 , F 1 , F 2 , F 3 and F 4 are the most important. This result agrees with the previous one obtained using the MLP with a regularization technique which showed these same six variables as the most relevant to the classification process.
Using this method, we obtained the same 5 classes of the previous analysis, with the same patterns in each class. The full table with the composition of each class is available from the authors. Labeling the classes found by SOM's training process with numbers 1 to 5, it is possible to draw the patterns into graphs with the first versus the second and the third principal components provided by PCA analysis. These classes are clearly seen in Figure 4 and 5, where there is evidence of three classes (1,2 and 3). The status of classes 4 and 5 is not clear. However, some possibilities are that they may be considered as separate classes, class 5 may be merged with class 3, and 4 with 2. The profile analysis carried out in the next section also suggests similar grouping.

Graphical Evaluation of the Classes
There is no recommended statistical method as the best for evaluating the validity and the number of clusters determined by using one or more of the numerous algorithms available for cluster analysis. See Sugar and James (2003) and Jiang et al. (2004). Figure 6 gives the distortion curves recommended in Sugar and James (2003), which suggests about 4 classes. Another suggested method is to examine the profiles of the patterns in different classes, which in statistical literature is also known as the plot of parallel coordinates of individuals and mean values as shown in Figure 7. It is seen that Classes 1 and 3 are distinct with class 2 occupying an intermediate position. The positions of classes 4 and 5 are not clear. It is interesting to see that the four classes differ mainly in mean values of the six variables chosen for clustering in Section 3.

Conclusion
Our study indicates the following: a) The profile plot and the scatter plot of the first two principal components indicate a clear separation between Classes 1 and 3. Patterns in Class 1 are characterized by long duration, bright fluency and soft spectrum while Class 3, by short duration, faint fluency and hard spectrum. There is some overlap between Classes 1 and 2 in the profile plot, but the distinctiveness of Class 2 is brought out in the plot of principal components. Patterns in this class are characterized by intermediate duration and fluence, and hard spectrum. The positions of Classes 4 and 5 are not clear. However, the profile plots of classes 4 and 5 appear to be similar. Patterns in these classes can be characterized by intermediate duration, fluency and spectrum. b) The means of the variables T50, T90, F1, F2, F3, F4 and H321 of the 3 clusters are well differentiated while the means of the other 10 variables P64, P256, P1024, T64, T256, T1024, Lat, Lon, Ft and H32 are not. The latter variables may not be useful in predicting the class to which a future GRB belongs. Any physical interpretation of the clusters should take this into account. d) SOM seems to be an appropriate tool for clustering and graphical display of the results. e) The choice of the dominant principal components is a computationally convenient way of reducing the curse of dimensionality due to a large number of variables in cluster analysis and classification problem f) SOM provides non overlapping clusters and the distinction between the Classes 1 and 2 cannot be easily specified. A parametric approach such as fitting a mixture model may reveal three components as demonstrated in the paper by Mukherjee et al. (1998).