A New Procedure of Clustering Based on Multivariate Outlier Detection

: Clustering is an extremely important task in a wide variety of application domains especially in management and social science research. In this paper, an iterative procedure of clustering method based on multivariate outlier detection was proposed by using the famous Mahalanobis distance. At ﬁrst, Mahalanobis distance should be calculated for the entire sample, then using T 2 -statistic ﬁx a UCL. Above the UCL are treated as outliers which are grouped as outlier cluster and repeat the same procedure for the remaining inliers, until the variance-covariance matrix for the variables in the last cluster achieved singularity. At each iteration, multivariate test of mean used to check the discrimination between the outlier clusters and the inliers. Moreover, multivariate control charts also used to graphically visualizes the iterations and outlier clustering process. Finally multivariate test of means helps to ﬁrmly establish the cluster discrimination and validity. This paper employed this procedure for clustering 275 customers of a famous two-wheeler in India based on 19 diﬀerent attributes of the two wheeler and its company. The result of the proposed technique conﬁrms there exist 5 and 7 outlier clusters of customers in the entire sample at 5% and 1% signiﬁcance level respectively.


Introduction and Related Work
Outliers are the set of objects that are considerably dissimilar from the remainder of the data (Han, 2006).Outlier detection is an extremely important problem with a direct application in a wide variety of application domains, including fraud detection (Bolton, 2002), identifying computer network intrusions and bottlenecks (Lane, 1999), criminal activities in e-commerce and detecting suspicious activities (Chiu, 2003).Different approaches have been proposed to detect outliers, and a good survey can be found in (Knorr, 1998;Knorr, 2000;Hodge, 2004).Clustering is a popular technique used to group similar data points or objects in groups or clusters (Jain and Dubes, 1988).Clustering is an important tool for outlier analysis.Several clustering-based outlier detection techniques have been developed.Most of these techniques rely on the key assumption that normal objects belong to large and dense clusters, while outliers form very small clusters (Loureiro, 2004;Niu, 2007).It has been argued by many researchers whether clustering algorithms are an appropriate choice for outlier detection.For example, in (Zhang and Wang, 2006), the authors reported that clustering algorithms should not be considered as outlier detection methods.This might be true for some of the clustering algorithms, such as the k-means clustering algorithm (MacQueen, 1967).This is because the cluster means produced by the k-means algorithm is sensitive to noise and outliers (Laan, 2003).Similarly, that the case is different for the Partitioning Around Medoids (PAM) algorithm (Kaufman and Rousseeuw, 1990).PAM attempts to determine k partitions for n objects.The algorithm uses the most centrally located object in a cluster (called medoid) instead of the cluster mean.PAM is more robust than the k-means algorithm in the presence of noise and outliers.This is because the medoids produced by PAM are robust representations of the cluster centers and are less influenced by outliers and other extreme values than the means (Laan, 2003;Kaufman and Rousseeuw, 1990;Dudoit and Fridlyand, 2002).Furthermore, PAM is a dataorder independent algorithm (Hodge, 2004), and it was shown in (Bradley, 1999) that the medoids produced by PAM provide better class separation than the means produced by the k-means clustering algorithm.PAM starts by selecting an initial set of medoids (cluster centers) and iteratively replaces each one of the selected medoids by one of the none-selected medoids in the data set as long as the sum of dissimilarities of the objects to their closest medoids is improved.The process is iterated until the criterion function converges.In this paper, a new method of clustering was proposed based on multivariate outlier detection.Note that our approach can be easily implemented when compare to other clustering algorithms that are based on PAM, such as CLARA (Kaufman and Rousseeuw, 1990), CLARANS (Ng and Han, 1994) and CLATIN (Zhang and Couloigner, 2005).
As discussed in (Loureiro, 2004;Niu, 2007;Zhang and Wang, 2006), there is no single universally applicable or generic outlier detection approach.Therefore, many approaches have been proposed to detect outliers.These approaches can be classified into four major categories based on the techniques used (Zhang and Wang, 2006), which are: distribution-based, distance-based, density-based and clustering-based approaches.Distribution-based approaches (Hawkins, 1980;Barnett and Lewis, 1994;Rousseeuw and Leroy, 1996) develop statistical models (typically for the normal behavior) from the given data and then apply a statistical test to determine if an object belongs to this model or not.Objects that have low probability to belong to the statistical model are declared as outliers.However, distribution-based approaches cannot be applied in multidimensional scenarios because they are univariate in nature.In addition, a prior knowledge of the data distribution is required, making the distribution-based approaches difficult to be used in practical applications (Zhang and Wang, 2006).In the distance-based approach (Knorr, 1998;Knorr, 2000;Ramaswami, 2000;Angiulli and Pizzut, 2005), outliers are detected as follows.Given a distance measure on a feature space, a point q in a data set is an outlier with respect to the parameters M and d, if there are less than M points within the distance d from q, where the values of M and d are decided by the user.The problem with this approach is that it is difficult to determine the values of M and d.Density-based approaches (Breunig, 2000;Papadimitriou, 2003) compute the density of regions in the data and declare the objects in low dense regions as outliers.In (Breunig, 2000), the authors assign an outlier score to any given data point, known as Local Outlier Factor (LOF), depending on its distance from its local neighborhood.A similar work is reported in (Papadimitriou, 2003).Clustering-based approaches (Loureiro, 2004;Gath and Geva, 1989;Cutsem and Gath, 1993;Jiang, 2001;Acuna and Rodriguez, 2004), consider clusters of small sizes as clustered outliers.In these approaches, small clusters (i.e., clusters containing significantly less points than other clusters) are considered outliers.The advantage of the clustering-based approaches is that they do not have to be supervised.Moreover, clustering-based techniques are capable of being used in an incremental mode (i.e., after learning the clusters, new points can be inserted into the system and tested for outliers).(Cutsem and Gath, 1993) present a method based on fuzzy clustering.In order to test the absence or presence of outliers, two hypotheses are used.However, the hypotheses do not account for the possibility of multiple clusters of outliers.Jiang et al. (Jiang, 2001) presented a two-phase method to detect outliers.In the first phase, the authors proposed a modified k-means algorithm to cluster the data, and then, in the second phase, an Outlier-Finding Process (OFP) is proposed.The small clusters are selected and regarded as outliers by using minimum spanning trees.In (Loureiro, 2004) clustering methods have been applied.The key idea is to use the size of the resulting clusters as indicators of the presence of outliers.The authors use a hierarchical clustering technique.A similar approach was reported in (Almeida, 2007).Acuna and Rodriguez (Acuna and Rodriguez, 2004) performed the PAM algorithm followed by the technique (henceforth, the method will be termed PAMST).The separation of a cluster A is defined as the smallest dissimilarity between two objects; one belongs to cluster A and the other does not.If the separation is large enough, then all objects that belong to that cluster are considered outliers.In order to detect the clustered outliers, one must vary the number k of clusters until obtaining clusters of small size and with a large separation from other clusters.In (Yoon, 2007), the authors proposed a clustering-based approach to detect outliers.The k-means clustering algorithm is used.As mentioned in (Laan, 2003), the k-means is sensitive to outliers, and hence may not give accurate results.

Proposed Approach
In this paper we proposed a new approach of outlier based clustering based on Mahalanobis distance.In statistics, Mahalanobis distance is a measure introduced by P. C. Mahalanobis (1936), which is based on correlations between variables by which different patterns can be identified and analyzed.It gauges similarity of an unknown sample set to a known one.It differs from Euclidean distance which takes the correlations of the data set and it is scale-invariant.In other words, it is a multivariate size.Formally, the multivariate distance of a multivariate vector T and the co-variance matrix S is defined as (1) From ( 1), X is the sample mean matrix of order p×1 and S is the sample variancecovariance matrix of order p × p.The test statistic for the Mahalanobis distance is the squared Mahalanobis distance defined as T -square was first proposed by Harold Hotelling (1951) and it is given as From (2), Hotelling derived the (UCL) upper control limit of T -square statistic as UCL = ((n − 1) 2 /n)β(α, p/2, (n − p − 1)/2), where n is the sample size, p is the number of variables, α is the level of significance and β(α, p/2, (n − p − 1)/2) followed a beta distribution.Based on the above said distance measures, first, assume all the variables follows a multivariate normality and calculate the Mahalanobis distance from (1) for the n observations based on p variables, where n > p.Secondly, from (2) fix a UCL for T -square statistic, observations above the UCL are consider as outlier cluster and named as cluster 1. Repeat the same procedure for remaining observations excluding the observations in cluster 1. Repeat the process, until the nature of variance-covariance matrix for the variables in the last cluster achieves singularity.Moreover, the cut-off T -square value can fixed by using the beta distribution and the identification of individual outlier observation can be done with the help of 1% or 5% significance points of T -square test statistic.The basic structure of the proposed method is as follows: Step 1: Calculate the Mahalanobis distance for n observations based on p variables.
Step 2: Determine the observations which are above the UCL of T -square statistic and consider those observations are outlier cluster 1.
Step 3: Using multivariate test of means, check the equality of means for the variables in cluster 1 and remaining observations.If the means are equal, then stop the iteration and it shows there are no clusters in the sample.
If the means are not equal, there exists some discrimination between the variables in cluster 1 and for the remaining variables.Then repeat the process of Step 3.
Step 4: Repeat step No.1 and 2 for the remaining observations and ascertain the cluster 2.
Step 5: Continue the iteration process, until the nature of variance-covariance matrix of the p variables in the last cluster is singular.
Step 6: In order to scrutinize the overall discriminant validity of the clusters, multivariate test of means should use with the assumption of the homogenous variance-covariance matrix.

Results and Discussion
In this section, we investigated the effectiveness of our proposed approach on the survey data collected from the famous two wheeler users' in India.The data comprised of 19 different attributes about the two wheeler company and the data was collected from 275 two wheeler users.A well-structured questionnaire was prepared and distributed to 300 two wheeler customers and the questions were anchored at five point likert scale from 1 to 5.After the data collection is over, only 275 completed questionnaires were used for analysis.The aim of this article is to describe the proposed clustering approach not the application of the theoretical concept.The following table shows the results extracted from the analysis by using SAS JMP v9.0 and STATA v11.2.
Table 1 visualizes the iteration summary of the identification of the multivariate outlier detection by using the T -square distance or the squared Mahalanobis distance.At first iteration, 275 observation and 19 variables were used to calculate the Mahalanobis distance for all observation.Among 275 observations, the value T -square statistic for 220 observations were below the UCL of T -square test statistic (29.53) at 5% significance level and the remaining No. of observations (55) are above the cut-off.Therefore, we consider the 55 observations as first outlier cluster.Then repeat the iteration process to the next stage for calculating the T -square distance or the squared Mahalanobis distance based on 220 observations (275 − 55) for the same 19 variables in iteration 2. Likewise, if we continue the iteration process for the remaining stages, the iteration reached the limit in the fifth step with 111 observations as outlier cluster No.5.At the iteration No.5, the variance-covariance matrix of 19 variables for 111 observations is singular, therefore it is not possible to calculate the T -square distance or the squared Mahalanobis distance for the observations.Hence based on 5 iterations, we identified five different outlier cluster at 5% significance level with (n = 55), (n = 44), (n = 34), (n = 31) and (n = 111) observations respectively.Moreover we also identified the outlier clusters at 1% level.In iteration 1, 275 observations and 19 variables were used to calculate the Mahalanobis distance.Among the 275 observation, the value of T -square statistic for 237 observations was below the UCL of T -square test statistic (35.05) at 1% significance level and the remaining observations (38) are above the cut-off value.Therefore we finalize 38 observations as 1 st outlier cluster.By repeating this iteration process, finally we reached the final iteration No.7 with 115 observations as outlier cluster No.7.In the final iteration, it is not possible to calculate the T -square statistic or the squared Mahalanobis distance because of the singularity of the varianceco-variance matrix.Hence, based on 7 iterations we identified 7 different outlier clusters at 1% significance level with (n = 38), (n = 30), (n = 30), (n = 22), (n = 26), (n = 14) and (n = 115) observations respectively.The iteration and identification of multivariate outlier clusters were explained with the help of the following multivariate control charts.Table 2 describes the results of the five different test statistics such as Wilk's lambda, Pilla's trace, Lawley-Hotelling trace, Roy's largest root test and the traditional F -statistic which helps us to finalize the discriminant validity of the clusters based on the 19 variables at each iteration.In the first iteration, out of 275 observations, 55 are treated as outlier cluster 1 and remaining are inliers.The result of the test statistic confirms that the means of 19 variables are significantly differed at 1% between the outlier cluster 1 and the inliers.This indicates the variables in the outlier cluster are different from the inliers.This process is done at each iteration and we achieve a positive indication of attaining the discriminant validity between the outlier cluster and the remaining inliers.Finally, in the last iteration, it is not possible to segregate the new outlier cluster, because the variance-covariance matrix of the 19 variables for 111 observations is singular.So, the iteration is stopped and we treat the 111 observations as outlier cluster No.5.Similarly, the above said test statistic were also used to finalize the discriminant validity of the outlier clusters at 1% level.The result of the test statistic confirms that in all iterations, the means of the 19 variables are significantly differed between the outliers the inliers at 1% significant level.Finally, in last iteration it is not possible to classify the new outlier cluster with (n = 115) observations because of a singularity of variance-covariance matrix.Hence the iteration was stopped here and we treat 115 observations as outlier cluster No.7.The following table shows the cluster wise means of the variables.
Table 3 exhibits the cluster wise centroids of the 19 variables.In order to test the equality of multivariate means of 19 variables among 5 outlier clusters, five

Conclusion
In this paper a new method of clustering was proposed based on Multivariate detection.Though several clustering procedures available in the literature, the proposed technique gives a unique idea to cluster the sample observations in a survey study based on the multivariate outliers.The feature of the proposed clustering technique was elaborately discussed and the authors also highlighted the application of the technique in a survey research.Based on the results derived, the proposed technique gives more insights to the researcher to cluster the sample observation at 5% and 1% significance level.Finally the authors enlighten an idea for further research by conducting simulation experiments for testing relationship between the significance level and the number of outlier clusters extracted.Moreover more rigorous experiments may conduct to identify the Multivariate outliers' inside the outlier clusters.

Figure 11 :
Figure 11: Cluster membership for outlier clusters at 5% level

Table 2 :
Iteration summary for test of equality of means Roy' largest root test and the traditional F -statistic which helps to strongly establish the discriminant validity among the clusters.From Table4, the result of the battery of multivariate test confirms that the means of variables among the 5 outlier clusters are significantly differed at 1% level.This indicates all clusters are different and each outlier cluster conveys different meaning which leads to achieve the overall discriminant validity among the clusters.Similarly the above said test of multivariate means also utilized to check the differences among the means of 19 variables for the outlier clusters at 1% level.The result of the test confirms the means of the variables among the 7 outlier clusters are significantly differed at 1% level.This indicates the entire outlier clusters at 1% level are different and each clusters conveys different meaning which leads to achieve the overall discriminant validity among the clusters.The following graph visualizes the summary of membership of each observation in each outlier cluster.

Table 3 :
Cluster-wise means of variables

Table 4 :
Test of of cluster means with homogenous variance covariance matrix P (number of variables) = 19