Support Vector Machines for Classification of Temporomandibular Disorders from Facial Pattern Values

The aim of this study is to develop a method for detection of temporomandibular disorder (TMD) based on visual analysis of facial movements. We analyse the motion of colour markers placed on the locations of interest on subjects faces in the video frames. We measured several features from motion patterns of the markers that can be used to distinguish between different classes. In our approach, both static and dynamic features are measured from a number of time sequences for classification of the subjects. A measure of nonlinear dynamics of the variations in the movement of colour markers positioned on the subjects faces was obtained via estimating the maximum Lyapunov exponent. Static features such as the number of out-liers and kurtosis have also been evaluated. Then, Support Vector Machines (SVMs) are used to automatically classify all the subjects as belonging to individuals with TMD and healthy subjects.


Introduction
One of the most dynamic biomechanical junctures in the human body is the temporomandibular joint (TMJ).The TMJs are joints located on either side of the face that connect the lower jaw to the skull (McNeill, 1993) (illustrated in Figure 1).
Temporomandibular disorders (TMDs) often results when the chewing muscles and the TMJ do not work together correctly.When this occurs, the muscles often cramp.This spasm can then become part of a cycle that results in tissue damage, pain and muscle tenderness (Okeson, 1996a;Ohrbach and Widmer, 1992).
TMD is a collective term used to describe a number of related disorders affecting the temporomandibular joints, masticatory muscles, and associated structures, all of which have common symptoms such as pain in or around the ear, Figure 1: An illustration of the temporomandibular joint and its location adopted from (http://www.stevenschnolldds.com/tmjdisorders.htm)limited mouth opening, tenderness of the jaw muscles, Clicking noises when one opens or closes the mouth and difficulty in opening and closing mouth (McNeill, 1993;Okeson, 1996b).
The clinical diagnosis of TMD has been based on determine the cause of these symptoms by conducting a series of diagnostic tests.These may include complete medical history and clinical examination which is to consider the possibility of temporomandibular joint pain and dysfunction, particularly if the pain is accompanied by clicking jaw joints and limited mouth opening (Carlsson, 1984).Therefore, poor detection of these signs and symptoms can lead to misdiagnosis of TMD.
Currently, there are three methods for measurement of the functional features for diagnosis of TMD; (a) the computerized mandibular scan (CNS) that records the delicate functioning movements of the jaw, (b) Electromyography (EMG) to measure masticatory muscle, and (c) The electrosonograph (ESG) to measure and graphically display or represent sounds made by TMJ components (Deng et al., 2006).These methods, however, are very rarely used in only a few places around the world.Different types of imaging systems may also be utilized for diagnosis of TMD.Arthrography and magnetic resonance imaging (MRI) are the most popular ones.Plain X-ray and computerized tomography (CT) are valuable for determining the presence of osseous changes and traumatic injury to the osseous components of the joint.MRI is costly and unable to visualize perforations of the posterior attachment or the disc.CT is too hazardous to be used frequently and not comfortable for the patients.
The above mentioned reasons indicate that there is a clear need for additional research on alternative methods to be used to supplement clinical examinations with automated classification and detection of TMD.Signal processing methods allow for enhanced clinical utility and an automated approach to the diagnosis of TMD.
Automatic motion analysis in human recognition is easier and most effective if the subject movement is cyclostationary.There are two obvious cyclic motions: walking and chewing.Analysis of motion vectors in walking subjects has been widely reported in gait recognition (Lee et al., 2006).During normal chewing, the lower jaw and connecting joints on both sides are synchronized; the joints on each side slide and rotate right in front of each ear.Both the cyclic and the chaotic features for these cases can be quantified and exploited in detection of any abnormality (Ghodsi, 2008).In the cases when the jaw twists during one of these motions, it causes pain and click (Ghodsi et al., 2008).The most common symptom of TMD is clicking of the TMJ.In some research the clicking sound has been suggested as a potential data to characterize TMD (Took et al., 2006;Took et al., 2008a;Took et al., 2008b).Since the clinicians hear the mixtures of TMJ sources recorded from inside each auditory canal, it makes the task difficult for them to diagnose TMD.It is difficult to know which type of TMJ source come from the right/left TMJ.
In this paper we develop a method for detection of TMD based on visual analysis of facial movement which is comfortable and safe.The combine of the sound and visual data can help audio analysis to characterize TMD.Here, we consider visual analysis of facial movement.For this purpose we attached a number of markers to the points of interest on the individuals' faces and tracked their positions over a large number of frames in the video sequences.We used image processing methods to extract the positions of the markers in the video frames.The important locations with significant changes during mouth movement are around the TMJ.We analyse the information related to the dynamics of movement of the colour markers placed on the face around the TMJ.
The paper is organized as follows.In Section 2, we provide the necessary mathematical background of the support vector machine (SVM), then we discuss the experimental data in Section 3. Section 4 presents the Bootstrap method and in section 5 the results are derived.Section 6 concludes the paper.

Support Vector Machines
Support vector machine (SVM) is an effective non-parametric classifier suitable for highdimensional datasets and has been found competitive with the best machine learning algorithms (Vapnik, 1995;Taylor and Cristianini, 2000).Unlike many classification algorithms SVM performs efficiently when • The number of features is high.
• There is a limited time for performing the classification.
• There is a non-uniform weighting among the features.
• There is a nonlinear map between the inputs and the outputs.
• The distribution of the data is not known.
• A convex (monotonic) is required so it does not fall into a local minima.
The formulation of SVM learning is based on the principle of structural risk minimization.Instead of minimizing an objective function based on the training samples (such as mean square error (MSE)),the SVM attempts to minimize a bound on the generalization error (i.e., the error made by the learning machine on test data not used during training) (Vapnik, 1995).
Consider a data set and their corresponding labels as: (X 1 , y 1 ), . . ., (X N , y N ), where X i ∈ R p is a feature vector and y i ∈ {−1, +1} are the class labels.Then it is possible to partition the p-dimensional pattern space into two half-spaces with a separating hyper-plane of equation (1) Of all the boundaries determined by W and b, the one that maximizes the margin (i.e.maximizes the distance between the hyperplane and the nearest data point of each class) generalizes better than other possible separating hyperplane.
Mathematically, this hyperplane can be found by minimizing the following cost function: subject to the separability constraints: where •, • means cross-product.This specific problem formulation may not be useful in practice because the training data may not be completely separable by a hyperplane.In this case, slack variables, denoted by ξ i , can be introduced to relax the separability constraints in Eq. (3) as follows: Accordingly, the cost function in Eq. ( 2) can be modified as follows: where • refers to the vector geometrical norm.Solving the above equation determines the Lagrangian multipliers and a classifier by implementing the optimal separating hyperplane in the feature space given by where sv refers to number of support vectors.Consequently, everything that has been derived concerning the linear case is also applicable for a non-linear case by using a suitable kernel instead of the dot product.The choice of kernel to fit non-linear data into a linear feature space depends on the structure of the data.
In this project we exploit these distinguished properties of SVMs in order to build up one classifier and apply that to a number of carefully selected and estimated features.

Data Acquisition
In this study; we used seven subjects.We have two individuals with TMD in the left side of their faces and the rest are healthy subjects.The patients used in our experiments were examined by our clinical expert collaborator1 .We captured the video of all subjects' faces (healthy and individual with TMD) from the left and right sides in frontal-lateral direction by two cameras (illustrated in Figure 2).Each subject was captured performing three cycles of chewing motion using a high resolution (640 × 480 pixels) colour video cameras at 30fps.On average, 400 video frames were obtained per subject.We placed four blue round markers at the locations of interest on each subject's face.The size of each marker is 6mm.We attached two markers on the TMJ at the left and right sides of the face.We also attached two additional markers of the same colour on nose and chin level.The distance of subject from camera can vary thus we used the distance of latter markers as a scaling measure for the images of two sides.We then found the coordinates of the center of each marker in each frame, which we used to find the correspondences between the markers detected in different frames.Therefore, for each marker we obtained a time sequence representing its movement, in direction x, in the video sequences.We then analyze the motion patterns of the TMJ markers during cycles of chewing.Figure 3 shows the time sequence of a TMJ marker for individual with TMD (left side) and healthy one (right side) in the original scale.As it appears from Figure 3 it is not possible to detect which series is related to individuals with TMD through visual inspection of the signals.Some features are measured from chewing motion patterns as described in the following section, that can be used to distinguish between different classes and used SVM as our classification method.It is believed that opening and closing patterns with click is more frequent in patients with TMD than normal subjects and patients with TMD demonstrated a restricted range of motion and reduced velocity than normal subjects.
The static features are measured from the normalized highpass filtered data.The highpass filter is used to remove the effect of mouth movement and enhance the changes in the TMJ within the time sequence.The signals are normalized to suppress the changes in picture/video size, here we define the new signal x new = (x i − x)/s where x is the sample mean and s is the sample standard deviation.It is believed that the visual features presented to the SVM are distinct enough to be separated using either a linear SVM or an SVM with a kernel as discussed above.

Features
Although any various order statistics of the data can be considered as features for this purpose, the following features have been empirically found as the most effective and justifiable features for this classification application (Ghodsi, 2008).These features provide both the statistical and dynamical information regarding the effects on the TMD during normal chewing.
Feature 1.Any abnormalities in the chewing process produce chaotic behaviour of the motion.Estimating chaos in a dynamical system is an important problem.Measuring maximum Lyapunov exponents (MLE) is a way to solve this problem (Kantz, 1994).As a dynamic feature, we used MLE λ 1 to measure the changes in the dynamics of the chewing pattern (Rosenstein et al., 1993).This is measured from the normalized data and it is observed that the chewing signal for the subjects with TMD is more chaotic than for healthy individuals (Ghodsi et al., 2007).We denote f 1 = λ 1 .Although in general the estimates of Lyapunov exponents for short data sequences are not very accurate, but the method adopted here (Rosenstein et al., 1993) provides sufficiently distinct values of λ for TMD and normal subjects.
Feature 2. Outliers are observations which are presumed to come from a different distribution than those of majority of the data (Han and Kamber, 2001).They can have a profound influence on the data analysis, often leading to erroneous conclusions because of their powerful influence on most parametric tests.Outliers (unusual abnormal values) are often the special points of interest in many practical situations and their identification is the main purpose of the investigation.In medicine, unusual values may indicate the diseases (see, e.g., (Kosheleva et al., 1998)).Accurate identification of outliers plays an important role in data analysis.
One approach to outlier detection is to start with N normal values x 1 , • • • , x N , compute the sample mean x, the sample standard deviation s, and then mark a value x as an outlier if x is outside the interval (x − a s, x + a s) (for some preselected number a).We can therefore identify the outliers as those values that are outside the aσ intervals (for an application of this method in engineering, see, e.g.(Wadsworth, 1990)).Here, we selected a = 3 and used the normalized highpass filtered data.f 2 = the number of observations > 3|σ|.
The skewness and the kurtosis tests are useful, among the most powerful tools available for testing the presence of outliers in an otherwise normal sample, especially when the number of outliers is unknown (Hawkins, 1980).
Feature 3. A large ratio between the peak (outlier) amplitude and the variance of a signal suggests that there is an unexpected value in the data.The equation describing this feature is given by where, X = (x 1 , • • • , x N ) is normalized highpass filtered data, max(•) is a scalar valued function that returns the maximum element in a vector, s is the sample standard deviation of X and | • | is the absolute value applied element-wise.
Feature 4. Kurtosis can be formally defined as the standardized fourth order moment.Kurtosis is a measure of how sharp a symmetric distribution is when compared to a normal distribution of the same variance.
Note that the kurtosis of a normal distribution is 3.If a distribution has a large central region which is flatter than a normal distribution with the same mean and variance, it has a kurtosis of less than 3 (i.e.sub Gaussian).
If the distribution has a central maximum more peaked and with longer tail than the equivalent normal distribution, its kurtosis is higher than 3 (i.e.super Gaussian) (Brooks and Carruthers, 1953).
As noted above, kurtosis largely reflects tail behavior, and so its use for detecting outliers has been considered.Discussions of approaches to detecting outliers using kurtosis can be found in (Barnett and Lewis, 1996;Jobson, 1991).Kurtosis is defined as: Feature 5. Skewness is a measure of the asymmetry of a distribution and is zero for a normal distribution.If the longer tail of a distribution occurs for values of x higher than the mean, that distribution is said to have positive skewness.If the longer tail occurs for values of x lower than the mean, the distribution is said to have negative skewness (Kerbaol and Chapron, 1999).Great skewness may motivate the researcher to investigate outliers.The normalized skewness for each signal is given by Figure 4 confirms that why we used features 2-5.Feature 6.This feature is a measure of likelihood of a peak subject to the gradient of the smoothed waveform.This feature identifies whether the peak appears in opening or closing the mouth.Let u(t) be the lowpass filtered data.We denote ∇ t u(t) (approximated as ∇ t u(t) = u(t) − u(t − 1) ) and define We denote f 6 = I ∇tu(t) .To acquire a better understanding of f 6 , Figure 5 shows the highpass filtered data (thin line) and ∇ t u(t) (thick line) together.
As it appears from Figure 5 all peaks occur when ∇ t u(t) ≥ 0.

Data Sample Generation Using Bootstrap
Although a reasonably large number of data samples can be provided by multiple recordings from the same subjects, in those cases where the statistical features are mainly with respect to the data distribution, a number of data samples can be produced.Also, in places where provision of sufficient data with reasonable lengths is difficult a number of data samples can be generated according to the actual data distribution.One of the methods for doing that is bootstrap (Efron and Tibshirani, 1993).This is explained next.We used bootstrap average signal to test the reliability and accuracy of the results obtained from the original signal.Let us consider the method of constructing bootstrap average signal for the signal X t (for more information see, e.g., (Efron and Tibshirani, 1993;Golyandina et al., 2001)).Under a suitable choice of embedding dimension m (we select this parameter using false nearest neighbor) and the corresponding eigentriples in the singular value decomposition (SVD), we have the representation X t = S t + N t , where S t (the reconstructed signal) approximates X t , and N t is the noise series.Suppose now that we have a (stochastic) model for the noise N t (for instance, pure noise).Then, simulating n independent copies N t,i of the noise series N t , we obtain n signals X t,i = S t + N t,i and produce n reconstruction results X t,i (Hassani and Zhigljavsky, 2009).
When the sample X t,i (1 ≤ i ≤ n) of the reconstruction results is obtained, we can calculate its bootstrap average signal by averaging the bootstrap results.The simplest model for N t is the Gaussian white noise model.The corresponding hypotheses can be checked with the help of a standard test for randomness and normality (Golyandina et al., 2001;Hassani, 2007;Hassani et al., 2009).

Results
As we mentioned above, we used both static and dynamic features extracted from movement of the markers positioned on subjects' faces to detect individuals with TMD.Table 1 represents a summary of the obtained results.The second column represents the left (L) and right (R) sides of face for all subjects; two individuals with TMD in left side (L1, L2) and the rest are healthy people.The values of features 1-6 for all samples are calculated.We followed the same procedure for the observations obtained by bootstrap method.The symbol ' * ' in Table 1 indicates that the results obtained by bootstrap method.Columns 3-14 represent the values of f 1 -f 6 for both original and bootstrap averaged signals.  1 show the values of λ 1 (is rounded) for each colour marker.The values λ 1 are positive for all samples (individuals with TMD and healthy subjects) indicating that they have chaotic behavior.However, this value for the individuals with TMD, rows 1 and 3, is larger than those for healthy subjects.
Feature 2 represents the number of outliers for all subjects.It should be noted that outliers are often the special points of interest and their identification is the main purpose of the investigation for instance in medicine may indicate the diseases.Here, the outliers represent the click events in either opening or closing process.The number of outliers for individuals with TMD is greater than those for healthy ones indicating that click is more frequent in the group of individuals with TMD.
A large ratio between the peak amplitude and the variance of a signal is a typical identifier for the click which is shown in feature 3. The normal chewing process is usually distributed about its mean value.Therefore, a low ratio is expected for a health subject, whilst the chewing process for an individuals with TMD has a high value.As the results show, the same pattern obtained here.
Features 4 and 5 indicate how clicks change the distribution of chewing process.Feature 6 is useful to distinguish between the peaks in the signals during the chewing process related to TMD and other non-relevant peaks.The results of feature 2 are presented in columns 5 and 6.As it is shown in these columns, the number of extreme values for the individuals with TMD is greater than those for healthy subjects which confirms the significance of this feature for classification.
SVM is used to separate TMD and non-TMD classes based on the above features.14 data segments from 7 subjects were used for training and the data from another 7 subjects were used for testing.A linear kernel was used.In order to test the classification we used cross-validation to test the accuracy of the SVM performance.In the cross-validation procedure we used 70% of the data as training examples and 30% for testing with no overlapping.The cross-validation was performed 10 times, each time the data were randomly rearranged, in order to yield a better estimate of the error.To test the classification results, we compared the classification results with the impressions by our expert clinician.As the result, we were able to classify correctly all the subjects into TMD or non-TMD subjects.

Conclusions
In this paper, we attempted to use an efficient classification system followed by measurement of a set of carefully selected features to classify the subjects suffering from the most common type of TMD, namely click, from visual data.This provides a simple non invasive and non intrusive procedure for TMD diagnosis.The TMD classifier works based on visual analysis of facial movement.We used a number of carefully selected features extracted from movement of the markers positioned on subjects' faces.In our approach the features are related to both static and dynamic visual variables measured from a number of time sequences corresponding to different subjects within different time intervals, and classified using SVM.The SVM correctly classified the two-class data for all subjects.SVM has been verified as a computationally cost effective method capable of classifying separable and nonseparable data through application of linear and nonlinear kernels.Furthermore, using bootstrap technique the designed classifier was tested and the results obtained from the original data were confirmed.It is achieved that even in the cases of mild TMD the classifier can obtain close to 100% correct classification.

Figure 2 :
Figure 2: The diagram of cameras and subject position

Figure 3 :
Figure 3: Original time series of the TMJ marker from an individual with TMD (in the left) and that of a healthy individual (in the right)

Figure 4 :
Figure 4: Histogram of the signal of TM marker from (a) the individual with TMD and (b) a healthy individual

Table 1 :
Values of the features for all subjects/trialsAs appears from Table1, the results of bootstrap averaged signals are close to the original signal indicating that our results are reliable.Columns 3 and 4 of Table