A COPULA-BASED SUPERVISED LEARNING CLASSIFICATION FOR CONTINUOUS AND DISCRETE DATA

: Despite the unreasonable feature independence assumption, the naive Bayes classifier provides a simple way but competes well with more sophisticated classifiers under zero-one loss function for assigning an observation to a class given the features observed. However, it has been proved that the naive Bayes works poorly in estimation and in classification for some cases when the features are correlated. To extend, researchers had developed many approaches to free of this primary but rarely satisfied assumption in the real world for the naive Bayes. In this paper, we propose a new classifier which is also free of the independence assumption by evaluating the dependence of features through pair copulas constructed via a graphical model called D-Vine tree. This tree structure helps to decompose the multivariate dependence into many bivariate dependencies and thus makes it possible to easily and efficiently evaluate the dependence of features even for data with high dimension and large sample size. We further extend the proposed method for features with discrete-valued entries. Experimental studies show that the proposed method performs well for both continuous and discrete cases.


Introduction
An The naive Bayes, as one of the most popular learning algorithms for machine learning and data mining, has been widely used in many areas for classifying new instances given a vector of features.Its simplicity and competitive performance are applausive despite the model is derived based on a critical and rarely satisfied assumption, i.e., given a class all features are independent.Addition, its computational efficiency, achieved by the fact that the training procedure is linear in features, makes it suitable to fitting for high dimensional data.Although some literatures (Domingos and Pazzani, 1997;Rish, 2001;and Zhang, 2004) had illustrated the reasons that the naive Bayes works surprisingly well in some cases where even the feature independence assumption is violated, the fact remains that the naive Bayes works poorly in estimation and in some other cases when the assumption does not hold.To extend the naive Bayes to account for the dependence of features, Friedman et al. (1997) developed a tree-like augmented naive Bayes (TAN) in which the class node directly points to all features nodes and a feature can have only one parent from another feature in addition to the class node.By relaxing the limitation on links among features, a more general augmented naive Bayes (ANB) is obtained as a special case of Bayesian networks in which no node is specified as a class node.Zhang and Ling (2001) then has proved that any Bayesian network, thus any joint probability distribution, can be represented by an ANB.In this paper, we propose a new tree-based classifier as an extension of the naive Bayes by evaluating the dependence of features through pair copulas constructed via a D-Vine tree structure.
A copula is a multivariate probability distribution in which the marginal probability distribution of each variable can be "separate" from the dependence structure among random variables.In practical applications, however, the choice of adequate copula families is rather limited for high dimensional multivariate dependence.Standard multivariate copulas, such as the multivariate Gaussian, are lack of the ability of accurately modeling the dependencies among a large number of variables.To extend, a technique called pair-copula construction (Aas et al., 2006) was developed to model multivariate dependence using a cascade of bivariate copulas, acting only on two variables at a time.However, the number of possible constructions increases dramatically fast as the number of variables grows.To effectively organize the construction procedure, vine copula, initially proposed by Joe (1996) and extended by Bedford andCooke (2001, 2002) and Kurowicka andCooke (2005, 2006), is developed as a flexible graphical model for constructing pair copulas in a nested way.Our proposed method builds a dependence classification model based on D-Vine copulas.The margins in the proposed model could be considered acting the exactly same way as in the naive Bayes, while pair copulas are used for evaluating the dependence of features (variables).Furthermore, its bivariate construction makes possible in fitting for a large p and small n dataset since only two variables are considered at a time.For a big-data, the computational efficiency could be achieved at each D-Vine tree level by utilizing a parallel computing technique.Theoretically, our proposed method can be used in fitting for a data with high dimension and large sample size in an efficient way.To extend the proposed method for features with discrete-valued entries, we use an augmented method with latent variables added for evaluating the feature dependence.Accordingly, we modify the regular D-Vine algorithm for discrete cases.
The paper is organized as follows.In Section 2, we introduce the concepts of copula and paircopula construction as well as D-Vine tree structure used for helping organize the decomposition on a multivariate joint density.In Section 3, our D-Vine copula-based dependence classifier is derived, and the learning process is illustrated in details in Section 4. In Section 5, we extend the proposed method for discrete features, and in Section 6 and 7, experimental studies are conducted for continuous and discrete cases, respectively.The conclusions come up in Section 8. Sklar's theorem (1959) states: Given p random variables X = (  " , … ,  & ), for any pdimensional multivariate joint distribution F with continuous marginal distribution functions  " , … ,  & , there exists a unique copula C such that for all x = ( " , … ,  & ) ∈  & , where x is a realization of X, In words, it states that any p-dimensional multivariate distribution function F can be written in terms of univariate margins and a copula C which describes the dependence structure for variables.Another common used expression for Eq. ( 1) is.

2.1.Copulas
where  + ," is the inverse distribution function associated with  -and  -=  -( -) ∈ (0,1), for k= 1,2,…,p.The joint density associated with F thus can be written as where  -is the marginal density corresponding to  -and c is the copula density associated with C.
Addition, a p-dimensional joint density can be factored by claims.
(4) Following the notation from Aas et al. (2006), each right-hand side term in Eq. ( 4) can be further decomposed into the associated pair copula times a conditional density, written as.here v is a vector,  -is an arbitrarily chosen component of v, and  ,-denotes the v-vector excluding  -.Iteratively decomposing (| ,-) till  ,-= ∅ , we finally obtain a general decomposition for a p-dimensional density which only involves margins and pair copulas.As shown in Joe(1996), the involved conditional distributions in the form of F(x|v) can be obtained by And when  ,-= ∅, Eq.( 6) reduces to When x and v are uniformly distributed, Eq.( 7) can be further simplified as Where h(•), called h-function, is the function used for generating pseudo observations which we will use later for fitting a model with D-Vine tree structure.For the details of bivariate copulas, such as the Gaussian and Student-t, see the reference (Aas et al.,2006).

2.2.D-Vine Tree Structure
Decomposing a p-dimensional joint density via Eq.( 4) is not unique since it depends on how to factorize the component  -in v given in Eq. ( 5), and the number of possible decompositions increases extremely fast as the dimension p grows.To help manipulate decomposition for data with high dimension, a technique called regular vine was introduced in Bedford and Cooke (2001;2002).Two special cases, canonical vine and D-Vine (Kurowicka and Cooke, 2004;Kurowicka and Joe, 2011), received even more attention due to their simplicity in organizing decomposition.In this paper, we focus on the D-Vine structure.However, our method straightforwardly extends to other types of vines.Figure 1 shows a decomposition example based on a 5-dimensional D-Vine.The total number of the nested trees in a p-dimensional D-Vine is p − 1, and each tree  7 , for j = 1, 2, . . ., p − 1, has p + 1 − j nodes and p − j edges.Each edge corresponds to a pair copula, and an edge in the tree  7 becomes to a node in  7 +1.The whole decomposition is obtained by the product of p margins and p(p−1)/2 copulas involving only two random variables in each copula.We should notice that the total number of ways to decompose a p-dimensional joint density via a p-dimensional D-Vine is p!/2.However, once the order of the nodes in T1 is determined, the decomposition is unique.Following the notation in Aas et al. (2006), a p-variate density f ( " , … ,  & ) corresponding to a p-dimensional D-Vine is given by where index j identifies the nested trees and i is for the edges in the tree  7 .F (•|•) can be obtained by Eq. ( 8) cascaded down from  " to  &," , and  9,9:7|9:",…,9:7 is a bivariate copula density corresponding to the pair nodes connected via the edge i in  7 .

D-Vine Copula Bayes Classifier
A D-Vine copula Bayes classifier (DVCBC) is a function to assign an observation to a class.According to the Bayes rule, the probability of a feature vector X = ( " , … ,  & ) with a realization x = ( " , … ,  & ) being in class e ∈ E, where E is a class vector with E = (e " , e < , … , e = ) > and l is the total number of classes, is written as For continuous features, via Eq.( 9), an augmented copula dependent Bayes probability model derived upon Eq. ( 10) is then given by Combing the model ( 11) with the maximum a posterior (MAP) decision rule, the corresponding classifier to assign x to a class e ∈ E is then given as follows where f (x|e) is given in (11).With the assumption that each continuous feature X 9 is conditionally independent of every other feature X 7 for i ≠j and given a class e, Eq. ( 10) reduces to which is the naive Bayes model.Consequently, we can consider Eq. ( 13) as a special case of our proposed method in Eq. ( 11).

4.1.Learning Optimization Structure of D-Vine
A D-Vine consists of a sequence of trees so that optimization over a D-Vine can involve algorithms for graphs.We know that an order of the nodes in the tree  " uniquely determines a decomposition of a joint density f( " , … ,  & ) defined in Eq. ( 9).Thus, optimizing a D-Vine structure can be simply thought as optimizing the tree  " .Here, optimizing  " means to capture as much dependence as possible for the pairs of features in  " .However, the number of the possible orders of the nodes in  " in a p-dimensional D-Vine structure is p!/2, which implies that optimization via enumeration becomes impossible when p is large.One possible way of optimizing  " is to maximize a sum of transforms of the dependence, such as absolute Kendall tau (|τ|), absolute Pearson correlation (| & |), and absolute Spearman correlation (| A |), from all pair nodes in  " using the greedy search algorithm.We note, for a D-Vine structure, the greedy search is equivalent to solve the traveling salesman problem (TSP), see (Brechmann, 2010) for details.However, TSP is a well known NP-hard problem.For computational efficiency, we use an insertion algorithm provided in an R package TSP (Hahsler and Hornik, 2015)

4.2.Learning Univariate Margin
Given a class e ∈ E, denote the corresponding data structure as De = (  ,   , … ,    ), where   = (  ,   , … ,   ) with s = 1, 2, . . ., n L and n L is the total number of data points in the class e.For continuous mar-gins, they can be estimated by using either parametric or nonparametric approaches.For parametric approaches, the parameter(s)  -in an chosen parametric marginal density  N O for the feature  -can be estimated by maximizing the log-likelihood function given by For example, if  N O is chosen to be a Gaussian density, i.e.,  N O = ∅ N O , then  -= ( -, - < ).
Maximizing the likelihood given in Eq. ( 14), we could obtain the estimated parameters  -and  -.For nonparametric approaches, the marginal density  -for  -can be estimated by using a kernel-based approach (Parzen, 1962).Briefly, -  = " , where K is a kernel function and h is the bandwidth smooth parameter corresponding to the class e.
Recently, Chen, Hanson, and Zhang (2014) proposed a new parametric based density estimation approach, named transformed Bernstein polynomial prior (TBPP).Briefly, the density  -(x|e) can be estimated through , where T decides the order of Bernstein polynomial,  \,] is the Bernstein coefficient,  \,] (•) is a beta density with two parameters t and T-t+1, and  N O is a parametric distribution with the corresponding density  N O .By centering  -, associated with  -, at a parametric distribution  N O , the estimated density through TBPP can behavior like its centering when data actually follows  N O and can be adjusted like a nonparametric approach when data departs from the centering  N O .For details, see the reference (Chen, Hanson, and Zhang;2014).Throughout this paper, we use the Gaussian parametric, Kernel, and TBPP methods for continuous marginal density estimation, and we always report the best result in our experimental studies.

4.3.Learning Dependence Structure
Given a class e and the pair nodes,  V 9,7 , connected by the edge i in the tree  7 with j = 1,…,p-1 and i = 1,…,p-j ,denote the corresponding data for the pair  V 9,7 as  V 9,7 =  A 9,7,V :  = 1, … ,  V , where  V is the total number of data points in the class e and With i|i+1,…,  7 -1 and i+j|i+1,…,  7 -1 indicating the indexes of two nodes connected by the i-th edge in  7 , see Section2.2.For notation convenience, we define  A 9,7,V ≝ ( A,9 7,V ,  A,9:7 7,V ).We should note that at  " , the  A 9,7,V is directly estimated via true observations by  A,- 7,V =  -( A,-|), for k= {i,i+1}, here  -is the estimated univariate marginal distribution for feature  -, see Section 4.2.Starting from  < to  &," ,  A,- 9,7,V (a pseudo observation) is obtained via the h-function defined in Eq. ( 8), which thus corresponds to the chosen bivariate copula  j T k,l for  V 9,7 .To estimate the parameter(s)  V 9,7 in  j T k,l , we need to use the lemma given the below.Let we first define a loss function via Kullback-Leibler divergence,written as where  n is the unknown true joint density with the distribution  n , written as(by Eq. ( 3 16) with the unknown true copula density  n , and  j is a joint density given in ( 17) with an arbitrarily chosen parametric copula density  j .By defining a loss function depend on , LOSS(),through Kullback-Leibler divergence given in ( 15), we then have: minimizing LOSS() for  is equivalent to maximizing the copula log-likelihood(CLL) given by Proof.
We note that minimizing the loss function LOSS(θ) = KL( n ,  j ) with respect to (w.r.t.) θ is equivalent to minimizing the alternative loss function KL( n ,  j ) given in Eq. ( 19).With the fact that  n is a copula density nothing about θ, then minimizing Eq. ( 19) w.r.t.θ is thus equivalent to maximizing the function given by

4.4.Model Selection
One of the advantages of using D-Vine for assessing dependence structure is that different bivariate copulas might be used to evaluate the dependences for different pairs in down cascaded trees starting from T1.It thus makes possible for the whole dependence structure constructed by different copulas, such as Gaussian, Student-t, and Clayton copulas.Typically, compared to Gaussian copulas, Student-t copulas can capture more dependence in-formation for tails and Clayton has more power in evaluating unsymmetric dependence.The best dependence for each pair does not rely on a priori specific dependent type, it is instead obtained by voting for the best one among all of the considered copulas using a BIC-based score criterion given by Where ( V 9,7 ) is a bivariate copula for  V 9,7  V 9,7 , obtained by maximizing CLL( V 9,7 ,  V 9,7 ), is the corresponding estimated parameter(s) for ( V 9,7 ) , and Θ( V 9,7 ) is the number of free parameters(s) corresponding to  V 9,7 .The task of voting for the best dependence for the pair  V 9,7 is then equivalent to finding a bivariate copula ( V 9,7 ) which maximizes score (( V 9,7 ):  V 9,7 ) .

4.5.Computational Issues
For a big-data, the computational efficiency could be achieved at each  7 , for j = 1, . . ., p−1, by utilizing a parallel computing technique.Specifically, the tasks of evaluating dependences for the pairs in  7 can be distributed out to different computing nodes along with the data points (observations for T1 or pseudo-observations for  < , . . .,  &," ) only involved for the pairs evaluated at each computing node.The results from all computing nodes are then collected for the next tree  7:" when j < p − 1. Theoretically, our proposed method can be used in fitting for a data with extremely high dimension and large sample size.We are currently coding for our high performance computing (HPC) algorithm by using R packages snow (Tierney et al., 2015) and Rmpi (Yu, 2015), and we will release our codes as an R package later on.Throughout this paper, we do not use the HPC technique for our experimental studies.
By considering k different bivariate copulas for each pair, the total system running time needed for evaluating D-Vine dependence structure for an n × p dataset is then given by One where ( < log (p)) is the time for solving the TSP problem for  " optimization; ( S z ) is the time used for p margins estimation, here we use m to denote that this time also depends on the marginal method we used; ( S {-) is the time needed for a bivariate copula estimation, and k is the total copulas to be evaluated.We should note, for each pair, the selection of the bivariate dependence has no extra cost since the score in Eq. ( 21) is obtained directly after evaluating Eq. ( 18) for copula estimation.For a p > n problem, the running time for evaluating pair copulas can be expected to be quadratically related to the dimension p, i.e. ( < ).
Given n data points, the log-likelihood function is then written as where ( A-} ,  A-~ |,       ) is given in Eq. ( 23) with the estimated  -} and  -~ for the margins  -} and -~ , respectively.The estimated  for (•) then can be obtained by maximizing Eq. ( 28).The whole dependence structure of a D-Vine tree with discrete-valued entries then can be evaluated via the following algorithm: Algorithm 0.0.1.

Experimental Studies
To assess the merits of our model (DVCBC), we compare it to the existing popular supervised learning classification methods including the naive Bayes classifier (NBC), a network-based classifier (TAN), the support vector machine (SVM), and an ensemble classifier random forests (RF).The naive Bayes model assumes independence of the features and is fitted by using the function naiveBayes() provided in an R package e1071 (Meyer et al., 2014).Also in the same R package, the function svm() is used for fitting for the SVM model.Without a priori knowledge on data, we found that it is better to consider a relative large parameter space for the SVM model.Specifically, we consider the cost parameter ϑ ∈ {10 ,OE , 10 ,< , 10 ," } and kernel parameter γ ∈ {10 < , 10 " , 10 n , 10 ," , 10 ,< } for radial basis kernel.The TAN model is considered with a linear Gaussian conditional distribution given by  -|e ∼ N ( n +  "  ' ,  < ), where the feature  ' is the parent of  -in the network structure with the realization  ' .The RF model is fitted by using the Breiman approach (Breiman, 2001), and the according R package randomForest (Breiman et al., 2015) can be installed from the official R website.
Given a class e, we use the transformed Bernstein polynomial centering at a univariate Gaussian to estimate all margins, given by ), where  \,] is the beta cumulative distribution function with the beta density  \,] , for k = 1, . . ., p.We found that this marginal estimation method works well either for Gaussian or non-Gaussian margins.The D-Vine structure is optimized by the TSP algorithm introduced in Section 4.1.We consider three different types of bivariate copulas, i.e, Gaussian, Student-t, and Clayton copulas (Aas et al., 2006).The best dependence among those three is picked up with the highest score obtained by Eq. ( 21).We evaluate all methods for five datasets from the UCI ma-chine learning repository, see Table 1 for the details of the datasets.In Table 2, we summarize the average classification accuracy upon 2-fold cross validation repeated 50 times for all methods.Also, in Figure 2 and Figure 3, we plot the ROC curves for binary classifications -Parkinsons and Hill Valley.It is not surprised that our model is always superior to the NBC since our method allows the feature dependence and thus makes it possible to get better predictions by adding more dependence information in the structure through copulas.The same phenomenon can be observed for the TAN compared to the NBC since the TAN model also relaxes the independence assumption for features by adding directional edges between them for the dependencies.We also note that our method performs better than the TAN model for all five experimental studies.It might imply that, compared to the TAN, our method can more effectively integrate dependence information into the structure, even if the both are the tree-based models and free of the feature independence assumption.The SVM model has a better classification for "Glass Ident" dataset, but our method performs better than the SVM for the other 4 datasets.The RF model performs the best.among all considered methods in terms of the average classification accuracy for "Glass Ident" and "Hill Valley" datasets.It is only slightly worse than the DVCBC for "Iris" and "Wine" datasets.Some literatures, e.g., Caruana and Niculescu-Mizil (2006), have commented on the reasons of the outperformance of the random forests for supervised learning classification compared to other classifiers prior to calibration.For example, the outputs of an SVM are normalized distances to the decision boundary and thus it is not originally desinged to predict probabilities, and the naive Bayes models are well known to predict poorly because of the unrealistic independence assumption.Although our proposed method relaxes the independence assumption via copula techniques, it still has several restrictions such as the types of copulas specified for dependence structure.Misspecification on it will eventaully result in inaccurate predictions and thus wrong classifications.Following the discussion in Caruana and Niculescu-Mizil (2006), although on average some classifiers including the NBC are not competitive with the best methods such as random forests, those conclusions do not always hold since they probably perform much better for some metrics on some particular problems.It also turns out that our proposed model has its own advantages when used on some datasets considered here, see Table 2 and Figure 2 for "Iris", "Wine", and "Parkinsons".

Studies for Discrete Cases
We consider two simulation scenarios for all features with binary entries, see the following simulation setup.This is quite common for data from survey with answers either "Yes" or "No".We should note that with binary entry, if x = 0, then a = F (0 , ) = 0 and b = F (0) = 1−π; and if x = 1, then a = F (1 , ) = 1 − π and b = F (1) = 1, where π is the probability of success (x = 1) on each trial.In Scenario I, all three features are independent to each other, however, in Scenario II, the feature X2 depends on the feature X3.The number of the data points in each class  -, for k = 1, 2, 3, is n = 100, so the total number is  \ = 3 * 100 = 300.
We estimate Bernoulli margins for all three features.Similar to a continous case, the D-Vine structure is optimized by the TSP algorithm, and three types of copulas are considered with the best one having the highest score from Eq. ( 21).We report the classification accuracy from the DVCBC, the NBC, the SVM, and the TAN for Scenario I and II in Table 3. Apparently, our method is superior to others in the two simulated scenarios.The SVM model is always better than other tree based classifiers, the NBC and TAN, and the NBC model performs worst in all two scenarios.To compare their computational efficiencies, we also record the CPU time for all four methods.With Intel Core i5-4570R CPU @2.70GHz, the system running time in one run for the DVCBC, the NBC, the SVM, and the TAN are 0.52s, 0.01s, 0.03s, and 0.03s, respectively.We further use the four methods to fit for hartigan data from an R package homals (Mair and De Leeuw, 2015).The main task is to classify a number of bolts, nails, screws, and tacks according to three criteria, i.e., Thread (Yes or No), Bottom (Sharp and Flat), and Brass (Yes or No).The total number of data points is  \ = 24.Specifically, 9 from nail, 6 from screw, 6 from bolt, and 3 from tack.We use 2-fold cross validation repeated 50 times for computing the classification accuracy and report the results in Table 4. Compared to the NBC, the SVM, and the TAN, our method is (0.912 − 0.870)/0.870× 100% ≈ 4.8%, (0.912 − 0.894)/0.894× 100% ≈2.0%, and (0.912 − 0.879)/0.879× 100% ≈ 3.8% better accuracy on average, respectively.

Conclusion
As In this paper, we propose a new supervised learning method which is free of the feature independence assumption.The dependences among features are evaluated by pair copulas constructed through a graphical model called D-Vine.This makes possible to fit for a data with high dimension and large sample size in an easy and efficient way since only two features (variables) are considered at a time.We note that the popular naive Bayes classifier actually is a special case of the proposed method.We further extend the model to evaluate the dependences among discrete-valued features using an augmented method.We should mention that in the whole paper we only considered three types of copulas, i.e., Gaussian, Student-t, and Clayton copulas, for multivariate dependence, however, in practice, we could use other types of copulas for measuring the dependence of features, such as Gumbel and Frank copulas.
with the weights obtained from |τ|, or | & |, or | A |.Although the optimization through this algorithm might be suboptimal, it runs much faster for a large p with the running time O( < log(p)).Throughout this paper, we optimize  " for a D-Vine structure via the TSP with insertion algorithm.
)) where  - n and  - n are the unknown univariate true marginal density and distribution functions for the feature  -, respectively, and  n is the unknown true copula density with the copula function  n .Similarly, where  j is a chosen parametric copula density with the parameter(s) .We then give the lemma as follows Lemma.Let p random variables X = ( " , … ,  & ) with a realization x = ( " , … ,  & ) and  - n be the true marginal distribution for variable  -with density  - n for k = 1,…,p.Let U = ( " , … ,  & ) are p random variables with  -=  - n ( -), where  -is uniformly distributed with  -∈(0,1),and u = ( " , … ,  & ) is a realization of U. Let data structure be D = { A : s = 1,…,n}, where  A = ( A" , … ,  A& ) is one realization of U. Denote the unknown true joint density as  n given in (

Table 1 :
5 datasets from UCI machine learning repository.