Run OPTICS on continuous variables to obtain a reachability plot
op<- optics(data1[,1:4],eps = 10,minPts = 10) # relative larger eps and minPts
plot(op, main='Data 1')

op<- optics(data2[,1:11],eps = 10,minPts = 20) # relative larger eps and minPts
plot(op, main='Data 2')

op<- optics(data3[,1:4],eps = 10,minPts = 10) # relative larger eps and minPts
plot(op, main='Data 3')

From 3 reachability plot we can observe that only data 1 displays 3 "vallies" indicating that data 1 belongs to natural cluster structure. Data 2 and 3 belongs to partitioned cluster structure or homogeneous structure.
Run Sparse k-means for data 1 and consensus k-means for data 2 and 3
data_run<- scale(data1[,-5],T,T) # scale is recommended
tune<- KMeansSparseCluster.permute(as.matrix(data_run),K=3,wbounds=seq(1.1, 3.1, by=0.3),silent = T)
sparsek<- KMeansSparseCluster(as.matrix(data_run),K=3,wbounds = tune$bestw,silent = T)
sparsek[1][[1]]$ws
## x1 x2 x3 x4
## 0.491210566 0.586567459 0.643931423 0.001765776
cramer.v(table(data1$x5,sparsek[1][[1]]$Cs))
## [1] 0.6757008
From obtained weights, it's not that hard to drop variable \(x_4\). Based on the cramer's V between variable \(x_5\) and sparse k-means clustering assignment, \(x_5\) will be kept. Therefore, \(x_1\), \(x_2\), \(x_3\), and \(x_5\) will be kept in final clustering step.
data2.new<- t(scale(data2[,1:11],T,T))
results.dat2<- ConsensusClusterPlus(as.matrix(data2.new),maxK=6,reps=1000,pItem=0.8,pFeature=1,
clusterAlg="km",distance="euclidean",seed=1262,plot='png',
title=paste0(path,'/hydapDat2'))
#results2.dat2<- calcICL(results.dat2, plot='png',title=paste0(path,'/hydapDat2'))


From consensus k-means results of data 2, we can observe that 3 is the optimal number of clusters. As we are able to partition the continuous part of data 2, it is defined as partitioned cluster structure. Therefore, we will keep all continuous variables in final clustering as all of them together lead to partitions.
apply(data2[,12:14],2, function(x) cramer.v(table(x,results.dat2[[3]][["consensusClass"]])))
## x12 x13 x14
## 0.1672483 0.8071675 0.8015909
Based on the cramer's V between each categorical variable and sparse k-means clustering assignment, \(x_{12}\) will be dropped in final clustering.
data3.new<- t(scale(data3[,1:4],T,T))
results.dat3<- ConsensusClusterPlus(as.matrix(data3.new),maxK=6,reps=1000,pItem=0.8,pFeature=1,
clusterAlg="km",distance="euclidean",seed=1262,plot='png',
title=paste0(path,'/hydapDat3'))
#results2.dat3<- calcICL(results.dat3, plot='png',title=paste0(path,'/hydapDat3'))


From consensus k-means results of data 3, we can observe that none of these numbers is the optimal number of clusters. Therefore, Data 3 is defined as homogeneous cluster structure and all continuous variables will be dropped in final clustering.
cramer.v(table(data3$x5,data3$x6))
## [1] 0.09764819
cramer.v(table(data3$x5,data3$x7))
## [1] 0.6758128
cramer.v(table(data3$x6,data3$x7))
## [1] 0.1005522
Based on the pair-wise cramer's V between categorical variables, \(x_6\) will be dropped in final clustering.