Supplementary Material

JDS

Journal of Data Science

1683-8602 1680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1000

10.6339/20-JDS1000

Statistical Data Science

Sparse Learning with Non-convex Penalty in Multi-classification

Nan

1 Zhang

Hao Helen

hzhang@math.arizona.edu2∗ 1Department of Epidemiology and Cancer Control, St. Jude Children’s Research Hospital, Memphis, Tennessee, U.S.A. 2Department of Mathematics, University of Arizona, Tucson, Arizona, U.S.A.

∗Corresponding author. Email: hzhang@math.arizona.edu.

2021

1022021

1915674

Supplementary Material

A zip file includes all the computation code and data for the numerical experiments is available.

112020 122020

2021 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2021

Open access article under the CC BY license.

Multi-classification is commonly encountered in data science practice, and it has broad applications in many areas such as biology, medicine, and engineering. Variable selection in multiclass problems is much more challenging than in binary classification or regression problems. In addition to estimating multiple discriminant functions for separating different classes, we need to decide which variables are important for each individual discriminant function as well as for the whole set of functions. In this paper, we address the multi-classification variable selection problem by proposing a new form of penalty, supSCAD, which first groups all the coefficients of the same variable associated with all the discriminant functions altogether and then imposes the SCAD penalty on the supnorm of each group. We apply the new penalty to both soft and hard classification and develop two new procedures: the supSCAD multinomial logistic regression and the supSCAD multi-category support vector machine. Our theoretical results show that, with a proper choice of the tuning parameter, the supSCAD multinomial logistic regression can identify the underlying sparse model consistently and enjoys oracle properties even when the dimension of predictors goes to infinity. Based on the local linear and quadratic approximation to the non-concave SCAD and nonlinear multinomial log-likelihood function, we show that the new procedures can be implemented efficiently by solving a series of linear or quadratic programming problems. Performance of the new methods is illustrated by simulation studies and real data analysis of the Small Round Blue Cell Tumors and the Semeion Handwritten Digit data sets.

Keywords logistic regression SCAD supnorm SVM variable selection

References

Bradley

, Mangasarian

(1998). Feature selection via concave minimization and support vector machines. In: ICML, volume 98, 82–90.

Breheny

, Huang

(2009). Penalized methods for bi-level variable selection. Statistics and Its Interface, 2(3): 369.

Crammer

, Singer

(2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2: 265–292.

Dudoit

, Fridlyand

, Speed

(2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457): 77–87.

Fan

, Li

(2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): 1348–1360.

Fan

, Peng

, et al. (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32(3): 928–961.

Hastie

, Tibshirani

, Friedman

(2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media.

Holmström

, Göran

, Edvall

(2010). User’s Guide for Tomlab 7.

Huang

, Breheny

, Ma

(2012). A selective review of group selection in high-dimensional models. Statistical Science, 27(4): 481–499.

Khan

, Wei

, Ringner

, Saal

, Ladanyi

, Westermann

, et al. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7(6): 673–679.

Lange

, Hunter

, Yang

(2000). Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1): 1–20.

Le Thi Hoai

, Tao

(1997). Solving a class of linearly constrained indefinite quadratic problems by dc algorithms. Journal of Global Optimization, 11(3): 253–285.

Lee

, Lin

, Wahba

(2004). Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465): 67–81.

Liu

, Shen

(2006). Multicategory ψ-learning. Journal of the American Statistical Association, 101(474): 500–509.

Liu

, Yuan

(2011). Reinforced multicategory support vector machines. Journal of Computational and Graphical Statistics, 20(4): 901–919.

Mangasarian

, Wild

(2001). Proximal support vector machine classifiers. Proceedings KDD-2001: Knowledge Discovery and Data Mining. Citeseer.

MATLAB (2014). version 8.3 (R2014a). The MathWorks Inc., Natick, Massachusetts.

McCullagh

, Nelder

(1989). Generalized Linear Models, 2nd edition. Chapman and Hall, London, UK.

Suykens

, Vandewalle

(1999). Least squares support vector machine classifiers. Neural Processing Letters, 9(3): 293–300.

Tang

, Zhang

(2006). Multiclass proximal support vector machines. Journal of Computational and Graphical Statistics, 15(2): 339–355.

Tibshirani

(1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, Methodological, 58(1): 267–288.

Tutz

, Pößnecker

, Uhlmann

(2015). Variable selection in general multinomial logit models. Computational Statistics & Data Analysis, 82: 207–222.

Vapnik

(1998). Statistical Learning Theory. Wiley-Interscience, New York.

Vapnik

(1995). The Nature of Statistical Learning Theory. Springer-Verlag.

Wang

, Shen

(2007). On l 1-norm multiclass support vector machines: Methodology and theory. Journal of the American Statistical Association, 102(478): 583–594.

Weston

, Watkins

, et al. (1999). Support vector machines for multi-class pattern recognition. In: Esann, volume 99, 219–224.

, Liu

(2009). Variable selection in quantile regression. Statistica Sinica, 19(2): 801–817.

Yuan

, Lin

(2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 68(1): 49–67.

Zhang

, Liu

(2014). Multicategory angle-based large-margin classification. Biometrika, 101(3): 625–640.

Zhang

, et al. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2): 894–942.

Zhang

, Ahn

, Lin

, Park

(2006). Gene selection using support vector machines with non-convex penalty. Bioinformatics, 22(1): 88–95.

Zhang

, Liu

, Wu

, Zhu

, et al. (2008). Variable selection for the multicategory svm via adaptive sup-norm regularization. Electronic Journal of Statistics, 2: 149–167.

Zhao

, Rocha

, Yu

, et al. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37(6A): 3468–3497.

Zou

(2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476): 1418–1429.

Zou

, Li

(2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4): 1509.

Zou

, Yuan

(2008). The f∞

-norm support vector machine. Statistica Sinica, 18(1): 379–398.