References

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1069

10.6339/22-JDS1069

Statistical Data Science

Linear Algorithms for Robust and Scalable Nonparametric Multiclass Probability Estimation

Zeng

Liyun

hzhang@math.arizona.edu1 Zhang

Hao Helen

12∗ 1Statistics and Data Science GIDP, University of Arizona, Tucson, Arizona, USA 2Department of Mathematics, University of Arizona, Tucson, Arizona, USA

∗Corresponding author. Email: hzhang@math.arizona.edu.

2023

3112022

2146586803620222592022

2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2023

Open access article under the CC BY license.

Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) has been developed to estimate class probabilities through ensemble learning for K-class problems (Wu et al., 2010; Wang et al., 2019), where K is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demands polynomial time in K. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in K. Though not the most efficient in computation, the OVA is found to have the best estimation accuracy among all the procedures under comparison. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate their finite sample performance.

Keywords linear time algorithm multiclass classification non-parametric probability estimation scalability support vector machines

References

Alimoglu

, Alpaydin

(1997). Combining multiple representations and classifiers for pen-based handwritten digit recognition. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition (

Schürmann, ed.), volume 2, 637–640. Ulm, Germany.

Alizadeh

, Eisen

, Davis

, Ma

, Lossos

, Rosenwald

, et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403(6769): 503–511.

Breiman

, Friedman J

, Olshen R

, Stone C

(1984). Classification and Regression Trees. Wadsworth Publishing Company, Belmont, California, USA.

Burges

(1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2: 121–167.

Cairano

, Brand

, Bortoff

(2013). Projection-free parallel quadratic programming for linear model predictive control. International Journal of Control, 86(8): 1367–1385.

Chamasemani

, Singh

(2011). Multi-class support vector machine (SVM) classifiers – an application in hypothyroid detection and classification. In: Proceedings of the Sixth International Conference on Bio-Inspired Computing: Theories and Applications (

Abdullah, ed.), 351–356. Penang, Malaysia.

Chen

, Guestrin

(2016). XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (

Smola,

Aggarwal,

Shen,

Rastogi, eds.), In: KDD ’16, 785–794. ACM, New York, New York, USA.

Crammer

, Singer

(2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2: 265–292.

Cristianini

, Shawe-Taylor

(2000). An Introduction to Support Vector Machines and other Kernel-based Learning Methods. Cambridge University Press, Cambridge, UK.

Ding

, Zhao

, Zhang

, Xue

(2019). A review on multi-class TWSVM. Artificial Intelligence Review, 52(2): 775–801.

Dua

, Graff

(2019). UCI machine learning repository. http://archive.ics.uci.edu/ml.

Dudoit

, Fridlyand

, Speed

(2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457): 77–87.

Guo

, Pleiss

, Sun

, Weinberger

(2017). On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Machine Learning (

Precup,

Teh, eds.), volume 70, 1321–1330. Sydney, Australia.

Hastie

, Tibshirani

, Friedman

(2009). The Elements of Statistical Learning: Data mining, Inference and Prediction. Springer, New York, New York, USA. 2 edition.

Herbei

, Wegkamp

(2006). Classification with reject option. Canadian Journal of Statistics, 34(4): 709–721.

(1995). Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition (

Suen, ed.), volume 1, 278–282. Montreal, Canada.

Horton

, Nakai

(1996). A probabilistic classification system for predicting the cellular localization sites of proteins. In: Proceeding of the Fourth International Conference on Intelligent Systems for Molecular Biology (

States,

Agarwal,

Gaasterland,

Hunter,

Smith, eds.), 109–115. St. Louis, Missouri, USA.

Huang

, Liu

, Du

, Perou

, Hayes

, Todd

, et al. (2013). Multiclass distance-weighted discrimination. Journal of Computational and Graphical Statistics, 22(4): 953–969.

Islam

, Khan

, Jm

(2016). Discriminant feature distribution analysis-based hybrid feature selection for online bearing fault diagnosis in induction motors. Journal of Sensors, 2016: 1–16.

Kallas

, Francis

, Kanaan

, Merheb

, Honeine

, Amoud

(2012). Multi-class SVM classification combined with kernel PCA feature extraction of ECG signals. In: Proceeding of the 19th International Conference on Telecommunications (

Abumarshoud,

Shojaeifard,

Aghvami,

Marvasti, eds.), 1–5. Jounieh, Lebanon.

Kimeldorf

, Wahba

(1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33: 82–95.

Krawczyk

, Woźniak

, Cyganek

(2014). Clustering-based ensembles for one-class classification. Information Sciences, 264: 182–195.

Lee

, Lin

, Wahba

(2004). Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99: 67–81.

Lei

, Dogan

, Binder

, Kloft

(2015). Multi-class SVMs: from tighter data-dependent generalization bounds to novel algorithms. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (

Cortes,

Lee,

Sugiyama,

Garnett, eds.), volume 2, 2035–2043. Montreal, Canada.

Lin

(2002). Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery, 6: 259–275.

Liu

(2007). Fisher consistency of multicategory support vector machines. In: Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (

Meila,

Shen, eds.), 291–298. San Juan, Puerto Rico.

Liu

, Yuan

(2011). Reinforced multicategory support vector machine. Journal of Computational and Graphical Statistics, 20: 901–919.

McCullagh

, Nelder

(1989). Generalized Linear Models. Chapman and Hall, London, UK.

Mezzoudj

, Benyettou

(2012). On the optimization of multiclass support vector machines dedicated to speech recognition. In: Proceedings of the 19th International Conference on Neural Information Processing (

Huang,

Zeng,

Li,

Leung, eds.), volume 2, 1–8. Berlin, Germany.

Minderer

, Djolonga

, Romijnders

, Hubis

, Zhai

, Houlsby

, et al. (2021). Revisiting the calibration of modern neural networks. In: Proceedings of the 35th Advances in Neural Information Processing Systems (

Ranzato,

Beygelzimer,

Dauphin,

Liang,

Vaughan, eds.), volume 34, 15682–15694.

Rifkin

, Klautau

(2004). In defense of one-vs-all classification. Journal of Machine Learning Research, 5: 101–141.

Saigal

, Khanna

(2020). Multi-category news classification using support vector machine based classifiers. SN Applied Sciences, 2(3): 458.

Tomar

, Agarwal

(2015). A comparison on multi-class classification methods based on least squares twin support vector machine. Knowledge-Based Systems, 81: 131–147.

Vapnik

(1998). Statistical Learning Theory. Wiley, New York, New York, USA.

Wahba

(1990). Spline Models for Observational Data CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, Pennsylvania, USA.

Wang

, Shen

(2006). Estimation of generalization error: random and fixed inputs. Statistica Sinica, 16(2): 569–588.

Wang

, Shen

, Liu

(2008). Probability estimation for large margin classifiers. Biometrika, 95: 149–167.

Wang

, Shen

(2007). On L 1

-norm multiclass support vector machines. Journal of the American Statistical Association, 102: 583–594.

Wang

, Zhang

, Wu

(2019). Multiclass probability estimation with support vector machines. Journal of Computational and Graphical Statistics, 28(3): 586–595.

Weston

, Watkins

(1999). Support vector machines for multi-class pattern recognition. In: Proceedings of the Seventh European Symposium on Artificial Neural Networks (

Gori, ed.), 21–23. Bruges, Belgium.

, Zhang

, Liu

(2010). Robust model-free multiclass probability estimation. Journal of the American Statistical Association, 105: 424–436.

, Tse

(1989). An extension of Karmarkar’s projective algorithm for convex quadratic programming. Mathematical Programming, 44(1–3): 157–179.

Yeoh

, Ross

, Shurtleff

, Williams

, Patel

, Mahfouz

, et al. (2002). Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1(2): 133–143.

Zhang

, Liu

(2013). Multicategory large-margin unified machines. Journal of Machine Learning Research, 14: 1349–1386.

Zhu

, Hastie

(2005). Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics, 14: 185–205.

Zhu

, Rosset

, Hastie

, Tibshirani

(2003). 1-norm support vector machines. In: Proceedings of the 16th International Conference on Neural Information Processing Systems (

Thrun,

Saul,

Schölkopf, eds.), 49–56. Whistler, Canada.