Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022
  4. Binary Classification of Malignant Mesot ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Binary Classification of Malignant Mesothelioma: A Comparative Study
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 205–224
Ted Si Yuan Cheng   Xiyue Liao  

Authors

 
Placeholder
https://doi.org/10.6339/23-JDS1090
Pub. online: 14 February 2023      Type: Data Science In Action      Open accessOpen Access

Received
25 July 2022
Accepted
6 February 2023
Published
14 February 2023

Abstract

Malignant mesotheliomas are aggressive cancers that occur in the thin layer of tissue that covers most commonly the linings of the chest or abdomen. Though the cancer itself is rare and deadly, early diagnosis will help with treatment and improve outcomes. Mesothelioma is usually diagnosed in the later stages. Symptoms are similar to other, more common conditions. As such, predicting and diagnosing mesothelioma early is essential to starting early treatment for a cancer that is often diagnosed too late. The goal of this comprehensive empirical comparison is to determine the best-performing model based on recall (sensitivity). We particularly wish to avoid false negatives, as it is costly to diagnose a patient as healthy when they actually have cancer. Model training will be conducted based on k-fold cross validation. Random forest is chosen as the optimal model. According to this model, age and duration of asbestos exposure are ranked as the most important features affecting diagnosis of mesothelioma.

Supplementary material

 Supplementary Material
The zip supplementary material file contains the Python and R scripts for reading data and preprocessing, exploratory data analysis, and the various models tested.

References

 
Aggarwal CC (Ed.) (2014). Data Classification: Algorithms and Applications. CRC Press, Yorktown Heights, NY, USA.
 
Aggarwal CC, Hinneburg A, Keim DA (2001). On the surprising behavior of distance metrics in high dimensional space. In: International Conference on Database Theory, 420–434. Springer.
 
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: 321–357. https://doi.org/10.1613/jair.953
 
Chen T, Guestrin C (2016). Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.
 
Chicco D, Rovelli C (2019). Computational prediction of diagnosis and feature selection on mesothelioma patient health records. PloS One, 14(1): e0208737. https://doi.org/10.1371/journal.pone.0208737
 
Chollet F (2017). Deep Learning with Python. Manning, New York, NY, USA.
 
Cortes C, Vapnik V (1995). Support-vector networks. Machine Learning, 20(3): 273–297.
 
Cruz JA, Wishart DS (2006). Applications of machine learning in cancer prediction and prognosis. Cancer Informatics, 2: 59–77.
 
Er O, Tanrikulu AC, Abakay A, Temurtas F (2012). An approach based on probabilistic neural network for diagnosis of mesothelioma’s disease. Computers & Electrical Engineering, 38(1): 75–81. https://doi.org/10.1016/j.compeleceng.2011.09.001
 
Fatima N, Liu L, Hong S, Ahmed H (2020). Prediction of breast cancer, comparative review of machine learning techniques, and their analysis. IEEE Access, 8: 150360–150376. https://doi.org/10.1109/ACCESS.2020.3016715
 
Fisher RA (1936). The use of multiple measurements in taxonomic problem. Annals of Human Genetics 7: 179–188.
 
Goodfellow I, Bengio Y, Courville A (2016). Deep Learning. MIT Press, Cambridge, MA, USA.
 
Hand DJ, Smyth P, Mannila H (2001). Principles of Data Mining. MIT Press, Cambridge, MA, USA.
 
Hastie T, Tibshirani R, Friedman J (2009). The Elements of Statistical Learning: Data mining, Inference and Prediction. Springer, New York, NY, USA.
 
Ishibuchi H, Nakashima T (2000). Effect of rule weights in fuzzy rule-based classification systems. In: Ninth IEEE International Conference on Fuzzy Systems. FUZZ- IEEE 2000 (Cat. No. 00CH37063). volume 1. 59–64. vol.1.
 
James G, Witten D, Hastie T, Tibshirani R (2013). An Introduction to Statistical Learning: With Applications in R. Springer, New York, NY, USA.
 
Janghel RR, Shukla A, Tiwari R, Kala R (2010). Breast cancer diagnosis using artificial neural network models. In: The 3rd International Conference on Information Sciences and Interaction Sciences, 89–94. IEEE.
 
Janusz A, Riza LS (2019). RoughSets: Data Analysis Using Rough Set and Fuzzy Rough Set Theories. R package version 1.3-7.
 
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30: 3146–3154.
 
Kotsiantis SB, Zaharakis I, Pintelas P, et al. (2007). Supervised machine learning: A review of classification techniques. Emerging Artificial Intelligence Applications in Computer Engineering, 160(1): 3–24.
 
Logan BF, Shepp LA (1975). Optimal reconstruction of a function from its projections. Duke Mathematical Journal, 42(4): 645–659.
 
Müller AC, Guido S (2016). Introduction to Machine Learning with Python: A Guide for Data Scientists. O’Reilly Media, Inc., USA.
 
Onan A (2015). A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer. Expert Systems with Applications, 42(20): 6844–6852. https://doi.org/10.1016/j.eswa.2015.05.006
 
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12: 2825–2830.
 
Peterson LE (2009). K-nearest neighbor. Scholarpedia, 4(2): 1883. https://doi.org/10.4249/scholarpedia.1883
 
Petricoin EF, Liotta LA (2004). Seldi-tof-based serum proteomic pattern diagnostics for early detection of cancer. Current Opinion in Biotechnology, 15(1): 24–30. https://doi.org/10.1016/j.copbio.2004.01.005
 
Rafique MZ, Alrayes N, Khan MK (2011). Application of evolutionary algorithms in detecting sms spam at access layer. In: Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, 1787–1794.
 
Riza LS, Bergmeir C, Herrera F, Benítez JM (2015). frbs: Fuzzy rule-based systems for classification and regression in R. Journal of Statistical Software, 65(6): 1–30. https://doi.org/10.18637/jss.v065.i06
 
Robinson BW, Musk AW, Lake RA (2005). Malignant mesothelioma. The Lancet, 366(9483): 397–408. https://doi.org/10.1016/S0140-6736(05)67025-0
 
Shevchuk Y, et al. (2015). Neupy. http://neupy.com/.
 
Spugnini EP, Bosari S, Citro G, Lorenzon I, Cognetti F, Baldi A (2006). Human malignant mesothelioma: Molecular mechanisms of pathogenesis and progression. The International Journal of Biochemistry & Cell Biology, 38(12): 2000–2004. https://doi.org/10.1016/j.biocel.2006.07.002
 
Stein EM, Weiss G (2016). Introduction to fourier analysis on euclidean spaces (pms-32), volume 32. In: Introduction to Fourier Analysis on Euclidean Spaces (PMS-32), volume 32. Princeton university press.
 
Street WN, Wolberg WH, Mangasarian OL (1993). Nuclear feature extraction for breast tumor diagnosis. In: Biomedical Image Processing and Biomedical Visualization, volume 1905, 861–870. SPIE.
 
Zadeh LA (1965). Fuzzy sets. Information and Control, 8: 338–353. https://doi.org/10.1016/S0019-9958(65)90241-X

Related articles PDF XML
Related articles PDF XML

Copyright
2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
binary classification cancer class imbalance machine learning mesothelioma variable importance

Metrics
since February 2021
732

Article info
views

362

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy