Binary Classification of Malignant Mesothelioma: A Comparative Study
Volume 21, Issue 2 (2023): Special Issue: Symposium Data Science and Statistics 2022, pp. 205–224
Pub. online: 14 February 2023
Type: Data Science In Action
Open Access
Received
25 July 2022
25 July 2022
Accepted
6 February 2023
6 February 2023
Published
14 February 2023
14 February 2023
Abstract
Malignant mesotheliomas are aggressive cancers that occur in the thin layer of tissue that covers most commonly the linings of the chest or abdomen. Though the cancer itself is rare and deadly, early diagnosis will help with treatment and improve outcomes. Mesothelioma is usually diagnosed in the later stages. Symptoms are similar to other, more common conditions. As such, predicting and diagnosing mesothelioma early is essential to starting early treatment for a cancer that is often diagnosed too late. The goal of this comprehensive empirical comparison is to determine the best-performing model based on recall (sensitivity). We particularly wish to avoid false negatives, as it is costly to diagnose a patient as healthy when they actually have cancer. Model training will be conducted based on k-fold cross validation. Random forest is chosen as the optimal model. According to this model, age and duration of asbestos exposure are ranked as the most important features affecting diagnosis of mesothelioma.
Supplementary material
Supplementary MaterialThe zip supplementary material file contains the Python and R scripts for reading data and preprocessing, exploratory data analysis, and the various models tested.
References
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: 321–357. https://doi.org/10.1613/jair.953
Chicco D, Rovelli C (2019). Computational prediction of diagnosis and feature selection on mesothelioma patient health records. PloS One, 14(1): e0208737. https://doi.org/10.1371/journal.pone.0208737
Er O, Tanrikulu AC, Abakay A, Temurtas F (2012). An approach based on probabilistic neural network for diagnosis of mesothelioma’s disease. Computers & Electrical Engineering, 38(1): 75–81. https://doi.org/10.1016/j.compeleceng.2011.09.001
Fatima N, Liu L, Hong S, Ahmed H (2020). Prediction of breast cancer, comparative review of machine learning techniques, and their analysis. IEEE Access, 8: 150360–150376. https://doi.org/10.1109/ACCESS.2020.3016715
Onan A (2015). A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer. Expert Systems with Applications, 42(20): 6844–6852. https://doi.org/10.1016/j.eswa.2015.05.006
Peterson LE (2009). K-nearest neighbor. Scholarpedia, 4(2): 1883. https://doi.org/10.4249/scholarpedia.1883
Petricoin EF, Liotta LA (2004). Seldi-tof-based serum proteomic pattern diagnostics for early detection of cancer. Current Opinion in Biotechnology, 15(1): 24–30. https://doi.org/10.1016/j.copbio.2004.01.005
Riza LS, Bergmeir C, Herrera F, Benítez JM (2015). frbs: Fuzzy rule-based systems for classification and regression in R. Journal of Statistical Software, 65(6): 1–30. https://doi.org/10.18637/jss.v065.i06
Robinson BW, Musk AW, Lake RA (2005). Malignant mesothelioma. The Lancet, 366(9483): 397–408. https://doi.org/10.1016/S0140-6736(05)67025-0
Shevchuk Y, et al. (2015). Neupy. http://neupy.com/.
Spugnini EP, Bosari S, Citro G, Lorenzon I, Cognetti F, Baldi A (2006). Human malignant mesothelioma: Molecular mechanisms of pathogenesis and progression. The International Journal of Biochemistry & Cell Biology, 38(12): 2000–2004. https://doi.org/10.1016/j.biocel.2006.07.002
Zadeh LA (1965). Fuzzy sets. Information and Control, 8: 338–353. https://doi.org/10.1016/S0019-9958(65)90241-X