Pub. online:10 Jul 2024Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 456–468
Abstract
Missing data is a common occurrence in various fields, spanning social science, education, economics, and biomedical research. Disregarding missing data in statistical analyses can introduce bias to study outcomes. To mitigate this issue, imputation methods have proven effective in reducing nonresponse bias and generating complete datasets for subsequent analysis of secondary data. The efficacy of imputation methods hinges on the assumptions of the underlying imputation model. While machine learning techniques such as regression trees, random forest, XGBoost, and deep learning have demonstrated robustness against model misspecification, their optimal performance may necessitate fine-tuning under specific conditions. Moreover, imputed values generated by these methods can sometimes deviate unnaturally, falling outside the normal range. To address these challenges, we propose a novel Predictive Mean Matching imputation (PMM) procedure that leverages popular machine learning-based methods. PMM strikes a balance between robustness and the generation of appropriate imputed values. In this paper, we present our innovative PMM approach and conduct a comparative performance analysis through Monte Carlo simulation studies, assessing its effectiveness against other established methods.
Abstract: PSA measurements are used to assess the risk for prostate cancer. PSA range and PSA kinetics such as PSA velocity have been correlated with in creased cancer detection and assist the clinician in deciding when prostate biopsy should be performed. Our aim is to evaluate the use of a novel, maxi mum likelihood estimation - prostate specific antigen (MLE-PSA) model for predicting the probability of prostate cancer using serial PSA measurements combined with PSA velocity in order to assess whether this reduces the need for prostate biopsy. A total of 1976 Caucasian patients were included. All these patients had at least 6 PSA serial measurements; all underwent trans-rectal biopsy with minimum 12 cores within the past 10 years. A multivariate logistic re gression model was developed using maximum likelihood estimation (MLE) based on the following parameters (age, at least 6 PSA serial measurements, baseline median natural logarithm of the PSA (ln(PSA)) and PSA velocity (ln(PSAV)), baseline process capability standard deviation of ln(PSA) and ln(PSAV), significant special causes of variation in ln(PSA) and ln(PSAV) detected using control chart logic, and the volatility of the ln(PSAV). We then compared prostate cancer probability using MLE-PSA to the results of prostate needle biopsy. The MLE-PSA model with a 50% cut-off probability has a sensitivity of 87%, specificity of 85%, positive predictive value (PPV) of 89%, and negative predictive value (NPV) of 82%. By contrast, a single PSA value with a 4ng/ml threshold has a sensitivity of 59%, specificity of 33%, PPV of 56%, and NPV of 36% using the same population of patients used to generate the MLE-PSA model. Based on serial PSA measurements, the use of the MLE-PSA model significantly (p-value < 0.0001) improves prostate cancer detection and reduces the need for prostate biopsy.
Longitudinal data analysis had been widely developed in the past three decades. Longitudinal data are common in many fields such as public health, medicine, biological and social sciences. Longitudinal data have special nature as the individual may be observed during a long period of time. Hence, missing values are common in longitudinal data. The presence of missing values leads to biased results and complicates the analysis. The missing values have two patterns: intermittent and dropout. The missing data mechanisms are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). The appropriate analysis relies heavily on the assumed mechanism and pattern. The parametric fractional imputation is developed to handle longitudinal data with intermittent missing pattern. The maximum likelihood estimates are obtained and the Jackkife method is used to obtain the standard errors of the parameters estimates. Finally a simulation study is conducted to validate the proposed approach. Also, the proposed approach is applied to a real data.
Subsampling the data is used in this paper as a learning method about the influence of the data points for drawing inference on the parameters of a fitted logistic regression model. The alternative, alternative regularized, alternative regularized lasso, and alternative regularized ridge estimators are proposed for the parameter estimation of logistic regression models and are then compared with the maximum likelihood estimators. The proposed alternative regularized estimators are obtained by using a tuning parameter but the proposed alternative estimators are not regularized. The proposed alternative regularized lasso estimators are the averaged standard lasso estimators and the alternative regularized ridge estimators are also the averaged standard ridge estimators over subsets of groups where the number of subsets could be smaller than the number of parameters. The values of the tuning parameters are obtained to make the alternative regularized estimators very close to the maximum likelihood estimators and the process is explained with two real data as well as a simulated study. The alternative and alternative regularized estimators always have the closed form expressions in terms of observations that the maximum likelihood estimators do not have. When the maximum likelihood estimators do not have the closed form expressions, the alternative regularized estimators thus obtained provide the approximate closed form expressions for them.