Pub. online:10 Jul 2024Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 22, Issue 3 (2024): Special issue: The Government Advances in Statistical Programming (GASP) 2023 conference, pp. 456–468
Abstract
Missing data is a common occurrence in various fields, spanning social science, education, economics, and biomedical research. Disregarding missing data in statistical analyses can introduce bias to study outcomes. To mitigate this issue, imputation methods have proven effective in reducing nonresponse bias and generating complete datasets for subsequent analysis of secondary data. The efficacy of imputation methods hinges on the assumptions of the underlying imputation model. While machine learning techniques such as regression trees, random forest, XGBoost, and deep learning have demonstrated robustness against model misspecification, their optimal performance may necessitate fine-tuning under specific conditions. Moreover, imputed values generated by these methods can sometimes deviate unnaturally, falling outside the normal range. To address these challenges, we propose a novel Predictive Mean Matching imputation (PMM) procedure that leverages popular machine learning-based methods. PMM strikes a balance between robustness and the generation of appropriate imputed values. In this paper, we present our innovative PMM approach and conduct a comparative performance analysis through Monte Carlo simulation studies, assessing its effectiveness against other established methods.
Changepoint analysis has had a striking variety of applications, and a rich methodology has been developed. Our contribution here is a new approach that uses nonlinear regression analysis as an intermediate computational device. The tool is quite versatile, covering a number of different changepoint scenarios. It is largely free of parametric model assumptions, and has the major advantage of providing standard errors for formal statistical inference. Both abrupt and gradual changes are covered.
There is growing interest in accommodating network structure in panel data models. We consider dynamic network Poisson autoregressive (DN-PAR) models for panel count data, enabling their use in regard to a time-varying network structure. We develop a Bayesian Markov chain Monte Carlo technique for estimating the DN-PAR model, and conduct Monte Carlo experiments to examine the properties of the posterior quantities and compare dynamic and constant network models. The Monte Carlo results indicate that the bias in the DN-PAR models is negligible, while the constant network model suffers from bias when the true network is dynamic. We also suggest an approach for extracting the time-varying network from the data. The empirical results for the count data for confirmed cases of COVID-19 in the United States indicate that the extracted dynamic network models outperform the constant network models in regard to the deviance information criterion and out-of-sample forecasting.
Classification is an important statistical tool that has increased its importance since the emergence of the data science revolution. However, a training data set that does not capture all underlying population subgroups (or clusters) will result in biased estimates or misclassification. In this paper, we introduce a statistical and computational solution to a possible bias in classification when implemented on estimated population clusters. An unseen-cluster problem denotes the case in which the training data does not contain all underlying clusters in the population. Such a scenario may occur due to various reasons, such as sampling errors, selection bias, or emerging and disappearing population clusters. Once an unseen-cluster problem occurs, a testing observation will be misclassified because a classification rule based on the sample cannot capture a cluster not observed in the training data (sample). To overcome such issues, we suggest a two-stage classification method to ameliorate the unseen-cluster problem in classification. We suggest a test to identify the unseen-cluster problem and demonstrate the performance of the two-stage tailored classifier using simulations and a public data example.
A joint equivalence and difference (JED) test is needed because difference tests and equivalence (more exactly, similarity) tests each provide only a one-sided answer. The concept and underlying theory have appeared numerous times, noted and discussed here, but never in a form usable in workaday statistical applications. This work provides such a form as a straightforward simple test with a step-by-step guide and possible interpretations and formulas. For initial treatment, it restricts attention to a t test of two means. The guide is illustrated by a numerical example from the field of orthopedics. To assess the quality of the JED test, its sensitivity and specificity are examined for test outcomes depending on error risk α, total sample size, sub-sample size ratio, and variability ratio. These results are shown in tables. Interpretations are discussed. It is concluded that the test exhibits high power and effect size and that only quite small samples show any effect on the power or effect size of the JED test by commonly seen values of any of the parameters. Data for the example and computer codes for using the JED test are accessible through links to supplementary material. We recommend that this work be extended to other test forms and multivariate forms.
The National Association of Stock Car Auto Racing (NASCAR) is ranked among the top ten most popular sports in the United States. NASCAR events are characterized by on-track racing punctuated by pit stops since cars must refuel, replace tires, and modify their setup throughout a race. A well-executed pit stop can allow drivers to gain multiple seconds on their opponents. Strategies around when to pit and what to perform during a pit stop are under constant evaluation. One currently unexplored area is publically available communication between each driver and their pit crew during the race. Due to the many hours of audio, manual analysis of even one driver’s communications is prohibitive. We propose a fully automated approach to analyze driver–pit crew communication. Our work was conducted in collaboration with NASCAR domain experts. Audio communication is converted to text and summarized using cluster-based Latent Dirichlet Analysis to provide an overview of a driver’s race performance. The transcript is then analyzed to extract important events related to pit stops and driving balance: understeer (pushing) or oversteer (over-rotating). Named entity recognition (NER) and relationship extraction provide context to each event. A combination of the race summary, events, and real-time race data provided by NASCAR are presented using Sankey visualizations. Statistical analysis and evaluation by our domain expert collaborators confirmed we can accurately identify important race events and driver interactions, presented in a novel way to provide useful, important, and efficient summaries and event highlights for race preparation and in-race decision-making.
Physician performance is critical to caring for patients admitted to the intensive care unit (ICU), who are in life-threatening situations and require high level medical care and interventions. Evaluating physicians is crucial for ensuring a high standard of medical care and fostering continuous performance improvement. The non-randomized nature of ICU data often results in imbalance in patient covariates across physician groups, making direct comparisons of the patients’ survival probabilities for each physician misleading. In this article, we utilize the propensity weighting method to address confounding, achieve covariates balance, and assess physician effects. Due to possible model misspecification, we compare the performance of the propensity weighting methods using both parametric models and super learning methods. When the generalized propensity or the quality function is not correctly specified within the parametric propensity weighting framework, super learning-based propensity weighting methods yield more efficient estimators. We demonstrate that utilizing propensity weighting offers an effective way to assess physician performance, a topic of considerable interest to hospital administrators.
Image registration techniques are used for mapping two images of the same scene or image objects to one another. There are several image registration techniques available in the literature for registering rigid body as well as non-rigid body transformations. A very important image transformation is zooming in or out which also called scaling. Very few research articles address this particular problem except a number of feature-based approaches. This paper proposes a method to register two images of the same image object where one is a zoomed-in version of the other. In the proposed intensity-based method, we consider a circular neighborhood around an image pixel of the zoomed-in image, and search for the pixel in the reference image whose circular neighborhood is most similar to that of the neighborhood in the zoomed-in image with respect to various similarity measures. We perform this procedure for all pixels in the zoomed-in image. On images where the features are small in number, our proposed method works better than the state-of-the-art feature-based methods. We provide several numerical examples as well as a mathematical justification in this paper which support our statement that this method performs reasonably well in many situations.
Boosting is a popular algorithm in supervised machine learning with wide applications in regression and classification problems. It combines weak learners, such as regression trees, to obtain accurate predictions. However, in the presence of outliers, traditional boosting may yield inferior results since the algorithm optimizes a convex loss function. Recent literature has proposed boosting algorithms that optimize robust nonconvex loss functions. Nevertheless, there is a lack of weighted estimation to indicate the outlier status of observations. This article introduces the iteratively reweighted boosting (IRBoost) algorithm, which combines robust loss optimization and weighted estimation. It can be conveniently constructed with existing software. The output includes weights as valuable diagnostics for the outlier status of observations. For practitioners interested in the boosting algorithm, the new method can be interpreted as a way to tune robust observation weights. IRBoost is implemented in the R package irboost and is demonstrated using publicly available data in generalized linear models, classification, and survival data analysis.
Pub. online:4 Jun 2024Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 239–258
Abstract
The programming overhead required to implement machine learning workflows creates a barrier for many discipline-specific researchers with limited programming experience. The stressor package provides an R interface to Python’s PyCaret package, which automatically tunes and trains 14-18 machine learning (ML) models for use in accuracy comparisons. In addition to providing an R interface to PyCaret, stressor also contains functions that facilitate synthetic data generation and variants of cross-validation that allow for easy benchmarking of the ability of machine-learning models to extrapolate or compete with simpler models on simpler data forms. We show the utility of stressor on two agricultural datasets, one using classification models to predict crop suitability and another using regression models to predict crop yields. Full ML benchmarking workflows can be completed in only a few lines of code with relatively small computational cost. The results, and more importantly the workflow, provide a template for how applied researchers can quickly generate accuracy comparisons of many machine learning models with very little programming.