Changepoint analysis has had a striking variety of applications, and a rich methodology has been developed. Our contribution here is a new approach that uses nonlinear regression analysis as an intermediate computational device. The tool is quite versatile, covering a number of different changepoint scenarios. It is largely free of parametric model assumptions, and has the major advantage of providing standard errors for formal statistical inference. Both abrupt and gradual changes are covered.
There is growing interest in accommodating network structure in panel data models. We consider dynamic network Poisson autoregressive (DN-PAR) models for panel count data, enabling their use in regard to a time-varying network structure. We develop a Bayesian Markov chain Monte Carlo technique for estimating the DN-PAR model, and conduct Monte Carlo experiments to examine the properties of the posterior quantities and compare dynamic and constant network models. The Monte Carlo results indicate that the bias in the DN-PAR models is negligible, while the constant network model suffers from bias when the true network is dynamic. We also suggest an approach for extracting the time-varying network from the data. The empirical results for the count data for confirmed cases of COVID-19 in the United States indicate that the extracted dynamic network models outperform the constant network models in regard to the deviance information criterion and out-of-sample forecasting.
Classification is an important statistical tool that has increased its importance since the emergence of the data science revolution. However, a training data set that does not capture all underlying population subgroups (or clusters) will result in biased estimates or misclassification. In this paper, we introduce a statistical and computational solution to a possible bias in classification when implemented on estimated population clusters. An unseen-cluster problem denotes the case in which the training data does not contain all underlying clusters in the population. Such a scenario may occur due to various reasons, such as sampling errors, selection bias, or emerging and disappearing population clusters. Once an unseen-cluster problem occurs, a testing observation will be misclassified because a classification rule based on the sample cannot capture a cluster not observed in the training data (sample). To overcome such issues, we suggest a two-stage classification method to ameliorate the unseen-cluster problem in classification. We suggest a test to identify the unseen-cluster problem and demonstrate the performance of the two-stage tailored classifier using simulations and a public data example.
The National Association of Stock Car Auto Racing (NASCAR) is ranked among the top ten most popular sports in the United States. NASCAR events are characterized by on-track racing punctuated by pit stops since cars must refuel, replace tires, and modify their setup throughout a race. A well-executed pit stop can allow drivers to gain multiple seconds on their opponents. Strategies around when to pit and what to perform during a pit stop are under constant evaluation. One currently unexplored area is publically available communication between each driver and their pit crew during the race. Due to the many hours of audio, manual analysis of even one driver’s communications is prohibitive. We propose a fully automated approach to analyze driver–pit crew communication. Our work was conducted in collaboration with NASCAR domain experts. Audio communication is converted to text and summarized using cluster-based Latent Dirichlet Analysis to provide an overview of a driver’s race performance. The transcript is then analyzed to extract important events related to pit stops and driving balance: understeer (pushing) or oversteer (over-rotating). Named entity recognition (NER) and relationship extraction provide context to each event. A combination of the race summary, events, and real-time race data provided by NASCAR are presented using Sankey visualizations. Statistical analysis and evaluation by our domain expert collaborators confirmed we can accurately identify important race events and driver interactions, presented in a novel way to provide useful, important, and efficient summaries and event highlights for race preparation and in-race decision-making.
A joint equivalence and difference (JED) test is needed because difference tests and equivalence (more exactly, similarity) tests each provide only a one-sided answer. The concept and underlying theory have appeared numerous times, noted and discussed here, but never in a form usable in workaday statistical applications. This work provides such a form as a straightforward simple test with a step-by-step guide and possible interpretations and formulas. For initial treatment, it restricts attention to a t test of two means. The guide is illustrated by a numerical example from the field of orthopedics. To assess the quality of the JED test, its sensitivity and specificity are examined for test outcomes depending on error risk α, total sample size, sub-sample size ratio, and variability ratio. These results are shown in tables. Interpretations are discussed. It is concluded that the test exhibits high power and effect size and that only quite small samples show any effect on the power or effect size of the JED test by commonly seen values of any of the parameters. Data for the example and computer codes for using the JED test are accessible through links to supplementary material. We recommend that this work be extended to other test forms and multivariate forms.
Physician performance is critical to caring for patients admitted to the intensive care unit (ICU), who are in life-threatening situations and require high level medical care and interventions. Evaluating physicians is crucial for ensuring a high standard of medical care and fostering continuous performance improvement. The non-randomized nature of ICU data often results in imbalance in patient covariates across physician groups, making direct comparisons of the patients’ survival probabilities for each physician misleading. In this article, we utilize the propensity weighting method to address confounding, achieve covariates balance, and assess physician effects. Due to possible model misspecification, we compare the performance of the propensity weighting methods using both parametric models and super learning methods. When the generalized propensity or the quality function is not correctly specified within the parametric propensity weighting framework, super learning-based propensity weighting methods yield more efficient estimators. We demonstrate that utilizing propensity weighting offers an effective way to assess physician performance, a topic of considerable interest to hospital administrators.
Boosting is a popular algorithm in supervised machine learning with wide applications in regression and classification problems. It combines weak learners, such as regression trees, to obtain accurate predictions. However, in the presence of outliers, traditional boosting may yield inferior results since the algorithm optimizes a convex loss function. Recent literature has proposed boosting algorithms that optimize robust nonconvex loss functions. Nevertheless, there is a lack of weighted estimation to indicate the outlier status of observations. This article introduces the iteratively reweighted boosting (IRBoost) algorithm, which combines robust loss optimization and weighted estimation. It can be conveniently constructed with existing software. The output includes weights as valuable diagnostics for the outlier status of observations. For practitioners interested in the boosting algorithm, the new method can be interpreted as a way to tune robust observation weights. IRBoost is implemented in the R package irboost and is demonstrated using publicly available data in generalized linear models, classification, and survival data analysis.
Image registration techniques are used for mapping two images of the same scene or image objects to one another. There are several image registration techniques available in the literature for registering rigid body as well as non-rigid body transformations. A very important image transformation is zooming in or out which also called scaling. Very few research articles address this particular problem except a number of feature-based approaches. This paper proposes a method to register two images of the same image object where one is a zoomed-in version of the other. In the proposed intensity-based method, we consider a circular neighborhood around an image pixel of the zoomed-in image, and search for the pixel in the reference image whose circular neighborhood is most similar to that of the neighborhood in the zoomed-in image with respect to various similarity measures. We perform this procedure for all pixels in the zoomed-in image. On images where the features are small in number, our proposed method works better than the state-of-the-art feature-based methods. We provide several numerical examples as well as a mathematical justification in this paper which support our statement that this method performs reasonably well in many situations.
There has been remarkable progress in the field of deep learning, particularly in areas such as image classification, object detection, speech recognition, and natural language processing. Convolutional Neural Networks (CNNs) have emerged as a dominant model of computation in this domain, delivering exceptional accuracy in image recognition tasks. Inspired by their success, researchers have explored the application of CNNs to tabular data. However, CNNs trained on structured tabular data often yield subpar results. Hence, there has been a demonstrated gap between the performance of deep learning models and shallow models on tabular data. To that end, Tabular-to-Image (T2I) algorithms have been introduced to convert tabular data into an unstructured image format. T2I algorithms enable the encoding of spatial information into the image, which CNN models can effectively utilize for classification. In this work, we propose two novel T2I algorithms, Binary Image Encoding (BIE) and correlated Binary Image Encoding (cBIE), which preserve complex relationships in the generated image by leveraging the native binary representation of the data. Additionally, cBIE captures more spatial information by reordering columns based on their correlation to a feature. To evaluate the performance of our algorithms, we conducted experiments using four benchmark datasets, employing ResNet-50 as the deep learning model. Our results show that the ResNet-50 models trained with images generated using BIE and cBIE consistently outperformed or matched models trained on images created using the previous State of the Art method, Image Generator for Tabular Data (IGTD).
Brain imaging research poses challenges due to the intricate structure of the brain and the absence of clearly discernible features in the images. In this study, we propose a technique for analyzing brain image data identifying crucial regions relevant to patients’ conditions, specifically focusing on Diffusion Tensor Imaging data. Our method utilizes the Bayesian Dirichlet process prior incorporating generalized linear models, that enhances clustering performance while it benefits from the flexibility of accommodating varying numbers of clusters. Our approach improves the performance of identifying potential classes utilizing locational information by considering the proximity between locations as clustering constraints. We apply our technique to a dataset from Transforming Research and Clinical Knowledge in Traumatic Brain Injury study, aiming to identify important regions in the brain’s gray matter, white matter, and overall brain tissue that differentiate between young and old age groups. Additionally, we explore a link between our discoveries and the existing outcomes in the field of brain network research.