Recently, the log cumulative probability model (LCPM) and its special case the proportional probability model (PPM) was developed to relate ordinal outcomes to predictor variables using the log link instead of the logit link. These models permit the estimation of probability instead of odds, but the log link requires constrained maximum likelihood estimation (cMLE). An algorithm that efficiently handles cMLE for the LCPM is a valuable resource as these models are applicable in many settings and its output is easy to interpret. One such implementation is in the R package lcpm. In this era of big data, all statistical models are under pressure to meet the new processing demands. This work aimed to improve the algorithm in R package lcpm to process more input in less time using less memory.
Studying migration patterns driven by extreme environmental events is crucial for building a sustainable society and stable economy. Motivated by a real dataset about human migrations, this paper develops a transformed varying coefficient model for origin and destination (OD) regression to elucidate the complex associations of migration patterns with spatio-temporal dependencies and socioeconomic factors. Existing studies often overlook the dynamic effects of these factors in OD regression. Furthermore, with the increasing ease of collecting OD data, the scale of current OD regression data is typically large, necessitating the development of methods for efficiently fitting large-scale migration data. We address the challenge by proposing a new Bayesian interpretation for the proposed OD models, leveraging sufficient statistics for efficient big data computation. Our method, inspired by migration studies, promises broad applicability across various fields, contributing to refined statistical analysis techniques. Extensive numerical studies are provided, and insights from real data analysis are shared.
Significant attention has been drawn to support vector data description (SVDD) due to its exceptional performance in one-class classification and novelty detection tasks. Nevertheless, all slack variables are assigned the same weight during the modeling process. This can lead to a decline in learning performance if the training data contains erroneous observations or outliers. In this study, an extended SVDD model, Rescale Hinge Loss Support Vector Data Description (RSVDD) is introduced to strengthen the resistance of the SVDD to anomalies. This is achieved by redefining the initial optimization problem of SVDD using a hinge loss function that has been rescaled. As this loss function can increase the significance of samples that are more likely to represent the target class while decreasing the impact of samples that are more likely to represent anomalies, it can be considered one of the variants of weighted SVDD. To efficiently address the optimization challenge associated with the proposed model, the half-quadratic optimization method was utilized to generate a dynamic optimization algorithm. Experimental findings on a synthetic and breast cancer data set are presented to illustrate the new proposed method’s performance superiority over the already existing methods for the settings considered.
The rapid accumulation and release of data have fueled research across various fields. While numerous methods exist for data collection and storage, data distribution presents challenges, as some datasets are restricted, and certain subsets may compromise privacy if released unaltered. Statistical disclosure control (SDC) aims to maximize data utility while minimizing the disclosure risk, i.e., the risk of individual identification. A key SDC method is data perturbation, with General Additive Data Perturbation (GADP) and Copula General Additive Data Perturbation (CGADP) being two prominent approaches. Both leverage multivariate normal distributions to generate synthetic data while preserving statistical properties of the original dataset. Given the increasing use of machine learning for data modeling, this study compares the performance of various machine learning models on GADP- and CGADP-perturbed data. Using Monte Carlo simulations with three data-generating models and a real dataset, we evaluate the predictive performance and robustness of ten machine learning techniques under data perturbation. Our findings provide insights into the machine learning techniques that perform robustly on GADP- and CGADP-perturbed datasets, extending previous research that primarily focused on simple statistics such as means, variances, and correlations.
Analysis of nonprobability survey samples has gained much attention in recent years due to their wide availability and the declining response rates within their costly probabilistic counterparts. Still, valid population inference cannot be deduced from nonprobability samples without additional information, which typically takes the form of a smaller survey sample with a shared set of covariates. In this paper, we propose the matched mass imputation (MMI) approach as a means for integrating data from probability and nonprobability samples when common covariates are present in both samples but the variable of interest is available only in the nonprobability sample. The proposed approach borrows strength from the ideas of statistical matching and mass imputation to provide robustness against potential nonignorable bias in the nonprobability sample. Specifically, MMI is a two-step approach: first, a novel application of statistical matching identifies a subset of the nonprobability sample that closely resembles the probability sample; second, mass imputation is performed using these matched units. Our empirical results, from simulations and a real data application, demonstrate the effectiveness of the MMI estimator under nearest-neighbor matching, which almost always outperformed other imputation estimators in the presence of nonignorable bias. We also explore the effectiveness of a bootstrap variance estimation procedure for the proposed MMI estimator.
Business Establishment Automated Classification of NAICS (BEACON) is a text classification tool that helps respondents to the U.S. Census Bureau’s economic surveys self-classify their business activity in real time. The tool is based on rich training data, natural language processing, machine learning, and information retrieval. It is implemented using Python and an application programming interface. This paper describes BEACON’s methodology and successful application to the 2022 Economic Census, during which the tool was used over half a million times. BEACON has demonstrated that it recognizes a large vocabulary, quickly returns relevant results to respondents, and reduces clerical work associated with industry code assignment.
Approximately 15% of adults in the United States (U.S.) are afflicted with chronic kidney disease (CKD). For CKD patients, the progressive decline of kidney function is intricately related to hospitalizations due to cardiovascular disease and eventual “terminal” events, such as kidney failure and mortality. To unravel the mechanisms underlying the disease dynamics of these interdependent processes, including identifying influential risk factors, as well as tailoring decision-making to individual patient needs, we develop a novel Bayesian multivariate joint model for the intercorrelated outcomes of kidney function (as measured by longitudinal estimated glomerular filtration rate), recurrent cardiovascular events, and competing-risk terminal events of kidney failure and death. The proposed joint modeling approach not only facilitates the exploration of risk factors associated with each outcome, but also allows dynamic updates of cumulative incidence probabilities for each competing risk for future subjects based on their basic characteristics and a combined history of longitudinal measurements and recurrent events. We propose efficient and flexible estimation and prediction procedures within a Bayesian framework employing Markov Chain Monte Carlo methods. The predictive performance of our model is assessed through dynamic area under the receiver operating characteristic curves and the expected Brier score. We demonstrate the efficacy of the proposed methodology through extensive simulations. Proposed methodology is applied to data from the Chronic Renal Insufficiency Cohort study established by the National Institute of Diabetes and Digestive and Kidney Diseases to address the rising epidemic of CKD in the U.S.
Recent studies observed a surprising concept on model test error called the double descent phenomenon where the increasing model complexity decreases the test error first and then the error increases and decreases again. To observe this, we work on a two-layer neural network model with a ReLU activation function designed for binary classification under supervised learning. Our aim is to observe and investigate the mathematical theory behind the double descent behavior of model test error for varying model sizes. We quantify the model size by the ration of number of training samples to the dimension of the model. Due to the complexity of the empirical risk minimization procedure, we use the Convex Gaussian MinMax Theorem to find a suitable candidate for the global training loss.
When comparing two survival curves, three tests are widely used: the Cox proportional hazards test, the logrank test, and the Wilcoxon test. Despite their popularity in survival data analysis, there is no clear clinical interpretation especially when the proportional hazard assumption is not valid. Meanwhile, the restricted mean survival time (RMST) offers an intuitive and clinically meaningful interpretation. We compare these four tests with regards to statistical power under many configurations (e.g., proportional hazard, early benefit, delayed benefit, and crossing survivals) with data simulated from the Weibull distributions. We then use an example from a lung cancer trial to compare their required sample sizes. As expected, the CoxPH test is more powerful than others when the PH assumption is valid. The Wilcoxon test is often preferable when there is a decreasing trajectory in the event rate as time goes. The RMST test is much more powerful than others when a new treatment has early benefit. The recommended test(s) under each configuration are suggested in this article.