Studying migration patterns driven by extreme environmental events is crucial for building a sustainable society and stable economy. Motivated by a real dataset about human migrations, this paper develops a transformed varying coefficient model for origin and destination (OD) regression to elucidate the complex associations of migration patterns with spatio-temporal dependencies and socioeconomic factors. Existing studies often overlook the dynamic effects of these factors in OD regression. Furthermore, with the increasing ease of collecting OD data, the scale of current OD regression data is typically large, necessitating the development of methods for efficiently fitting large-scale migration data. We address the challenge by proposing a new Bayesian interpretation for the proposed OD models, leveraging sufficient statistics for efficient big data computation. Our method, inspired by migration studies, promises broad applicability across various fields, contributing to refined statistical analysis techniques. Extensive numerical studies are provided, and insights from real data analysis are shared.
Significant attention has been drawn to support vector data description (SVDD) due to its exceptional performance in one-class classification and novelty detection tasks. Nevertheless, all slack variables are assigned the same weight during the modeling process. This can lead to a decline in learning performance if the training data contains erroneous observations or outliers. In this study, an extended SVDD model, Rescale Hinge Loss Support Vector Data Description (RSVDD) is introduced to strengthen the resistance of the SVDD to anomalies. This is achieved by redefining the initial optimization problem of SVDD using a hinge loss function that has been rescaled. As this loss function can increase the significance of samples that are more likely to represent the target class while decreasing the impact of samples that are more likely to represent anomalies, it can be considered one of the variants of weighted SVDD. To efficiently address the optimization challenge associated with the proposed model, the half-quadratic optimization method was utilized to generate a dynamic optimization algorithm. Experimental findings on a synthetic and breast cancer data set are presented to illustrate the new proposed method’s performance superiority over the already existing methods for the settings considered.
The rapid accumulation and release of data have fueled research across various fields. While numerous methods exist for data collection and storage, data distribution presents challenges, as some datasets are restricted, and certain subsets may compromise privacy if released unaltered. Statistical disclosure control (SDC) aims to maximize data utility while minimizing the disclosure risk, i.e., the risk of individual identification. A key SDC method is data perturbation, with General Additive Data Perturbation (GADP) and Copula General Additive Data Perturbation (CGADP) being two prominent approaches. Both leverage multivariate normal distributions to generate synthetic data while preserving statistical properties of the original dataset. Given the increasing use of machine learning for data modeling, this study compares the performance of various machine learning models on GADP- and CGADP-perturbed data. Using Monte Carlo simulations with three data-generating models and a real dataset, we evaluate the predictive performance and robustness of ten machine learning techniques under data perturbation. Our findings provide insights into the machine learning techniques that perform robustly on GADP- and CGADP-perturbed datasets, extending previous research that primarily focused on simple statistics such as means, variances, and correlations.
Analysis of nonprobability survey samples has gained much attention in recent years due to their wide availability and the declining response rates within their costly probabilistic counterparts. Still, valid population inference cannot be deduced from nonprobability samples without additional information, which typically takes the form of a smaller survey sample with a shared set of covariates. In this paper, we propose the matched mass imputation (MMI) approach as a means for integrating data from probability and nonprobability samples when common covariates are present in both samples but the variable of interest is available only in the nonprobability sample. The proposed approach borrows strength from the ideas of statistical matching and mass imputation to provide robustness against potential nonignorable bias in the nonprobability sample. Specifically, MMI is a two-step approach: first, a novel application of statistical matching identifies a subset of the nonprobability sample that closely resembles the probability sample; second, mass imputation is performed using these matched units. Our empirical results, from simulations and a real data application, demonstrate the effectiveness of the MMI estimator under nearest-neighbor matching, which almost always outperformed other imputation estimators in the presence of nonignorable bias. We also explore the effectiveness of a bootstrap variance estimation procedure for the proposed MMI estimator.
Business Establishment Automated Classification of NAICS (BEACON) is a text classification tool that helps respondents to the U.S. Census Bureau’s economic surveys self-classify their business activity in real time. The tool is based on rich training data, natural language processing, machine learning, and information retrieval. It is implemented using Python and an application programming interface. This paper describes BEACON’s methodology and successful application to the 2022 Economic Census, during which the tool was used over half a million times. BEACON has demonstrated that it recognizes a large vocabulary, quickly returns relevant results to respondents, and reduces clerical work associated with industry code assignment.
Approximately 15% of adults in the United States (U.S.) are afflicted with chronic kidney disease (CKD). For CKD patients, the progressive decline of kidney function is intricately related to hospitalizations due to cardiovascular disease and eventual “terminal” events, such as kidney failure and mortality. To unravel the mechanisms underlying the disease dynamics of these interdependent processes, including identifying influential risk factors, as well as tailoring decision-making to individual patient needs, we develop a novel Bayesian multivariate joint model for the intercorrelated outcomes of kidney function (as measured by longitudinal estimated glomerular filtration rate), recurrent cardiovascular events, and competing-risk terminal events of kidney failure and death. The proposed joint modeling approach not only facilitates the exploration of risk factors associated with each outcome, but also allows dynamic updates of cumulative incidence probabilities for each competing risk for future subjects based on their basic characteristics and a combined history of longitudinal measurements and recurrent events. We propose efficient and flexible estimation and prediction procedures within a Bayesian framework employing Markov Chain Monte Carlo methods. The predictive performance of our model is assessed through dynamic area under the receiver operating characteristic curves and the expected Brier score. We demonstrate the efficacy of the proposed methodology through extensive simulations. Proposed methodology is applied to data from the Chronic Renal Insufficiency Cohort study established by the National Institute of Diabetes and Digestive and Kidney Diseases to address the rising epidemic of CKD in the U.S.
Recent studies observed a surprising concept on model test error called the double descent phenomenon where the increasing model complexity decreases the test error first and then the error increases and decreases again. To observe this, we work on a two-layer neural network model with a ReLU activation function designed for binary classification under supervised learning. Our aim is to observe and investigate the mathematical theory behind the double descent behavior of model test error for varying model sizes. We quantify the model size by the ration of number of training samples to the dimension of the model. Due to the complexity of the empirical risk minimization procedure, we use the Convex Gaussian MinMax Theorem to find a suitable candidate for the global training loss.
When comparing two survival curves, three tests are widely used: the Cox proportional hazards test, the logrank test, and the Wilcoxon test. Despite their popularity in survival data analysis, there is no clear clinical interpretation especially when the proportional hazard assumption is not valid. Meanwhile, the restricted mean survival time (RMST) offers an intuitive and clinically meaningful interpretation. We compare these four tests with regards to statistical power under many configurations (e.g., proportional hazard, early benefit, delayed benefit, and crossing survivals) with data simulated from the Weibull distributions. We then use an example from a lung cancer trial to compare their required sample sizes. As expected, the CoxPH test is more powerful than others when the PH assumption is valid. The Wilcoxon test is often preferable when there is a decreasing trajectory in the event rate as time goes. The RMST test is much more powerful than others when a new treatment has early benefit. The recommended test(s) under each configuration are suggested in this article.
Extensive literature has been proposed for the analysis of correlated survival data. Subjects within a cluster share some common characteristics, e.g., genetic and environmental factors, so their time-to-event outcomes are correlated. The frailty model under proportional hazards assumption has been widely applied for the analysis of clustered survival outcomes. However, the prediction performance of this method can be less satisfactory when the risk factors have complicated effects, e.g., nonlinear and interactive. To deal with these issues, we propose a neural network frailty Cox model that replaces the linear risk function with the output of a feed-forward neural network. The estimation is based on quasi-likelihood using Laplace approximation. A simulation study suggests that the proposed method has the best performance compared with existing methods. The method is applied to the clustered time-to-failure prediction within the kidney transplantation facility using the national kidney transplant registry data from the U.S. Organ Procurement and Transplantation Network. All computer programs are available at https://github.com/rivenzhou/deep_learning_clustered.
Abstract: Mixture of Weibull distributions has wide application in modeling of heterogeneous data sets. The parameter estimation is one of the most important problems related to mixture of Weibull distributions. In this pa per, we propose a L-moment estimation method for mixture of two Weibull distributions. The proposed method is compared with maximum likelihood estimation (MLE) method according to the bias, the mean absolute error, the mean total error and completion time of the algorithm (time) by sim ulation study. Also, applications to real data sets are given to show the flexibility and potentiality of the proposed estimation method. The com parison shows that, the proposed method is better than MLE method.