As the major target of many vaccines and neutralizing antibodies against SARS-CoV-2, the spike (S) protein is observed to mutate over time. In this paper, we present statistical approaches to tackle some challenges associated with the analysis of S-protein data. We build a Bayesian hierarchical model to study the temporal and spatial evolution of S-protein sequences, after grouping the sequences into representative clusters. We then apply sampling methods to investigate possible changes to the S-protein’s 3-D structure as a result of commonly observed mutations. While the increasing spread of D614G variants has been noted in other research, our results also show that the co-occurring mutations of D614G together with S477N or A222V may spread even more rapidly, as quantified by our model estimates.
The COVID-19 (COrona VIrus Disease 2019) pandemic has had profound global consequences on health, economic, social, behavioral, and almost every major aspect of human life. Therefore, it is of great importance to model COVID-19 and other pandemics in terms of the broader social contexts in which they take place. We present the architecture of an artificial intelligence enhanced COVID-19 analysis (in short AICov), which provides an integrative deep learning framework for COVID-19 forecasting with population covariates, some of which may serve as putative risk factors. We have integrated multiple different strategies into AICov, including the ability to use deep learning strategies based on Long Short-Term Memory (LSTM) and event modeling. To demonstrate our approach, we have introduced a framework that integrates population covariates from multiple sources. Thus, AICov not only includes data on COVID-19 cases and deaths but, more importantly, the population’s socioeconomic, health, and behavioral risk factors at their specific locations. The compiled data are fed into AICov, and thus we obtain improved prediction by the integration of the data to our model as compared to one that only uses case and death data. As we use deep learning our models adapt over time while learning the model from past data.
A large volume of trajectory data collected from human beings and vehicle mobility is highly sensitive due to privacy concerns. Therefore, generating synthetic and plausible trajectory data is pivotal in many location-based studies and applications. But existing LSTM-based methods are not suitable for modeling large-scale sequences due to gradient vanishing problem. Also, existing GAN-based methods are coarse-grained. Considering the trajectory’s geographical and sequential features, we propose a map-based Two-Stage GAN method (TSG) to tackle the challenges above and generate fine-grained and plausible large-scale trajectories. In the first stage, we first transfer GPS points data to discrete grid representation as the input for a modified deep convolutional generative adversarial network to learn the general pattern. In the second stage, inside each grid, we design an effective encoder-decoder network as the generator to extract road information from map image and then embed it into two parallel Long Short-Term Memory networks to generate GPS point sequences. Discriminator conditioned on encoded map image restrains generated point sequences in case they deviate from corresponding road networks. Experiments on real-world data are conducted to prove the effectiveness of our model in preserving geographical features and hidden mobility patterns. Moreover, our generated trajectories not only indicate the distribution similarity but also show satisfying road network matching accuracy.
One of the key features in regression models consists in selecting appropriate characteristics that explain the behavior of the response variable, in which stepwise-based procedures occupy a prominent position. In this paper we performed several simulation studies to investigate whether a specific stepwise-based approach, namely Strategy A, properly selects authentic variables into the generalized additive models for location, scale and shape framework, considering Gaussian, zero inflated Poisson and Weibull distributions. Continuous (with linear and nonlinear relationships) and categorical explanatory variables are considered and they are selected through some goodness-of-fit statistics. Overall, we conclude that the Strategy A greatly performed.
We consider a continuous outcome subject to nonresponse and a fully observed covariate. We propose a spline proxy pattern-mixture model (S-PPMA), an extension of the proxy pattern-mixture model (PPMA) (Andridge and Little, 2011), to estimate the mean of the outcome under varying assumptions about nonresponse. S-PPMA improves the robustness of PPMA, which assumes bivariate normality between the outcome and the covariate, by modeling the relationship via a spline. Simulations indicate that S-PPMA outperforms PPMA when the data deviate from normality and are missing not at random, with minor losses of efficiency when the data are normal.
Multi-classification is commonly encountered in data science practice, and it has broad applications in many areas such as biology, medicine, and engineering. Variable selection in multiclass problems is much more challenging than in binary classification or regression problems. In addition to estimating multiple discriminant functions for separating different classes, we need to decide which variables are important for each individual discriminant function as well as for the whole set of functions. In this paper, we address the multi-classification variable selection problem by proposing a new form of penalty, supSCAD, which first groups all the coefficients of the same variable associated with all the discriminant functions altogether and then imposes the SCAD penalty on the supnorm of each group. We apply the new penalty to both soft and hard classification and develop two new procedures: the supSCAD multinomial logistic regression and the supSCAD multi-category support vector machine. Our theoretical results show that, with a proper choice of the tuning parameter, the supSCAD multinomial logistic regression can identify the underlying sparse model consistently and enjoys oracle properties even when the dimension of predictors goes to infinity. Based on the local linear and quadratic approximation to the non-concave SCAD and nonlinear multinomial log-likelihood function, we show that the new procedures can be implemented efficiently by solving a series of linear or quadratic programming problems. Performance of the new methods is illustrated by simulation studies and real data analysis of the Small Round Blue Cell Tumors and the Semeion Handwritten Digit data sets.
Since the first confirmed case of COVID-19 was identified in December 2019, the total COVID-19 patients are up to 80,675,745, and the number of deaths is 1,764,185 as of December 27, 2020. The problem is that researchers are still learning about it, and new variants of SARS-CoV-2 are not stopping. For medical treatment, essential and informative genes can lead to accurate tests of whether an individual has contracted COVID-19 and help develop highly efficient vaccines, antiviral drugs, and treatments. As a result, identifying critical genes related to COVID-19 has been an urgent task for medical researchers. We conducted a competing risk analysis using the max-linear logistic regression model to analyze 126 blood samples from COVID-19-positive and COVID-19-negative patients. Our research led to a competing COVID-19 risk classifier derived from 19,472 genes and their differential expression values. The final classifier model only involves five critical genes, ABCB6, KIAA1614, MND1, SMG1, RIPK3, which led to 100% sensitivity and 100% specificity of the 126 samples. Given their 100% accuracy in predicting COVID-19 positive or negative status, these five genes can be critical in developing proper, focused, and accurate COVID-19 testing procedures, guiding the second-generation vaccine development, studying antiviral drugs and treatments. It is expected that these five genes can motivate numerous new COVID-19 researches.
Subsampling is an effective way to deal with big data problems and many subsampling approaches have been proposed for different models, such as leverage sampling for linear regression models and local case control sampling for logistic regression models. In this article, we focus on optimal subsampling methods, which draw samples according to optimal subsampling probabilities formulated by minimizing some function of the asymptotic distribution. The optimal subsampling methods have been investigated to include logistic regression models, softmax regression models, generalized linear models, quantile regression models, and quasi-likelihood estimation. Real data examples are provided to show how optimal subsampling methods are applied.
This paper proposes a procedure to execute external source codes from a LATEX document and include the calculation outputs in the resulting Portable Document Format (pdf) file automatically. It integrates programming tools into the LATEX writing tool to facilitate the production of reproducible research. In our proposed approach to a LATEX-based scientific notebook the user can easily invoke any programming language or a command-line program when compiling the LATEX document, while using their favorite LATEX editor in the writing process. The required LATEX setup, a new Python package, and the defined preamble are discussed in detail, and working examples using R, Julia, and MatLab to reproduce existing research are provided to illustrate the proposed procedure. We also demonstrate how to include system setting information in a paper by invoking shell scripts when compiling the document.
Climate change is widely recognized as one of the most challenging, urgent and complex problem facing humanity. There are rising interests in understanding and quantifying climate changing. We analyze the climate trend in Canada using Canadian monthly surface air temperature, which is longitudinal data in nature with long time span. Analysis of such data is challenging due to the complexity of modeling and associated computation burdens. In this paper, we divide this type of longitudinal data into time blocks, conduct multivariate regression and utilize a vine copula model to account for the dependence among the multivariate error terms. This vine copula model allows separate specification of within-block and between-block dependence structure and has great flexibility of modeling complex association structures. To release the computational burden and concentrate on the structure of interest, we construct composite likelihood functions, which leave the connecting structure between time blocks unspecified. We discuss different estimation procedures and issues regarding model selection and prediction. We explore the prediction performance of our vine copula model by extensive simulation studies. An analysis of the Canada climate dataset is provided.