The coronavirus disease of 2019 (COVID-19) is a pandemic. To characterize its disease transmissibility, we propose a Bayesian change point detection model using daily actively infectious cases. Our model builds on a Bayesian Poisson segmented regression model that 1) capture the epidemiological dynamics under the changing conditions caused by external or internal factors; 2) provide uncertainty estimates of both the number and locations of change points; and 3) has the potential to adjust for any time-varying covariate effects. Our model can be used to evaluate public health interventions, identify latent events associated with spreading rates, and yield better short-term forecasts.
Researchers and public officials tend to agree that until a vaccine is readily available, stopping SARS-CoV-2 transmission is the name of the game. Testing is the key to preventing the spread, especially by asymptomatic individuals. With testing capacity restricted, group testing is an appealing alternative for comprehensive screening and has recently received FDA emergency authorization. This technique tests pools of individual samples, thereby often requiring fewer testing resources while potentially providing multiple folds of speedup. We approach group testing from a data science perspective and offer two contributions. First, we provide an extensive empirical comparison of modern group testing techniques based on simulated data. Second, we propose a simple one-round method based on ${\ell _{1}}$-norm sparse recovery, which outperforms current state-of-the-art approaches at certain disease prevalence rates.
Improvement of statistical learning models to increase efficiency in solving classification or regression problems is a goal pursued by the scientific community. Particularly, the support vector machine model has become one of the most successful algorithms for this task. Despite the strong predictive capacity from the support vector approach, its performance relies on the selection of hyperparameters of the model, such as the kernel function that will be used. The traditional procedures to decide which kernel function will be used are computationally expensive, in general, becoming infeasible for certain datasets. In this paper, we proposed a novel framework to deal with the kernel function selection called Random Machines. The results improved accuracy and reduced computational time, evaluated over simulation scenarios, and real-data benchmarking.
Penalized regression provides an automated approach to preform simultaneous variable selection and parameter estimation and is a popular method to analyze high-dimensional data. Since the conception of the LASSO in the mid-to-late 1990s, extensive research has been done to improve penalized regression. The LASSO, and several of its variations, performs penalization symmetrically around zero. Thus, variables with the same magnitude are shrunk the same regardless of the direction of effect. To the best of our knowledge, sign-based shrinkage, preferential shrinkage based on the sign of the coefficients, has yet to be explored under the LASSO framework. We propose a generalization to the LASSO, asymmetric LASSO, that performs sign-based shrinkage. Our method is motivated by placing an asymmetric Laplace prior on the regression coefficients, rather than a symmetric Laplace prior. This corresponds to an asymmetric ${\ell _{1}}$ penalty under the penalized regression framework. In doing so, preferential shrinkage can be performed through an auxiliary tuning parameter that controls the degree of asymmetry. Our numerical studies indicate that the asymmetric LASSO performs better than the LASSO when effect sizes are sign skewed. Furthermore, in the presence of positively-skewed effects, the asymmetric LASSO is comparable to the non-negative LASSO without the need to place an a priori constraint on the effect estimates and outperforms the non-negative LASSO when negative effects are also present in the model. A real data example using the breast cancer gene expression data from The Cancer Genome Atlas is also provided, where the asymmetric LASSO identifies two potentially novel gene expressions that are associated with BRCA1 with a minor improvement in prediction performance over the LASSO and non-negative LASSO.
Previous abstractive methods apply sequence-to-sequence structures to generate summary without a module to assist the system to detect vital mentions and relationships within a document. To address this problem, we utilize semantic graph to boost the generation performance. Firstly, we extract important entities from each document and then establish a graph inspired by the idea of distant supervision (Mintz et al., 2009). Then, we combine a Bi-LSTM with a graph encoder to obtain the representation of each graph node. A novel neural decoder is presented to leverage the information of such entity graphs. Automatic and human evaluations show the effectiveness of our technique.
Tumor cell population is a mixture of heterogeneous cell subpopulations, known as subclones. Identification of clonal status of mutations, i.e., whether a mutation occurs in all tumor cells or in a subset of tumor cells, is crucial for understanding tumor progression and developing personalized treatment strategies. We make three major contributions in this paper: (1) we summarize terminologies in the literature based on a unified mathematical representation of subclones; (2) we develop a simulation algorithm to generate hypothetical sequencing data that are akin to real data; and (3) we present an ultra-fast computational method, Mutstats, to infer clonal status of somatic mutations from sequencing data of tumors. The inference is based on a Gaussian mixture model for mutation multiplicities. To validate Mutstats, we evaluate its performance on simulated datasets as well as two breast carcinoma samples from The Cancer Genome Atlas project.
It is hypothesized that short-term exposure to air pollution may influence the transmission of aerosolized pathogens such as COVID-19. We used data from 23 provinces in Italy to build a generalized additive model to investigate the association between the effective reproductive number of the disease and air quality while controlling for ambient environmental variables and changes in human mobility. The model finds that there is a positive, nonlinear relationship between the density of particulate matter in the air and COVID-19 transmission, which is in alignment with similar studies on other respiratory illnesses.
Splines are important tools for the flexible modeling of curves and surfaces in regression analyses. Functions for constructing spline basis functions are available in R through the base package splines. When the curves to be modeled have known characteristics in monotonicity or curvature, more efficient statistical inferences are possible with shape-restricted splines. Such splines, however, are not available in the R package splines. The package splines2 provides easy-to-use shape-restricted spline basis functions, along with their derivatives and integrals which are important tools in many inference scenarios. It also provides additional splines and features that are not available in the splines package, such as periodic splines and generalized Bernstein polynomials. The usages of the functions are illustrated with shape-restricted regression, recurrent event data analysis, and extreme-value copulas.