Volume 23, Issue 4 (2025): Special Issue: In honor of Prof. Xizhi Wu for his transformative contributions to statistics and data science in China,
October 2025
Journal:Journal of Data Science
Volume 23, Issue 4 (2025): Special Issue: In honor of Prof. Xizhi Wu for his transformative contributions to statistics and data science in China, pp. 575–577
Pub. online:30 Jan 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 4 (2025): Special Issue: In honor of Prof. Xizhi Wu for his transformative contributions to statistics and data science in China, pp. 578–591
Abstract
Loan behavior modeling is crucial in financial engineering. In particular, predicting loan prepayment based on large-scale historical time series data of massive customers is challenging. Existing approaches, such as logistic regression or nonparametric regression, could only model the direct relationship between the features and the prepayments. Motivated by extracting the hidden states of loan behavior, we propose the smoothing spline state space (QuadS) model based on a hidden Markov model with varying transition and emission matrices modeled by smoothing splines. In contrast to existing methods, our method benefits from capturing the loans’ unobserved state transitions, which not only increases prediction performances but also provides more interpretability. The overall model is learned by EM algorithm iterations, and within each iteration, smoothing splines are fitted with penalized least squares. Simulation studies demonstrate the effectiveness of the proposed method. Furthermore, a real-world case study using loan data from the Federal National Mortgage Association illustrates the practical applicability of our model. The QuadS model not only provides reliable predictions but also uncovers meaningful, hidden behavior patterns that can offer valuable insights for the financial industry.
Pub. online:26 Feb 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 4 (2025): Special Issue: In honor of Prof. Xizhi Wu for his transformative contributions to statistics and data science in China, pp. 592–606
Abstract
Cellular deconvolution is a key approach to deciphering the complex cellular makeup of tissues by inferring the composition of cell types from bulk data. Traditionally, deconvolution methods have focused on a single molecular modality, relying either on RNA sequencing (RNA-seq) to capture gene expression or on DNA methylation (DNAm) to reveal epigenetic profiles. While these single-modality approaches have provided important insights, they often lack the depth needed to fully understand the intricacies of cellular compositions, especially in complex tissues. To address these limitations, we introduce EMixed, a versatile framework designed for both single-modality and multi-omics cellular deconvolution. EMixed models raw RNA counts and DNAm counts or frequencies via allocation models that assign RNA transcripts and DNAm reads to cell types, and uses an expectation-maximization (EM) algorithm to estimate parameters. Benchmarking results demonstrate that EMixed significantly outperforms existing methods across both single-modality and multi-modality applications, underscoring the broad utility of this approach in enhancing our understanding of cellular heterogeneity.
Pub. online:27 May 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 4 (2025): Special Issue: In honor of Prof. Xizhi Wu for his transformative contributions to statistics and data science in China, pp. 607–623
Abstract
We propose to explore high-dimensional data with categorical outcomes by generalizing the penalized orthogonal-components regression method (POCRE), a supervised dimension reduction method initially proposed for high-dimensional linear regression. This generalized POCRE, i.e., gPOCRE, sequentially builds up orthogonal components by selecting predictors which maximally explain the variation of the response variables. Therefore, gPOCRE simultaneously selects significant predictors and reduces dimensions by constructing linear components of these selected predictors for a high-dimensional generalized linear model. For multiple categorical outcomes, gPOCRE can also construct common components shared by all outcomes to improve the power of selecting variables shared by multiple outcomes. Both simulation studies and real data analysis are carried out to illustrate the performance of gPOCRE.
Pub. online:26 Mar 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 4 (2025): Special Issue: In honor of Prof. Xizhi Wu for his transformative contributions to statistics and data science in China, pp. 624–637
Abstract
Extensive literature has been proposed for the analysis of correlated survival data. Subjects within a cluster share some common characteristics, e.g., genetic and environmental factors, so their time-to-event outcomes are correlated. The frailty model under proportional hazards assumption has been widely applied for the analysis of clustered survival outcomes. However, the prediction performance of this method can be less satisfactory when the risk factors have complicated effects, e.g., nonlinear and interactive. To deal with these issues, we propose a neural network frailty Cox model that replaces the linear risk function with the output of a feed-forward neural network. The estimation is based on quasi-likelihood using Laplace approximation. A simulation study suggests that the proposed method has the best performance compared with existing methods. The method is applied to the clustered time-to-failure prediction within the kidney transplantation facility using the national kidney transplant registry data from the U.S. Organ Procurement and Transplantation Network. All computer programs are available at https://github.com/rivenzhou/deep_learning_clustered.
Pub. online:28 Jan 2025Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 4 (2025): Special Issue: In honor of Prof. Xizhi Wu for his transformative contributions to statistics and data science in China, pp. 638–647
Abstract
In many medical comparative studies, subjects may provide either bilateral or unilateral data. While numerous testing procedures have been proposed for bilateral data that account for the intra-class correlation between paired organs of the same individual, few studies have thoroughly explored combined correlated bilateral and unilateral data. Ma and Wang (2021) introduced three test procedures based on the maximum likelihood estimation (MLE) algorithm for general g groups. In this article, we employ a model-based approach that treats the measurements from both eyes of each subject as repeated observations. We then compare this approach with Ma and Wang’s Score test procedure. Monte Carlo simulations demonstrate that the MLE-based Score test offers certain advantages under specific conditions. However, this model-based method lacks an explicit form for the test statistic, limiting its potential for further development of an exact test.
Pub. online:20 Jan 2025Type:Computing In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 23, Issue 4 (2025): Special Issue: In honor of Prof. Xizhi Wu for his transformative contributions to statistics and data science in China, pp. 648–658
Abstract
Piecewise linear-quadratic (PLQ) functions are a fundamental function class in convex optimization, especially within the Empirical Risk Minimization (ERM) framework, which employs various PLQ loss functions. This paper provides a workflow for decomposing a general convex PLQ loss into its ReLU-ReHU representation, along with a Python implementation designed to enhance the efficiency of presenting and solving ERM problems, particularly when integrated with ReHLine (a powerful solver for PLQ ERMs). Our proposed package, plqcom, accepts three representations of PLQ functions and offers user-friendly APIs for verifying their convexity and continuity. The Python package is available at https://github.com/keepwith/PLQComposite.
Pub. online:9 May 2025Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 23, Issue 4 (2025): Special Issue: In honor of Prof. Xizhi Wu for his transformative contributions to statistics and data science in China, pp. 659–675
Abstract
Forecasting is essential for optimizing resource allocation, particularly during crises such as the unprecedented COVID-19 pandemic. This paper focuses on developing an algorithm for generating k-step-ahead interval forecasts for autoregressive time series. Unlike conventional methods that assume a fixed distribution, our approach utilizes kernel distribution estimation to accommodate the unknown distribution of prediction errors. This flexibility is crucial in real-world data, where deviations from normality are common, and neglecting these deviations can result in inaccurate predictions and unreliable confidence intervals. We evaluate the performance of our method through simulation studies on various autoregressive time series models. The results show that the proposed approach performs robustly, even with small sample sizes, as low as 50 observations. Moreover, our method outperforms traditional linear model-based prediction intervals and those derived from the empirical distribution function, particularly when the underlying data distribution is non-normal. This highlights the algorithm’s flexibility and accuracy for interval forecasting in non-Gaussian contexts. We also apply the method to log-transformed weekly COVID-19 case counts from lower-middle-income countries, covering the period from June 1, 2020, to March 13, 2022.
Pub. online:20 Jan 2025Type:Data Science ReviewsOpen Access
Journal:Journal of Data Science
Volume 23, Issue 4 (2025): Special Issue: In honor of Prof. Xizhi Wu for his transformative contributions to statistics and data science in China, pp. 676–694
Abstract
Deep neural networks have a wide range of applications in data science. This paper reviews neural network modeling algorithms and their applications in both supervised and unsupervised learning. Key examples include: (i) binary classification and (ii) nonparametric regression function estimation, both implemented with feedforward neural networks ($\mathrm{FNN}$); (iii) sequential data prediction using long short-term memory ($\mathrm{LSTM}$) networks; and (iv) image classification using convolutional neural networks ($\mathrm{CNN}$). All implementations are provided in $\mathrm{MATLAB}$, making these methods accessible to statisticians and data scientists to support learning and practical application.
Pub. online:28 Oct 2025Type:Data Science ConversationOpen Access
Journal:Journal of Data Science
Volume 23, Issue 4 (2025): Special Issue: In honor of Prof. Xizhi Wu for his transformative contributions to statistics and data science in China, pp. 695–715
Abstract
Over the past three decades, the discipline of statistics has undergone profound transformation, driven by the rapid emergence of data science and artificial intelligence. These developments have reshaped methodological paradigms and introduced new challenges and opportunities for statistical education, particularly in China. In this context, Professor Xizhi Wu from the School of Statistics at Renmin University of China has remained closely engaged with the evolving landscape, demonstrating keen insight and a forward-looking perspective. Through sustained contributions to teaching, research, and educational reform, Professor Wu has deeply influenced generations of students and educators, playing a pivotal role in the advancement of statistical education. To document and reflect on this legacy, the Capital of Statistics conducted an in-depth interview with Professor Wu, focusing on his academic trajectory, professional contributions, and perspectives on the future of the discipline. The conversation also recounts meaningful interactions with his students, offering a multidimensional portrait of a life devoted to statistics.