Pub. online:28 Oct 2025Type:Data Science ConversationOpen Access
Journal:Journal of Data Science
Volume 23, Issue 4 (2025): Special Issue: Statistical Frontiers of Data Science, pp. 695–715
Abstract
Over the past three decades, the discipline of statistics has undergone profound transformation, driven by the rapid emergence of data science and artificial intelligence. These developments have reshaped methodological paradigms and introduced new challenges and opportunities for statistical education, particularly in China. In this context, Professor Xizhi Wu from the School of Statistics at Renmin University of China has remained closely engaged with the evolving landscape, demonstrating keen insight and a forward-looking perspective. Through sustained contributions to teaching, research, and educational reform, Professor Wu has deeply influenced generations of students and educators, playing a pivotal role in the advancement of statistical education. To document and reflect on this legacy, the Capital of Statistics conducted an in-depth interview with Professor Wu, focusing on his academic trajectory, professional contributions, and perspectives on the future of the discipline. The conversation also recounts meaningful interactions with his students, offering a multidimensional portrait of a life devoted to statistics.
Pub. online:12 Jun 2025Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 24, Issue 1 (2026): Special Issue: Statistical aspects of Trustworthy Machine Learning, pp. 239–253
Abstract
A challenge that data scientists face is building an analytic product that is useful and trustworthy for a given audience. Previously, a set of principles for describing data analyses were defined that can be used to create a data analysis and to characterize the variation between analyses. Here, we introduce a concept called the alignment of a data analysis, which is between the data analyst and an audience. We define an aligned data analysis as the matching of principles between the analyst and the audience for whom the analysis is developed. In this paper, we propose a model for evaluating the alignment of a data analysis and describe some of its properties. We argue that more generally, this framework provides a language for characterizing alignment and can be used as a guide for practicing data scientists to building better data products.
Dr. David S. Salsburg’s career has been an exceptional one. He was the first statistician to work in Pfizer, Inc., and later became the first statistician from the pharmaceutical industry to be elected as an ASA fellow. He played a vital role as a statistician in Pfizer, Inc. at a time when the drug approval process was developed. For his contributions, Dr. Salsburg was awarded the Career Achievement Award of the Biostatistics Section of the Pharmaceutical Research and Manufacturers of America in 1994, for “significant contributions to the advancement of biostatistics in the pharmaceutical industry”. Dr. Salsburg also managed to achieve something rare among scientists, which is to popularize his field of research and make it accessible and enjoyable to laypeople. Dr. Salsburg is possibly best known for his book “The Lady Tasting Tea – How Statistics Revolutionized the 20th Century Science”, in which he combines simple and engaging explanations of statistical methods, and why they are needed, along with personal stories told with a great deal of generosity, fondness, and humor about the people who developed them. Dr. Salsburg’s admiration for the those statisticians shines through. In this interview, Dr. Salsburg shares his own stories and perspectives, from his childhood, through his service in the Navy and his long and productive career in Pfizer, Inc. to his equally productive retirement, in which he authored “The Lady Tasting Tea” and other books.
Pub. online:24 May 2024Type:Computing In Data ScienceOpen Access
Journal:Journal of Data Science
Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 208–220
Abstract
With the growing scale of big datasets, fitting novel statistical models on larger-than-memory datasets becomes correspondingly challenging. This document outlines the development and use of an API for large scale modelling, with a demonstration given by the proof of concept platform largescaler, developed specifically for the development of statistical models for big datasets.
Journal:Journal of Data Science
Volume 20, Issue 3 (2022): Special Issue: Data Science Meets Social Sciences, pp. 413–436
Abstract
This paper provides an overview of how to use “big data” for social science research (with an emphasis on economics and finance). We investigate the performance and ease of use of different Spark applications running on a distributed file system to enable the handling and analysis of data sets which were previously not usable due to their size. More specifically, we explain how to use Spark to (i) explore big data sets which exceed retail grade computers memory size and (ii) run typical statistical/econometric tasks including cross sectional, panel data and time series regression models which are prohibitively expensive to evaluate on stand-alone machines. By bridging the gap between the abstract concept of Spark and ready-to-use examples which can easily be altered to suite the researchers need, we provide economists and social scientists more generally with the theory and practice to handle the ever growing datasets available. The ease of reproducing the examples in this paper makes this guide a useful reference for researchers with a limited background in data handling and distributed computing.