Quantifying the Alignment of a Data Analysis Between Analyst and Audience

D’Agostino McGowan, Lucy; Peng, Roger D.; Hicks, Stephanie C.

doi:10.6339/25-JDS1189

Journal of Data Science

Quantifying the Alignment of a Data Analysis Between Analyst and Audience

Lucy D’Agostino McGowan Roger D. Peng Stephanie C. Hicks

https://doi.org/10.6339/25-JDS1189

Pub. online: 12 June 2025 Type: Education In Data Science

Open Access

Received
26 November 2024

Accepted
22 May 2025

Published
12 June 2025

Abstract

A challenge that data scientists face is building an analytic product that is useful and trustworthy for a given audience. Previously, a set of principles for describing data analyses were defined that can be used to create a data analysis and to characterize the variation between analyses. Here, we introduce a concept called the alignment of a data analysis, which is between the data analyst and an audience. We define an aligned data analysis as the matching of principles between the analyst and the audience for whom the analysis is developed. In this paper, we propose a model for evaluating the alignment of a data analysis and describe some of its properties. We argue that more generally, this framework provides a language for characterizing alignment and can be used as a guide for practicing data scientists to building better data products.

Supplementary material

Supplementary Material

In the supplementary materials we provide the lecture slides used for the case study and the code and data used for the analysis in Section 4.

References

Artino Jr AR, Driessen EW, Maggio LA (2019). Ethical shades of gray: International frequency of scientific misconduct and questionable research practices in health professions education. Academic Medicine, 94(1): 76–84. https://doi.org/10.1097/ACM.0000000000002412

Baggerly KA, Coombes KR (2009). Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology. Annals of Applied Statistics, 3(4): 1309–1334.

Broderick T, Gelman A, Meager R, Smith AL, Zheng T (2023). Toward a taxonomy of trust for probabilistic machine learning. Science Advances, 9(7): eabn3999. https://doi.org/10.1126/sciadv.abn3999

Cabrera J, McDougall A (2013). Statistical Consulting. Springer Science & Business Media.

Coiera E, Ammenwerth E, Georgiou A, Magrabi F (2018). Does health informatics have a replication crisis? Journal of the American Medical Informatics Association, 25(8): 963–968. https://doi.org/10.1093/jamia/ocy028

Coiera E, Tong HL (2021). Replication studies in the clinical decision support literature–frequency, fidelity, and impact. Journal of the American Medical Informatics Association, 28(9): 1815–1825. https://doi.org/10.1093/jamia/ocab049

Cross N (2011). Design Thinking: Understanding How Designers Think and Work. Berg.

D’Agostino McGowan L (2019). tidycode: Analyze Lines of R Code the Tidy Way. R package version 0.1.1.

D’Agostino McGowan L, Peng RD, Hicks SC (2022). Design principles for data analysis. Journal of Computational and Graphical Statistics, 32(2): 754–761.

Dreber A, Johannesson M (2019). Statistical significance and the replication crisis in the social sciences. In: Oxford Research Encyclopedia of Economics and Finance. Oxford University Press.

Edwards MA, Roy S (2017). Academic research in the 21st century: Maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environmental Engineering Science, 34(1): 51–61. https://doi.org/10.1089/ees.2016.0223

Franco A, Malhotra N, Simonovits G (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203): 1502–1505. https://doi.org/10.1126/science.1255484

Gelman A, Loken E (2014). The statistical crisis in science. American Scientist, 102(6): 460–465. https://doi.org/10.1511/2014.111.460

Gigerenzer G (2018). Statistical rituals: The replication delusion and how we got there. Advances in Methods and Practices in Psychological Science, 1(2): 198–218. https://doi.org/10.1177/2515245918771329

Hand DJ (2022). Trustworthiness of statistical inference. Journal of the Royal Statistical Society. Series A. Statistics in Society, 185(1): 329–347. https://doi.org/10.1111/rssa.12752

Hand DJ, Everitt BS, Everitt B (2007). The Statistical Consultant in Action. Cambridge University Press.

Hicks SC, Peng RD (2019). Elements and principles of data analysis. arXiv preprint, 1–13.

Kimball A (1957). Errors of the third kind in statistical consulting. Journal of the American Statistical Association, 52(278): 133–142. https://doi.org/10.1080/01621459.1957.10501374

Maimone C, Sharp JL, Schwartz-Soicher O, Oliver JC, Beltran L (2024). Do good: Strategies for leading an inclusive data science or statistics consulting team. Statistica, 13(2): e687. https://doi.org/10.1002/sta4.687

Mira A, Wit E (2021). The capstone in everyone’s delivery room: Placing ‘practice’at the center of data science education. Harvard Data Science Review, 3(1). https://doi.org/10.1162/99608f92.539432b5

Moonesinghe R, Khoury MJ, Janssens ACJW (2007). Most published research findings are false—but a little replication goes a long way. PLoS Medicine, 4(2): e28. https://doi.org/10.1371/journal.pmed.0040028

Nosek BA, Ebersole CR, DeHaven AC, Mellor DT (2018). The preregistration revolution. Proceedings of the National Academy of Sciences of the United States of America, 115(11): 2600–2606. https://doi.org/10.1073/pnas.1708274114

Nosek BA, Spies JR, Motyl M (2012). Scientific utopia: Ii. restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6): 615–631. https://doi.org/10.1177/1745691612459058

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251). aac4716.

Parker H (2017). Opinionated analysis development. PeerJ Preprints, 5: e3210v1.

Peng RD (2011). Reproducible research in computational science. Science, 334(6060): 1226–1227. https://doi.org/10.1126/science.1213847

Rubio DM, Del Junco DJ, Bhore R, Lindsell CJ, Oster RA, Wittkowski KM, et al. (2011). Evaluation metrics for biostatistical and epidemiological collaborations. Statistics in Medicine, 30(23): 2767–2777. https://doi.org/10.1002/sim.4184

Schirm A, Lazar N, et al. (2019). Moving to a world beyond “$p\lt 0.05$”. American Statistician, 73(sup1): 1–19. https://doi.org/10.1080/00031305.2019.1583913

Silberzahn R, Uhlmann EL, Martin DP, Anselmi P, Aust F, Awtrey E, et al. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 1(3): 337–356. https://doi.org/10.1177/2515245917747646

Simmons JP, Nelson LD, Simonsohn U (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11): 1359–1366. https://doi.org/10.1177/0956797611417632

Tukey JW (1962). The future of data analysis. The Annals of Mathematical Statistics, 33(1): 1–67. https://doi.org/10.1214/aoms/1177704711

Tukey W, Wilk MB (1966). Data analysis and statistics: An expository overview. In: Proceedings of the November 7–10, 1966, Fall Joint Computer Conference, 695–709.

Valentine JC, Biglan A, Boruch RF, Castro FG, Collins LM, Flay BR, et al. (2011). Replication in prevention science. Prevention Science, 12: 103–117. https://doi.org/10.1007/s11121-011-0217-6

Van Aert RC, Wicherts JM, Van Assen MA (2019). Publication bias examined in meta-analyses from psychology and medicine: A meta-meta-analysis. PLoS ONE, 14(4): e0215052. https://doi.org/10.1371/journal.pone.0215052

Wen H, Wang HY, He X, Wu CI (2018). On the low reproducibility of cancer studies. National Science Review, 5(5): 619–624. https://doi.org/10.1093/nsr/nwy021

Wild CJ, Pfannkuch M (1999). Statistical thinking in empirical enquiry. International Statistical Review, 67(3): 223–248. https://doi.org/10.1111/j.1751-5823.1999.tb00442.x

Yu B, Barter RL (2024). Veridical Data Science: The Practice of Responsible Data Analysis and Decision Making. MIT Press.

Yu B, Kumbier K (2020). Veridical data science. Proceedings of the National Academy of Sciences of the United States of America, 117(8): 3920–3929. https://doi.org/10.1073/pnas.1901326117

2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

analytic design theory data science evaluation

Funding

The authors do not have any funding to acknowledge.

Metrics

since February 2021

102

Article info
views

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file