Quantifying the Alignment of a Data Analysis Between Analyst and Audience
Pub. online: 12 June 2025
Type: Education In Data Science
Open Access
Received
26 November 2024
26 November 2024
Accepted
22 May 2025
22 May 2025
Published
12 June 2025
12 June 2025
Abstract
A challenge that data scientists face is building an analytic product that is useful and trustworthy for a given audience. Previously, a set of principles for describing data analyses were defined that can be used to create a data analysis and to characterize the variation between analyses. Here, we introduce a concept called the alignment of a data analysis, which is between the data analyst and an audience. We define an aligned data analysis as the matching of principles between the analyst and the audience for whom the analysis is developed. In this paper, we propose a model for evaluating the alignment of a data analysis and describe some of its properties. We argue that more generally, this framework provides a language for characterizing alignment and can be used as a guide for practicing data scientists to building better data products.
Supplementary material
Supplementary MaterialIn the supplementary materials we provide the lecture slides used for the case study and the code and data used for the analysis in Section 4.
References
Artino Jr AR, Driessen EW, Maggio LA (2019). Ethical shades of gray: International frequency of scientific misconduct and questionable research practices in health professions education. Academic Medicine, 94(1): 76–84. https://doi.org/10.1097/ACM.0000000000002412
Broderick T, Gelman A, Meager R, Smith AL, Zheng T (2023). Toward a taxonomy of trust for probabilistic machine learning. Science Advances, 9(7): eabn3999. https://doi.org/10.1126/sciadv.abn3999
Coiera E, Ammenwerth E, Georgiou A, Magrabi F (2018). Does health informatics have a replication crisis? Journal of the American Medical Informatics Association, 25(8): 963–968. https://doi.org/10.1093/jamia/ocy028
Coiera E, Tong HL (2021). Replication studies in the clinical decision support literature–frequency, fidelity, and impact. Journal of the American Medical Informatics Association, 28(9): 1815–1825. https://doi.org/10.1093/jamia/ocab049
Edwards MA, Roy S (2017). Academic research in the 21st century: Maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environmental Engineering Science, 34(1): 51–61. https://doi.org/10.1089/ees.2016.0223
Franco A, Malhotra N, Simonovits G (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203): 1502–1505. https://doi.org/10.1126/science.1255484
Gelman A, Loken E (2014). The statistical crisis in science. American Scientist, 102(6): 460–465. https://doi.org/10.1511/2014.111.460
Gigerenzer G (2018). Statistical rituals: The replication delusion and how we got there. Advances in Methods and Practices in Psychological Science, 1(2): 198–218. https://doi.org/10.1177/2515245918771329
Hand DJ (2022). Trustworthiness of statistical inference. Journal of the Royal Statistical Society. Series A. Statistics in Society, 185(1): 329–347. https://doi.org/10.1111/rssa.12752
Kimball A (1957). Errors of the third kind in statistical consulting. Journal of the American Statistical Association, 52(278): 133–142. https://doi.org/10.1080/01621459.1957.10501374
Maimone C, Sharp JL, Schwartz-Soicher O, Oliver JC, Beltran L (2024). Do good: Strategies for leading an inclusive data science or statistics consulting team. Statistica, 13(2): e687. https://doi.org/10.1002/sta4.687
Mira A, Wit E (2021). The capstone in everyone’s delivery room: Placing ‘practice’at the center of data science education. Harvard Data Science Review, 3(1). https://doi.org/10.1162/99608f92.539432b5
Moonesinghe R, Khoury MJ, Janssens ACJW (2007). Most published research findings are false—but a little replication goes a long way. PLoS Medicine, 4(2): e28. https://doi.org/10.1371/journal.pmed.0040028
Nosek BA, Ebersole CR, DeHaven AC, Mellor DT (2018). The preregistration revolution. Proceedings of the National Academy of Sciences of the United States of America, 115(11): 2600–2606. https://doi.org/10.1073/pnas.1708274114
Nosek BA, Spies JR, Motyl M (2012). Scientific utopia: Ii. restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6): 615–631. https://doi.org/10.1177/1745691612459058
Peng RD (2011). Reproducible research in computational science. Science, 334(6060): 1226–1227. https://doi.org/10.1126/science.1213847
Rubio DM, Del Junco DJ, Bhore R, Lindsell CJ, Oster RA, Wittkowski KM, et al. (2011). Evaluation metrics for biostatistical and epidemiological collaborations. Statistics in Medicine, 30(23): 2767–2777. https://doi.org/10.1002/sim.4184
Schirm A, Lazar N, et al. (2019). Moving to a world beyond “$p\lt 0.05$”. American Statistician, 73(sup1): 1–19. https://doi.org/10.1080/00031305.2019.1583913
Silberzahn R, Uhlmann EL, Martin DP, Anselmi P, Aust F, Awtrey E, et al. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 1(3): 337–356. https://doi.org/10.1177/2515245917747646
Simmons JP, Nelson LD, Simonsohn U (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11): 1359–1366. https://doi.org/10.1177/0956797611417632
Tukey JW (1962). The future of data analysis. The Annals of Mathematical Statistics, 33(1): 1–67. https://doi.org/10.1214/aoms/1177704711
Valentine JC, Biglan A, Boruch RF, Castro FG, Collins LM, Flay BR, et al. (2011). Replication in prevention science. Prevention Science, 12: 103–117. https://doi.org/10.1007/s11121-011-0217-6
Van Aert RC, Wicherts JM, Van Assen MA (2019). Publication bias examined in meta-analyses from psychology and medicine: A meta-meta-analysis. PLoS ONE, 14(4): e0215052. https://doi.org/10.1371/journal.pone.0215052
Wen H, Wang HY, He X, Wu CI (2018). On the low reproducibility of cancer studies. National Science Review, 5(5): 619–624. https://doi.org/10.1093/nsr/nwy021
Wild CJ, Pfannkuch M (1999). Statistical thinking in empirical enquiry. International Statistical Review, 67(3): 223–248. https://doi.org/10.1111/j.1751-5823.1999.tb00442.x
Yu B, Kumbier K (2020). Veridical data science. Proceedings of the National Academy of Sciences of the United States of America, 117(8): 3920–3929. https://doi.org/10.1073/pnas.1901326117