Inside Out: Externalizing Assumptions in Data Analysis as Validation Checks

Zhang, H. Sherry; Peng, Roger D.

doi:10.6339/25-JDS1206

Journal of Data Science

Inside Out: Externalizing Assumptions in Data Analysis as Validation Checks

H. Sherry Zhang

Roger D. Peng

https://doi.org/10.6339/25-JDS1206

Pub. online: 9 March 2026 Type: Data Science In Action

Open Access

Received
24 December 2024

Accepted
29 October 2025

Published
9 March 2026

Abstract

In data analysis, unexpected results often prompt researchers to revisit their procedures to identify potential issues. While some researchers may struggle to identify the root causes, experienced researchers can often quickly diagnose problems by checking a few key assumptions. These checked assumptions, or expectations, are typically informal, difficult to trace, and rarely discussed in publications. In this paper, we introduce the term analysis validation checks to formalize and externalize these informal assumptions. We then introduce a procedure to identify a subset of checks that best predict the occurrence of unexpected outcomes, based on simulations of the original data. The checks are evaluated in terms of accuracy, determined by binary classification metrics, and independence, which measures the shared information among checks. We demonstrate this approach with a toy example using step count data and a generalized linear model example examining the effect of particulate matter air pollution on daily mortality.

Supplementary material

Supplementary Material

The supplementary materials include a full script of the examples in the paper (index.R) and its output (index.html), the data used in the examples in Section 5 (data/), the package source (adtoolbox_0.1.0.tar.gz), and a README.md file containing the install instructions for running the scripts.

References

Allaire JJ, Teague C, Scheidegger C, Xie Y, Dervieux C (2022). Quarto. URL https://github.com/quarto-dev/quarto-cli.

Batini C, Cappiello C, Francalanci C, Maurino A (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys (CSUR), 41(3): 1–52. https://doi.org/10.1145/1541880.1541883

Bell ML, McDermott A, Zeger SL, Samet JM, Dominici F (2004). Ozone and short-term mortality in 95 US urban communities, 1987–2000. Journal of the American Medical Association, 292(19): 2372–2378. https://doi.org/10.1001/jama.292.19.2372

Broderick T, Gelman A, Meager R, Smith AL, Zheng T (2023). Toward a taxonomy of trust for probabilistic machine learning. Science Advances, 9(7), eabn3999. https://doi.org/10.1126/sciadv.abn3999

Cai L, Zhu Y (2015). The challenges of data quality and data quality assessment in the big data era. Data Science Journal, 14: 2–2. https://doi.org/10.5334/dsj-2015-002

Cichy C, Rass S (2019). An overview of data quality frameworks. IEEE Access, 7: 24634–24648. https://doi.org/10.1109/ACCESS.2019.2899751

Dong J, Roth A, Su WJ (2022). Gaussian differential privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1): 3–37. ISSN 1369-7412. https://doi.org/10.1111/rssb.12454

Donoghue T, Voytek B, Ellis SE (2021). Teaching creative and practical data science at scale. Journal of Statistics and Data Science Education, 29(sup1): S27–S39. https://doi.org/10.1080/10691898.2020.1860725

Fischetti T (2023). assertr: Assertive programming for R analysis pipelines. URL https://CRAN.R-project.org/package=assertr. R package version 3.0.1.

Grolemund G, Wickham H (2014). A cognitive interpretation of data analysis. International Statistical Review, 82(2): 184–204. https://doi.org/10.1111/insr.12028

Gu K, Grunde-McLaughlin M, McNutt A, Heer J, Althoff T (2024). How do data analysts respond to AI assistance? A wizard-of-oz study. In: Proceedings of the CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery, New York, NY, USA. 1–22. https://doi.org/10.1145/3613904.3641891.

Henry L, Pedersen TL, Luciani TJ, Decorde M, Lise V (2023). vdiffr: Visual regression testing and graphical diffing. URL https://CRAN.R-project.org/package=vdiffr. R package version 1.0.7.

Iannone R, Vargas M, Choe J (2024). pointblank: Data validation and organization of metadata for local and remote tables. URL https://CRAN.R-project.org/package=pointblank. R package version 0.12.2.

Leiner J, Duan B, Wasserman L, Ramdas A (2023). Data fission: Splitting a single data point. Journal of the American Statistical Association, 12-(549), 135–146. https://doi.org/10.1080/01621459.2023.2270748.

Li C, Chan E, Denny P, Luxton-Reilly A, Tempero E (2019). Towards a framework for teaching debugging. In: Proceedings of the Twenty-First Australasian Computing Education Conference, Association for Computing Machinery, New York, NY, USA, 79–86.

Michael S, Joane D, Joseph F, Joseph M, Jan R (2002). Fault Tree Handbook with Aerospace Applications. NASA Office of Safety and Mission Assurance-NASA Headquarters. 2–8. Washington.

Neufeld A, Dharamshi A, Gao LL, Witten D (2024). Data thinning for convolution-closed distributions. Journal of Machine Learning Research, 25(57): 1–35.

Peng RD (2011). Reproducible research in computational science. Science, 334(6060): 1226–1227. https://doi.org/10.1126/science.1213847

Peng RD, Parker HS (2022). Perspective on data science. Annual Review of Statistics and Its Application, 9(1): 1–20. https://doi.org/10.1146/annurev-statistics-040220-013917

Peng RD, Dominici F, Louis TA (2006). Model choice in time series studies of air pollution and mortality. Journal of the Royal Statistical Society Series A: Statistics in Society, 169(2): 179–203. https://doi.org/10.1111/j.1467-985X.2006.00410.x

Peng RD, Chen A, Bridgeford E, Leek JT, Hicks SC (2021). Diagnosing data analytic problems in the classroom. Journal of Statistics and Data Science Education, 29(3): 267–276. https://doi.org/10.1080/26939169.2021.1971586

Petersen AH, Ekstrøm CT (2019). Datamaid: Your assistant for documenting supervised data quality screening in R. Journal of Statistical Software, 90: 1–38. https://doi.org/10.18637/jss.v090.i06

Polyzotis N, Zinkevich M, Roy S, Breck E, Whang S (2019). Data validation for machine learning. Proceedings of Machine Learning and Systems, 1: 334–347.

R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Ruczinski I, Kooperberg C, LeBlanc M (2003). Logic regression. Journal of Computational and Graphical Statistics, 12(3): 475–511. ISSN 1061-8600. https://doi.org/10.1198/1061860032238

Samet JM, Dominici F, Curriero FC, Coursac I, Zeger SL (2000). Fine particulate air pollution and mortality in 20 US cities, 1987–1994. New England Journal of Medicine, 343(24): 1742–1749. https://doi.org/10.1056/NEJM200012143432401

Schelter S, Lange D, Schmidt P, Celikel M, Biessmann F, Grafberger A (2018). Automating large-scale data quality verification. Proceedings of the VLDB Endowment, 11(12): 1781–1794. https://doi.org/10.14778/3229863.3229867

Sidi F, Hassany Shariat Panahy P, Suriani Affendey L, Jabar MA, Ibrahim H, Mustapha A (2012). Data quality: A survey of data quality dimensions. In: 2012 International Conference on Information Retrieval & Knowledge Management, Kuala Lumpur, Malaysia, 300–304. IEEE. https://doi.org/10.1109/InfRKM.2012.6204995.

Van der Loo M, de Jonge E (2021). Data validation infrastructure for R. Journal of Statistical Software, 97: 1–33. URL https://www.jstatsoft.org/article/view/v097i10.

Vesely WE, Goldberg FF, Roberts NH, Haasl DF (1981). Fault Tree Handbook, Technical report, Nuclear Regulatory Commission Washington DC.

Wang RY, Strong DM (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4): 5–33. https://doi.org/10.1080/07421222.1996.11518099

Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and flexible summaries of data. URL https://CRAN.R-project.org/package=skimr. R package version 2.1.5.

Welty LJ, Zeger SL (2005). Are the acute effects of particulate matter on mortality in the national morbidity, mortality, and air pollution study the result of inadequate control for weather and season? A sensitivity analysis using flexible distributed lag models. American Journal of Epidemiology, 162(1): 80–88. https://doi.org/10.1093/aje/kwi157

Wild CJ, Pfannkuch M (1999). Statistical thinking in empirical enquiry. International Statistical Review, 67(3): 223–248. https://doi.org/10.1111/j.1751-5823.1999.tb00442.x

Woodall P, Oberhofer M, Borek A (2014). A classification of data quality assessment and improvement methods. International Journal of Information Quality 16, 3(4): 298–321. https://doi.org/10.1504/IJIQ.2014.068656

Yu B, Barter RL (2024). Veridical Data Science: The Practice of Responsible Data Analysis and Decision Making. MIT Press.

2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

data analysis assumptions diagnostic logic regression

Metrics

since February 2021

614

Article info
views

156

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file