Inside Out: Externalizing Assumptions in Data Analysis as Validation Checks
Pub. online: 9 March 2026
Type: Data Science In Action
Open Access
Received
24 December 2024
24 December 2024
Accepted
29 October 2025
29 October 2025
Published
9 March 2026
9 March 2026
Abstract
In data analysis, unexpected results often prompt researchers to revisit their procedures to identify potential issues. While some researchers may struggle to identify the root causes, experienced researchers can often quickly diagnose problems by checking a few key assumptions. These checked assumptions, or expectations, are typically informal, difficult to trace, and rarely discussed in publications. In this paper, we introduce the term analysis validation checks to formalize and externalize these informal assumptions. We then introduce a procedure to identify a subset of checks that best predict the occurrence of unexpected outcomes, based on simulations of the original data. The checks are evaluated in terms of accuracy, determined by binary classification metrics, and independence, which measures the shared information among checks. We demonstrate this approach with a toy example using step count data and a generalized linear model example examining the effect of particulate matter air pollution on daily mortality.
Supplementary material
Supplementary MaterialThe supplementary materials include a full script of the examples in the paper (index.R ) and its output (index.html ), the data used in the examples in Section 5 (data/ ), the package source (adtoolbox_0.1.0.tar.gz ), and a README.md file containing the install instructions for running the scripts.
References
Allaire JJ, Teague C, Scheidegger C, Xie Y, Dervieux C (2022). Quarto. URL https://github.com/quarto-dev/quarto-cli.
Batini C, Cappiello C, Francalanci C, Maurino A (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys (CSUR), 41(3): 1–52. https://doi.org/10.1145/1541880.1541883
Bell ML, McDermott A, Zeger SL, Samet JM, Dominici F (2004). Ozone and short-term mortality in 95 US urban communities, 1987–2000. Journal of the American Medical Association, 292(19): 2372–2378. https://doi.org/10.1001/jama.292.19.2372
Broderick T, Gelman A, Meager R, Smith AL, Zheng T (2023). Toward a taxonomy of trust for probabilistic machine learning. Science Advances, 9(7), eabn3999. https://doi.org/10.1126/sciadv.abn3999
Cai L, Zhu Y (2015). The challenges of data quality and data quality assessment in the big data era. Data Science Journal, 14: 2–2. https://doi.org/10.5334/dsj-2015-002
Cichy C, Rass S (2019). An overview of data quality frameworks. IEEE Access, 7: 24634–24648. https://doi.org/10.1109/ACCESS.2019.2899751
Dong J, Roth A, Su WJ (2022). Gaussian differential privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1): 3–37. ISSN 1369-7412. https://doi.org/10.1111/rssb.12454
Donoghue T, Voytek B, Ellis SE (2021). Teaching creative and practical data science at scale. Journal of Statistics and Data Science Education, 29(sup1): S27–S39. https://doi.org/10.1080/10691898.2020.1860725
Fischetti T (2023). assertr: Assertive programming for R analysis pipelines. URL https://CRAN.R-project.org/package=assertr. R package version 3.0.1.
Grolemund G, Wickham H (2014). A cognitive interpretation of data analysis. International Statistical Review, 82(2): 184–204. https://doi.org/10.1111/insr.12028
Gu K, Grunde-McLaughlin M, McNutt A, Heer J, Althoff T (2024). How do data analysts respond to AI assistance? A wizard-of-oz study. In: Proceedings of the CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery, New York, NY, USA. 1–22. https://doi.org/10.1145/3613904.3641891.
Henry L, Pedersen TL, Luciani TJ, Decorde M, Lise V (2023). vdiffr: Visual regression testing and graphical diffing. URL https://CRAN.R-project.org/package=vdiffr. R package version 1.0.7.
Iannone R, Vargas M, Choe J (2024). pointblank: Data validation and organization of metadata for local and remote tables. URL https://CRAN.R-project.org/package=pointblank. R package version 0.12.2.
Leiner J, Duan B, Wasserman L, Ramdas A (2023). Data fission: Splitting a single data point. Journal of the American Statistical Association, 12-(549), 135–146. https://doi.org/10.1080/01621459.2023.2270748.
Peng RD (2011). Reproducible research in computational science. Science, 334(6060): 1226–1227. https://doi.org/10.1126/science.1213847
Peng RD, Parker HS (2022). Perspective on data science. Annual Review of Statistics and Its Application, 9(1): 1–20. https://doi.org/10.1146/annurev-statistics-040220-013917
Peng RD, Dominici F, Louis TA (2006). Model choice in time series studies of air pollution and mortality. Journal of the Royal Statistical Society Series A: Statistics in Society, 169(2): 179–203. https://doi.org/10.1111/j.1467-985X.2006.00410.x
Peng RD, Chen A, Bridgeford E, Leek JT, Hicks SC (2021). Diagnosing data analytic problems in the classroom. Journal of Statistics and Data Science Education, 29(3): 267–276. https://doi.org/10.1080/26939169.2021.1971586
Petersen AH, Ekstrøm CT (2019). Datamaid: Your assistant for documenting supervised data quality screening in R. Journal of Statistical Software, 90: 1–38. https://doi.org/10.18637/jss.v090.i06
R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Ruczinski I, Kooperberg C, LeBlanc M (2003). Logic regression. Journal of Computational and Graphical Statistics, 12(3): 475–511. ISSN 1061-8600. https://doi.org/10.1198/1061860032238
Samet JM, Dominici F, Curriero FC, Coursac I, Zeger SL (2000). Fine particulate air pollution and mortality in 20 US cities, 1987–1994. New England Journal of Medicine, 343(24): 1742–1749. https://doi.org/10.1056/NEJM200012143432401
Schelter S, Lange D, Schmidt P, Celikel M, Biessmann F, Grafberger A (2018). Automating large-scale data quality verification. Proceedings of the VLDB Endowment, 11(12): 1781–1794. https://doi.org/10.14778/3229863.3229867
Sidi F, Hassany Shariat Panahy P, Suriani Affendey L, Jabar MA, Ibrahim H, Mustapha A (2012). Data quality: A survey of data quality dimensions. In: 2012 International Conference on Information Retrieval & Knowledge Management, Kuala Lumpur, Malaysia, 300–304. IEEE. https://doi.org/10.1109/InfRKM.2012.6204995.
Van der Loo M, de Jonge E (2021). Data validation infrastructure for R. Journal of Statistical Software, 97: 1–33. URL https://www.jstatsoft.org/article/view/v097i10.
Wang RY, Strong DM (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4): 5–33. https://doi.org/10.1080/07421222.1996.11518099
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and flexible summaries of data. URL https://CRAN.R-project.org/package=skimr. R package version 2.1.5.
Welty LJ, Zeger SL (2005). Are the acute effects of particulate matter on mortality in the national morbidity, mortality, and air pollution study the result of inadequate control for weather and season? A sensitivity analysis using flexible distributed lag models. American Journal of Epidemiology, 162(1): 80–88. https://doi.org/10.1093/aje/kwi157
Wild CJ, Pfannkuch M (1999). Statistical thinking in empirical enquiry. International Statistical Review, 67(3): 223–248. https://doi.org/10.1111/j.1751-5823.1999.tb00442.x
Woodall P, Oberhofer M, Borek A (2014). A classification of data quality assessment and improvement methods. International Journal of Information Quality 16, 3(4): 298–321. https://doi.org/10.1504/IJIQ.2014.068656