Journal of Data Science logo


Login Register

  1. Home
  2. Issues
  3. Volume 17, Issue 1 (2019)
  4. Automating Data Analysis Methods in Epid ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Automating Data Analysis Methods in Epidemiology
Volume 17, Issue 1 (2019), pp. 55–80
George Choueiry   Pascale Salameh  

Authors

 
Placeholder
https://doi.org/10.6339/JDS.201901_17(1).0003
Pub. online: 4 August 2022      Type: Research Article      Open accessOpen Access

Published
4 August 2022

Abstract

Technological advances in software development effectively handled technical details that made life easier for data analysts, but also allowed for nonexperts in statistics and computer science to analyze data. As a result, medical research suffers from statistical errors that could be otherwise prevented such as errors in choosing a hypothesis test and assumption checking of models. Our objective is to create an automated data analysis software package that can help practitioners run non-subjective, fast, accurate and easily interpretable analyses. We used machine learning to predict the normality of a distribution as an alternative to normality tests and graphical methods to avoid their downsides. We implemented methods for detecting outliers, imputing missing values, and choosing a threshold for cutting numerical variables to correct for non-linearity before running a linear regression. We showed that data analysis can be automated. Our normality prediction algorithm outperformed the Shapiro-Wilk test in small samples with Matthews correlation coefficient of 0.5 vs. 0.16. The biggest drawback was that we did not find alternatives for statistical tests to test linear regression assumptions which are problematic in large datasets. We also applied our work to a dataset about smoking in teenagers. Because of the opensource nature of our work, these algorithms can be used in future research and projects.

Related articles PDF XML
Related articles PDF XML

Copyright
No copyright data available.

Keywords
automation computer software machine learning normal distribution

Metrics
since February 2021
901

Article info
views

560

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy