AUTOMATING DATA ANALYSIS METHODS IN EPIDEMIOLOGY

Technological advances in software development effectively handled technical details that made life easier for data analysts, but also allowed for non-experts in statistics and computer science to analyze data. As a result, medical research suffers from statistical errors that could be otherwise prevented such as errors in choosing a hypothesis test and assumption checking of models. Our objective is to create an automated data analysis software package that can help practitioners run non-subjective, fast, accurate and easily interpretable analyses. We used machine learning to predict the normality of a distribution as an alternative to normality tests and graphical methods to avoid their downsides. We implemented methods for detecting outliers, imputing missing values, and choosing a threshold for cutting numerical variables to correct for non-linearity before running a linear regression. We showed that data analysis can be automated. Our normality prediction algorithm outperformed the Shapiro-Wilk test in small samples with Matthews correlation coefficient of 0.5 vs. 0.16. The biggest drawback was that we did not find alternatives for statistical tests to test linear regression assumptions which are problematic in large datasets. We also applied our work to a dataset about smoking in teenagers. Because of the open-source nature of our work, these algorithms can be used in future research and projects.


Introduction
Statistical errors are abundant in medical literature, and it can be proven that most claimed research findings are false (Ioannidis, 2005). In particular, conditions and assumptions of hypothesis tests are rarely checked and reported (Hanif & Ajmal, 2011). This can be done either with statistical testing, which does not work well especially with small sample sizes (Barker & Shaw, 2015), or with graphical methods which have their own problem of introducing subjectivity into the analysis, especially because researchers are already biased towards getting statistically significant results ( On the other hand, advancements in the field of statistical computing coupled with the power of open-source software has lead the statistical programming language R to grow in popularity in epidemiological research (haine, 2017). The community has built more than 12,000 packages for R that help solving a large variety of problems and provide data analysts with cutting edge technology *. However huge this growth was in the last years, it only impacted a minority of researchers who know how to code, as R is command driven and has a steep learning curve (Ozgur, Colliau, Rogers, Hughes, & Myer-Tyson, 2017) which is a serious disadvantage for non-programmers (Khan, 2013).
Recent developments with graphical user interface software improved the situation a lot since the data analyst was required to write code and deal with mathematical details to get things done. Packages like SPSS statistics opened the door for non-professional statisticians to work with data. However, this did not reduce the number of statistical errors in medical research (Ercan et al., 2007;Felson, Cupples, & Meenan, 1984) and it did not solve the subjectivity problem in data analysis. * The comprehensive R archive network, www.cran.r-project.org We hypothesise that by automating assumption checking and result interpretation of the most used statistical tests and models in epidemiology we can create a more objective analysis and reduce the rate of errors in the literature related to these subjects. Also, by creating a graphical user interface (GUI) for the R programming language, we can bring cutting edge technology to non-experts in programming and statistics. Therefore, our goal is to build an open-source web and desktop based GUI application that automates data analysis. Our secondary objective is to use this software to analyze the predictors of smoking among teenage students.

Methods
In this section we will describe how our software handles outliers, imputes missing data and automates the bivariate and multivariable analyses.

Outlier detection
The software detects and handles 2 types of outliers: numerical and categorical outliers.
Numerical outliers are defined as values that fall more than 1.5 times the interquartile range above the third quartile or below the first quartile. The software offers the user the option to replace them with missing values. Categorical outliers are defined as variables that have at least one category that constitutes less than 10% of the total sample. The software detects and flags those variables so the user can choose to include them or not in subsequent analyses.

Missing values treatment
To impute missing values we implemented random forests, available through the R package missForest (Stekhoven, 2013).

Bivariate analysis
The software performs parametric and nonparametric hypothesis testing. Parametric tests include: Student t-test, adjusted t-test, Chi-squared, Pearson's correlation, one-way analysis of variance (ANOVA). Non-parametric tests include: Fisher's exact test, Mann-Whitney U test, Spearman's rank correlation, Kruskal-Wallis one-way analysis of variance.
In choosing between parametric and nonparametric alternatives, our software automatically examines corresponding conditions and reports how the decision was made.
Many parametric tests require the variance to be approximately the same across compared groups. Therefore to confirm equal spread, our software computes the standard deviation of each group and no standard deviation should be 1.5 times larger than the other. When comparing several groups, such as with ANOVA, the ratio of the largest standard deviation to the smallest should be less than 1.5 (Falissard, 2011).
Another important assumption of parametric tests is that, for sample sizes smaller than 30 (Falissard, 2011), the studied variable must have a normal distribution. In our software, this is done by using a statistical model that predicts normality from features extracted from the histogram and normality tests. To train the model, we ran a simulation using 6000 samples with sample sizes ranging from 8 to 50, drawn at random from the following symmetric and asymmetric distributions: Normal(0,1), Uniform(0,1), Beta(2,2), Beta(6,2), Beta(2,1), Beta (3,2), t(5), t(7), t(10), Gamma (1,5), Gamma (4,5), χ2(4), χ2 (20). These distributions were selected to cover various values of skewness and kurtosis. For every sample we computed:The adjusted sum of squared errors (SSEadj): after drawing the histogram, we overlaid a normal distribution curve and the distance between the center of each bin and the curve is squared and 58 AUTOMATING DATA ANALYSIS METHODS IN EPIDEMIOLOGY added, then the sum is adjusted for the number of bins, as shown in equation 1 Where: "n" is the number of bins of the histogram "∆ i " is the error of the ith bin 1. The histogram is split into 3 parts: the mean height of the bins is calculated for each part, then the vertical distance between the first part and the second is calculated, and between the second and the third to obtain the 2 distances: dist1 and dist2 We used as independent variables: "SSEadj", "dist1", "dist2", "skew", "kurt", "SW", "KS", "AD" and "JB" and as dependent variable: "target". In order to avoid over-fitting, we split the sample into training and testing sets (70/30 split). Using the training set, we trained 2 models, a logistic regression and a random forests (Ho, 1995) whose parameters were tuned by crossvalidation. The normality prediction thresholds for both models were set using the receiver operating characteristic (ROC) curve. We then used the testing set to compare the performance of the 2 models using raw accuracy, Matthews correlation coefficient and the area under the curve (AUC) of the ROC curve. Matthews correlation coefficient is a measure that takes into account true and false positives and negatives, works well with unbalanced size classes (Boughorbel, Jarray, & El-Anbari, 2017), it takes values between -1 (total disagreement between prediction and observation) and 1 (total agreement). Logistic regression and random forests were also compared to the Shapiro-Wilk test and a base model (that always predicts the majority class -non-normality) using raw accuracy and Matthews correlation coefficient.
The model with the best performance was implemented in the software to predict normality for sample sizes smaller than 30.

Variable selection for multivariable models
We implemented several methods for automatic variable selection: (1) Focused principal component analysis: is a graphical display that shows correlations of independent variables with the dependent and with each other (Falissard, 1999).
(2) Bivariate analysis: the user can then chose to include in the multivariable model independent variables that have a p-value < 0.2 (Bouyer, 2009).

Linear regression
Linear regression assumptions are automatically checked using statistical tests. If one of them is violated, the software tries using logarithmic transformation on the dependent variable and re-checks these conditions. The Shapiro-Wilk test (Shapiro & Wilk, 1965) is used to check for normality of the residuals since it is considered to have the best power for a given An important assumption of linear regression is the linear relationship between every independent variable and the dependent variable. This assumption is hardly met with real life data. Most of the time we need to intervene to make sure this assumption is not violated. We implemented 2 strategies to make this correction: (1) Transforming the independent variable: When running a linear regression, for each numerical independent variable, the software tries: (2) Then it compares the sum of squared errors of the 2 models. The independent variable form that has the smallest error -therefore a more linear relationship with the dependent variable, will be used in the final model. This enables the software to do automatic logarithmic transformation when needed. The software also helps the user by interpreting the final model, since the interpretation of a logarithmically transformed coefficient is not straightforward (Yang, 2012).

AUTOMATING DATA ANALYSIS METHODS IN EPIDEMIOLOGY
(3) Cutting the variable is a valid alternative if the relationship between dependent and independent variables has a breakpoint that represents an abrupt change and the sample is large enough. The software automatically chooses the threshold using a brute force algorithm -trying every possible data value as threshold and fitting 2 regression lines before and after the threshold, the computer then computes the sum of squared errors of the two lines for each threshold and chooses the threshold that yields the lowest error, thereby automating threshold selection. The user can then split the sample according to the threshold chosen and analyze data in each subset alone.

Logistic regression
The Hosmer-Lemeshow tests the model's goodness of fit. Nagelkerke's pseudo R-squared is also reported, and the results and model coefficients are automatically interpreted.

Tools used
This desktop and web application was coded using the shiny package (Chang et al., 2018) of Microsoft R Open version 3.4.4 in RStudio version 1.1.453, HTML and JavaScript. The complete list of R packages used can be found in the appendix.
The normality simulation and the predictive models were written in Microsoft R Open.

Software description
Our software consists of a graphical user interface for R in which the user can work with data without the need of typing commands. The program can read data stored in Microsoft Excel, CSV (comma separated values), SPSS, STATA, SAS and JSON (JavaScript object notation) files. The complete menu structure is presented in Table 1. This menu can be divided into two large parts: a data pre-processing or preparation part and the data analysis part.
In the data preparation stage, users can change the variable type to specify if it should be considered as numeric, categorical or text information. They have the option to cut a numerical variable using an automatically chosen threshold as described in the methods section. The user can replace numeric outliers with missing values; the software also suggests some actions to handle categorical outliers. Visualizing missing patterns of data can help identify observations or variables that have a lot of missing values. We used the data from a student health behavior study (Abdo, Zeenny, & Salameh, 2016), which will be discussed in a future section, to see the pattern of missing data using the following variables: age (the student's age), total average score (the student's average grade score in school) and do you have a boy/girlfriend (If the student is in an intimate relationship or not). Figure 1 shows missing values in black, one variable, the total average score, has only 1.2% missing values but these are restricted to the first and last few observations in the dataset, which lead us to investigate the reason. We can either delete these observations or choose to impute missing values using a random forest model. When the user clicks on multivariable linear model, a linear regression is run (after verifying its assumptions) for continuous dependent variables and a logistic regression is run for binary dependent variables, the software tries logarithmic transformations as discussed in the methods section and helps the user also by interpreting the model's coefficients.

Accuracy results of the normality prediction algorithm
Using 70% of the simulation dataset (4200 samples), we trained a logistic regression model Nagelkerke's pseudo R2 was 0.201 andhe threshold for predicting normality was set to 0.15 using the area under the ROC curve. We also trained a random forests model, the threshold for predicting normality was set to 0.15 using the area under the ROC curve. On the remaining 30% of the dataset (1800 samples), we compared the performance of 4 normality predictors (

Application -Predictors of smoking among Lebanese school adolescents
As means to test our newly developed software, we chose to work on a classical subject from the field of epidemiology, to find factors associated with smoking among adolescent school students.

Introduction
Aiming to limit risky health behaviors among these teenagers, it is important to identify high risk groups and implement effective health programs that target these individuals in early life when intervention is more beneficial. We will focus on cigarette and waterpipe smoking among teenage students. Tobacco smoking in the form of cigarettes and waterpipe is common should be made to improve the quality of life of adolescent smokers in order to prevent shortand long-term consequences.

Methods
Our data comes from a cross-sectional study, carried out in 2014 on 4000 private school students, which asked about their health habits (smoking, alcoholism and eating habits) and various other things like standard of living and relationship with their parents. Details of the study design and data collection can be found in Abdo et al. (Abdo et al., 2016).
For the descriptive part, we used means with standard deviations to summarize continuous variables, and percentages for categorical variables.
To compare groups, student-t and chi-squared tests were used after checking appropriate conditions and assumptions. A p-value of less than 5% was considered statistically significant.
Multivariable logistic regressions were performed to assess factors associated with cigarette and waterpipe smoking by controlling for confounding variables. Variables included in the models were risk factors found in the literature review and others we found logically plausible to include.

Results
The sample consists of 4000 school students, mostly from Mount Lebanon (74.5%) and Beirut (24.5%), with roughly 51% females and 49% males. The average student age was 15.31 ±2.01 years, the smallest student was 10 years old and the oldest was 21 years old ( Figure 2). Boys' tendency of trying to smoke (40.4%) was significantly higher than girls' (30.7%), (p < 0.001) (Figure 3). They also tend to be heavier smokers, as they smoke on average 2 more cigarettes per week (95% CI [1.3-2.7]; p < 0.001).  Divorce between parents is associated with 61% more risk of cigarette smoking in adolescents. Communication between family members did not affect significantly the smoking status. However, having a sibling or a close friend who smoked significantly increased the risk of cigarette (aOR of 1.57 and 3.58) and waterpipe use (aOR of 1.5 and 2.1).
The socioeconomic status represented here by the promiscuity index (number of individuals in the house divided by the number of rooms) did not affect smoking status.
Consuming energy drinks is associated with an increased risk of cigarette and waterpipe smoking with an aOR of 2.59 and 3.53 respectively. Alcohol use, binge drinking and been drunk at least in one occasion were all significant predictors of cigarette and waterpipe use.
Students who smoke had an inferior evaluation of their own health, aOR for cigarette and waterpipe were 0.56 and 0.73, respectively. Watching TV or playing video games in weekdays is associated with 5.5% more waterpipe smoking.

Discussion
As an application to our work, we analyzed tobacco smoking behavior of Lebanese school adolescents. We found that the proportion of students who have ever tried cigarette smoking is close to that found by Zahlan et al.

Discussion
We found that data analysis can be rendered faster and more objective with automation by using a combination of programming by specific instructions coupled with machine learning techniques.
Specific instructions were used, for instance, when choosing a statistical test. This is essentially following branches in a decision tree where we check conditions or assumptions and decide, based on the answer, which path we end up taking. Some of these conditions are straightforward, for instance checking if the variable is numeric or categorical, the sample size, etc. Others are more difficult to assess, such as the normality of a distribution or equal variance between 2 groups. We can use statistical tests to decide on normality and homoscedasticity but the problem with these tests is that with small sample sizes (n < 30) they do not have enough power to detect an effect (Barker & Shaw, 2015; Mohd Razali & Yap, 2011) and the opposite happens as the sample size gets large where they tend to detect the smallest effect size that would not affect the results much (Falissard, 2011). This is why we preferred using the rule of thumb definition of homoscedasticity between groups -where a standard deviation of one group is not larger than 1.5 times the standard deviation of the other (Falissard, 2011).
Although this method cannot be regarded more than an approximation, it is practically accurate enough, independent of sample size and easily programmable. As for normality in hypothesis testing, it should be assessed only when we have a sample size smaller than 30, otherwise the central limit theorem ensures the normality of the sampling distribution (Falissard, 2011). In order to avoid using a low powered normality test with a small sample size, we can use a graphical method to assess normality, where the epidemiologist looks at the histogram or QQ plot to decide visually if the distribution is normal. This has the advantage of eliminating the dependency on low powered normality tests but introduces subjectivity into the analysis. To deal with this problem, we used a model to predict normality, which did better than the Shapiro-Wilk test, for sample sizes between 7 and 50, both in terms of raw accuracy . Other more accurate model-based methods can be used, but most of them focus only of numeric variables (Finch, 2010),or treat categorical and numeric data types separately thus ignoring possible relationships between these different variables. We implemented a random forest method, available from the package missForest, which is a non-parametric method that can handle categorical and numerical data types simultaneously and has many advantages over other methods as discussed in Stekhoven et al (Stekhoven, 2013).
One of the assumptions of linear regression is the linear relationship between dependent and numeric independent variables. One method of correcting non-linearity is cutting the independent variable in 2 groups, each having a linear relationship with the dependent variable.
The problem is then reduced to choosing a cutoff point. Our brute force algorithm described in the methods section provides a mathematically more accurate alternative than the visual method or using the rule of thumb of choosing the median, because it chooses a threshold from all possible data points by finding the minimal sum of squared errors of the 2 regression lines.

Limitations
Statistical software with graphical user interface, especially with automatic checking of conditions and interpretation of results, lowers the entry bar for non-experts to analyze data. Three main assumptions should be assessed before running a linear regression, namely the residuals must be uncorrelated, have a normal distribution, and a constant variance (Berry, 1993). In our software, these assumptions were assessed using statistical tests. This has the downside of rejecting the null, in large samples, even for a small change that would not affect regression coefficients, p-values and confidence intervals much, which leads to trying logarithmic transformations of the dependent variable or using a generalized linear model instead of linear regression.
Granted that automation ensures a certain level of objectivity, however, forcing specific methods constricts data analysis in a way that experienced users can no longer get the results of a statistical test or a model unless the software judges appropriate by testing its assumptions in its specific implemented manner. This is both a feature and a drawback that we certainly considered but decided that it would be, in general, more advantageous than limiting.

Future and big picture
We can ameliorate the software by testing the assumptions of linear regression without resorting to statistical tests and handling bugs and errors caused by special cases such as running logistic regression on a binary dependent variable with one level being severely underrepresented.
Consider the following question: "does cranberry juice reduce the risk of urinary tract infection in immunosuppressed men older than 70? ". Automated data analysis combined with the natural language processing of a search engine can answer this type of specific queries whose answer is not addressed in any research paper, by searching for a dataset on the internet with appropriate variables and cases (and if needed combine datasets from different sources to increase the sample size) and running automatic analysis. We believe that this individualized, guided-by-demand human-computer interaction complements traditional research, specifically because medicine is always about answering specific case and personalized questions and never about averages in a population which is what the medical literature is all about. We believe that automated data analysis is an essential ingredient in the big picture solution that bridges this gap.

Software availability
The web application is available on https://automated-data-analysis.shinyapps.io/automation_app/, the source code is available for download from the app itself, and anyone can use it to run the software locally and for free.
The desktop application can be recreated from the same source code using instructions from http://blog.analytixware.com/2014/03/packaging-your-shiny-app-as-windows.html. It is also available on demand for non-programmers, please contact white.softapp@gmail.com.

Conclusion
We look at this work as a contribution to automation of data analysis. In general, it should resolve the problem of errors in checking conditions and assumptions in statistics that are found in the medical literature, and give epidemiologists the opportunity not to be lost in statistical details, to see the big picture and focus on the question and consequences of the results. It also takes advantages that the R environment has in terms of cutting edge improvements and offers them to non-programmers through a graphical user interface.
Finally, we showed how easy it was for a machine learning model to outperform normality tests such as Shapiro-Wilk. On the other hand, our assumption checking in linear regression is still based on statistical tests. More research is needed to find new ways to automate certain parts of data analysis, either by improving methods used in our work or by automating others and adding them to the project.

Acknowledgement
We would like to thank the R community for the packages that made our work possible and everyone working on open-source projects and knowledge sharing.