Abstract: Quick identification of severe injury crashes can help Emergency Medical Services (EMS) better allocate their scarce resources to improve the survival of severely injured crash victims by providing them with a fast and timely response. Data broadcast from a vehicle’s Event Data Recorder (EDR) provide an opportunity to capture crash information and send them to EMS near real-time. A key feature of EDR data is a longitudinal measure of crash deceleration. We used functional data analysis (FDA) to ascertain key features of the deceleration trajectories (absolute integral, absolute in- tegral of its slope, and residual variance) to develop and verify a risk predic- tion model for serious (AIS 3+) injuries. We used data from the 2002-2012 EDR reports and the National Highway and National Automotive Sampling System (NASS) Crashworthiness Data System (CDS) datasets available on the National Transportation Safety Administration (NHTSA) website. We consider a variety of approaches to model deceleration data, including non- penalized and penalized splines and a variable selection method, ultimately obtaining a model with a weighted AUC of 0.93. A novel feature of our approach is the use of residual variance as a measure of predictive risk. Our model can be viewed as an important first step towards developing a real- time prediction model capable of predicting the risk of severe injury in any motor vehicle crash.
Abstract: Identification of representative regimes of wave height and direction under different wind conditions is complicated by issues that relate to the specification of the joint distribution of variables that are defined on linear and circular supports and the occurrence of missing values. We take a latent-class approach and jointly model wave and wind data by a finite mixture of conditionally independent Gamma and von Mises distributions. Maximum-likelihood estimates of parameters are obtained by exploiting a suitable EM algorithm that allows for missing data. The proposed model is validated on hourly marine data obtained from a buoy and two tide gauges in the Adriatic Sea.
Abstract: According to 2006 Programme for International Student Assess ment (PISA), sixteen Organization for Economic Cooperation and Develop ment (OECD) countries had scores that were significantly higher than the US. The top three performers were Finland, Canada, and Japan. While Finland and Japan are vastly different from the US in terms of cultures and educational systems, the US and Canada are similar to each other in many aspects, thus their performance gap was investigated. In this study data mining was employed to identify factors regarding access to and use of resources, as well as student views on science for predicting PISA science scores among Grade 10 American and Canadian students. It was found that science enjoyment and frequent use of educational software play important roles in the academic achievement of Canadian students.
Abstract: Despite the availability of software for interactive graphics, current survey processing systems make limited use of this modern tool. Interactive graphics offer insights, which are difficult to obtain with traditional statis tical tools. This paper shows the use of interactive graphics for analysing survey data. Using Labour Force Survey data from Pakistan, we describe how plotting data in different ways and using interactive tools enables analysts to obtain information from the dataset that would normally not be possible using standard statistical methods. It is also shown that interacative graphics can help the analyst to improve data quality by identifying erroneous cases.
Abstract: Scientific interest often centers on characterizing the effect of one or more variables on an outcome. While data mining approaches such as random forests are flexible alternatives to conventional parametric models, they suffer from a lack of interpretability because variable effects are not quantified in a substantively meaningful way. In this paper we describe a method for quantifying variable effects using partial dependence, which produces an estimate that can be interpreted as the effect on the response for a one unit change in the predictor, while averaging over the effects of all other variables. Most importantly, the approach avoids problems related to model misspecification and challenges to implementation in high dimensional settings encountered with other approaches (e.g., multiple linear regression). We propose and evaluate through simulation a method for constructing a point estimate of this effect size. We also propose and evaluate interval estimates based on a non-parametric bootstrap. The method is illustrated on data used for the prediction of the age of abalone.