Abstract: According to 2006 Programme for International Student Assess ment (PISA), sixteen Organization for Economic Cooperation and Develop ment (OECD) countries had scores that were significantly higher than the US. The top three performers were Finland, Canada, and Japan. While Finland and Japan are vastly different from the US in terms of cultures and educational systems, the US and Canada are similar to each other in many aspects, thus their performance gap was investigated. In this study data mining was employed to identify factors regarding access to and use of resources, as well as student views on science for predicting PISA science scores among Grade 10 American and Canadian students. It was found that science enjoyment and frequent use of educational software play important roles in the academic achievement of Canadian students.
Abstract: Despite the availability of software for interactive graphics, current survey processing systems make limited use of this modern tool. Interactive graphics offer insights, which are difficult to obtain with traditional statis tical tools. This paper shows the use of interactive graphics for analysing survey data. Using Labour Force Survey data from Pakistan, we describe how plotting data in different ways and using interactive tools enables analysts to obtain information from the dataset that would normally not be possible using standard statistical methods. It is also shown that interacative graphics can help the analyst to improve data quality by identifying erroneous cases.
Abstract: Student retention is an important issue for all university policy makers due to the potential negative impact on the image of the university and the career path of the dropouts. Although this issue has been thoroughly studied by many institutional researchers using parametric techniques, such as regression analysis and logit modeling, this article attempts to bring in a new perspective by exploring the issue with the use of three data mining techniques, namely, classification trees, multivariate adaptive regression splines (MARS), and neural networks. Data mining procedures identify transferred hours, residency, and ethnicity as crucial factors to retention. Carrying transferred hours into the university implies that the students have taken college level classes somewhere else, suggesting that they are more academically prepared for university study than those who have no transferred hours. Although residency was found to be a crucial predictor to retention, one should not go too far as to interpret this finding that retention is affected by proximity to the university location. Instead, this is a typical example of Simpson’s Paradox. The geographical information system analysis indicates that non-residents from the east coast tend to be more persistent in enrollment than their west coast schoolmates.
Abstract: Retrieving valuable knowledge and statistical patterns from official data has a great potential in supporting strategic policy making. Data Mining (DM) techniques are well-known for providing flexible and efficient analytical tools for data processing. In this paper, we provide an introduction to applications of DM to official statistics and flag the important issues and challenges. Considering recent advancements in software projects for DM, we propose intelligent data control system design and specifications as an example of DM application in official data processing.
Abstract: Scientific interest often centers on characterizing the effect of one or more variables on an outcome. While data mining approaches such as random forests are flexible alternatives to conventional parametric models, they suffer from a lack of interpretability because variable effects are not quantified in a substantively meaningful way. In this paper we describe a method for quantifying variable effects using partial dependence, which produces an estimate that can be interpreted as the effect on the response for a one unit change in the predictor, while averaging over the effects of all other variables. Most importantly, the approach avoids problems related to model misspecification and challenges to implementation in high dimensional settings encountered with other approaches (e.g., multiple linear regression). We propose and evaluate through simulation a method for constructing a point estimate of this effect size. We also propose and evaluate interval estimates based on a non-parametric bootstrap. The method is illustrated on data used for the prediction of the age of abalone.