Abstract: Estimating fastest paths on large networks is a crucial problem for dynamic route guidance systems. The present paper proposes a statistical approach for approximating fastest paths on urban networks. The traffic data used for conducting the statistical analysis is generated using a macroscopic traffic simulation software developed by us. The traffic data consists of the input flows, the arc states or the number of cars in the arcs and the paths joining the various origins and the destinations of the network. To find out the relationship between the input flows, arc states and the fastest paths of the network, we subject the traffic data to hybrid clustering. The hybrid clustering uses two methods namely k-means and Ward’s hierarchical agglomerative clustering. The strength of the relationship among the traffic variables was measured using canonical correlation analysis. The results of hybrid clustering are decision rules that provide fastest paths as a function of arc states and input flows. These decision rules are stored in a database for performing predictive route guidance. Whenever a driver arrives at the entry point of the network, the current arc states and input flows are matched against the database parameters. If agreement is found, then the database provides the fastest path to the driver using the corresponding decision rule. In case of disagreement, the database recommends the driver to choose the shortest path as the fastest path in order to reach the destination.
PM2.5 is a major air pollutant which has a high probability to cause many serious cardiopulmonary diseases, such as asthma, lung cancer, trachea cancer, bronchus cancer, etc. Up to 2014, a World Health Organization (WHO) air quality model confirmed that 92% of the population in the world lived in areas where air quality levels exceeded WHO limits (i.e., 10 µg/m3). This indicates that PM2.5 is still one of the most serious world-wide problems, and monitoring PM2.5 concentrations is extremely necessary. In this paper, we proposed a easy and flexible spatial-temporal Gaussian mixture model to analyze annual average PM2.5 concentrations. Because of the bimodal distribution of PM2.5 concentrations, we decided for a two- component Gaussian mixture model with county-year-level spatial-temporal random effects. A Markov Chain Monte Carlo (MCMC) algorithm is used to estimating model parameters.
Abstract: Simple parametric functional forms, if appropriate, are preferred over more complicated functional forms in clinical prediction models. In this paper, we illustrate our practical approach to obtaining the appropriate functional forms for continuous variables in developing a clinical prediction model for risk of Clostridium difficile infection. First, we used a nonpara metric regression smoother to establish the reference curve. Then, we used regression spline function-restricted cubic spline (RCS) and simple para metric forms to approximate the reference curve. Based on the shape of the reference curve, the model fit information (AIC), and the formal statistical test (Vuong test), we selected the simple parametric forms to replace the more elaborated RCS functions. Finally, we refined the simple parametric forms in the multiple variable regression model using the Wald test and the likelihood-ratio test. In addition, we compared the calibration and discrim ination aspects between the model with appropriate functional forms and the model with simple linear terms. The calibration χ 2 (8.4 versus 10) and calibration plot, the area under ROC curve (0.88 vs 0.84, p < 0.05), and inte grated discrimination improvement (0.0072, p < 0.001) indicated the model with appropriate forms was better calibrated and had higher discrimination ability.
Abstract: The National Immunization Survey (NIS) is the United States’ primary tool for assessing immunization coverage among 19- to 35-monthold children. Although annual estimates from the NIS are quite precise at the national level, US State-level estimates have much larger sampling error than national-level estimates. We combined two independent unbiased estimates of US State-level coverages within a given year to obtain new estimates which are more precise than previously published estimates. We first calculated a model-based estimate for each State for 2001 using multiple years of NIS data. Next, we combined each model-based estimate with the corresponding, previously reported NIS estimate for 2001. Our resulting estimates of State-level immunization coverage had smaller standard errors than the previously published estimates. To make similar improvements in precision by increasing sample size would, depending on State, require an increase in sample size of 30% – 120%.
Abstract: The problem of variable selection is fundamental to statistical modelling in diverse fields of sciences. In this paper, we study in particular the problem of selecting important variables in regression problems in the case where observations and labels of a real-world dataset are available. At first, we examine the performance of several existing statistical methods for analyzing a real large trauma dataset which consists of 7000 observations and 70 factors, that include demographic, transport and intrahospital data. The statistical methods employed in this work are the nonconcave penalized likelihood methods (SCAD, LASSO, and Hard), the generalized linear logis tic regression, and the best subset variable selection (with AIC and BIC), used to detect possible risk factors of death. Supersaturated designs (SSDs) are a large class of factorial designs which can be used for screening out the important factors from a large set of potentially active variables. This paper presents a new variable selection approach inspired by supersaturated designs given a dataset of observations. The merits and the effectiveness of this approach for identifying important variables in observational studies are evaluated by considering several two-levels supersaturated designs, and a variety of different statistical models with respect to the combinations of factors and the number of observations. The derived results are encour aging since the alternative approach using supersaturated designs provided specific information that are logical and consistent with the medical experi ence, which may also assist as guidelines for trauma management.
Abstract: Identifying influential observations is an important part of the model building process in linear regression. There are numerous diagnostic measures based on different approaches in linear regression analysis. However, the problem of multicollinearity and influential observations may occur simultaneously. Therefore, we propose new diagnostic measures based on the two parameter ridge estimator defined by Lipovetsky and Conklin (2005) alternative to the usual ridge regression and ordinary linear regression. We define two parameter ridge-type generalizations of DFFITS and Cook’s distance. Moreover, we obtain approximate case deletion formulas and provide approximate versions of new measures. Finally, we illustrate the benefits of proposed measures in real data examples.
Abstract: This paper considers models of educational data where a value added analysis is required. These models are multilevel in nature and contain endogenous regressors. Multivariate models are considered so as to simulta neously model results from different subject areas. Path models and factor models are considered as types of model that can be used to overcome the problem of endogeneity. Estimation methods available in MLwiN and EQS are used. The use of a factor model with EQS is shown to give estimates of the effects of teaching styles that have smaller standard errors than any other method studied.
Abstract: Frailty models have become popular in survival analysis for deal ing with situations where groups of observations are correlated. If the data comprise only exact or right-censored failure times, inference can be done by either integrating out the frailties directly or by using the EM algorithm. If there is both left- and right-censoring this is no longer the case. How ever the MCMC method of Clayton (1991, Biometrics 47, 467-485) can be easily extended by imputation of the left-censored times. Several schemes for doing this are suggested and compared. Application of the methods is illustrated using data on the joint failures of patients with fibrodysplasia ossificans progressiva.
Time series modelling is very popular technique used in data science. Main motive of time series modelling is to know the data generating process and also get its parameters which depend on all the observations. There may be few observations which misinterpret the data and also influence the parameters, such type of observations are called Outlier. The present study dealt the handling of outlier in context of ARIMA time series and proposed an alternative approach for the replacement of outlier. In usual process two ways of handling the outlier is popular, in first remove the outliers from the data and second replace it by the nearby values. Removal concept cannot work in the auto-correlated data like time series and similarly replacement of outlier through just previous/after value is also not much appropriate method because of dependency structure. Therefore, we are proposing an alternative approach, in which outlier is replaced by estimated values through best model. Detailed methodology is discussed and then an empirical analysis on the time series of National Pension Scheme (NPS) is carried out. Most of the series are modelled perfectly and few series were not due to non-stationary nature of the series. After getting an outlier free series, forecasting is also done. The realization of the series also performed on proposed methodology to get generalized view of proposed methodology and get similar result.
Abstract: The aim of this paper is to represent the Bonus-Malus System (BMS) of Iran, which is a mandatory scheme based on Insurance act num ber 56. We examine the current Iranian BMS, using various criteria such as elasticity and time of convergence to steady state with respect to the claim frequency as well as financial balance. We also find the closed form of stationary distribution of the Iranian BMS that plays a key role in study of BMSs. Moreover, we compare the results with the German and Japan BMS. Finally we give some hints that can be used to improve the performance of the current Iranian BMS.