Abstract: An analysis of air quality data is provided for the municipal area of Taranto characterized by high environmental risks, due to the massive presence of industrial sites with elevated environmental impact activities. The present study is focused on particulate matter as measured by PM10 concentrations. Preliminary analysis involved addressing several data problems, mainly: (i) an imputation techniques were considered to cope with the large number of missing data, due to both different working periods for groups of monitoring stations and occasional malfunction of PM10 sensors; (ii) due to the use of different validation techniques for each of the three monitoring networks, a calibration procedure was devised to allow for data comparability. Missing data imputation and calibration were addressed by three alternative procedures sharing a leave-one-out type mechanism and based on ad hoc exploratory tools and on the recursive Bayesian estimation and prediction of spatial linear mixed effects models. The three procedures are introduced by motivating issues and compared in terms of performance.
We introduce a new family of distributions based on a generalized Burr III generator called Modified Burr III G family and study some of its mathematical properties. Its density function can be bell-shaped, left-skewed, right-skewed, bathtub, J or reversed-J. Its hazard rate can be increasing or decreasing, bathtub, upside-down bathtub, J and reversed-J. Some of its special models are presented. We illustrate the importance of the family with two applications to real data sets.
Abstract: The interest in estimating the probability of cure has been increas ing in cancer survival analysis as the cure of some cancer sites is becoming a reality. Mixture cure models have been used to model the failure time data with the existence of long-term survivors. The mixture cure model assumes that a fraction of the survivors are cured from the disease of interest. The failure time distribution for the uncured individuals (latency) can be mod eled by either parametric models or a semi-parametric proportional hazards model. In the model, the probability of cure and the latency distribution are both related to the prognostic factors and patients’ characteristics. The maximum likelihood estimates (MLEs) of these parameters can be obtained using the Newton-Raphson algorithm. The EM algorithm has been proposed as a simple alternative by Larson and Dinse (1985) and Taylor (1995). in various setting for the cause-specific survival analysis. This approach is ex tended here to the grouped relative survival data. The methods are applied to analyze the colorectal cancer relative survival data from the Surveillance, Epidemiology, and End Results (SEER) program.
Abstract: Estimating fastest paths on large networks is a crucial problem for dynamic route guidance systems. The present paper proposes a statistical approach for approximating fastest paths on urban networks. The traffic data used for conducting the statistical analysis is generated using a macroscopic traffic simulation software developed by us. The traffic data consists of the input flows, the arc states or the number of cars in the arcs and the paths joining the various origins and the destinations of the network. To find out the relationship between the input flows, arc states and the fastest paths of the network, we subject the traffic data to hybrid clustering. The hybrid clustering uses two methods namely k-means and Ward’s hierarchical agglomerative clustering. The strength of the relationship among the traffic variables was measured using canonical correlation analysis. The results of hybrid clustering are decision rules that provide fastest paths as a function of arc states and input flows. These decision rules are stored in a database for performing predictive route guidance. Whenever a driver arrives at the entry point of the network, the current arc states and input flows are matched against the database parameters. If agreement is found, then the database provides the fastest path to the driver using the corresponding decision rule. In case of disagreement, the database recommends the driver to choose the shortest path as the fastest path in order to reach the destination.
PM2.5 is a major air pollutant which has a high probability to cause many serious cardiopulmonary diseases, such as asthma, lung cancer, trachea cancer, bronchus cancer, etc. Up to 2014, a World Health Organization (WHO) air quality model confirmed that 92% of the population in the world lived in areas where air quality levels exceeded WHO limits (i.e., 10 µg/m3). This indicates that PM2.5 is still one of the most serious world-wide problems, and monitoring PM2.5 concentrations is extremely necessary. In this paper, we proposed a easy and flexible spatial-temporal Gaussian mixture model to analyze annual average PM2.5 concentrations. Because of the bimodal distribution of PM2.5 concentrations, we decided for a two- component Gaussian mixture model with county-year-level spatial-temporal random effects. A Markov Chain Monte Carlo (MCMC) algorithm is used to estimating model parameters.
Abstract: Simple parametric functional forms, if appropriate, are preferred over more complicated functional forms in clinical prediction models. In this paper, we illustrate our practical approach to obtaining the appropriate functional forms for continuous variables in developing a clinical prediction model for risk of Clostridium difficile infection. First, we used a nonpara metric regression smoother to establish the reference curve. Then, we used regression spline function-restricted cubic spline (RCS) and simple para metric forms to approximate the reference curve. Based on the shape of the reference curve, the model fit information (AIC), and the formal statistical test (Vuong test), we selected the simple parametric forms to replace the more elaborated RCS functions. Finally, we refined the simple parametric forms in the multiple variable regression model using the Wald test and the likelihood-ratio test. In addition, we compared the calibration and discrim ination aspects between the model with appropriate functional forms and the model with simple linear terms. The calibration χ 2 (8.4 versus 10) and calibration plot, the area under ROC curve (0.88 vs 0.84, p < 0.05), and inte grated discrimination improvement (0.0072, p < 0.001) indicated the model with appropriate forms was better calibrated and had higher discrimination ability.
Abstract: The National Immunization Survey (NIS) is the United States’ primary tool for assessing immunization coverage among 19- to 35-monthold children. Although annual estimates from the NIS are quite precise at the national level, US State-level estimates have much larger sampling error than national-level estimates. We combined two independent unbiased estimates of US State-level coverages within a given year to obtain new estimates which are more precise than previously published estimates. We first calculated a model-based estimate for each State for 2001 using multiple years of NIS data. Next, we combined each model-based estimate with the corresponding, previously reported NIS estimate for 2001. Our resulting estimates of State-level immunization coverage had smaller standard errors than the previously published estimates. To make similar improvements in precision by increasing sample size would, depending on State, require an increase in sample size of 30% – 120%.
Abstract: The problem of variable selection is fundamental to statistical modelling in diverse fields of sciences. In this paper, we study in particular the problem of selecting important variables in regression problems in the case where observations and labels of a real-world dataset are available. At first, we examine the performance of several existing statistical methods for analyzing a real large trauma dataset which consists of 7000 observations and 70 factors, that include demographic, transport and intrahospital data. The statistical methods employed in this work are the nonconcave penalized likelihood methods (SCAD, LASSO, and Hard), the generalized linear logis tic regression, and the best subset variable selection (with AIC and BIC), used to detect possible risk factors of death. Supersaturated designs (SSDs) are a large class of factorial designs which can be used for screening out the important factors from a large set of potentially active variables. This paper presents a new variable selection approach inspired by supersaturated designs given a dataset of observations. The merits and the effectiveness of this approach for identifying important variables in observational studies are evaluated by considering several two-levels supersaturated designs, and a variety of different statistical models with respect to the combinations of factors and the number of observations. The derived results are encour aging since the alternative approach using supersaturated designs provided specific information that are logical and consistent with the medical experi ence, which may also assist as guidelines for trauma management.
Abstract: Identifying influential observations is an important part of the model building process in linear regression. There are numerous diagnostic measures based on different approaches in linear regression analysis. However, the problem of multicollinearity and influential observations may occur simultaneously. Therefore, we propose new diagnostic measures based on the two parameter ridge estimator defined by Lipovetsky and Conklin (2005) alternative to the usual ridge regression and ordinary linear regression. We define two parameter ridge-type generalizations of DFFITS and Cook’s distance. Moreover, we obtain approximate case deletion formulas and provide approximate versions of new measures. Finally, we illustrate the benefits of proposed measures in real data examples.