Abstract: This paper discusses a comprehensive statistical approach that will be useful in answering health-related questions concerning mortality and incidence rates of chronic diseases such as cancer and hypertension. The developed spatio-temporal models will be useful to explain the patterns of mortality rates of chronic disease in terms of environmental changes and social-economic conditions. In addition to age and time effects, models include two components of normally distributed residual effects and spatial effects, one to represent average regional effects and another to represent changes of subgroups within region over time. Numerical analysis is based on male lung cancer mortality data from the state of Missouri. Gibbs sampling is used to obtain the posterior quantities. As a result, all models discussed in this article fit well in stabilizing the mortality rates, especially in the less populated areas. Due to the richness of hierarchical settings, easy interpretation of parameters and ease of implementation, any models proposed in this paper can be applied generally to other sets of data.
Abstract: The classical coupon collector’s problem is concerned with the number of purchases in order to have a complete collection, assuming that on each purchase a consumer can obtain a randomly chosen coupon. For most real situations, a consumer may not just get exactly one coupon on each purchase. Motivated by the classical coupon collector’s problem, in this work, we study the so-called suprenewal process. Let {Xi , i ≥ 1} be a sequence of independent and identically distributed random variables, ∑ Sn = n i=1 Xi , n ≥ 1, S0 = 0. For every t ≥ 0, define Qt = inf{n | n ≥ 0, Sn ≥ t}. For the classical coupon collector’s problem, Qt denotes the minimal number of purchases, such that the total number of coupons that the consumer has owned is greater than or equal to t, t ≥ 0. First the process {Qt, t ≥ 0} and the renewal process {Nt, t ≥ 0}, where Nt = sup{n|n ≥ 0, Sn ≤ t}, generated by the same sequence {Xi , i ≥ 1} are compared. Next some fundamental and interesting properties of {Qt, t ≥ 0} are provided. Finally limiting and some other related results are obtained for the process {Qt, t ≥ 0}.
Abstract: Information regarding small area prevalence of chronic disease is important for public health strategy and resourcing equity. This paper develops a prevalence model taking account of survey and census data to derive small area prevalence estimates for diabetes. The application involves 32000 small area subdivisions (zip code census tracts) of the US, with the prevalence estimates taking account of information from the US-wide Behavioral Risk Factor Surveillance System (BRFSS) survey on population prevalence differentials by age, gender, ethnic group and education. The effects of such aspects of population composition on prevalence are widely recognized. However, the model also incorporates spatial or contextual influences via spatially structured effects for each US state; such contextual effects are allowed to differ between ethnic groups and other demographic categories using a multivariate spatial prior. A Bayesian estimation approach is used and analysis demonstrates the considerably improved fit of a fully specified compositional-contextual model as compared to simpler ‘standard’ approaches which are typically limited to age and area effects.
Abstract: This article extends the recent work of V¨annman and Albing (2007) regarding the new family of quantile based process capability indices (qPCI) CMA(τ, v). We develop both asymptotic parametric and nonparametric confidence limits and testing procedures of CMA(τ, v). The kernel density estimator of process was proposed to find the consistent estimator of the variance of the nonparametric consistent estimator of CMA(τ, v). Therefore, the proposed procedure is ready for practical implementation to any processes. Illustrative examples are also provided to show the steps of implementing the proposed methods directly on the real-life problems. We also present a simulation study on the sample size required for using asymptotic results.
Abstract: Information fusion has become a powerful tool for challenging applications such as biological prediction problems. In this paper, we apply a new information-theoretical fusion technique to HIV-1 protease cleavage site prediction, which is a problem that has been in the focus of much interest and investigation of the machine learning community recently. It poses a difficult classification task due to its high dimensional feature space and a relatively small set of available training patterns. We also apply a new set of biophysical features to this problem and present experiments with neural networks, support vector machines, and decision trees. Application of our feature set results in high recognition rates and concise decision trees, producing manageable rule sets that can guide future experiments. In particular, we found a combination of neural networks and support vector machines to be beneficial for this problem.
Abstract: This paper proposes to investigate inequality in Viet Nam from the point of view of a study of the urban/rural gap by means of a multilevel model. Using data from the Viet Nam Household Living Standards Survey of 2002, the paper constructs a multilevel model, yielding random effects in the urban/rural gap which can be seen as location-specific random contributions to the urban/rural gap above and beyond the effects of known location characteristics, such as the level of education of the population, etc. The paper also demonstrates how the multilevel model can be used to obtain small area estimates at the commune level.
Abstract: Student retention is an important issue for all university policy makers due to the potential negative impact on the image of the university and the career path of the dropouts. Although this issue has been thoroughly studied by many institutional researchers using parametric techniques, such as regression analysis and logit modeling, this article attempts to bring in a new perspective by exploring the issue with the use of three data mining techniques, namely, classification trees, multivariate adaptive regression splines (MARS), and neural networks. Data mining procedures identify transferred hours, residency, and ethnicity as crucial factors to retention. Carrying transferred hours into the university implies that the students have taken college level classes somewhere else, suggesting that they are more academically prepared for university study than those who have no transferred hours. Although residency was found to be a crucial predictor to retention, one should not go too far as to interpret this finding that retention is affected by proximity to the university location. Instead, this is a typical example of Simpson’s Paradox. The geographical information system analysis indicates that non-residents from the east coast tend to be more persistent in enrollment than their west coast schoolmates.
Abstract: The paper proposes the use of Kohonen’s Self Organizing Map (SOM), and supervised neural networks to find clusters in samples of gammaray burst (GRB) using the measurements given in BATSE GRB. The extent of separation between clusters obtained by SOM was examined by cross validation procedure using supervised neural networks for classification. A method is proposed for variable selection to reduce the “curse of dimensionality”. Six variables were chosen for cluster analysis. Additionally, principal components were computed using all the original variables and 6 components which accounted for a high percentage of variance was chosen for SOM analysis. All these methods indicate 4 or 5 clusters. Further analysis based on the average profiles of the GRB indicated a possible reduction in the number of clusters.
Abstract: The likelihood of developing cancer during one’s lifetime is approximately one in two for men and one in three for women in the United States. Cancer is the second-leading cause of death and accounts for one in every four deaths. Evidence-based policy planning and decision making by cancer researchers and public health administrators are best accomplished with up-to-date age-adjusted site-specific cancer death rates. Because of the 3-year lag in reporting, forecasting methodology is employed here to estimate the current year’s rates based on complete observed death data up through three years prior to the current year. The authors expand the State Space Model (SSM) statistical methodology currently in use by the American Cancer Society (ACS) to predict age-adjusted cancer death rates for the current year. These predictions are compared with those from the previous Proc Forecast ACS method and results suggest the expanded SSM performs well.
Abstract: The self-controlled case series (SCCS) and the matched cohort are two frequently used study designs to adjust for known and unknown confounding effects in epidemiological studies. Count data arising from these two designs may not be independent. While conditional Poisson regression models have been used to take into account the dependence of such data, these models have not been available in some standard statistical software packages (e.g., SAS). This article demonstrates 1) the relationship of the likelihood function and parameter estimation between the conditional Poisson regression models and Cox’s proportional hazard models in SCCS and matched cohort studies; 2) that it is possible to fit conditional Poisson regression models with procedures (e.g., PHREG in SAS) using Cox’s partial likelihood model. We tested both conditional Poisson likelihood and Cox’s partial likelihood models on data from studies using either SCCS or a matched cohort design. For the SCCS study, we fitted both parametric and semi-parametric models to model age effects, and described a simple way to apply the parametric and complex semi-parametric analysis to case series data.