Abstract: For longitudinal binary data with non-monotone non-ignorable missing outcomes over time, a full likelihood approach is complicated alge braically, and maximum likelihood estimation can be computationally pro hibitive with many times of follow-up. We propose pseudo-likelihoods to estimate the covariate effects on the marginal probabilities of the outcomes, in addition to the association parameters and missingness parameters. The pseudo-likelihood requires specification of the distribution for the data at all pairs of times on the same subject, but makes no assumptions about the joint distribution of the data at three or more times on the same sub ject, so the method can be considered semi-parametric. If using maximum likelihood, the full likelihood must be correctly specified in order to obtain consistent estimates. We show in simulations that our proposed pseudo likelihood produces a more efficient estimate of the regression parameters than the pseudo-likelihood for non-ignorable missingness proposed by Troxel et al. (1998). Application to data from the Six Cities study (Ware, et.al, 1984), a longitudinal study of the health effects of air pollution, is discussed.
Abstract: Anti-smoking media campaign is an effective tobacco control strategy. How to identify what types of advertising messages are effective is important for maximizing the use of limited funding sources for such campaigns. In this paper, we propose a statistical modeling approach for systematically assessing the effectiveness of anti-smoking media campaigns based on ad recall rates and rating scores. This research is motivated by the need for evaluating youth responses to the Massachusetts Tobacco Control Program (MTCP) media campaign. Pattern-mixture GEE models are pro posed to evaluate the impact of viewer and ads characteristics on ad recall rates and rating scores controlling for missing values, confounding and cor relations in the data. A key difficulty for pattern-mixture modeling is that there were too many distinct missing data patterns which cause convergence problem for modeling fitting based on limited data. A heuristic argument based on collapsing missing data patterns is used to test the missing com pletely at random (MCAR) assumption in pattern-mixture GEE models. The proposed modeling approach and the recall-rating study design pro vide a complete system for identifying the most effective type of advertising messages.
Abstract: The interest in estimating the probability of cure has been increas ing in cancer survival analysis as the cure of some cancer sites is becoming a reality. Mixture cure models have been used to model the failure time data with the existence of long-term survivors. The mixture cure model assumes that a fraction of the survivors are cured from the disease of interest. The failure time distribution for the uncured individuals (latency) can be mod eled by either parametric models or a semi-parametric proportional hazards model. In the model, the probability of cure and the latency distribution are both related to the prognostic factors and patients’ characteristics. The maximum likelihood estimates (MLEs) of these parameters can be obtained using the Newton-Raphson algorithm. The EM algorithm has been proposed as a simple alternative by Larson and Dinse (1985) and Taylor (1995). in various setting for the cause-specific survival analysis. This approach is ex tended here to the grouped relative survival data. The methods are applied to analyze the colorectal cancer relative survival data from the Surveillance, Epidemiology, and End Results (SEER) program.
Abstract: Data systems collecting information from different sources or over long periods of time can receive multiple reports from the same indi vidual. An important example is public health surveillance systems that monitor conditions with long natural histories. Several state-level systems for surveillance of one such condition, the human immunodeficiency virus (HIV), use codes composed of combinations of non-unique personal charac teristics such as birth date, soundex (a code based on last name), and sex as patient identifiers. As a result, these systems cannot distinguish between several different individuals having identical codes and a unique individual erroneously represented several times. We applied results for occupancy models to estimate the potential magnitude of duplicate case counting for AIDS cases reported to the Centers for Disease Control and Prevention with only non-unique partial personal identifiers. Occupancy models with equal and unequal occupancy probabilities are considered. Unbiased estimators for the numbers of true duplicates within and between case reporting areas are provided. Formulas to calculate estimators’ variances are also provided. These results can be applied to evaluating duplicate reporting in other data systems that have no unique identifier for each individual.
Abstract: Spread of airborne plant diseases from a propagule source is classically assessed by fitting a gradient curve to aggregated data coming from field experiments. But, aggregating data decreases information about processes involved in disease spread. To overcome this problem, individual count data can be collected; it was done in the case of short-distance spread of wheat brown rust. However, for such data, the gradient curve is a limited model since heterogeneity of hosts is ignored and, consequently, overdisper sion occurs. So, we propose a parametric frailty model in which the frailties represent propensities of hosts to be infected. The model is used to assess dispersal of propagules and heterogeneity of hosts.
Abstract: A seasonal additive nonlinear vector autoregression (SANVAR) model is proposed for multivariate seasonal time series to explore the possible interaction among the various univariate series. Significant lagged variables are selected and additive autoregression functions estimated based on the selected variables using spline smoothing method. Conservative confidence bands are constructed for the additive autoregression function. The model is fitted to two sets of bivariate quarterly unemployment rate data with comparisons made to the linear periodic vector autoregression model. It is found that when the data does not significantly deviate from linearity, the periodic model is preferred. In cases of strong nonlinearity, however, the additive model is more parsimonious and has much higher out-of-sample prediction power. In addition, interactions among various univariate series are automatically detected.
Abstract: The paper presents a statistical analysis of electricity spot prices in a deregulated market in New South Wales, Australia, in the period 10 May, 1996 – 7 March, 1998. It is unusual that a single set of data, such as this, allows one to consider a relatively systematic sequence of statistical problems, each resulting in clear, although not always obvious, solutions. This is the reason why these data and their analysis can be used as a rel atively good base for training in practical statistical analysis. Existing for merly as a report, the material has been used in lecture courses in several universities in Australia and New Zealand.