Abstract: We have developed an automated linking scheme for PUBMED citations with GO terms using SVM (Support Vector Machine), a classifica tion algorithm. The PUBMED database has been essential to life science re searchers with over 12 million citations. More recently GO (Gene Ontology) has provided a graph structure for biological process, cellular component, and molecular function of genomic data. By text mining the textual content of PUBMED and associating them with GO terms, we have built up an ontological map for these databases so that users can search PUBMED via GO terms and conversely GO entries via PUBMED classification. Conse quently, some interesting and unexpected knowledge may be captured from them for further data analysis and biological experimentation. This paper reports our results on SVM implementation and the need to parallelize for the training phase.
Abstract: The image de-noising is the process to remove the noise from the image naturally corrupted by the noise. The wavelet method is one among the various methods for recovering infinite dimensional objects like curves, densities, images etc. The wavelet techniques are very effective to remove the noise because of its ability to capture the energy of a signal in few energy transform values. The wavelet methods are based on shrinking the wavelet coefficients in the wavelet domain. This paper concentrates on selecting a threshold for wavelet function estimation. A new threshold value is pro posed to shrink the wavelet coefficients obtained by wavelet decomposition of a noisy image by considering that the sub band coefficients have a gener alized Gaussian distribution. The proposed threshold value is based on the power of 2 in the size 2J × 2 J of the data that can be computed efficiently. The experiment has been conducted on various test images to compare with the established threshold parameters. The result shows that the proposed threshold value removes the noise significantly.
The Power function distribution is a flexible life time distribution that has applications in finance and economics. It is, also, used to model reliability growth of complex systems or the reliability of repairable systems. A new weighted Power function distribution is proposed using a logarithmic weight function. Statistical properties of the weighted power function distribution are obtained and studied. Location measures such as mode, median and mean, reliability measures such as reliability function, hazard and reversed hazard functions and the mean residual life are derived. Shape indices such as skewness and kurtosis coefficients and order statistics are obtained. Parametric estimation is performed to obtain estimators for the parameters of the distribution using three different estimation methods; namely: the maximum likelihood method, the L-moments method and the method of moments. Numerical simulation is carried out to validate the robustness of the proposed distribution.
Abstract: Among many statistical methods for linear models with the multicollinearity problem, partial least squares regression (PLSR) has become, in recent years, increasingly popular and, very often, the best choice. However, while dealing with the predicting problem from automobile market, we noticed that the results from PLSR appear unstable though it is still the best among some standard statistical methods. This unstable feature is likely due to the impact of the information contained in explanatory variables that is irrelevant to the response variable. Based on the algorithm of PLSR, this paper introduces a new method, modified partial least squares regression (MPLSR), to emphasize the impact of the relevant information of explanatory variables on the response variable. With the MPLSR method, satisfactory predicting results are obtained in the above practical problem. The performance of MPLSR, PLSR and some standard statistical methods are compared by a set of Monte Carlo experiments. This paper shows that the MPLSR is the most stable and accurate method, especially when the ratio of the number of observation and the number of explanatory variables is low.
Abstract: Receiver operating characteristic (ROC) curve is an effective and widely used method for evaluating the discriminating power of a diagnostic test or statistical model. As a useful statistical method, a wealth of literature about its theories and computation methods has been established. The research on ROC curves, however, has focused mainly on cross-sectional design. Very little research on estimating ROC curves and their summary statistics, especially significance testing, has been conducted for repeated measures design. Due to the complexity of estimating the standard error of a ROC curve, there is no currently established statistical method for testing the significance of ROC curves under a repeated measures design. In this paper, we estimate the area of a ROC curve under a repeated measures design through generalized linear mixed model (GLMM) using the predicted probability of a disease or positivity of a condition and propose a bootstrap method to estimate the standard error of the area under a ROC curve for such designs. Statistical significance testing of the area under a ROC curve is then conducted using the bootstrapped standard error. The validity of bootstrap approach and the statistical testing of the area under the ROC curve was validated through simulation analyses. A special statistical software written in SAS/IML/MACRO v8 was also created for implementing the bootstrapping algorithm, conducting the calculations and statistical testing.
Abstract: Particulate matter smaller than 2.5 microns (PM2.5) is a com monly measured parameter in ground-based sampling networks designed to assess short and long-term air quality. The measurement techniques for ground based PM2.5 are relatively accurate and precise, but monitoring lo cations are spatially too sparse for many applications. Aerosol Optical Depth (AOD) is a satellite based air quality measurement that can be computed for more spatial locations, but measures light attenuation by particulates throughout in entire air column, not just near the ground. The goal of this paper is to better characterize the spatio-temporal relationship between the two measurements. An informative relationship will aid in imputing PM2.5 values for health studies in a way that accounts for the variability in both sets of measurements, something physics based models cannot do. We use a data set of Chicago air quality measurements taken during 2007 and 2008 to construct a weekly hierarchical model. We also demonstrate that AOD measurements and a latent spatio-temporal process aggregated weekly can be used to aid in the prediction of PM2.5measurements.
Abstract: The primary advantage of panel over cross-sectional regression stems from its control for the effects of omitted variables or ”unobserved heterogeneity”. However, panel regression is based on the strong assump tions that measurement errors are independently identically ( i.i.d.) and normal. These assumptions are evaded by design-based regression, which dispenses with measurement errors altogether by regarding the response as a fixed real number. The present paper establishes a middle ground between these extreme interpretations of longitudinal data. The individual is now represented as a panel of responses containing dependently non-identically distributed (d.n.d) measurement errors. Modeling the expectations of these responses preserves the Neyman randomization theory, rendering panel regression slopes ap proximately unbiased and normal in the presence of arbitrarily distributed measurement error. The generality of this reinterpretation is illustrated with German Socio-Economic Panel (GSOEP) responses that are discretely distributed on a 3-point scale.
Abstract: This article extends the recent work of V¨annman and Albing (2007) regarding the new family of quantile based process capability indices (qPCI) CMA(τ, v). We develop both asymptotic parametric and nonparametric confidence limits and testing procedures of CMA(τ, v). The kernel density estimator of process was proposed to find the consistent estimator of the variance of the nonparametric consistent estimator of CMA(τ, v). Therefore, the proposed procedure is ready for practical implementation to any processes. Illustrative examples are also provided to show the steps of implementing the proposed methods directly on the real-life problems. We also present a simulation study on the sample size required for using asymptotic results.
In this paper, we introduce a new generalized family of distri- butions from bounded support (0,1), namely, the Topp-Leone-G family.Some of mathematical properties of the proposed family have been studied. The new density function can be symmetrical, left-skewed, right-skewed or reverse-J shaped. Furthermore, the hazard rate function can be constant, in- creasing, decreasing, J or bathtub hazard rate shapes. Three special models are discussed. We obtain simple expressions for the ordinary and incomplete moments, quantile and generating functions, mean deviations and entropies. The method of maximum likelihood is used to estimate the model parame- ters. The flexibility of the new family is illustrated by means of three real data sets.