Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 5 (2020): Special Issue S1 in Chinese (with abstract in English), pp. 875–888
Abstract
In the wake of the COVID-19 outbreak, the public resorted to Sina Weibo as a major platform for the trend of the pandemic. Research on public sentiment and topic mining of major public sentiment events based on Sina Weibo’s comment data is important for understanding the trend of public opinions during major epidemic outbreaks. Based on classification of the Chinese language into emotion categories in psychology, we use open source tools to build naive Bayesian models to classify Weibo comments. Visualization of comment topics is achieved with word co-occurrence network methods. Commented topics are mined with the help of the latent Dirichlet distribution model. The results show that the psychological sentiment classification combined with the naive Bayesian model can reflect the evolvement of public sentiment during the epidemic, and that the latent Dirichlet distribution model and word co-occurrence network can effectively mine the topics of public concerns.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 5 (2020): Special Issue S1 in Chinese (with abstract in English), pp. 875–888
Abstract
To surveil the development of COVID-19 is a complex and challenging issue. The foundation of such surveillance is timely and accurate epidemic data. Therefore, quality control for releasing COVID-19 data is very important, accounting for the releasing agent, the content to release, and the impact of the released data. We suggest that the quality requirements for the release of COVID-19 data be based on the global perspective that the goal of open epidemic data is to create a valuable ecological chain in which all stakeholders are involved. As such, the collection, aggregation, and release process of the COVID-19 data should meet not only the data quality standards of official statistics and health statistics, but also the characteristics of the epidemic statistics and the needs of pandemic prevention. The quality requirements should follow the unique characteristics of the epidemic and be scrutinized by the public. We integrate the perspectives of official statistics, health statistics, and open government data, proposing five quality dimensions for releasing COVID-19 data: accuracy, timeliness, systematicness, userfriendliness and security. Through case studies on the official websites of Chinese provincial health commission, we report the quality problems in the current data releasing process and suggest improvements.
Many software reliability growth models based upon a non-homogeneous Poisson process (NHPP) have been proposed to measure and asses the reliability of a software system quantitatively. Generally, the error detection rate and the fault content function during software testing is considered to be dependent on the elapsed time testing. In this paper we have proposed three software reliability growth models (SRGM’s) incorporating the notion of error generation over the time as an extension of the delayed S-shaped software reliability growth model based on a non-homogeneous Poisson process (NHPP). The model parameters are estimated using the maximum likelihood method for interval domain data and three data sets are provided to illustrate the estimation technique. The proposed model is compared with the existing delayed S-shaped model based on error sum of squares, mean sum of squares, predictive ratio risk and Akaike’s information criteria using three different data sets. We show that the proposed models perform satisfactory better than the existing models.
Abstract: The Lee-Carter model and its extensions are the most popular methods in the field of forecasting mortality rate. But, in spite of introducing several different methods in forecasting mortality rate so far, there is no general method applicable to all situations. Singular Spectrum Analysis (SSA) is a relatively new, powerful and non parametric time series analysis that its capability in forecasting different time series has been proven in the various sciences. In this paper, we investigate the feasibility of using the SSA to construct mortality forecasts. We use the Hyndman-Ullah model, which is a new extension of Lee-Carter model, as a benchmark to evaluate the performance of the SSA for mortality forecasts in France data sets.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 5 (2020): Special Issue S1 in Chinese (with abstract in English), pp. 849–859
Abstract
Millions of people travel from Wuhan to other cities from Jan. 1st 2020 to Jan 23rd 2020. Taking advantage of the masked software development kit data from Aurora Mobile Ltd and open epidemic data released by health authorities, we analyze the relationship between number of confirmed COVID-19 cases in a region and the people who traveled from Wuhan to this region in this period. Further, we identify high risk carriers of COVID-19 to improve the control of COVID-19. The key findings are three-folds: (1) in each region the number of high-risk carriers is highly positively correlated with the severity of illness; (2) history of visit to the 62 designated hospitals is the foremost index of risk; (3) the second most important index is the travelers’ duration of stay in Wuhan. Based on our analysis, we estimate that, as of February 4, 2020, (a) among the 8.5 million people held up in Wuhan, there are 425 thousand high risk carriers; and (b) among all the 3.5 million migrant workers held up in Hubei, there are 175 thousand high risk carriers. The disease control authorities should closely minotor these groups.
Abstract: The application of linear mixed models or generalized linear mixed models to large databases in which the level 2 units (hospitals) have a wide variety of characteristics is a problem frequently encountered in studies of medical quality. Accurate estimation of model parameters and standard errors requires accounting for the grouping of outcomes within hospitals. Including the hospitals as random effect in the model is a common method of doing so. However in a large, diverse population, the required assump tions are not satisfied, which can lead to inconsistent and biased parameter estimates. One solution is to use cluster analysis with clustering variables distinct from the model covariates to group the hospitals into smaller, more homogeneous groups. The analysis can then be carried out within these groups. We illustrate this analysis using an example of a study of hemoglobin A1c control among diabetic patients in a national database of United States Department of Veterans’ Affairs (VA) hospitals.
We fit a Cox proportional hazards (PH) model to interval-censored survival data by first subdividing each individual's failure interval into nonoverlapping sub-intervals. Using the set of all interval endpoints in the data set, those that fall into the individual's interval are then used as the cut points for the sub-intervals. Each sub-interval has an accompanying weight calculated from a parametric Weibull model based on the current parameter estimates. A weighted PH model is then fit with multiple lines of observations corresponding to the sub-intervals for each individual, where the lower end of each sub-interval is used as the observed failure time with the accompanying weights incorporated. Right-censored observations are handled in the usual manner. We iterate between estimating the baseline Weibull distribution and fitting the weighted PH model until the regression parameters of interest converge. The regression parameter estimates are fixed as an offset when we update the estimates of the Weibull distribution and recalculate the weights. Our approach is similar to Satten et al.'s (1998) method for interval-censored survival analysis that used imputed failure times generated from a parametric model in a PH model. Simulation results demonstrate apparently unbiased parameter estimation for the correctly specified Weibull model and little to no bias for a mis-specified log-logistic model. Breast cosmetic deterioration data and ICU hyperlactemia data are analyzed.
In this article, we introduce a class of distributions that have heavy tails as compared to Pareto distribution of third kind, which we termed as Heavy Tailed Pareto (HP) distribution. Various structural properties of the new distribution are derived. It is shown that HP distribution is in the domain of attraction of minimum of Weibull distribution. A representation of HP distribution in terms of Weibull random variable is obtained. Two characterizations of HP distribution are obtained. The method of maximum likelihood is used for estimation of model parameters and simulation results are presented to assess the performance of new model. Marshall-Olkin Heavy Tailed Pareto (MOHP) distribution is also introduced and some of its properties are studied. It is shown that MOHP distribution is geometric extreme stable. An autoregressive time series model with the new model as marginal distribution is developed and its properties are studied.
Abstract: In this paper we introduce a Bayesian analysis of a spherical distri bution applied to rock joint orientation data in presence or not of a vector of covariates, where the response variable is given by the angle from the mean and the covariates are the components of the normal upwards vector. Standard simulation MCMC (Markov Chain Monte Carlo) methods have been used to obtain the posterior summaries of interest obtained from Win Bugs software. Illustration of the proposed methodology are given using a simulated data set and a real rock spherical data set from a hydroelectrical site.