Understanding shooting patterns among different players is a fundamental problem in basketball game analyses. In this paper, we quantify the shooting pattern via the field goal attempts and percentages over twelve non-overlapping regions around the front court. A joint Bayesian nonparametric mixture model is developed to find latent clusters of players based on their shooting patterns. We apply our proposed model to learn the heterogeneity among selected players from the National Basketball Association (NBA) games over the 2018–2019 regular season and 2019–2020 bubble season. Thirteen clusters are identified for 2018–2019 regular season and seven clusters are identified for 2019–2020 bubble season. We further examine the shooting patterns of players in these clusters and discuss their relation to players’ other available information. The results shed new insights on the effect of NBA COVID bubble and may provide useful guidance for player’s shot selection and team’s in-game and recruiting strategy planning.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 5 (2020): Special Issue S1 in Chinese (with abstract in English), pp. 907–921
Abstract
The Corona Virus Disease 2019 (COVID-19) emerged in Wuhan, China in December 2019. In order to control the epidemic, the Chinese government adopted several public health measures. To study the influence of these measures on the transmissibility of COVID-19 in the city of Wuhan and other cities in the Hubei province, China, we establish generalized semi-varying coefficient models for the number of new diagnosed cases and estimate the varying coefficient for the covariates by the spline method. Since the pandemic was most severe in Wuhan, we fitted separate models for Wuhan and the remaining 16 cities in Hubei. Estimators for the incubation periods, the real-time transmission rates, and the real-time reproduction numbers were obtained. The results demonstrate that the changes in the real-time transmission rate in Wuhan and other cities in Hubei are almost simultaneous. Futher, public health interventions such as restriction of traffic, adjustment of the diagnosed standard, deployment of medical resources, and improvement of nucleic acid testing capacity, had positive effects on reducing the transmission of COVID-19.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 5 (2020): Special Issue S1 in Chinese (with abstract in English), pp. 889–906
Abstract
The new coronavirus disease (COVID-19), as a new infectious disease, has relatively strong ability to spread from person to person. This paper studies several meteorological factors and air quality indicators between Shenzhen and Wenzhou, China, and conducts modelling analysis on whether the transmission of COVID-19 is affected by atmosphere. A comparative assessment is made on the characteristics of meteorological factors and air quality in these two typical cities in China and their impacts on the spread of COVID-19. The article uses meteorological data and air quality data, including 7 variables: daily average temperature, daily average relative humidity, daily average wind speed, nitrogen dioxide (NO2), atmospheric fine particulate matter (PM2.5), carbon monoxide (CO) and ozone (O3), a distributed lag non-linear model (DLNM) is constructed to explore the correlation between atmospheric conditions and non-imported confirmed cases of COVID-19, and the relative risk is introduced to measure the lagging effects of meteorological factors and air pollution on the number of non-imported confirmed cases. Our finding indicates that there is significant differences in the relationship between 7 predictors and the transmission of COVID-19 in Shenzhen and Wenzhou. However, all predictors between the two cities have a non-linear relationship with the number of non-imported confirmed cases. The lower daily average temperature has increased the risk of epidemic transmission in the two cities. As the temperature rises, the risk of epidemic transmission in both cities will significantly decrease. The average daily relative humidity has no significant effects on the epidemic situation in Shenzhen, but the lower relative humidity reduces the risk of epidemic spread in Wenzhou. In contrast, meteorological data have significant impacts on the spread of COVID-19 in Wenzhou. The four predictors (NO2, PM2.5, CO, and O3) have significant effects on the number of nonimported confirmed cases. Among them, PM2.5 has a significant positive correlation with the number of non-imported confirmed cases. Daily average wind speed, NO2 and O3 have different effects on the number of non-imported confirmed cases in different cities.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 5 (2020): Special Issue S1 in Chinese (with abstract in English), pp. 875–888
Abstract
In the wake of the COVID-19 outbreak, the public resorted to Sina Weibo as a major platform for the trend of the pandemic. Research on public sentiment and topic mining of major public sentiment events based on Sina Weibo’s comment data is important for understanding the trend of public opinions during major epidemic outbreaks. Based on classification of the Chinese language into emotion categories in psychology, we use open source tools to build naive Bayesian models to classify Weibo comments. Visualization of comment topics is achieved with word co-occurrence network methods. Commented topics are mined with the help of the latent Dirichlet distribution model. The results show that the psychological sentiment classification combined with the naive Bayesian model can reflect the evolvement of public sentiment during the epidemic, and that the latent Dirichlet distribution model and word co-occurrence network can effectively mine the topics of public concerns.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 5 (2020): Special Issue S1 in Chinese (with abstract in English), pp. 875–888
Abstract
To surveil the development of COVID-19 is a complex and challenging issue. The foundation of such surveillance is timely and accurate epidemic data. Therefore, quality control for releasing COVID-19 data is very important, accounting for the releasing agent, the content to release, and the impact of the released data. We suggest that the quality requirements for the release of COVID-19 data be based on the global perspective that the goal of open epidemic data is to create a valuable ecological chain in which all stakeholders are involved. As such, the collection, aggregation, and release process of the COVID-19 data should meet not only the data quality standards of official statistics and health statistics, but also the characteristics of the epidemic statistics and the needs of pandemic prevention. The quality requirements should follow the unique characteristics of the epidemic and be scrutinized by the public. We integrate the perspectives of official statistics, health statistics, and open government data, proposing five quality dimensions for releasing COVID-19 data: accuracy, timeliness, systematicness, userfriendliness and security. Through case studies on the official websites of Chinese provincial health commission, we report the quality problems in the current data releasing process and suggest improvements.
Many software reliability growth models based upon a non-homogeneous Poisson process (NHPP) have been proposed to measure and asses the reliability of a software system quantitatively. Generally, the error detection rate and the fault content function during software testing is considered to be dependent on the elapsed time testing. In this paper we have proposed three software reliability growth models (SRGM’s) incorporating the notion of error generation over the time as an extension of the delayed S-shaped software reliability growth model based on a non-homogeneous Poisson process (NHPP). The model parameters are estimated using the maximum likelihood method for interval domain data and three data sets are provided to illustrate the estimation technique. The proposed model is compared with the existing delayed S-shaped model based on error sum of squares, mean sum of squares, predictive ratio risk and Akaike’s information criteria using three different data sets. We show that the proposed models perform satisfactory better than the existing models.
Abstract: The Lee-Carter model and its extensions are the most popular methods in the field of forecasting mortality rate. But, in spite of introducing several different methods in forecasting mortality rate so far, there is no general method applicable to all situations. Singular Spectrum Analysis (SSA) is a relatively new, powerful and non parametric time series analysis that its capability in forecasting different time series has been proven in the various sciences. In this paper, we investigate the feasibility of using the SSA to construct mortality forecasts. We use the Hyndman-Ullah model, which is a new extension of Lee-Carter model, as a benchmark to evaluate the performance of the SSA for mortality forecasts in France data sets.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 5 (2020): Special Issue S1 in Chinese (with abstract in English), pp. 849–859
Abstract
Millions of people travel from Wuhan to other cities from Jan. 1st 2020 to Jan 23rd 2020. Taking advantage of the masked software development kit data from Aurora Mobile Ltd and open epidemic data released by health authorities, we analyze the relationship between number of confirmed COVID-19 cases in a region and the people who traveled from Wuhan to this region in this period. Further, we identify high risk carriers of COVID-19 to improve the control of COVID-19. The key findings are three-folds: (1) in each region the number of high-risk carriers is highly positively correlated with the severity of illness; (2) history of visit to the 62 designated hospitals is the foremost index of risk; (3) the second most important index is the travelers’ duration of stay in Wuhan. Based on our analysis, we estimate that, as of February 4, 2020, (a) among the 8.5 million people held up in Wuhan, there are 425 thousand high risk carriers; and (b) among all the 3.5 million migrant workers held up in Hubei, there are 175 thousand high risk carriers. The disease control authorities should closely minotor these groups.
Abstract: The application of linear mixed models or generalized linear mixed models to large databases in which the level 2 units (hospitals) have a wide variety of characteristics is a problem frequently encountered in studies of medical quality. Accurate estimation of model parameters and standard errors requires accounting for the grouping of outcomes within hospitals. Including the hospitals as random effect in the model is a common method of doing so. However in a large, diverse population, the required assump tions are not satisfied, which can lead to inconsistent and biased parameter estimates. One solution is to use cluster analysis with clustering variables distinct from the model covariates to group the hospitals into smaller, more homogeneous groups. The analysis can then be carried out within these groups. We illustrate this analysis using an example of a study of hemoglobin A1c control among diabetic patients in a national database of United States Department of Veterans’ Affairs (VA) hospitals.