Pub. online:3 Oct 2022Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 20, Issue 4 (2022): Special Issue: Large-Scale Spatial Data Science, pp. 461–474
Abstract
Spatio-temporal filtering is a common and challenging task in many environmental applications, where the evolution is often nonlinear and the dimension of the spatial state may be very high. We propose a scalable filtering approach based on a hierarchical sparse Cholesky representation of the filtering covariance matrix. At each time point, we compress the sparse Cholesky factor into a dense matrix with a small number of columns. After applying the evolution to each of these columns, we decompress to obtain a hierarchical sparse Cholesky factor of the forecast covariance, which can then be updated based on newly available data. We illustrate the Cholesky evolution via an equivalent representation in terms of spatial basis functions. We also demonstrate the advantage of our method in numerical comparisons, including using a high-dimensional and nonlinear Lorenz model.
This study analyzes the impact of the COVID-19 pandemic on subjective well-being as measured through Twitter for the countries of Japan and Italy. In the first nine months of 2020, the Twitter indicators dropped by 11.7% for Italy and 8.3% for Japan compared to the last two months of 2019, and even more compared to their historical means. To understand what affected the Twitter mood so strongly, the study considers a pool of potential factors including: climate and air quality data, number of COVID-19 cases and deaths, Facebook COVID-19 and flu-like symptoms global survey data, coronavirus-related Google search data, policy intervention measures, human mobility data, macro economic variables, as well as health and stress proxy variables. This study proposes a framework to analyse and assess the relative impact of these external factors on the dynamic of Twitter mood and further implements a structural model to describe the underlying concept of subjective well-being. It turns out that prolonged mobility restrictions, flu and Covid-like symptoms, economic uncertainty and low levels of quality in social interactions have a negative impact on well-being.
Social determinants of health (SDOH) are the conditions in which people are born, grow, work, and live. Although evidence suggests that SDOH influence a range of health outcomes, health systems lack the infrastructure to access and act upon this information. The purpose of this manuscript is to explain the methodology that a health system used to: 1) identify and integrate publicly available SDOH data into the health systems’ Data Warehouse, 2) integrate a HIPAA compliant geocoding software (via DeGAUSS), and 3) visualize data to inform SDOH projects (via Tableau). First, authors engaged key stakeholders across the health system to convey the implications of SDOH data for our patient population and identify variables of interest. As a result, fourteen publicly available data sets, accounting for >30,800 variables representing national, state, county, and census tract information over 2016–2019, were cleaned and integrated into our Data Warehouse. To pilot the data visualization, we created county and census tract level maps for our service areas and plotted common SDOH metrics (e.g., income, education, insurance status, etc.). This practical, methodological integration of SDOH data at a large health system demonstrated feasibility. Ultimately, we will repeat this process system wide to further understand the risk burden in our patient population and improve our prediction models – allowing us to become better partners with our community.
Pub. online:30 Aug 2022Type:Data Science In ActionOpen Access
Journal:Journal of Data Science
Volume 21, Issue 3 (2023): Special Issue: Advances in Network Data Science, pp. 578–598
Abstract
Social network analysis has created a productive framework for the analysis of the histories of patient-physician interactions and physician collaboration. Notable is the construction of networks based on the data of “referral paths” – sequences of patient-specific temporally linked physician visits – in this case, culled from a large set of Medicare claims data in the United States. Network constructions depend on a range of choices regarding the underlying data. In this paper we introduce the use of a five-factor experiment that produces 80 distinct projections of the bipartite patient-physician mixing matrix to a unipartite physician network derived from the referral path data, which is further analyzed at the level of the 2,219 hospitals in the final analytic sample. We summarize the networks of physicians within a given hospital using a range of directed and undirected network features (quantities that summarize structural properties of the network such as its size, density, and reciprocity). The different projections and their underlying factors are evaluated in terms of the heterogeneity of the network features across the hospitals. We also evaluate the projections relative to their ability to improve the predictive accuracy of a model estimating a hospital’s adoption of implantable cardiac defibrillators, a novel cardiac intervention. Because it optimizes the knowledge learned about the overall and interactive effects of the factors, we anticipate that the factorial design setting for network analysis may be useful more generally as a methodological advance in network analysis.
This paper introduces the package open-crypto for free-of-charge and systematic cryptocurrency data collecting. The package supports several methods to request (1) static data, (2) real-time data and (3) historical data. It allows to retrieve data from over 100 of the most popular and liquid exchanges world-wide. New exchanges can easily be added with the help of provided templates or updated with build-in functions from the project repository. The package is available on GitHub and the Python package index (PyPi). The data is stored in a relational SQL database and therefore accessible from many different programming languages. We provide a hands-on and illustrations for each data type, explanations on the received data and also demonstrate the usability from R and Matlab. Academic research heavily relies on costly or confidential data, however, open data projects are becoming increasingly important. This project is mainly motivated to contribute to openly accessible software and free data in the cryptocurrency markets to improve transparency and reproducibility in research and any other disciplines.
This paper proposes a nonuniform subsampling method for finite mixtures of regression models to reduce large data computational tasks. A general estimator based on a subsample is investigated, and its asymptotic normality is established. We assign optimal subsampling probabilities to data points that minimize the asymptotic mean squared errors of the general estimator and linearly transformed estimators. Since the proposed probabilities depend on unknown parameters, an implementable algorithm is developed. We first approximate the optimal subsampling probabilities using a pilot sample. After that, we select a subsample using the approximated subsampling probabilities and compute estimates using the subsample. We evaluate the proposed method in a simulation study and present a real data example using appliance energy data.
We introduce the stepp packages for R and Stata that implement the subpopulation treatment effect pattern plot (STEPP) method. STEPP is a nonparametric graphical tool aimed at examining possible heterogeneous treatment effects in subpopulations defined on a continuous covariate or composite score. More pecifically, STEPP considers overlapping subpopulations defined with respect to a continuous covariate (or risk index) and it estimates a treatment effect for each subpopulation. It also produces confidence regions and tests for treatment effect heterogeneity among the subpopulations. The original method has been extended in different directions such as different survival contexts, outcome types, or more efficient procedures for identifying the overlapping subpopulations. In this paper, we also introduce a novel method to determine the number of subjects within the subpopulations by minimizing the variability of the sizes of the subpopulations generated by a specific parameter combination. We illustrate the packages using both synthetic data and publicly available data sets. The most intensive computations in R are implemented in Fortran, while the Stata version exploits the powerful Mata language.
Understanding shooting patterns among different players is a fundamental problem in basketball game analyses. In this paper, we quantify the shooting pattern via the field goal attempts and percentages over twelve non-overlapping regions around the front court. A joint Bayesian nonparametric mixture model is developed to find latent clusters of players based on their shooting patterns. We apply our proposed model to learn the heterogeneity among selected players from the National Basketball Association (NBA) games over the 2018–2019 regular season and 2019–2020 bubble season. Thirteen clusters are identified for 2018–2019 regular season and seven clusters are identified for 2019–2020 bubble season. We further examine the shooting patterns of players in these clusters and discuss their relation to players’ other available information. The results shed new insights on the effect of NBA COVID bubble and may provide useful guidance for player’s shot selection and team’s in-game and recruiting strategy planning.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 5 (2020): Special Issue S1 in Chinese (with abstract in English), pp. 907–921
Abstract
The Corona Virus Disease 2019 (COVID-19) emerged in Wuhan, China in December 2019. In order to control the epidemic, the Chinese government adopted several public health measures. To study the influence of these measures on the transmissibility of COVID-19 in the city of Wuhan and other cities in the Hubei province, China, we establish generalized semi-varying coefficient models for the number of new diagnosed cases and estimate the varying coefficient for the covariates by the spline method. Since the pandemic was most severe in Wuhan, we fitted separate models for Wuhan and the remaining 16 cities in Hubei. Estimators for the incubation periods, the real-time transmission rates, and the real-time reproduction numbers were obtained. The results demonstrate that the changes in the real-time transmission rate in Wuhan and other cities in Hubei are almost simultaneous. Futher, public health interventions such as restriction of traffic, adjustment of the diagnosed standard, deployment of medical resources, and improvement of nucleic acid testing capacity, had positive effects on reducing the transmission of COVID-19.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 5 (2020): Special Issue S1 in Chinese (with abstract in English), pp. 889–906
Abstract
The new coronavirus disease (COVID-19), as a new infectious disease, has relatively strong ability to spread from person to person. This paper studies several meteorological factors and air quality indicators between Shenzhen and Wenzhou, China, and conducts modelling analysis on whether the transmission of COVID-19 is affected by atmosphere. A comparative assessment is made on the characteristics of meteorological factors and air quality in these two typical cities in China and their impacts on the spread of COVID-19. The article uses meteorological data and air quality data, including 7 variables: daily average temperature, daily average relative humidity, daily average wind speed, nitrogen dioxide (NO2), atmospheric fine particulate matter (PM2.5), carbon monoxide (CO) and ozone (O3), a distributed lag non-linear model (DLNM) is constructed to explore the correlation between atmospheric conditions and non-imported confirmed cases of COVID-19, and the relative risk is introduced to measure the lagging effects of meteorological factors and air pollution on the number of non-imported confirmed cases. Our finding indicates that there is significant differences in the relationship between 7 predictors and the transmission of COVID-19 in Shenzhen and Wenzhou. However, all predictors between the two cities have a non-linear relationship with the number of non-imported confirmed cases. The lower daily average temperature has increased the risk of epidemic transmission in the two cities. As the temperature rises, the risk of epidemic transmission in both cities will significantly decrease. The average daily relative humidity has no significant effects on the epidemic situation in Shenzhen, but the lower relative humidity reduces the risk of epidemic spread in Wenzhou. In contrast, meteorological data have significant impacts on the spread of COVID-19 in Wenzhou. The four predictors (NO2, PM2.5, CO, and O3) have significant effects on the number of nonimported confirmed cases. Among them, PM2.5 has a significant positive correlation with the number of non-imported confirmed cases. Daily average wind speed, NO2 and O3 have different effects on the number of non-imported confirmed cases in different cities.