Pub. online:13 Mar 2024Type:Statistical Data ScienceOpen Access
Journal:Journal of Data Science
Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 280–297
Abstract
The use of visuals is a key component in scientific communication. Decisions about the design of a data visualization should be informed by what design elements best support the audience’s ability to perceive and understand the components of the data visualization. We build on the foundations of Cleveland and McGill’s work in graphical perception, employing a large, nationally-representative, probability-based panel of survey respondents to test perception in stacked bar charts. Our findings provide actionable guidance for data visualization practitioners to employ in their work.
Our contribution is to widen the scope of extreme value analysis applied to discrete-valued data. Extreme values of a random variable are commonly modeled using the generalized Pareto distribution, a peak-over-threshold method that often gives good results in practice. When data is discrete, we propose two other methods using a discrete generalized Pareto and a generalized Zipf distribution respectively. Both are theoretically motivated and we show that they perform well in estimating rare events in several simulated and real data cases such as word frequency, tornado outbreaks and multiple births.
This study delves into the impact of the COVID-19 pandemic on the enrollment rates of on-site undergraduate programs within Brazilian public universities. Employing the Machine Learning Control Method, a counterfactual scenario was constructed in which the pandemic did not occur. By contrasting this hypothetical scenario with real-world data on new entrants, a variable was defined to characterize the impact of the COVID-19 pandemic on on-site undergraduate programs at Brazilian public universities. This variable reveals that the impact factor varies significantly when considering the geographical locations of the institutions offering these courses. Courses offered by institutions located in smaller population cities experienced a more pronounced impact compared to those situated in larger urban centers.
One crucial aspect of precision medicine is to allow physicians to recommend the most suitable treatment for their patients. This requires understanding the treatment heterogeneity from a patient-centric view, quantified by estimating the individualized treatment effect (ITE). With a large amount of genetics data and medical factors being collected, a complete picture of individuals’ characteristics is forming, which provides more opportunities to accurately estimate ITE. Recent development using machine learning methods within the counterfactual outcome framework shows excellent potential in analyzing such data. In this research, we propose to extend meta-learning approaches to estimate individualized treatment effects with survival outcomes. Two meta-learning algorithms are considered, T-learner and X-learner, each combined with three types of machine learning methods: random survival forest, Bayesian accelerated failure time model and survival neural network. We examine the performance of the proposed methods and provide practical guidelines for their application in randomized clinical trials (RCTs). Moreover, we propose to use the Boruta algorithm to identify risk factors that contribute to treatment heterogeneity based on ITE estimates. The finite sample performances of these methods are compared through extensive simulations under different randomization designs. The proposed approach is applied to a large RCT of eye disease, namely, age-related macular degeneration (AMD), to estimate the ITE on delaying time-to-AMD progression and to make individualized treatment recommendations.
The exploration of whether artificial intelligence (AI) can evolve to possess consciousness is an intensely debated and researched topic within the fields of philosophy, neuroscience, and artificial intelligence. Understanding this complex phenomenon hinges on integrating two complementary perspectives of consciousness: the objective and the subjective. Objective perspectives involve quantifiable measures and observable phenomena, offering a more scientific and empirical approach. This includes the use of neuroimaging technologies such as electrocorticography (ECoG), EEG, and fMRI to study brain activities and patterns. These methods allow for the mapping and understanding of neural representations related to language, visual, acoustic, emotional, and semantic information. However, the objective approach may miss the nuances of personal experience and introspection. On the other hand, subjective perspectives focus on personal experiences, thoughts, and feelings. This introspective view provides insights into the individual nature of consciousness, which cannot be directly measured or observed by others. Yet, the subjective approach is often criticized for its lack of empirical evidence and its reliance on personal interpretation, which may not be universally applicable or reliable. Integrating these two perspectives is essential for a comprehensive understanding of consciousness. By combining objective measures with subjective reports, we can develop a more holistic understanding of the mind.
The United States has a racial homeownership gap due to a legacy of historic inequality and discriminatory policies, but factors that contribute to the racial disparity in homeownership rates between White Americans and people of color have not been fully characterized. In order to alleviate this issue, policymakers need a better understanding of how risk factors affect the homeownership rates of racial and ethnic groups differently. In this study, data from several publicly available surveys, including the American Community Survey and United States Census, were leveraged in combination with statistical learning models to investigate potential factors related to homeownership rates across racial and ethnic categories, with a focus on how risk factors vary by race or ethnicity. Our models indicated that job availability for specific demographics, and specific regions of the United States were factors that affect homeownership rates in Black, Hispanic, and Asian populations in different ways. Based on the results of this study, it is recommended policymakers promote strategies to increase access to jobs for people of color (POC), such as vocational training and programs to reduce implicit bias in hiring practices. These interventions could ultimately increase homeownership rates for POC and be a step toward reducing the racial wealth gap.
Racial and ethnic representation in home ownership rates is an important public policy topic for addressing inequality within society. Although more than half of the households in the US are owned, rather than rented, the representation of home ownership is unequal among different racial and ethnic groups. Here we analyze the US Census Bureau’s American Community Survey data to conduct an exploratory and statistical analysis of home ownership in the US, and find sociodemographic factors that are associated with differences in home ownership rates. We use binomial and beta-binomial generalized linear models (GLMs) with 2020 county-level data to model the home ownership rate, and fit the beta-binomial models with Bayesian estimation. We determine that race/ethnic group, geographic region, and income all have significant associations with the home ownership rate. To make the data and results accessible to the public, we develop an Shiny web application in R with exploratory plots and model predictions.
In 2022 the American Statistical Association established the Riffenburgh Award, which recognizes exceptional innovation in extending statistical methods across diverse fields. Simultaneously, the Department of Statistics at the University of Connecticut proudly commemorated six decades of excellence, having evolved into a preeminent hub for academic, industrial, and governmental statistical grooming. To honor this legacy, a captivating virtual dialogue was conducted with the department’s visionary founder, Dr. Robert H. Riffenburgh, delving into his extraordinary career trajectory, profound insights into the statistical vocation, and heartfelt accounts from the faculty and students he personally nurtured. This multifaceted narrative documents the conversation with more detailed background information on each topic covered by the interview than what is presented in the video recording on YouTube.
In the form of a scholarly exchange with ChatGPT, we cover fundamentals of modeling stochastic dependence with copulas. The conversation is aimed at a broad audience and provides a light introduction to the topic of copula modeling, a field of potential relevance in all areas where more than one random variable appears in the modeling process. Topics covered include the definition, Sklar’s theorem, the invariance principle, pseudo-observations, tail dependence and stochastic representations. The conversation also shows to what degree it can be useful (or not) to learn about such concepts by interacting with the current version of a chatbot.
This paper aims to determine the effects of socioeconomic and healthcare factors on the performance of controlling COVID-19 in both the Southern and Southeastern United States. This analysis will provide government agencies with information to determine what communities need additional COVID-19 assistance, to identify counties that effectively control COVID-19, and to apply effective strategies on a broader scale. The statistical analysis uses data from 328 counties with a population of more than 65,000 from 13 states. We define a new response variable by considering infection and mortality rates to capture how well each county controls COVID-19. We collect 14 factors from the 2019 American Community Survey Single-Year Estimates and obtain county-level infection and mortality rates from USAfacts.org. We use the least absolute shrinkage and selection operator (LASSO) regression to fit a multiple linear regression model and develop an interactive system programmed in R shiny to deliver all results. The interactive system at https://asa-competition-smu.shinyapps.io/COVID19/ provides many options for users to explore our data, models, and results.