COVID-19 Fatality: A Cross-Sectional Study using Adaptive Lasso Penalized Sliced Inverse Regression

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronvirus, which was declared as a global pandemic by the World Health Organi-zation on March 11, 2020. In this work, we conduct a cross-sectional study to investigate how the infection fatality rate (IFR) of COVID-19 may be associated with possible geographical or demographical features of the infected population. We employ a multiple index model in combi-nation with sliced inverse regression to facilitate the relationship between the IFR and possible risk factors. To select associated features for the infection fatality rate, we utilize an adaptive Lasso penalized sliced inverse regression method, which achieves variable selection and suﬃcient dimension reduction simultaneously with unimportant features removed automatically. We apply the proposed method to conduct a cross-sectional study for the COVID-19 data obtained from two time points of the outbreak.


Introduction
Since January 2020, many regions in China have experienced an outbreak of the coronavirus disease 2019 , caused by severe acute respiratory syndrome coronavirus (SARS-Cov-2). Researchers from different fields have been studying epidemiological and clinical characteristics of the COVID-19. For example, Epidemiology Working Group for NCIP Epidemic Response (2020) studied epidemiological characteristics of the COVID-19 in China. Xu et al. (2020) investigated pathological characteristics of a patient who died from severe infection with SARS-CoV-2. Zhang et al. (2020) accessed how livers would be affected using the available case studies. A number of researchers studied clinical features of individual patients of COVID-19 (Guan et al., 2020;Huang et al., 2020;Chen et al., 2020;Wang et al., 2020d). Investigations on the characteristics of COVID-19 by using statistical and machine learning methods were also available. For instance,  employed a modified stacked auto-encoder for modeling transmission dynamics of the epidemics to forecast confirmed cases of COVID-19 across China. Wang et al. (2020b) developed an epidemiological forecast model with an R package to assess the intervention effect on the COVID-19 epidemic within and outside Hubei in China.
As COVID-19 becomes a global pandemic, one of the greatest concerns is to identify the fatality risk for infected populations which usually differ in multiple features. It is important to understand what features of the populations are associated with the high fatality risk. At the individual level, some work was available to describe the relationship between patient level features and the fatality risk factors. For instance, Ji et al. (2020) investigated the potential association between the COVID-19 fatality and health-care resources. Shi et al. (2020) explored the association between the cardiac injury and the fatality of the patients with the COVID-19. Clinical Features of 69 cases with COVID-19 were studied by Wang et al. (2020d). Different risk factors that may be associated with the infected populations with COVID-19 were considered by different researchers, including temperature Wang et al., 2020c), humidity (Wang et al., 2020c), age structure (Onder et al., 2020), and the cardiovascular disease (Clerkin et al., 2020). While each of the possible risk factors has been separately studied to reveal the potential association with COVID-19, it is unclear how those factors may interactively affect the fatality of the COVID-19. It is critical to examine how the COVID-19 fatality may be associated with potential risk factors as a group instead of on an individual basis. To this end, in this paper we employ the multiple index model to facilitate the relationship between possible risk factors and the fatality rate of COVID-19. Such a model is advantageous in its flexibility of accommodating various kinds of association with an unspecified model function.
To reduce the variable dimension of the multiple index model, we employ the sufficient dimension reduction method, a powerful tool for reducing the variable dimension of the multiple index model which can be employed to identify important risk factors by obtaining the central subspace (Cook, 1998). To estimate the central subspace of the sufficient dimension reduction, many methods have been developed, including the most widely used sliced inverse regression (SIR) (Li, 1991), the sliced average variance estimation (SAVE) method (Cook and Weisberg, 1991), the principal Hessian directions (pHd) method (Li, 1992), and the iterative Hessian transformation method (Cook et al., 2002). The shrinkage SIR based on the Lasso penalty for sufficient dimension reduction was proposed by Ni et al. (2005). Li and Nachtsheim (2006) and Li (2007) produced sparse estimates of the basis for the central subspace by combining a regression-type formulation with the Lasso penalty. Other penalized sufficient dimension reduction methods include the constrained canonical correlation procedure (Zhou et al., 2008), the SCAD max penalized SIR method (Wu and Li, 2011), and the Lasso-SIR (Lin et al., 2019).
Motivated by Zou (2006) and Lin et al. (2019), we propose an adaptive Lasso penalized sliced inverse regression method for the multiple index model to identify the possible risk factors for the infection fatality rate of COVID-19. The proposed method develops a model-free variable selection procedure which does not require the specification of a parametric model for the underlying true process. It estimates the central subspace of the multiple index model and selects the important features simultaneously.
The remainder of this article is organized as follows. Section 2 describes the COVID-19 data sets to be analyzed. Section 3 presents the proposed method and the implementation algorithm. The analysis results are reported in Section 4. Section 5 concludes the article with a discussion.

Descriptions of the COVID-19 Data
To interpret the fatality risk of infected populations, we use the infection fatality rate (IFR) in percent of COVID-19, defined as Infection Fatality Rate (in Percent) = Total Number of Deaths Total Number of Confirmed Cases × 100.
We consider the data from n = 64 developed or developing countries in Asia, Europe, Africa, America, and Oceania. The Asian countries include Armenia, Azerbaijan, Bahrain, Iran, Israel, Japan, Kazakhstan, Malaysia, Pakistan, Philippines, Qatar, Saudi Arabia, Singapore,  Table 1, were collected between February 27, 2020 and April 9, 2020. All of the covariates are continuous variables. The covariate x 1 is defined to be the percentage of individuals age 65 or over in a country where the total population sizes counts all residents regardless of legal status or citizenship. The covariate x 2 is the prevalence of smoking which is the percentage of males and females age 15 and over who smoke any tobacco product. The covariate x 3 is the population weighted exposure to ambient PM2.5 pollution, defined as the average level of exposure of a nation's population to concentrations of suspended particles measuring less than 2.5 microns in aerodynamic diameter; exposure is calculated by weighting mean annual concentrations of PM2.5 by population in both urban and rural areas. The covariate x 4 is the age standardized death rate per 100, 000 individuals for both males and females. The covariate x 5 is the disease burden due to all chronic respiratory diseases, including silicosis, asthma, and lung disease. The covariate x 6 is the number of medical doctors, including general physicians and medical specialist per 1000 residents in a country. The covariate x 7 is the number of hospital beds, including inpatient beds available in public, private, general, and specialized hospitals and rehabilitation centers per 1000 residents in a country. The covariate x 8 is the mean systolic blood pressure (in mmHg) for adults 18 years and older, which is related to the level of the hypertension in a country. The covariates x 9 and x 10 are the semi-annual average temperature and relative humidity from November of the previous year to April of its following year, respectively. This consideration is driven by the fact that the first case of the COVID-19 was discovered in early December of 2019 and the data we consider span to the end of April of 2020. The covariates x 11 and x 12 are the number of serious cases and the total number of tests per million people in a As the outbreak of COVID-19 starts at different times for different regions, there is a time lag to obtain the number of deaths and the number of confirmed cases. To circumvent this issue, we conduct a cross-sectional study by defining the time point as the day of the first 400 confirmed cases, and we call this "Study 1". For comparison, we take a second time point, defined as the 14 days after the first 100 cases are confirmed, which is partially driven by the fact that the maximum incubation time for COVID-19 is about 14 days (e.g., He et al., 2020), and we call this "Study 2". The minimum, mean, and maximum values of the number of deaths and the number of confirmed cases for the two studies are shown in Table 2.

Notation and Framework
Let X = (X 1 , . . . , X p ) ∈ R p be the vector of covariates which follows a p-dimensional elliptical distribution with E(X) = 0 and covariance matrix Cov(X) = Σ, where the diagonal elements of Σ are Var(X j ) = 1 for j = 1, . . . , p. Let Y be the response variable. We consider the multiple index model for Y and X (Li, 1991): where is the error term independent of X with E( ) = 0 and {β 1 , . . . , β d } are unknown projection vectors with β k = (β k1 , . . . , β kp ) for k = 1, . . . , d. Here d is unknown but is regarded to be much smaller than p. Several methods are available in the literature for determining the structural dimension d, such as the asymptotic test (Li, 1991), the permutation test (Cook and Yin, 2001), the information criterion (Zhu et al., 2006), and the estimation algorithm (Lin et al., 2019). In our algorithm to be described in Section 3.2, we modify the estimation algorithm of Lin et al. (2019) by combing with the K-means method (MacQueen, 1967) to choose a suitable value for d.
Let B = (β 1 , . . . , β d ) which is a nonidentifiable p × d matrix in the sense that the vectors β 1 , . . . , β d are not unique. However, the space spanned by the columns of B, called the central space and denoted col(B), can be uniquely determined. Li (1991) proposed the sliced inverse regression (SIR) procedure to estimate the central subspace col(B) without knowing function f (·). The SIR method is summarized as follows.
Assume that there exist n independent random pairs (y i , x i ) whose distributions are identical to (Y, X), where i = 1, . . . , n. We first divide them into H equal sized slices based on the order statistics y (i) for i = 1, . . . , n, where H is a user-specified number of slices. Let c = n/H , we define y h,l = y (c(h−1)+l) and x h,l = x (c(h−1)+l) , and then we rewrite the data as (y h,l , x h,l ), where h is the slice number and l is the order number of a sample in the hth slice, and x (r) is the covariate vector corresponding to y (r) for an index r. Letx h = 1 c c l=1 x h,l be the sample mean in the hth slice, then Λ = var{E(X|Y )} can be estimated bŷ where X H = (x 1 , . . . ,x H ) is a p × H matrix. LetV be the matrix of the top d eigenvectors ofΛ and let col(V ) be the subspace formed by the column vectors ofV . Then the central subspace col(B) is estimated byΣ −1 col(V ) using the observed data, whereΣ is the estimated covariance matrix of X.

Methodology and Algorithm
For ease of exposition, we assume y 1 ≤ y 2 ≤ · · · ≤ y n for the sample {(y i , To estimate the space spanned by {β 1 , . . . , β d }, Lin et al. (2019) proposed the Lasso-SIR algorithm to find the sparse estimates, say,β k , for β k with k = 1, . . . , d. Specifically, for k = 1, . . . , d, let and µ k is the tuning parameter. Then minimizing L β k ,k with respect to β k gives the sparse estimatesβ k . Lin et al. (2019) showed that the Lasso penalized SIR regression (2) is asymptotically equivalent to the Lasso penalized linear regressions. The R package glmnet can be used to implement the Lasso penalized linear regression where the tuning parameters µ k for k = 1, . . . , d are chosen by the cross-validation method. The Lasso penalized linear regressions method does not have the selection consistency (Tibshirani, 1996;Zou, 2006). To overcome this limitation, Zou (2006) proposed the adaptive Lasso penalized linear regressions method which enjoys the selection consistency. The adaptive Lasso penalized methods can shrink the estimates of the parameters of unimportant covariates to 0 as the sample size n → ∞, thus, removing all the unimportant covariates automatically and yielding sparse estimation results. Motivated by this, we propose the adaptive Lasso penalized SIR regression. For k = 1, . . . , d, we calculate the adaptive Lasso penalized SIR regression expressioñ where the w kj = 1/|β kj | r with r > 0 are weights controlling the penalty for k = 1, . . . , d and j = 1, . . . , p, theβ kj are the estimates obtain from the unpenalized sliced inverse regression, and r can be simply set as 1. To obtain the sparse estimateβ k , we minimizeL β k ,k with respect to β kj . Specifically, we carry out the following steps.
Steps 1 and 2 are easy to implement. For Step 3, we first fit the data using unpenalized sliced inverse regression to obtainβ kj for k = 1, . . . , d and j = 1, . . . , p. The Lasso penalized linear regression was well studies by Tibshirani (1996), Efron et al. (2004) and Friedman et al. (2010). Since we can convert the optimization problem (4) to the optimization problem of adaptive Lasso penalized linear regression, we solve the optimization problem (4) using the R package glmnet with the argument alpha = 1. The tuning parameters µ k with k = 1, . . . , d are chosen by the cross-validation method. By solving the minimization problem in Step 3, we obtain a sparse estimate of β. Since col(B) = col(β 1 , . . . ,β d ) estimates the central subspace col(B), the jth covariate x j is considered to be unimportant ifβ kj = 0 for k = 1, . . . , d.

Data Analysis
We analyze the COVID-19 data described in Section 2 using the proposed adaptive Lasso penalized SIR regression method, denoted ALSIR. In comparison, we also apply the linear regressions method and the unpenalized sliced inverse regression method, denoted LR and SIR, respectively. As discussed in Section 2, we analyze the two data sets of Studies 1 and 2 separately. To make the covariates unit-less and of the same scale for the penalty function, we standardize all the covariates by subtracting the mean and dividing the standard deviation before applying these methods.

Analysis Results with H = 5
In this subsection, we set H = 5 when implementing the ALSIR method to analyze the data. We report analysis results in Table 3 which includes the parameter estimates (Est.), the associated Table 3: Analysis results for Study 1 and Study 2 using the linear regression (LR), unpenalized sliced inverse regression (SIR) and adaptive Lasso penalized SIR (ALSIR) methods. The entries for "Est.", "SE", and "ALSIR" are in percent.  , we estimate the structural dimension of the subspace to be d = 1. Therefore, we report only one direction vector (β 11 , . . . ,β 1p ) as the estimated basis for the central subspace in the analysis; the jth covariate is considered to be unimportant ifβ 1j is shrunk to zero. That is, the jth feature does not affect the fatality of the COVID-19 ifβ 1j = 0. Since the ALSIR method provides estimates of direction vectors for the basis of the central subspace derived from the multiple index model, positive or negative signs of estimates do not indicate positive or negative association of the covariate with the response. The linear regressions method shows that except for the covariate x 4 , none of the covariates are significantly associated with the IFR of COVID-19 for Study 1 data if we take the significant level to be 0.05. The unpenalized sliced inverse regression method does not remove unimportant risk factors automatically. On the contrary, the ALSIR method yields different results which implies that the linear relationship may not be appropriate. It shows that in Study 1, the covariates x 4 , x 6 and x 12 are important features for affecting the fatality. That is, the cardiovascular disease, the number of physicians, and the number of tests are closely associated with the fatality at the early stage of the outbreak. Clerkin et al. (2020) concluded that the patients with cardiovascular disease are at higher risk of fatality. Compared with the number of hospital beds, the number of doctors plays an important role in explaining the fatality rate. The fatality rate at the early stage of the COVID-19 does not seem to differ for the populations with different age structures.

Covariate
Regarding the results for Study 2, our method shows that the covariates x 1 , x 4 , x 6 , x 7 , x 9 , x 11 and x 12 are significant. That is, the age structure, cardiovascular disease, the number of physicians, the number of hospital beds, average temperature, the number of serious cases, and the number of tests are all important features for the fatality rate of the infected populations. Onder et al. (2020) found that the populations with a higher proportion of older people have a higher fatality rate. The cardiovascular disease is an important feature for IFR of the infected populations. The number of medical doctors and the number of hospital beds per 1000 residents are all important for explaining the fatality rate. Studying the data in Wuhan city, China, Ma et al. (2020) suggested that temperature has strong positive correlation with the fatality of COVID-19, while some other research indicates that high temperature could prevent the spread of COVID-19 (e.g., Wang et al., 2020a,c). In our analysis here, the covariate x 9 is semi-annual average temperature collect from November to April, with the maximum, median and minimum values being 26.583 • C, 6.717 • C and −11.583 • C respectively, and most countries in the data sets are in winter or spring during this time period. Our analysis indicates that temperature is associated with the fatality of COVID-19. Wang et al. (2020c) showed that the most suitable survival temperature for the coronvirus of COVID-19 is 8.72 • C. The smoking prevalence and PM2.5 air pollution do not associate with the fatality rate of COVID-19. The analysis results support that the number of serious cases and the number of tests are important features for the fatality of COVID-19. The analysis results show that the humidity does not associate with the IFR of COVID-19. Chronic respiratory diseases and blood pressure (hypertension) do not have a strong association with the fatality.
In summary, the cardiovascular disease, the number of physicians and the number of tests are important features for both studies, while the smoking prevalence, PM2.5 air pollution, chronic respiratory disease, blood pressure, and humidity do not seem to be associated with the fatality. The age structure, the number of hospital beds, and the number of serious cases play different roles in explaining the fatality; they are important features for Study 2 but not for Study 1. For both studies, the chronic respiratory diseases and blood pressure (hypertension) are not considered as important for explaining the COVID-19 fatality.

Sensitivity Analysis
To compare the performance of the different methods, we further provide a "bootstrap inference" for the LR, SIR and ALSIR methods, as suggested by a referee. To be specific, we independently resample 1500 bootstrap samples from the initial data of Studies 1 and 2. We fit the 1500 bootstrap samples using these methods and obtain 1500 analysis results for each of the two studies accordingly. Then we calculate the resultant standard errors and report the results in Table 4. When using the proposed method, we choose H = 5 as in Section 4.1. Unsurprisingly, the results provided by the three methods are different. Interestingly, the proposed method seems to provide more stable results than the LR and SIR methods, because its associated standard errors are smaller than those produced by the LR and SIR methods.
As discussed by Liquet and Saracco (2012), the SIR method may be sensitive to the choice of the number H of slices. To see if the same issue is related to the ALSIR method, here we investigate the effect of the choice of H on the estimation results of the proposed method by considering three values of H: H = 3, 5 or 8, where the structural dimension d is taken as 1, as in Section 4.1. The parameter estimation results are recorded in Table 5, which clearly shows that the performance of the ALSIR is sensitive to the choice of H.

Discussion
In this paper, we propose an adaptive Lasso penalized sliced inverse regression method for the multiple index model to analyze the COVID-19 data. The proposed method is flexible in the   x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x sense that there is no need to specify the model structure of f (·) function for the multiple index model, which protects us from potential model misspecification. We apply the proposed method with H set as 5 to identify risk factors associated with the COVID-19 fatality, which shows that the cardiovascular disease, the number of doctors and the number of tests are important for both cross-sectional studies. However, as demonstrated in Section 4.2, the performance of proposed ALSIR method is sensitive to the choice of H, a phenomenon that also occurs with the SIR method (Liquet and Saracco, 2012). When interpreting the analysis results, one must be aware of this uncertainty and regard the inference procedure as a kind of conditional analysis in the sense that H is set as a given value. Our cross-sectional study examines the data sets at two time points of the outbreak which are defined as the time of first 400 confirmed cases and the 14 days after the first 100 cases are confirmed. This consideration mainly focuses on studying the early stage of the pandemic with the incubation period taken into account. One may, of course, consider other time points and repeat the same investigation, and different findings may be uncovered as data become richer. More generally, it is interesting to extend this development to the longitudinal data framework and explore the dynamic relationship between the fatality and risk factors.
Several important issues should be acknowledged, as noticed by the referees and discussed by other authors (e.g., He et al., 2020). Due to asymptomatic infections and limited test capacity, the reported numbers of cases with COVID-19 are typically smaller than the true numbers of infections for various countries. Additionally, other aspects such as varying incubation periods make the data error-prone. To help understand the underlying truth, it is useful to conduct sensitivity analyses by modifying the development here to accommodate the feature of errorcontaminated data (Carroll et al., 2006;Yi, 2017).
In the analysis, one concern pertains to the collinearity of covariates. In the data we analyze, the correlation for paired covariates are fairly small, as suggested by the values in Table 6. Other concerns may be related to the inclusion of potential risk factors for the analysis. For example, one may ask: have we exhausted all possible risk factors for the fatality? Do we need to worry about confounding issues among the risk factors? Questions like these warrant careful studies. As the outbreak continues to exacerbate, more data become available and in-depth studies are possible to help us better understand the characteristics of the COVID-19.