In causal mediation analyses, of interest are the direct or indirect pathways from exposure to an outcome variable. For observation studies, massive baseline characteristics are collected as potential confounders to mitigate selection bias, possibly approaching or exceeding the sample size. Accordingly, flexible machine learning approaches are promising in filtering a subset of relevant confounders, along with estimation using the efficient influence function to avoid overfitting. Among various confounding selection strategies, two attract growing attention. One is the popular debiased, or double machine learning (DML), and another is the penalized partial correlation via fitting a Gaussian graphical network model between the confounders and the response variable. Nonetheless, for causal mediation analyses when encountering high-dimensional confounders, there is a gap in determining the best strategy for confounding selection. Therefore, we exemplify a motivating study on the human microbiome, where the dimensions of mediator and confounders approach or exceed the sample size to compare possible combinations of confounding selection methods. By deriving the multiply robust causal direct and indirect effects across various hypotheses, our comprehensive illustrations offer methodological implications on how the confounding selection impacts the final causal target parameter estimation while generating causality insights in demystifying the “gut-brain axis”. Our results highlighted the practicality and necessity of the discussed methods, which not only guide real-world applications for practitioners but also motivate future advancements for this crucial topic in the era of big data.
Yang et al. (2004) developed the two-dimensional principal component analysis (2DPCA) for image representation and recognition, widely used in different fields, including face recognition, biometrics recognition, cancer diagnosis, tumor classification, and others. 2DPCA has been proven to perform better and computationally more efficiently than traditional principal component analysis (PCA). However, some theoretical properties of 2DPCA are still unknown, including determining the number of principal components (PCs) in the training set, which is the critical step in applying 2DPCA. Without rigorous criteria for determining the number of PCs hampers the generalization of the application of 2DPCA. Given this issue, we propose a new method based on parallel analysis to determine the number of PCs in 2DPCA with statistical justification. Several image classification experiments demonstrate that the proposed method compares favourably to other state-of-the-art approaches regarding recognition accuracy and storage requirement, with a low computational cost.
Abstract: We analyze the cross-correlation between logarithmic returns of 1108 stocks listed on the Shanghai and Shenzhen Stock Exchange of China in the period 2005 to 2010. The results suggest that the estimated distribution of correlation coefficients is right shifted in the tumble time of Chinese stock market. Due to the large share of maximum eigenvalue, the principal correlation component in Chinese stock market is dominant and other components only have trivial effects on the market condition. The same-signed corresponding vector elements enable us to propose the maximum eigenvalue series as an indicator for collective behavior in the equity market. We provide the evidence that the largest eigenvalue series can be used as an effective indicative parameter to describe the collective behavior of stock returns, which is found to be positively correlated to market volatility. By using time-varying windows, we find the positive correlation diminishes when the market volatility reaches both highest and lowest level. By defining a stability rate, we display that the collective behavior of stocks tends to be more homogeneous in the context of crisis than the regular time. This study has implications for the arising discussions on correlation risk.
Abstract: Considering the importance of science and mathematics achieve ments of young students, one of the most well known observed phenomenon is that the performance of U.S. students in mathematics and sciences is undesirable. In order to deal with the problem of declining mathematics and science scores of American high school students, many strategies have been implemented for several decades. In this paper, we give an in-depth longitudinal study of American youth using a double-kernel approach of non parametric quantile regression. Two of the advantages of this approach are: (1) it guarantees that a Nadaraya-Watson estimator of the conditional func tion is a distribution function while, in some cases, this kind of estimator being neither monotone nor taking values only between 0 and 1; (2) it guar antees that quantile curves which are based on Nadaraya-Watson estimator not absurdly cross each other. Previous work has focused only on mean re gression and parametric quantile regression. We obtained many interesting results in this study.
Previous abstractive methods apply sequence-to-sequence structures to generate summary without a module to assist the system to detect vital mentions and relationships within a document. To address this problem, we utilize semantic graph to boost the generation performance. Firstly, we extract important entities from each document and then establish a graph inspired by the idea of distant supervision (Mintz et al., 2009). Then, we combine a Bi-LSTM with a graph encoder to obtain the representation of each graph node. A novel neural decoder is presented to leverage the information of such entity graphs. Automatic and human evaluations show the effectiveness of our technique.