Large pretrained transformer models have revolutionized modern AI applications with their state-of-the-art performance in natural language processing (NLP). However, their substantial parameter count poses challenges for real-world deployment. To address this, researchers often reduce model size by pruning parameters based on their magnitude or sensitivity. Previous research has demonstrated the limitations of magnitude pruning, especially in the context of transfer learning for modern NLP tasks. In this paper, we introduce a new magnitude-based pruning algorithm called mixture Gaussian prior pruning (MGPP), which employs a mixture Gaussian prior for regularization. MGPP prunes non-expressive weights under the guidance of the mixture Gaussian prior, aiming to retain the model’s expressive capability. Extensive evaluations across various NLP tasks, including natural language understanding, question answering, and natural language generation, demonstrate the superiority of MGPP over existing pruning methods, particularly in high sparsity settings. Additionally, we provide a theoretical justification for the consistency of the sparse transformer, shedding light on the effectiveness of the proposed pruning method.
When computations such as statistical simulations need to be carried out on a high performance computing (HPC) cluster, typical questions arise among researchers or practitioners. How do I interact with a HPC cluster? Do I need to type a long host name and also a password on every single login or file transfer? Why does my locally working code not run anymore on the HPC cluster? How can I install the latest versions of software on a HPC cluster to match my local setup? How can I submit a job and monitor its progress? This tutorial provides answers to such questions with experiments on an example HPC cluster.
Connections between subpar dietary choices and negative health consequences are well established in the field of nutritional epidemiology. Consequently, in the United States, there is a standard practice of conducting regular surveys to evaluate dietary habits. One notable example is the National Health and Nutrition Examination Survey (NHANES) conducted every two years by the Center for Disease Control (CDC). Several scoring methods have been developed to assess the quality of diet in the overall population as well as different pertinent subgroups using dietary recall data collected in these surveys. The Healthy Eating Index (HEI) is one such metric, developed based on recommendations from the United States Department of Health and Human Services (HHS) and Department of Agriculture (USDA) and widely used by nutritionists. Presently, there is a scarcity of user-friendly statistical software tools implementing the scoring of these standard scoring metrics. Herein, we develop an R package heiscore to address this need. Our carefully designed package, with its many user-friendly features, increases the accessibility of the HEI scoring using three different methods outlined by the National Cancer Institute (NCI). Additionally, we provide functions to visualize multidimensional diet quality data via various graphing techniques, including bar charts and radar charts. Its utility is illustrated with many examples, including comparisons between different demographic groups.
Yang et al. (2004) developed the two-dimensional principal component analysis (2DPCA) for image representation and recognition, widely used in different fields, including face recognition, biometrics recognition, cancer diagnosis, tumor classification, and others. 2DPCA has been proven to perform better and computationally more efficiently than traditional principal component analysis (PCA). However, some theoretical properties of 2DPCA are still unknown, including determining the number of principal components (PCs) in the training set, which is the critical step in applying 2DPCA. Without rigorous criteria for determining the number of PCs hampers the generalization of the application of 2DPCA. Given this issue, we propose a new method based on parallel analysis to determine the number of PCs in 2DPCA with statistical justification. Several image classification experiments demonstrate that the proposed method compares favourably to other state-of-the-art approaches regarding recognition accuracy and storage requirement, with a low computational cost.
The ultrasonic testing has been considered a promising method for diagnosing and characterizing masonry walls. As ultrasonic waves tend to travel faster in denser materials, their use is common in evaluating the conditions of various materials. Presence of internal voids, e.g., would alter the wave path, and this distinct behavior could be employed to identify unknown conditions within the material, allowing for the assessment of its condition. Therefore, we applied mixed models and Gaussian processes to analyze the behavior of ultrasonic waves on masonry walls and identify relevant factors impacting their propagation. We observed that the average propagation time behavior differs depending on the material for both models. Additionally, the condition of the wall influences the propagation time. Gaussian process and mixed model performances are compared, and we conclude that these models can be useful in a classification model to automatically identify anomalies within masonry walls.
Attention mechanism has become an almost ubiquitous model architecture in deep learning. One of its distinctive features is to compute non-negative probabilistic distribution to re-weight input representations. This work reconsiders attention weights as bidirectional coefficients instead of probabilistic measures for potential benefits in interpretability and representational capacity. After analyzing the iteration process of attention scores through backwards gradient propagation, we proposed a novel activation function, TanhMax, which possesses several favorable properties to satisfy the requirements of bidirectional attention. We conduct a battery of experiments to validate our analyses and advantages of proposed method on both text and image datasets. The results show that bidirectional attention is effective in revealing input unit’s semantics, presenting more interpretable explanations and increasing the expressive power of attention-based model.
Analyzing the gene-environment interaction (GEI) is crucial for understanding the etiology of many complex traits. Among various types of study designs, case-control studies are popular for analyzing gene-environment interactions due to their efficiency in collecting covariate information. Extensive literature explores efficient estimation under various assumptions made about the relationship between genetic and environmental variables. In this paper, we comprehensively review the methods based on or related to the retrospective likelihood, including the methods based on the hypothetical population concept, which has been largely overlooked in GEI research in the past decade. Furthermore, we establish the methodological connection between these two groups of methods by deriving a new estimator from both the retrospective likelihood and the hypothetical population perspectives. The validity of the derivation is demonstrated through numerical studies.
Imbalanced datasets present a significant challenge for machine learning models, often leading to biased predictions. To address this issue, data augmentation techniques are widely used to generate new samples for the minority class. However, in this paper, we challenge the common assumption that data augmentation is necessary to improve predictions on imbalanced datasets. Instead, we argue that adjusting the classifier cutoffs without data augmentation can produce similar results to oversampling techniques. Our study provides theoretical and empirical evidence to support this claim. Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data, and help researchers and practitioners make informed decisions about which methods to use for a given task.
The assignment problem, crucial in various real-world applications, involves optimizing the allocation of agents to tasks for maximum utility. While it has been well-studied in the optimization literature when the underlying utilities between all agent-task pairs are known, research is sparse when the utilities are unknown and need to be learned from data on the fly. This paper addresses this gap, as motivated by mentor-mentee matching programs at many U.S. universities. We develop an efficient sequential assignment algorithm, with the aim of nearly maximizing the overall utility simultaneously over different time periods. Our proposed algorithm is to use stochastic bandit feedback to adaptively estimate the unknown utilities through linear regression models, integrating the Upper Confidence Bound (UCB) algorithm in the multi-armed bandit problem with the Hungarian algorithm in the assignment problem. We provide theoretical bounds of our algorithm for both the estimation error and the total regret. Additionally, numerical studies are also conducted to demonstrate the practical effectiveness of our algorithm.
Multivariate random forests (or MVRFs) are an extension of tree-based ensembles to examine multivariate responses. MVRF can be particularly helpful where some of the responses exhibit sparse (e.g., zero-inflated) distributions, making borrowing strength from correlated features attractive. Tree-based algorithms select features using variable importance measures (VIMs) that score each covariate based on the strength of dependence of the model on that variable. In this paper, we develop and propose new VIMs for MVRFs. Specifically, we focus on the variable’s ability to achieve split improvement, i.e., the difference in the responses between the left and right nodes obtained after splitting the parent node, for a multivariate response. Our proposed VIMs are an improvement over the default naïve VIM in existing software and allow us to investigate the strength of dependence both globally and on a per-response basis. Our simulation studies show that our proposed VIM recovers the true predictors better than naïve measures. We demonstrate usage of the VIMs for variable selection in two empirical applications; the first is on Amazon Marketplace data to predict Buy Box prices of multiple brands in a category, and the second is on ecology data to predict co-occurrence of multiple, rare bird species. A feature of both data sets is that some outcomes are sparse — exhibiting a substantial proportion of zeros or fixed values. In both cases, the proposed VIMs when used for variable screening give superior predictive accuracy over naïve measures.