Objective: Financial fraud has been a big concern for many organizations across industries; billions of dollars are lost yearly because of this fraud. So businesses employ data mining techniques to address this continued and growing problem. This paper aims to review research studies conducted to detect financial fraud using data mining tools within one decade and communicate the current trends to academic scholars and industry practitioners.
Method: Various combinations of keywords were used to identify the pertinent articles. The majority of the articles retrieved from Science Direct but the search spanned other online databases (e.g., Emerald, Elsevier, World Scientific, IEEE, and Routledge - Taylor and Francis Group). Our search yielded a sample of 65 relevant articles (58 peer-reviewed journal articles with 7 conference papers). One-fifth of the articles was found in Expert Systems with Applications (ESA) while about one-tenth found in Decision Support Systems (DSS).
Results: 41 data mining techniques were used to detect fraud across different financial applications such as health insurance and credit card. Logistic regression model appeared to be the leading data mining tool in detecting financial fraud with a 13% of usage.In general, supervised learning tool have been used more frequently than the unsupervised ones. Financial statement fraud and bank fraud are the two largest financial applications being investigated in this area – about 63%, which corresponds to 41 articles out of the 65 reviewed articles. Also, the two primary journal outlets for this topic are ESA and DSS.
Conclusion: This review provides a fast and easy-to-use source for both researchers and professionals, classifies financial fraud applications into a high-level and detailed-level framework, shows the most significant data mining techniques in this domain, and reveals the most countries exposed to financial fraud.
In this paper, a new five-parameter extended Burr XII model called new modified Singh-Maddala (NMSM) is developed from cumulative hazard function of the modified log extended integrated beta hazard (MLEIBH) model. The NMSM density function is left-skewed, right-skewed and symmetrical. The Lambert W function is used to study descriptive measures based on quantile, moments, and moments of order statistics, incomplete moments, inequality measures and residual life function. Different reliability and uncertainty measures are also theoretically established. The NMSM distribution is characterized via different techniques and its parameters are estimated using maximum likelihood method. The simulation studies are performed on the basis of graphical results to illustrate the performance of maximum likelihood estimates (MLEs) of the parameters. The significance and flexibility of NMSM distribution is tested through different measures by application to two real data sets.
In this paper, we proposed the Bayesian estimation for the parameter and reliability function of exponentiated gamma distribution under progressive type-II censored samples. The Bayes estimate of the parameter and reliability function are derived under the assumption of independent gamma prior by three different approximation methods namely Lindley’s approximation, Tierney-Kadane and Markov Chain Monte Carlo methods. Further, the comparison of Bayes estimators with corresponding maximum likelihood estimators have been carried out through simulation study. Finally, a real data set has been used to illustrate the above study in realistic phenomenon.
The COVID-19 pandemic has triggered explosive activities in searching for cures, including vaccines against the SARS-CoV-2 infection. As of April 30, 2020, there are at least 102 COVID-19 vaccine development programs worldwide, the majority of which are in preclinical development phases, five are in phase I trial, and three are in phase I/II trial. Experts caution against rushing COVID-19 vaccine development, not only because the knowledge about SARS-CoV-2 is lacking (albeit rapidly accumulating), but also because vaccine development is a complex, lengthy process with its own rules and timelines. Clinical trials are critically important in vaccine development, usually starting from small-scale phase I trials and gradually moving to the next phases (II and III) after the primary objectives are met. This paper is intended to provide an overview on design considerations for vaccine clinical trials, with a special focus on COVID-19 vaccine development. Given the current pandemic paradigm and unique features of vaccine development, our recommendations from statistical design perspective for COVID-19 vaccine trials include: (1) novel trial design (e.g., master protocol) to expedite the simultaneous evaluation of multiple candidate vaccines or vaccine doses, (2) human challenge studies to accelerate clinical development, (3) adaptive design strategies (e.g., group sequential designs) for early termination due to futility, efficacy, and/or safety, (4) extensive modeling and simulation to characterize and establish long-term efficacy based on early-phase or short-term follow-up data, (5) safety evaluation as one of the primary focuses throughout all phases of clinical trials, (6) leveraging real-world data and evidence in vaccine trial design and analysis to establish vaccine effectiveness, and (7) global collaboration to form a joint development effort for more efficient use of resource and expertise and data sharing.
Abstract:A new generalized two-parameter Lindley distribution which offers more flexibility in modeling lifetime data is proposed and some of its mathematical properties such as the density function, cumulative distribution function, survival function, hazard rate function, mean residual life function, moment generating function, quantile function, moments, Renyi entropy and stochastic ordering are obtained. The maximum likelihood estimation method was used in estimating the parameters of the proposed distribution and a simulation study was carried out to examine the performance and accuracy of the maximum likelihood estimators of the parameters. Finally, an application of the proposed distribution to a real lifetime data set is presented and its fit was compared with the fit attained by some existing lifetime distributions.
Abstract: This paper discusses the selection of the smoothing parameter necessary to implement a penalized regression using a nonconcave penalty function. The proposed method can be derived from a Bayesian viewpoint, and the resultant smoothing parameter is guaranteed to satisfy the sufficient conditions for the oracle properties of a one-step estimator. The results of simulation and application to some real data sets reveal that our proposal works efficiently, especially for discrete outputs.
Abstract: We analyze the cross-correlation between logarithmic returns of 1108 stocks listed on the Shanghai and Shenzhen Stock Exchange of China in the period 2005 to 2010. The results suggest that the estimated distribution of correlation coefficients is right shifted in the tumble time of Chinese stock market. Due to the large share of maximum eigenvalue, the principal correlation component in Chinese stock market is dominant and other components only have trivial effects on the market condition. The same-signed corresponding vector elements enable us to propose the maximum eigenvalue series as an indicator for collective behavior in the equity market. We provide the evidence that the largest eigenvalue series can be used as an effective indicative parameter to describe the collective behavior of stock returns, which is found to be positively correlated to market volatility. By using time-varying windows, we find the positive correlation diminishes when the market volatility reaches both highest and lowest level. By defining a stability rate, we display that the collective behavior of stocks tends to be more homogeneous in the context of crisis than the regular time. This study has implications for the arising discussions on correlation risk.
Abstract: Recently, He and Zhu (2003) derived an omnibus goodness-of-fit test for linear or nonlinear quantile regression models based on a CUSUM process of the gradient vector, and they suggested using a particular sim ulation method for determining critical values for their test statistic. But despite the speed of modern computers, execution time can be high. One goal in this note is to suggest a slight modification of their method that eliminates the need for simulations among a collection of important and commonly occurring situations. For a broader range of situations, the modi fication can be used to determine a critical value as a function of the sample size (n), the number of predictors (q), and the quantile of interest (γ). This is in contrast to the He and Zhu approach where the critical value is also a function of the observed values of the q predictors. As a partial check on the suggested modification in terms of controlling the Type I error probability, simulations were performed for the same situations considered by He and Zhu, and some additional simulations are reported for a much wider range of situations.
Abstract: The present article discusses and compares multiple testing procedures (MTPs) for controlling the family wise error rate. Machekano and Hubbard (2006) have proposed empirical Bayes approach that is a resampling based multiple testing procedure asymptotically controlling the familywise error rate. In this paper we provide some additional work on their procedure, and we develop resampling based step-down procedure asymptotically controlling the familywise error rate for testing the families of one-sided hypotheses. We apply these procedures for making successive comparisons between the treatment effects under a simple-order assumption. For example, the treatment means may be a sequences of increasing dose levels of a drug. Using simulations, we demonstrate that the proposed step-down procedure is less conservative than the Machekano and Hubbard’s procedure. The application of the procedure is illustrated with an example.
Abstract: This paper aims to propose a suitable statistical model for the age distribution of prostate cancer detection. Descriptive studies suggest the onset of prostate cancer after 37 years of age with maximum diagnosis age at around 70 years. The major deficiency of descriptive studies is that the results cannot be generalized for all types of populations usually having non-identical environmental conditions. The proposition follows by checking the suitability of the model through different statistical tools like Akaike Information Criterion, Kolmogorov Smirnov distance, Bayesian Information Criterion and χ2 statistic. The Maximum likelihood estimate of the parameters of the proposed model along with their asymptotic confidence intervals have been obtained for the considered real data set.