Abstract: This paper aims to propose a suitable statistical model for the age distribution of prostate cancer detection. Descriptive studies suggest the onset of prostate cancer after 37 years of age with maximum diagnosis age at around 70 years. The major deficiency of descriptive studies is that the results cannot be generalized for all types of populations usually having non-identical environmental conditions. The proposition follows by checking the suitability of the model through different statistical tools like Akaike Information Criterion, Kolmogorov Smirnov distance, Bayesian Information Criterion and χ2 statistic. The Maximum likelihood estimate of the parameters of the proposed model along with their asymptotic confidence intervals have been obtained for the considered real data set.
Abstract: The development and application of computational data mining techniques in financial fraud detection and business failure prediction has become a popular cross-disciplinary research area in recent times involving financial economists, forensic accountants and computational modellers. Some of the computational techniques popularly used in the context of financial fraud detection and business failure prediction can also be effectively applied in the detection of fraudulent insurance claims and therefore, can be of immense practical value to the insurance industry. We provide a comparative analysis of prediction performance of a battery of data mining techniques using real-life automotive insurance fraud data. While the data we have used in our paper is US-based, the computational techniques we have tested can be adapted and generally applied to detect similar insurance frauds in other countries as well where an organized automotive insurance industry exists.
Abstract: Through a series of carefully chosen illustrations from biometry and biomedicine, this note underscores the importance of using appropriate analytical techniques to increase power in statistical modeling and testing. These examples also serve to highlight some of the important recent devel opments in applied statistics of use to practitioner
Abstract: Supervised classifying of biological samples based on genetic information, (e.g., gene expression profiles) is an important problem in biostatistics. In order to find both accurate and interpretable classification rules variable selection is indispensable. This article explores how an assessment of the individual importance of variables (effect size estimation) can be used to perform variable selection. I review recent effect size estimation approaches in the context of linear discriminant analysis (LDA) and propose a new conceptually simple effect size estimation method which is at the same time computationally efficient. I then show how to use effect sizes to perform variable selection based on the misclassification rate, which is the data independent expectation of the prediction error. Simulation studies and real data analyses illustrate that the proposed effect size estimation and variable selection methods are com petitive. Particularly, they lead to both compact and interpretable feature sets. Program files to be used with the statistical software R implementing the variable selection approaches presented in this article are available from my homepage: http://b-klaus.de.
Pub. online:4 Aug 2022Type:Research ArticleOpen Access
Journal:Journal of Data Science
Volume 18, Issue 3 (2020): Special issue: Data Science in Action in Response to the Outbreak of COVID-19, pp. 536–549
Abstract
As the COVID-19 pandemic has strongly disrupted people’s daily work and life, a great amount of scientific research has been conducted to understand the key characteristics of this new epidemic. In this manuscript, we focus on four crucial epidemic metrics with regard to the COVID-19, namely the basic reproduction number, the incubation period, the serial interval and the epidemic doubling time. We collect relevant studies based on the COVID-19 data in China and conduct a meta-analysis to obtain pooled estimates on the four metrics. From the summary results, we conclude that the COVID-19 has stronger transmissibility than SARS, implying that stringent public health strategies are necessary.
Compound distributions gained their importance from the fact that natural factors have compound effects, as in the medical, social and logical experiments. Dubey (1968) introduced the compound Weibull by compounding Weibull distribution with gamma distribution. The main aim of this paper is to define a bivariate generalized Burr (compound Weibull) distribution so that the marginals have univariate generalized Burr distributions. Several properties of this distribution such as marginals, conditional distributions and product moments have been discussed. The maximum likelihood estimates for the unknown parameters of this distribution and their approximate variance- covariance matrix have been obtained. Some simulations have been performed to see the performances of the MLEs. One data analysis has been performed for illustrative purpose.
In this paper, we introduce a new lifetime model, called the Gen- eralized Weibull-Burr XII distribution. We discuss some of its mathematical properties such as density, hazard rate functions, quantile function and mo- ments. Maximum likelihood method is used to estimate model parameters. A simulation study is performed to assess the performance of maximum like- lihood estimators by means of biases, mean squared errors. Finally, we prove that the proposed distribution is a very competitive model to other classical models by means of application on real data set.
Abstract: In the paper, we propose power weighted quantile regression(PWQR), which can reduce the effect of heterogeneous of the conditional densities of the response effectively and improve efficiency of quantile regression). In addition to PWQR, this article also proves that all the weighting of those that the actual value is less than the estimated value of PWQR and the proportion of all the weighting is very close to the corresponding quantile. At last, this article establishes the relationship between Geomagentic Indices and GIC. According to the problems of power system security operation, we make GIC risk value table. This table can have stronger practical operation ability, can provide power system security operation with important inferences.
Abstract: An individual in a finite population is represented by a random variable whose expectation is linearly composed of explanatory variables and a personal effect. This expectation locates her (his) random variable on a scale when s(he) responds to a questionnaire item or physical instrument. This formulation reinterprets design-based sampling, which represents an individual as a constant waiting to be observed. Retaining constant expecta tions , however, along with fixed realizations of random variables, preserves and strengthens design-based theory through the Horvitz-Thompson (1952) theorem. This interpretation reaffirms the usual design-based regression es timates, whose normality is seen to be free of any assumptions about the distribution of the outcome variable. It also formulates response error in a way that renders a superpopulation, postulated by model-based sampling, unnecessary. The value of distribution-free regression is illustrated with an analysis of American presidential approval.
The statistical modeling of natural disasters is an indispensable tool for extracting information for prevention and risk reduction casualties. The Poisson distribution can reveal the characteristics of 1 a natural disaster. However, this distribution is insufficient for the clustering of natural events and related casualties. The best approach is to use a Neyman type A (NTA) distribution which has the feature that two or more events occur in a short time. We obtain some properties of the NTA distribution and suggest that it could provide a suitable description to analyze the natural disaster distribution and casualties. We support this argument using disaster events, including earthquakes, floods, landslides, forest fires, avalanches, and rock falls in Turkey between 1900 and 2013. The data strongly supports that the NTA distribution represents the main tool for handling disaster data. The findings indicate that approximately three earthquakes, fifteen landslides, five floods, six rock falls, six avalanches, and twenty nine forest fires are expected in a year. The results from this model suggest that the probability of the total number of casualties is the highest for the earthquakes and the lowest for the rock falls. This study also finds that the expected number of natural disasters approximately equals to 64 per year and inter-event time between two successive earthquakes is approximately four months. The inter-event time for the natural disasters is approximately six days in Turkey.