Abstract: The motivation behind this paper is to investigate the use of Softmax model for classification. We show that Softmax model is a nonlinear generalization for the logistic discrimination, that can approximate the posterior probabilities of classes where other Artificial neural network (ANN) models don't have this ability. We show that Softmax model has more flexibility than logistic discrimination in terms of correct classification. To show the performance of Softmax model a medical data set on thyroid gland state is used. The result is that Softmax model may suffer from overfitting.
Abstract: Background: Brass developed a procedure for converting proportions dead of children ever born reported by women in childbearing ages into estimates of the probability of dying before attaining certain exact childhood ages. The method has become very popular in less developed countries where direct mortality estimation is not possible due to incomplete death registration. However, the estimates of q(x), the probability of dying before age x, obtained by Trussell’s variant of Brass method are sometimes unrealistic, q(x) being not monotonically increasing for increasing x. Method: State level child mortality estimates obtained by Trussell’s variant of Brass method from 1991 and 2001 Indian census data were made monotonically increasing by logit smoothing. Using two of the smoothed child mortality estimates, infant mortality estimate is obtained by fitting a two parameter Weibull survival function. Results: It has been found that in many states and union territories infant mortality rates have increased between 1991 and 2001. Cross checking with the 1991 and 2001 census data on the increase/decrease of percentage of children died establishes the reliability of the estimates. Conclusion: We have reason to suspect the trend of declining infant mortality as shown by the different agencies and researchers.
Abstract: Modeling the Internet has been an active research in the past ten years. From the “rich get richer” behavior to the “winners don’t take all” property, the models depend on the explicit attributes described in the net work. This paper discusses the modeling of non-scale-free network subsets like bulletin forums. A new evolution mechanism, driven by some implicit at tributes “hidden” in the network, leads to a slightly increase in the page sizes of front rank forum. Due to the complication of quantifying these implicit attributes, two potential models are suggested. The first model introduces a content ratio and it is patched to the lognormal model, while the second model truncates the data into groups according to their regional specialties and data within groups are fitted by power-law models. A Taiwan-based bulletin forum is used for illustration and data are fitted via four models. Statistical Diagnostics show that two suggested models perform better than the traditional models in data fitting and predictions. In particular, the second model performs better than the first model in general.
Abstract: Registrations in epidemiological studies suffer from incomplete ness, thus a general consensus is to use capture-recapture models. Inclusion of covariates which relate to the capture probabilities has been shown to improve the estimate of population size. The covariates used have to be measured by all the registrations. In this article, we show how multiple im putation can be used in the capture-recapture problem when some lists do not measure some of the covariates or alternatively if some covariates are unobserved for some individuals. The approach is then applied to data on neural tube defects from the Netherlands
Abstract: This paper considers the statistical problems of editing and imputing data of multiple time series generated by repetitive surveys. The case under study is that of the Survey of Cattle Slaughter in Mexico’s Municipal Abattoirs. The proposed procedure consists of two phases; firstly the data of each abattoir are edited to correct them for gross inconsistencies. Secondly, the missing data are imputed by means of restricted forecasting. This method uses all the historical and current information available for the abattoir, as well as multiple time series models from which efficient estimates of the missing data are obtained. Some empirical examples are shown to illustrate the usefulness of the method in practice.
This paper presents an empirical study of a recently compiled workforce analytics data-set modeling employment outcomes of Engineering students. The contributions reported in this paper won the data challenge of the ACM IKDD 2016 Conference on Data Science. Two problems are addressed - regression using heterogeneous information types and the extraction of insights/trends from data to make recommendations; these goals are supported by a range of visualizations. Whereas the data-set is specific to a nation, the underlying techniques and visualization methods are generally applicable. Gaussian processes are proposed to model and predict salary as a function of heterogeneous independent attributes. Key novelties the GP approach brings to the domain of understanding workforce analytics are (a) statistically sound notion of uncertainty of prediction that is data dependent, (b) automatic relevance determination of various independent attributes to the dependent variable (salary),(c) seamless incorporation of both numeric and string attributes within the same regression frame- work without dichotomization; specifically, string attributes include single-word or categorical (e.g. gender) or nominal attributes (e.g. college tier) or multi-word attributes (e.g. specialization) and (d) treatment of all data as being correlated towards making predictions. Insights from both predictive modeling approaches and data analysis were used to suggest factors, that if improved, might lead to better starting salaries for Engineering students. A range of visualization techniques were used to extract key employment patterns from the data.
Objective: Financial fraud has been a big concern for many organizations across industries; billions of dollars are lost yearly because of this fraud. So businesses employ data mining techniques to address this continued and growing problem. This paper aims to review research studies conducted to detect financial fraud using data mining tools within one decade and communicate the current trends to academic scholars and industry practitioners.
Method: Various combinations of keywords were used to identify the pertinent articles. The majority of the articles retrieved from Science Direct but the search spanned other online databases (e.g., Emerald, Elsevier, World Scientific, IEEE, and Routledge - Taylor and Francis Group). Our search yielded a sample of 65 relevant articles (58 peer-reviewed journal articles with 7 conference papers). One-fifth of the articles was found in Expert Systems with Applications (ESA) while about one-tenth found in Decision Support Systems (DSS).
Results: 41 data mining techniques were used to detect fraud across different financial applications such as health insurance and credit card. Logistic regression model appeared to be the leading data mining tool in detecting financial fraud with a 13% of usage.In general, supervised learning tool have been used more frequently than the unsupervised ones. Financial statement fraud and bank fraud are the two largest financial applications being investigated in this area – about 63%, which corresponds to 41 articles out of the 65 reviewed articles. Also, the two primary journal outlets for this topic are ESA and DSS.
Conclusion: This review provides a fast and easy-to-use source for both researchers and professionals, classifies financial fraud applications into a high-level and detailed-level framework, shows the most significant data mining techniques in this domain, and reveals the most countries exposed to financial fraud.
In this paper, a new five-parameter extended Burr XII model called new modified Singh-Maddala (NMSM) is developed from cumulative hazard function of the modified log extended integrated beta hazard (MLEIBH) model. The NMSM density function is left-skewed, right-skewed and symmetrical. The Lambert W function is used to study descriptive measures based on quantile, moments, and moments of order statistics, incomplete moments, inequality measures and residual life function. Different reliability and uncertainty measures are also theoretically established. The NMSM distribution is characterized via different techniques and its parameters are estimated using maximum likelihood method. The simulation studies are performed on the basis of graphical results to illustrate the performance of maximum likelihood estimates (MLEs) of the parameters. The significance and flexibility of NMSM distribution is tested through different measures by application to two real data sets.
In this paper, we proposed the Bayesian estimation for the parameter and reliability function of exponentiated gamma distribution under progressive type-II censored samples. The Bayes estimate of the parameter and reliability function are derived under the assumption of independent gamma prior by three different approximation methods namely Lindley’s approximation, Tierney-Kadane and Markov Chain Monte Carlo methods. Further, the comparison of Bayes estimators with corresponding maximum likelihood estimators have been carried out through simulation study. Finally, a real data set has been used to illustrate the above study in realistic phenomenon.