Home
Search

Journal of Data Science

Submit your article Information

Journal home
To appear
Current issue
All issues
More
Journal home To appear Current issue All issues

Detailed search

Title

Author

Types

Abstract

Keywords

Published

Pages

Volumes

Issues

DOI

Affiliation

Search results 892

Order by:

Select: All None Download:

Topic Model Kernel Classification with Probabilistically Reduced Features

Vu Nguyen Dinh Phung Svetha Venkatesh

https://doi.org/10.6339/JDS.201504_13(2).0006

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 13, Issue 2 (2015), pp. 323–340

Abstract

Probabilistic topic models have become a standard in modern machine learning to deal with a wide range of applications. Representing data by dimensional reduction of mixture proportion extracted from topic models is not only richer in semantics interpretation, but could also be informative for classification tasks. In this paper, we describe the Topic Model Kernel (TMK), a topicbased kernel for Support Vector Machine classification on data being processed by probabilistic topic models. The applicability of our proposed kernel is demonstrated in several classification tasks with real world datasets. TMK outperforms existing kernels on the distributional features and give comparative results on nonprobabilistic data types.

Application of One Sided t-tests and a Generalized Experiment Wise Error Rate to High-Density Oligonucleotide Microarray Experiments: An Example Using Arabidopsis

W. M. Muir J. Romero-Severson S.D. Rider Jr. All authors (5)

https://doi.org/10.6339/JDS.2006.04(3).270

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 4, Issue 3 (2006), pp. 323–341

Abstract

Abstract: Motivation: A formidable challenge in the analysis of microarray data is the identification of those genes that exhibit differential expression. The objectives of this research were to examine the utility of simple ANOVA, one sided t tests, natural log transformation, and a generalized experiment wise error rate methodology for analysis of such experiments. As a test case, we analyzed a Affymetrix GeneChip microarray experiment designed to test for the effect of a CHD3 chromatin remodeling factor, PICKLE, and an inhibitor of the plant hormone gibberellin (GA), on the expression of 8256 Arabidopsis thaliana genes. Results: The GFWER(k) is defined as the probability of rejecting k or more true null hypothesis at a given p level. Computing probabilities by GFWER(k) was shown to be simple to apply and, depending on the value of k, can greatly increase power. A k value as small as 2 or 3 was concluded to be adequate for large or small experiments respectively. A one sided ttest along with GFWER(2)=.05 identified 43 genes as exhibiting PICKLEdependent expression. Expression of all 43 genes was re-examined by qRTPCR, of which 36 (83.7%) were confirmed to exhibit PICKLE-dependent expression.

Regression Analysis of Collinear Data using r-k Class Estimator: Socio-Economic and Demographic Factors Affecting the Total Fertility Rate (TFR) in India

Piyush Kant Rai Sarla Pareek Hemlata Joshi

https://doi.org/10.6339/JDS.2013.11(2).1030

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 11, Issue 2 (2013), pp. 323–342

Abstract

Abstract: A basic assumption concerned with general linear regression model is that there is no correlation (or no multicollinearity) between the explana tory variables. When this assumption is not satisfied, the least squares estimators have large variances and become unstable and may have a wrong sign. Therefore, we resort to biased regression methods, which stabilize the parameter estimates. Ridge regression (RR) and principal component regression (PCR) are two of the most popular biased regression methods which can be used in case of multicollinearity. But the r-k class estimator, which is composed by combining the RR estimator and the PCR estimator into a single estimator gives the better estimates of the regression coefficients than the RR estimator and PCR estimator. This paper explores the multiple regression technique using r-k class estimator between TFR and other socio-economic and demographic variables and the data has been taken from the National Family Health Survey-III (NFHS-III): 29 states of India. The analysis shows that use of contraceptive devices shares the greatest impact on fertility rate followed by maternal care, use of improved water, female age at marriage and spacing between births.

Variable Selection by sNML Criterion in Logistic Regression with an Application to a Risk-Adjustment Model for Hip Fracture Mortality

Antti Liski Ioan T˘abu¸s Reijo Sund

https://doi.org/10.6339/JDS.2012.10(2).739

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 10, Issue 2 (2012), pp. 321–343

Abstract

Abstract: When comparing the performance of health care providers, it is important that the effect of such factors that have an unwanted effect on the performance indicator (eg. mortality) is ruled out. In register based studies randomization is out of question. We develop a risk adjustment model for hip fracture mortality in Finland by using logistic regression. The model is used to study the impact of the length of the register follow-up period on adjusting the performance indicator for a set of comorbidities. The comorbidities are congestive heart failure, cancer and diabetes. We also introduce an implementation of the minimum description length (MDL) principle for model selection in logistic regression. This is done by using the normalized maximum likelihood (NML) technique. The computational burden becomes too heavy to apply the usual NML criterion and therefore a technique based on the idea of sequentially normalized maximum likelihood (sNML) is introduced. The sNML criterion can be evaluated efficiently also for large models with large amounts of data. The results given by sNML are then compared to the corresponding results given by the traditional AIC and BIC model selection criteria. All three comorbidities have clearly an effect on hip fracture mortality. The results indicate that for congestive heart failure all available medical history should be used, while for cancer it is enough to use only records from half a year before the fracture. For diabetes the choice of time period is not as clear, but using records from three years before the fracture seems to be a reasonable choice.

Extended Poisson-Frechet Distribution: Mathematical Properties and Applications to Survival and Repair Times

M.S. Hamed

https://doi.org/10.6339/JDS.202004_18(2).0006

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 18, Issue 2 (2020), pp. 319–342

Abstract

In this paper, a new four parameter zero truncated Poisson Frechet distribution is defined and studied. Various structural mathematical properties of the proposed model including ordinary moments, incomplete moments, generating functions, order statistics, residual and reversed residual life functions are investigated. The maximum likelihood method is used to estimate the model parameters. We assess the performance of the maximum likelihood method by means of a numerical simulation study. The new distribution is applied for modeling two real data sets to illustrate empirically its flexibility.

Beta Linear Failure Rate Geometric Distribution with Applications

Broderick O. Oluyede Ibrahim Elbatal Shujiao Huang

https://doi.org/10.6339/JDS.201604_14(2).0006

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 14, Issue 2 (2016), pp. 317–346

Abstract

Abstract: This paper introduces the beta linear failure rate geometric (BLFRG) distribution, which contains a number of distributions including the exponentiated linear failure rate geometric, linear failure rate geometric, linear failure rate, exponential geometric, Rayleigh geometric, Rayleigh and exponential distributions as special cases. The model further generalizes the linear failure rate distribution. A comprehensive investigation of the model properties including moments, conditional moments, deviations, Lorenz and Bonferroni curves and entropy are presented. Estimates of model parameters are given. Real data examples are presented to illustrate the usefulness and applicability of the distribution.

Stochastic Diffusion Modeling of Degradation Data

Sheng-Tsaing Tseng Chien-Yu Peng

https://doi.org/10.6339/JDS.2007.05(3).351

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 5, Issue 3 (2007), pp. 315–333

Abstract

Abstract: Accelerated degradation tests (ADTs) can provide timely relia bility information of product. Hence ADTs have been widely used to assess the lifetime distribution of highly reliable products. In order to properly predict the lifetime distribution, modeling the product’s degradation path plays a key role in a degradation analysis. In this paper, we use a stochastic diffusion process to describe the product’s degradation path and a recursive formula for the product’s lifetime distribution can be obtained by using the first passage time (FPT) of its degradation path. In addition, two approxi mate formulas for the product’s mean-time-to-failure (MTTF) and median life (B50) are given. Finally, we extend the proposed method to the case of ADT and a real LED data is used to illustrate the proposed procedure. The results demonstrate that the proposed method has a good performance for the LED lifetime prediction.

Robust Ancova: Heteroscedastic Confidence Intervals that Have Some Specified Simultaneous Probability Coverage

Rand R. Wilcox

https://doi.org/10.6339/JDS.201704_15(2).0008

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 15, Issue 2 (2017), pp. 313–328

Abstract

The paper deals with robust ANCOVA when there are one or two covariates. Let Mj (Y |X) = β0j + β1j X1 + β2j X2 be some conditional measure of location associated with the random variable Y , given X, where β0j , β1j and β2j are unknown parameters. A basic goal is testing the hypothesis H0: M1(Y |X) = M2(Y |X). A classic ANCOVA method is aimed at addressing this goal, but it is well known that violating the underlying assumptions (normality, parallel regression lines and two types of homoscedasticity) create serious practical concerns. Methods are available for dealing with heteroscedasticity and nonnormality, and there are well-known techniques for controlling the probability of one or more Type I errors. But some practical concerns remain, which are reviewed in the paper. An alternative approach is suggested and found to have a distinct power advantage.

A New Analytic Framework for Moderation Analysis — Moving Beyond Analytic Interactions

Wan Tang Qin Yu Paul Crits-Christoph All authors (4)

https://doi.org/10.6339/JDS.2009.07(3).462

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 7, Issue 3 (2009), pp. 313–329

Abstract

Abstract: Conceptually, a moderator is a variable that modifies the effect of a predictor on a response. Analytically, a common approach as used in most moderation analyses is to add analytic interactions involving the predictor and moderator in the form of cross-variable products and test the significance of such terms. The narrow scope of such a procedure is inconsistent with the broader conceptual definition of moderation, leading to confusion in interpretation of study findings. In this paper, we develop a new approach to the analytic procedure that is consistent with the concept of moderation. The proposed framework defines moderation as a process that modifies an existing relationship between the predictor and the outcome, rather than simply a test of a predictor by moderator interaction. The approach is illustrated with data from a real study.

Analysis of Bank Failure Using Published Financial Statements: The Case of Indonesia (Part 2)

Loso Judijanto E. V. Khmaladze

https://doi.org/10.6339/JDS.2003.01(3).126

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 1, Issue 3 (2003), pp. 313–336

48 49 50 51 52

Items per page

Export citation

Copy and paste formatted citation

Formatted citation

Placeholder

Citation style

Download citation in file

Export format

Authors

Placeholder

RSS

Journal of data science

Online ISSN: 1683-8602
Print ISSN: 1680-743X

About

About journal

For contributors

Submit
OA Policy
Become a Peer-reviewer

Contact us

JDS@ruc.edu.cn
No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China