Home
Search

Journal of Data Science

Submit your article Information

Journal home
To appear
Current issue
All issues
More
Journal home To appear Current issue All issues

Detailed search

Title

Author

Types

Abstract

Keywords

Published

Pages

Volumes

Issues

DOI

Affiliation

Search results 892

Order by:

Select: All None Download:

Observer Variability: A New Approach in Evaluating Interobserver Agreement

Michael Haber Huiman X. Barnhart Jingli Song All authors (4)

https://doi.org/10.6339/JDS.2005.03(1).181

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 3, Issue 1 (2005), pp. 69–83

Abstract

Abstract: Existing indices of observer agreement for continuous data, such as the intraclass correlation coefficient or the concordance correlation coefficient, measure the total observer-related variability, which includes the variabilities between and within observers. This work introduces a new index that measures the interobserver variability, which is defined in terms of the distances among the ‘true values’ assigned by different observers on the same subject. The new coefficient of interobserver variability (CIV ) is defined as the ratio of the interobserver and the total observer variability. We show how to estimate the CIV and how to use bootstrap and ANOVAbased methods for inference. We also develop a coefficient of excess observer variability, which compares the total observer variability to the expected total observer variability when there are no differences among the observers. This coefficient is a simple function of the CIV . In addition, we show how the value of the CIV , estimated from an agreement study, can be used in the design of measurements studies. We illustrate the new concepts and methods by two examples, where (1) two radiologists used calcium scores to evaluate the severity of coronary artery arteriosclerosis, and (2) two methods were used to measure knee joint angle.

A New Procedure of Clustering Based on Multivariate Outlier Detection

G. S. David Sam Jayakumar Bejoy John Thomas

https://doi.org/10.6339/JDS.2013.11(1).1091

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 11, Issue 1 (2013), pp. 69–84

Abstract

Abstract: Clustering is an extremely important task in a wide variety of ap plication domains especially in management and social science research. In this paper, an iterative procedure of clustering method based on multivariate outlier detection was proposed by using the famous Mahalanobis distance. At first, Mahalanobis distance should be calculated for the entire sample, then using T 2 -statistic fix a UCL. Above the UCL are treated as outliers which are grouped as outlier cluster and repeat the same procedure for the remaining inliers, until the variance-covariance matrix for the variables in the last cluster achieved singularity. At each iteration, multivariate test of mean used to check the discrimination between the outlier clusters and the inliers. Moreover, multivariate control charts also used to graphically visual izes the iterations and outlier clustering process. Finally multivariate test of means helps to firmly establish the cluster discrimination and validity. This paper employed this procedure for clustering 275 customers of a famous two wheeler in India based on 19 different attributes of the two wheeler and its company. The result of the proposed technique confirms there exist 5 and 7 outlier clusters of customers in the entire sample at 5% and 1% significance level respectively.

Understanding Variable Effects from Black Box Prediction: Quantifying Effects in Tree Ensembles Using Partial Dependence

Guy Cafri Barbara A. Bailey

https://doi.org/10.6339/JDS.201601_14(1).0005

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 14, Issue 1 (2016), pp. 67–96

Abstract

Abstract: Scientific interest often centers on characterizing the effect of one or more variables on an outcome. While data mining approaches such as random forests are flexible alternatives to conventional parametric models, they suffer from a lack of interpretability because variable effects are not quantified in a substantively meaningful way. In this paper we describe a method for quantifying variable effects using partial dependence, which produces an estimate that can be interpreted as the effect on the response for a one unit change in the predictor, while averaging over the effects of all other variables. Most importantly, the approach avoids problems related to model misspecification and challenges to implementation in high dimensional settings encountered with other approaches (e.g., multiple linear regression). We propose and evaluate through simulation a method for constructing a point estimate of this effect size. We also propose and evaluate interval estimates based on a non-parametric bootstrap. The method is illustrated on data used for the prediction of the age of abalone.

A Frailty Model to Assess Plant Disease Spread from Individual Count Data

Samuel Soubeyrand Ivan Sache Christian Lannou All authors (4)

https://doi.org/10.6339/JDS.2007.05(1).315

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 5, Issue 1 (2007), pp. 67–83

Abstract

Abstract: Spread of airborne plant diseases from a propagule source is classically assessed by fitting a gradient curve to aggregated data coming from field experiments. But, aggregating data decreases information about processes involved in disease spread. To overcome this problem, individual count data can be collected; it was done in the case of short-distance spread of wheat brown rust. However, for such data, the gradient curve is a limited model since heterogeneity of hosts is ignored and, consequently, overdisper sion occurs. So, we propose a parametric frailty model in which the frailties represent propensities of hosts to be infected. The model is used to assess dispersal of propagules and heterogeneity of hosts.

A Comparison of Propensity Score and Linear Regression Analysis of Complex Survey Data

Elaine L. Zanutto

https://doi.org/10.6339/JDS.2006.04(1).233

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 4, Issue 1 (2006), pp. 67–91

Abstract

Abstract: We extend propensity score methodology to incorporate survey weights from complex survey data and compare the use of multiple linear regression and propensity score analysis to estimate treatment effects in observational data from a complex survey. For illustration, we use these two methods to estimate the effect of gender on information technology (IT) salaries. In our analysis, both methods agree on the size and statistical significance of the overall gender salary gaps in the United States in four different IT occupations after controlling for educational and job-related covariates. Each method, however, has its own advantages which are discussed. We also show that it is important to incorporate the survey design in both linear regression and propensity score analysis. Ignoring the survey weights affects the estimates of population-level effects substantially in our analysis.

An Assessment of the Use of an Advanced Neural Network Model with Five Different Training Strategies for the Preparation of Landslide Susceptibility Maps

Biswajeet Pradhan

https://doi.org/10.6339/JDS.201101_09(1).0006

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 9, Issue 1 (2011), pp. 65–81

Abstract

Abstract: Data collection for landslide susceptibility modelling is often an almost inhibitive activity. This has been the reason for quite sometimes land slide was described and modelled on the basis of spatially distributed values of landslide related attributes. This paper presents landslide susceptibility analysis at Selangor area, Malaysia, using artificial neural network model with the aid of remote sensing data and geographic information system (GIS) tools. To meet the objectives, landslide locations were identified in the study area from interpretation of aerial photographs and supported with extensive field surveys. Then, the landslide inventory was grouped into two categories: (1) training data (2) testing data. Further, topographical, geological data and satellite images were collected, processed, and constructed into a spatial database using GIS tools and image processing techniques. Nine landslide occurrence attributes were selected and analyzed using an artificial neural network model to generate the landslide susceptibility maps. Landslide loca tion data (training data) were used for training the neural network and five training sites were selected randomly in this case. The use of five training sites ensemble to investigate the model reliability, including the role of the thematic variables used to construct the model, and the model sensitivity to changes in the selection of the training sites. By studying the variation of the neural network’s susceptibility estimate, the error associated with the model is determined. The results of the neural network analysis are shown on five sets of landslide susceptibility maps. Then the susceptibility maps were validated using ”receiver operating characteristics (ROC)” method as a measure for the model verification. Landslide training data which were not used during the training of the neural network was used for the verification of the maps. The results of the analysis were verified using the landslide location data and compared between five different cases. Qualitatively, the model seems to give reasonable results with accuracy observed was 87%, 83%, 85%, 86% and 82% for five different training sites respectively.

Bayesian and Classical Solutions for Binomial Cytogenetic Dosimetry Problem

M´arcia D. Branco Heleno Bolfarine

https://doi.org/10.6339/JDS.2003.01(1).112

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 1, Issue 1 (2003), pp. 65–82

The Quasi Xgamma Distribution with Application in Bladder Cancer Data

Subhradev Sen N. Chandra

https://doi.org/10.6339/JDS.201701_15(1).0004

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 15, Issue 1 (2017), pp. 61–76

Abstract

For the purpose of generalizing or extending an existing probability distribution, incorporation of additional parameter to it is very common in the statistical distribution theory and practice. In fact, in most of the times, such extensions provide better fit to the real life situations compared to the existing ones. In this article, we propose and study a two-parameter probability distribution, called quasi xgamma distribution, as an extension or generalization of xgamma distribution (Sen et al. 2016) for modeling lifetime data. Important distributional properties along with survival characteristics and distributions of order statistics are studied in detail. Method of maximum likelihood and method of moments are proposed and described for parameter estimation. A data generation algorithm is proposed supported by a Monte-Carlo simulation study to describe the mean square errors of estimates for different sample sizes. A bladder cancer survival data is used to illustrate the application and suitability of the proposed distribution as a potential survival model.

Contrast Coding in Multiple Regression Analysis: Strengths, Weaknesses, and Utility of Popular Coding Structures

Matthew J. Davis

https://doi.org/10.6339/JDS.2010.08(1).563

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 8, Issue 1 (2010), pp. 61–73

Abstract

Abstract: The use of multiple regression analysis (MRA) has been on the rise over the last few decades in part due to the realization that analysis of variance (ANOVA) statistics can be advantageously completed using MRA. Given the limitations of ANOVA strategies it is argued that MRA is the better analysis; however, in order to use ANOVA in MRA coding structures must be employed by the researcher which can be confusing to understand. The present paper attempts to simplify this discussion by providing a description of the most popular coding structures, with emphasis on their strengths, limitations, and uses. A visual analysis of each of these strategies is also included along with all necessary steps to create the contrasts. Finally, a decision tree is presented that can be used by researchers to determine which coding structure to utilize in their current research project.

Approximate Graphical Methods for Inverse Regression

Geoffrey Jones Paul Lyons

https://doi.org/10.6339/JDS.2009.07(1).413

Pub. online: 4 Aug 2022 Type: Research Article

Open Access

Journal: Journal of Data Science Volume 7, Issue 1 (2009), pp. 61–72

Abstract

Abstract: Graphical procedures can be useful for illustrating and evaluating the process of inverse regression. We first review some simple and well-known graphical approaches for univariate linear and nonlinear models. We then propose a new graphical tool applicable to situations where the response is bivariate and repeated measures data are available. The proposed method is illustrated with an example of the age determination of tern chicks using measurements on body weight and wing length.

73 74 75 76 77

Items per page

Export citation

Copy and paste formatted citation

Formatted citation

Placeholder

Citation style

Download citation in file

Export format

Authors

Placeholder

RSS

Journal of data science

Online ISSN: 1683-8602
Print ISSN: 1680-743X

About

About journal

For contributors

Submit
OA Policy
Become a Peer-reviewer

Contact us

JDS@ruc.edu.cn
No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China