Columba, Eclipse JDT.Core and Scarab

URL

Change Log

When What
September 18th, 2015 Donated by Sunghun Kim

Reference

Studies who have been using the data (in any form) are required to include the following reference:

@INPROCEEDINGS{6032487,
author={Sunghun Kim and Hongyu Zhang and Rongxin Wu and Liang Gong},
booktitle={Software Engineering (ICSE), 2011 33rd International Conference on},
title={Dealing with noise in defect prediction},
year={2011},
pages={481-490},
keywords={data mining;program debugging;program testing;MSR;bug report links;false negative noise;false positive noise;historical defect data;mining software repositories;optional bug fix keywords;software defect prediction models;Electrical resistance measurement;Noise;Noise measurement;Prediction algorithms;Predictive models;Resistance;Training;buggy changes;buggy files;data quality;defect prediction;noise resistance},
doi={10.1145/1985793.1985859},
ISSN={0270-5257},
month={May},
}

About the Data

Overview of Data

We use Columba, Eclipse JDT.Core and Scarab as our subjects for this experiment (Table 1), as these projects have high quality change logs and links between changes logs and bug reports.

Paper Abstract

Many software defect prediction models have been built using historical defect data obtained by mining software repositories (MSR). Recent studies have discovered that data so collected contain noises because current defect collection practices are based on optional bug fix keywords or bug report links in change logs. Automatically collected defect data based on the change logs could include noises. This paper proposes approaches to deal with the noise in defect data. First, we measure the impact of noise on defect prediction models and provide guidelines for acceptable noise level. We measure noise resistant ability of two well-known defect prediction algorithms and find that in general, for large defect datasets, adding FP (false positive) or FN (false negative) noises alone does not lead to substantial performance differences. However, the prediction performance decreases significantly when the dataset contains 20%-35% of both FP and FN noises. Second, we propose a noise detection and elimination algorithm to address this problem. Our empirical study shows that our algorithm can identify noisy instances with reasonable accuracy. In addition, after eliminating the noises using our algorithm, defect prediction accuracy is improved.