Change Log

When What
March 11, 2015 Donated by Alina Lazar


Studies who have been using the data (in any form) are required to add the following reference to their report/paper:

author = {Lazar, Alina and Ritchey, Sarah and Sharif, Bonita},
title = {Improving the Accuracy of Duplicate Bug Report Detection Using Textual Similarity Measures},
booktitle = {Proceedings of the 11th Working Conference on Mining Software Repositories},
series = {MSR 2014},
year = {2014},
isbn = {978-1-4503-2863-0},
location = {Hyderabad, India},
pages = {308--311},
numpages = {4},
url = {http://doi.acm.org/10.1145/2597073.2597088},
doi = {10.1145/2597073.2597088},
acmid = {2597088},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {Duplicate bug reports, TakeLab, support vector machines, textual similarity measures},

About the Data

Dataset Acquisition

The three systems used in the case study presented here are Eclipse, Open Office, and Mozilla. Using web scraping techniques, bug reports were collected from the Bugzilla websites of the three systems. In Table 1 we report: the time interval when the bugs collected were submitted, the number of initial bugs collected, the number of initial duplicates, the number of bugs after preprocessing, and thenumber of duplicate pairs used for training.

After the bug reports were collected, reports that had an open resolution were removed from the datasets. We did this because their status cannot be confirmed from the information available. Removing them would help prevent training the model with mislabeled data. It is also important to note that this could easily be done in an industrial setting. Thedatasets were trimmed further by removing duplicate resolution bugs that did not have masters in the dataset. For each dataset, the master duplicate bug was found together with all the duplicate bugs in it’s group. This was just a preliminary step before generating all the duplicate bug report pairs. All these pairs are identified by adding to the dataset the classification decision feature decision and setting its value to 1. Four times as many non-duplicate pairs of bugs as duplicate pairs of bugs were generated for the training dataset. The decision for the non-duplicate pairs is set to -1.

The unprocessed data can be refered from “Towards more accurate retrieval of duplicate bug reports, pages 253–262”

Classification features selected (not in reports): bug id, title, description, product, component, type, priority, version, open date

18 Synthezised Features are generated using simple TakeLab . A few of them in no particular order are

1-3) n-gram word overlap for unigrams, bigrams and trigrams,

4-6) n-gram word overlap for unigrams, bigrams and trigrams after lemmatization,

7) WordNet based augmented word overlap,

8) weighted word overlap,

9-10) normalized differences for sentence length and aggregate word information content,

11) shallow named entity and numbers overlap.

FYI: The paper or dataset does not have a detailed description of the attributes or its sequence in the dataset, but it says that it is referenced from “TakeLab: systems for measuring semantic text similarity

Features 19-25 are calculated from the paper “Towards more accurate retrieval of duplicate bug reports”

feature19 = 1, if b1.prod = b2.prod

0, otherwise

feature20 = 1, if b1.comp = b2.comp

0, otherwise

feature21 = 1, if b1.type = b2.type

0, otherwise

feature22 = 1/(1 − abs(b1.priority − b2.priority))

feature23 = 1/(1 − abs(b1.version − b2.version))

feature24 = abs(b1.open date − b2.open date)

feature25 = feature25 = abs(b1.bug id − b2.bug id)

Train and Test Split of Dataset

After we generate the features for all the bugs in the datasets, we divide the initial dataset into a training set and a testing set. The training set contains 5,000 bug pairs and all the other pairs are put into the testing set. The instances are divided into the subsets using stratified sampling, so the percentage between the two classes is preserved. That means out of the 5,000 pairs, 1,000 paris are duplicate and 4,000 are non-duplicate. Next, all the values in the training and the testing datasets are scaled to the [-1,1] interval. First the training set is scaled and the ranges are then applied to scale the testing set. The training sets are used to generate the classification models