bugfiles

URL

Change Log

When What
March 13, 2015 Donated by Xin Ye

Reference

Studies who have been using the data (in any form) are required to add the following reference to their report/paper:

@inproceedings{Ye:2014,
   author = {Ye, Xin and Bunescu, Razvan and Liu, Chang},
   title = {Learning to Rank Relevant Files for Bug Reports Using Domain Knowledge},
   booktitle = {Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of
   Software Engineering},
   series = {FSE 2014},
   year = {2014},
   location = {Hong Kong, China},
   pages = {689--699},
   numpages = {11},
}

About the Data

Overview of Data

This dataset contains bug reports, commit history, and API descriptions of six open source Java projects including Eclipse Platform UI, SWT, JDT, AspectJ, Birt, and Tomcat. This dataset was used to evaluate a learning to rank approach that recommends relevant files for bug reports.

Dataset structure

File list:

  • **AspectJ.[xls xml]** – The bug reports and commit history of AspectJ.
  • **Birt.[xls xml]** – The bug reports and commit history of Birt.
  • **Eclipse_Platform_UI.[xls xml]** – The bug reports and commit history of Eclipse Platform UI.
  • **JDT.[xls xml]** – The bug reports and commit history of JDT.
  • **SWT.[xls xml]** – The bug reports and commit history of SWT.
  • **Tomcat.[xls xml]** – The bug reports and commit history of Tomcat.

Attribute Information

  • bug\_id – refers to the bug report id.
  • summary – refers to the bug report summary.
  • description – refers to the bug report description.
  • report\_time – refers to the bug report report time.
  • report\_timestamp – refers to the bug report report timestamp.
  • status – refers to the status of the bug report.
  • commit – refers to the SHA-1 hash id for the commit that fixed the bug report.
  • commit\_timestamp – refers to the commit timestamp.
  • files – contains the full path of every Java file that was fixed in this commit.
  • result – contains the position of every positive instance in our ranked list result.

How to obtain the source code

  • AspectJ: git clone git://git.eclipse.org/gitroot/aspectj/org.aspectj.git
  • Birt: git clone https://git.eclipse.org/r/p/birt/org.eclipse.birt
  • Eclipse Platform UI: git clone https://git.eclipse.org/r/p/platform/eclipse.platform.ui
  • JDT: git clone https://git.eclipse.org/r/p/jdt/eclipse.jdt.ui
  • SWT: git clone https://git.eclipse.org/r/p/platform/eclipse.platform.swt
  • Tomcat: git clone git://git.apache.org/tomcat.git

A before-fix version of the source code package needs to be checked out for each bug report. Taking Eclipse Bug 420972 for example, this bug was fixed at commit 657bd90. To check out the before-fix version 2143203 of the source code package, use the command git checkout 657bd90~1.

Efficient indexing of the code

If bug 420972 is the first bug processed by the system, we check out its before-fix version 2143203 and index all the corresponding source files. To process another bug report 423588, we need to check out its before-fix version 602d549 of the source code package. For efficiency reasons, we do not need to index all the source files again. Instead, we index only the changed files, i.e., files that were “Added”, “Modified”, or “Deleted” between the two bug reports. The changed files can be obtained as follows:

  • Added: git diff --name-status 2143203 602d549 | grep ".java$" | grep "^A"
  • Modified: git diff --name-status 2143203 602d549 | grep ".java$" | grep "^M"
  • Deleted: git diff --name-status 2143203 602d549 | grep ".java$" | grep "^D"

Paper abstract

When a new bug report is received, developers usually need to reproduce the bug and perform code reviews to find the cause, a process that can be tedious and time consuming. A tool for ranking all the source files of a project with respect to how likely they are to contain the cause of the bug would enable developers to narrow down their search and potentially could lead to a substantial increase in productivity. This paper introduces an adaptive ranking approach that leverages domain knowledge through functional decompositions of source code files into methods, API descriptions of library components used in the code, the bug-fixing history, and the code change history. Given a bug report, the ranking score of each source file is computed as a weighted combination of an array of features encoding domain knowledge, where the weights are trained automatically on previously solved bug reports using a learning-to-rank technique. We evaluated our system on six large scale open source Java projects, using the before-fix version of the project for every bug report. The experimental results show that the newly introduced learning-to-rank approach significantly outperforms two recent state-of-the-art methods in recommending relevant files for bug reports. In particular, our method makes correct recommendations within the top 10 ranked source files for over 70% of the bug reports in the Eclipse Platform and Tomcat projects.