When | What |
---|---|

March 1st, 2017 | Donated by Radek Silhavy |

Studies who have been using the data (in any form) are required to include the following reference:

```
@article{Silhavy20171,
title = "Analysis and selection of a regression model for the Use Case Points method using a stepwise approach ",
journal = "Journal of Systems and Software ",
volume = "125",
number = "",
pages = "1 - 14",
year = "2017",
note = "",
issn = "0164-1212",
doi = "http://dx.doi.org/10.1016/j.jss.2016.11.029",
url = "http://www.sciencedirect.com/science/article/pii/S016412121630231X",
author = "Radek Silhavy and Petr Silhavy and Zdenka Prokopova",
keywords = "Software size estimation",
keywords = "Stepwise approach",
keywords = "Multiple linear regression",
keywords = "Use Case Points",
keywords = "Dataset",
keywords = "Variables analysis ",
abstract = "Abstract This study investigates the significance of use case points (UCP) variables and the influence of the complexity of multiple linear regression models on software size estimation and accuracy. Stepwise multiple linear regression models and residual analysis were used to analyse the impact of model complexity. The impact of each variable was studied using correlation analysis. The estimated size of software depends mainly on the values of the weights of unadjusted UCP, which represent a number of use cases. Moreover, all other variables (unadjusted actors' weights, technical complexity factors, and environmental complexity factors) from the \{UCP\} method also have an impact on software size and therefore cannot be omitted from the regression model. The best performing model (Model D) contains an intercept, linear terms, and squared terms. The results of several evaluation measures show that this model's estimation ability is better than that of the other models tested. Model D also performs better when compared to the \{UCP\} model, whose Sum of Squared Error was 268,620 points on Dataset 1 and 87,055 on Dataset 2. Model D achieved a greater than 90% reduction in the Sum of Squared Errors compared to the Use Case Points method on Dataset 1 and a greater than 91% reduction on Dataset 2. The medians of the Sum of Squared Errors for both methods are significantly different at the 95% confidence level (p < 0.01), while the medians for Model D (312 and 37.26) are lower than Use Case Points (3134 and 3712) on Datasets 1 and 2, respectively. "
}
```

This dataset was gathered by us from three software houses. This is real-life dataset. Use Case points method as originated by Karner was used for counting a steps or number of actors. Data are based on different languages, various problem domains. ISBSG style for language, domain and application type were adopted.

Attributes are used as follows: Project_No - only project ID for identification purposes Simple Actors - Number of actor classify according UCP - simple actors. Average Actors - Number of actor classify according UCP - average actors. Complex Actors - Number of actor classify according UCP - complex actors. UAW - Unadjusted Actor weight, computed by using UCP equation. Simple UC - Number of use cases classified as simple - UCP number of steps is used. Average UC - Number of use cases classified as average - UCP number of steps is used. Complex UC - Number of use cases classified as complex - UCP number of steps is used. UUCW - Unadjusted UseCase Weight - computed by using UCP equation. TCF - Technical Complexity FactorECF - Enviromental Complexity Factors Real_P20 - Real_P20 - Real Effort in Person hours, decided by productivity factor (PF = 20). Real_Effort_Person_Hours - Real Effort (development time) in person-hours. Sector - Problem domain of projectLanguage - Programming language used for project. Methodology - Development methodology used for project development. ApplicationType - Classification of project type - provided by donator. DataDonator - anonymized acronym for data donator.

This study investigates the significance of use case points (UCP) variables and the influence of the complexity of multiple linear regression models on software size estimation and accuracy.

Stepwise multiple linear regression models and residual analysis were used to analyse the impact of model complexity. The impact of each variable was studied using correlation analysis.

The estimated size of software depends mainly on the values of the weights of unadjusted UCP, which represent a number of use cases. Moreover, all other variables (unadjusted actors’ weights, technical complexity factors, and environmental complexity factors) from the UCP method also have an impact on software size and therefore cannot be omitted from the regression model. The best performing model (Model D) contains an intercept, linear terms, and squared terms. The results of several evaluation measures show that this model’s estimation ability is better than that of the other models tested. Model D also performs better when compared to the UCP model, whose Sum of Squared Error was 268,620 points on Dataset 1 and 87,055 on Dataset 2. Model D achieved a greater than 90% reduction in the Sum of Squared Errors compared to the Use Case Points method on Dataset 1 and a greater than 91% reduction on Dataset 2. The medians of the Sum of Squared Errors for both methods are significantly different at the 95% confidence level (p < 0.01), while the medians for Model D (312 and 37.26) are lower than Use Case Points (3134 and 3712) on Datasets 1 and 2, respectively.