Change Log

When What
March, 29th 2015 Donated by Titus Barik


author = "Barik, T and Lubick, K and Smith, J and Slankas, J and Murphy-hill, E"
title = "FUSE : A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets",
note = "Submitted",

About the Data

The contributors have provided two related datasets, which together constitute the FUSE spreadsheet corpus2.

  • A Web Analysis dataset of 2,127,284 URLs that return spreadsheet content, along with the full HTTP web server response, formatted as JSON records. This dataset was obtained by filtering through 26.83 billion HTTP responses within the Common Crawl archive.
  • A Binary Analysis dataset of 249,376 spreadsheets, extracted from the 1.9 PB of raw data within the Common Crawl archive. For each spreadsheet, the authors provide JSON metadata containing their analysis, which includes NLP token extraction and spreadsheet metrics.

Paper abstract

Spreadsheets are perhaps the most ubiquitous form of end-user programming software. This paper describes a corpus, called FUSE, containing 2,127,284 URLs that return spreadsheets (and their HTTP server responses), and 249,376 unique spreadsheets, contained within a public web archive of over 26.83 billion pages. Obtained using nearly 60,000 hours of computation, the resulting corpus exhibits several useful properties over prior spreadsheet corpora, including reproducibility and extendability. Our corpus is unencumbered by any license agreements, available to all, and intended for wide usage by end-user software engineering researchers. In this paper, we detail the data and the spreadsheet extraction process, describe the data schema, and discuss the trade-offs of FUSE with other corpora.