Colorado mountains
From Long-Term Data to Understanding: Toward a Predictive Ecology
2015 LTER ASM Estes Park, CO - August 30 - September 2, 2015

Data Provenance in R

Printer-friendly versionPrinter-friendly version

Poster Number: 
Presenter/Primary Author: 
Emery Boose
Barbara Lerner
Aaron Ellison
Shaylyn Adams
Nicole Hoffler
Antonia Oprescu
Luis Perez
Sofiya Toskova

The ability to understand and replicate a data analysis is enhanced by metadata that describe exactly how the data were created and transformed, including all of the data artifacts and processes used along the way.  However, few (if any) workflow or scripting environments currently available capture all of this information (also known as data provenance).  Rather, most software used for data analysis is optimized for performance and ease of use and not for tracking provenance.  As a result, data provenance has had little impact so far in improving the transparency, reliability, and reproducibility of scientific results.

Two major challenges must be overcome to bring data provenance within reach of domain scientists.  First, the software tools that collect data provenance must be easy to use; ideally the analytical tools that scientists already use should be augmented to provide this service.  This is a non-trivial task, since much of the required information is dynamic and must be collected and recorded while the script or program is executing.  Second, data provenance (once collected) has the potential to be very large and complex.  As a result, scientists will need effective tools for visualizing, querying, and managing these metadata or they will have no practical value.

In this project we are developing software tools to make data provenance available to users of the R statistical language, which is widely used by ecologists and environmental scientists for data analysis and visualization.  Our current tools include: (1) RDataTracker, a library of R functions which collects data provenance in the form of a Data Derivation Graph (or DDG) as an R script executes, and (2) DDG Explorer, a separate tool written in Java and used to visualize, query, and store the resulting DDG.