Speaker
Description
Achieving reproducibility in data science can be challenging, as it depends on software reproducibility as well as data reproducibility and ad-hoc parameters/variables. Nevertheless, good tools are being developed that help make it possible. Nix is a package manager that allows users to conveniently define, create, and work with system-level virtual environments within which one can execute data science algorithms such as fitting or training a model. The data versioning tool DVC can be used to keep track of dataset versions and to associate e.g. a training result with the training and validation data used. In this talk, I will describe how we use Nix along with DVC (and git, upon which DVC is built) to achieve pragmatic levels of reproducibility in some of our data science projects, and touch on some of the shortcomings of these tools.