Replication with R

Replication

Replicating results and making replication materials accessible to the public are common practice in today’s social sciences. Many peer-review journals require not only that your results are replicable but that you provide replication materials for interested readers as well. Many replication files of peer-reviewed articles in Political Science can be found on ]Harvard’s Dataverse1.

Replication with isolated libraries and version control

Most journals in the social sciences require you to provide a declaration of the data used, in a level of detail conditional on what data you used, and perhaps collected. To ensure replicability, you provide your code and, if possible, the data used in your analyses. The code should be executable and yield the same results as in your article, including figures and tables. You can save your code in a Git repository (e.g., on GitHub) to safe it, create a snapshot[^Snapshot] of it, and share this snapshot and any other version of your code it with others, and track changes from one version to another.

Next month, next year, next decade?

Code that runs perfectly on your computer may not run on another person’s computer. Luckily, R works fairly consistent across Windows, Linux distributions and macOS but packages may work differently or not at all across different operating systems (OS). Other software may be available only for Linux but not for Windows and vice versa. Likely, your code will call on functions from different packages as well. Packages, but also R and your operating system, are—hopefully—updated regularly. Updates may change how functions work, some are replaced and deprecated before they are removed, others may not be supported in newer versions of R. Thus, code that ran perfectly on your computer last year may not run on your computer this year.

There are some remedies to increase the chances that your analyses are replicable for longer. Different versions of R and its packages are already stored in various archives online. You can safe not only a specific version of code but also of the software environment the code is executed in. To provide snapshots and version control for libraries in R,you can employ additional packages (e.g. renv, groundhog). The popular software Docker allows you create and share an isolated software container, a virtual machine with a set operating system, R software, and package libraries. For ongoing projects, groundhog can easily be included at any step of a project.

Library versioning

renv is best initialized at the start of a project. renv creates a seperate (from the local) library which can be shared and initialized by others. However, replication with renv requires the entire library to be sent and shared alongside the code and data and may require a specific version of R compatible with the packages included in the library.

In contrast, groundhog does not safe a library, but the versions of packages and their dependencies loaded in a project. The syntax and execution of groundhog is very similar to base R, e.g., to load packages, you call groundhog.library("package", "date") instead of library("package").

Software versioning

Docker is a well-established software to create software containers functioning like virtual computers. These containers are independent collections of software that include everything to execute some desired operation. A containerized R project would include an operating system (OS), commonly a Linux distribution, the R software required to run R code and a library of all required packages for the project. This software container can be shared and executed across multiple machines, independent of any other software on a user’s computer.

Docker can be used at any point of a project and can install (only) required libraries at the first execution of the containerized project.

Replicate randomness with set.seed()

Setting seeds is extremely helpful for replicability. It allows you to reproduce your results, e.g. a random draw, so that whenever you execute a script again, the random draw is exactly the same as before. A random draw could be sampling values (e.g. drawing 1000 observations from 10000 observations) or generating random values (e.g. ten numbers from 1-100).

To do so, use the command set.seed() and define an integer value as the seed, e.g. set.seed(2025).



  1. King, G. (2007). An Introduction to the Dataverse Network as an Infrastructure for Data Sharing. Sociological Methods & Research, 36(2), 173-199. ↩︎

Previous
Next