What makes scientific results reproducible? =========================================== The idea of reproducible science is to provide not only the final results of your research (often in the form of a paper), but also enable others to redo your experiments and analysis. This allows for external verification, greater trust in your results and comparative ease in reconstructing details of your own work in the future. So what is necessary for your computational results to be reproducible? The following is based loosly on [this article](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285), which I would encourage you to read. Automate everything ------------------- You should have no manual steps in your result creation, that means codifying everything you do. This often also replaces a lot of documentation of the sort "click here, copy this from here to there, and rename that to xyz". There are several ways of achieving that. You could write a straightforward shell script in [bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) if you're under Linux or MacOSX, or [batch files](https://en.wikipedia.org/wiki/Batch_file) or [Powershell](https://en.wikipedia.org/wiki/PowerShell) if you're under Windows. This has, however, a serious drawback: all steps in your pipeline are run again, even if that is not necessary. For example, if you only change your postprocessing part, there is no need to run the simulation again. (You probably could implement checks like this in your favorite shell language, but then you're just reimplementing some of the solutions below. Don't.) Better solutions to this are build systems, of which there are [many](https://en.wikipedia.org/wiki/List_of_build_automation_software). The most widely known and almost universally available (at least on Linux systems), is [make](https://en.wikipedia.org/wiki/Make_(software)). If you're already familiar with make and it suits your work, go with it. However, I find the syntax dismal, and the timestamp-based decisions sometimes result in unnecessary reruns. Python-based alternatives which we're going to use here are [doit](pydoit.md) or [Snakemake](snakemake.md) . If you know Python, you'll feel right at home. Keep everything under version control ------------------------------------- Version control is a way of keeping track of the changes you made to your code and why. While there are also [many](https://en.wikipedia.org/wiki/List_of_version-control_software) version control systems, one has emerged as the clear winner and is widely known and supported. Learn [Git](https://git-scm.com/) and use it habitually everywhere, all the time. Archive your environment ------------------------ This part is quite difficult. Software changes its behaviour as newer versions are released. To make your results reproducible, the code needs to be run with the same version again. The least you should do is write down the version numbers of the software that is important. But this is prone to error, likely to become out-of-date and makes reproducing your results much more difficult than it should be. A better solution would be to provide a virtual machine with all the software that you've used preinstalled. This satisfies the reproducibility requirements, but is also cumbersome to use and archive, and not very transparent in terms of how you've set up your environment. A new approach is to use containers, which will allow you to easily distribute your environmental setup along with the rest of the code, while also being faster than virtual machines. This is discussed in detail in the [docker](docker) section. For Analyses That Include Randomness, Note Underlying Random Seeds ------------------------------------------------------------------ To make results that stem from computations that include randomness reproducible, set the seed of the RNG to a fixed value. Record intermediate results --------------------------- This is where it gets tricky. If you're intermediate files are just a couple of MBs, than sure, check them into Git and be done with it (use Git LFS). For large simulations, possible approaches could be 1. Put them into \, and make it clear where to find this data. 2. Take a look at [Git-Annex](https://git-annex.branchable.com/), where you can combine git with storing data anywhere you want (other servers, your phone, Dropbox, Amazon S3, USB stick, Git LFS, Bittorrent, Google Drive, IPFS, ...). 3. Just give up. At a certain scale, in situ visualization is the only feasible thing anyway Provide Public Access to Scripts, Runs, and Results --------------------------------------------------- None of the above matters in terms of me being able to reproduce your results if I'm not able to find your code/data. For code, a good approach would be to upload your repository to [Github](https://github.com/) and include the link in your paper. Use a [tag](https://git-scm.com/book/en/v2/Git-Basics-Tagging) to mark the version that created the final paper. Additionally, you can upload your code and data to [Zenodo](https://zenodo.org/), where you'll get a DOI for it that will allow you to reference it in the future.