ONLINE COURSE – Reproducible Data Science using RMarkdown, Git, R packages, Docker, Make & Drake, and other tools (RDRP01) This course will be delivered live
29 June 2020 - 3 July 2020£450.00
This course provides a comprehensive introduction to doing reproducible data analysis, which we define as analysis where the entire workflow or pipeline is as open and transparent as possible, making it possible for others, including our future selves, to be able to exactly reproduce any of its results. We cover this topic by providing a thorough introduction to a set of R based and general computing tools such as RMarkdown, Git & GitHub, R packages, Docker, Gnu Make and Drake, and show how they can be used together to do reproducible data analysis that can then be shared with others. After a general introduction on Day 1, where we introduce the core concept of a research compendium, we will begin by covering RMarkdown, knitr and related tools. These are vital tools for reproducible research that allow us to produce data analysis reports, i.e. articles, slides, posters, websites, etc., by embedding analysis code (R, Python, etc) within the text of the report that is then executed, and the results it produces are inserted into the final output document. On Day 2, we provide a comprehensive introduction to version control using Git, including using GitHub. Git and GitHub are vital tools for the organization, maintenance, and distribution of our code, especially for large scale and long term projects involving multiple collaborators. On Day 3, we cover how to create, maintain, distribute R packages. R packages are obviously the principal means of distributing reusable R code generally, and here, we will also look at how R packages can be used also to create, maintain, and distribute research compendia. On Day 4, we cover Docker, which is a now very popular means for producing reproducible computing environments across different devices, platforms, and operating systems. On Day 5, we cover build automation tools, particularly Gnu Make and Drake, which are used for automatically running complex analysis code that involves multiple inter-dependencies between files. Gnu Make is a general purpose build automation tool, while Drake is specifically designed for complex data analysis pipelines in R. On each day, therefore, we aim to provide a comprehensive and thorough introduction to a set of valuable and generally useful computing tools, each of which plays a key role in allowing us to do reproducible data science.
This course is relevant to anyone doing data science, whether in industry or in academic research.
Venue – Delivered remotely
Time zone – UK (GMT)
Availability – 15 places
Duration – 5 days
Contact hours – Approx. 28 hours
ECT’s – Equal to 3 ECT’s
Language – English
PLEASE READ – CANCELLATION POLICY: Cancellations are accepted up to 28 days before the course start date subject to a 25% cancellation fee. Cancellations later than this may be considered, contact firstname.lastname@example.org. Failure to attend will result in the full cost of the course being charged. In the unfortunate event that a course is cancelled due to unforeseen circumstances a full refund of the course fees (and accommodation fees if booked through PS statistics) will be credited.
Dr. Mark Andrews
This course will be hands-on and workshop based. Each day, there may be some lecture style presentation, i.e., using slides, introducing and explaining key concepts. However, this will be minimal and our focus on each day will be the practical master of the computing tools we cover.
Assumed quantitative knowledge
Though we assume all participants will be experienced with some methods of statistical data analysis, no knowledge of any specific topic is required or assumed.
Assumed computer background
We will only assume a minimal familiarity with R and RStudio. More extensive R experience is desirable but not essential. No experience whatsoever with RMarkdown, Git, R package development, Docker, Make or Drake will be assumed.
Equipment and software requirements
A laptop with all the required software, i.e. R/RStudio, RMarkdown, Git, etc, installed is necessary. All this software is free and open source and available on Windows, MacOS, and Linux. Instructions on how to install this software on each of the platforms will be distributed in advance of the workshop, and in most cases, can also be installed within minutes during the workshop itself.
UNSURE ABOUT SUITABLILITY THEN PLEASE ASK email@example.com
Monday 29th – Classes from 09:30 to 17:30
• Topic 1: Doing reproducible data science. We begin by providing an overview of reproducible data analysis generally and this course in particular. We’ll address why reproducible data analysis is valuable and what are the wide range of tools that are available for accomplishing it. We’ll explain that reproducible data analysis is sometimes motivated in terms of open science, which is committed to doing research where the data, analysis code, and results are made fully open and transparent to others. However, reproducible data analysis can also be motivated simply as a means of doing more high quality, trustworthy, and robust data analysis, even when that analysis is of a confidential nature. Here, we will also introduce the central concept of a research compendium, which is a bundling of the data, analysis code, and dynamic document files that produce the final reports of the analysis. We will then overview the wide range of tools for creating, maintaining, and distributing research compendia that we will cover in the remainder of the course.
• Topic 2: RMarkdown. RMarkdown is a file format that contains a mixture of R code and text and from which we can produce data analysis reports (or slides, web pages, etc). The report is produced by automatically executing all the analysis code in the RMarkdown file and inserting the results, such as tables, figures, etc., along with the text into the final pdf, html, or MS Word output document. While the basics of RMarkdown can be quickly learned, our aim here is to provide a thorough and comprehensive introduction to RMarkdown so as to get the most out of it. This will include covering markdown syntax; mathematical typesetting with LaTeX; bibliography and citation management; cross references; formatting tables; controlling the placement of figures; scientific diagrams with TikZ; using alternative document templates; creating new customized templates. We will primarily focus on creating articles as the output format, but will also focus on creating web pages and slides.
Tuesday 30th – Classes from 09:30 to 17:30
• Topic 3: Git & GitHub. The next major tool that we will cover is Git. Git is version control software, and version control software generally is vital for the organization and development of a set of source code files, especially when working collaboratively. We will argue that all the source code files, including RMarkdown files, in the data analysis project should be under version control from the beginning of the project. Using Git for this is an obvious choice because Git is powerful, open source, and is now the most popular and most widely version control system worldwide. In addition, GitHub is an excellent, free to use, and popular hosting site for Git repositories. Here, we will cover initializing Git repositories and cloning existing ones; staging and committing new files or modified files to the repository; writing commit messages; pushing and pulling to and from remote repositories; checking out previous versions of the repository; resetting or reverting to a previous state of the repository (i.e. undoing); branching and merging and rebasing. The last of these topics, i.e. branching etc, describes some especially powerful features of Git, but ones that are vital for long term and complex projects, especially those involving multiple collaborators.
Wednesday 1st – Classes from 09:30 to 17:30
• Topic 4: R packages. R packages are the means by which add-on or contributed R code is distributed, usually through CRAN repositories and GitHub. In addition to being a general major tool for R users, packages can also be used specifically for developing and distributing research compendia. In this section of the course, we will provide a thorough introduction to developing R packages that can then be pushed to GitHub to be installed by others. We will cover all the major aspects of an R package: writing reusable functions; writing documentation for our code using roxygen2; writing tests to ensure that our code is working as expected; adding data files, including their documentation; writing the DESCRIPTION where we provide all the information about what our package does, how to use it, what package dependencies it has, who the authors etc are; writing vignettes, which are long form documentation or tutorials; uploading to GitHub for distribution; using pkgdown to create a website for the package or compendium. This section of the course is intended to provide a comprehensive introduction to developing R packages generally, and research compendia in particular. For the latter topic, we will follow the general guidelines outlined in Marwick et al (2018) “Packaging Data Analytical Work Reproducibly Using R (and Friends)”.
Thursday 2nd – Classes from 09:30 to 17:30
• Topic 5: Docker. Docker is a powerful and now widely used “containerization” software that can be used to create reproducible environments and software stacks. This allows users to run software identically across different devices, platforms, and operating system without installing any software other than Docker itself, which is open source and cross platform. Thus, Docker allows us to write our code and perform our analyses on one machine in a container using a specific stack of software. We may then create a specification of this container that others can download and which allows them recreate the same environment with the same software stack on their devices. They can then run our code identically, using identical versions of all the software, including R packages, and including the lower level code libraries. Distributing our research compendium to run in a Docker container is ultimate standard of reproducibility short of using identical hardware devices. In this section of the course, we will learn how to pull general docker images from Docker repository and run containers based on them. We will then focus on an R based Docker image, namely rocker, that will allow us to install an R/RStudio based container. We will then extend this rocker image to create a customized R/RStudio environment with all the packages that we require to run our compendium. We will create a Dockerfile specification of this compendium that we can distribute online, and that will allow others to download and recreate our environment exactly. Finally, we will distribute this Docker based environment via the binder website which runs the container on server and that be then used interactively through a RStudio server session running the container. As part of this coverage, we will also cover the packrat and checkpoint R packages that can be used for version control of R package dependencies.
Friday 3rd – Classes from 09:30 to 16:00
• Topic 6: Build automation with Make and Drake: Executing simple analyses may be as simple as running a short R script or RMarkdown file. On the other hand, complex analyses may involve dozens of scripts, each pertaining to a particular part of the analysis pipeline, and there may be complex inter-dependencies between files, and the entire pipeline make take hours or even days to complete. Tools such as Gnu Make and Drake allow us to run our entire analysis pipeline using single shell or R commands. More importantly, these tools identify the inter-dependencies in the code base and so allow us to run only those parts of the pipelines that are affected after any change is made. Gnu Make a generally useful tool for any software development, and can be used for many analysis related tasks, especially those that involve code in multiple different languages. Drake, on the other hand, was specifically designed for R based workflows, particularly those that involve high performance and distributed computing. In this final section of the course, therefore, we will explore how to use Make and Drake to automate analysis workflows. To do so, we will use some relatively simple but otherwise typical data analysis projects, involving data cleaning, modelling fitting, followed by report generation. Here, we will also deal with parallel and distributed computing workflows and how these may be automated by Make and Drake.