Reproducible DS development with Rmarkdown and Github
Many times when conducting applied research I find myself diving in on a project for a few weeks only to emerge on the other side confused as to "how did I end up here" or thinking "I was sure I've ran an analysis that shows that but now I can't seem to be able to reproduce it".
Borrowing from the Academic world the term "reproducible research", I came over time to adopt a few tools and methods of work I call "reproducible development". The aim of "reproducible development" is to enable greater transparency and collaboration on my ongoing research while incurring minimal day-to-day additional costs.
In this article I give a quick run-down of how I set-up and use Github and Rstudio Rmarkdown to produce high quality and scalable (in human time) reproducible data science development code.
Github
While data science processes usually don't involve the exact same workflows like software development (for which Git was originally intended), I think Git is very well suited to the iterative nature of data-science tasks.
When walking down different avenues in the exploration path, it's worth while to have them reside in different branches. That way instead of jotting down in general pointers what you did on some analysis along with some code snippets in some text file (or god-forbid word when you want to have plots as well) you can instead go back to the relevant branch, see the different iterations and read a neat report with code and images. You can even re-visit ideas that didn't make it into the master branch. Be sure to use informative branch names and commit messages!
Below is in illustration of how that process might look like:
We can see in the image above some feature engineering we experimented with. When confident it was useful we merged the code into the project, but the experimentation process remains for future scrutiny. We can also see an experiment that didn't make it back to our master branch, but maybe someone will pick it up in the future and do a better job.
Sharing your work on Github that way can enable fellow researchers to "land" anywhere they want on the development graph above and make revisions or learn from your work.
Rmarkdown
Most people familiar with Rmarkdown know it's a great tool to write neat reports in all sorts of formats (html, PDF and even word!). One format that really makes it a great combo with Github is the github_document format. While HTML files aren't rendered on the Github site, the output file from a github_document knit is an .md file which renders perfectly well on Github, supporting images, tables, math, table of contents and much more.
What some may not realize is that Rmarkdown is also a great development tool in itself. It behaves much like the popular Jupiter notebooks, with plots, tables and equations showing next to the code that generated them:
What's more, it has tons of cool features that really support reproducible development such as:
- The first r-chunk (labelled by default "setup" in the Rstudio template) always runs once when you execute code within chunks following it. If you load all required packages in the setup chunk you're guaranteed to have them loaded when you open the file and run arbitrary chunks down stream.
- When running code from within a chunk (pressing ctrl+Enter) the working directory will always be the one which the .Rmd file is located at. In short this means no more worrying about setting the working directory - be it when working on several projects simultaneously or when cloning a repo from Github.
- It has many cool code execution tools such as a button to run code in all chunks up to the current one, run all code in the current chunk and it has a green progress bar that lets you track the script progress in real time!
- If your script is so long that scrolling around it becomes confusing, you can use this neat feature in Rstudio: When viewing Rmarkdown files you can view an interactive table of contents that enables you to jump between sections (defined by # headers) in your code
To summarise this section, I would highly recommend developing with Rmd files rather than R files.
A few set-up tips
- Place a file named "passwords.R" with all passwords in the directory to which you clone repos and source it via the Rmd. That way you don't accidentally publish your passwords to Github
- I like working with cache on all chunks in my Rmd. It's usually good practice to avoid uploading the cache files generated in the process to Github so be sure to add to your .gitignore file the file types: *.RData, *.rdb, *.rdx, *.rds, *__packages
- Github renders CSV files pretty nicely (and enables searching them conveniently). So if you have some reference tables you want to include in your repo and you have a *.csv entry in your .gitignore file, you may want to add to your .gitignore the following entry: !reference_table_which_renders_nicely_on_github.csv to exclude it from the exclusion list.
- If you want to write basic math in your report you should add in the YAML section under the github_document format the option: pandoc_args: --webtex
Sample Reproducible development repo
Feel free to clone the sample reproducible development repo below and get your reproducible project running ASAP!