GNU Make paper published in JSS

Tags

Using GNU Make to Manage the Workflow of Data Analysis Projects

It’s been a long time since my last post about UseR2019! in Toulouse. Since then there’s been very little time for thinking about research, never mind blogging about it. Work, in the form of re-designing courses, coordinating large courses, and teaching, took over when I got back home (apart from a great 10 day trip to Namibia just before the COVID19 lock down hit).

On 30 August “Using GNU Make to Manage the Workflow of Data Analysis Projects” was finally publised in the Journal of Statistical Software: doi: 10.18637/jss.v094.c01

From now on, I’m hoping to get back into thinking, coding and writing but I’m always an optimist.

UseR!2019 talk available here

Tags

UseR!2019 talk available here

It looks like the direct link to talk on gitlab is blocked. Thanks Ray for letting me know!

The talk may be downloaded here.

If you want to obtain the R Markdown files for the talk, then you may be able to reach the gitlab site from the link on the top left hand side of this page. If you are trying to install `gnumaker` package and don’t have the graphics packages installed then you may need to install one or more graphics packages via `bioconductor`.

I’ll fix the links and installation instructions when I get back to Australia.

In the mean time, I’m having a lovely holiday in France and will be back online in 10 days or so.

Peter

Paper on using GNU Make for data analysis workflow accepted by JSS

Tags

My recently accepted and revised paper (Code Snippet) Using GNU Make to Manage the Workflow of Data Analysis Projects is now in the publication queue at the Journal of Statistical Software. A final draft version, which is subject to change, may be dowloaded here.

The article describes GNU Make pattern rules for R, Sweave, R Markdown, SAS, Stata, Perl and Python to streamline management of data analysis and reporting projects. Rules are used by adding a single line to project Makefiles. Additional flexibility is available for modifying standard program options. An overall strategy is outlined for Makefile construction and illustrated via simple and complex examples.

The GNU Makefile rules described in the paper may be found at https://github.com/petebaker/r-makefile-definitions

Alpha version of codebookr for R is now on github

Tags

Alpha version of codebookr now on github

Version 0.0.0.9003 of the R package codebookr is now available. codebookr aids cleaning, checking and formatting datasets using codebook metadata.

It is very preliminary but currently reads codebooks supplied as spreadsheet and also can

  • check limits (continuous variables)
  • check categories (categorical variables/factors)

Check it out at http://github.com/petebaker/codebookr

R Makefile definitions updated

2016-06-23 at 16:23:57 (Version 0.2.9004)

  1. fixed Rscript –vanilla R CMD –vanilla bug for latexmk
  2. added variables for programs like cat, rm, pdfjam, latexmk to include PROG_OPTS to set options and LATEXMK_PRE which can be prepended to latex
  3. changed outputting R syntax from .Rmd and .Rnw files to produce -syntax.R to avoid dependency loops where a .tex file then .pdf might be produced instead of using rmarkdown

Latest version available at

github: R Makefile definitions

2016-06-19 at 23:27:34

  1. added in various rmarkdown outputs like ioslides, slidy, beamer, tufte, rtf, odt

2016-05-19 at 11:58:34

  1. modified beamer from .Rnw to be more generic
  2. added beamer example and preamble .Rnw files which can be used to produce presentation, handouts, notes, articles and handouts and multiple page per page handouts from a single .Rnw file
    • see make help-beamer

useR! 2015 Aalborg Tutorial

Efficient statistical consulting using R: Workflow for data analysis projects
Tutorial Tuesday, June 30 at 9:00-12:00

Pre-tutorial:

Please install the dryworkflow package from github using these commands in R:

library(devtools) # available on CRAN (or github) devtools::install_github(petebaker/dryworkflow“, dependencies = TRUE)

You should also install GNU make and git. Git should be installed if you have RStudio installed. Make will be installed on linux systems. Windows users should install Rtools and for MACOSX  please install XCode. For more details please see https://github.com/petebaker/dryworkflow

dryworkflow package 0.1.9016 available on github

The dryworkflow package produces a project skeleton for data analysis including R syntax files, report and Makefiles. Given data files and documents, the skeleton is generated with initial directories, template log files, template R syntax for data checking and initial analysis, makefiles and a git repository is initialised.

Further details and installation instructions are available at https://github.com/petebaker/dryworkflow

Makefile definitions for R

Tags

, ,

R related Makefile definitions

GNU Make is a commonly used tool as part of the process for managing software projects written in languages like C or python. For data analysis projects, it’s main strengths are that it allows the data analyst to repeat just those steps needed when data or R syntax is changed in addition to clearly outlining the steps required.

Makefile2-2015-03-29.png

Data analysis can involve many steps including reading data; cleaning and transforming data; plotting data, statistical analysis and finally writing reports. While we can try and keep track of each step manually by using good documentation and being highly organised, it can prove to be more efficient to employ computer tools to augment these practices. One such approach is to use a tool like GNU Make. Such an approach does not obviate the need to be organised and document the work but it can certainly prove helpful, especially as a project grows in size. While it is far from perfect, make is widely used in software development and also proves to be useful for efficiently carrying out tasks in data analysis. Unfortunately, make does not provide standard rules for producing .Rout files from .R files, .pdf files from .Rnw files, .docx files from .Rmd files and so on. It is straight forward to define a pattern rule to output a .Rout file from a .R syntax file by including the following two lines in a Makefile

%.Rout: %.R
<TAB> R CMD BATCH --vanilla $<

which runs the command R CMD BATCH –vanilla to produce the output file. The left hand side of the colon (:) is the target which depends on the prerequisite file(s) to the right of the colon. Here, % is a wildcard. So, for any .R syntax file, say mySyntax.R, you can then use ‘make mySyntax.Rout’ to produce the .Rout output file noting that nothing happens if the target is newer than the prerequisite since it is already ‘up to date’. To actually use this rule in practice, we may have several prerequisite files like an R syntax file and several data files. In the Makefile we may specify the dependencies as

readData.Rout: readData.R data1.csv data2.csv oldData.RData

so we can run the syntax file by typing ‘make readData.Rout’ at the command prompt. If any of the files readData.R, data1.csv, data2.csv or oldData.RData have changed recently, and so are newer than the target file readData.Rout, then the predefined R batch command is run to get a new output file, otherwise readData.Rout is ‘up to date’. Similar rules can be set up for producing reports from markdown or sweave files. The file common.mk contains many such rules and can be included in a standard Makefile to facilitate a more efficient workflow. You can obtain common.mk at github https://github.com/petebaker/r-makefile-definitions

Using common.mk

  1. Download the file to a directory you commonly use to store functions and definitions. Ideally, this would something like:
    • ~/lib or C:\MyLibrary
  2. put the following line in yourMakefile
    • include ~/lib/common.mk where ~ will be expanded to be your HOME directory, or
    • include C:/MyLibrary/common.mk (in windows)

Example Makefiles

.PHONY : all
all: test.pdf test.html test.docx test2.pdf test-stitch.Rout test-stitch.pdf

## produce pdf, html, docx from test.Rmd
test.pdf: ${@:.pdf=.Rmd}
test.html: ${@:.html=.Rmd}
test.docx: ${@:.docx=.Rmd}

## produce pdf from test2.rmd
test2.pdf: ${@:.pdf=.rmd}

## use stitch to produce pdf via rmarkdown (exactly as in RStudio)
test-stitch.pdf: ${@:.pdf=.R}

## if you have common.mk in ~/lib directory comment line below 
## and uncomment the second line
include common.mk
##include ~/lib/common.mk

More usually, if there is a sequence of steps relying on a data file, say myData.csv  then your Makefile may look something like

.PHONY: all
all: report.pdf

## produce report from .Rmd once previous steps carried out
report.pdf: ${@:.pdf=.Rmd} summaryAndPlots.Rout

## summarise data
summaryAndPlots.Rout: ${@:.Rout=.R} read.Rout

## read data
read.Rout: ${@:.Rout=.R} myData.csv

include ~/lib/common.mk
Run this with the command ‘make’ at the command line or even better: set up RStudio or your editor to do this at the press of a button.

Prerequisites

To use these makefile definitions you need to install

Note that Windows users can install Rtools (available from CRAN) to get a working version of make and may also need to install pandoc and latex to produce pdf files if they haven’t already. Miktex is recommended although texlive will also work well.

Notes

Definitions in ‘common.mk’ have been developed and tested on linux and tested on windows. Some tweaking may be required to suit your set up or your preferred workflow. Peter Baker