R - Connect Scripts via Pipes

Question

I have a number of R scripts that I would like to chain together using a UNIX-style pipeline. Each script would take as input a data frame and provide a data frame as output. For example, I am imagining something like this that would run in R's batch mode.

  cat raw-input.Rds | step1.R | step2.R | step3.R | step4.R > result.Rds

Any thoughts on how this could be done?

  cat raw-input.Rds | step1.R | step2.R | step3.R | step4.R > result.Rds

Any thoughts on how this could be done?

flodel · Accepted Answer

Writing executable scripts is not the hard part, what is tricky is how to make the scripts read from files and/or pipes. I wrote a somewhat general function here: https://stackoverflow.com/a/15785789/1201032

Here is an example where the I/O takes the form of csv files:

Your step?.R files should look like this:

#!/usr/bin/Rscript

OpenRead <- function(arg) {

   if (arg %in% c("-", "/dev/stdin")) {
      file("stdin", open = "r")
   } else if (grepl("^/dev/fd/", arg)) {
      fifo(arg, open = "r")
   } else {
      file(arg, open = "r")
   }
}

args  <- commandArgs(TRUE)
file  <- args[1]
fh.in <- OpenRead(file)

df.in <- read.csv(fh.in)
close(fh.in)

# do something
df.out <- df.in

# print output
write.csv(df.out, file = stdout(), row.names = FALSE, quote = FALSE)

and your csv input file should look like:

col1,col2
a,1
b,2

Now this should work:

cat in.csv | ./step1.R - | ./step2.R -

The - are annoying but necessary. Also make sure to run something like chmod +x ./step?.R to make your scripts executables. Finally, you could store them (and without extension) inside a directory that you add to your PATH, so you will be able to run it like this:

cat in.csv | step1 - | step2 -

Hong Ooi · Answer

Why on earth you want to cram your workflow into pipes when you have the whole R environment available is beyond me.

Make a main.r containing the following:

source("step1.r")
source("step2.r")
source("step3.r")
source("step4.r")

That's it. You don't have to convert the output of each step into a serialised format; instead you can just leave all your R objects (datasets, fitted models, predicted values, lattice/ggplot graphics, etc) as they are, ready for the next step to process. If memory is a problem, you can rm any unneeded objects at the end of each step; alternatively, each step can work with an environment which it deletes when done, first exporting any required objects to the global environment.

If modular code is desired, you can recast your workflow as follows. Encapsulate the work done by each file into one or more functions. Then call these functions in your main.r with the appropriate arguments.

source("step1.r")  # defines step1_read_input, step1_f2
source("step2.r")  # defines step2_f2
source("step3.r")  # defines step3_f1, step3_f2, step3_f3
source("step4.r")  # defines step4_write_output

step1_read_input(...)
step1_f2(...)
....
step4write_output(...)

R - Connect Scripts via Pipes

Tags:

unix

r

pipe

Nick Allen

2 Answers

flodel

Hong Ooi

Recent Activity

Donate For Us

R - Connect Scripts via Pipes

Tags:

unix

r

pipe

Nick Allen

2 Answers

flodel

Hong Ooi

Related questions

Recent Activity

Donate For Us