Wednesday, October 27, 2021

[SOLVED] How can I reuse a R variable from one shell script to the other?

Issue

I have a R pipeline to run analyses on a big dataset. Currently I can start an analysis by calling my script from the terminal, giving it my analysis parameters. $ ./my_script.R --parameter1 a1 --parameter2 b1
The script loads the dataset from a .Rds file, but it takes more than a minute to load, every time I start the script.

Is there a way to keep the dataset in memory to run multiple analyses in a row (meaning $ ./my_script.R --parameter1 a2 --parameter2 b2 etc.)? Using the global environment maybe?
Thanks!


Solution

One way to attack that problem is to allow the user to specify multiple pairs of arguments at the time of script call, so that the program can iterate over all of them at once (necessitating only one startup-cost).

Here's a sample script that uses a few things:

  1. library(optparse), for ease of arguments. There are others, nothing is required, I find it makes things look easy.
  2. The ability for the script to know if it is being sourced (and not run some code, useful for dev/testing) or being run from the command line (which would trigger some code to run). This is similar to python's if __name__ == '__main__': trick, something I answered a while ago as https://stackoverflow.com/a/47932989/3358272.

Neither of them are strictly necessary, but I find it helps demonstrate how to structure the script so that you can facilitate "one or more" type operations.

#!/usr/bin/env r
startup <- function() {
  message(Sys.time(), " Some expensive data load ...")
  Sys.sleep(3)
}

func1 <- function(x, y) {
  message(Sys.time(), " Called with (x,y): ", jsonlite::toJSON(list(x=x,y=y)))
}

if (sys.nframe() == 0L) {
  library(optparse)
  P <- OptionParser()
  P <- add_option(P, c("--param1"), dest = "p1", type = "character",
                  help = "Parameter 1", metavar = "P1")
  P <- add_option(P, c("--param2"), dest = "p2", type = "character",
                  help = "Parameter 2", metavar = "P2")
  P <- add_option(P, c("--param-csv"), dest = "pcsv", type = "character",
                  help = "CSV file with parameters in each column", metavar = "FILE")
  args <- parse_args(P, commandArgs(trailingOnly = TRUE))

  if (!is.null(args$pcsv)) {
    if (!file.exists(args$pcsv)) {
      stop("file not found: ", sQuote(args$pcsv))
    }
    params <- read.csv(args$pcsv, header = FALSE)
    if (!ncol(params) >= 2L) {
      stop("file does not have (at least) 2 columns")
    }
  } else {
    params <- data.frame(
      p1 = sapply(strsplit(args$p1, "[,[:space:]]+")[[1]], trimws),
      p2 = sapply(strsplit(args$p2, "[,[:space:]]+")[[1]], trimws)
    )
  }

  startup()

  for (rownum in seq_len(nrow(params))) {
    func1(params[[1]][rownum], params[[2]][rownum])
  }  
}

For the sake of this demo, startup is you loading your .Rds file (which takes 3 seconds here), and func1 is the rest of whatever processing you might be doing. (As a general hint, I tend to do as little work within the sys.nframe() == 0 block, so that the functions I write above it can be used interactively or with the script. It's just one way to organize code.)

This script supports three modalities:

  • your default invocation

    $ Rscript 64287443.R --param1 foo1 --param2 bar1
    2020-10-09 15:33:48 Some expensive data load ...
    2020-10-09 15:33:51 Called with (x,y): {"x":["foo1"],"y":["bar1"]}
    

    one "job" at a time.

  • comma-separated multiple arguments, as in

    $ Rscript 64287443.R --param1 foo1,foo2 --param2 bar1,bar2
    2020-10-09 15:33:55 Some expensive data load ...
    2020-10-09 15:33:58 Called with (x,y): {"x":["foo1"],"y":["bar1"]}
    2020-10-09 15:33:58 Called with (x,y): {"x":["foo2"],"y":["bar2"]}
    

    which is equivalent to running

    $ Rscript 64287443.R --param1 foo1 --param2 bar1
    $ Rscript 64287443.R --param1 foo2 --param2 bar2
    

    except that it is only incurring the startup cost once.

  • a CSV file of jobs, one param per column.

    $ cat params.csv
    foo1,bar1
    foo2,bar2
    foo3,bar3
    
    $ Rscript 64287443.R --param-csv params.csv
    2020-10-09 15:35:15 Some expensive data load ...
    2020-10-09 15:35:18 Called with (x,y): {"x":["foo1"],"y":["bar1"]}
    2020-10-09 15:35:18 Called with (x,y): {"x":["foo2"],"y":["bar2"]}
    2020-10-09 15:35:18 Called with (x,y): {"x":["foo3"],"y":["bar3"]}
    

TODO:

  • the logic to strsplit a comma-separated array for --param1 and 2 is trusting, and should be broken down a little to test for unequal pairings, and either error or do something meaningful; as of now, it will fail
  • in general, there is very little error checking here, but that's context-sensitive


Answered By - r2evans