Issue
I have a R pipeline to run analyses on a big dataset. Currently I can start an analysis by calling my script from the terminal, giving it my analysis parameters.
$ ./my_script.R --parameter1 a1 --parameter2 b1
The script loads the dataset from a .Rds
file, but it takes more than a minute to load, every time I start the script.
Is there a way to keep the dataset in memory to run multiple analyses in a row (meaning $ ./my_script.R --parameter1 a2 --parameter2 b2
etc.)? Using the global environment maybe?
Thanks!
Solution
One way to attack that problem is to allow the user to specify multiple pairs of arguments at the time of script call, so that the program can iterate over all of them at once (necessitating only one startup-cost).
Here's a sample script that uses a few things:
library(optparse)
, for ease of arguments. There are others, nothing is required, I find it makes things look easy.- The ability for the script to know if it is being sourced (and not run some code, useful for dev/testing) or being run from the command line (which would trigger some code to run). This is similar to python's
if __name__ == '__main__':
trick, something I answered a while ago as https://stackoverflow.com/a/47932989/3358272.
Neither of them are strictly necessary, but I find it helps demonstrate how to structure the script so that you can facilitate "one or more" type operations.
#!/usr/bin/env r
startup <- function() {
message(Sys.time(), " Some expensive data load ...")
Sys.sleep(3)
}
func1 <- function(x, y) {
message(Sys.time(), " Called with (x,y): ", jsonlite::toJSON(list(x=x,y=y)))
}
if (sys.nframe() == 0L) {
library(optparse)
P <- OptionParser()
P <- add_option(P, c("--param1"), dest = "p1", type = "character",
help = "Parameter 1", metavar = "P1")
P <- add_option(P, c("--param2"), dest = "p2", type = "character",
help = "Parameter 2", metavar = "P2")
P <- add_option(P, c("--param-csv"), dest = "pcsv", type = "character",
help = "CSV file with parameters in each column", metavar = "FILE")
args <- parse_args(P, commandArgs(trailingOnly = TRUE))
if (!is.null(args$pcsv)) {
if (!file.exists(args$pcsv)) {
stop("file not found: ", sQuote(args$pcsv))
}
params <- read.csv(args$pcsv, header = FALSE)
if (!ncol(params) >= 2L) {
stop("file does not have (at least) 2 columns")
}
} else {
params <- data.frame(
p1 = sapply(strsplit(args$p1, "[,[:space:]]+")[[1]], trimws),
p2 = sapply(strsplit(args$p2, "[,[:space:]]+")[[1]], trimws)
)
}
startup()
for (rownum in seq_len(nrow(params))) {
func1(params[[1]][rownum], params[[2]][rownum])
}
}
For the sake of this demo, startup
is you loading your .Rds
file (which takes 3 seconds here), and func1
is the rest of whatever processing you might be doing. (As a general hint, I tend to do as little work within the sys.nframe() == 0
block, so that the functions I write above it can be used interactively or with the script. It's just one way to organize code.)
This script supports three modalities:
your default invocation
$ Rscript 64287443.R --param1 foo1 --param2 bar1 2020-10-09 15:33:48 Some expensive data load ... 2020-10-09 15:33:51 Called with (x,y): {"x":["foo1"],"y":["bar1"]}
one "job" at a time.
comma-separated multiple arguments, as in
$ Rscript 64287443.R --param1 foo1,foo2 --param2 bar1,bar2 2020-10-09 15:33:55 Some expensive data load ... 2020-10-09 15:33:58 Called with (x,y): {"x":["foo1"],"y":["bar1"]} 2020-10-09 15:33:58 Called with (x,y): {"x":["foo2"],"y":["bar2"]}
which is equivalent to running
$ Rscript 64287443.R --param1 foo1 --param2 bar1 $ Rscript 64287443.R --param1 foo2 --param2 bar2
except that it is only incurring the startup cost once.
a CSV file of jobs, one param per column.
$ cat params.csv foo1,bar1 foo2,bar2 foo3,bar3 $ Rscript 64287443.R --param-csv params.csv 2020-10-09 15:35:15 Some expensive data load ... 2020-10-09 15:35:18 Called with (x,y): {"x":["foo1"],"y":["bar1"]} 2020-10-09 15:35:18 Called with (x,y): {"x":["foo2"],"y":["bar2"]} 2020-10-09 15:35:18 Called with (x,y): {"x":["foo3"],"y":["bar3"]}
TODO:
- the logic to
strsplit
a comma-separated array for--param1
and2
is trusting, and should be broken down a little to test for unequal pairings, and either error or do something meaningful; as of now, it will fail - in general, there is very little error checking here, but that's context-sensitive
Answered By - r2evans