I'm running a code which takes very long to compute. I made my code parallel using foreach()%dopar% and run in on the cluster.
It runs generally fine but sometimes crashes and I get the following error :
Error in { : task 4 failed - "missing value where TRUE/FALSE needed"
Calls: %dopar% -> <Anonymous>
Execution halted
Now it says Execution halted but only for this particular core so the others keep running and at the end it fails to output but doesn't tell me before hand.
I guess it's a problem with an if statement. I tried simulating the code on my computer but it is so rare that I can't simulate it.
The code runs easily 100 hours doing as many as 100 000 loops and only one of them will fail.
My questions are : Can I traceback where the error was? (I run the code on a cluster so I don't have all the nice Rstudio stuff)
Also, is it possible to still output from a foreach() loop even if one of the tasks crashed?
Or perhaps any method people use so that I can make the crash happen on my computer?
I can write the code if needed, please ask if it helps.
The foreach ".errorhandling" argument is intended to help in this situation. If you want foreach to pass errors through, then use .errorhandling="pass". If you want it to filter out errors (which reduces the length of the result), then use .errorhandling="remove". The default value is "stop" which throws an error indicating which task failed.
Unfortunately, most parallel backends don't support tracebacks, but doMPI does. You simply call "startMPIcluster" with verbose=TRUE, and the traceback will be written to the log file of the worker that had the error. Here's an example that generates an error on task 42:
suppressMessages(library(doMPI))
cl <- startMPIcluster(4, verbose=TRUE)
registerDoMPI(cl)
g <- function(i) {
  if (i == 42) {
    if (NULL) cat('hello, world\n')
  }
  7
}
f <- function(i) g(i)
r <- foreach(i=1:50, .errorhandling='pass') %dopar% f(i)
print(r)
closeCluster(cl)
mpi.quit()
Since it uses .errorhandling="pass", the script runs to completion, with an error object returned in element 42 of the result list. In addition, one of the log files contains a traceback of the error (along with many other messages):
waiting for a taskchunk...
executing taskchunk 42 containing 1 tasks
error executing task: argument is of length zero
traceback (most recent call first):
> g(i)
> f(i)
> eval(expr, envir, enclos)
> eval(expr, envir)
> executeTask(taskchunk$argslist[[1]])
> executeTaskChunk(cl$workerid, taskchunk, envir, err, cores)
returning error results for taskchunk 42
Unfortunately, doMPI is mostly used on Linux systems, so this isn't helpful to most Mac and Windows users.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With