Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel query of SQLite database in R

I have a large database (~100Gb) from which I need to pull every entry, perform some comparisons on it, and then store the results of those comparisons. I have attempted to run parallel queries within a single R sessions without any success. I can just run multiple R sessions all at once but I am looking for a better approach. Here is what I attempted:

library(RSQLite)
library(data.table)
library(foreach)
library(doMC)



#---------
# SETUP
#---------


#connect to db
db <- dbConnect(SQLite(), dbname="genes_drug_combos.sqlite")


#---------
# QUERY
#---------
# 856086 combos = 1309 * 109 * 6

registerDoMC(8)

#I would run 6 seperate R sessions (one for each i)
res_list <- foreach(i=1:6) %dopar% {

  a <- i*109-108
  b <- i*109

  pb  <- txtProgressBar(min=a, max=b, style=3)
  res <- list()

  for (j in a:b) {

    #get preds for drug combos
    statement   <- paste("SELECT * from combo_tstats WHERE rowid BETWEEN", (j*1309)-1308, "AND", j*1309)
    combo_preds <- dbGetQuery(db, statement)

    #here I do some stuff to the result returned from the query
    combo_names <- combo_preds$drug_combo
    combo_preds <- as.data.frame(t(combo_preds[,-1]))

    colnames(combo_preds)  <- combo_names

    #get top drug combos
    top_combos <- get_top_drugs(query_genes, drug_info=combo_preds, es=T)

    #update progress and store result
    setTxtProgressBar(pb, j)
    res[[ length(res)+1 ]] <- top_combos
  }
  #bind results together
  res <- rbindlist(res)
}

I dont get any errors but only one core spins up. In contrast, if I run multiple R sessions, all my cores go at it. What am I doing wrong?

like image 438
alexvpickering Avatar asked Oct 19 '25 13:10

alexvpickering


1 Answers

Some things I have learned while accessing concurrently with RSQLite the same file SQLite database:

1. Make sure each worker has its own DB connection.

  parallel::clusterEvalQ(cl = cl, {
    db.conn <- RSQLite::dbConnect(RSQLite::SQLite(), "./export/models.sqlite");
    RSQLite::dbClearResult(RSQLite::dbSendQuery(db.conn, "PRAGMA busy_timeout=5000;"));
  })

2. Use PRAGMA busy_timeout=5000;

By default this is set to 0, and chances are that you will end up with a "database is locked" error each time your worker tries to write to the DB while it is locked. Previous code sets this PRAGMA in each worker connection. Note that SELECT operations are never locked, only INSERT/DELETE/UPDATE.

3. Use PRAGMA journal_mode=WAL;

This only has to be set once and stays on by default forever. It will add two (more or less permanent) files to the DB. It will improve concurrent read/write performance. Read more here.

With the above settings I have not experienced this issue.

like image 182
Davor Josipovic Avatar answered Oct 22 '25 18:10

Davor Josipovic



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!