I have a large database (~100Gb) from which I need to pull every entry, perform some comparisons on it, and then store the results of those comparisons. I have attempted to run parallel queries within a single R sessions without any success. I can just run multiple R sessions all at once but I am looking for a better approach. Here is what I attempted:
library(RSQLite)
library(data.table)
library(foreach)
library(doMC)
#---------
# SETUP
#---------
#connect to db
db <- dbConnect(SQLite(), dbname="genes_drug_combos.sqlite")
#---------
# QUERY
#---------
# 856086 combos = 1309 * 109 * 6
registerDoMC(8)
#I would run 6 seperate R sessions (one for each i)
res_list <- foreach(i=1:6) %dopar% {
a <- i*109-108
b <- i*109
pb <- txtProgressBar(min=a, max=b, style=3)
res <- list()
for (j in a:b) {
#get preds for drug combos
statement <- paste("SELECT * from combo_tstats WHERE rowid BETWEEN", (j*1309)-1308, "AND", j*1309)
combo_preds <- dbGetQuery(db, statement)
#here I do some stuff to the result returned from the query
combo_names <- combo_preds$drug_combo
combo_preds <- as.data.frame(t(combo_preds[,-1]))
colnames(combo_preds) <- combo_names
#get top drug combos
top_combos <- get_top_drugs(query_genes, drug_info=combo_preds, es=T)
#update progress and store result
setTxtProgressBar(pb, j)
res[[ length(res)+1 ]] <- top_combos
}
#bind results together
res <- rbindlist(res)
}
I dont get any errors but only one core spins up. In contrast, if I run multiple R sessions, all my cores go at it. What am I doing wrong?
Some things I have learned while accessing concurrently with RSQLite the same file SQLite database:
parallel::clusterEvalQ(cl = cl, {
db.conn <- RSQLite::dbConnect(RSQLite::SQLite(), "./export/models.sqlite");
RSQLite::dbClearResult(RSQLite::dbSendQuery(db.conn, "PRAGMA busy_timeout=5000;"));
})
PRAGMA busy_timeout=5000;By default this is set to 0, and chances are that you will end up with a "database is locked" error each time your worker tries to write to the DB while it is locked. Previous code sets this PRAGMA in each worker connection. Note that SELECT operations are never locked, only INSERT/DELETE/UPDATE.
PRAGMA journal_mode=WAL;This only has to be set once and stays on by default forever. It will add two (more or less permanent) files to the DB. It will improve concurrent read/write performance. Read more here.
With the above settings I have not experienced this issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With