Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to upload data via R to PostgresSQL 12

I am using the following code to connect to a PostgreSQL 12 database:

con <- DBI::dbConnect(odbc::odbc(), driver, server, database, uid, pwd, port)

This connects me to a PostgreSQL 12 database on Google Cloud SQL. The following code is then used to upload data:

DBI::dbCreateTable(con, tablename, df)
DBI::dbAppendTable(con, tablename, df)

where df is a data frame I have created in R. The data frame consists of ~ 550,000 records totaling 713 MB of data.

When uploaded by the above method, it took approximately 9 hours at a rate of 40 write operations/second. Is there a faster way to upload this data into my PostgreSQL database, preferably through R?

like image 771
Dylan Russell Avatar asked Sep 04 '25 02:09

Dylan Russell


1 Answers

I've always found bulk-copy to be the best, external to R. The insert can be significantly faster, and your overhead is (1) writing to file, and (2) the shorter run-time.

Setup for this test:

  • win10 (2004)
  • docker
  • postgres:11 container, using port 35432 on localhost, simple authentication
  • a psql binary in the host OS (where R is running); should be easy with linux, with windows I grabbed the "zip" (not installer) file from https://www.postgresql.org/download/windows/ and extracted what I needed
  • I'm using data.table::fwrite to save the file because it's fast; in this case write.table and write.csv are still much faster than using DBI::dbWriteTable, but with your size of data you might prefer something quick
DBI::dbCreateTable(con2, "mt", mtcars)
DBI::dbGetQuery(con2, "select count(*) as n from mt")
#   n
# 1 0

z1000 <- data.table::rbindlist(replicate(1000, mtcars, simplify=F))
nrow(z1000)
# [1] 32000
system.time({
  DBI::dbWriteTable(con2, "mt", z1000, create = FALSE, append = TRUE)
})
#    user  system elapsed 
#    1.56    1.09   30.90 

system.time({
  data.table::fwrite(z1000, "mt.csv")
  URI <- sprintf("postgresql://%s:%s@%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
  system(
    sprintf("psql.exe -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
            "mt", paste(colnames(z1000), collapse = ","),
            sQuote("mt.csv"), URI)
  )
})    
# COPY 32000
#    user  system elapsed 
#    0.05    0.00    0.19 
DBI::dbGetQuery(con2, "select count(*) as n from mt")
#       n
# 1 64000

While this is a lot smaller than your data (32K rows, 11 columns, 1.3MB of data), a speedup from 30 seconds to less than 1 second cannot be ignored.


Side note: there is also a sizable difference between dbAppendTable (slow) and dbWriteTable. Comparing psql and those two functions:

z100 <- rbindlist(replicate(100, mtcars, simplify=F))

system.time({
  data.table::fwrite(z100, "mt.csv")
  URI <- sprintf("postgresql://%s:%s@%s:%s", "postgres", "mysecretpassword", "127.0.0.1", "35432")
  system(
    sprintf("/Users/r2/bin/psql -U postgres -c \"\\copy %s (%s) from %s (FORMAT CSV, HEADER)\" %s",
            "mt", paste(colnames(z100), collapse = ","),
            sQuote("mt.csv"), URI)
  )
})
# COPY 3200
#    user  system elapsed 
#     0.0     0.0     0.1 

system.time({
  DBI::dbWriteTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
#    user  system elapsed 
#    0.17    0.04    2.95 

system.time({
  DBI::dbAppendTable(con2, "mt", z100, create = FALSE, append = TRUE)
})
#    user  system elapsed 
#    0.74    0.33   23.59 

(I don't want to time dbAppendTable with z1000 above ...)

(For kicks, I ran it with replicate(10000, ...) and ran the psql and dbWriteTable tests again, and they took 2 seconds and 372 seconds, respectively. Your choice :-) ... now I have over 650,000 rows of mtcars ... hrmph ... drop table mt ...

like image 107
r2evans Avatar answered Sep 06 '25 14:09

r2evans