What would be the most efficient way to insert millions of records say 50-million from a Spark dataframe to Postgres Tables. I have done this from spark to MSSQL in the past by making use of bulk copy and batch size option which was successful too.
Is there something similar that can be here for Postgres?
Adding the code I have tried and the time it took to run the process:
def inserter():
    start = timer()
    sql_res.write.format("jdbc").option("numPartitions","5").option("batchsize","200000")\
    .option("url", "jdbc:postgresql://xyz.com:5435/abc_db") \
    .option("dbtable", "public.full_load").option("user", "root").option("password", "password").save()
    end = timer()
    print(timedelta(seconds=end-start))
inserter()
So I did the above approach for 10 million records and had 5 parallel connections as specified in numPartitions and also tried batch size of 200k.
The total time it took for the process was 0:14:05.760926 (fourteen minutes and five seconds).
Is there any other efficient approach which would reduce the time?
What would be the efficient or optimal batch size I can use ? Will increasing my batch size do the job quicker ? Or opening multiple connections i.e > 5 help me make the process quicker ?
On an average 14 mins for 10 million records is not bad, but looking for people out there who would have done this before to help answer this question.
How large a cluster can Spark scale to? Many organizations run Spark on clusters of thousands of nodes. The largest cluster we know has 8000 of them. In terms of data size, Spark has been shown to work well up to petabytes.
When using Postgres if you do need writes exceeding 10,000s of INSERT s per second we turn to the Postgres COPY utility for bulk loading. COPY is capable of handling 100,000s of writes per second.
I actually did kind of the same work a while ago but using Apache Sqoop.
I would say that for answering this questions we have to try to optimize the communication between Spark and PostgresSQL, specifically the data flowing from Spark to PostgreSql.
But be careful, do not forget Spark side. It does not make sense to execute mapPartitions if the number of partitions is too high compared with the number of maximum connections which
PostgreSQL support, if you have too many partitions and you are opening a connection for each one, you will probably have the following error org.postgresql.util.PSQLException: FATAL: sorry, too many clients already.
In order to tune the insertion process I would approach the problem following the next steps:
coalesce, as is mentioned here. Finally, there is no silver bullet to do this job. You can use all the tips I mentioned above but it will really depends on your data and use cases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With