Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing millions of database records in Java [closed]

I have a requirement to write a batch job that fetches rows from a database table and based on a certain conditions, write to other tables or update this row with a certain value. We are using spring and jdbc to fetch the result set and iterate through and process the records using a standalone java program that is scheduled to run weekly. I know this is not the right way to do it, but we had to do it as a temporary solution. As the records grow into millions, we will end up with out of memory exceptions, so I know this is not the best approach.

Can any of you recommend what is the best way to deal with such a situation?

Use Threads and fetch 1000 records per thread and process them in parallel?

(OR)

Use any other batch mechanism to do this (i know there is spring-batch but have never used this)

(OR)

Any other ideas?

like image 982
user1583261 Avatar asked Sep 13 '25 20:09

user1583261


2 Answers

You already know that you can't bring a million rows into memory and operate on them.

You'll have to chunk them in some way.

Why bring them to the middle tier? I'd consider writing stored procedures and operating on the data on the database server. Bringing it to the middle tier doesn't seem like it's buying you anything. Have your batch job kick off the stored proc and do the calculations in-place in the database server.

like image 76
duffymo Avatar answered Sep 15 '25 10:09

duffymo


a batch job that fetches rows from a database table and based on a certain conditions, write to other tables or update this row with a certain value.

This sounds like the sort of thing you should do inside the database. For example, to fetch a particular row and update it based on certain conditions, SQL has the UPDATE ... WHERE ... statement. To write to another table, you can use INSERT ... SELECT ....

These may get fairly complicated, but I suggest doing everything in your power to do this inside the database, since pulling the data out to filter it is incredibly slow and defeats the purpose of having a relational database.

Note: Make sure to experiment with this on a non-production system first, and implement any limits you need so you don't lock up production tables at bad times.

like image 37
Brendan Long Avatar answered Sep 15 '25 09:09

Brendan Long