Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pentaho Table input to table output in batches

My input table in MySQL has 20 Million records and the target table in Oracle is empty. I need to load the entire table from MySQL into Oracle. I am simply using a Table Input and Table Output step.

PentahoProcessing

My intention is not to lock the source table for a long time whilst reading. Is there a problem with the load (Number of records) I am trying to achieve? I could see Use batch update for inserts option in the Table Output. I could not see something similar in the Table Input. Is there a way to perform batch processing in Pentaho?

like image 778
SriniV Avatar asked Oct 26 '25 16:10

SriniV


2 Answers

Don't worry, 20 millions records is a small number for PDI, and you will not lock the table if it is open for input. That's why the Bulk load is for output table not input tables.

A common beginer trap however, is the Truncate table option on the output table. If you run (inadertantly or for parallel processing) twice the output step, each one will lock the other. Forever.

To speed up: You may use the Lazy conversion check box on input, so that the data remains in byte format until it is used. But I am not sure you win something on a simple input/output table. if something wrong happens with Dates or Blobs on writing on the output file, the error message will be quite cryptic.

You can also increase the speed of the output by inceasing the commit size (worst a few trials in Oracle), and by inceasing the number of rows in row set which will increase the number of rows read by the table input. To do so right-click anywhere then Properties/Miscelanous.

Something I really advise to do, is to increase the JVM memory size. Use an editor (notepad or better) to edit the file named spoon.bat. You'll find around line 94-96 a line containing someting like "-Xmx256K". Change it to "-Xmx4096M" (where 4096 is half the size of your machine RAM).

To perform "batch processing" has many meaning. One of them beeing Make the transformation database transactional. Which you can do with the check box just below the above mentionned Number of row in rowset (and buggily spelled as Make the transformation database in PDI latest version). With that box checked, if something goes wrong the state of the databases is rolled back as if the transformation was never executed. But I don't advise to do this in your case.

like image 84
AlainD Avatar answered Oct 28 '25 06:10

AlainD


In additinal of @AlainD solution There are a couple of options: - Tune MySQL for better performance on Inserts - Use the MySQL Bulk loader step in PDI - Write SQL statements to file with PDI and read them with mysql-binary

Speed can be boosted by using some simple JDBC-connection setting.

  • useServerPrepStmts= false
  • rewriteBatchedStatements= true
  • useCompression= true
like image 27
Yuseferi Avatar answered Oct 28 '25 07:10

Yuseferi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!