i have about 1300 CSV files with almost 40k of rows in each file, i have written a python code to read the file and convert all the 40k entries into single insert statement to insert in Postgres database.
The psudocode is following
for file in tqdm(files, desc="Processing files"):
rows = ReadFile(file) # read all 40k rows from the file
q = GenerateQuery(rows) # convert all rows into single bulk insert statement
InsertData(q) # Execute the generated query to insert data
The code is fine but there are performance issues, when I start the code with empty table it takes around 2 to 3it/s, but after 15 to 20 files it takes 10 to 12it/s, and then the performance drops exponentially with each 10 to 15 files processing, the per iteration time keeps on increasing even it reaches 40 to 50s/it, it's hilarious, according to my understanding I have developed the following hypothesis
Since in start table is empty its very easy to update table indexes, so it takes no time for 40k bulk insert records to update indexes but with growing records it become harder and even harder to update indexes in the table with 10m+ records.
My Question is can I temporarily disable index updation of the table, so that after complete data dump I will then manually update the indexes by calling some query in postgres which for now I don't know if it really exists.
There are several options:
If you have no restrictions, such as uniqueness on the indexed columns, you can drop the indexes before inserting tuples. You can easily create the indexes after the insertion is completed.
If you have enough RAM, you can first read all the CSVs and then create a table using a select statement. This process will be performed in parallel, resulting in much less overall time. This approach can be helpful if you don't need to load the data into an existing table and may require the use of psycopg2 drivers for Postgres.
You can use the Postgres native method to load tuples from a CSV file called COPY. If I am not mistaken, the parallel execution of COPY is supported in versions 14-15 of Postgres.
In any scenario, you can create an unlogged table (https://www.postgresql.org/docs/current/sql-createtable.html) or set an existing table as unlogged (ALTER TABLE foo SET UNLOGGED;), which can provide a speed increase of 50-200% even without parallel insertion. However, be cautious. If there is existing data in the table and something goes wrong, you may lose all the data. If you are creating a new table, this is not an issue since you can repeat the copy operation. Regardless, after the copy is completed, it's recommended to make the table logged again (ALTER TABLE foo SET LOGGED;). This process may take some time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With