I have this SQL query:
delete from scans
where scandatetime>(current_timestamp - interval '21 days') and
scandatetime <> (select min(tt.scandatetime) from scans tt where tt.imb = scans.imb) and
scandatetime <> (select max(tt.scandatetime) from scans tt where tt.imb = scans.imb)
;
That I use to delete records from the following table:
|imb |scandatetime |status |scanfacilityzip|
+-----------+-------------------+---------+---------------+
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-01 19:30:32|Received |12345 |
|isdijh23452|2020-01-02 04:50:22|Confirmed|12345 |
|isdijh23452|2020-01-03 19:32:18|Processed|45867 |
|awgjnh09864|2020-01-01 10:24:16|Intake |84676 |
|awgjnh09864|2020-01-01 19:30:32|Received |84676 |
|awgjnh09864|2020-01-01 19:30:32|Received |84676 |
|awgjnh09864|2020-01-02 02:15:52|Processed|84676 |
such that only 2 records remain per IMB, the one with the minimum scandatetime and the maximum scandatetime. I also limit this so it only performs this operation for records that are less than 3 weeks old. The resultant table looks like this:
|imb |scandatetime |status |scanfacilityzip|
+-----------+-------------------+---------+---------------+
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-03 19:32:18|Processed|45867 |
|awgjnh09864|2020-01-01 10:24:16|Intake |84676 |
|awgjnh09864|2020-01-02 02:15:52|Processed|84676 |
This table has a few indexes and has tens of millions of rows, so the query usually takes forever to run. How can I speed this up?
Explain output:
Delete on scans (cost=0.57..115934571.45 rows=10015402 width=6)
-> Index Scan using scans_staging_scandatetime_idx on scans (cost=0.57..115934571.45 rows=10015402 width=6)
Index Cond: (scandatetime > (CURRENT_TIMESTAMP - '21 days'::interval))
Filter: ((scandatetime <> (SubPlan 2)) AND (scandatetime <> (SubPlan 4)))
SubPlan 2
-> Result (cost=3.91..3.92 rows=1 width=8)
InitPlan 1 (returns $1)
-> Limit (cost=0.70..3.91 rows=1 width=8)
-> Index Only Scan using scans_staging_imb_scandatetime_idx on scans tt (cost=0.70..16.79 rows=5 width=8)
Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
SubPlan 4
-> Result (cost=3.91..3.92 rows=1 width=8)
InitPlan 3 (returns $3)
-> Limit (cost=0.70..3.91 rows=1 width=8)
-> Index Only Scan Backward using scans_staging_imb_scandatetime_idx on scans tt_1 (cost=0.70..16.79 rows=5 width=8)
Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
Table DDL:
-- Table Definition ----------------------------------------------
CREATE TABLE scans (
imb text,
scandatetime timestamp without time zone,
status text,
scanfacilityzip text
);
-- Indices -------------------------------------------------------
CREATE INDEX scans_staging_scandatetime_idx ON scans(scandatetime timestamp_ops);
CREATE INDEX scans_staging_imb_idx ON scans(imb text_ops);
CREATE INDEX scans_staging_status_idx ON scans(status text_ops);
CREATE INDEX scans_staging_scandatetime_status_idx ON scans(scandatetime timestamp_ops,status text_ops);
CREATE INDEX scans_staging_imb_scandatetime_idx ON scans(imb text_ops,scandatetime timestamp_ops);
Edit: Here is the explain analyze output (note, I changed the interval to 1 day to make it run faster):
Delete on scans (cost=0.58..3325615.74 rows=278811 width=6) (actual time=831562.877..831562.877 rows=0 loops=1)
-> Index Scan using scans_staging_scandatetime_idx on scans (cost=0.58..3325615.74 rows=278811 width=6) (actual time=831562.875..831562.875 rows=0 loops=1)
Index Cond: (scandatetime > (CURRENT_TIMESTAMP - '1 day'::interval))
Filter: ((scandatetime <> (SubPlan 2)) AND (scandatetime <> (SubPlan 4)))
Rows Removed by Filter: 277756
SubPlan 2
-> Result (cost=3.92..3.93 rows=1 width=8) (actual time=1.675..1.675 rows=1 loops=277756)
InitPlan 1 (returns $1)
-> Limit (cost=0.70..3.92 rows=1 width=8) (actual time=1.673..1.674 rows=1 loops=277756)
-> Index Only Scan using scans_staging_imb_scandatetime_idx on scans tt (cost=0.70..16.80 rows=5 width=8) (actual time=1.672..1.672 rows=1 loops=277756)
Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
Heap Fetches: 277761
SubPlan 4
-> Result (cost=3.92..3.93 rows=1 width=8) (actual time=0.086..0.086 rows=1 loops=164210)
InitPlan 3 (returns $3)
-> Limit (cost=0.70..3.92 rows=1 width=8) (actual time=0.084..0.085 rows=1 loops=164210)
-> Index Only Scan Backward using scans_staging_imb_scandatetime_idx on scans tt_1 (cost=0.70..16.80 rows=5 width=8) (actual time=0.083..0.083 rows=1 loops=164210)
Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
Heap Fetches: 164210
Planning Time: 11.360 ms
Execution Time: 831562.956 ms
EDIT: Result with explain analyze buffers:
Delete on scans (cost=0.57..1274693.83 rows=103787 width=6) (actual time=19309.026..19309.027 rows=0 loops=1)
Buffers: shared hit=743430 read=46033
I/O Timings: read=15917.966
-> Index Scan using scans_staging_scandatetime_idx on scans (cost=0.57..1274693.83 rows=103787 width=6) (actual time=19309.025..19309.025 rows=0 loops=1)
Index Cond: (scandatetime > (CURRENT_TIMESTAMP - '1 day'::interval))
Filter: ((scandatetime <> (SubPlan 2)) AND (scandatetime <> (SubPlan 4)))
Rows Removed by Filter: 74564
Buffers: shared hit=743430 read=46033
I/O Timings: read=15917.966
SubPlan 2
-> Result (cost=4.05..4.06 rows=1 width=8) (actual time=0.232..0.233 rows=1 loops=74564)
Buffers: shared hit=458108 read=27849
I/O Timings: read=15114.478
InitPlan 1 (returns $1)
-> Limit (cost=0.70..4.05 rows=1 width=8) (actual time=0.231..0.231 rows=1 loops=74564)
Buffers: shared hit=458108 read=27849
I/O Timings: read=15114.478
-> Index Only Scan using scans_staging_imb_scandatetime_idx on scans tt (cost=0.70..20.81 rows=6 width=8) (actual time=0.230..0.230 rows=1 loops=74564)
Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
Heap Fetches: 74583
Buffers: shared hit=458108 read=27849
I/O Timings: read=15114.478
SubPlan 4
-> Result (cost=4.05..4.06 rows=1 width=8) (actual time=0.042..0.042 rows=1 loops=34497)
Buffers: shared hit=228637 read=701
I/O Timings: read=507.724
InitPlan 3 (returns $3)
-> Limit (cost=0.70..4.05 rows=1 width=8) (actual time=0.041..0.041 rows=1 loops=34497)
Buffers: shared hit=228637 read=701
I/O Timings: read=507.724
-> Index Only Scan Backward using scans_staging_imb_scandatetime_idx on scans tt_1 (cost=0.70..20.81 rows=6 width=8) (actual time=0.040..0.040 rows=1 loops=34497)
Index Cond: ((imb = scans.imb) AND (scandatetime IS NOT NULL))
Heap Fetches: 34497
Buffers: shared hit=228637 read=701
I/O Timings: read=507.724
Planning Time: 5.350 ms
Execution Time: 19313.242 ms
Without the pre-aggregation (and avoiding the CTE):
DELETE FROM scans del
WHERE del.scandatetime > (current_timestamp - interval '21 days')
AND EXISTS (SELECT *
FROM scans x
WHERE x.imb = del.imb
AND x.scandatetime < del.scandatetime
)
AND EXISTS (SELECT *
FROM scans x
WHERE x.imb = del.imb
AND x.scandatetime > del.scandatetime
)
;
The idea is: you only delete if there is (at least) one record before, and (at least) one after it. (with the same imd) This is not true for the first and last records, only the ones inbetween.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With