Is it safe to run VACUUM and DELETE against a Delta Table while there's a Spark Streaming query doing data ingestion

Question

I've got a 24/7 Spark Structured Streaming query (Kafka as a source) that appends data to a Delta Table.

Is it safe to periodically run VACUUM and DELETE against the same Delta Table from a different cluster while the first one is still processing incoming data ?

The table is partitioned on date and the DELETE will be done at partition level.

p.s. the infrastructure is on top of AWS.

Alex Ott · Accepted Answer

If your streaming job is really append-only, then it shouldn't have any conflicts:

DELETE on the partition level can't conflict in WriteSerializable isolation level (default) if the write happens without reading (i.e. append-only workload)
VACUUM simply removes files that aren't referenced in the latest version so it won't conflict with appends.

Is it safe to run VACUUM and DELETE against a Delta Table while there's a Spark Streaming query doing data ingestion

Tags:

spark-structured-streaming

databricks

delta-lake

unvadim

1 Answers

Alex Ott

Recent Activity

Donate For Us

Is it safe to run VACUUM and DELETE against a Delta Table while there's a Spark Streaming query doing data ingestion

Tags:

spark-structured-streaming

databricks

delta-lake

unvadim

1 Answers

Alex Ott

Related questions

Recent Activity

Donate For Us