I've got a 24/7 Spark Structured Streaming query (Kafka as a source) that appends data to a Delta Table.
Is it safe to periodically run VACUUM and DELETE against the same Delta Table from a different cluster while the first one is still processing incoming data ?
The table is partitioned on date and the DELETE will be done at partition level.
p.s. the infrastructure is on top of AWS.
If your streaming job is really append-only, then it shouldn't have any conflicts:
DELETE on the partition level can't conflict in WriteSerializable isolation level (default) if the write happens without reading (i.e. append-only workload)VACUUM simply removes files that aren't referenced in the latest version so it won't conflict with appends.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With