Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark History Server ListBucket costs

We are using Spark history 3.2.1 to monitor our Spark applications.

We have thousands of daily jobs (running on Kubernetes) that writes event logs to S3 bucket (in a dedicated folder).

We are using history-server to analyze and compare completed jobs (incomplete running jobs never appeared in the UI but it's not a requirement now).

Recently I've noticed increase in our ListBucket API Operation in AWS billing cost explorer. This cost is higher than the cost of the StandardStorage (the price we pay for storing the data itself). It's up to few hundreds per month!

Running history-server with DEBUG log level exposed the "problem": every 10s the the history-server list the bucket to get all logs and then it iterate over each folder to get it's content. So if I want to keep the last 10,000 jobs, I'll have to pay for 10,101 ListBucket requests every 10s!

Here is one example (out of the 10k) reproduced locally with minio as S3:

22/02/20 06:44:31 DEBUG wire: http-outgoing-57 << "<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Name>local-audience</Name><Prefix>history-logs/eventlog_v2_spark-ffffdf5903c841259f28b53981746b76/</Prefix><KeyCount>2</KeyCount><MaxKeys>5000</MaxKeys><Delimiter>/</Delimiter><IsTruncated>false</IsTruncated><Contents><Key>history-logs/eventlog_v2_spark-ffffdf5903c841259f28b53981746b76/appstatus_spark-ffffdf5903c841259f28b53981746b76</Key><LastModified>2022-02-12T17:00:15.304Z</LastModified><ETag>&#34;d41d8cd98f00b204e9800998ecf8427e&#34;</ETag><Size>0</Size><Owner><ID></ID><DisplayName></DisplayName></Owner><StorageClass>STANDARD</StorageClass></Contents><Contents><Key>history-logs/eventlog_v2_spark-ffffdf5903c841259f28b53981746b76/events_1_spark-ffffdf5903c841259f28b53981746b76</Key><LastModified>2022-02-12T17:00:15.136Z</LastModified><ETag>&#34;f91cc774d92c6f6c2ca4d0e1a1e76e13&#34;</ETag><Size>868837</Size><Owner><ID></ID><DisplayName></DisplayName></Owner><StorageClass>STANDARD</StorageClass></Contents></ListBucketResult>"

To ensure that the cost comes from history-server I turned it off for a day and there was no charge per ListBucket since then:

enter image description here

To mitigate the problem (because we still need the history-server), I can set the spark.history.fs.update.interval to higher number (such as 3600s or so). As we are checking the history-server once a day it is overkill and doesn't worth it (cost wise).

  • Why does it scan the completed jobs every time (over and over again) and not only new jobs? is there a way to configure such behavior to avoid those ListBucket operations?
  • If I care only for completed jobs, and assuming I can wait few minutes to see the list, is there a mode that can load the list only when I login to the UI? (rather than periodically doing it for nothing).

P.S - I'm using AWS lifecycle rules to clean this folder every few few days (and not the server cleaning feature), by expiration objects after few days.

like image 204
ItayB Avatar asked Oct 30 '25 21:10

ItayB


1 Answers

treewalking in s3 is (a) expensive and (b) horribly slow, especially given that a deep tree scan exists. If you want to fix this and can write scala code, see if you can fix the server to switch to a deep listing by moving to FileSystem.listFiles(path, true). Yes that involves coding, but the OSS community depends on everyone fixing their own personal issues and sharing the outcome

like image 166
stevel Avatar answered Nov 02 '25 13:11

stevel