Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark making expensive S3 API calls

I have a spark job which receives a list of ~100k files and gets invoked every 10mins. These files are in s3. The paths look like:

s3://<bucket>/<folder>/<file_name>

Files are loaded like this:

df = spark.read.option("mergeSchema", "true").schema(schema).parquet(*files)

Behind the scenes, it seems spark makes a LIST and HEAD API call for each file. This is quite wasteful as these are files not directories and are guaranteed to exist due to the nature of the job. Ive looked at the spark codebase and it seems, this behaviour is part of the InMemoryFileIndex. Is there a way to configure spark to directly make the GET calls and skip list/head calls?

like image 801
Dexter Avatar asked Sep 13 '25 03:09

Dexter


1 Answers

This is quite wasteful as these are files not directories and are guaranteed to exist due to the nature of the job.

problem here is that the filesystem layer doesn't know that "the nature of the job", so it does its own probes at times

Looks also like InMemoryFileIndex.scala is pretty inefficient; it does its own treewalk except for some hard coded bits for HDFS, and does seem to rescanning all the files its just listed again.

Yes, scope for improvement, as open source projects say. But as they also tend to say "please submit a patch"

like image 66
stevel Avatar answered Sep 15 '25 21:09

stevel