I have a spark job which receives a list of ~100k files and gets invoked every 10mins. These files are in s3. The paths look like:
s3://<bucket>/<folder>/<file_name>
Files are loaded like this:
df = spark.read.option("mergeSchema", "true").schema(schema).parquet(*files)
Behind the scenes, it seems spark makes a LIST and HEAD API call for each file. This is quite wasteful as these are files not directories and are guaranteed to exist due to the nature of the job. Ive looked at the spark codebase and it seems, this behaviour is part of the InMemoryFileIndex. Is there a way to configure spark to directly make the GET calls and skip list/head calls?
This is quite wasteful as these are files not directories and are guaranteed to exist due to the nature of the job.
problem here is that the filesystem layer doesn't know that "the nature of the job", so it does its own probes at times
Looks also like InMemoryFileIndex.scala is pretty inefficient; it does its own treewalk except for some hard coded bits for HDFS, and does seem to rescanning all the files its just listed again.
Yes, scope for improvement, as open source projects say. But as they also tend to say "please submit a patch"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With