Spark making expensive S3 API calls

Question

I have a spark job which receives a list of ~100k files and gets invoked every 10mins. These files are in s3. The paths look like:

s3://<bucket>/<folder>/<file_name>

Files are loaded like this:

df = spark.read.option("mergeSchema", "true").schema(schema).parquet(*files)

Behind the scenes, it seems spark makes a LIST and HEAD API call for each file. This is quite wasteful as these are files not directories and are guaranteed to exist due to the nature of the job. Ive looked at the spark codebase and it seems, this behaviour is part of the InMemoryFileIndex. Is there a way to configure spark to directly make the GET calls and skip list/head calls?

stevel · Accepted Answer

This is quite wasteful as these are files not directories and are guaranteed to exist due to the nature of the job.

problem here is that the filesystem layer doesn't know that "the nature of the job", so it does its own probes at times

Looks also like InMemoryFileIndex.scala is pretty inefficient; it does its own treewalk except for some hard coded bits for HDFS, and does seem to rescanning all the files its just listed again.

Yes, scope for improvement, as open source projects say. But as they also tend to say "please submit a patch"

Spark making expensive S3 API calls

Tags:

amazon-s3

apache-spark

databricks

Dexter

1 Answers

stevel

Recent Activity

Donate For Us

Spark making expensive S3 API calls

Tags:

amazon-s3

apache-spark

databricks

Dexter

1 Answers

stevel

Related questions

Recent Activity

Donate For Us