Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Connect spark to localstack s3 using docker compose

I've been trying to connect to a local S3 bucket (localstack) using docker-compose. This way:

  • One of the containers is the spark driver
  • One of the containers would be the S3 bucket
  • One of the containers is the spark master
  • Two other containers are the spark workers

Still, I'm having trouble to get spark to reach to localstack (S3 container).

I've already tried to configure the "spark.hadoop.fs.s3a.endpoint" to my network ip "http://172.x.x.x", and it works fine.

But when I use "http://localstack:4572" (being "localstack" my alias in the docker-compose file), I get the a "name or service not known" error message:

19/10/30 20:39:49 WARN FileStreamSink: Error while looking for metadata directory.
Traceback (most recent call last):
  File "etl_job.py", line 103, in <module>
    etl_method(conf)
  File "etl_job.py", line 68, in etl_method
    df = spark.read.csv('s3a://test/data/RAW_DATA_Metro_Interstate_Traffic_Volume.csv.gz', header=True, inferSchema=True)
  File "/usr/local/lib/python3.7/site-packages/pyspark/sql/readwriter.py", line 476, in csv
    return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/usr/local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/local/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o69.csv.
: com.amazonaws.AmazonClientException: Unable to execute HTTP request: test.localstack: Name or service not known
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:454)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
    at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
    at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.immutable.List.flatMap(List.scala:355)
    at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:618)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: test.localstack: Name or service not known
    at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
    at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
    at java.net.InetAddress.getAllByName(InetAddress.java:1193)
    at java.net.InetAddress.getAllByName(InetAddress.java:1127)
    at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45)
    at org.apache.http.impl.conn.DefaultClientConnectionOperator.resolveHostname(DefaultClientConnectionOperator.java:263)
    at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:162)
    at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:326)
    at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610)
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445)
    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
    ... 34 more


Would someone know a way to get spark to connect to a localstack S3 bucket from docker containers in any way? Thanks in advance.

like image 948
Lucas Avatar asked Sep 07 '25 10:09

Lucas


1 Answers

After some time, I figured it out.

Turns out, you need to set "spark.hadoop.fs.s3a.path.style.access" parameter to true, so that spark could find your container by name.

Keep in mind that this parameter is only available from hadoop version 2.8.0 and on.

After making this change and upgrading hadoop, everything works fine.

like image 126
Lucas Avatar answered Sep 10 '25 06:09

Lucas