I'm using Docker to develop local AWS glue jobs (with pyspark). I have a python file (song_data.py) containing the aws glue job that uses the GlueContext class. When I run gluesparksubmit glue_etl_scripts/song_data.py --JOB-NAME test
within the container terminal to execute the glue job script, I get the following error:
20/06/24 02:12:54 WARN EC2MetadataUtils: Unable to retrieve the requested metadata (/latest/dynamic/instance-identity/document). Failed to connect to service endpoint:
com.amazonaws.SdkClientException: Failed to connect to service endpoint:
at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:100)
at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:70)
at com.amazonaws.internal.InstanceMetadataServiceResourceFetcher.readResource(InstanceMetadataServiceResourceFetcher.java:75)
at com.amazonaws.internal.EC2ResourceFetcher.readResource(EC2ResourceFetcher.java:66)
at com.amazonaws.util.EC2MetadataUtils.getItems(EC2MetadataUtils.java:402)
at com.amazonaws.util.EC2MetadataUtils.getData(EC2MetadataUtils.java:371)
at com.amazonaws.util.EC2MetadataUtils.getData(EC2MetadataUtils.java:367)
at com.amazonaws.util.EC2MetadataUtils.getEC2InstanceRegion(EC2MetadataUtils.java:282)
at com.amazonaws.regions.InstanceMetadataRegionProvider.tryDetectRegion(InstanceMetadataRegionProvider.java:59)
at com.amazonaws.regions.InstanceMetadataRegionProvider.getRegion(InstanceMetadataRegionProvider.java:50)
at com.amazonaws.regions.AwsRegionProviderChain.getRegion(AwsRegionProviderChain.java:46)
at com.amazonaws.services.glue.util.EndpointConfig$.getConfig(EndpointConfig.scala:42)
at com.amazonaws.services.glue.util.AWSConnectionUtils$.<init>(AWSConnectionUtils.scala:36)
at com.amazonaws.services.glue.util.AWSConnectionUtils$.<clinit>(AWSConnectionUtils.scala)
at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:152)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
at sun.net.www.http.HttpClient.New(HttpClient.java:339)
at sun.net.www.http.HttpClient.New(HttpClient.java:357)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1205)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
at com.amazonaws.internal.ConnectionUtils.connectToEndpoint(ConnectionUtils.java:52)
at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:80)
... 25 more
An error occurred while calling o28.getCatalogSource.
: java.lang.ExceptionInInitializerError
at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:152)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazonaws.SdkClientException: Unable to load region information from any provider in the chain
at com.amazonaws.regions.AwsRegionProviderChain.getRegion(AwsRegionProviderChain.java:59)
at com.amazonaws.services.glue.util.EndpointConfig$.getConfig(EndpointConfig.scala:42)
at com.amazonaws.services.glue.util.AWSConnectionUtils$.<init>(AWSConnectionUtils.scala:36)
at com.amazonaws.services.glue.util.AWSConnectionUtils$.<clinit>(AWSConnectionUtils.scala)
... 12 more
The error is raised when the glueContext.create_dynamic_frame.from_catalog() method is called within the glue job file (song_data.py):
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark import SQLContext
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from configparser import ConfigParser
config = ConfigParser()
config.read_file(open('/usr/local/src/config/aws.cfg'))
sc = SparkContext.getOrCreate()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", config.get('AWS', 'KEY'))
hadoop_conf.set("fs.s3a.secret.key", config.get('AWS', 'SECRET'))
hadoop_conf.set("fs.s3a.endpoint", "s3.us-west-2.amazonaws.com")
sql = SQLContext(sc)
glueContext = GlueContext(sql)
try:
song_df = glueContext.create_dynamic_frame.from_catalog(
database='sparkify',
table_name='song_data')
print ('Count: ', song_df.count())
print('Schema: ')
song_df.printSchema()
except Exception as e:
print(e)
I've Tried:
Changing Hadoop configuration fs.s3a to fs.s3 with different access/secret key attributes:
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3.awsAccessKeyId", config.get('AWS', 'KEY'))
hadoop_conf.set("fs.s3.awsSecretAccessKey", config.get('AWS', 'SECRET'))
hadoop_conf.set("fs.s3.endpoint", "s3.us-west-2.amazonaws.com")
Using GlueContext's create_dynamic_frame_from_catalog() method instead of create_dynamic_frame.from_catalog():
song_df = glueContext.create_dynamic_frame_from_catalog(
database='sparkify',
table_name='song_data')
Removing Hadoop endpoint config:
# hadoop_conf.set("fs.s3a.endpoint", "s3.us-west-2.amazonaws.com")
UPDATED ATTEMPTS
Changed song_data.py to:
conf = (
SparkConf()
.set('spark.hadoop.fs.s3a.access.key', config.get('AWS', 'KEY'))
.set('spark.hadoop.fs.s3a.secret.key', config.get('AWS', 'SECRET'))
.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
)
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
glueContext = GlueContext(spark)
try:
print('Attempt 1:')
song_df = glueContext.create_dynamic_frame.from_options(
connection_type='s3',
connection_options={"paths": [ "s3a://sparkify-dend-analytics"]},
format='json')
print ('Count: ', song_df.count())
print('Schema: ')
song_df.printSchema()
except Exception as e:
print(e)
try:
print('Attempt 2:')
song_df = glueContext.create_dynamic_frame.from_catalog(
database='sparkify',
table_name='song_data')
print ('Count: ', song_df.count())
print('Schema: ')
song_df.printSchema()
except Exception as e:
print(e)
try:
print('Attempt 3:')
song_df = glueContext.create_dynamic_frame_from_catalog(
database='sparkify',
table_name='song_data')
print ('Count: ', song_df.count())
print('Schema: ')
song_df.printSchema()
except Exception as e:
print(e)
OUTPUT ERRORS
Attempt 1:
An error occurred while calling o37.getDynamicFrame.
: org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on
sparkify-dend-analytics: com.amazonaws.AmazonClientException: No AWS
Credentials provided by DefaultAWSCredentialsProviderChain :
com.amazonaws.SdkClientException: Unable to load AWS credentials from any
provider in the chain: [EnvironmentVariableCredentialsProvider: Unable to
load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or
AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)),
SystemPropertiesCredentialsProvider: Unable to load AWS credentials from
Java system properties (aws.accessKeyId and aws.secretKey),
WebIdentityTokenCredentialsProvider: You must specify a value for roleArn
and roleSessionName, com.amazonaws.auth.profile.ProfileCredentialsProvider@xxxxxxxx:
profile file cannot be null, com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper@xxxxxxxx: Failed
to connect to service endpoint: ]
Attempt 2:
EC2MetadataUtils: Unable to retrieve the requested metadata (/latest/dynamic/instance-identity/document). Failed to connect to service endpoint:
com.amazonaws.SdkClientException: Failed to connect to service endpoint:
......
Caused by: java.net.ConnectException: Connection refused (Connection refused)
......
Caused by: com.amazonaws.SdkClientException: Unable to load region information from any provider in the chain
Attempt 3:
An error occurred while calling o32.getCatalogSource.
: java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.services.glue.util.AWSConnectionUtils$
Running Glue jobs locally in Docker container cannot access Glue catalog.
Instead of reading from catalog read data directly from s3 using
from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx="")
Find documentation for the same here
Update: You are receiving region error which is common when running glue locally.
Try running below command to provide your region this is used to initialize the library and it still works locally
export AWS_REGION=us-east-1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With