I am able to follow the instructions in https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html, and log messages in driver. But when I try to use the logger inside the map function like this
sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
logger.info("starting glue job...") #successful
...
def transform(item):
logger.info("starting transform...") #error
...transform logics...
Map.apply(frame = dynamicFrame, f = transform)
I get this error:
PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects
I researched around and the message implies that the logger object cannot be serialized when passed to the worker.
What's the correct way to do logging in AWS Glue worker?
Your logger object can't be sent to the remote executors which is why you get the serialization error. You would have to initialize the logger inside the mapper function.
But doing that in a transform might be expensive resource-wise. Mappers should ideally be quick and light weight since they are executed on each row.
Here is how you can do it in at least Glue V3. The log events will end up in the error logs.
def transform(record):
logging.basicConfig(level=logging.INFO, format="MAPPER %(asctime)s [%(levelname)s] [%(name)s] %(message)s")
map_logger = logging.getLogger()
map_logger.info("an info event")
map_logger.error("an error event")
return record
Here's a full example script:
import logging
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.transforms.dynamicframe_map import Map
from pyspark.context import SparkContext
from pyspark.sql.types import Row, IntegerType
# Configure logging for the driver
logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] [%(name)s] %(message)s')
logger = logging.getLogger(__name__)
def main():
logger.info("======== Job Initialisation ==========")
sc = SparkContext()
glue_context = GlueContext(sc)
spark_session = glue_context.spark_session
logger.info("======== Start of ETL ==========")
df = spark_session.createDataFrame(range(1, 100), IntegerType())
dynf = DynamicFrame.fromDF(df, glue_context, "dynf")
# Apply mapper function on each row
dynf = Map.apply(frame=dynf, f=transform)
logger.info(f"Result: {dynf.show(10)}")
logger.info("Done")
def transform(record):
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] [%(name)s] %(message)s")
map_logger = logging.getLogger("transformer")
map_logger.info("an info event")
map_logger.error("an error event")
return record
main()
I faced the same issue in Glue and even tried to contact AWS but no luck then @selle answer helped me. but I found that without logger you can see the print in error logs from executors. you just have to deep dive into error logs.
this will give you clear picture about Glue logs... https://docs.aws.amazon.com/glue/latest/dg/reduced-start-times-spark-etl-jobs.html#reduced-start-times-logging
btw this is my first post in stack overflow
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With