Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to log messages in AWS Glue worker (inside map function)?

I am able to follow the instructions in https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html, and log messages in driver. But when I try to use the logger inside the map function like this

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
logger.info("starting glue job...") #successful
...
def transform(item):
    logger.info("starting transform...") #error
    ...transform logics...

Map.apply(frame = dynamicFrame, f = transform)

I get this error:

PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects

I researched around and the message implies that the logger object cannot be serialized when passed to the worker.

What's the correct way to do logging in AWS Glue worker?

like image 648
Xiqiang Lin Avatar asked Oct 14 '25 05:10

Xiqiang Lin


2 Answers

Your logger object can't be sent to the remote executors which is why you get the serialization error. You would have to initialize the logger inside the mapper function.

But doing that in a transform might be expensive resource-wise. Mappers should ideally be quick and light weight since they are executed on each row.

Here is how you can do it in at least Glue V3. The log events will end up in the error logs.

def transform(record):
    logging.basicConfig(level=logging.INFO, format="MAPPER %(asctime)s [%(levelname)s] [%(name)s] %(message)s")
    map_logger = logging.getLogger()
    map_logger.info("an info event")
    map_logger.error("an error event")

    return record

Here's a full example script:

import logging

from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.transforms.dynamicframe_map import Map
from pyspark.context import SparkContext
from pyspark.sql.types import Row, IntegerType

# Configure logging for the driver
logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] [%(name)s] %(message)s')
logger = logging.getLogger(__name__)


def main():
    logger.info("======== Job Initialisation ==========")

    sc = SparkContext()
    glue_context = GlueContext(sc)
    spark_session = glue_context.spark_session

    logger.info("======== Start of ETL ==========")

    df = spark_session.createDataFrame(range(1, 100), IntegerType())

    dynf = DynamicFrame.fromDF(df, glue_context, "dynf")

    # Apply mapper function on each row
    dynf = Map.apply(frame=dynf, f=transform)

    logger.info(f"Result: {dynf.show(10)}")

    logger.info("Done")


def transform(record):
    logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] [%(name)s] %(message)s")
    map_logger = logging.getLogger("transformer")
    map_logger.info("an info event")
    map_logger.error("an error event")

    return record

main()
like image 92
selle Avatar answered Oct 17 '25 16:10

selle


I faced the same issue in Glue and even tried to contact AWS but no luck then @selle answer helped me. but I found that without logger you can see the print in error logs from executors. you just have to deep dive into error logs.

this will give you clear picture about Glue logs... https://docs.aws.amazon.com/glue/latest/dg/reduced-start-times-spark-etl-jobs.html#reduced-start-times-logging

btw this is my first post in stack overflow

like image 40
Vignxsh s Avatar answered Oct 17 '25 14:10

Vignxsh s