I am trying to add logging to some Jupyter Notebook code (running Pyspark3).
Digging around SO I found a few answers that said using basicConfig() does not work because the notebook starts its own logging session. Some work-around answers pointed to running reload(logging) to get around this. With that in mind I am setting up my logging like this:
from importlib import reload  # Not needed in Python 2
import logging
reload(logging)
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    level=logging.INFO,
    datefmt="%y/%m/%d %H:%M:%S",
)
logger = logging.getLogger(__name__)
Then I run an info statement: logger.info("this is a test") and I get an I/O Value error? I am not sure what this means. 
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib64/python3.6/logging/__init__.py", line 994, in emit
    stream.write(msg)
  File "/tmp/2950371398694308674", line 534, in write
    super(UnicodeDecodingStringIO, self).write(s)
ValueError: I/O operation on closed file
Call stack:
  File "/tmp/2950371398694308674", line 700, in <module>
    sys.exit(main())
  File "/tmp/2950371398694308674", line 672, in main
    response = handler(content)
  File "/tmp/2950371398694308674", line 318, in execute_request
    result = node.execute()
  File "/tmp/2950371398694308674", line 229, in execute
    exec(code, global_dict)
  File "<stdin>", line 1, in <module>
Message: 'this is a test'
Arguments: ()
This has something to do with logging interacting with stdout/stderr, but I am not sure how to resolve it.
After my comment above, I've come to make this workaround.
The problem seems to be that sys.stdout does not play well with spark, or at least when used by jupyter; you can easily verify this by making a new (Pyspark3) notebook, importing sys, and then printing sys.stdout in different cells: they will print different objects (in my case, there's 4 of them and it cycles between them, but I can't be sure why is it 4; perhaps it's particular to my cluster config, but it didn't change as I changed number of execs, or cores per exec).
My workaround goes:
logger = logging.getLogger(__name__)
logger.handlers[0].stream.write = print
This works because I know my logger has only one handler, and it's sys.stdout. If you have more handlers in your logger (say, one stdout and a file), I haven't figured out how to change only the stdout one (I can't compare if stream == sys.stdout because the whole root of the problem means the object will have changed, unless you do that in the same cell you created the logger), so this workaround may not be for everyone.
If I evolve further, I'll edit the answer with a better solution, but I'm using this for now and it's working like a charm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With