Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a big .mbox file with Python

Tags:

python

email

mbox

I'd like to read a big 3GB .mbox file coming from a Gmail backup. This works:

import mailbox
mbox = mailbox.mbox(r"D:\All mail Including Spam and Trash.mbox")
for i, message in enumerate(mbox):
    print("from   :",message['from'])
    print("subject:",message['subject'])
    if message.is_multipart():
        content = ''.join(part.get_payload(decode=True) for part in message.get_payload())
    else:
        content = message.get_payload(decode=True)
    print("content:",content)
    print("**************************************")

    if i == 10:
        break

except it takes more than 40 seconds for the first 10 messages only.

Is there a faster way to access to a big .mbox file with Python?

like image 251
Basj Avatar asked Nov 28 '25 05:11

Basj


1 Answers

Here's a quick and dirty attempt to implement a generator to read in an mbox file message by message. I have opted to simply ditch the information from the From separator; I'm guessing maybe the real mailbox library might provide more information, and of course, this only supports reading, not searching or writing back to the input file.

#!/usr/bin/env python3

import email
from email.policy import default

class MboxReader:
    def __init__(self, filename):
        self.handle = open(filename, 'rb')
        assert self.handle.readline().startswith(b'From ')

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, exc_traceback):
        self.handle.close()

    def __iter__(self):
        return iter(self.__next__())

    def __next__(self):
        lines = []
        while True:
            line = self.handle.readline()
            if line == b'' or line.startswith(b'From '):
                yield email.message_from_bytes(b''.join(lines), policy=default)
                if line == b'':
                    break
                lines = []
                continue
            lines.append(line)

Usage:

with MboxReader(mboxfilename) as mbox:
    for message in mbox:
        print(message.as_string())

The policy=default argument (or any policy instead of default if you prefer, of course) selects the modern EmailMessage library which was introduced in Python 3.3 and became official in 3.6. If you need to support older Python versions from before America lost its mind and put an evil clown in the White House simpler times, you will want to omit it; but really, the new API is better in many ways.

like image 165
tripleee Avatar answered Nov 30 '25 20:11

tripleee