Python parse a raw email and get the text content of the body

Question

How to get the text body of an email without the html tags.

I have tried below code to parse the mail, but I get the entire section for '------=_Part_2' section as the body.

My code

import email
message = email.message_from_string(text)
print_payload(message)

def print_payload(message):
    print('******')
    if message.is_multipart():
        for payload in message.get_payload():
            print_payload(payload)
    else:
        print message.get_payload()
        for part in message.walk():
            if part.get_content_type():
                body = str(part.get_payload())
                print(body)
    print('******')

Actual email body:

Another test mail.
Thanks,
Munesh

Raw email:

Return-Path: [email protected] Date: Mon, 18 Sep 2017 23:07:16 +0000 From: [email protected] To: [email protected] Cc: [email protected] Message-ID: <[email protected]> Subject: My email subject MIME-Version: 1.0 Content-Type: application/ms-tnef Content-Transfer-Encoding: binary X-MS-Exchange-Organization-SCL: -1 X-MS-Exchange-Organization-MessageDirectionality: Originating Thread-Topic: My email subject X-Forefront-Antispam-Report: SFV:SKI;SCL:-1; X-MS-PublicTrafficType: Email X-MS-Exchange-Organization-Antispam-Report: SFV:SKI;SCL:-1; Accept-Language: en-US Content-Language: en-US

------=_Part_2_123.456 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="Generator" content="Microsoft Word 14 (filtered medium)"><style><!-- /* Font Definitions */ @font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";} a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;} span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri","sans-serif";
        color:windowtext;} .MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri","sans-serif";} @page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml><o:shapedefaults v:ext="edit" spidmax="1026" /></xml><![endif]--><!--[if gte mso 9]><xml><o:shapelayout v:ext="edit"><o:idmap v:ext="edit" data="1" /></o:shapelayout></xml><![endif]--></head><body lang="EN-US" link="blue" vlink="purple"><div class="WordSection1"><p class="MsoNormal">Another test mail.<o:p></o:p></p><p class="MsoNormal"><o:p>&nbsp;</o:p></p><p class="MsoNormal">Thanks,<o:p></o:p></p><p class="MsoNormal">Munesh<o:p></o:p></p><p class="MsoNormal"><o:p>&nbsp;</o:p></p></div></body></html>

------=_Part_2_123.456--

Thanks in advance.

Bradley Robinson · Accepted Answer

Using the BeautifulSoup library, it is actually not too hard to parse the text. If you do not have the library, make sure you pip install bs4 first. After that, it shouldn't be too hard:

from bs4 import BeautifulSoup
def print_payload(message):
    print('******')
    if message.is_multipart():
        for payload in message.get_payload():
            print_payload(payload)
    else:
         print message.get_payload()
         for part in message.walk():
             if part.get_content_type():
                 body = str(part.get_payload())
                 soup = BeautifulSoup(body)
                 paragraphs = soup.find_all('p')
                 for paragraph in paragraphs:
                     print(paragraph.text)
    print('******')

What BeautifulSoup does eloquently is creating a parse-tree, from which html elements can be selected. So if your e-mail has other html elements in it, you may have to also search for them to get all the data. But with this simple e-mail, finding all the html elements with the tag 'p' is sufficient.

Python parse a raw email and get the text content of the body

Tags:

python

email

parsing

Munesh

1 Answers

Bradley Robinson

Recent Activity

Donate For Us

Python parse a raw email and get the text content of the body

Tags:

python

email

parsing

Munesh

1 Answers

Bradley Robinson

Related questions

Recent Activity

Donate For Us