How to get the text body of an email without the html tags.
I have tried below code to parse the mail, but I get the entire section for '------=_Part_2' section as the body.
My code
import email
message = email.message_from_string(text)
print_payload(message)
def print_payload(message):
print('******')
if message.is_multipart():
for payload in message.get_payload():
print_payload(payload)
else:
print message.get_payload()
for part in message.walk():
if part.get_content_type():
body = str(part.get_payload())
print(body)
print('******')
Actual email body:
Another test mail.
Thanks,
Munesh
Raw email:
Return-Path: [email protected] Date: Mon, 18 Sep 2017 23:07:16 +0000 From: [email protected] To: [email protected] Cc: [email protected] Message-ID: <[email protected]> Subject: My email subject MIME-Version: 1.0 Content-Type: application/ms-tnef Content-Transfer-Encoding: binary X-MS-Exchange-Organization-SCL: -1 X-MS-Exchange-Organization-MessageDirectionality: Originating Thread-Topic: My email subject X-Forefront-Antispam-Report: SFV:SKI;SCL:-1; X-MS-PublicTrafficType: Email X-MS-Exchange-Organization-Antispam-Report: SFV:SKI;SCL:-1; Accept-Language: en-US Content-Language: en-US
------=_Part_2_123.456 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="Generator" content="Microsoft Word 14 (filtered medium)"><style><!-- /* Font Definitions */ @font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";} a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;} span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;} .MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri","sans-serif";} @page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml><o:shapedefaults v:ext="edit" spidmax="1026" /></xml><![endif]--><!--[if gte mso 9]><xml><o:shapelayout v:ext="edit"><o:idmap v:ext="edit" data="1" /></o:shapelayout></xml><![endif]--></head><body lang="EN-US" link="blue" vlink="purple"><div class="WordSection1"><p class="MsoNormal">Another test mail.<o:p></o:p></p><p class="MsoNormal"><o:p> </o:p></p><p class="MsoNormal">Thanks,<o:p></o:p></p><p class="MsoNormal">Munesh<o:p></o:p></p><p class="MsoNormal"><o:p> </o:p></p></div></body></html>
------=_Part_2_123.456--
Thanks in advance.
Using the BeautifulSoup library, it is actually not too hard to parse the text. If you do not have the library, make sure you pip install bs4 first. After that, it shouldn't be too hard:
from bs4 import BeautifulSoup
def print_payload(message):
print('******')
if message.is_multipart():
for payload in message.get_payload():
print_payload(payload)
else:
print message.get_payload()
for part in message.walk():
if part.get_content_type():
body = str(part.get_payload())
soup = BeautifulSoup(body)
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
print(paragraph.text)
print('******')
What BeautifulSoup does eloquently is creating a parse-tree, from which html elements can be selected. So if your e-mail has other html elements in it, you may have to also search for them to get all the data. But with this simple e-mail, finding all the html elements with the tag 'p' is sufficient.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With