Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to Parse Email in Python

I have set of .msg files stored in E:/ drive that I have to read and extract some information from it. For that i am using the below code in Python 3.6.

from email.parser import Parser
with open("E:\Downloads\Test1.msg",encoding="ISO-8859-1") as fp:
    headers = Parser().parse(fp)

print('To: %s' % headers['To'])
print('From: %s' % headers['From'])
print('Subject: %s' % headers['subject'])

In the output I am getting as below.

To: None
From: None
Subject: None

Process finished with exit code 0

I am not getting the actual values in To, FROM and subject fields.

Any thoughts why it is not printing the actual values?

My sample .msg file looks like below.

From: [email protected]
To: [email protected]
Subject: orderid: ord1234, circtid: cr1234


Charges:
Annual Charge - 10
Excess Charges - 5

From this message I am trying to extract order id, circuit id from subject and charges from mail body.

Output1:

enter image description here

Thanks

like image 810
Ratan Avatar asked Nov 29 '25 07:11

Ratan


1 Answers

This is the body of the file that you posted on pastebin for us.

From: [email protected] <[email protected]>
Sent: Thursday, January 4, 2018 11:58 AM
To: Ratankumar Shivratri
Subject: Cct Id: ONE211, eCo order No: 1CTRP

Charges:

Annual rental - 2,125.00

Maintenance charge - 0.00



Regards

Ratan.

I've been able to obtain data from the headers using the following code.

>>> from email.parser import Parser
>>> p = Parser()
>>> msg = p.parse(open('ratan.msg'))
>>> msg['To']
'Ratankumar Shivratri'
>>> msg['From']
'[email protected] <[email protected]>'
>>> msg['Subject']
'Cct Id: ONE211, eCo order No: 1CTRP\n '

So that much works.

The next problem I foresee is that the format of the subject headers seems to be inconsistent across messages. For instance, in the message in your question, the subject header is 'orderid: ord1234, circtid: cr1234' but in this message it's 'Cct Id: ONE211, eCo order No: 1CTRP'. You want to be able to recover 'order id, circuit id' from messages but these items don't appear in every message.

If they did you could probably ferret them out with a regex.

like image 137
Bill Bell Avatar answered Dec 01 '25 23:12

Bill Bell