I am encountering a problem extracting meta data from a PST file.
As you can see in the code I am using pypff to read the PST file. I need to extract the following data from the emails: sender, recipient, subject, time and date and of course the email content.
But apparently I'm too stupid for that, because I just can't find the recipient.
I'm asking you professionals for help, maybe you know a better way to do this. I have already thought about "unpacking" all .msg from the PST into a folder and then itterrating over it. But I wouldn't know how to do that either.
Thanks in advance for your answers and help.
# Retrieving E-Mails from a PST file
#File opening
#Fist we load the libraries
import pypff
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#Then we open the file: the opening can neverthless be quite long
#depending on the size of the archive.
pst = pypff.file()
pst.open("PathTo.pst")
# Metadata extraction
#It is possible to navigate through the structure using the functions
#offered by the library, from the root:
root = pst.get_root_folder()
#To extract the data, a recursive function is necessary:
def parse_folder(base):
messages = []
for folder in base.sub_folders:
if folder.number_of_sub_folders:
messages += parse_folder(folder)
print(folder.name)
for message in folder.sub_messages:
print(message.transport_headers)
messages.append({
"subject": message.subject,
"sender": message.sender_name,
"datetime": message.client_submit_time,
})
return messages
messages = parse_folder(root)
Actually is not too easy to find the recipient because usually you get a pst file exporting from one single recipient, I don't know if this will help you but right now I'm in a similar issue, so, in theory you can extract Original-Recipient or Final-Recipient from the message object by parsing transport_headers, using something like this:
for hp in message.transport_headers.split('\n'):
pts = re.findall(r'^([^:]+): (.+)\r$', hp)
if pts:
key = pts[0][0].capitalize()
headers[key] = val
or maybe something like this...
for record_set in pst_message.record_sets:
for entry in record_set.entries:
print(f"entry type {hex(entry.get_entry_type())} {entry.get_value_type()} {entry.data)})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With