I have a large CSV and it's made up with an "ID" column and a "History" column.
The ID is simple, just an integer.
The History though is a single cell and made up of up to hundreds of entries that are separated by * NOTE * in the text area.
I want to parse this with Python and the CSV module to read the data in and export it out as a new CSV as below.
EXISTING DATA STRUCTURE:
ID,History
56457827, "*** NOTE ***
2014-02-25
Long note here. This is just a stand in to give you an idea
*** NOTE ***
2014-02-20
Another example.
This one has carriage returns.
Demonstrates they're all a bit different, though are really just text based"
56457896, "*** NOTE ***
2015-03-26
Another example of a note here. This is the text portion.
*** NOTE ***
2015-05-24
Another example yet again."
REQUIRED DATA STRUCTURE:
ID, Date, History
56457827, 2014-02-25, "Long note here. This is just a stand in to give you an idea"
56457827, 2014-02-20, "Another example.
This one has carriage returns.
Demonstrates they're all a bit different, though are really just text based"
56457896, 2015-03-26, "Another example of a note here. This is the text portion."
56457896, 2015-05-24, "Another example yet again."
So I will need to master some commands. I'm guessing a loop that brings the data in which I'll be able to manage I'm sure, but then I need to analyse the data.
I believe I'll need to:
9 repeat steps 2-8 until the end of the History field
This file is about 5MB. Is this the best approach to do this? I'm relatively new to programming and data manipulation so I'm open to any constructive critisism before I kick into this tonight when I crack open the laptop and dig in.
Thanks so much, all feedback greatly appreciated.
Ok you can easily parse the input file with the csv module, but you will need to set skipinitialspace, because your file has white spaces after the comma. I also assume that the empty line after the header should not be there.
Then, you should split the History column on '*** NOTE ***'. The first line on the text of each note should be a date, and the remaining part the actual History. Code could be:
with open(input_file_name, newline = '') as fd, \
open(output_file_name, "w", newline='') as fdout:
rd = csv.reader(fd, skipinitialspace=True)
ID, Hist = next(rd) # skip header line
wr = csv.writer(fdout)
_ = wr.writerow((ID, 'Date', Hist)) # write header of output file
for row in rd:
# print(row) # uncomment for debug traces
hists = row[1].split('*** NOTE ***')
for h in hists:
h = h.strip()
if len(h) == 0: # skip initial empty note
continue
# should begin with a data line
date, h2 = h.split('\n', 1)
_ = wr.writerow((row[0], date.strip(), h2.strip()))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With